key: cord- - uc w authors: chen, zhiqi; ma, xuezhong; zhang, jianhua; hu, jim; gorczynski, reginald m. title: alternative splicing of cd is regulated by an exonic splicing enhancer and sf /asf date: - - journal: nucleic acids res doi: . /nar/gkq sha: doc_id: cord_uid: uc w cd , a type i membrane glycoprotein, plays an important role in prevention of inflammatory disorders, graft rejection, autoimmune diseases and spontaneous fetal loss. it also regulates tumor immunity. a truncated cd (cd (tr)) resulting from alternative splicing has been identified and characterized as a functional antagonist to full-length cd . thus, it is important to explore the mechanism(s) controlling alternative splicing of cd . in this study, we identified an exonic splicing enhancer (ese) located in exon , which is a putative binding site for a splicing regulatory protein sf /asf. deletion or mutation of the ese site decreased expression of the full-length cd . direct binding of sf /asf to the ese site was confirmed by rna electrophoretic mobility shift assay (emsa). knockdown of expression of sf /asf resulted in the same splicing pattern as seen after deletion or mutation of the ese, whereas overexpression of sf /asf increased expression of the full-length cd . in vivo studies showed that viral infection reversed the alternative splicing pattern of cd with increased expression of sf /asf and the full-length cd . taken together, our data suggest for the first time that sf /asf regulates the function of cd by controlling cd alternative splicing, through direct binding to an ese located in exon of cd . cd is a type membrane glycoprotein, delivering immunoregulatory signals through binding to its receptors (cd rs) ( ) ( ) ( ) ( ) . it is present on neurons, b cells, activated t cells, thymocytes, dendritic cells and endothelium in mice, rats and human ( , ) . a large and growing body of studies demonstrates that expression level of cd regulates graft survival ( ) ( ) ( ) , susceptibility to autoimmune diseases ( ) ( ) ( ) , fetal loss ( ) , inflammation/infection ( ) and tumor immunity ( ) ( ) ( ) ( ) . alternative splicing is a major mechanism for regulating biological systems, producing multiple messenger rna (mrna) and protein isoforms. some of these isoforms have distinct or even opposing functions ( ) . many genes in the immune system have been found to be alternatively spliced ( ) ( ) ( ) and a growing number of human diseases are associated with aberrant splicing of the genes ( ) ( ) ( ) . however, few studies to date have identified the mechanisms that regulate alternative splicing in the immune system. while cd exists as a single copy gene, data from borriello et al. ( ) , confirmed by our experiments ( ) , have reported that a splice variant of cd exists. although exon deletion of cd caused by alternative splicing results in a frame shift and premature translational termination, we noted the existence of a downstream atg start codon in a perfect kozak context ( ) . when the first start codon is followed shortly by a terminator codon and creates a small open reading frame (orf; -mini-cistron), the s ribosomal subunit remains bound to the mrna, resumes scanning, and potentially reinitiates at the next atg codon downstream ( ) . it is known that the nh -terminal region of cd is important for its biological interaction with cd rs ( , ) , and translation from the second atg start codon would produce a truncated form of cd (cd tr ) lacking the nh -terminal amino acids which includes regions important for the interaction with cd rs. indeed, our previous studies have shown that expressed cd tr is a functional antagonist to cd ( ) . exons often contain specific short oligonucleotide sequences that affect their ability to be spliced. exonic splicing enhancers (eses) within exons promote splicing of the corresponding exons and subsequent exon inclusion mediated by splicing regulatory proteins. the best-studied family of splicing regulatory proteins are serine/ arginine-rich proteins (sr proteins), which include the proteins sf /asf, sc , srp , srp c and many others ( , ) . it has become clear that many exons *to whom correspondence should be addressed. tel: (ext. ); fax: ; email: zhiqi.chen@utoronto.ca contain ese elements that bind to specific members of the sr family ( ) , leading to exon inclusion. since cd is involved in many diseases and its splice variant cd tr is an antagonist to cd , identification of the mechanism controlling the relative expression levels of cd versus cd tr may provide insight into novel strategies for treatment of clinical disorders. in the present study, we have explored the mechanism controlling cd alternative splicing and show that sf /asf regulates cd alternative splicing through its direct binding to an ese site in exon of this gene. the level of sf /asf determines the alternative splicing patterns in different tissues or cells. interestingly, in a mouse model of viral infection, we detected for the first time that the normal splicing pattern of cd was reversed in the lung tissue of a/j mice infected with mouse hepatitis virus strain i (mhv- ), following an increase in expression of sf /asf in this mhv- susceptible mouse strain. all human cell lines were obtained from american type culture collection. human b cell lines daudi, raji and tem were maintained in rpmi (invitrogen) supplemented with % fetal bovine serum (fbs). the human neuronal cell lines sk-n and hcn- a were cultured with % fbs in a-mem media (invitrogen). total rnas from different human tissues were purchased from clontech. a human bac clone containing the whole human cd gene and a pcdna . expression vector containing sf /asf were obtained from the center for applied genomics (hospital for sick children, toronto). taq dna polymerase, t dna ligase and all restriction endonucleases were purchased from new england biolabs. random primers, superscript reverse transcriptase ii, elongase enzyme, pcdna . expression vector and all competent cells were purchased from invitrogen. endofree plasmid purification maxi kit and qiaex ii gel extraction kit were ordered from qiagen. a purified sf /asf recombinant protein was kindly provided by dr. blencowe (university of toronto). anti-human and mouse sf / asf antibody was obtained from santa cruz biotechnology. anti-human and mouse b-actin antibody was purchased from bd biosciences. small-interfering rna (sirna) including sf /asf sirna and a 'scrambled' sirna were synthesized by eurogentec. rna oligonucleotides were synthesized by dna and rna synthesis center at hospital for sick children (toronto). all the primers used for polymerase chain reactions (pcrs), real-time pcrs and mutations were synthesized by invitrogen. female a/j and c bl/ j mice, - weeks of age were purchased from jackson laboratories. the mice were maintained in microisolator cages, housed in the animal facility at the toronto hospital research institute, university of toronto, and fed standard lab chow diet and water ad libitum. all protocols were approved by the animal welfare committee. parental virus mouse hepatitis virus strain (mhv ) was ordered from the american type culture collection. as previously described ( ) , mhv infection was carried out in a viral isolation room. a/j and c bl/ j mice were anesthetized by intraperitoneal injection with . ml % pentobarbital diluted in normal saline. mice were left untreated or received plaque forming unit (pfu) of mhv intranasally. mice were sacrificed , and h postinfection and lung tissue was collected. total rna was isolated from human b cell lines (daudi, raji, tem), human neuronal cell lines (sk-n, hcn- a) and mouse lung tissue using trizol reagent. five micrograms of total rna from human tissues (brain, heart, skeletal muscle, colon, liver, thymus, kidney, intestine, lung, placenta and spleen), or human b cell lines (daudi, raji, tem) and human neuronal cell lines (sk-n, hcn- a), or mouse lung tissue was treated with dnase i and reverse transcribed in the presence of ng of random primers,  pcr buffer, mm dntps and u of superscript ii reverse transcriptase (rt; invitrogen) in a final reaction volume of ml. reactions were carried out at c for min, c for min, followed by a -min step at c to denature the enzyme. for regular pcr, ml of first strand complementary dna (cdna) was amplified in a -ml reactions in the presence of  pcr buffer, . mm mgcl , . mm of dntps, u of taq dna polymerase (new england biolab). a first cycle of min at c was followed by cycles of s at c, s at a different annealing temperature (based on different primer pairs), and min at c. the final extension step was at c for min. for real-time pcr, first strain cdna was diluted : and quantified using an abi ht sequence detection system (applied biosystems). the sequences of the primers used for regular and real-time pcr were indicated in table . the endogenous human cd primer pairs for regular pcr were also used to construct an amplicon-containing plasmid (endogenous) for a standard curve. an exogenous amplicon-containing plasmid (exogenous) for a standard curve was constructed using the primers shown in table . samples were tested in triplicate using ml of first strand cdna in a ml total volume with  universal master mix (applied biosystems). the results were normalized to that of the housekeeping gene gapdh and hprt. the copy number of transcripts was determined by comparison with a calibration curve of known amounts of amplicon-containing plasmid. control reactions were performed for the specificity of the real-time pcr primers. a dna fragment, containing either exon , exon and exon or only exon and exon , was gel purified and subcloned into pcdna . between noti and xhoi sites. the cd -bearing plasmids were then linearized by xhoi. in vitro transcription was carried out using transcriptaid t high yield transcription kit (fermantas inc.) following the manufacturer's instruction. transcribed rna was treated with dnasei to remove template dna and purified by phenol:choloroform extraction and ethanol precipitation. first strand cdna was then synthesized and real time pcr was performed. the primer pairs used for real-time pcr are shown in figure a and table . a human bac clone containing the whole human cd gene was used as a template for long-distance pcr to obtain a region bearing exon , intron , exon , intron and exon of the human cd . two mixtures were prepared: mix ( ml) contained . mg of dna template, . mm dntp mix and . mm of sense and antisense primers; mix ( ml) included elongase enzyme mix and  long-distance pcr buffer a and b provided by the manufacturer (the ratio of buffer a and b is : ). the sense primer started with the noti cleavage site and the antisense primer with the sali site. the sequences of the primers were shown in table . mix and mix were combined on ice and subject to pcr under the following condition: c for min followed by three cycles at c for s, c for s, c for min, and then cycles of c for s, c for min. the final extension was c for min. the -kb cd fragment was displayed on . % tae-agarose gel and purified using qiaex ii agarose gel extraction kit following the manufacturer's instruction. for more efficient elution of the large size dna, the final incubation time was extended to min at c. the gel-purified dna fragment was verified by restriction enzyme digestion with bamhi, bglii, ecori and hindiii, respectively, and dna sequencing. for ligation to pcdna . expression vector, the cd fragment was digested with noti and sali. meanwhile, pcdna . expression vector was digested with noti and xhoi. afterwards, pcdna . vector was further dephosphorylated to remove the phosphoryl group, preventing the vector from selfligation. the enzyme-treated cd fragment and pcdna . were ligated, at a molar ratio of : , using primers for real-time pcr (the location of the numbered primers was shown in figure a ) endogenous human full-length cd ( ) sense (exon ) -cagcctggtttgggtcatg- ( ) antisense (exon ) -gcagagagcattttaaggaagca- endogenous human truncated cd ligation products containing the alternative splicing construct were transformed into dh b escherichia coli cells by electroporation using a cell-porator electroporation system (life technologies) at v, mf capacitance, low and k (for booster). the cells were plated onto lb/ampicillin plates and incubated at c overnight. twenty isolated clones were randomly picked. only one clone showed a dna supercoil band with much larger size than that of the vector clone on the gel. this clone was further characterized by the combination of restriction enzyme digestion and sequence analysis. an ese site was identified in exon of the human cd using computational methods rescue-ese ( ) and esefinder ( ) . to mutate the ese site, site-directed mutagenesis was employed using quickchange ii xl site-directed mutagenesis kit from stratagene. two mutagenic primers were synthesized, in which the ese site was replaced by a bsiwi site or deleted, and purified by polyacrylamide gel electrophoresis (page). the sequences of the primers used are shown in table (the mutated region was underlined). the mutagenesis reaction was carried out in ml total volume with ng of template dna, ng of each primer and . u pfuultra high-fidelity (hf) dna polymerase and ml of quicksolution reagent provided by stratagene. the cycling conditions included a -min initial denaturation at c, cycles with s denaturation at c, s annealing at c and min extension at c, and a final extension of min at c. the product was then subjected to digestion with u of dpni for h at c, selectively removing the parental, methylated, and nonmutated strands. four microliters of dpni-treated dna was then transformed into xl -gold ultracompetent cells. cells were plated and incubated for selection of ampicillin-resistant clones. ten isolated ampicillin-resistant clones were picked at random and their mutated or deleted regions were characterized by dna sequencing. the b cell line daudi was washed and resuspended in  hanks balanced salt solution (hbss) to a cell density of  cells/ml. the neuronal cell line sk-n was trypsinized and resuspended in  phosphate-buffered saline (pbs) with % fbs at a density of cells/ml. thee-hundred microliters of the daudi cells or ml of the sk-n cells were transfected with mg of the alternative splicing minigene construct, the minigene construct with the ese site deleted or mutated, the minigene construct plus sf /asf expression vector, or the ese deleted construct plus sf /asf expression vector. electroporation was performed with square waves of v, ms pulse length for four pulses for daudi and square waves of v, ms pulse length for one pulse for sk-n using t electrosquareporator (btx). both daudi and sk-n cells were cultured in ml of pre-warmed complete medium for h before harvesting. the rna oligonucleotides used for gel mobility shift assay were as follows: cd exon with the wild-type ese, -gugaucag gaugcccuucuc- ; cd exon with the mutated ese, -gugacguac gugcccuucuc- ; the rna gel mobility shift assay was carried out as previously described ( ) . the rna oligonucleotides were -end labeled with g- p-atp (perkin elmer) using kinasemax kit from applied biosystems following the manufacturer's instruction. unincorporated nucleotides were removed by using g- sephadex columns. fifteen femtomoles of radiolabeled rna oligonucleotides were mixed with pmol of sf /asf recombinant protein in a -ml binding reaction containing mg yeast trna (applied biosystems). for competition,  cold cd exon oligonucleotide was added to the reaction containing the radiolabeled cd exon oligonucleotide and sf /asf. after incubation for min on ice, the rna-protein complexes were separated from free rna by electrophoresis on a % native polyacrylamide gel, run at v for h in . % tbe buffer. the gel was then dried and autoradiographed at À c with intensifying screen. sf /asf sirna was designed based on the information described by cartegni et al. ( ) . a 'scrambled' sirna, which has no match with any mrna of the human database, was used as a control. the sirnas were synthesized by eurogentec with the following sequences: .  daudi cells or  sk-n cells were seeded into -well plates h before transfection. twoand-a-half micrograms of sirna was transfected to daudi or sk-n cells using lipofectamine (invitrogen) to examine endogenous expression pattern of cd following silencing sf /asf. two-and-a-half micrograms of sirna, together with mg of the alternative splicing construct dna, was transfected to daudi or sk-n cells by electroporation to detect exogenous expression pattern of cd following silencing sf /asf. the cells were harvested h posttransfection. total rna and protein were then extracted. nuclear extracts from daudi and sk-n cells were isolated using ne-per nuclear and cytoplasmic extraction reagents ( ) from pierce biotechnology following the manufacturer's instruction. western blotting was performed using mg of nuclear extracts. after separation on a % sds-page gel, the proteins were transferred to a nitrocellulose membrane and probed with anti-human sf /asf antibody [ : dilution, goat polyclonal immunoglobulin g (igg; santa cruz biotechnology] followed by washing in % milk-pbs tween. the membrane was then incubated with donkey anti-goat igg ( : dilution; horseradish peroxidase-conjugated (bd biosciences) and followed by washing again. substrates, luminal and enhancer were added onto the membrane and incubated for min. the membrane was exposed to kodak xar- film with intensifying screens for min. anti-human b-actin antibody ( : dilution, goat monoclonal igg; bd biosciences) was used as loading controls. the exposure time for b-actin was s. statistical significance was calculated with one-way analysis of variance (anova) followed by tukey tests. p-values . were considered significant and shown in the figures. the existence of discrete cd splice variants is cell and tissue specific human cd splice variants were examined in human tissues, b cells and neuronal cells. total rnas from different human tissues or human b cell and neuronal cell lines were used for rt-pcr using a sense primer located in exon of human cd and an antisense primer in exon . as shown in figure a and b, two transcripts were detected in all the human tissues, b cell lines (daudi, raji and tem) and neuronal cell lines (sk-n and hcn- a). the larger transcript was by far the dominant one seen in the brain and neuronal cell lines. accordingly, for subsequent experiments, the b cell line daudi and neuronal cell line sk-n were used as representatives of the two different splicing pattern of cd . the only tissue not expressing cd was human skeletal muscle. the two transcripts were purified from the agarose gel and sequenced. it was confirmed that the larger one represented an exon inclusion, whereas the smaller one represented an exon exclusion (cd tr ). since alternatively spliced exons often contain eses for binding of splicing regulators that determine the fate of the exon (exon inclusion or exclusion), we wondered whether eses for binding of splicing regulatory proteins existed in exon of cd . for this purpose, both rescue-ese ( ) and esefinder ( ) were used to search for eses in the exon of cd . only one ese was identified in exon by both rescue-ese and esefinder. the ese existed in exon of cd in human, mouse and rat, with the sequence tcagga ( figure a ). the identified ese represents a known binding site for a splicing regulatory protein sf /asf, a member of the sr protein family ( ) . exogenous expression of cd /cd tr shared the similar pattern with the corresponding endogenous one to gain insight into the role of the ese in exon of cd , we generated an alternative splicing minigene construct containing the genomic region from exon to exon of the human cd ( figure b) . a -kb fragment bearing this genomic region was characterized by sequencing and restriction enzyme digestion, and ligated to a pcdna . expression vector. the construct was transfected independently to human b cell line daudi and neuronal cell line sk-n. after h, rna was extracted from each cell population for detection of the exogenous expression of splicing pattern of cd . rna was also isolated from nontransfected daudi and sk-n cells for detection of the endogenously expressed splicing pattern. to measure quantitatively the expression levels of the two splice variants, real-time rt-pcr was performed using the primer pairs located in different regions ( figure a ). the specificity of the primers for amplification of full-length and truncated cd was examined. as shown in figure b , the primer pair used for full-length cd did not amplify the template from cd rna lacking exon (truncated form), whereas the primer pairs for truncated cd were not able to amplify the template from cd rna containing exon (full-length form). each primer pair generated only a single product (supplementary figure a) and the standard curves generated from each primer pair are parallel with slopes between À . and À . (supplementary figure b) . the exogenous expression of cd / cd tr had a similar pattern to the corresponding endogenous one in daudi cells or sk-n cells ( figure d and e). to examine further whether the ese in exon of cd determined the fate of the exon (inclusion or exclusion), site-directed mutagenesis was performed to mutate the ese element in the alternative splicing construct, replacing the ese (tcctga) with a restriction enzyme bsiwi site (cgtacg) ( figure c ) or to delete the ese. after characterizing the mutation or deletion construct by sequencing, the splicing construct was transfected to daudi and sk-n cells. total rna was extracted from cells h after transfection and real-time rt-pcr was carried out. as shown in figures c and d, and a and b, expression of the full-length transcript (exon inclusion) was reduced in both daudi and sk-n cells after mutation or deletion of the ese in exon . these data suggest that the ese in exon of cd promotes exon inclusion. a splicing regulatory protein sf /asf directly binds to the ese and determines the fate of exon of cd since the ese described above is known to contain a putative binding site for sf /asf, we investigated whether sf /asf binds to the ese. an rna-emsa was performed. as shown in figure , an rna-protein complex was detected after the sf /asf recombinant protein with Árs domain was mixed with a radiolabeled rna oligonucleotide containing the ese site. this protein/rna interaction is specific since sf /asf did not bind to a radiolabeled rna oligonucleotide containing mutated ese site and the above binding was eliminated by competing  unlabelled oligonucleotide containing the same ese ( figure ). moreover, this binding was not competed by the same level of cold oligonucleotide with the ese site mutated (data not shown). as previously described, the full-length cd was expressed predominantly in brain and neuronal cells. one explanation of this observation is that the expression of sf /asf is higher in neuronal cells and brain. to test this hypothesis, we assessed sf /asf levels in daudi and sk-n cells by western blotting. as shown in figure a , the natural level of sf /asf was clearly higher in sk-n cells than in daudi cells. to gain further insight into the role of sf /asf in controlling alternative splicing of cd , an sirna against sf /asf was employed to knock down sf / asf in daudi and sk-n cells. a scramble sirna was used as a negative control. after figure . the pattern of expression of exogenous full-length cd or truncated cd in different cells parallels that of the endogenous molecules and mutation of the ese in exon abolishes exon inclusion. (a) the location of the primers used for real-time rt-pcr. primers and were used for endogenous expression of full-length cd ; primers and were used for endogenous expression of truncated cd ; primers and were used for exogenous expression of full-length cd ; primers and were used for exogenous expression of truncated cd ; primers and were used for the constitutive expression of v region of cd . (b) the specificity of the primers used for full-length or truncated cd . cd rna containing exon or lacking exon from in vitro transcription was reverse transcribed and used for real-time pcr using the primer pairs labeled in the figure. (c) mutation of the ese in exon was confirmed by dna sequencing. (d) endogenous and exogenous expression of the full-length and truncated cd in daudi cells, and exogenous expression of two isoforms after mutation of the ese. (e) endogenous and exogenous expression of the full-length and truncated cd in sk-n cells, and exogenous expression of two isoforms after mutation of the ese. the data represent the mean ± se (three independent experiments, triplicate determinations). broken lines reflect exogenous expression of the full-length cd decreased after mutation of the ese relative to that of wild type (p < . in daudi; p < . in sk-n). continuous lines reflect exogenous expression of the truncated cd increased after mutation of the ese relative to that of wild type in sk-n cells (p < . ). and total rnas extracted for real-time rt-pcr, along with nuclear proteins for western blot. as shown in figure a , sf /asf expression was eliminated after treatment with . mg of sirna. b-actin was used as a loading control. real-time rt-pcr was then performed using rna samples treated with sirna. as shown in figure b and c, the endogenous expression of full-length cd (exon inclusion) was reduced in both daudi and sk-n cells, compared with mock (no sirna) or scramble sirna-treated cells. the same pattern was observed for the exogenous expression of cd in daudi ( figure d or sk-n cells ( figure e ) following silencing sf /asf. consistent with the observation resulting from the ese mutation or deletion, the expression pattern of full-length versus truncated cd was reversed in sk-n cells after knockdown of sf /asf. to investigate further the function of sf /asf in exon inclusion or exclusion, we performed overexpression analysis by transfection of sf /asf expression vector to daudi or sk-n cells and examined the fate of exon . as shown in figure a -c, overexpression of sf / asf induced exon inclusion but this function was abolished in the absence of the ese in exon , indicating that sf /asf regulates cd isoforms only via the ese. these results support the hypothesis that the splicing regulatory protein sf /asf, acting through binding to the ese in exon of cd , plays an important role in controlling alternative splicing of cd , and regulates figure . ese deletion or overexpression of sf /asf affects exogenous expression patterns of cd isoforms. ten micrograms of the wild-type minigene construct, the minigene construct with the ese site deleted, the minigene construct plus sf /asf expression vector, or the minigene construct with the ese site deleted plus sf /asf expression vector was transfected into daudi or sk-n cells by electroporation. after h, cells were collected and total rna was isolated for real-time rt-pcr.the expression levels of the full-length and truncated cd as well as total cd in daudi (a) and sk-n (b) were normalized to the housekeeping genes gapdh and hprt. the data shown are expression levels of full-length or truncated cd relative to total cd . the data represent the mean ± se (three independent experiments, triplicate determinations). the relative ratio of expression of full length to truncated cd . the alternative splicing pattern is altered in vivo in a/j mice infected with mhv- previous studies have shown that several viruses express a viral protein which mimics human cd and down-regulates host immunity to the virus following interaction with a human cd receptor on host cells ( ) ( ) ( ) . whether viral infection itself affects the expression of cd in host is an issue which remains to be explored. intranasal infection of a/j mice with the coronavirus murine hepatitis virus strain (mhv- ) has been described to induce pulmonary pathology with features reminiscent of severe acute respiratory syndrome (sars) ( ) . to examine the correlation between the viral (mhv- ) infection and expression of cd in host we collected lung tissues from mhv- susceptible a/j mice and mhv- -resistant c bl/ j mice after infection. rt-pcr was performed using a sense primer located in exon and an antisense primer present in exon . interestingly, we observed a reversal of the normal cd splicing pattern in lung tissues of a/j mice postinfection ( figure a ). real-time rt-pcr provided a more accurate result of this phenomenon. we documented that the full-length cd was increased after viral infection and was -fold higher at h postinfection compared with that before infection ( figure b ). all the susceptible a/j mice were dead at h postinfection. in contrast, the relative ratio of full-length to truncated cd did not change in infected c bl/ j mice ( figure c and d) . thus, the pattern of alternative splicing of cd was correlated with susceptibility of these strains to viral infection. since the above studies have shown that sf /asf regulates alternative splicing of cd , we wondered whether expression of sf /asf increased in a/j mice post infection. we performed western blotting using anti-sf /asf antibody. as shown in figure e , no obvious difference of sf /asf level was seen between a/j and c bl/ j mice before viral infection. increased expression of sf /asf was detected in lungs of a/j mice h postinfection, whereas no increase of sf /asf in c bl/ j mice even h postinfection, suggesting that the role of virus on host cd expression is mediated by sf /asf. the studies reported here show that the relative expression of two isoforms (cd and cd tr ) is tissue and cell specific and the alternative slicing patterns are different between the pattern in the lymphoid tissues and that of neuronal tissues. the relative expression of the two isoforms of cd is of interest, given our recent evidence that the truncated form (cd tr ) can antagonize the functional suppression induced by full-length cd ( ) . although borriello et al. ( ) reported no change in the alternative splicing pattern of murine cd in lymphoid tissue after stimulation by con a or lps in vivo, in our in vivo studies of mouse lung tissues before/after infection of mhv- virus we observed that, unlike in the natural condition, following viral infection the expression of total cd increased in lung of both mhv- susceptible a/j mice and mhv- -resistant c bl/ mice. however, the splicing pattern of cd is reversed only in a/j mice, with the full-length transcript, capable of inducing immunosuppresion, becoming the predominant one. in contrast, for c bl/ j, an mhv- -resistant mouse strain, no change in the splicing pattern of cd was seen in the lung. this result importantly demonstrates that only the splicing pattern, but not the total transcription level, of cd determines the murine immune response to mhv- and is consistent with the hypothesis that the shift in the balance of expression of cd /cd tr to decrease expression of the truncated product allowing cd to function in its immunosuppressive role, possibly contributing to the increased susceptibility to mhv- in the a/j mice. further studies showed an increased expression of sf / asf in a/j mice postinfection and the increase in sf / asf occurred prior to increased full-length cd , strongly suggesting that the regulation of alternative splicing of cd is mediated by sf /asf. it remains to be determined what viral proteins of mhv- have this effect and how the proteins regulate expression of sf /asf. our studies suggest that viruses escape elimination by the host's immune system not only through producing viral proteins which mimic cd but also by inducing host cd expression and reducing expression of the antagonist cd tr . posttranscriptional regulation, including mrna stability, plays an important role for gene expression ( ) . whether the increase of full-length cd in a/j mice is also due to differential mrna stability cannot be ruled out. in this report, we searched eses in the human and murine exon sequence using two ese-detecting algorithms rescue-ese and ese finder ( , ) . only one ese, which is a putative binding site for sf /asf, was detected by both rescue-ese and esefinder . . no ese was identified in the whole exon when using higher stringent esefinder . . thus, we focused on this ese for the rest of the experiments. since an ese can promote exon inclusion, mutation or deletion of the ese would lead to less full-length but more truncated cd . our results showed that after mutating the ese in exon , expression of full-length cd was reduced in both daudi and sk-n cells. this expression pattern is the reverse of that seen for endogenous cd expression in sk-n cells, in which the predominant expression is of full-length cd . to exclude the possibility that the mutation created a new exonic splicing silencer (ess) which led to decreased full-length, and increased truncated cd , we deleted the ese and examined the changes in cd :cd tr . our result showed that deletion of the ese promoted exon exclusion, the same result as we obtained from mutation analysis, indicating that mutation of the ese does not create an ess. identification of a putative ese for sf /asf binding does not provide direct evidence that sf /asf recognizes and binds to the ese. to examine whether the identified ese in exon is bound by sf /asf, we performed rna-emsa using rna radiolabeled oligonucleotides bearing the ese in exon and a recombinant sf /asf with Árs domain to reduce nonspecific binding. the result showed a binding of sf /asf to the ese and the binding is specific because either mutated ese or  cold oligonucleotides abolished the binding. knockdown of sf /asf decreased expression of full-length cd in both daudi and sk-n cells. consistent with data seen following mutation or deletion of the ese, the expression pattern of cd was again reversed in sk-n cells. the western blot performed confirmed the efficiency of knockdown of sf /asf. in contrast, overexpression of sf /asf increased expression of full-length cd but only in the presence of the ese in exon , highlighting the critical role of the ese in the mechanism of alternative splicing of cd . ubiquitously expressed splicing factors, among them is sf /asf, are thought to control tissue specific alternative splicing through their different expression levels in different tissues ( ) . our result showed that the natural level of sf /asf was higher in the neuronal cell line sk-n than in b cell line daudi. this may help explain why endogenous full-length cd (exon inclusion) is expressed at much higher level than that of truncated cd (exon exclusion) in sk-n. a recent report has described a higher expression level of sf /asf in many tumors, including lung, thyroid, kidney, colon, small intestine and melanoma, relative to their respective normal controls. one mechanism to explain this observation is that sf /asf abolished the tumor suppressor activity of bin , a tumor suppressor gene, by inclusion of exon a which interferes with myc binding ( ) . in contrast to its roles in transplantation, autoimmune diseases and inflammation, cd enhances the growth of malignant tumors and it has been suggested that a novel approach to anticancer therapy might include blockade of cd ( , , ( ) ( ) ( ) ( ) . since cd tr is an antagonist to cd ( ) , our data are consistent with the hypothesis that increased cd tr expression and decreased expression of full-length cd by blockade of sf /asf may also be of potential benefit for cancer treatment. in conclusion, we have identified an alternative splicing pattern for expressed human cd in different cells and tissues, and compared this with the pattern observed in vivo following viral infection. our data suggest that regulation of expression of alternative splicing transcripts may be important in controlling susceptibility to viral infection. an ese in exon of cd is a binding site for a splicing regulatory protein, sf /asf, which we have shown to control the alternative splicing pattern of cd . a drug-mediated manipulation of alternative splicing has recently been reported which includes modulation of sf /asf ( ) . it would be of interest to know if this drug treatment alters the expression ratio of cd to cd tr and thereby produces change in immune function. supplementary data are available at nar online. funding for open access charge: the heart and stroke foundation (na to r.m.g.). cd and membrane protein interactions in the control of myeloid cells characterization of the cd receptor family in mice and humans and their interactions with cd cd is a ligand for all members of the cd r family of immunoregulatory molecules mice lacking cd r show absence of suppression of lipopolysaccharide-induced tumor necrosis factor-alpha and mixed leukocyte culture responses by cd different reticular elements in rat lymphoid tissue identified by localization of ia, thy- and mrc ox antigens the gene for mrc ox- membrane glycoprotein is localized on human chromosome cloning and characterization of the murine homologue of the rat/human mrc ox- gene increased expression of the novel molecule ox- is involved in prolongation of murine renal allograft survival expression of a cd transgene is necessary for induction but not maintenance of tolerance to cardiac and skin allografts down-regulation of the macrophage lineage through interaction with ox (cd ) constitutive retinal cd expression regulates resident microglia and activation state of inflammatory cells during experimental autoimmune uveoretinitis expression of cd on epithelial cells of the murine hair follicle: a role in tissue-specific immune tolerance? the same immunoregulatory molecules contribute to successful pregnancy and transplantation elevated neuronal expression of cd protects wlds mice from inflammation-mediated neurodegeneration antibodies selected from combinatorial libraries block a tumor antigen that plays a key role in immunomodulation cd expression on tumor cells suppresses antitumor immunity: new approaches to cancer immunotherapy cd as a prognostic factor in acute myeloid leukaemia breast cancer cell cd expression regulates immune response to emt tumor cells in mice alternative isoform regulation in human tissue transcriptomes alternative splicing of mrna of mouse interleukin- and interleukin- identification of a novel il- isoform binding to the endogenous il- receptor a native soluble form of ctla- listening to silence and understanding nonsense: exonic mutations that affect splicing pre-mrna splicing and human disease sr proteins as potential targets for therapy characterization and localization of mox , the gene encoding the murine homolog of the rat mrc ox- membrane glycoprotein identification of an expressed truncated form of cd , cd tr, which is a physiologic antagonist of cd -induced suppression structural features in eukaryotic mrnas that modulate the initiation of translation synthetic peptides from the n-terminal regions of cd and cd r modulate immunosuppressive and anti-inflammatory effects of cd -cd r interaction the cd and cd receptor cell surface proteins interact through their n-terminal immunoglobulin-like domains sorting out the complexity of sr protein functions sr proteins: a conserved family of pre-mrna splicing factors murine hepatitis virus strain produces a clinically relevant model of severe acute respiratory syndrome in a/j mice predictive identification of exonic splicing enhancers in human genes esefinder: a web resource to identify exonic splicing enhancers functional domains of the human splicing factor asf/sf determinants of exon splicing in the spinal muscular atrophy genes, smn and smn zinc and cadmium can promote rapid nuclear translocation of metal response element-binding transcription factor- down-regulation of basophil function by human cd and human herpesvirus- cd myxoma virus m r expresses a viral cd (vox- ) that is responsible for down-regulation of macrophage and t-cell activation in vivo human herpesvirus k protein mimics cd in down-regulating macrophage activation through cd receptor mrna stability and the control of gene expression: implications for human disease an increased specificity score matrix for the prediction of sf /asf-specific exonic splicing enhancers smn deficiency causes tissue-specific perturbations in the repertoire of snrnas and widespread defects in splicing the gene encoding the splicing factor sf / asf is a proto-oncogene evidence of a role for cd in regulation of immune rejection of leukaemic tumour cells in c bl/ mice cd is induced by erk and is a potential therapeutic target in melanoma blockade of cd in the presence or absence of antibody effector function: implications for anti-cd therapy rationale for anti-cd immunotherapy in b-cll and other hematologic malignancies: new concepts in blocking immune suppression conflict of interest statement. none declared. key: cord- -prsvv l authors: qin, jian; jones, robert c.; ramakrishnan, ramesh title: studying copy number variations using a nanofluidic platform date: - - journal: nucleic acids res doi: . /nar/gkn sha: doc_id: cord_uid: prsvv l copy number variations (cnvs) in the human genome are conventionally detected using high-throughput scanning technologies, such as comparative genomic hybridization and high-density single nucleotide polymorphism (snp) microarrays, or relatively low-throughput techniques, such as quantitative polymerase chain reaction (pcr). all these approaches are limited in resolution and can at best distinguish a twofold (or %) difference in copy number. we have developed a new technology to study copy numbers using a platform known as the digital array, a nanofluidic biochip capable of accurately quantitating genes of interest in dna samples. we have evaluated the digital array's performance using a model system, to show that this technology is exquisitely sensitive, capable of differentiating as little as a % difference in gene copy number (or between and copies of a target gene). we have also analyzed commercial dna samples for their cyp d copy numbers and confirmed that our results were consistent with those obtained independently using conventional techniques. in a screening experiment with breast cancer and normal dna samples, the erbb gene was found to be amplified in about % of breast cancer samples. the use of the digital array enables accurate measurement of gene copy numbers and is of significant value in cnv studies. variation in the human genome occurs on multiple levels, from single nucleotide polymorphisms (snps) to duplications or deletions of contiguous blocks of dna sequences ( ) ( ) ( ) ( ) ( ) . copy number variation (cnv) is an important polymorphism of dna segments across a wide range of sizes and one of the primary sources of variation in the human genome ( ) . recently, cnv has been studied extensively because of its close association with large numbers of human disorders ( , ) . an understanding of this variation is important not only to understand the full spectrum of human genetic variation but also to assess the significance of such variation in disease-association studies. the first human cnv map was constructed from a study of normal individuals with a total of cnv regions in the whole genome ( ) ; more than cnvs have been found in the human genome (http://projects. tcag.ca/variation). a recent paper demonstrated the presence of novel insertion sequences across the genomes of eight unrelated individuals, which were not present in the human reference genome, and showed that many of these have different copy numbers ( ) . however, the current cnv analysis is mainly dependent upon microarray-based snp and comparative genomic hybridization (cgh) platforms, or dna sequencing, and is therefore subject to low sensitivity and low resolution. these techniques are high throughput but lack the flexibility of analyzing individual genes or sequences of interest. other existing technologies, such as quantitative polymerase chain reaction (pcr), are limited because of their inability to reliably distinguish less than a twofold difference in copy number of a particular gene in dna samples ( ) ( ) ( ) . in this study we demonstrate the use of a unique integrated nanofluidic system, the digital array, in the study of cnvs. the digital array ( , ) is able to accurately quantitate dna samples based on the fact that single dna molecules are randomly distributed in more than reaction chambers and then pcr amplified. the concentration of any sequence in a dna sample (copies/ml) can be calculated using the numbers of positive chambers that contain at least one copy of that sequence. in order to ensure that the apparent difference in gene copy numbers in different samples are real, and not distorted by differences in sample amounts, we use the expression 'relative copy number'. the relative copy number of a gene is the number of copies of that gene per haploid genome. it can be easily expressed as the ratio of the copy number of a target gene to the copy number of a single copy reference gene (two copies per cell) in a dna sample, which is always per haploid genome. by using two assays for the two genes (the gene of interest and the reference gene) with two fluorescent dyes on the same digital array, we are able to simultaneously quantitate both genes in the same dna sample. the ratio of the numbers of molecules of these two genes is the relative copy number of the gene of interest in a dna sample. a single copy gene should have a relative copy number of . a relative copy number greater than indicates the presence of duplication of the target gene while a number smaller than implies deletion of this gene. our data show that the digital array is able to distinguish less than twofold differences in gene copy number and differentiate between , , , , , and copies of a gene with great accuracy. it provides a reliable and robust platform to study copy number variations and has great advantages over conventional techniques. the sequence of the rpp synthetic construct and the sequences of the primers and probe used to amplify this construct are shown in supplementary the taqman assay for the rnase p gene (vic) was ordered from applied biosystems (foster city, ca). the feasibility of digital pcr has previously been demonstrated by performing pcr on a single dna sample obtained by a serial dilution process ( , ) . target molecules in a dna sample could be quantitated by counting the number of positive reactions. we utilize the principle of partitioning instead of dilution in order to identify and quantitate individual dna molecules. the fluidigm digital array is a novel nanofluidic biochip where digital pcr reactions can be performed ( , ) . utilizing nanoscale valves and pumps, the digital array delivers up to mixtures of sample and pcr reagents into individual panels. each panel contains independent -nl chambers. this nanofluidic platform utilizes soft lithography and silicone rubber to create nanoscale valves and pumps that can be used in serial or parallel applications. the digital array is composed of a pdms (silicone rubber) integrated fluidic circuit, an integrated heat spreader to ensure rapid heat transfer and temperature uniformity within the array and an sbs-formatted carrier with inputs and pressure accumulator to act as an interface between the user and the pdms chip. there are carrier inputs corresponding to separate sample inputs to the chip. individual samples of a minimum volume of ml each are delivered into -nl preprogrammed partitioning chambers in the chip by pressure-driven 'blind filling' in the pdms. control lines are primed with control fluid and are pressurized to actuate valves between the reaction chambers. the valves partition individual chambers that are kept closed during the pcr experiment. one of the important applications of the digital array is absolute quantitation ( , ) . the dna molecules in each mixture are randomly partitioned into the chambers of each panel. the chip is then thermocycled on fluidigm's biomark system and the positive chambers that originally contained one or more molecules will generate fluorescent signals and can be counted by the digital pcr analysis software. since the volumes and dilution factors of the dna samples are known prior to loading into the digital array, the dna concentrations can be accurately calculated. the precision of this test is only dependent upon the sampling randomness and, like any biological experiments, will improve with multiple tests (panels). digital array has been routinely used by us to quantitate dna samples of unknown concentration and, especially, cdna samples whose concentrations of the sequences of interest are hard to determine otherwise. when duplication occurs, multiple copies of a gene might be closely linked on the same chromosome and therefore might not be separated from each other, even on the digital array. as a result, multiple copies might behave as a single molecule and the total number of copies of the gene would be underestimated. when two copies are separated by a large genomic distance, some of them might be separated when dna molecules are fragmented during purification. however, in most cases this would not be sufficient (see table , sample na genomic dna data). specific target amplification (sta) is a good solution to this problem. sta is a simple pcr reaction with primers for both the reference gene and the gene of interest. it is typically performed for a limited number of thermal cycles (five in this study). the copy numbers of both genes are proportionally increased. using this process, multiple copies of the gene of interest will be amplified separately and later randomly partitioned into chambers in the digital array. since the newly generated molecules of both genes reflect the original ratio and they are not linked any more, a digital chip analysis can quantitate the molecules of the two genes and measure their ratio, and therefore the copy number of the gene of interest, very accurately ( figure ). it is very important that the amplification efficiencies of the two pairs of primers be approximately equal in order not to introduce any bias in the ratio of the two gene copy numbers in the limited number of sta thermal cycles, although this is likely to have an insignificant effect on our results since we utilized only five cycles of preamplification. the amplification efficiency of any pair of primers can be easily measured using real-time pcr ( ) . sta was performed on a geneamp pcr system (applied biosystems, foster city, ca) in a ml reaction containing  taqman preamp master mix (applied biosystems, foster city, ca), nm of primers for both rnase p and the target gene and - ng dna. thermocycling conditions were c, min hot start and five cycles of c for s and c for min. the products were diluted prior to the copy number analysis on the digital array based on their initial concentrations so that there would be about - rnase p molecules per panel. copy number analysis using the digital array on the biomark system each panel of a digital array contains a total of . ml ( nl  chambers) pcr reaction mix. however, ml reaction mixes were normally prepared for each panel, containing  taqman gene expression master mix (applied biosystems, foster city, ca),  rnase p-vic taqman assay,  taqman assay ( nm primers and nm probe) for the target gene,  sample loading reagent (fluidigm, south san francisco, ca) and dna with about - copies of the rnase p gene. the reaction mix was uniformly partitioned into the reaction chambers of each panel and the digital array was thermocycled on the biomark system (http:// www.fluidigm.com/products/biomark-main.html). thermocycling conditions included a c, min hot start followed by cycles of two-step pcr: s at c for denaturing and min at c for annealing and extension. molecules of the two genes were independently amplified. fam and vic signals of all chambers were recorded at the end of each pcr cycle. after the reaction was completed, digital pcr analysis software (fluidigm, south san francisco, ca) was used to process the data and count the numbers of both fam-positive chambers (target gene) and vic-positive chambers (rnase p) in each panel. there are chambers in each of the panels in a digital array. when single dna molecules are randomly partitioned into these chambers, it is possible that multiple molecules could partition into the same chamber. as a result there could be more molecules in each panel than positive chambers. the true number of molecules per chamber can be estimated using a simple poisson distribution equation as described by sindelka et al. ( ) . we have developed a more robust computational algorithm to analyze cnv data obtained from the digital array. this algorithm has been integrated into the digital pcr analysis software and is detailed in ( ) . a proof-of-principle spike-in experiment was performed using a synthetic construct to explore the digital array's feasibility as a robust platform for the cnv study. a -base oligonucleotide that is identical to a fragment of the human rpp was ordered from integrated dna technologies (coralville, ia, usa). rnase p, a single copy gene, is used as reference in this study ( , ) . both rpp synthetic construct and human genomic dna na from the coriell cell repositories (camden, nj, usa) were quantitated using the rpp assay on a digital array. different amounts of rpp synthetic construct were then spiked into the genomic dna so that mixtures with ratios of rpp and rnase p of : (no spike-in), : . , : , : . , : and : . were made, simulating dna samples containing two to seven copies of the rpp gene. these spike-in mixtures were analyzed on the digital arrays. five panels were used for each mixture and - rnase p molecules were present in each panel. the ratios of rpp /rnase p of all samples were calculated and are plotted against the expected ratios in figure . a good linear relationship can be observed. also shown in figure is an example of a typical digital array experiment. cnvs of the cyp d gene cyp d belongs to the cytochrome p system responsible for the metabolism of many commonly prescribed medications ( , ) . the cyp d gene is highly polymorphic and this can significantly influence the metabolic activity of the enzyme it codes for (debrisoquine -hydroxylase) and the therapeutic efficacy of the drugs. therefore, the pharmacogenetic polymorphism information of this gene would be of great clinical importance in therapeutic decision-making ( ) ( ) ( ) ( ) . more than alleles of the cyp d gene have been identified (http://www.cypalleles. ki.se/cyp d .htm). allele-associated variations in the activity of the cyp d enzyme have been observed and individuals carrying these alleles are classified into poor, intermediate, extensive and ultrarapid metabolizers ( , ) . genotyping patients would be able to identify those who are at risk of severe toxic responses (poor metabolizer) or in need of more than standard level of drugs (ultra rapid metabolizer). it has been shown that some poor metabolizers and ultra rapid metabolizers are caused by the deletion or duplication of the entire cyp d gene ( , ) . these large structural changes can be detected using conventional technologies such as southern blot and longrange pcr. however, it is believed that real-time pcr is figure . quantitation of the rpp copy number in spike-in samples that contain two to seven copies of the rpp molecules per two haploid genomes. the x-axis shows the expected ratio of the numbers of rpp molecules to rnase p molecules. the y-axis shows the observed ratios. each value is calculated using five panels of the same sample mix and the error bars represent standard errors. a good linear correlation can be seen with a coefficient of determination (r ) of . . currently the only promising technique that is able to provide information about the exact copy number of the cyp d gene in a routine clinical setting ( ) ( ) ( ) . we used the digital array to measure the cyp d copy numbers of three dna samples from paragondx (morrisville, nc). the cyp d genotypes of these dna samples had been characterized (table ). the samples were sta-treated (see figure and materials and methods section) and the products were analyzed using five panels each on the digital arrays. the relative copy numbers of these three samples are , . and . , respectively, highly consistent with their assumed cyp d diploid copy numbers ( , and ) based upon their genotypes. we also studied five cell line dna samples from coriell cell repositories (camden, nj). first, we measured their relative copy numbers using genomic dna. the results showed that two of them have a single copy and two have two copies of the cyp d gene per cell ( table ) . one sample had a relative copy number of about . , equal to a diploid copy number of . . we then sta-treated these five samples and ran the products on digital arrays. the relative copy numbers of the -and -copy samples remained the same and the fifth sample showed a relative copy number of about . or a diploid copy number of . apparently this sample had a duplication of the cyp d gene on one of the two chromosomes ( ) . it has been previously demonstrated ( , ) that when cyp d duplication occurs, the two copies are separated by . kb. therefore, the diploid copy number of . obtained when genomic dna was used is likely the result of dna breakage in this . kb genomic region in some dna molecules that separated the two cyp d copies. to confirm this, we ran a long range pcr [see ( ) cyp d duplication was observed only in the sample with a relative copy number of . ( figure ) . erbb (also known as her ) is a receptor tyrosine kinase gene overexpressed in up to % of invasive breast cancer, resulting in a loss of normal cellular growth control. most of these cases ( %) are caused by the amplification of this gene and the number of extra copies is closely related to the protein expression level ( ) ( ) ( ) ( ) . erbb amplification is well correlated with an aggressive phenotype characterized by reduced response to chemotherapy, high recurrence rate and short survival time and serves as a significant prognostic predictor for breast cancer patients ( , ) . trastuzumab (herceptin), an fda-approved monoclonal antibody against the erbb protein, has been shown to dramatically increase response rate and extend survival in breast cancer patients with erbb amplification. given trastuzumab's proven efficacy and substantial benefit in multiple clinical trials, detection of erbb amplification has become critical ( ) ( ) ( ) ( ) . there are different methodologies of determining the erbb status in breast cancer. immunohistochemistry (ihc) and fluorescence in situ hybridization (fish) are two fda-approved technologies for the detection of erbb amplification. the former detects overexpression of the erbb receptor on the cell membrane while the latter detects the copy number of the gene itself relative to the chromosome centromere. ihc is less expensive and easy to perform but is prone to a high rate of inaccuracies due to variations in tissue preparation, protein stability, antibody sensitivity and scoring subjectivity. on the other hand, fish is accurate with good clinical correlation but it is expensive, time consuming, and labor intensive and requires very experienced personnel. therefore, suggestions have been made to use a combination of ihc and fish, where ihc is used as a screening procedure followed by a fish confirmation if necessary ( , ) . we used digital arrays to analyze the erbb copy numbers of breast cancer and normal breast tissue dna samples from biochain (hayward, ca). all dna samples were from asian individuals except one normal sample that was from a caucasian. of the breast cancer samples, are adenocarcinoma, is fibroadenoma, are invasive lobular carcinoma, is infiltrative ductal carcinoma and are invasive ductal carcinoma. the samples were sta-treated and, for screening purpose, the products were analyzed using only two panels for each sample on digital arrays. the results are shown in figure . fourteen breast cancer samples ( %) had a diploid erbb copy number of more than five while all control samples were below five copies [an absolute number of erbb copies greater than . per cell is considered amplification in fish analysis ( ) . here we use five as the threshold]. the copy numbers shown are not all integers due to (i) heterogeneity of the cancer cells and (ii) sampling variations as only two panels were used for each sample. a real-time pcr reaction was also performed on these samples. twenty-four replicates were used for each sample. although the average copy numbers were close to the digital array data, large fluctuations (sds of up to . ) were observed in the reactions of each sample. studies on other genes (for example, cyp d ) showed that real-time pcr does not always produce accurate results (data not shown). genomewide analyses have shown the existence of large numbers of cnvs in the entire human genome with large interindividual diversity ( ) ( ) ( ) ( ) ( ) ( ) . many of these cnvs colocalize with genes involved in a variety of diseases or disease susceptibility and are believed to play some role in pathogenesis ( ) ( ) ( ) ( ) . the first mendelian disorder associated with the amplification of a kb dna fragment was reported recently ( ) . it appears to only be a the results of both genomic dna and sta products are shown. the ratios of the cyp d gene to the rnase p gene should be close to multiples of . . the genomic ratio of . for sample na (corresponding to a diploid copy number of . ) reflects the partial separation of the duplication alleles in the genomic dna. a ratio of . (diploid copy number of ) was obtained when the sample was subjected to sta prior to the digital pcr analysis. question of time before more genetic conditions related to cnv are identified. two standard genomewide scanning methods for cnv detection are array-based cgh and high-density snp genotyping arrays and both were employed in the construction of the first human cnv map ( ) . these microarray techniques are able to generate whole-genome cnv data and are important in cnv discovery. their resolution is also improving with the development of new probes. however, since they are both based on hybridization, the detection of copy number changes largely depends on signal-to-noise ratio, which is sensitive to reagent and manufacturing variability. therefore, false positive and false negative results are sometimes inevitable ( ) . additionally, the lack of standard reference genomes in the studies using these technologies further complicates the interpretation of the results ( ). on many occasions, gene-or locus-specific (other than the whole genome) copy number information is required. this is especially true in the cases of cyp d and erbb described above in which therapeutic decision needs to be made based upon the copy numbers of these genes. in addition to other conventional methods (southern blot, long-range pcr and fish), the possibility of using quantitative pcr in the cnv study of these two genes has been previously explored ( ) ( ) ( ) ( ) ( ) . quantitative pcr is simple and easy to perform. however, since the copy number of the target gene is derived from the ct difference between the target gene and a reference gene, the results are very sensitive to the efficiency of the amplification reaction. even if one compensates for the amplification efficiency, it is considered difficult to obtain a discrimination power of better than twofold ( ) . the digital array has the ability to absolutely quantitate any type of dna sample. in a multiplex pcr reaction with two assays, the quantitation of two or more genes/sequences in a single sample becomes possible, effectively eliminating pipetting variations inherently occurring in any quantitation experiment. the accuracy of the results is only subject to the random distribution of the molecules and, like any biological experiments, can improve with the use of multiple replicates for each sample. sta can efficiently separate the linked copies of a gene on the same chromosome when duplication occurs while other methods, such as restriction digestion are also valid (data not shown). we performed three experiments to test the feasibility of the digital array in the cnv study. first we measured the copy numbers of the rpp gene of a series of mixtures made of a human genomic dna and a synthetic rpp construct. we observed a very good correlation between the results and the expected outcome. we then studied the cyp d copy numbers of some dna samples that were either genotyped elsewhere or characterized by us using conventional techniques. the results were also consistent. lastly, we screened breast cancer samples for the amplification of the erbb gene. although the clinical data (other than pathological classification) of these samples were lacking, about % of the samples had an increased number of this gene above , very close to the erbb amplification frequency reported in the literature ( ) . in conclusion, this study shows that the digital array provides a new and robust technology to study geneand sequence-specific cnv and is able to detect gene copy numbers with great accuracy. digital arrays provide a much greater discrimination power than quantitative pcr. cnv studies on the digital array are easy to perform, fast and the data obtained is easy to interpret. furthermore, the platform is very flexible and can be tailored to any gene/sequence. it can also serve as an independent measure to verify results from the whole-genome scans using array technologies. the digital array is an excellent cnv platform for both basic research and clinical investigation. supplementary data are available at nar online. funding for open access charge: fluidigm corporation. conflict of interest statement. the authors declare competing financial interests. all are employees of fluidigm corporation. ( ) . the essence of snps recent duplication, domain accretion and the dynamic mutation of the human genome detection of large-scale variation in the human genome large-scale copy number polymorphism in the human genome segmental duplications and copy-number variation in the human genome structural variation in the human genome new perspectives for the elucidation of genetic disorders genomic rearrangements and sporadic disease global variation in copy number in the human genome mapping and sequencing of structural variation from eight human genomes novel real-time quantitative pcr test for trisomy digital pcr for the molecular detection of fetal chromosomal aneuploidy two-fold differences are the detection limit for determining transgene copy numbers in plants by real-time pcr high throughput gene expression measurement with real time pcr in a microfluidic dynamic array intracellular expression profiles measured by real-time pcr tomography in the xenopus laevis oocyte nanoliter scale pcr with taqman detection application of real-time quantitative pcr in the analysis of gene expression. dna amplification: current technologies and applications mathematical analysis of copy number variation in a dna sample using digital pcr on a nanofluidic device real-time reverse transcriptionpolymerase chain reaction assay for sars-associated coronavirus structure and transcription of a human gene for h rna, the rna component of human rnase p the effect of cytochrome p metabolism on drug response, interactions, and adverse effects overview of enzymes of drug metabolism the clinical role of genetic polymorphisms in drug-metabolizing enzymes individualized drug therapy the prevalence and clinical relevance of cytochrome p polymorphisms pharmacogenetics and adverse drug reactions cyp d polymorphisms and the impact on tamoxifen therapy clinical implications of cyp d genetic polymorphism during treatment with antipsychotic drugs deletion of the entire cytochrome p cyp d gene as a cause of impaired drug metabolism in poor metabolizers of the debrisoquine/sparteine polymorphism ultrarapid metabolizers of debrisoquine: characterization and pcr-based detection of alleles with duplication of the cyp d gene cyp d genotyping strategy based on gene copy number determination by taqman real-time pcr determination of cytochrome p d (cyp d ) gene copy number by real-time quantitative pcr pharmacogenetic screening of the gene deletion and duplications of cyp d the human debrisoquine -hydroxylase (cyp d) locus: sequence and identification of the polymorphic cyp d gene, a related gene, and a pseudogene ultrarapid drug metabolism: pcr-based detection of cyp d gene duplication human breast cancer: correlation of relapse and survival with amplification of the her- /neu oncogene studies of the her- /neu proto-oncogene in human breast and ovarian cancer detection and quantitation of her- /neu gene amplification in human breast cancer archival material using fluorescence in situ hybridization prognostic and predictive value of her /neu oncogene in breast cancer erbb oncogene in human breast cancer and its clinical significance use of chemotherapy plus a monoclonal antibody against her for metastatic breast cancer that overexpresses her ongoing adjuvant trials with trastuzumab in breast cancer trastuzumab after adjuvant chemotherapy in her -positive breast cancer ) -year follow-up of trastuzumab after adjuvant chemotherapy in her -positive breast cancer: a randomised controlled trial prognostic and predictive value of her /neu oncogene in breast cancer her testing: a review of detection methodologies and their clinical performance recent duplication, domain accretion and the dynamic mutation of the human genome detection of large-scale variation in the human genome large-scale copy number polymorphism in the human genome segmental duplications and copy-number variation in the human genome structural variation in the human genome challenges and standards in integrating surveys of structural variation autosomal-dominant microtia linked to five tandem copies of a copy-number-variable region at chromosome p a comprehensive analysis of common copynumber variations in the human genome new perspectives for the elucidation of genetic disorders genomic rearrangements and sporadic disease global variation in copy number in the human genome methods and strategies for analyzing copy number variation using dna microarrays challenges and standards in integrating surveys of structural variation cyp d genotyping strategy based on gene copy number determination by taqman real-time pcr determination of cytochrome p d (cyp d ) gene copy number by real-time quantitative pcr pharmacogenetic screening of the gene deletion and duplications of cyp d her- /neu gene copy number quantified by real-time pcr: comparison of gene amplification, heterozygosity, and immunohistochemical status in breast cancer tissue reliability and discriminant validity of her gene quantification and chromosome aneusomy analysis by real-time pcr in primary breast cancer digital pcr for the molecular detection of fetal chromosomal aneuploidy human breast cancer: correlation of relapse and survival with amplification of the her- /neu oncogene the authors would like to thank dr stephen quake for his assistance in the interpretation of the results, as well as his careful reading of this article. key: cord- -omq becw authors: shabanpoor, fazel; mcclorey, graham; saleh, amer f.; järver, peter; wood, matthew j.a.; gait, michael j. title: bi-specific splice-switching pmo oligonucleotides conjugated via a single peptide active in a mouse model of duchenne muscular dystrophy date: - - journal: nucleic acids res doi: . /nar/gku sha: doc_id: cord_uid: omq becw the potential for therapeutic application of splice-switching oligonucleotides (ssos) to modulate pre-mrna splicing is increasingly evident in a number of diseases. however, the primary drawback of this approach is poor cell and in vivo oligonucleotide uptake efficacy. biological activities can be significantly enhanced through the use of synthetically conjugated cationic cell penetrating peptides (cpps). studies to date have focused on the delivery of a single sso conjugated to a cpp, but here we describe the conjugation of two phosphorodiamidate morpholino oligonucleotide (pmo) ssos to a single cpp for simultaneous delivery and pre-mrna targeting of two separate genes, exon of the dmd gene and exon of the acvr b gene, in a mouse model of duchenne muscular dystrophy. conjugations of pmos to a single cpp were carried out through an amide bond in one case and through a triazole linkage (‘click chemistry’) in the other. the most active bi-specific cpp–pmos demonstrated comparable exon skipping levels for both pre-mrna targets when compared to individual cpp–pmo conjugates both in cell culture and in vivo in the mdx mouse model. thus, two ssos with different target sequences conjugated to a single cpp are biologically effective and potentially suitable for future therapeutic exploitation. splice-switching oligonucleotides (ssos) are currently very promising for therapeutic use in both duchenne muscular dystrophy (dmd) through exon skipping and for spinal muscular atrophy (sma) through promotion of exon inclusion. such ssos are designed to sterically block splice sites or specific binding motifs for splicing machinery in order to promote exon inclusion or exclusion. for dmd, targeting of the dystrophin pre-mrna with a sso is used to 'skip' an exon that contains a nonsense coding mutation, or to remove an exon neighbouring an out-of-frame genomic deletion, so as to restore the mrna reading frame. this allows synthesis of an internally deleted dystrophin protein that retains the elements crucial for function ( ) ( ) ( ) . proof-of-principle of this approach has been demonstrated in murine and canine animal models of dmd ( - ) and more recently in human phase ii/iii clinical trials ( ) ( ) ( ) ( ) ( ) . various sso chemistries have been developed for exon skipping including phosphorodiamidate morpholino oligonucleotides (pmo), peptide nucleic acids (pna), locked nucleic acid (lna), -o-methoxyethyl phosphorothioate oligonucleotides ( -moe/ps), -omethyl phosphorothioate oligonucleotides ( -ome/ps) and tricyclo oligonucleotides (see ( ) for a recent review). the two major sso chemistries that have been used more extensively and which are in current clinical trials in dmd patients are -ome/ps ( , ) and pmo ( , , ) . in these trials, production of dystrophin protein has been demonstrated, albeit at low levels, and rather disappointingly a recent phase iii trial with -ome-ps sso was not able to meet its primary endpoint of a statistically significant improvement in the min walk test ( ) . nevertheless, preliminary results with pmo chemistry are promising and further studies are planned to target additional exons ( , ) . to date, clinical trials have focused on targeted removal of exon , since this would benefit the largest patient pool, but ssos are also being developed to target other exons between and . the most significant barrier to success for spliceswitching therapies has been effective delivery. in the case of pmo, and indeed other chemistries, the level of exon skipping, and hence the amount of functional dystrophin restored in muscle, is poor unless high doses are used. this is thought to be due mostly to their rapid clearance from the body following systemic administration, as well as their poor ability to penetrate cellular barriers and reach their nuclear target site. one approach to address this has been to attach cell-penetrating peptides (cpps) that can effectively carry charge neutral pmo cargos across cell membranes to their pre-mrna target site in the nucleus. in particular, pmos conjugated to arginine-rich cpps (known as p-pmos) have been shown to enhance dystrophin production in muscle following systemic administration in the mdx mouse model of dmd ( ) ( ) ( ) . we have previously developed a series of novel arginine-rich cpps known as pna/pmo internalization peptides (pips), comprised of two arginine-rich sequences separated by a central short hydrophobic core sequence. these pip peptides were designed to improve serum stability whilst maintaining a high level of exon skipping, initially by attachment to a pna cargo ( ) . further derivations of these peptides were designed as conjugates of pmos, which were shown to lead to high body-wide skeletal muscle dystrophin production, and importantly also including the heart, following systemic administration ( ) . a later version (pip a-pmo) proved to be an even more efficient conjugate in mediating dystrophin production in the mdx mouse ( ) . whilst restoration of the absent dystrophin protein is the primary goal for genetic therapies for dmd, consideration of complementary therapies to reduce pathological features of disease or to improve muscle function are also very important. one such strategy would be to promote muscle growth through targeting of the myostatin pathway. myostatin, or growth/differentiation factor- (gdf ), is a member of the transforming growth factor-␤ (tgf-␤) family, which is involved in muscle homeostasis and acts to inhibit muscle growth ( ) . myostatin is involved in control of myogenesis through binding to activin type iib receptor (acvriib) ( , ) which recruits and activates activin type i receptor (alk or alk ) ( ) . activin receptor activation results in phosphorylation of intracellular signalling mediators smad and smad that translocate to the nucleus to affect gene transcription. there is evidence that inhibition of the myostatin pathway has the potential for clinical benefit in dmd. transgenic knockout models for both myostatin and dystrophin demonstrate increased musculature due to fibre hypertrophy as well as reduced fibrosis and fat deposition, compared to mdx mice alone ( ) . similarly, when mdx mice were treated with anti-myostatin antibodies this resulted in enlarged muscles concurrent with improved muscle function and strong reduction in diaphragm fibrosis ( ) . based on these improved muscle features, the concept of myostatin down-regulation concurrent with dystrophin restoration has been investigated. dumonceaux et al. reported the use of adeno-associated virus (aav) constructs to combine rnai-mediated down-regulation of acvriib with a u -based small rna exon skipping technique to restore dystrophin ( ) . whilst concurrent treatment did not improve muscle mass, absolute and specific forces were much greater compared to either individual strategy. a subsequent study by hoogaars et al., utilized soluble acvriib decoy receptors in combination with aav-u mediated dystrophin restoration to treat mdx mice ( ) . treatment with decoy acvriib increased body weight, with morphometric measurements of muscle fibres suggesting that muscle growth was due to hypertrophy ( , , ) . based on the potential success of this approach, we sought to simultaneously target both the dystrophin and the myostatin pathway as a molecular model to evaluate the efficacy of using bi-specific pmo compounds. a splice-switching approach has been used previously to target and down-regulate expression in the myostatin pathway, with ssos developed to target both the mstn ( ) and alk ( ) transcripts. whilst the principle of multiple-exon targeting for both dystrophin restoration and myostatin depletion has been demonstrated before using a cocktail of individual ssos ( ) ( ) ( ) , there are advantages in development of a bi-specific compound to target two pre-mrnas. first, the outcome of use of a bi-specific sso is that both ssos must enter the same cell, whereas for a cocktail of ssos there will likely be a mixed population of cells where either none, one or both genes are targeted. more importantly, since cell and in vivo toxicity of p-pmos is thought to be predominantly due to that of the peptide, the use of a single peptide to deliver both pmos halves the total peptide requirement compared to use of a cocktail of two separate p-pmos, and this may help to reduce the potential for peptide-mediated toxicity. note also that two -ome/ps ssos that target both dystrophin and myostatin have been joined together recently to make a bi-specific construct, but without any attached delivery peptide, but in this case activity in mdx mice was not seen for the dystrophin sso ( ) . in our study, we selected pip a ( ) as the cpp to simultaneously deliver two different pmos as a bi-specific conjugate and to develop the chemistries for their attachment. three different conjugation chemistries including amide, disulfide and triazole bonds were utilized to allow orthogonal conjugation. the first pmo targets exon of the dystrophin gene to correct the mdx genotype and the second to target removal of exon of the acvr b gene so as to produce an internally deleted protein that lacks the crucial trans-membrane domains. several different bi-specific conjugate designs were investigated whereby two pmos were joined either at one end of the pip a peptide or with one pmo at either n-or c-termini. the activities of these conjugate constructs were assessed in mouse mdx cells and the most active bi-specific conjugates (d and d ), which had both pmos attached at the c-terminus of the cpp, were shown to have closely comparable dmd exon skipping activity to the single pip a-pmo targeting dmd. d and d conjugates were also assessed in the same cells for targeting of acvr b and both conjugates demonstrated only very slightly reduced exon skipping activity compared to pip a-pmo targeting acvr b. importantly, the cell viability using a bi-specific compound was significantly better than for a mixture of the two individual pip a-pmos. we furthermore assessed the potential of this approach in an in vivo environment through intramuscular administration and demon-nucleic acids research, , vol. , no. strated that there were no significant differences in exon skipping activities for both dmd and acvr b targets between bi-specific conjugates (d and d ) and a cocktail of the individual p-pmo equivalents. fmoc-protected amino acids, coupling reagents (hbtu and pybop) and the fmoc-␤-ala-oh preloaded wang resin ( . mmol g − ) were obtained from merck (hohenbrunn, germany). fmoc-azido-l-lysine-oh was from iris biotech gmbh (deutschland, germany). fmoc-lbis-homopropargylglycine-oh (bpg) was purchased from chiralix (nijmegen, the netherlands). chicken embryo extract (cee) and horse serum (hs) for cell culture were obtained from sera laboratories international ltd (west sussex, uk). ␥ -interferon was obtained from roche applied science (penzberg, germany). all other reagents were obtained from sigma-aldrich (st louis, mo, usa) unless otherwise stated. maldi-tof mass spectrometry (table ) was carried out using a voyager de pro biospectrometry workstation. a stock solution of mg ml − of ␣-cyano- -hydroxycinnamic acid or sinapinic acid in % acetonitrile in water was used as matrix. the measurements have an accuracy level of ± . %. peptides were synthesized by standard fmoc chemistry ( ) using a cem liberty tm microwave peptide synthesizer (buckingham, uk). peptides were assembled on fmoc-␤-ala-oh preloaded wang resin on a . mmol scale with excess of fmoc-protected amino acids, pybop and dipea ( : : ). the n␣-fmoc protecting groups were removed by treating the resin with piperidine in dmf ( % v/v) at • c twice, once for s and then for min. the coupling reactions were carried out at • c for min. in order to prevent racemization, fmoc-cysteine (trt)-oh was coupled at • c for min at w microwave power. all amino acids were single coupled except for the arginines, which were double coupled. the fmoc-l-bis-homopropargylglycine-oh was coupled manually using a -fold excess and the coupling success was checked using a tnbs test ( ) . after completion of peptide assembly, the resin bound peptide was cleaved off by treating the resin with a cocktail of tfa:dodt:h o:tips ( : . : . : the peptide was precipitated by addition of ice-cold diethyl ether and washed three times. the crude peptides were analysed and purified to > % by reversed-phase hplc (rp-hplc). the peptide mass characterization was carried out using a maldi-tof mass spectrometry (abi voyager de pro) and an ␣-cyano- -hydroxycinnamic acid matrix made up in % acetonitrile containing . % tfa. the pmo sequence for exon skipping of dmd pre-mrna ( -ggccaaacctcggcttacctgaaat) was either unmodified (standard morpholino with a secondary amine at the end) or functionalized with a disulfide at its -end. the pmo targeting exon- of acvr b was unmodified ( -gcctcgtttctcggcagcaatgaac- ). all pmos were purchased from gene tools llc (philomath, usa). -unmodified pmo was functionalized with an azido group by coupling the free -secondary amine group with fmoc-azido-l-lysine-oh ( figure a ). the coupling was carried out by activating the carboxyl group of the amino acid derivative using hbtu ( . eq.) and hoat ( eq.) in nmp in the presence of . eq. of diea before addition of the pmo dissolved in dimethylsulfoxide (dmso). the fmoc-azido-l-lysine-pmo conjugate was then purified using rp-hplc followed by fmoc deprotection and purification. in the case of pmo with a disulfide bond at its end ( figure b) , the disulfide bond was reduced to give a free sulfhydryl group using a -fold excess of tris ( -carboxyethyl)phosphine hydrochloride (tcep·hcl) in water for h followed by filtration to remove the excess tcep. the pmo with a free sulfhydryl group was then activated using a . -fold molar excess of , -dithiobis ( -nitropyridine) (dtnp) in dmso: acetonitrile ( . % tfa):h o ( . % tfa) with ( : : ) ratios ( ) . the reaction mixture was stirred at room temperature for h and the npys-activated pmo was purified by rp-hplc. conjugations of peptides to pmos ( figure ) were carried out in solution using a . -fold excess of peptide using similar conditions to the coupling of fmoc-azido-l-lysine-oh to pmo. the conjugation of the second pmo was carried out using either copper (i) mediated alkyne-azide click chemistry between the alkyne-functionalized p-pmo (figure a ) and the azide-functionalized second pmo or by forming a disulfide bond between the npys-activated second pmo and a free cysteine thiol of the p-pmo (figure b ). the alkyne-azide click reaction between azidefunctionalized pmo and alkyne functionalized p-pmo was carried out by dissolving the p-pmo in water followed by addition of azido-functionalized pmo ( . eq.). sodium ascorbate ( eq. as a -mm solution) was added and the reaction mixture was vortexed thoroughly followed by addition of copper (ii)-tbta ( eq. as a mm solution) ( ) . the click reaction was carried out at room temperature for h or at • c for min. the conjugation of npysactivated pmo to p-pmo was carried out by first dissolving the npys-activated pmo in ammonium bicarbonate solution (ph ) followed by addition of the p-pmo dissolved in water. the reaction mixture was stirred at room temperature for h. the single and dual p-pmo conjugates were purified on a high-resolution (hr)- cation-exchange column (ge healthcare, usa) using mm sodium phosphate buffer (ph . ) containing % acetonitrile. the conjugates were eluted using a m nacl solution in the same buffer at a flow rate of ml min − . the excess salts were removed by centrifugation using an amicon r ultra- k centrifugal filter device. the conjugates were characterized using maldi-tof ms as mentioned above. they were dissolved in ster- ile water and filtered through a . m cellulose acetate membrane (costar) before use. mouse h k/mdx myoblasts were plated at a density of × cells per well in a gelatin ( . %) pre-coated well plate. h k/mdx myoblasts were grown in high-glucose dulbecco's modified eagle's medium (dmem) supplemented with % foetal calf serum (fcs), % cee and . % of interferon-␥ at • c. the myoblast cells were differentiated into myotubes for the exon-skipping assay. myoblasts were differentiated for days in dmem supplemented with % hs at • c prior to transfection of p-pmos in serum-free opti-mem for h at • c. the transfection medium was then replaced with dmem/ % hs and cells incubated for a further h at • c. experiments were carried out in the biomedical sciences unit, university of oxford according to procedures authorized by the uk home office. experiments were carried out in mdx mice (c bl/ scsn-dmd mdx /j) ( ) . intramuscular (im) injections (n = per treatment) were carried out on -week-old mdx mice under general anaesthesia. . nmol peptide-pmo in l . % saline volume was injected into tibialis anterior (ta) muscle. two weeks postadministration animals were sacrificed by rising co inhalation and tissues snap-frozen in a dry ice cooled isopentane bath and stored at − • c. rna was extracted from either h kmdx cell pellets or from ta tissue sections by mechanical disruption; and subsequently processed using trizol according to manufacturer's instructions (life technologies). rt-pcr analysis of exon skipping levels was carried out with ng of total rna used as a template in a l rt-pcr using the geneamp rna pcr kit (applied biosystems, warrington, uk). rt-pcr amplification of the dystrophin dmd transcript was carried out under the following conditions: • c for s, • c for s and • c for s for cycles using the following primers: dysex fo ( -cagaattctgccaattgctgag) and dysex ro ( -ttcttcagcttgtgtcatcc). two microlitres of this reaction was used as a template for nested amplification using amplitaq gold (applied biosystems, warrington uk) under the following conditions: • c for s, • c for s and • c for s for cycles using the following primers: dysex fi ( -cccagtctaccaccctatcagagc) and dysex ri ( -cctgcctttaaggcttcctt). acvr b rt-pcr amplification was carried out under the following conditions: • c for s, • c for s and • c for s for cycles using the following primers: acvr bex f ( -ctgcgtttggaaagctcagctcat) and acvr bex r ( -aagggcagcatgtactcatcgaca). pcr products were analysed on % agarose gels. for quantitative analysis of exon skipping levels, g of rna was reverse transcribed using the high capacity cdna rt kit (applied biosystems, warrington, uk) according to manufacturer's instructions. qpcr analysis was carried out using ng cdna template and amplified with taqman gene expression master mix (applied biosystems, warrington, uk) on a stepone plus thermocycler the levels of cytotoxicity of p-pmos were assessed in human hepatocytes novel bi-specific pmo compounds were developed that involved use of standard peptide synthesis methods for synthesis of the functionalized peptide component. the pmo components were obtained using initially unmodified pmo that was then functionalized at its -end with an azido group ( figure a ) to enable 'click' conjugation. alternatively, a -disulfide functionalized pmo was used to prepare a -npys-activated pmo ( figure b ). the yield of -npys pmo was %, whereas the -azido pmo was in lower yield of . % because of the two-step rp-hplc purification used. the cpp chosen for the constructions was pip a (figure a) . two different types of bi-specific compounds were designed. in the first, the two pmo oligonucleotides were each conjugated to a different end of the pip a peptide ( figure a ) (designated d ) or in the second where the pmo oligonucleotides were both attached at the carboxyterminal end of pip a (designated d and d ). in the case of click chemistry conjugation to pmo, pip a was synthesized with an alkyne group either at its n-terminus for d , or at its c-terminus for d and d (figure a ). in d , pmo (dmd) was conjugated to the c-terminus through an amide bond and the -azido-pmo (acvr b) was conjugated at the n-terminus through a triazole bond. for bi-specific conjugate d , the acvr b pmo was click conjugated, whereas in conjugate d the dmd targeting pmo was click conjugated. for click conjugations at the c-terminus of pip a, one ␤-alanine (b) and one aminohexanoic acid (x) spacer residue were incorporated on ei-ther side of the bpg alkyne derivative, whereas no spacer was used for n-terminal click conjugations (figure a ). bispecific conjugate d was prepared with a similar spacing to d , but where a disulfide bond replaced the triazole bond through synthesis of pip a having a c-terminal x-cys-b extension ( figure b ). the assembly of these bi-specific conjugates required synthesis of three different derivatives of pip a, which were synthesized on solid phase using fmoc peptide chemistry and purified by rp-hplc to > % purity in yields of - %. for each construct, conjugation of the first pmo to each of the three pip a derivatives was carried out through amide bond formation between the c-terminal carboxylic acid group of the peptide to the secondary amine at the end of the pmo, similarly to the synthesis of pip a-pmo (dmd) ( ) (figure a and b) . the conjugations were carried out in solution and purifications were carried out by ion exchange hplc. isolated yields of p-pmo were - % (based on the amount of starting pmo). the -azido pmo was coupled to the alkyne-p-pmo using copper (i)mediated alkyne-azide click chemistry resulting in a yield of % for conjugate d and % for conjugates d and d (figure a) . the click reaction for syntheses of d and d was sluggish at room temperature and after h only a small amount of bi-specific conjugate was formed ( figure a) . however, heating the reaction mixture to • c significantly improved the reaction rate and after min the reaction had proceeded to near completion as determined by ion exchange hplc ( figure b) . the -npys pmo (dmd) was conjugated to p-pmo (acvr b) to give a disulfide bond in bi-specific conjugate d ( figure b ) in a yield of %. mass table . the efficacies of the bi-specific p-pmo constructs were assessed in an initial screening step by rt-pcr using dmd exon skipping in h k mdx cells ( figure a ). for all conjugates tested, high levels of exon skipping were found in a dose-dependent manner (with some exons - double skipping which is frequently observed in this test system). the bi-specific conjugates d and d , where both pmos are conjugated to the c-terminus of pip a, exhibited better exon-skipping activity than for d and d . since bispecific conjugate d was less effective than d , this suggests that there is no advantage to use of a cleavable disulfide linkage over use of stable click chemistry for addition of the second pmo. the control singly conjugated pip a-pmo (dmd) demonstrated only slightly higher exon skipping compared to conjugates d and d . bi-specific conjugates d and d , which showed the best dmd exon skipping levels, were examined further for their exon-skipping efficacies in acvr b ( figure b ). the results for acvr b mirrored that of dmd targeting. only very slightly higher exon skipping was observed with the control singly conjugated pip a-pmo compared to bi-specific counterparts d and d . in addition, the general level of exon skipping in the acvr b gene was found to be lower than that for dmd. this is potentially because disruptive exon skipping by oligonucleotides for this gene is harder to achieve than skipping of dmd exon and has not been fully optimized as yet. bi-specific conjugate d was marginally more efficient than for d . since both d and d conjugates demonstrated skipping activity of both genes in cells and had the highest levels of dmd exon skipping, they were further evaluated in vivo in the mdx mouse model of dmd. intramuscular administration of . nmol of either the d , d bi-specific conjugates was carried out into the ta muscle of mdx mice and compared with a : molar cocktail of singly conjugated pip a-pmos. analysis of splice-switching activity was carried out weeks post-administration. each of the cpp-pmo conjugates demonstrated robust splice-switching activity, with higher splice-switching activity evident for dmd targeting compared to acvr b, as was also seen in the in vitro cell culture studies. since no significant differences between the constructs could be seen following gel analysis by rt-pcr in the case of dmd gene targeting ( figure a ), quantitative analysis of splice switching was carried out using qpcr primers to determine the reduction in the level of transcripts containing the target exons. in each case - % of exon skipped dmd transcripts were found, when normalized to dmd transcripts from non-injected control muscle, with no statistically significant differences seen between p-pmo treated mice for both singly conjugated and bi-specific conjugates. unsurprisingly, the pattern of exon skipping was also maintained for acvr b gene targeting, where there were no significant differences seen between d or d conjugates or the singly conjugated pip a-pmo counterpart ( figure b ). cell and in vivo toxicities of p-pmos are known to be predominantly a function of the peptide component and are dose-dependent ( ) . therefore, the cell viability of the d bi-specific conjugate was assessed in human hepatocytes (huh ) cells and compared to that of pip a-pmo (dmd) and a mixture of the two pip a-pmos against the two different targets dmd and acvr b, in each case using a high equimolar concentration based on total pmo. thus m of bi-specific conjugate d was compared with a mixture of m each of pip a-pmo (dmd) and pip a-pmo (acvr b) and the percentage of cell survival measured (figure ) . these results showed significantly higher cell viabil- ity for the bi-specific d conjugate compared to a mixture of both pip a-pmos for the two individual targets. the promise of ssos as therapeutic agents is being realized, with a number of clinical trials for dmd in progress to assess the efficacy of targeting a single exon (exon ) to by-pass disease causing mutations ( ) ( ) ( ) . further clinical trials are also being undertaken to target other exons, notably in the region covering exons - ( ) . multiple simultaneous exon skipping using a cocktail of ssos has been suggested as an approach to target a majority of patients ( , ) . a proof-of-concept for multiple exon skipping was demonstrated in the golden retriever model of dmd ( ) as well as in human patient cells lines ( ) . the concept was further extended to the targeting of exons - using cocktails of various ssos ( ) . unsurprisingly, the levels of exon skipping found to date have been low. in an attempt to improve this efficacy, studies were undertaken in the mdx mouse model using a cocktail of different 'vivo-morpholino' pmos to delete the entire stretch of exons - ( ) . although a skipped transcript could be detected in these mice, the significant likely toxicity of delivering into mice individual pmos that are functionalized with guanidinium groups might hinder their clinical development. instead of the use of a cocktail of p-pmos and with the need in mind to minimize toxicity, we sought to develop proof-of-concept for use of bi-specific pmo ssos that could simultaneously target two different exons in different genes rather than in the same gene. we chose for these initial studies simultaneous targeting of exon of the dmd gene and exon of the acvr b gene, using a single cpp as pmo delivery agent and to observe whether exon skipping could be maintained for both targets. dumonceaux et al. showed that a combination of restoration of dystrophin and simultaneously inhibition of the myostatinsignalling pathway results in a significant improvement in muscle growth and force in dystrophic mdx mice ( ) . in a similar study, myostatin knockdown in conjunction with dystrophin restoration using an exon-skipping approach using a cocktail of two separate pip a-pmos resulted in significantly increased mouse muscle mass ( ) . thus targeting two genes in this way might be expected to have clinical relevance. nucleic acids research, , vol. , no. we used our well-characterized arg-rich cpp pip a as the model cpp because of the known high level of exon skipping observed for pmo conjugates in the dmd model ( ) and we designed orthogonal conjugation chemistries using amide, disulfide and triazole bonds. the first conjugation of pmo to pip a was carried out in all cases through formation of a stable amide bond between the -secondary amine of the pmo to the c-terminal carboxyl group of a synthetic pip a derivative. the second pmo conjugation was then effected either using an alkyne group on the pip a to an azide-functionalized pmo to give a stable triazole linkage or with a cys residue on the pip a to an activated thiol group on the pmo to form a reversible disulfide linkage (figures and ) . bi-specific conjugate d has one pmo at each pip a terminus whilst d and d have both pmos at the c-terminus of pip a. the more sluggish click conjugation of the second pmo to the c-terminus of pip a in d and d at room temperature was presumably due to poorer accessibility than in the case of d , but heating to • c readily facilitated triazole bond formation (figure ) . conjugation of the second pmo using a disulfide bond (d ) did not require heating. in mdx muscle cells, the lowest exon-skipping efficacy was observed for bi-specific conjugate d that had a pmo conjugated to each end of the pip a, which suggests that the ability of pip a to deliver pmo through the endosomal pathway and into the nucleus is inhibited by placing a bulky pmo at its n-terminus. by contrast, when both pmos were placed at the c-terminus of the pip a (d and d ), the dmd skipping activity was restored to close to the level observed with pip a-pmo (dmd) ( figure a ), which confirms that cpp delivery is optimal when the cpp n-terminus is not blocked by a bulky substituent. both conjugates d and d also demonstrated high levels of acvr b exon skipping in cells and close to that of the single pip a-pmo (acvr b) ( figure b ). further, both d and d conjugates showed exon skipping of both the dmd and acvr b targets by intramuscular delivery ( figure ) at levels unaltered from that of individual pip a-pmos. these results show conclusively that there is no sequestration of one pmo at its own target that prevents the action of the other pmo at its own target. thus, there must be a sufficient on-off equilibrium established for each target release and for target accessibility. this is an important finding that validates the use of bi-specific p-pmo. interestingly bi-specific p-pmo d , in which the pmo (dmd) was conjugated through a reducible disulfide bond, was less active than stably conjugated d and d constructs ( figure ). one explanation for this could be due to partial reduction of the disulfide bond upon cell entry and liberation of free dmd pmo from the bi-specific conjugate before the pip a can deliver it into the cell nucleus. however, note that no difference was found in splicing redirection in a hela cell model using either disulfide-linked or stably linked r -penetratin pna ( ) . by contrast in a recent study, bi-specific -o-methyl phosphorothioate ssos targeting both the mstn and dmd genes and delivered using a cationic transfection reagent were more effective when linked through a cleavable disulfide linker than through a non-cleavable hydrocarbon linker ( ) . the differential effect of use of a cleavable disulfide linkage probably reflects a different cell and nuclear uptake mechanism for ssos delivered by a transfection agent compared to covalent cpp delivery of uncharged pmo or pna. detailed mechanistic studies would be needed to confirm this, but it should be noted that uptake of pip a-pmo (dmd) into mdx skeletal muscle cells has been shown recently to be predominantly caveolae-mediated, whereas in cardiomyocytes uptake was mostly clathrin-mediated ( ) , suggesting that uptake routes are also cell dependent in addition to being dependent on oligonucleotide type and delivery method. for the future it should be noted that the efficiency of exon skipping for pmo (acvr b) is not as high as for exon skipping for pmo (dmd). pmo (acvr b) will therefore need further optimization to improve efficiency on this target. alternatively, one might prefer to use a combination of pmo (dmd) and a pmo that targets myostatin directly ( ) to make a bi-specific pmo rather than targeting its receptor, and thus this second pmo needs to be optimized before embarking on lengthy intravenous delivery studies in mdx mice where larger scale synthesis would be needed. clearly such studies need to investigate physiology benefits in addition to exon skipping ( ) . however, the increase in cell viability in human hepatocyte cultures for bi-specific conjugate d compared to the same pmo concentration for a mixture of two pip a-pmos ( figure ) is encouraging and if this lower toxicity is maintained in systemic delivery, the clinical importance could be significant when considering multiple targeting (whether targeting different exons in the same gene or in two different genes). in summary, we have developed new synthetic methodology for conjugation of two pmo ssos to a single pip a cpp and shown efficient targeting in cells and in vivo of two separate genes with retained potency for each sso. this work should enable further systemic studies on multiple sso targeting in both dmd and also potentially in other neuromuscular disease models. antisense-mediated exon skipping: a versatile tool with therapeutic and research applications a chemical view of oligonucleotides for exon skipping and related drug applications targeting rna to treat neuromuscular disease systemic delivery of morpholino oligonucleotide restores dystrophin expression bodywide and improves dystrophic pathology functional rescue of dystrophin-deficient mdx mice by a chimeric peptide-pmo extensive and prolonged restoration of dystrophin expression with vivo-morpholino-mediated multiple exon skipping in dystrophic dogs efficacy of systemic morpholino exon-skipping in duchenne dystrophy dogs exon skipping and dystrophin restoration in patients with duchenne muscular dystrophy after systemic phosphorodiamidate morpholino oligomer treatment: an open-label, phase , dose-escalation study systemic administration of pro in duchenne's muscular dystrophy local restoration of dystrophin expression with the morpholino oligomer avi- in duchenne muscular dystrophy: a single-blind, placebo-controlled, dose-escalation, proof-of-concept study local dystrophin restoration with antisense oligonucleotide pro eteplirsen for the treatment of duchenne muscular dystrophy orphan drug development in muscular dystrophy: update on two large clinical trials of dystrophin rescue therapies sustained dystrophin expression induced by peptide-conjugated morpholino oligomers in the muscles of mdx mice cell-penetrating peptide-morpholino conjugates alter pre-mrna splicing of dmd (duchenne muscular dystrophy) and inhibit murine coronavirus replication in vivo cell-penetrating peptide-conjugated antisense oligonucleotides restore systemic muscle and cardiac dystrophin expression and function improved cell-penetrating peptide-pna conjugates for splicing redirection in hela cells and exon skipping in mdx mouse muscle pip -pmo, a new generation of peptide-oligonucleotide conjugates with improved cardiac exon skipping activity for dmd treatment regulation of skeletal muscle mass in mice by a new tgf-beta superfamily member regulation of myostatin activity and muscle growth myostatin inhibits myoblast differentiation by down-regulating myod expression interleukin- and interleukin- are expressed in organs of normal young and old mice loss of myostatin attenuates severity of muscular dystrophy in mdx mice antibody-directed myostatin inhibition improves diaphragm pathology in young but not adult dystrophic mdx mice combination of myostatin pathway interference and dystrophin rescue enhances tetanic and specific force in dystrophic mdx mice combined effect of aav-u -induced dystrophin exon skipping and soluble activin type iib receptor in mdx mice transcriptional control by the tgf-beta/smad signaling system antisense-induced myostatin exon skipping leads to muscle hypertrophy in mice following octa-guanidine morpholino oligomer treatment targeting tgf-beta signaling by antisense oligonucleotide-mediated knockdown of tgf-beta type i receptor highly efficient in vivo delivery of pmo into regenerating myotubes and rescue in laminin-alpha chain-null congenital muscular dystrophy mice dual exon skipping in myostatin and dystrophin for duchenne muscular dystrophy dual myostatin and dystrophin exon skipping by morpholino nucleic acid oligomers conjugated to a cell-penetrating peptide is a promising therapeutic strategy for the treatment of duchenne muscular dystrophy design and application of bispecific splice-switching oligonucleotides methods and protocols of modern solid phase peptide synthesis new micro-test for detection of incomplete coupling reactions in solid-phase peptide synthesis using , , -trinitrobenzene-sulphonic acid guided reconstitution of membrane protein fragments development of a general methodology for labelling peptide-morpholino oligonucleotide conjugates using alkyne-azide click chemistry x chromosome-linked muscular dystrophy (mdx) in the mouse pharmacokinetics, biodistribution, stability and toxicity of a cell-penetrating peptide-morpholino oligomer conjugate multiexon skipping leading to an artificial dmd protein lacking amino acids from exons through could rescue up to % of patients with duchenne muscular dystrophy development of a multiplex allele-specific primer pcr assay for simultaneous detection of qoi and caa fungicide resistance alleles in plasmopara viticola populations antisense oligonucleotide-induced exon skipping restores dystrophin expression in vitro in a canine model of dmd antisense-induced multiexon skipping for duchenne muscular dystrophy makes more sense assessment of the feasibility of exon - multiexon skipping for duchenne muscular dystrophy bodywide skipping of exons - in dystrophic mdx mice by systemic antisense delivery efficient splicing correction by pna conjugation to an r -penetratin delivery peptide cellular trafficking determines the exon skipping activity of pip a-pmo in mdx skeletal and cardiac muscle cells we thank thibault coursindel (mrc-lmb) and taeyoug koo and samir el andaloussi (university of oxford) for help and discussions with the initiation of the concept of joining two pmos via a single peptide. we also thank andrey arzumanov and itaru okamoto (mrc-lmb) for advice and the academic support of the mdex consortium (http://www.mdex.org.uk/) is also acknowledged. key: cord- -t aufs authors: aurrecoechea, cristina; barreto, ana; basenko, evelina y.; brestelli, john; brunk, brian p.; cade, shon; crouch, kathryn; doherty, ryan; falke, dave; fischer, steve; gajria, bindu; harb, omar s.; heiges, mark; hertz-fowler, christiane; hu, sufen; iodice, john; kissinger, jessica c.; lawrence, cris; li, wei; pinney, deborah f.; pulman, jane a.; roos, david s.; shanmugasundram, achchuthan; silva-franco, fatima; steinbiss, sascha; stoeckert, christian j.; spruill, drew; wang, haiming; warrenfeltz, susanne; zheng, jie title: eupathdb: the eukaryotic pathogen genomics database resource date: - - journal: nucleic acids res doi: . /nar/gkw sha: doc_id: cord_uid: t aufs the eukaryotic pathogen genomics database resource (eupathdb, http://eupathdb.org) is a collection of databases covering + eukaryotic pathogens (protists & fungi), along with relevant free-living and non-pathogenic species, and select pathogen hosts. to facilitate the discovery of meaningful biological relationships, the databases couple preconfigured searches with visualization and analysis tools for comprehensive data mining via intuitive graphical interfaces and apis. all data are analyzed with the same workflows, including creation of gene orthology profiles, so data are easily compared across data sets, data types and organisms. eupathdb is updated with numerous new analysis tools, features, data sets and data types. new tools include go, metabolic pathway and word enrichment analyses plus an online workspace for analysis of personal, non-public, large-scale data. expanded data content is mostly genomic and functional genomic data while new data types include protein microarray, metabolic pathways, compounds, quantitative proteomics, copy number variation, and polysomal transcriptomics. new features include consistent categorization of searches, data sets and genome browser tracks; redesigned gene pages; effective integration of alternative transcripts; and a eupathdb galaxy instance for private analyses of a user's data. forthcoming upgrades include user workspaces for private integration of data with existing eupathdb data and improved integration and presentation of host–pathogen interactions. a unique infrastructure and search strategy system distinguish the eukaryotic pathogen database resource (eupathdb, http://eupathdb.org) from other organism databases. the power of eupathdb lies in the ability to query across hundreds of data sets while refining a set of genes, proteins, pathways or organisms of interest. the interface is designed for easy mastery by biological researchers, enabling in silico experiments that interrogate diverse and complex data sets. despite the sophisticated strategy system, browsing gene pages and genomic spans or regions remains a simple and informative task in this innovative and valuable resource. eupathdb facilitates the discovery of meaningful biological relationships between genomic features such as genes or snps by integrating pre-analyzed data with sophisticated data mining, visualization and analysis tools that are designed to be used by wet-bench researchers. organized into free, online databases eupathdb supports over eukaryotic pathogens with genomic sequence and annotation, functional genomics data, host-response data, isolate and population data and comparative genomics. table provides a web address and a link to a list of organisms supported for each database. all databases are built with the same infrastructure and use the strategies web development kit ( ) , which provides a graphical interface for building complex search strategies and exploring relationships across data sets and data types ( figure ; strategy http://plasmodb.org/plasmo/im.do?s= b dd c ). as one of four national institute of allergy and infectious disease (niaid/nih) funded bioinformatics resource centers ( - ) eupathdb provides data, tools and services to scientific communities researching pathogens in the niaid list of emerging and re-emerging infectious diseases which includes niaid category a-c priority pathogens and many fungi. additional eupathdb support for the kinetoplastid and fungal research communities is funded by the wellcome trust in collaboration with genedb ( ), including support for focused curated annotation. this manuscript describes expanded content, features and tools added since that increase the data mining and discovery power of eupathdb. over the past years, eupathdb has routinely updated existing databases and added two new databases. we added new data, expanded the range of supported data types, enhanced infrastructure and added new analysis tools. eupathdb resources have been expanded to include fungidb (http://fungidb.org) ( ) , which supports fungi and oomycetes, and hostdb (http://hostdb.org), for interrogation of host responses to infection. hostdb supports host data obtained during infections by organisms supported by eupathdb's parasite lineage-specific databases. minot et al. ( ) , for example, infected murine macrophages with toxoplasma gondii strains and collected mixed parasitehost samples for rna sequencing. reads that align to the t. gondii genome are integrated into toxodb whereas hostdb houses those sequencing reads that align to the m. musculus genome. because all eupathdb databases employ the same data analysis pipelines, search strategy system, visualization and analysis tools, the t. gondii and m. musculus data can be compared. for example, one can easily identify parasite genes that are differentially expressed between two t. gondii strains from toxodb as well as host genes that are differentially expressed during infection with the same two strains from hostdb. enrichment analyses and comparison of these lists offers insights into host-pathogen interactions and responses. eupathdb tools are conceived and designed to reduce analysis barriers, enhance data mining and improve communication within and between the scientific communities we serve. the near-seamless integration of strategy results with tools for functional enrichment analyses and transcript interpretation as well as our new galaxy workspace and the availability of publicly shared strategies augment the data mining experience in eupathdb. galaxy workspace. eupathdb sites now include a galaxy-based ( ) workspace for large-scale data analyses, e.g. rna-seq read mapping to a reference genome. developed in partnership with globus genomics ( ), workspaces offer a private analysis platform with published workflows and pre-loaded annotated genomes for the organisms we support. the workspace is accessed through the 'analyze my experiment' (figure a ) tab on the home page of any eupathdb resource and can be used to upload your own data e.g. rna-seq reads, compose and run preconfigured or custom workflows ( figure b and c), retrieve your results, visualize them in eupathdb ( figure d ), and share workflows and data analysis results with colleagues. explore transcript subsets. transcript subsets occur when a multi-transcript gene has at least one transcript that does not meet the search criteria. for example, signal peptides are short sequences at the n-terminus of secretory proteins and eupathdb predicts signal peptides for all annotated genomes using signalp ( ) . the predicted signal peptide search returns genes and transcripts with predicted signal peptides. if one transcript of a multi-transcript gene excludes the exon containing the signal peptide, the search returns the gene but not the signal peptide-deficient transcript. searches and strategies that query transcript-specific data ( figure a ; strategy http://plasmodb.org/plasmo/im. do?s= df f e) are equipped with an explore tool for interrogating or filtering transcript subsets. the explore tool appears in the gene results tab above the table of ids ( figure c ) and offers filters for transcripts based on their inclusion in the result set. filters are applied to the strategy result and update the gene result list. for two-step strategies where both steps query transcript specific data, the explore tool offers further filters for viewing transcripts that were returned by both searches, either search or neither search. enrichment analyses. gene ontology, metabolic pathway and word enrichment analyses are available for gene strategy results to aid with their interpretation ( figure f ). these functional analyses apply the fisher's exact test to determine over-represented pathways, ontology terms and product description terms. clicking the analyze results tab of any gene strategy result ( figure e ) and selecting an enrichment analysis will open an analysis tab where users are prompted for parameter values. the results of an enrichment analysis are presented in tabular form and include a list of enriched go terms, pathways or product description words and associated data. public strategies. strategies marked as public when saved to a user's profile will also be shared with the community in the 'public strategies' tab of the 'my strategies' interface. users control the availability of the strategy and can remove it at any time. the panel also includes example strategies provided by eupathdb. data sets search tool. each data set integrated into eu-pathdb is documented with a data set record which contains information about the data including a description, contact information for the investigator that generated the data, literature references, and when available, example graphs and links to searches and genome browser tracks. links to data set records appear on gene pages and on search pages beneath the parameters. a searchable table of all data sets is available from the data summary tab in the gray drop-down menu bar. eupathdb's philosophy is to provide a data mining platform that allows users to ask their own questions in support of hypothesis driven research. the extensive range of data types (genomic, transcriptomic, proteomic, metabolomic, etc.) maintained by eupathdb broadens the user's ability to mine extensively by providing multiple forms of experimental evidence to interrogate. as the omics world expands, eupathdb endeavors to support meaningful data types and has expanded its coverage over the past few years. source brought many genomes from this large and diverse research community. updates to eupathdb's reflow workflow system ( ) make it possible to quickly and reliably analyze and load data. thus, over the past years, numerous functional data sets have been loaded. data sets of interest can be located with the data set search tool described above. protein microarray. this new data type offers a measure of host response to infection by revealing pathogenspecific antibodies in host serum or plasma samples. a typical data set includes data from serum samples collected from patients during an infection (or from healthy controls) that were hybridized to arrays spotted with possible pathogen antigens (peptides representing gene products) ( ) ( ) ( ) ( ) . searches that query this data type are classified un-d nucleic acids research, , vol. , database issue der immunology and graphs of a pathogen gene's antigenicity for each sample appear on gene pages. the searches employ the filter parameter for selecting samples based on clinical characteristics of patients when configuring the search ( ) . metabolic pathways. pathways are integrated from meta-cyc, kegg, trypanocyc and leishcyc ( ) ( ) ( ) ( ) ( ) as networks of enzymatic reactions and substrate/product compounds. genes are mapped to pathways based on ec numbers. pathway record pages feature a cytoscape image which can be 'painted' with experimental data, e.g. gene expression values or ortholog profiles. for easy transition to functional analysis, gene search results can be converted to pathways using the transform to pathways function in the add step popup or users can run a pathways enrichment analysis of their gene result to identify pathways that are statistically enriched. compounds. compound records are integrated from the chemical entities of biological interest (chebi) database ( ) and associated to genes through metabolic pathway mappings. lists of compounds are returned based on molecular weight or formula, compound id, enzyme ec number, compound id and text. lists of genes and metabolic pathways can be transformed into their associated compounds using the transform function. a genome-wide loss of function screen using crispr technology is available in toxodb and provides a measure of a gene's contribution to parasite fitness ( quantitative proteomics. this new data type provides evidence for differential protein expression from experimental methods such as silac ( , ) . the searches appear under the proteomics, quantitative mass-spec evidence and return genes based on the fold change in protein expression between samples. gene pages include graphs of these data when available. copy number variation. whole genome resequencing data are used to estimate chromosome and gene copy number in re-sequenced strains ( ) . the median read depth is set to the organism's ploidy and each chromosome's median read depth is normalized to this value. contigs that are not assigned to chromosomes are excluded from this analysis. gene copy number is similarly calculated using a normalized read depth for each gene. to compare the number of genes in the re-sequenced genome to the reference genome, genes are grouped into clusters that are inferred to have originated by duplication. searches are categorized under genetic variation and either return genes with a certain copy number, or genes with different copy numbers between strains. polysomal transcriptomics. rna-sequencing of polysome or ribosome associated transcripts reveals potential translation events. data sets of this data type are available in plasmodb ( , ) and trytripdb ( ) . categorized under transcriptomics, rna seq evidence, the searches against this new data type return genes with differential translation potential (fold change search) or genes within a certain percentile rank within a sample. expression graphs and rna sequencing coverage plots are available statically in gene pages and dynamically in gbrowse. these coverage plots provide evidence for the cds and translational start site usage. metadata. biological sample characteristics such as host clinical parameters for pathogen isolates or blood samples offer valuable information for stratifying samples while configuring searches. eupathdb integrates metadata when available and presents it in the filter parameter interface to take advantage of the rich data type when selecting samples for data mining (see below). the most recent eupathdb release represents significant updates to the underlying data and infrastructure. in addition to refreshing all data to the latest versions, we added workspaces, redesigned our gene pages, incorporated alternative transcripts into gene pages and searches, updated search categories and contemporized the rna sequence analysis workflow. categories. searches, the experimental data sets they query, and genome browser tracks for visualization are now displayed with a common logic across the websites. the categories are based on the embrace data & methods ontology (edam) ( ) , which relates biological concepts with bioinformatic analyses. the result is a logical, consistent menu structure from home page to gene page to genome browser. for example, the category names and order in the home page 'search for genes' (figure b) is the same as the 'contents' section of the gene page ( figure c ). eupathdb's extensive record system documents integrated data and analysis results for entities such as genes, genomic sequences, snps, isolates, compounds and metabolic pathways. record pages have a new streamlined look, contain improved navigation tools, and are reorganized to reflect edam-based categories ( figure ) . to view the gene page for pf d , autophagy-related protein , putative that is highlighted in figure , go to http://plasmodb.org/plasmo/app/record/ gene/pf d . for example, in gene record pages, gene ids and product descriptions are prominently displayed in the upper left corner of the page with other pertinent gene information and links directly below ( figure a ). also at the top of the page are 'shortcuts' ( figure b ) which serve two functions--clicking on the shortcut's magnifying glass icon offers a larger view of the data, while clicking on the image (or its title) navigates to the data within the gene page. 'view in genome browser' links (e.g. above and below the gene models image in figure d ) accompany data that are also available for dynamic viewing in the genome browser. these links open the genome browser (gbrowse) ( ) with the pertinent data track added to the user's current browser session. the collapsible and interactive 'contents' section reflects the new edam-based categories and features a search function for quickly locating a category ( figure c ). the contents section remains stationary and visible while scrolling the gene page data ( figure d) . a section indicator (small blue circle) appears to the left of the category name of the data currently in view. clicking a category name directs the page to that data section. the check boxes to the right of the category names can be used to customize the data display. data from categories with empty check boxes will be hidden from view. data tables ( e, f and within figure d ) are collapsible, interactive, contain sortable columns and present transcript-specific information when data can be unambiguously assigned to a transcript. tables with two or more rows include a search function. the transcriptomics (figure e) , protein properties and features ( figure f ), mass spec -based expression evidence and sequences tables contain expandable rows for retrieving detailed information. each row of the transcriptomics table represents a data set and expanding a row reveals graphs, data tables, and a data set description, as well as coverage plots for rna sequencing data. expansion of the rows in the protein properties and features table reveals the domains, blastp hits and other analysis results pertinent to the transcript's protein product. the mass spec-based expression evidence graphic table shows proteomic evidence associated with each transcript. the sequences table offers genomic, coding, predicted mrna and predicted protein sequences for each transcript. human and mouse genes (hostdb) have extensive alternative transcripts and there is increasing evidence that many eukaryotic pathogen genes have more than one transcript. eupathdb infrastructure was updated to better represent transcript information. transcripts are graphically represented on gene pages and listed in gene page tables when data can be unambiguously assigned to a transcript (figure d ). all gene search results now include a transcript id column ( figure c ). the results of searches that query transcript-specific data (e.g. predicted signal peptide) contain an explore tool (see tools section of this manuscript) for investigating transcript subsets ( figure b ). filtering samples based on metadata. sequences from pathogen isolates and data from host clinical blood samples are often accompanied by rich metadata-sample characteristics including host, age, geographic location, disease status and parasitemia. eupathdb's new filter parameter ( figure ) increases the user's power to mine data via display of sample characteristics (metadata) on the interface for selection of samples while configuring a search or multiple sequence alignment. for example, the filter parameter makes it possible to compare the antigenicity of parasite genes between infected children and uninfected children within the same dataset. the filter parameter is available for searches and sequence alignments that access snp, chip-seq and hostresponse data. rna-sequence analysis workflow updated. our pipeline for analyzing and loading rna-sequence data was updated to use standard tools and to accommodate data sets with biological replicates. the new workflow aligns reads with gsnap and calculates fpkm/rpkm with ht-seq ( , ) . deseq is used to determine differential expression for experiments that have appropriate biological replicates ( ) . future development efforts at eupathdb will concentrate on expanding private analysis workspaces and better integration and support for host response to pathogen infection. the galaxy toolshed contains many tools for data analysis. we expect to enhance our existing galaxy workspace with new workflows such as alignment of resequencing reads and snp calls or production of multiple sequence alignments and phylogenetic analyses. critical to our expanded workspace will be the ability for users to fully integrate the results of their analyses into eupathdb so that they can query, view, and share their results in the context of the publicly available data in eupathdb. a high priority for eupathdb in the coming year is to better represent host responses to pathogen infection and enable users to mine these data to identify genes (or other entities) and relationships of interest. currently, only a few omics data sets are available for host response, but we expect this situation to change rapidly. we will be expanding not only the amount of host data that we load, but also the types of host response data so that we can include highthroughput metabolic and immune profiling and rich descriptions of all study, experiment and sample metadata. we will be loading these rich multi-dimensional studies and we will be implementing a variety of tools and analyses to mine these data at a systems level. the strategies wdk: a graphical search interface and web development kit for functional genomics databases eupathdb: the eukaryotic pathogen database ) patric, the bacterial bioinformatics database and analysis resource vectorbase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases virus pathogen database and analysis resource (vipr): a comprehensive bioinformatics database and analysis resource for the coronavirus research community influenza research database: an integrated bioinformatics resource for influenza research and surveillance. influenza other respir viruses genedb-an annotation database for pathogens fungidb: an integrated functional genomics database for fungi admixture and recombination among toxoplasma gondii lineages explain global genome diversity the galaxy platform for accessible, reproducible and collaborative biomedical analyses: update cloud-based bioinformatics workflow platform for large-scale next-generation sequencing analyses signalp . : discriminating signal peptides from transmembrane regions submicroscopic and asymptomatic plasmodium falciparum and plasmodium vivax infections are common in western thailand--molecular and serological evidence a prospective analysis of the ab response to plasmodium falciparum before and after a malaria season by protein microarray plasmodium falciparum protein microarray antibody profiles correlate with protection from symptomatic malaria in kenya malaria transmission, infection, and disease at three sites with varied transmission intensity in uganda: implications for malaria control a framework for global collaborative data management for malaria research kegg as a reference resource for gene and protein annotation leishcyc: a biochemical pathways database for leishmania major leishcyc: a guide to building a metabolic pathway database and visualization of metabolomic data the metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases trypanocyc: a community-led biochemical pathways database for trypanosoma brucei the chebi reference database and ontology for biologically relevant chemistry: enhancements for a genome-wide crispr screen in toxoplasma identifies essential apicomplexan genes quantitative proteomics using silac: principles, applications, and developments proteome remodelling during development from blood to insect-form trypanosoma brucei quantified by silac and mass spectrometry chromosome and gene copy number variation allow major structural change between species and strains of leishmania genome-wide regulatory dynamics of translation in the plasmodium falciparum asexual blood stages polysome profiling reveals translational control of gene expression in the human malaria parasite plasmodium falciparum extensive stage-regulation of translation revealed by ribosome profiling of trypanosoma brucei edam: an ontology of bioinformatics operations, types of data and identifiers, topics and formats the generic genome browser: a building block for a model organism system database htseq-a python framework to work with high-throughput sequencing data gmap and gsnap for genomic sequence alignment: enhancements to speed, accuracy, and functionality differential expression analysis for sequence count data the authors wish to thank members of the eupathdb research communities for their willingness to share genomicscale data sets, often prior to publication and for numerous comments and suggestions from our scientific advisors and the scientific community at large, which have helped to improve the functionality of eupathdb resources. we also thank past and present staff associated with the eupathdb brc project, and our research laboratory colleagues whose contributions have facilitated the creation and maintenance of this database resource. key: cord- -a y rfas authors: sharma, virag; prère, marie-françoise; canal, isabelle; firth, andrew e.; atkins, john f.; baranov, pavel v.; fayet, olivier title: analysis of tetra- and hepta-nucleotides motifs promoting - ribosomal frameshifting in escherichia coli date: - - journal: nucleic acids res doi: . /nar/gku sha: doc_id: cord_uid: a y rfas programmed ribosomal - frameshifting is a non-standard decoding process occurring when ribosomes encounter a signal embedded in the mrna of certain eukaryotic and prokaryotic genes. this signal has a mandatory component, the frameshift motif: it is either a z_zzn tetramer or a x_xxz_zzn heptamer (where zzz and xxx are three identical nucleotides) allowing cognate or near-cognate repairing to the - frame of the a site or a and p sites trnas. depending on the signal, the frameshifting frequency can vary over a wide range, from less than % to more than %. the present study combines experimental and bioinformatics approaches to carry out (i) a systematic analysis of the frameshift propensity of all possible motifs ( z_zzn tetramers and x_xxz_zzn heptamers) in escherichia coli and (ii) the identification of genes potentially using this mode of expression amongst enterobacteriaceae genomes. while motif efficiency varies widely, a major distinctive rule of bacterial - frameshifting is that the most efficient motifs are those allowing cognate re-pairing of the a site trna from zzn to zzz. the outcome of the genomic search is a set of gene clusters, of which constitute new candidates for functional utilization of - frameshifting. programmed ribosomal - frameshifting (prf- ) has been recognized more than years ago as a mode of translational control of specific genes, first in retroviruses ( ) ( ) ( ) and later in bacterial genes ( , ) . since then the number of demonstrated or suspected cases, has greatly increased, generally through homology searches, or by taking advantage of the many sequenced genomes to look for genes containing potential frameshift signals ( , ) . for example, the recode database ( ) has entries for - frameshifting originating from eukaryotic viruses ( cases), from transposable elements ( cases, bacterial and eukaryotic), from bacteriophages ( cases) and from chromosomal genes ( cases). overall, entries come from eukaryotic genes. this may give the impression that - frameshifting is less common in prokaryotes, but it is not necessarily true. analysis of the isfinder database ( ) , dedicated to bacterial transposable elements called insertion sequences (is), showed that more than is elements very likely use - frameshifting to synthesize the proteins necessary for their mobility ( ) . another bioinformatics study carried out on bacterial genomes revealed more than genes that probably use - frameshifting ( ) . these genes can be grouped into a limited number of clusters most of which correspond to is elements. although, like their eukaryotic counterparts, most bacterial genes likely using programmed - frameshifting are found in mobile elements, such as is transposons or bacteriophages ( ) , utilization of - frameshifting may not be limited to them. another study revealed a set of prokaryotic gene families with various potential programmed frameshifts, several of which were experimentally tested ( , ) . most of these families correspond to non-mobile genes encoding proteins of known functions and proteins with conserved domains performing yet unknown functions. sequences of genes from these clusters are available from gentack database ( ) . the execution of frameshifting at a significant level requires a relatively simple signal which is embedded within the coding part of certain mrnas ( figure ). the two components of this signal were revealed by the earlier stud- ies on retroviruses ( ) ( ) ( ) . the obligatory component is a short sequence of nucleotides, x xx.z zz.n, called the frameshift motif or 'slippery' motif, where xxx and zzz, are triplets of identical bases, and n is any nucleotide (underscoring separates codons in frame with the initiation codon, i.e. frame , and dots separate codons in the new frame, i.e. frame - ). thus, there are possible sequences corresponding to the above definition. it was also shown that an even shorter sequence, a z zz.n tetramer (where zzz are three identical bases, thus leading to possible motifs), could also direct programmed - frameshifting ( ) ( ) ( ) ( ) ( ) ( ) ( ) . the'slipperiness' of both types of motifs likely results from their capacity to allow cognate or near cognate repairing in the - frame of one or two trnas ( , ) . on an x xx.z zz.n heptamer, the xxz-and zzn-decoding trnas, respectively, in the p and a sites of the ribosome, would break the codon-anticodon interaction and re-pair on the xxx and zzz codons in the - frame. the second component of frameshift signals is a stimulatory element which, by itself, cannot induce frameshifting. it can be an rna secondary structure, such as simple or branched hairpin-type stem-loop (hp) or a pseudoknot (pk) ( ) ( ) ( ) . as illustrated in panels c and d of figure , it is formed by local folding of the mrna and generally starts - nucleotides downstream of the slippery sequence ( , , ) . the stimulatory effect of a structure may be linked to its capacity to block ribosomes transiently when the zzn codon occupies the a site and thus give more time to trnas for repairing ( ) ( ) ( ) . in addition, a structure may exert a pulling effect on the mrna and favour its realignment within the ribosome to bring the xxx and zzz - frame codons in the p and a sites ( , ) . it is present in all well-studied eukaryotes cases, often as a pk, but not always found in prokaryotes prf- regions ( ) . prokaryotic signals sometimes possess another type of stimulatory element upstream of the frameshift motif: a shine-dalgarno (sd)-like sequence normally involved in translation initiation through pairing with the ccucc sequence at the end of s ribosomal rna ( ) ( ) ( ) . in prf- signals, the same interaction occurs but within an elongating ribosome and results in a translational pausing ( ) , which may provide a longer time window for trna re-pairing. the other possible effect of a stimulatory sd is linked to its distance from the motif which could generate a tension between the mrna and the ribosome that could be resolved by realigning the mrna ( ) . thus, the two types of stimulators could act in concert by generating pausing, for both, and by pushing (sd) or pulling (structure) on the mrna. in addition, frameshifting frequency in eukaryotes and prokaryotes is modulated by the immediate context on both sides of the slippery motif ( ) ( ) ( ) . it is not yet clear by which mechanism(s) this modulation operates but it could result in part from an e site trna effect ( ) , from intra-mrna interactions ( ) and possibly from mrna-ribosome interactions within the message entry tunnel ( ) . the slippery motif, being the key element in frameshifting, has been the object of a particular attention. an early study dealt with the motif present in the rous sarcoma virus signal ( ) . mutating it to a limited set of different motifs lead to the proposal of basic rules governing x xx.z zz.n heptamers frameshifting efficiency for eukaryotes: in short, substantial frameshifting is attained if xxx = [aaa, ggg or uuu], zzz = [aaa or uuu] and n = [a, c or u]. a subsequent nearly systematic analysis confirmed these rules: motifs were tested and the remaining motifs were not included because of their expected inefficiency (see upper panel of supplementary figure s a ) ( ) . thus, only a subset of the possible x xx.z zz.n heptamers can elicit frameshifting at a significant level in higher eukaryotes. the x xx.z zz.n heptamers were also analysed in prokaryotes, using the escherichia coli bacterium, but not as thoroughly as in eukaryotes ( , , ( ) ( ) ( ) . it turned out that the most proficient motifs, the x xx.a aa.g heptamers, were the least efficient ones in eukaryotes. conversely, the best eukaryotic motifs proved inefficient in e. coli (e.g. a aa.a aa.c or g gg.a aa.a), thus indicating major differences in the response of the respective translational machineries to these signals. however, the limited scope of these studies, in terms of number of motifs tested, did not allow the establishment of precise rules concerning slippery heptamers efficiency in bacteria. the first aim of the work presented here is to determine these rules by carrying out a complete functional analysis in e. coli of both types of potential frameshift motifs, the z zz.n tetramers and x xx.z zz.n heptamers. the second objective is to investigate, by bioinformatics approaches, the prevalence of the x xx.z zz.n motifs in enterobacterial genomes, mostly from e. coli isolates, in order to determine whether or not they have been selected against in coding sequences because of their frameshifting proclivity. our third aim is to identify genes possibly utilizing - programmed frameshifting by analysing, in the same set of genomes, those containing a subset of heptamers, which were chosen on the basis of their - frameshifting efficiency and/or their significant underrepresentation. the e. coli k- strain js [mc , arad (ara leu) galu galk hsds rpsl (laciopzya)x malp::laciq reporter plasmid and sequence of the three contexts in which the frameshifting propensity of the x xx.z zz.n heptamers was assessed. plasmid pofx (panel a) was used to clone between a hindiii and an apai site the three frameshift windows shown in panels b-d. the no-stimulator construct (panel b) was derived from the is construct (panel c) ( ) by deletion of most of the stem-loop and mutation to ccuc of the sd-like ggag sequence. the is construct (panel d) was engineered by replacing the is stem-loop with the pk from is ( ) and by mutating to ccuc the stimulatory sd. srlc::tn reca ] was used for all experiments. bacterial cultures were carried out in luria-bertani (lb) medium ( ) to which ampicillin ( mg/l) plus oxacillin ( mg/l) were added when necessary. all frameshift cassettes were cloned into the pofx reporter (figure a) , derived from the pan plasmid ( ) by changing the translation initiation region of the lacz gene, between the xbai and hindiii to tctagctcgagatttattggaataacatatg aaa aaa cgt aat tta agc tt (the xbai and hindiii sites are in lowercase, the sd sequence gga and the atg start codon in frame are both underlined). overlapping oligonucleotides were inserted between the hindiii and apai sites of the vector to reconstitute the various frameshift regions in front of the lacz gene so that expression of ␤-galactosidase requires a - ribosomal frameshifting event within the cloned cassette. for each type of frameshift region (i.e. no stimulator, is stimulators or is stimulators, see figure b -d), an in-frame construct was made to serve as % reference for calculation of frameshifting frequencies from ␤-galactosidase activities. a non-shifty derivative was constructed for each motif to assess the background level of frameshifting. the rationale was to keep the same trna in the a site (z zz.n motifs) or in both a and p sites (x xx.z zz.n motifs). transcription of lacz relies on a strong, isopropyl-␤-dthiogalactopyranoside-inducible, ptac promoter. its expression was monitored by a standard colorimetric assay ( ) on cultures prepared either in the absence of inducer, for constructs with a sufficiently high level of ␤-nucleic acids research, , vol. , no. galactosidase activity (i.e. above . % frameshifting), or after isopropyl ␤-d- thiogalactopyranoside induction for the ones with a low activity (i.e. those with less than % frameshifting). for each strain, tubes containing . ml of luria-bertani medium (supplemented with ampicillin and oxacillin) were inoculated with independent clones and incubated overnight at • c. these cultures were either diluted / in the same medium and incubated min at • c (no-induction conditions) or diluted / in the same medium plus mm of isopropyl-␤-dthiogalactopyranoside and incubated min at • c (induction conditions). the dosage conditions were as previously described ( ) . note that both methods gave identical % frameshifting values in their overlap range, i.e. between . % and %. we also verified the accuracy of the reported values above or below the overlap range, by applying to a limited set of plasmid constructions a refined assay in which non-induced cultures were first concentrated and lysed by sonication (data not shown). the refseq accessions of the genomes that were used to construct our nrmeg, together with their organism/strain information, are given in supplementary table s . all the protein coding gene sequences were extracted from the .ffn files (national center for biotechnological information website; ftp://ftp.ncbi.nlm.nih.gov/genomes/bacteria/) of these accessions and merged together to make a combined genome of sequences. these sequences were clustered using the blastclust program using a % sequence identity threshold at the level of nucleotide sequence. one representative sequence per cluster was randomly chosen to constitute an nrmeg of sequences. each sequence from the nrmeg was randomized times using the dicodonshuffle randomization procedure ( ) to yield randomized nrmegs. the dicodonshuffle algorithm preserves the dinucleotide composition, the encoded protein sequence and the codon usage of each gene. a customized perl script was used to count the occurrences of a pattern in all three possible reading frames (i.e. x xx.z zz.n, xx x.zz z.n and xxx .zzz. n) in both the real and randomized nrmegs. violin plots ( ) , generated with the vioplot package from the r software library (http://www.r-project.org), were used to visualize the occurrences of the x xx.z zz.n patterns. z-scores were computed as follows: z-score = (x − x mean )/x sd, where x is the frequency of occurrence of a pattern in the integrated genome, x mean is the mean of the distribution of the same pattern across randomized genomes and x sd is the standard deviation of the distribution of the same pattern across randomized genomes. z-scores for the xxxzzzn in all three frames are shown in supplementary table s while all violin plots are available online at http:// lapti.ucc.ie/heptameric patterns clusters/. all the annotated protein coding genes from of the genomes listed in supplementary table s (ac was later excluded because of its removal from refseq) were screened for the presence of a motif from a set of selected x xx.z zz.n patterns (see results section). these sequences were clustered based on similarity between the encoded protein sequences using the blastclust program (sequence identity threshold = %). a total of clusters which had at least sequences and where the heptameric pattern was perfectly conserved were taken up for further analysis. the coordinates of the conserved heptameric patterns were also recorded for each cluster. however, these clusters contain only sequences of protein coding genes from those genomes, which were initially selected to constitute the nrmeg. these clusters were enriched with additional homologous sequences from the genomes not included in the nrmeg in an attempt to obtain a better phylogenetic signal. the enrichment was carried out using a tblastn search against all bacterial sequences in the nr database as described previously ( ) . the 'newly' obtained homologous sequences for each cluster were aligned by translating them into protein sequences, aligning these protein sequences and then back translating the aligned protein sequences to their corresponding nucleotide sequences. the coordinates of the conserved heptameric pattern for each cluster (which were recorded in the previous step) were recalculated to account for gaps introduced during alignment of additional sequences. for the clusters identified in the previous step, we employed an additional filtering procedure to identify those clusters where the heptameric pattern is conserved. for each cluster, the total number of sequences was referred to as n all . the number of sequences where the heptameric pattern was the same as the parent pattern was referred to as n . the number of sequences where the pattern is not the same but is one of the x xx.y yy.z patterns, was referred to as n . finally, it has been previously observed (e.g. in dnax) that the position of the frameshift site may not be perfectly conserved. to account for that possibility, a -nt window starting from the coordinate which is nt upstream of the conserved heptameric coordinate was also screened for the presence of any x xx.y yy.z pattern; the number of sequences in that category was referred to as n . for each cluster, the three values were summed up (i.e. n +n +n = n sum ) and the ratio n sum /n all was calculated. clusters where n sum /n all > . were labelled as 'conserved'. in an ideal situation, the frameshift pattern should be absolutely conserved, but this threshold was relaxed so as to allow for the possibility of sequencing errors or recent mutations in the sequences from a cluster. the end result was a set of clusters (features of these clusters are presented in supplementary tables s -s , and the complete sequence of their genes and other features are available at http://lapti.ucc.ie/heptameric patterns clusters/). the degree of conservation at synonymous sites was calculated as previously described ( ) for a -codon window. the detailed results of this analysis are available online at http:// lapti.ucc.ie/heptameric patterns clusters/ and a summary is included in the rssv column (for reduced variability at synonymous sites) of supplementary tables s and s . however, a statistically significant conservation at synonymous sites can only be observed if there is sufficient sequence divergence in the alignment. to numerically quantify sequence divergence, we also calculated a statistic called aln div which corresponds to an estimate of the mean number of phylogenetically independent nucleotide substitutions per alignment column (see supplementary tables s and s ). two types of potential frameshift stimulators were searched. the first type was an sd-like sequence (either gagg, ggag, agga, ggngg or aggkg, with k = [t,g]) located - nt before the second base of the motif; these sequences and the spacing interval were chosen because all were experimentally proved to be stimulatory ( ) . a segment of nt ending with the first base of the motif was scanned, using a script for perl (version . . from activestate), in all the sequences of each cluster for the above potential sd sequences. a cluster qualified as having a 'conserved sd' if at least % of its sequences had an sd. the second type of stimulator was an rna structure downstream of the motif. a preliminary study of is family members (see supplementary figure s ) led us to choose the following empirical rules: the structure is (i) a simple or branched hairpin of a length ranging from to nt, (ii) that starts - nt after the last base of the motif , (iii) with a g-c(or c-g) base-pair followed by at least three consecutive watson-crick or g-u or u-g base-pairs and (iv) has a g unfold@ • c ≥ . kcal.mol − ; the g @ • c value was determined using the default parameters of the rnafold program from version . . of the vienna rna package ( ) . we limited our search to hairpin structures and and did not explore whether some of the structures could also form pks at this preliminary stage. for each sequence of all clusters, a nt segment starting at the fourth base after the motif was extracted and analysed with a custom perl script. each segment was first deleted from the end one base at a time and down to nt. each set of nested deletions was passed to the rnafold program and the potential structures were sorted out to retain those conforming to the rules. for a given cluster, structures were grouped in types according to hairpin size and distance from the motif. the frequency of each type of structure was calculated and only those present in at least % of the sequences of the cluster were retained; of the clusters had such a conserved structure. the g hp .nt − parameter (i.e. the g unfold value divided by the number of nucleotides in the hairpin structure) was calculated. for clusters with several types of structure, the 'best' type was the one with the highest value for the [( g hp .nt − ) x (frequency)] product. a summary of these analyses is reported in supplementary tables s , s and s . for further comparison, iss from the is family were selected because they contain a frameshift motif followed by a known (or likely) stimulatory structure of a size ranging from to nt. the g hp and g hp .nt − parameters were calculated for each structure. selective pressure to maintain a hairpin should likely result in a region more structured than neighbouring regions of the same size, i.e. having a value higher than average for the g.nt − parameter. to assess that, a nt segment starting nt downstream of the frameshift motif was extracted for each of the iss as well as for one typical sequence of each of the clusters with a conserved structure. for each nt segment, a perl script generated a subset of sequences by moving ( nt at a time) a sliding window of the size of the corresponding conserved hairpin and passed it to rnafold. the average g unfold .nt- ( g av .nt − ) were calculated for each subset as well as the as expected, the iss possessing a stimulatory structure all display a higher than average g.nt − (supplementary table s ). as illustrated in figure , the z zz.n motifs and the x xx.z zz.n motifs were cloned either without flanking stimulators, or with a strong downstream stimulator derived from the is pk ( , ) , or, for the heptamers only, with the moderately efficient combination of upstream and downstream stimulators from is ( , ) . for both types of motifs, a non-shifty derivative was similarly cloned: for that, the first base of each tetramer or the first and fourth bases of each heptamer were mutated. the shifty and nonshifty cassettes were inserted in front of the lacz gene, carried by plasmid pofx , so that translation of full-length ␤-galactosidase occurs only when ribosomes move to the - frame before encountering the frame stop codon ( figure ). the graphs showing the variation of - frameshifting frequency as a function of the sequence of the motif are presented in figure for the z zz.n tetramers (see also supplementary table s ). the motif-containing constructs without pk (save g gg.g, see below) were on the average marginally above their no-motif counterpart ( . ± . % versus . ± . %), which suggests that the motifs are by themselves barely or not at all shifty. addition of the is pk led to substantial increase in frameshifting frequency for motifs. only six were at least four times above background (i.e. above . ± . %), with frequencies ranging from . % to . %. for them the hierarchy was four motifs (u uua, u uug, c cca and c ccg) were . -to . -fold above background. a few oddities were revealed. the g gg.g, c gg.a and c gg.g constructs (with and without pk) were found nucleic acids research, , vol. , no. frameshift region cloned in plasmid pofx was the one used in a previous study [see figure in ( )]. it differs slightly from the one used for the heptamer analysis ( figure , panel d). the nucleotides upstream ( nt) and downstream ( nt) of the motif are those found in is . the sequence from the hindiii site to the start of the pk is agcuuccuccazzzngccgc--. the no-stimulator construct was derived by deleting the half of the pk, right after the uga stop codon in the frame, to give the following sequence: agcuuccuccazzzngccgcgacauacuucgcgaa ggccugaacuugaagggcc. the four frameshifting values for each motif correspond to a construct with a motif and the is pk (open circles), a construct without motif and with the is pk (open lozenges), a construct with motif and without stimulator (black inverted triangles) and a construct without motif and stimulator (open triangles). each frameshifting value is the mean of five independent determinations (the ± standard deviation intervals were omitted because they are not bigger than the size of the symbols in most cases). the no-motif constructs were derived by changing each motif to either g yy.n or c rr.n. above background probably not as a result of frameshifting but because these sequences, together with the following g, could act as sd sequences and direct low level initiation on the - frame aua codon present nt downstream (see legend of figure ). the other oddity, c aa.g, was nearly times above background ( . %) but only when the pk stimulator was present: this was likely due to - frameshifting caused by the high shiftiness of the lysyl-trna uuu ( , ) combined to the high efficiency of the pk stimulator. the results obtained for the x xx.z zz.n heptamers and their non-shifty derivatives are presented in figure . ( , , ) . the ratio between the motif and no-motif frameshifting frequency values was used as a classifier of motif efficiency: motifs displaying a ratio above were categorized as frameshift-prone. among the constructs without stimulator, motifs ( %) met this criterion (ratio from . to ). when the is stimulators were added, motifs ( %) showed a ratio ranging from . to . swapping the moderate is stimulators for the more efficient is pk, increased further the motif to no-motif ratio (from . to ) and raised the number of positive motifs to ( . %); the motifs below the threshold were a aa.c cc.a, g gg.c cc.a and c cc.g gg.c. in spite of divergences as to its timing in the elongation cycle, the general view concerning - frameshifting on slippery heptamers is that it occurs after proper decoding of the zzn -frame codon when the p and a ribosomal sites are occupied by the xxz and zzn codons and their cognate trnas ( , , , , , ) . simple rules emerge when the effect of the zzn codon is considered (figure ) . the uun and aan codons are on the average more frameshiftprone than ccn and ggn. whatever z is, the two homogeneous zzn codons (meaning all purine or all pyrimidine bases) are better shifters than the two corresponding heterogeneous ones. to explain further all the variations in frameshifting frequency, it is also necessary to take into account the nature of the x nucleotide. motifs are by and large more frameshift-prone when the x and z nucleotides are homogeneous, i.e. all purines or all pyrimidines. notable exceptions from the above experimental study, we concluded that a majority of heptamers (and nearly half of the tetramers) were capable of eliciting - frameshifting at substantial levels (at least twice the background level). to determine the range of motifs used in genes utilizing frameshifting for their expression, we carried out an analysis of is mobile genetics elements known, or suspected, to use this mode of translational control. we focused on the members of the is and is families available in the isfinder database in october ( ) . as shown in figure , both tetramers and heptamers are found, but with a marked preference for heptamers ( against ). among the five tetramers, the three most shift-prone motifs, a aa.g and u uu.[u,c], predominate and the less efficient a aa.a motif is also well represented. only different heptamers are found with % of them being either a aa.a aa.g or a aa.a aa.a. the next most frequent are a aa.a aa.c and g gg.a aa.c, both of low efficiency, followed by the more efficient g gg.a aa.g, u uu.u uu.c and g gg.a aa.a. to conclude, it appears that genes known or suspected to use prf- to express a biologically important protein do not necessarily utilize high efficiency motifs. however, this conclusion is based on one category of genes where two overlapping genes code for the proteins required for transposition of two types of is elements. there, the purpose of frameshifting is to provide the 'right' amount of a fusion protein which has the transposase function ( , ) ; this amount is what keeps transposition of the is at a level without negative effect on the bacterial host. if the 'right' amount is a low amount, then the use of low-efficiency motifs, with or without flanking stimulators, is a way to achieve this goal as illustrated by the is element ( ) . our objective was to statistically assess the prevalence of each x xxz zzn motif in the genome of various e. coli strains. the rationale was that if a given motif induces by itself frameshifting at a significant level (i.e. at a biologically detrimental level), then it should be counterselected and, therefore, be underrepresented in e. coli genes. (i) generation of a non-redundant nrmeg. in a recent study, ( ) . each gene sequence in the nrmeg was randomized times. this gave rise to a set of randomized enterobacterial genomes where each constituent gene sequence encodes for the same protein sequence, and has the same codon usage and dinucleotide biases as in the native nrmeg. hence, the frequency of a heptamer's occurrence in these genomes could be used as an estimate of its frequency in the absence of selective pressure due to shiftprone properties of this pattern. (iv) comparison of the observed and expected values of the frequency counts for each pattern. the frequency of occurrences of each of the xxxzzzn patterns in each of the three possible frames (i.e. x xxz zzn, xx xxz zn and xxx zzz n) was determined across the randomized nrmegs. each pattern is represented by two numerical values: the mean and the standard deviation of its frequency distribution count across the randomized nrmegs. to quantify the degree of under-or overrepresentation of each pattern we used a z-score (see material and methods section). a negative z-score implies that a pattern is underrepresented while a positive z-score is indicative of its overrepresentation (supplementary table s ). under our assumption, we would expect only the 'in-frame' shifty motif (x xxy yyz) and not the 'out-of-frame' shifty motifs (xx xyy yz and xxx yyy z) to be underrepresented. the comparison of each x xxz zzn pattern frequency in the nrmeg with its distribution in the randomized nrmegs is shown in figure . black dots correspond to the number of occurrences of a particular motif in the nrmeg while the associated violin shows the distribution of the same motif across the randomized nrmegs. comparison with the in vivo data of figure shows that some of the sequence patterns, which are characterized by a marked underrepresentation, are also associated with high number of genes table s ). possible reasons are that these patterns may interfere with gene expression in a frame-independent manner, producing indels in mrna due to transcriptional slippage ( ) or indel mutations at a high rate ( ) . two motifs, the poor frameshifters c cc.g gg.c and c cc.g gg.u, are notably underrepresented and one motif, u uu.c cc.g, is markedly overrepresented. frameshifting is not the only factor that may affect evolution of codon co-occurrence. it is possible that a particular pair of codons is slow to decode or results in ribosome drop-off. such factors re-sult in codon pair bias. the ccc ggy and uuc ccg codon pairs were indeed shown to be less frequent than expected for the formers and more frequent for the latter ( ) . the violin plots showing the comparison of patterns occurrence in the nrmeg and their distribution in the randomized nrmegs in all three frames are available online at http://lapti.ucc.ie/heptameric patterns clusters/. we anticipated that x xx.z zz.n patterns characterized as shift-prone in our assays would be underrepresented due to selection pressure. therefore, we expected to find negative correlation between z-scores and observed frameshifting efficiencies (in the absence of a stimulator) for these patterns. surprisingly, no significant anticorrela- tion was found between the two measures (r = − . , p = . ; figure a ). previously, underrepresentation of one shift prone pattern (a aa.a aa.g) was found to be more pronounced in highly expressed genes than in lowly expressed genes ( ) . therefore, the distribution of x xx.z zz.n motifs was analysed among genes predicted as highly expressed in e. coli k (heg database; http://genomes.urv.cat/heg-db/). these sequences were similarly randomized ( times instead of because of the smaller size of this data set in comparison to the nrmeg set) and z-scores for each of the patterns were computed. however, the correlation coefficient still remained non-significant albeit only marginally (r = − . , p = . ; figure b ). the objective was to identify genes likely using prf- on the basis of several criteria: (i) presence of an efficient motif (defined below), (ii) conservation of this motif (or a very similar one) in a given family of homologous genes from selected genomes (see supplementary table s and materials and methods) and even beyond, in orthologous sequences, (iii) sequence conservation around the motif, (iv) presence of potential stimulatory elements flanking the motif and (v) position of the motif in the gene and consequence of frameshifting in terms of protein products [i.e. synthesis of a shorter or of a longer hybrid protein; note that the answer does not provide evidence for or against frameshifting since there are proven prf - cases leading to one or the other outcome ( )]. we selected a subset of x xx.z zz.n patterns with a negative z-score and an in vivo frameshifting efficiency of more than . % in the absence of stimulators listed below. three non-underrepresented patterns (c cc.u uu.c, a aa.g gg.a and a aa.g gg.g) were also considered because they exhibit high level frameshifting. gene families possibly using these patterns for - prf were identified using the pipeline described in materials and methods. this procedure led to alignments which represented gene families with sequences containing one of the chosen x xx.z zz.n patterns. subsequent filtering on the basis frameshift site conservation reduced that number to clusters: correspond to mobile genetic elements, are from prophage genes and belong to other gene families (supplementary table s and s ). the main features of these clusters are summarized in figure . it appears that the size of the gene containing the frameshift signal is very variable, since it can code for a - amino acid protein (supplementary table s , figure a ). in clusters, the frameshift product is shorter than the product of normal translation (supplementary table s , figure b) . the degree of conservation of synonymous sites around the frameshift site was also analysed ( ); figure c (http://lapti.ucc.ie/heptameric patterns clusters/). synonymous sites are supposed to evolve neutrally unless there are additional constraints acting at the nucleotide sequence level, for example, pressure to conserve an rna structure. only out of the non-mobile genes display reduced variability at synonymous sites in the vicinity of the frameshift site, whereas is clusters and prophage cluster do show such suppression (rvss/aln div column in supplementary tables s and s ). however, failure to detect statistically significant synonymous site conservation in the other clusters may be due to insufficient sequence diver- gence (rvss/aln div column in supplementary tables s and s ). among the clusters displaying reduced variability, non-mobile cluster (a aaa aag ), and is clusters (a aaa aag , a aaa aag and a aaa aag ) possess a proven or potential stimulatory structure downstream of the motif. one is cluster with reduced variability (a aaa aac , a proven case of frameshifting) has no established stimulator ( , ) . two is clusters do not display reduced synonymous site variability (a aaa aaa and a aaa aag ) in spite of being proven cases where - frameshifting is stimulated by a stem-loop structure (unpublished data) ( ) . in addition, the region nt upstream of the motif was checked for the presence of a conserved sd-like sequence and the region extending nt downstream of the frameshift site was analysed for the presence of a conserved rna secondary structure; our criteria for a conserved stimulator was the presence of such a structure in at least % of the genes of a cluster. the sd-like sequences to be searched - nt upstream of the motif were those for which a stimulatory effect was experimentally demonstrated (materials and methods) ( ) . a conserved sd was found in out of the non-mobile clusters and in out of the is and prophage clusters ( figure c , supplementary tables s and s ). in contrast, a potential stimulatory structure was predicted in a larger proportion of clusters: a conserved hairpin is present in the is clusters, in out of of the phage clusters and in out of the non-mobile genes clusters (see materials and methods for the parameters used to define the hairpin structure). nine clusters possess both types of stimulators. to characterize further the predicted structures, we compared them with is family members possessing a frameshift site and an associated stimulatory structure ( , ). to assess structures of different sizes, we used a single parameter, g hp .nt − ; which is the g unfold@ • c value of the hairpin divided by the number of nucleotides in the structure. an overall comparison showed that taken together the hairpins of our clusters had a lower g hp .nt − than those from a set of is family members ( . ± . versus . ± . kcal.mol − .nt − ; supplementary table s and figure s ). for a more refined comparison, is members were selected because they have a hairpin ranging from to nt. the average g.nt − ( g av .nt − ) downstream of the frameshift motif was determined as detailed in materials and methods for these iss as well as for our clusters. the difference between g hp .nt − and g av .nt − , g.nt − , was calculated and plotted against the size of the structure (figure ). it appeared that all the is hairpins have a positive g.nt − value (≥ . kcal.mol − .nt − ) indicating that the hairpin segment is more structured than average, as expected if there is selective pressure for its maintenance ( figure a) . the distribution of g.nt − values is clearly not the same for our clusters, especially the non-mobile genes clusters ( figure b ): only of them are at or above the . kcal.mol − .nt − threshold value defined by the is set. the remaining clusters, as well as phage clusters and is cluster, appeared to have a local folding level close or even below average. this suggests that their respective potential hairpins may not have been selected for but are fortuitous, non-biologically relevant, structures. . the x-axis indicates the size in nucleotide of the hairpin structure and the y-axis shows the g.nt − parameter, which is the difference between the mean g unfold per nucleotide of the conserved hairpin ( g hp .nt − , kcal.mol − .nt − ) and the average g unfold of structures predicted in a sliding window, of the same size as the corresponding conserved hairpin, moved over a nt segment starting nt after the motif ( gav, kcal.mol − ) (see also supplementary we determined that in e. coli, the rules of frameshifting on z zz.n tetramers are, in terms of motif hierarchy, a aa.g > u uu.y > c cc.y >a aa.a, the remaining motifs were found barely or not at all frameshift-prone in our conditions. thus, maintenance of a cognate trna-codon interaction after re-pairing of the a-site trna in the - frame is important to ensure efficient frameshifting. notably, the maximal level of frameshifting remained about -fold lower than observed with the best heptamer associated with the same stimulatory element. in contrast, also with the heptamers, frameshifting on the z zz.n motifs definitely requires presence of a strong stimulator ( figure ). relative frameshifting frequencies of x xx.z zz.n heptamers, tested in a eukaryotic system (rabbit reticulocytes lysate) ( ) or in e. coli (figure ) , are displayed in supplementary figure s a ; the motifs were placed upstream of stimulatory elements of similar efficiency, the avian infectious bronchitis virus (ibv) pk for the eukaryotic assay and the is pk for the e. coli series. a third of the motifs were not experimentally tested in the eukaryotic context because they were expected to be inefficient. nevertheless, major rules of - frameshifting efficiency, as a function of the identity of the x, z and n nucleotides, can be formulated (supplementary figure s b ). it appears that the . the outcome is a slightly larger number of high efficiency motifs in the eukaryotic situation than in the bacterial one ( versus ) , at least in the nucleotide context in which the motifs were tested in both studies. while the nucleotides immediately flanking a given motif can modulate frameshift level ( , , ( ) ( ) ( ) , they probably cannot turn an inefficient motif into a highly efficient one or vice versa, as suggested by one analysis in e. coli [see table in ( )]. that the above rules likely apply to other eukaryotic organisms is supported by studies on yeast and plant viruses ( , ( ) ( ) ( ) . several observations suggest that the e. coli rules are probably valid for many other bacterial species. a survey of the gtrnadb database [( ); http://gtrnadb.ucsc.edu/] in june indicates that out of different bacterial species, possess only one type of lys-trna, with a uuu anticodon, like e. coli ( ) . in these species, covering all the major bacterial phyla, the a aa.r and v vv.a aa.r motifs should be as shift-prone as in e. coli. interestingly, the same two types of motifs are highly prevalent ( . %) among non-redundant is elements from the is and is families present in the isfinder database ( figure ). furthermore, among the species present in the gtrnadb database and in which is family transposable elements are found (isfinder database, october ), contain both types of lys-trna ( uuu and uuc anticodons). iss with an a aa.g or v vv.a aa.g motif are present in of these species. thus, presence of a lys-trna with a uuc anticodon, which should pair perfectly with the aag codon and thus reduce frameshifting ( ) , does not preclude the use of a aa.g or v vv.a aa.g frameshift motifs in is elements from many bacterial species. a common feature of tetramer and heptamers motifs is the preferred identity for the z nucleotide, a or u, constituting the first two bases of the zzn codon. this suggests that a weak trna-codon pairing interaction in the a site is a universal pre-requisite for high level - frameshifting ( ) . the major differences concern the identity of the x and n nucleotides. while n euk can be a, c or u, n prok identity is linked to that of z so that zzn prok must be all purines or all pyrimidines to achieve high frameshifting level. in terms of trna-codon relations, this suggests that the prokaryotic ribosome tolerates less readily a non-cognate interaction after frameshifting (e.g. following a shift from aac to aaa) than its eukaryotic counterpart. one possibility is that the bacterial ribosome still monitors the correctness of the codon-anticodon pairing in the a site even after frameshifting. concerning the ribosomal p site trna, which has to shift from xxz to xxx, the prokaryotic ribosome still displays the same preference for cognate pairing in the new frame when z is u. however, it is more eukaryoticlike when z is a, a feature reflecting the high shiftiness of bacterial lys-trna uuu especially when zzn is aag [figures and ; ( ) ]. the previous paragraph highlighted the most efficient motifs and their properties, as revealed in three particular contexts (is , is and no-stimulator; see figure ). overall, about % of the heptamers are significantly shiftprone, to very different extent, in the absence of stimulators, a feat confirming that the motif, i.e. trna re-pairing, is the primary determinant of - frameshifting. stimulatory elements cannot induce frameshifting by themselves. they likely facilitate trna re-pairing by causing ribosome pausing ( ) ( ) ( ) and by promoting mrna realignment ( ) ( ) ( ) . it is interesting to note that heptamers of low efficiency in e. coli (figure ) , like a aa.a aa.c and g gg.a aa.c, are nevertheless very likely used for programmed frameshifting by bacterial is elements [ figure ; ( ) ]. study of the distribution of the x xxz zzn heptamers in sequences, selected to constitute our nrmeg, revealed that about % of the motifs were underrepresented to different extents ( figure , supplementary table s ). however, there was no significant anticorrelation between the observed frameshifting efficiency and the underrepresentation of shifty patterns when all the genes are taken into account or when only a subset of genes categorized as highly expressed was considered (figure ). the latter finding was unexpected, because it is believed that the deleterious effect of frameshift-prone patterns, at least in highly expressed genes ( ), should increase with increased frameshifting efficiency and thus augment the pressure for selection against these sequences in protein coding regions. at this point we may only speculate about possible reasons. one reason could be the dependency of frameshifting on the context. such context effects, involving nucleotides located immediately upstream or downstream of some motifs (tetramers and heptamers), were revealed through directed mutagenesis of frameshifting signals of prokaryotic and eukaryotic origin ( , , ( ) ( ) ( ) . our experimental assays were carried out in a limited set of nucleotide context surrounding the patterns and, therefore, our results may not reflect the frameshifting efficiencies of these patterns in all their native contexts. another possibility is that, even for the most efficient motifs placed in the best immediate context, frameshifting frequency remains sufficiently low in the absence of stimulatory elements, so as to have no detrimental effect on bacterial fitness. for the three best v vv.a aa.g heptamers(v = [c,a,g]) this frequency is at around . % (figure ) . from another study, we know that frameshifting on these motifs could be increased by about . -fold at most, i.e. going up to . %, by modifying the context ( ) . but this is still much lower if compared to the cumulative effect of background translational errors: missense errors and drop-off have a total estimated frequency of about × − per amino-acid and thus would result in ∼ % of incorrect chains for a amino-acids protein ( ) . previous attempts to find novel recoded genes in bacteria used two different approaches, one based on search of frameshift-prone motifs ( , ) , and the other based on the identification and characterization of disrupted coding sequences ( , , ) . the former led to identification of a few candidate genes only, but the search was restricted to a limited number of motifs and to one organism only, e. coli. the studies using the second approach were more exhaustive since they used all the available sequenced bacterial genomes. consequently, they brought more candidates. a search of genes with disrupted open reading frames (orfs) among genomes initially revealed about candidate genes, % of which could be grouped into clusters ( ) . assuming an average number of protein-coding genes per genome, this gives a frequency of candidates of about . %. sequence comparison showed that clusters contained genes from is mobile genetic elements. interestingly, a substantial proportion of them ( clusters) may use programmed transcriptional realignment rather than translational - frameshifting ( clusters) and in clusters, both types of recoding may operate. the analysis with the gene-tack program of microbial genomes carried out by antonov et al. ( , ) eventually revealed genes, potentially using frameshifting (in the + or - direction) or transcriptional realignment, which were grouped into clusters. is transposable elements genes are found in a minority of clusters, . thus, other categories of genes of various functions predominate. however, if the absolute number of genes is considered, then is elements prevail with a total number of genes. this probably reflects that iss are prone to horizontal transfer and are often present in multiple copies in a genome. assuming again an average number of protein-coding genes per genome, then the overall frequency of candidates found by antonov et al. ( ) among genomes is about . %. one drawback expression of the rous sarcoma virus pol gene by ribosomal frameshifting signals for ribosomal frameshifting in the rous sarcoma virus gag-pol region characterization of ribosomal frameshifting in hiv- gag-pol expression frameshifting is required for production of the transposase encoded by insertion sequence translational frameshifting generates the gamma subunit of dna polymerase iii holoenzyme sequences that direct significant levels of frameshifting are frequent in coding regions of escherichia coli identification of functional, endogenous programmed - ribosomal frameshift signals in the genome of saccharomyces cerevisiae recode- : new design, new search tools, and many more genes exploring bacterial insertion sequences with isfinder: objectives, uses, and future developments programmed ribosomal- frameshifting as a tradition: the bacterial transposable elements of the is family a pilot study of bacterial genes with disrupted orfs reveals a surprising profusion of protein sequence recoding mediated by ribosomal frameshifting and transcriptional realignment recoding in bacteriophages and bacterial is elements genetack database: genes with frameshifts in prokaryotic genomes and eukaryotic mrna sequences identification of the nature of reading frame transitions observed in prokaryotic genomes slippery runs, shifty stops, backward steps, and forward hops: - , - , + , + , + , and + ribosomal frameshifting ribosomal frameshifting from - to + nucleotides mutational analysis of the 'slippery-sequence' component of a coronavirus ribosomal frameshifting signal translational control in production of transposase and in transposition of insertion sequence is programmed translational - frameshifting on hexanucleotide motifs and the wobble properties of trnas prokaryotic-style frameshifting in a plant translation system: conservation of an unusual single-trna slippage event apical loop-internal loop rna pseudoknots: a new type of stimulator of - translational frameshifting in bacteria p-site trna is a crucial initiator of ribosomal frameshifting structural probing and mutagenic analysis of the stem-loop required for escherichia coli dnax ribosomal frameshifting: programmed efficiency of % frameshifting rna pseudoknots: structure and mechanism programmed ribosomal frameshifting in hiv- and the sars-cov the sequences of and distance between two cis-acting signals determine the efficiency of ribosomal frameshifting in human immuno-deficiency virus type and human t-cell leukemia virus type ii in vivo ribosomal movement impeded at a pseudoknot required for frameshifting ribosomal pausing during translation of an rna pseudoknot ) mrna pseudoknot structures can act as ribosomal roadblocks the -Å solution: how mrna pseudoknots promote efficient programmed − ribosomal frameshifting a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting the interplay of mrna stimulatory signals required for auu-mediated initiation and programmed - ribosomal frameshifting in decoding of transposable element is the '-terminal sequence of escherichia coli s ribosomal rna: complementarity to nonsense triplets and ribosome binding sites rrnamrna base pairing stimulates a programmed - ribosomal frameshift the anti-shine-dalgarno sequence drives translational pausing and codon choice in bacteria influence of the stacking potential of the base ' of tandem shift codons on - ribosomal frameshifting used for gene expression comparative mutational analysis of cis-acting rna signals for translational frameshifting in hiv- and htlv- the three transfer rnas occupying the a, p and e sites on the ribosome are involved in viral programmed - ribosomal frameshift interactions of the ribosome with mrna and trna e. coli ribosomes re-phase on retroviral frameshift signals at rates ranging from to percent sequence requirements for efficient translational frameshifting in the escherichia coli dnax gene and the role of an unstable interaction between trnalys and an aag lysine codon expression of a coronavirus ribosomal frameshift signal in escherichia coli: influence of trna anticodon modification on frameshifting a short course in bacterial genetics: a laboratory manual and handbook for escherichia coli and related bacteria widespread selection for local rna secondary structure in coding regions of bacterial genes violin plots: a box plot-density trace synergism stimulation of stop codon readthrough: frequent presence of an extended ' rna structural element the vienna rna websuite a three-way junction and constituent stem-loops as the stimulator for programmed - frameshifting in bacterial insertion sequence is modified nucleosides and codon recognition pseudoknot-dependent programmed - frameshifting: structures, mechanisms and models recoding: expansion of decoding rules enriches gene expression programmed translational frameshifting and initiation at an auu codon in gene expression of bacterial insertion sequence is translational control of transposition activity of the bacterial insertion sequence is comparison of sequenced escherichia coli genomes transcriptional slippage in bacteria: distribution in sequenced genomes and utilization in is element gene expression avoidance of long mononucleotide repeats in codon pair usage trna properties help shape codon pair preferences in open reading frames high-level ribosomal frameshifting directs the synthesis of is gene products an extended signal involved in eukaryotic - frameshifting operates through modification of the e site trna comparative study of the effects of heptameric slippery site composition on - frameshifting among different eukaryotic systems recoding: expansion of decoding rules enriches gene expression gtrnadb: a database of transfer rna genes detected in genomic sequence translational accuracy and the fitness of bacteria fsscan: a mechanism-based program to identify + ribosomal frameshift hotspots endogenous ribosomal frameshift signals operate as mrna destabilizing elements through at least two molecular pathways in yeast knotinframe: prediction of - ribosomal frameshift events on programmed ribosomal frameshifting: the alternative proteomes nucleic acids research, , vol. , no. of the disrupted-orf approach is that it fails to detect cases where frameshifting would lead to a protein shorter than the product of normal translation. an example is provided by the dnax gene family: while in e.coli and many bacteria frameshifting leads to a shorter protein, it results in a longer product in a more limited number of bacteria. the genetack analysis detected only the later cases in genomes [see cof in the genetack prokaryotic frameshift database; ( ) ].in contrast, approaches primarily based on the search of frameshift motifs allow detection of both types of recoding outcomes, but the motifs are so short (e.g. or nucleotides for - frameshift motifs) that in the absence of proper filtering, too many candidates are found, even if presence of a potentially stimulatory structure downstream of the motif is an added condition. an illustration is provided by a study of the yeast genome: % of the orfs (i.e. out of ) were found to contain one or more 'strong' - frameshift signal ( ); by strong the authors mean that there is an efficient motif (as defined in the previous section) and a downstream structure (the total number of strong candidate frameshift regions was ). since in % of the cases, - frameshifting leads to rapid premature termination, it was proposed, and then experimentally substantiated at least for a few genes ( ) , that - prf is largely used in yeast for regulatory purpose rather than to generate, as is the case in is elements or viruses, a fusion protein with a new carboxylterminal functional domain ( ) . a subsequent study, using a different method for structure prediction and scoring, lead to a much less optimistic evaluation: only candidates of the former study were retained ( ) .one aim of the present study was to identify genes containing selected frameshift motifs within individual genomes, of which are from e. coli strains. the cumulated number of protein coding genes is and among them contain at least one of the motifs, thus the overall frequency of motif-containing genes is . %. after filtering and enrichment beyond the e. coli species, the outcome was a set of clusters (i.e. a total genes) each being a group of closely related genes where a given motif is conserved. since about genes were tested, the final yield of frameshift candidates was of . %. this value is close to that obtained by antonov et al. ( ) , therefore, suggesting achievement of a similar stringency by both searches. an internal validation of our method was provided by the fact that expected cases (dnax gene, is elements and bacteriophage genes; marked as [true] in supplementary tables s and s and in figure ) were found in the final set. among the remaining clusters, is from is elements, are from prophages genes and are from nonmobile cellular genes of known and unknown functions. in of our clusters, like in e. coli dnax gene, but in contrast with iss from the is and is families and all the programmed frameshift clusters from the genetack database, frameshifting would lead to a product shorter than the protein resulting from normal translation ( figure b ). whether or not this type of frameshifting affects mrna stability, as proposed for yeast candidates ( , ) , remains to be determined. a majority of our clusters, , contain a conserved potential stimulatory element (as defined in materials and methods): there is an upstream sd in clusters, a downstream hairpin in and both types of stimulators in ( figure c ). however, assessment of the hairpins with the g.nt − parameter suggested that they may be relevant in only phage cluster and in non-mobile genes clusters (figure ; supplementary table s ) . furthermore, if motif efficiency is taken as an additional constraint, only clusters remain as best candidates for high level - frameshifting (marked with ** in supplementary table s ). as shown in supplementary table s , there is a limited overlap between our clusters and the clusters of antonov et al. ( , ) , thus, demonstrating that the two approaches are complementary. the present study, in agreement with previous ones ( ) ( ) ( ) ( ) , suggests that recoding in bacteria is mostly found (at least in terms of absolute number of candidates genes) in is transposable elements and in a few bacteriophage genes. however, if numbers of gene clusters are considered, then it appears that non-mobile genes clusters predominate. information about the function of these genes, as found in the ecogene database ( ), is shown in supplementary table s . seventeen are of unknown function, four are predicted to be transcriptional regulators and the rest have different predicted functions.once candidates have been found, a critical issue, as stressed in a recent review ( ) , is their functional validation. this entails two steps: (i) the demonstration that there is frameshifting (or transcriptional slippage) on the predicted signal and (ii) the determination of the cellular function of the recoding product. full functional analysis has not yet been carried out on the genetack clusters or on the candidates genes reported in this study. representatives of both studies, of the genetack clusters ( ) and of our best candidates (supplementary table s ), were tested for the first step only. promisingly, of the genetack candidates and both of ours were found capable of eliciting frameshifting at different but substantial levels, i.e. from . % to % [( ), supplementary figure s ]. the challenge is now to carry on with the complete experimental characterization of all candidates to establish which of them indeed use recoding to synthesize alternate proteins biologically pertinent for e. coli and other bacterial species. additional information about clusters is available at http: //lapti.ucc.ie/heptameric patterns clusters/. supplementary data are available at nar online. the help of claire bertrand and patricia licznar at an early stage of this project, as well as the hospitality of mick chandler and bao ton hoang and the support from agamemnon carpousis are gratefully acknowledged. key: cord- -qm urt w authors: blank, maximilian f.; chen, sifan; poetz, fabian; schnölzer, martina; voit, renate; grummt, ingrid title: sirt -dependent deacetylation of cdk activates rna polymerase ii transcription date: - - journal: nucleic acids res doi: . /nar/gkx sha: doc_id: cord_uid: qm urt w sirt is an nad(+)-dependent protein deacetylase that regulates cell growth and proliferation. previous studies have shown that sirt is required for rna polymerase i (pol i) transcription and pre-rrna processing. here, we took a proteomic approach to identify novel molecular targets and characterize the role of sirt in non-nucleolar processes. we show that sirt interacts with numerous proteins involved in transcriptional regulation and rna metabolism, the majority of interactions requiring ongoing transcription. in addition to its role in pol i transcription, we found that sirt also regulates transcription of snornas and mrnas. mechanistically, sirt promotes the release of p-tefb from the inactive sk snrnp complex and deacetylates cdk , a subunit of the elongation factor p-tefb, which activates transcription by phosphorylating serine within the c-terminal domain (ctd) of pol ii. sirt counteracts gcn -directed acetylation of lysine within the catalytic domain of cdk , deacetylation promoting ctd phosphorylation and transcription elongation. studies over the past decade have shown that sirtuins, members of a phylogenetically conserved protein family that shares homology to the budding yeast silencing factor sir (silent information regulator), affect a broad range of cellular functions encompassing cellular stress resistance, genomic stability, energy metabolism and tumorigenesis. the seven mammalian sirtuins, denoted sirt -sirt , have distinct cellular locations and target multiple substrates. by utilizing nad + as cofactor, sirtuins act either as deacety-lases or adp-ribosyltransferases, and have emerged as key metabolic sensors that link environmental signals to metabolic homeostasis and stress response. sirt is enriched in nucleoli, where it promotes cell growth and proliferation by driving rdna transcription and ribosome biogenesis ( ) . sirt expression correlates with cell growth, being high in metabolically active cells, and low or even absent in non-proliferating cells ( ) ( ) ( ) ( ) . in epithelial prostate carcinomas, high sirt levels are associated with aggressive cancer phenotypes, metastatic disease and poor patient prognosis ( ) . high expression of sirt is steadily propelling cells towards an oncogenic status. depletion of sirt or overexpression of a catalytically inactive point mutant leads to decreased cell proliferation, induction of apoptosis and reduced tumor growth ( ) . sirt -knockout mice suffer from increased embryonic lethality, reduced stress resistance, inflammatory cardiomyopathy and premature aging ( ) ( ) ( ) ( ) . moreover, sirt catalyzes deacetylation of lysine at histone h (h k ac), a biomarker of aggressive tumors. hypoacetylation of h k compromises transcription of genes that are linked to tumor suppression and facilitates dna repair ( , ) . previous work has established that sirt is a key regulator of nucleolar transcription and pre-rrna processing. sirt is enriched in nucleoli and activates rna polymerase i (pol i) transcription by deacetylating paf (polymerase-associated factor ), a core subunit of mammalian pol i ( ) . hypoacetylation of paf enhances pre-rrna synthesis by facilitating the association of pol i with rdna, thereby promoting pol i transcription. additionally, sirt regulates processing of pre-rrna by deacetylating u - k, a core component of the u snornp complex. reversible acetylation modulates the association of u - k protein with u snorna, deacetylation by sirt enhancing the interaction ( ) . upon exposure to cellular stress, sirt is released from nucleoli and accumulates in the nu-cleoplasm, which leads to hyperacetylation of both paf and u - k and defects in transcription and processing of pre-rrna. these results indicated that sirt controls ribosome biogenesis through a mechanism involving binding to pre-rrna and u snorna as well as nucleolarnucleoplasmic shuttling in response to stress signaling. the role of sirt in ribosome biogenesis and cell proliferation is also supported by recent proteomic analyses showing that sirt is associated with numerous non-nucleolar target proteins with functions in transcription, ribosome biogenesis and translation ( ) ( ) ( ) . sirt was also found to interact with chromatin remodeling complexes, such as b-wich, norc and swi/snf, which are required for the establishment of a specific chromatin structure. furthermore, sirt was shown to occupy trna genes and to interact with pol iii and tfiiic , suggesting a regulatory role of sirt in pol iii transcription ( , ) . the present work extends these previous studies, aiming to decipher the molecular mechanisms underlying the role of sirt in transcription activation. we found that a large fraction of the sirt interactome depends on ongoing transcription and/or the presence of rna. the n-terminal part of sirt binds to rna and mediates rna-dependent interactions with sirt target proteins. consistent with sirt function not being restricted to processes related to ribosome biogenesis, we show that sirt is associated with pol ii and regulates transcription of snornas and other pol ii genes. mechanistically, sirt promotes the release of p-tefb from the inactive sk snrnp complex and deacetylates the p-tefb component cdk . deacetylation by sirt activates the kinase activity of cdk , which phosphorylates the c-terminal domain (ctd) of pol ii and facilitates transcription elongation. the results reveal a novel function of sirt outside the nucleolus, reinforcing its role as a key regulator of cellular homeostasis. u os and hek t cells cultured in dulbecco's modified eagle's medium (dmem) supplemented with % fetal calf serum (fcs) were transfected with expression vectors using fugene (life technologies). sirnas against sirt (hsirt on-targetplus smartpool), or nontargeting control sirnas were from dharmacon (ther-mofisher scientific) and shrnas have been described ( , ) . cells were harvested h (sirnas) or h (shrnas) after reverse transfection with lipofectamine or rnaimax (invitrogen). plasmids encoding hsirt , sh-hsirt , flag-hsirt ( ), flag/ha-sirt and clonal lines that stably express flag/ha-sirt ( ) have been described. sirt truncation mutants were generated by pcr and cloned into pcmv-flag vector. oligonucleotides used for plasmid construction and mutagenesis are listed in supplementary table s . expression vectors encoding ha-cdk and flag-rpb were from addgene. antibodies against ubf, paf , rpa and sirt have been described ( , , ) . the following commercial antibodies were used: anti-acetyl lysine (cell signaling technology, ), antiactin (abcam, ab ), anti-cdk (santa cruz, sc- (c )), anti-chd /mi (santa cruz, sc- ), anti-cyclin t (santa cruz, sc- ), anti-ddx (santa cruz, sc- ), anti-flag (sigma, f ), anti-hexim (bethyl laboratories, a - a), anti-hnrnpk/j (santa cruz, sc- ), anti-hnrnpu (santa cruz, sc- ), antinucleolin (santa cruz, sc- ), anti-nucleophosmin/b (santa cruz, sc- ), anti-p (abcam, ab ), anti-pol ii (santa cruz, sc- and sc- (n )), anti-pser -pol ii (active motif, ), anti-pser -pol ii (abcam, ab ) and anti-tubulin (sigma, clone b- - - , t ). anti-flag m agarose was from sigma (f ). secondary antibodies were from dianova ( - - and - - ). hek t cells expressing flag/ha-tagged sirt were lysed in buffer am- ( mm kcl, mm tris-hcl [ph . ], mm mgcl , . mm edta, % glycerol, . mm dtt) supplemented with . % np- , protease inhibitors (roche complete) and hdac inhibitors ( nm tsa, mm sodium butyrate, mm nam). sequential immunoprecipitation was first performed for h at • c using protein g sepharose beads coated with anti-flag or mouse igg as control. after elution with buffer am- / . % np- supplemented with flag peptide ( g/ml), proteins were precipitated with anti-ha coupled to protein g sepharose and eluted with buffer am- / . % np- supplemented with ha-peptide ( g/ml). proteins were digested with trypsin overnight at • c and tryptic peptides were analyzed by lc-ms/ms on an ltq-orbitrap xl mass spectrometer (thermo scientific) using a h gradient. the mgf-files generated by xcalibur software (thermo scientific) were used for database searches with the mas-cot search engine (matrix science, version . ) against the swissprot database (swissprot version ). the peptide mass tolerance was set to ppm and fragment mass tolerance to . da. proteins were considered as identified if more than one unique peptide had an individual ion score exceeding the mascot identity threshold. go analysis was performed using david bioinformatics resources ( , ) . assessment of changes in the proteome was based on the relation of unique peptides in the treated sample and the control sample after substraction of igg values, a ratio below . was considered as decreased interaction. cellular rna was isolated with trizol reagent (invitrogen), transcribed into cdna using random primers (roche), and analyzed by real-time pcr (roche, lightcy-cler ). radiolabeled rna was generated by in vitro transcription using the megascript t transcription kit (ambion) and templates generated by pcr with gene-specific primers fused to the t promoter. primers are listed in supplementary table s . nucleic acids research, , vol. , no. clip assays have been performed essentially as described ( ) . briefly, nuclear lysates from uv-irradiated hek t cells ( nm, . jcm − ) expressing flag-tagged sirt were sonicated, precleared with protein g sepharose, and protein-rna complexes were immunoprecipitated using anti-flag m beads (sigma) or protein g sepharose (ge healthcare) coated with mouse iggs as control. beads were sequentially washed in ip buffer ( mm tris-hcl [ph . ], mm nacl, mm edta, mm egta, . % sds, % np- , . % sodium deoxycholate, protease inhibitors) and in ip buffer containing mm kcl. after elution with the flag peptide ( g/ml) and proteinase k digestion ( min, • c), rna was isolated, incubated with dnase i (sigma) and subjected to rt-qpcr. for pulldown experiments, flag-tagged proteins ( g) were immobilized on m agarose and incubated with radiolabeled rna for h at room temperature in buffer am- supplemented with . % triton x- , . u/ml rnasin and protease inhibitors. after stringent washing, captured rna was extracted, subjected to gel electrophoresis and visualized by phosphorimaging. alternatively, g of biotinylated ets-rna (+ /+ ) were incubated with l of streptavidin-coated magnetic beads (thermo fisher scientific) for min at room temperature in mm tris-hcl [ph . ], . mm edta, mm nacl, . % tween- . after washing, pmol of bead-bound rna were incubated with g of nuclear extract from hek t cells expressing flag-tagged sirt or with pmol of purified gst-sirt / - in buffer am- supplemented with . % triton x- , . u/ml rnasin (promega) and protease inhibitors (roche) for h at • c. after washing with mm tris-hcl [ph . ], mm nacl, mm edta, . % tween- and protease inhibitors, bound proteins were analyzed on immunoblots. cleared cell lysates were incubated for h at • c with the respective antibodies and immunocomplexes were bound to protein g-sepharose. after washing with buffer containing mm kcl, . % np- , protease and hdac inhibitors, proteins were eluted with the corresponding epitope peptide or with sds sample buffer and visualized on western blots. in all experiments, controls with unspecific iggs or no antibody were carried out in parallel. chip assays were performed as described ( , ) . briefly, nuclei were fixed with % formaldehyde ( min, rt), quenched with . m glycine and lysed in mm tris-hcl [ph . ], mm edta and % sds. chromatin was sonicated to an average fragment length of - bp, diluted with volumes of ip-buffer ( . mm tris-hcl [ph . ], mm nacl, . mm edta, . % sds, . % triton x- ), pre-cleared on protein a/g sepharose in the presence of mg/ml sonicated escherichia coli dna, and incubated with - g antibodies overnight at • c. protein-dna complexes were captured on protein a/g-sepharose for h, washed twice with low salt buffer ( mm nacl, mm tris-hcl [ph . ], mm mgcl , % triton x- ), followed by washes with high salt buffer containing mm nacl, with licl buffer ( mm licl, mm tris-hcl [ph . ], mm edta, . % na-deoxycholate, . % triton x- ) and with te buffer. after elution, reversal of the cross-links ( • c, h) and digestion with proteinase k, dna was purified and quantified by qpcr using gene-specific primers. the ratio of dna in the immunoprecipitates (upon subtraction of the igg background) versus dna in the input chromatin was calculated and normalized to control reactions from mock-transfected cells. to monitor deacetylation, ha-cdk was immunopurified from hek t cells co-expressing ha-cdk and flag-gcn . flag-sirt was isolated from hek t or insect cells by m -immunopurification and flag-peptide elution. flag-sirt was incubated with bead-bound ha-cdk for h at • c in mm tris-hcl [ph . ], mm mgcl , % glycerol, m tsa, . mm dtt and mm nad + , and acetylation was detected on immunoblots using antiacetyl-lysine antibodies. the p-tefb release assay was performed as previously described with some modifications ( , ) . g of anti-hexim antibody bound to dynal magnetic beads (life technologies) were incubated with cell lysates ( • c, h). after washing, bead-bound p-tefb/ sk snrnp complexes were incubated with purified flag-sirt /wt or flag-sirt /h y in the presence or absence of mm nad + for h on ice. after sequestration of the beads with a magnetic separator, the supernatants and bead-bound proteins were analyzed on western blots. immunopurified flag-rpb or gst-ctd (sigma) was incubated for h at • c in kinase buffer ( mm kcl, mm hepes ph [ . ], mm mgcl , mm edta, mm dtt, m atp, ci ␥ [ p]-atp) with bead-bound ha-cdk . phosphorylation was visualized by phospho-rimaging or on western blots with anti-phospho-ser antibodies. indirect immunofluorescence and direct gfp-fluorescence analysis was done as described before ( ) . images were visualized at a zeiss axiophot microscope using a × . oil immersion plan-neofluor objective and processed with nis-elements br . and imagej software. data are reported as mean values from at least three biological replicates with error bars denoting standard deviations (sds). the two groups were compared using a paired twotailed student's t-test. the significance level was set at p values *p < . , **p < . . quantification of western blots and radioactive signals was performed using image gauge and imagej software. previous studies have revealed that the protein content of the nucleolus is dynamic, showing decrease, no changes, or accumulation of specific proteins after inhibition of transcription, viral infection or dna damage ( ) ( ) ( ) ( ) ( ) . to functionally characterize sirt -interacting proteins, we purified flag/ha-tagged sirt from hek t cells and analyzed associated proteins by mass spectrometry (supplementary figure s a ). using stringent inclusion criteria, proteins (≥ unique peptides) were identified, the majority residing in the nuclear and nucleolar compartment (supplementary table s ). consistent with previous studies ( ) ( ) ( ) , the sirt interactome was enriched in nucleolar proteins, such as nucleolin, nucleophosmin, mybbp a, dhx , ddx and ribosomal proteins. in addition to previously reported sirt interacting proteins, a large part of sirt -associated proteins identified by our mass spectrometry analysis was novel (supplementary figure s b and table s ). classification of proteins revealed prominent groups with functions in transcription, translation, rna maturation, chromatin organization, dna repair, and intracellular transport ( figure a ). the largest fraction comprised proteins with roles in rna metabolism and ribosome biogenesis, such as pre-ribosomal factors (prfs) and small nucleolar ribonucleoprotein particles (snornps) involved in pre-rrna folding, processing and posttranscriptional modifications. proteins with functions in pol ii transcription, such as rpb and cyclin-dependent kinase (cdk ) and mrna processing factors were also identified, suggesting that sirt is involved in both transcriptional and post-transcriptional processes (supplementary table s ). previous studies have revealed that the interaction of sirt with paf and u - k depends on rna ( , ) . to examine whether rna is also involved in other sirt protein interactions, we compared sirt -associated proteins isolated from untreated and rnase a-treated samples (supplementary figures s a and s c). after rnase a treatment % of sirt -associated proteins were completely or partially lost, indicating that rna mediates or stabilizes the association of sirt with a subset of proteins ( figure b, supplementary figure s d and table s ). according to go annotation, the majority of rnase asensitive interacting proteins serve a role in translation and rna processing (supplementary figure s e) . studies on the kinetics of nucleolar proteins under cellular stress have shown that the nucleolus is a dynamic structure. transcription inhibition by actinomycin d (amd) showed decrease, no changes, or accumulation of individual factors in nucleoli ( ) . these fine-tuned changes in the inventory of the nucleolus are thought to reflect the partition of proteins between the nucleolus and the nucleoplasm according to the physiological state of the cell. to analyze the impact of ongoing transcription on the sirt interactome, we identified sirt -associated proteins from untreated or amd-treated cells (supplementary figures s a and s f). while % of the interactions were sensitive to rnase a treatment, transcription inhibition by amd resulted in loss or reduction of % of sirt -associated proteins ( figure c , d and supplementary table s ). go analysis revealed that certain subgroups of interacting proteins were more affected than others. for example, proteins involved in nucleocytoplasmic transport or rna processing showed strongly reduced binding, whereas the interaction with factors implicated in chromatin regulation was only marginally affected. similar to rnase a treatment, binding of sirt to proteins comprising rna recognition motifs was impaired in amd-treated cells. validation of amd-sensitive and -insensitive interactions by coimmunoprecipitation confirmed that some proteins, such as ubf or p , remained bound to sirt . however, binding of a substantial fraction of sirt -associated proteins, including pol i, nucleolin and hnrnpk, was markedly decreased after amd treatment ( figure d ), indicating that the majority of the sirt interactome depends on ongoing transcription. to examine which region of sirt mediates rnadependent interactions, we performed pull-down experiments using n-terminally truncated sirt mutants. sirt comprises an unstructured arginine-and dipeptide-rich region within the n-terminal amino acids, a motif that is often present in rna binding proteins ( ) (supplementary figure s a ). to investigate whether this region is important for rna-dependent protein interactions, external spacer ( -ets) rna was immobilized on streptavidin beads and incubated with lysates of cells expressing flag-tagged wildtype sirt or mutants lacking ( n ) or n-terminal amino acids ( n ). both wildtype sirt and the n mutant were efficiently pulleddown by immobilized rna. deletion of amino acids, however, abolished rna binding (figure a , upper panels). furthermore, a gst-fusion protein comprising amino acids - (gst-sirt / - ) efficiently interacted with rna, supporting that the arginine-rich n-terminal part of sirt mediates the interaction of sirt with rna (figure a bottom panel and supplementary figure s b) . in a complementary approach, we monitored binding of radiolabeled rna to immobilized wildtype and mutant sirt . again, deletion of the n-terminal amino acids markedly reduced rna binding ( figure b and supplementary figure s c ). in accord with previous rna immunoprecipitation (clip) experiments showing that sirt not only interacts with pre-rrna but also with snornas ( ), deletion of the n-terminal part of sirt impaired the interaction of sirt with pre-rrna, u , u , u and c snornas ( figure c ). moreover, sirt / n was distributed throughout the nucleus, indicating that the n-terminal region mediates nucleolar enrichment of sirt ( figure d ). to examine whether deletion of the n-terminal part of sirt would also affect the interaction with proteins whose binding was compromised after amd or rnase a treatment, we compared the association of selected proteins with wildtype and n-terminally truncated sirt . binding of rnase-and amd-sensitive proteins, e.g. nucleolin, hn-rnps, pol i and nucleophosmin, was abolished in mutant sirt / n ( figure e ). ubf or p , however, which bind to sirt in an rna-independent manner, interacted with both wildtype and mutant sirt , indicating that the n-terminal part mediates rna-dependent protein interactions, whereas the central and c-terminal part of sirt mediate rna-independent interactions. mass spectrometry analysis of sirt -associated proteins has identified several pol ii subunits, suggesting that sirt might serve a role in pol ii transcription (supplementary table s ). both unphosphorylated pol ii and pol ii phosphorylated at ser and ser within the c-terminal domain (ctd) of the large subunit rpb was present in sirt immunoprecipitates, demonstrating that sirt interacts with transcribing pol ii ( figure a and supplementary figures s a and s b) . we also identified cyclin t and cdk , constituting the positive transcription elongation factor b (p-tefb), and hexim , an inhibitor of p-tefb activity ( figure a , b and supplementary figure s a) . the canonical function of p-tefb is to phosphorylate the ctd at ser , which marks the elongation complex ( , ) . the association of sirt with cdk and cyclin t was sensitive to rnase a treatment, indicating that binding of figure a and b). clip-seq data ( ) and clip-qpcr analyses have shown that sirt is not only associated with pre-rrna and snornas, but also with sk snrna ( figure b ). binding of the sk snrna/hexim complex to cyclin t /cdk is known to sequester p-tefb in a large inactive ribonucleoprotein complex which is released after dissociation of sk snrna/hexim ( , ) . the finding that sirt is associated with components of the inactive p-tefb complex as well as with elongating pol ii prompted us to investigate whether sirt regulates p-tefb activity. previous studies have shown that cdk activity is inhibited by gcn -mediated acetylation ( ) ( ) ( ) . this observation sugested that deacetylation by sirt activates cdk , thereby promoting the transition into the elongation phase of transcription. to test this, we monitored cdk acetylation levels on immunoblots using an antibody that recognizes acetylated lysine residues. while cdk acetylation was hardly detectable in untransfected cells, a strong signal was observed after overexpression of gcn ( figure c ). gcn -mediated acetylation was further enhanced if cells were treated with nicotinamide (nam), a competitive inhibitor of nad + -dependent deacetylases, indicating that a member of the sirtuin family counteracts acetylation by gcn . if cdk is deacetylated by sirt , cdk should be hyperacetylated after knockdown of sirt . indeed, a marked increase in cdk acetylation was observed after depletion of sirt ( figure d ). to monitor deacetylation in vitro, ha-cdk was co-expressed with flag-gcn , incubated with recombinant sirt , and acetylation was monitored on immunoblots. acetylation was significantly reduced upon incubation with wildtype sirt in the presence of nad + , confirming that sirt is the enzyme that deacetylates cdk ( figure e showing that the n-terminal region of sirt mediates rna binding in vivo. uv-crosslinked flag-sirt -rna complexes were captured on anti-flag beads, and co-precipitated rna was analyzed by rt-qpcr. the percentage of precipitated rna relative to input rna is shown. error bars denote means ±sd (n = ) ( * p < . , ** p < . , n.s.: not significant). (d) the n-terminal part is required for nucleolar localization of sirt . direct fluorescence showing the cellular localization of gfptagged sirt , sirt / n and sirt / n . indirect immunofluorescence and direct gfp fluorescence analysis was done as described ( ) . nucleoli were stained with anti-ubf antibodies. scale bar, m. (e) the n-terminal region of sirt mediates protein interactions. flag-sirt or mutant n were immunoprecipitated and co-precipitated proteins were visualized on immunoblots. not interact with sirt , did not deacetylate cdk in vitro, underscoring the importance of the n-terminus for sirt function (supplementary figures s c and s d) . cdk -dependent phosphorylation of the pol ii ctd is required for efficient promoter clearance and transcriptional processivity ( , ) . given that sirt counteracts gcn -directed acetylation of cdk , we reasoned that deacetylation by sirt should activate cdk and increase ctd phosphorylation. to test this, we performed in vitro assays monitoring cdk -dependent phosphorylation of rpb , the large subunit of pol ii. consistent with previous studies ( , ) , ctd phosphorylation was compromised if cdk was hyperacetylated, i.e. if cdk was isolated from cells co-expressing gcn and treated with the sirtuin inhibitor nam ( figure f and supplementary figures s e and s f ), underscoring that acetylation inhibits the kinase activity of cdk . previous studies have shown that acetylation of lysine (k ) compromises cdk function ( , ) . to assay the impact of k acetylation on cdk activity, we compared wildtype cdk with mutants in which k has been replaced by glutamine or arginine, thus mimicking the acetylated or deacetylated state of cdk . in accord with k acetylation regulating cdk activity, phosphorylation of rpb by the acetylation-defective mutant cdk /k r was higher than wildtype cdk , while the acetylationmimicking mutant cdk /k q was enzymatically inactive ( figure g and supplementary figure s g ). furthermore, phosphorylation of rpb was markedly impaired if cdk was purified from sirt -depleted cells, supporting that hypoacetylation is required for cdk activity ( figure h and supplementary figure s h ). the inverse correlation of acetylation and kinase activity of cdk reinforces the relevance of sirt -dependent deacetylation of k for ctd phosphorylation. to examine whether sirt also activates p-tefb by facilitating the release of cdk /cyclin t from the inhibitory hexim / sk ribonucleoprotein complex, we incubated bead-bound p-tefb/ sk snrnp complexes with flag-sirt and monitored the release of p-tefb on immunoblots. as shown in figure i , wildtype sirt but not the catalytically inactive point mutant promoted release of p-tefb from the sk snrnp complex in an nad +dependent manner (see also supplementary figure s i ). together these results indicate that sirt mediates ctd-ser phosphorylation and transcription elongation both by activation of cdk and releasing p-tefb from the sk snrnp complex. the finding that sirt activates the kinase activity of cdk suggested that sirt function is not restricted to the nucleolus but may also serve a regulatory role in pol ii transcription. in accord with this hypothesis, previous clip-seq analysis has shown that sirt is associated with numerous snornas ( ) . to test whether sirt affects pol ii-dependent transcription of snornas, we overexpressed flag-sirt in hek t cells and monitored rna levels by rt-qpcr. levels of u , u , u and u snornas, all of which are involved in rrna maturation, were ele-vated in cells overexpressing sirt . overexpression of the catalytically inactive mutant sirt /h y did not affect snorna levels, underscoring that the catalytic activity of sirt is required for upregulation of snornas ( figure a and supplementary figure s b) . to examine whether sirt -mediated upregulation of transcription is restricted to snornas or whether sirt also affects transcription of mrnas, we analyzed the level of several pre-mrnas upon overexpression of sirt . for this, genes were chosen that have been shown to be occupied by sirt in the promoter-proximal region ( ) . we found that numerous pre-mrnas, including brf , pex , jmjd or psmc , are upregulated in cells overexpressing wildtype but not mutant sirt ( figure a and supplementary figures s b and s e) . interestingly, the level of u snrna was not affected by sirt overexpression. this result is consistent with previous studies showing that sirt occupies genes that encode snorna but not snrna ( ) , emphasizing that sirt stimulates transcription of a subset of genes transcribed by pol ii. this conclusion is further supported by loss-of-function experiments demonstrating that the level of snornas and selected pre-mrnas, but not u rna, was decreased in sirt -depleted cells ( figure b and supplementary figure s b, right) . significantly, decreased transcription of u and u snorna in sirt -knockout cells (sirt −/− ) was rescued by overexpression of flag-sirt , but not by flag-sirt /h y or sirt / n , demonstrating that both the catalytic activity and rna binding of sirt are required for activation of pol ii transcription ( figure c and supplementary figure s c ). previous results have established that sirt activates transcription of rrna genes by enhancing pol i occupancy at rdna ( , ) . we therefore reasoned that overexpression of sirt may increase binding of pol ii to genes that are regulated by sirt . indeed, pol ii occupancy at selected pol ii-dependent genes was enhanced after overexpression of wildtype sirt but not mutant sirt /h y ( figure d ). conversely, knockdown of sirt decreased pol ii occupancy at sirt -responsive genes to a similar level as pol i at the rdna locus, while binding of pol ii to actin and u snrna genes, which are not occupied by sirt , was not altered ( figure e and supplementary figures s a and s d ). chip assays with antibodies against pol ii-pser and pol ii-pser revealed that knockdown of sirt affected the occupancy of both initiating and elongating pol ii, supporting that hyperacetylation of cdk prevents ctd phosphorylation and impairs transcription elongation ( figure h and supplementary figure s f ). this view is substantiated by chip assays showing increased occupancy of both sirt and cdk at the promoterproximal region of target genes ( figure f) . significantly, the association of cdk with these genes was abolished upon knockdown of sirt , confirming that downregulation of sirt correlates with abrogation of cdk binding ( figure g and supplementary figure s f ). to further strengthen the functional link between sirt and cdk activity, we determined pol ii occupancy at target genes in sirt -depleted cells that overexpress cdk /k r or cdk /k q. while decreased pol ii occupancy in sirt knockdown cells could be relieved by overexpression of nucleic acids research, , vol. , no. figure s g) . taken together, these results uncover the mechanism underlying sirt -dependent activation of pol ii, demonstrating that sirt enhances transcription elongation by deacetylation of lysine of cdk , which is required for ctd phosphorylation and transcription activation. although sirt has emerged as a critical regulator of metabolic health and stress response, which promotes survival in times of adversity, it is the least understood member of the human sirtuin family. this is to a large extent due to its low enzymatic activity in vitro and the few molecular targets identified so far. global proteomic studies have identified numerous sirt -associated proteins, most of them serving functions in transcription, ribosome biogenesis and translation ( ) ( ) ( ) . we have previously shown that sirt is released from nucleoli in response to transcriptional, metabolic or osmotic stress, leading to hyperacetylation of paf and the u - k protein, hyperacetylation compromising rdna transcription and pre-rrna processing ( , ) . given the vital role of sirt in cellular homeostasis, it is not surprising that sirt function is not restricted to pre-rrna synthesis and processing. our mass spectrometric analyses revealed many nuclear sirt -associated proteins with functions in rna metabolism, chromatin structure, nucleocytoplasmic transport and pol ii transcription, emphasizing that sirt serves important roles outside the nucleolus. previous studies have shown that sirt deacetylates h k in a gene-specific context, and selective hypoacetylation of h k ac is necessary for essential features of cancer cells, including anchorage-independent growth and escape from contact inhibition ( ) . sirt was also found to interact with proteins that are associated with the pol ii and the pol iii transcription machinery ( , ) . consistent with sirt interacting with tfiiic, knockdown of sirt led to decreased levels of trnas, indicating that the regulatory impact of sirt is not restricted to pol i transcription but affects transcription by all three classes of nuclear rna polymerases. although sirt does not harbor a classical rna binding domain, the n-terminal region mediates the association of sirt with rna, and binding to rna is required for the interaction of sirt with numerous proteins. other proteins, however, such as ubf and p , associate with sirt in an rna-independent fashion, presumably via the c-terminal domain. these results are supported by previous studies showing that n-and c-terminal regions flanking the catalytic domain enhance the activity of sirt ( ) and mediate interactions of sirt with mybbp a, which inhibits the deacetylase activity of sirt ( , ) . among the proteins detected in the sirt proteome was rpb , suggesting that sirt interacts with pol ii. the c-terminal domain (ctd) of rpb is modified by reversible phosphorylation, acetylation and methylation, all of which are implicated in pol ii recruitment and transcription ( , , ) . notably, both the proteomic data and the results of co-immunoprecipitation experiments revealed that sirt is also associated with the positive transcription elongation factor p-tefb comprising cdk and cyclin t . moreover, sirt was found to be associated with sk rna ( ) and with the sk-associated protein hexim . together with the observation that sirt interacts with elongating pol ii and co-localizes with pol ii phosphorylated at ctd-ser at sirt target genes, these results suggested that reversible acetylation may regulate cdk kinase activity and transcription elongation. cdk is acetylated at two lysine residues, k and k ( ) ( ) ( ) . gcn -mediated in vivo acetylation preferentially targets k which is essential for atp binding and cdk activity. here we show that deacetylation by sirt augments the kinase activity of cdk which is required for ctd phosphorylation and efficient pol ii transcription. acetylation of cdk has no effect on the interaction of p-tefb with hexim or sk snrna ( , ) , indicating that this modification regulates p-tefb activity independently from the inhibitor function of hexim . however, we found that sirt -mediated deacetylation also promotes dissociation of cdk /cyclin t from the inactive p-tefb/ sk snrnp complex, suggesting that sirt may target another component of the complex to release cdk /cyclin t . expression of sirt is known to propel cells towards tumorigenesis and to promote the invasiveness and metas- ± sd (n = ) ( * p < . , ** p < . ). see also supplementary figure s b. (b) snorna and pre-mrna levels are decreased in sirt -deficient cells. hek t cells were transfected with non-targeting sirna or sirt -specific sirna. rna levels were measured by rt-qpcr and normalized to actin mrna. bars represent means ± sd (n = ) ( * p < . , ** p < . ). see also supplementary figure s b . (c) ectopic sirt rescues downregulation of u and u snorna in sirt -knockout cells. hek t/sirt −/− cells were transfected with flag-tagged sirt /wt, sirt /h y or sirt / n and snorna levels were monitored by rt-qpcr. the bars represent means ±sd (n = ) ( * p < . , ** p < . ). see also supplementary figure s c . (d) overepression of sirt increases the association of the transcription machinery with target genes. chip-qpcr monitoring pol i and pol ii occupancy at selected target genes after overexpression of flag-sirt (wt) or mutant sirt /h y. antibodies against rpa (pol i) and rpb (pol ii, n ) were used for chip. bars represent means ± sd (n = ) ( * p < . , ** p < . ). see also supplementary figure s a . (e) chip-qpcr monitoring occupancy of pol i (anti-rpa ) and pol ii (anti-rpb , n ) at selected target genes after shrna-(left panel) or sirna-(right panels) mediated depletion of sirt . cells transfected with non-targeting shrna/sirna served as control. the bars represent means ± sd (n = ) ( * p < . , ** p < . ). see also supplementary figure s a and s d. (f) sirt and cdk occupancy is enriched at the promoter of target genes. chips monitoring occupancy of endogenous cdk or stably expressed flag-ha-sirt at different regions of brf and jmjd using antibodies against cdk or the flag epitope. the bars represent mean values ± sd (n = ). the scheme depicts the position of the amplified regions. see also supplementary figure s e . (g) knockdown of sirt impairs cdk occupancy at target genes. chip-qpcr monitoring occupancy of endogenous cdk at target genes upon sirna-mediated knockdown of sirt , cells transfected with non-targeting sirna serving as control. bars represent means ± sd (n = ) ( * p < . , ** p < . ). see also supplementary figures s a and s f. (h) knockdown of sirt abolishes pol ii occupancy at selected target genes. chips showing occupancy of pol ii phosphorylated at ctd-ser and ctd-ser in cells transfected with sirt -specific sirna or with non-targeting sirna using antibodies against pser -ctd and pser -ctd. bars represent means ± sd (n = ) ( * p < . , ** p < . ). see also supplementary figures s a and s f . model illustrating the role of sirt in pol ii transcription. sirt -mediated deacetylation promotes the release of cdk /cyclin t from the inactive p-tefb/ sk snrnp complex and activates p-tefb by deacetylation of cdk , which leads to increased ctd-ser phosphorylation and transcriptional processivity. tasis of cancer cells ( , ) . likewise, aberrant expression of cdk and cyclin t /t has been observed in various tumors ( ) , demonstrating that regulation of p-tefb activity is crucial for controlled gene expression. based on the available data we propose the following model of p-tefbdependent transcription activation ( figure ). in normal cells, a significant amount of p-tefb is sequestered in a large inactive complex containing sk/hexim , activation requiring release from the sk ribonucleoprotein complex. given that sirt serves a vital role in the cellular stress response ( , ) and sirt expression is linked to cell proliferation and oncogenic activity, it is not surprising that sirt is connected to an emerging network of extracellular signals that control transcription of all three classes of nuclear rna polymerases. unraveling this network will bring important clues to the pathways that regulate gene expression in response to cell cycle progression or extracellular signaling events. mammalian sir homolog sirt is an activator of rna polymerase i transcription evolutionarily conserved and nonconserved cellular localizations and functions of human sirt proteins sirtuin in cell proliferation, stress and disease: rise of the seventh sirtuin! involvement of sirt in resumption of rdna transcription at the exit from mitosis sirt inactivation reverses metastatic phenotypes in epithelial and mesenchymal tumors sirtuin oncogenic potential in human hepatocellular carcinoma and its regulation by the tumor suppressors mir- a- p and mir- b sirt -dependent inhibition of cell growth and proliferation might be instrumental to mediate tissue integrity during aging sirt increases stress resistance of cardiomyocytes and prevents apoptosis and inflammatory cardiomyopathy in mice a sirt -dependent acetylation switch of gabp␤ controls mitochondrial function sirt promotes genomic integrity and modulates non-homologous end joining dna repair sirt links h k deacetylation to maintenance of oncogenic transformation repression of rna polymerase i upon stress is caused by inhibition of rna-dependent deacetylation of paf by sirt sirt -dependent deacetlyation of the u - k protein controls pre-rrna processing functional proteomics establishes the interaction of sirt with chromatin remodeling complexes and expands its role in regulation of rna polymerase i transcription comparative interactomes of sirt and sirt : implication of functional links to aging sirt plays a role in ribosome biogenesis and protein synthesis phosphorylation by g -specific cdk-cyclin complexes activates the nucleolar transcription factor ubf constitutive and strong association of paf with rna polymerase i david: database for annotation, visualization, and integrated discovery systematic and integrative analysis of large gene lists using david bioinformatics resources rna helicase ddx coordinates transcription and ribosomal rna processing the mechanism of release of p-tefb and hexim from the sk snrnp by viral and cellular activators includes a conformational change in sk nucleolar proteome dynamics -dependent subcellular proteome localization following dna damage a quantitative proteomics analysis of subcellular proteome localization and changes induced by dna damage quantitative proteomics using stable isotope labeling with amino acids in cell culture reveals changes in the cytoplasmic, nuclear, and nucleolar proteomes in vero cells infected with the coronavirus infectious bronchitis virus proteomics analysis of the nucleolus in adenovirus-infected cells insights into rna biology from an atlas of mammalian mrna-binding proteins the multi-tasking p-tefb complex the pol ii ctd: new twists in the tail regulation of p-tefb elongation complex activity by cdk acetylation acetylation of conserved lysines in the catalytic core of cyclin-dependent kinase inhibits kinase activity and regulates transcription sirt directs the replication stress response through cdk deacetylation the rna polymerase ii ctd coordinates transcription and rna processing pause, play, repeat: cdks push rnap ii´s buttons sirt is activated by dna and deacetylates histone h in the chromatin context inhibition of h k deacetylation of sirt by myb-binding protein a (mybbp a) crystal structure of the n-terminal domain of human sirt reveals a three-helical domain architecture acetylation of rna polymerase ii regulates growth-factor-induced gene transcription in mammalian cells site-specific methylation and acetylation of lysine residues in the c-terminal domain (ctd) of rna polymerase ii perspective of cyclin-dependent kinase (cdk ) as a drug target we thank elisabeth kremmer (helmholtz center munich) for providing anti-ha, anti-gst and anti-pser -pol ii antibodies. we acknowledge the technical assistance of jeanette seiler and the support of tore kempf and sabine fiedler in mass spectrometry. conflict of interest statement. none declared. supplementary data are available at nar online. key: cord- -e licyc authors: tholstrup, jesper; oddershede, lene b.; sørensen, michael a. title: mrna pseudoknot structures can act as ribosomal roadblocks date: - - journal: nucleic acids res doi: . /nar/gkr sha: doc_id: cord_uid: e licyc several viruses utilize programmed ribosomal frameshifting mediated by mrna pseudoknots in combination with a slippery sequence to produce a well defined stochiometric ratio of the upstream encoded to the downstream-encoded protein. a correlation between the mechanical strength of mrna pseudoknots and frameshifting efficiency has previously been found; however, the physical mechanism behind frameshifting still remains to be fully understood. in this study, we utilized synthetic sequences predicted to form mrna pseudoknot-like structures. surprisingly, the structures predicted to be strongest lead only to limited frameshifting. two-dimensional gel electrophoresis of pulse labelled proteins revealed that a significant fraction of the ribosomes were frameshifted but unable to pass the pseudoknot-like structures. hence, pseudoknots can act as ribosomal roadblocks, prohibiting a significant fraction of the frameshifted ribosomes from reaching the downstream stop codon. the stronger the pseudoknot the larger the frameshifting efficiency and the larger its roadblocking effect. the maximal amount of full-length frameshifted product is produced from a structure where those two effects are balanced. taking ribosomal roadblocking into account is a prerequisite for formulating correct frameshifting hypotheses. the reading frame of the vast majority of mrnas is determined by the start codon after which the downstream cistron is translated in the same frame. maintenance of the reading frame occurs without further signals to the ribosome. however, examples of genes containing information for programmed frameshifts can be found in most organisms, or in some of their is sequences, transposable elements, retroelement-derived sequences or viruses. the sequence-information needed for programmed ribosomal frameshift varies and both + and À frameshifts can be induced ( ) ( ) ( ) . here, we focus on the frameshifting signal found in several viruses ( ) , including infectious bronchitis virus (ibv) and sars-cov. the signal leads to programmed ribosomal À frameshift, whereby multiple proteins are produced from a single polycistronic messenger rna (mrna) ( , ) . the frameshift efficiency, i.e. the fraction of ribosomes, which change reading frame, is important to ensure a correct stoichiometric relationship between the different products of translation. it has been shown that altered frameshift efficiency has detrimental effects on the proliferation of hiv-i and the yeast l-a viruses ( , ) . in order to induce À frameshift, these viruses rely on three physical features on the mrna: a heptanucleotide sequence, a spacer and a downstream structure ( ) . the heptanucleotide sequence, called the slippery sequence, is where the À frameshift occurs and typically has the following sequence: x xxy yyz, where x, y and z denote nucleotide species and spaces indicate initial reading frame. the spacer is a stretch of - nt positioning the ribosome correctly at the slippery site when encountering the downstream structure. the downstream structure is most often found to be a pseudoknot. the pseudoknot structure probably functions as a physical barrier deforming upon approach of the translating ribosome ( ) , thereby assisting the frameshifting process; however, geometry and surface charge of the structure may also play a role for the frameshifting ( ) . in bacteria and yeast, programmed frameshift signals can have rather different elements, as, e.g. the upstream shine-dalgarno binding element in the autoregulatory rf gene frameshift site first described in escherichia coli ( ) or the different pattern of the + frameshift stimulating heptanucleotide sequences present in saccharomyces ty elements ( ) . however, many frameshift signals deviate little from those described for the virus-derived system used here and many signals are of such general character that ribosomes from different kingdoms of life will respond to them by shifting frame ( ) . this happens not always with the same efficiency as in the original organism ( , ) and there are even examples found where a frameshift element can direct the ribosomes into À or + frameshift depending on the test organism ( ) . here, we challenged e. coli ribosomes by constructing artificial frameshifting signals containing pseudoknot-like structures with strong stems. using a refined frameshift assay, involving two-dimensional ( d) gel electrophoresis of pulse labelled proteins, we show that a significant amount of frameshifted ribosomes permanently stall within the strongest pseudoknots which therefore efficiently act as roadblocks. the small ribosomal subunits have been shown to be sensitive towards mrna secondary structure in the process of translation initiation and mrna structures can exclude initiation both in eukaryotes during the scanning process ( ) and in prokaryotes for binding between the mrna and the -end of s rna ( ) . the fully assembled and translating s or s ribosomes seem to be more robust. it is, however, broadly accepted that mrna secondary structures can function as obstacles to translating ribosomes ( , ) although examples exists of large secondary structures in mrna that are translated without any ribosomal delay ( ) . nevertheless, there is compelling evidence from in vitro experiments showing that ribosomes may pause upstream to such structures, most pronounced if the structures form pseudoknots ( ) ( ) ( ) . possibly the lack of rotational freedom in the helix of stem , due to the pairing in stem , makes pseudoknot structures harder to 'unzip' by the ribosome than simple stem-loop structures ( ) . this may explain why pseudoknots can pause ribosomes. examples from nature show the existence of diverse peptide sequences, often present in regulatory circuits, which will stall ribosomes ( ), but to our knowledge, a permanent halt of ribosomes caused by mrna structures has not been shown previously. recent single molecule investigations suggest that the mechanical strength of pseudoknots correlate with the ability of the pseudoknot to stimulate frameshift ( ) ( ) ( ) , at least in a certain interval. however, the calculated gibbs free energy does not always correlate with frameshift efficiency. not only the strength of the stems, but also the interaction between the loop and the stems might be of importance for the ability to induce frameshift and for the overall mechanical strength and brittleness of the structure. if the pseudoknot becomes too strong the ribosome, frameshifted or not, might not be able to open it and continue translation, whereby the pseudoknot acts as a roadblock. often in literature ( ) ( ) ( ) ( ) ( ) ( ) ( ) frameshifting assays were performed on constructs exhibiting the common feature that the stop codon for the normal reading frame was located at the entrance of the pseudoknot (or inside the pseudoknot) and the stop codon for the successful À frameshift was located downstream of the pseudoknot. in most frameshifting assays, the amount of frameshifting is determined by quantifying the amount of full-length frameshifted versus non-frameshifted products. however, for this to be a correct measure, the frameshifted ribosome must continue translation through the pseudoknot and beyond to the À frameshifted stop codon. if the À frameshifted ribosome permanently stalls inside the pseudoknot, it would falsely be interpreted as if the ribosome did not frameshift. therefore, there is a serious pitfall in the classical methods which renders the amount of frameshifted ribosomes to be non-correctly determined, i.e. be underestimated, potentially leading to false hypotheses regarding the physical mechanism of frameshifting. the observation that strong pseudoknot-like structures can stop translation lead to the hypothesis that the largest amount of frameshifted product will be produced if the pseudoknot is mechanically strong but without a significant roadblocking effect. most likely, this is exactly the balance exhibited by naturally occurring viral pseudoknots. escherichia coli strain mas [e. coli k- , reca d(pro-lac) thi ara f : laci q lacz::tn proab + ]. liquid cultures were grown in minimal mops media ( ) using glycerol as carbon source. cultures were incubated with shaking at c for at least generations in the log phase prior to being used in frameshift assays. pseudoknots were designed using custom-made software, which ensued that the codon usage was appropriate for expression in e. coli and that the sequences were likely to fold into the correct structure as determined by pknotsrg ( ) . hence, the resulting sequences are artificial pseudoknot-like structures and there is always a risk that the structure does not fold as anticipated. the selected sequences were synthesized by genescript and were subsequently inserted into plasmid ofx [containing slippery sequence, spacer and pseudoknot ( ) ] between hindiii and apai restriction sites. the in vivo frameshift assays were performed as described previously ( ) . briefly, ml of an exponentially growing culture was induced with isopropyl b-d-thiogalactopyranoside (iptg) to a final concentration of mm at an optical density of . - . measured at nm (od ). after induction for min, the culture was pulse-labelled with $ mci l-[ s]-methionine for s and chased with mg l-methionine for min before being transferred to ml of chloramphenicol ( mg/ml) on ice. the cells were harvested by centrifugation and proteins were boiled in sds buffer and separated by % sds-page. the gel was dried and placed on a phosphor imager screen (molecular dynamics) and left to expose for - days. relative amount of protein of the relevant polypeptides was quantified using imagequant software and the frameshift efficiency (e) was determined as follows: where v fs is the relative radioactivity in the frameshift product, n met,fs is the number of methionines in the frameshift product, v stop is the relative radioactivity in the in-frame stop product and n met,stop is the number of methionines in the in-frame stop product. two-dimensional sds-gels were performed as described ( ) with a few modifications ( ) using samples from the frameshift assay described above. the frameshift efficiency was determined as described for the frameshift assay above, although polygonal shapes were used to encircle the polypeptides of interest and quantify the relative amount of radioactivity in them. polypeptides originating from stalled ribosomes were found as radioactive polypeptides with appropriate isoelectric point and molecular weight appearing on gels when the translated transcript contained a pseudoknot. these polypeptides were absent when a transcript without a pseudoknot was translated. the weakest stalled protein spots were difficult to distinguish from spots originating from endogenous gene expression on these gels (compare to the construct in supplementary figure s ) and their determination is connected with some uncertainty. the statistical analysis used to compare the stalling efficiency between pseudoknot / a and / b was an unpaired one-tailed student's t-test with a significance level of . . total rna was extracted from . ml culture samples by the 'hot-phenol' extraction method and separated according to size by electrophoresis on . % agarose, % formaldehyde gels in recirculating xmops buffer. capillary blots were performed onto hybond-n + (perkin elmer) membranes, and the rna was crosslinked to the membrane by . j/cm uv light in a stratalinker . riboprobes covering mrna sequences as described in figure were made by t rna polymerase transcripts from the pmas 'downstream' template ( ) or from templates made by pcr where one primer included 'hanging out' t promoter sequences (gene and lacz probes). the riboprobes were synthesized in the presence of -p-utp and the final specific activity was about ci/mmol of nucleotide. hybridization and stripping of membranes were performed following standard protocols (amersham, hybond-n+ booklet, ). the membranes were wrapped in saran wrap and placed on a phosphor imager screen (molecular dynamics) and left to expose over night. signals were visualized using imagequant software. we created a series of plasmids containing different pseudoknots and where the in-frame stop codon was placed either immediately upstream ('upstream stop') or $ nt downstream ('downstream stop') from the pseudoknot ( figure a ). the 'upstream stop' constructs had an in-frame stop codon in the spacer between the slippery sequence and the pseudoknot. this caused non-frameshifted ribosomes to produce a kda polypeptide (gene from phage t ) while ribosomes undergoing a À frameshift continued through the pseudoknot and into lacz producing a kda fusion protein of the t gene and lacz sequences. in the 'downstream stop' constructs we replaced the uaa stop codon immediately upstream from the pseudoknot with a lysine codon (aaa). this change caused non-frameshifting ribosomes to continue through the pseudoknot and terminate at a downstream uga codon producing a kda polypeptide. the pseudoknot constructs based on the plasmid ofx ( ) are detailed in figure b . we systematically increased the length of stem and in pseudoknot / a through / c, we exchanged gc with ua base pairs, thus, gradually decreasing the stability of stem . often, the number of ribosomes which undergo À frameshift has been determined from constructs such as our 'upstream stop' constructs, by separating radioactively labelled proteins by sds-page and quantifying the relative amount of protein in each of the two polypeptides ( versus kda). given the limited resolution of sds-page, it is, however, impossible to clearly differentiate between polypeptides produced by ribosomes that terminate at the in-frame uaa stop codon and ribosomes that undergo À frameshift but stall within the pseudoknot. in order to overcome this problem, we invoked d sds-page ( ) whereby polypeptides were separated not only by molecular weight but also by their isoelectric point (pi). while polypeptides originating from ribosomes stalled in the pseudoknot varied only slightly in molecular weight, they varied significantly in their pi. based on the 'downstream stop' construct, we calculated a theoretical d sds-page assay of a growing polypeptide as consecutive codons are translated (shown in figure a ). at around kda, the trace splits into two, the triangles denote the non-frameshifted product and the circles denote the À frameshifted product. red symbols denote codons inside the pseudoknot. experimental data originating from the 'downstream stop' construct is shown in figure b , the theoretically expected features are indeed present, e.g. both the non-frameshifted (ds-stop) and the À frameshifted (fs) products are visible. the heat shock proteins groel and dnak serve as landmarks on the gel. interestingly, a series of polypeptides originating from ribosomes stalled inside the pseudoknot appeared (inside dashed red line). for comparison, a standard sds-page of the same sample is shown in figure c , here, the second level of information (isoelectric point) is lost and the relative blurry bands are difficult to interpret. the results shown in figure revealed that a d sds-page assay could not firmly identify polypeptides originating from a À frameshifted ribosome stalled in the pseudoknot from the non-frameshifted product in a 'downstream stop' construct. in order to quantify the amount of À frameshifted ribosomes stalled inside the pseudoknot, we performed a d sds-page separation of the radioactively labelled proteins originating from the 'upstream stop' construct (supplementary figures s and s ) , which is the type of construct most commonly used throughout literature. the advantage of a d-gel analysis is that all the unfinished protein chains with different lengths concentrate in a common spot when they have the same pi. this made it possible to identify randomly stalled translation products inside the pseudoknot sequence and we quantified the amount of radioactivity in all identified additional spots. this produced a conservative estimate of the amount of stalled translations. the result of quantifying the fraction of in vivo À frameshifted ribosomes, both those which made it all the way to the lacz stop codon (gene /lacz fusion) and those which stalled inside the pseudoknot, is shown in figure a . the hatched bars denote the À frameshift efficiency taking into account only the end product of À frameshift ( kda gene /lacz fusion). this frameshift efficiency was calculated as (intensity of fs product)/ (intensity of non-fs product+intensity of fs product). the filled bars denote the À frameshift efficiency when both the end product ( kda gene /lacz fusion) and the products originating from stalled ribosomes are taken into account. this frameshift efficiency was calculated as (intensity of fs product+intensity of stalled product)/(intensity of non-fs product+intensity of fs product+intensity of stalled product). in addition to the six artificial pseudoknot-like structures, we also analysed two earlier investigated pseudoknots pk and pk ( ), with over-all a b figure . frameshift assay and pseudoknot structures. (a) all plasmid constructs contain an iptg inducible promoter in front of t gene (light grey), a complete frameshift signal, and lacz (dark grey). the frame shift stimulating pseudoknot-like structure is inserted downstream of gene . immediately, downstream from the pseudoknot lacz is inserted in the À reading frame relative to gene . in the 'upstream stop' construct the non-frameshifting ribosomes will translate gene and terminate at a uaa stop codon in the spacer sequence and produce a kda polypeptide. ribosomes undergoing À frameshift at the slippery sequence translate lacz thus producing $ kda polypeptide. in the 'downstream stop' construct the uaa stop codon is replaced by an aaa lysine codon thus resulting in $ kda polypeptide being produced by non-frameshifting ribosomes which terminate at an uga stop codon downstream from the pseudoknot. (b) sequence and structure of the inserted pseudoknots, the slippery sequence and the spacer. in pseudoknot, / , / a, / b and / c the first base in loop has been removed in order to maintain the downstream reading frame (underlined). the boxed insert in panel b shows the structure and sequence of previously described constructs ( ) . structures more similar to naturally occurring pseudoknots (figure insert), inspired from structures in the infectious bronchitis virus ( , , ) . the pseudoknot structures in this type of virus are selected for their effects on vertebrate ribosomes, but the stem length variations were found to yield approximately the same relative stimulatory effect in e. coli ( ) and suggest that stem strength is equally important for stimulating bacterial ribosomes to frameshift. all pseudoknots investigated stalled some fraction of the frameshifted ribosomes, however, significantly more ribosomes stalled in the artificial pseudoknots than in those resembling naturally occurring pseudoknots (pk and pk ). to quantify the amount of ribosomes stalling within a pseudoknot in vivo we calculated the ratio of (stalled+non-stalled frameshifted ribosomes) to (non-stalled frameshifted ribosomes), the result is shown in figure b . for the ibv inspired pseudoknots, this ratio was close to signifying that essentially no ribosomes stalled. however, the ratio was significantly larger than for the more artificial pseudoknots which acted as roadblocks for a large amount of frameshifted ribosomes. the length of stem did not significantly influence on the amount of frameshifted or stalled ribosomes. interestingly, within pseudoknots with the same overall structure ( / a-c) / a stalls a significantly higher fraction of frameshifted ribosomes than / b (verified by student's t-test, n = , a = . , p = . ), which again stalls more than / c. hence, the ability to stall a ribosome correlated with the strength of the pseudoknot base pairs, the stronger the base pairs the more frameshifted ribosomes were stalled. earlier studies have shown that the insertion of sequences able to form mrna secondary structures into a gene may cause the rna polymerase to stall or invoke a target for endonucleolytic attacks ( ) . therefore, in our analysis of mrna pseudoknot-stalled ribosomes, it was important to verify that there was no significant population of mrnas that ended within the pseudoknot structure. if such truncated transcripts were abundant, it would be difficult to distinguish between protein products from ribosomes stalled within the pseudoknot and protein products originating from ribosomes ending translation at 'non-stop' mrnas having their -ends within the pseudoknot sequence. in the latter case translation would be terminated by tmrna trans-translation thus rendering the protein products unstable due to the tmrna-encoded tag ( ) . in the following subsections 'identification of transcripts from the t gene -pk-lacz gene fusions', 'messenger rna stability' and 'coupling between translation and transcription is required for full-length transcripts', we will show that the observed proteins did indeed originate from stalled ribosomes and that they were not caused by other effects. identification of transcripts from the t gene -pk-lacz gene fusions. to identify the major class of transcripts from our pseudoknot containing constructs, we made a northern blot with rna from all strains used to measure frameshift frequencies, which are those containing the upstream stop. we used three different probes hybridizing either upstream of the pseudoknot, immediately downstream of the pseudoknot or in the very end of the lacz reading frame ( figure a ). figure b -d, there was an unspecific hybridization from all three probes to the s and s ribosomal rnas. in e. coli, ribosomal rna constitutes between % and % of total rna depending on growth conditions and some cross-hybridization to these species is often seen in northern blots. here, the uninduced culture in figure b -d, lane ' no iptg', made it possible to estimate the unspecific probing to rrna and the two bands were used as size markers on the blots. following induction with iptg, all strains showed increased hybridization above the s rna band compared to the uninduced control with all three probes. the so-called construct was described in reference ( ) , and contains a slippery sequence and the uaa stop codon but no pseudoknot-like structure. in all strains, except the one with the construct, there were a distinct band (fl) representing the expected full-length transcript. the full-length transcript reached from transcription start to the stem-loop structure downstream of the -end of the lacz open reading frame ('hp' in figure a ). this mrna stem-loop structure has been shown to stabilize the lacz transcript by reducing -end exonucleolytic attacks ( ) . the core plasmid contained no distinct transcription termination signal after the lacz gene, and accordingly we found transcripts that exceeded far beyond the full-length fl band ( figure b-d) . in the beginning of lacz, $ nt into the open reading frame, there is a site, called 'pt' in figure a , where the rna polymerase is caused to terminate if there is inefficient translation initiation of the lacz gene ( ) . in the construct there is no pseudoknot to stimulate frameshift at the slippery site. therefore, virtually no ribosomes were expected to follow the rna polymerase from gene into the lacz part of our gene fusion. as expected, figure b and c, lane ' ' shows a prominent band ('sp' for stop polymerase) corresponding in size and probe-ability to this premature termination product. also, corresponding low amounts of high molecular weight transcripts are detected for this construct. all the other constructs shown in figure contained frameshift stimulating pseudoknots and a inspection of the northern blot showed that the 'sp' bands probed with both gene and lacz sequences were present in sizes which correspond to the sizes of the pseudoknots inserted. messenger rna stability. the wild type lacz mrna half-life is close to the average mrna half-life in e. coli ( s) and transcription takes close to s due to the length of the lacz gene (three times longer than the average gene). therefore, a northern blot of wild type lacz mrna under steady state transcription will always include a lot of unfinished native transcripts, as well as mrnas under degradation. here, our gene -lacz fusion was even longer and transcription should take $ s. accordingly, all induced strains included in figure show a distinct smear of mrna fragments recognized by all three probes. in order to examine the half-life of our artificial transcripts, we made experiments where transcription from the p tac promotor was stopped due to removal of the inducer ( figure ). two minutes after iptg removal, any remaining smear should originate from mrna degradation because most of the rna polymerase should have reached the end of the gene fusion at this time. from the experiment, shown in figure , it is evident that both the 'fl' and the 'sp' mrna fragments had a half-life close to the average min e. coli mrna half-life. in addition, both the pseudoknot containing constructs ( / and / a) revealed the existence of a short mrna fragment that was recognized only by the gene probe but not the lacz and probes (indicated by 'asterisks' in figure ). this fragment includes the transcription start in the -end and the pseudoknot in the -end. we suggest that the pseudoknot acts as an exonuclease barrier like the natural stem-loop structure in the -end of the wild type lacz transcript ( ) and thereby induces a degradation intermediate of a distinct length with increased half life compared to unstructured mrna sequences like those from construct . alternatively, but not mutually exclusive, a pseudoknot acts like a rho-independent termination signal to the rna polymerase. however, the sequences were not followed by a row of uridine residues, which would be necessary to make a stem-loop structure into a functional transcription terminator. coupling between translation and transcription is required for full-length transcripts. the final test of our model for the transcription pattern in our artificial gene fusion was to establish translational coupling beyond the slippery sequence and into the polar termination site (sp) in lacz. by changing the upstream stop codon between the slippery site and the pseudoknot region into a sense codon ribosomes should, frameshifted or not, follow the rna polymerase into the beginning of the lacz sequence. the / a and the constructs are the two constructs with the lowest frequency of frameshifting. therefore, they have the least ribosome traffic into the lacz sequence. alteration of the uaa stop codon into a lysine aaa codon in the spacer between the slippery sequence and the pseudoknot changed the pattern of transcripts immensely. these two downstream stop variants ('ds. stop' in figure ), which did not contain a stop codon upstream from the structure, expressed significantly more full-length ('fl') transcript and only insignificant amounts of premature transcription stop fragment ('sp') compared to their sister constructs containing the uaa stop codon upstream from the structure ( figure ). our control construct, pk , which stimulated % frameshift, showed no premature transcription stop fragment ('sp') and therefore no change in transcription pattern was observed as a consequence of removing the upstream uaa stop codon ( figure ) thus confirming that the major effect causing the 'sp' fragment is polarity in the lacz gene and not transcription termination caused by the pseudoknot sequences. also, the very short band marked by 'asterisks' that appeared from the / a construct was not present in the 'ds. stop' variant ( figure ). this exclude this mrna fragment to be causal for the appearance of stalled protein products, because / a ('ds. stop') is the construct that caused the highest frequency of stalling (compare figure and supplementary figure s ). our conclusion is that the stable proteins observed from within the pseudoknot structures (figure , supplementary figure s , s , s and s ) were products from stalled ribosomes. the stalling of the ribosomes was directly caused by the tertiary structure and not by some secondary effect, as, e.g. stop codon-less mrna fragments ending within the structure sequences. the structures analysed in this study are artificial and were designed to fold into pseudoknot-like structures with a gradually increasing mechanical strength. the mechanical strength was adjusted by changing the base pairs of the two stems, which seems to be a reasonable way of crudely varying the mechanical strength, as the energy involved in base pairing is higher than the energies involved in, e.g. the electrostatic interaction of the loop with the stems. it is, however, likely that the loop-stem interaction, surface charges or other players than just mechanical strength influence frameshift stimulating effect of mrna structures. as there is a consensus in recent literature that pseudoknot mechanical strength correlates with frameshifting efficiency ( ) ( ) ( ) , it was intriguing that the amount of frameshifted product was reduced by the stronger pseudoknot / a compared to the weaker / b or c ( figure a ). this proved to be caused by stalling of a significant amount of frameshifted ribosomes by the strong pseudoknots ( figure b ). future studies will show whether significant stalling can also be caused by naturally occurring pseudoknots. quantitative northern blot analysis was used to examine whether the observed translation products ending within the pseudoknot structure arose from fragments of mrna produced either by low rna-polymerase processivity or specific endonucleolytic attacks by rnases at the pseudoknot sequences. no evidence was found of a specific population of transcripts that could explain the amounts of protein products attributed to originate from pseudoknot-stalled ribosomes. also, our protein-stability assay showed that the translational products from the stalled ribosomes were stable for at least min (supplementary figure s ) , thus indicating that the stalled ribosomes are not rescued by tmrna and that the stalled proteins do not originate from truncated mrna. we also checked whether the protein products from within the pseudoknot structure could arise from very slow rather than permanently stalled ribosomes. a pulse chase experiment (supplementary figure s ) revealed that within min there was no sign of a redistribution of label between the stalled spots and the stop codon-terminated downstream stop product, thus proving the possibility of very slow ribosomes to be unlikely. it is possible that the newly discovered ribosome rescue factor, arfa ( ) could be active at pseudoknot-stalled ribosomes and that nascent proteins would be more stable than if saved by tmrna. however, as can be seen in supplementary figure s , the growth of strains expressing pseudoknot / a was severely affected by induction and showed a decrease in growth rate correlating to the amount of stall product observed. because ribosomes are limiting in growing cells ( ) , the sequestration of ribosomes by engagement in induced overexpression of a gene from a plasmid will often cause a strain to grow slower than the uninduced counterpart. the enhanced reduction in growth rate upon induction of / a compared to the construct (supplementary figure s ) could indicate that stalled ribosomes were not rescued at a sufficiently high rate and we suggest that either the ribosomal rescue systems were titrated by the large amount of mrna induced from the plasmid alleles, or alternatively, that no rescue is possible for pseudoknot-stalled ribosomes. our results are in agreement with the observation that the amount of protein produced from an mrna can be reduced when a pseudoknot is located upstream ( ) . also, they provide a possible explanation for the reduction in frameshift efficiency observed by, e.g. napthine et al. ( ) when increasing the thermodynamic stability of stem above a certain threshold. this apparent reduction in frameshift efficiency (observed by d sds-page) could be caused by the fact that a significant fraction of the 'frameshifted' ribosomes permanently stalled within the pseudoknot. we propose that pseudoknot induced frameshifting efficiency can be viewed as a balance between two competing effects (as visualized in figure ) , the mechanically stronger the pseudoknot, the larger the frameshifting efficiency ( ) ( ) ( ) , however, the stronger the pseudoknot the larger the likelihood of stalling the frameshifted ribosome, thus preventing the translation of full-length frameshift product. possibly, evolution optimized viral pseudoknots to balance these two effects. hence, in measurements of frameshifting efficiency it is important to take into account the roadblocking effect of mrna pseudoknots. figure . model of frameshifting efficiency. increasing the strength of a pseudoknot causes the pseudoknot to induce frameshifting at a higher frequency. however, the stronger the pseudoknot the larger the likelihood that it will act as a roadblock for the ribosome, reducing the amount of frameshifted product produced. the optimal frameshifting efficiency is achieved by balancing the two contributions. pseudoknot-dependent programmed- ribosomal frameshifting: structures, mechanisms and models recoding: expansion of decoding rules enriches programmed ribosomal- frameshifting as a tradition: the bacterial transposable elements of the is family programmed ribosomal frameshifting in hiv- and the sars-cov a conserved predicted pseudoknot in the ns a-encoding sequence of west nile and japanese encephalitis flaviviruses suggests ns may derive from ribosomal frameshifting maintenance of the gag/gag-pol ratio is important for human immunodeficiency virus type rna dimerization and viral infectivity ribosomal frameshifting efficiency and gag/gag-pol ratio are critical for yeast m double-stranded rna virus propagation frameshifting rna pseudoknots: structure and mechanism a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting crystal structure of a luteoviral rna pseudoknot and model for a minimal ribosomal frameshifting motif reading frame switch caused by base-pair formation between the end of s rrna and the mrna during elongation of protein synthesis in escherichia coli e. coli ribosomes re-phase on retroviral frameshift signals at rates ranging from to percent differential response to frameshift signals in eukaryotic and prokaryotic translational systems programmed frameshifting in the synthesis of mammalian antizyme is + in mammals, predominantly + in fission yeast, but - in budding yeast circumstances and mechanisms of inhibition of translation by secondary structure in eucaryotic mrnas a role for mrna secondary structure in the control of translation initiation translation and messenger rna secondary structure the ribosome uses two active mechanisms to unwind messenger rna during translation codon usage determines the translation rate in escherichia coli ribosomal movement impeded at a pseudoknot required for frameshifting ribosomal pausing during translation of an rna pseudoknot ribosomal pausing at a frameshifter rna pseudoknot is sensitive to reading phase but shows little correlation with frameshift efficiency torsional restraint: a new twist on frameshifting pseudoknots divergent stalling sequences sense and control cellular physiology correlation between mechanical strength of messenger rna pseudoknots and ribosomal frameshifting triplex structures in an rna pseudoknot enhance mechanical stability and increase efficiency of - ribosomal frameshifting characterization of the mechanical unfolding of rna pseudoknots characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins the role of rna pseudoknot stem length in the promotion of efficient - ribosomal frameshifting evidence for an rna pseudoknot loop-helix interaction essential for efficient- ribosomal frameshifting culture medium for enterobacteria pknotsrg: rna pseudoknot folding including near-optimal structures and sliding windows high resolution two-dimensional electrophoresis of proteins determination of the peptide elongation rate in vivo cytoplasmic degradation of ssra-tagged proteins specific endonucleolytic cleavage sites for decay of escherichia coli mrna inefficient translation initiation causes premature transcription termination in the lacz gene ribosome rescue by escherichia coli arfa (yhdl) in the absence of trans-translation system synthesis of proteins in escherichia coli is limited by the concentration of free ribosomes: expression from reporter genes does not always reflect functional mrna levels supplementary data are available at nar online. funding for open access charge: university of copenhagen excellence program.conflict of interest statement. none declared. key: cord- -y wf f authors: zhang, guang lan; srinivasan, kellathur n.; veeramani, anitha; august, j. thomas; brusic, vladimir title: pred(balb/c): a system for the prediction of peptide binding to h (d) molecules, a haplotype of the balb/c mouse date: - - journal: nucleic acids res doi: . /nar/gki sha: doc_id: cord_uid: y wf f pred(balb/c) is a computational system that predicts peptides binding to the major histocompatibility complex- (h (d)) of the balb/c mouse, an important laboratory model organism. the predictions include the complete set of h (d) class i (h -k(d), h -l(d) and h -d(d)) and class ii (i-e(d) and i-a(d)) molecules. the prediction system utilizes quantitative matrices, which were rigorously validated using experimentally determined binders and non-binders and also by in vivo studies using viral proteins. the prediction performance of pred(balb/c) is of very high accuracy. to our knowledge, this is the first online server for the prediction of peptides binding to a complete set of major histocompatibility complex molecules in a model organism (h (d) haplotype). pred(balb/c) is available at . the t cells of the immune system recognize antigens as short peptide fragments (t-cell epitopes) derived from self or foreign proteins. self proteins include all proteins produced by the cells of the host. foreign peptides are derived from pathogens, environmental antigens, tumor cells and transplanted tissue. immune recognition of both self and foreign antigens involves proteolytic processing of antigens, binding of the peptide epitopes by major histocompatibility complex (mhc) molecules and presentation of selected peptide epitopes on the cell surface to activate of t cells ( ) ( ) ( ) ( ) . cytotoxic t cells (cd + ) recognize peptides bound to mhc class i molecules and helper t cells (cd + ) recognize antigen in the context of mhc class ii molecules. mhc class i molecules are present in all cells and bind mainly endogenous peptides (those produced within the presenting cell), whereas class ii molecules are present mainly in cells that recognize foreign proteins, such as macrophages and dendritic cells. t-cell epitopes are critical for the immune response to infectious, autoimmune, allergic and neoplastic disease. they have been studied for the development of peptide-based vaccines ( ) and may also be important in the diagnosis of pathogens. it is estimated that between and % of all peptides can bind a particular mhc molecule ( ) . traditional approaches to the identification of t-cell epitopes that involve various biochemical and functional assays of overlapping peptides derived from proteins of interest are costly and not applicable to large-scale studies. accurate predictions using computer models help speed up the identification of t-cell epitopes ( ), minimize the number of experiments necessary and enable systematic scanning for candidate t-cell epitopes from larger sets of protein antigens, such as those encoded by complete viral genomes ( ) . the balb/c inbred laboratory mouse strain is one of the most commonly used animal models in immunological studies and has been used extensively in vaccine research ( , ) . balb/c mice express three class i (h -k d , h -l d and h -d d ) and two class ii (i-a d and i-e d ) molecules. several publicly available prediction systems for mhc class i and class ii binding peptides provide the prediction models for (histocompatibility complex- ) h d alleles. syfpeithi ( ) has h -k d and h -l d models, bimas ( ) has h -k d , h -d d and h -l d models, and rankpep ( ) has models for *to whom correspondence should be addressed: tel: if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. for commercial re-use, please contact journals.permissions@oupjournals.org all h d molecules. syfpeithi uses binding motifs, whereas bimas and rankpep use binding matrices. these are general servers that contain prediction models for a range of mhc molecules in human and mouse, and several other mammalian organisms, but the accuracies of individual models have not been determined. on the other hand, quantitative matrices for h -k b , h -d b , h -l d and h -k k ( , ) have been developed and validated. pred balb/c is a computational system for the prediction of peptides binding to all five mhc molecules in balb/c mice (h d ) class i (h -k d , h -l d and h -d d ) and class ii (i-a d and i-e d ) that allows analysis of proteins for the presence of binding motifs to all five h d molecules in parallel. we derived the initial quantitative matrices for pred balb/c using logarithmic equations based on the frequency of amino acids at specific positions within the training set of mer peptides as described previously ( ) . the initial matrices were refined by including information on the consensus ( ) and other binding motifs, for example, h -k d binding peptides that have i, l or m at major anchor position p ( ) . the anchor positions (e.g. positions and in k d binding peptides) were assigned higher weights than other positions. in addition, the prediction scores were inspected for all permissible amino acids at each of the anchor positions. all amino acids at the anchor positions other than the permissible ones were assigned low scores to exclude peptides with non-permissible amino acids from the list of predicted binders. the final binding scores were normalized to a scale of - and the final models were tested and validated rigorously. to our knowledge, pred balb/c is the first online server for the prediction of peptides binding to a complete set of mhc molecules in a model organism (h d haplotype). the training data containing binding and non-binding peptides were extracted from mhcpep ( ), mhcbn ( ) , syfpeithi ( ) , jenpep ( ) and a set of non-binders (v. brusic, unpublished data). the mer peptides were used for deriving h d class i matrices because the majority of peptides that bind these molecules are amino acids long ( ) . although the majority of h d class ii binding peptides are - amino acids long, their binding cores are amino acids long ( , ) . an iterative elimination method starting from i-a d and i-e d motifs in syfpeithi ( ) was used to identify the core mer regions from long peptides (k.n. srinivasan, g.l. zhang, a. veeramani, j.t. august and v. brusic, manuscript in preparation). the number of peptides in the training sets is shown in table . no one method of predicting peptide-mhc binding consistently outperforms the rest and the most appropriate predictive model depends on the amount of data available in ref. ( ) . in our previous work, an artificial neural network method and hidden markov models were applied to the prediction of human leukocyte antigens (hlas) binding peptides ( , ) , where more training data were available. because relatively small training data sets are available for h d , we adopted matrix models as the prediction method. five · matrices were built, one for each of the five h d alleles, and -fold cross-validations were performed to test the accuracy of the prediction models. the results show that pred balb/c predicts peptides binding to i-e d , i-a d and h -k d with excellent accuracy [area under the receiver operating characteristic (roc) curve, a roc > . ], and to h -d d and h -l d with good accuracy (a roc > . ). the models were also rigorously tested using experimentally known peptides from viral, prokaryotic and eukaryotic origins ( , ( ) ( ) ( ) ( ) and validated by in vivo studies using severe acute respiratory syndrome (sars) nucleocapsid and hiv gag proteins. the h d models accurately predicted out of elispot positive regions from balb/c mice splenocytes immunized with sars nucleocapsid dna vaccines (data not shown). the web interface of pred balb/c uses a set of graphical user interface forms. the interface was built using a combination of perl, cgi and c programs. pred balb/c has been implemented in a sunos . unix environment. users have the option to predict peptides binding to all h d molecules, h d class i molecules, h d class ii molecules or a single h d molecule. the default selection on the webpage is 'all h d ' molecules. to perform predictions using pred balb/c , the user must paste a protein sequence into a textbox and assign a name to the sequence. the sequence must contain between and amino acids. if the prediction is run with an input sequence containing symbols other than the amino acid codes (spaces and carriage returns are allowed) or the total sequence length is outside the - amino acids range, an error message will be displayed. the input can be either a contiguous protein sequence or a list of peptides, one per line. the default selection on the webpage is 'protein sequence' (figure a) , which means the input sequence is treated as a contiguous protein sequence (carriage returns and line breaks will be ignored). the pred balb/c input processing program decomposes protein sequence (or the list of peptides) into a series of mer peptides overlapping by eight amino acids. individual mer peptides are then submitted for prediction. predicted binding scores for all mers are displayed in the result tables ( figure b) . the mer binding scores are within the range - ; the higher the score, the higher the probability of the peptide being a binder. pre-d balb/c has the option to plot the binding scores of all the overlapping mer peptides as a graph, in which the x-axis represents the start position of a mer peptide and the y-axis represents the binding score of the mer peptide ( figure c ). the user can sort the peptides by their binding scores and choose to view only predicted binders with binding scores above a certain threshold. to assess prediction accuracy, we used measures of sensitivity se = tp/(tp + fn) and specificity sp = tn/(tn + fp) (tp: true positives; tn: true negatives; fp: false positives; fn: false negatives). the higher the value of sp, the lower is the value of se, which results in lower number of both tps and fps. the lower the value of sp, the higher is the value of se, which results in higher number of both tps and fps. raw binding scores are mapped to a linear scale that corresponds to sp values, and therefore the prediction thresholds across different models have similar meaning. for example, when a user sets the threshold to , the specificity of the predictions to all five alleles is . . the corresponding sensitivities of each model can be viewed at http://antigen.i r. a-star.edu.sg/predbalbc/html/specificity.html. when users select the input sequence type to be 'a list of peptide sequences', the input sequences separated by carriage returns or line breaks are treated as different peptides (figure a ). all overlapping mers in each peptide are submitted for prediction. in the result tables, predicted binding scores are represented by the highest individual mer binding score within the input peptide. the predicted binding scores of individual mers in each peptide in the list are not shown ( figure b ). to display the top-scoring mer peptides from each input peptide, the user can use the function 'view binding peptides at threshold ' (figure b ). in the result page ( figure c ), the mers with binding scores equal to or above the threshold of are aligned with the input peptides. the predicted mers are displayed with the names of the h d alleles to which the mer binding scores are above the threshold. for example, the first input peptide, ypilpeylqcvk, has binding scores . , . , . , . and . to h -d d , h -k d , h -l d , i-a d and i-e d , respectively ( figure b ). the alignment view of the predicted binding peptides at threshold , which indicates that the specificity of the prediction is . . at threshold (sp level . ), there are no mer binders to h d class ii alleles and the mer ilpeylqcv has the highest binding score to h -d d , . . thus, in figure c , this mer is aligned with the input peptide and followed by 'd d '. conclusion pred balb/c marks a new direction in predictive modeling of mhc-binding peptides and t-cell epitopes. the main advantage is that pred balb/c focuses on a complete organism and its predictions represent a complete set of predicted targets of t-cell immune responses. the focus on the complete set of mhc alleles is closer to studies involving laboratory animals. this approach provides a more complete view of the immune responses of an organism. the balb/c mouse is an important laboratory model and pred balb/c is, therefore, useful for the analysis of immunization regimens and deciphering responses to infections. further development of pred balb/c will include addition of matrices for prediction of mer and mer binders to h d class i molecules and further improvement of prediction matrices by cyclical refinement-using newly defined binders and non-binders from experiments. mechanisms of mhc class i-restricted antigen processing proteases involved in mhc class ii antigen presentation cut and trim: generating mhc class i peptide ligands class ii mhc peptide loading by the professionals genome-wide characterization of a viral cytotoxic t lymphocyte epitope repertoire computational binding assays of antigenic peptides computational methods for prediction of t-cell epitopes-a framework for modelling, testing, and applications rapid determination of hla b * ligands from the west nile virus ny genome a peptide mimotope of type pneumococcal capsular polysaccharide induces a protective immune response in mice vaccination by genetically modified dendritic cells expressing a truncated neu oncogene prevents development of breast cancer in transgenic mice syfpeithi: database for mhc ligands and peptide motifs scheme for ranking potential hla-a binding peptides based on independent binding of individual peptide side-chains prediction of mhc class i binding peptides using profile motifs an automated prediction of mhc class i-binding peptides based on positional scanning with peptide libraries new horizons in mouse immunoinformatics: reliable in silico prediction of mouse class i histocompatibility major complex peptide binding affinity methods for prediction of peptide binding to mhc molecules: a comparative study efficient binding to the mhc class i k(d) molecule of synthetic peptides in which the anchoring position does not fit the consensus motif mhcpep, a database of mhc-binding peptides mhcbn: a comprehensive database of mhc binding and non-binding peptides jenpep: a novel computational information resource for immunobiology and vaccinology peptides naturally presented by mhc class i molecules chemistry of peptides associated with mhc class i and class ii molecules crystal structures of two i-a d -peptide complexes reveal that high affinity can be achieved without large anchor residues neural models for predicting viral vaccine targets multipred: a computational system for prediction of promiscuous hla binding peptides prediction of major histocompatibility complex binding regions of protein antigens by sequence pattern analysis the mhc class i-restricted immune response to hiv-gag in balb/c mice selects a single epitope that does not have a predictable mhc-binding motif and binds to k d through interactions between a glutamine at p and pocket d influenza a virus-specific h- d restricted cross-reactive cytotoxic t lymphocyte epitope(s) detected in the hemagglutinin ha subunit of a/udorn/ identification of murine cytotoxic t-lymphocyte epitopes of bovine herpesvirus this project has been funded in part by us federal funds from the national institute of allergy and infectious diseases, national institutes of health, department of health and human services, under grant no. u ai and contract no. hhsn c. funding to pay the open access publication charges for this article was provided by the institute for infocomm research.conflict of interest statement. none declared. supplementary material is available at nar online. key: cord- -cbzd ybv authors: belew, ashton t.; advani, vivek m.; dinman, jonathan d. title: endogenous ribosomal frameshift signals operate as mrna destabilizing elements through at least two molecular pathways in yeast date: - - journal: nucleic acids res doi: . /nar/gkq sha: doc_id: cord_uid: cbzd ybv although first discovered in viruses, previous studies have identified operational − ribosomal frameshifting (− rf) signals in eukaryotic genomic sequences, and suggested a role in mrna stability. here, four yeast − rf signals are shown to promote significant mrna destabilization through the nonsense mediated mrna decay pathway (nmd), and genetic evidence is presented suggesting that they may also operate through the no-go decay pathway (ngd) as well. yeast est mrna is highly unstable and contains up to five − rf signals. ablation of the − rf signals or of nmd stabilizes this mrna, and changes in − rf efficiency have opposing effects on the steady-state abundance of the est mrna. these results demonstrate that endogenous − rf signals function as mrna destabilizing elements through at least two molecular pathways in yeast. consistent with current evolutionary theory, phylogenetic analyses suggest that − rf signals are rapidly evolving cis-acting regulatory elements. identification of high confidence − rf signals in ∼ % of genes in all eukaryotic genomes surveyed suggests that − rf is a broadly used post-transcriptional regulator of gene expression. programmed ribosomal frameshifting (prf) is has historically been associated with the study of viruses. prf signals stochastically redirect ribosomes into new reading frames and viral prf promotes synthesis of c-terminally extended fusion proteins. the most well defined prf signals direct ribosomes to slip by one nucleotide in the (À ) direction. À prf signals typically contain three elements: a 'slippery site' composed of seven nucleotides (x xxy yyz, incoming zero-frame indicated by spaces) where shifting occurs; a short spacer sequence and a downstream stimulatory structure, typically an mrna pseudoknot ( ) ( ) ( ) . current models posit that the pseudoknot directs ribosomes to pause with their aminoacyl-(aa-) and peptidyl-trnas positioned over the slippery sequence, where re-pairing of the non-wobble bases of both trnas with the À frame codons occurs ( ) ( ) ( ) ( ) . it is now clear that prf is employed by organisms representing every branch in the tree of life, suggesting an ancient and possibly universal mechanism for controlling the expression of actively translated mrnas ( ) . the past few years have witnessed several reports describing in silico identification of recoding signals using a variety of computational approaches ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . while the methodologies of each study covered a broad range of bioinformatics techniques, the general goal with the exceptions of ( , ) was to first find out-of-frame orfs followed by the identification of prf signals in the overlapping region between them. while this can identify new classes of prf signals, it is based on the assumption that prf outcomes should mimic those observed in viral genomes and thus cannot identify new operational outcomes of frameshifting. while 'outcome-neutral' approaches using mrna motifs known to promote efficient prf cannot identify new classes of frameshift signals, they enable an expansion of our understanding of operational uses for prf. the seminal study in this field searched the yeast genome for operational À ribosomal frameshift (À rf) promoting motifs resembling well characterized examples of viral À rf signals, identifying $ putative such elements ( ) . this work was limited by incomplete annotation of the yeast genome and insufficient computational resources available at the time. new bioinformatics tools were subsequently developed and applied using faster and more robust computational platforms. the results showed that: pattern matching approaches coupled with a predictive method for folding rna sequences provided a dramatic improvement in the results; À rf motifs are widespread in the genome of saccharomyces cerevisiae and many have predicted secondary structures with statistically significant measures of free energy ( ) . this analysis showed that $ % of yeast genes contain at least one high probability À rf signal. furthermore, we demonstrated that nine putative À rf signals selected from a variety of s. cerevisiae genes/genome promoted efficient recoding in vivo. more recently, this bioinformatics protocol has been applied to additional genomes. currently, more than genomes have been analyzed and it appears that - % of genes contain at least one potential À rf signal (see prfdb at http://prfdb.umd.edu/) ( ) . a key finding was that the outcome and function of À rf differs significantly between the viral and 'cellular' contexts. in viruses, prf controls the stoichiometries of structural versus enzymatic proteins ( ) . in contrast, 'cellular' rf events redirect elongating ribosomes to premature termination codons, suggesting that À rf is used to control cellular mrna abundance and stability through the nonsense-mediated mrna decay (nmd) pathway. while prf is required for the production of functional products, in the cellular context rf appears to operate in a different manner. thus, in the current work, prf is used to connote frameshift signals whose function is to produce c-terminally extended proteins with novel functions, while rf is used to refer to frameshift signals that operate to direct ribosomes to premature termination codons. a proof-of-principle experiment demonstrated that a viral À prf signal can function as an mrna destabilizing element and that mrna destabilization required nmd ( ) . here, rapid degradation of a reporter mrna through nmd is demonstrated for four cellular yeast À rf signals. further, genetic evidence suggests that the presence of the rf-stimulating pseudoknot may promote mrna destabilization through the no-go decay (ngd) pathway ( ) . the est gene, encoding the catalytic subunit of telomerase ( ) , was used to delve deeper into the relationships between À rf and mrna stability. the est mrna is destabilized by À rf primarily via nmd. ablation of its five À rf signals resulted in stabilization of the est mrna, and an inverse correlation between À rf efficiency and est mrna steady-steady state abundance was observed. escherichia coli dh a was used to amplify plasmid dna. transformations of e. coli were performed as described previously using the calcium chloride method ( ) . yeast cells were transformed using the alkali cation method ( ) . yeast strains used in this study are shown in supplementary table s . yeast were grown on ypad and synthetic complete media (hÀ) ( ) . yrp , yrp were kind gifts from r. parker. yjb (generously provided by judith berman) was sporulated and strains jd , jd , jd and jd were obtained by tetrad dissection. dual luciferase and mrna stability plasmids have been previously described ( ) . oligonucleotide primers were purchased from integrated dna technologies (coralville, ia, usa) and are shown in supplementary table s . computationally identified putative À rf signals were amplified from yeast genomic dna using pcr using oligonucleotide primers which terminated in a sali restriction site at the and bamhi at the . the zero-frame dual-luciferase reporter plasmid (pjd ) along with the - rf signal containing dsdna fragments were digested using these restriction enzymes and ligated together to generate endogenous - rf signal containing dual-luciferase vectors. oligonucleotide primers were chosen to terminate in kpni restriction sites and amplify and bases of renilla and firefly luciferase derived sequences respectively. the resulting amplicons were cloned into the kpni site bases into the pgk open reading frame of the unmodified pgk containing vector (pjd ). a premature termination codon vector (pjd ) was generated by cutting the readthrough (pjd ) with bamhi and backfilling with klenow fragment. plasmids so generated are described in supplementary table s . full length est in a centromeric plasmid and the diploid s. cerevisiae est deletion strain were generously provided by the berman lab and have been previously described ( ) . individual mutant strains were obtained by tetrad dissection. five potentially significant À rf signals were identified in the est open reading frame using the predicted ribosomal frameshift database ( ) . the wobble bases of five slippery heptamers were mutagenized to synonymous codons by oligonucleotide site-directed mutagenesis using the quickchange ii xl site-directed mutagenesis kit (stratagene). oligonucleotide design and reaction conditions were performed as recommended by the manufacturer with minor modifications. all mutations were confirmed by sequencing. plasmids so generated are described in supplementary table s . steady state and time course rna blot analyses of pgk harboring endogenous À rf signals mrna stability vectors were transformed into wild-type yeast (jd ), upf d or upf d (jd or jd ), xrn Á (jd ), dcp Á (jd ), ski Á (jd ), ski Á (jd ) and dom Á (jd ) cells. the est mrna stability vector (pjd ) was transformed into rpb - (jd ) and rpb - /upf À (jd ) cells and time courses were performed as described previously ( ) . total rna was extracted with acid phenol/chloroform (ph = . ) from mid-logarithmic cell cultures ( ) , or with trizole ß reagent following the manufacturer's directions (invitrogen, carlsbad, ca, usa). rna (northern) blotting was performed as previously described ( ) . equal amounts of rna ( , or mg) were separated through % agarose-formaldehyde gels. rna samples were transferred and uv cross linked to hybond-n-membranes (amersham). blots were hybridized with g[ p] -end-labeled oligonucleotides specific for u snorna (loading control) and the exogenous renilla fragment (experimental). messenger rnas were identified using a genestorm phosphoimager (bio-rad) and quantified using quantifyone (bio-rad). each experiment was repeated three or more times and averaged to generate graphs. error bars for calculations including ratios of ratios may be approximated using either of the two following calculations. 'average ratio' is defined as the ratio of the two calculated values. value ctrl is the value of the control (the denominator of the ratio) while stdev ctrl is the calculated standard deviation of the control. similarly, value exp and stdev exp are the value and standard deviations of the experimental (the numerator of the ratio). rep ctrl and rep exp are the number of replicates performed for the control and experimental respectively. the error bars in the graphs of ratios of ratios use the approximated standard error. full length est expression vectors (pjd ), est mutant vectors (pjd ) and null plasmids (pjd ) were transformed into wt (jd ), est deletion (jd ), upf (jd ) and est /upf (jd ) deletion strains. total rna was extracted with acid phenol/chloroform (ph = . ) from mid-logarithmic cell cultures. in parallel, total rnas were extracted from isogenic rpl d strains expressing wild-type rpl (jd ), the down-frameshifting rpl -r a allele (am-l r a), or the up-frameshifting rpl -w c/ p s allele (jd ). to prevent amplification from contaminating cellular dna, rna was treated with dnase i before reverse transcription using turbo dnase (ambion). cdna was generated using the bio-rad iscript cdna synthesis kit and used in the lightcycler real-time pcr system. pcr reactions were performed with ml of cdna in -ml reaction mixtures containing $ nm each sense and antisense primer, and x lightcycler sybr green i master mix (roche). pcr cycles were run as follows: cycle of c for min; cycles of c for s, c for s and c for s. u snorna was chosen as a reference gene. the spr , est , bub and tbf orthologs from the genomes of s. paradoxus, s. mikatae, s. bayanus, s. castellii, s. kudriavzevii and s. kluyveri were extracted from the yeast gene order browser (http://wolfe.gen.tcd .ie/ygob/) ( ) . orthologs were identified for all genes. the nucleotide sequences were analyzed for the presence of potential À rf signals as previously described ( , ) . results are compiled in supplementary table s . cellular À rf signals are mrna destabilizing elements four operational yeast cellular À rf signals derived from the bub , est , spr and tbf genes were employed to test the hypothesis that À rf signals function as mrna destabilization elements. the slippery heptamers for these À rf signals begin at nucleotides , , and of their respective orfs. these were cloned into a yeast pgk reporter gene so that frameshifted ribosomes are directed to ptcs. all inserts were flanked by sequences derived from renilla and firefly luciferase genes, providing unique exogenous sequences for specific detection of the reporter mrnas. two additional pgk reporters without À rf signals were used as controls: a readthrough reporter encoded a continuous orf, while a ptc control contained an in-frame uaa termination codon ( figure ). reporters were introduced into wild-type yeast cells; their steady state mrna abundances were determined by rna blot analysis and normalized to u snorna controls ( figure ) . a figure . schematic of pgk reporter vectors used to monitor the effects of À rf signals on mrna stability. the indicated renilla and firefly luciferase derived sequences from pjd were cloned into the unique kpni restriction site in a high copy pgk expression vector to create the readthrough control (pjd ). the indicated À rf signals derived from bub , est , spr and tbf were cloned into sali/bamhi digested pjd . colored arcs depict computationally predicted base-paired stems ( ) . the premature termination control (ptc) was constructed by mutagenizing pjd to create an in-frame taa codon. minimum of three independent blots were performed for all experiments. we note that the blots shown in figure are simplified for the purpose of publication and that the strains are not all isogenic with one another. in contrast, the bar graphs shown in figure represent data summarized from multiple blots using isogenic strains. in wild-type cells, all four of the cellular À rf signals and the in-frame ptc containing control were less abundant than the pgk reporter mrna (figure a ). the decrease in mrna steady-state abundance varied from $ -fold of the readthrough control (est ) to $ . -fold of wild-type (tbf ). experiments were also performed in upf d and dom d strains, and the u snorna-normalized signal intensities were compared among the same signals between wild-type and mutant strains to determine the relative contributions of nmd and ngd on steady-state abundance of the À rf signal-containing reporters ( figure b and c). the ptc containing mrna was only affected through the nmd pathway: -fold increased abundance in upf d cells relative to wild-type cells, but no change in dom d cells. the tbf À rf signal similarly affected the reporter signal only through nmd ($ -fold). in contrast, the steady-state abundance of the est and bub À rf signal-containing reporter mrnas were increased in both the upf d and dom d mutants: the est signal was -fold less effective in decreasing mrna abundance in upf d cells when compared to the wt strain and $ -fold less effective in dom d cells, while the values for the bub signal were $ -fold and -fold, respectively. the steady-state abundance of the spr À rf signal containing reporter mrna was primarily increased in dom d cells ($ . -fold). deletion of dcp , xrn and ski , all of which function downstream of upf or dom , also generally increased the abundance of the reporter mrnas ( figure d -f). we note however that, in the case of the dcp d cells, the reporter mrnas were relatively abundant, most likely because of the presence of the ! exonuclease activity of the exosome, and/or due to decapping activity contributed by other factors, e.g. by the presence of the l-a virus ( ) . these results establish that endogenous cellular À rf signals can decrease mrna steady-state abundance in yeast through the nmd pathway. in addition, the data are consistent with the hypothesis that a subset of these signals may also affect mrna abundance through ngd, although substantiation of this claim requires further studies, e.g. to monitor the abundance of the endonucleolytic cleavage products and mrna stability assays in ngd À strains. the est À rf signal at nucleotide is primarily destabilized by À rf induced nmd figure suggests that À rf induced nmd is the major cause of decreased mrna steady-state abundance by the est À rf signal beginning at nucleotide . to confirm this, a series of time course mrna decay assays were performed employing the pgk -est À rf reporter, the readthrough control, and the ptc containing construct in cells harboring the temperature sensitive rpb - allele of rna polymerase ii. at the zero time point, cells were shifted to the non-permissive temperature ( c) to arrest transcription of mrnas, total cellular mrnas were extracted at , , , , and min. subsequent to the temperature shift, and rna blots were hybridized with the firefly luciferase and u snorna probes. while the readthrough control was stable in wild-type cells ( figure a and d) , both the ptc containing control and the reporter containing the est À rf signal promoted rapid exponential decay of the reporter mrna, thus demonstrating that this À rf signal can operate as an mrna destabilizing element ( figure b-d) . in a parallel experiment using rpb - upf d cells, all of the reporter mrnas remained stable ( figure e-h) . the rapid decay kinetic profile of the est À rf containing reporter, and its stabilization in nmd-deficient cells are consistent with nmd being the major decay pathway triggered by this element ( ) . to independently test of this, the a aaa aat slippery site was partially inactivated by mutating it to g aag aac. this silent mutation stabilized the reporter mrna $ -fold compared to the wild-type slippery site in wild-type, i.e. dom cells ( figure i ). interestingly, this is less than the -fold stabilization in upf d cells. one would expect that, since nmd is dependent of À rf, then inactivation of À rf should be quantitatively the same as inactivation of nmd. to address this, the steady state abundance of the g aag aac slippery site containing pgk reporter was assayed in an isogenic dom d strain ( figure i, dom d lane) . this combination increased the steady-state abundance of the reporter mrna to near wild-type levels. ablation of À rf signals increases the steady-state abundance of the yeast est mrna, and À rf efficiency inversely correlates with est mrna abundance the est family of yeast genes is named after their 'ever shortening telomere' phenotype ( ) . est encodes the catalytic subunit of telomerase and the other three est genes either encode protein subunits of telomerase (est and est ) or a telomere-associated regulator of telomerase (cdc /est ) ( ) . telomere elongation occurs in late s phase, although est p is associated to varying extents with telomeric chromatin throughout the cell cycle, and telomerase defects result in chromosome instability and rapid senescence ( ) . the very low abundance est mrna is stabilized in nmd-deficient cells ( , ) . computational analyses revealed that est contains four additional high confidence À rf signals beginning at positions , , and (supplementary figure s ). the positions of the five predicted À rf signals in the est orf are shown in figure a . silent protein coding changes were introduced into the slippery sites of all of the À rf signals in a full-length est clone expressed from a low copy vector (pest ssÁ, figure a ). clones expressing either wild-type est (pest ) or pest ssÁ were introduced into isogenic est d or est d upf d cells, and qrt-pcr analyses were performed. these silent mutations resulted $ . -fold increase in the abundance of the full-length est ssÁ mrna relative to wild-type est mrna ( figure b) . similarly, abrogation of nmd increased the abundance of the wild-type est and est ssÁ mrnas $ . -fold and $ . -fold, respectively. to independently monitor the influence of À rf on mrna abundance, the steady-state abundance of the est mrna was monitored in isogenic cells expressing up-and down-frameshift promoting alleles of rpl (which encodes ribosomal protein l ) by qrt pcr. est mrna abundances were normalized to u snorna in cells expressing wild-type rpl , the rpl -r a allele which decreases À rf from the l-a frameshift signal to $ % of wild-type levels ( ) , and the rpl -w c/p s allele which increases À rf by $ . -fold ( ) . relative to wild-type cells, steady-state abundance of the est mrna was increased by . ± . fold in cells expressing rpl -r a, and decreased to . ± . in cells expressing rpl -w c/p s ( figure c ). taken together, these experiments demonstrate that À rf induced nmd plays a significant role in destabilizing est mrna. programmed À ribosomal frameshifting, but not specific À rf signals appears to be conserved and rapidly evolving in budding yeasts if regulation of gene expression through À rf is biologically significant, then À rf signals should be present in orthologous mrnas from other budding yeast species. to address this, the bub , est , spr and tbf orthologs were identified in s. paradoxus, s. mikatae, s. bayanus, s. castellii, s. kudriavzevii and s. kluyveri, and analyzed for potentially significant À rf signals as previously described ( ) . at first glance, positions of the slippery sites of five predicted À rf signals and their sequences are indicated. the full-length gene including native and utr sequences were cloned into a low-copy yeast vector to create pest . silent coding mutations that are predicted to inactivate À rf were introduced to produce pest ssÁ. (b) pest or pest ssÁ were introduced into est d or est d upf d cells and est mrna steady state abundances were determined by quantitative real-time pcr. (c) quantitative real-time pcr was used to monitor steady-state abundance of the endogenous est mrna in isogenic cells expressing three different forms of ribosomal protein l : wild-type rpl (wt); the r a mutant which promotes decreased rates of À rf and the w c/p s mutant which promotes increased À rf efficiency. est mrna abundances were normalized to u snorna abundance for each sample, and the values shown are relative to wild-type cells. these analyses reveal that no single À rf signal is completely conserved among the budding yeasts (supplementary table s ) . however, closer analysis shows that strong candidate À rf signals can be identified in the orthologs of all of these genes, although not in every species. for example, as noted above, the s. cerevisiae est mrna contains five potential À rf signals. similarly, the s. paradoxus ortholog contains five potential À rf signals, although none are identical to the s. cerevisiae elements. s. mikatae est appears to harbor two potential À rf signals, s. bayanus has three, and s. castelli contains two, and s. kluyveri has three. none were identified in the s. kudriavzevii est ortholog. turning to spr , the s. cerevisiae mrna contains a second potential À rf signal beginning at nucleotide in addition to that identified beginning at nucleotide (see http://cbmgintra.umd.edu/prfdb/index .cgi/detail?id= &accession=sgdid:s & slipstart= ). both the s. paradoxus and s. kudriavzevii spr orthologs contain three potential À rf signals, but none were identified in the s. mikatae s. bayanus or s. castelli orthologs. interestingly, the s. kluyveri spr ortholog contains a slippery site followed by a strong stem-loop structure; while this may or may not constitute a À rf signal, it does suggest the presence of a rapidly evolving cis-acting element (see discussion below). s. cerevisiae bub contains the operational À rf signal at nucleotide , plus potential À rf signals beginning at nucleotides and . the orthologous mrnas in s. paradoxus, s. bayanus, s. castelli and s. kudriavzevii each appear to have one potential À rf signal, but none were identified in s. mikatae or s. kluyveri. lastly, the s. cerevisiae tbf mrna has only the single confirmed À rf signal. the s. castelli, s. kudriavzevii and s. kluyveri orthologs contain two each, and the s. mikatae has one. no potential À rf signals were identified in either s. paradoxus or s. bayanus. as a control, six s. cerevisiae genes lacking predicted À rf signals were selected (pgk , hht , tef , mic , cmd and grx ), orthologs from the six other yeast species identified, and these were in turn queried for the presence of putative À rf signals. these analyses revealed that none of the orthologs of these six genes contain predicted À rf signals (supplementary table s , and hyperlinked data therein). the potential evolutionary significance these observations are discussed below. in a prior proof-of-principle experiment, we utilized the well characterized À prf signal from the yeast l-a dsrna virus to demonstrate that these elements can generally function as mrna destabilizing elements through the nmd pathway ( ) . subsequently, a bioinformatics approach was used to determine that potential À rf signals are widely found in all genomes examined, and that the great majority of these are predicted to direct elongating ribosomes to premature termination codons ( , ) . here, we show that these chromosomally encoded, endogenous À rf signals can also function as cis-acting mrna destabilizing elements, both in the context of a reporter mrna, and also in one case in a natural context. further, we demonstrated that À rf signals can differentially affect mrna abundance through the nmd pathway, and the data are also consistent with destabilization through ngd. these are modeled in figure . in support of this idea, the est , bub and spr mrnas were all stabilized in upf d, upd d/nmd d, upf d, dcp d and xrn d cells ( ) , and the half-lives of these mrnas were less than the mean in wild-type cells ( ) . interestingly, tbf is not represented in these databases. in the case of a ribosome shifting reading frame into a ptc, the surveillance complex led by the upf proteins signals rapid decapping by dcp p/dcp p, followed by deadenylation and exonucleolytic decay via xrn p and the exosome. in parallel, the ngd pathway can be activated by ribosomes that are stalled at strong secondary structures in mrnas. stalled ribosomes are freed from mrnas by dom p/hbs p, promoting exonucleolytic cleavage at unpaired nucleotides near the pause, thus resulting in two mrna fragments which become substrates for decapping and exonucleolytic decay [reviewed in ( ) ]. the findings presented here suggest that cells are not only well equipped to deal with aberrant messages which contain premature termination codons and to clear stalled ribosomes from mrnas, but have also evolved to capitalize upon these functions to post-transcriptionally regulate gene expression. the strength of these signals to function as mrna destabilizing elements should be equal to a combination of (i) their strengths as À rf signals and (ii) their abilities to block ribosome progression, i.e. their thermodynamic stability. the est signal is both highly efficient at promoting À rf [$ %, see ( ) ], and is predicted to be quite stable ($À to À kcal/mol depending on the particular folding solution, see http:// prfdb.umd.edu/). it is important to note however that the software used to predict mrna pseudoknots can neither identify base triples, which make major contributions to frameshifting ( ) ( ) ( ) ( ) ( ) , let alone calculate their contributions to thermodynamic stability. regardless, this combination of high frameshifting and thermodynamic stability results in very strong destabilization via nmd ( figure b ), and perhaps ngd as well ( figure c ). as discussed previously ( ) , the exponential decay profile suggests that nmd can occur beyond the 'pioneer round' of translation. in contrast to est , the tbf signal promoted $ % frameshifting ( ), but is not predicted to be highly stable (À . kcal/mol). thus, all of its mrna destabilization activity was through nmd (compare figure b with c). the thermodynamic stability of the bub signal is predicted to have an intermediate value to est and tbf ($À kcal/mol), and hence the potential contribution of ngd to the stability of its reporter was significant. interestingly, this signal only promoted $ % frameshifting ( ) , yet the contribution of nmd to its destabilization was greater than observed for tbf . one possible explanation for this apparent discrepancy may stem from the fact that, in order to measure frameshifting, one base had to be deleted from the spacer region between the slippery site and the stimulatory pseudoknot. changes in the length and composition of this spacer are known to affect rates of À prf ( ) , and thus the À rf values so determined cannot be taken as absolute. in contrast, the reporters used to monitor mrna stability contained the native sequences. in light of this, it is likely that the native bub À rf signal promotes more frameshifting than the tbf signal. lastly, the spr À rf signal is predicted to be quite stable ($À kcal/mol), yet promoted very low levels of frameshifting ($ . %) ( ) . accordingly, destabilization via nmd was negligible for this element, while ngd appeared to be the major contributor. beyond the pro forma demonstration that À rf signals can decrease cellular mrna abundance, it is important to begin to understand the biological function of this phenomenon. as a first step in this direction, we showed that silently mutating the slippery sites in predicted À rf signals within a full-length clone of est significantly stabilized its encoded mrna ( figure b) . similarly, abrogation of nmd stabilized this message, while changes in À rf efficiency inversely correlated with est steady-state mrna abundance. est p is the reverse transcriptase subunit of the telomerase holoenzyme ( ) . interestingly, prior studies have demonstrated that this mrna, along with other mrnas encoding proteins having telomere-associated functions, are stabilized in nmd À yeast cells ( , ) . analysis of the programmed ribosomal frameshift database (http:// prfdb.umd.edu/) reveals that, along with the other four putative À rf signals in the est mrna, the mrnas encoding est p, stn p, cdc p and orc p, all components or regulators of telomerase that are stabilized in nmd À cells, also contain high confidence À rf signals (supplementary figure s ). in addition, the est mrna contains a + prf signal ( ) . intriguingly, telomerase is limiting in cells: while a yeast cell contains chromosome ends, there are only $ telomerase molecules per cell, and telomerase is preferentially recruited to short telomeres ( ) . additionally, tbf p is a telobox containing general regulatory factor that binds to ttaggg repeats within subtelomeric anti-silencing regions ( ) . intriguingly, ablation of nmd ( ) or overexpression of single components of telomerase-associated proteins, i.e. the tel rna, est p, stn p or cdc p resulted in changes in telomere length ( , , ) . we hypothesize that yeast cells use À rf to limit the expression of these proteins in order to maintain the correct stoichiometric balance among telomere associated components. corollary to this, mutations that alter À rf and/or nmd should affect telomere function, and should thus show phenotypic defects similar to those observed in telomerase mutants, e.g. cell cycle progression defects. indeed, we have isolated numerous such mutants [reviewed in ( ) ], and have reported that the mof - and mof - mutants, which affect both nmd and À rf tend to accumulate large mother-daughter cells, and/or multiply budded cells, typical of g /m cell cycle defects ( ) . similarly, upf d cells have abnormally elongated buds, and decreased telomere lengths ( , ) . intriguingly, mof - mutants, which only affect À rf, arrest as large, unbudded cells, typical of m-phase exit defects ( ) . these observations suggest that stabilization of the mrnas encoding multiple telomere-associated proteins may have dominant negative effects on telomere homeostasis, and that nmd and À rf may regulate different aspects of the cell cycle. additionally, the central role of bub p at the mitotic cell cycle spindle assembly checkpoint and the progeroid phenotypes caused by bub p deficiency ( ) suggest a more general role for À rf in control of cell growth and division. lastly, the expression of spr p during sporulation ( ) suggests a role for À rf in this developmental process as well. future studies will dissect the roles of the À rf signals in these mrnas. finally, if À rf is widely used to regulate gene expression, then it should be well conserved. the major problem associated with attempting a phylogenetic analysis of À rf signals is the inherent limitations of the software used to predict them. in short, it is not well enough developed to automatically identify matching motifs. in an attempt to begin to address this issue, the orthologous bub p, est p, spr p and tbf p's in six closely related yeast species were identified, their nucleotide sequences extracted from the yeast gene order browser ( ) , and analyzed for the presence of potential À rf signals. these analyses revealed that while specific À rf signals do not appear to be evolutionarily conserved, À rf itself may be relatively well-enough conserved as a mechanism to post-transcriptionally regulate the expression of these genes across many but not all species examined (see supplementary table s ). importantly, the control experiment showed that putative À rf signals were not detected in any of the orthologs of six s. cerevisiae genes that themselves were not predicted to contain À rf signals. these observations are in agreement with current evolutionary theory based on analyses of utr sequences of drosophila species proposing that rapid rates of mutation in cis-acting regulatory elements drives speciation because they confer very specific effects on gene expression, as opposed to mutations affecting protein structure, the pleiotrophic effects of which impose very high penalties on fitness ( ) ( ) ( ) . the findings presented in this work, i.e. that while À rf signals are not conserved in orthologs from other yeasts, these orfs contain generically contain À rf signals, and that the absence of À rf signals in other orthologous orfs, agrees very well with this theory of molecular evolution. ribosomal frameshifting on viral rnas programmed translational frameshifting recoding: translational bifurcations in gene expression ribosomal pausing at a frameshifter rna pseudoknot is sensitive to reading phase but shows little correlation with frameshift efficiency kinetics of ribosomal pausing during programmed - translational frameshifting the -angstrom solution: how mrna pseudoknots promote efficient programmed - ribosomal frameshifting torsional restraint: a new twist on frameshifting pseudoknots programmed ribosomal frameshifting goes beyond viruses: organisms from all three kingdoms use frameshifting to regulate gene expression, perhaps signaling a paradigm shift identification of putative programmed - ribosomal frameshift signals in large dna databases computational identification of putative programmed translational frameshift sites translational recoding signals between gag and pol in diverse ltr retrotransposons sequences that direct significant levels of frameshifting are frequent in coding regions of escherichia coli predicting genes expressed via - and + frameshifts reprogrammed genetic decoding in cellular gene expression identification of functional, endogenous programmed - ribosomal frameshift signals in the genome of saccharomyces cerevisiae knotinframe: prediction of - ribosomal frameshift events prfdb: a database of computationally predicted eukaryotic programmed - ribosomal frameshift signals ribosomal frameshifting efficiency and gag/gag-pol ratio are critical for yeast m double-stranded rna virus propagation a programmed - ribosomal frameshift signal can function as a cis-acting mrna destabilizing element endonucleolytic cleavage of eukaryotic mrnas with stalls in translation elongation the catalytic subunit of yeast telomerase molecular cloning, a laboratory manual transformation of intact yeast cells treated with alkali cations two chromosomal genes required for killing expression in killer strains of saccharomyces cerevisiae the role of the est genes in yeast telomere replication identification and characterization of genes that are required for the accelerated degradation of mrnas containing a premature translational termination codon an in vivo dual-luciferase assay system for studying translational recoding in the yeast saccharomyces cerevisiae the yeast gene order browser: combining curated homology and syntenic context reveals gene fate in polyploid species the coat protein of the yeast double-stranded rna virus l-a attaches covalently to the cap structure of eukaryotic mrna senescence mutants of saccharomyces cerevisiae with a defect in telomere replication identify three additional est genes telomerase: what are the est proteins doing? chromosome end maintenance by telomerase ) mrnas encoding telomerase components and regulators are controlled by upf genes in saccharomyces cerevisiae regulation of telomerase by telomeric proteins a molecular clamp ensures allosteric coordination of peptidyltransfer and ligand binding to the ribosomal a-site identification of functionally important amino acids of ribosomal protein l by saturation mutagenesis genome-wide analysis of mrnas regulated by the nonsense-mediated and to mrna decay pathways in yeast precision and functional specificity in mrna decay quality control of eukaryotic mrna: safeguarding cells from abnormal mrna function minor groove rna triplex in the crystal structure of a ribosomal frameshifting viral pseudoknot solution structure of the pseudoknot of srv- rna, involved in ribosomal frameshifting a loop cytidine-stem minor groove interaction as a positive determinant for pseudoknot-stimulated - ribosomal frameshifting stimulation of - programmed ribosomal frameshifting by a metabolite-responsive rna pseudoknot functional analysis of the srv- rna frameshifting pseudoknot mutational analysis of the ''slippery-sequence'' component of a coronavirus ribosomal frameshifting signal programmed translational frameshifting in a gene required for yeast telomere replication low abundance of telomerase in yeast: implications for telomerase haploinsufficiency identification of high affinity tbf p-binding sites within the budding yeast genome telomere cap components influence the rate of senescence in telomerase-deficient yeast cells characterization of recombinant saccharomyces cerevisiae telomerase core enzyme purified from yeast cdc cooperates with the yeast ku proteins and stn to regulate telomerase recruitment recoding: expansion of decoding rules enriches gene expression translational maintenance of frame: mutants of saccharomyces cerevisiae with altered - ribosomal frameshifting efficiencies a genome-wide screen for saccharomyces cerevisiae deletion mutants that affect telomere length comprehensive and quantitative analysis of yeast deletion mutants defective in apical and isotropic bud growth functional analysis of the sporulation-specific spr gene of saccharomyces cerevisiae emerging principles of regulatory evolution functional analysis of eve stripe enhancer evolution in drosophila: rules governing conservation and change repeated morphological evolution through cis-regulatory changes in a pleiotropic gene we would like to thank judith berman and roy parker for the gifts of yeast strains and plasmids. institutes of health (r gm , supplementary data are available at nar online. key: cord- -wifs yy authors: yu, chien-hung; noteborn, mathieu h.; pleij, cornelis w. a.; olsthoorn, rené c. l. title: stem–loop structures can effectively substitute for an rna pseudoknot in − ribosomal frameshifting date: - - journal: nucleic acids res doi: . /nar/gkr sha: doc_id: cord_uid: wifs yy − programmed ribosomal frameshifting (prf) in synthesizing the gag-pro precursor polyprotein of simian retrovirus type- (srv- ) is stimulated by a classical h-type pseudoknot which forms an extended triple helix involving base–base and base–sugar interactions between loop and stem nucleotides. recently, we showed that mutation of bases involved in triple helix formation affected frameshifting, again emphasizing the role of the triple helix in − prf. here, we investigated the efficiency of hairpins of similar base pair composition as the srv- gag-pro pseudoknot. although not capable of triple helix formation they proved worthy stimulators of frameshifting. subsequent investigation of ∼ different hairpin constructs revealed that next to thermodynamic stability, loop size and composition and stem irregularities can influence frameshifting. interestingly, hairpins carrying the stable gaaa tetraloop were significantly less shifty than other hairpins, including those with a uucg motif. the data are discussed in relation to natural shifty hairpins. ribosomal frameshifting is a translational recoding event in which a certain percentage of ribosomes are forced to shift to another reading frame in order to synthesize an alternative protein. this switch occurs at a specific position on the mrna, called the slip site or slippery sequence, and can be either forwards (+ ) or backwards (À ). the nature and efficiency of frameshifting depends on several factors, including trna availability and modifications, and mrna primary and secondary structure ( , ) . the signals that are responsible for À frameshifting comprise two elements: a slippery sequence where the actual reading shift takes place, and a downstream located structural element which greatly stimulates the efficiency of frameshifting. although the mechanism is still elusive, the present view is that the downstream structure forms a physical barrier that blocks ef- function and causes ribosomes to stall in their translocation step. this 'roadblock' puts tension on the mrna-trna interaction. the tension can be relieved by the realigning of a-site and p-site trnas in the -direction, whereafter ef- can do its work and the ribosome resumes translation in the À reading frame ( ) . in general, a pseudoknot is more efficient in stimulating frameshifting than a hairpin of the same sequence composition. this difference is likely related to a higher thermodynamic stability of the pseudoknot. indeed, from thermodynamic analysis it appears that pseudoknots are more stable than their hairpin counterparts ( ) ( ) ( ) . recent studies employing mechanical 'pulling' of frameshifter pseudoknots have shown a correlation between the mechanical strength of a pseudoknot and its frameshifting capacity ( , ) , and the influence of major groove and minor groove triplex structures ( ) . the higher strength of a pseudoknot can be primarily attributed to the formation of base triples between the lower stem s and loop l ( figure a ), making it more resistant against unwinding by an elongating ribosome ( , ) . base triples in several pseudoknots, such as beet western yellows virus (bwyv) p -p ( ) , pea enation mosaic virus type- (pemv- ) p -p ( ) , sugarcane yellow leaf virus (scylv) p -p ( ) and simian retrovirus type- gagpro (srv- ) ( , ) have been shown to play an essential role in frameshifting. for pseudoknots with a longer stem s of - bp, like that of infectious bronchitis virus (ibv), base triples do not appear to contribute to frameshifting ( ) . although a hairpin is considered to be a less efficient frameshift-inducing secondary structure than a pseudoknot, some viruses like human immunodeficiency virus (hiv) ( ) , human t-lymphotropic virus type- (htlv- ) ( ) and cocksfoot mottle virus (cfmv) ( ) make use of a simple hairpin to stimulate substantial levels of frameshifting. in addition, frameshifting in the prokaryotic dnax gene requires, next to an upstream enhancer, the presence of a hairpin as well ( ) . a few studies have investigated a correlation between hairpin stability and frameshift efficiency of natural shifty hairpins ( , ) . nonetheless, certain studies have shown that a hairpin composed of the same base pairs as a frameshifter pseudoknot is not very efficient in inducing frameshifting in mammalian cells and lysates ( ) ( ) ( ) but is in other systems ( ) . here, we have carried out a systematic analysis of the frameshift-inducing efficiency of hairpins derived from the srv- gag-pro frameshifter pseudoknot. investigation of about different hairpin constructs revealed that next to thermodynamic stability, also loop size and composition, and stem irregularities can significantly influence frameshifting. our data showed that there exists no base specific contacts between the hairpin and the ribosome during frameshifting and suggests that the hairpin primarily serves as a barrier to allow repositioning of trnas at the slippery site. mutations in the srv- gag-pro frameshifting signal were made in an abridged version of plasmid sf ( ) which is derivative of psfcass ( ), a frameshift reporter construct. in this version, the entire bglii-ncoi fragment of psf was replaced by a synthetic dsdna fragment ( -g atcttaatacgactcactatagggctcatttaa actagttgaggggccatatttcgc- , a spei restriction site is underlined). this yielded plasmid psf in which the original gggaaac slippery sequence has been replaced by the more slippery uuuaaac sequence ( ) . psf was digested with spei and ncoi, and sets of complementary oligonucleotides corresponding to the various mutants were inserted. a list of oligonucleotides is available upon request. all constructs were verified by automated dideoxy sequencing using chain terminator dyes (lgtc, leiden). dna templates were linearized by bamhi digestion and purified by successive phenol/chloroform extraction and column filtration (qiagen, benelux). sp polymerase directed transcriptions were carried out in ml reactions containing $ mg linearized dna, mm ntps, mm tris-hcl (ph . ), mm nacl, mm dtt, mm mgcl , mm spermidine, u of rnase inhibitor (rnasin, promega, benelux) and u of sp polymerase (promega, benelux). after an incubation period of h at c, samples were taken and run on agarose gels to determine the quality and quantity of the transcripts. appropriate dilutions of the reaction mix in water were directly used for in vitro translations. alternatively, transcripts were purified by phenol/chloroform extraction and isopropanol precipitation and quantified by uv absorption as described previously ( ) . experiments were carried out in duplicate using seriallyin water-diluted mrnas with final concentrations of nm. reactions contained ml of an rna solution, . ml of rabbit reticulocyte lysate (rrl, promega), . - ml of s methionine (amersham, in vitro translation grade), . ml of mm amino acids lacking methionine and were incubated for min at c. samples were boiled for min in  laemmli buffer and loaded onto % sds polyacrylamide gels. gels were dried and exposed to phosphoimager screens. band intensity of -frame and À frameshift products was measured using a molecular imager fx and quantity one software (biorad). frameshift percentages were calculated as the amount of À frameshift product divided by the sum of and À frame products, corrected for the number of methionines ( in the -frame product and in the fusion product), multiplied by . candidates of interest were constructed in a dual luciferase vector, pdual-hiv( ), essentially as described previously ( , ) . in short, pdual-hiv( ) was digested by kpni and bamhi, followed by insertion of complementary oligonucleotides to clone the srv- gag-pro pseudoknot, various hairpins as shown in figures c and , and a negative control (nc) which formed no apparent secondary structure downstream of the slippery sequence. an in-frame control was constructed by inserting an a-residue upstream of the cytosine in the uuuaa ac slippery sequence of a bp hairpin frameshift construct. hela cells were cultured in dmem/high glucose/ stable glutamine (paa laboratories gmbh, germany) and supplemented with % fetal calf serum and u/ml penicillin and mg/ml streptomycin. cells were kept in a humidified atmosphere containing % co at c. assay protocols were described previously ( ) . briefly, cells were transfected with ng of plasmid using ml of lipofectamine- (invitrogen) in a -well plate. cells were lysed h after transfection and luciferase activities were quantified by glomaxmultidetector (promega, benelux) according to manufacturer's protocol. frameshifting efficiency was calculated by dividing the ratio of renilla luciferase (rl) over firefly luciferase (fl) activity of the mutant by the rl/ fl ratio of the in-frame control, multiplied by . in contrast to earlier reports involving the ibv frameshifting pseudoknot ( , ) , we found that in the case of the srv- gag-pro frameshift inducing pseudoknot a hairpin of similar composition as the pseudoknot did stimulate frameshifting in vitro ( figure a and b). the bp hairpin derivative of the srv- pseudoknot (srv-hp) showed % frameshifting efficiency, whereas the srv- pseudoknot (srv-pk) in this context yielded %. the pseudoknot in these experiments is a modified version of the wild-type srv- pseudoknot previously used for nmr and functional analysis ( ) . we note that the u uuaaac slippery sequence was used to enhance the sensitivity of the in vitro frameshifting assay. this sequence is $ . -fold more slippery than the wild-type gggaaac slippery sequence ( ) . in the latter context, the hairpin was indeed less efficient (data not shown) while a non-slippery variant, gggaagc, was not effective at all (< . %, data not shown). two other known efficient slip sites, aaaaaac and uuuuuua, caused and %, respectively, of ribosomes to switch frame in the presence of the bp hairpin (data not shown). these data showed that the bp hairpin is a genuine stimulator of frameshifting. since the hairpin construct also contained sequences resembling those of l of the pseudoknot construct, it was theoretically possible that these nucleotides could take part in the same base triples. to investigate this possibility, we replaced the downstream sequence in the hairpin construct (srv-muthp). this did not affect the frameshift efficiency of the hairpin construct. in contrast, the same mutations in the pseudoknot context (srvmutpk) reduced its activity about -fold ( figure b) . thus, it is unlikely that triple helix formation or other tertiary interactions contribute to hairpin-dependent frameshifting; the hairpin as such seems to be sufficient. next, we investigated the role of stem length on frameshifting efficiency. increasing stem size from to or bp did not significantly alter frameshifting (figure a ). on the other hand, decreasing stem size led to a steady decrease in frameshifting efficiency which seemed to vanish around a stem size of bp or Ág of À . kcal/ mol ( figure b ). thermodynamic stabilities were calculated at the mfold website using version . parameters (http://mfold.rna.albany.edu/?q=mfold/rna-folding-form . ), as these were previously shown to better fit in vivo hairpin stabilities ( ) . these data support the notion that downstream structures serve as barriers to stall translating ribosomes to stimulate frameshifting, and demonstrate that there is a correlation between the thermodynamic stability of a hairpin and its frameshift inducing capacity. a selection of above hairpins was cloned into a dualluciferase reporter plasmid and their frameshifting efficiency assayed in mammalian cells ( figure c) . although the absolute level of frameshifting was lower than in vitro, the trend was similar and showed maximal frameshifting of $ % around - bp. the pseudoknot in these assays was . times more efficient than the and bp hairpins, close to the in vitro ratio of . (see above). thus, the hairpin derivative can effectively substitute for the srv- pseudoknot in À ribosomal frameshifting. bulges and mismatches are known to change twisting and bending of a regular stem and are thus expected to influence the way in which a ribosome encounters a hairpin structure ( , ) . to investigate a possible effect of helical twisting and bending on frameshifting, we introduced mismatches and bulges in the bp stem at a position corresponding to the junction in the srv pseudoknot ( figure a) . introduction of an a · a mismatch halfway through the stem ( bp/aa) decreased frameshifting about fold, although its predicted thermodynamic stability of À . kcal/mol is comparable to that of a regular hairpin of bp, yielding % frameshifting ( figure b ). the frameshift inducing ability was recovered when the base pair was restored to a-u ( bp/au). we also introduced a single or triple adenosine bulge at either side of the stem, to investigate potential bending effects on frameshifting. figure a and b show that the single adenosine bulge mutant decreased frameshifting, depending on the location of the bulge, five to seven fold compared to the bp hairpin construct. when the bulge was enlarged to three adenosines the frameshifting was almost abolished. interestingly, the effect of bulges at the side of the stem was less dramatic than those at the side. the loop composition plays a major role in hairpin stability, rna/rna and rna/protein interaction. these factors may directly influence hairpin-induced ribosomal frameshifting efficiency. to explore the correlation between loop composition and frameshifting efficiency, a number of loop mutations were introduced in the context of a bp stem (figure ). we note that the uucg tetraloop with a cg closing base pair (cbp) has higher stability ($ kcal/mol) than that with a gc cbp ( ) . therefore, we first tested if this different cbp affected frameshifting efficiency. our results showed that there is no difference in frameshifting efficiency between uucg and uucg/cg constructs (figure , bars and ) . replacing the uucg tetraloop by gggc which, due to its high content of purines, is among the most disfavored tetraloops ( ) had only a marginal effect on frameshifting (figure , compare bars and and figure a, lanes and ) . interestingly, increasing the loop size to nt, which is predicted to lower the stability of stem did not affect frameshifting ( figure , bar ; figure a , lane ). substituting uucg by another stable tetraloop sequence (gaaa) resulted in a -fold decrease in frameshifting ( figure a , lanes and ) either with gc ( figure , bar ) or cg cbp (figure , bar ). we designed another five loop mutants to try to explain the low efficiency of the gaaa tetraloop constructs. constructs aaaa and caaa induced . % and . % frameshifting, respectively (figure , bars and ) , which is close to that of the gaaa constructs. the efficiency of two other a-rich loop mutants, acaa and aaau, was . % and . %, respectively (figure , bars and ), thereby closely matching that of the uucg constructs. finally, the ggga tetraloop construct, belonging to the stable gnra tetraloop family, induced . -times more frameshifting than its gaaa sibling (figure , bar ) . these data suggest that the presence of or adenines at the side of a tetraloop is unfavorable for frameshifting. to further examine the role of the loop identity or size in ribosomal frameshifting, we cloned some of the above loop mutants into a dual-luciferase reporter plasmid and assayed their frameshifting efficiency in mammalian cells ( figure b ). our data show that the effects of loop nucleotides are comparable in vitro and in vivo. the stable gaaa tetraloop construct again had the lowest frameshifting efficiency ( figure b , . %), which was half that of the uucg construct ( figure b, . %). most rna viruses that make use of ribosomal frameshifting employ pseudoknot structures instead of simple hairpins for this job. the reason for this may be the presence of a triple helix interaction between s and l in most frameshifter pseudoknots, which has been suggested to be a poor substrate for the ribosomal helicase ( , ) and hence increases ribosomal pausing and the time window for slippage. although pausing is critical, it is not sufficient for efficient frameshifting ( ) . previously, it was shown that a bp hairpin with a calculated stability of À . kcal/mol derived from the minimal ibv pseudoknot induced -to -fold less frameshifting in rrl ( ) than its parent pseudoknot even though both the hairpin and the pseudoknot can pause ribosomes at the same position and to a similar extent ( ) . in the present study, a bp hairpin derivative of the srv- gag-pro pseudoknot with a calculated stability of À . kcal/mol was capable of inducing % of frameshifting, which is only . -fold . influence of loop sequence and closing base pair (cbp) on À ribosomal frameshifting efficiency. the composition of various loops capping a bp stem is shown in bold, and cg-cbps are shown in lower case. the constructs are named after their loop sequence followed by the '/cg' extension when the cbp was changed from g-c to c-g. slippery sequence and spacer are the same as in the construct shown in figure a . graph is similar to that of figure b except that on the right y-axis Ág starts from À kcal/mol. less than its pseudoknotted counterpart. this indicated that a non-natural hairpin can be an efficient frameshift stimulator, at least in the srv- model. furthermore, our results showed that the frameshifting efficiency increased upon elongation of the length of the hairpin up to - bp, which is consistent with our previous data using antisense oligonucleotides of - nt to induce ribosomal frameshifting ( ) . more importantly, the frameshift inducing ability of these hairpin constructs with a perfect stem linearly correlated with the calculated thermodynamic stability, in agreement with two previous reports ( , ) . in the experiments of bidou et al. ( ) studying the hiv- gag-pol frameshift hairpin the stem-length was kept at bp, while its stability was varied between À . and À . kcal/mol (recalculated using mfold . ) by changing the number of au and gc base pairs in a small set of six hairpins. in the case of the dnax gene of escherichia coli variants of the wild-type bp hairpin were tested for their ability to stimulate À prf at the aaaaaag slippery sequence. hairpin stabilities varied between À . and À kcal/mol and a positive correlation between frameshifting efficiency and calculated stability was observed both in the presence (r = . ) and absence (r = . ) of upstream enhancer ( ) . the dnax gene with the highly efficient (prokaryotic) aaaaa ag slippery sequence is not directly comparable to our in vitro system; a bp hairpin in the dnax gene displayed % of frameshifting without upstream enhancer, whereas a bp hairpin in our system induced only . % of frameshifting. in the hiv- gag-pol gene bidou et al. ( ) observed a - % decrease in frameshifting in vivo with their most stable hairpin, similar to our results with the bp hairpin. however, in our case, the stability at which this happened was À kcal/mol much higher than their most stable hairpin of À . kcal/mol. it is possible that this difference is due to the different experimental systems. although it has been suggested that too stable stems increase the time for trnas to shift back into the -frame again ( ) we believe that our bp hairpin is less efficient because it has more au bps in the middle of the stem compared to the and bp hairpins (figure a) . the experiments with hairpins harboring bulges or mismatches halfway through the stem demonstrated that this region is quite important for frameshifting ( figure a and b) . even though the overall stability of these constructs was comparable to that of a hairpin of or bp, their frameshift activity was equal or lower than that of a bp hairpin of À . kcal/mol: as if the mismatch or bulge after the th base pair disconnected the upper part of the stem. this observation is reminiscent of the overall destabilizing effect of mismatches in dna hairpins. in a pioneer singlemolecule pulling study, it was shown that introducing a mismatch in a bp dna hairpin shifted its transition state close to the location of the mismatch ( ) . our data also comply with this mechanical study and suggest that mechanical stability may be a better parameter than thermodynamic stability to describe the frameshift efficiency of hairpins. in addition to the mentioned dnax and hiv- gag-pol hairpins, other examples of frameshifter hairpins are found in htlv- and cfmv ( figure ). htlv- gag-pro features a perfect bp hairpin with cua tri-loop which induces % frameshifting in rrl ( ) . the cfmv a- b frameshifting hairpin consists of bp, one cytidine bulge close to the top, and a stable uacg tetraloop and is capable of inducing % of frameshifting in a wheat germ cell-free system (wge) ( ) . what these hairpins have in common is their length of - bp, their relatively low number of mismatches and bulges, their small loops and their high gc content, especially in the bottom bp. these features are also applicable to the good frameshifters from our dataset. interestingly, these features do not all apply to the minimal ibv hairpin ( figure ) that is derived from the so-called minimal ibv pseudoknot. despite its large size of bp, absence of mismatches and bulges, presence of a small loop, the stability of the middle part of the hairpin, i.e. bp - , is not very high. this could be the reason why its activity in rrl is - fold lower ( ) than of its parent pseudoknot, whose activity is % ( ) . surprisingly, in our assays the frameshift-inducing efficiency of the ibv hairpin was % (data not shown), which is in stark contrast to the - % reported by brierley et al. ( ) . this discrepancy may be due to experimental conditions: in our experiments we used non-capped transcripts, a -nt spacer and rrl from promega whereas the brierley's lab used capped transcripts, a -nt spacer and in-house prepared rrl. on the other hand, the % we obtained for the ibv hairpin would be a factor of . lower than the % reported for the ibv pseudoknot ( ) , and is similar to the ratio of . and . we obtained for srv in vitro and in vivo, respectively. remarkably, in wge the ibv hairpin has been reported to induce high levels ( %) of frameshifting versus % for the ibv pseudoknot ( ) . in that study modified extracts were used that are somewhat more frameshift-prone than the standard wheat-germ extracts. nevertheless, the ratio between pseudoknot and hairpin-induced frameshifting in this system is also . . this number may reflect the additional interactions, like base triples, in a pseudoknot that make it a better frameshift stimulator than a hairpin. in addition to stem size, loop composition is another determinant of hairpin stability. an important subgroup of hairpin loops is the tetraloop, which is the most common loop size in s and s ribosomal rnas ( ) . the tetraloops with consensus uncg, gnra, or cuug loop sequence form stable loop conformations ( , ) . as opposed to the mentioned stable tetraloops, purine-rich ( ) and larger loops ( ) are considered to be less favorable for hairpin formation. our results showed that the gggc loop is indeed less efficient in inducing frameshifting but the larger loop construct ( bp/ nt), although having a lower thermodynamic stability, showed comparable frameshifting efficiency to the stable uucg tetraloop hairpin. this is consistent with previous studies that showed that increasing the size of the loop in a hairpin or pseudoknot can increase frameshift-inducing ability to a certain extent ( , ) . although larger loops seem efficient in inducing frameshifting, in known examples of frameshifter hairpins, there are no loop sizes of more than nt. this could relate to hairpin folding kinetics ( ) or to nuclease sensitivity. intriguingly, we found that a bp stem capped with a gaaa tetraloop is -fold less efficient in inducing frameshifting than its uucg counterpart in vitro and in vivo. it has been reported that gaaa tetraloops are frequently involved in rna tertiary interactions ( ) . we hypothesize that the gaaa tetraloop may be involved in an unknown rna tertiary structure with ribosomal rna, thereby interfering with frameshifting. the fact that in the known natural examples of frameshifter hairpins, the gaaa tetraloop, despite its high stability, is absent can be taken as support for this hypothesis (olsthoorn, unpublished data) . further investigation of this observation may lead to new insights in ribosomal frameshifting. in conclusion, our data show that hairpins of various base composition in stem and loop can act as efficient frameshift stimulators. combined with previous studies on antisense-induced frameshifting ( , ) , these data support the notion that downstream structures primarily serve as barriers to stall translating ribosomes to stimulate frameshifting. although there exists a linear relationship between calculated stability and frameshifting, local destabilizing elements like bulges or mismatches in a hairpin can greatly influence frameshift-inducing activity. future experiments addressing the mechanical strength of these hairpins ( - ) may help to improve our understanding of the basics of ribosomal frameshifting. translational frameshifting: implications for the mechanism of translational frame maintenance frameshifting rna pseudoknots: structure and mechanism a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting energetics of a strongly ph dependent rna tertiary structure in a frameshifting pseudoknot structure, stability and function of rna pseudoknots involved in stimulating ribosomal frameshifting solution structure of a luteoviral p -p frameshifting mrna pseudoknot correlation between mechanical strength of messenger rna pseudoknots and ribosomal frameshifting characterization of the mechanical unfolding of rna pseudoknots triplex structures in an rna pseudoknot enhance mechanical stability and increase efficiency of - ribosomal frameshifting torsional restraint: a new twist on frameshifting pseudoknots specific mutations in a viral rna pseudoknot drastically change ribosomal frameshifting efficiency a loop cytidine-stem minor groove interaction as a positive determinant for pseudoknot-stimulated - ribosomal frameshifting solution structure of the pseudoknot of srv- rna, involved in ribosomal frameshifting functional analysis of the srv- rna frameshifting pseudoknot the role of rna pseudoknot stem length in the promotion of efficient - ribosomal frameshifting structure of the rna signal essential for translational frameshifting in hiv- comparative mutational analysis of cis-acting rna signals for translational frameshifting in hiv- and htlv- regulation of - ribosomal frameshifting directed by cocksfoot mottle sobemovirus genome structural probing and mutagenic analysis of the stem-loop required for escherichia coli dnax ribosomal frameshifting: programmed efficiency of % in vivo hiv- frameshifting efficiency is directly related to the stability of the stem-loop stimulatory signal mutational analysis of the rna pseudoknot component of a coronavirus ribosomal frameshifting signal ribosomal pausing during translation of an rna pseudoknot the q-base of asparaginyl-trna is dispensable for efficient - ribosomal frameshifting in eukaryotes prokaryotic-style frameshifting in a plant translation system: conservation of an unusual single-trna slippage event mutational analysis of the 'slippery-sequence' component of a coronavirus ribosomal frameshifting signal identification and analysis of the pseudoknot-containing gag-pro ribosomal frameshift signal of simian retrovirus- a dual-luciferase reporter system for studying recoding signals analysis of the role of the pseudoknot component in the srv- gag-pro ribosomal frameshift signal: loop lengths and stability of the stem regions translational standby sites: how ribosomes may deal with the rapid folding kinetics of mrna a characteristic bent conformation of rna pseudoknots promotes - frameshifting during translation of retroviral rna contribution of the closing base pair to exceptional stability in rna tetraloops: roles for molecular mimicry and electrostatic factors structural parameters affecting the kinetics of rna hairpin formation programmed ribosomal frameshifting in siv is induced by a highly structured rna stem-loop ribosomal pausing at a frameshifter rna pseudoknot is sensitive to reading phase but shows little correlation with frameshift efficiency stimulation of ribosomal frameshifting by antisense lna direct measurement of the full, sequence-dependent folding landscape of a nucleic acid comparative anatomy of -s-like ribosomal rna solution structure of an unusually stable rna hairpin, 'ggac(uucg)gucc structural features that give rise to the unusual stability of rna hairpins containing gnra loops loop dependence of the stability and dynamics of nucleic acid hairpins stem-loop structure of cocksfoot mottle virus rna is indispensable for programmed - ribosomal frameshifting rna tertiary structure mediation by adenosine platforms novel application of srna: stimulation of ribosomal frameshifting efficient stimulation of site-specific ribosome frameshifting by antisense oligonucleotides we thank fabien sohet and maarten laurs for their initial contributions to this subject. funding funding for open access charge: leiden institute of chemistry, leiden university.conflict of interest statement. none declared. key: cord- -s geszi authors: lang, dorothy m.; zemla, a. t.; zhou, c. l. ecale title: highly similar structural frames link the template tunnel and ntp entry tunnel to the exterior surface in rna-dependent rna polymerases date: - - journal: nucleic acids res doi: . /nar/gks sha: doc_id: cord_uid: s geszi rna-dependent rna polymerase (rdrp) is essential to viral replication and is therefore one of the primary targets of countermeasures against these dangerous infectious agents. development of broad-spectrum therapeutics targeting polymerases has been hampered by the extreme sequence variability of these sequences. rdrps range in length from – residues, yet contain only ∼ residues that are conserved in most species. in this study, we made structure-based comparisons that are independent of sequence composition using a recently developed algorithm. we identified residue-to-residue correspondences of multiple protein structures and created (two-dimensional) structure-based alignment maps of polymerase structures that provide both sequence and structure details. using these maps, we determined that ∼ % of each polymerase species consists of seven protein segments, each of which has high structural similarity to segments in other species, though they are widely divergent in sequence composition and order. we define each of these segments as a ‘homomorph’, and each includes (though most are much larger than) the well-known conserved polymerase motifs. all homomorphs contact the template tunnel or nucleoside triphosphate (ntp) entry tunnel and the exterior of the protein, suggesting they constitute a structural and functional skeleton common among the polymerases. the polymerase protein family has been studied extensively for > years. this interest has been motivated by their unique function-to replicate all forms of life, and confounded by their sequence diversity. as more tertiary structures of polymerase were solved, it became apparent that widely diverse sequences form highly similar structures. there has not, until recently, been a time-effective computational method to make detailed comparisons of these observations. the objective of this study was to clarify the relationship between structure and sequence in a group of rna-dependent rna polymerases (rdrps) that replicate many of the viruses that represent significant threats to life throughout the world. we selected well-studied species in order to maximize the amount of experimental data that could be used to evaluate the association of functional residues and structure (table ) . we used the stralsv algorithm ( ) to perform structure comparisons between all of the selected species. we created maps of residue-to-residue (r r) correspondence from which we determined the boundaries of structurally similar segments-which we named 'homomorphs'. in contrast to the relatively short lengths of previously described motifs, we found that most homomorphs are long, and each provides a structural connection between the template tunnel or ntp entry tunnel and the exterior of the protein. the tertiary structure of the replicative unit of most rdrps is highly conserved ( ) . it resembles that of a right-handed palm, with finger-like folds curved inward to form a tunnel that encircles the template that is being processed ( ) . most single-unit polymerases are - residues in length. early polymerase studies found that $ residues are highly conserved in all polymerases, and in most species they are in the same sequential order ( ) . some are clustered, with two to four highly conserved residues within a segment $ residues in length. the sequence segment that includes each of the highly conserved residues or clusters has been described as a motif. the motifs are arranged in most species in the order: g-f -f -f -a-b-c-d-e. the birnaviruses (ibdv and ipnv) differ from this scheme due to a *to whom correspondence should be addressed. tel: + ; fax: + ; email: dorothylang@gmail.com correspondence may also be addressed to a. t. zemla. tel: + ; fax: + ; email: zemla @llnl.gov transversion involving the c motif (c-a-b) ( ) . the references for each of the motifs and species that have been studied are listed in table ; these references were selected because they included either alignments of one or more motifs with several species, or alignments for a particular motif not found elsewhere. apart from these conserved motifs, the rdrp sequences are highly variable. an extensive study of picornaviruses by koonin et al.( ) illustrates this variability. the basis of koonin et al.'s study was an alignment of species using the algorithm multiple sequence comparison by log expectation (muscle) ( ) and a manual adjustment. the alignment included four species that are also in our sample group, and these had a total of conserved residues within sequences $ residues in length. the analysis that we present in the following pages demonstrates that highly similar structures can be formed by very different sequences. structure comparison, rather than sequence comparison, enabled us to readily recognize functionally significant segments of similarity and difference between sequences. in total, well-studied viral species with solved rna polymerase structures and four viral species with solved dna polymerase structures were selected for analysis. we used the stralsv algorithm (http://proteinmodel.org/), described in detail previously ( ) , to perform the analyses. stralsv compares the r r (structural) correspondence of each sequence in a set of reference sequences to a specified query sequence, beginning at the start of the sequence and continuing to the end, by evaluating successive overlapping segments of a user-selected length. each of the selected structures was used as a query to all structures available in the protein data bank (pdb release _ _ ; number of chains ) ( ) . the results were filtered for structural segments of at least % lga_s structure similarity ( ) to at least one query segment of amino acids in length (size cutoff for the structural context) from which r r correspondences were extracted from local tightly superimposed spans: continuous segments of the minimum length of five amino acids. these parameters together contributed to the identification of common regions of structure similarity, which were used to distinguish regions of conservation (structure matches) from regions in which structure deviates (non-matches). the stralsv comparison of each initial query to all structures in pdb resulted in the identification of a final set of pdb structures that were used as a reference polymerase structure set in this study. one representative of each species within the set of pdb structures was used to create an all-against-all structure comparison using stralsv. the species and pdb identities of this reference set are summarized in table . the output from the all-against-all structure comparisons was parsed to extract r r correspondences for each query/template pair. in each comparison the full query sequence was represented. at some positions in the template sequences, gaps occurred either due to structure deviation exceeding the alignment cutoff ( Å ) or because the template contained additional residues (e.g. a loop) without correspondence in the query structure. a structure map was created for each set of r r correspondences derived from each query/template-set stralsv comparison by combining the data for each binary alignment in an excel spreadsheet. in this article, we report the structure map using poliovirus rdrp as the primary query (figures - ) , although in most cases we include structure maps with other species as query (figures and - ) . alternative query species were used when structure similarity of some of the templates was not identified with poliovirus as query (e.g. motif g in wnv and denv were identified based on denv as the query). the query that contributes to non-poliovirus matches is indicated on each alignment ('q' following the species abbreviation). ferrer-orta et al. poch et al. poch et al. poch et al. ferrer-orta et al. ( hrv pan et al. pan et al. pan et al. love et al. love et al. love et al. love et al. fmdv gorbalenya et al. ferrer-orta et al. ( ) poch et al. ferrer-orta et al. ( ) ferrer-orta et al. ( nv pan et al. pan et al. ferrer-orta et al. ( ) ferrer-orta et al. ( ) ferrer-orta et al. ( ) pan et al. ferrer-orta et al. ( rhdv pan et al. pan et al. ferrer-orta et al. ( ) love et al. love et al. love et al. ferrer-orta et al. ( hcv pan et al. choi et al. bressanelli ( ) pan et al. bressanelli et al. choi et al. pan et al. choi et al. choi et al. choi et al. choi et al. hiv pan et al. pan et al. pan et al. pan et al. pan et al. pan et al. pan et al. butcher et al. butcher et al. pan et al. pan et al. pan et al. butcher et al. reov pan et al. pan et al. pan et al. pan et al. pan et al. pan et al. the structure maps were used as the basis for the structure alignments described in this article. on all structure maps, we identified the motifs a-f as described by gong and peersen ( ) by coloring the background of the columns matching the residues of the motif orange and the background of the columns matching the highly conserved residues yellow. a similar coloring scheme was used for motif g, except that it depicts the residues identified by pan et al. ( ) because motif g was not specified in the gong and peersen study ( ) . on all structure maps, we colored the residues of the picornaviruses blue; the caliciviruses green; hcv and bvdv (flaviviruses) black; wn and denv (flaviviruses) red; phi black; reov brown; rotav, ibdv, ipnv black; hiv purple; tert, t rnap, n black; taq turquoise and t dnap black. the segment of conserved structure adjacent to each motif was determined from the stralsv maps. for each of the queries and each of the motifs, the location at which structure conservation of most species became discontinuous was noted. we defined the boundaries of a homomorph as the position at which the structural segment shared by all representatives in a set became discontinuous in more than two species. in all structure maps, the conserved segments that we identified based on stralsv r r correspondences are colored light blue. we defined the homomorph of each motif as the segment consisting of the conserved motif plus the adjacent structurally conserved segments. the length of each homomorph varied somewhat depending on the query. for each query species, the start and end of the homomorphic segment of each motif were recorded, and a  matrix for each motif was generated to compile the data from each query (data not shown). this matrix was used to identify the minimum start and maximum end of each homomorph, and these values are summarized in supplementary table s . these values are plotted in figure , which illustrates the maximal expanse of the homomorphic segments that include each of the polymerase motifs. all of the tertiary structures were illustrated using the cn d program ( ) . structural examination of the sequence motif regions yielded extended regions of structural conservation. we named each of these regions a 'homomorph', defined as a sequence segment that shares a highly similar tertiary structure with other species, independent of the sequence composition. we found that most of the homomorphs were at least twice as long as the corresponding sequence motif. the extent of this expansion is illustrated in figure . the length of each homomorph was determined separately for each species, using a single structure for each species as the query in a stralsv analysis. the identity of the start and end of each homomorph depends on the structural similarity to a given query. therefore, there is some minor query-specific variability of the location of the ends of homomorphs that can be observed in figure . within the homomorphs, non-matching residues can be used to identify minor differences between species. in most single-stranded rdrp (ss-rdrp) species (pv, coxs, hrv, fmdv, nv, rhdv, sapv, hcv, bvdv, wn and denv), the homomorphs of motif g are the largest (median of residues), followed by a ( ), b ( ), e ( ), f ( ), d ( ), c ( ), f ( ) and f ( ) . in double-stranded rdrps (ds-rdrps) (phi , reov, rotav, ibdv and ipnv), the homomorphs of motif g are relatively short (median of residues). several other homomorphs are also shorter in ds-rdrps: b ( ), a ( ), e ( ), f ( ), d ( ), c ( ), f ( ) and f ( ) . the lengths and occurrences of homomorphs of polymerases that are associated with dna (hiv, tert, taq, t dnap, t rnap and n ) are variable and will be discussed in the sections describing each motif. the homomorphs of all species are similarly distributed over the length of the polymerase ( figure ). the ss-rdrps (pv, coxs, hrv, fmdv, nv, rhdv, sapv, hcv, bvdv, kunj and denv) are most similar to each other. the spacing between homomorphs is more variable in the ds-rdrps (phi , reov, rotav, ibdv and ipnv), and in general larger than in the ss-rdrps. the homomorph of motif c (hmc) is identified in the birnaviruses despite a sequence inversion that places it before motif a ( ) . relatively large segments between homomorphs occur in phi between f and f , and in kunj, denv and phi between b and c. the spacing between motifs is notably reduced in hiv and tert (rddps). in birnaviruses (ibdv, ipnv), the homomorphs of c and a are only three residues apart, and the distance between the homomorphs of motif f and c is greater than the typical f -b distance. most homomorphs are separated from each other by a segment that contains a turn (secondary structure), or there is a turn at the beginning or end of the homomorph. within all rdrps, all motifs occur within a length of residues. in t -ddrp (t rnap) and n , the motifs are spread out over approximately residues. the amount of r r correspondence for most of the rdrps, determined from the minimum and maximum values of all homomorphs, is $ % over the span from motif g through motif e (supplementary table s ). the structurally aligned sequences that comprise hmg are summarized in figure a . the r r correspondences of wn and denv could not be evaluated for the motif g region (approximately pv - ), as the structural configuration of the segments of these viruses that would be expected to match the homomorph of motif g (hmg) segment had not been determined. within the homomorph, most of the ss-rdrps were highly similar (figure a , top). bvdv is similar to the other ss-rdrps in the n-terminal segment, but no longer matches them at the c-terminal segment. only motif g, and not a homomorph, was identified in phi , ibdv and ipnv. in the region of motif g, stralsv did not identify r r correspondences between any of the rna polymerases and reov, rotav, hiv, tert or dna-dependent polymerases (taq, t dnap, t rnap and n ). there were structural discontinuities within the homomorph (noted by x in figure a ) and similar discontinuities within the motif. these minor discontinuities identify species-specific differences within a segment that is otherwise highly continuous in several species. for example, fmdv has aa between s and t (pv numbering) and therefore does not match the structure of pv -ld- , which has only aa. in contrast, wn and denv have the same number of residues for the gaps from a -r and g -r , respectively, but these segments were not structurally aligned by stralsv within the parameters used in this study. the segment pv-y to a is a b-hairpin unique to picornavirus rdrps ( ) . the numbering on nv and hcv clarifies that these regions are continuous in these species (and the other caliciviruses, rhdv and sapv, though not numbered). figure b and c illustrates the tertiary structure of the homomorph using a poliovirus structure (pdb: ra ). most of the n-terminal segment is a single helix that extends over nearly half of the surface of the protein. both ends of the homomorph terminate at the exterior surface of the protein. the distance between the homomorphs of motifs g and f was - residues in the ss-rdrps (in all species where both were present) and longer in the ds-rdrps (median residues) ( figure ). three components of motif f have been recognized: f , f and f ( , ) . in some species there are sequence segments between these motifs. in all the species in our sample set except phi , in those species that have r r correspondence within motif f, the three f motifs are continuous; therefore, we have combined them, and the adjacent structurally aligned segments, into a single homomorph. the structurally aligned sequences that comprised homomorph of motif f (hmf) for rdrps and hiv are summarized in figure a . hmf extended five residues upstream from the n-terminal edge of motif f [as defined by gong and peersen ( ) ] and $ residues downstream from motif f [as defined by gong and peersen ( )]. hmf was found in all rdrp species except wn and denv; it was not possible to evaluate r r correspondence for this segment of wn and . the number of residues from the start of the polymerase structure to the start of the first homomorph is identified for each species at the left of the chart. the pdb structures, and consequently, sequence position numbers for kunj, denv and taq, which are used throughout this article, do not begin at the polymerase; therefore, for this figure, the distance from the start of the polymerase is shown after the slash. for species lacking motif g, the first identified homomorph is indicated at the left of the start position. the length of the polymerase of each species is listed at the right of the chart. denv as the structure of this segment has not been resolved. motifs f and f are always continuous if f is present, and motif f is present in most species. motif f is represented by a single residue in phi , reov and rotav (dsrna), two residues in hiv (rddp), residues in bvdv and residues in hcv (two of which, in hcv, are structurally aligned to the other rdrps). motif f varied in length from to residues. in phi , there was a -residue segment between f and f . hmf was present in all rdrp species. figure b and c illustrates the tertiary position of hmf. most of the structure is hairpin-like, with some residues of motif f at the apex, which is located at the exterior surface of the protein. hmf and hmf are approximately parallel for several residues. hmf then independently extends to the surface of the protein approximately opposite the motif f site. figure d shows the n-and c-terminal residues and some residues of the c-segment of hmf at the surface of the protein. figure e shows the position of motif f relative to the template tunnel. the segments between hmf and hma are - residues in ss-rdrps and phi , - residues in ds-rdrps (reov, rotav, ibdv and ipnv), - residues in rddps (hiv and tert) and ddrps (t rnap and n ) and - in dddps (taq and t dnap) ( figure ). the structurally aligned sequences that comprise homomorph of motif a (hma) are summarized in figure a . and ipnv (ds-rdrps) have fewer aligned residues. reov (ds-rdrp) and hiv and tert (rddps) do not have r r correspondence with the ss-rdrps. the ddrps (t rnap and n ) and dddps (taq and t dnap) share a homomorphic structure within the n-terminal segment, but it is substantially different from the rdrp structure and therefore is not included in the homomorph or figure a . within the motif, hiv corresponds only to nv and sapv (only found using an hiv query), indicating a significant structural difference from other species; hiv also lacks r r correspondence beyond the motif and therefore is not included in the homomorph. at the c-terminal segment of the homomorph, most species in the sample set, except hiv, have a homologous structure. at some sequence positions within hma, a particular residue composition is conserved throughout a viral family (e.g. picornavirus), and a different residue composition is conserved in another viral family at the same position. this within-family sequence conservation (! %) occurs at the following sequence positions (shown in figure a , pv numbering): , , - , and . within the n-terminal side of the homomorph, at the edge of the motif (pv - ), there is a minor discontinuity in structure homology ( figure a ). the distance between the discontinuities in each species is provided in a column within the figure (white) that indicates the entire span over which discontinuity exists for each species. however, the loop represented by this discontinuity varies in length by only one to four amino acids. figure b illustrates the tertiary structure of the hma. each end of the homomorph is at the exterior surface of the protein (figure c ), and its center-the conserved motif a-is at the surface of the template tunnel. the overall configuration of the homomorph is spring-like ( figure d ). the species-specific loop within the homomorph is located at the exterior of the protein. the sequence segment between the homomorphs of motif a and motif b (hmb) is $ - residues in the rdrps, and mostly greater than residues in the dna-dependent polymerases. it is relatively long in reov ( ), t rnap ( ) and n ( ). in the birnaviruses (ibdv, ipnv), motif c precedes motif a in sequence; this sequence inversion is described in a later section of this article, which describes motif c. motif b is a component of the largest homomorph identified in the rdrps. the homomorph begins residues upstream from motif b and extends residues downstream. the motif is residues long. the size of the homomorph is consistent in most species. the structurally aligned sequences that comprise the homomorph are shown in the top section of figure a . they include all the rdrps in the sample set plus tert (rddp). each of in cells with a light blue background filled with a number, the number is the sequence position of the adjacent matches for each species; numbers in the white column between them summarize the length of sequence that the non-matched sequence represents in each species. in this segment there are more residues in each species than between the corresponding residues in pv, indicating that this region is a loop that is absent in pv, and the loop length varies by species. at the left of the alignment ( - , uncolored), there is a structure common to several species, but too few to qualify the region as part of the homomorph. (b) in this figure of poliovirus (pdb: ra ), the n-terminal segment of the homomorph is blue and the c-terminal segment is brown. the terminal residues of hma are at the exterior surface of the protein (pdb: ra ). motif a is centered within the homomorph at the wall of the template tunnel. (c) the terminal residues of the homomorph and the helix adjacent to each are constituents of the protein surface. (d) in pv, an insertion (red) at the c-terminal edge of the motif is lethal: l -i-s ( ) . a species-specific loop (green) affects the catalytic rate (in pv) ( ) . these species matched a poliovirus query, indicating there is greater structural similarity than in other homomorphs and motifs. the n-terminal segment of the homomorph contains some discontinuities that are resolved by using r r matches for alternative queries ( figure a , lower section). the c-terminal segment of the homomorph is well represented in all rna polymerases and tert. no r r correspondence was found between the residues comprising hmb in the rna polymerases and residues in the dna-dependent polymerases (t rnap, n , taq and t dnap). the lower section of figure a illustrates the dependence of the r r correspondence on the query sequence. these differences make it possible to identify fine details between structures. our definition of each of the homomorphs, however, is based on the inclusion of all r r alignments using all queries in the sample set. the position of the hmb within the tertiary structure of pv is illustrated in figure b . the n-terminal residue is at the exterior surface of the protein. the n-terminal segment is a classical b-hairpin protein structure that is folded back on itself and is almost entirely exposed on a surface nearly perpendicular to the face of the protein that contains the n-terminal residue ( figure c ). the base of the loop transitions to motif b at the template tunnel. the c-terminal side of the homomorph extends from the tunnel to the exterior surface of the protein ( figure d ). the distance between the homomorphs of motifs b and c (hmc) (figure ) is < - in all rdrps except kunj and denv, which are and residues, respectively. in the dna-dependent polymerases, this distance is between (taq) and (n ) residues. in ibdv and ipnv, the segments between the homomorphs of motifs b and d are and residues, respectively. the structurally aligned sequences that comprise hmc are shown in figure a . motif c is the only rdrp motif that is not a component of a larger homomorphic structure. the segments immediately adjacent to both flanks of motif c do not even cluster into subgroups. motif c is short- residues in most rdrps and folds sharply back on itself ( figure b ). the highly conserved residues (labeled motif c) are at the surface of the template tunnel and both the n-terminal and c-terminal residues are at the exterior surface of the protein ( figure c ). in the birnaviruses ibdv and ipnv, there is a sequence inversion that results in the relocation of motif c to a position immediately preceding motif a. figure a shows an alignment that documents this inversion. the top and bottom segments of figure a illustrate that all species are well aligned upstream of motif c (ipnv positions - ) and within motif a (ipnv positions - ). rhdv, sapv and bvdv are not well aligned within motif c using the ipnv query, and therefore are missing from the middle section of figure a (ipnv positions - ). the numbering of ipnv and ibdv is sequential, indicating that motif c precedes motif a in these species. the numberings of nv and hcv indicate there are r r matches with ipnv at motif c, but that over this segment the match is not in sequential order. using a pv query, however, all of these species have r r matches over this segment (shown in figure a ). the ipnv query indicates that the structure of motif c of the birnaviruses more closely matches nv and hcv than the others in the sample set. the difference in linear order that results from the sequence inversion is figure . (a) the high number of species that align to pv indicates that the structure of motif c is highly conserved. although a t dnap query was required to identify the matches for the n -taq-t rnap species, it was achievable. hmc is the only homomorph for which there is r r correspondence in all species of the study group. (b) motif c (gold) is the only motif in the rdrps that is not a component of a larger structure. motif c [illustrated using poliovirus (pdb: ra )] is tightly folded upon itself in a manner that places the highly conserved residues (yellow) at the tunnel wall, whereas the n-terminal segment of the motif (blue) and c-terminal segment of the motif (brown) are parallel to each other and penetrate the protein. (c) the terminal residues of both the n-and c-terminal segments are at the surface of the protein. compensated by a modified structure that maintains the motifs within a tertiary position that is similar to all other rdrps ( figure b and c) . the distance between the hmc and hmd (homomorph of motif d) is < residues in the rdrps. it is relatively large in phi ( residues) and is indeterminate in the dddps, as neither motif d nor its homomorph is within the pdb structures included in this study. the structurally aligned sequences that comprise the hmd are shown in figure a . the homomorph is residues long and consists of a -residue extension from the n-terminal edge of the motif plus the motif itself. the structure of the n-terminal segment is more highly conserved (i.e. has more r r matches) than the motif. various query sequences were tested with the expectation that they would capture additional alignments. the middle section of figure a illustrates that this produced some improvement. for example, using an hcv query, there are r r matches to tert, taq and t dnap. the c-terminal edge of the motif has some r r correspondence, suggesting that the structure of the motif is moderately conserved. using t dnap as a query (lowest segment of the figure) , only a small portion of the c-terminal edge of motif d and a few species have similar structures. there is no alignment of phi within the n-terminal segment of the homomorph, because in this region phi consists of a -residue loop between the end of motif c and the start of motif d. the tertiary structure of the hmd is illustrated in figure b and c. this homomorph lies mostly at the exterior surface of the protein. the motif lines the wall of the polymerase tunnel. the segment between the hmd and homomorph of motif e (hme) is < residues in all structures in the sample set, except in ibdv and ipnv in which it is and residues, respectively. the structurally aligned sequences that comprise hme are summarized in figure a . hme is large and in most of the ss-rdrps (pv, coxs, hrv, fmdv, nv, rhdv, sapv, hcv, bvdv, wn and denv) it is highly conserved. the motif is near the n-terminal edge and a loop region is located near the middle of the homomorph. the sequences vary in length due to the loop region. the length of hme in the caliciviruses ( - residues) is shorter than those in the picornaviruses ( - residues); hcv and bvdv loops are and residues, respectively, and the loops of wn and denv are the longest at and residues, respectively. there is strain-specific amino acid variability in this segment of hrv. hme is well represented by all rdrps. no r r correspondence was found with hiv or tert. these species, however, are structurally matched to each other ( figure a , middle section). there is considerable sequence similarity between pv and denv within this homomorph; this is illustrated in the bottom section of figure a by the shaded conserved residues. ddrps and ddrps are not included in the analysis of this region because the region is missing from the structures in our sample group. the tertiary structure of the hme is illustrated in figure b and c. most of the homomorph is at the exterior of the protein near the ntp entry tunnel. although it has extensive surface exposure, each terminus of the homomorph appears to be anchored by residues that are not part of the homomorph; as a result, the terminal residue at each end of the homomorph is exposed as a single residue at the exterior surface of the protein. motif e is located near the n-terminal edge of the homomorph and contacts the surface of the ntp entry tunnel ( ) . the c-terminal segment of the homomorph is folded back on itself in a manner that places the speciesspecific loop at the surface of the protein ( figure c) . the homomorph forms a double strand through pv_m , at which point the remainder of the homomorph is a single-stranded helix that emerges at the exterior surface of the protein. in pv, the c-terminal of hme (r ) is exposed at the surface the protein and surrounded by the segment -safhyvfeg- . structure-based sequence alignment using the stralsv algorithm ( ) enabled us to identify seven distinct homologous structures in most of the polymerases in our collection of species. in the rdrps, the combined regions of structural homology represent $ % of the sequence from the start of homomorph of motif g (hmg) through the end of hme in each species ($ residues). there is < % conservation of sequence composition among these species. each of the homomorphs includes a sequence motif consisting of characteristic highly conserved functional residues that are essential to replication. the tertiary position of each of the homomorphs includes at least one residue (and sometimes more) in contact with the exterior surface of the protein and one or more highly conserved functional residues located within or at the wall of the template tunnel. we defined the boundaries of a homomorph as the position where the structural segment shared by all representatives in a set became discontinuous in more than two species. for many queries, this position could be confidently identified. however, these positions sometimes varied by one or two residues, depending on the query sequence. query-dependent differences in r r matches were also observed within the motifs themselves, where minor differences in structure resulted in a lack of r r matches for short segments of some queries. our approach was to set the boundary at the position where most queries were in agreement, but to keep in mind that these edges might vary by one or two residues. poliovirus had r r correspondence with other species in the sample set more often than did any other structure. in almost all instances, we were able to map functional features of other proteins to a structurally similar segment of poliovirus. this property of centrality makes it a useful template for polymerase structure properties. hmg is shared by picornaviruses, caliciviruses and flaviviruses, although the structures of each of these groups begin to diverge within the c-terminal segment of motif g. motif g is characterized by the conserved motif [t/sx - g], which is located near the outer edge of the template tunnel. the motif may enforce the correct orientation of essential residues and a primer ( ) . each flank of the homomorph contains amino acid residues that significantly affect the life cycle of the species. in pv, mutations at the n-terminal residue of the homomorph (d a/e a) are lethal ( ) . mutations located outside the n-terminal edge of the motif (pv d a/e a) result in small plaques ( ) . downstream from the c-terminal edge of the motif, there is a nuclear localization signal (nls) in the picornaviruses and caliciviruses. the nls is located two residues from the c-terminus of the homomorph and mutations in the nls (k a/ k a/k a and k a/r a/d a) are lethal to pv ( ) . previous research found that motif f occurs in all rdrps ( ) , that it recognizes the incoming ntp ( ) , serves as the primary fidelity checkpoint for rdrp and reorients the proper triphosphate into a position for efficient catalysis ( ) . hmf is an extensive structure with surface exposure at both ends and near its mid-section at motif f ( figure a-d) . motif f ( figure e ) is analogous to the loop in hmg that varies in composition and length; it is upstream of a highly conserved motif and is speciesspecific. the large size of this homomorph and its positioning that transects the protein while maintaining contact with the template tunnel is consistent with its established role in transcription, which requires both fine-scale stability and large-scale mobility. motif f consists of mostly basic amino acid residues and forms the roof of the ntp entry tunnel ( ) ; the characteristic conserved arg residue is essential to nucleotide binding ( ) . the required orientation of the f motifs would be stabilized by the loop formed by hmf and the doublestranded segment formed by the extension of the homomorph beyond the motifs. both the n-terminal and c-terminal residues of the homomorph are exposed at the exterior surface of the protein. in pv, mutations of residues adjacent to the n-terminal are lethal: g -i-i ( ) and h a/k a ( ) . the conserved residues of motif a (in pv, d and d ) control the function of the metal ions at the active site ( , ) , which perform the phosphotransfer essential to polymerase activity ( ) . d is ligand to the metal ( ) . d is essential to ntp binding ( ). similar functions for the residues of motif a have been identified for hcv ( ) , hrv ( ) and fmdv ( ) . motif a is centered within a spring-like homomorph ( figure b -d). each end terminates at the exterior surface of the protein, and the beginning and end of the homomorph terminate nearly opposite each other. mutations in the n-terminal segment of the homomorph (in pv at e a/e a) result in small plaques ( ) , suggesting that these residues influence the rate of catalysis. this is the region where species-specific structures protrude from the homomorph (pv l -l ). this position, relative to the conserved motif, is analogous to a similar structure in hmg and motif f . all these structures contain a segment that varies in length and composition by species and is located upstream from a highly conserved motif, essential to replication. an insertion at the c-terminal edge of motif a (l -i-s ) is lethal ( figure a ) ( ) . the structure of the homomorph is highly conserved in this region, suggesting that the structural consequences of an insertion are not tolerated. the position of this lethal insertion is similar to the position of lethal mutations in motif g, although the major effect in hmg may be the loss of the nuclear localization signal. mutations near the c-terminal residue of the motif a homomorph (pv g ) affect function: e a/k a is lethal ( ) , and the insertion i -ile-g results in temperature sensitivity ( ) . hma provides a structural connection between the functional residues at the template tunnel and the exterior surface of the protein. sequence residues immediately adjacent to the motif have a high degree of functionality. it is possible that the orientation of the conserved segments that comprise the homomorph would be affected by changes in the orientation of residues at the edges of the motif. n-terminal and c-terminal segments of the homomorph are helices, which are likely to be relatively rigid. motif b is near the center of a very large homomorph that contacts the exterior surface at nearly opposite positions. as stated by bruenn ( ) , motif b forms the base of the template-entry channel and may function in guiding the template entry into the active site. choi et al. ( ) observed that the highly conserved asn (n in bvdv) is conserved in all picornavirus. hansen et al. ( ) found that in hrv, n is involved in positioning ntp for recognition. ferrer-orta et al. ( ) determined that the equivalent fmdv-n and d (motif a) together are involved in ribonucleoside triphosphate (rntp) selection. tao et al. ( ) and butcher et al. ( ) proposed that motif b interacts with the -oh group on the incoming nucleotide. korneeva and cameron ( ) determined that fmdv-n interacts with the c-terminal-oh in the uridylylation complex, but with the -oh in the elongation complex. the role of motif b in the mechanisms of active site closure has recently been described in detail by gong and peersen ( ) . these experiments document the role of the highly conserved asn in the motif in multiple species and suggest that structural alignment may be useful for the identification of potential functionally equivalent residues in structures that have r r correspondences. stralsv structure analysis indicates that the structure of motif b is highly conserved in all rdrps, unlike some of the other motifs that have unmatched r r correspondences. this highly conserved structure is consistent with its role in ntp recognition. an insertion in motif b at pv c -s-s is lethal ( ) . within the n-terminal segment of the hmb, the mutation in pv-k l results in small plaques ( ) . the structural position of this mutation (within the homomorph and upstream from the motif) is similar to that of rate-affecting mutations in the homomorphs of motifs g and a. at the c-terminal end of the homomorph, in bvdv (bvdv-f ), mutation of residues c , s and r to ala reduces primer-dependent rna elongation and abolishes de novo synthesis ( ) . motif c is not a component of a larger structurally conserved segment, but has the same key features of the other homomorphs. it is folded in a manner that places the apex of the fold at the wall of the template tunnel, and both the n-terminus and c-terminus at the exterior surface of the protein ( figure b and c) . therefore, motif c as defined in the literature comprises the homomorph. the absence of r r correspondence adjacent to hmc indicates that the structures of the adjacent sequence segments are highly specific to each species. hmc is highly conserved in the rdrps and highly similar to the dna-dependent polymerases. although there is a sequence inversion in the birnaviruses (motif c precedes motif a), figure b and c illustrates that despite the difference in sequence order, the homomorphs occupy a similar tertiary position. stralsv analysis indicates that the structure of motif c is highly conserved in the rna-dependent polymerases, though slightly different in the dna-dependent polymerases ( figure a ). motif c is part of the classic 'rrm-fold' that forms the core of the palm domain of all these polymerases (together with that part of motif a that forms a b-sheet with motif c. experimental studies have demonstrated that several residues within motif c are sensitive to the position and composition of mutants. the highly conserved residues, gdd, occur near the center of the motif. the primary function of these residues is to coordinate the metal ions associated with the incoming rntp ( , ) . in pv, mutation of d to e in either or both positions (d or d ) is lethal ( ) . in hcv, mutation of g a is also lethal ( ) . however, in birnaviruses the highly conserved residues are adn, rather than gdd, and mutation to gdd increases rna synthesis activity ( ) . certain mutations immediately upstream from the gdd motif are lethal in pv: y [chims] ( ) . this is similar to the effect of the l -i-s at the downstream edge of motif a in pv. near the n-terminal end of motif c in hcv (hcv t ), the mutation d a characterizes chronic hepatitis ( ) . mutation at the edge of a highly conserved structure seems to have a substantial effect on the viral life cycle. the r r comparisons summarized in these structure maps identify the types of sequence variability that can occur while maintaining the same spatial structure ( figure a ) and demonstrates and identifies the variations in composition that can be tolerated even within a key functional motif. the r r correspondence of other rdrps with birnaviruses in motif c ( figure a ), despite the sequence inversion in birnaviruses (cab) supports the premise that conservation of structure is a significant, if not dominant, factor in evolution. hmd is different from the other homomorphs in that it lies mostly on the surface of the protein ( figure c ). like the others, however, its terminal residues are located at a distinctive surface ( figure d ). in the case of hmd, they come from an opposite surface rather than the interior of the protein. the n-terminal segment of hmd is more conserved than the motif itself, which forms the c-terminal segment. the motion of motif d in the active state has not been captured by the existing structures in pdb ( ) . therefore, the lack of r r correspondence in the motif may be a reflection of the limitations of the available structures. residues within hmd perform varied functions. in pv, e.g. polymerases form an extensive lattice system by polymerase-polymerase interactions; l and d , located within hmd, contribute to interface i of this lattice system ( ) . the most highly conserved residue within the homomorph is a gly (pv g ) at the n-terminal edge of the motif and central to the homomorph; gly in this position would facilitate the folding of the homomorph, and is consistent with cameron et al.'s ( ) hypothesis that motif d may be the most dynamic structural element of rdrps and rts. another conserved residue is a lys near the c-terminal edge (pv-k ). residues equivalent to pv-k supply a proton to the nucleotidyl transfer reaction that increases the rate constant for nucleotide addition by -to -fold ( ) . in pv, within the motif, the insertion t -t-m results in small plaques, likely due to delayed rna synthesis ( ) . in other homomorphs, mutations that affect the rate of synthesis occur more commonly outside of the motifs. immediately downstream from the homomorph, pv-t i is an attenuating mutation for the sabin vaccine ( ) . hme is $ amino acids in length ( figure a ), and well represented by all rna-dependent members of the sample set, except that no correspondence was found with hiv or tert. these two species, however, are structurally matched to each other. there is considerable sequence similarity between pv and denv within this homomorph, shown in the bottom segment of figure a . appleby et al. ( ) determined that motif e is unique to rna polymerases. hme forms part of the ntp entry tunnel and has a considerable amount of exposure on the surface of the protein. the species-specific loop within hme is at the outermost edge of the protein, a feature found in other homomorphs (g, f and a; figures a, a and a, respectively). huang et al. ( ) found that the motif e loop region acts as a pivot point for thumb subdomain movement upon template-primer binding. motif e may also function in the proper positioning of the thumb relative to the palm ( ) . the turn of the loop projects into the active site cavity where it has been implicated in helping to position the c-terminal end of the primer strand for attachment to the a-phosphate of the ntp during phosphoryl transfer ( ) . motif e in hcv plays a role in binding the priming nucleotide (not the incoming nucleotides) ( ) ; hcv has a longer loop ( figure a ), possibly related to this function. in pv, the c-terminal of hme (r ) emerges from the protein into the segment -safhyvfeg- . this segment contains residues f and f , which interact with w to maintain the polymerase structure ( ) . comparisons of the tertiary structures of the rdrps of viral species indicated that most of the highly conserved residues essential to polymerase function are embedded in large sequence segments that are highly conserved structurally, yet disparate in composition. we have named these conserved segments 'homomorphs' and have identified the composition and length of each homomorph that includes previously recognized polymerase motifs ( table ). we have demonstrated that the rna polymerases have structural skeletons (frames) that are highly conserved, with flexible segments between them, and that extensive segments of structure similarity can be identified by the methods we have described. these methods are applicable to the studies of other groups of proteins, and we anticipate that by accessing structure similarity independent of sequence composition, skeletal frameworks will be found in other groups of proteins. additionally, after structure similarity is identified, differences between members of the group become readily apparent. all of the homomorphs included residues that connect the template tunnel or the ntp entry tunnel with the outer surface of the protein. although some of the surface residues within these homomorphs have specific functional roles, as reported in the literature (see citations in previous paragraphs), we anticipate that they may all be important for polymerase function; the consistent occurrence of homomorphs embedding motifs-even when a defined sequence motif is small in size-suggests a structurefunction relationship between the motif and its structurally conserved flanking regions. it would be interesting to explore the possibility that interactions at the surface of the protein (e.g. protein-protein contact at surface homomorph residues) may subtly affect function buried deep beneath, within the tunnel. furthermore, each homomorph is either divided by or is separated from another homomorph by a flexible secondary structure. identification of the span of each homomorph and the terminal residue enables us to identify specific residues on the surface that would not, in many cases, be otherwise noticed. by comparing experimental data with the surface location of the ends of the homomorphs, we have found that these are often the sites of key functional interactions of the protein. a paper describing these sites is in preparation. we have compared the effects of currently recognized mutations within the motifs and within the homomorphs. most mutations within the motifs are function-specific, related to either a change in charge or size, and in most cases the mutations are lethal [ma ( ) , mb ( ) , mc ( ), ( ) ]. mutations outside of the motif (but within the homomorph) are more often rate related and located in a segment that bulges from the homomorph by an amount that varies by species (figures a, a and a ). these differences support the hypothesis that residues actively involved in template processing are essential to viability, and most of them are components of a consistent, stable structure that places and/or maintains them in their appropriate functional position. however, the practice of mutating residues to ala has resulted in a somewhat 'all or nothing' perspective of mutations. stralsv analysis can facilitate informed selection of alternative residues of various compositions, which could possibly affect replication rates to different extents. experiments involving this type of testing would enhance predictive models and may provide new insights for the design and development of medical countermeasures. the extension of all homomorphs from the template tunnel to the exterior of the protein was an unexpected finding. its universality in the polymerase family suggests a functional significance. residues within the homomorphs that were localized to the surface often had species-specific loops. the most likely reason these features have not been identified previously is due to the limitations of existing sequence and structure comparison tools-in particular, the ability to perform multi-species comparisons of structures, using overlapping windows of a size determined by the user, and the ability to select the criteria for r r matches. the homomorphs as defined in this work add structural clarity and context to sequencebased functional motifs previously observed by numerous authors performing comparative studies among polymerases. the structure maps created from the r r correspondences identified by the stralsv algorithm provided a unique and informative perspective of structure and function in rdrps. they readily identify unique regions of each species and those shared by proteins within a family. these are features that would be useful for studies of any protein family. based on the results of this study, it may be possible to define characteristic homomorphs for many other protein families, despite considerable sequence variation. it may be feasible to classify homomorphs in a manner analogous to the scop database, and in doing so provide new insight into protein evolution. the stralsv algorithm simultaneously, rapidly and quantitatively identifies the similarities and differences of the structural components of multiple species and provides an output that facilitates the comparison of three-dimensional structure information. stralsv enabled us to cluster protein segments that have the same tertiary structure, independent of sequence variability. in a sense, it is an analog of blast, although based on structure rather than sequence. the precision of stralsv makes it easy to identify small differences between and within species. the ability to process multiple species at the same time can rapidly accelerate our understanding of differences between them. the identified structural associations may also facilitate the transfer of structure-related functional information among proteins. the traditional perspective of the relationship between the amino acid sequence of a protein and its tertiary structure has been that sequence determines structure. under this premise, sequence-based evolutionary studies and phylogeny would inherently incorporate structure. in this study of rdrps, we demonstrated that structure accommodates substantial sequence variability, and that highly diverse sequences can generate highly similar tertiary structures. structure-based phylogeny may provide new perspectives of protein evolution. stralsv: assessment of sequence variability within similar d structures and application to polio rna-dependent rna polymerase structure of foot-and-mouth disease virus rna-dependent rna polymerase and its complex with a template-primer rna structural basis for proteolysis-dependent activation of the poliovirus rna-dependent rna polymerase identification of four conserved motifs among the rna-dependent polymerase encoding elements the structure of a birnavirus polymerase reveals a distinct active site topology the big bang of picorna-like virus evolution antedates the radiation of eukaryotic supergroups muscle: multiple sequence alignment with high accuracy and high throughput the protein data bank structural basis for active site closure by the poliovirus rna-dependent rna polymerase cn d: sequence and structure views for entrez crystal structure of coxsackievirus b dpol highlights the functional importance of residue in picornavirus polymerases the crystal structure of the rna-dependent rna polymerase from human rhinovirus: a dual function target for common cold antiviral therapy crystal structure of norwalk virus polymerase reveals the carboxyl terminus in the active site cleft crystal structures of active and inactive conformations of a caliciviral rna-dependent rna polymerase the . a resolution structure of the sapporo virus rna dependant rna polymerase substrate complexes of hepatitis c virus rna polymerase (hc-j ): structural evidence for nucleotide import and de-novo initiation the structure of the rna-dependent rna polymerase from bovine viral diarrhea virus establishes the role of gtp in de novo initiation crystal structure of the rna polymerase domain of the west nile virus non-structural protein crystal structure of the dengue virus rna-dependent rna polymerase catalytic domain at . -angstrom resolution a mechanism for initiating rna-dependent rna polymerization rna synthesis in a cage-structural studies of reovirus polymerase lambda mechanism for coordinated rna packaging and genome replication by rotavirus polymerase vp the n-terminus of the rna polymerase from infectious pancreatic necrosis virus is the determinant of genome attachment complexes of hiv- reverse transcriptase with inhibitors of the hept series reveal conformational changes relevant to the design of potent non-nucleoside inhibitors structure of a covalently trapped catalytic complex of hiv- reverse transcriptase: implications for drug resistance structure and functional implications of the polymerase active site region in a complex of hiv- rt with a double-stranded dna template-primer and an antibody fab fragment at . Å resolution structure of the tribolium castaneum telomerase catalytic subunit tert the structural mechanism of translocation and helicase activity in t rna polymerase structure of replicative dna polymerase provides insights into the mechanisms for processivity, frameshifting and editing structural basis for the transition from initiation to elongation transcription in t rna polymerase structural basis for dna-hairpin promoter recognition by the bacteriophage n virion rna polymerase crystal structures of open and closed forms of binary and ternary complexes of the large fragment of thermus aquaticus dna polymerase i: structural basis for nucleotide incorporation crystal structure of a bacteriophage t dna replication complex at . Å resolution molecular model of sars coronavirus polymerase: implications for biochemical functions and drug design the palm subdomain-based active site is internally permuted in viral rna-dependent rna polymerases of an ancient lineage crystal structure of the rna-dependent rna polymerase of hepatitis c virus clustered charged-to-alanine mutagenesis of poliovirus rna dependent rna polymerase yields multiple temperature-sensitive mutants defective in rna synthesis a structural and primary sequence comparison of the viral rna-dependent rna polymerases stabilization of poliovirus polymerase by ntp binding and fingers-thumb interactions dynamics: the missing link between structure and function of the viral rna-dependent rna polymerase? structure-function relationships among rna-dependent rna polymerases effects of mutations in poliovirus dpol on rna polymerase activity and on polyprotein cleavage a mechanism for all polymerases remote site control of an active site fidelity checkpoint in a viral rna-dependent rna polymerase structural basis for the c-terminal-n-terminal exonuclease activity of escherichia coli dna polymerase i: a two metal ion mechanism trans rescue of a mutant poliovirus rna polymerase function structure of the rna-dependent rna polymerase of poliovirus structure-function relationships of the viral rna-dependent rna polymerase: fidelity, replication speed, and initiation mechanism determined by a residue in the ribose-binding pocket intramolecular and intermolecular uridylylation by poliovirus rna-dependent rna polymerase mutation of the aspartic acid residues of the gdd sequence motif of poliovirus rna-dependent rna polymerase results in enzymes with altered metal ion requirements for activity biochemical properties of hepatitis c virus ns b rna-dependent rna polymerase and identification of amino acid sequence motifs essential for enzymatic activity effect of mutation in the hepatitis c virus nonstructural b region on hcv replication oligomeric structures of poliovirus polymerase are important for function genetic complementation among poliovirus mutants derived from an infectious cdna clone crystal structure of complete rhinovirus rna polymerase suggests front loading of protein primer crystal structure of human immunodeficiency virus type reverse transcriptase complexed with double-stranded dna at . a resolution shows bent dna a.z. designed and developed the stralsv algorithm and performed calculations for all viral species in the study. c.z. and d.l. wrote codes and developed methods for post-processing of stralsv results, and performed literature searches for interpretation of biological significance of various residue positions. d.l. performed the greater part of the detailed sequence comparisons and prepared the manuscript, with contributions from c.z. and a.z. all authors participated in the discussions and shaped the ideas that led to the experimental design and results of this work. all authors read and approved the manuscript. conflict of interest statement. none declared. key: cord- -m agj z authors: reddy, timothy e.; shakhnovich, boris e.; roberts, daniel s.; russek, shelley j.; delisi, charles title: positional clustering improves computational binding site detection and identifies novel cis-regulatory sites in mammalian gaba(a) receptor subunit genes date: - - journal: nucleic acids res doi: . /nar/gkl sha: doc_id: cord_uid: m agj z understanding transcription factor (tf) mediated control of gene expression remains a major challenge at the interface of computational and experimental biology. computational techniques predicting tf-binding site specificity are frequently unreliable. on the other hand, comprehensive experimental validation is difficult and time consuming. we introduce a simple strategy that dramatically improves robustness and accuracy of computational binding site prediction. first, we evaluate the rate of recurrence of computational tfbs predictions by commonly used sampling procedures. we find that the vast majority of results are biologically meaningless. however clustering results based on nucleotide position improves predictive power. additionally, we find that positional clustering increases robustness to long or imperfectly selected input sequences. positional clustering can also be used as a mechanism to integrate results from multiple sampling approaches for improvements in accuracy over each one alone. finally, we predict and validate regulatory sequences partially responsible for transcriptional control of the mammalian type a γ-aminobutyric acid receptor (gaba(a)r) subunit genes. positional clustering is useful for improving computational binding site predictions, with potential application to improving our understanding of mammalian gene expression. in particular, predicted regulatory mechanisms in the mammalian gaba(a)r subunit gene family may open new avenues of research towards understanding this pharmacologically important neurotransmitter receptor system. understanding transcription factor (tf) mediated control of gene expression remains a major challenge at the interface of computational and experimental biology. computational techniques predicting tf-binding site specificity are frequently unreliable. on the other hand, comprehensive experimental validation is difficult and time consuming. we introduce a simple strategy that dramatically improves robustness and accuracy of computational binding site prediction. first, we evaluate the rate of recurrence of computational tfbs predictions by commonly used sampling procedures. we find that the vast majority of results are biologically meaningless. however clustering results based on nucleotide position improves predictive power. additionally, we find that positional clustering increases robustness to long or imperfectly selected input sequences. positional clustering can also be used as a mechanism to integrate results from multiple sampling approaches for improvements in accuracy over each one alone. finally, we predict and validate regulatory sequences partially responsible for transcriptional control of the mammalian type a g-aminobutyric acid receptor (gaba a r) subunit genes. positional clustering is useful for improving computational binding site predictions, with potential application to improving our understanding of mammalian gene expression. in particular, predicted regulatory mechanisms in the mammalian gaba a r subunit gene family may open new avenues of research towards understanding this pharmacologically important neurotransmitter receptor system. co-regulation is a basic mechanism to coordinately control expression of genes in modules, biochemical pathways and protein complexes ( ) ( ) ( ) . in eukaryotes, expression is most often mediated by transcription factors (tfs) that bind upstream of the transcription start site (tss) and recruit the polymerase assembly ( ) . tfs bind, with varying affinity, to a set of similar, short ( - nt) sequences ( ) . computational binding site discovery focuses on finding significantly overrepresented sequences in upstream regions of co-regulated genes ( ) ( ) ( ) . thus, computational tfbs prediction algorithms must begin with an input set of promoters from genes hypothetically co-regulated by a shared tf. the algorithms aim to predict the binding positions and consequently the nucleotide specificity of the tf ( ) ( ) ( ) . the first part of transcription factor binding site (tfbs) discovery, the input set, can be identified using either computational or experimental methods. experimental techniques, such as chromatin immunoprecipitation (chip) ( ) , have been successfully used to generate a genome scale mapping of approximate tf-binding positions ( , , ) . computational techniques, such as phylogenetic profiling ( , ) and artificial neural networks, can also be used to identify sets of co-regulated genes. both experimental and computational approaches, however, suffer from a significant false positive (fp) prediction rate. inclusion of extraneous promoters in the input sets dilutes the enrichment of the shared tfbs sequences making computational tfbs discovery significantly more challenging ( ) . we term such erroneously included promoters decoy sequences (dss). after receiving a set of upstream regions co-regulated by a shared tf as input, computational methods aim to predict the binding positions of that tf ( ) ( ) ( ) ) . given a set of input promoters, motif detection algorithms identify a set of short, oligonucleotide segments hypothesized to bind to the tf of interest. the predicted sequences can be used to construct a position weight matrix (pwm) representing the average nucleotide frequencies for each position in the site ( ) . ideally, computational detection will return all sequences that bind to every tf with biologically relevant function in those upstream regions. however, since the source of binding specificity for tfs is not well understood ( ) , heuristic approaches and ad hoc multiple alignment based scoring schemes are used to identify locally optimal solutions ( ) . each local optimum that exists in a given set of promoters may correspond to distinctly different motifs, and may score differently relative to each other according to different scoring schemes. binding site prediction algorithms are generally confounded by several factors: degeneracy in the binding site; the unknown length of the binding site; the relatively large length of promoters; and the inclusion of dss in the input sets ( , , ) . as a result as few as % of predicted positions correspond to biologically functional binding sites ( ) . due, in part, to the low accuracy rate, computational binding site identification has been of limited use ( ) . problems identifying binding sites are further exacerbated in mammalian genomes by larger promoter regions ( ) and scarcity of reliable information on co-regulation of genes. thus, the most demanding test of efficacy for tfbs identification approaches lies in their application to mammalian systems and subsequent validation of predictions. because of computational complexity of the problem, gibbs sampling is often used to identify binding positions ( ) . in this paper, we present a new strategy that clusters gibbs sampling results at each input nucleotide-a technique we term positional clustering-to improve accuracy of predicted tf binding. we evaluate the efficacy of our approach using known examples of binding and regulation in yeast and experimentally testing predicted tf-binding sites upstream of the subunit genes coding for the heteromeric mammalian neurotransmitter receptor system, the type a g-aminobutyric acid receptor (gaba a r). the gaba a r is the major inhibitory neurotransmitter receptor in the central nervous system (cns) ( , ) with important roles in development ( , ) and disease ( ) ( ) ( ) . the receptor is believed to be a pentamer made up of multiple subunits that come from at least four different subunit classes (a, b, g and d) ( ) . at least genes code for the various subunits that differentially combine to form numerous pharmacologically distinct gaba a receptor isoforms ( , ) . isoform utilization depends in part on the relative abundance of the subunits, which may change under various conditions ( ) ( ) ( ) . understanding subunit regulatory mechanisms may provide insight into gaba a receptor isoform usage and related phenotypes ( ) . in the current study, we test the ability of positional clustering to detect known tf-binding sites in a series of increasingly noisy sets of yeast promoters, and found marked improvement in the percentage of correct predictions over gibbs sampling alone. we also present de novo predictions of tf-binding sites in promoter regions of gaba a receptor subunit genes (gabrs) whose expression is altered (either up-regulated or down-regulated) in an animal model of temporal lobe epilepsy ( ) . positional clustering identified a number of putative cis-regulatory sites, many of which correspond to known binding elements for tfs found in the cns. mobility shift assays showed several predicted gabr-binding sequences specifically bind nuclear proteins derived from primary neocortical neurons kept in culture. furthermore, a particular non-consensus gabr putative regulatory sequence was shown to have a functional role in cultured cortical neurons demonstrating the efficacy of positional clustering in detecting functional regulatory elements in mammals. we identified s.cerevisiae genes predicted at high confidence (p < . ) to be regulated by the tf ste in ypd growth media, according to whole-genome tf location data ( ) . for the identified genes, we collected upstream intergenic promoters. intergenic regions were truncated at kb upstream of the gene's tss. we selected for study a set of six gabrs: gabra , gabra , gabrb , gabrb , gabrd and gabre. promoters were extracted for each gene, including two alternative first exons of the gabrb ( ), giving a set of seven promoters. the length of each promoter was: gabra , bp; gabra , bp; gabrb , bp; gabrb (exon ), bp; gabrb (exon a), bp; gabrd, bp; and gabre, bp. we augmented the input set with orthologous promoters from rat, with the exception of gabrb for which an orthologous gene from mouse was used. in total, promoters upstream of six gabrs were selected for analysis. for a given input set of promoters, we ran the gibbs sampler bioprospector ( ) - times, evenly distributed across all motifs widths from - bp. we used a third-order background model derived from appropriate genomic promoters. we collected the best three results from each bioprospector run. we counted the number of times bioprospector identified each nucleotide in the input set. for each promoter, we identified the maximally occurring nucleotide, and extracted all positions identified by bioprospector > % of the maximum. we clustered together neighboring positions into putative tfbs. as a dust filter, we removed all putative tfbss < bp long ( figure ). for sets of s.cerevisiae promoters, we used results from bioprospector runs in our evaluation. for gabrps, we considered all non-empty subsets of the seven promoters (orthologous sequences were always considered together). we used results from bioprospector runs, evenly distributed across all promoter subsets, in our analysis. in addition to dust filtering, we required putative tfbss to occur both in the human and in the orthologous rodent promoter. we used positive predictive value, to evaluate ste -binding site predictions. we classified predictions as true positive (tp) or false positive (fp) by comparison to the ste -binding motif, tgaaaca, as determined by ( ) . for each sequence, we calculated distance from the known ste pwm using a modified local ungapped sequence alignment similar to that in ( ) . alignments were scored as the sum of pearson's correlation coefficient, between prediction x and the ste pwm across all aligned positions. thus, scores ranged from zero, with no positions aligned, to seven, the length of the ste pwm. we observed a bimodal distribution of scores (supplementary figure s ) , and chose the alignment score corresponding to the minima of the distribution (alignment score ¼ . ) as the threshold to classify predictions as tp or fp. we complemented the seed set of ste -bound promoters with - randomly chosen yeast promoters. we performed our motif detection procedure on each input set, and compared the ppv of putative tfbs with that of raw bioprospector results ( figure , solid lines). to evaluate the background rate of ste -binding site recovery, we created a seed set of randomly chosen s.cerevisiae promoters. we evaluated the percentage of ste -like binding sites identified in the random seed set, as well as in versions of the seed set augmented with - randomly chosen yeast promoters (figure , dashed lines) . for additional yeast evaluations (hap , tec , yap and ydr c), we substituted for bioprospector an in-house implementation of the bioprospector algorithm. comparisons of results from each implementation show the two implementations to be approximately equivalent. we ran motifscanner ( ) to search gabr promoters for all vertebrate tf-binding motifs found in transfac ( ) . for each promoter analyzed, we used a prior probability of . and the corresponding organism specific third-order promoter background model from eukaryotic promoter database (epd) ( ) .we considered positional overlap between motif-scanner predictions and putative tfbss indicative of known binding motifs in our predictions. double-stranded oligonucleotides for emsa contained the following sequences: nuclear extracts were prepared ( ) and used for gel shift analysis after concentration (microcon no. columns, amicon, ma). quantification was performed on emsas under conditions that yield a standard curve for band intensity. single-stranded sense and antisense phosphorothioate oligonucleotides for the predicted ggcggcgtgcacacacacgc-ccaccgcgg binding site are annealed by boiling sense and antisense oligonucleotides for min at equal molar ratios in dh o. oligos are then cooled to room temperature and placed on ice. transfections using dotap (roche)/hepes solutions are performed with oligonucleotides corresponding to wildtype, mutant or with dotap (roche)/hepes solution lacking oligonucleotides (mock) as described in ( ) . effects of oligonucleotide application to neurons are assessed by real-time rt-pcr. since tfbs are predicted computationally by local optimization strategies, we evaluate the extent to which one of these strategies, gibbs sampling, identifies the same set of segments in repeated runs using the same input data. identifying stably recurring motifs requires clustering of related results which, in turn, requires definition of 'related'. sequence similarity based clustering is impaired by the combination of sequence variation within motifs, the short length of tf-binding sites, and aligning motifs of different lengths. instead of using sequence based clustering, we chose to cluster results by position, counting the number of times gibbs sampling identifies each nucleotide in the promoter (figure ). we find that gibbs sampling predictions, generated using bioprospector ( ) are power-law distributed over nucleotide position (supplementary figure s ) . gibbs sampling converges on the majority of nucleotides very infrequently, and a small number of nucleotides very frequently. thus, the most frequently recurring nucleotides appear in as few as % of results. moreover, we find the power-law distribution of results is robust to gibbs sampling algorithm and scoring scheme (data not shown). we can hypothesize that the most frequently occurring positions are the most biologically significant. thus, discarding the least frequent gibbs sampling results may yield higher accuracy and robust identification of biologically insignificant positions. as a preliminary test of the above hypothesis, we applied repeated runs of gibbs sampling to a set of s.cerevisiae promoters enriched in ste binding as identified by wholegenome chip-chip experiments ( ) . we used positional clustering of results to identify the most frequently recurring positions (see methods). incorporation of additional results did not significantly alter the distribution of results (data not shown). we chose ste because it is one of the best studied tfs, with a well known, highly conserved and experimentally well-defined binding motif ( , ) . the most frequently recurring positions were compared with the known ste binding motif ( ) . we classified predictions into two categories: true positive (tp) if they resemble the experimentally identified ste -binding motif, and false positive (fp) otherwise (see methods). finally, we calculated the positive predictive value ppv as ppv ¼ tp/(tp + fp). we find that positional clustering and subsequent selection of frequently recurring nucleotides improved the ppv of the ste binding site by at least % over gibbs sampling alone (figure ). to validate that the above results were not specific to the number of input promoters, the ste -binding motif, or the particular gibbs sampling implementation, we repeated the above prediction process for promoters predicted to bind to yap , tec , hap and ydr c. we also repeated the analysis replacing the original gibbs sampling procedure with our own implementation and motifsampler ( ) . in all cases, we found positional clustering significantly improves on results over local optimization procedures alone ( figure ). computational discovery of tfbs can have two types of fp predictions. one type is the identification of an incorrect motif from a set of upstream regions known to bind to a tf of interest as described above (see methods). the second type of fp error is the background discovery rate of the correct motif using upstream regions that do not bind to the tf. to simulate this rate for ste -like binding site recovery we repeated the analysis as described above starting with randomly chosen yeast promoters. we find that positional clustering identifies ste -like sites in < % of results, compared with - % for gibbs sampling alone. thus, using positional clustering, the performance of computational motif discovery is enhanced not only by improving the positive predictive value in promoters of genes co-regulated by ste , but also by decreasing the false discovery of ste -like sites by %. next, we evaluated the effect of adding dss on the performance of gibbs sampling with and without positional clustering. addition of dss dilutes enrichment of the tf-binding site in the input set, making motif detection more challenging ( , ) . modeling dss, we repeated our estimate of ppv of tfbs detection with the addition of - random yeast promoters (dss) to the original set of ste -bound promoters. we found that positional clustering improves the ppv of gibbs sampling by > % through the addition of up to % noise or dss (figure , supplementary figure s ). additionally, results of gibbs sampling both with and without positional clustering decay linearly with the addition of decoys [r ¼ . and . , respectively (supplementary figure s ) ]. extrapolating, we predict positional clustering will maintain an improved ppv through the addition of > % noise or dss. to address issues of generality, we repeated the procedure on additional sets of s.cerevisiae promoters (yap , tec , hap and ydr c). an added benefit is that we can evaluate the effect of information content of the binding motif and number of promoters on the improvement from positional clustering ( ) . repeating the analysis, we again find that independently of the set or sampling procedure, positional clustering improves accuracy through a broad range of random dss (figure ). improvement appears to be limited and unreliable only when sampling alone correctly identifies the binding site in fewer than % of results. this result is consistent with our analysis of ste -bound promoters (figure ) , and may correspond to a lower limit for the efficacy of positional clustering. recently, researchers have noted that complementary motif detection approaches can be used together to predict binding sites more effectively than either method alone ( ) . with this in mind, we evaluated positional clustering in terms of its ability to combine results from two different sampling implementations. for each dataset, an equal number of results from each approach were combined into a single dataset, and positional clustering was used to predict binding sites as described above ( figure c ). we measured the average percent change in ppv for each tf on each dataset, and found positional clustering improved combined sampling by % compared with % and % improvement for bioprospector and motifsampler, respectively. additionally, clustering combined sampling improved of the datasets evaluated, whereas clustering of bioprospector and motifsampler results improved and datasets, respectively. thus, positional clustering is an effective mechanism to integrate results from multiple sampling procedures. identification of gabr cis-regulatory sequences as described above in introduction, identifying functional tfbs in mammals is difficult due in part to inclusion of decoy sequence from long upstream regions and lack of information on co-regulation of genes. positional clustering, as shown above, is more robust to noisy input than gibbs sampling alone, and thus may be better suited to identify de novo cis-regulatory elements in mammalian promoters that are coordinately regulated. to test this possibility, we chose seven mammalian gabr promoters (gabrps) whose activity is potentially altered in response to status epilepticus as identified through change in mrna levels of the gene products ( ) . for each set, the initial promoters were analyzed using gibbs sampling with positional clustering (solid triangles) and without (open triangles). two gibbs sampling approaches were applied to each dataset: a gibbs sampler procedure similar to bioprospector ( ) (row a), and motifsampler ( ) (row b). row c shows the combination of both sampling procedures, along with positional clustering of the combined results. x-axis counts over addition of dss. we evaluated the positive predictive value of each technique on each dataset, and found positional clustering generally improved the ppv through addition of % random dss. ( , ) . we also included orthologous rodent promoters in the input sets ( ) . orthologous promoters were included to provide more instances of binding sites in the input set than would be expected by random, allowing for easier detection of the sites. inclusion of orthologous promoters has the additional effect of selectively amplifying evolutionarily conserved binding sites. such binding sites are more likely to have major functional roles in the regulation of the gabr receptor. thus, sensitivity to such sequences is improved at the expense of sensitivity to species-specific binding sites. with this effect in mind, we require all gabr-binding site predictions to exist in orthologous promoters. since the mechanisms of co-regulation for the seven gabrs are unknown, hypothetical co-regulation models were evaluated by querying all possible subsets of the seven gabrps. clustering results on nucleotide positions and selecting the most frequently occurring positions, we predicted functional tf-binding sites. predictions were compared with instances of known binding motifs from transfac ( ) , and of the predictions ( . %) resembled known binding sites for tfs (table ) . of the tfs, have been identified in the cns of rodents: sp- ( ); ap- , tst- (pou f ), oct- (pou f ), olf- ( ); cp- ( ); and rreb- ( ) . furthermore, previous analyses of gabr promoter regions agree with our predictions that assign putative regulatory roles to sp- , oct- , olf- in the regulation of gabrs ( ) . we chose to validate novel motif predictions with emsas and functional studies in primary cultured neurons. emsa ( ) was performed with an excess of cold competitors to define specificity of protein binding in nuclear extracts derived from primary neocortical neurons and fibroblasts (fibs) kept in culture. as shown in figures - , out of six predicted binding sites found upstream of the (a, b, g and d) subunit genes, four (gabra , gabrb , gabrb and gabrd) displayed specific binding. in addition to specific binding of neuronal extracts to novel gabra motifs, we have evidence for specific binding using fib extracts ( figure a and b) , of especial interest given that the expression of gabrs is restricted to the nervous system and repressors such as the re -silencing transcription factor (rest) ( , ) expressed in non-neuronal cells have been implicated in the neural specificity of gene expression. clearly, protein binding to dna does not always necessitate regulatory function. to begin to address the functional table . positional clustering based predictions of transcriptional regulatory sequences upstream of gabrs in total, we predict orthologous pairs of regulatory sequences, representing unique sequences. comparing with known mammalian binding motifs, eight of the predictions show similarity to previously characterized tfbs, as indicated. where no known binding motif exists, the corresponding in vitro emsa and functional assay, if applicable, is indicated. similar predictions are grouped together and aligned by hand. significance of our predicted regulatory motifs, we evaluated the effects of transfecting neurons with double-stranded oligonucleotides containing one of the gabra novel binding motifs (dsa o), as described above. gabra is especially interesting given that it is regulated by brain derived neurotrophic factor (bdnf) after status epilepticus ( , ) . transfection with the dsa o produced a significant downregulation of gabra gene expression in neocortical neurons as monitored by quantitative real-time rt-pcr with no change after mock transfection or transfection with a dso containing three copies of a camp regulatory element (cre) (figure ). how reliable are the binding site predictions returned by gibbs sampling based tfbs identification algorithms? we began by evaluating the stability of binding site predictions via repeated runs of gibbs sampling. to quantify the robustness of predictions, we counted the number of gibbs sampling results at each nucleotide position in the input set ( figure ) over a large number of repeated trials. we find that the most frequently returned positions better predict tf binding sites than the maximally scoring motifs from gibbs sampling (figures and ). since scoring functions are empirically derived and do not necessarily represent biological reality, the result is not altogether unexpected ( ) . however, we find that selecting frequently recurring positions allows filtering of up to % of spurious sampling results caused by convergence on biologically uninformative local minima. positional clustering allows unbiased aggregation of results from different motif widths, thus approximating the width of the binding site 'for free' ( ) . next we show that positional clustering improves robustness to the addition of dss (figures and ) . such sequences arise from inclusion of promoter regions in input sets without direct binding to the tf either due to experimental error or computational mis-annotation ( , ) . in the ste example studied, linear regression models indicate our approach will maintain an advantage over traditional gibbs sampling through addition of up to % noise to the original signal (supplementary figure s ) . empirical data, however, show a sharp decrease in improvement close to the addition of dss, or roughly double the input set ( figure ). moreover, evaluations using promoters co-regulated by other tfs figure . double-stranded oligonucleotide functional assay for gabra regulation. primary cultures of rat neocortical neurons were treated with dotap (n-[ -( , -dioleoyloxy)propyl]-n,n,n-trimethylammonium methylsulfate) alone (mock) or with dotap and phosphothioate oligonucleotides from either a camp response element (cre decoy) or a sequence from the gaba-a promoter predicted using positional clustering (gaba-a decoy) (gtgcacacacacgcccaccgcggctcggg). mrna was harvested after h, and real-time rt-pcr was performed with gaba-a specific primers. error bars refer to individual experiments; i.e. different platings of cells from different animals. data was normalized to rrna levels, and expressed as relative mrna levels (gaba-a /rrna). results are shown as mean ± sem, n ¼ , asterisk indicates significantly different from control at the % confidence interval. figure . emsa of three putative tf binding sites form dna-protein complexes in neocortical and fibroblast nuclear extracts. neocortical (neo) and fibroblast (fib) nuclear extracts from e rat embryos were incubated with three p-radiolabeled probes from human a and d receptor subunits. cold wild-type oligonucleotides were used to define specificity through competition. cold oligonucleotides were added at -fold excess over probe. indicate positional clustering is less likely to improve predictions when gibbs sampling identifies a correct site in < % of repetitions ( figure ). thus, it is possible the rather simplistic linear model overestimates improvement in robustness beyond what is practically achievable. moreover, when multiple motifs exist in the input promoters, preliminary evidence suggests positional clustering will uniquely identify a single dominant motif (supplementary figure s ) . with further refinement, however, it may be possible to recover subordinate motifs, enabling identification of cis-regulatory modules. in spite of these limitations, using positional clustering of repeated runs, researchers can successfully apply sampling algorithms in identification of functional binding sites in datasets with a significant proportion of noise. computational prediction of tf binding in mammalian genomes poses just such a challenge due to increased decoy sequence in large upstream regions ( ) . thus, having established increased robustness to dss in yeast, we applied our approach to identify potentially unknown gaba a receptor subunit gene regulatory sequences that may participate in the response of the genome to seizure activity. we reasoned that gaba a receptor subunit genes either up-regulated or down-regulated in the animal model of epilepsy would share common binding motifs. using positional clustering, we predicted tf-binding sites upstream of gaba a receptor subunit genes ( table ) . twelve of our predictions were verified by either comparison to known binding sites or experimental verification using in vitro binding assays. initially positive experimental results highlight the ability of computational techniques to direct research into transcriptional regulation in mammalian models. as such, our approach may be applicable in the study of other protein complexes in the mammalian proteome. the reported predictions may enable pharmacologically important downstream research. for example the predicted sites can be used as a starting point for quantifying in vivo effect on downstream transcription; for identifying the tfs bound; and even for the more complex task of understanding the role of each site in determining the relative abundance of gaba a receptor isoforms. research along these lines may dramatically improve our understanding of gaba a receptor regulation and its role in disease and development. additionally, a more comprehensive evaluation of the remaining gaba a receptor subunit genes may reveal additional tfbinding sites that uncover the evolutionary significance of g-a-b gabr clusters in the human genome. supplementary data are available at nar online. charles delisi is partially supported by nih grants a pogm a and j - . daniel s. roberts is supported by nih training grant t gm . shelley j russek is supported by nih/ninds grant ns . funding to pay the open access publication charges for this article was provided by the boston university bioinformatics program. regulation of genes encoding subunits of the trehalose synthase complex in saccharomyces cerevisiae: novel variations of stre-mediated transcription control? mol combinatorial control required for the specificity of yeast mapk signaling synexpression groups in eukaryotes consensus patterns in dna finding dna regulatory motifs within unaligned noncoding sequences clustered by whole-genome mrna quantitation from the cover: building a dictionary for genomes: identification of presumptive regulatory sites by statistical analysis bioprospector: discovering conserved dna motifs in upstream regulatory regions of co-expressed genes an improved map of conserved regulatory sites for saccharomyces cerevisiae transcriptional regulatory code of a eukaryotic genome practical strategies for discovering regulatory dna sequence motifs in vivo cross-linking and immunoprecipitation for studying dynamic protein:dna associations in a chromatin environment genomic binding sites of the yeast cell-cycle transcription factors sbf and mbf transcriptional regulatory networks in saccharomyces cerevisiae assigning protein functions by comparative genome analysis: protein phylogenetic profiles identification of functional links between genes using phylogenetic profiles scoring functions for transcription factor binding site prediction detecting subtle sequence signals: a gibbs sampling strategy for multiple alignment dna binding sites: representation and discovery protein-dna binding specificity predictions with structural models analysis of computational approaches for motif discovery assessing transcription factor motif drift from noisy decoy sequences assessing computational tools for the discovery of transcription factor binding sites going the distance: a current view of enhancer action basic neurochemistry: molecular, cellular, and medical aspects, th edn ). ( ) imitators of epilepsy coronavirus main proteinase ( clpro) structure: basis for design of anti-sars drugs mice lacking the major adult gabaa receptor subtype have normal number of synapses, but retain juvenile ipsc kinetics until adulthood gabaa receptors: building the bridge between subunit mrnas, their promoters, and cognate transcription factors gabaergic mechanisms in epilepsy egr stimulation of gabra promoter activity as a mechanism for seizure-induced up-regulation of gaba(a) receptor alpha subunit expression developmental changes in gaba receptor subunit composition within the gonadotrophin-releasing hormone- neuronal system alterations in gabaa receptor occupancy occur during the postnatal development of rat purkinje cell but not granule cell synapses selective changes in single cell gaba(a) receptor subunit expression and function in temporal lobe epilepsy development of subtype selective gabaa modulators a strong promoter element is located between alternative exons of a gene encoding the human gamma-aminobutyric acid-type a receptor beta subunit (gabrb ) searching databases of conserved sequence regions by aligning protein multiple-alignments toucan: deciphering the cis-regulatory logic of coregulated genes the transfac system on gene expression regulation the eukaryotic promoter database (epd) cell-specific helix-loop-helix factor required for pituitary expression of the pro-opiomelanocortin gene the yeast ste protein binds to the dna sequence mediating pheromone induction a gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes human-mouse genome comparisons to locate regulatory sites developmental expression of sp in the mouse mouse brain organization revealed through direct genome-scale tf expression analysis characterization of the genomic structure, chromosomal location, promoter, and development expression of the alpha-globin transcription factor cp rreb- , a novel zinc finger protein, is involved in the differentiation response to ras in human medullary thyroid carcinomas electrophoretic mobility shift assay regulation of neuronal traits by a novel transcriptional complex rest and its corepressors mediate plasticity of neuronal gene chromatin throughout neurogenesis brain-derived neurotrophic factor (bdnf)-induced synthesis of early growth response factor (egr ) controls the levels of type a gaba receptor{alpha} subunits in hippocampal neurons finding functional sequence elements by multiple local alignment conflict of interest statement. none declared. key: cord- -uzzuoy v authors: naito, yuki; ui-tei, kumiko; nishikawa, toru; takebe, yutaka; saigo, kaoru title: sivirus: web-based antiviral sirna design software for highly divergent viral sequences date: - - journal: nucleic acids res doi: . /nar/gkl sha: doc_id: cord_uid: uzzuoy v sivirus () is a web-based online software system that provides efficient short interfering rna (sirna) design for antiviral rna interference (rnai). sivirus searches for functional, off-target minimized sirnas targeting highly conserved regions of divergent viral sequences. these sirnas are expected to resist viral mutational escape, since their highly conserved targets likely contain structurally/functionally constrained elements. sivirus will be a useful tool for designing optimal sirnas targeting highly divergent pathogens, including human immunodeficiency virus (hiv), hepatitis c virus (hcv), influenza virus and sars coronavirus, all of which pose enormous threats to global human health. rna interference (rnai) is now widely used to knockdown gene expression in a sequence-specific manner, making it a powerful tool not only for studying gene function, but also for therapeutic purposes, including antiviral treatments ( ) ( ) ( ) ( ) . currently, the replication of a wide range of viruses can be inhibited successfully using rnai, with both short interfering rnas (sirnas) and sirna expression vectors ( ) . in mammalian rnai, the efficacy of each sirna varies widely depending on its sequence; only a limited fraction of randomly designed sirnas is highly effective. many experiments have been conducted to clarify possible sequence requirements of functional sirnas. of these, our work incorporates guidelines from three major studies ( - ) of selecting functional sirnas. however, designing functional sirnas that target viral sequences is problematic because of their extraordinarily high genetic diversity. for example, about entries of near full-length sequences of hiv- group m, which is largely responsible for global pandemic, are stored in the sequence databases, but it proved impossible to select a common mer from among all of them. moreover, rnai-resistant viral mutants achieved through point mutation or deletion emerge rapidly when targeting viruses in cell culture. these problems suggest a strong need to select highly conserved target sites for designing antiviral sirnas. furthermore, the off-target silencing effects of sirna are also a serious problem that could affect host gene expression ( ) . off-target silencing effects arise when an sirna has sequence similarities with unrelated genes. in antiviral rnai, it is desirable to minimize off-target effects against human genes. consequently, only a limited fraction of mers is suitable for use as antiviral sirnas. in this study, we developed a novel web-based online software system, sivirus, which provides functional, off-target minimized sirnas targeting highly conserved regions of divergent viral sequences. highly conserved sirna sequences are selected based on their degree of conservation, defined as the proportion of viral sequences that are targeted by the corresponding sirna, with complete matches (i.e. / matches). all possible sirna candidates targeting every other position of userselected viral sequences are generated and their degrees of conservation are computed. users can arbitrarily specify a set of viral sequences for the computation; e.g. sequences can be selected from a specific geographic region(s) or a specific genotype(s) to design the best sirnas tailored to specific user needs. sivirus also accepts user's own sequences in a multi-fasta format and shows whether each sirna can target the posted sequences. off-target searches were performed for each sirna using sidirect ( , ) . sivirus shows the number of off-target hits within two mismatches against the non-redundant database of human transcripts ( ) . currently, sivirus incorporates viral genome sequences of hiv- , hcv, influenza a virus and sars coronavirus. these sequences were downloaded from the los alamos hiv sequence database (http://hiv-web.lanl.gov/), the los alamos hcv sequence database ( ), the ncbi influenza virus sequence database (http://www.ncbi.nlm.nih.gov/ genomes/flu/flu.html), and ncbi genbank ( ), respectively. sivirus will be updated continuously as these databases are revised. we also plan to incorporate other viruses if sufficient numbers of their sequences are available. to design anti-hiv sirna, we analyzed the near fulllength hiv- sequences listed in supplementary conserved sirnas constituted only . % of the possible sirnas if > % conservation is expected ( figure a) . the fraction is still as small as . % even if the threshold of the conservation is relaxed to %. on the other hand, sirnas predicted to be functional by one or more guidelines ( ) ( ) ( ) constituted . % of the sirnas ( figure b) . taken together, sirnas that are > % conserved, and satisfy at least one guideline constitute only . % of the sirnas. in this condition, - sirnas can be designed for each full-length sequence of hiv- . these indicate that most of the randomly designed sirnas are not suited for targeting hiv- efficiently. figure c shows typical output from sivirus for designing anti-hiv sirnas. a total of sequences from hiv- subtypes b, c and crf _ae, which are the most prevalent hiv- genotypes circulating in asia, were selected. the results were sorted by their degree of conservation, and filtered to display sirnas that satisfy at least one efficacy guideline. the off-target search results against human genes are also shown. it is desirable to select an sirna that has less off-target hits. to test the validity of sivirus, sirnas satisfying the guideline by ui-tei et al. ( ) were designed against the conserved regions of hiv- genomes using sivirus and were assayed for inhibition of viral replication. among them, sirnas effectively inhibited hiv- replication by > % when each sirna duplex was transfected at nm (y. naito, k. ui-tei, k. saigo and y. takebe, unpublished data). potent and specific genetic interference by double-stranded rna in caenorhabditis elegans revealing the world of rna interference unlocking the potential of the human genome with rna interference induction and suppression of rna silencing: insights from viral infections antiviral rnai therapy: emerging approaches for hitting a moving target guidelines for the selection of highly effective sirna sequences for mammalian and chick rna interference rational sirna design for rna interference an algorithm for selection of functional sirna sequences noise amidst the silence: off-target effects of sirnas? sidirect: highly effective, target-specific sirna design software for mammalian rna interference accelerated off-target search algorithm for sirna the los alamos hepatitis c sequence database this work was supported in part by grants from the ministry of education, culture, sports, science and technology of japan to k.s., k.u.-t. and y.t., and by grants from the ministry of health, labour and welfare of japan to y.t. funding to pay the open access publication charges for this article was provided by the ministry of education, culture, sports, science and technology of japan. y.n. is a research fellow of the japan society for the promotion of science. supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- -rloey j authors: harel, noam; meir, moran; gophna, uri; stern, adi title: direct sequencing of rna with minion nanopore: detecting mutations based on associations date: - - journal: nucleic acids res doi: . /nar/gkz sha: doc_id: cord_uid: rloey j one of the key challenges in the field of genetics is the inference of haplotypes from next generation sequencing data. the minion oxford nanopore sequencer allows sequencing long reads, with the potential of sequencing complete genes, and even complete genomes of viruses, in individual reads. however, minion suffers from high error rates, rendering the detection of true variants difficult. here, we propose a new statistical approach named associvar, which differentiates between true mutations and sequencing errors from direct rna/dna sequencing using minion. our strategy relies on the assumption that sequencing errors will be dispersed randomly along sequencing reads, and hence will not be associated with each other, whereas real mutations will display a non-random pattern of association with other mutations. we demonstrate our approach using direct rna sequencing data from evolved populations of the ms bacteriophage, whose small genome makes it ideal for minion sequencing. associvar inferred several mutations in the phage genome, which were corroborated using parallel illumina sequencing. this allowed us to reconstruct full genome viral haplotypes constituting different strains that were present in the sample. our approach is applicable to long read sequencing data from any organism for accurate detection of bona fide mutations and inter-strain polymorphisms. a major goal of genetics today is the characterization of genetic diversity in a population. in microbes, such diversity is generated in particular by high mutation rates, which may generate both nucleotide substitutions as well as point insertions or deletions (indels) ( ) . longer indels may also occur ( ) , as may events of genetic recombination. disease pathogenesis, progression, management and epidemiology are all affected by the genetic diversity created in microbial populations ( ) ( ) ( ) ( ) ( ) ( ) ( ) . thus, characterizing this diversity is of utmost importance in clinical as well as research settings, which makes the development and improvement of suitable sequencing technologies crucial ( ) ( ) ( ) . the availability of second-generation dna sequencing technologies ( ) , with the illumina platform currently at the forefront, has made the sequencing of genomes conventional. in particular, this technology has dramatically furthered the study of viruses, whose relatively small genomes allow in depth characterization of a population of viruses ( , ) . illumina-based sequencing allows the detection of minor variants that the standard sanger-based method often missed. however, illumina short-read sequencing technologies all share one major limitation, the short length of each read, which typically ranges between and bp (for a paired end read). this means that a complete viral genome sequence cannot be obtained in a single read, impairing the ability to link distant mutations in an individual viral genome. another illumina limitation is that rna cannot be sequenced directly. during library preparation, rna is reverse-transcribed into cdna and amplified by pcr. this creates multiple problems that have been extensively discussed but not resolved ( ) : first, reverse transcription and pcr may introduce errors during early stages of amplification that will be carried on to later stages ( ) ( ) ( ) . second, some molecules may be preferentially amplified over others, a term known as pcr bias. third, pcr and reverse transcription reactions can often result in chimeric dna sequences that originate from different molecules ( ) ( ) ( ) . together, these problems make the inference of haplotypes from pcr-based libraries that are sequenced with illumina extremely limited. currently, single-molecule third-generation sequencing systems, such as oxford nanopore technologies, provide a promising alternative for sequencing full-length single viral genomes ( ) . in fact, these technologies now allow directly sequencing either dna or rna. the long reads provided by these methods have the potential to allow for the inference of up to an entire genome of a typical rna virus, whose genome is generally shorter than , bp ( ) ( ) ( ) ( ) ( ) . however, one of the major shortcomings of the third gener-ation technologies are their relatively high error rates, with the proportion of errors on a read often exceeding % ( , ) . this high error rate makes the detection of true single-nucleotide variants very difficult ( , , ) . here, we devised a simple statistical procedure called as-socivar that allows weeding out real mutations from technical errors using only the minion sequencing results. we focused our analysis on the ms bacteriophage, an extremely small ( bases) and fast evolving +ssrna virus ( , ) that is highly amenable to direct sequencing with oxford nanopore minion. we sequenced virus populations in parallel using both minion and illumina, allowing us to corroborate the inferences of associvar. this then allowed us to directly infer relationships between mutations and to deduce the entire genome sequences of viral strains in the population. we were also able to use associvar to analyze a yeast mrna sample and data of mixed strains of zika virus. this illustrates the generality of our approach which can be applied to other organisms as well. during the course of an evolve-and-resequence experiment, we performed serial passaging of the phage ms (meir et al., in preparation). briefly, clonal ms stock was propagated from a single plaque that was the precursor of all the evolutionary lines established in this work. we performed serial passages at • c with two biological replicates (hereby denoted as a and b). the serial passages were performed as follows: ml cultures of naive escherichia coli c- were grown up to an optical density of od = . (corresponding to a density of about cells/ml). each passage was infected with ml of phages from the previous passage. the cultures were grown for h at each temperature with shaking, and the e. coli cells were then removed by centrifugation. the supernatant was subjected to filtration with . m filter (stericup ® filter, emd millipore) to remove any remaining residues. naive hosts were provided for each passage. the new phage stock was then stored at • c. aliquots of these phage stocks were used for measuring the concentration of phages by plaque assay and infecting the next serial passage. we then determined the population frequency of each mutation at passage and passage through whole genome deep sequencing as described below, using illumina and minion. library preparation was performed according to our lab's accungs sequencing protocol ( ) with some modifications: the reverse transcription reaction was performed using superscript ® iii reverse transcriptase (thermo scientific), the reaction was performed on ng of rna phage with l of dntp mix ( mm), l of r primer (tg ggtggtaactagccaagcag) ( m) and l of sterile distilled water. the mixture was incubated for min at • c followed by incubation on ice for min. after brief centrifugation, l of x first-strand buffer were added along with l dtt ( . m), l of rnase out (thermo scientific) and l of superscript™ iii rt ( units/l). the mixture was incubated for mins at • c followed by inactivation for mins at • c. the last step was adding l of rnase h to the reaction according to manufacturer instructions. cdna from the reverse transcription reaction was directly used as a template for the pcr amplification of the full ms genome in three overlapping fragments. pcr reactions were performed using 'phusion high-fidelity dna-polymerase' (thermo scientific) according to manufacturer instructions with the primers: f (gggtggga cccctttcgg), r (tttttctagagagccgttgc ct), f (ggcccaaatctcagccatgc), r (cgtg tctgatccacggc), f (ggcacaagttgcagga tgca), r (tgggtggtaactagccaagcag), see supplementary figure s . after pcr, the three amplicons were purified with a pcr clean up kit (promega). purified amplicons were quantified with qubit assays (q , life technologies), diluted and pooled in equimolar concentrations. the illumina nextera xt library preparation protocol and kit were used to produce dna libraries, according to manufacturer instructions with some modifications. briefly, the tagmentation reaction was performed with . ng/l of dna, l td buffer, l atm enzyme and up to l ddw. the mixture was incubated for min at • c and then directly used as a template for l pcr reaction using 'phusion high-fidelity dna-polymerase' (thermo scientific) according to manufacturer instructions with the index primers from the nextera xt kit. after pcr, a double sized selection was performed using ampure beads to remove short and long library fragments, since bp fragments were required. we collected the supernatant and read l from each sample in the tapestation (agilent high sensitives d ) to verify fragment size. the yeast enolase sample was prepared as described for the phage rna. rt and pcr amplification were performed using primers: e (atggctgtctctaaagtttacg cta) and e (ttacaacttgtcaccgtggtgg). libraries were sequenced on an illumina miseq using the × miseq reagent kit (illumina, ms- - ) for pairedend reads. bioinformatics processing of the data was performed using the accungs pipeline ( ) with the default parameters (minimal %id = , e-value threshold = e− and q score cutoff of ). briefly, this pipeline is based on (a) mapping the reads to the reference genome using blast, (b) searching for variants that appear on both overlapping reads, (c) calling variants with a given q-score threshold and inferring their frequency. all libraries attained a mean coverage of ∼ , reads/base. the reference genome was determined by comparing the consensus of passage to gen-bank id v . (differences noted in supplementary table s ). when examining the results of the control sequence (a plasmid bearing the ms genome), we noted a high error rate at several positions that resided near the primer sites, and accordingly positions from each end of the genome were excluded from downstream analysis. the oxford nanopore minion was used to sequence the ms rna directly. we sequenced the three samples (p a, nucleic acids research, , vol. , no. e p b and p a) in three separate runs. we prepared direct rna libraries according to manufacturer library prep instructions with some modifications. we altered the supplied reverse transcriptase adapter (rta) ( ) , which has a t overhang, to specifically target the ms genome with nucleotides complementary to the ms conserved end (supplementary figure s ). the ligation reaction was performed with ng rna in l, nm costume adapter in l, l of nebnext quick ligation buffer and . l of t dna ligase enzyme (neb). the mixture was incubated for min at • c followed by incubation on ice for min and then directly used as a template for the next step of the library prep which is cdna synthesis according to manufacturer instructions. the cdna synthesis step was performed in order to maintain the rna fragments integrity during minion library prep and sequencing. the library was cleaned up each time using l of ampure xp dna beads per l of sample and we added l of rnase out (thermo scientific) to protect the rna. the rna was directly sequenced on the minion nanopore sequencing device using a flo-min flow cell equipped with the r . chemistry. the minknow control software version . . was used and was allowed to proceed for h. the basecalling was performed locally by the miniknow software as well, and the data was written out in the fastq format. reads were filtered with the miniknow default cutoff of a minimum average q-score of . the accungs pipeline ( ) described above was next applied to the data in order to determine variant frequencies. blast parameters were modified to minimal % identity = , e-value threshold = e− and no q-score cutoff, to allow the highly variable minion reads to map. as with the illumina results, positions from each end of the genome were excluded from downstream analysis. this also solved the problem of very low coverage areas in the minion sequencing. variant frequency distributions for minion and miseq were calculated by using the accungs pipeline results. for substitutions and deletions, the distribution is straightforward and accounts for the results for all bases. for insertions, we focused on point insertions, defined as the first insertion after any position, in order to create the frequency distributions. positions close to the ends of the genome that show a high error rate in miseq due to primer proximity were removed from this analysis. to calculate the error rate, we sequenced a control sample from the beginning of the passaging experiment (p a), and the mean, th percentile and th percentile errors were calculated for every error type. associvar searches for strong associations between variants as an indication that these represent bona fide mutations. the method is based on five stages: (a) detecting non-random associations. for each pair of positions, a read is classified into four categories based on whether the read bears the wt nucleotide (i.e., identical to the reference genome) or non-wt nucleotide (i.e., different from the reference genome) at each of the two given positions. we use this to create a × contingency matrix of observed counts, which is then used as the input for a chi-square test and a resulting chisquare statistic (see supplementary text). we focus only on contiguous reads that spanned the entire genome. notably at this stage we focus only on wt versus non-wt assignments (rather than the exact identity of the non-wt allele) for computational tractability. this is relaxed later on. (b) removing proximal positions. since we observed that positions that are highly proximal (< bp apart) often tended to be highly associated, and we suspected this is an artifact of the sequencing machine, chi square results for all such proximal positions were removed from the analysis. to each other tend to present spurious high associations, due to transitivity we expect a position next to a real mutation to also be highly associated with other positions with real mutations. in other words, if positions p and q are highly associated because they are real mutations, and positions q and q + are highly associated because they are proximal, we will see a high association between positions p and q + as well. however, we expect the association between the real mutations to be the highest, i.e. to be a local maximum in the surroundings of a given pair. a normalized chi score's surroundings is defined as the four neighboring normalized chi scores when the data is regarded as a two-dimensional matrix. for example, the normalized chi score for ( p, q) is required to be higher than the normalized chi scores for ( p, q − ), ( p, q + ), ( p − , q) and ( p + , q). (e) use of a control sequence. in order to create a cutoff for the normalized chi-square statistic, we used the values obtained for a control sequence (supplementary figure s ). we know that our control sample was not completely homogenous, since it contained two mutations at a frequency slightly higher than %. nevertheless, it served as a valid control when setting a confidence rate of . %, i.e. calculating the normalized chi score that allows . % of the positions in our control sample to be identified as significant (allowing for three 'false positive' positions in our case). after stages (a) through (e), associvar infers n positions with real mutations in the population. the last stage is to identify the identity of the mutations (a, c, g, t, −), in a similar way to that described in (a). insertions are ignored here. every position has four possible alternative variants (the three nucleotides that differ from the reference, or a deletion), and we test these variations against each other using chi-square tests, leading to n × (n− ) tests, where n is the number of positions previously identified as having real mutations. again, we created a × contingency matrix of observed frequencies, which is then used as the input for the chi-square test. for every position, we choose the variant with the highest average chisquare statistic for all the tests for pairs containing that variant. we begin by focusing only on the variants inferred as bona fide mutations in the last stage. this means that in principle there are n possible haplotypes bearing these mutations. we filter out reads with variants that do not match our inference (for example, if one of the inferred mutants is a g, we filter out reads with the nucleotides c, t or a deletion at position ). we then use our inferred percentile error thresholds (table ) to deduce which combinations of mutations are likely to be true and which may have been created by the technical error rate. we use an iterative approach to classify which of the n haplotypes is reliable. first, for every single nucleotide variant, we group together all of the haplotypes that include this base variant (a haplotype can appear in more than one group). second, in each group, we compute the relative frequency of each haplotype as its proportion of all haplotypes in the group. we iterate through the haplotypes from highest frequency to lowest, classifying each haplotype as reliable or not. the first haplotype is automatically classified as reliable. for every haplotype, we compare its relative frequency with the probability that it is created by technical errors from the closest haplotype classified as reliable, called its parent haplotype, using the inferred error threshold. for example, if a haplotype has an additional deletion and substitution when compared to its parent haplotype, we require that its relative frequency be higher than the product of . × . = . to be classified as reliable (using the th percentile error frequencies in table ). we iterate through the haplotypes until classifying all the haplotypes in each group. if reads for a wt haplotype exist, the wt is also treated as a base for a group -which in this case will include all of the observed haplotypes. haplotypes may appear in more than one group, if a haplotype appears as reliable in at least one group it will be classified as reliable overall. see supplementary text for a visual summary of the algorithm we employ. the code we provide produces a file with all the variant combinations observed, the calculations described here and whether a haplotype is reliable or not. it also produces a file with the haplotypes classified as reliable, with their proportion in the population recalculated appropriately. the sequencing data created and used in this study is available in the sequencing read archive (sra, https://www. ncbi.nlm.nih.gov/sra), under bioproject prjna . the accompanying code can be found at https://github. com/sternlabtau/associvar. we set out to sequence two evolved populations of the ms coliphage. both populations were derived by fifteen serial passages performed at • c (denoted as a and b) as part of an evolve-and-resequence experiment (methods). we first performed deep sequencing of both populations at passage with the illumina miseq platform ( ) . this revealed several segregating mutations (figure ), some of which shared similar frequencies. however, due to the short-read nature of the sequencing it was impossible to infer whether these mutations co-occurred on the same genome. we next sequenced the same two populations of rna viruses from passage using oxford nanopore's minion. importantly, we employed direct rna sequencing, without using reverse transcription or pcr amplification, and without any shearing of the genomes. the only requirement for library preparation was the ligation of an adaptor to the of the rna genome, allowing the to enter the sequencing pore. each replica was sequenced independently, denoted as p a and p b. we also sequenced a sample from line a passage using both illumina miseq and minion. as this was a mostly unevolved and homogenous population, we used this as a control sample. a total of , , , and , reads were produced for the minion-p a, minion-p b and the control runs, respectively. in order to map the reads to the ms reference genome we ran our computational pipeline (materials and methods) ( ) , which infers the proportion of each point mutation (a, c, g, t or '-') at each position in the genome. over % of the reads were mapped to the reference, yet often sequencing terminated before it reached the end in both the evolved and control populations (supplementary figure s ). nevertheless, ∼ % of the reads (between ∼ , and ∼ , ) covered the entire ms genome. we next focused on the frequency of an observed variant, defined here as any base called differently from what is present in the reference sequence, at any position. we expect such variants to be the sum of two independent processes: real biological variations derived from evolutionary processes in the phage populations, and technical errors introduced by the sequencing process. comparing between the variant distributions of illumina and minion, it was evident that minion suffers from a very high technical error rate (figure ) . notably, in the control population the number of variants exceeding a frequency of % in the illumina sequencing was , whereas with minion we observed variants in positions exceeding %. this allowed us to infer that the vast majority of minion variant frequencies are technical errors, and further allowed us to roughly estimate the various types of minion error rates for our experiment (table ) . notably, we observed that the point deletion and point insertion rates together exceeded the substitution rate, reinforcing previous observations ( , ) . we attempted to use the inferred minion error rates as thresholds that can distinguish between real mutations from errors, by setting the th percentile obtained for the control sample as an error threshold for each type of error (table ). this naïve approach that is often used, quickly turned out to be invalid, as corroborated by our parallel illumina results. for example, we knew from the illumina results that only mutations in line a and mutations in line b exceeded a frequency of % at passage ( figure ). however, the minion results showed mutations in positions and mutations in positions exceeding % in both replicas, respectively. we sought a strategy to weed out the technical errors from the real mutations in the minion results independently of the illumina results. we calculated the conditional probabilities of observing one variant given another variant observed on the same read ( figure ). when observing the pattern of conditional probabilities, we noted two distinctly different patterns. some variants co-occurred more or less randomly with all other variants, manifested as more or less the same probability of observing one variant given any other variant (similar colors across a given column in figure a) . on the other hand, some variants displayed a nonrandom pattern, where the probability of observing variants together depended very much on which pair of variants was examined (different colors across any given column in figure b) . importantly, the variants that displayed a non-random pattern were variants that we knew were true mutations based on the illumina data. this led us to realize that random technical errors are expected to display a different pattern than real biological mutations: we expect technical errors to be associated randomly with any other technical error, whereas a pair or more of real biological mutations are expected to be non-randomly associated with each other. this is a reflection of evolutionary processes operating on genomes. while mutations created from replicative polymerases will be mostly randomly distributed along the genome, selection and genetic drift will lead to the fact that specific combinations of mutations reach higher frequency. thus, true mutations that are prevalent in a population will tend to be either present with some other mutations on the same genome/read, or not present with some other mutation on the same genome. both these properties (tendency to be present or not present with other mutations) reflect non-random association between mutations. one of the most commonly used methods to test for associations between two properties is the chi-square test: here we use this test to see whether the observed joint variant counts deviate from what is expected when variants are counted independently. to this end, each variant was classified as either wild-type (wt) or non-wt, based on whether it was identical or not to the reference genome. notably, this led to ( we began by inspecting all associations between all pairs of positions. this allowed us to make a few general observations. first, we observed that proximal pairs of positions (residing up to bases apart from each other) tended to be highly associated. we postulate that this is a reflection of minion errors, and also the high deletion rate, which could cause slight misalignment of reads covering positions proximal to the deletions. second, we observed a very similar pattern of associations among the three samples. this suggests that minion sequencing has a tendency towards specific a pattern of errors for a given genome that is sequenced (supplementary figure s ) . we devised a method called associvar that detects the real variants in the data, based on the following properties: (a) the method searches for the strongest non-random associations, (b) the method takes into account that pairs of proximal positions (i.e. up to bases apart) that have high associations between them are likely false positives induced by the minion machine itself, (c) in order to make the different positions comparable, the method normalizes the distribution of chi square scores per position, essentially searching for outliers from all the associations of a given po-sition, (d) the method also takes into account that because proximal positions are highly associated, a position next to a real mutation may be associated with other real mutations due to transitivity. however, we expect the two real mutations to display the highest association, i.e. we require an association to be a local maximum in a given window. finally, (e) the method uses a control sample to set a cutoff for the highest associations (materials and methods). associ-var hence calculates a normalized chi-square statistic and infers the positions where 'true' variation occurs, based on the above properties ( figure ) . after applying associvar to the data, we were able to identify five out of the six mutations appearing at a frequency above % in the illumina results in p a, and all eight positions within the p b sample (figure , supplementary table s ). notably, associvar also often correctly identified mutations segregating at lower frequencies quency lower than % according to illumina. all in all, the results indicate that our association approach has the power to resolve real variants from technical errors based on the minion data alone. we began the analysis by classifying mutations into wt and non-wt, for computational tractability. next, identifying the specific nucleotide variant in our samples after having identified the positions with real mutations is easy enough using a similar approach. every position has four possible variants (the three nucleotides different that the reference and a deletion for that base), and we test these variations against each other, again--under the assumption that the most highly associated variant for each position is the real one (materials and methods). for the positions identified by our association analysis and verified as correct by the illumina sequencing, all but one of the positions were matched with their correct variant with this method (position in p b was identified as a deletion instead of nucleotide a). the minion rna sequencing kit comes with a control sequence of the enolase ii yeast gene. we ran our association analysis on the enolase results, originally to verify that we pick up no variation in this gene. however, we were surprised to see two positions with a very high and outstanding association ( figure ) . we thus sequenced the same sample with the illumina miseq platform. reassuringly, the results verified the findings of associvar and showed that these two variants do appear in the sample and are the only two variants that ap- . notably, other high associations were ruled out as induced by proximity with the local maximum analysis (materials and methods). since we do not have a control sample for this gene, we cannot use it to infer a cutoff. however, the association between these two positions is so prominent when compared to the rest of the data that we were able to conclude they are between positions with 'true' variation (as later verified by illumina). pear there at a frequency higher than %. this suggests that our method can be used (a) as a general approach and not only for virus populations, and (b) in the absence of a control. we next tested our method on a sample of zika virus genomes ( ) . in this study, two different strains that differed at several positions had been artificially mixed and sequenced using minion. we ran associvar on one of the sequenced amplicons, in the absence of a control sequence. at least five of the six true mutations in this amplicon stood out as having highly prominent associations (supplementary figure s ). finally, we tested how our method fares in the absence of a control for our ms data, and compared the true positive rate versus false positive rate as a function of (a) thresholds set for the normalized chi-square statistic, and (b) a frequency threshold from the illumina results that determine when we define a mutation as true or false. we further compared this to the use of a 'naïve' approach where we use a varying frequency threshold for the minion results (as described above). our results show that associvar inference is consistently much more accurate that the naïve approach, and moreover, can be used even to detect mutations at a frequency of %, at the risk of some false positives ( figure ). one of the main goals of minion sequencing, in particular in the context of rna virus evolutionary experiments, is the detection of haplotypes and identification of distinct strains in the population. we thus set out to use the approach we devised to infer the composition of strains in our ms samples. notably, this is challenging on two fronts: first, our association approach can tell us which mutations in the min-ion data are real, and which pairs are associated, but it does not tell us what their frequency is (or rather, we do not trust the observed frequencies given the very high error rate). second, we are interested in inferring haplotypes, i.e. which mutations reside together regularly on the same genomes and which do not. once again, the high error rate makes this extremely tricky since we observe reads bearing almost all possible combinations of mutations. in fact, most reads bore so many variants, that ∼ - % of the bases called per read were different than the reference (supplementary figure s ). in this case, we used a two-pronged approach: we first focused only on variants inferred as true mutations using our method, associvar, as described above. second, we used the inferred error threshold to infer the probability of two or more variants residing erroneously on the same genome, utilizing an iterative approach in which we compare a given haplotype to haplotypes already classified as reliable (materials and methods). in our case, because the ms populations bore many mutations at low frequencies, we also limited the analysis to variants that appeared at a frequency of at least % in the minion sequencing results. the summary of inferred strains is shown in table and provides a few interesting insights into our populations. first, seventeen and ten different strains were identified in each of the a and b populations, respectively. second, a g and t -, both of which rose to high frequencies in both replicates, were found to be mutually exclusive in both replicates. on the other hand, g -(which we know from the illumina data to actually be g a) was found to be tightly linked to t -, in line with the very similar frequencies of these mutations in p b. all of these results were unobtainable with the illumina results alone and highlight the added value of using minion for inferring viral genotypes. we have developed here a simple and intuitive approach, associvar, to (a) detect bona fide mutations from minion population sequencing, and (b) infer the set of haplotypes (strains) present in a population. our approach is based e nucleic acids research, , vol. , no. page of figure . receiver operating characteristic (roc) of associvar versus a naïve method. performance of prediction of mutations is assessed using roc curves, where each curve is plotted as a function of the normalized chi-square statistic threshold for associvar (solid lines), or frequency threshold for the naïve method (dashed lines). the illumina results are used as the gold standard test to define a mutation as true or false, and the three different colors represent different thresholds for this definition. for example, for the blue line labeled as %, only mutations at a frequency higher than % according to illumina are defined as true. on the notion that sequencing errors will be randomly dispersed along the reads, whereas real mutations tend to associate with specific genetic backgrounds. in the case where technical error rates are high (such as occurs with min-ion), this allows one to focus on the real genetic diversity that is hidden in the vast array of technical errors generated by this method. notably, our approach is general enough so that it can be used for any type of long read sequencing. we applied associvar to sequencing data from an evolved population of phages where illumina sequencing was available, allowing us to corroborate whether mutations we found based on analysis of the minion data alone were indeed real. strikingly, all but one of the high frequency mutations observed in the p a and p b data (> %) were picked up using associvar, despite the fact that the th percentile for technical errors was as high as % (table ). in fact, despite the very high deletion rate, associvar accu-page of nucleic acids research, , vol. , no. e rately identified the one real deletion mutation present in our populations, suggesting a very high sensitivity of the method. our approach also shows high specificity, with a false positive rate lower than . %. finally, we have shown that using a naïve approach based on a frequency threshold as a cutoff to separate real mutations from errors results in extremely high false positive rates, demonstrating the value of our approach. originally, when observing the data in figure , as a first approximation it seemed likely to assume that mutations with a similar frequency would be mutations shared on the same genomes. accordingly, we had hypothesized that at least two clusters of mutations in line b (t -/g a/a g and a g/ a g/t c/g a/a g) would be present on the same genomes. this turned out to be only partially true: mutations with similar frequency were sometimes indeed on the same genomes (e.g. t -/g a), but sometimes completely not (the former two and a g) ( table ) . these results illustrate the utility of minion to resolve the relationships among mutations, and its advantage for differentiating variants with mutations displaying similar frequencies. we further used our approach to perform the reverse analysis: when analyzing the mrna of the yeast gene enolase, our analysis suggested that the mrna population sequenced was not homogenous. this was then precisely verified by illumina sequencing of the same population. remarkably, this analysis shows that (a) associvar can be used to analyze different types of data, ranging from virus genomes to mrna of any organism, and (b) associvar can be used without sequencing a control sequence. we note that this requires more caution, since our analysis of ms showed that spurious associations between mutations may be created artifactually by the sequencing process itself. use of associvar without a control sequence requires the user to specify the threshold of the normalized chi square statistic. as with all methods, the specificity of associvar comes at the cost of sensitivity, and vice versa ( figure ). nevertheless, it seems the best strategy we can suggest is to use a very high threshold, which is extremely effective for variants at a frequency higher than or %. it is important to delineate the limitations of our approach. we note that we cannot distinguish haplotypes/strains that differ from each other at one position only, because our method relies on the association between two positions containing real mutations. similarly, if two strains differ at very proximal loci, associvar will also fail, since we filter out associations between mutations that are < base pairs apart. we postulate that the presumably artifactual associations we observed between proximal loci are induced by the rna (or dna) passing through the pore of the sequencer. finally, we also noted specific patterns of mutations that were reproduced between our control sequence and the two evolved populations of ms . this suggests two possibilities: first, perhaps sequence context and/or rna secondary structure induce specific errors in minion, and second, it is possible that ms genomes undergo rna modifications and these are the cause of these specific errors. minion direct rna sequencing records the raw electric signal produced by the rna going through the pores, and this potentially offers the opportunity to identify rna modifications using a newly developed tool called tombo (version . ) by oxford nanopore (https://nanoporetech.github.io/tombo/). unfortunately, we could not conclusively determine the presence or effect of rna modifications and its relationship to associated mutations. our results suggest that tombo still suffers from a high false positive rate, while the true positive rate of the method has not yet been determined ( ) . the former was demonstrated herein by a high number of presumably modified sites found in the enolase yeast gene, despite the fact this gene was created synthetically in vitro, where modifications would not likely occur. we nevertheless analyzed our ms samples and found a similar pattern of presumable modifications among the three ms samples, yet there was no correlation between sites with a high rate of modification and sites with high normalized chi scores by associvar (see supplementary text, supplementary figures s -s ). while we cannot rule out that rna modifications are responsible for the pattern of errors in minion, we conclude that further research is required to determine which factors induce these errors. although our method is ideal for direct rna or direct dna sequencing, we also used the method for cdna that was amplified from rna in the case of the zika virus analysis ( ) (supplementary figure s ) . when we tried to reconstruct the known haplotypes present in this sample, our method did not fully succeed to recapitulate the haplotypes (data not shown). one possible explanation for this is that during the amplification step, either chimeric sequences of both strains were created, or pcr recombination occurred, breaking down some of the linkage between sites. in such cases, the use of associvar is limited to the detection of mutations only, and this further suggests that direct rna/dna sequencing may be preferable. to summarize, we anticipate that due to its ease of use and advantages listed above, direct long read sequencing using minion will be increasingly valuable in the field of virus genetics and in additional diverse fields such as transcriptome studies, cancer genetics, and microbiology. the associvar approach we suggest herein is simple and applicable to any organism, and as such we hope it will be a useful addition to the genomics toolbox in multiple fields. viral mutation rates the defective component of viral populations early minion™ nanopore single-molecule sequencing technology enables the characterization of hepatitis b virus genetic complexity in clinical samples viral phylodynamics minority hiv- drug resistance mutations are present in antiretroviral treatment-naïve populations and associate with reduced treatment efficacy evolutionary analysis of the dynamics of viral infectious disease highly accurate-single chromosomal complete genomes using iontorrent and minion sequencing of clinical pathogens antimicrobial resistance prediction and phylogenetic analysis of neisseria gonorrhoeae isolates using the oxford nanopore minion sequencer field investigation with real-time virus genetic characterisation support of a cluster of ebola virus disease cases in dubreka distinguishing low frequency mutations from rt-pcr and sequence errors in viral deep sequencing data evolution of foot-and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts minion nanopore sequencing identifies the position and structure of bacterial antibiotic resistance determinants in a multidrug-resistant strain of enteroaggregative escherichia coli coming of age: ten years of next-generation sequencing technologies evaluating the accuracy and sensitivity of detecting minority hiv- populations by illumina next-generation sequencing genomics and outbreaks: foot and mouth disease examining sources of error in pcr by single-molecule sequencing pcr amplification introduces errors into mononucleotide and dinucleotide repeat sequences insight into biases and sequencing errors for amplicon sequencing with the illumina miseq platform a general method to eliminate laboratory induced recombinants during massive, parallel sequencing of cdna library minimizing dna recombination during long rt-pcr dna recombination during pcr direct rna sequencing of the coding complete influenza a virus genome highly parallel direct rna sequencing on an array of nanopores the oxford nanopore minion: delivery of nanopore sequencing to the genomics community multiplex pcr method for minion and illumina sequencing of zika and other virus genomes directly from clinical samples minion nanopore sequencing of an influenza genome long-read sequencing-a powerful tool in viral transcriptome research oxford nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome assessing the performance of the oxford nanopore technologies minion an amplicon-based sequencing framework for accurately measuring intrahost virus diversity using primalseq and ivar from squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy genomewide patterns of substitution in adaptively evolving populations of the rna bacteriophage ms effect of deleterious mutation-accumulation on the fitness of rna bacteriophage ms accurate in vivo population sequencing uncovers drivers of within-host genetic diversity in viruses the asqc basic references in quality control: statistical techniques improved data analysis for the minion nanopore sequencer characterization of minion nanopore data for resequencing analyses direct rna nanopore sequencing of full-length coronavirus genomes provides novel insights into structural variants and enables modification analysis we thank david burstein and tzachi hagai for critical reading. supplementary data are available at nar online. key: cord- - g tb authors: rhodin, michael h. j.; dinman, jonathan d. title: a flexible loop in yeast ribosomal protein l coordinates p-site trna binding date: - - journal: nucleic acids res doi: . /nar/gkq sha: doc_id: cord_uid: g tb high-resolution structures reveal that yeast ribosomal protein l and its bacterial/archael homologs called l contain a highly conserved, basically charged internal loop that interacts with the peptidyl-transfer rna (trna) t-loop. we call this the l ‘p-site loop’. chemical protection of wild-type ribosome shows that that the p-site loop is inherently flexible, i.e. it is extended into the ribosomal p-site when this is unoccupied by trna, while it is retracted into the terminal loop of s rrna helix when the p-site is occupied. to further analyze the function of this structure, a series of mutants within the p-site loop were created and analyzed. a mutant that favors interaction of the p-site loop with the terminal loop of helix promoted increased affinity for peptidyl-trna, while another that favors its extension into the ribosomal p-site had the opposite effect. the two mutants also had opposing effects on binding of aa-trna to the ribosomal a-site, and downstream functional effects were observed on translational fidelity, drug resistance/hypersensitivity, virus maintenance and overall cell growth. these analyses suggest that the l p-site loop normally helps to optimize ribosome function by monitoring the occupancy status of the ribosomal p-site. over the past decade, atomic resolution ribosome structures have revealed the locations of critical elements. however, these static images do not reveal the dynamic movements within this complex macromolecule. the ribosome must coordinate multiple activities between spatially and functionally different sites in two subunits. these include three transfer rna (trna)-binding sites, the peptidyltransferase and decoding centers and the elongation factor interacting regions. events occurring in these regions must be carefully coordinated to assure rapid and accurate decoding of messenger rnas (mrnas). current efforts in the field are focusing on determining the mechanisms by which these functional centers synchronize their actions and communicate with each other. the eukaryotic ribosome contains nearly intrinsic proteins. the high degree of similarity across species, from the primary amino acid sequences to their tertiary structures, suggests conserved functional roles beyond serving as mere scaffolding for the rrnas. ribosomal protein l of saccharomyces cerevisiae is an essential, highly conserved component of the s subunit (in bacteria and archaea, the homologous protein is named l ; the yeast nomenclature is used throughout this text to minimize confusion). at the primary amino acid sequence level, l is well conserved among eukaryotes ($ - % identity), while bacterial and archaeal l proteins are less well conserved ( - % identical) (supplementary figure s a ). l is uniquely positioned at the interface between the large subunit central protuberance ( figure a and b) and the head of the small subunit ( figure a ) ( ) ( ) ( ) ( ) ( ) . in the small subunit, the head region undergoes significant rotational movement relative to the central protuberance between the pre-and post-translocational states ( ) , and the protein-protein interactions between l and s (s in bacteria and archaea) on the small subunit (the b b and b c intersubunit bridges) undergo the largest intersubunit structural rearrangements between these two states ( ) ( ) ( ) . these observations suggest that l may play a central role as an informational conduit between the two subunits. detailed analysis of x-ray crystallographic and cryo-em structures ( figure b ) reveals that the concave surface of the b-sheet portion of l interacts with specific nucleotides in the minor groove of s rrna helix ( ) . l also makes contacts with the helix iii and loop c regions of s rrna; these connections have been hypothesized to help stabilize s rrna interactions and may participate in an information signal transmission network linking functional centers within the ribosome ( , ) . importantly, the b b and b c intersubunit bridges with s are the only protein-protein interactions between the two subunits ( , , , ) . analyses of these structures indicate that contacts involving l and s through the b b and b c bridges break and rearrange after eef- binding and ribosome ratcheting, controlled in part by differentially charged amino acid side chains between the two proteins ( , , , , , ) . an internal loop of l that we denote the 'l p-site loop', which is roughly formed by amino acid residues - , also directly contacts the t-loop of the peptidyl-trna in the p-site through trna nucleotide ( ) ( ) ( ) . at the level of primary amino acid sequence, the p-site loop is highly conserved among eukaryotes ( - % identity), while it is less well conserved among bacteria and archaea ( - % identity) (supplementary figure s b) . at the biochemical level, however, the p-site loop is significantly more homogeneous, containing a large number of well-aligned charged and aromatic amino acids. in particular, a , f , r and i (yeast numbering) are universally conserved. an alignment of the p-site loop structures from yeast, haloarcula marismortui, thermus thermophilus and escherichia coli reveals that the p-site loop is extremely well conserved at the structural level (supplementary figure s c) . in yeast, l is encoded by the paralogous genes rpl a and rpl b located on chromosomes and , respectively ( ) . the -kda proteins are amino acids long and are identical except for an alanine (l a) to threonine (l b) difference at the third amino acid position. analysis of l in the late s (a.k.a. l ) showed that expression of either isoform was sufficient for cell viability ( ) . however, when expressed as the sole form of l , rpl a mrna transcripts accumulated to only - % of wild-type levels as compared to cells expressing both isogenes, while rpl b mrnas accumulated to - %. expression of either isogene alone also affected s subunit assembly: a strain expressing only l b grew at wild-type rates but synthesized fewer s subunits than wild-type cells (although apparently not below a threshold necessary for wild-type growth rates), while strains expressing only l a grew more slowly than wild-type, and synthesized only - % of wild-type levels of total l and s subunits ( ) . a random mutagenesis screen of rpl b for cold-sensitive mutants identified alleles that promoted s pre-rrna processing and initiation defects ( ) . the specific mutants identified in that study were s p, s p, s f, a v, s c and g d. in arabidopsis, divergent untranslated regions (utrs) between the two isogenes were found to result in differential expression among plant tissues ( ) . in addition to its function as a ribosomal protein, l has been implicated in p activation through its interactions with hdm in the nucleus of human fibroblast cells ( ) , and mutant forms of l have been linked to daimond-blackfan anemia in humans ( ) . although the structural information suggests that l should play a significant role in translation, functional analyses of the protein in this role have not been performed. in this report, a series of mutants were generated using a reverse genetics approach to parse the role of the l p-site loop. detailed biochemical and structural analyses focused on two multi-amino acid mutants with opposing effects on rrna structure and trna binding. we propose that prior to peptidyltransfer, the presence of peptidyl-trna in the large subunit p-site positions the l p-site loop to interact with the helix of the large subunit rrna. after peptidyltransfer, spontaneous translocation of the deacylated trna to the large subunit e-site allows the l p-site loop to extend into the p-site, breaking contact with helix . by this model, we hypothesize that the l p-site loop functions locally as a sensor of the occupancy status of the ribosomal p-site. restriction enzymes were obtained from promega (madison, wi, usa), mbi fermentas (vilnius, lithuania) and roche applied science (indianapolis, in, usa). the quikchange xl ii site-directed specific mutagenesis kit was purchased from stratagene (la jolla, ca, usa). dna sequencing was performed by genewiz (germantown, md, usa). escherichia coli dh a was used to amplify plasmid dna. transformation of yeast and e. coli and were performed as previously described ( ) . ypad, sd and . mb plates for testing the killer phenotype were as previously reported ( ) . plasmids for expression of dual luciferase reporters were described previously ( ) . saccharomyces cerevisiae strain psy (mat rpl a::his rpl b::his ura - leu d trp d his d + ycpl b ura ), an rpl a/rpl b gene deletion strain in which l is supplied by a ura -cen based rpl b clone, was a generous gift from dr. pamela silver ( ) . the l-a and m viruses were introduced into psy by cytoplasmic mixing (cytoduction) through nonproductive mating with jd [mata kar - arg (l-ahn m )] to produce the killer + strain jd as previously described ( ) . wild-type rpl b was isolated from yeast strain psy plasmid (pycp l b ura ). using flanking bamhi restriction sites, a . -kb fragment of dna containing both the -bp wild-type rpll b orf plus the native and utr regions ( bp and bp, respectively) was purified by agarose gel electrophoresis. this -bp fragment was ligated into bamhi digested prs , a low copy trp -selectable plasmid (purchased from atcc, manassas, va, usa) ( ) to create prs l b-trp . this plasmid served as the template for generation of rpl b mutants by site directed mutagenesis using the primers listed in supplementary table s . wild-type and mutant prs l b-trp clones were transformed into jd , selected for growth on -trp medium, and cells having lost the ura -based plasmid were identified by their ability to grow in the presence of -fluoroorotic acid ( -foa) ( ) . the effects of temperature and translational inhibitors were assessed by standard -fold dilution spot assays. yeast were grown in h-tryptophan synthetic deletion (sd) media (-trp) to mid log phase. od values were obtained, and cells were serially diluted -fold from to cfu per . ml and spotted on -trp plates. growth was monitored at c, c and c, and pharmacogenetic assays utilized mg/ml paromomycin, mg/ml anisomycin or mg/ml sparsomycin incubated at c for - days. killer virus assays were performed as previously described ( ) . the dual luciferase reporter plasmids pydl-control, pydl-la, pydl-ty , pydl-uaa ( ) and pydl-agc ( ) were employed to quantitatively monitor programmed À ribosomal frameshifting, programmed + ribosomal frameshifting, suppression of a uaa codon and suppression of an agc serine codon in place of an aga argine codon in the firefly luciferase catalytic site respectively. in this study, the reporters were housed in leu -based reporters: the frame dual luciferase reporter was pjd , the l-a dsrna virus À prf containing reporter was pjd and the ty containing + prf reporter was pjd . cells were grown overnight in -ml volumes of -leu synthetic depletion media to mid log phase (a = . - . ). cells were washed, resuspended in lysis buffer ( x pbs ph . , mm pmsf) and lysed using . -mm glass beads with a vortex mixer for - min at c. lysates were clarified by centrifugation for min. at r.p.m. at c. samples were maintained on ice, and ml of clarified lysate was added to ml of pre-aliquoted promega larii reagent, mixed by pipetting, and read in a td / luminometer. immediately upon completion of this read, ml of promega stop and glo buffer was added to the tube, pipetted to mix and read again. this was repeated - times per strain per reporter depending on the consistency of the data. frameshifting rates were determined by taking the ratio of firefly to renilla luciferases for each sample, and then taking the ratio of the average ratios of the frame samples to that of test reporter ratios to obtain the rates for both À and+ prf. these results were then analyzed by t-test to determine statistical significance compared to wild-type levels as previously described ( ) . prior to determining rates of uaa readthrough (nonsense suppression), strains were cured of the endogenous yeast prion [psi + ] by daily serial passage of cells in -trp liquid media containing mm guanidine hydrochloride for days. rates of nonsense suppression were determined as previously described ( ) using the leu selectable -frame control pjd and in-frame uaa containing reporter pjd . missense reporters were based on ura plasmids previously described for the sense reporter ( ) and for the firefly luciferase arginine codon (aga) to serine (agc) missense reporter plasmid pydl-agc ( ) . methodologies were the same as those for other dual luciferase assays described above. cells were grown overnight in a c shaker in ml of ypad media to mid-log phase (od . - . ), cooled to c for h to allow ribosomes to run off of transcripts while remaining tightly coupled. cells were harvested by centrifugation and washed three times with ml . % kcl solution. cell pellets were stored at À c until needed, at which time they were thawed and resuspended in ml binding buffer ( mm tris-hcl ph . , mm mgcl , mm nh cl, mm dtt, mm pmsf) per gram of cells. cells were lysed with a : vol of zirconian beads (biospec, bartlesville, ok, usa) and disrupted using two -min pulses of a minibead beater. lysates were clarified by centrifugation at r.p.m. ( g) using an msl- rotor at c for min. ribosomes were chromatographically purified using sulfolink beads (pierce, rockford, il, usa) as previously described ( ) , and eluted from the resin in ml of elution buffer ( mm tris-hcl ph . , mm mgcl , mm kcl, mm dtt, . mg/ml heparin). eluted ribosomes were treated with mm puromycin and mm gtp for min at c and were layered on top of a -ml glycerol cushion [ mm hepes-koh ph . , mm mg(ch coo) , mm nh cl, mm dtt, % glycerol] and pelleted by centrifugation at r.p.m. at c for - h. pellets were washed with ml of storage buffer [ mm hepes-koh ph . , mm mg(ch coo) , mm nh cl, mm dtt, % glycerol], and resuspended in - ml of storage buffer. concentrations were determined spectrophotometrically ( od = pmol ribosomes). the salt-washed ribosomes were aliquoted and stored at À c for up to months. ribosomal rrna quality was checked on . % agarose gels and rrna to protein ratios were monitored by determining od to od ratios. polysome profiles were obtained by sucrose density gradient centrifugation as previously described ( ) . samples were split, and ml of dimethyl sulfoxide (dmso) was added to half of the samples, while ml of mm m was added to the other half. samples were incubated at c for min. ribosomes were precipitated by the addition of ml of ice-cold % ethanol and stored at À c for - h. ribosomes were pelleted by centrifugation at r.p.m. for min and resuspended in lysis buffer and rrnas were isolated using an ambion (austin, tx, usa) rnaqueous Õ -micro rna isolation kit. optical densities were taken at nm and nm to monitor the quantity and quality of rna, and samples were resuspended at a concentration of mg rrna/ ml in pure water. hplc purified oligonucleotide primers purchased from idt (coralville, ia, usa) are listed in supplementary table . oligonucleotides were resuspended to pmol/ml, end labeled with g[ p]atp with t polynucleotide kinase (roche, indianapolis, in, usa), and purified from free radiolabeled nucleotide by passage through a microspin g- column (ge healthcare, piscataway, nj, usa). annealing reactions utilized mg of modified rrnas and ml of labeled oligonucleotide heated at c for min, followed by a - -min incubation at - c below the t m of each oligonucleotide. annealed rrna/ primers ( ml each) were added to ml of cold enzyme mix [ . ml mm dntp, . ml mm dtt, ml x superscript iii buffer, . ml superscript iii (invitrogen life technologies, carlsbad, ca, usa), . ml h o]. for sequencing samples, an additional ml of each ddntp was added to each c, t, a, g, sample, respectively. primer extension reactions were performed at c for min, with potential -min-long extensions preceding the c at lower temperatures depending on the individual t m values of the primers. denaturing rna loading dye ( ml) was added to each sample, heated to c for . min, and samples were resolved through % urea-acrylamide denaturing gels. gels were dried and radiolabeled samples were visualized by phosphorimagery. the published structures for the s ribosome from e. coli [pdb accession numbers: avy, aw ; ( )], as well as yeast s structures from yeast ( s i, s h, jyv, jyw, jyx; ( , ) ] were used in the analysis of this work and the generation of figures. published t. thermophilus s subunits containing a-site, p-site and e-site phe-trna were also employed ( g x, ( ) . all structures were visualized and manipulated using macpymol software ( ) . the visualization of a single salient loop of l interacting with peptidyl-trna indicated that it might play a vital role in sensing peptidyl-trna occupancy status and transmitting this information to other functional centers of the ribosome. as cells expressing rpl b alone were healthier than those solely expressing rpl a, genetic manipulations began with the yeast rpl ad rpl bd double knockout strain jd expressing wild-type (wt) rpl b from a low-copy, ura -selectable episomal plasmid (prpl b-ura ). oligonucleotide site-directed mutagenesis was used to construct a series of mutants, each containing changes of , or sequential amino acids ( figure c ). stretches of amino acids from arginine to arginine in the l p-site loop were targeted for site-directed mutagenesis expressed from a low-copy, trp -selectable episomal plasmid under control of the endogenous rpl b promoter (prpl b-trp ). after transformation and selection on sd medium lacking tryptophan (-trp), cells expressing only mutant rpl b alleles were identified by their ability to grow on sd-trp medium containing -flouroorotic acid ( -foa). three of the multiple substitution mutants were inviable as the sole forms of l b. these were r ytvrtfgir !alanine (i.e. - a); deletion of residues - ( - Á); and f gir !alanine ( - a). viable mutants, r ytv !alanine ( - a), v rtf !alanine ( - a), r a, y * (mutations including Á, a, r, e, s, i, q, n, h and f), f a and r a were rescued from yeast into e. coli, and the mutations were confirmed by dna sequencing. the l p-site loop mutants confer temperature-and drug-specific growth phenotypes displayed roughly wild-type growth. cold sensitivity was assessed at c and both mutants grew at wild-type rates. - a showed enhanced growth at c relative to itself at c, while mutant - a was similar to wild type. r a grew at wild-type rates at c and c but showed enhanced growth at c. the y * mutants displayed mutant-specific effects on growth rates at c, but did not confer significant phenotypes at either c or c. f a had wild-type growth rates at all temperatures, while r a showed depressed growth at c, which was rescued at c. small molecule inhibitors of protein translation are useful probes for identifying changes in ribosome function. this study utilized three such molecules: paromomycin, anisomycin and sparsomycin. the effects of all three drugs were monitored using dilution spot assays at c on sd-trp media containing various drug concentrations. paromomycin is an aminoglycoside antibiotic that increases translational error rates by artificially stabilizing codon:anticodon interactions at the decoding center in the small ribosomal subunit ( ) . as compared to their intrinsic growth in the absence of drug, both the - a and - a mutants were slightly hypersensitive to mg/ml paromomycin, as were y Á, y n and y h. in contrast, r a, f a and r a were all paromomycin resistant ( figure b ). anisomycin competes with the end of the aa-trna for binding to the a-site pocket of the ribosome ( , ) . both - a and - a showed anisomycin resistance at mg/ml, as did several y * mutants, and r a ( figure b ). sparsomycin binds to the p-site and interferes with peptidyl-trna binding and peptidyl transfer ( , ) . - a and - a mutants were hypersensitive to mg/ml sparsomycin, as were most of the y * mutants, with the exception of y f, which conferred slight resistance to this drug ( figure b ). the yeast 'killer' system is composed of the l-a helper and m satellite dsrna viruses ( ) . the l-a dsrna viral genome encodes a capsid protein (gag), and an rna-dependent rna polymerase (pol) that is synthesized as a gag-pol fusion protein consequent to a À programmed ribosomal frameshifting (prf) event ( ) . the m satellite dsrna is encapsidated and replicated in l-a encoded viral particles, and the m (+) strand encodes a secreted toxin that kills uninfected yeast through its interactions with the gpi-anchored kre p cell wall assembly protein ( ) . changes in À prf efficiency alter the ratio of gag to gag-pol, and inhibit the ability of cells to maintain m ( ) . to monitor the effects of the mutants on killer virus maintenance, colonies of jd cells expressing either wild-type or mutant rpl b alleles were spotted onto a lawn of diploid, killer À indicator cells. cells expressing wild-type rpl b were killer + as demonstrated by their ability to inhibit growth of the indicator cells ( figure c ). in contrast, isogenic cells expressing the - a, - a and f a mutants were killer À . a weak killer phenotype, defined by decreased zones of growth inhibition, was observed in mutants y e, y n, y h and f a. the rpl b mutants affect translational fidelity 'translational fidelity' is generically used to describe the accuracy of protein synthesis. a series of bicistronic reporter plasmids were used to quantitatively monitor the effects of the l b mutants on four aspects of translational fidelity: À prf, + prf, suppression of a uaa nonsense codon and incorporation of a missense near-cognate amino acid. in jd cells expressing wild-type rpl b, À prf directed by the l-a dsrna viral signal was . % ± . %. this compares favorably with other 'wild-type' strains in our laboratory (normal range from % to % ( , ) . the - a mutant promoted increased À prf ( . ± . -fold relative to wild type), while - a trended in the opposite direction ( . ± . -fold relative to wild type) ( figure , and table ). both these values were statistically significant and correlate well with the killer À phenotypes. y Á, y n, y e and y h mutants also showed increased rates of - prf, with statistically significant rates ranging from y h at . -fold wild type to y e at . -fold wild-type. y a, y s, y q and y f all had wild-type rates of À prf. while both À and + prf are kinetically driven events, the substrates for the slippage are distinct: À prf requires that both the ribosomal a-and p-sites are occupied by trnas, while+ prf occurs while the a-site is empty ( ) . rates of + prf were monitored using a cis-acting signal derived from the ty retrotransposable element using pydl-ty . baseline + prf efficiencies in cells expressing wild-type rpl b were . % ± . %. - a had no effects on + prf, while - a promoted a small but statistically significant increase ( . ± . -fold of wild type; figure ). significant changes in + prf were also observed in the y a, y s, y n, y e, y h and y f mutants. mrna decoding occurs in the small subunit decoding center, and changes in termination codon recognition (nonsense suppression) is another indicator of altered translational fidelity. pydl-uaa ( ) , which contains an in-frame termination codon immediately of the firefly luciferase gene, was used to monitor this parameter. the baseline rate of nonsense suppression in cells expressing rpl b was . % ± . %. the - a mutant slightly improved this aspect of translational fidelity, with nonsense suppression levels decreasing to . ± . -fold of wild-type levels. - a did not affect uaa recognition (figure ) . y Á, y a, y s, y n, y e and y h all promoted increased rates of nonsense suppression ranging from . -to . -fold wild type. pydl-agc tests missense suppression levels by monitoring rates of incorporation of an arginine (aga) near-cognate amino acid instead of a cognate serine (agc) at the catalytic codon within the firefly luciferase gene as previously described ( ) . thus, in this assay, mis-utilization of near-cognate trna arg at the ser agc codon restores firefly luciferase activity. wild-type missense levels were measured at . % ± . , comparable to previous studies ( ) . mutant - a had significantly higher levels of missense suppression (measured at . ± . -fold wild-type), while - a did not significantly affect this phenomenon ( . ± . fold wild type) ( figure ). missense suppression was not assayed for the single amino acid mutants. the mutant rpl b alleles promote opposing effects on trna binding to the ribosomal a-and p-sites sucrose gradient analyses were employed to fractionate cycloheximide arrested elongating ribosomes on mrnas in lysates generated from jd cells expressing wild-type l b, - a, and - a. in all strains the s peak was smaller than that of the s fraction which can be attributed to the presence of only a single copy of rpl b, which has previously been shown to effectively reduce the number of s subunits produced by the cell to - % of true wild-type levels while having no visible phenotypic effect on growth ( ) . no significant differences were observed among the samples (data not shown). phenotypic variation in prf and in the presence of anisomycin and sparsomycin are indicative of altered interactions between the ribosome and trnas. p-site trna k d values were determined in vitro by binding -fold serial dilutions of n-acetylated-[ c]phe-trna to ribosomes until saturation was achieved ( figure a ), and the resulting data were used to determine steady-state single site binding k d values ( figure b ). wild-type ribosomes bound this p-site substrate with a k d of . ± . nm. the - a mutants promoted a slight increase in affinity for p-site substrate (k d = . ± . nm), while - a had the opposite effect (k d = . ± . nm). given the physical interaction between the l p-site loop and peptidyl-trna, it was imperative to determine whether the observed small changes in p-site affinities promoted by the mutants were biochemically significant. to this end, multiple turnover puromycin reactions were performed. in these experiments, puromycin was added to ribosomes pre-incubated with excess p-site substrate, i.e. ac-[ c]phe-trna phe , and accumulation of the peptidylpuromycin product was monitored over time. in these reactions, the first round of peptidylpuromycin synthesis is very rapid. next, in a slow step, the ribosome intrinsically translocates the deacylated trna phe into the e-site ( ), followed by the slow diffusion of ac-[ c]phe-trna phe into the p-site where it can react with puromycin. repetition of this cycle results in slow multiple rounds of product synthesis ( figure c ). assuming that the l mutants do not affect either rates of intrinsic translocation or of ac-[ c]phe-trna phe diffusion into the p-site, changes in product accumulation, i.e. k obs , should be due to differences in binding affinities for the p-site substrate. consistent with this model - a promoted . ± . -fold increased k obs relative to wild-type ribosomes, while - a decreased k obs to figure . the l b mutants promote defects in translational fidelity. isogenic yeast cells expressing either wild-type or mutant forms of l b were transformed with dual luciferase reporters and control plasmids and rates of translational recoding were determined. all results are graphed as fold wild type. À prf was measured using the yeast l-a virus frameshift signal. + prf was directed by the frameshift signal derived from the ty retrotransposable element. nonsense suppression denotes the percentage of ribosomes able to suppress an in-frame uaa termination codon positioned between the renilla and firefly luciferase reporter genes. missense suppression rates were evaluated by incorporation of an arginine (aga) near-cognate amino acid instead of a cognate serine (agc) at the catalytic codon within the firefly luciferase gene. error bars denote standard error. p-values are indicated above samples showing statistically significant changes. (figure e and f). the p-site loop is flexible depending on the occupancy status of the p-site the highly basic nature of the p-site loop, its interaction with peptidyl-trna, and its proximity to s rrna helix (h ) suggested that it might interact with either of these two rna components depending on the occupancy status of the p-site. changes in interactions between the p-site loop and local rrna structures may in turn propagate outward to more distant regions of the ribosome. to test this, shape ( ) ( ) ( ) was employed to probe for structural alterations in selected regions of the s, s and s rrnas due to either the l b mutants or in wild-type ribosomes with occupied or unoccupied p-sites. due to the large size and complex three-dimensional structure of the ribosome, the entire rrna content was not examined. rather, approximately one-third of the rrna bases were interrogated, focusing on those bases closest to l , the a-and p-sites, and the decoding center. in the first series of experiments, salt-washed wild-type and - a, - a, y q and y f mutant ribosomes (chosen for structural analyses because they had the most pronounced genetic phenotypes) were treated with m , an electrophile that adds an adduct onto the oh groups of solvent exposed base sugars. modifications were performed on salt-washed ribosomes because they represent the thermodynamic 'ground state' of the ribosome. thus, the structural changes observed are indicative of changes in the full 'dynamic potential' of the ribosome as opposed to conformations locked in by e.g. occupation of binding sites by trnas or ribosome-associated factors. rrnas were extracted, hybridized with [ p]-labeled oligonucleotide primers and reverse transcriptase primer extension reactions were performed. the products were separated through urea-acylamide denaturing gels, and visualized using a phosphorimager. -oh ribose modification results in a strong stop -nt of modified bases, and the intensity of the stops are proportional to the solvent accessibility and flexibility of riboses. comparison of the protection patterns between wild-type and mutant ribosomes enables identification of specific bases which became protected or deprotected relative to wt. in all areas examined, rpl b ribosomes y q and y f matched the wild-type rrna base modification profile (data not shown), while - a and - a ribosomes revealed consistently reproducible differences. the most significant changes in rrna structure were observed in bases c -a (e. coli numbering: c - ) located in the terminal loop of s rrna h ( figure a and e). the two mutants promoted opposing patterns of base protection/deprotection in this structure. specifically, as compared to wild-type ribosomes, - a promoted enhanced protection of this loop, while the loop was deprotected in the - a mutants. analysis of the recent cryo-em yeast ribosome structure ( ) revealed that these h loop bases are located within Å of the stretches of amino acids changed to alanines in both the - a and - a mutants ( figure b ). these findings suggested that the two mutants had the effects of displacing the p-site loop into two opposing conformational states: extended toward the p-site ( - a), or retracted into h ( - a). to test whether these two states are naturally dependent on p-site occupancy, the experiments were repeated with wild-type and mutant ribosomes with or without trna phe in their p-sites. consistent with this model, addition of trna to the p-site of wild-type ribosomes resulted in slightly enhanced protection of the h terminal loop bases closest to the p-site loop (a -a ). interestingly, c showed significant deprotection when the p-site was occupied by trna. this base is on the far side of the terminal end of h from the p-site loop, suggesting that h itself alters its conformation upon trna occupancy of the p-site ( figure c ). - a's h bases were unchanged between p-site bound and unbound ribosomes, consistent with the p-site loop positioned in the 'retracted' state in this mutant, although small differences in the protection patterns suggest that the p-site loop is in a slightly different orientation in this mutant. in contrast, while - a ribosomes, i.e. the p-site loop 'extended' state, showed deprotection at all bases (c -a ) for both p-site bound and salt washed ribosomes, bases a -a were less deprotected when trna was in the p-site and c was even more reactive, consistent with the notion that the p-site loop interacts with h when peptidyl-trna is in the p-site. although no other shape-specific changes were observed, several other phosphodiester bonds of specific s rrna bases were reproducibly more, or less, intrinsically labile as compared to wild type ( figure d ). in both mutants, g and g located in expansion segment (es ) were more stable than in wild-type ribosomes as evidenced by reduced intensity of strong reverse transcriptase stops -nt of these bases. additionally, bases a -a (e. coli a -u ) located in the terminal loop of helix were hyper-labile in - a mutant ribosomes as compared to wt, as shown by the presence of strong stops with increased intensity -nt of these bases. these are mapped onto the two-dimensional structure of yeast s rrna ( figure e ). the l p-site loop is largely comprised of polar amino acids and carries a net positive charge, making it ideal for interactions with the phosphate backbones of nucleic acids, e.g. rrna and trna. positioned between h and the peptidyl-trna t-loop, several of its amino acids are within h-bonding distance of h ($ . Å ), while c of the peptidyl trna t-loop comes within . Å of g in the l p-site loop ( , ) , suggesting that the l p-site loop can directly interact with both of the rna-based structures. while currently available x-ray crystal structures are unavailable for ratchet-state ribosomes, a recently published examination of trna movement through the e. coli ribosome using large-scale analysis of cryo-em images implicates the p-site loop as a dynamic arm interacting with and moving in relation to trnas passing across the p-site ( ) . although these studies were performed at resolutions of - Å , leaving considerable ambiguity regarding the precise residues involved, they clearly reveal highly dynamic interactions between the p-site loop and both p-site, and e-site trnas. although death is not a phenotype per se, the inviable mutants are informative nonetheless in so far as they demonstrate that the amino acids f gir are absolutely required for viability. while f is universally conserved, it does not appear to be essential on its own for viability, as witnessed in the mild phenotypes of the f a mutant. similarly, all single amino acid changes explored here resulted in viable cells, suggesting a certain degree of biochemical/biophysical redundancy within this essential loop. in support of this notion, the strongest growth phenotypes observed across a range of temperatures and small molecule translational inhibitors were concentrated in the multiple alanine substitutions, i.e. - a and - a, thus directing the bulk of the biochemical and structural analyses to these two mutants. analysis of the results of the assays performed on the viable multiple alanine substitution mutants (summarized in table ) provoke the hypothesis that the l p-site loop may dynamically function to help the ribosome sense the occupancy status of the large ribosomal subunit p-site. this is modeled in figure . when the p-site is unoccupied, the p-site loop can extend into this space, moving away from the terminal loop of h . upon occupation of the p-site, the peptidyl-trna t-loop displaces the l p-site loop, causing its retraction into h . by this model, the rrna shape analyses depicting increased protection of helix by the - a mutant show that this mutant drives the l p-site loop equilibrium toward the 'retracted' state. conversely, increased deprotection of helix in the - a mutant suggests that this more mimics the p-site unoccupied state, i.e. the 'extended' p-site loop state. this analysis directly explains the p-site binding data. retraction of the p-site loop from the p-site results in - a ribosomes having higher intrinsic affinity for this substrate while extension of this structure into the p-site creates a steric clash with the peptidyl-trna t-loop, resulting in decreased affinity for this substrate. that neither mutant conferred optimal peptidyl-trna p-site occupancy may account for their hypersensitivity to sparsomycin, especially for - a in which the p-site loop is already competing with the trna for the p-site. mutants - a, - a and - Á appear to disrupt the normal function of the p-site loop to a lethal level. in addition, the observation that trna binding to the p-site results in deprotection of c implicates h itself as a structurally dynamic unit. the functional consequences of this are not clear, although it is tempting to speculate that this conformational change may play a role in the structural rearrangements of the b b and b c bridges between the pre-and post-translocational states. the lack of rrna structural changes in the a-site or in the decoding center suggest that the biochemical and phenotypic effects observed are indirectly due to the changes described above. the reciprocal effects between ac-aa-trna binding with the p-site and aa-trna interactions with the a-site are intriguing. in the aa-trna binding reactions, the ribosomal p-sites were occupied with daeacylated trna. we suggest that in the - a mutant, the p-site ligand is more 'locked' into a suboptimal conformation, which in turn feeds back to the a-site, resulting in decreased affinity for its ligand. conversely, the lessened ability of - a mutant ribosomes to lock p-site ligand in a suboptimal conformation may account for the increased affinity of these ribosomes for a-site ligand. anisomycin resistance by both mutants also followed the reciprocal p-site/a-site pattern, i.e. both mutants were sparsomycin hypersensitive. paromomycin interacts with the decoding center in the small subunit, where it promotes misreading of near-cognate codons in the a-site by stabilizing codon-anticodon interactions ( ) . this sensitivity may be attributable to an observed increase in missense incorporation of a near cognate arginine (aga) over that of the sense serine codon (agc) in mutant - a. intriguingly, - a had wild-type levels of missense incorporation suggesting that its sensitivity to paromomycin was indirect. the reciprocal anisomycin/paromomycin phenotypes of the l mutants demonstrate the effects of this protein on a-site ligand based ribosomal functions over very long distances. similar phenotypic patterns were previously observed with mutants of other large subunit components ( , ) . the observed effects on - prf are consistent with a recent kinetic analysis demonstrating that aa-trna slippage is the most highly weighted parameter in determining the rate at which this process occurs (liao,p.y. et al., submitted for publication). here, increased affinity for aa-trna by the large subunit suggests that the - a ribosomes stabilize the frameshifted (i.e. near-cognate) trnas, reducing their ability to be proofread, thus promoting increased rates of À prf. this is consistent with the observed increased rates of missense decoding in this mutant. conversely, post-slippage a-site trnas are even less stable in the - a mutants, leading these to be more efficiently proofread, and thus promoting decreased À prf efficiency. in both cases, altering À prf from the optimum 'golden mean' precludes these cells from maintaining the yeast killer virus ( , ) . programmed + frameshifting is completely dependent on peptidyl-trna slippage. increased + prf in the - a mutant is consistent with decreased affinity for this substrate. the failure to observe decreased + prf in the - a mutant, despite its increased affinity for peptidyl-trna, is not entirely clear, although this may be due to the inability of these ribosomes to achieve a threshold beyond which + prf effects can be observed. the changes in rrna stability observed in the terminal loop of helix and in es are intriguing. chemical protection experiments revealed the terminal loop of helix is involved in a kissing loop interaction with the terminal loop of helix , and this interaction is apparent in the x-ray crystal and cryo-em structures ( , ) . increased lability at a and a was trna h l s p-site h l s c c figure . model: the p-site loop acts as a sensor of the occupancy status of the p-site. (left) when the large subunit p-site is unoccupied by trna, the l p-site loop is able to extend into this space leaving the distal loop of h partially deprotected from chemical attack. this conformation is favored by the - a mutant of l b. (right panel) occupation of the p-site by peptidyl-trna displaces the l p-site loop, causing it to tightly retract from the p-site and interact with h , resulting in increased protection of the h terminal loop from chemical attack. h likely moves toward the p-site loop slightly, increasing the exposure of c to the surrounding solvent. this conformation is favored by the l b - a mutant. previously observed in the y c mutant of ribosomal protein l (homolog of e. coli l ) located at the base of the aa-trna accommodation corridor, and in the É c (e. coli u ) s rrna mutant located in the peptidyltransferase center ( , ) . the observation that mutations located in three very different and topologically distinct regions of the large subunit conferred similar structural effects suggest that this kissing loop interaction plays an important role in ribosome function. its location on the cytoplasmic face of the ribosome where deacylated trna leaves the molecule implies that the interaction between the terminal loops of helices and may be involved in gating this deacylated trna exit corridor open and closed. this is consistent with the model of allosteric coordination between the a-and e-sites ( , ) , which would indicate that the defects conferred by all of these mutants on aa-trna binding might impair this e-site gating function. the decreased lability of c and g in es is similarly intriguing, raising more questions than answers. no function is currently associated with this expansion segment, but recent cryo-em analysis shows it to be located on a solvent accessible surface of the large subunit ( ) . perhaps this site is also involved in a-site/ e-site coordination. alternatively, it may be a site for recognition of defective ribosomes by the nonfunctional ribosome decay apparatus. the complete atomic structure of the large ribosomal subunit at . a resolution structures of the bacterial ribosome at . a resolution structure of the s ribosome from saccharomyces cerevisiae-trna-ribosome and subunit-subunit interactions comprehensive molecular structure of the eukaryotic ribosome crystal structure of the ribosome at . Å resolution structures of the ribosome in intermediate states of ratcheting the process of mrna-trna translocation the conserved a-site finger of the s rrna: just one of the intersubunit bridges or a part of the allosteric communication pathway? locking and unlocking of ribosomal motions the roles of ribosomal proteins in the structure assembly, and evolution of the large ribosomal subunit s rrna: structure and function from head to toe domain movements of elongation factor eef and the eukaryotic s ribosome facilitate trna translocation the primary structure of the gene encoding yeast ribosomal protein l depletion of saccharomyces cerevisiae ribosomal protein l causes a decrease in s ribosomal subunits and formation of half-mer polyribosomes assembly of s ribosomal subunits is perturbed in temperature-sensitive yeast mutants defective in ribosomal protein l developmental regulation of ribosomal protein l genes in arabidopsis thaliana essential role of ribosomal protein l in mediating growth inhibition-induced p activation ribosomal protein l and l mutations are associated with cleft palate and abnormal thumbs in diamond-blackfan anemia patients ribosomal frameshifting efficiency and gag/gag-pol ratio are critical for yeast m double-stranded rna virus propagation an in vivo dual-luciferase assay system for studying translational recoding in the yeast saccharomyces cerevisiae factors affecting nuclear export of the s ribosomal subunit in vivo a system of shuttle vectors and yeast host strains designed for efficient manipulation of dna in saccharomyces cerevisiae methods in yeast genetics differentiating between near-and non-cognate codons in saccharomyces cerevisiae systematic analysis of bicistronic reporter assay data enhanced purity, activity and structural integrity of yeast ribosomes purified using a general chromatographic method gcd , a translational repressor of the gcn gene, has a general function in the initiation of protein synthesis in saccharomyces cerevisiae experimental prerequisites for determination of trna binding to ribosomes from escherichia coli yeast ribosomal protein l affects the kinetics of protein synthesis and ribosomal protein l improves translational accuracy, while mutants lacking both remain viable the pymol molecular graphics system functional insights from the structure of the s ribosomal subunit and its interactions with antibiotics inhibitors of protein biosynthesis. ii. mode of action of anisomycin structures of five antibiotics bound at the peptidyl transferase center of the large ribosomal subunit structural basis for the interaction of antibiotics with the peptidyl transferase centre in eubacteria double-stranded rna viruses of saccharomyces cerevisiae a - ribosomal frameshift in a double-stranded rna virus forms a gag-pol fusion protein kre p, the plasma membrane receptor for the yeast k viral toxin an 'integrated model' of programmed ribosomal frameshifting and post-transcriptional surveillance evidence against a direct role for the upf proteins in frameshfiting or nonsense codon readthrough translocation mechanism of ribosomes rna structure analysis at single nucleotide resolution by selective -hydroxyl acylation and primer extension (shape) a fast-acting reagent for accurate analysis of rna secondary and tertiary structure by shape chemistry selective -hydroxyl acylation analyzed by primer extension (shape): quantitative rna structure analysis at single nucleotide resolution ribosome dynamics and trna movement by time-resolved electron cryomicroscopy selection of trna by the ribosome requires a transition from an open to a closed form structure/ function analysis of yeast ribosomal protein l an arc of unpaired ''hinge bases'' facilitates information exchange among functional centers of the ribosome achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins evolutionary relationships amongst archaebacteria. a comparative study of s ribosomal rnas of a sulphur-dependent extreme thermophile, an extreme halophile and a thermophilic methanogen yeast ribosomal protein l helps coordinate trna movement through the large subunit rrna mutants in the yeast peptidyltransferase center reveal allosteric information networks and mechanisms of drug resistance features of s mammalian ribosome and its subunits deacylated trna is released from the e site upon a site occupation but before gtp is hydrolyzed by ef-tu we would like to thank dr. rasa rakauskaite _ for assistance training, offering technical support and advice above and beyond the call of duty. further thanks as well to dr. arturas meskauskas, dr. karen jack, hamid-reza shahshahan, ashton trey belew, dr. jonathan leshin and the rest of our laboratory for help and support. we thank dr. pamela silver for providing us with strain psy . supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- - u b xo authors: firth, andrew e.; wills, norma m.; gesteland, raymond f.; atkins, john f. title: stimulation of stop codon readthrough: frequent presence of an extended ′ rna structural element date: - - journal: nucleic acids res doi: . /nar/gkr sha: doc_id: cord_uid: u b xo in sindbis, venezuelan equine encephalitis and related alphaviruses, the polymerase is translated as a fusion with other non-structural proteins via readthrough of a uga stop codon. surprisingly, earlier work reported that the signal for efficient readthrough comprises a single cytidine residue ′-adjacent to the uga. however, analysis of variability at synonymous sites revealed strikingly enhanced conservation within the ∼ nt ′-adjacent to the uga, and rna folding algorithms revealed the potential for a phylogenetically conserved stem–loop structure in the same region. mutational analysis of the predicted structure demonstrated that the stem–loop increases readthrough by up to -fold. the same computational analysis indicated that similar rna structures are likely to be relevant to readthrough in certain plant virus genera, notably furovirus, pomovirus, tobravirus, pecluvirus and benyvirus, as well as the drosophilia gene kelch. these results suggest that ′ rna stimulatory structures feature in a much larger proportion of readthrough cases than previously anticipated, and provide a new criterion for assessing the large number of cellular readthrough candidates that are currently being revealed by comparative sequence analysis. there are two types of exceptions to universality of the genetic code. in one, the meaning of a codon is globally reassigned in a context independent manner ( ) . in the other, codon redefinition is in competition with standard decoding and it is codon context dependent ( ) . though there is an example where the meaning of a sense codon is redefined ( ) , most cases of codon redefinition involve one of the three stop codons of the standard code (uga, uag or uaa) specifying an amino acid at least a proportion of the time that it is decoded. where the significant feature of stop codon redefinition is to allow ribosomes to continue translation into a downstream open reading frame (orf), rather than the identity of the amino acid specified, then it is generally termed stop codon readthrough (rt) ( ) . in contrast, when selenocysteine or pyrrolysine are specified by uga or uag, respectively, then the important features are the special properties of these non-universal amino acids ( ) ( ) ( ) . both types of non-global codon redefinition are just one aspect of the variety of ways (collectively referred to as 'recoding') in which genetic readout can be dynamically altered in a site-or mrna-specific manner ( , ) . numerous studies have shown that the identity of the -adjacent nucleotide influences stop codon leakiness in both prokaryotes and eukaryotes and, correspondingly, there is considerable bias in the identity of the nucleotide at this position for natural gene terminators ( ) ( ) ( ) ( ) ( ) ( ) ( ) . of great interest was the discovery that rt of the coat protein (cp) gene terminator of the phage qb yields a greatly extended protein that is important for viral propagation ( , ) . shortly afterwards, studies that utilized purified yeast suppressor trnas in in vitro experiments found that several plant viruses, including tobacco mosaic tobamovirus (tmv), also utilize rt to express their replicase proteins ( ) ( ) ( ) . similarly murine leukemia gammaretrovirus (mulv), whose relevant sequence is identical to that in xenotropic mulv-related virus (xmrv), utilizes rt of the gag gene terminator to allow ribosomes to enter the pol gene and synthesize the gag-pol polyprotein that is the source of viral reverse transcriptase ( , ) . mulv pol binds to the translation release factor, erf , and non-interacting mutants of pol failed to synthesize adequate levels of gag-pol to permit replication ( ) . this raises the possibility of temporal control of rt ( ) . the efficiency of rt in the drosophila gene kelch also appears to be developmentally regulated ( ) . two other drosophila genes are known to employ rt-headcase and out-at-first-though another approximately candidate cases have recently been identified via comparative genomic approaches utilizing sequences from drosophila species ( ) ( ) ( ) . although some of these candidates may actually be cases of alternative splicing or rna editing, the indication is that utilized rt may be significantly more common in cellular organisms than previously supposed. several alphaviruses, including sindbis virus (sinv), utilize rt of a uga stop codon in their replicase gene ( , ) . for sinv, primarily on the basis of in vitro translation studies, the only contextual feature reported to be important for rt was the identity of the cytidine nucleotide immediately of the stop codon, directly analogous to the results of the early stop codon leakiness studies ( ) . similarly, in the tobraviruses (specifically tobacco rattle virus) and, by implication, the pecluvirus, furovirus and pomovirus replicase gene, and the furovirus cp-extension gene, it has been reported that rt of the uga stop codon might depend on just the three -adjacent nucleotides ( ) . for these plant viruses, and alphaviruses that utilize rt, the consensus motif in wild-type (wt) viruses is uga-cua or uga-cgg ( ) . in contrast, for tmv (where the rt codon is uag), plant tissue culture experiments showed that the nt immediately of the stop codon are relevant, with the consensus motif for efficient rt being uag-car-yya ( , ) . the same motif is utilized by a number of other plant viruses, while the motif uag-car-nba stimulates rt in yeast ( ) . in terms of stimulatory motifs, adenines at the À and À nucleotide positions have been shown to positively modulate rt in yeast and are a feature common to many virus rt sites, notably in the tobamoviruses, poleroviruses and luteoviruses ( ) . for a relatively small number of cases of utilized rt, the known stimulatory signals involve an mrna structure of the stop codon. in mulv, in vitro translation studies showed that a compact pseudoknot structure of the gag terminator, uag, is essential for meaningful levels of rt, with the identity of certain nucleotides in the nt 'spacer' region between the stop codon and the pseudoknot, as well as some of the nucleotides in loop of the pseudoknot, being important ( ) ( ) ( ) ( ) . the location of the pseudoknot ( nt of the stop codon) may permit it to act at the mrna unwinding site half-way through the mrna entrance channel of the ribosome ( ) . a very different stimulatory element is present in the plant luteoviruses, where rt at the end of the cp gene produces a much larger cp-extension protein that is important for aphid transmission ( ) . in the best-studied of these viruses, barley yellow dwarf, both -adjacent sequences and an element $ - nt of the uag stop codon have been identified as important for rt and long-range rna base pairing between the -proximal and -distal elements has been suggested as a possible mechanism ( ) . similar results were found for beet western yellows virus ( ) . although cytidine residues are under-represented at the position immediately -adjacent to uga (and other) terminators in eukaryotes, they are by no means absent ( , ) . thus we hypothesized that, at least in vivo, rt in sinv and other alphaviruses might be modulated by additional sequence elements. to test for the existence of such elements, we investigated the degree of phylogenetic conservation at synonymous sites downstream of known rt stop codons in alphavirus genomes, and then extended the analyses to other rna viruses and selected cellular rt genes. regions of enhanced conservation at synonymous sites are indicative of overlapping functional elements such as rna secondary structures or primary nucleotide sequences with functions in addition to amino acid coding. in many cases, and in particular those cases where rt of a uga codon had been previously assumed to be stimulated simply by the -adjacent nucleotides cua or cgg, we found considerably enhanced conservation at synonymous sites in the -adjacent sequence, typically extending over a region of - nt -adjacent to the stop codon. here, we computationally and experimentally explore these conserved regions and their significance for rt. the genus alphavirus encompasses approximately described species, many of which infect humans and livestock, causing rashes, painful arthritis, fever and potentially fatal encephalitis (reviewed in reference ; see reference for a phylogeny). transmission is generally via arthropods such as mosquitoes. the single-stranded positive sense genomic rna is about - kb long and contains two long orfs separated by a short non-coding sequence ( figure a ). the -proximal orf codes for the non-structural proteins nsp -nsp -nsp -nsp while the -proximal orf, which is translated from a subgenomic rna, codes for the structural polyprotein c-e -e - k-e and, via programmed ribosomal frameshifting, c-e -e -tf ( ) . in sinv, venezuelan equine encephalitis (veev), eastern equine encephalitis (eeev), western equine encephalitis (weev) and related alphaviruses, a uga stop codon separates the coding sequence for nsp (rna-dependent rna polymerase, rdrp) from the coding sequence for nsp ( , ) . in contrast, the salmonid alphaviruses lack the uga stop codon while, for alphaviruses in the semliki forest complex, the stop codon tends to be present in some but not all strains even within a single species, possibly as a result of conflicting selective forces in alternating arthropod and vertebrate hosts (passaging in cell culture may also drive selection for or against a stop codon at this location; see ref. and references therein). virus sequences were obtained from genbank in may , updated in october , and processed using blast, emboss and clustalw ( ) ( ) ( ) . the accession numbers of all sequences used are given in the supplementary data. coding sequences were extracted, translated, aligned with clustalw and back-translated to nucleotide sequence alignments, and manually adjusted in a few cases. for the synonymous site conservation plots, alignment columns in which the reference sequence ( figure ) contained gap characters were removed so that the plots are in reference sequence coordinates. rna structures were predicted using a combination of vienna rna rnafold and alidot, pknotsrg and manual inspection ( , ) . conservation at synonymous sites was analyzed as described in ref. ( ; a procedure inspired by the sssv statistic of ref. ). the procedure takes into account whether synonymous site codons are -, -, -, -or -fold degenerate and the differing probabilities of transitions and transversions. briefly, for a given pair of sequences within an alignment, a codon position was defined as a synonymous site if the same amino acid was encoded in both sequences. a 'null' substitution model was defined such that the relative probability of each possible synonymous codon substitution (including substitution with itself) at such sites may be calculated by assuming that the component nucleotides evolve neutrally. neutral evolution was modelled using a kimura nucleotide substitution matrix with k = ( ) . for each sequence pair, the divergence parameter t was set so that the total expected number of nucleotide substitutions at synonymous sites under the null model was equal to the total observed number. next, the difference between the expected number and observed number of nucleotide substitutions was calculated at each synonymous site in the pairwise comparison. the variance at each site was calculated from the expected probabilities of each possible synonymous codon substitution, assuming a multinomial distribution. statistics were summed, at each alignment codon position, over a phylogenetic tree as described in ref. ( ) . finally the statistics were averaged over a sliding window. an approximate p-value (probability that the mean conservation in the sliding window would be as high as observed if the null model were true) was also calculated, under the assumption of a normal distribution as an approximation to the sum of many independent multinomial distributions. the sequences encompassing the rt site and the predicted stem-loop structure for veev and sinv were synthesized by genscript and cloned into the xhoi and bglii sites of pdluc, a derivative of the p luc vector ( , ) . the firefly luciferase gene is in the same reading frame relative to the upstream renilla luciferase gene such that rt of the stop codon results in a renilla-firefly luciferase fusion product. derivative constructs were generated by pcr using appropriate primers and recloning into pdluc. all plasmids were verified by dna sequencing. plasmid dnas ( . mg) were used as templates in ml reactions of the rabbit reticulocyte lysate tnt Õ t quick coupled transcription/translation system (promega). s-methionine (perkin elmer) was included in the reactions and protein products were separated by sds-page. dried gels were analyzed using a typhoon phosphorimager (ge healthcare) and the amount of a nsp nsp nsp nsp c e e k e stop codon readthrough − frameshift site n i e t o r p y l o p l a r u t c u r t s n i e t o r p y l o p l a r u t c u r t s n the plot depicts the probability that the degree of conservation within a -codon sliding window could be obtained under a null model of neutral evolution at synonymous sites. note that the rt stop codon itself has been excluded from the conservation statistics. in order to map the conservation statistic onto the coordinates of a specific sequence in each alignment, all alignment columns with gaps in a chosen reference sequence were removed prior to calculation of conservation. the following reference sequences (genbank accession numbers) were used: veev nc_ , sinv nc_ . radioactivity in each product was determined using the imagequant . program (molecular dynamics). after normalization for the number of methionine residues in termination and rt products ( and , respectively), the rt efficiencies were calculated as [rt/ (rt+termination)]. tissue culture rt assays rt assays were performed using the dual luciferase reporter constructs, as previously described ( , ) . to control for possible differences in stability of specific mrna sequences, each rt construct was compared with a control construct that was identical except that the tga stop codon was replaced with a tgg codon. rt efficiencies were calculated as (firefly activity/renilla activity) for the rt sequence normalized by (firefly activity/renilla activity) for the corresponding tgg control sequence. standard deviations were calculated based on six independent transfections. sequence alignments of coding sequences containing rt stop codons were generated for a number of rna virus taxa and the degree of conservation at synonymous sites was analyzed as described in the 'materials and methods' section. for an alignment of veev, eeev and weev sequences, this analysis revealed significantly enhanced conservation in a region comprising the $ nt -adjacent to the rt stop codon and a -codon sliding window size clearly resolved the conservation into two distinct peaks ( figure b ). inspection of the sequence alignment demonstrated the potential for base pairing between the sequences corresponding to these two peaks to form a stem-loop structure. in veev, the -end of the component of the stem is separated from the stop codon by an - nt 'spacer' and the and components of the stem are separated by a less-conserved 'loop' region (which may nonetheless contain structured elements) of nt ( figure ). the predicted stem has - bp with a nt asymmetric bulge in the centre of the component and, despite the enhanced conservation, is further supported by a compensatory a:u to g:c substitution that occurs in some strains at the fourth base pair from the 'top' of the stem. in eeev and weev, the predicted stem has bp with a nt asymmetric bulge, and is separated from the stop codon by a nt 'spacer' (figure ) . again, the predicted stem is supported by a compensatory g:c to a:u substitution in the related fort morgan virus (fmv). high conservation was also noted for the - codons immediately -adjacent to the component of the predicted stem in veev, eeev and weev. with respect to the non-structural polyprotein, sinv and aura virus (aurav) form a separate clade from veev, weev and eeev but, again, the conservation analysis revealed striking tandem conservation peaks of the rt site ( figure b ) and, again, the conservation peaks corresponded to sequences with the potential to base pair to form an rna structure-this time comprising an bp stem with a nt asymmetric bulge, a nt 'spacer' from the rt stop codon, and a nt 'loop' region ( figure ). for those alphavirus species where there appears to be a constant flux between presence and absence of the rt stop codon, it is not unreasonable to suspect that the structure, if any, will be present whether or not the stop codon is present in any particular sequence. however, although we found the potential for conserved rna stems to form in a number of these species (e.g. ross river, getah, semliki forest and chikungunya viruses; figure and supplementary data), the range of divergences in the available sequence data proved inadequate to obtain supporting evidence from an analysis of conservation at synonymous sites. curiously this phenomenon was not just limited to the alphaviruses. the potential to form an extended stemloop structure -adjacent to a rt stop codon-phylogenetically conserved and supported by a pair of peaks in synonymous site conservation-was also found in a number of plant virus rt cases, for example, in the replicase gene in the genera (figures and ) . further, the predicted stem is well-supported by a large number of compensatory substitutions-i.e. paired substitutions that preserve the predicted base pairings-between the different species ( figure and supplementary data). the furoviruses and pomoviruses have a second rt site in the cp gene. here, however, there is a marked dichotomy between the two genera in the rt context. in the furoviruses, the rt context is generally uga-cgg (uga-ugg in the highly divergent sorghum chlorotic spot virus, ab ) and there was evidence for tandem synonymous site conservation peaks and an associated stem-loop structure that, together with a nt 'spacer', covered nt -adjacent to the stop codon (figures and ) . in the pomoviruses, however, the rt context is generally a-uag-caa-uya (a-uaa-caa-uua in the highly divergent broad bean necrosis virus, d ) and the synonymous site conservation analysis failed to reveal extended conservation in the vicinity of the rt site ( figure ). thus the furovirus context and predicted structure is alphavirus-like while the pomovirus context and lack of predicted structure is tobamovirus-like (see below). the animal-infecting coltiviruses also have an alphavirus-like rt site (uga-cgg) in the vp /vp -coding sequence and, again, there is potential to form a -adjacent rna stem-loop structure (figure ; as noted previously in ref. ) , which is tentatively supported by our conservation analysis ( figure ) . stop codon rt is also utilized by members of the plant virus taxa tombusviridae, luteoviridae, benyvirus and tobamovirus but the rt signals for these viruses had previously been grouped separately from those utilized by the alphaviruses, coltiviruses, tobraviruses, pecluviruses, furoviruses and pomoviruses (excepting pomovirus rna ), and our analysis likewise supported this distinction at the level of extended -adjacent synonymous site conservation ( ) . in the case of the tobamoviruses, greatly enhanced synonymous site conservation is seen from codons À to + relative to the uag stop codon, and the motif xxa-uag-caa-uua-xxg is completely conserved in the sequences analyzed (despite lack of amino acid conservation at the À and + codons). however, more extended conservation of the type seen in the alphaviruses was not observed (figure ). in the luteoviruses and poleroviruses (family luteoviridae), the stop codon context aaa-uag-gua is completely conserved in all except one of sequences analyzed (rose spring dwarf-associated virus, eu , has gaa-uga-cgg), and enhanced synonymous site conservation was also observed over several further codons, especially codons À to + . however, while this region may well interact with distal elements as discussed in ref. ( ) , the extended -adjacent conservation of the type seen in the alphaviruses was not observed in the luteoviridae (figure ) . the highly conserved local nucleotide contexts of the different rt sites mentioned here have been noted, discussed and characterized in detail in a number of previous works ( and references therein). a compilation of our own sequence analysis is given in the supplementary data and, to our knowledge, represents the largest such compilation to date. in the benyviruses-which generally have a tobamovirus-like stop codon context (i.e. uag-caa-uua; however, highly divergent rice stripe necrosis virus, eu , has uag-ggg-uac), the potential was observed for a local stem-loop structure (e.g. nt spacer, nt stem, nt loop in beet necrotic yellow vein virus, d ; figure ), but there was insufficient sequence data to obtain strong support from the synonymous site conservation analysis ( figure ). previous deletion experiments in beet necrotic yellow vein benyvirus have shown, incidentally, that rt efficiency is considerably reduced when sequence corresponding to codons + to + from the rt codon is deleted, even though the immediately -adjacent nucleotide context uag-caa-uua is left intact ( ) . in contrast, deletion of codons + to + had little effect on rt. these results indicate that there is an additional stimulatory element within the region defined by codons + to + , consistent with the predicted stem-loop structure (codons + to + ). in the tombusviridae family (including genera tombusvirus, carmovirus, necrovirus and others), rt occurs at a uag stop codon followed by ggr, but enhanced synonymous site conservation was observed for approximately codons -adjacent to the uag (figure ) . some of this conservation, however, may be explained by other conserved elements in the region (see ref. and references therein; see also refs , ) . rna folding software predicted alphavirus-like -adjacent stem-loops in some species and more complex structures in other species, a detailed analysis of which is beyond the scope of this article. the rt site in gammaretroviruses has been studied in depth and our computational analysis supported the known stimulatory spacer sequence and pseudoknot structure but did not reveal further conservation in the vicinity ( ) ( ) ( ) ( ) . rt sites in enamoviruses, carrot red leaf luteovirus-associated rna, middelburg and barmah forest alphaviruses, providence tetravirus and others, were not analyzed in detail due to lack of sequence data for useful comparative computational analysis ( ) ( ) ( ) . in contrast, to our knowledge, no stimulatory rna structure has been previously proposed for uga rt in kelch. however, when we applied our computational analysis to kelch, we found tandem synonymous site conservation peaks of the rt codon and the corresponding sequences were predicted to form an rna stem ( nt loop in d. melanogaster) that is conserved in all drosophila species (figures and ; supplementary data) . the predicted stem has bp with a nt asymmetric bulge near the center of the component, and is separated from the rt codon by an nt 'spacer' sequence that is completely conserved in all drosophila species but, perhaps unusually, the rt codon context is uga-aug (uga-agc in anopheles, culex and aedes mosquitoes). in order to verify and investigate the functionality of the predicted rna structures in veev and sinv, local sequences ( nt of the uga stop codon and nt for veev or nt for sinv) were cloned in-frame between the renilla luciferase and firefly luciferase genes in vector pdluc. the firefly luciferase gene lacks an initiation codon and its expression is dependent on rt of the uga codon. rt efficiencies were determined both in vitro using rabbit reticulocyte lysate, and in hek tissue culture cell lysates. a positive control for rt, the mulv gag-pol rt site and -adjacent pseudoknot, was included in all assays ( figure a and b, lane ). the wt veev and sinv constructs promoted rt in vitro at . % and . %, respectively ( figure a, lanes and ) . the rt efficiencies in tissue culture cells were much higher: . % for veev and . % for sinv ( figure b, lanes and ) . substitution of the nt immediately of the uga codon in veev with the tobamovirus-like rt stimulator, caa-uua, increased rt both in vitro ( . %) and in tissue culture cells ( . %; figure a and b, lane ). derivative constructs lacking the sequences for the predicted structures were generated ( figure c ). the veev derivative containing only nt of the uga codon directed just . % rt in vitro and . % in tissue culture cells ( figure a and b, lane ), while the sinv derivative containing only nt of the uga codon directed just . % rt in vitro and . % in tissue culture cells ( figure a and b, lane ). thus the stimulatory effect of the stem-loop sequence is $ -to -fold for veev and -to -fold for sinv, depending on the assay system. this is in direct contrast to ref. where no difference was found in vitro between an insert comprising just sinv uga-cua and an insert comprising the entire sinv nsp +nsp -coding sequences. the veev stem-loop sequence was chosen for further analysis due to the higher rt efficiency and its greater stimulatory effect. when the part of the stem was deleted, rt was reduced to . % in vitro and . % in tissue culture cells ( figure a and b, lane ) , thus demonstrating the importance of the sequence corresponding to the component of the predicted stem, > nt downstream of the rt codon. to address base pairing within the predicted stem, two mutations were constructed that were predicted to disrupt watson-crick interactions: g residues in the part of the stem were changed to cs or c residues in the part of the stem were changed to gs ( figure c ). in both cases, rt was drastically reduced in both in vitro and tissue culture cell assays ( figure a and b, lanes and ) . however, when the two mutations were combined such that the predicted base pairings were restored, rt recovered to near the wt level ( figure a and b, lane ) . the importance of the sequence between the two halves of the stem (referred to here as the 'loop' region but without implication about internal structure) was tested by deleting all but nt of its sequence. interestingly, this resulted in substantially higher rt than the wt level in both assay systems ( figure a and b, lane ) . we have shown that the stimulatory elements for efficient rt in veev and probably also sinv include not just the immediately -adjacent nucleotides, but also a stem-loop structure that spans $ nt of the stop codon. computational analyses provide strong evidence that similar structures are relevant for rt in several other alphaviruses, and in plant viruses where rt occurs at a uga codon. although this rna structure is clearly not essential for some level of rt to occur in some systems [as in many previous analyses the predicted wt structure was not present ( , , ) and, in our own experiments, $ % rt was achieved in tissue culture without the wt structure], it does have a pronounced stimulatory effect on rt efficiency ( -to -fold for sinv, -to -fold for veev). as with the gammaretrovirus pseudoknot, the precise mechanism by which the stem-loop affects rt remains to be determined. possibilities include direct interaction with the ribosome (including pausing and/or promotion of conformational changes in the ribosome); provision of a physical block that preferentially occludes release factor from the a-site in favour of trnas; or an indirect action via some trans-acting factor. the function of the rna structure may simply be to achieve a higher rt level that is optimal for the virus. alternatively, the structure may provide a regulatory mechanism, perhaps allowing different rt levels to be achieved in different hosts or at different stages in the viral cycle. the long 'loop' length of the predicted structures is noteworthy. while long-distance base pairings have been demonstrated to play important regulatory roles in rna viruses (reviewed in ref. ) , the distances involved in the rt base pairings identified here are very much smaller and a genome-scale regulatory role seems unlikely. furthermore, our loop-deletion mutant promoted even higher rt than the wt construct, suggesting that the presence of a long loop region, or any sequence motifs therein, play little if any role in stimulating efficient rt. thus, we hypothesize that evolutionary selection simply acts to place the component of the stem in a convenient location (e.g. with regards to minimizing interference with the encoded amino acid sequence). although we refer to the region between the two components of the stem as a 'loop', it should not be taken to imply that this region does not fold. in fact the region generally is predicted to fold, and the fact that it can fold may indeed be functionally important-perhaps just to provide stability to the basal stem. however, in most cases, the nature of the fold seems to be relatively unimportant as it is not well-conserved between related sequences. how can our results be reconciled with previous results which indicated that only the immediately -adjacent - nt were relevant for rt in these viruses? there are several possibilities. previous analyses of the rt cassette in sinv alphavirus, and also tobacco rattle tobravirus, were performed in in vitro systems ( , ) . however, rt efficiency may vary considerably between in vitro and cell culture systems, depending on the absence or presence and abundance of various relevant near-cognate trna species ( ) , and potentially also on the concentration of various trans-acting factors, salt concentrations, temperature, ribosome loading density and intracellular architecture. thus a high rt efficiency measured in vitro for a short insert does not mean that the full complement of elements that stimulate efficient rt in cell culture or in vivo has been recapitulated faithfully. such factors may also explain why our in vitro experiments produced much lower rt efficiencies than previous in vitro experiments, and highlight the importance of our experiments in mammalian cell culture ( , , ) . although ref. ( ) compared, in vitro, the rt efficiency for a short insert (that excluded the predicted structure) with a long insert (comprising the entire nsp +nsp -coding sequences), such comparisons between inserts of very different sizes are not always straightforward, in part because the different protein products may be degraded at different rates, and because chance base pairings with the construct sequence could affect rt efficiency differently for the long and short inserts. in contrast, guided by our computational analysis, we were able to make small but targeted substitutions that allowed for more precise comparisons in the context of a long veev insert that included the predicted rna structure elements. accurate measurements of the rt efficiency in alphavirus-infected cells are not readily obtainable due to the multiple cleavage products of the non-structural polyprotein and rapid degradation of excess nsp ( , , ) . nonetheless, in ref. ( ) , -to -fold less nsp was found in wt sinv-infected cells than in cells infected with mutant viruses in which the uga was replaced by a ser, trp or arg codon, thus suggesting a wt rt efficiency in the range . - %. our measurement of $ % for wt veev and sinv sequences in the dual luciferase construct suggests that there may be additional factors that affect alphavirus rt. interestingly, although rt for the veev and sinv cassettes was much more efficient in cell culture than in vitro, there was little difference between the two systems for the mulv rt cassette ( figure ). while the action of some cellular trans-acting stimulatory factor cannot be ruled out (albeit presumably not interacting with the loop region, given the increased rt observed when the loop was deleted), other possible explanations include: (i) the different stop codons and nucleotide contexts involved (uga-c in veev and sinv; uag-g in mulv) and hence the different pools of potential stop codon-decoding trnas and (ii) the nature of the structure (a compact pseudoknot in mulv but an extended stem-loop in veev and sinv) with possible consequences for the ease with which the structure may fold in different environments. similar differences in rt efficiency between in vitro and cell culture systems were noted for colorado tick fever coltivirus which, like sinv and veev, utilizes a uga rt codon with a predicted -adjacent stem-loop structure ( ) . besides the known structure-stimulated rt cases discussed above, rna structure also plays an integral role in the recoding of uga codons for selenocysteine insertion. in eukaryotes, this process is dependent on an rna stem-loop structure containing specific nucleotide motifs, known as the secis element, usually located in the -utr of the corresponding mrnas ( , ). in certain cases, an additional stem-loop structure close to the recoded stop codon has also been identified ( , ) . for example, in the human sepn gene there is a phylogenetically conserved bp stem (with a nt symmetric bulge) and a nt loop, separated from the uga by a nt spacer. interestingly, this structure has been shown to stimulate rt in cell culture (but not in vitro) even when the secis element is absent. howard et al. located potential -adjacent structures for at least of human selenocysteine-encoding uga codons analyzed. however, their initial computational selection involved rna-folding of just nucleotides + to + of the human sequence-an analysis which would have missed most of the rna structures predicted in this report. thus -adjacent structures may be a feature of a larger proportion of selenocysteine rt sites than these, though it does not appear to be an essential feature for selenocysteine rt ( ) . the various motifs that stimulate rt in eukaryotic cells have been previously classified by beier and grimm and by harrell et al. ( , ) . beier and grimm define the classes type i (generally uag-caa-uya; includes tobamovirus replicase, and benyvirus and pomovirus cp extension), type ii (generally uga-cgg or uga-cua; includes alphavirus replicase, tobravirus, pecluvirus, furovirus and pomovirus replicase, and furovirus cp extension), and type iii (generally uag-g, plus a compact pseudoknot in gammaretroviruses and possible but as yet relatively uncharacterized structures in the luteoviruses and tombusviruses). there are exceptions to the rule (e.g. enamovirus uga-g, various pomovirus cases with atypical stop codons, and so on). one reason for this may be that the required level of rt may vary between different viruses, and may also be modulated by other sequence elements (e.g. nucleotide and/or amino acid context) so that, in certain cases, deviations from one of the 'canonical' rt motifs may be tolerated. with this proviso, our results suggest that the definition of the type ii motif should, in general though perhaps not ubiquitously, be modified to include a rna structure component. our discovery in alphaviruses and phylogenetically supported predictions for many plant viruses and the drosophila gene kelch, together with the small number of previously identified cases of structure-stimulated rt, now suggest that rna structures as a component of efficient rt cassettes in eukaryotes (especially those that lack a car-yya tobamovirus-like stimulator), rather than being exceptional, may in fact be the norm. rewiring the keyboard: evolvability of the genetic code the distinction between recoding and codon reassignment ribosome ''skipping'': ''stop-carry on'' or ''stopgo'' translation recoding: expansion of decoding rules enriches gene expression selenocysteine biosynthesis, selenoproteins, and selenoproteomes reprogramming the ribosome for selenoprotein expression: rna elements and protein factors recoding: expansion of decoding rules enriches gene expression recoding: reprogrammed genetic decoding recoding: expansion of decoding rules enriches gene expression the influence of codon context on genetic code translation sequence analysis suggests that tetra-nucleotides signal the termination of protein synthesis in eukaryotes eukaryotic start and stop translation sites the identity of the base following the stop codon determines the efficiency of in vivo translational termination in escherichia coli translational termination efficiency in mammals is influenced by the base following the stop codon the efficiency of translation termination is determined by a synergistic interplay between upstream and downstream sequences in saccharomyces cerevisiae comparison of characteristics and function of translation termination signals between and within prokaryotic and eukaryotic organisms natural read-through at the uga termination signal of q-beta coat protein cistron the readthrough protein a is essential for the formation of viable q beta particles yeast suppressors of uaa and uag nonsense codons work efficiently in vitro via trna leaky uag termination codon in tobacco mosaic virus rna translation of tobacco rattle virus rnas in vitro: four proteins from three rnas translation of mulv and msv rnas in nuclease-treated reticulocyte extracts: enhancement of the gag-pol polypeptide with yeast suppressor trna murine leukemia virus protease is encoded by the gag-pol gene and is synthesized through suppression of an amber termination codon reverse transcriptase of moloney murine leukemia virus binds to eukaryotic release factor to modulate suppression of translational termination genetic reprogramming by retroviruses: enhanced suppression of translational termination examination of the function of two kelch proteins generated by stop codon suppression a novel stop codon readthrough mechanism produces functional headcase protein in drosophila trachea regulatory autonomy and molecular characterization of the drosophila out at first gene revisiting the protein-coding gene catalog of drosophila melanogaster using fly genomes sequence coding for the alphavirus nonstructural proteins is interrupted by an opal termination codon mutagenesis of the in-frame opal termination codon preceding nsp of sindbis virus: studies of translational readthrough and its effect on virus replication the signal for translational readthrough of a uga codon in sindbis virus rna involves a single cytidine residue immediately downstream of the termination codon uga suppression by trna cmca trp occurs in diverse virus rnas due to a limited influence of the codon context misreading of termination codons in eukaryotes by natural nonsense suppressor trnas the signal for a leaky uag stop codon in several plant viruses includes the two downstream codons pseudouridine in the anticodon gÉa of plant cytoplasmic trna tyr is required for uag and uaa suppression in the tmv-specific context impact of the six nucleotides downstream of the stop codon on translation termination the major determinant in stop codon read-through involves two adjacent adenines evidence that a downstream pseudoknot is required for translational read-through of the moloney murine leukemia virus gag stop codon pseudoknot-dependent read-through of retroviral gag termination codons: importance of sequences in the spacer and loop bipartite signal for read-through suppression in murine leukemia virus mrna: an eight-nucleotide purine-rich sequence immediately downstream of the gag termination codon followed by an rna pseudoknot structural studies of the rna pseudoknot required for readthrough of the gag-termination codon of murine leukemia virus mrna helicase activity of the ribosome aphid transmission of beet western yellows luteovirus requires the minor capsid read-through protein p local and distant sequences are required for efficient readthrough of the barley yellow dwarf virus pav coat protein gene stop codon effects of mutations in the beet western yellows virus readthrough protein on its expression and packaging and on virus accumulation, symptoms, and aphid transmission the alphaviruses: gene expression, replication, and evolution complete nucleotide sequence of middelburg virus, isolated from the spleen of a horse with severe clinical disease in zimbabwe discovery of frameshifting in alphavirus k resolves a -year enigma nonstructural proteins nsp and nsp of ross river and o'nyong-nyong viruses: sequence and comparison with those of other alphaviruses regulation of semliki forest virus rna replication: a model for the control of alphavirus pathogenesis in invertebrate hosts basic local alignment search tool emboss: the european molecular biology open software suite clustal w and clustal x version . vienna rna secondary structure server pknotsrg: rna pseudoknot folding including near-optimal structures and sliding windows a conserved predicted pseudoknot in the ns a-encoding sequence of west nile and japanese encephalitis flaviviruses suggests ns ' may derive from ribosomal frameshifting bioinformatic and functional analysis of rna secondary structure elements among different genera of human and animal caliciviruses a simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences detecting overlapping coding sequences in virus genomes processive selenocysteine incorporation during synthesis of eukaryotic selenoproteins a dual-luciferase reporter system for studying recoding signals programmed ribosomal frameshifting in decoding the sars-cov genome virgaviridae: a new family of rod-shaped plant viruses termination and read-through proteins encoded by genome segment of colorado tick fever virus high resolution analysis of the readthrough domain of beet necrotic yellow vein virus readthrough protein: a kter motif is important for efficient transmission of the virus by polymyxa betae a discontinuous rna platform mediates rna virus replication: building an integrated model for rna-based regulation of viral processes characterization of an internal element in turnip crinkle virus rna involved in both coat protein binding and replication immunodetection, expression strategy and complementation of turnip crinkle virus p and p replication components the nucleotide sequence and luteovirus-like nature of rna of an aphid non-transmissible strain of pea enation mosaic virus a small rna resembling the beet western yellows luteovirus st -associated rna is a component of the california carrot motley dwarf complex genome organization and translation products of providence virus: insight into a unique tetravirus evolution of genes and genomes on the drosophila phylogeny the leaky uga termination codon of tobacco rattle virus rna is suppressed by tobacco chloroplast and cytoplasmic trnas trp with cmca anticodon long-distance rna-rna interactions in plant virus gene expression and replication cleavage-site preferences of sindbis virus polyproteins containing the non-structural proteinase. evidence for temporal regulation of polyprotein processing in vivo regulation of sindbis virus rna replication: uncleaved p and nsp function in minus-strand rna synthesis, whereas cleaved products from p are required for efficient plus-strand rna synthesis processing the nonstructural polyproteins of sindbis virus: study of the kinetics in vivo by using monospecific antibodies recoding elements located adjacent to a subset of eukaryal selenocysteine-specifying uga codons a recoding element that stimulates decoding of uga codons by sec trna predominance of six different hexanucleotide recoding signals of read-through stop codons the authors thank mike howard (university of utah) for his kind gift of plasmid pdluc, and chris anderson (university of utah) for help with tissue culture analyses. the authors also thank lynn cooley, their collaborator for experimental analysis of kelch rt (work in progress), for her support. conflict of interest statement. none declared. supplementary data are available at nar online. key: cord- -xgwbl em authors: henderson, clark m.; anderson, christine b.; howard, michael t. title: antisense-induced ribosomal frameshifting date: - - journal: nucleic acids res doi: . /nar/gkl sha: doc_id: cord_uid: xgwbl em programmed ribosomal frameshifting provides a mechanism to decode information located in two overlapping reading frames by diverting a proportion of translating ribosomes into a second open reading frame (orf). the result is the production of two proteins: the product of standard translation from orf and an orf –orf fusion protein. such programmed frameshifting is commonly utilized as a gene expression mechanism in viruses that infect eukaryotic cells and in a subset of cellular genes. rna secondary structures, consisting of pseudoknots or stem–loops, located downstream of the shift site often act as cis-stimulators of frameshifting. here, we demonstrate for the first time that antisense oligonucleotides can functionally mimic these rna structures to induce + ribosomal frameshifting when annealed downstream of the frameshift site, ucc uga. antisense-induced shifting of the ribosome into the + reading frame is highly efficient in both rabbit reticulocyte lysate translation reactions and in cultured mammalian cells. the efficiency of antisense-induced frameshifting at this site is responsive to the sequence context ′ of the shift site and to polyamine levels. the standard triplet readout of the genetic code can be reprogrammed by signals in the mrna to induce ribosomal frameshifting [reviewed in ( ) ( ) ( ) ]. generally, the resulting trans-frame protein product is functional and may in some cases be expressed in equal amounts to the product of standard translation. this elaboration of the genetic code ( , ) demonstrates versatility in decoding. requirements for eukaryotic ribosomal frameshifting include a shift-prone sequence at the decoding site and often a downstream secondary structure in mrna. the majority of À programmed frameshift sites consist of a heptanucleotide sequence x xxy yyz [where x can be a, g, c or u; y can be a or u; and z can be any nucleotide ( ) ]. in this configuration, the p-and a-site trnas can re-pair with at least out of nt when shifted nt towards the end of the mrna. similarly, for + frameshift sites, the identity of the codons in the p-and a-sites of the ribosome is critical for efficient frameshifting. one factor affecting + frameshift efficiency is the initial stability of the p-site trna-mrna interaction in the frame ( ) . high-efficiency frameshifting occurs when the p-site trna does not form standard codon-anticodon interactions ( ) . in some studies, a correlation between + frameshift efficiency and the final stability of the p-site trna-mrna interaction in the + frame has been shown previously ( , ) . however, in other systems there appears to be little correlation ( ) . in addition, competition between decoding of the frame and + frame codons in the a-site may affect frameshifting efficiency ( ) . slow to decode frame codons such as stop codons or those decoded by low abundance trnas favor frameshifting, as do + frame codons with high levels of corresponding cognate trnas ( ) ( ) ( ) ( ) ( ) . high levels of frameshifting are often achieved by the stimulatory action of a cis-acting element located downstream of the shift site. a wide variety of structures, most commonly h-type pseudoknots ( ) , have been identified which stimulate À frameshifting in eukaryotes [for reviews see ( , ) ]. mutagenic and structural data for several of the frameshift stimulators have demonstrated that each pseudoknot has key structural features required for frameshift stimulation ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . however, unifying structural feature essential for frameshifting has not yet been identified. this observation combined with recent reports that simple antisense oligonucleotides can functionally mimic cis-acting stimulators of À frameshifting ( , ) demonstrates that many different structures can stimulate frameshifting. although it should be noted that not all structures of equal thermodynamic stability can stimulate frameshifting (discussion). rna pseudoknots have also been shown to stimulate programmed + frameshifting in many eukaryotic antizyme genes ( , ) . antizyme is a negative regulator of cellular polyamine levels through its ability to target ornithine decarboxylase (the rate-limiting enzyme in polyamine biosynthesis) for degradation ( ) ( ) ( ) , inhibits polyamine import ( , ) and stimulates export ( ) . antizyme expression is induced by high-intracellular polyamine levels, and decreased with lowered levels. the polyamine sensor is a programmed + frameshift event that is required for antizyme synthesis. *to whom correspondence should be addressed. tel: + ; fax: + ; email: mhoward@genetics.utah.edu Ó the author(s). this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/licenses/ by-nc/ . /uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. at low polyamine levels, termination at the end of open reading frame (orf ) is efficient, whereas at high levels of polyamines, a substantial proportion of ribosomes shift to the + reading frame and then resume standard decoding to synthesize the full-length and active antizyme protein. frameshifting at the mammalian antizyme mrna shift site, ucc uga, is stimulated by two cis-acting signals ( , ) . one of these, the element, encompasses $ bases upstream of shift site and is important for the polyamine effect ( ) ( ) ( ) . the other cis-acting element is a pseudoknot located of the shift site. the mammalian antizyme pseudoknot and a structurally distinct counterpart in a subset of invertebrate antizyme mrnas ( ) are the only pseudoknots known to act as stimulators for + frameshifting in eukaryotes. although it is unknown if pseudoknots stimulate À frameshifting and + frameshifting by different mechanisms, one notable difference is found in positioning of the downstream structure relative to the shift site. naturally occurring pseudoknots or stem-loop stimulators of À frameshifting typically begin $ - nt downstream of the a-site codon of the shift site ( ) , whereas + frameshift pseudoknots are located closer with only a - nt separation from the a-site codon ( ) . mutagenic studies have revealed that altering the size of the spacer affects frameshifting and, in general, reduces efficiency ( , , ( ) ( ) ( ) . here we have tested the ability of antisense oligonucleotides, annealed downstream of the shift-prone site, ucc uga, to induce shifting of the ribosome to the + reading frame. the directionality of frameshifting (either into the + or À reading frame) is shown to be dependent upon the position of the duplex region relative to the shift site, and the efficiency of frameshifting is responsive to polyamine levels and enhanced by the inclusion of stimulatory sequences found upstream of the human antizyme + programmed frameshift site. complementary oligonucleotides, to construct the sequences described in this paper, were synthesized at the university of utah dna/peptide core facility such that when annealed they would have appropriate ends to ligate into the sali/ bamhi sites of the dual luciferase vector, p luc ( ) . dual luciferase constructs were prepared and their sequence was verified as described previously ( ) . insert sequences with shift site in boldface is given as follows: p lucaz wt: tcgacggtctccctccactgctgtag-taacccgggtccggggcctcggtggtgctcctgatg-cccctcacccacccctgaagatcccaggtgggcgag-ggaatagtcagagggatcacaacggatc; p lucaz sp: tcgacggtctccctccactgctgtag-taacccgggtccggggcctcggtggtgctcctgac-cctcacccacccctgaagatcccaggtgggcgagg-gaatagtcagagggatcacaacggatc; p lucaz hp: tcgacggtctccctccactgctgtag-taacccgggtccggggcctcggtggtgctcctgatg- p lucaz pkdel: tcgacggtctccctccactgctg-tagtaacccgggtccggggcctcggtggtgctcct-gatgcccctggatc; p lucaz pkm : tcgacggtctccctccactgctgt-agtaacccgggtccggggcctcggtggtgctcctg-atgcccctcacccaccgggatcacaaggatc; p lucaz sl: tcgacggtctccctccactgctgtagt-aacccgggtccggggcctcggtggtgctcctgatg-cccctcacccacccggatc; p lucaz fs: tcgacgtgctcctgatgcccctg-gatc; p lucaz fsugg: tcgacgtgctcctggtgcccctg-gatc. the dual luciferase constructs ( . mg) described above were added directly to tnt coupled reticulocyte lysate reactions (promega) with s-labeled methionine in a volume of ml. reactions were incubated at c for h. radiolabeled proteins were separated by sds-page and the gels were fixed with . % acetic acid and methanol for min. after drying under vacuum, the gels were visualized using a storm phosphorimager (molecular dynamics) and radioactive bands quantified using imagequant software. percent frameshifting was calculated as the percentage of full-length (frameshift) product relative to the termination product and the full-length product combined. the value of each product was corrected for the number of methionine codons present in the coding sequence. the reported values are the average and standard deviations obtained from at least three independent measurements. tables showing percent frameshifting and standard deviations can be found in supplementary data. plasmid p lucaz pkdel was co-transfected into cv- cells with varying concentrations of az b -o-methyl antisense oligonucleotides under the following conditions. cv- cells ( . · ) in ml of dmem + % fetal bovine serum were added to wells ( / area -well tissue culture treated plates) containing ng of dna, varying amounts of az b antisense oligonucleotides and . ml lipofectamine (invitrogen) in ml of optimem. cells were incubated at c ( % co ) for h. media were then removed from the cells and the transfected cells were lysed in . ml lysis buffer and luciferase activity determined by measuring light emission following injection of ml of luminescence reagent (promega). percent frameshifting was calculated by comparing firefly/renilla luciferase ratios of experimental constructs with those of control constructs: (firefly experimental rlus/renilla experimental rlus)/(firefly control rlus/renilla control rlus) · . the ability of cis-acting rna structures or trans-acting -o-methyl antisense oligonucleotides to induce ribosomal frameshifting was determined by in vitro transcription and translation of a dual luciferase reporter vector, p luc. p luc contains the renilla and firefly luciferase genes on either side of a multiple cloning site, and can be transcribed using the t promoter located upstream of the renilla luciferase gene ( ) . sequences containing shift-prone sites were cloned between the two reporter genes such that the downstream firefly luciferase gene is in the + reading frame. the resulting constructs were then transcribed and translated in vitro with or without complementary cis-acting stimulators of frameshifting at the antizyme shift site initially, three dual luciferase reporter vectors were generated containing the human antizyme frameshift cassette (p luc-az wt) with the and stimulators of frameshifting, with the pseudoknot deleted (p luc-az pkdel), or replaced with a stem-loop (p luc-az hp) (figure ). each constructs was then subjected to coupled transcription and translation reactions in the presence of increasing amounts of spermidine, and the s-labeled products separated by sds-page. table ). maximal levels of frameshifting were found to occur when - mm of antisense oligonucleotide was added to the transcription/translation reactions (supplementary table ). in the presence of . mm exogenous spermidine, highly efficient shifting of ribosomes into the + reading frame (higher than that observed in the wild-type antizyme frameshift cassette) was observed with the addition of az a ( . %), az b ( . %) and az c ( . %) (supplementary table ). the most efficient frameshifting is observed with the antisense oligonucleotide az b which anneals such that spacing between the shift site and the beginning of the duplex region is the same as that observed between the shift site and the beginning of stem of the natural antizyme pseudoknot structure (i.e. each has a nt spacer). to verify that the antisense oligonucleotide was activating ribosomal frameshifting and not transcription slippage, rna was transcribed from p luc-az pkdel in the absence of oligonucleotide and added to reticulocyte lysate translations in the presence of increasing amounts of -o-methyl az b oligonucleotide. frameshifting levels were increased to the same level as that observed in coupled transcription and translation reactions demonstrating that the oligonucleotide acts to induce frameshifting during translation (supplementary figure) . surprisingly, the addition of az a ( spacer) also induced high-level frameshifting into the À reading frame in a manner which was modestly inhibited by the addition of spermidine ( % in the absence and % in the presence of . mm exogenous spermidine) ( figure a and supplementary table ). no À frameshift product was observed when the wild-type antizyme cassette was examined in the absence of antisense oligonucleotide addition (figure ; azwt). as the az a antisense oligonucleotide was designed to anneal directly adjacent to the uga codon of the shift site, it was of interest to determine whether the wild-type antizyme pseudoknot could induce À frameshifting when located in the equivalent position. to address this, a new construct p luc-az - sp ( figure a ) was made by deleting the nt spacer between the pseudoknot and the shift site of p luc-az wt. in this case, the wild-type pseudoknot is directly adjacent to the shift site. the products of in vitro transcription and translation were separated by sds-page. no À frameshift product was observed and levels of the + frameshift product were significantly reduced to $ % ( figure d and supplementary table ) . az a, az b and az c were designed to complement rna sequences encoded by the originating vector. to determine if duplexes formed between the antisense oligonucleotide and adjacent antizyme sequences would result in more efficient frameshift stimulation, reporter vectors were designed to contain a portion of the antizyme stimulator. construct p luc-az pkm contains sequences from the half of the axis formed by the stacking of stem and stem of the pseudoknot ( figure a) . two complementary -o-methyl antisense oligonucleotides were designed. first, pkm has perfect complimentarity to the region starting nt and ending nt downstream of the uga shift site codon. second, pkm is the same except that a mispaired c and bulged a were located at positions and , respectively. these two alterations were included to more closely mimic the natural pseudoknot which also contains a mispaired c and bulged a at equivalent positions along the extended stem formed by the stacking of pseudoknot stems and ( figure ; compare p luc-az wt with the duplex formed between p luc-pkm and antisense oligonucleotide pkm ). pkm and pkm induced and % frameshifting, respectively, when added to coupled transcription and translation reactions of p luc-az pkm in the presence spermidine ( figure a and b, and supplementary table ). neither pkm nor pkm induced frameshifting to the same levels seen with az b, suggesting that the sequence content of the duplex region can affect the efficiency of frameshift stimulation and that native antizyme sequences are not required. a second construct, p luc-az sl, was designed to contain only the half of stem of the antizyme pseudoknot downstream from the ucc uga shift site ( figure a ). -o-methyl antisense oligonucleotides were designed to anneal between and nt (sl ) or and nt (sl ) downstream from the uga codon of the shift site. frameshift efficiency induced by these two antisense oligonucleotides, and % respectively, was somewhat lower than that observed with pkm and pkm ( figure c and d and supplementary table ). in these cases frameshift efficiency was higher for the longer antisense oligonucleotide (sl ), suggesting that frameshift efficiency most probably correlates with stability of the duplex. as was seen with az a, az b and az c, frameshifting efficiency stimulated by antisense oligonucleotides pkm , pkm , sl and sl was also strongly correlated with the concentration of exogenously added spermidine (supplementary table ) . the importance of the antizyme sequence context to antisense oligonucleotide induced ribosome frameshifting was examined by testing the frameshift site, ucc uga, without the and stimulatory antizyme sequences. to this end, the antizyme stimulatory sequences were deleted from p luc-az pkdel to make p luc-az fs. each of the antisense oligonucleotides az a, az b or az c was added to coupled transcription and translation reactions with p luc-az fs in the presence or absence of spermidine. frameshift efficiency was measured at , and %, in the presence of spermidine and , . and . % in its absence for az a, az b and az c, respectively ( figure a and b) . to determine whether the stop codon of the shift site is essential for frameshifting, the uga codon of p luc-az fs was altered to ugg such that the shift site was ucc ugg (p luc-az -ugg). frameshift efficiency was significant, but reduced, compared to the shift site ucc uga, and shows little stimulation by the addition of spermidine; az a, az b and az c induced , and . % frameshifting in the presence of spermidine, and . , . and . % frameshifting in its absence, respectively ( figure c and d) . the ability of antisense oligonucleotides to induce frameshifting in cultured mammalian cells was examined by co-transfection of cv- cells with p lucaz pkdel and increasing amounts of -o-methyl antisense oligonucleotides az b as described in materials and methods. in the absence of antisense oligonucleotide frameshifting levels were determined to be . %, whereas a graded increase in frameshift levels was observed upon the addition of az b ( figure ). maximal frameshifting levels were % in the presence of mm az b in the transfection media. several models attempting to explain pseudoknot stimulation of programmed À frameshifting have been proposed [for reviews see ( , ) ]. most models invoke a pausing mechanism whereby the ribosome is paused over the shift site such that time is allowed for the trnas to reposition in the new reading frame. this explanation is clearly too simplistic as stem-loops and pseudoknots of similar thermodynamic stability that cause ribosome pausing are not necessarily effective frameshift stimulators ( ) ( ) ( ) . in addition, variations of the ibv pseudoknot have demonstrated a lack of correlation between the extent of pausing and the efficiency of frameshifting ( ) . a recent publication by brierley and co-workers ( ) presents structural data demonstrating that the ibv frameshift stimulating pseudoknot blocks the mrna entrance tunnel and leads to a structural deformation of the p-site trna. the resulting movement of the trna displaces the anticodon loop towards the end of the mrna. a model is presented in which this movement results in disruption of the codon-anticodon interactions, thus allowing for trna slippage relative to the mrna. similar trna movements were not observed with non-frameshift stimulating stem-loop structures. this model provides a feasible mechanistic explanation for the ability of some downstream structures to induce frameshifting. the ability of antisense oligonucleotides to induce highlevel À frameshifting ( , ) demonstrates that elaborate tertiary structures are not required, and that a duplex formed by complementary antisense oligonucleotides (with a variety of chemistries, including rna, -o-methyl, morpholino) is sufficient to induce high-level frameshifting. here we demonstrate for the first time that trans-acting antisense oligonucleotides may stimulate ribosome shifting to the + reading frame at surprisingly high levels, levels which are greater than those achieved by natural cis-acting mrna pseudoknot structures in programmed + frameshifting. structural studies indicating that the mrna begins to enter the ribosome - nt downstream from the a-site codon is of direct relevance to this study ( , ) . our results indicate that maximal frameshifting is induced when the antisense-mrna duplex begins nt downstream of the uga of the shift site, in agreement with the distance found between the uga of the shift site and the beginning of stem of the pseudoknot stimulator found in antizyme genes. given this distance, the implication is that the stimulatory secondary structure would be encountered by the ribosome when the ucc codon enters the a-site of the ribosome. perhaps as suggested by the structural studies of the ibv- frameshift inducing pseudoknot, the codon-anticodon interactions between the ucc codon and ser-trna ser are disrupted during translocation to the p-site. given the importance of the uga codon during frameshifting at the ucc uga shift site, subsequent events following translocation of the ucc codon to the p-site and uga to the a-site must influence frameshifting efficiency. this latter event most probably involves competition between termination and + frame decoding when the uga codon is in the a-site. various discussions have been presented for the importance of a-site and p-site events during ribosomal frameshifting ( , ) and clearly, further investigations of this topic are warranted. the observation presented here that the antisense oligonucleotide, az a, which anneals directly adjacent to the uga stop codon can induce ribosome frameshifts to either the + or À reading frame is surprising. in light of the above discussion of spacing for naturally occurring cis-acting frameshift stimulators, it is possible that frameshifting may occur at codons upstream of the known ucc uga shift site. however, visual examination of upstream codons does not reveal an obvious À or + frameshift site. the ability of spermidine to stimulate antisense oligonucleotide induced ribosome frameshifting to the + reading frame at the ucc uga shift site in the absence of the natural stimulator demonstrates that this cis-acting element is not required for polyamine responsiveness. similarly, spermidine stimulation was observed in the absence of the element but virtually eliminated by altering the uga codon of the shift site to ugg. these observations are in agreement with previous studies examining the importance of cis-acting elements for polyamine induced frameshifting during expression of antizyme genes ( ) ( ) ( ) . finally, the ability to direct ribosomes to the + reading frame in living cells ( figure ) suggests a potential therapeutic application for antisense oligonucleotides. directed frameshifting to the + reading frame near a disease causing À frameshift mutation would cause some ribosomes to resume decoding in the wild-type orf, thus restoring partial production of full-length protein from mutant alleles. the importance of the stop codon for efficient frameshifting suggests that the stop codon following the frameshift mutation presents a promising target for antisense induced phenotypic suppression, and that modulation of intracellular polyamine levels, although not essential, may increase the effectiveness of this approach. further experiments are required to determine the therapeutic potential of this approach in vivo including the generality and efficiency of frameshift induction at non-programmed frameshift sites. programmed translational frameshifting recoding: translational bifurcations in gene expression reprogrammed genetic decoding in cellular gene expression recoding: reprogrammed genetic decoding recoding: dynamic reprogramming of translation mutational analysis of the 'slippery-sequence' component of a coronavirus ribosomal frameshifting signal p-site trna as a crucial initiator of ribosomal frameshifting near-cognate peptidyl-trnas promote + programmed translational frameshifting in yeast analysis of effects of trna:message stability on frameshift frequency at the escherichia coli rf programmed frameshift site ribosomal frameshifting from À to + nucleotides special peptidyl-trna molecules can promote translational frameshifting without slippage ribosomal frameshifting in the yeast retrotransposon ty: trnas induce slippage on a nucleotide minimal site a novel programed frameshift expresses the pol gene of retrotransposon ty of yeast: frameshifting without trna slippage normal trnas promote ribosomal frameshifting mechanism of ribosome frameshifting during translation of the genetic code pulling the ribosome out of frame by + at a programmed frameshift site by cognate binding of aminoacyl-trna a new principle of rna folding based on pseudoknotting cold spring harbor symposia on quantitative biology structure, stability and function of rna pseudoknots involved in stimulating ribosomal frameshifting the structure of an rna pseudoknot that causes efficient frameshifting in mouse mammary tumor virus comparative studies of frameshifting and nonframeshifting rna pseudoknots: a mutational and nmr investigation of pseudoknots derived from the bacteriophage t gene mrna and the retroviral gag-pro frameshift site rna, involved in ribosomal frameshifting metal ions and flexibility in a viral rna pseudoknot at atomic resolution minor groove rna triplex in the crystal structure of a ribosomal frameshifting viral pseudoknot specific mutations in a viral rna pseudoknot drastically change ribosomal frameshifting efficiency evidence for an rna pseudoknot loop-helix interaction essential for efficient À ribosomal frameshifting the role of rna pseudoknot stem length in the promotion of efficient À ribosomal frameshifting crystal structure of a luteoviral rna pseudoknot and model for a minimal ribosomal frameshifting motif efficient stimulation of site-specific ribosome frameshifting by antisense oligonucleotides novel application of srna: stimulation of ribosomal frameshifting identification of a new antizyme mrna + frameshifting stimulatory pseudoknot in a subset of diverse invertebrates and its apparent absence in intermediate species reading two bases twice: mammalian antizyme frameshifting in yeast degradation of ornithine decarboxylase: exposure of the c-terminal target by a polyamine-inducible inhibitory protein ornithine decarboxylase is degraded by the s proteasome without ubiquitination determinants of proteasome recognition of ornithine decarboxylase, a ubiquitin-independent substrate feedback repression of polyamine transport is mediated by antizyme in mammalian tissue-culture cells antizyme protects against abnormal accumulation and toxicity of polyamines in ornithine decarboxylase-overproducing cells properties of a polyamine transporter regulated by antizyme cell culture analysis of the regulatory frameshift event required for the expression of mammalian antizymes autoregulatory frameshifting in decoding mammalian ornithine decarboxylase antizyme polyamine sensing during antizyme mrna programmed frameshifting characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot the sequences of and distance between two cis-acting signals determine the efficiency of ribosomal frameshifting in human immunodeficiency virus type and human t-cell leukemia virus type ii in vivo identification and analysis of the pseudoknot-containing gag-pro ribosomal frameshift signal of simian retrovirus- a dual-luciferase reporter system for studying recoding signals sequence specificity of aminoglycoside-induced stop codon readthrough: potential implications for treatment of duchenne muscular dystrophy ribosomal pausing at a frameshifter rna pseudoknot is sensitive to reading phase but shows little correlation with frameshift efficiency ribosomal movement impeded at a pseudoknot required for frameshifting ribosomal pausing during translation of an rna pseudoknot a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting the path of messenger rna through the ribosome programmed + translational frameshifting in the yeast saccharomyces cerevisiae results from disruption of translational error correction the authors would like to thank drs pasha baranov, john atkins and lorin petros for critical reading of the manuscript. this project was funded by an mda development grant and nih r ns to m.t.h. funding to pay the open access publication charges for this article was provided by nih r ns .conflict of interest statement. none declared. supplementary data are available at nar online. key: cord- - g f sw authors: tosoni, elena; frasson, ilaria; scalabrin, matteo; perrone, rosalba; butovskaya, elena; nadai, matteo; palù, giorgio; fabris, dan; richter, sara n. title: nucleolin stabilizes g-quadruplex structures folded by the ltr promoter and silences hiv- viral transcription date: - - journal: nucleic acids res doi: . /nar/gkv sha: doc_id: cord_uid: g f sw folding of the ltr promoter into dynamic g-quadruplex conformations has been shown to suppress its transcriptional activity in hiv- . here we sought to identify the proteins that control the folding of this region of proviral genome by inducing/stabilizing g-quadruplex structures. the implementation of electrophorethic mobility shift assay and pull-down experiments coupled with mass spectrometric analysis revealed that the cellular protein nucleolin is able to specifically recognize g-quadruplex structures present in the ltr promoter. nucleolin recognized with high affinity and specificity the majority, but not all the possible g-quadruplexes folded by this sequence. in addition, it displayed greater binding preference towards dna than rna g-quadruplexes, thus indicating two levels of selectivity based on the sequence and nature of the target. the interaction translated into stabilization of the ltr g-quadruplexes and increased promoter silencing activity; in contrast, disruption of nucleolin binding in cells by both sirnas and a nucleolin binding aptamer greatly increased ltr promoter activity. these data indicate that nucleolin possesses a specific and regulated activity toward the hiv- ltr promoter, which is mediated by g-quadruplexes. these observations provide new essential insights into viral transcription and a possible low mutagenic target for antiretroviral therapy. g-quadruplexes (g s) are nucleic acids secondary structures that may form in single-stranded g-rich dnas and rnas under physiological conditions ( ) ( ) ( ) . four gs bind via hoogsteen-type hydrogen bonds to yield g-quartets that in turn stack on top of each other to form the g . g s are highly polymorphic, both in terms of strand stoichiometry (forming both inter-and intramolecular structures) and strand orientation/topology. the presence of k + cations specifically supports g formation and stability ( ) ( ) ( ) . in eukaryotes and prokaryotes, g dna motifs have been found in telomeres, g-rich micro-and mini-satellites, near promoters, and within the ribosomal dna (rdna) ( ) ( ) ( ) . in the human genome, genes that are near g dna motifs fall into specific functional classes; for example, promoters of oncogenes and tumor suppressor genes have particularly high and low g -forming potential, respectively ( ) ( ) ( ) . human g dna motifs have been reported to be associated with recombination prone regions ( ) and to show mutational patterns that preserved the potential to form g dna structures ( ) . rna g s have been detected in the and -utr and coding regions, in which they act as important regulators of pre-mrna processing (splicing and polyadenylation), rna turnover, mrna targeting and translation ( , ) . regulatory mechanisms controlled by g s involve the binding of protein factors that modulate g conformation and/or serve as a bridge to recruit additional protein regulators. indeed, g binding proteins can be classified into three functional groups: telomere-related proteins, such as the shelterin complex; proteins that unfold the g structure, such as the helicase and heterogeneous nuclear ribonucleoprotein families; proteins that stabilize g s, a large group which includes nucleolin, maz and nucleophosmin ( , ( ) ( ) ( ) . g structures and their cognate proteins are key players in numerous essential processes in eukaryotic cells. their misregulation has been associated with a number of relevant human diseases, such as the amyotrophic lateral sclerosis ( ) ( ) ( ) , alzheimer ( ) and fragile x syndrome ( ) , in which expansion of g -forming regions has been reported. moreover, mutations in g -interacting proteins have been linked to genetic diseases, such as the werner syndrome and fanconi anemia ( , ) . in recent years, new studies have contributed to increase our knowledge of the biological significance of g s in prokaryotes ( , ) and viruses ( ) . we and other groups have identified functionally significant g s in the nef coding region ( ) and the unique ltr promoter ( ) ( ) ( ) of the human immunodeficiency virus (hiv), the etiologic agent of the acquired immune deficiency syndrome (aids). these studies have shown that g folding at the ltr promoter decreased viral transcription with an effect that was augmented by g ligands ( , ) . in this direction, the significance of these structures as focal points of interactions with host and viral factors is supported also by the observation that g -folded sequences are specifically recognized by various viral proteins, such as the epstein barr virus nuclear antigen ( , ) and the sars coronavirus unique domain (sud), which occurs exclusively in highly pathogenic strains ( ) . for this reason, we decided to pursue the investigation of putative cellular/viral proteins that may be involved in the regulation of the g ltr promoter activity in hiv. we employed a concerted approach combining electrophorethic mobility shift assay (emsa) and analysis by electrospray ionization mass spectrometric (esi-ms) to identify possible factors capable of binding the ltr g structure. in order to validate the findings, we then tested their stabilizing activity on the g fold and evaluated their ability to inhibit ltr-driven transcription in cells. the results provided new insights into the role of the ltr g in the viral life cycle, which could pave the way for the possible development of novel therapeutic strategies. all desalted oligonucleotides and aptamers were purchased from sigma-aldrich, milan, italy (supplementary table s ). the hiv- ltr region was inserted into the promoterless luciferase reporter vector pgl . -luc (promega italia, milan, italy) to form the pgl . -ltr-luc vector, as previously reported ( ) . the renilla plasmid (p . , promega italia, milan, italy) was used as an internal control. the human enhanced green fluorescent proteinnucleolin plasmid (gfp-nucleolin) was purchased from addgene (addgene, cambridge, ma, usa). the pegfp empty vector was used as control (clontech, takara bio, otsu, japan). human embryonic kidney (hek) t cells (atcc # crl- ) were grown in dmem (gibco, thermo fisher scientific, waltham, ma, usa) supplemented with % heat-inactivated fetal bovine serum (fbs, gibco, thermo fisher scientific, waltham, ma, usa). jurkat tlymphocytes cells (atcc # tib- ) were grown in rpmi (gibco, thermo fisher scientific, waltham, ma, usa) supplemented with % heat-inactivated fbs. mcf- human breast cancer cells (atcc # htb- ) were grown in rpmi supplemented with % heat-inactivated fbs. mcf a normal human mammary epithelial cells (atcc # crl- ) were grown in dmem/f (gibco, thermo fisher scientific, waltham, ma, usa) supplemented with % heat-inactivated fbs and egf ( . g/ml), hydrocortisone ( . g/ml), cholera toxin ( . g/ml), insulin ( g/ml) (purchased all from sigma-aldrich, milan, italy). all cultures were grown in a humidified incubator maintained at • c with % co . oligonucleotides were -end labeled with [␥ - p]atp using t polynucleotide kinase at • c for min. after dna precipitation, labeled species were resuspended in lithium cacodylate buffer ( mm, ph . ) and kcl mm. the oligonucleotides were denatured for min at • c and gradually cooled to room temperature to achieve proper folding of g-quadruplex structures. protein nuclear extracts of hek t and jurkat cells were obtained by using nx-tract kit (sigma-aldrich, milan, italy). recombinant full-length human nucleolin was expressed in hek cells and purified as described by the manufacturer (origene technologies, rockville, usa). labeled oligonucleotides ( nm) were incubated in l of reaction in emsa binding buffer and nuclear extract ( . g/l) or purified nucleolin ( ng) for h at • c. emsa binding buffer composition was: tris-hcl mm, ph , kcl mm, mgcl . nm, dtt mm, glycerol %, protease inhibitor cocktail (sigma-aldrich, milan, italy) %, naf mm, na vo mm, poly [di-dc] (sigma-aldrich, milan, italy) . ng/l. in competition experiments, an excess of cold oligonucleotides was added to the samples and their ability to disrupt the g structures was monitored to evaluate binding specificity. after incubation, reaction solutions were loaded onto % native polyacrylamide gel in × tbe buffer and kcl mm. dna-protein complexes were resolved by running the gel overnight at v at • c. emsa gels were dried using a gel dryer (bio-rad laboratories, milan, italy), free and bound dna molecules were visualized by phosphorimaging (typhoon fla , ge healthcare europe, milan, italy) and quantified by imagequant tl software (ge healthcare europe, milan, italy). after the desired complex was located on the gel, the corresponding band was cut and either directly in-gel digested for mass spectrometric (ms) analysis, or further purified by sds-page. extraction was performed in sds-page sample buffer. after min in boiling water, samples were incubated overnight at • c. finally, supernatant was loaded on % sds-page. the band of interest was excised after coomassie staining. hek t cells were seeded in a -cm dish in dmem supplemented with % heat-inactivated fbs and incubated overnight. cells were next either mock-transfected or transfected with of pnl - (the reagent was obtained throughout the nih aids reagent program, division of aids, niaid, nih) ( ) using calphos tm mammalian transfection kit (clontech, otsu, japan) according to the manufacturer's protocol. after h, cells were washed with pbs × and fresh growth medium was added. forty eight hours post-transfection, cells were washed twice with cold pbs and scraped off. after a short centrifugation, pellet was resuspended in total protein extraction buffer (kcl mm, tris-hcl mm, ph . , glycerol %, dtt mm, protease inhibitor cocktail). cells were then lysed with three repeated freeze/thaw cycles and supernatant cleared by centrifugation, stored at − • c, and subsequently used in emsa assays. bands were treated according to established in-gel digestion protocols. briefly, they were first washed with % ch oh and . % acetic acid, dehydrated with ch cn, and then reduced with l of dtt ( mm in mm nh hco ) for min at room temperature. the excess of dtt was eliminated before treating the bands with l of iodoacetamide ( mm in mm nh hco ) for min at room temperature in order to alkylate cysteine residues. bands were washed with mm nh hco , dehydrated with ch cn twice, and then digested. a g aliquot of ms-grade trypsin (thermofisher scientific, waltham, ma, usa) in l of mm nh hco was added to the dehydrated bands, followed by incubation on ice for min. the excess of trypsin was eliminated and substituted with l of mm nh hco and the sample was incubated overnight at • c. peptides were extracted twice with % formic acid and two more times with % ch cn, % formic acid. the peptide mixture was further desalted in a silica nanocolumn (polymicro technologies, phoenix, az, usa) packed in house with pinnacle c pack material (thermo fisher scientific, waltham, ma, usa). all materials were ms grade purchased from sigma aldrich, st. louis, mo, us except where otherwise indicated. the desalted mixture was finally analyzed by direct infusion electrospray ionization (esi) on a thermo fisher scientific (waltham, ma, usa) ltq-orbitrap velos mass spectrometer. the instrument was calibrated by using a . mg/ml solution of csi in % ch oh, which provided a typical < ppm mass accuracy. all analyses were performed in nanoflow mode by utilizing quartz emitters produced in house by using a p laser pipette puller (sutter instruments co., novato, ca, usa). up to l samples were typically loaded onto each emitter by using a gel-loader pipette tip. a stainless steel wire was inserted through the back-end of the emitter to supply an ionizing voltage that ranged between . and . kv. bands containing bovine serum albumin (bsa) and empty gel bands were used as positive and negative control, respectively. putative peptides that were not present in blank samples were submitted to tandem mass spectrometric (ms/ms) analysis. these determinations involved isolating the precursor ion of interest in the ltq element of the instrument, activating fragmentation in either the ltq or the c-trap, and performing fragment detection in the orbitrap. the masses of the more intense fragments were employed to perform a mascot database search ( ) to identify their parent protein. the matched protein was deemed as being positively identified when two or more peptides provided a mascot score greater than . hek t protein nuclear extract ( . g/l) was incubated with biotinylated ltr oligo g folded ( nm) in l of reaction containing tris-hcl mm, ph , kcl mm, mgcl . nm, protease inhibitor cocktail %, naf mm, na vo mm, poly [di-dc] . ng/l for h at • c. the binding reaction was followed by incubation ( h at • c) with l of streptavidin-agarose beads (sigma-aldrich, milan, italy). after pbs washes, proteins were eluted with increasing amount of nacl ( . and m), and concentrated with amicon ultra . (merck millipore, germany). beads were collected by brief centrifugation, resuspended in l of laemmli buffer, and finally incubated at • c for min. supernatants were separated on sds-page and analyzed by western blot. oligonucleotides were diluted to . m in lithium cacodylate buffer ( mm, ph . ) and kcl mm heat denatured for min at • c, and folded in g structure at room temperature for h. samples were incubated alone, with purified human nucleolin ( ng) or bovine serum albumin (bsa, negative control) for h at • c. fluorescence melting curves were determined by using a lightcycler ii (roche, milan, italy). after a first equilibration step at • c for min, a stepwise increase of • c every minute for cycles was performed to reach • c. a measurement was completed after each cycle by using nm excitation and nm detection. oligonucleotide melting was monitored by observing -carboxyfluorescein ( -fam) emission, which was normalized between and . t m was defined as the temperature for which the normalized emission was . . gene-specific pooled sirna trilencer targeting human ncl and a scrambled negative control duplex were purchased from origene (ncl trilencer- human sirna, origene technologies, rockville, md, usa). mcf cells were transfected with , and nm aliquots of human ncl sirna and control sirna by using lipofectamine rnaimax (invitrogen, thermo fisher scientific, waltham, ma, usa) following the manufacturer's instructions. pltr luciferase plasmid and renilla construct were transfected into the same cells h later by using lipofectamine (invitrogen, thermo fisher scientific, waltham, ma, usa). in the double transfected cells, ltr promoter activity was assessed as firefly luciferase signal, normalized to renilla luciferase activity, by using dual-glo r luciferase assay system (promega italia, milan, italy), according to the manufacturer's directions ( ) . depending on the transfected cell line (i.e. either mcf or mcf a), the pgl . -ltr-luc vector provided signals ranging form to × luciferase units, as measured by a victor x multilabel plate reader (perkin elmer italia, milan, italy). in contrast, the promoterless pgl . -luc and untransfected cells displayed a background signal lower than luciferase units. all data were acquired in mediumfree pbs. dna aptamers as and the control cro were added to cell medium at the time of transfection of nucleic acids research, , vol. , no. pgl . -ltr-luc and p . -renilla plasmids and the luciferase signal read h after transfection. each assay was performed in duplicate and each set of experiments was repeated at least three times. immunoblot analysis was performed on cell protein extracts obtained as previously described ( ) . protein concentrations were quantified by using the pierce r bca protein assay kit (thermo scientific, rockford, il, usa) and the samples stored at − • c. each sample was electrophoresed on % sds-page and transferred to a nitrocellulose blotting membrane (amersham tm protan tm, ge healtcare life science, milan, italy) by using transblot sd semi-dry transfer cell (bio-rad laboratories, milan, italy). the membranes were blocked with % skim milk in pbst ( . % tween in pbs). membranes were incubated with the respective primary antibody directed against ncl (rabbit polyclonal c (h- ); santa cruz biotechnology, dallas, tx, usa), p (rabbit polyclonal; abcam, cambridge, uk), and ␤-actin (mouse monoclonal; sigma-aldrich, milan, italy). after three washes in pbst, membranes were incubated with ecl plex goat-␣-rabbit igg-cy or ecl plex goat-␣-mouse igg-cy (ge healthcare life sciences, milan, italy). images were captured on the typhoon fla , and quantified by imagequant tl software. taq polymerase stop assay was carried out as previously described ( ) . briefly, the -end labeled primer ( -ggcaaaaagcagctgcttatatgcag- ) was annealed to the template (supplementary table s ) in lithium cacodylate buffer in the presence or absence of kcl mm by heating at • c for min and gradually cooling to room temperature. where specified, samples were incubated with purified human nucleolin ( ng) at • c for h. primer extension was then conducted by using u of amplitaq gold dna polymerase (life technologies, thermo fisher scientific) for min at • c or • c. reactions were stopped by ethanol precipitation. primer extension products were separated on a % denaturing gel, and finally visualized by phosphorimaging (typhoon fla ). the dna substrate of interest was gel-purified before use and prepared in desalted/lyophilised form. the oligonucleotide was -end-labeled with [␥ - p]atp by t polynucleotide kinase, purified by using microspin g- columns (amersham biosciences, europe), resuspended in lithium cacodylate buffer mm, ph . , kcl mm, heatdenatured and folded. the oligonucleotide ( . m) was incubated either alone or with purified human nucleolin ( ng) in emsa binding buffer for h at • c. sample solutions were then treated with dimethylsulfate (dms, . % in ethanol) for min and stopped by addition of gel loading buffer containing % glycerol and ␤-mercaptoethanol. samples were loaded onto % native polyacrylamide gels and run until the desired resolution was obtained. dna bands were localized via autoradiography, excised and eluted overnight. the supernatants were recovered, ethanolprecipitated and treated with piperidine m for min at • c. samples were dried in a speed-vac, washed with water, dried again and resuspended in formamide gel loading buffer. reaction products were analyzed on % denaturing polyacrylamide gels, visualized by phosphorimaging analysis, and quantified by imagequant tl software. spr was performed on the biacore t platform (ge healthcare, life science, milan, italy). purified human ncl was immobilized on serie s sensor chip cm by amine coupling. immobilization was performed in hepes-nacl running buffer (hepes ph . in previous work, we demonstrated that the hiv- ltr region can fold at least three different g-quadruplex (g ) structures at positions − /− with respect to the transcription initiation site of the representative hxb lai (nc ) strain ( ) . disruption of g by point mutations increased the transcript levels, thus indicating that g formation may contribute to the modulation of viral transcription. we reasoned that stabilization and unfolding of g at the ltr level are likely regulated by interactions with viral/cellular proteins. in order to test this hypothesis, ltr g -forming sequences were incubated with nuclear protein extracts. ltr sequences of different lengths were employed to check whether individual g structures identified in the ltr promoter displayed different protein binding capabilities. in particular, we assayed sequences that possessed the minimal requirements to fold into an individual g-quadruplex (i.e. only g-tracts: ltr-ii, ltr-iii and ltr-iv) and sequences capable of providing multiple g s (i.e. more than g-tracts: ltr-ii+iii+iv and ltr-iii+iv) ( figure a) . the corresponding p-labeled oligonucleotides were incubated with nuclear extracts derived from two cell lines: jurkat t-lymphocytes that are a model for the natural hiv- targets in vivo; human embryonic kidney t cells that lack hiv- cell receptors and sustain all viral steps with the exception of virion attachment and entry. the latter can however be transfected with the hiv- proviral genome to produce fully competent and infectious viral particles, which indicates that their cytoplasmic/nuclear protein makeup is competent to sustain viral replication. samples were analysed onto native polyacrylamide gels to monitor the formation of slower running bands corresponding to oligonucleotide-protein complexes. as shown in figure b , all g ltr oligonucleotides formed slower running bands. the fact that the patterns obtained from the two nuclear extracts were essentially identical indicated that the selected cell lines contained very similar sets of g -binding proteins. in particular, a very specific band migrated with the same rate in all g oligonucleotide samples (arrow in figure b) , thus suggesting that the same protein was able to bind all ltr g sequences considered. the intensity of the observed bands indicated that the longer sequence (i.e. ltr-ii+iii+iv) was bound most efficiently, whereas the shorter sequence (i.e. ltr-iv) supported only very modest complex formation. the ltr-ii+iii+iv oligonucleotide was incubated with extracts of hiv- producing and non-producing t cells to test whether the presence of viral proteins affected in any detectable way the observed emsa profiles ( figure c ). the actual presence of viral proteins in the transfected cells was assessed by western blot analysis ( figure d ). while viral proteins were well represented in the transfected extract, the emsa determinations revealed no major difference and confirmed that cellular proteins must constitute the major players in ltr g binding ( figure c) . in previous work, we demonstrated that one-or twonucleotide point mutations in the g-tracts involved in g pairing were able to partially or totally disrupt, respectively, g folding in the ltr sequence (ltr-m , ltr-m , ltr-m + ), whereas a mutation in the loop was not (ltr-m ", figure a ) ( ) . based on these observations, we used mutant ltr-ii+iii+iv oligonucleotides to assess the specificity of the observed protein for the g structure (figure b ). when incubated with the nuclear extracts, only the sequences that retained g -folding capabilities (i.e. wt and m ") were able to form bands corresponding to the desired complexes (arrow in figure b ), whereas the mutants that partially fold or cannot form stable g s were much less efficiently bound. a scrambled sequence matching the wt base composition did not bind at all (figure b and c). it should be noted that both a slower and a faster running bands were present in all samples (see asterisk in figure b ), indicative of nucleic acid binding proteins that may not be selective for g . a competition experiment was performed to confirm the binding specificity of the selected protein, which involved mixing the labeled wt sequence with increasing amounts of unlabeled wt, mutant ltrs or scrambled sequence. as shown in figure d and e, only the wt and m " sequences were able to effectively compete for protein binding, whereas the other oligonucleotides did not decrease the amount of protein bound to wt. interestingly, however, fold addition of m , m and m + oligonucleotides increased the amount of protein bound to the wt sequence. in this case, the mutant sequences may compete for binding to proteins that are sequence-but not structure-specific, thus leaving more g -specific protein available to bind to the g -folded wt sequence. furthermore, the fact that wt sequences folding g or ds structures bound different pro- g-tracts are shown in bold and are numbered ( - ') above the sequence. in the mutant sequences, the mutated bases are shown in red. the names of the mutant sequences correspond to the g-tracts where g bases have been mutated. scra stands for a sequence where gs have been scrambled to such an extent that no g can form. the ability of the mutant sequences to fold into g is reported on the right: y = folding, y/n = partial folding, n = no folding (measured as reported in ( ) teins confirmed that the observed protein was specific for the g conformation ( figure f ). with the goal of identifying the protein of interest, the relevant emsa band was excised from the gel and submitted directly to trypsin digestion (see materials and methods). alternatively, the material extracted from the emsa band, which could possibly include multiple co-migrating species, was further purified on a sds gel before trypsin digestion. in either case, the digestion products were subjected to ms/ms analysis and database searching ( figure a ) ( , ) , which provided an excellent match with human nucleolin (ncl) ( table ). the experiment was separately repeated three times to corroborate the results. positive identification was also confirmed by performing emsa analysis of samples that included the g -folded wt and mutant ltr-ii+iii+iv sequences with either nuclear extracts or purified human ncl. ncl displayed the same binding activity towards the wt and mutant ltr sequences manifested by the unknown protein ( figure b ). ncl binding to the wt ltr-ii+iii+iv g was concentration dependent (figure c and d) . surprisingly, however, two ncl/ltr complex bands were identified ( figure b and c). it has been reported that both native and purified ncl display self-cleaving activity and indeed preparations of ncl usually exhibit multiple bands ( ) . our purified ncl also showed multiple bands on sds gel, as detected by both coomassie staining and western blot analysis (supplementary figure s a ). ms analysis of the four bands detected by coomassie staining showed full-length peptide coverage only for the upper major band, whereas coverage of the n-terminal portion was missing in the lower three bands (supplementary figure s b) . we therefore ascribed the upper and lower bands to the full-length and cleaved forms of ncl, respectively. interestingly, the band obtained from nuclear extracts migrated the fastest and, indeed, the nterminal portion was not observed by ms peptide analysis (table and supplementary figure s b ). in addition, when the protein bound to ltr g in the emsa gel was extracted and analyzed separately on sds gel, it migrated at around kda, corresponding to one of the cleaved forms of ncl (data not shown). therefore, we conclude that the ncl form present in the nuclear extracts corresponds to its cleaved portion lacking the n-terminus. in addition, ncl formed a complex with a scrambled sequence, which displayed a slightly slower migration rate compared to the g -bound ncl ( figure b ). the amount of complex was similar to that afforded by m and m ltr sequences. binding of ncl to g-rich oligonucleotides has been previously reported ( ) and provides an excellent indication that the rna binding domains (rbd) of the purified ncl are indeed active. in analogous fashion, biotinylated wt, m + ltr-ii+iii+iv, and scrambled-sequence oligonucleotides were incubated with nuclear extracts and then added to streptavidin-functionalized agarose beads to facilitate removal of unbound proteins. immobilized proteins were eluted with buffers of increasing ionic strength and identified by western blot analysis ( figure e) . ncl was released from the wt sequence when treated with increasing concentrations of nacl or heated to • c in denaturing buffer, whereas it was released from the m + and scrambled sequences at the lowest nacl concentration, consistent with a lower affinity for these non-g -forming sequences. these data confirm that ncl specifically binds the g folded conformation of the ltr-ii+iii+iv sequence. taq polymerase stop assays were performed to assess the stabilization imparted by ncl to the various ltr g structures. the wt ltr-ii+iii+iv and m + mutant sequences were extended to include a primer-annealing region and used as templates for a single-cycle taq reaction (supplementary table s ). elongation of the wt template was performed at • c or • c in the presence/absence of ncl. in samples containing mm kcl and no ncl, a pausing site corresponding to the most -end g-tract was observed only in the reaction elongated at • c. the pause was not observed at • c, consistent with possible destabilizing effects of temperature on the g structure (compare lanes and of figure a ). addition of ncl induced more evident pauses at both elongation temperatures (figure a, lanes and ) , which clearly highlighted the stabilizing properties of ncl on g conformation. as expected from the inability of the m + mutant to fold g structures, no pausing sites were observed regardless the presence of protein ( figure a, lanes - ) , thus confirming the specificity of ncl binding for full-fledged g s. at the same time, fret melting assays were also carried out to study the stabilizing effects of ncl on the g structures. the assays involved the synthesis of constructs that combined selected g -forming sequences, such as ltr-ii+iii+iv and the shorter ltr-ii, ltr-iii, ltr-iv, and ltr-iii+iv (figure a) , with fam and tamra moieties placed at their -and -ends, respectively. the results showed that ncl conferred the highest stabilization in the series to the ltr-ii+iii+iv construct, followed by ltr-iii+iv. progressively lower stabilization was observed for ltr-iii and ltr-ii, whereas ltr-iv was the least affected in the series ( table ). the negative control bovine serum albumin (bsa) did not afford any detectable stabilization to the selected sequences. to identify the position of putative ncl binding sites onto the dna g structures, we performed dimethylsulfate (dms) footprinting. when a complex of ncl with the ltr-ii+iii+iv dna was assessed ( figure b ), the methylation pattern revealed that the region of the ltr sequence was protected by protein binding with unique specificity for two g bases in g-tract , as highlighted in figure c . based on the facts that ncl has been described as a rna-binding protein ( ) and that the ltr sequence is present in the u region of the hiv- genome during the initial infection steps, we tested whether ncl could also bind the rna version of the ltr g . in this case, ncl was incubated with labeled rna or dna oligonucleotides capable of folding ltr g s. increasing concentrations of unlabelled rna and dna counterparts were employed to compete for ncl binding. the results clearly showed that the protein was able to bind both types of g oligonucleotides (lanes and , figure a ). however, the dna g was consistently able to outcompete the rna version table ). only characteristic b and y ions are indicated ( , ) . the data matched the sequence of peptide t -k of ncl, which is reported on top with the observed fragments. (b) emsa analysis of the binding of nuclear extract (ne) proteins and purified ncl to the wt and mutant ltr sequences. arrows indicate relevant protein/ltr g complex bands. (c) emsa analysis of the binding of increasing amounts of purified ncl to the wt ltr-ii+iii+iv g . the vertical bar highlights the portion of the gel where the two ncl/ltr g complex bands are observed. (d) quantification of the upper and lower ncl/ltr g complex bands obtained in the emsa in panel (c). (e) pull-down assay of nuclear extract proteins with wt, mutant g ltr-ii+iii+iv and random (rnd) sequences, immobilized on agarose beads. shown is the western blot analysis with an ncl antibody. proteins complexed to the beads-bound ltrs were washed with augmented stringency by increasing the ionic strength of the wash buffer ( . and m). the final elution was obtained in denaturing buffer at • c. for protein binding (lanes - , figure a and b), whereas the rna was incapable of outcompeting the dna (lanes - , figure a and b). we had previously shown that the ltr rna g s fold into parallel structures ( ) , whereas the dna counterparts display rather hybrid-like conformations ( ) ; therefore, the preferential binding toward the dna g is likely caused by these substantial structural differences between the dna and rna g conformations. surface plasmon resonance (spr) analysis was next used to assess the affinity of the purified ncl for the wt ltr-ii+iii+iv g : a k d of . ± . nm was obtained, which indicates an extremely high affinity of the protein for this ltr g ( figure c ). in contrast, the scrambled oligonucleotide showed no affinity for ncl (supplementary figure s ). a luciferase reporter assay was established to explore the downstream biological effects of ncl binding to the ltr promoter. two epithelial breast cell lines were selected for the assay: mcf- breast cancer cells and mcf- a normal breast epithelial cells. the latter inherently express lower amounts of ncl compared to tumor cells ( , ) . for this reason, the mcf- a cells were transfected with wt or m + ltr luciferase reporter plasmids, either alone or in the presence of increasing amounts of ncl expression vector. the luciferase signal was measured to determine the level of activation of the ltr promoter. the results showed that the activity of wt ltr decreased to % of the control, while that of the m + promoter remained unvaried ( figure a ). the mcf cell line that overexpresses ncl was employed to perform additional activity assays. in this case, cells were treated with increasing amounts of sirnas designed to target ncl mrna ( ) , and then transfected with wt or mutant m + ltr luciferase reporter plasmids. analysis of ncl content showed that the protein was effectively depleted at - nm sirna, reaching % of the initial amount at nm ( figure b ). measured by luciferase activity, the effect of ncl depletion on ltr promoter activity was quite astonishing: at nm of sirna, the promoter activity of the wt sequence was folds that of the control while the mutant m + sequence was only . folds (figure c) . as a complementary approach, we tested the effect of the ncl-targeted dna aptamer as , which has been reported to bind with high affinity ncl in cells ( ) . at m of as , we again observed a significant increment of the wt ltr promoter activity to reach times that of the non-treated control while the mutant m + sequence increased of only . times ( figure d ). in contrast, the control sequence cro that is complementary to as did not modify ltr promoter activity. these data indicate that the specific binding of ncl to g structures in the ltr promoter exerts significant repressive effects on hiv- promoter activity. we have identified ncl as a prominent host factor capable of binding with high affinity to the g structures present in the ltr promoter of hiv- . we observed that the specific interaction leads to g stabilization and contributes to silence viral transcription. conversely, we also demonstrated that ncl depletion produces extraordinary enhancing effects on ltr promoter activity. these observations are consistent with the multifaceted nucleic acid binding and chaperoning activities attributed to this protein. indeed, ncl is most abundant in the nucleolus, but can be found also in cell membranes and, upon stress stimuli, in the nucleoplasm and cytoplasm, to some extent ( , ) . among other functions, it is involved in transcription ( ) by specific interactions with sequences that can adopt complex secondary structures ( , ( ) ( ) . it is widely believed that ncl plays a chaperone role by helping the correct folding of complex nucleic acids structures. indeed, ncl has been shown to display a marked preference for both endogenous and exogenous g-rich sequences that can fold into g ( ) . it has been recently reported that binding of ncl to the endogenous (ggggcc) n hexanucleotide repeat expansion (hre) in c orf is responsible for the initiation of molecular cascades that lead to neurodegenerative diseases ( ) . at the promoter level, binding of ncl to g structures augments the basal effect of the folded conformation ( , ( ) ( ) ( ) . one of the best documented example of g -mediated regulation among g promoters ( ) is that of c-myc, which shows striking similarities with the g mediated regulation of the hiv- ltr promoter reported here and previously by us ( ) . both cases involve multiple g-tracts that enable folding into alternative g conformations; g parallel-like topology ( ); at least one g n g motif; binding sites for sp ; silencing effect on promoter activity ( ); ncl-binding activity and higher affinity of ncl towards the dna g compared to the rna g counterpart ( ) . in the case of c-myc, it has been shown that the n-terminal of ncl is dispensable for its g binding activity ( ) . here, we show that ncl can naturally produce cleaved forms that lack the n-terminal but retain full binding capabilities. these observations demonstrate that cells allow the formation of ncl cleaved species that maintain their g /nucleic acid binding activity. on the other, they suggest that the hiv- virus and human host cells have likely evolved identical mechanisms to control transcription at the dna promoter level. the results of our experiments provided valuable insights into the determinants of ncl binding to the various ltr g -forming structures. the lower binding of ncl to ltr-iv compared to the other g structures suggests that the interaction has both conformation-and sequence-dependent characteristics, which in turn implies the fascinating possibility that different g s in the hiv- ltr promoter may exert different functions based on their binding partners. moreover, the greater affinity demonstrated for dna than rna constructs indicates that the interaction is deeply influenced by conformational differences between g structures folded by the different types of biopolymers. the facts that the hiv- genome consists of rna, that this g forming sequence is also present in the u region of the genome ( ) , and that viral rna during the first steps of infection is still present in the cell cytoplasm where ncl levels are low, suggest that the specific binding of ncl to the dna version may represent an essential mechanism for regulating viral transcription. in this direction, it has been shown that ncl is involved in different steps of the hiv- life cycle. inhibition of surface ncl by different cellular and synthetic compounds ( - ) affects cell attachment/entry by the virus ( ) . in addition, ncl can bind hiv- gag protein to promote viral budding ( ) , or to enhance gag release ( ) . further, ncl involvement in the viral life cycle has been corroborated also by evidence that hiv infection modifies the protein's cellular distribution ( , ) . the activities performed by ncl in other viruses have also been described in a number of recent papers ( ) ( ) ( ) ( ) ( ) . for example, binding of ncl to epstein-barr virus (ebv) nuclear antigen (ebna ) modulates viral replication and transcription ( ) , and virus-induced relocalization of ncl has been observed in some instances ( , ) . in this broader context, these reports support our new findings that point toward a significant role played by this protein in hiv- replication. the observation that ncl interaction with ltr g silences viral transcription is in apparent contrast with the well-known ability of ncl to interact with histone h , which induces chromatin decondensation ( ) that in turn facilitates the passage of the dna polymerases ( ) . however, the opposite effect consisting of ncl-mediated repression of dna replication has been also reported ( , ( ) ( ) ( ) , and has been attributed to the interaction of ncl with the dna processing enzyme replication protein a (rpa). in addition, ncl has been reported to recruit a dna helicase that unwinds g structures ( ) . in both cases, ncl binding to cellular proteins is a transient response to stress stimuli. these observations prompt the intriguing possibility that when ncl is redistributed by the cellular stress imposed by hiv infection ( , ) , it may be then recruited by the g structure of viral promoter to transiently downregulate hiv transcription and enable the virus to prepare for subsequent efficient transcription, when viral proteins, such as tat, take over. alternatively, this interaction may be required as a first switch to viral latency and to recruit proteins that further consolidate latency. the signals that trigger latency are not known at this point and are thus the object of intense studies. however, factors that repress viral transcription at the ltr promoter have been proposed to play a determinant role in latency mechanisms ( , ) . finally, the very specific nature of nucleolin binding indicates that its viral target must be less prone to mutations. this observation makes the viral g /nucleolin complex into a very appealing target for the development of antiviral strategies that may afford a different mechanism of action, the possibility of targeting viral latency, and a lower probability of incurring into drug resistance. in conclusion, we have shown that the specific binding of the cellular protein ncl to the ltr promoter regulates viral transcription. this result alone paves the way for the investigation of different regulation mechanisms of hiv- transcription/latency, which may lead to new possible targets for the design of specific inhibitors. g-quadruplex structures: in vivo evidence and function formation of parallel four-stranded complexes by guanine-rich motifs in dna and its implications for meiosis g-quadruplex nucleic acids and human disease a sodium-potassium switch in the formation of four-stranded g -dna neurodegenerative diseases: g-quadruplex poses quadruple threat g-quadruplexes and metal ions gene function correlates with potential for g dna formation in the human genome g-quadruplexes in promoters throughout the human genome the disruptive positions in human g-quadruplex motifs are less polymorphic and more conserved than their neutral counterparts g-quadruplex formation within the promoter of the kras proto-oncogene and its effect on transcription conserved elements with potential to form polymorphic g-quadruplex structures in the first intron of human genes evidence of genome-wide g dna-mediated gene expression in human cancer cells genome-wide analyses of recombination prone regions predict role of dna structural motif in recombination exploring mrna -utr g-quadruplexes: evidence of roles in both alternative polyadenylation and mrna shortening -utr rna g-quadruplexes: translation regulation and targeting the evolving world of protein-g-quadruplex recognition: a medicinal chemist's perspective the kras promoter responds to myc-associated zinc finger and poly(adp-ribose) polymerase proteins, which recognize a critical quadruplex-forming ga-element hras is silenced by two neighboring g-quadruplexes and activated by maz, a zinc-finger transcription factor with dna unfolding property c orf nucleotide repeat structures initiate molecular cascades of disease g-quadruplex structures contribute to the neuroprotective effects of angiogenin-induced trna fragments characterization of dna g-quadruplex species forming from c orf gc-expanded repeats associated with amyotrophic lateral sclerosis and frontotemporal lobar degeneration a g-rich element forms a g-quadruplex and regulates bace mrna alternative splicing the fragile x syndrome d(cgg)n nucleotide repeats form a stable tetrahelical structure human werner syndrome dna helicase unwinds tetrahelical structures of the fragile x syndrome repeat sequence d(cgg)n fancj helicase defective in fanconia anemia and breast cancer unwinds g-quadruplex dna to defend genomic stability genome-wide study predicts promoter-g dna motifs regulate selective functions in bacteria: radioresistance of d. radiodurans involves g dna-mediated regulation the genome-wide distribution of non-b dna motifs is shaped by operon structure and suggests the transcriptional importance of non-b dna structures in escherichia coli g-quadruplexes in viruses: function and potential therapeutic applications formation of a unique cluster of g-quadruplex structures in the hiv- nef coding region: implications for antiviral activity a dynamic g-quadruplex region regulates the hiv- long terminal repeat promoter topology of a dna g-quadruplex structure formed in the hiv- promoter: a potential target for anti-hiv drug development u region in the hiv- genome adopts a g-quadruplex structure in its rna and dna sequence anti-hiv- activity of the g-quadruplex ligand braco- g-quadruplexes regulate epstein-barr virus-encoded nuclear antigen mrna translation role for g-quadruplex rna binding by epstein-barr virus nuclear antigen in dna replication and metaphase chromosome attachment the sars-unique domain (sud) of sars coronavirus contains two macrodomains that bind g-quadruplexes production of acquired immunodeficiency syndrome-associated retrovirus in human and nonhuman cells transfected with an infectious molecular clone probability-based protein identification by searching sequence databases using mass spectrometry data a rapid micropreparation technique for extraction of dna-binding proteins from limiting numbers of mammalian cells proposal for a common nomenclature for sequence ions in mass spectra of peptides contributions of mass spectrometry to peptide and protein structure increased stability of nucleolin in proliferating cells by inhibition of its self-cleaving activity dna binding properties of a kda nucleolar protein localization of nucleolin binding sites on human and mouse pre-ribosomal rna identification and characterization of nucleolin as a c-myc g-quadruplex-binding protein the nucleolin targeting aptamer as destabilizes bcl- messenger rna in human breast cancer cells epithelial-mesenchymal transition in human gastric cancer cell lines induced by tnf-alpha-inducing protein of helicobacter pylori antiproliferative activity of g-rich oligonucleotides correlates with protein binding stress-dependent nucleolin mobilization mediated by p -nucleolin complex formation functions of the histone chaperone nucleolin in diseases two rna-binding domains determine the rna-binding specificity of nucleolin molecular basis of sequence-specific recognition of pre-ribosomal rna by nucleolin agro inhibits activation of nuclear factor-kappab (nf-kappab) by forming a complex with nf-kappab essential modulator (nemo) and nucleolin heterogeneous nuclear ribonucleoprotein k and nucleolin as transcriptional activators of the vascular endothelial growth factor promoter through interaction with secondary dna structures a cis-element with mixed g-quadruplex structure of npgpx promoter is essential for nucleolin-mediated transactivation on non-targeting sirna stress the c-terminus of nucleolin promotes the formation of the c-myc g-quadruplex and inhibits c-myc promoter activity targeting myc expression through g-quadruplexes structures, folding patterns, and functions of intramolecular dna g-quadruplexes found in eukaryotic promoter regions direct evidence for a g-quadruplex in a promoter region and its targeting with a small molecule to repress c-myc transcription midkine, a cytokine that inhibits hiv infection by binding to the cell surface expressed nucleolin the anti-hiv pentameric pseudopeptide hb- binds the c-terminal end of nucleolin and prevents anchorage of virus particles in the plasma membrane of target cells identification of v loop-binding proteins as potential receptors implicated in the binding of hiv particles to cd (+) cells nucleolin and the packaging signal, psi, promote the budding of human immunodeficiency virus type- (hiv- ) tandem immunoprecipitation approach to identify hiv- gag associated host factors specific changes in the posttranslational regulation of nucleolin in lymphocytes from patients infected with human immunodeficiency virus intracellular accumulation of cell cycle regulatory proteins and nucleolin re-localization are associated with pre-lethal ultrastructural lesions in circulating t lymphocytes: the hiv-induced cell cycle dysregulation revisited identification of nucleolin as a cellular receptor for human respiratory syncytial virus cell surface nucleolin facilitates enterovirus binding and infection host cell nucleolin is required to maintain the architecture of human cytomegalovirus replication compartments dynamic and nucleolin-dependent localization of human cytomegalovirus ul to the periphery of viral replication compartments and nucleoli nucleolin interacts with the dengue virus capsid protein and plays a role in formation of infectious virus particles nucleolin is important for epstein-barr virus nuclear antigen -mediated episome binding, maintenance, and transcription nucleolin interacts with the feline calicivirus untranslated region and the protease-polymerase ns and ns proteins, playing a role in virus replication nucleolin interacts with us protein of herpes simplex virus and is involved in its trafficking a major nucleolar protein, nucleolin, induces chromatin decondensation by binding to histone h nucleolin is a histone chaperone with fact-like activity and assists remodeling of nucleosomes formation of a complex between nucleolin and replication protein a after cell stress prevents initiation of dna replication regulation of dna replication after heat shock by replication protein a-nucleolin interactions novel checkpoint response to genotoxic stress mediated by nucleolin-replication protein a complex formation nucleolin inhibits g oligonucleotide unwinding by werner helicase an ap- binding site in the enhancer/core element of the hiv- promoter controls the ability of hiv- to establish latent infection establishment and molecular mechanisms of hiv- latency in t cells key: cord- -uhhtvdif authors: longhini, andrew p.; leblanc, regan m.; becette, owen; salguero, carolina; wunderlich, christoph h.; johnson, bruce a.; d'souza, victoria m.; kreutz, christoph; dayie, t. kwaku title: chemo-enzymatic synthesis of site-specific isotopically labeled nucleotides for use in nmr resonance assignment, dynamics and structural characterizations date: - - journal: nucleic acids res doi: . /nar/gkv sha: doc_id: cord_uid: uhhtvdif stable isotope labeling is central to nmr studies of nucleic acids. development of methods that incorporate labels at specific atomic positions within each nucleotide promises to expand the size range of rnas that can be studied by nmr. using recombinantly expressed enzymes and chemically synthesized ribose and nucleobase, we have developed an inexpensive, rapid chemo-enzymatic method to label atp and gtp site specifically and in high yields of up to %. we incorporated these nucleotides into rnas with sizes ranging from to nucleotides using in vitro transcription: a-site ( nt), the iron responsive elements ( nt), a fluoride riboswitch from bacillus anthracis ( nt), and a frame-shifting element from a human corona virus ( nt). finally, we showcase the improvement in spectral quality arising from reduced crowding and narrowed linewidths, and accurate analysis of nmr relaxation dispersion (cpmg) and trosy-based cest experiments to measure μs-ms time scale motions, and an improved noesy strategy for resonance assignment. applications of this selective labeling technology promises to reduce difficulties associated with chemical shift overlap and rapid signal decay that have made it challenging to study the structure and dynamics of large rnas beyond the nt median size found in the pdb. the tertiary architectures rnas adopt are crucial for modulating gene expression across all domains of life, making them important targets of structural and dynamics studies. for instance, for riboswitches, the presence or absence of specific ligands drives the folding of one of two or more mutually exclusive, regulatory states ( ) ( ) ( ) ( ) . in viral rna genomes, structured, untranslated regions commonly exercise direct control over viral gene expression ( , ) . in the ribosome, the ability to distinguish between cognate and near-cognate trnas is governed in part by the extrahelical flipping of adenines a and a ( ) . both the global architecture and the subtle motions of specific base and ribose moieties are thus demonstrably important and can profoundly modulate an rna's function ( ) . however, in spite of this importance, directly establishing how dynamics modulates the structure and function of rnas has been difficult because x-ray crystallography and nuclear magnetic resonance spectroscopy are plagued by distinct but equally challenging problems. in crystallography, motions can only be observed in the ps-ns timescales and the strain imposed by crystal packing can obscure and distort structural data ( , ) . in contrast, nmr spectroscopy can probe dynamic fluctuations directly over a wide range of timescales. unfortunately, nmr suffers from both narrow chemical shift dispersion and rapid signal decay, exacerbated by direct one-bond and multi-bond spin-spin couplings. the former leads to spectral crowding and the spin-spin couplings can lead to decreased spectral resolution and inaccurate mea-e nucleic acids research, , vol. , no. page of surements of c relaxation rates such as longitudinal relaxation rates (r ), transverse relaxation rates (r ), and heteronuclear overhauser effect (hnoe) ( ) ( ) ( ) . furthermore, these problems become more pronounced as the size of the rna increases: the spectral quality deteriorates because of increased line broadening. addressing these problems requires the development of new technologies. in the past, spectral overlap has been addressed using heteronuclear multi-dimensional pulse sequences applied to uniformly c/ n labeled rnas ( ) ( ) ( ) . by spreading the poorly dispersed proton resonances over the better resolved carbon and/or nitrogen dimensions, it is possible to resolve overlapped proton peaks in small rnas. while these advances have greatly aided nmr structural studies of rnas with a median size of nt, they fail for rnas larger than nt. out of rna structures in the pdb (protein data bank), only seven rna structures with sizes > nt have been solved by nmr ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . of these seven, three rna structures of , and nt have been solved using mostly homonuclear twodimensional ( d) noesy methods based on nucleotidespecific and fragmentation-based segmental h-labeling approaches ( , , ) . thus, current uniform labeling approaches while valuable are quite limiting ( , ) . also of great interest are the large couplings of adjacent c nuclei within the ribose and base ring systems which cause several complications in rna relaxation measurements. the foremost concern is that uniform labeling introduces strong couplings that can render c r , hnoe and cpmg (carl-purcell-meiboom-gill) relaxation measurements inaccurate. these couplings also complicate and limit the range of applicability of cest (chemical exchange saturation transfer) and rotating-frame relaxation rate (r ) measurements and analyses while also decreasing the attainable resolution and sensitivity of nmr experiments ( , , ( ) ( ) ( ) ( ) ( ) ( ) ( ) . numerous robust spectroscopic solutions have been proposed in the past to circumvent these coupling problems ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . unwanted splittings can be removed using constant time (ct) evolution ( ) ( ) ( ) ( ) , adiabatic band selective decoupling ( ) ( ) ( ) , or a series of selective pulses. constant time evolution limits the acquisition time that can be used to obtain adequate resolution. to improve resolution requires long constant-time delays that lead to significant signal loss for large rna molecules ( ) . additionally, obtaining accurate relaxation parameters are problematic for c-cpmg based relaxation dispersion rates for quantifying millisecond (ms) time-scale processes, as well as r and proton-carbon hnoe ( , ) important for quantifying ns-ps time-scale motions in rna ( , ) . several precautions are needed to obtain accurate r and r measurements ( , ( ) ( ) : provided r is derived from the initial slope of the relaxation decay curve, fairly accurate rates can be extracted for small rnas; for r experiments, distortions can arise from transfer between adjacent c atoms with similar chemical shifts via a hartmann-hahn mechanism and these need to be minimized ( , ( ) ( ) ; to suppress the echo-modulation caused by the large scalar couplings during the c relaxation delay, r can be measured instead of cpmg ( , ( ) ( ) ( ) . still for nucleic acids high power spinlocks (> khz) are needed to study isolated spin pairs such as c found in adenine and c in both adenine and guanine. for low spin lock power levels (< khz), oscillations can be observed in the monoexponential decay of peak intensity, arising from residual scalar coupling interactions within neighbouring nuclei ( , ( ) ( ) . in addition, application of selective cross-polarization ( - ) using weak radio-frequency fields can effectively decouple homonuclear j-couplings. this elegant spectroscopic solution has been exploited to measure nitrogen r in proteins and both carbon and nitrogen r in uniformly labeled nucleic acids ( ) ( ) ( ) ( ) . while this scheme obviates the need for selective c isotopic enrichment, in uniformly labeled samples, the presence of large homonuclear scalar couplings again limits the range of applicability of these methods ( , ) . finally building upon schemes for protein n and c cest measurements by kay and co-workers, zhang et al. developed a set of nucleic-acidoptimized d/ d c cest experiments that use various shaped pulses to refocus carbon-carbon scalar coupling and showed that accurate exchange parameters can be obtained for all cest profiles in uniformly labeled rna samples for purines and ribose carbons ( , ( ) ( ) ( ) . nonetheless, they and others acknowledged the following limitations for both cest and r in uniformly labeled rna and protein samples ( ) ( ) ( ) ( ) ( ) ( ) . first, the lowest spinlock or saturating b field that can be used is limited (∼ × the scalar coupling) to ∼ hz for ade c , ∼ hz for purine c , ∼ hz for c '. for pyrimidine ring carbons with large carboncarbon couplings of ∼ hz, it would require ∼ hz spinlock fields for c and c , clearly intractable with uniformly labeled samples. second, even though c- c couplings do not introduce errors in extracted chemical shifts for purines, these homonuclear couplings decrease the resolution. ultimately, the coupling effects need to be considered in the cest data analyses for couplings greater than hz. otherwise, exchange parameters (k ex ) are overestimated and population ratios are underestimated ( ) . thus, uniformly labeled samples do limit the range of wide applicability of both cest and r to biological problems. these spectroscopic tools notwithstanding, an alternative, straightforward and effective solution for overcoming the problem of spectral crowding and j-coupling would complement existing methodologies. a promising method is to synthesize site-specific isotopically labeled nucleotides ( , ( ) ( ) as we recently demonstrated with our chemoenzymatic production of pyrimidine nucleotides ( , ) . here, we extend that approach to improving the synthesis of purine nucleotides. our synthesis offers improvements in speed, streamlined reaction conditions, and higher yields. by combining the newly developed purine nucleotides with our previous pyrimidine nucleotides we present an improvement to the traditional noesy structural assignment protocol. additionally, we show that the measurements of relaxation parameters using cpmg, r , and cest are possible for both small and large rnas. furthermore, we demonstrate substantial improvements in signalto-noise and line width for relaxation optimized spectroscopy (trosy) experiments compared to the traditional heteronuclear single quantum coherence (hsqc) ex- nucleic acids research, , vol. , no. e periments for isolated two-spin systems approximated by our purine and pyrimidine labeling schemes ( ) ( ) ( ) ( ) . reagents and solvents were purchased from sigma-aldrich. - c adenine and - c guanine were either purchased from cambridge isotope laboratories or synthesized as described in the supplementary materials. similarly, preparative chemical synthesis of labeled adenine and guanine, chemo-enzymatic nucleotide synthesis, rna preparation, and nmr experiments are detailed in the supplementary materials. nmr spectral processing was done in topspin (bruker biospin) and nvfx (one moon scientific). peak intensities were selected using in-house software by david fushman. cpmg data were fit to a two-state exchange model using the full bloch-mcconnell matrix. the time-dependent evolution of magnetization during the cpmg period was solved numerically by non-linear least squares fitting using in-house matlab software. errors in fits were calculated using jacobian or monte carlo simulations ( ) , and the larger of the two errors was reported for cpmg and cest relaxation dispersion analysis. nmrviewj was used for peak assignments. hydrogen and carbon chemical shifts are predicted based on the secondary structure of the input rna molecule. expected cross peaks for different experiment types and labeling patterns were then generated using the rna peak generator tool. expected cross-peaks were generated for hsqc spectra based on the covalent structure and for noesy spectra using inter-atomic distances typically observed in rna helices. for bacterial a-site rna, of which there are no deposited chemical shifts in the bmrb database, the rna peak generator accurately predicted of the expected c -h resonances, of the c -h , and of the c / -h / resonance within . ppm of their actual values in the hsqc spectra. since the noesy peak generator was used in a mode where it only predicts peaks in helical regions, peaks in bulge and tetraloop regions of the a-site rna were not predicted. further assignment of the noesy spectra utilized the rna peak slider tool within nmrviewj. this links the predicted peaks into a network connected by atoms shared between the different peaks. peaks are then interactively positioned in a way that utilizes the network of peaks typically connected within the noesy 'walk'. overall, the combined tools of nmrviewj allowed for relatively rapid assignment of the resonance in the specifically labeled a-site rna model system and provides a powerful tool that, when combined with selective labeling, can streamline resonance assignment for rna than previously reported ( ) . we have developed a protocol for synthesizing site-specific isotopically labeled purine nucleotides with varied ribose and base patterns (supplementary figure s ). in principle any combination of labeled base and ribose can be utilized, but here we demonstrate the use of both , - c and , - c purine and , - c and , - c pyrimidine nucleotides for assignment, structural, and dynamics measurements. these labeling patterns both remove the strong c- c j-coupling found in uniformly labeled nucleotides and simultaneously reduce spectral crowding. the removal of c j-coupling creates isolated spin systems in both the base and ribose, and enables the use of trosy pulse sequences for studying large rnas, and these trosy modules can be readily incorporated into cpmg, r and cest pulse sequences for measuring s-ms timescale dynamics ( ) ( ) ( ) ( ) . in addition to the removal of strong j-couplings, the new labels help to greatly reduce spectral crowding and enable the creation of a new protocol for the noesy assignment of rnas. as a proof of this assignment concept, this protocol is demonstrated on the a-site rna. we have created an improved method for the synthesis of site-selective isotopically labeled atp and gtp with increased yields and speed of synthesis. both purine reactions proceed to completion without the need to purify intermediate species. final yields of > % and > % respectively for atp and gtp were achieved relative to starting input adenine or guanine. both yields are better than previously reported ( ) ( ) ( ) ( ) ( ) ( ) . in addition to improved yields, atp synthesis is complete in - h while gtp synthesis is complete in - h. previously, atp synthesis was reported to take - h and gtp synthesis - h ( - ). these improvements allow reactions to be complete in a single day. additionally we have taken advantage of the ability of creatine kinase to act on a variety of substrates to convert ndps to ntps and adapted the use of datp as the energy source in the energy regeneration system ( ) ( ) ( ) . the use of datp is ideal since the lack of a -oh of the ribose in datp prevents its interaction with the boronate column used to purify atp and gtp. this offers a more robust synthesis, free of contaminants, and does not dilute the synthesized labels with unlabeled atp. the effectiveness of these nucleotides is demonstrated by their incorporation into a number of interesting rnas. the production of gtp was achieved in a two-step, one pot reaction. specifically-labeled ribose and guanine were combined in the presence of phosphoribosyl pyrophosphate synthetase (prpps), ribokinase (rk), and xanthine-guanine phosphoribosyl transferase (xgprt), with a datp regeneration system. the datp regeneration system was composed of myokinase and creatine kinase, with creatine phosphate acting as the high energy phosphate donor. the formation of gmp was monitored by fplc and nmr (supplementary figure s ) . however, due to the low solubility of guanine ( . mm) fplc was unsuitable to track its disappearance, thus making it difficult to monitor the progression of the reaction. however, by nmr spectroscopy, the resonance chemical shift between the labeled c- position of unreacted ribose and newly formed gmp was used to determine the completion of the first step of the reaction (supplementary figure s a) . when the majority e nucleic acids research, , vol. , no. page of of guanine was converted to gmp, in approximately - h, guanylate kinase was added to the reaction. guanylate kinase phosphorylates gmp to gdp. gdp is phosphorylated to gtp by creatine kinase which acts promiscuously to convert ndps to ntps. this was unexpected, as ck is said to be highly specific ( , ) . gmp is completely converted to gtp in an additional h. we confirmed by fplc that conversion is complete and further validate this observation by p nmr (supplementary figure s b) . the production of atp and the progression of the reaction is monitored as reported for gtp. a notable difference is that adenine's greater solubility ( mm) allowed the use of fplc to monitor the disappearance of uncoupled base and the formation of product for all steps of the reaction. labeled adenine and ribose were combined in the presence of prpps, rk, adenine phosphoribosyl transferase (aprt), and the datp regeneration system. the datp regeneration system acts on both amp and adp and takes the reaction to completion in ∼ h. the reaction is similarly monitored by fplc and nmr (supplementary figure s c&d) . when studying large rnas (> nts) by nmr, slow molecular tumbling leads to broadened linewidths and losses in signal intensity. careful selection of appropriate nmr experiments to address these losses are necessary for successful measurement of many nmr parameters. trosy experiments take advantage of the interference between the dipolar coupling and chemical shift anisotropy (csa) components of t relaxation ( ) . for the base c position of adenine and guanine, these contributions effectively cancel at ∼ mhz field strength leading to reduction in the r relaxation rate ( , ) . thus, rnas synthesized with our selective site-specifically labeled ntps should benefit from trosy based nmr experiments that reduce the problems of crowding, fast signal decay, low resolution, and decreased s/n ratios ( , , , ( ) ( ) ( ) ( ) . the benefits of trosy increases with the size of the rna. for small rnas such as ire ( nt) we saw substantial, yet modest, improvements for the base region. these improvements in signal intensities ranged from . -to . fold (average: . ± . ) when comparing trosy with conventional hsqc sequences (figures a and a) . for the larger hcv sars rna ( nt), the signal improvements are larger and ranged from . -to . -fold (average . ± . ) (figures b and b) . for the c and c peaks the improvements were more modest since these positions have lower csa values (supplementary figure s ). our labeled c approximates an isolated two spin system necessary for these gains in signal. thus, the large improvements seen for these positions when using our site-specifically labeled nucleotides can be harnessed for assignment, structural, and dynamics measurements ( ) . the above observations led us to run c-trosy version of the n-trosy experiment of kay et al. ( ) . we decided that to validate this trosy pulse sequence it would be ap- figure s ) . while the fits of both the hqsc and cest data sets gave similar exchange parameters, comparing the χ of the fits showed significant improvements for the trosy cest experiment ( . - . ) when both experiments were run using the same parameters and experiment time. these measurements were made on the c and c positions of a , - c- , - n- - h utp labeled sample. what then are some of the benefits of a selectively labeled sample when uniformly labeled samples have been shown to be adequate? strong coupling eliminated between ribose carbons allowed a straightforward analysis of the cest data without the need to account for and correct j-coupling ( , ) . in particular obtaining cest data for c pyrimidine is particularly problematic because of complications mentioned above in the introduction using uniformly labeled samples and that field strengths of > hz needed preclude their use in uniformly labeled samples. with our selective labeled samples, we were able to obtain excellent cest profiles readily for both purine and pyrimidines. cpmg relaxation dispersion measurements facilitate the extraction of information about exchange phenomenon occurring on the s-ms timescale ( ) ( ) ( ) ( ) ( ) ( ) ( ) . previously, others have used similar approaches to measure cpmg experiments for rnas smaller than nucleotides with specifically labeled pyrimidine bases ( , ) . here, we present data that illustrates the effect of creating an isolated, labeled c and c positions in our nucleotides, and show that measurements of cpmg parameters are readily accessible without the problem of j-coupled induced oscillations ( , ) . we have transcribed a nt viral rna with , - c labeling pattern as a proof of concept. the data indicate that while a majority of the nucleotides within the rna do not experience exchange on the ms time-scale, a few residues sample a lowly populated state. without data being fit at multiple static magnetic field strengths, the only meaningful parameter that can be extracted is a k ex value ( figure a ). the exchange rates extracted from the cpmg experiments on the viral rna match well with those from cest experiments (unpublished). even though similar information, and perhaps more, can be derived from r data, we find that cpmg is straightforward to setup and analyse compared to r experiments. thus having labeled rna that facilitates cpmg measurements is important for the field. using in-house matlab scripts, cpmg data were fit to a two-state exchange model using the bloch-mcconnell matrix as previously described by kay et al. ( ) . site-selective labels allow us to prepare isolated two spin systems without the carbon-carbon or carbon-nitrogen scalar couplings. in measurements were made on the c position of , - c labeled sample. the a position that these curves belong to has been implicated in the discrimination of cognate and near-cognate trnas in the ribosome. the past, such scalar couplings have hindered the interpretation of relaxation dispersion data ( , ) . using the bacterial a-site rna as a model system, we were able to capture motions on the microsecond timescale using cpmg experiments to monitor exchange of the ribose c residues. it is widely accepted that motions in residues a and a are involved in the discrimination between cognate and near-cognate trnas ( ) ( ) ( ) ( ) ( ) . most notably, a , a residue that flips in and out of the bulge region of a-site showed characteristic dispersion profiles ( figure b ). the extracted k ex and p b values of ± s − and . ± . % match well the previously reported values of s − and . % determined by relaxation dispersion measurements on the c positions of the ribose moieties ( ) . thus, our labels can be used to readily and straightforwardly capture lowly populated states in rna. the relatively narrow spectral width over which base and sugar carbons and protons resonate is a major limitation of e nucleic acids research, , vol. , no. page of rna nmr that must be overcome ( ) . overlap of signals is only partially alleviated by d and d nmr experiments in samples in which all nucleotides are uniformly c-and n-labeled. we reasoned that what would be critical for decluttering spectra to manageable levels for large rnas is not only the ability to choose which of the four nucleotides to label, but also which of the atomic sites to isotopically enrich. to demonstrate the power of this approach, we have examined rnas ranging in size from to nucleotides in length. for a large rnas transcribed with only , - c atp, the resonances that belong to the adenine c can be identified rapidly when compared to a sample that has all four nucleotides fully-labeled (not shown). while it is possible to achieve a similar result using a fully labeled atp only sample, one bond c- c and c- n couplings quickly degrade the quality of the spectrum. with a view to design a new noesy assignment protocol, we synthesized rna samples that maximize the information content of their spectra while simultaneously alleviating spectral overlap. the classic approach to assign resonances in a helical stretch of an rna employs a noesy walk methodology ( , ) . protons close in space (< Å) can produce cross peaks in a noesy spectrum indicative of a through-space transfer of longitudinal magnetization between the adjacent nuclei. for nucleotides in a helix, the protons attached to the c /c of the base and the c /c of the sugar fulfill this distance requirement. by labeling all nucleotides at the c and c /c , the base and ribose of adjacent nucleotides can be connected. however, as the size of the rnas increases, spectral crowding becomes especially pronounced in the sugar resonances and may lead to incorrect peak assignments. in the past the solution to this problem might have been to remove these resonances by transcribing the rna with unlabeled cytosine. while the spectra would then be simplified, the noesy walk is broken in any helical stretches that contain cytosine. here we propose an alternative approach. instead of transcribing the rna with unlabeled cytosine, a different labeling pattern such as , - c could be used. in this way, the noesy walk is preserved while removing the overlapping c resonances. thus, by combining our previous work on pyrimidine synthesis with our current purine synthesis, we can make rnas that provide labeling patterns that enable an important advance in noesy assignment strategies ( , ) . for the conventional uniformly labeled samples, the c and c resonances are both extremely crowded as discussed above. in a traditional noesy walk all nucleotides or various permutations are fully-labeled. noe crosspeaks between protons attached to the c and c and the c /c of the same and previous nucleotides are observed for helical regions. as we have illustrated, spectral crowding can severely hinder this assignment process. however, by labeling the base of c /c of each nucleotide and alternating the label on the ribose between c and c a sample is created that not only distinguishes the purines from the pyrimidines but also the a-u and the g-c pairs. we first made nucleotide specific labeled samples, and from the overlaid spectra, we could immediately tell that c/u and g/a showed more spectral overlap in their sugar resonances. thus it was necessary to label c/g on their c carbons and u/a on their c carbons. as a proof of concept we have labeled the bacterial a-site rna with , - c- , - n ctp, , - c- , - n utp, , - c gtp, and , - c atp ( figure ). by combining this alternative labeling strategy with noesy experiments that allow for filtering/editing of h crosspeaks based on the attached carbons ( c versus c), we can create a unique and powerful system to assign resonances without ambiguity ( ) ( ) ( ) . for ambiguous or overlapped cross-peaks, we utilized d c-noesy-hsqc experiments. this alternating ribose pattern allowed us to unambiguously assign helical regions of rna. in future work, we will streamline this methodology for use in larger rnas. the resulting assignment matched those previously determined ( ) . in situations where there is significant overlap in the base region, samples in which certain bases are unlabeled or even deuterated can be made allowing for the assignment bottleneck to be quickly circumvented. this work extends our previous synthesis of pyrimidine ( , ) to purine nucleotides. we have shown that the ability to easily synthesize a variety of purine and pyrimidine nucleotides facilitates the study of large rnas. these nucleotides are suitable for use in three key aspects of rna nmr structural biology: assignment, structural and dynamics measurements. the first advantage of these new site-specific labels is the potential for new assignment schemes. we have coupled alternate labeling of either c or c labeled ribose to c labeled purine bases. these combinations have allowed us to develop a new noesy assignment strategy that benefits from reduced spectral crowding. this new strategy takes advantage of the large proton chemical shift differences between the c and c ribose carbons. by using an alternating c and c pattern with labeled bases, the noesy spectrum is greatly simplified without compromising the information content present. since all nucleotides are labeled, a complete noesy walk is possible in helical regions. additionally if the purines and pyrimidines labeled with c and c enrichment are reversed, orthogonal data is generated that can confirm the previous assignment. the second advantage of these new labels is that the removal of the strong c j-coupling leads to substantial improvements in signal intensity in the protonated base c and c positions. additionally, these isolated spin pairs have facilitated the measurement of s-ms dynamics using cpmg and cest pulse sequences without the complications of large carbon-carbon couplings. finally with these isolated 'two-spin' labels, these couplings need not be explicitly taken into account in the data analysis of cest profiles as required in previous studies using uniformly labeled rna or protein ( , ) , but also be able to probe more useful sites such as pyrimidine c and c sites. it is important to note that other dispersion experiments such as r will also benefit from using rnas transcribed with isolated figure . (a) noesy walk of the bacterial a-site rna using alternatively labeled nucleotides. starting at the h of c , the connectivity from the sugar h /h to c /c of the n+ base enables sequential assignment all the way to the h of g . this allowed the consecutive assignment of residues present in the helical environment. connectivities follow the placed arrows and move sequentially from cyan to magenta to yellow to black. (b) pymol representation of the noesy walk using the same coloring scheme as used in the noesy walk (pdb: a m ( )). (c) predicted versus actual chemical shift values for the assigned residues as determined by nmrviewj (one moon scientific). the offset from the central line represents how far the assigned resonance is from its predicted value. blue circles represent resonances that are less than . ppm from predicted values, red circles have predicted ppm more than . ppm away from observed chemical shift. spin systems. the improvements we see from trosy based pulse sequences scales with the size of the rna. a price to pay for not using spectroscopic tools to minimize the c- c coupling problem is that the number of probe sites is now limited to the labeled sites. nonetheless, it still remains useful because our method allows for very rapid accumulation of chemical shifts, a set of parameters that are easily and accurately measured and available at very early stages of nmr data analyses. thus by measuring various chemical shifts (h /c , h /c , h ,h /c , h /c , c , h /c , h /c , h /c , n , n , n , n ), we think the availability of such parameters will facilitate chemical shift based structure calculations of rna, especially for constructing structural models for transiently and sparsely populated rna states as has been done, so far, only for proteins ( ) . we, therefore, anticipate that as the size of the rnas under investigation becomes greater than nucleotides the combined use of these selective labels with trosy-and hmqc-based pulse elements will be critical for advancing nmr for the study of the structure and dynamics of a large number of new and interesting rnas. supplementary data are available at nar online. thiamine derivatives bind messenger rnas directly to regulate bacterial gene expression widespread genetic switches and toxicity resistance proteins for fluoride multiple conformations of sam-ii riboswitch detected with saxs and nmr spectroscopy three-state mechanism couples ligand and temperature sensing in riboswitches the untranslated region of pea enation mosaic virus contains two t-shaped, ribosome-binding, cap-independent translation enhancers identification of a minimal region of the hiv- -leader required for rna dimerization, nc binding, and packaging flipping of the ribosomal a-site adenines provides a basis for trna selection visualizing transient low-populated structures of rna integrated description of protein dynamics from room-temperature x-ray crystallography and nmr time-resolved structural studies of protein reaction dynamics: a smorgasbord of x-ray approaches alternate-site isotopic labeling of ribonucleotides for nmr studies of ribose conformational dynamics in rna selective c labeling of nucleotides for large rna nmr spectroscopy using an e. coli strain disabled in the tca cycle selective c labeling of nucleotides for large rna nmr spectroscopy using an e. coli strain disabled in the tca cycle three-dimensional heteronuclear nmr studies of rna preparation of c and n labelled rnas for heteronuclear multi-dimensional nmr studies preparation of isotopically labeled ribonucleotides for multidimensional nmr spectroscopy of rna structure of hcvires domain ii determined by nmr nmr structure of the -nucleotide core encapsidation signal of the moloney murine leukemia virus solution structure of trna(val) from refinement of homology model against residual dipolar coupling and saxs data structure of a conserved retroviral rna packaging element by nmr spectroscopy and cryo-electron tomography solution structure of the cap-independent translational enhancer and ribosome-binding element in the ' utr of turnip crinkle virus structure of the yeast u /u snrna complex a structure-based mechanism for trna and retroviral rna remodelling during primer annealing the nmr structure of the ii-iii-vi three-way junction from the neurospora vs ribozyme reveals a critical tertiary interaction and provides new insights into the global ribozyme structure structure of the hiv- rna packaging signal key labeling technologies to tackle sizeable problems in rna structural biology isotope labeling strategies for nmr studies of rna nmr experiments for the measurement of carbon relaxation properties in highly enriched, uniformly c, n-labeled proteins: application to c␣ carbons site-specific labeling of nucleotides for making rna for high resolution nmr studies using an e. coli strain disabled in the oxidative pentose phosphate pathway regio-selective chemical-enzymatic synthesis of pyrimidine nucleotides facilitates rna structure and dynamics studies a computational study of the effects of c- -c- scalar couplings on c- cest nmr spectra: towards studies on a uniformly c- -labeled protein effects of j-couplings and unobservable minor states on kinetics parameters extracted from cest data characterizing slow chemical exchange in nucleic acids by carbon cest and low spin-lock field r nmr spectroscopy homonuclear broadband-decoupled absorption-spectra, with linewidths which are independent of the transverse relaxation rate investigation of complex networks of spin-spin coupling by two-dimensional nmr improved d triple-resonance nmr techniques applied to a -kda protein optimization of constant-time evolution in multidimensional nmr experiments multisite band-selective decoupling in proteins base-type-selective high-resolution c- edited noesy for sequential assignment of large rnas resolution enhanced homonuclear carbon decoupled triple resonance experiments for unambiguous rna structural characterization carbonyl carbon probe of local mobility in c- ,n- -enriched proteins using high-resolution nuclear magnetic resonance alternate-site isotopic labeling of ribonucleotides for nmr studies of ribose conformational dynamics in rna extensive backbone dynamics in the gcaa rna tetraloop analyzed using c nmr spin relaxation and specific isotope labeling rotational diffusion tensor of nucleic acids from c nmr relaxation dynamics of large elongated rna by nmr carbon relaxation d hetereonuclear nmr measurements of spin-lattice relaxation-times in the rotating frame of x nuclei in heteronuclear hx spin systems active site dynamics in the lead-dependent ribozyme nmr methods for studying the structure and dynamics of rna biosynthetic c- labeling of aromatic side chains in proteins for nmr relaxation measurements characterizing rna excited states using nmr relaxation dispersion selective cross-polarization in solution state nmr excitation of selected proton signals in nmr of isotopically labeled macromolecules hartmann-hahn polarization transfer in liquids: an ideal tool for selective experiments frequency-switched single-transition cross-polarization: a tool for selective experiments in biomolecular nmr extending the range of microsecond-to-millisecond chemical exchange detected in labeled and unlabeled nucleic acids by selective carbon r- rho nmr spectroscopy nmr r- rho rotating-frame relaxation with weak radio frequency fields off-resonance r (p) nmr studies of exchange dynamics in proteins with low spin-lock fields: an application to a fyn sh domain probing transient hoogsteen hydrogen bonds in canonical duplex dna using nmr relaxation dispersion and single-atom substitution probing slowly exchanging protein systems via c- ␣-cest: monitoring folding of the im protein probing slow chemical exchange at carbonyl sites in proteins by chemical exchange saturation transfer nmr spectroscopy visualizing side chains of invisible protein conformers by solution nmr c- ␣ cest experiment on uniformly c- -labeled proteins probing rna dynamics via longitudinal exchange and cpmg relaxation dispersion nmr spectroscopy using a sensitive c-methyl label synthesis of ( - c)pyrimidine nucleotides as spin-labels for rna dynamics attenuated t relaxation by mutual cancellation of dipole-dipole coupling and chemical shift anisotropy indicates an avenue to nmr structures of very large biological macromolecules in solution relaxation-optimized nmr spectroscopy of methylene groups in proteins and nucleic acids measuring hydrogen exchange rates in invisible protein excited states off-resonance rotating-frame relaxation dispersion experiment for c- in aromatic side chains using l-optimized trosy-selection conformational exchange of aromatic side chains characterized by l-optimized trosy-selected c- cpmg relaxation dispersion c- relaxation experiments for aromatic side chains employing longitudinal-and transverse-relaxation optimized nmr spectroscopy pathway engineered enzymatic de novo purine nucleotide synthesis preparation of specifically deuterated rna for nmr studies using a combination of chemical and enzymatic synthesis preparation of specifically deuterated and c-labeled rna for nmr studies using enzymatic synthesis preparation and characterization of a uniformly h/ n-labeled rna oligonucleotide for nmr studies d c(cc)h tocsy experiment for assigning protons and carbons in uniformly c-and selectively h-labeled rna enzymatic synthesis and f nmr studies of -fluoroadenine-substituted rna stereospecificity, substrate, and inhibitory properties of nucleoside diphosphate analogs for creatine and pyruvate kinases relating structure to mechanism in creatine kinase transverse relaxation optimized triple-resonance nmr experiments for nucleic acids improved sensitivity and resolution in h- c nmr experiments of rna the use of nmr methods for conformational studies of nucleic acids direct measurements of the dissociation-rate constant for inhibitor-enzyme complexes via the t and t (cpmg) methods effect of diffusion on free precession in nuclear magnetic resonance experiments modified spin-echo method for measuring nuclear relation times a general two-site solution for the chemical exchange produced dependence of t upon the carr-purcell pulse separation measurement of carbonyl chemical shifts of excited protein states by relaxation dispersion nmr spectroscopy: comparison between uniformly and selectively c labeled samples low-populated folding intermediates of fyn sh characterized by relaxation dispersion nmr aminoglycoside-induced reduction in nucleotide mobility at the ribosomal rna a-site as a potentially key determinant of antibacterial activity monitoring molecular recognition of the ribosomal decoding site stochastic gating and drug-ribosome interactions new applications of d filtered/edited noesy for assignment and structure elucidation of rna and rna-protein complexes synthesis and nmr of rna with selective isotopic enrichment in the bases isotope-filtered nmr methods for the study of biomolecular structure and interactions nmr paves the way for atomic level descriptions of sparsely populated, transiently formed biomolecular conformers paromomycin binding induces a local conformational change in the a-site of s rrna key: cord- - pq dkl authors: imbeaud, sandrine; graudens, esther; boulanger, virginie; barlet, xavier; zaborski, patrick; eveno, eric; mueller, odilo; schroeder, andreas; auffray, charles title: towards standardization of rna quality assessment using user-independent classifiers of microcapillary electrophoresis traces date: - - journal: nucleic acids res doi: . /nar/gni sha: doc_id: cord_uid: pq dkl while it is universally accepted that intact rna constitutes the best representation of the steady-state of transcription, there is no gold standard to define rna quality prior to gene expression analysis. in this report, we evaluated the reliability of conventional methods for rna quality assessment including uv spectroscopy and s: s area ratios, and demonstrated their inconsistency. we then used two new freely available classifiers, the degradometer and rin systems, to produce user-independent rna quality metrics, based on analysis of microcapillary electrophoresis traces. both provided highly informative and valuable data and the results were found highly correlated, while the rin system gave more reliable data. the relevance of the rna quality metrics for assessment of gene expression differences was tested by q-pcr, revealing a significant decline of the relative expression of genes in rna samples of disparate quality, while samples of similar, even poor integrity were found highly comparable. we discuss the consequences of these observations to minimize artifactual detection of false positive and negative differential expression due to rna integrity differences, and propose a scheme for the development of a standard operational procedure, with optional registration of rna integrity metrics in public repositories of gene expression data. purity and integrity of rna are critical elements for the overall success of rna-based analyses, including gene expression profiling methods to assess the expression levels of thousands of genes in a single assay. starting with low quality rna may strongly compromise the results of downstream applications which are often labor-intensive, time-consuming and highly expensive. however, in spite of the need for standardization of rna sample quality control, presently there is no real consensus on the best classification criteria. conventional methods are often not sensitive enough, not specific for single-stranded rna, and susceptible to interferences from contaminants present in the sample. for instance, when using a spectrophotometer, a ratio of absorbances at and nm (a :a ) greater than . is usually considered an acceptable indicator of rna purity ( , ) . however, the a measurement can be compromised by the presence of genomic dna leading to over-estimation of the actual rna concentration. on the other hand, the a measurement will estimate the presence of protein but provide no hint on possible residual organic contaminants, considered at nm ( ) ( ) ( ) . pure rna will have a :a equal to a :a and > . ( ) . a second check involves electrophoresis analysis, routinely performed using agarose gel electrophoresis, with rna either stained with ethidium bromide (etbr) ( ) ( ) ( ) ( ) , or the more sensitive sybr green dye ( ) . the proportion of the ribosomal bands ( s: s) has conventionally been viewed as the primary indicator of rna integrity, with a ratio of . considered to be typical of 'high quality' intact rna ( ) . however, these methods are highly sample-consuming, using . - mg total rna and often not sensitive enough to detect slight rna degradation. today, microfluidic capillary electrophoresis with the agilent bioanalyzer (agilent technologies, usa) has become widely used, particularly in the gene expression profiling platforms ( , ) . it requires only a very small amount of rna sample (as low as pg), the use of a size standard during electrophoresis allows the estimation of sizes of rna bands and the measurement appears relatively unaffected by contaminants. integrity of *to whom correspondence should be addressed. tel: the online version of this article has been published under an open access model. users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the journal and oxford university press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. for commercial re-use, please contact journals.permissions@oupjournals.org the rna may be assessed by visualization of the s and s ribosomal rna bands ( figure a and b); an elevated threshold baseline and a decreased s: s ratio, both are indicative of degradation. a broad band shows dna contamination ( figure c ). as it is apparent from a review of the literature, the standard of a . rrna ratio is difficult to meet, especially for rna derived from clinical samples, and it now appears that the relationship between the rrna profile and mrna integrity is somewhat unclear ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . on the one hand, this may reflect unspecific damage to the rna, including sample mishandling, postmortem degradation, massive apoptosis or necrosis, but it can reflect specific regulatory processes or external factors within the living cells. altogether, it appears that total rna with lower rrna ratios is not necessarily of poor quality especially if no degradation products can be observed in the electrophoretic trace ( figure d ). for all these reasons, the development of a reliable, fully integrated and automated system appropriate for numeric evaluation of rna integrity is highly desirable. standardized rna quality assessment would allow a more reliable comparison of experiments and facilitate exchange of biological information within the scientific community. with that prospect in mind, and with the aim of anticipating future standards by pre-normative research, we identified and tested two software packages recently developed to gauge the integrity of rna samples with a user-independent strategy: one open source, the degradometer software for calculation of the degradation factor and 'true' s: s ratio based on peak heights ( ) and the freely available rin algorithm of the agilent expert software, based on computation of a 'rna integrity number' (rin) ( ) . both tools were developed separately to extract information about rna integrity from microcapillary electrophoretic traces and produce a userindependent metrics. using these tools, we assessed the purity and integrity of rna samples, derived from different human adult tissues and cell lines, many of which representing tumors. those results were compared with conventional rna quality measurement approaches as well as with highly expert human interpretation. we evaluated the simplicity for users and examined the potential, accuracy and efficiency of each method to contribute to standardization of rna integrity assessment upstream of biological assays. these procedures were further validated by real-time rt-pcr quantitation of the expression levels of three housekeeping genes, using the same rna samples, at different levels of degradation. total rna was prepared from human cell lines (especially from the atcc bio-resource center, n = ) and tissue samples (clinical samples, n = ) from different human adult tissue types, i.e. blood, brain, breast, colon, epithelium, kidney, lymphoma, lung, liver, muscle, prostate, rectum and thyroid. rna purification was performed by cesium chloride ultracentrifugation according to chomczynski and sacchi ( ) , by phenol-based extraction methods (trizol reagent, invitrogen, usa), or silica gel-based purification methods (rneasy mini kit, qiagen, germany; strataprep kit, stratagene, usa or sv rna isolation kit, promega, usa) according to the manufacturer's instructions with some modifications. material was maintained at À c with minimal handling. rna extraction was carried out in an rnase-free environment (see supplementary table online) . the commercially available rna samples were the 'universal human reference' (n = ) distributed by stratagene (usa), and human brain (n = ) and muscle (n = ) rnas supplied by clontech (usa). once extracted, rna concentration and purity was first verified by uv measurement, using the ultrospec pro (amersham biosciences, usa) and mm cuvettes. the absorbance (a) spectra were measured from to nm. a , a and a were determined. a :a and a :a ratios were calculated. for microcapillary electrophoresis measurements, the agilent bioanalyzer (agilent technologies, usa) was used in conjunction with the rna nano and the rna pico labchip kits. in total, assays were run in accordance with the manufacturer's instructions (see supplementary notes online). to evaluate the reliability of the classifier systems described in this study, replicate runs were done on a set of rna samples loaded on different chips, resulting in (n = ), (n = ), (n = ) and (n = ) data points per sample. human rna integrity categorization rna integrity checking was performed by expert operators who classified each total rna sample within a predefined discrete category from to , examining the integrity of the rna from electropherograms (see supplementary table online). a low number indicates high integrity. reference criteria parameters include ribosomal peaks definition, baseline flatness, existence of additional or noise peaks between ribosomal peaks, low molecular weight species contamination and genomic dna presence suspicion. a smearing of either s and s peaks, or a decrease in their intensity ratio indicate degradation of the rna sample and results in the classification into the higher categories. to evaluate the robustness of this human interpretation, five highly experienced operators, trained in these cataloging steps, separately classified a subset of samples from breast cancers. it included samples with varying levels of integrity: intact rna ( %), low quality samples ( %) and a wide range of degradation ( %). bioanalyzer electrophoretic data were exported in the degradometer software folder (.cld format). for comparison of samples, the original data were re-scaled by the classifier system, first along the time-axis to compensate for differences in migration time, then along the fluorescence intensity-axis to compensate for variation in total rna amount. as a result, fluorescence curves that have the same shape will have the same peak heights after re-scaling. then, degradation factors (degfact) and corrected s: s ratios were calculated (see supplementary table online) using the mathematical model developed by auer et al. ( ) , examining additional 'degradation peak signals' appearing in the lower molecular weight range and comparing them to ribosomal peak heights. calculation of the degfact is based on a numbering of continuous metrics, ranging from to ¥; increasing degfact values correspond to more degradation, and a new group of integrity is defined after graduation steps. once the classification of the rna samples is completed, groups of integrity are displayed, showing an alert warning indicative of some measurable degradation (yellow: - , orange: - and red: > ), while all non-reliable data come together and form the fourth group (black). we introduced a fifth class labeled white (< ), when no alert was produced by the software. software and manual are freely available at http://www. dnaarrays.org/downloads.php. degradometer version . . (released in may ) of the software was used. bioanalyzer electrophoretic sizing files (.cld format) collected with biosizing software version a. . .si (released in march ) were imported in the agilent expert software (rin beta release). the rin algorithm allows calculation of rna integrity using a trained artificial neural network based on the determination of the most informative features that can be extracted from the electrophoretic traces out of features identified through signal analysis. the selected features which collectively catch the most information about the integrity levels include the total rna ratio (ratio of area of ribosomal bands to total area of the electropherogram), the height of the s peak, the fast area ratio (ratio of the area in the fast region to the total area of the electropherogram) and the height of the lower marker. a total of electropherograms of rna samples from various tissues of three mammalian species (human, mouse and rat), showing varying levels of degradation and an adaptive learning approach were used in order to assign a weight factor to the relevant features that describe the rna integrity. a rin number is computed for each rna profile (see supplementary table online) resulting in the classification of rna samples in numerically predefined categories of integrity. the output rin is a decimal or integer number in the range of - : a rin of is returned for a completely degraded rna samples whereas a rin of is achieved for intact rna sample. in some cases, the measured electropherogram signals are of an unusual shape, showing for example peaks at unexpected migration times, spikes or abnormal fluctuation of the baseline. in such cases, a reliable rin computation is not possible. several separate neural networks were trained to recognize such anomalies and display a warning to the user or even suppress the display of a rin number. combining the results of the neural network for the rin computation and the neural networks to detect anomalies, the rin algorithm achieves a mean square error of . and a mean absolute error of . on an independent test set. the beta release of the software and manual are freely available at http://www.agilent.com/chem/rin. agilent expert version b. . .si (released in november ) of the software was used. expression levels of three housekeeping genes (hkg)-gapd, gusb and tfrc-were measured by quantitative pcr using the taqman gene expression assays according to the manufacturer's instructions (applied biosystems, usa). sixteen aliquots of a unique batch of rna sample (universal human reference rna, stratagene, usa) of various levels of integrity (cf. table ) were used to test the influence of rna quality on the relative expression of those three genes. in parallel, a to comparison was done using two separate gusb and tfrc taqman probes. an homogeneous quantity ( . - mg) of the rna samples was subjected to a reverse transcription step using the highcapacity cdna archive kit (applied biosystems, usa) as described by the manufacturer. single-stranded cdna products were then analyzed by real-time pcr using the taqman gene expression assays according to the manufacturer's instructions (applied biosystems, usa). single-stranded cdna products were analyzed using the abi prism sequence detector (applied biosystems, usa). the efficiency and reproducibility of the reverse transcription were tested using s rrna taqman probes. five assays were used, gapdh- (hs _m ), gusb- (hs _gh), gusb- (hs _m ), tfrc- (hs _m ) and tfrc- (hs _m ). in each case, duplicate threshold cycle (ct) values were obtained and averaged; then expression levels were evaluated by a relative quantification method ( ) . the fold change in one tested hkg (target gene) was normalized to the s rrna (reference gene) and compared to the highest quality sample (calibrator sample), using the following formula: fold change = Àddct , where ddct = (c t-target À c t-reference ) sample-n À (c t-target À c t-reference ) calibrator-sample . sample-n corresponds to any sample for the target gene normalized to the reference gene and calibrator-sample represents the expression level ( ·) of the target gene normalized to the reference gene considering the highest quality sample. mean Àddct and sd were calculated, considering the samples either individually or grouped by quality metrics categories, based on rin metrics or degfact values, together with the lower and upper bound mean of % intervals of confidence (ic). using this analysis, if the expression levels of the hkg are not affected by the rna degradation, the values of the mean fold change at each condition should be very close to (since = ) ( ) . descriptive statistics were executed using the xlstat software, version . (addinsoft, usa), p = . . mean, sd and coefficient of variation (variation or cv) between and within groups of samples were calculated, together with a measure of the dispersion (range), inter-quartile range ( st and rd quartiles, q -q ) and evaluation of the lower and upper bound mean of % interval of confidence (ic). comparative statistical analyses between groups were completed, p = . , using non-parametric statistical tests: two-independent mann-whitney u-test and k-independent kruskal-wallis test. we analyzed total rna sample profiles from various human tissues ( %) and cell lines ( %) of either tumoral ( %) or normal ( %) origin, with varying levels of rna integrity. supplementary table online for details). significant differences in a :a ratios were observed between specific groups of samples (i.e. tumoral versus normal or tissues versus cell lines). for instance, rna extracted from normal samples displayed an improved ratio of . , with % falling within the desired range ( figure a ). in contrast, the distribution of a :a ratios was not found to correlate with either purification methods or tissues of origin. rna integrity was further assessed by resolving the s and s ribosomal rna bands using the agilent bioanalyzer and the rna protocol. the analysis was done on rna profiles; data from samples was not obtained due to device problems during the runs. the system automatically provided s: s ratios for ( %) of the profiles. figure b shows the distribution of the s: s computed values, with a median ratio around . and a variation of % from the mean (ic . - . and q -q . - . ). in addition, a significant degree of variability of the s: s ratio ( - %) was found for identical samples from replicate runs ( - times). among those rna samples, s: s ratios of . or greater were rare, less than % of the values measured being within the theoretically desired range, except for the samples prepared from cultured cells ( figure b ). the integration failed in the remaining cases, displaying an atypical migration, with no clear s and s rrna bands, and no s: s ratio was computed (data not shown). expert operators categorized the set of rna samples by inspecting the electrophoretic traces of successful assays. over the rna profiles checked, ( %) were scored within predefined categories ( figure c ), namely good [human categorization (hc)-level ], regular (hc-level ), moderate (hc-level ), low (hc-level ) and degraded (hc-level ). the remaining ( %) were flagged as displaying a temperature-sensitive profile: rna samples initially found intact became highly degraded when heated, although no rnase contamination was observed (data not shown). estimation of the robustness of this cataloging was done through comparison of qualifying criteria using a set of breast cancer samples (see materials and methods). integrity of the samples was evaluated independently by five expert operators, and categorization was found highly reliable with a coefficient variation (cv) $ %. this is low considering that individual interpretation is involved, but can be explained by the fact that very experienced operators accomplished the scoring based on a clearly defined set of instructions, thus limiting frequently observed subjective visual interpretation and inconsistency of human categorization. predictably, a s: s ratio of . denoted high quality for a majority of rna samples, % being classified in hc-levels to . however, % of total rnas with s: s > . but a low baseline between the s and s rrna or front marker were also classified in hc-levels - (see figure d ) and could be considered suitable for most downstream applications. rna degradation was first assessed using the degradometer software (see materials and methods). over the rna profiles checked, all were scored in one of the five predefined classes ( figure a) . altogether, ( %) degradation factors (degfact) values were computed, the remaining rna samples ( %) displaying profiles that could not be interpreted reliably; no degfact values could be scored, and samples were flagged in the black category ( figure a ). most of them ( %) correspond to samples previously classified by our operators as degraded (hc-level ). the remaining cases had an average degradation factor of . (ic . - . ) with large variations over the entire set of samples (over % from the mean, range - ). a lower variability was persistently found when identical samples from replicate runs were considered, resulting in observed degfact values with a - % cv. in addition, statistically significant differences were found between degfact values of samples sorted by types. the highest degfact values were found characteristic of tissue samples, % of them displaying a degfact > , as compared with % for the cell lines (data not shown). remarkably, we found a significant linear relationship between the degfact values distribution and the explicit human categorization. most hc classes corresponded to an unambiguous degfact distribution ( figure b ), while hclevels and form a single class: hc-level , mean degfact of . , sd of . (ic . - . ); hc-level and , mean deg-fact of . , sd of . (ic . - . ); hc-level , mean degfact of . , sd of . (ic . - . ); hc-level , mean degfact of . , sd of . (ic . - . ). it is worth mentioning that the normalized heights of s and s peaks, and the interval between them after rescaling gradually decrease and then reverse with increasing degradation ( figure b ). integrity of rna samples was measured in parallel based on the rna integrity number metrics using an artificial neural network trained to distinguish between different rna integrity levels by examining the shape of the microcapillary electrophoretic traces (see materials and methods). over the rna profiles checked, ( %) were scored successfully ( figure a) , with an average rin of . (ic . - . ). the remaining ( %) samples were associated with various unexpected signals, disturbing computation of the rin using default anomaly detection parameters. in each case, a flag alert was added corresponding to critical anomalies including unexpected data in sample type, (or) ribosomal ratio, (or) baseline and signal in the s region (data not shown). rin categorization was found regular, variability between replicate runs, compared to the other methods, being consistently very small (cv - %). as expected, the highest rin were characteristic of cell line samples, % of them displaying a rin > , as compared with % for the tissue samples (data not shown). a first group, corresponding to ( %) of the rna profiles, was analyzed using the default settings of the rin system, but with a lower threshold of rna quantity loaded ( ng) for reliable detection of anomalies than that recommended by the manufacturer ( ng). a significant linear relationship was found between the rin number and both the explicit human classification provided by our operators, figure . rna degradation characterization. integrity of rna sample profiles was scored using the degradometer software. (a) a total of rna profiles were successfully categorized into predefined alert classes using a mathematical model that quantifies rna degradation and computes a degradation factor (degfact). four classes (white, yellow, orange and red) are associated with different levels of degradation. a fifth class, black alert corresponds to samples that the system was not able to qualify with accuracy (n.d.). the distribution is represented by the number of records in each class. (b) comparative analysis was done using human evaluation (x-axis) based on electrophoresis analysis as a reference for rna integrity classification; observations of rrna peak heights and degfact values were taken at each of the hc levels. histograms refer to the mean s and s rrna peak heights and % confidence intervals (fluorescence intensities; left scale). mean degfact values and % confidence intervals (arbitrary unit, right scale) are plotted with the means joined. and the degfact values calculated by the degradometer software ( figure b ). each distinct hc class corresponds to an explicit rin number, with hc-levels and forming once again a single class: hc-level , mean rin of . , sd of . (ic . - . ); hc-level and , mean rin of . , sd of . (ic . - . ); hc-level , mean rin of . , sd of . (ic . - . ); hc-level , mean rin of . , sd of . (ic . - . ). for the remaining samples (assay done with < ng of rna), two separate groups were considered: samples with a computed rin below . , and above . . all samples in the first group were derived from rna nano assays, with mean rna quantities loaded below ng (q -q , - ng), i.e. below the lower limit of quantitation indicated by the manufacturer. all but of these samples were estimated by our operators to be of poor quality (hc-level ; n = ) or degraded (hc-level ; n = ), and all but were flagged black by the degradometer software and no degfact values were scored. these rna profiles could not be interpreted reliably, possibly due to either the low rna concentration or the unusual migration behavior and shifted baseline values of degraded samples. thus, the two automated systems were in disagreement for these samples; while human interpretation was in most cases in agreement with the rin system, with less than % of inconsistency. in the second group of samples, of the profiles were derived from rna pico assays with rna quantities loaded being on average below ng (q -q , . - . ng), which is within the manufacturer specifications. all but of them were estimated by our operators to range from high (hc-level ; n = ) to correct (hc-level and ; n = ) quality levels. in addition, all rna profiles except were scored by the degradometer software, most of them displaying an alert flag (n = ); some slight degradation was detected, associated to a low mean degfact value of . (ic . - . ; q -q , . - . ). thus, both automated systems and human interpretations agreed in most of these cases, with < % of inconsistency. the influence of rna quality categorization obtained with both user-independent classifiers on gene expression profiling was explored using real-time rt-pcr. the expression levels of three housekeeping genes (hkg)-gapdh, gusb and tfrc-were measured in aliquots of a unique rna displaying various integrity metrics ( table ). the mean correlation coefficient (r) between the threshold cycle (ct) among the samples and both quality metrics was found high: r = À . considering the rin metrics and r = . considering the degfact values. the values of the mean fold changes, calculated according to the Àddct quantification method (see materials and methods), were found lower than . , corresponding to the expression level ( ·) in the sample exhibiting the highest rna quality (table and figure ). considering that hkg expression was measured relative to the reference sample, an obvious decline of the relative expression levels was observed, up to , and %, in samples categorized according to the rin metrics ( figure a) and degfact values ( figure b ). these results indicate that -to -fold differences may be expected in the relative expression levels of genes in samples that differ only by their quality (table ). these fold differences are much larger than those measured for rna samples of comparable integrity, consistently lower than . (table and figure ). in addition, an unambiguous gap in the distribution may be defined ( figure a and b) , distinguishing the rna samples of the higher quality categories (rin > and degfact values < ) from those of the lower categories (rin < and degfact values > ). it would be expected that measuring expression of an intact mrna would yield approximately equal results regardless of the region being probed, and if mrna fragmentation had occurred, then some sequences may be more abundant than others. we thus tested the effect of pcr probe location on the rnas. the and gusb probes, separated by nt, were associated with highly similar threshold cycle (ct) measures (r = . , b parameter = . ) ( figure c ). similar results were obtained for tfrc, with probes separated by nt (r = . , b parameter = . , data not shown). it seems therefore that the region being probed is not a source of variation in our results. it is universally accepted that rna purity and integrity are of foremost importance to ensure reliability and reproducibility of downstream applications. in the biomedical literature (pubmed, november ), from the articles that relate to rna, and the or including respectively the 'quality' or 'integrity' term, less than were found to contain 'rna quality' or 'rna integrity' terms. interestingly, half of them were published between and ; but none is proposing a standard operational procedure for rna quality assessment to the scientific community. except for two studies ( , ) , those reports are based on to years old methods ( ), indicating that they represent the established and currently mostly used methods. our results strongly challenge the reliability and usefulness of those conventional methods, demonstrating their inconsistency to evaluate rna quality. first, the a :a and a :a ratios are reflecting rna purity, but are not informative regarding the integrity of the rna. available rna extraction and purification methods yield highly pure rna with very little dna or other contaminations, resulting most often in both ratios ) . , although % of the samples were found degraded and % more of poor quality. the high a :a ratios are indicative of limited protein contaminations, whereas high a :a ratios are indicative of an absence of residual contamination by organic compounds such as phenol, sugar or alcohol, which could be highly detrimental to downstream applications. nonetheless, samples displaying low a :a ratios (( . ) did not exhibit any inhibition during downstream applications, such as cdna synthesis and labeling or in vitro transcription (data not shown). second, due to a lack of reliability, the s: s rrna ratios may not be used as a gold standard for assessing rna integrity. when ribosomal ratios were calculated from identical samples but through independent runs, a large degree of variability (cv - %) was observed. moreover, using the biosizing software, we found s: s rrna ratios evaluation compromised by the fact that their calculation is based on area measurements and therefore heavily dependent on definition of start and end points of peaks. in % of the cases, the system was unable to localize the ribosomal peaks, and therefore no s: s ratios were computed. for the remaining samples, no clear correlation between s: s ratios and rna integrity was found although rnas with s: s > . were usually of high quality. most of the rnas we studied ( %), displaying a s: s > . , could be considered of good quality. interestingly, auer et al. ( ) in a study on tissues from seven organisms, reported that an objective measurement of the rna integrity may possibly be done through comparison of re-scaled s and s peak heights, but not of the corresponding areas. actually, we observed a linear relationship between rna integrity and differences in normalized s and s peak heights. increased degradation resulted in a significant decrease in the scaled corrected heights of the ribosomal peaks, with inversion of the ratio at the highly degraded stages (cf. figure b ). in comparison to the area computation, s: s rrna re-scaled peak height measurement produced more consistent values, with a cv reduced to - %, and displayed clear concentration-independent values (see supplementary tables and online) . human evaluation of the integrity of rna through visual inspection of the electrophoresis profiles provided very consistent data. variability between classifications produced by five independent expert operators (cv %) was lower than with automated management of more conventional control s: s area values (cv - %). it is, however, very time-consuming and strongly dependent on individual competence. even with highly trained specialists, % of the set of rna samples could not be allocated to any of the five predefined categories; their corresponding profiles were considered by our experts as atypical, displaying a temperature-sensitive shape (data not shown). these strategies appear unsuitable for standardization and quality control of rna integrity assessment, which require simple but consistent expert-independent classification, facilitating information exchanges between laboratories. the n-value corresponds to the number of samples by category. the mean quality metrics, i.e. rin and degfact and the mean fold change ( Àddct ) relative to the reference sample are indicated, together with the % confidence intervals. observed technical variation (ic-rep, p = . ) is also specified, considering duplicate (two per gene per target sample) and replicate (six per gene per calibrator sample) measures. the reference sample exhibits a rin of , a degfact value of . and by default mean fold change set to . the observed decrease in the expression (relative expression, %) relative to the reference sample is calculated. the fold differences refer to the fold-ratios that are expected in the expression levels for a gene, across categories (between categories), given that the samples only differ by their quality, and within each category (within categories), considering rna of comparable integrity. the fold-ratios (technical variation) that may be expected by chance in the gene expression levels, p = . , from some technical reasons, are also considered. we therefore investigated the performance of two recently developed user-independent software algorithms ( , ) . the degradometer software provided a reliable evaluation of rna integrity based on the identification of additional 'degradation peak signals' and their integration in a mathematical calculation together with the ribosomal peak heights. it allowed characterization of the integrity of % of the samples tested, one-third with an alert flag, which was first found to be fairly informative, as it strongly reduces the complexity of the metrics by introducing three distinct classes labeled yellow, orange and red, and can be used as a first straightforward simple filtering step. however, degradation factors (degfact) metrics yield precise measures with less than % cv and are much more valuable than flag alerts for the purpose of standardization. the same is true for the rna integrity number 'rin' software which allowed the characterization of the integrity of % of the rna samples tested, with a rin value for rna sample profiles with less than % cv. in general, there was a good agreement between the human classification, the degradation factor and the rin (see figure b ). this provided a cross-validation of the user-independent qualification systems tested. both resulted in the refinement of human interpretations, validating four statistically relevant classes of samples, namely good (hc-level ), regular/ moderate (hc-level and ), poor (hc-level ) and degraded (hc-level ). moreover, the % rna samples previously flagged by the operators as displaying an atypical temperature-sensitive shape were unambiguously assigned to one or the other category of samples [rin = . (ic . - . ); degfact = . (ic . - . ); data not shown]. altogether, we found the degradometer and rin algorithms to be highly reliable user-independent methods for automated assessment of rna degradation and integrity. the rin system is a slightly more informative tool, able to compute assessment metrics for % of the rna profiles, compared to % with the degradometer software; the remaining being flagged respectively as n/a or black alert. for samples available below a low limit of ng (n = ) the rin system provided figure . workflow of operational procedure for rna quality assessment. integrity of the rna, once extracted and purified from cell lines, clinical or biological tissues samples, is controlled from the widely used bioanalyzer electrophoretic traces. as standard part of the agilent analysis software ( ), a rin metrics is first calculated, scoring each rna sample into numerically predefined categories of integrity (rin, from to ; n is a threshold value). as an independent control, a degradation factor metrics (degfact, from to ¥; n is a threshold value) may optionally be allocated to each rna sample using the bioanalyzer-independent degradometer software ( ) . in a standard operating procedure, rin and/or degfact metrics will first be used as a standard exchange language to document rna integrity and degradation, second to classify the rna in homogeneous groups, and finally to select samples of comparable rna integrity to improve the scheme of meaningful downstream experiments. the standard operating procedure will benefit from feedback information that will help users to define threshold integrity metrics values based on the results of rna-based analyses. metric values for % of them, compared to only % with the degradometer software. similarly, the rin system was able to provide metric values for % of poor quality samples (including low quality and degraded samples; n = ), whereas the degradometer software could classify only % of them. another advantage with the rin classifier is that, if there are critical anomalies detected (including genomic dna contamination, wavy baseline, etc.), threshold settings may be changed and a reliable rin value computed. this was the case for of the rna sample profiles successfully classified by the system. while intact rna obviously constitutes the best representation of the natural state of the transcriptome, there are situations in which gene expression analysis may be desirable even on partially degraded rna. some studies report collection of reasonable microarray data from rna samples of impaired quality ( ) , leading to meaningful results if used carefully. moreover, auer et al. ( ) recently concluded that degradation does not preclude microarray analysis if comparison is done using samples of comparable rna integrity. we confirmed the direct influence of the rna quality on the distribution of gene expression levels, by detecting using q-pcr a significant (up to -fold) difference in the relative expression of genes in samples of slightly decreased rna integrity, which is much larger than the variation within comparable rna quality categories (cf. figure and table ). this may correlate with ratio discrepancies in gene expression experiments, and therefore with false positive and false negative rates of differential gene expression when comparing two samples. therefore, computing reliable metrics of rna integrity, even if the rna is found to be partially degraded, may be highly valuable. the straight and unambiguous relationships established between human interpretations and both rin and degfact distributions indicates that, using these metrics, it should be possible to distinguish specific samples that are too disparate to be included in comparative gene expression analyses without compromising the results. although the information provided by these user-independent classifiers is not a guarantee for successful downstream experiments, it gives a more comprehensive picture of the samples and can be used as a safeguard against performing useless and costly experiments. thus, the rin system may be used as simple metrics that can be easily integrated in any sample tracking information system for definition of standard operating procedures under quality assurance following a scheme such as the one described in figure . in this context, we suggest that the growing number of laboratories performing rna quality control by microcapillary electrophoresis should be offered the option to report objective rna quality metrics as part of the 'minimum information about a microarray experiment' miame standards ( ) . through registration of rna profiles in a public electronic repository, such standardized information should enable and facilitate comparisons of rna-based bioassays performed across laboratories with rna samples of similar quality, in much the same way as sequencing traces are compared. molecular cloning: a laboratory manual use of uv methods for measurement of protein and nucleic acid concentrations value of a /a ratios for measurement of purity of nucleic acids validity of nucleic acid purities monitored by nm/ nm absorbance ratios the effect of sodium ion concentration on intrastrand base-pairing in single-stranded dna a new fluorometric method for rna and dna determination fractionation of ribonucleic acids by 'sephadex' agarose gel electrophoresis rna molecular weight determinations by gel electrophoresis under denaturing conditions, a critical reexamination a rapid, accurate, nonradioactive method for quantitating rna on agarose gels quantitative detection of reverse transcriptase-pcr products by means of a novel and sensitive dna stain a microfluidic system for high-speed reproducible dna sizing and quantitation quantification of mrna using real-time reverse transcription pcr (rt-pcr): trends and problems increase in the ratio of s rna to s rna in the cytoplasm of mouse tissues during aging fine mapping of s rrna sites specifically cleaved in cells undergoing apoptosis rna extraction from gastrointestinal tract and pancreas by a modified chomczynski and sacchi method rapid isolation of total rna from small samples of hypocellular, dense connective tissues ribosomal rna in alzheimer's disease and aging rnase l-independent specific s rrna cleavage in murine coronavirus-infected cells quality of nucleic acids extracted from fresh prostatic tissue obtained from turp procedures moderate degradation does not preclude microarray analysis of small amounts of rna total rna suitable for molecular biology analysis evaluation of quality-control criteria for microarray gene expression analysis a two-step method for the extraction of high-quality rna from endoscopic biopsies chipping away at the chip bias: rna degradation in microarray analysis rna integrity number (rin)-standardization of rna quality control. agilent application note, publication number- - en single-step method of rna isolation by acid guanidinium thiocyanate-phenol-chloroform extraction analysis of relative gene expression data using real-time quantitative pcr and the (-delta delta c(t)) method changes in differential gene expression because of warm ischemia time of radical prostatectomy specimens minimum information about a microarray experiment (miame)-toward standards for microarray data we would like to thank herbert auer and karl kornacker for useful discussions and technical assistance with the degradometer tool. we are very grateful to raphaël saffroy for having given access to the abi prism instrument and for his helpful advices concerning the implementation of the q-pcr processes. this work was supported by cnrs. funding to pay the open access publication charges for this article was provided by agilent technologies and the cnrs.conflict of interest statement. none declared. supplementary material is available at nar online. key: cord- -uvf qzfd authors: kenworthy, rachael; lambert, diana; yang, feng; wang, nan; chen, zihong; zhu, haizhen; zhu, fanxiu; liu, chen; li, kui; tang, hengli title: short-hairpin rnas delivered by lentiviral vector transduction trigger rig-i-mediated ifn activation date: - - journal: nucleic acids res doi: . /nar/gkp sha: doc_id: cord_uid: uvf qzfd activation of the type i interferon (ifn) pathway by small interfering rna (sirna) is a major contributor to the off-target effects of rna interference in mammalian cells. while ifn induction complicates gene function studies, immunostimulation by sirnas may be beneficial in certain therapeutic settings. various forms of sirna, meeting different compositional and structural requirements, have been reported to trigger ifn activation. the consensus is that intracellularly expressed short-hairpin rnas (shrnas) are less prone to ifn activation because they are not detected by the cell-surface receptors. in particular, lentiviral vector-mediated transduction of shrnas has been reported to avoid ifn response. here we identify a shrna that potently activates the ifn pathway in human cells in a sequence- and ′-triphosphate-dependent manner. in addition to suppressing its intended mrna target, expression of the shrna results in dimerization of interferon regulatory factor- , activation of ifn promoters and secretion of biologically active ifns into the extracellular medium. delivery by lentiviral vector transduction did not avoid ifn activation by this and another, unrelated shrna. we also demonstrated that retinoic-acid-inducible gene i, and not melanoma differentiation associated gene or toll-like receptor , is the cytoplasmic sensor for intracellularly expressed shrnas that trigger ifn activation. a specific double-stranded rna (dsrna) structure, $ - bp dsrna with overhangs, plays a critical role in initiating both microrna (mirna)-and small interfering rna (sirna)-mediated gene silencing, as it is the structure recognized by the rna interference (rnai) machinery, the rna-induced silencing complex (risc) ( ) ( ) ( ) . except for preformed sirna duplexes of $ bp, the risc-loaded small rnas are generated by a ribonuclease (rnase) iii-like enzyme that is found in virtually all eukaryotic organisms. this enzyme, aptly named dicer for its ability to cleave a variety of larger (> bp) dsrna molecules into the $ bp dsrna with a characteristic overhang of nt, is a multidomain rna-binding protein and itself a component of risc. the primary sequence of the rnas is not important in risc formation, and rnai can suppress virtually any target as long as rules of sequence complementarities between the small rna and the target rna are satisfied. dsrnas are also a type of pathogen-associated molecular pattern (pamp) that are detected by cellular innate immunity sensors named pattern recognition receptors (prrs) ( ) . the interaction between a pamp and a prr triggers activation of the interferon (ifn) pathway in mammalian cells, which significantly changes the gene-expression profile in the cells and contributes to the well-documented off-target effect of rnai. ifn induction is especially problematic in antiviral studies employing rnai, where the antiviral effect of ifn must be distinguished from that of rnai. typical ifn-inducing structure patterns include dsrna of certain length, single-stranded rna (ssrna) containing -triphosphates ( -ppp), the dsrna analogue polyinosinic-polycytidylic acid (poly i:c), and certain dsdna molecules. these rna patterns are generally believed to possess 'non-self' properties to allow the cell to recognize foreign (often viral) rnas specifically. various forms of sirna duplexes have been reported to trigger ifn induction both in vitro and in vivo ( ) ( ) ( ) ( ) ( ) , probably through the cell surface-and/or endosomeexpressed toll-like receptors (tlrs), including tlr and tlr ( , , ) . short-hairpin rnas (shrnas) expressed from a dna plasmid have also been shown to activate ifn ( ) . the double-stranded form of these rnas is below the size limit of the stem-loop rnas that can be detected by the rna-activated protein kinase (pkr) ( ) and is probably detected by other cytoplasmic prrs. two cytoplasmic rna helicases, retinoic-acid-inducible gene i (rig-i) and melanoma differentiation associated gene (mda ), signal to the ifn-b promoter when activated by specific rna structures ( ) ( ) ( ) . although both prrs signal through the mitochondrial antiviral signaling protein mavs/cardif/visa/ips- ( ) ( ) ( ) ( ) , studies of ligand specificity suggest that rig-i and mda are parallel sensors with overlapping substrates. for example, although both prrs are activated by poly i:c in cell culture systems ( , ( ) ( ) ( ) ( ) ( ) , mda appears to be more important in mediating the poly i:c response in vivo ( , ) . in addition, rig-i can bind and respond to ssrnas bearing -ppp, whereas mda is not activated by -ppp-containing rna ( , ) . finally, several cytosolic sensors for dsdna has been recently reported ( ) ( ) ( ) ( ) ( ) ( ) . nevertheless, current data on what constitutes effective substrates for either prr are incomplete and sometimes controversial. here we report for the first time that shrnas delivered by lentiviral transduction triggered ifn activation and that rig-i and mavs, but not mda or tlr , mediated the ifn activation triggered by intracellularly expressed shrna, which could activate both ifn-a and ifn-b promoters. ifn activation depended on sequence, a -ppp and correct processing of the rna hairpin by dicer; it was independent of promoter choice, presence of blunt ends, route of delivery and rnai potency. gs and lh cells have been described earlier ( , ) . huh- and ft cells were maintained in dmem supplemented with % fbs. we used the following antibodies: anti-cypa (biomol, plymouth meeting, pa, usa); anti-cypb (afenity bioreagents, rockford, il, usa); anti-ku , anti-flag and anti-actin (sigma-aldrich, st louis, mo, usa); anti-ifn stimulate gene (isg) (rockland immunochemicals, gilbertsville, pa, usa); anti-ns a (virogen, watertown, ma, usa) and anti-ns (in-house). gsb and h cells have been described earlier ( ) . poly i : c was purchased from sigma-aldrich, and synthetic hairpin rna was purchased from integrated dna technologies (coralville, ia, usa). synthetic sirna was purchased from ambion (austin, tx, usa). protein contents of cell lysate were quantified with the bio-rad dc protein assay (bio-rad, hercules, ca, usa), and an equal amount of total protein was loaded in each lane. samples for irf- dimerization assay were run on a polyacrylamide gel under non-denaturing conditions ( ) . other samples were denatured and separated by sodium dodecyl sulfate polyacrylamide gelelectrophoresis (sds-page). proteins were then transferred onto a nitrocellulose membrane and stained with the appropriate antibodies with the snap i.d. tm system (millipore, worcester, ma, usa) according to the manufacturer's instructions. for luciferase assays, cells were seeded to a confluency of %, and for all other assays, cells were seeded to a confluency of %. the next day, transfections of dna plasmids and synthetic rnas were performed with lipofectamine tm (invitrogen, carlsbad, ca, usa) according to the manufacturer's instructions. plasmids pgl -ifna , pgl -ifnb, prl-tk, pcmv-flag-irf- and pcr . -irf- a have been described earlier ( ) . shrnas were expressed from a human immunodeeciency virus (hiv)-based lentiviral vector ( , ) , and sh-pcaf was constructed on the basis of a previously reported sequence ( ) . plasmid sh-b /h was constructed by cloning of the dna fragment encoding the sh-b rna into psilencer . -h (ambion, austin, tx, usa) according to the manufacturer's instructions. the rig-i and tlr constructs have been described ( , ) . the rig-i c construct encodes flag-tagged, c-terminal aa of human rig-i cloned into a bicistronic expression vector modified from pbicep-cmv- (sigma-aldrich, st louis, mo, usa), in which the cmv promoter was replaced with the elongation-factor- promoter. the mda , mda -c constructs were kindly provided by fujita ( ) . hcv genotype a ns - a protease was expressed from the pcmv- tag- a plasmid (stratagene, la jolla, ca, usa). ft cells were seeded in -well plates and were transfected h later with ng of a shrna expression vector, ng of pgl -ifna or pgl -ifnb, ng of prl-tk and ng of pcr . -irf- a. cells were collected h after transfection. luciferase assays were performed with the dual-glo Õ luciferase assay system reagents (promega, madison, wi) and luminescence quantified with a modulus microplate reader (turner biosystems, sunnyvale, ca, usa). ratios of firefly luciferase (from the pgl vectors) to renilla luciferase (from the prl-tk vector) were calculated, and that of the sh-b sample was normalized to %. sequences of shrna are shown in table . lentiviral vector production and transduction were performed as described earlier ( ) . viral vectors were pelleted by ultracentrifugation at g at c for h and resuspended in a volume of pbs that was % of the original medium volume. the titers of the concentrated vectors were then measured with a p elisa kit (zeptometrix, buffalo, ny, usa). real-time reverse transcription pcr (rt-pcr) was performed as described earlier ( ) . the primers used were oas forward, -agg tgg taa agg gtg gct cc- and oas reverse -aca acc agg tca gcg tca gat- ; rig-i forward -gag gca gag gaa gag caa gag g- and rig-i reverse -cgc ctt cag aca tgg gac gaa g- ; gapdh forward -tca ctg cca ccc aga aga ctg- and gapdh reverse -gga tga cct tgc cca cag c- . the primers for hcv detection were -cgc tca atg cct gga gat ttg- and -gca ctc gca agc acc cta tc- . for flow cytometry, gs cells were fixed h after treatment in a solution of % paraformaldehyde and analyzed with a facscanto flow cytometer (bd biosciences, san jose, ca, usa). mean gfp intensity was plotted, and that of the sh-ntc sample was normalized to %. total rna from transiently transfected ft cells was extracted with rna stat- (tel-test, friendswood, tx, usa) and separated on a . % urea polyacrylamide gel. the transfer of rna onto nitrocellulose membrane and hybridization were performed according to standard molecular biology protocols. the probe for detecting the expression of sh-b and its variants was a synthetic dna oligomer corresponding to the bottom strand of sh-b . radioactive labeling of the probe was performed with an end-labeling protocol with t polynucleotide kinase (ambion, austin, tx, usa). the exposure and detection of the radioactive signal was performed with a typhoon imager (ge healthcare, piscataway, nj, usa) with quantity one software (bio-rad, hercules, ca, usa). a short-hairpin rna directed at cypb induces ifn production in human embryonic kidney cells to investigate the potential role of the cyclophilins (cyps) in hcv replication ( ), we delivered several shrnas directed at mrnas of three cyps into hcv replicon cells by means of a lentiviral vector, using a murine u promoter to drive the expression of the shrna ( figure a ) ( ) . we observed a discrepancy between two anti-cypb shrnas (b and b ) in their relative efficiency in knocking down cypb expression and in suppressing hcv. lentiviral vector sh-b was less efficient in knocking down cypb expression but potently inhibited hcv ns a expression in a human hepatoma cell line containing replicating hcv rna ( figure b , left). viral inhibition was independent of cypb knockdown, as control medium from transfected ft cells that did not contain any lentiviral vector particles, generated by omission of the packaging plasmids during transfection, also inhibited hcv replication ( figure b , right) without affecting cypb expression. the fast kinetics of viral inhibition (complete inhibition with h, data not shown) was also more consistent with ifn than with rnai-based inhibition. the presence of ifn in the lentiviral vector preparation of sh-b was confirmed by strong induction of - -oligoadenylate synthetase (oas ), a classic ifn-induced gene, in both naı¨ve huh- and the hcv replicon cell line (gs ) treated with the medium ( figure c ). in addition, hcv replication in an ifn-resistant hcv replicon cell line (h ), in contrast to that in a wildtype replicon cell line (gsb ) ( ), was not inhibited by the sh-b medium ( figure d ), suggesting the lack of additional viral inhibiting agents in the sh-b medium. expression of sh-b in ft cells also induced dimerization of irf- , confirming the activation of the ifn production pathway in these transfected cells ( figure e ). finally, sh-b was able to activate both ifn-a and ifn-b promoters, although the activation of the ifn-a promoter required coexpression of irf- , which is normally expressed at very low levels in -based cells ( figure f ). these results demonstrate that sh-b is a potent activator of irf- and irf- , master regulators of ifn expression in human cells. we next investigated the role of the different viral/ exogenous rna sensors, rig-i, mda and tlr , in sh-b -triggered ifn production. mammalian expression plasmids encoding each of these proteins, as well as the dominant negative (dn) mutants of rig-i and mda , were transfected into ft cells with shrnas and an ifn-b promoter reporter construct. the signaling to ifn-b promoter and the expression of the prr proteins were then examined h after transfection. in the absence of sensor proteins, the sh-b increased activation of the ifn-b promoter by . -fold ( figure a ). coexpression of mda or tlr did not increase or decrease sh-b 's ability to activate ifn-b promoter relatively to the negative control shrna (sh-ntc), but in the presence of rig-i coexpression, the induction of ifn-b promoter by sh-b was increased to $ -fold. moreover, ectopic expression of a dn mutant of rig-i (rig-i c), but not that of mda (mda -c), completely abrogated ifn promoter activation by sh-b . with the exception of tlr , which required prolonged exposure of the western blot to be detected, the cytoplasmic sensors and their mutants were expressed at comparable levels ( figure b ). moreover, activation of irf- ( figure e ) and ifn promoters ( figure f ) in ft cells, which do not contain a functional tlr signaling pathway ( ) , indicates that tlr plays a negligible role, if any, in ifn induction by sh-b . the combination of sh-b and rig-i produced the highest level of ifn-b promoter activity, which were confirmed by western blotting showing that endogenous isg induction was only detectable in cells cotransfected with sh-b and wild-type rig-i ( figure b ). to confirm further that biologically active ifn was released from these cells, we applied the culture medium of the transfected ft cells to an hcv replicon cell line (gs ) in which ns a-gfp expression is used for monitoring viral rna replication ( ) . hcv replication in this cell line is extremely sensitive to ifn, and the effect of the cytokine can be readily measured as the change in the mean gfp intensity of the treated cells. as shown in figure c , culture medium from sh-b efficiently suppressed hcv replication, resulting in a decrease in figure . a small-hairpin rna directed at cypb induces ifn production in human embryonic kidney cells. (a) sequence of sh-b , which was expressed from a self-inactivating human immunodeficiency virus (hiv) vector with a murine u promoter ( ) . (b) inhibition of hcv expression by culture media of sh-b -transfected ft cells. gs cells were treated with culture supernatant taken from ft cells transfected with various shrna plasmids with (left) or without (right) the packaging plasmids overnight. cells were then cultured in fresh media for an additional days before being lysed for western blotting. (c) oas induction by culture supernatant from ft cells transfected with sh-b . huh and gs cells were treated with culture supernatant from ft cells transfected with either sh-luc or sh-b for h before rna extraction and real-time rt-pcr analysis. oas rna level was normalized to that of gapdh rna. (d) transfected culture media failed to suppress hcv replication in an ifn-resistant cell line. hcv replicon cells were cultured as described earlier ( ) and then treated with the indicated culture medium from transfected ft cells. hcv rna was analyzed with real-time rt-pcr. (e) irf- dimerization in response to sh-b expression. flag-irf- was cotransfected with a shrna into ft cells. cells were lysed h after transfection, and total cell lysate was separated on a polyacrylamide gel under non-denaturing conditions, transferred and stained with an anti-flag antibody. (f) ifn-a and ifn-b promoter activation by sh-b expression. sh-ntc, sh-c (an shrna directed at cypc), or sh-b was cotransfected along with luciferase reporter plasmids with or without irf- . the ratios of firefly luciferase readings to renilla luciferase readings were plotted. the ns a-gfp intensity within h of treatment. cotransfecting wild-type rig-i produced a medium with stronger inhibition, whereas the rig-c drastically suppressed the antiviral effect of the medium. finally, real-time rt-pcr analysis revealed that sh-b , but not the negative control shrna, strongly activated expression of endogenous rig-i, a well-characterized isg whose induction requires paracrine/autocrine action of ifn ( , ) . as expected, poly i : c activated rig-i expression in the same assay ( figure d ). these results, taken together, show that rig-i is the cellular sensor that mediates the ifn induction by sh-b . the majority of the shrnas that we use in the lab do not activate rig-i expression and ifn signaling despite having essentially the same structure as sh-b , so we wanted to determine whether the sequence of sh-b is distinctive enough to trigger the production of ifn. we first tested a synthetic sirna duplex with the same target sequence as sh-b . this sirna (si-b -syn) should resemble the final dicer product of sh-b except for the -ends. the synthetic sirna contains -oh groups, whereas the dicer products probably figure a ) while failing to activate ifn production, as measured by the gfp-hcv assay ( figure b ). to determine whether the sequence of the intact hairpin rna before dicer cleavage is sufficient to trigger ifn, we tested a synthetic shrna (sh-b -syn) that had exactly the same sequence as the predicted intracellular sh-b transcript generated by the u promoter. again, the -end of the synthetic sh-b had a -oh group instead of any phosphate. sh-b -syn behaved similarly to si-b -syn in that it knocked down cypb expression without activating ifn response (figure ) . these results suggest that the -end status of sh-b is important for ifn activation, consistent with the previously finding that a -triphosphate is required for rig-i activation ( , ) . to determine the contribution of the individual residues of the sh-b sequence, we introduced a series of point mutations into the shrna and tested them for ifn induction. we changed the first nucleotide from a to g, c, or t while maintaining base-pairing between nucleotides + and + . these mutant shrnas lacked the ability to activate ifn production (table ) . changing the + nucleotide to g while leaving the + nucleotide intact also abolished ifn activation by the shrna (a /g), as did the reciprocal mutation u /c. the importance of the first nucleotide was further confirmed by the inability of sh-b + to activate ifn. the target of sh-b + was shifted nt downstream on the cypb mrna, producing an shrna starting with a g at the + position. the presence of an a at the + position was not, however, sufficient to render a shrna competent for ifn activation, as replacing the first nucleotide of the sh-ntc with an a did not generate an ifn-inducing shrna (ntc-a and ntc+ ). these results indicate that a protruding/unpaired a at the end of the hairpin or the rna duplex, a potential result of 'breathing' at the end of the dsrna, is not sufficient to trigger ifn induction as previously suggested ( ) . two point mutations located farther into the stem structure of the shrna ( g and b a ) also reduced its ability to induce ifn even though the base-pairing was perfectly maintained in these mutants. finally, replacing the -nt hairpin loop with a -nt loop that had been previously shown to abolish shrna-mediated rnai (loop a mutant) ( ) eliminated sh-b 's ability to induce ifn, suggesting the importance of rna processing in the induction. to determine whether the inability of the mutant shrnas to induce ifn was due to lower expression levels, we performed northern blotting analysis of the shrna expression on the wild-type and two mutants. the mutants a /g and loop a were chosen because their final sirna products have exactly the same sequence as that of the wild-type sh-b and can thus be detected with the same efficiency by the same probe. although sh-a/g and sh-loop a were clearly unable to activate ifn-b promoter ( figure a ), they were both expressed at levels comparable to those of the wild-type sh-b product ( figure b) . interestingly, the final sirna product of sh-loop a was slightly smaller than those of sh-b and sh-a /g, suggesting that cleavage did occur and perhaps occurred one or nt into the stem to compensate for the shorter loop. blunt-ended sirna has been previously reported to be stronger inducers of ifn than the sirnas with overhangs ( ) . indeed, a previously reported ifn-inducing shrna, sh-pcaf (p /creb-binding protein-associated factor), contains a blunt end ( ) and was more potent in activating ifn than sh-b ( figure a ), which is predicted to form an overhang of - ts at each end of the final sirna. we therefore constructed a version of the sh-b that would be blunt at the end that is not processed by dicer by adding two extra as to the -end of the shrna. this modification (blunt sh-b ) did not increase the ability of sh-b to activate ifn-b promoter ( figure a ). we confirmed, in two independent experiments, that ifn induction by sh-pcaf was also mediated by rig-i. first, cotransfection of dn rig-i resulted a -to -fold inhibition of ifn induction by sh-pcaf ( figure b ), whereas wild-type rig-i increased ifn induction by several fold in the same assay. second, when hcv ns - a protease, which cleaves mavs, thereby blocking the rig-i pathway, was coexpressed with either sh-b or sh-pcaf, ifn induction by these shrnas were severely compromised ( figure c ), further substantiating a role of the rig-i and mavs pathway in mediating ifn induction by both the blunt-ended sh-pcaf and the sh-b with overhang. the proper expression of ns - a protease was confirmed by western blotting ( figure d ). to assess the contribution of the promoter choice in ifn activation by intracellular expressed shrna, we expressed sh-b from another commonly used pol iii promoter, the human h promoter. both the original, mu -driven sh-b and the h -driven sh-b activated ifn-b promoter ( figure a ) and resulted in secretion of ifn into the transfected cell-culture media, which in turn suppressed hcv replication ( figure b ). proper expression of the sirna ( figure c ) and the subsequent knockdown of cypb expression ( figure d ) all appeared normal for sh-b expressed from the h promoter plasmid, which has a backbone different from that of our lentiviral vector carrying the mu promoter. these data suggest that ifn induction by sh-b is not restricted to a particular promoter or expression construct. further supporting this conclusion was the observation that the expression cassette by itself, removed and isolated from the lentiviral plasmid by restriction digestion, could also activate ifn production in transfected ft cells (data not shown). to this point, all the ifn induction experiments were done with transient transfection of dna vectors and it was possible that certain features of the double-stranded plasmid dna are responsible for ifn induction. we first tried to address this point by transfecting just the shrnaexpressing cassette, generated either by pcr or restriction enzyme digestion, into ft cells and confirming that these fragments of $ bp were sufficient to trigger ifn induction (supplementary figure s ) . to definitively rule out any contribution by dsdna, we used a lentiviral transduction system which has been suggested to express shrnas that can escape detection by prrs and ifn activation ( ) . we produced lentiviral particles containing shrnas from ft cells using standard . sh-b expressed from an h promoter triggers ifn activation. sh-b expressed from an h promoter was capable of (a) activating ifn-b promoter and (b) triggering ifn production to inhibit hcv replication in gs cells. (c) intracellular levels of u -and h -driven sh-b products. rna extraction and northern blotting were performed as described in figure b . (d) knockdown of cypb expression by sh-b expressed from an h promoter. methods, centrifuged them to separate the vectors from the ifn-containing media, and then used them to infect naı¨ve ft cells ( figure a ). both sh-b and sh-pcaf vectors induced ifn production when delivered as concentrated lentiviral particles, measured both by hcv suppression ( figure b ) and by oas induction ( figure c ) in huh- cells. to rule out the possibility that residual ifn in the concentrated viral particles was responsible for these results, we added u/ml ifn to the negative control vector sample before the concentration step. this preparation, designated sh-ntc*, was not able to trigger ifn production in naı¨ve ft cells, suggesting that the concentration step effectively removed the soluble ifn from the viral particle pellet. proper knockdown of the sirna target of sh-b was confirmed by this route of shrna delivery ( figure d ). to prove definitively that ifn induction by the shrnas was mediated by the lentiviral infection route, we tested the effect of an inhibitor of hiv reverse transcriptase, nevirapine, on ifn induction by sh-b and sh-pcaf. as shown in figure e , inclusion of nevirapine at the time of transduction effectively blocked the ability of both shrnas to induce ifn in the transduced cells, suggesting the importance of the reverse transcription step in the expression of the shrnas delivered by the lentiviruses. to determine whether lentiviral vector-delivered shrna can trigger ifn induction in cells other than ft cells, we transduced a human hepatoma cell line, lh , which has been reported to produce ifn upon viral infection ( ) , and examined ifn induction in these cells. culture medium from lh cells transduced with sh-pcaf contained biologically active ifn, which suppressed hcv replication in gs cells ( figure f ), indicating that the ability of shrnas delivered by lentivirus to induce ifn response was not limited to ft cells. it has been reported that certain chemically synthesized and phage polymerase in vitro transcribed sirnas can non-specifically induce ifn responses and produce offtarget effect via various prrs, including tlrs. however, the induction of ifn response by shrnas and its underlying mechanisms have not been as well studied. the actual number of shrnas that are capable of triggering ifn response will certainly be larger than the few that have been reported in the literature, yet very little is known about the unique characteristics of the select shrnas and the pathway that they use to activate ifn production. the present study identifies rig-i, but not mda or tlr , as the mediator for activation of ifn responses by two shrnas that are distinct in sequence and structure but both capable of ifn induction in human cells. this was demonstrated by induction of irf- dimerization, activation of ifn promoters, induction of endogenous isgs (isg , oas and rig-i), and secretion of ifn, all of which depended on rig-i and its downstream adaptor, mavs. in addition, we show that delivery of these shrnas via lentiviral transduction does not reduce their ifn-inducing capacity, indicating that the ability of lentiviral vector transduction to avoid ifn induction by shrnas, as reported previously ( ), may not be universally applicable to all the shrnas. specific recognition of dsrnas or ssrnas bearing -triphosphates by rig-i is presumably determined mostly by structural features other than the nucleotide sequence of the rna. yet ifn activation by sh-b exhibited a stringent dependence on specific nucleotides at multiple positions of the shrna. an aa dinucleotide at the beginning of the u transcript has previously been suggested to result in aberrant transcription, and preserving a c/g sequence at positions À /+ suggested to avert ifn induction ( ) . we indeed observed a strict requirement for an adenylate at the + position of sh-b for rig-i recognition and ifn activation, but we observed no difference in expression levels or the apparent sizes of the sh-b rnas bearing either an a or a g at the + position. furthermore, mutations introduced elsewhere in the shrna also abolished or diminished sh-b 's ability to activate ifn, suggesting additional sequence requirement for efficient rig-i recognition and ifn triggering. despite these results, because we were not successfully in cloning and sequencing the vectorexpressed sirna, we cannot exclude the possibility that the adenylate at the + position interferes with transcription and that the resultant abnormal transcript contributes to ifn induction. interestingly, the loop a mutant, which contains a predicted loop of nt, generated a sirna duplex inside the cells that is slightly smaller than that of the shrnas with a wild-type hairpin loop, suggesting the processing by dicer into the stem, perhaps fulfilling the requirement of a length of nt for the hairpin loop ( ) . this mutant form of sh-b was not, however, able to trigger ifn activation. despite the abilities of both sh-b and sh-pcaf to activate the rig-i pathway, the two shrnas are unrelated in sequence. two short stretches of sirna sequences, guccuuccaa and ugugu, that have been previously defined as ifn-or cytokine-activating motifs ( , ) are not found in either sh-b or sh-pcaf. any common sequence motifs of ifn-activating shrnas, if any, remain to be defined. the two shrnas also differ in that one is predicted to contain one blunt end and the other two ends with overhangs. these results suggest that, although blunt ends may increase sirna's ability to be recognized by rig-i ( ), they are not required for ifn activation by an endogenously expressed shrna. the best-characterized rna structure motif recognized by rig-i is the -ppp, which is absent from virtually all the cellular rnas as a result of either -capping or internal cleavage before their appearance in the cytoplasm. a synthetic shrna that has the same sequence as sh-b but lacks the -ppp failed to induce ifn, suggesting the -end status of the intracellularly expressed sh-b contributes to ifn activation. whether or not the -end of an shrna is capped has not been investigated. murine u rna does not contain the trimethylguanosine cap that is present on mrnas and other u small nuclear rnas; instead it contains a g-monomethyl phosphate cap at its -end ( ) . capping of heterologous transcripts produced from the mu promoter, however, requires a stem loop at the -end of the transcript and an auauac sequence immediately after ( ) . most shrnas, including sh-b and sh-pcaf, would not meet these requirements and thus should contain unmodified -ppp. similarly, no evidence of a cap structure for h transcripts could be found in the literature. we attempted to express sh-b using a mirna expression cassette and the pol ii promoter ( ) . the primary transcript generated with this construct would be capped at -end by a trimethylguanosine cap and the final sirna duplex would bear a monophosphate at the -ends of both strands because of drosha and dicer cleavage. this version of the sh-b vector was much weaker in its ability to trigger ifn activation. unfortunately the intracellular expression of the rna duplex was also much weaker and barely detectable by northern blotting. in addition, no knockdown of the target cypb mrna was seen with this mirna-based sh-b (data not shown). as a result, whether sh-b , if expressed at higher level from this construct, could effectively activate ifn remains unclear. so far as we know, ours is the first report of ifn activation in the target cells by shrnas delivered by lentiviral transduction. a previous report of ifn induction by lentiviral vector-expressed shrna only examined the ifn generated in the vector-producing cells, which then up-regulated ifn-stimulated genes in the transduced cells ( ) . the distinction is important as lentiviral vectors used in a gene-therapy setting will likely be purified and free of any ifn that has been generated during the vector preparation step, but ifn activation in the target cells would pose a more serious concern. our data suggest the importance of screening shrnas for ifn induction in the transduced cells in vitro before largescale studies. an hiv reverse transcriptase inhibitor efficiently blocked ifn production by both sh-b and sh-pcaf when delivered by transduction, indicating the virion-encapsulated rna was not able to trigger ifn activation. in this respect, it is interesting to note that positive-stranded rna viruses, which produce dsrna intermediates in the cytoplasm during replication ( ) ( ) ( ) ( ) , often replicate in membrane enclosed vesicles ( ) , this sequestration of viral dsrna in membranous structures may shield the rna from the cytoplasmic prrs and contribute to a successful infection. ifn-induction and rnai by shrnas appear to be independent functions of the same rna ( ). our results also showed that ifn-induction by sh-b is independent of its ability to suppress target mrna expression through rnai. on the other hand, it might be possible to screen for duel functional sirnas that confer therapeutic benefits by both rnai and immunostimulation ( ) . for example, sirnas that target either viral genomes or cellular cofactors of the viruses can be screened for their ability to trigger ifn activation in hopes of find 'super sirnas' with increased efficacy against ifn-sensitive viruses. an rna-directed nuclease mediates post-transcriptional gene silencing in drosophila cells single-stranded antisense sirnas guide target rna cleavage in rnai human risc couples microrna biogenesis and posttranscriptional gene silencing pathogen recognition and innate immunity activation of the interferon system by short-interfering rnas small interfering rnas mediate sequence-independent gene suppression and induce immune activation by signaling through toll-like receptor interferon induction by sirnas and ssrnas synthesized by phage polymerase sequence-specific potent induction of ifn-alpha by short interfering rna in plasmacytoid dendritic cells through tlr sequence-dependent stimulation of the mammalian innate immune response by synthetic sirna induction of an interferon response by rnai vectors in mammalian cells -triphosphate-dependent activation of pkr by rnas with short stem-loops the rna helicase rig-i has an essential function in doublestranded rna-induced innate antiviral responses differential roles of mda and rig-i helicases in the recognition of rna viruses essential role of mda- in type i ifn responses to polyriboinosinic : polyribocytidylic acid and encephalomyocarditis picornavirus ips- , an adaptor triggering rig-i-and mda -mediated type i interferon induction cardif is an adaptor protein in the rig-i antiviral pathway and is targeted by hepatitis c virus identification and characterization of mavs, a mitochondrial antiviral signaling protein that activates nf-kappab and irf visa is an adapter protein required for virus-triggered ifn-beta signaling the v proteins of paramyxoviruses bind the ifn-inducible rna helicase, mda- , and inhibit its activation of the ifn-beta promoter shared and unique functions of the dexd/h-box helicases rig-i, mda , and lgp in antiviral innate immunity polyinosinicpolycytidylic acid induces the expression of gro-alpha in beas- b cells ebola virus vp protein binds double-stranded rna and inhibits alpha/beta interferon production induced by rig-i signaling expression of ip- /cxcl is upregulated by double-stranded rna in beas- b bronchial epithelial cells rig-i-mediated antiviral responses to single-stranded rna bearing -phosphates aim recognizes cytosolic dsdna and forms a caspase- -activating inflammasome with asc aim activates the inflammasome and cell death in response to cytoplasmic dna an orthogonal proteomic-genomic screen identifies aim as a cytoplasmic dna sensor for the inflammasome hin- proteins regulate caspase activation in response to foreign cytoplasmic dna dai (dlm- /zbp ) is a cytosolic dna sensor and an activator of innate immune response rna polymerase iii detects cytosolic dna and induces type i interferons through the rig-i pathway characterization of hepatitis c virus subgenomic replicon resistance to cyclosporine in vitro hepatitis c virus triggers apoptosis of a newly developed hepatoma cell line through antiviral defense system defective jak-stat activation in hepatoma cells is associated with hepatitis c viral ifn-alpha resistance induction of irf- /- kinase and nf-kappab in response to double-stranded rna and virus infection: common and unique pathways a kaposi's sarcoma-associated herpesviral protein inhibits virus-mediated induction of type i interferon by blocking irf- phosphorylation and nuclear accumulation identification of cellular cofactors for human immunodeficiency virus replication via a ribozyme-based genomics approach determinants of interferonstimulated gene induction by rnai vectors gb virus b disrupts rig-i signaling by ns / a-mediated cleavage of the adaptor protein mavs distinct poly(i-c) and virus-activated signaling pathways leading to interferon-beta production in hepatocytes cyclophilin a is an essential cofactor for hepatitis c virus infection and the principal mediator of cyclosporine resistance in vitro recognition of double-stranded rna and activation of nf-kappab by toll-like receptor effect of cell growth on hepatitis c virus (hcv) replication and a mechanism of cell confluencebased inhibition of hcv rna and protein expression antitumor nk activation induced by the toll-like receptor -ticam- (trif) pathway in myeloid dendritic cells central role of interferon regulatory factor- (irf- ) in controlling retinoic acid inducible gene-i (rig-i) expression a system for stable expression of short interfering rnas in mammalian cells a structural basis for discriminating between self and nonself double-stranded rnas in mammalian cells stable expression of shrnas in human cd + progenitor cells can avoid induction of interferon responses to sirnas in vitro gamma-monomethyl phosphate: a cap structure in spliceosomal u small nuclear rna capping of mammalian u small nuclear rna in vitro is directed by a conserved stem-loop and auauac sequence: conversion of a noncapped rna into a capped rna a lentiviral microrna-based system for single-copy polymerase ii-regulated rna interference in mammalian cells visualization of double-stranded rna in cells supporting hepatitis c virus rna replication double-stranded rna is produced by positive-strand rna viruses and dna viruses but not in detectable amounts by negative-strand rna viruses subcellular localization and membrane topology of the dengue virus type non-structural protein b sars-coronavirus replication is supported by a reticulovesicular network of modified endoplasmic reticulum seeking membranes: positive-strand rna virus replication complexes sirna and isrna: two edges of one sword -triphosphate-sirna: turning gene silencing and rig-i activation against melanoma design of hiv vectors for efficient gene delivery into human hematopoietic cells the authors thank dr andre irsigler and dr jason robotham for technical assistance and dr anne b. thistle for proofreading the manuscript. supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- - zei o authors: deng, zengqin; lehmann, kathleen c.; li, xiaorong; feng, chong; wang, guoqiang; zhang, qi; qi, xiaoxuan; yu, lin; zhang, xingliang; feng, wenhai; wu, wei; gong, peng; tao, ye; posthuma, clara c.; snijder, eric j.; gorbalenya, alexander e.; chen, zhongzhou title: structural basis for the regulatory function of a complex zinc-binding domain in a replicative arterivirus helicase resembling a nonsense-mediated mrna decay helicase date: - - journal: nucleic acids res doi: . /nar/gkt sha: doc_id: cord_uid: zei o all positive-stranded rna viruses with genomes >∼ kb encode helicases, which generally are poorly characterized. the core of the nidovirus superfamily helicase (hel ) is associated with a unique n-terminal zinc-binding domain (zbd) that was previously implicated in helicase regulation, genome replication and subgenomic mrna synthesis. the high-resolution structure of the arterivirus helicase (nsp ), alone and in complex with a polynucleotide substrate, now provides first insights into the structural basis for nidovirus helicase function. a previously uncharacterized domain b connects hel domains a and a to a long linker of zbd, which further consists of a novel ring-like module and treble-clef zinc finger, together coordinating three zn atoms. on substrate binding, major conformational changes were evident outside the hel domains, notably in domain b. structural characterization, mutagenesis and biochemistry revealed that helicase activity depends on the extensive relay of interactions between the zbd and hel domains. the arterivirus helicase structurally resembles the cellular upf helicase, suggesting that nidoviruses may also use their helicases for post-transcriptional quality control of their large rna genomes. helicases and nucleic acid translocases are atp-dependent motor proteins capable of moving along their nucleic acid substrates while either unwinding duplexed regions (helicases) or performing other functions (translocases), including protein displacement and the nucleation of larger rna-protein complexes ( , ) . these enzymes are known to be critical players in a wide variety of biological processes and are encoded by all organisms, as well as positive-stranded rna (+rna) viruses with genomes larger than about kb [( ); for reviews, see ( ) ( ) ( ) ]. on the basis of sequence comparisons, helicases/translocases have been classified into six superfamilies (sf to sf ) ( , ) , with +rna viral helicases belonging to sf , sf or sf . based on the direction of translocation, helicases of various superfamilies have been divided into (biochemical) classes a and b, which translocate along their nucleic acid substrates in the - or - direction, respectively ( ) . in the case of sf helicases ( , ) , structurally characterized cellular enzymes of class b (sf b) are further divided into the phylogenetically compact pif -like (pif , recd ), uvrd/rep and upf -like (upf , ighmbp ) groups, with the latter being able to unwind both dna and rna duplexes ( ) . helicase sf also includes a large number of (putative) helicases from a dozen +rna virus families belonging to two diverse phylogenetic lineages, known as the alphavirus-like (or sindbis virus-like) supergroup ( ) and the order nidovirales ( ) . more detailed studies on the sf helicases of two alphavirus-like viruses have recently been published. the helicase domain of the dendrolimus punctatus tetravirus (an insect virus form the alphatetraviridae family) was found to have dsrnaunwinding activity with - directionality ( ) . the helicase domain of the plant tomato mosaic virus (tomv; family virgaviridae) was not characterized enzymatically, but its crystal structure revealed the two canonical reca-like a/b domains ( a and a) of the helicase core ( ) . accessory domain insertions, an otherwise frequently observed phenomenon among cellular sf helicases, are lacking in the tomv helicase. the sf helicases of nidoviruses (hel ), one of which is the focus of this study, were characterized in some detail using bioinformatics, molecular genetics and biochemistry (see below), but structural information was lacking thus far. nidoviruses constitute an order of +rna viruses composed of virus groups targeting a wide variety of mammalian, avian and invertebrate hosts. in mammals, nidovirus infection can be associated with severe respiratory disease, as in the case of porcine reproductive and respiratory syndrome (prrs) ( ) , one of the leading swine diseases (caused by arteriviruses), and zoonotic coronavirus infections in humans, like severe acute respiratory syndrome ( ) and middle east respiratory syndrome ( ) . the continuing outbreak of the latter disease is currently attracting worldwide attention, in particular because of its $ % case fatality rate. besides their pathogenic properties, nidoviruses have been studied for their extraordinarily large rna genomes: even the shortest nidovirus genome (the . -kb rna of the arterivirus equine arteritis virus, eav) outranks almost all other mammalian +rna virus genomes, whereas coronavirus genomes ( . - . kb) are larger than those of any other rna virus group. their large genome size enabled nidoviruses to evolve substantial genetic complexity, which is evident from (among other properties) the acquisition of a variety of enzymatic activities and accessory proteins, many of which are lacking or rare in other +rna viruses ( ) . these proteins appear to contribute to the regulation of the complex rna synthesis of nidoviruses, which occurs exclusively in the cytoplasm of the infected cell, and to the elaborate array of virus-host interactions needed to support efficient virus replication ( , ) . for example, nidoviruses with genomes > kb use a proofreading - rna exonuclease that is proposed to promote the fidelity of viral rna synthesis ( , ( ) ( ) ( ) ( ) ( ) ( ) ( ) . however, it is completely unknown whether and how nidoviruses deal with translational quality control during the expression of their large multicistronic genomic rnas, which also serve as mrnas for the synthesis of the viral replicative enzymes. compared with other +rna viruses, nidovirus replicase genes encode an exceptionally large number of nonstructural proteins (nsps) ( , , , ) . nidovirus nsps are expressed from open reading frames (orfs) a and b, which make up the -proximal - % of the genome rna. orf a encodes polyprotein a (pp a; size ranging from to aa) and following a - ribosomal frameshift pp a can be extended with the orf bencoded polyprotein to give pp ab ( - aa) ( ) (supplementary figure s ). both polyproteins are subject to extensive proteolytic processing by multiple, internally encoded proteinases ( , ) . the nidovirus replicase backbone consists of a conserved array of domains, arranged in a nidovirus-specific order and including the orf b-encoded rna-dependent rna polymerase (rdrp) and hel domains, the core enzymes needed for genome rna synthesis (replication) and subgenomic (sg) mrna production (transcription). the latter process yields an extensive nested set of sg mrnas, which are used to express up to a dozen structural and accessory proteins from smaller orfs in the -proximal part of the genome ( ) ( ) ( ) . in both corona-and arteriviruses, sg mrnas contain a common leader sequence that is identical to the end of the genome. their generation from sg negative-stranded templates involves a mechanism of discontinuous negative strand rna synthesis ( , ) . previous studies identified the nsp carrying rna helicase activity (arterivirus nsp and coronavirus nsp ) as one of the two most evolutionarily conserved nidovirus proteins. biochemical studies using recombinant arterivirus and coronavirus helicases revealed similar enzymatic properties, including nucleic acid-stimulated atpase and - duplex unwinding activities on both rna and dna substrates containing single-stranded regions ( , ) . a unique nidovirus helicase feature is the presence of an n-terminal (predicted) complex zincbinding domain (zbd) of - residues. zbd includes or conserved cys/his residues ( ) and is a nidoviral genetic marker not found in any other rna virus group ( ) . zbd is separated from the downstream hel domain by an uncharacterized domain that varies in size and sequence between arteri-and coronaviruses ( ) . for the arterivirus prototype eav, the significance of the nsp zbd was evaluated extensively using sitedirected mutagenesis in combination with biochemical assays and reverse genetics. amino acid substitutions in zbd or the adjacent 'spacer' that connects it to the downstream domain can profoundly affect eav helicase activity and rna synthesis, with most replacements of conserved cys or his residues yielding replicationnegative virus phenotypes ( , ) . intriguingly, some mutations in the spacer region selectively inactivated transcription, while not affecting replication ( , ) , strongly suggesting a specific role for nsp in the unique mechanism of discontinuous sg rna synthesis. despite its importance as a key replicative enzyme and antiviral drug target ( ) , no d structural information has been reported for any nidovirus helicase. to understand the regulatory role of zbd and the protein's interaction with nucleic acids, we characterized the structure of a helicase-competent derivative of eav nsp , alone and in complex with poly(dt). the multi-domain nsp includes the canonical a and a core domains of a sf helicase, a flexible accessory domain that is sensitive to nucleic acid binding, and a complex zbd displaying a novel structural organization. strikingly, the protein was found to bear structural resemblance to the eukaryotic upf helicases, which are multi-domain proteins involved in rna quality control, including nonsensemediated mrna decay ( ) . thus, our study not only highlights how nidovirus helicase activity depends on the extensive relay of interactions between the zbd, accessory and hel domains but also provides a framework to propose and explore a role for the enzyme in the posttranscriptional quality control of nidovirus rnas. nsp of the eav-bucyrus isolate (ncbi reference sequence nc_ ) is composed of amino acids - of replicase pp ab, which will throughout this study be referred to as nsp residues - . the full-length nsp sequence or a c-terminally truncated version comprising residues - (nsp Á) were cloned into a modified pet a vector with a tobacco etch virus (tev) protease cleavage site. mutations were generated using the quikchange protocol and confirmed by dna sequencing. the proteins were overexpressed at c in escherichia coli strain bl (de ) grown to an od of $ . in luria-bertani medium in the presence of mg/ml kanamycin. protein expression was induced with . mm isopropyl b-d- -thiogalactopyranoside for h at c. cell pellets were resuspended in lysis buffer ( mm hepes, ph . , for nsp Á or ph . for full-length nsp , mm nacl and mm imidazole), supplemented with protease inhibitor cocktail (roche) and disrupted by sonication. lysates were clarified at g for min and the soluble fraction was applied to a ni + chelating column. after sample loading, the column was washed ( mm hepes, ph . or . , mm nacl and mm imidazole) and the protein was eluted ( mm hepes, ph . or . , mm nacl and mm imidazole). proteins intended for atpase or helicase assays were dialysed against storage buffer ( mm hepes, ph . or . , mm nacl, % glycerol) and stored at À c. truncated protein for crystallization studies was digested with % (w/w) tev protease to remove the his-tag. further purification was performed by size-exclusion chromatography using a superdex column (ge healthcare) with gf buffer ( mm hepes, ph . , mm nacl). the peak fraction was collected and analysed by sodium dodecyl sulphate-polyacrylamide gel electrophoresis. purified nsp Á was concentrated to mg/ml and initial crystallization trials were performed at c using the sitting-drop vapour-diffusion method by mixing ml of protein solution with ml of reservoir solution. the conditions were then optimized and high-quality crystals were obtained in . m (nh ) so , . m hepes, ph . , mm kcl and % ethylene glycol. to obtain crystals of the protein-dna complex, purified protein and partially double-stranded dna with a single-stranded poly-thymidine overhang (the two partially complementary sequences were -ttttttttttgcagtgct cg- and -cgcgagcactgc- ) were mixed in a : . molar ratio and incubated at c overnight. the complex was further purified by size-exclusion chromatography (superdex , ge healthcare) and concentrated to mg/ml. the condition for obtaining crystals was % peg , . m hepes, ph . , and . m calcium acetate. for data collection, crystals were cryoprotected in mother liquor containing % (v/v) ethylene glycol and flash cooled to À c. the multi-wavelength anomalous diffraction (mad) data for intrinsic zinc atoms were collected on beamline w b at the beijing synchrotron radiation facility. the data for eav nsp Á and its complex with dna were collected at beamline ne a at photon factory (kek) and beamline bl u at the shanghai synchrotron radiation facility. data was indexed, integrated and scaled using hkl ( ) . data collection and processing statistics are summarized in table . the structure of nsp Á was determined by the mad method. initial phases were calculated by solve, and phases were subsequently improved using resolve ( ) . the figure of merit from the mad phasing was . and the z score was . . several segments of the protein could be automatically modelled into the electron-density map by resolve, although in part only as poly-alanine chains. manual rebuilding was performed in coot ( ) , and refinement was performed with refamc ( ) . further rounds of refinement were done with translation/libration/screw (tls) refinement ( ) . the structure was refined to . Å with an r work of . % and an r free of . %. using the structure of free nsp Á without domain b as input model, the structure of nsp Á in complex with dna was successfully solved by molecular replacement. the initial model was obtained by molrep from the ccp program suite ( ) . a good match for domains zbd, a and a with electron density was found. domain b was manually added with the aid of f o -f c and f o -f c maps using coot ( ) . dna molecules were included in the final stages of refinement. difference fourier maps clearly showed electron densities for seven bound deoxyribonucleotides. the final model was refined to . Å with an r work of . % and an r free of . %. all figures in this article displaying molecular structure were made using pymol ( ) . full-length eav nsp and a series of truncated variants were overexpressed in and purified from e. coli. after extensive crystallization trials, diffracting crystals could only be obtained for a truncated form of nsp (aa - ) lacking the c-terminal residues. for simplicity, we will hereafter refer to this protein as nsp Á, which was used throughout this study unless otherwise specified. to verify that nsp Á, which contained all characteristic sf helicase sequences (motifs), is enzymatically active, we performed in vitro enzyme assays to compare full-length and truncated nsp . in agreement with previously published results ( ) , full-length nsp displayed only weak atpase activity in the absence of nucleic acid, but was strongly stimulated by the addition of poly-uridine (polyu). in the absence of polyu nsp Á showed a -fold higher atpase activity than the full-length protein ( figure a ), yet this increased atp turnover did apparently not translate into increased helicase activity. unwinding of a partially double-stranded dna substrate by nsp Á was incomplete, but went to completion when using full-length nsp ( figure b ). as expected, replacement of the conserved lysine of the walker a motif, which is essential for atp hydrolysis ( ) , with glutamine (mutant k q) completely abolished atpase and consequentially also helicase activity. this confirmed that the observed activities could be completely attributed to the recombinant eav proteins used, rather than to potential trace amounts of contaminating bacterial enzymes. the observed enzymatic differences between nsp and nsp Á may be caused by the latter's truncation and could, in principle, be explained by one or multiple defects, like decreased unwinding velocity and/or processivity, loss of affinity towards the substrate or uncoupling of atpase from helicase activity. the results of the atpase assay lead us to propose that the observed reduction of duplex unwinding may be due to unproductive atp hydrolysis, originating from the fact that the atpase reaction is independent of nucleic acid substrate binding. accordingly, the input atp in the nsp Á assay may have been depleted before complete unwinding was achieved. regardless of which interpretation is correct, the c-terminal amino acids clearly are dispensable for the helicase activity of eav nsp . this result is in good agreement with the fact that the truncated protein retained all hel key domains ( figure a ) previously shown to be evolutionary conserved and essential in both in vitro enzyme assays and in vivo studies with virus mutants. the crystal structure of eav nsp " reveals a multi-domain organization of the arterivirus replicative helicase because d structures of orthologous proteins were not available, we took advantage of the zinc-binding properties of nsp and used the zinc multiple-wavelength anomalous dispersion (mad) method ( ) to solve the eav nsp Á structure. the presence and position of three zinc atoms were established with anomalous data collected from the zinc absorption edge ( table ). the final model included eav nsp residues - , three zinc ions in the n-terminal zbd, five sulphate ions and water molecules. two reca-like a/b domains ( a and a) form the structure's c-terminal part ( figure b ; cyan and green) and constitute the helicase core (hel ). domain a contains a parallel five-stranded b-sheet that is sandwiched by three a-helices on one side and two a-helices on the other. domain a contains a parallel four-stranded b-sheet with five a-helices on the side facing domain a. upstream of domain a, we identified an additional domain with a characteristic b-barrel fold (figure a and b; magenta). it consists of five b-strands arranged as two tightly packed anti-parallel b-sheets and is juxtaposed to domain a ( figure b ). the location of this domain in the protein sequence and its orientation relative to the hel domain resemble those of domain b in helicases of the sf b upf -like subfamily ( figure b and c), and it was therefore named accordingly in our nsp Á structure. the domain has no counterpart in the only other solved structure of a viral sf helicase, that from tomv ( figure a ) ( ) , whereas its counterpart in helicases of the pif -like subfamily is inserted in domain a ( figure d ) ( ) . our structure further revealed that the n-terminal zbd (figure ; yellow) has a compact fold containing three structural zinc atoms. based on secondary structure analysis with dial ( ), we could partition zbd into three elements (figure ). two adjacent and structurally different zinc fingers, an n-terminal ring-like module (residues - , pink) and a treble-clef zinc finger (residues - , red) constitute the main body of zbd. the third element is a c-terminal linker region (linker ) that includes the long loop l , which crosses the entire domain, and helix a (residues - , yellow), which connects the two zinc fingers with domain b ( figure a ). this classification is further supported by the observation that the connecting residues between the ring module and treble-clef zinc finger are disordered structure of (b) free and (c) nucleic acid-bound nsp Á. also the f o Àf c differential electron density map of the bound single-stranded part of a partially double-stranded dna substrate at . s is presented. the putative atp binding site is shown as a brown oval. (supplementary figure s and figure d ). only out of the cys/his residues are involved in zinc binding, rather than all residues as proposed previously [( ); figure b and c]. not involved is his , which is not conserved in other arteri-and coronaviruses (supplementary figure s b) . the n-terminal ring-like module has a notable binuclear structure with a cross-brace topology involving cys and his residues that coordinate two zinc ions ( figure a) . a three-stranded antiparallel b-sheet (b -b ) sits in the centre and packs against helix a following b ( figure b ). the first zinc ion (zn ) is coordinated by four cysteine residues (cys , cys , cys and cys ) within a treble-clef zinc finger-like motif. residues cys and cys are provided by the zinc knuckle within loop l , whereas cys is positioned at the c-terminus of b and cys comes from the n-terminus of helix a . the second zinc ion (zn ) is coordinated by residues cys , cys , his and his , which are arranged in an abb zinc finger-like motif. the second pair of the zinc-coordinating residues of both zinc-binding motifs of the ring module may include both his and cys residues in other arteri-and coronaviruses. overall, the ring module of these viruses can be described by a characteristic conserved cys a -cys b -cys[his/cys] a -[his/cys] b pattern (where applicable, a and b refer to residues chelating the first and second zinc ion, respectively; brackets indicate positions at which his and cys can alternate). the c-terminal zinc finger of zbd adopts a treble-clef fold distinct from that of the ring module (see above; figure c ). two one-turn helices a and a are stabilized by a zinc atom (zn ) that is chelated by residues cys and his of a zn-knuckle within loop l , while cys and cys originate from l and a , respectively. an extensive array of hydrogen bonds is observed between the main chains of residues in loop l and thr in a ( figure d ). these multiple hydrogen-bonding interactions play a major role in the formation of a compact zinc finger. arteri-and coronaviruses appear to tolerate replacements (cys for his, or vice versa) at the second and fourth residues of this finger ( , ) , which can be described by the characteristic, conserved c[h/c]c[c/h] pattern. finally, linker includes only one structured element (a ), but it plays a central role in the interaction between the main body of zbd and hel , as detailed below. the structural basis for the essential role of zbd in eav nsp helicase function previously, zbd mutagenesis demonstrated the in vitro and in vivo importance of this domain for nsp enzyme activities, genome replication and transcription, and arterivirus viability. the solved structure now provides us with a structural basis for these observations. zbd packs against the hel domains through extensive hydrophobic and hydrophilic interactions ( figure a and b). specifically, residues leu , val , val , leu , pro , val , leu and trp in domain a together with residues ile , leu , leu , leu and ile from a in zbd create an extensive hydrophobic surface. the total interface area between zbd and the hel is Å , as determined by protein interfaces, surfaces and assemblies (pisa) server ( ) . a major part of this interface involves the a helix, which is located in a groove formed by two helices and a loop of domain a, while making extensive contacts to the main body of zbd and, to lesser extent, domain b ( figure ). the interface areas between a and domain a, on the one hand, and the zbd fingers (including zinc ions) on the other hand, are . and . Å , respectively. in addition, four hydrogen bonds between zbd and the hel enhance the interaction (figure b) , and a salt bridge is observed between his in zbd and asp in domain b ( figure b ). the large size of these interface surfaces and the large number of interactions suggest the existence of a signalling network through which zbd could affect both the fold and activity of hel . the proposed signalling network can now be used to rationalize, in a structural context, the previously reported phenotypes of eav zbd mutants carrying replacements of residues not directly involved in zn-binding. for instance, a replication-negative phenotype was described for mutant d a ( ) . it is now clear that asp forms two hydrogen bonds with the main and side chain of thr and electrostatically interacts with the side chain of his , which both belong to the ring-like zinc finger ( figure d ). replacement of asp may thus greatly reduce these interactions and disrupt zbd integrity, potentially affecting the structural integrity of the hel . another residue, ser , was probed extensively by mutagenesis after the finding that a virus mutant (eav f) carrying a s p mutation replicates its genomic rna with wild-type efficiency, while being completely defective in sg mrna synthesis ( ) . this transcription-negative phenotype was attributed to the severe structural constraints exerted by pro residues on the local conformation of the proposed hinge region, as various substitutions of ser alone (to ala, cys, gly, his, leu or thr) yielded virus mutants with a wild-type phenotype, while combining the neutral s g mutation with a p g substitution reproduced the specific defect in sg mrna synthesis ( ) . this interpretation is now further supported by the nsp Á structure in which ser and pro are located in the hinge connecting the treble-clef zinc finger and a of zbd. the main chain of ser forms three hydrogen bonds with the treble-clef thr , which is also connected to the pro side chain and lys main chain ( figure d ). owing to the unique properties of the pro residue, the ser -to-thr bonds are likely disrupted by the s p mutation, but are not affected by the alternative replacements tested. consequently, also owing to the main chain rigidity associated with the introduction of a pro residue, the orientation of a relative to a and/or the main body of zbd is likely affected in mutant s p, which carries adjacent pro residues at positions and . likewise, the introduction of two gly residues at these positions [double mutant s g/p g; ( ) ] probably gives rise to excessive flexibility of the hinge region, thus compromising nsp function in a similar manner. to further explore the role of zbd, we tested the effect of four mutations (c a and h a in the ring-like module; h a and c a in the treble-clef zinc finger) expected to affect the ability to bind zn , zn or zn , respectively. in agreement with the proposed structural role of these zinc ions, soluble his-tagged proteins containing these mutations could not be obtained and only low yields of gst-nsp fusion proteins carrying the same mutations could be recovered. for mutants c a and h a, band shift analysis revealed a complete loss of binding to a partially double-stranded dna substrate containing a single-stranded poly(dt) overhang (substrate dna-t ; figure c , lane - ). these results complement previous findings, showing a complete loss of both atpase and helicase activity for these mutants ( ). in contrast, the level of nucleic acid binding by mutants h a and c a was comparable with that of the wild-type protein ( figure c , lanes - ), consistent with nsp -h a retaining a limited level of atpase and helicase activity ( ) . on further testing, we observed that the addition of mm edta altered the overall conformation of nsp Á, as detected by changes in circular dichroism (supplementary figure s a) , and reduced its binding to -dna-a (supplementary figure s b ). in summary, these results reveal that zbd interacts extensively with the hel domain and that its integrity is an essential determinant of nsp Á properties in in vitro assays. next we analysed the existence of structural similarity between eav nsp Á and other proteins by scanning a protein data bank using the dali server ( ) . the structure of the nsp Á hel domain was found to be most similar [z score, . ; root-mean-square deviation (rmsd), . Å ] to the helicase core of nonsensemediated mrna decay factor upf and its homolog ighmbp (z score, . ; rmsd, . Å ), which both belong to the upf -like helicase subfamily ( ) . further comparisons revealed that this resemblance extends into the respective n-terminal zbds: the binuclear ring-like module of nsp Á zbd was found to be most similar to ring-like module in the ch-domain of upf ( figure e ). this similarity was rather limited (z-score of . and rmsd of . Å ) because only six out of the eight znchelating residues in the two domains could be juxtaposed ( figure f ) and because loops l , l and helix a in nsp Á are shorter than the corresponding elements in upf . we did not detect significant similarity of the treble-clef zinc finger with other proteins, although we note that the upf ch-domain also has a zinc finger (but of a different fold) downstream of the ring module. thus, eav nsp zbd prototypes a novel and complex multi-domain zinc finger with distinct structural properties. on the other hand, eav nsp and upf share a similar domain organization, including structurally similar ring and helicase domains. these similarities are further enhanced by the - directionality of duplex unwinding shared by both these helicases and likely extends to other nidovirus helicases in view of the observed sequence conservation (supplementary figure s b ). we proceeded to solve the crystal structure of nsp Á in complex with a nucleic acid substrate. nidovirus rna helicases, including eav nsp , were previously found to lack the ability to discriminate between rna and dna substrates, a property shared with only a few other helicases ( , ) . this substrate promiscuity allowed us to use a partially double-stranded dna substrate ( dna-t ) containing a single-stranded poly(dt) overhang for crystallographic studies. the binding of this substrate was deduced from an increase of the protein's stokes radius in gel filtration chromatography (supplementary figure s ) . the binary complex diffracted to a resolution of . Å in space group p and was solved by molecular replacement (table ) . continuous electron density was found in the enzyme's binding pocket ( figure c ), which apparently corresponded to seven thymidine residues. this part presents in an extended conformation and lies in a channel formed by domains a, b and a, with its end in domain a and its end in domain a. the remaining three unpaired thymidines and the entire double-stranded portion of the substrate could not be located. the asymmetric unit contained four nsp Á-dna binary complexes with a matthews coefficient of . Å /da, corresponding to a solvent content of %. these complexes shared a remarkably similar spatial arrangement with the rmsd of their ca atoms being only . Å . several connecting residues between subdomains were missing in the structure of the complex, indicating apparent structural flexibility of these residues. nucleic acid binding induces profound conformational changes outside the hel domain of nsp " the ca atoms of domains a and a of free nsp Á and the nsp Á-dna complex can be superimposed with an rmsd of . Å , indicating that the relative orientations of these core domains are barely affected by dna binding ( figure a ). however, outside these domains, the effect of dna binding was considerable, with the rmsd between the ca atoms of the two forms of nsp Á increasing to . Å . particularly large conformational changes were observed in domain b, which rotates $ towards zbd in the nsp Á-dna complex ( figure a ). the rmsd between the ca atoms of the two forms of domain b is . Å , with loop residues being affected most profoundly (supplementary figure s a) . both width and height of the polynucleotide substrate channel formed by domains a and b (originally $ and Å , respectively) are increased by Å on this rotation. this reorganization makes this channel large enough to accept single-stranded nucleic acids, although it remains too narrow for a nucleic acid duplex ( figure b ). consequently, double-stranded nucleic acids must be unwound at the entrance of the substrate channel to let a single-stranded chain enter. besides this large conformational change, temperature factor calculations suggest that the regions at the surface of domain b not directly involved in dna binding may become flexible (supplementary figure s ) . for example, domain b residues arg , gly and ala become disordered after dna binding ( figures c and b and supplementary figure s ). on dna binding, a structural change was also observed in the treble-clef zinc finger of zbd, as reflected by its relatively high temperature factor (compared with that of domains a and a) in the nsp Á-dna complex as opposed to nsp Á alone (supplementary figure s ) . as outlined above, the single-stranded part of the dna substrate is bound to a nucleic acid-binding channel formed by domains a, b and a ( figure c ). the backbone phosphates of the poly(dt) are located on top of domains a and a, with the thymine bases exposed to the solvent (supplementary figure s a) . the majority of contacts with the bound dna are made via the phosphodiester backbone and non-specific protein-base interactions as depicted in figure . consistent with this observation, the base orientation varies in the four eav nsp -poly(dt) complexes of the asymmetric unit, while the position of the dna backbone is rather rigid (supplementary figure s b and c) . several key residues from domains a and a contact the dna backbone in the channel of the protein (figure a and b) . base t , the most one, is exposed to the solvent and protrudes outwards, causing a bend in the dna backbone between t and t . the bases t and t as well as t and t stack with each other at an average distance of . Å . in contrast, base t is almost perpendicular to t , with its edge exposed to protein side chains that make specific contacts. val in domain a forms van der waals contacts with the base and the sugar ring of t and thus stabilizes the dna conformation. moreover, the binding is stabilized by several hydrogen bonds between his , his , thr , ser and the backbone of the dna, and by van der waals contacts between thr , leu , val , tyr and the phosphate groups of the dna. while the interactions described above do not involve specific bases, six further interactions specific for thymine were found. for example, the backbone nh of arg forms a hydrogen bond with the o atom of t . the o and o atoms of base t form hydrogen bonds with the side chains of asp and tyr . also, several residues, such as arg and gln , interact with both the base and the sugar ring. however, no interaction was observed between nsp Á and position c of the ribose ring of the dna substrate. this observation may explain why eav nsp has the ability to unwind both dna and rna, in agreement with the substrate specificity observed for other helicases ( , ) possessing or lacking the ability to interact with the oh moiety of the rna backbone. among +rna viruses, whose rdrps generally have a high error rate, nidoviruses stand out for their large to very large genome size ( - kb) . consequently, the replication fidelity of nidoviruses, in particular coronaviruses, has been the subject of intense study. most recently, the identification of a unique -to- exoribonuclease (exon) activity has provided the basis for the hypothesis that a primitive proofreading mechanism operates to promote the fidelity of rna-dependent rna synthesis in nidoviruses with > kb genomes ( ) ( ) ( ) ( ) ( ) ( ) ( ) . despite this recent progress, the two central subunits of the nidovirus replicase, the rdrp and the unique zbdcontaining rna helicase, have remained poorly characterized, also due to the lack of structural information. remarkably, our present analysis of the arterivirus helicase structure revealed a number of important similarities with upf helicases, eukaryotic enzymes involved in quality control of rnas through multiple pathways, including nonsense-mediated mrna decay ( ) ( ) ( ) . in contrast to the exon-driven control of replication fidelity (see above), the possibility of post-transcriptional quality control of nidovirus mrnas has not been considered thus far. yet, replicase orf ab is extremely large (from to > codons) and its correct expression by translation of the viral genome is a critical first step in the production of the enzymes directing genome replication and expression. therefore, our study not only provides the first insights into the structural basis for nidovirus rna helicase function, but also creates a basis to propose a role for this protein in the posttranscriptional quality control of viral mrnas. this role may be common to all nidoviruses, regardless of their genomes size, which would distinguish it from the exon-based proofreading mechanism that appears to be restricted to nidoviruses with a > kb genome. on the time scale of nidovirus evolution, the acquisition of figure a . note that the dna in figure c was extracted from the complex structure of dna-bound state. zbd-hel may have been a critical event to facilitate the genome expansion of ancestral small-sized nidoviruses, thus setting the stage for the subsequent exon-driven expansion towards even larger nidovirus genomes ( , ) . previously, using bioinformatics, biochemistry and molecular genetics, it was established that nsp of arteriviruses and its orthologs in other nidoviruses are multi-domain proteins. of its domains, zbd and the hel domains are critical for the enzyme's atpase and helicase activities in vitro and for the regulation of viral replication and transcription in infected cells. our structural and biochemical studies extended the characterization of known domains and delineated two hitherto uncharacterized domains: one (domain b) flanked by zbd and hel , and the other (c-terminal domain) located downstream of the hel , with its structure remaining to be solved. our data show that, along with zbd, these two non-enzymatic domains may regulate hel function. given that nsp /nsp is one of only three proteins whose nidovirus-wide conservation can be detected at the sequence level ( , , , ) , the nsp Á structure should be applicable to other nidovirus helicases, including those of prrs viruses and coronaviruses. however, considerable size differences exist between arteri-and coronaviruses in the most conserved zbd and hel domains, whereas the b and c-terminal domains lack appreciable sequence conservation. thus, helicase structures from other small-and large-genome nidoviruses will be required to fully understand the enzyme's function. the nsp c-terminal domain: coupling atpase and helicase activities? while attempting to solve the eav nsp structure, we were confronted with the low stability of the full-length recombinant protein expressed in e. coli. we solved this problem by characterizing the c-terminally truncated nsp Á, which lacks the residues (c-terminal domain) downstream of the known hel motifs. this protein was found to bind partially double-stranded dna and display the previously reported in vitro atpase and helicase activities. because, compared with full-length nsp , nsp Á appeared to be somewhat more active as an atpase but somewhat less active as a helicase, the c-terminal truncation may have affected the coupling of these two enzymatic activities. this suggests that the c-terminal domain may have evolved to (co)regulate nsp helicase-mediated functions in vivo, implying that it must be able to communicate with the nsp active site. this could be achieved either directly, by interacting with the nucleic acid-or atp-binding site (the nsp Á c-terminus is $ . Å apart of the active centre; figure c ), or indirectly, through a protein signal transduction network. importantly, the c-terminal domain is poorly conserved among arteri-and coronaviruses in terms of both sequence and size (supplementary figure s a , and data not shown), arguing that such a putative regulatory function could be executed in a virus-and, possibly, host-specific manner. the nsp structure: defining a complex zbd our characterization of the eav nsp structure verified and revised a model of the n-terminal zbd based on prior studies ( , , ) . it resolved the uncertainty about the number of zinc ions bound (now established to be three) and the fold of this domain (a unique structure combining a ring-like module fused with a treble-clef zinc finger). furthermore, it redefined the c-terminal border of zbd and placed it residues downstream to include a third hitherto unrecognized structural element (helix a ). previously, we analysed a variety of eav nsp zbd mutants in which putative zinc-binding residues were replaced in a manner (cys!his or his!cys) that could preserve zinc binding ( , ) . from the solved structure, it is now apparent that the replication-negative phenotypes of these virus mutants can likely be attributed to the detrimental impact of the respective mutations on zbd integrity and, through the extensive interaction network, hel domains. it presently remains unclear why the replacement of his by cys in the treble-clef zinc finger was partially tolerated. on the other hand, structural superposition of the ringlike modules of nsp and hupf ( figure e ) reveals how the only other similarly tolerated replacement ( , ) , that of the zn -coordinating cys by his (found in the equivalent position in hupf ), could be accommodated by nsp . the ring-like module of upf also shares structural similarity with ring-box domains of e ubiquitin ligases ( ) and the involvement of this module in self-ubiquitination of upf was indeed demonstrated ( ) . it would be interesting to see whether these results are relevant for nsp and its zbd. recently, arterivirus papain-like protease was found to have deubiquitinase activity, which suppresses the innate immune response in infected host cells ( , ) . the nsp -nucleic acid complex: towards the dsrna unwinding mechanism to understand how nsp unwinds its natural dsrna substrates, we analysed a complex of nsp Á with a partially double-stranded dna substrate. only seven thymidine residues could be confirmed in the structure of that complex ( figure c ). the dna-bound nsp Á structure revealed two possible rna-binding clefts at the surface of nsp , which are formed by domains b and zbd (named putative exit site ), and a and zbd (putative exit site ), respectively (supplementary figure s ) . both have continuous positively charged surfaces, with the latter (supplementary figure s , right panel) being sufficiently large to bind a ssrna > bp, which could be especially suited for unwinding complex secondary structures. this organization suggests that, after unwinding, one of the separated rna chains would be guided through the narrow nucleic acid substrate tunnel formed by domains a, b and a, while the path of the other strand remains to be defined. no matter which cleft is actually used for rna binding, the positively charged zbd, and especially its ring-like module, would be involved. like the protein-binding surface of the upf c/h domain ( ) , zbd has a putative protein interaction surface composed of two major hydrophobic zones that are almost perpendicular to each other ( figure ). nucleic acid binding induced a conformational change (supplementary figure s b ) of these two zones. in addition, the temperature factor of the treble-clef zinc finger was higher and several residues are disordered in the structure of the nsp Á-dna complex (supplementary figure s ). together, these findings imply that these two zones are readily accessible for interactions with other proteins, which may further influence nucleic acid binding. substrate binding by nsp is accompanied by structural changes in domain b and the treble-clef zinc finger, which may be recognized by yet-to-be identified interaction partners modulating nsp function. the treble-clef finger is fairly distant from the bound substrate, suggesting long-distance signal transduction within nsp , possibly involving helix a , which interacts with a, b and nucleic acid and is directly connected to the treble-clef zinc finger. the flexibility of the hinge region connecting the treble-clef zinc finger and helix a is likely compromised by the previously described s p and s g/p g mutations that, importantly, were found to impair viral sg mrna synthesis but not genome replication ( ) . consequently, the described inter-domain communication channel may be used by nsp and its partners for switching from a role in genome replication to directing viral transcription, a hypothesis that will be the subject of future studies. nidovirus helicase: a role in post-transcriptional quality control of viral mrnas? the observed structural affinity between the eav nsp and upf helicases is most remarkable, in particular because it extends to include the multi-domain organization essential for helicase function. this organization is only found in upf of all eukaryotes ( ) and nidovirus helicases ( , , , ) . for upf , its conservation was linked to the protein's universal role in posttranscriptional quality control of eukaryotic rnas through multiple pathways, including nonsense-mediated mrna decay ( ) ( ) ( ) . upf interacts, commonly through its c/h and a domains, with proteins that can modulate its function. for the nidovirus helicase subunit, the functional basis of its domain conservation remains to be firmly established, although zbd-like c/h in upf ( )-affects helicase activity ( , ) . if the nidovirus helicase possesses some of the properties of upf , this could explain the exclusive conservation of zbd in nidoviruses, which stand out for their large to very large single-stranded rna genomes. for instance, by providing post-transcriptional quality control of genomic rna, i.e. detection of nonsense and/ or other mutations and elimination of defective molecules, the nidovirus helicase could alleviate the consequences of the generally low fidelity of rna virus genome replication. such a role of zbd-hel may have protected an ancestral nidovirus from the mutational meltdown of its expanding genome, similar to the proposed fixation of the proofreading exon domain at a later stage of nidovirus evolution ( , , , ) . subsequently, the enzyme would have facilitated expansion to the genome size observed in contemporary arteriviruses, and remained a critical factor in the further exon-driven genome expansion to evolve middle-and large-sized nidoviruses. thus, the proposed upf -like role of the nidovirus helicase can be accommodated in a meaningful evolutionary scenario incorporating several of the structural and functional observations made in this study. the structural similarity between nsp and upf establishes a new connection between research on viral and cellular helicases, which could be mutually insightful for understanding the evolution and function of this group of vitally important enzymes. the coordinates and structure-factor amplitudes of eav nsp Á and eav nsp Á-dna complex have been deposited in the protein data bank with accession codes n n and n o, respectively. from unwinding to clamping -the dead box rna helicase family helicases: an overview viral proteins containing the purine ntp-binding sequence pattern virus-encoded rna helicases rna helicases in infection and disease viral helicases structure and mechanism of helicases and nucleic acid translocases helicases: amino acid sequence comparisons and structure-function relationships a novel superfamily of nucleoside triphosphate-binding motif containing proteins which are probably involved in duplex unwinding in dna and rna replication and recombination a new superfamily of replicative proteins sf and sf helicases: family matters the alphaviruses: gene expression, replication, and evolution order nidovirales identification and characterization of rna duplex unwinding and atpase activities of an alphatetravirus superfamily helicase crystal structure of the superfamily helicase from tomato mosaic virus pathogenesis of porcine reproductive and respiratory syndrome virus severe acute respiratory syndrome middle east respiratory syndrome coronavirus (mers-cov): announcement of the coronavirus study group nidovirales: evolving the largest rna virus genome coronaviruses: an rna proofreading machine regulates replication fidelity and diversity discovery of an rna virus ! exoribonuclease that is critically involved in coronavirus rna synthesis rna -end mismatch excision by the severe acute respiratory syndrome coronavirus nonstructural protein nsp /nsp exoribonuclease complex unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group lineage discovery of the first insect nidovirus, a missing evolutionary link in the emergence of the largest rna virus genomes infidelity of sars-cov nsp -exonuclease mutant virus replication is revealed by complete genome sequencing coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics the coronavirus replicase characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot virus-encoded proteinases and proteolytic processing in the nidovirales nidovirus transcription: how to make sense a contemporary view of coronavirus transcription rna-rna and rna-protein interactions in coronavirus replication and transcription the human coronavirus e superfamily helicase has rna and dna duplex-unwinding activities with -to- polarity biochemical characterization of the equine arteritis virus helicase suggests a close functional relationship between arterivirus and coronavirus helicases the predicted metal-binding region of the arterivirus helicase protein is involved in subgenomic mrna synthesis, genome replication, and virion biogenesis a complex zinc finger controls the enzymatic activities of nidovirus helicases an infectious arterivirus cdna clone: identification of a replicase point mutation that abolishes discontinuous mrna transcription development of chemical inhibitors of the sars coronavirus: viral helicase as a potential target quality control of eukaryotic mrna: safeguarding cells from abnormal mrna function processing of x-ray diffraction data collected in oscillation mode automated mad and mir structure solution features and development of coot refmac for the refinement of macromolecular crystal structures optimal description of a protein structure in terms of multiple groups undergoing tls motion molecular replacement with molrep the pymol molecular graphics system, version . r mechanistic basis of - translocation in sf b helicases dial: a web-based server for the automatic identification of structural domains in proteins inference of macromolecular assemblies from crystalline state dali server: conservation mapping in d visualizing atp-dependent rna translocation by the ns helicase from hcv the ighmbp helicase structure reveals the molecular basis for disease-causing mutations in dmsa unusual bipartite mode of interaction between the nonsense-mediated decay factors, upf and upf the ighmbp helicase structure reveals the molecular basis for disease-causing mutations in dmsa structural and functional insights into the human upf helicase core the footprint of genome architecture in the largest genome expansion in rna viruses coronavirus genome: prediction of putative functional domains in the non-structural polyprotein by comparative amino acid sequence analysis crystal structure of the upf -interacting domain of nonsense-mediated mrna decay factor upf upf potentially serves as a ring-related e ubiquitin ligase via its association with upf in yeast deubiquitinase function of arterivirus papainlike protease suppresses the innate immune response in infected host cells ovarian tumor domain-containing viral proteases evade ubiquitin-and isg -dependent innate immune responses molecular mechanisms for the rna-dependent atpase activity of upf and its regulation by upf the authors thank the staff at beamline ne a (kek), ssrf beamline bl u and bsrf beamline w b facilities for help with crystallographic data collection, alexander kravchenko, dmitry samborskiy and igor sidorov for viralis management. a.e.g. thanks dr john ziebuhr for a decade-old discussion of the possible roles of mrna decay regulation in nidoviruses. supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- -wupre uj authors: morgan, brittany s; forte, jordan e; hargrove, amanda e title: insights into the development of chemical probes for rna date: - - journal: nucleic acids res doi: . /nar/gky sha: doc_id: cord_uid: wupre uj over the past decade, the rna revolution has revealed thousands of non-coding rnas that are essential for cellular regulation and are misregulated in disease. while the development of methods and tools to study these rnas has been challenging, the power and promise of small molecule chemical probes is increasingly recognized. to harness existing knowledge, we compiled a list of ligands with reported activity against rna targets in biological systems (r-bind). in this survey, we examine the rna targets, design and discovery strategies, and chemical probe characterization techniques of these ligands. we discuss the applicability of current tools to identify and evaluate rna-targeted chemical probes, suggest criteria to assess the quality of rna chemical probes and targets, and propose areas where new tools are particularly needed. we anticipate that this knowledge will expedite the discovery of rna-targeted ligands and the next phase of the rna revolution. the field of rna biology has exploded in recent years with the discovery of non-coding rnas (ncrnas) that regulate essential processes in all living organisms ( ) . these processes include transcription, translation, and evasion in bacteria and archaea ( , ) as well as replication, persistence, and cellular transformation in viruses ( ) . within the human genome, protein-coding genes are vastly outnumbered by regulatory ncrnas that can influence a wide range of cellular functions ( , ) . many of these ncrnas are dysregulated in and implicated as drivers of various human diseases, including metastatic cancers and neurologi-cal and neuromuscular disorders ( , , ) . this 'rna revolution' is radically changing our understanding of the role rna plays in fundamental biology and is rapidly driving scientific innovation. methods and tools to structurally and functionally characterize rnas at the molecular level, however, are more difficult and/or lacking as compared to those for proteins ( ) ( ) ( ) ( ) . one important example is the development of chemical probes, which has greatly progressed the study of proteins and related diseases ( , ) but has been challenging for non-ribosomal rnas. this powerful chemical tool requires small molecules with well-defined biological activity, cell permeability, and selectivity to accurately and reliably probe specific mechanistic and phenotypic questions ( , ) . given the potential advantages of small molecule chemical probes over biological approaches (e.g. sirnas, asos and crispr-cas) ( , ) and the power of using both approaches in tandem ( ) , the development of rnatargeted chemical probes has the potential to greatly benefit both chemists and biologists interested in rna. while ligands that bind non-ribosomal rna in vitro have been reported for decades, the development of chemical probes with evidence of specific small molecule:rna engagement in cell or animal models has dramatically increased in the last four years. recent studies report several drug-like small molecules that target a range of rnas in animal models, including riboswitches ( ) , mirnas, ( , ) splice sites ( ) , and mature mrnas ( ) , at least one of which is currently in clinical trials (nct ). multivalent ligands have been reported that target r(cug) exp repeats of myotonic dystrophy type (dm ) in d. melanogaster ( ) and mouse models ( ) . these recent successes confirm that selective rna targeting is achievable in biological systems; however, the limited examples over years of effort highlight the challenges associated with selectively probing rna. recently, we compiled the rna-targeted bioactive ligand database (r-bind), which comprised organic small molecules that target non-ribosomal rnas and show activity in cell culture or animal models ( ) . for this survey, the compilation was updated to include chemical probe discoveries through may for a total of chemical probes (see supplementary material, r-bind - .xls). aminoglycosides are excluded due to established non-specific binding behavior ( ) ( ) ( ) as well as peptides and oligonucleotides due to distinctive medicinal chemistry properties ( , ) . the chemical probes were divided into two classes: monovalent, traditional drug-like small molecules (sm) and multivalent ligands (mv) with alkyl, aryl, or peptidyl linkers between multiple binding moieties. previous work compared the physicochemical, structural, and spatial properties of the small molecules to fdaapproved drugs as well as rna-binding ligands without reported biological activity ( ) . this analysis revealed several key differences between these libraries that can in turn be used to bias small molecules toward biological rna targeting. in addition to that work, the curation of this collection allowed us to gain insights into: (i) the rna elements targeted; (ii) the design and discovery strategies utilized and (iii) the in cellulo characterization of these chemical probes. herein, we discuss these insights, highlight unique examples, and consider the need to establish standards for cellbased selectivity. we conclude by proposing future directions that utilize our current and prospective chemical biology toolbox to expedite the discovery of chemical probes for rna. the chemical probes targeted distinct rna elements, including those from bacterial, fungal, human, or viral systems [ figure ]. although some overlap between targets was observed, the small molecules probed a wider range of rnas in cell culture than the multivalent ligands [ figure a ]. the most common small molecule target was the hiv- trans-activation response element (tar) rna, a well-studied and frequently screened rna that binds to the viral protein tat ( ) . disruption of this interaction reduces viral production and represents an alternative strategy against hiv. some of the first rna-targeted chemical probes were developed for tar rna, including a tetraaminoquinozaline ( ) and a -aminoquinolone ( , ) with an ec value of m and an ic value of . m, respectively, in chronically infected hiv cell models. on the other hand, only one bioactive small molecule was identified for a fungal target, specifically the candida albicans lsu group ribozyme ( ). this essential ribozyme is a desirable antifungal target as it leads to failed ribosomal assembly when mutated and is absent in the human genome. further, / small molecule chemical probes demonstrated efficacy in animal models, targeting seven unique rna elements in bacterial and human systems. one recent example ( ) targeted the g-quadruplex structure located in the -untranslated region (utr) of human vascular endothelial growth factor (hvegf) mrna, an angiogenic growth factor involved in tumor progression. in a breast cancer mouse model, the small molecule showed antitumor efficacy similar to that of doxorubicin but with fewer indications of side effects. multivalent chemical probes targeted fewer distinct rna elements, with / unique ligands targeting nucleotide repeat expansions [ figure b ]. there are several advantages to targeting these rna repeats: (i) long repeat stretches are typically not present elsewhere in the human genome; (ii) nuclear localization minimizes competition with ribosomal rna and (iii) targetable motifs are separated by a specific distance ( ) . another target of interest was the heat shock response element of the factor mrna in e. coli. this rna element contains a rare, perfectly paired three-way junction that can be stabilized by symmetrical triptycenebased molecules, forming a distinct shape-selective fit ( ) . this stabilization resulted in an ∼ % reduction in translation of an -gfp fusion protein and could potentially lead to antimicrobial activity ( ) . in addition, / chemical probes showed efficacy in animal models, targeting two rna elements: r(cug) exp repeats and pri-mirna- . pri-mirna- is an oncogenic rna that suppresses the translation of a pro-apoptotic protein, foxo . in a mouse model of triple negative breast cancer, a modular ligand designed to target the drosha processing site on the rna led to a statistically significant reduction in tumor size and to changes in rna and protein levels consistent with the proposed mode of action ( ) . notably, examination of this list further exposes the rna-driven processes and diseases that still lack functional chemical probes. ideal rna targets have defined functional sites and/or clear phenotypes while also being of high abundance, and several untargeted rnas, ranging from archaeal ncrna to oncogenic lncr-nas, meet these criteria. many of the monovalent ligands were discovered through traditional screening methods. approximately one-third of the rna:ligand interactions were identified by each of the following approaches: focused-screening (fcs), highthroughput screening (hts), and hts followed by lead optimization (hts-lo) [ figure a ]. in this survey, fcs is defined by the use of biased libraries, which are typically based on prior knowledge of a particular chemotype binding to an rna element. in contrast to fcs libraries, molecules specifically designed to explore structureactivity relationships were classified as lead optimization (lo). the starting points for several of the fcs libraries and/or other small molecule identification strategies included rna-binding natural products, chemical similarity searching, and scaffold-based synthesis [ figure b -d]. we caution that the relative success of these various approaches cannot be evaluated since failed attempts are not typically documented in the literature. hit rates are one of the benchmarks used to assess the efficiency of a screen. we note prior to the discussion that comparisons across studies should be interpreted with caution as the definition of a small molecule lead, the specific assays used in primary screens, the controls utilized in the assay, and the number of false-positives and -negatives can be highly variable and are not always reported. of the rna:small molecule interactions discovered through hts and fcs, had reported hit rates, which were compared by screening approach, primary screen, and primary library [ figure ]. the higher hit rates found in some fcs approaches provide compelling evidence that fcs is efficient for rna targets, as it is known to be for protein targets ( , ) . moving forward, characterization of additional rna tertiary structures ( , , , ) and the identification of novel rna-binding chemotypes ( , , ) will expedite the fcs approach for discovering biologically active rna-targeted ligands. while hit rates varied widely within each type of primary screen and primary library, these com-parisons support the potential of many distinct paths toward rna ligand discovery. specific aspects of library and screen design are discussed below. small molecules discovered by hts were typically from large libraries (n > small molecules), including three corporate libraries and the nih small molecule repository [ table ]. there were select examples of smaller libraries as well: fda-approved drugs (n = ) and ucla academic library (n = ). importantly, some of these reports explicitly stated that libraries were filtered to yield small molecules with favorable medicinal chemistry proper- s-amino acid conjugates a screening approaches ties prior to screening. while successful in protein-targeted drug discovery ( ), only one report identified a bioactive ligand from a fragment-based library (commercial library) ( ) . once optimized, the scaffold yielded four additional molecules that targeted the influenza a rna promoter with ic values ranging from to m in a cell-based luciferase assay ( ) . similarly, only two reports yielded bioactive small molecules from natural productbased libraries (synthetic library and academic library) ( , ) . both of these screens contained fewer than small molecules, yet identified ligands that bind and modulate g-quadruplex structures located in the -utr of two distinct mrnas. it is promising that small molecules were discovered from a variety of hts libraries, suggesting that biologically active, rna-binding ligands can be found in a subset of current small molecule chemical space ( ) . further validation and exploration of this space could lead to greater efficiency and success in identifying bioactive leads as well as rna-privileged chemotypes. as expected, fcs used smaller libraries, typically containing fewer than small molecules [ table ]. the largest fcs library (n = , academic library) was designed through a chemical similarity search of the bis-benzimidazole and similar cores, which have shown preferences for × nucleotide internal loops ( ) . in addition, this library was filtered for favorable medicinal chemistry properties, and the screen resulted in three leads that: (i) bound r(cug) exp in vitro; (ii) led to a statistically significant decrease in ctnt mini-gene exon inclusion in cells and (iii) were selective for a mini-gene with r(cug) repeats compared to a mini-gene without the repeats. fcs also encompassed rna structure-guided design, which included two studies utilizing molecular modeling to identify ligands structurally similar to guanine for the xpt-pbux riboswitch ( , ) . in another structure-guided approach, a small library of p-terphenylene-based ligands was designed to mimic an ␣-helix of rev, a protein-binding partner of the hiv- rev response element (rre) ( ) . leads were selected by docking, with one ligand disrupting the rev-rre interaction in vitro (ic = . m), inhibiting hiv- replication (ic values of . - . m), and exhibiting on-target effects via a rre-luciferase reporter assay (ic values of - m). further, we did not identify successful reports of biologically active ligands from small molecule libraries biased to general rnabinding. recently, chemical companies have designed such focused libraries (chemdiv, nucleic acid ligands: http://www.chemdiv.com/nucleic-acid-ligands/; otava chemicals, rna targeted library: http: //www.otavachemicals.com/products/targeted-librariesand-focused-libraries/rna-binding); however, the success of these libraries is yet to be reported. to date, successful fcs strategies have utilized knowledge of the rna structure and/or a small molecule binder(s), neither of which is known for many therapeutically-relevant rnas. the primary screening assay for each small molecule was categorized as computational, in vitro, or cell-based [ figure a ]. this list contained a wide range of primary screening assays, with limited examples of the same assay being used for multiple targets. the majority of the chemical probes were discovered by in vitro primary screening assays (n = ) with fewer in cellulo or silico examples. of those in vitro primary screens, were rna:protein displacement assays [ figure b ]. these included fluorescence-based assays (förster resonance energy transfer (fret) and fluorescence anisotropy) and radiolabel-based methods (mobility shift, scintillation proximity, and filtration assays). one rather unique assay utilized a molecular beacon approach to probe for stabilization of stem loop (sl ), a presumptive structural switch located in the -packing domain in hiv- that is destabilized by binding of the gag protein prior to packaging of the virus ( ) . in this assay, the -and -terminal ends of the sl rna were labeled with a tet fluorophore and a blackhole quencher (bhq ), respectively. in the presence of gag protein, the rna construct became single stranded and the fluorescence was 'turned on'. when a small molecule stabilized the folded hairpin form of sl rna, the gag-promoted rna destabilization was reduced and the fluorescence was quenched. the researchers screened a modest sized library (> small molecules) and discovered a ligand that reduced viral production similar to models with a mutated -packing domain (p = . m). the remaining in vitro assays consisted of two activitybased screens and various rna binding assays [figure (v) competition dialysis and (vi) indicator displacement. we note that a lack of correlation in small molecule activity between in vitro rna binding and rna:protein displacement assays has sometimes been reported, ( , , ) highlighting the importance of multiple assays and/or choosing the most relevant assay for a particular system. we also note that, as in all screens, a lack of correlation can be observed between in vitro activity and cell culture activity. these differences can be attributed to many factors, including small molecule uptake, localization, and metabolism, specificity or off-target effects, and target availability due to binding of other macromolecules or metabolites. nonetheless, we emphasize the success of the in vitro assays mentioned here in developing rna bioactives and the valuable insights gained from other ligands discovered by in vitro rna assays without reported biological activity. cell-based screens were the second most common primary screening assay for hts or fcs approaches [ figure a ] and often the preferred screen for bacterial and viral rna targets. the exception was a splicing assay of human serotonin receptor c (htr c) mrna where a green fluorescent protein (gfp) reporter was used to evaluate the inclusion or exclusion of a particular exon ( ) . bacterial and viral rna cell-based studies also utilized reporter systems such as gfp or lacz gene as well as more traditional phenotypic screens, such as growth inhibition or cell death. in one particular example, ∼ ligands were screened in a growth inhibition assay against escherichia coli with and without supplementation of riboflavin ( ). this differential supplementation allowed researchers to specifically probe the riboflavin pathway and confirm the fmn riboswitch was targeted by ribocil-b. in addition, one report measured enzyme activity and antigen production in addition to cell death measurements to assess antiviral activity against hiv- ( ). given these successes, cell-based assays likely offer unforeseen promises as primary screening assays to discover rna-targeted probes. high-throughput and focused computational screens were used to identify four small molecules [ figure a ]. two small molecules were identified by docking against experimentally determined structures of hiv- tar ( ) and rre rna ( ) . in another example, small molecules were modeled into an x-ray diffraction structure of the xpt-pbue guanine riboswitch aptamer after the native ligand was removed ( ) . criteria such as geometrical constraints, hydrogen bonding patterns, and molecule planarity were used to assess the 'fit' of the ligand, leading to the selection of two small molecules, one of which had antimicrobial activity against of the gram-positive bacteria species tested and was selective for species with the guaa gene under riboswitch control. the fourth example utilized a computationally-predicted d structure of the severe acute respiratory syndrome coronavirus (sars-cov) pseudoknot ( ) . a library of small molecules was docked against the predicted structure, and the highest scoring molecules were tested in an in vitro activity-based assay. the screen resulted in a biologically active ligand with an ic value of . m in cell-based models. additional advances in computational structural prediction and rna:ligand docking will undoubtedly lead to improved computational primary screens and thus more efficient experimental screens ( , ) . in addition to the primary screens, a database of known rna motif:small molecule interactions, inforna ( ) , was utilized to identify seven small molecules. the database was generated using a library versus library approach named -dimensional combinatorial screening ( dcs). in this method, small molecules are immobilized onto a microarray slide and then incubated with libraries of labeled, randomized rna secondary structures. the bound rnas are excised, sequenced, and assigned a fitness score using structure-activity relationships through sequencing (starts). fitness scores reflect the affinity and selectivity of a given rna motif:small molecule interaction and are represented on a numerical scale, where a higher score represents greater selectivity. to utilize the inforna database, a computational or experimental secondary structure of rna is input, the dcs data is searched and lead molecules are proposed. this strategy identified bioactive ligands for five targets: (i) mapt pre-mrna, -nucleotide bulge; ( ) (ii) pre-mirna- , × internal loop ( ); (iii) pre-mirna- a, -nucleotide bulge ( ); (iv) pre-mirna- , × internal loop ( ) and (v) pre-mirna- , × internal loop ( ) . other examples of ligands not identified from a primary screening assay included the selection and characterization of four metabolite analogs for riboswitch inhibition ( ) ( ) ( ) ( ) . in contrast to small molecules, most multivalent ligands were developed through rational design based on the secondary structure of the rna target [ figure a ]. generally, development began by the identification of monovalent ligands that bound to a particular secondary structure motif(s) by screening, literature search, or using inforna [see section 'other methods of small molecule discovery']. monomers were covalently linked by selecting and optimizing appropriate sized spacers. for several multivalent ligands, the design was inspired by a crystal structure of r(cug) sequences, leading to ligands targeting r(cug) exp in drosophila melanogaster models ( ) . this approach linked acridine and a triaminotriazine unit, the latter of which was proposed to recognize the non-optimal base pairing of u-u mismatches by janus-wedge hydrogen bonding. stacking of the two units was expected to decrease nonspecific intercalative binding. this early design was optimized to yield bisamidinium conjugates that mitigated the glossy and rough eye phenotype observed in a dm transgenic drosophila melanogaster model ( ) . a different approach utilized hoechst , which had been previously reported to bind a cug/ guc internal loop ( ) . hoechst was modified to contain an azide handle and then covalently linked to a peptoid backbone via click chemistry ( ) . after extensive linker optimization, multivalent ligands were identified that improved r(cug) exprelated splicing defects in a mouse model of dm . one notable exception to the aforementioned design strategies is the use of dynamic combinatorial chemistry (dcc) [ figure b ] ( ) . several multivalent ligands were derived from a library of resin-bound, cysteine-containing monomers, which were allowed to incubate with the rna of interest, probing thousands of multivalent ligand combinations by forming covalent yet reversible disulfide linkages. the binders with highest affinity were thus enriched and then isolated, characterized, and validated for rnabinding. after replacing the disulfide linkage with more sta-ble bioisosteres, the method yielded bioactive ligands for two rnas of known structure: dm r(cug) exp , where statistically significant improvements in splicing were observed in mouse models ( ) , and hiv- frameshift-stimulating rna, ( , ) where in one example the decrease in viral infectivity (ec values of . and m) correlated to frame-shifting activity (> % at m) in cell culture ( ) . a powerful advantage of dcc is that multivalent ligands can be constructed without knowledge of the rna structure, including larger and complex tertiary folds. in general, both rational design and dcc yield multivalent probes with significantly increased affinity and specificity for rna targets relative to small molecules. while achieving high potency in biological systems with larger molecules may require more development than with traditional small molecules, the examples identified support the possibility of success. evaluating target engagement, off-target effects, potency/appropriate concentration, and other criteria is critical to understanding the quality of a chemical probe and thus any experimental conclusions ( , , , ) . when curating the collection of chemical probes, strict benchmarks related to these criteria could not be included due to the lack of consistency within the field. in the next paragraphs, the characterization techniques utilized for rna-targeted chemical probes will be described for each biological system and notable examples highlighted. one of the most common validation experiments in bacterial and fungal systems was serial passage ( , , , , , , ) . in this technique, ligand resistant mutants were grown in the presence of compound and mutations were mapped by whole-genome sequencing. in addition to confirming target engagement, the results revealed off-target effects and unexpected modes of action. further, select examples performed the serial passage experiments in multiple bacterial strains and measured binding affinity to the mutants in vitro, which provided added confidence in target engagement ( , , ) . in select cases where serial passage experiments did not yield mutated isolates, the ligands were tested against mutants with established variations in structure or activity ( , ) . another powerful strategy for assessing target engagement in riboswitches was phenotype rescue by addition of the native ligand ( , ) . lastly, several of the targeted rna elements regulated the expression or translation of specific genes, which was assessed by measuring the quantity of the rna or protein, respectively. one study went beyond measuring the expected transcripts and performed a transcriptomic microarray analysis of genes involved in many different cellular processes ( ) . the observed repression was consistent with riboswitch inhibition, although addition of the native ligand failed to rescue the expression of several genes, indicating a potential cellular stress response. this strategy and other genome-wide analyses can provide compelling evidence of target engagement, though it must be noted that target specificity does not always lead to biological specificity ( ) . compared to the other systems, chemical probes targeting viral rnas were less characterized in terms of target engagement and specific activity. a select few studies validated probe activity by testing mutant versions of the virus ( , ) or closely related native viruses ( , ) . notably, one report validated probe activity with a second research group to ensure reproducibility of the biological effect ( ) . rnaspecific reporter systems were occasionally used to confirm on-target effects, including fusion-induced gene stimulation ( ) , heterologous tethering ( ) , and a viral protein reporter ( ) . a noteworthy example of assessing off-target effects was the use of rna-seq at increasing ligand concentrations in the absence of the target rna ( ) . the experiment in hek t cells revealed that of the transcripts assayed had statistically significant alterations at two concentrations, potentially reflecting a cellular stress response. it is also intriguing that the most biologically potent ligand in this study was the least selective analog in an in vitro trna competition assay. this observation underlies potential differences in cellular activity versus in vitro binding selectivity and thus the importance of progressing multiple ligands to biological assays. there are several noteworthy examples of probing ligand promiscuity and/or selectivity in human systems. multiple studies utilized genome-wide analyses to globally assess changes in mirna, ( , , , , ) mrna, ( , ) or splicing ( , , , ( ) ( ) ( ) ( ) ( ) ) levels compared to the target rna or process. toward a more direct assessment, chemical cross-linking and isolation by pull down (chem-clip) was utilized to investigate proximity-based engagement of rna ( ) . in this method, the ligand is appended with a nucleic-acid reactive module (e.g. chlorambucil) and a biotin purification tag. after incubating cells with the modified ligand, the cells are lysed, the ligand is captured by streptavidin beads, and the bound targets are characterized by qrt-pcr or rna-seq. this strategy as well as competitive chem-clip ( ) were used to characterize onand off-target engagement of ligands that modulate repeat expansions ( , , , ) and mirnas ( , , ) . in another study, dual luciferase reporter assays were utilized to compare ligand binding to four g-quadruplex structures, including the rna of interest and three other regulatory rnas ( ) . this study was one of few examples in which selectivity within a target family was assessed in cell culture. various controls were also used to evaluate on-target effects. for example, the impacts of chemical probes have been analyzed following sirna knockdown of foxo mrna ( ) and following inhibition of the mtor signaling pathway modulated by mirna- ( ) . another example overexpressed mirna- and assessed the effect of the chemical probe on the phenotype ( ) . likewise, a ligand targeting r(cug) exp was tested with the rna under conditional expression ( ) . another notable example used rna immunoprecipitation (rip) to detect rna-binding at increasing concentrations of ligand and identified a dosedependent response ( ) . for precipitation, a g-quadruplex specific antibody, bg , was utilized and the complex was characterized by two complementary methods: dot blotting and qrt-pcr. another important control, particularly for human systems, was to replicate on-target effects in at least two cell lines, though this control was performed for only a limited number of chemical probes ( , , , , ) . for many years, rna has been labeled as 'undruggable' or 'impossible to probe selectively'; however, the reports described herein demonstrate the substantial progress that has been achieved in the past four years. this includes the development of chemical probes for unique rna elements, though the number of 'targetable' rna elements is certain to vastly exceed this list ( , , , ) . those in archaea, for example, are unrepresented in reports of rnatargeted chemical probes, despite the established importance of small regulatory rnas in archaea metabolism, morphology, and adaption to extreme conditions ( , ) . furthermore, there are several rnas implicated in diseases for which novel treatment strategies are needed, including insect-borne viruses ( ) , genetic disorders ( ) , and metastatic cancers ( , ) as well as bacterial targets amid the antibiotic-resistance crisis ( , ) . there are also many opportunities for rna-targeting in fungal systems ( ) , especially as fungal infections are experiencing a rise in cases and a therapeutic plateau ( ) . fundamentally, the development of chemical probes will allow for the rapid and reversible interrogation of novel and complex rna biology in ways not attainable by knockdown and genetic approaches ( ) . while the potential benefits of selective rna targeting are staggering, approaches toward chemical probe development must thoughtfully consider a number of variables, including transcript abundance and tissue-specific expression of the rna target. by total mass and number of molecules, rrna and trna account for greater than % of total cellular rna in humans ( ) , and thus mrnas and other ncrnas exist in a much lower abundance ( , , ) . even within these low abundance transcripts, copy numbers can vary widely within and across cell types with reports of mrna levels spanning four orders of magnitude and some ncrnas averaging less than one copy per cell ( , , ) . a direct impact of rna expression levels was recently reported, in which the authors proposed that ligand occupancy of the mirna target was driven by the relative abundance of structurally similar rna elements ( ) . rna with well-defined function and that are highly expressed thus represent low-hanging fruit within the field ( , ) . the selection of a library and primary screening strategy is also critical for the success of chemical probe discovery. for rna targets with known structure and/or ligands, the use of focused screening libraries ( , ) has proven to be an efficient strategy, though this approach generally limits the chemical diversity of the library. advances such as the identification of more rna-privileged scaffolds ( ) and biologically relevant rna chemical space ( , ) will facilitate the discovery of chemical probes for additional rna targets. these efforts could be expedited by additional fragment-or natural product-based screening to access vast chemical space with fewer ligands ( ) and to probe known biologically relevant chemical space, respectively ( ) . additional high-throughput screens could likewise discover novel chemotypes, though the expense may exceed the resources available in academia, where focused approaches are more attainable. continued progress in the development and refinement of computational tools will also aid in expanding the boundaries of focused screening and structurebased design; ( , ) although the latter will also depend upon advancements for accurately determining atomic resolution structures ( , , , ) . in addition, in vitro or cellular activity-based assays that probe well-studied rna functions (e.g. splicing, translation, or processing) may be more practical starting points to identify chemical probes. this includes screening strategies described herein as well as others recently developed ( , ) . finally, numerous opportunities exist to build upon the established multivalent targeting strategies discussed above, particularly the application of dynamic combinatorial chemistry to rnas of unknown structure. when using a chemical probe in a biological system, the quality and specificity of the probe must be well characterized to draw accurate and meaningful conclusions, as recently highlighted by several preeminent chemical biologists ( , , , ) . evaluation of the chemical probes revealed a diverse spectrum of characterization techniques with few ligands meeting the traditional criteria for robust chemical probes, and not all of these gaps can be attributed to a lack of relevant tools. for example, characterization inconsistencies include incomplete reports of cytotoxicity and a lack of attention to cell permeability and localization. in addition, many in vitro assays are accessible to help establish the potency and selectivity of a probe in multiple experiments, and these assays should include evaluation against both specifically mutated targets and a number of other structured rnas. any cell-based observations should be reproducible in multiple cell lines and validated in the absence of the target by utilizing sirna or crispr-cas technologies. further, on-target effects should be established by using a number of spatial-based experiments ( ) . these include the biochemical methods described herein (e.g. chem-clip or rip) and novel applications of other technologies such as in-cell chemical probing ( , ) to observe changes in rna secondary structure upon ligand binding or photoaffinity labeling ( ) to assess target engagement under temporal control. if plausible, serial passage experiments followed by deep sequencing should also be performed, even in human systems ( ) , to identify ligandescaping mutations to confirm target engagement and/or off-target effects. lastly, it is critical to use an inactive analog and an active analog from a different chemical class, if feasible, to draw conclusions regarding the targeted biology. moving toward these standards will be crucial for the rna-targeting field to avoid the scientific pollution that has plagued many others ( , , ) . for comparison, we note that over a decade ago chemical probe discovery for another 'undruggable' target, protein:protein interactions, was in its infancy with only known small molecule inhibitors ( ) . however, novel advances in screening approaches and design strategies led to the rapid discovery of thousands of protein:protein interaction inhibitors with several entering the clinic and most also breaking the rules of 'drug-like chemical space' ( ) . in the next five years, we anticipate that the innovations described herein and the ones yet to be discovered will lead to a similar surge in reports of rna-based chemical probes and therapeutics. we also expect that novel and well-characterized chemical probes will allow rna biologists to uncover many more exciting and unanticipated roles for rna, propelling us into the next phase of the rna revolution. supplementary data are available at nar online. the noncoding rna revolution-trashing old rules to forge new ones the rise of regulatory rna viral noncoding rnas: more surprises the encode project consortium ( ) identification and analysis of functional elements in % of the human genome by the encode pilot project non-coding rnas as drug targets the therapeutic targeting of long noncoding rna linking rna biology to lncrnas synthetic receptors for oligonucleotides and nucleic acids biochemical methods to investigate lncrna and the influence of lncrna:protein complexes on chromatin opportunities and challenges in rna structural modeling and design the impact of chemical probes in drug discovery: a pharmaceutical industry perspective choose and use your chemical probe wisely to explore cancer biology comparison of small molecules and oligonucleotides that target a toxic, non-coding rna overcoming cellular barriers for rna therapeutics selective small-molecule inhibition of an rna structural element small molecule inhibition of microrna- reprograms an oncogenic hypoxic circuit design of a small molecule against an oncogenic noncoding rna smn splice modulators enhance u -pre-mrna association and rescue sma mice discovery of small molecules for repressing cap-independent translation of human vascular endothelial growth factor (hvegf) as novel antitumor agents rationally designed small molecules that target both the dna and rna causing myotonic dystrophy type features of modularly assembled compounds that impart bioactivity against an rna target discovery of key physicochemical, structural, and spatial properties of rna-targeted bioactive ligands small molecules targeting viral rna targeting rna with small molecules chemical and functional diversity of small molecule ligands for rna the medicinal chemistry of peptides the medicinal chemistry of therapeutic oligonucleotides inhibitors of protein-rna complexation that target the rna: specific recognition of human immunodeficiency virus type tar rna by small organic molecules new anti-human immunodeficiency virus type -aminoquinolones: mechanism of action ) -aminoquinolones as new potential anti-hiv agents pentamidine inhibition of group i intron splicing in candida albicans correlates with growth inhibition rational design of chemical genetic probes of rna function and lead therapeutics targeting repeating transcripts recognition of nucleic acid junctions using triptycene-based molecules modulation of the e. coli rpoh temperature sensor with triptycene-based small molecules recent developments in focused library design: targeting gene-families the design and application of target-focused compound libraries recent advances in developing small molecules targeting rna insights into rna structure and function from genome-wide studies inforna . : a platform for the sequence-based design of small molecules targeting structured rnas identifying the preferred rna motifs and chemotypes that interact by probing millions of combinations twenty years on: the impact of fragments on drug discovery a novel small-molecule binds to the influenza a virus rna promoter and inhibits viral replication targeting influenza a virus rna promoter for up-regulating the translation of antiamyloidogenic secretase, a disintegrin and metalloproteinase (adam ), by binding to the g-quadruplex-forming sequence in the untranslated region (utr) of its mrna studying a drug-like, rna-focused small molecule library identifies compounds that inhibit rna toxicity in myotonic dystrophy novel riboswitch ligand analogs as selective inhibitors of guanine-related metabolic pathways chemical correction of pre-mrna splicing defects associated with sequestration of muscleblind-like protein by expanded r(cag)-containing transcripts structure-based design of an rna-binding p-terphenylene scaffold that inhibits hiv- rev protein function targeting rna-protein interactions within the human immunodeficiency virus type lifecycle a small molecule that represses translation of g-quadruplex-containing mrna thermodynamic studies of a series of homologous hiv- tar rna ligands reveal that loose binders are stronger tat competitors than tight ones development of small molecules with a noncanonical binding mode to hiv- trans activation response (tar) rna pyrvinium pamoate changes alternative splicing of the serotonin receptor c by influencing its rna structure structure-based computational database screening, in vitro assay, and nmr assessment of compounds that target tar rna identification of rna pseudoknot-binding ligand that inhibits the - ribosomal frameshifting of sars-coronavirus by structure-based virtual screening structure based approaches for targeting non-coding rnas with small molecules structure-based discovery of small molecules binding to rna bottom-up design of small molecules that stimulate exon skipping in mutant mapt pre-mrna sequence-based design of bioactive small molecules that target precursor micrornas defining rna-small molecule affinity landscapes enables design of a small molecule inhibitor of an oncogenic noncoding rna biogenesis disrupts adaptive responses to hypoxia by modulating atm-mtor signaling novel riboswitch-binding flavin analog that protects mice against clostridium difficile infection without inhibiting cecal flora roseoflavin is a natural antibacterial compound that binds to fmn riboswitches and regulates gene expression structure-guided mutational analysis of gene regulation by the bacillus subtilis pbue adenine-responsive riboswitch in a cellular context thiamine pyrophosphate riboswitches are targets for the antimicrobial compound pyrithiamine a simple ligand that selectively targets cug trinucleotide repeats and inhibits mbnl protein binding targeting toxic rnas that cause myotonic dystrophy type (dm ) with a bisamidinium inhibitor specific binding of hoechst to site thymidylate synthase mrna dynamic combinatorial selection of molecules capable of inhibiting the (cug) repeat rna-mbnl interaction in vitro: discovery of lead compounds targeting myotonic dystrophy (dm ) from dynamic combinatorial 'hit' to lead: in vitro and in vivo activity of compounds targeting the pathogenic rnas that cause myotonic dystrophy n-methylation as a strategy for enhancing the affinity and selectivity of rna-binding peptides: application to the hiv- frameshift-stimulating rna hiv- frameshift rna-targeted triazoles inhibit propagation of replication-competent and multi-drug-resistant hiv in human cells high-affinity recognition of hiv- frameshift-stimulating rna alters frameshifting in vitro and interferes with hiv- infectivity the promise and peril of chemical probes target identification and mechanism of action in chemical biology and drug discovery design and antimicrobial action of purine analogues that bind guanine riboswitches dual-targeting small-molecule inhibitors of the staphylococcus aureus fmn riboswitch disrupt riboflavin homeostasis in an infectious setting conformational inhibition of the hepatitis c virus internal ribosome entry site rna inhibition of hiv- tat-tar interaction by diphenylfuran derivatives: effects of the terminal basic side chains small molecule inhibition of mir- biogenesis disrupts adaptive responses to hypoxia by modulating atm-mtor signaling design of a bioactive small molecule that targets the myotonic dystrophy type rna via an rna motif-ligand database and chemical similarity searching induction and reversal of myotonic dystrophy type pre-mrna splicing defects by small molecules targeting the r(cgg) repeats that cause fxtas with modularly assembled small molecules and oligonucleotides small molecule recognition and tools to study modulation of r(cgg)(exp) in fragile x-associated tremor ataxia syndrome precise small-molecule recognition of a toxic cug rna repeat expansion approaches to validate and manipulate rna targets with small molecules in cells discovery of a biomarker and lead small molecules to target r(ggggcc)-associated defects in c ftd/als lomofungin and dilomofungin: inhibitors of mbnl -cug rna binding with distinct cellular effects non-coding rnas as antibiotic targets small regulatory rnas in archaea small rnas in bacteria and archaea: who they are, what they do, and how they do it rna structures as mediators of neurological diseases and as drug targets antibacterial drug discovery in the resistance era fungal rna biology the antifungal pipeline: a reality check non-coding rna: what is functional and what is junk? front specificity and nonspecificity in rna-protein interactions an abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data micrornas and other non-coding rnas as targets for anticancer drug development enhancements of screening collections to address areas of unmet medical need: an industry perspective the re-emergence of natural products for drug discovery in the genomics era development and implementation of an hts-compatible assay for the discovery of selective small-molecule ligands for pre-micrornas rna fluorescence in situ hybridization for high-content screening progress and challenges for chemical probing of rna structure inside living cells detection of rna-protein interactions in living cells with shape photoaffinity labeling in target-and binding-site identification using transcriptome sequencing to identify mechanisms of drug action and resistance emerging classes of protein-protein interaction inhibitors and new tools for their development small molecules, big targets: drug discovery faces the protein-protein interaction challenge a small molecule that targets r(cgg)(exp) and improves defects in fragile x-associated tremor ataxia syndrome development of pharmacophore models for small molecules targeting rna: application to the rna repeat expansion in myotonic dystrophy type identification of biologically active, hiv tar rna-binding small molecules using small molecule microarrays importance of ribosomal frameshifting for human immunodeficiency virus type particle assembly and replication bioavailable inhibitors of hiv- rna biogenesis identified through a rev-based screen a modular approach to synthetic rna binders of the hepatitis c virus internal ribosome entry site artificial nucleobase-amino acid conjugates: a new class of tar rna binding agents we thank the members of the hargrove lab for stimulating discussion and input. key: cord- -b u hp r authors: liu, ying poi; haasnoot, joost; ter brake, olivier; berkhout, ben; konstantinova, pavlina title: inhibition of hiv- by multiple sirnas expressed from a single microrna polycistron date: - - journal: nucleic acids res doi: . /nar/gkn sha: doc_id: cord_uid: b u hp r rna interference (rnai) is a powerful approach to inhibit human immunodeficiency virus type (hiv- ) replication. however, hiv- can escape from rnai-mediated antiviral therapy by selection of mutations in the targeted sequence. to prevent viral escape, multiple small interfering rnas (sirnas) against conserved viral sequences should be combined. ideally, these rna inhibitors should be expressed simultaneously from a single transgene transcript. in this study, we tested a multiplex microrna (mirna) expression strategy by inserting multiple effective anti-hiv sirna sequences in the mirna polycistron mir- - . individual anti-hiv mirnas that resemble the natural mirna structures were optimized by varying the sirna position in the hairpin stem to obtain maximal effectiveness against luciferase reporters and hiv- . we show that an antiviral mirna construct can have a greater intrinsic inhibitory activity than a conventional short hairpin (shrna) construct. when combined in a polycistron setting, the silencing activity of an individual mirna is strongly boosted. we demonstrate that hiv- replication can be efficiently inhibited by simultaneous expression of four antiviral sirnas from the polycistronic mirna transcript. these combined results indicate that a multiplex mirna strategy may be a promising therapeutic approach to attack escape-prone viral pathogens. rnai is an evolutionary conserved and sequence-specific gene silencing mechanism in eukaryotes ( , ) . rnai can be induced by double-stranded rna that is processed by the rnase iii-like enzyme dicer into - bp sirnas ( ) ( ) ( ) . the sirna is incorporated into the rna-induced silencing complex (risc) in the cytoplasm and directs risc to degrade an mrna that is perfectly complementary to one (guide) strand of the sirna ( ) . cellular mirnas are the natural inducers of rnai. most mirnas are synthesized as part of longer primary rna transcripts (pri-mirnas) ( ) ( ) ( ) . the pri-mirnas are cleaved by the nuclear drosha-dgcr complex to produce mirna precursors (pre-mirnas) of $ nt ( ) ( ) ( ) ( ) . pre-mirnas are transported by exportin- to the cytoplasm, where they are cleaved by dicer to produce the mirna duplex of $ bp. the single-stranded mature mirna programs risc for mrna cleavage (perfect complementarity) or translational repression (incomplete complementarity) ( , ) . several mirnas are encoded in genomic clusters that are transcribed as polycistronic pri-mirnas, allowing the production of multiple mirnas from a single transcription unit ( , ) . rnai can be induced using rna polymerase iii promoterdriven shrna expression vectors that direct sirna expression. another sirna expression strategy uses a pre-mirna backbone that is transcribed by polymerase ii or iii ( ) ( ) ( ) ( ) . an optimized pre-mirna design includes the single-stranded flanks of the pri-mirna ( ) ( ) ( ) . currently, rnai has been employed to inhibit the replication of a wide range of viruses including hiv- , hepatitis c virus (hcv), hepatitis b virus (hbv), dengue virus, poliovirus, influenza virus a, coronavirus, herpesvirus and picornavirus ( , ) . for hiv- , potent inhibition has been reported with single shrna and mirna expression constructs ( , ( ) ( ) ( ) . however, the therapeutic use of a single inhibitor is limited because of the rapid emergence of hiv- escape mutants ( , , ) . minor sequence changes in the target sequence, sometimes even a single-point mutation, are sufficient to overcome rnaimediated inhibition ( ) , thus demonstrating the exquisite sequence-specificity of rnai. strategies to reduce the chance of viral escape include the simultaneous use of multiple shrnas in a combinatorial rnai approach, which increases the genetic barrier for viral escape ( , ) . similar strategies have previously been validated for antisense dna and ribozymes ( ) ( ) ( ) . however, expression of these shrnas necessitates multiple expression cassettes and the construction of rather complex vectors that will not easily provide equimolar shrna expression levels. furthermore, when the same promoter is reiterated in a lentiviral vector, recombination occurs with high frequency on the repeated sequences ( ) . alternative antiescape strategies include the use of a second-generation of sirnas that target-specific escape variants ( ) , the use of tandem sirna transcripts ( ) , long hairpin rnas ( ) or the targeting of cellular co-factors that are critically involved in viral replication ( ) ( ) ( ) ( ) ( ) . another attractive approach is to express multiple antiviral sirnas from a single polycistronic mirna transcript, such as a natural genomic mirna cluster that can be expressed from an rna polymerase ii promoter. this strategy is of particular interest for antiviral purposes because mirnalike transcripts were shown to be more effective antivirals than regular shrnas ( , ) . furthermore, using an rna polymerase ii promoter will allow lower and regulated expression, thereby reducing the risk of toxicity due to oversaturation of the rnai machinery ( ) . in this study, we designed a polycistronic transcript based on the mir- - backbone to simultaneously express four anti-hiv sirnas. to generate this transcript, we first constructed individual anti-hiv mirnas that resemble the natural pri-mirna structures. these hairpins were optimized for viral inhibition by varying the sirna position in the hairpin stem. we show that the expression of individual mirnas is greatly enhanced in multiplex hairpin transcripts that are properly processed into functional mirnas. hiv- replication can be potently inhibited by simultaneous expression of four antiviral mirnas. these combined results indicate that the multiplex mirna strategy is a promising therapeutic approach against escape-prone viral pathogens. the wild-type mir- - b polycistron was amplified from genomic dna of t cells with primers oncof and oncor and ta-cloned in topo . (invitrogen). the mir- - b polycistron sequence was verified and is identical to the sequence in the ncbi database (nt_ . ). the topo . /mir- - b construct was used as a template to generate the antiviral mirna constructs. the construction consists of a four-step fusion pcr as shown in the supplementary figure . briefly, the -flank of the pri-mirna was amplified with a forward primer encoding a bamhi site and a reverse primer encoding the hiv- sequence at its -end (step ). similarly, the -flank of the pri-mirna was amplified with a forward primer containing hiv- sequences and a reverse primer encoding bglii and xhoi sites (step ). two complementary oligonucleotides, creating the stem-loop structure of the antiviral mirna, were annealed as described previously (step ) ( ) . the partial sequence similarity between the fragments generated in steps , and allowed their fusion by pcr with the outer forward f and reverse r primers (step ), resulting in the antiviral mirna. supplementary table lists all oligonucleotides used in this study. pcr amplification was performed in a ml reaction containing  pcr amplification buffer (invitrogen), . - mm mgcl (optimized for each reaction), pmol of each primer, . mm dntps and . units of amplitaq dna polymerase (perkin elmer applied biosystem). the pcr program was as follows: c for min, cycles of min at c, . min at c, . min at c and a final extension for min at c. the pcr products were separated on a % agarose gel stained with ethidium bromide and compared to a standard dna size marker (eurogentec). mirna pcr products were excised from gel, purified with the qiaquick gel extraction kit (qiagen), digested with bamhi and xhoi and cloned in the corresponding sites of pcdna . -gw/emgfp-mir (invitrogen). all mirna constructs were sequenced with primers gfpseqf and mirr using the bigdye terminator v . cycle sequencing kit (perkin elmer applied biosystem). multimerization of the individual pri-mirna units was performed by digestion of a single mirna hairpin construct with bamhi and xhoi and religation into the bglii/xhoi sites of pcdna -mirna. by repeating this procedure we obtained constructs expressing different combinations of , , , and pri-mirnas. the rna structures formed by the transcripts were predicted with the mfold program ( ) at http://frontend.bioinfo.rpi.edu/ applications/mfold/ and found to be similar to the predicted conformation of the wild-type pri-mirnas. the firefly luciferase (fl) reporters containing hiv- target sequences pol (luc-a pol ), pol (luc-b pol ), gag (luc-c gag ), r/t (luc-d r/t ), ldr (luc-e ldr ) and the anti-hiv shrnas have been described previously ( ) . the full-length hiv- molecular clone lai (accession number af . ) ( ) was used to produce wildtype virus and to study its inhibition by the antiviral mirnas and shrnas. human embryonic kidney (hek) t cells were grown as a monolayer in dmem (invitrogen) supplemented with % fetal calf serum (fcs) (hybond), minimal essential medium nonessential amino acids, penicillin ( u/ml) and streptomycin ( mg/ml) at c and % co . cells were trypsinized one day before transfection, resuspended in dmem without antibiotics and seeded in -well plates at a density of .  cells per well. cells were co-transfected with the indicated dna constructs using lipofectamine reagent (invitrogen) according to the manufacturer's instructions. one nanogram of prl plasmid (promega) expressing renilla luciferase (rl) from the cmv promoter was added as an internal control for cell viability and transfection efficiency. all transfection experiments were controlled for equal dna input by adding pbluescript sk-(promega). luciferase and renilla expression was measured with the dual-luciferase reporter assay system (promega) according to the manufacturer's instructions. virus production was determined by measuring the ca-p levels in the culture supernatant by elisa (abbott) ( ) . subsequently, the cells were lysed to measure the renilla luciferase activities using the renilla luciferase assay system (promega) according to the manufacturer's protocol. transfection experiments were corrected for between session variations as described previously ( ) . the human t-cell line supt was cultured in cm flasks in rpmi medium supplemented with % fcs, penicillin ( u/ml) and streptomycin ( mg/ml) at c and % co . . supt cells were infected with equal amounts virus ( . ng ca-p ) produced in t cells. when hiv-induced cytopathic effects were observed, cell and supernatant samples were stored at À c. virus spread was followed by measuring the ca-p levels in the culture supernatant by elisa. one day before transfection .  t cells were plated in -well plates. cells were transfected with - ng of the shrna construct and or ng of the mirna-expression construct using lipofectamine reagent (invitrogen) according to the manufacturer's instructions. in all transfection experiments we added pbluescript sk-(promega) to obtain identical dna concentrations. total cellular rna was extracted days post-transfection with the mirvana mirna isolation kit (ambion) according to the manufacturer's protocol. for northern blot analysis, mg total rna was heated for min at c before electrophoresis on a % denaturing polyacrylamide gel (precast novex tbu gel, invitrogen). to check for equal sample loading, the gel was stained with mg/ml ethidium bromide in milliq water for min. destaining was performed by rinsing the gel three times with milliq water for min. the ribosomal rna (rrna) bands were visualized under uv light. the rna samples were electrotransferred to a positively charged nylon membrane (boehringer mannheim, gmbh, mannheim, germany) and crosslinked to the membrane using uv light at a wavelength of nm ( mj  ). lna oligonucleotide probes were -end labeled with the kinasemax kit (ambion) in the presence of ml [g- p]atp ( . mbq/ml amersham biosciences). to remove unincorporated nucleotides, the probes were purified on sephadex g- spin columns (amersham biosciences) according to the manufacturer's protocol. hybridizations were performed at c with labeled lna oligonucleotides in ml ultrahyb hybridization buffer (ambion) according to the manufacturer's instructions. we used the oligonucleotide probes (lna-positions underlined): -gtgaaggggcagtagtaat- (pol probe), -acaggagcagatgatacag- (pol probe), -gaagaaatgatgacagcat- (gag probe), -atggcaggaagaagcggag- (r/ t probe) and -agatgggtgcgagagcgtc- (ldr probe). the membranes were washed twice for min at c in  ssc/ . % sds and twice for min at c in .  ssc/ . % sds. signals were detected and quantified using a phosphorimager (amersham biosciences). lentiviral vector plasmids are derived from the construct plenti /v -dest (invitrogen), which we renamed plv. the mirna cassettes, containing a gfp marker, were inserted into plv using the gateway-adapted block-it lentiviral pol ii mir rnai expression system (invitrogen) according to the manufacturer's instructions. the sequences of all mirna constructs were verified using the primers cmvf and v r. the mirna inhibitory potential was determined by co-transfection with appropriate luciferase reporters (results not shown). lentiviral vector was produced in t cells ( .  ) seeded in a cm flask. the next day, medium was replaced with . -ml medium without antibiotics. subsequently, plv vectors expressing a mirna ( . mg) were co-transfected with packaging plasmids psyngp ( . mg) ( ), rsv-rev ( . mg) and pvsvg ( . mg) ( ) with ml of lipofectamine reagent and . ml optimem (gibco brl). the second day, medium was replaced with fresh medium. on the third and fourth day, medium containing lentiviral vector was harvested and pooled. cellular debris was removed by filtration through a fp / . ca-s filter (schleicher and schuell microscience) and -ml aliquots were stored at À c. lentiviral stocks were titrated on supt and t cells to determine the vector titer. supt and t cells (  ) were transduced with the plv vector expressing an unrelated mirna n (invitrogen) or the antiviral mirnas a pol , b pol , c gag , d r/t , e ldr and acde at a multiplicity of infection (moi) of . as described previously ( ) . the plv vector contains the blasticidin resistance gene, which allows the selection of transduced cells using . mg/ml blasticidin. the gfp marker encoded by the mirna cassette was used to select gfp+ cells by facs sorting approximately days post-transduction. the human mir- - cluster on chromosome encodes a kb pri-mirna polycistronic transcript with six pre-mirnas that produces seven mature mirnas ( figure a , upper panel) ( ) . the pre-mir- hairpin encodes two mirnas, one on the side of the hairpin (mir- - p) and one on the side (mir- - p). we amplified the sequences encoding the first five pri-mirnas from the mir- - polycistron and cloned it under the control of the cytomegalovirus (cmv) immediate early promoter ( figure a , lower panel). we subcloned each individual pre-mirna with at least -nt flanks on each side of the hairpin and systematically replaced the mature mirna sequences with antiviral sequences as explained in detail in the supplementary figure . the original mirna names were replaced with letters a-e and we inserted -to -nt antiviral sequences against five hiv- genes ( figure b, upper panel) . the hiv- targets represent highly conserved sequences to which we successfully raised potent shrna inhibitors ( ) . we set out to combine , , or antiviral mirnas, which will eventually result in an antiviral pri-mirna polycistron ( figure b , lower panel). we first determined if we could produce an optimal antiviral mirna construct by varying the position and length of the anti-hiv sequence. for the initial experiments we chose mirnas mir- and mir- b (d wt and e wt ) because they contain the smallest number of bulges and thus most resemble the original shrna structure ( figure ). the original mature mirna and the antiviral mirna strand are boxed. we modified the passenger strand of the basepaired stem to mimic structural features (mismatches, bulges and thermodynamic stability) of the natural pre-mirna. we constructed a series of antiviral mirnas with the effective r/t and ldr sirnas in the d wt backbone, which produces the antiviral guide strand from the side of the hairpin duplex. the -nt antiviral sirnas were positioned either at the (d r/t - and d ldr - ) or -end (d r/t - and d ldr - ) of the original mirna sequence (figure , upper panel). additional constructs were made in which we extended the antiviral sirnas at the -end to -nt, which is the actual size of the wild-type mature mirna (d r/t - and d ldr - ) ( figure , table ). we repeated this strategy for the r/t inhibitor in the e wt hairpin, which produces the antiviral guide strand from the side of the hairpin stem ( figure , lower panel). we designed a similar set of constructs in which the -nt r/t sirna was positioned at the (e r/t - ) or -end (e r/t - ) of the wild-type mirna and an extended -nt version (e r/t - ). to determine the inhibitory activity of the mirna constructs, we co-transfected t cells with the inhibitors and luciferase reporter constructs containing either the -nt r/t target (luc-d r/t and luc-e r/t ) or the -nt ldr target (luc-d ldr ). a plasmid encoding renilla luciferase (prl) was included to correct for transfection variation and to monitor for cell viability that may be affected by off-target effects of the modified mirnas and/or the antiviral sirnas. firefly and renilla luciferase expression was measured h post-transfection. firefly luciferase expression was normalized to the control renilla luciferase expression. firefly luciferase expression in the presence of pbluescript (pbs) was set at % ( figure a) . a common pattern was observed in that most efficient inhibition was scored for the -and -nt sirna positioned at the -end of the original mirna sequence, which are the and variants of the d hairpin and the and variants of the e hairpin. a similar trend was observed in hiv- inhibition studies ( figure b ). we co-transfected the hiv- molecular clone lai and the mirna constructs into t cells and virus production was measured as the ca-p level in the culture supernatant at days post-transfection. we observed the strongest inhibition when the sirna inhibitor was situated at the -end of the mirna sequence. the optimized hairpins have a similar efficiency of inhibiting hiv- production as the shrna constructs that were used as positive controls. based on these initial findings, we designed additional mirna constructs with - nt antiviral sequences at the -end of the mirna: a pol - , a pol - , b pol - , b pol - , c gag - , c gag - ' , e ldr - and e ldr - ' (table ) . of the original d r/t , d ldr and e r/t constructs, we selected d r/t - because it is the best inhibitor. as a consequence, the ldr inhibitor was introduced in the e hairpin. we tested the effectiveness of all new constructs against luciferase reporters and hiv- ( table ). the observed inhibition is sequence-specific because non-matching mirnas did not have any effect. furthermore, a . figure . structure of the antiviral mirnas based on pre-mirnas d wt and e wt . upper panel: the r/t and ldr sirnas were incorporated into the d wt backbone, which produces the guide strand from the -side of the hairpin. the guide strand is marked in grey. we modified the passenger strand to mimic structural features (mismatches, bulges and thermodynamic stability) of the natural pre-mirna. the -nt antiviral sirnas were positioned either at the (d r/t - and d ldr - ) or -end (d r/t - ' and d ldr - ' ) of the original mirna and the length of the sirnas was extended at the -end to -nt (d r/t - and d ldr - ). lower panel: the r/t sirna was similarly incorporated into the e wt pre-mirna, which produces the guide strand from the -side of the hairpin. the structure of the original shrnas sh r/t and sh ldr are presented. hiv- sequences are blue, mature wild-type mirna sequences are red, pre-mirna sequences are black, watson-crick base pairs are shown with dashes and gu wobbles with dots. the expression of the control renilla reporter was stable in all experiments. the mirna constructs with -nt anti-hiv sequences showed a somewhat higher activity than the extended versions. we therefore selected the -nt inhibitors for the construction of antiviral mirna polycistrons. for simplicity, we removed the indications from the mirna names (e.g. a pol - becomes a pol ). the original shrna inhibitor sh r/t seems slightly more active than the optimized d r/t mirna molecule ( figure ). this could be due to the use of different promoters (polymerase iii u versus polymerase ii cmv, respectively), but may also be due to differential rna processing efficiencies. we therefore wanted to compare the sirna level expressed from each construct. we titrated - ng sh r/t construct in t cells and compared that with ng d r/t construct. two days post-transfection, total cellular rna was isolated from the transfected cells and analyzed on a northern blot with an r/t probe that detects the guide (antisense) strand ( figure ). the original d wt hairpin was used as a negative control. quantification of the rna bands showed that expression of the antiviral -nt sirna by construct d r/t is about - -fold lower than the expression by construct sh r/t (figure , see numbers below the blot). although the sirna expression level of d r/t is significantly lower than that of sh r/t , the inhibitory effect measured in the luciferase and hiv- inhibition assays is comparable. this result suggests that the intrinsic inhibitory capacity of d r/t is in fact much greater than that of sh r/t . to address the effect of chaining of different hairpins for the silencing activity of the individual hairpins, we coupled the a pol inhibitor to wild-type pri-mirnas. we constructed two variants of a pol with two or six hairpins in a single transcript: a pol b wt and a pol a wt -e wt (a wt -e wt represents the complete wild-type mir- - b polycistron). we tested the ability of these hairpins to inhibit the luc-a pol reporter ( figure a , left). firefly luciferase expression was normalized to the renilla luciferase expression from the co-transfected prl plasmid. we set the luciferase expression in the presence of pbs at %. the knockdown efficiency of the single a pol hairpin was $ %, but chaining it with the b wt hairpin or the a wt -e wt cluster enhanced the silencing activity. as a negative control, we included the a wt -e wt construct. as a positive control, we included the original shrna. a similar pattern was observed for inhibition of hiv- production ( figure a, right) . thus, expression of a mirna inhibitor as part of a mirna polycistronic transcript enhances the silencing activity. next, we constructed polycistronic hairpin constructs with different combinations of two, three or four hairpins of the mirna inhibitors a pol , b pol , c gag , d r/t and e ldr : ab, ac, ad, acc, acd, accd, acde and acdb wt . we first determined the knockdown efficiency of each individual hairpin within the multiplex transcripts by co-transfection with the corresponding luciferase reporter into t cells ( figure b, supplementary table ). constructs encoding the a wt -e wt cistron and individual wild-type or non-matching mirnas were used as negative controls. for inhibitors a pol , c gag and e ldr , we observed a remarkable enhancement of silencing activity hiv- sequences are blue; pre-mirna sequences are black; mature mirna sequences are red; guide strand is bold; À, no; +, - %; ++, - %; +++, - %; nd, not determined. when chained to other hairpins. for instance, construct c gag is poorly active compared to construct ac or acde ( figure b ). construct b pol does not exhibit any inhibitory activity, either alone or when combined with another hairpin as in ab. since the b pol hairpin structure is rather unstable as predicted by the mfold algorithm, this could be due to misfolding of the hairpin rna. we therefore will not use hairpin b pol in the final polycistron construct. hairpin d r/t does not benefit from linkage to other hairpins, but this is likely due to the high inhibitory activity of the individual d r/t inhibitor. strong inhibition of luc-d r/t was observed with d r/t , ad, acd, accd and acde, indicating that generation of effective sirnas from the multiplex hairpin transcripts does not depend on the mirna position in the polycistron. we further tested the ability of the antiviral mirnas to inhibit hiv- by co-transfection with the hiv- molecular clone lai ( figure c, supplementary table ) . consistent with the luciferase results, we observed a moderate hiv- inhibition by the single mirna constructs, except for construct b pol (inactive) and d r/t (highly active) ( figure c ). multimerization of the hairpins strongly enhanced the inhibition of hiv- production, which is due both to enhancement of the silencing activity of the individual hairpins and to the presence of multiple antiviral sirnas. the a wt -e wt construct was used as a negative control and the shrna constructs were used as positive controls. we next performed northern blot analysis of the antiviral sirnas made by the different polycistron constructs ( figure ). the results are remarkably similar to the activity data in figure . for instance, sirna expression of a pol is greatly increased by linkage to another hairpin as in ad ( figure a ). the northern blot analysis of the b pol and the ab inhibitors provides an explanation for its inactivity as only the $ -nt b pol precursor is observed ( figure b , indicated by an arrow). to study whether the passenger sirna strand of b pol is made, we performed a northern blot analysis with the corresponding probe, but failed to detect any passenger sirnas (results not shown). in contrast, properly processed sirnas are produced from hairpin a pol in construct ab, indicating that the inactive b unit in the two-hairpin transcript does not negatively influence the active a unit ( figure a and b) . in addition, sirna production from the a hairpin is boosted for the polycistronic transcripts compared to the single a pol hairpin transcript. for inhibitor c gag , we also observed a strong increase in sirna production when the hairpin is combined with other hairpins in a polycistronic transcript ( figure c ). the d r/t inhibitor is expressed individually and does not benefit from chaining to other hairpins ( figure d ), which correlates nicely with the luciferase results ( figure b ). for inhibitor e ldr , we observed increased sirna levels for acde compared to the single hairpin ( figure e ). the combined luciferase inhibition results and the northern blot analyses demonstrate that the silencing activity of an individual hairpin rna can be significantly enhanced when expressed in a polycistronic transcript. we next created stably transduced supt cells with a lentiviral vector (plv) expressing the individual mirna constructs a pol , b pol , c gag , d r/t , e ldr and the polycistronic acde construct. plv expressing the control mirna n (invitrogen) was used as negative control. to study the impact of the antiviral mirnas on supt cell viability, we set up a sensitive toxicity screen for cells transduced with n, a pol and acde. we cultured the cells for days after lentiviral transduction and followed the percentage of gfp+ (transduced) and gfp-(untransduced) cells by facs (table ) . we did not observe a decrease in the fraction of transduced cells, indicating a similar growth rate as untransduced cells. the stably transduced supt cells were selected by facs sorting and subsequently infected with hiv- ( . ng of ca-p ). virus replication was followed for days by measuring the ca-p level in the culture supernatant. fast virus replication and virus-induced cytopathic effects were observed in cells expressing mirnas n, b pol , c gag and untransduced supt cells (figure ) . hiv- replication was inhibited by the individual hairpins a pol , d r/t and e ldr . virus replication was profoundly inhibited in supt cells expressing the polycistronic mirna construct acde and control cells that express an extended triple shrna construct (e-shrna ) (manuscript in preparation). in this study, we designed a combinatorial rnai approach against hiv- using the human mir- - polycistron. we first constructed individual pri-mirna transcripts against five conserved regions of hiv- under the control of a cmv promoter. we maintained the secondary structure of the original pre-mirnas and included the single-stranded flanks because these are important for proper mirna processing and subsequent risc loading ( ) . we used the cmv promoter to express the transcripts because most primary mirnas are transcribed by rna polymerase ii ( ) , which also allows inducible or tissue-specific mirna expression ( , , ) . previously, several reports have demonstrated effective gene knockdown in mammalian cells with sirnas derived from mirna precursors ( , , , ) . however, none of these studies addressed the issue that a mature mirna is typically - nt, whereas an sirna is only -nt in length. here, we demonstrate that positioning of - nt antiviral sirna sequences at the -end of the pre-mirna hairpin stem results in optimal hiv- inhibition. despite the optimization of the mirna-like inhibitors, their activity is less than that of the original shrna antivirals that were used to design the mirnas. we therefore addressed whether the mirna-like inhibitor is correctly processed and compared the amount of sirna produced from the mirna versus shrna constructs. the sirna level produced by the shrna construct sh r/t is - -fold higher than the d r/t mirna, which is likely due to the use of different promoters (rna polymerase iii versus rna polymerase ii). interestingly, the shrna inhibitor did only show marginally higher inhibitory activity, suggesting that the intrinsic inhibitory activity of the mirna-like inhibitor is in fact much greater than the shrna variant. consistent with these results, vectors encoding hairpin structures that closely resemble a natural pre-mirna produced $ -fold more mature sirnas than vectors encoding simple hairpin structures ( ) . the superior activity of natural mirna-like inhibitors is likely attributed by the intrinsic properties of a mirna, which is processed in the nucleus by drosha, exported to the cytoplasm, processed further by dicer and loaded into risc. in the case of an shrna, the drosha step is bypassed, which could provide a less-efficient entry into the rnai pathway. moderate hiv- inhibitory activity was observed with constructs expressing a single mirna hairpin. interestingly, we demonstrate that co-expression of two or more hairpins in a single transcript greatly enhanced the silencing activity of each individual hairpin in the transcript. northern blot analyses showed that the increased inhibitory activity correlates with higher sirna expression levels. multimerization of different mirna hairpins is of particular interest for targeting of rna viruses such as hiv- and hcv because of their extreme genetic diversity and potential for mutational escape. two recent papers presented the potential of combinatorial rnai using two mirna- hairpins ( , ) . in agreement with these studies, we showed an increase in rnai activity upon multimerization of , , or hairpins in a single transcript. another study used the mir- backbone to multiplex artificial mirnas and reported decreased rnai activity for the tandem plasmid, which may be due to mirna processing problems ( studied extensively. our study focuses on the natural mir- - polycistron and we demonstrate that insertion of four antiviral sirnas creates a transcript that is properly processed into functional antiviral mirnas that effectively inhibit hiv- production and replication. a major concern of the mirna approach is the offtarget effect on cellular transcripts with partial sequence complementarity. such an off-target effect may require only a complementarity of - nt between the seed region of the mirna and the target ( , ) . such a weak restraint results in numerous potential off-target genes for any mirna. when multiple mirnas are used, the number of potential off-targets will increase, increasing the chance of a negative effect on the treated cells. however, we observed no obvious cellular changes in a sensitive toxicity screen. furthermore, the observed inhibition of firefly luciferase reporters clearly showed sequence-specificity as non-matching mirnas did not have any effect and the expression of the control renilla reporter was not affected. another study, in which artificial mirnas were expressed in arabidopsis thaliana, conferred viral resistance without cellular alterations, suggesting that off-target effects are not significant ( ) . furthermore, recently evidence emerges that sirna sequences inserted in a mirna backbone do not compete for transport and incorporation into risc, while competition was observed when the same sirna sequences were presented as synthetic sirnas or shrnas ( ) . nonetheless, off-targeting is a genuine concern for the development of any rnai-based gene therapy against hiv- and the potential risk should be assessed properly in relevant in vivo models prior to an eventual clinical application ( ) . we have shown in stably transduced t-cell lines that multiple effective mirnas inhibit hiv- replication much stronger than a single mirna. these data, together with our previous shrna studies ( , ) , indicate that a combinatorial rnai approach against hiv- results in an increased magnitude of inhibition and consequently a restriction of viral escape. current strategies combine multiple polymerase iii shrna expression cassettes ( ) , which results in high expression levels. this may not always be desired in a gene therapy setting because of increased toxicity due to saturation of the rnai machinery with sirnas ( ) . the use of polymerase ii promoters to express the mirna polycistron will reduce this risk because of lower expression levels. in addition, polymerase ii cassettes allow expression in a tissue-specific manner and inducible gene expression, which increases the flexibity for gene therapy and functional genomic applications ( , ) . for hiv- gene therapy applications the use of a hematopoietic or t-cell-specific promoter can increase the target cell specificity. an interesting candidate is the was promoter that is active in human hematopoietic precursor cells (cd +) and t lymphocytes, b cells and dendritic cells ( ) . another intriguing possibility is to use the hiv- ltr promoter to express the mirna polycistron. transcriptional activation of the hiv- ltr requires the viral tat protein, which is produced only in hiv- -infected cells, thereby allowing exquisite target cell specificity. this approach has previously been employed for shrna expression ( ) . thus, several approaches can be tested experimentally in order to optimize the antiviral mirna polycistron strategy for hiv- inhibition. in summary, one can effectively combat hiv- with multiple mirna effector molecules transcribed from a single polycistronic transcript. we showed that expression of the mirna polycistron results in the production of functional mature mirnas that can efficiently and selectively inhibit hiv- . further optimization of this construct by increasing the target cell specificity and inducibility will be a further step towards a gene therapeutic approach against hiv- . with lentiviral vectors expressing antiviral mirnas a pol , b pol , c gag , d r/t , e ldr and acde. untransduced supt cells and cells transduced with the unrelated mirna n were used as negative controls. cells expressing an extended triple shrna construct (e-shrna ) serve as positive control. supt cells stably expressing mirnas a pol , b pol , c gag , d r/t , e ldr and acde were infected with hiv- and virus replication was monitored for days by measuring ca-p in the culture supernatant. one representative experiment is shown, similar results were obtained in three independent experiments. potent and specific genetic interference by double-stranded rna in caenorhabditis elegans mechanisms of gene silencing by double-stranded rna rnai: double-stranded rna directs the atp-dependent cleavage of mrna at to nucleotide intervals a species of small antisense rna in posttranscriptional gene silencing in plants posttranscriptional gene silencing by double-stranded rna microrna maturation: stepwise processing and subcellular localization microrna genes are transcribed by rna polymerase ii human micrornas are processed from capped, polyadenylated transcripts that can also function as mrnas the nuclear rnase iii drosha initiates microrna processing processing of primary micrornas by the microprocessor complex the drosha-dgcr complex in primary microrna processing recognition and cleavage of primary microrna precursors by the nuclear processing enzyme drosha transcription and processing of human microrna precursors microrna biogenesis: coordinated cropping and dicing a microrna polycistron as a potential human oncogene human embryonic stem cells express a unique set of micrornas enhanced gene silencing of hiv- specific sirna using microrna designed hairpins both natural and designed micro rnas can inhibit the expression of cognate mrnas when expressed in human cells micrornas and small interfering rnas can inhibit mrna expression by similar mechanisms use of rna polymerase ii to transcribe artificial micrornas second-generation shrna libraries covering the mouse and human genomes a lentiviral microrna-based system for single-copy polymerase ii-regulated rna interference in mammalian cells probing tumor phenotypes using stable and regulated synthetic microrna precursors inhibition of virus replication by rna interference rna interference: its use as antiviral therapy bispecific short hairpin sirna constructs targeted to cd , cxcr , and ccr confer hiv- resistance human immunodeficiency virus type escapes from rna interference-mediated inhibition inhibition of hiv- replication with designed mirnas expressed from rna polymerase ii promoters human immunodeficiency virus type escape from rna interference hiv- can escape from rna interference by evolving an alternative structure in its rna genome rna interference as an antiviral approach: targeting hiv- silencing of hiv- with rna interference: a multiple shrna approach specific inhibition of human immunodeficiency virus type replication by antisense oligonucleotides: an in vitro model for treatment can hammerhead ribozymes be efficient tools to inactivate gene function? importance of independence in ribozyme reactions: kinetic behavior of trimmed and of simply connected multiple ribozymes with potential activity against human immunodeficiency virus lentiviral vectors that carry anti-hiv shrnas: problems and solutions a novel approach for inhibition of hiv- by rna interference: counteracting viral escape with a second generation of sirnas design of extended short hairpin rnas for hiv- inhibition inhibition of human immunodeficiency virus type by rna interference using long-hairpin rna ) sirna-directed inhibition of hiv- infection suppression of chemokine receptor expression by rna interference allows for inhibition of hiv- replication inhibition of hiv- fusion with small interfering rnas targeting the chemokine coreceptor cxcr the inner-nuclear-envelope protein emerin regulates hiv- infectivity unpaking'' human immunodeficiency virus (hiv) replication: using small interfering rna screening to identify novel cofactors and elucidate the role of group i paks in hiv infection artificial microrna-mediated virus resistance in plants fatality in mice due to oversaturation of cellular microrna/short hairpin rna pathways mfold web server for nucleic acid folding and hybridization prediction changes in growth properties on passage in tissue culture of viruses derived from infectious molecular clones of hiv- lai , hiv- mal , and hiv- eli functional differences between the long terminal repeat transcriptional promoters of hiv- subtypes a through g factor correction as a tool to eliminate between-session variation in replicate experiments: application to molecular biology and retrovirology a rev-independent human immunodeficiency virus type (hiv- )-based vector that exploits a codonoptimized hiv- gag-pol gene self-inactivating lentivirus vector for safe and efficient in vivo gene delivery polycistronic rna polymerase ii expression vectors for rna interference based on bic/mir- a single lentiviral vector platform for microrna-based conditional rna interference and coordinated transgene expression multi-mirna hairpin method that improves gene knockdown efficiency and provides linked multi-gene knockdown multiple shrnas expressed by an inducible pol ii promoter can knock down the expression of multiple target genes an rna polymerase ii construct synthesizes short-hairpin rna with a quantitative indicator and mediates highly efficient rnai conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microrna targets utr seed matches, but not overall identity, are associated with rnai off-targets expression of artificial micrornas in transgenic arabidopsis thaliana confers virus resistance combinatorial delivery of small interfering rnas reduces rnai efficacy by selective incorporation into risc rna interference against viruses: strike and counterstrike lentiviral vector design for multiple shrna expression and durable hiv- inhibition lentiviral vectors targeting wasp expression to hematopoietic cells, efficiently transduce and correct cells from was patients negative feedback inhibition of hiv- by tatinducible expression of sirna hiv- rna research in the berkhoutlab is sponsored by zonmw (vici grant) and nwo-cw (top grant). we thank stephan heynen for performing ca-p elisa and jens gruber for useful discussions. funding to pay the open access publication charges for this article was provided by zonmw and nwo-cw. supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- -pc x e authors: yu, chien-hung; noteborn, mathieu h. m.; olsthoorn, rené c. l. title: stimulation of ribosomal frameshifting by antisense lna date: - - journal: nucleic acids res doi: . /nar/gkq sha: doc_id: cord_uid: pc x e programmed ribosomal frameshifting is a translational recoding mechanism commonly used by rna viruses to express two or more proteins from a single mrna at a fixed ratio. an essential element in this process is the presence of an rna secondary structure, such as a pseudoknot or a hairpin, located downstream of the slippery sequence. here, we have tested the efficiency of rna oligonucleotides annealing downstream of the slippery sequence to induce frameshifting in vitro. maximal frameshifting was observed with oligonucleotides of – nt. antisense oligonucleotides bearing locked nucleid acid (lna) modifications also proved to be efficient frameshift-stimulators in contrast to dna oligonucleotides. the number, sequence and location of lna bases in an otherwise dna oligonucleotide have to be carefully manipulated to obtain optimal levels of frameshifting. our data favor a model in which rna stability at the entrance of the ribosomal tunnel is the major determinant of stimulating slippage rather than a specific three-dimensional structure of the stimulating rna element. programmed ribosomal frameshifting is a translational recoding event that increases the versatility of gene expression. it is mainly utilized by eukaryotic rna viruses ( ) ( ) ( ) , though some prokaryotic ( ) and mammalian genes ( ) ( ) ( ) are also controlled by ribosomal frameshifting. the requirements for À ribosomal frameshifting are the presence of a slippery heptanucleotide sequence x xxy yyz (where x can be a, u, g or c; y can be a or u; and z does not equal y; the spaces indicate the original reading frame) ( ) followed by a downstream structural element, such as a pseudoknot, a hairpin or an antisense oligonucleotide duplex [for reviews, see ( ) ]. although the mechanism of frameshifting is still elusive, a promising model has been proposed by brierley and co-workers using cryo-electron microscopy to image mammalian s ribosomes ( ) . in their model, the ribosome is paused by its inability to unwind a pseudoknot structure resulting in a blockage of the a-site by eef- . during translocation, the p-site trna is bent in the -direction by opposing forces. to release the tension, the p-site trna may un-pair and subsequently re-pair in the À frame with a certain frequency, followed by a-site trna delivery into the new À reading frame. these and other recent data obtained by mechanical unfolding of frameshifter pseudoknots suggest that mrna secondary structures with certain conformational features that resist ribosomal helicase-mediated unwinding and eef- catalyzed translocation are key players in ribosomal frameshifting. small oligonucleotides have been used for several years to regulate gene expression by rnaseh-dependent rna degradation ( ) , blocking translation ( ) , or re-directing splicing ( ) . more recently, micrornas (mirnas) ( ) and small interfering rnas (sirnas) have appeared on the scene of post-transcriptional gene regulation ( ) . sirnas may be effective in treatment of chronic hepatitis-b virus infection ( ) , hiv infection ( ) , cancer ( ) and age-related macular degeneration ( ) . very few antisense oligonucleotides, for example against the bcl- oncogene have reached the stage of clinical trials ( ) or have actually been approved by the fda, for instance for the treatment of human cytomegalovirus retinitis ( ) . enhancing the stability of small oligonucleotides to prolong circulation and meanwhile increasing target specificity are major concerns for therapeutic applications. various kinds of modifications in backbones, sugars or even analogs have already been studied extensively [for reviews, see ( , ) ] to meet these requirements. locked nucleic acid (lna) is a rather novel nucleic acid analog comprising a class of bicyclic high-affinity rna analogs in which the furanose ring of lna monomers is conformationally locked in an rna-mimicking c -endo/ n-type conformation ( ) . the lna modification also resists degradation by cellular nucleases. furthermore, introducing lna into dna or rna oligonucleotides improves the affinity for complementary sequences and increases the melting temperature by several degrees ( ) . a recent study showed that lna/dna mix-mers against mirna- can be acutely administered at high dosage with long lasting effects without any evidence of lna-associated toxicities or histopathological changes in the studied animals ( ) . these data suggests that lna is a promising candidate for small oligonucleotide applications. we and others have demonstrated that small rna oligonucleotides are able to mimic the function of frameshifter pseudoknots or hairpins by redirecting ribosomes into new reading frames ( , ) . in this article, we have investigated the length and concentration of rna oligonucleotides for optimal frameshifting, as well as the effects of introducing lna-type sugars in dna oligonucleotides. the À ribosomal frameshifting events were monitored by the sf reporter construct described earlier ( ) . complementary oligonucleotides (eurogentec, liege, belgium) sf (ctagttgacctcaacccttgg aa) and sf (catgttccaagggttgaggt caa) and sf (ctagttgagcgcgctggaggc catgg) and sf (catgccatggcctccagcgc gctca) were annealed and ligated into spei/ncoi digested sf reporter to construct the sf and sf templates, respectively. all constructs were verified by dna sequencing on an abi prism Õ xl analyzer (lgtc, leiden, the netherlands). rna oligonucleotides (except for rna which was obtained from invitrogen) were purchased from dharmacon (lafayette, usa). the rnas from dharmacon carried a -o-ace protection group, which was removed by incubation with mm acetic acid ph . and temed at c for min. the sequences of rna oligos were as follows: rna : gcgcgc, rna : ccagcgcgc, rna : c cuccagcgcgc, rna : uggccuccagcgcgc, rna : ccauggccuccagcgcgc, rna: gcg cgcuggaggccaugg, and rna : ccaagggg uugagg. dna and lna/dna mix-mers were synthesized by eurogentec. custom oligonucleotides were extracted by phenol/chloroform followed by ethanol precipitated before use. the sequences of dna and lna/dna mix-mers were as follows (lower case represents the lna modification and capital represents dna): plasmids were linearized by bamhi and purified by phenol/ chloroform extraction followed by ethanol precipitation. in vitro transcription was conducted by sp rna polymerase and carried out in the ml reaction mixture of: mg of linearized template, mm of rntps, units of rnase inhibitor and units of sp rna polymerase with buffer (all from promega, benelux). after h incubation at c, the integrity and quantity of transcripts were checked by agarose gel and appropriate amount of the rna were diluted in nuclease free water for in vitro translation. in vitro translations were carried out in nuclease treated rabbit reticulocyte lysate (rrl) (promega). the amount of mrna was . pmol and different amounts of oligonucleotides ( . - . pmol) were mixed with template for min at room temperature. after incubation, ml of rrl, . mm amino acids mixture except methionine, mci of s methionine ( mci/ml, mp biomedicals, in vitro translational grade) were added in total volume of ml and incubated at c for h. after translation, samples were mixed with  laemmli buffer, boiled at c for min and resolved by % sds polyacrylamide gels. gels were fixed in % acetic acid and % methanol for min, dried under vacuum, and exposed to phosphoimager screens (biorad). the screen was scanned and the frame and À frameshift protein products were quantified by quantity one software (biorad). frameshift percentages were calculated by dividing the amount of À frameshift product by the amount of -frame and À frameshift products after correction for the number of methionines in the protein sequence, multiplied by . determination of the melting temperature of oligonucleotide duplexes rna oligonucleotide rna ( gcgcgcuggaggc caugg , dharmacon, usa) was mixed in a : molar ratio with rna , dna or one of the various dna/ lna mix-mers, in uv-melting buffer ( mm nacl, mm cacodylate acid, ph . ). the analysis was performed on a varian cary spectrophotometer using temperature ramps of . c /min during heating and cooling. the absorbance at nm was recorded and normalized to the blank control. although antisense oligonucleotides were found to induce ribosomal frameshifting ( , ) , the optimal number of base pairs has not been addressed yet. to investigate this we designed antisense rna oligonucleotides that are , , , and bases complementary to the region downstream of an uuuaaac slippery sequence in our reporter plasmid sf ( figure ). first, titration with rna and rna oligonucleotides revealed that a -fold molar excess of oligonucleotides over mrna resulted in the highest level of frameshifting ( figure a) ; this ratio was used in the following experiments. the shortest oligonucleotide, rna , was not capable of inducing significant levels of frameshifting (figure b) , whereas rna induced $ . % of frameshifting. maximum levels were obtained with rna , rna and rna ; all three induced $ % of frameshifting. in the following experiments oligonucleotides between and nt in length were used. since we have absent knowledge about the efficacy of lna-induced ribosomal frameshifting, lna/dna mix-mers of nt in length were designed to investigate this ( figure ) . a dna oligonucleotide, as expected, was less capable ( . %) of inducing frameshift due to the lower thermodynamic stability of rna-dna duplexes, see also below. surprisingly, substituting the -cytosine and guanosine in this dna oligonucleotide by their lna analogs enhanced its frameshift inducing capacity to . %, i.e. as high as an rna oligonucleotide ( . %). increasing the lna content of this oligonucleotide further did not lead to higher frameshifting. on the contrary, the efficiency of lna was with . % lower than that of lna and that of lna was a mere . %. since the overall translation efficiency seemed not affected by lna we suspected an effect of the oligonucleotide itself (see below). to demonstrate that the enhanced effect of lna oligonucleotides is a general feature we designed another construct (sf ) in which the target sequence was replaced by an unrelated sequence ( figure ). lna/dna mix-mers were designed in which nucleotides starting from the -end were gradually replaced by lna ( figure ). increasing the number of lnas from one to two and four in these dna oligonucleotides improved their frameshift inducing ability, reaching an apparent optimum of . % with four lna substitutions. further increase of the lna content to nt (ld ) did not improve frameshift efficiency, but, on the other hand, ld also did not lead to the dramatic decrease as observed above for the lna oligonucleotide applied in the sf construct. we suspected that (partial) self-complementarity may be limiting the effective concentration of free lna/dna oligonucleotides. to check this possibility, we ran all the oligonucleotides on a non-denaturing polyacrylamide gel. figure showes that the lna oligonucleotide indeed migrated more slowly indicative of partial dimer formation, presumably by intermolecular base pairing of the palindromic gcgcgc sequences in each oligonucleotide (compare the migration to that of the full dimer formed by annealing of oligonucleotides dna and dna). the ld series, as predicted, migrated as monomers. we noted that ld , though loaded in equal amount, based on its uv absorbance, showed a higher affinity to ethidium bromide than its counterparts. at present we have no explanation for this unexpected behavior of ld , since its migration and therefore its conformation was identical to the other lna/dna mix-mers. these results demonstrate that lna modifications indeed enhance the antisense-induced frameshifting efficiency probably due to higher thermodynamic stability and rna-like structural properties. this phenomenon appears to be general, at least in our experiments. to investigate which positions in a dna oligonucleotide would exert the largest effect when substituted by an lna analog, we designed lna/dna mix-mer mutants based on lna , which is the most efficient lna/dna mix-mer in our experiments and would give a good read-out. when the two lna substitutions were moved two positions more inward (l - ), compared to lna , frameshift efficiency decreased to . % ( figure ). however, when the lna modifications were moved another two positions more inward (l - ), activity dropped to . % ( figure ) which is comparable to an unmodified dna oligonucleotide. similarly, when the lna groups were introduced at the other end of the oligonucleotide, activity was as low as dna ( figure ). finally, l - , in which the first and fourth position were lna, was only half as efficient as lna . these results indicate that the choice of the location of the lna modifications is crucial for the frameshift-inducing efficiency of an oligonucleotide. theoretically the position effect of the lna substitutions could simply be explained by differences in thermodynamic stability of the resulting mrna/oligonucleotide duplexes. to investigate this possibility we carried out table . the t m of the rna/rna duplex was the highest with c in agreement with its high frameshifting efficiency. the rna/dna duplex had a much lower t m of c, which is expected for an rna/dna hybrid, and also agreed with the lower frameshifting efficiency. the lna substituted oligonucleotides, all had higher t m s (+ to + c) than dna . the t m of l - was with c almost as high as that of rna . remarkably there was no correlation between the t m of the lna oligonucleotides and their frameshifting inducing capacity. for example, the t m of lna was rather low with c but it had the highest frameshifting activity, and l - , which had the highest t m , actually had the lowest frameshifting activity. t m s of l - and l - were identical but their frameshifting activities were . and . %, respectively. we also noted that both l - and l - were comparable to dna in frameshifting activity but formed far more stable duplexes. these data suggest that the position effect of the lna substitutions is related to the mechanism of frameshifting and not per se to their thermodynamic stability. previously, we have demonstrated that antisense oligonucleotides can induce high levels of À frameshifting ( ) . the optimal length of small antisense oligonucleotides, however, was not investigated. understanding the optimal length of trans-acting oligonucleotides that can induce the most efficient frameshifting and, at the same time, escape rnai interference will be an important issue for future in vivo applications. here we found that maximum levels of frameshifting were obtained with oligonucleotides of nt and more. this is comparable to the stem lengths (s +s ) of known examples of highly frameshift inducing h-type pseudoknots, such as the + bp of the simian retrovirus typeÀ pseudoknot ( ) , the + bp of the minimal infectious bronchitis virus (ibv) pseudoknot ( ) , and the + bp chimeric mouse mammary tumor virus (mmtv)-ibv pseudoknot ( ) . in addition, in known examples of hairpin-induced frameshifts, the stem length of hairpins is around bp ( ) ( ) ( ) . this may imply that a full helical turn of an rna helix either in one single stem or in two stacking stems of a pseudoknot (s +s ) is selected by viruses to induce efficient ribosomal frameshifting. in addition to rna oligonucleotides, we demonstrated that lna/dna mix-mers are also capable of stimulating efficient À ribosomal frameshifting in contrast to dna oligonucleotides. replacing nt in a dna oligonucleotide by lna was already sufficient to reach the same level of frameshifting as with a comparable rna oligonucleotide. however, the excellent affinity of lna oligonucleotides could be a double-edged sword in certain cases. in our experimental system, the oligonucleotides are partly self-complementary and this resulted in the formation of dimers ( figure ), which were apparently unable to induce frameshifting (lna , figure ). hence, lna substitutions should be optimized in a sequence that is prone to form dimers. in our sf construct (figure ) , the optimal number of lna substitutions to induce the most significant amount of frameshifting is four. ld with two additional lna substitutions did not improve the efficiency. thus, our results suggest that the first bp are critical for antisense-induced frameshifting. a likely explanation is that when a ribosome that is translating the slippery sequence, the helicase active site is around position + , with respect to the first nucleotide of the p-site, which is close to the first base pair of the mrna/ oligonucleotide duplex ( ) . increasing the local thermodynamic stability in this region may prevent ribosomes to unwind rna structures, causing ribosomal pausing at the slippery sequence, and finally results in a higher frequency of ribosomal frameshifting. our data also showed that a single lna modification is not sufficient to turn a dna oligonucleotide into an efficient frameshift inducer but that a second lna is needed. the best position for the second modification appeared to be also close to the -end of the oligonucleotide. although one could expect that spacing of two lna groups by two non-modified sugars as applied in probes for mirnas, results in the optimal induction of the -endo conformation in the neighboring sugars ( ), this was not the case in our frameshift assays. here such a spacing was less efficient (see data for l - , figure ). however, we have not investigated if possible differences of self-dimerization behavior of these oligonucleotides accounts for the different stimulating activities, since such effects were only observed when six lna modifications were introduced in an oligonucleotide of this sequence. the observation that different positions of lna substitutions induced different levels of ribosomal frameshifting is interesting. even though the overall thermodynamic stability of these oligonucleotides is roughly the same, they still create different degrees of barriers for ribosomes to unwind and these differences could be the reason for different level of induced frameshifting. the finding that local stability at the -end of the lna/ dna mix-mers is important for frameshifting is in agreement with the observation that in natural examples of frameshift stimulators, most of them have high gc content in the first few nucleotides ( ) . hence, our data support the notion that the stability of the -end of the oligonucleotide, which may reside in the active site of the ribosomal helicase, is critical for frameshift-inducing structural elements. in pseudoknots this stability is probably attained by triple interactions, since nature has no other way to increase the stability of a gc-rich a-type helix. triplex structures have been documented for a number of frameshifter pseudoknots, e.g. bwyv ( ), srvÀ ( ) and in a telomerase pseudoknot ( ) . several models of ribosomal frameshifting have been proposed ( , , ) . the consistency from these studies is that ribosomal pausing at shifty sites by downstream structural elements is important but that pausing caused by rna secondary structure, does not always result in frameshifting. in addition, a lack of correlation between the extent of pausing and the efficiency of frameshifting by ibv pseudoknots has been observed ( ) . a recent study also showed that pseudoknots with a similar global structure can still induce very different levels of frameshifting although their thermodynamic stabilities were different ( ) . these data complicate the view on the role of the downstream structure. experiments involving simple oligonucleotides such as shown here may be better alternatives to elucidate the role of the downstream element. several groups have correlated the mechanical force of unfolding of a pseudoknot with its frameshifting efficiency by using optical tweezers ( , ( ) ( ) ( ) and suggest that frameshift efficiency is dependent on the unfolding force rather than on differences of thermodynamic stability between folded and unfolded states. since we showed here that antisense oligonucleotides can induce frameshifting presumably by serving as a physical barrier for the elongating ribosome, it will be interesting to measure the strength of these linear oligonucleotides in complex with (a piece of) mrna by optical tweezers and see if there is a correlation with their frameshifting efficiency. finally, several properties of lna, including its good aqueous solubility, low toxicity, highly efficient binding to complementary nucleic acids, high biostability, and, improved mismatch discrimination relative to natural nucleic acid ( ) make lna a promising candidate for in vivo applications of antisense-induced frameshifting. funding for open access charge: leiden institute of chemistry, leiden university. structure, stability and function of rna pseudoknots involved in stimulating ribosomal frameshifting programmed ribosomal frameshifting in hiv- and the sars-cov a - ribosomal frameshift in a double-stranded rna virus of yeast forms a gag-pol fusion protein recoding in bacteriophages and bacterial is elements a functional - ribosomal frameshift signal in the human paraneoplastic ma gene characterization of the frameshift signal of edr, a mammalian example of programmed - ribosomal frameshifting mammalian gene peg expresses two reading frames by high efficiency À frameshifting in embryonic-associated tissues mutational analysis of the ''slippery-sequence'' component of a coronavirus ribosomal frameshifting signal frameshifting rna pseudoknots: structure and mechanism a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting progress in antisense technology morpholino oligos: making sense of antisense? functional amounts of dystrophin produced by skipping the mutated exon in the mdx dystrophic mouse micrornas: target recognition and regulatory functions small silencing rnas: an expanding universe potent and persistent in vivo anti-hbv activity of chemically modified sirnas rna interference as an antiviral approach: targeting hiv- small interfering rna therapy in cancer: mechanism, potential targets, and clinical applications ocular delivery of nucleic acids: antisense oligonucleotides, aptamers and sirna bcl- -targeted antisense therapy (oblimersen sodium): towards clinical reality antisense technology: a selective tool for gene expression regulation and gene targeting chemical modification: the key to clinical application of rna interference? lna (locked nucleic acid): high-affinity targeting of complementary rna and dna locked nucleic acid (lna): fine-tuning the recognition of dna and rna lna-mediated microrna silencing in non-human primates novel application of srna: stimulation of ribosomal frameshifting efficient stimulation of site-specific ribosome frameshifting by antisense oligonucleotides identification and analysis of the pseudoknot-containing gag-pro ribosomal frameshift signal of simian retrovirus- evidence for an rna pseudoknot loop-helix interaction essential for efficient À ribosomal frameshifting the role of rna pseudoknot stem length in the promotion of efficient À ribosomal frameshifting characterization of the frameshift stimulatory signal controlling a programmed À ribosomal frameshift in the human immunodeficiency virus type regulation of À ribosomal frameshifting directed by cocksfoot mottle sobemovirus genome structural probing and mutagenic analysis of the stem-loop required for escherichia coli dnax ribosomal frameshifting: programmed efficiency of % mrna helicase activity of the ribosome sensitive and specific detection of micrornas by northern blot analysis using lna-modified oligonucleotide probes minor groove rna triplex in the crystal structure of a ribosomal frameshifting viral pseudoknot solution structure of the pseudoknot of srv- rna, involved in ribosomal frameshifting triplex structures in an rna pseudoknot enhance mechanical stability and increase efficiency of À ribosomal frameshifting structure and function of the stimulatory rnas involved in programmed eukaryotic- ribosomal frameshifting the -a solution: how mrna pseudoknots promote efficient programmed À ribosomal frameshifting ribosomal pausing at a frameshifter rna pseudoknot is sensitive to reading phase but shows little correlation with frameshift efficiency the global structures of a wild-type and poorly functional plant luteoviral mrna pseudoknot are essentially identical correlation between mechanical strength of messenger rna pseudoknots and ribosomal frameshifting interaction of the hiv- frameshift signal with the ribosome characterization of the mechanical unfolding of rna pseudoknots locked nucleic acid oligonucleotides: the next generation of antisense agents? conflict of interest statement. none declared. key: cord- -ix du er authors: mouzakis, kathryn d.; lang, andrew l.; vander meulen, kirk a.; easterday, preston d.; butcher, samuel e. title: hiv- frameshift efficiency is primarily determined by the stability of base pairs positioned at the mrna entrance channel of the ribosome date: - - journal: nucleic acids res doi: . /nar/gks sha: doc_id: cord_uid: ix du er the human immunodeficiency virus (hiv) requires a programmed − ribosomal frameshift for pol gene expression. the hiv frameshift site consists of a heptanucleotide slippery sequence (uuuuuua) followed by a spacer region and a downstream rna stem–loop structure. here we investigate the role of the rna structure in promoting the − frameshift. the stem–loop was systematically altered to decouple the contributions of local and overall thermodynamic stability towards frameshift efficiency. no correlation between overall stability and frameshift efficiency is observed. in contrast, there is a strong correlation between frameshift efficiency and the local thermodynamic stability of the first – bp in the stem–loop, which are predicted to reside at the opening of the mrna entrance channel when the ribosome is paused at the slippery site. insertion or deletions in the spacer region appear to correspondingly change the identity of the base pairs encountered nt downstream of the slippery site. finally, the role of the surrounding genomic secondary structure was investigated and found to have a modest impact on frameshift efficiency, consistent with the hypothesis that the genomic secondary structure attenuates frameshifting by affecting the overall rate of translation. translation is a high-fidelity process in all organisms. failure to maintain reading frame typically results in incorrect protein synthesis and/or early termination. however, a programmed change in reading frame can result in the translation of new proteins, thereby maximizing genomic coding capacity. many retroviruses, including human immunodeficiency virus type (hiv- ) ( ) , and some coronaviruses, such as severe acute respiratory syndrome ( ) and infectious bronchitis virus (ibv) ( ), use a programmed À ribosomal frameshift (À prf) to control translation levels of their enzymatic proteins ( ) ( ) ( ) ( ) . in the retroviruses, the À prf site lies between the gag and pol open reading frames (orfs), with pol in the À reading frame relative to gag. the gag orf encodes the viral structural proteins, whereas the pol orf encodes the enzymatic proteins. during translation of hiv- mrna, the majority of ribosomes terminate at a stop codon at the end of the gag orf, producing the gag polyprotein ( , ) . however, the hiv À prf induces $ % of ribosomes to shift into the À reading frame, thus producing the gag-pol polyprotein ( , ( ) ( ) ( ) . the % frameshift efficiency determines the ratio of viral proteins produced and is important for viral replication and infectivity ( , ( ) ( ) ( ) ( ) . a decrease in frameshift efficiency can inhibit viral replication ( , ) . the hiv- frameshift site is composed of a heptanucleotide slippery sequence (uuuuuua) followed by a downstream rna stem-loop ( figure a ). the slippery sequence follows a general xxxyyyz consensus sequence, where x can be any nucleotide (nt) type, y can be a or u and z is not g in eukaryotes ( , ) . this sequence allows near-cognate and cognate re-pairing of the a-and p-site trna anticodons, respectively, in the À reading frame. hiv- 's slippery sequence is especially 'slippery', and in the absence of a downstream structure increases the basal level of ribosomal frameshifting from $ . % to . % per codon ( , , ) . however, in order to further stimulate frameshifting to the levels required for viral replication, the slippery site must be followed by a stable rna structure ( , ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (figure a) . thus, frameshifting is achieved by the cis coupling of the slippery site and downstream structure ( , ( ) ( ) ( ) ) . multiple models have been proposed to explain the frameshift mechanism ( , , ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . common among them are the following steps: (i) during translation, the ribosome pauses when the slippery sequence (uuu uua in the frame) is engaged in the ribosomal a-and p-sites ( , , , ) . the pause is triggered by the downstream structure's resistance to unwinding. (ii) while paused, $ % of ribosomes slip nt in the -direction and continue elongation in the À reading frame. the proposed models are differentiated by the exact step at which the frameshift occurs: during aminoacylated-trna accommodation ( ) , after accommodation, but before peptidyl transfer ( ), after large subunit translocation ( ) or after peptidyl transfer due to an incomplete translocation ( ) . alternatively, the 'many pathways model' of À prf suggests that frameshift efficiency is the sum of frameshift events occurring, each of which could occur at these different points in elongation ( ) . an important role of the downstream structure is to induce ribosomal pausing on the slippery sequence, which is necessary but not sufficient to promote efficient levels of frameshifting ( ) ( ) ( ) . interestingly, the pause length does not appear to correlate with frameshift efficiency ( ) . interactions with the translational machinery have also been hypothesized to contribute to frameshifting ( , , , , ) . previous studies have observed general trends between hiv- stem-loop thermodynamic stability and frameshift efficiency ( , , ) . however, a quantitative correlation between thermodynamic stability and frameshift efficiency has not been described, and the role of individual base pairs has not been systematically investigated. for frameshift sites with a downstream pseudoknot structure, mechanical stability has been proposed to be a determinant of frameshift efficiency ( ) ( ) ( ) . it has been hypothesized that mechanical tension lowers the energy barrier for frameshifting ( ) , where the amount of tension sensed by the ribosome is proportional to the mechanical stability of the translocation barrier ( , , ( ) ( ) ( ) . however, a recent study found no correlation between pseudoknot mechanical stability and frameshift efficiency, but instead observed a correlation between frameshifting and the ability to form alternative structures ( ) . other factors can modulate the frameshift efficiency, such as translation initiation rates ( , ) . increased translation initiation rates lead to increased polysome density, which can cause ribosomes to stack at the frameshift site. this in turn affects the rate of mrna refolding during translation and leads to a decrease in overall frameshift efficiency ( , ) . ribosome stacking can be promoted by rna structure that precedes the frameshift site. studies examining the secondary structure of the hiv- genomic rna within capsids have revealed that the frameshift site is part of a conserved three-helix junction ( hj) ( , ) . it has been hypothesized that the role of this secondary structure is to decrease the rate of translation ( ) , which may affect frameshifting by facilitating pausing and inducing ribosome stacking. here, we investigate the role of the hiv- rna structure in frameshifting, focusing on elucidating the relationships between frameshift efficiency and (i) the downstream rna stem-loop thermodynamic stability, (ii) spacer length and (iii) surrounding genomic secondary structure. by systematically altering the base pair composition of the stem-loop, we dissect the contributions of global and local thermodynamic stability on frameshifting. these data reveal that the thermodynamic stability of the first - bp in the stem-loop is a primary determinant of frameshift efficiency. our data further indicate that the base pairs important for frameshifting are located at a distance of nt from the slippery site, which corresponds to the length of the spacer and is consistent with a structural model of the ribosome paused at the frameshift site. finally, we find that the conserved genomic rna secondary structure serves to attenuate the frameshift efficiency, likely by affecting the overall rate of translation. importantly, our study describes the first quantitative and predictive model for frameshift inducing stem-loops, which can be generally applied to many À prf viral systems. dna templates used for the dual-luciferase frameshift assay were cloned into a p luc vector between the rluc and fluc reporter genes. briefly, complementary synthetic oligonucleotides [integrated dna technologies (idt), inc.] with bamh i and sac i compatible ends were cloned into the p luc vector using the bamh i and sac i sites between the rluc and fluc reporter genes. oligonucleotides comprising the template sequences (supplementary table s ) and their complements were phosphorylated, annealed and ligated into the p luc vector to produce the experimental constructs. this places the fluc gene in the À reading frame relative to rluc; analogous to the orientation of the gag and pol genes in the hiv- genome. for the spacer mutation constructs (ms - ), a compensatory number of nts were added or removed downstream of the frameshift site to maintain the appropriate reading frame of the downstream reporter gene. the 'wild-type' (wt) sequence utilized here corresponds to the most frequently occurring sequence found in hiv- group m subtype b nl - laboratory strain ( ) . positive control sequences and their complements were also cloned into the p luc vector and have two thymidine residues (supplementary table s , bold) in the slippery sequence (supplementary table s , underlined) replaced with cytidines, and an additional nt inserted immediately before the sac i complementary sequence (gagct), which places the rluc and fluc genes in-frame. in all constructs, a pml i restriction site was included at the end of the template to allow for run-off transcription after digestion with the pml i enzyme (neb). resultant products were transformed into escherichia coli competent cells (dh a). plasmid dna was purified from cell cultures (qiagen) and the sequences of all constructs were verified (university of wisconsin-madison biotechnology center). microgram quantities of rna for the frameshift assay were transcribed in vitro using linearized p luc plasmid dna, purified his -tagged t rna polymerase ( Â), . mm ntps and two units of rnasin plus rnase inhibitor (promega), in ml for min at c. pyrophosphate was pelleted by centrifugation ( min, rpm, room temperature) and rna was phenol/ chloroform extracted. unincorporated ntps and salt were separated from the rna using size-exclusion chromatography [two econo-pac p cartridges (bio-rad) in series]. monomeric rna folding was achieved by denaturation at c for min followed by incubation on ice for min. rnas were lyophilized to dryness and resuspended in water to a concentration of mg/ml and stored in aliquots ($ ml) at À c. rna integrity and purity were checked with % agarose gel electrophoresis. finally, rnas used for uv spectroscopy were purchased from idt. in vitro frameshift assays were completed with each rna reporter (experimental and positive control) using a rabbit reticulocyte lysate (rrl) system (promega, nuclease treated, l a). differences from our previously described protocol ( ) include the following: translation reactions contained . mg rna, units of rnasin plus rnase inhibitor (promega, n ), and . ml of rrl in . ml. following a -min incubation at c, reactions were quenched with the addition of . ml . m edta ph . ( mm final), as described previously ( ) . for each reporter, a minimum of three independent frameshift assays were completed. each independent assay included six replicate reactions. luminescence was measured using the dual-luciferase reporter assay (promega) as previously described ( ) . readings were taken with a veritas microplate luminometer equipped with dual-injectors (turner bio-systems) for s after ml of the respective substrate was injected into the reaction mixture ( -s lag time prior to measurement). ratios of firefly/renilla luminescence were calculated for each of the experimental and control translation reactions. the frameshift efficiency was calculated by taking the ratio of the experimental/control luminescence (firefly/renilla). frameshift efficiencies were averaged and their standard deviations were propagated through to yield a standard error of the mean (sem). for the wt and a subset of the mutant stem-loop (ms) rnas (supplementary table s and figure ), rna overall thermodynamic stability, Ág global , was measured using uv absorbance at nm as a function of temperature with a cary model bio uv-visible spectrophotometer equipped with a peltier heating accessory and temperature probe. all samples contained mm potassium phosphate buffer, ph . , mm rna, in a volume of ml. for rnas that were too stable to measure Ág global under these conditions, urea was added to , and m, and the Ág global was deduced by extrapolating to m urea as described below. prior to data collection, samples were heated from c to c, at c/min, held at c for min and cooled from c back to c at the same rate to ensure homogenous folding. samples were heated at c/min from c to c. identical traces were obtained by cooling, indicating a lack of hysteresis. a data were collected in . -min intervals and raw data were baseline corrected by subtraction of a values at each temperature. the average hyperchromicity [equation ( ) ] and temperature were calculated from four curves and the sem was determined for each average. for rna with a single melting transition, the average hyperchromicity can be fit by equation ( ) to measure Áh and t m . here, a f and a u are the temperature-dependent a of the folded and unfolded forms of the rna, determined to be linear functions of the temperature. k is given by equation ( ), where t is the desired temperature (kelvin, k for our calculations) for the Ág calculation, and r is the gas constant in units of kcal/(mol  k). with an average t m and Áh extrapolated using equation ( ), assuming a Ácp of zero, the Ág global can be calculated using equation ( ) . error in Ág global is calculated using standard propagation of error (supplementary 'materials and methods' section). average hyperchromicity data for each rna were fit using equation ( ) and overall thermodynamic stabilities were calculated using equation ( ) (prism . , graphpad). for rnas with melting temperatures approaching or > c, a linear extrapolation of Ág global versus urea concentration was applied to determine the Ág global at m urea (supplementary figure s ) ( ) . for all other rnas, determination of Ág global at standard buffer conditions was sufficient to produce minimal error in Ág global . all the data used to calculate the reported Ág global values were established using a minimum of three independently prepared samples at each buffer condition. a small and non-cooperative transition was observed in the - c range for rnas with less stable terminal base pairs (wt, ms , ms , ms , ms and ms ). this transition was not present for rnas with very stable terminal base pairs (e.g. ms and ms ) and can be attributed to helical fraying. for all rnas, only the major cooperative unfolding transitions were used in data fitting. starting values were determined by examining the first derivative plots for each set of averages. local stabilities, Ág local , for base pairs were calculated using nearest-neighbor parameters at m nacl, c ( - ) (supplementary table s ). frameshift efficiency was plotted as a function of overall and local thermodynamic stability. overall and local thermodynamic stabilities were predicted at m nacl. data were fit to a one-phase exponential decay function [equation ( )] (prism . , graphpad). here, the amplitude, k and plateau are variables and x o is used to offset the exponential fit. x o was set to the most negative x value in the data set. errors were determined using a % confidence interval. modeling the hiv- frameshift site onto the eukaryotic ribosome we utilized a well-established dual-luciferase frameshift assay ( , , ) to quantitatively measure frameshift efficiency in rrl. the sequence of the frameshift site stemloop was varied to dissect the relative contributions of local and overall rna stability on hiv- frameshift efficiency. we hypothesized that once the ribosome is paused on the slippery sequence, the thermodynamic stability of the base pairs encountered at the base of the stem-loop should be a critical determinant for frameshifting. after the ribosome has completed one translocation step, it moves away from the slippery site and the reading frame is established. therefore, we further hypothesized that downstream base pairs in the stem-loop should have a much lower impact on frameshifting. to test this hypothesis, ms constructs ( figure b) were created using nearest-neighbor parameters ( ) ( ) ( ) ) to systematically alter the stability of different regions of the stem-loop. we define local stability (Ág local ) as the thermodynamic stability of base pairs directly downstream of the spacer ( figure a) , as determined by their nearest-neighbor interactions ( ) ( ) ( ) . global stability (Ág global ) is defined as the overall thermodynamic stability of the stem-loop. the thermodynamic stabilities (Ág global ) were experimentally determined for a subset or rnas (wt, ms - , ms - , ms and ms ) using uv-monitored thermal denaturation (figure , supplementary figure s and supplementary table s ). owing to the extreme stabilities of the structures ( ), thermal denaturation curves were measured at low ionic strength ( mm potassium phosphate buffer) in the presence of varying concentrations of urea and extrapolated back to m urea ( ) . results followed the same trend as those predicted from nearest-neighbor parameters ( - , ) (r = . ) ( figure ). as expected, the measured stabilities were lower than the predicted values at m nacl ( ) . upon correction of the experimental values to m monovalent ionic strength ( ), we observe an excellent agreement (r = . ) between experimental and predicted free energies ( figure ) . indeed, free energy prediction is robust for small, stable rnas with no competing suboptimal folds ( ) . frameshift efficiencies for the different rna constructs were measured (figure and table ). increases in the local stability of the first bp resulted in significant increases in frameshift efficiency ( figure a, ms - ). in contrast, sequence changes that significantly lowered the local stability of the first bp resulted in decreased frameshift efficiencies ( figure a , ms - and table ). no correlation between frameshift efficiency and overall thermodynamic stability is observed ( figure b ). instead, we observe a strong correlation (r = . ) between frameshift efficiency and local stability of the first bp at the base of the stem-loop using a one-phase exponential decay function ( figure c and supplementary table s ). the frameshift efficiency for each variant frameshift site can be predicted using the parameters derived from the correlation, and each predicted frameshift efficiency falls within sd of its measured value (data not shown). these results support the hypothesis that the stability of the base pairs at the base of the stem-loop is a primary determinant of frameshift efficiency. extremely stable rna structures can promote longterm ribosomal stalling or 'roadblocking', ( , ) . in the dual-luciferase assay, roadblocking would result in decreased translation levels of the downstream firefly luciferase reporter gene product. however, the dual-luciferase assay controls for this, as frameshift efficiencies are normalized relative to in-frame control constructs ( , , ) . nevertheless, we asked if differential degrees of roadblocking might occur for our various constructs. the luminescence data reveal a consistent ratio of firefly/renilla activity (data not shown) for all constructs (table ) . these values were calculated using the luminescence data from the positive control dual-luciferase constructs, where the renilla and firefly genes are in frame and the slippery site is mutated such that the ribosome cannot frameshift. the consistency in the relative expression levels of the reporter genes indicates that roadblocking, if occurring, is uniform for all constructs. it has been hypothesized that during frameshifting, the mechanical force of translocation causes a build-up of tension that is transmitted through the spacer region ( figure a ) and sensed at the anticodon-codon level ( , ) . we therefore investigated the influence of nt deletions and insertion in the spacer region ( figure ). the wt construct was compared to a version with an adenosine insertion that increases the spacer length by nt (ms ) (figure ) . additionally, we created spacers with a single nt deletion (ms ) and -nt deletions (ms - ) (figure ). the resulting frameshift efficiencies were measured ( figure d ). interestingly, ms shows a large increase in frameshift efficiency. this cannot be due to the -nt deletion, since ms and ms have the same spacers yet display wt-levels of frameshifting. we hypothesized that the -nt deletion in the spacer of ms increased frameshift efficiency by altering the base pairs in the stem-loop encountered by the ribosome during frameshifting. in other words, by deleting nt in the spacer, a ribosome footprint may extend nt further into the stem. in support of this hypothesis, a very stable set of base pairs are located nt from the base of the stem ( -ggc- / -gcc- ). in ms and ms , we replaced these base pairs with the less stable base pairs normally found at the base of the stem ( -cug- / -cag- ) ( figure c ). indeed, when these changes are made, the frameshifting efficiency is indistinguishable from wt, despite the apparent -nt spacer deletion ( figure d) . interestingly, the overall stability of ms is increased relative to ms ( figure c ), yet the frameshift efficiency is unaltered ( figure d ). these data indicate that changes in the spacer region correspondingly alter the base pairs encountered by the ribosome when it is engaged with the slippery site. when plotting the data from all rna constructs studied (including the ms - rnas) as a function of overall rna stability, no correlation is observed ( figure e ). however, we observe a strong correlation between frameshifting and the thermodynamic stability of the first - bp nt downstream of the slippery site ( figure f and g) . conversely, the correlations grow considerably weaker as more base pairs are considered in the analysis (supplementary figure s ) . likewise, no correlation is observed between local stability of base pairs at the top of the stem-loop and frameshift efficiency (supplementary figure s l) . the observed correlations are exponential functions with baselines of - % frameshifting, which correspond to the lowest observed frameshift efficiencies in the presence of a stem-loop secondary structure downstream of the slippery site. the stem-loop is flanked by a -u and -g ( figure a) , that could potentially form a u-g wobble at the base of the stem. inclusion of this wobble pair in the local stability term produces consistently weaker correlations between frameshift efficiency and local stability (supplementary figure s ) . to further address whether or not this u-g wobble pair can form during frameshifting, we modeled the frameshift site stem-loop and spacer onto the eukaryotic ribosome ( figure ). the spacer was connected to the terminal nt in the a-site, to recapitulate the position of the ribosome when it is engaged on the slippery sequence in the reading frame. the model indicates that the minimal spacer distance between the slippery site and the stem-loop is nt; however, formation of a u-g wobble at the base of the stem is blocked by steric clash with the ribosomal s protein ( figure b and d). therefore, experimental data and structural modeling support an hiv- frameshift site spacer length of nt. within viral capsids, the hiv- frameshift site rna is part of a conserved hj secondary structure ( figure a ) ( , ) . it has been hypothesized that the role of this secondary structure is to slow down the rate of translation ( ) , which in turn may modulate frameshift efficiency. we therefore compared the hj secondary structure ( hj wt) to a similar construct with mutations designed to disrupt secondary structure formation in the p and p helices ( figure b , hj mut). we observe a significant decrease in frameshift efficiency, from . ± . % to . ± . %, when the hj secondary structure is present ( figure c and table , compare hj wt to hj mut). as expected, there is no significant difference between the observed frameshifting efficiencies of the hj mut and the wt construct used above ( . ± . and . ± . , respectively). the observed frameshift efficiencies for our hj wt and wt reporter constructs in rrl both fall within the range of previously measured frameshifting efficiencies for hiv- in vivo, which range from % to % ( , , , ). next, we tested the local stability hypothesis in the context of the hj secondary structure by increasing the local stability of bp in the p helix ( figure a ). the -bp change ( hj ms ) results in a large, -fold increase in frameshift efficiency ( figure c and table ). this increase is similar to the -fold difference between the ms and wt rnas ( figure a and table ). we conclude that the hj secondary structure indirectly modulates frameshifting, likely by altering the kinetics of translation, as previously hypothesized ( ) . this effect must happen prior to frameshifting, as the ribosome disrupts the hj secondary structure as it encounters the slippery site. once the ribosome is engaged with the slippery site, local stability is the primary determinant of frameshifting efficiency, as illustrated by comparison of hj wt to hj ms ( figure c ). in this work, we report a strong correlation between the thermodynamic stability of the first - bp at the base of the stem-loop and frameshift efficiency in hiv- . we therefore hypothesize that the frameshift mechanism involves a thermodynamic block to translocation, determined by the local stability of base pairs positioned directly at the mrna entrance channel. this is in agreement with previous studies investigating antisense-induced frameshifting using either mixed locked nucleic acid/dna ( ) or morpholino/rna ( ) oligonucleotides. when these oligonucleotides were used to direct antisenseinduced frameshifting, the local stability of the duplex was also critical to frameshift stimulation ( , ) . in light of the local stability hypothesis, we can re-examine data from prior studies that investigated trends between thermodynamic stability of the hiv- stem-loop and frameshift efficiency ( , , ) . indeed, we find that these results are generally consistent with local stability being the primary determinant in frameshift efficiency. for example, bidou et al. mutations in the frameshift site that arise in response to cytotoxic t-cell escape ( ) and protease inhibitor resistance ( , ) are also consistent with our results. prado et al. ( ) investigated the frameshift efficiency of four hiv- strains with mutations in the frameshift site. in this study, the only mutation that produced a significant change (decrease) in frameshift efficiency was one with a decreased local stability due to disruption of the first base pair in the stem-loop. nijhuis et al. if the local stability of - bp is sufficient to determine frameshift efficiency, why does the hiv- frameshift site stem-loop contain watson-crick base pairs? we can see two possible explanations for this. first, the additional base pairs ensure that the stem-loop has a high probability of folding and cannot be out-competed by suboptimal folds, which can severely impact hiv replication ( ) . in the > -nt genome, there are a total of only helices that are equal or larger in size ( ) . second, the additional base pairs serve to cooperatively stabilize the base pairs at the base of the stem. these effects may explain why severely truncated constructs produce lower frameshift efficiencies compared to stem-loops with identical local stability (ms versus wt, ms versus ms ). cooperative stabilization of local stability may also explain why severe truncations of a hairpin downstream of the simian retrovirus type- (srv- ) slippery site result in lower frameshift efficiency ( ) . the observed frameshift efficiencies for the hj wt and wt reporter constructs in rrl both fall within the range of previously measured frameshifting efficiencies for hiv- in vivo, which range from % to % ( , , , ) . the wide range of observed frameshifting efficiencies in vivo is likely influenced by viral and cellular factors, for example, modulation of translation initiation by the hiv- tar rna structure ( ) and polysome density ( ) . we find that the conserved hj secondary structure in the hiv- genomic rna ( ) causes a significant decrease in frameshift efficiency ( figure ). this observation is consistent with the previous hypothesis that the hj secondary structure induces ribosomal pausing ( ) . pausing at the upstream secondary structure may promote stacking of consecutive ribosomes ( ) , promoting a net decrease in frameshift efficiency because the mrna would have less time to refold between ribosomes. our data support this model and also indicate that local stability has a far greater impact on frameshift efficiency ( figure ). prokaryotic ribosomes use two active mechanisms during translation to unwind rna ( ) . in the first mechanism, the ribosomal helicase activity raises the free energy of an encountered base pair by + . kcal/mol ( ) . this destabilizes the base pair, which can then be opened by the mechanical force generated by translocation. if the base pair is resistant to this force, tension may be created which is sensed at the codon-anticodon base pairs ( , , , , ) . because g-c pairs require more force for unwinding ( ) , the tension sensed in the decoding center would be proportional to the local rna stability ( ) . the mechanical tension may either cause the trnas to slip nt in the -direction ( , , , , ) or, alternatively, cause the ribosome to translocate incompletely by nt instead of nt ( ), which would also result in a À frameshift. when the hiv- frameshift site is modeled onto the eukaryotic ribosome, base pairs critical for frameshifting are positioned at the entrance to the mrna entry channel ( figure b and d), in agreement with chemical probing and toeprinting results with a prokaryotic ribosome stalled on the hiv- frameshift site ( ) . interestingly, the decoding center and the mrna channel are highly conserved between prokaryotes and eukaryotes ( ) and a bacterial translation system produces similar levels and changes in frameshift efficiency in response to changes in the hiv- stem-loop sequence ( ) . our data support an -nt spacer length between the slippery site and the stem-loop, as the effect of deletions in the spacer correlates with corresponding changes in stem-loop local stability -nt downstream of the slippery site ( figure ). consistent with this idea, deletion of nt in the spacer region of the beet western yellow virus (bwyv) À prf site promotes the melting of the first base pair in the downstream structure ( ) . if the mrna channel length is maintained, it follows that spacer lengths in all À prf sites should be ! nt in length. yet, some frameshift sites have been drawn with -to -nt spacers [reviewed in ( , ) ], including that of bwyv ( ) . our data suggest that these frameshift site structures may be partially unwound at the time of frameshifting, in order to accommodate the requisite spacer length and positioning of the slippery sequence in the ribosomal decoding center. unfortunately, there are currently no high-resolution structural views of ribosomes engaged with frameshift site structures. in conjunction with functional studies such as the one presented here, high-resolution structural views will ultimately be required to define the frameshifting mechanism. prior studies have observed relationships between spacer length and À prf efficiency in various systems ( , , , ) and are consistent with a spacer length of - nt and local stability being the primary determinant in frameshift efficiency. spacer lengths of - nt produced the highest level of À prf for the antisense oligonucleotides, stem-loop and pseudoknot stimulatory structures ( ) . in agreement with our conclusions, these spacers position base pairs with strong local stabilities at the entrance to the mrna channel. hiv- group m subtype b is the dominant form of hiv- in north and south america, europe, japan, thailand and australia. the less common non-b subtypes (a, c, d, e, f, g, h, j and k) have decreased local stabilities; for example, a frequent c to u mutation in the first base pair of the stem-loop in these subtypes results in formation of a u-g wobble pair in place of a c-g ( , ) . interestingly, the exponential relationship we observe predicts that such a change would have little effect on frameshift efficiency. for instance, mutants ms - all incorporate u-g wobble pairs at these positions, which significantly destabilize the local stability by + . kcal/mol relative to wt ( figure ). nevertheless, these mutations result in near wt frameshift efficiencies ( figure a ), owing to the exponential relationship between local stability and frameshift efficiency (figures c, f and g). these results are consistent with the observed frameshift efficiencies of the less common subtypes ( ) . finally, a randomized trial of hiv patients receiving protease inhibitor therapy examined mutations in the gag-pol frameshift site and found no relationship between overall stability of the stem-loop and virological response ( ) . this observation is consistent with our results that show no correlation between overall stability and frameshifting (figures b and e) . characterization of ribosomal frameshifting in hiv- gag-pol expression programmed ribosomal frameshifting in hiv- and the sars-cov an efficient ribosomal frame-shifting signal in the polymerase-encoding region of the coronavirus ibv ribosomal frameshifting on viral rnas frameshifting rna pseudoknots: structure and mechanism programmed translational frameshifting rna pseudoknots and the regulation of protein synthesis structure and function of the stimulatory rnas involved in programmed eukaryotic- ribosomal frameshifting a heptanucleotide sequence mediates ribosomal frameshifting in mammalian cells decreasing the frameshift efficiency translates into an equivalent reduction of the replication of the human immunodeficiency virus type a reassessment of the response of the bacterial ribosome to the frameshift stimulatory signal of the human immunodeficiency virus type importance of ribosomal frameshifting for human immunodeficiency virus type particle assembly and replication characterization of human immunodeficiency virus type- (hiv- ) particles that express protease-reverse transcriptase fusion proteins maintenance of the gag/gag-pol ratio is important for human immunodeficiency virus type rna dimerization and viral infectivity the human immunodeficiency virus type ribosomal frameshifting site is an invariant sequence determinant and an important target for antiviral therapy overexpression of the hiv- gag-pol polyprotein results in intracellular activation of hiv- protease and inhibition of assembly and budding of virus-like particles overexpression of the gag-pol precursor from human immunodeficiency virus type proviral genomes results in efficient proteolytic processing in the absence of virion production mutational analysis of the ''slippery-sequence'' component of a coronavirus ribosomal frameshifting signal comparative mutational analysis of cis-acting rna signals for translational frameshifting in hiv- and htlv- ribosome structure: revisiting the connection between translational accuracy and unconventional decoding translational frameshifting at the gag-pol junction of human immunodeficiency virus type is not increased in infected t-lymphoid cells the sequences of and distance between two cis-acting signals determine the efficiency of ribosomal frameshifting in human immunodeficiency virus type and human t-cell leukemia virus type ii in vivo interaction of the hiv- frameshift signal with the ribosome characterization of the frameshift stimulatory signal controlling a programmed - ribosomal frameshift in the human immunodeficiency virus type solution structure and thermodynamic investigation of the hiv- frameshift inducing element solution structure of the hiv- frameshift inducing stem-loop rna structure of the rna signal essential for translational frameshifting in hiv- efficiency of a programmed - ribosomal frameshift in the different subtypes of the human immunodeficiency virus type group m human immunodeficiency virus type gag-pol frameshifting is dependent on downstream mrna secondary structure: demonstration by expression in vivo in vivo hiv- frameshifting efficiency is directly related to the stability of the stem-loop stimulatory signal the -a solution: how mrna pseudoknots promote efficient programmed - ribosomal frameshifting a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting the three transfer rnas occupying the a, p and e sites on the ribosome are involved in viral programmed - ribosomal frameshift the mechanics of translocation: a molecular ''spring-and-ratchet'' system achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins the many paths to frameshifting: kinetic modelling and analysis of the effects of different elongation steps on programmed À ribosomal frameshifting targeting frameshifting in the human immunodeficiency virus mechanisms and implications of programmed translational frameshifting characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot ribosomal pausing at a frameshifter rna pseudoknot is sensitive to reading phase but shows little correlation with frameshift efficiency ribosomal movement impeded at a pseudoknot required for frameshifting ribosomal pausing during translation of an rna pseudoknot torsional restraint: a new twist on frameshifting pseudoknots analysis of natural variants of the human immunodeficiency virus type gag-pol frameshift stem-loop structure proline residues within spacer peptide p are important for human immunodeficiency virus type infectivity, protein processing, and genomic rna dimer stability correlation between mechanical strength of messenger rna pseudoknots and ribosomal frameshifting triplex structures in an rna pseudoknot enhance mechanical stability and increase efficiency of - ribosomal frameshifting rna reactions one molecule at a time the ribosome uses two active mechanisms to unwind messenger rna during translation characterization of the mechanical unfolding of rna pseudoknots predicting ribosomal frameshifting efficiency programmed - frameshifting efficiency correlates with rna pseudoknot conformational plasticity, not resistance to mechanical unfolding the presence of the tar rna structure alters the programmed - ribosomal frameshift efficiency of the human immunodeficiency virus type (hiv- ) by modifying the rate of translation initiation architecture and secondary structure of an entire hiv- rna genome high-throughput shape analysis reveals structures in hiv- genomic rna strongly conserved across distinct biological states selection and characterization of small molecules that bind the hiv- frameshift site rna programmed ribosomal frameshifting in siv is induced by a highly structured rna stem-loop urea and guanidine hydrochloride denaturation of ribonuclease, lysozyme, a-chymotrypsin, and b-lactoglobulin thermodynamic parameters for an expanded nearest-neighbor model for formation of rna duplexes with watson-crick base pairs expanded sequence dependence of thermodynamic parameters improves prediction of rna secondary structure testing the nearest neighbor model for canonical rna base pairs: revision of gu parameters structural aspects of messenger rna reading frame maintenance by the ribosome the structure of the eukaryotic ribosome at . Å resolution the path of messenger rna through the ribosome one core, two shells: bacterial and eukaryotic ribosomes crystal structure of the eukaryotic ribosome a dual-luciferase reporter system for studying recoding signals an in vivo dual-luciferase assay system for studying translational recoding in the yeast saccharomyces cerevisiae rnastructure: software for rna secondary structure prediction and analysis rna secondary structure prediction a unified view of polymer, dumbbell, and oligonucleotide dna nearest-neighbor thermodynamics ) mrna pseudoknot structures can act as ribosomal roadblocks stimulation of ribosomal frameshifting by antisense lna functional consequences of human immunodeficiency virus escape from an hla-b* -restricted cd + t-cell epitope in p gag protein a novel substrate-based hiv- protease inhibitor drug resistance mechanism mutational patterns in the frameshift-regulating site of hiv- selected by protease inhibitors opening of the tar hairpin in the hiv- genome causes aberrant rna dimerization and packaging stem-loop structures can effectively substitute for an rna pseudoknot in À ribosomal frameshifting the utr of hiv- full-length mrna and the tat viral protein modulate the programmed À ribosomal frameshift that generates hiv- enzymes ribosome pausing and stacking during translation of a eukaryotic mrna footprinting analysis of bwyv pseudoknot-ribosome complexes spacer-length dependence of programmed - or - ribosomal frameshifting on a u a heptamer supports a role for messenger rna (mrna) tension in frameshifting identification and analysis of the gag-pol ribosomal frameshift site of feline immunodeficiency virus differential stability of the mrna secondary structures in the frameshift site of various hiv type viruses gag mutations can impact virological response to dual-boosted protease inhibitor combinations in antiretroviral-naive hiv-infected patients the authors thank jordan burke, ashley richie and lauren michael for helpful discussions. they also thank raymond gesteland (university of utah) for the generous gift of the p luc plasmid dna and prof. a.c. palmenberg and her laboratory (university of wisconsin-madison) for equipment use. conflict of interest statement. none declared. key: cord- - ncgldaq authors: elworth, r a leo; wang, qi; kota, pavan k; barberan, c j; coleman, benjamin; balaji, advait; gupta, gaurav; baraniuk, richard g; shrivastava, anshumali; treangen, todd j title: to petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics date: - - journal: nucleic acids res doi: . /nar/gkaa sha: doc_id: cord_uid: ncgldaq as computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. in recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. for instance, sketching algorithms such as minhash have seen a rapid and widespread adoption. these techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. we also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. we then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions. thanks to advances in sequencing technology, the amount of next-generation sequencing data for genomics has increased at an exponential pace over the last decade. while this explosion of data has yielded unprecedented oppor-tunities to answer previously unanswered questions in biology, it also creates new challenges. for instance, a key challenge is in designing new algorithms and data structures that are capable of handling analyses on such large and numerous datasets (table ) . one approach for solving this big data problem is the development and adoption of probabilistic algorithms and data structures. when applying probabilistic methods to genomic analyses, input sequences are frequently decomposed into sets of overlapping subsequences with length k, referred to as k-mers. this large set of k-mers is then compressed into matrices using techniques from compressed sensing and sketching. genomic analyses such as clustering and taxonomic classification can be performed directly on the compact matrices ( figure ). in this paper, we review the great strides that have already been made in these areas and look forward to future possibilities. many novel probabilistic and signal processing approaches for handling these massive amounts of genetic data have been previously reviewed ( ) ( ) ( ) ( ) ( ) . for instance, in ( ) a comprehensive review was performed covering probabilistic algorithms and data structures such as minhash ( ) and locality sensitive hashing (lsh) ( ) , count-min sketch (cms) ( ) , hyperloglog ( ) and bloom filters ( ) . this review includes extensive details of how these data structures work, supporting theory behind each of them, as well as a brief discussion of their applications. however, the genomics applications for each approach were not thoroughly covered. other more biologically motivated reviews include a review of compressive algorithms in ( ) and ( ) < . metahit ( ) . tara oceans ( ) . terragenome ( ) . jgi img ( ) . human microbiome project ( ) . the european nucleotide archive (ena) ( ) . ncbi sequence read archive ( ) . sketching approaches in ( ) . in ( ) , techniques are covered such as the burrows-wheeler transform (bwt) ( ) , the fm-index ( ) , and other techniques based around exploiting redundancy in large datasets. a more in depth discussion of many of these topics can also be found in ( , ) includes a thorough review of compressed string indexes, lsh via sketches, cms, bloom filters, and minimizers ( ) , with accompanying applications in genomics for each. while many techniques focus on efficient ways to represent a dataset, the compressed sensing (cs) technique from signal processing exploits the sparsity of signals for their efficient acquisition and interpretation. cs's measurement efficiency often translates to significant reductions in cost and time. cs has previously found biomedical applications in microscopy ( ) and rapid mri acquisition ( ) . in this review, we summarize the essentials of cs, relate the technique to the other probabilistic data structures and algorithms, discuss relevant recent advances, and highlight corresponding applications in metagenomics. we direct interested readers to ( ) for further discussion of the core concepts of cs and to the seminal works of ( ) and ( ) for more thorough analyses. most recently, a comprehensive review of sketching algorithms in genomics was performed in ( ) . this review covers approaches like minhash, bloom filters, cms, hy-perloglog, the biological applications and implementations of each, and even includes a set of live, interactive notebooks with code examples of each approach. given the wealth of previously performed reviews on these topics, we refer readers to the works above for more in depth explanations of these approaches along with their applications, implementations, and theory. instead, we include only a brief review of these fundamental methodologies, followed by more recent advances in these areas, and finally their applications to metagenomics. previous studies have often neglected more novel applications in metagenomic data given the new challenges it poses. metagenome sequencing and analysis not only complicates established fundamental problems in comparative genomics but also adds entirely new problems. therefore, we focus on how the aforementioned techniques can overcome unique hurdles in metagenomics. recently, more attention has been given to the study of probabilistic algorithms ( ) as a means to circumvent the widening gap between the explosion of data and our computing capabilities. algorithms based on hashing and sketching ( ) ( ) ( ) ( ) ( ) ( ) have been extensively used in the theoretical computer science and database literature for reducing the computations associated with processing massive webscale datasets ( ) ( ) ( ) ( ) ( ) . hashing algorithms are typically associated with a random hash function that takes the input (usually the data vector) and outputs a discrete value. usually, this output serves as a (small memory) fingerprint which, being discrete, can be used for 'smart' indexing. these indices are most notably used for sub-linear time near-neighbor searches ( , ) . sketching algorithms work by creating a dynamic probabilistic data structure popularly known as a sketch ( ) . the sketch is a small memory summary of a given set of items, which typically requires logarithmic memory for summarizing them ( ) . these sketches can support dynamic updates ( ) and the dynamic query operation which returns an approximate estimate for a quantity of interest. to begin, we perform a concise overview of core probabilistic data structures and algorithms ( figure ). we then include a review of a wide array of more recent variations, extensions, and recent advancements of these fundamental methodologies. finally, we include a more in depth discussion on promising applications to genomic and metagenomic data. ( ) locality sensitive hashing (lsh) was first introduced to solve the nearest neighbor search (nns) problem in high dimensions ( ) . lsh functions are a subset of hash functions that seek to hash similar input values to the same hash values. essentially, for an lsh function f, if two input items x and x are very similar to each other, then applying the lsh function to both should cause them to collide (f(x ) = f(x )) with high probability. the main idea behind efficient retrieval is to use f to structure the data as an efficient dictionary or hash table by indexing data point x i with key f(x i ). given any query q, f(q) naturally becomes a favorable key for lookup. this is because any x j with the same key will have f(q) = f(x j ), and hence, is likely to have high similarity with query q. ( ) minhash is arguably one of the most popular lsh functions for genomic and metagenomic data. min-hash takes a set as input and outputs a set of integer hash values. specifically, minhash applies p different hash functions to each element in a set and returns the minimal hash values from each of the p hash functions as the sketch of the set. the probability that two sets have the same minimal hash values is equal to the percentage of common elements in the union of both sets. as a consequence, we can quickly approximate the similarity between two sets by simply computing the ratio of the number of minhash collisions between the sets and the total number of minhashes. with minhash we can compute a small approximate summary of each set, referred to as a sketch, and then calculate the similarity of any two sets as the distance between their sketches. sequencing data are often conveniently represented as sets of tokens (or k-mers). as a result, minhash is fre- figure . overview of applying probabilistic data structures and compressed sensing in metagenomic sequence analysis. given a set of sequences, each sequence is usually first decomposed into a series of consecutive k-mers. then the probabilistic algorithm compresses the k-mers into sketches. the sketches can be analyzed to evaluate characteristics of the input sequences, such as sequence similarity. in compressed sensing (cs), the aggregate k-mer frequencies for the whole sample are treated as measurements. elements of a database (e.g. microbial genomes) have individual k-mer frequency distributions that are stored in columns of a matrix. cs finds the elements of the database that comprise the sample measurements. quently used to quickly compare the similarity between two large sequencing datasets by applying the p hash functions to their k-mers. ( ) minimizers are another widely used technique within the family of lsh-algorithms to reduce the total number of k-mers for sequence comparison applications. a minimizer is a representative sequence of a group of adjacent k-mers in a string and can help memory efficiency by storing a single minimizer in lieu of a large number of highly similar k-mers. minimizers will sample the sequence by choosing the smallest (lexicographically, for instance) k-mer within a sliding window. in figure , the minimizer portion demonstrates the sliding window that moves across the sequence, creating the set of minimizer k-mers for the sequence by taking the smallest k-mers within the window as it slides. the choice of the window length w and k-mer size k of the minimizers are parameters that can be adjusted for the application. several techniques employ hashing to compress the representation of a dataset. from these new representations, information can be rapidly queried. ( ) bloom filter (bf) is a data structure that compresses a set while still being able to query if an element exists in the set. the sketch for a bf is a bit array of w bits. the bits are given an initial value of . to record an element into the sketch, p different hash functions are used to map the input element to p different positions in the array. after evaluating the hash functions, the bf sets the bits to at all mapped positions. to search for an element, the query element is hashed by the same p hash functions. then, every bit that the hash values map to in the bf are checked. if any bit value of the mapped locations are not equal to , the input element is definitely not in the set. if all the mapped bits are , the element is likely in the set. this result can also be caused by random hash collisions while inserting other elements. thus, the bf can have false positives. ultimately, bfs can quickly evaluate the presence of a given element using very little memory. ( ) hyperloglog is designed to estimate the number of distinct elements in a set using minimal memory. the essence of hyperloglog is to keep track of the count of the maximum number of leading zeros in the binary representation of each element in the set. if the maximum number of leading zeros observed is n, a crude estimate for the number of distinct elements in the set is n . this style of cardinality estimation only works for data distributed uniformly at random, so each element passes through a hash function before being evaluated and incorporated into an extremely compact sketch for the set. the process of cardinality estimation based on leading zeroes can have a high variance, so the hy-perloglog sketch distributes the hashed elements into multiple counters, whose harmonic mean yields a final cardinality estimation (after correcting for using multiple counters and hash collisions). but this memory is still logarithmic in the total number of distinct elements. on the other hand, calculating the exact cardinality requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. alternatively, condensed representations may summarize the structure of the dataset by analyzing the frequency of components of the set. new datapoints that are assumed to exhibit the same structure can be efficiently acquired. . count-min sketch: three pairwise independent hash functions are applied to each k-mer. each hash function is responsible for a row in the sketch and maps the hash values to the bins in its row. to encode an element into the sketch, the count-min sketch increases the numeric value in the mapped bins. to return the number of occurrences of a given k-mer, it hashes the k-mer using the same hash functions and returns the smallest value. bloom filter: it initiates all the values in the array as . to record the presence of a k-mer in the dataset, it maps k-mer to the bits in the bloom filter using three pairwise independent hash functions, and then it changes the mapped bits from to . minimizer: given a sequence, it can be compressed into a list of minimizers. to do that, a window slides across the sequence. in each window, the sequence inside the window is decomposed into k-mers. a minimizer is selected among the list of k-mers for the window at each position. hyperloglog: each k-mer is represented by a hash value with length . the first three bits of a hash value is used to locate a register and the last bits are saved in the corresponding register. the maximum number of leading zeros among all the values, that are stored in the register, is used to estimate the cardinality of each register. ( ) compressed sensing is a signal processing technique that enables the acquisition of high-dimensional signals from low-dimensional measurements by leveraging the sparsity of many natural signals ( ) ( ) ( ) . sparse signals have only a few nonzero elements. in metagenomics, a signal of interest may be the relative abundance of microbes in a sample. these signals are sparse because only a small fraction of all known species are present (i.e. have nonzero abundance) in any given sample. figure illustrates the process of cs in this context. the cs problem can be represented concisely with linear algebra: y = x where an m × n sensing matrix captures an n-dimensional signal x with m linear measurements that are stored in y. sparse recovery algorithms find the sparsest x that obeys y = x either through a convex relaxation (e.g. a lasso regression ( )) or a greedy algorithm (e.g., matching pursuit ( ) ( ) ( ) ( ) ). theory shows that cs can make very efficient use of linear measurements; m scales logarithmically with n ( , ). ( ) count-min sketch (cms) is a specialized cs algorithm where the projection matrix is a structured ( - ) random matrix derived from cheap universal hash functions. due to this carefully designed matrix, it is possible to compute the projection y = x as well as perform recovery of x from y without materializing the matrix in memory and instead only use a few universal hash func-tions, each of which needs only two integers. as a result, we get a provably logarithmic memory algorithm for compressing x and recovering its heavy elements. the cms is popular for estimating the frequencies of different elements in a data set or stream. the cms algorithm is remarkably simple and has a striking similarity with the bloom filter. the cms is a matrix with w columns and d rows. it can be thought of as a collection of d bloom filters, one for each row, each using a single hash function. the only difference is that we use counters in cms instead of bits in bloom filters. given an input data element x to the cms, it is hashed by d independent hash functions. each of the d hash functions generates a hash value hash d (x) within range w and increments the numeric value stored at column hash d (x) row d. querying the count of an element consists of simply taking the minimum of the counters that the element hashes to in the cms. a tremendous amount of study and followup work has been performed by the scientific community to improve the fundamental probabilistic data structures and algorithms. here, we give a brief overview of relevant variations, extensions, and recent advancements to the methodologies described above. there has been a significant advancement in improving the computing cost of minhash, which became a central tool in bioinformatics after the introduction of mash ( ) and other toolkits that then followed ( , ) . minhash requires p hash functions, and p passes over the data to compute p signatures. recently, using a novel idea of densification ( ) ( ) ( ) , densified-minhash was developed. densified-minhash only requires one hash function and one pass over the set to generate all the p signatures of the data with identical statistical properties as p independent minhash, for any given p. several improvements have been made for efficiently computing weighted minhash as well ( ) , where the elements of sets are allowed to have importance weight. these recent advances have made it possible to convert data into minhashes in the same cost as data reading, which, otherwise, was the main bottleneck step. genomic applications also use many lsh functions beyond minhash. simhash ( ) was invented by google to find near-duplicates over large string inputs using cosine similarity. it was shown in ( ) that for sequence and string datasets minhash is provably and empirically superior to simhash, even for cosine similarity. b-bit minwise hashing is a variation of minhash that saves only the lowest b bits of each hashed value ( ) . it requires less memory to store each hash code and can be used to accurately estimate the similarities among high-dimensional binary data. sectional minhash (s-minhash) ( ) includes information about the location of k-mers or tokens in a string to improve duplicate detection performance. universal (or random) hash functions seek to quickly and uniformly map inputs to hash codes. universal hash functions are important building blocks for the cms, bloom filter, hash table, and other fundamental data structures. murmurhash (https://sites.google.com/site/murmurhash, accessed march ) is a very well-known universal hash that has been widely used in many bioinformatic software packages, including mash ( ) . although previous murmurhash versions were vulnerable to hash collision, murmurhash (https: //github.com/aappleby/smhasher/wiki/murmurhash , accessed march ) is a good general-purpose function that is particularly well-suited to large binary inputs. however, there are other options such as xxhash (https://github.com/cyan /xxhash, accessed march ), which can be faster than murmurhash, and cityhash (https://opensource.googleblog.com/ / / introducing-cityhash.html, accessed march ). city-hash is relevant to genomics because it is optimized for strings. it outperforms murmurhash for short string inputs but is appropriate for any length input. farmhash is the successor to cityhash and also focuses on improved string hashing performance (https://opensource.googleblog.com/ / /introducing-farmhash.html, accessed march ). nthash ( ) is a specialized dna hashing function. it recursively calculates the hash values for the consecutive k-mers in a given sequence. while nthash can be faster than xxhash, cityhash and murmurhash, it is only appropriate for sequence data. minimal perfect hash functions (mphf) and perfect hash functions (phf) map inputs to a set of hash codes without any collisions. a phf maps n inputs, or keys, to a set of >n hash codes, some of which are unused. an mphf maps n inputs to n codes. although mphfs have been used to improve many bioinformatics applications, such as the quasi-dictionary ( ), the mphf construction process is often resource-intensive. critically, all of the inputs must be known in advance to construct an mphf, and many construction methods based on hypergraph peeling fail to scale. bbhash is an mphf construction method that was introduced to scale to massive key sets ( ) . bbhash is constructed by a simple procedure that maps each key to a fixed-size bit array using a universal hash. if two keys collide in the bit array, the corresponding location is set to . otherwise, the bit remains . this recursive process is repeated with all of the colliding keys until there are no more collisions. due to the simplicity of the algorithm, bbhash construction is much faster at the scale typically encountered in genomics. mphfs are usually used to implement fast, read-only hash tables with constant-time lookups. however, clever open addressing schemes can also be used to achieve similar query performance without knowing the key set in advance. rather than avoid hash collisions, open addressing attempts to rearrange elements in the hash table for optimal performance. for instance, hopscotch hashing ( ) ensures that a key pair is always found within a small neighborhood of its hash code. since only a small collection of consecutive buckets need to be searched when a query is issued, hopscotch hashing has very strong query-time performance. robin hood hashing ( ) is another open addressing method. the key feature of this algorithm is that it minimizes the distance between the hash code location and the actual key-value pair, reducing worst-case query time. cuckoo hashing ( ) uses two hash functions and guarantees that the element will always be found at one of the two hash indices. some fundamental advances in lsh have also been seen with minimizers. traditionally, minimizer selection is executed according to lexicographic order. however, this procedure may cause 'over-selection' where more k-mers than necessary become minimizers. instead, researchers recently proposed to select minimizers from a set of k-mers based on a universal hitting set or a randomized ordering ( ) . if minimizers are picked from the universal hitting sets, which are the minimum sets of k-mers that cover every possible llong sequence ( ) , the expected number of minimizers in a given sequence would decrease. there is also recent progress in techniques to rapidly characterize datasets. hyperloglog has risen to prominence recently thanks to its ability to efficiently count distinct elements in large data sets and databases. many new algorithms have since been developed based on hyper-loglog to adapt to different scenarios. for instance, hyper-loglog++ ( ) was introduced to reduce the memory usage and increase the estimation accuracy for an important cardinality range. sliding hyperloglog ( ) adds a sliding window to the original algorithm for more flexible queries, but it requires more memory storage. bloom filters are attractive because they can substantially compress a dataset, but this approach can return false positive answers. cascading bloom filters ( , ) improve the accuracy of the standard bloom filter. a cascading bloom filter recursively creates child bloom filters to store the false positives from a parent bloom filter. this reduces the false positive rate (fpr) of the overall system at a small memory cost. an alternative fpr reduction strategy is the kmer bloom filter (kbf) ( ) . each k-mer in a sequence overlaps with its adjacent k-mers by k − base pairs. therefore, the existence of two k-mers in a sequence is not independent, and the presence of a particular k-mer in the bloom filter can be verified by the co-occurrences of its neighbors. based on this information, kbf lowers the fpr by checking, for instance, the query's eight possible neighboring kmers (four to the left and four to the right). if none of the query's neighbors exist in the bloom filter, kbf rejects the query as a false positive. there are also many algorithms built around the generalized bloom filter data structure. these methods give the bloom filter different functions, but maintain its simplicity and memory-efficiency. the counting bloom filter (cbf), for instance, was developed to detect whether the count of an element is below a certain threshold ( ) . the only difference between the bf and cbf is that when adding an element, all the counters for that element increase by . the spectral bloom filter (sbf) ( ) functions similarly to a cbf, but the sbf only increases the minimum value in the table when inserting an element. this modification causes sbf to have a lower error rate when compared to the cbf. nucleic acids research, , vol. , no. in addition to extensions and variations of fundamental methods, recent advances have developed by combining several core data structures and techniques. for instance, race ( ) is an algorithm to downsample sets of genetic sequences while preserving metagenomic diversity. race replaces the universal hash function in the cms with an lsh function. using minhash, race can identify frequent clusters of sequences rather than frequent elements. since race is robust to sequence perturbations, it can be used to implement diversity sampling. by adjusting the lsh collision properties, race can create a sampled set of sequences that retains metagenomic diversity while substantially downsampling a data stream. the race diversity sampling algorithm is attractive because it can downsample accurately with high throughput, low memory overhead, and only one online pass through the dataset. for each sequence in an input stream, race checks to see whether the sequence belongs to a frequent cluster. this is done by replacing the minimum operation in the cms with an average over the count values. due to a deep connection between race and kernel density estimation, the average is a measure of the number of nearby sequences in the dataset, otherwise known as a density estimate. if the density is low, then race has not seen many similar sequences and the sequence is kept. otherwise, the sequence is discarded. in theory and practice, race attempts to select a constant number of sequences from each cluster. when minhash is properly tuned to differentiate between species, the clusters in the race algorithm correspond to different species in the dataset. as a result, race provides a fast, online and robust way to downsample sequence datasets while retaining important metagenomic properties. another important development comes from the cms and bloom filters. rambo (repeated and merged bloom filter) ( ) is a recent development in multiple set compression for fast k-mer and genetic sequence search. the rambo data structure is inspired by the cms, but the goal is to report the sequence containment status rather than sequence frequency. rambo consists of a set of b × r bloom filters. rather than maintain one bloom filter for each set of k-mers, rambo uses a -universal hash function to randomly merge k datasets into b groups ( ≤ b k ) so that each group has approximately k/b datasets. each partition is compressed using a bloom filter. this process is independently repeated r times with different partitions. to determine which sets contain a query sequence, rambo queries each bloom filter. because the groupings are random, each repetition reduces the number of candidates by the factor /b until only the correct datasets are reported at the end of the algorithm. the key insight is that with this approach, rambo can determine which datasets contain a given k-mer or sequence using far fewer bloom filter queries, yielding a very fast sublinear-time sequence search algorithm ( ) . rambo also inherits many desirable features from the cms and the bloom filter. this includes a low false positive rate, zero false negative rate, cheap update process for streaming inputs, fast query time, and a simple systems-friendly data structure that is straightforward to parallelize. in addition to methods that enable the scalable processing of high dimensional data, there are fundamental extensions of and considerations for cs that enable its efficient acquisition. while applications of cs are constrained to those where the sparsity assumption is appropriate, seemingly irrelevant signals may have a hidden sparse representation in some basis. for example, jpeg image compression exploits the fact that natural images can be sparsely represented (or at least approximated) in a discrete cosine basis (a cousin of the fourier transform). when the sparsity basis is known in advance, the canonical cs problem can be reformulated from y = x to y = s where s is the sparse representation of x in the basis defined by the columns of . this transformation was recently demonstrated in transcriptomics ( ) and may soon find an analogous application in metagenomics. aside from signal sparsity, cs also imposes constraints on the sensing matrix. specifically, must adequately preserve signals' separation distances; highly distinct ndimensional signals should not be forced into close proximity in m-dimensional space once projected by ( , ) . while gaussian and other classes of random matrices have been shown to work well in the general case, recent techniques indicate that can be iteratively optimized for a given task by simulating measurements and sparse recovery of signals ( ) . however, as we discuss below, practitioners generally do not have full control of in most applications. in metagenomics, the values in are constrained by the nucleic acid content of natural organisms. because each chosen sensor makes up a row of , a new algorithm can select m sensors (e.g. k-mers or probes) from a set of options to optimize the properties of for cs ( ) . very recent techniques in cs are also exploring how to merge machine learning with cs. given a dataset, recent work indicates that both the sensing matrix and the procedure that recovers x from y = x can be learned from specially designed deep neural networks ( ) ( ) ( ) ( ) , even in cases where the signal's sparsity structure is nonlinear. datasets in metagenomics are known to be highly structured and could thus be positively impacted by these recent advances in cs in the near future. most, if not all, of the approaches described above have found their way into previously published bioinformatics methods. however, method development to date has been primarily focused on genome sequencing for a single individual or isolate genome. findings suggesting links between microbiomes, such as the human gut microbiome, and human disease ( , ) has led to increased metagenomic sequencing. the rapid growth of this type of sequencing, where the set of reads is from a complex community of organisms, adds additional complexity and new challenges to fundamental comparative genomics problems. here we list a core set of these fundamental problems faced when performing metagenomic sequence analysis: (i) sequence resemblance, (ii) sequence containment, (iii) sequence classification, (iv) sequence downsampling, (v) sequence profiling, (vi) sequence probe design. for each problem, we discuss the role of the previously described approaches and newer tools incorporating recent advances (table ) . one of the recent breakthroughs in the area of large-scale biological sequence comparison is in the use of localitysensitive hashing, or specifically minhash and minimizers, for efficient average nucleotide identity estimation, clustering, genome assembly, and metagenomic similarity analyses. mash. in response to the high computational expense of large-scale sequence similarity calculations, researchers have begun to apply probabilistic approaches such as using minhash to approximate the similarity between sequences ( ). in the seminal work of mash ( ) , it was shown that minhash could be used as an extremely efficient estimator for genome similarities in both speed and resource use. it was also shown how mash could be applied to similarity estimates between entire metagenomes. in addition, mashtree has experimented with building phylogenetic trees based on the genomic similarity estimated using mash ( ) . these and other applications led to a quick and widespread adoption of mash throughout the research community for rapid sequence similarity calculations. despite representing a paradigm shift, one of the shortcomings of minhash is that its similarity estimation is most accurate when the two sets have similar sizes and their intersection region is large ( ) . in the paper ( ), the authors also point out that the genomic similarity estimated via jaccard distance is sensitive to the data set size. another limitation of minhash applied to metagenomics is that large amounts of rare k-mers can dominate the sample sketches. these k-mers which only occur a few times could be the result of sequencing errors as well as being actual rare species present in a metagenome. we will now review several other recent bioinformatic tools that have accelerated sequence similarity in the era of terabyte-scale datasets. bindash ( ) , like mash, takes in sequences, compresses them into sketches and then compares sketches to estimate the genome similarities. specifically, bindash focuses on accelerating the sketch construction and sketch comparison time. to do this, bindash uses the b-bit onepermutation minhash algorithm to compress sequences. given a sequence, bindash first decomposes the sequence into k-mers. each k-mer of the sequence is hashed by one predefined hash function. the hash values of k-mers are then pooled together into b buckets. after all the k-mers are hashed and then grouped into b buckets, bindash selects the smallest hash value from each bucket and stores the b lowest bits of each selected hash value as the sketch of a sequence. to account for potentially empty buckets, the sketch process is optimized by the densification operation as mentioned in the previous section. the sketch similarities are then estimated using jaccard indices based on the b · b bit sketch. the experiments show that, comparing to mash, bindash can characterize the same data set with less error, less memory used and faster speed. dashing. the recently introduced work of dashing uses hyperloglog (hll) sketching to approximate genomic distances ( ) . one main motivation behind dashing is to improve the similarity estimation accuracy across input sequence datasets with different sizes. dashing represents the first time that hll has been applied to estimate the overall similarity between sequence samples. given that hll is used to estimate set cardinality, to use hll to estimate genomic sequence similarities you must estimate the intersection of the two sequence data sets' k-mers, then estimate the cardinality of this intersection set. dashing first sketches the k-mers of each given sequence data set using hll. it then creates a union sketch using basic register maximum operations between the two hll sketches. now, having access to the set cardinality of both independent sets, as well as the union set size, the inclusion-exclusion principle yields the set cardinality of the intersection between the two sequence datasets. the hll set cardinality calculations of dashing are estimated using a maximum-likelihood-based approach, which has higher accuracy than the traditional corrected harmonic mean estimation approach. dashing is able to sketch metagenomes faster than previous approaches, but it requires more cpu time to calculate the genomic distances. in the end, comparing to mash, dashing has faster speed, higher accuracy and a lower memory footprint. finch rare k-mers can distort the estimation of sequence comparisons and inter-metagenomic distances. to solve this problem, finch ( ) uses minhash with a larger sketch size in order to evaluate the abundance of each k-mer. it then decides thresholds based on estimated abundances to filter out low abundance k-mers. it also removes k-mers with unequal frequencies of forward and reverse sequences. by deleting erroneous or rare k-mers, finch can estimate the distances between metagenomic samples robustly. finch also reports including correction for sequencing depth biases. hulk estimates the similarities among metagenomic samples while taking k-mer frequencies into account ( ) . in hulk, a metagenomic sample is sketched via histogram sketching ( ) into a final histosketch, which preserves k-mer frequency information. to build a histoskech for a given metagenome, reads are first decomposed into k-mers and then streamed in a distributed fashion into independent count-min sketch counters. once a large number of reads have been counted, hulk sends the cms data to be histosketched and resets the cms counts to initial values. in order to create the final histosketch, hulk first summarizes the count-min sketch counters into a k-mer spectrum and then applies consistent weighted sampling (https://www.microsoft.com/en-us/research/publication/ consistent-weighted-sampling/, accessed march ) methods. hulk can successfully cluster metagenome samples based on similarity between histosketches as well as being a faster approach than that of naive k-mer counting. kwip is yet another recent approach that tries to improve the accuracy of estimating sequence dataset similarity via k-mer weighted inner product (kwip) ( ) . kwip first uses khmer ( ) , which is a k-mer counting software relying on count-min sketch, to compress each metagenomic read sample into a sketch. each sketch is an array consisting of m bins. each bin is responsible for counting the number of occurrences of some of the k-mers (with collisions) in the sample. to calculate the distance between two sam-nucleic acids research, , vol. , no. table . metagenomics software based on probabilistic and signal processing algorithms. six main application areas are highlighted: containment, downsampling, probe design, profiling, resemblance and taxonomic classification. speed indicates the relative computational speed of cpu operations, memory the relative maximum ram used during index construction/query steps and year the publication year. more ' 's means better time and memory efficiency. less ' 's indicate more resource intensive tools. performance estimates using only literature based comparison are marked in gray (' '). the stars ( - ) correspond roughly to time (days, hours, minutes, seconds and milliseconds) and memory (> gb (server), > gb (workstation), > gb, > mb and < mb). datasets used were shakya et al. ( ) biobloom tools and opal were indexed using the training data provided by opal which is much smaller than the dbs other tools use. metamaps is a classifier specifically for long read sequences as compared to the other tools in the category. the datasets and results for each tool can be found at https://gitlab.com/treangenlab/hashreview ples, each of the m bins is assigned a weight to be used in a weighted inner product. in order to assign weights to individual bins, kwip first counts the number of non zero bins across all of the n samples. an m length vector containing these frequencies is then used by kwip to create another m length vector converting the frequency values to a new value based on shannon entropy. this entropy conversion causes bins that have k-mers present in roughly half of the samples to be heavily weighted versus bins that have k-mers present in all or none of the samples (which get a weight of zero). genetic similarity is then approximated by the kwip distance. the kwip distance is calculated using the inner product between two sample sketches, with each bin weighted by the shannon entropy for that bin. the authors show that kwip can produce more accurate results than mash, especially for metagenomic samples with low divergence. of note, kwip is specifically designed to create a distance matrix from multiple samples, using all samples in the sketching process, as opposed to comparing individual sketches for individual samples like most other methods discussed here. order min hash (omh) introduces a new way of sketching a sequence that estimates the edit distance of the sequences. ( ) unlike most other hashing based techniques for similarity calculations, which treat all the k-mers without respect to the order in which they occur, omh preserves the k-mer ordering in its sketching process. the sketch for a given sequence consists of n vectors of length l. each of the n vectors contains l representative k-mers, which are selected according to a pre-defined permutation function, and whose relative ordering is maintained from the original sequence. the distance calculation uses the weighted jaccard distance, where the number of appearances of a k-mer are taken into account. sourmash ( ) is closely related to mash and based on minhash. it modifies the sketching procedure such that the sketch size can be of variable length for different sequences. in their approach, the size of the sketch is based around the number of unique k-mers unlike the fixed size min-hash sketch. additionally, sourmash includes functionalities such as k-mer frequency calculations as well as a sequence containment method that combines the sequence bloom tree and minhash methodologies. searching for the containment of a read, gene fragment, gene, operon, or genome within a metagenomic sample or sequence database is a frequent computational task in bioinformatics. this is an open challenge for two key reasons: first, the size of metagenomic and sequence repositories are on the scale of terabytes to petabyes. thus, methods able to quickly eliminate all the non-matching sequences in the database are crucial. second, sequences evolve over time and rarely, if ever, will be an exact match especially as metagenomes and sequence databases contain a huge amount of sequence diversity. methods that tolerate mismatches and indels have much improved sensitivity compared to methods that require more strict exactly matching sequences to satisfy containment. despite the breakthroughs made via bloom tree inspired structures in sequence search, these approaches are not without drawbacks. first, they have to make a trade-off between falsepositives and the filter size due to the inherent limitations of the bloom filter. second, they commonly lack flexibility; once the filter size is determined, they cannot be changed based on the size of the input sequences. no matter how many k-mers a sequence has, they all have to be sketched into a fixed size array. finally, as the size of the input data increases, the precision of the bloom filter-based sequence search typically declines. we will now review a few recent approaches that have tackled this important task in computational biology. sequence bloom tree (sbt) ( ) is a binary tree where each node in the tree is a bloom filter. an sbt is used to index large sequence databases for efficient containment check of a query sequence within the database sequences or datasets. to construct an sbt, each sequence or dataset is added one by one, beginning with adding the first dataset as the root of the sbt. for each additional sequence or dataset, you first compute the bloom filter for the contained k-mers, and then scan from the root of the sbt to the leaves, inserting the dataset's representative bloom filter at the bottom of the tree. at each bifurcation, the insertion traversal follows the path of the child with the closest hamming distance similarity to the bloom filter for the current dataset. after insertion is finished, the new dataset's bloom filter is added as a leaf node, and each node in the sbt contains the union of the bloom filters of its children. to be specific, if a k-mer is present in node u, it should also exist in all the direct ascending nodes' bloom filters from u to the root. therefore, as a bloom filter gets closer to the root, it becomes more populated and the false-positive rate of the bloom filter is higher (a process known as saturation). querying for sequence containment proceeds by querying each node's bloom filter, starting with the root, and determining if enough k-mers are contained from the query's k-mers. if the bloom filter contains enough of the query's k-mers, then each child node's bloom filter is queried for containment. the process proceeds until each sequence or dataset containing the query at the leaves of the sbt is determined. split sequence bloom tree (ssbt) ( ) were implemented to quickly search short transcripts within a large database. although the ssbt was originally designed for rna-seq data, it can be adapted to other sequence containment problems just like sbts. the ssbt is an improvement over the sequence bloom tree (sbt) data structure ( ) . similar to sbts, each sequence or dataset in the database is inserted into the ssbt by traversing from the root of the tree to the bottom. the ssbt is also a binary tree, but each node has two bloom filters instead of one. the first filter, called the similarity filter, saves k-mers shared by all the datasets in the subtree under a particular node. the second filter, named the remainder filter, stores the k-mers that are not universally shared among all the datasets but are specific to at least one dataset in the subtree for a node. the union of the similarity filter and the remainder filter is a single bloom filter for the node similar to the nodes of an sbt. ssbt is a clever re-organization of sbt resulting in accuracy similar to an sbt but with reduced space occupancy and search time. bigsi represents a significant advance in sequence containment search; bigsi was introduced to allow efficient search for a query sequence among a large bacterial and viral genome database ( ) . it also relies on bloom filters to solve this problem. but, instead of using a tree-like structure (e.g. sbt), bigsi employs a flat bloom filter-based data structure. bigsi first indexes the reference datasets, where these datasets are raw fastq read datasets or assemblies from which to search for the presence of a query sequence. to index the reference datasets, bigsi first extracts a set of non-redundant k-mers from each dataset, and then builds a corresponding bloom filter. after this initial step, bigsi then concatenates all the bloom filters together. bigsi compresses the whole database into a matrix, in which each column is a bloom filter for a given dataset. to conduct an exact search of a sequence, bigsi is expected to find the index of all the k-mers of the query sequence inside the matrix. for inexact search, as referenced above, bigsi just needs to find the index for a subset of the k-mers present in a sequence of interest. bigsi can also dynamically update the size of the sketch based on the amount of input datasets. when new datasets arrive, bigsi can add a new column to the matrix for each new dataset. rambo ( ) is a very recent method which also allows indexing new sequences and new datasets in a streaming fashion. contrary to bigsi, which has o(k) (k is the number of datasets) query time, rambo is sublinear in query time with a slight increase in memory. mash screen ( ) was developed to determine which reference sequences are contained within a metagenomic sample using minhash, though the methodology is also presented as a method for sequence similarity. similar to meta-pallette (described below), it uses references found to be contained in a metagenome to describe the metagenome's taxonomic composition, but does not classify individual reads. mash screen first converts a reference sequence and a given metagenomic sample into two sets of k-mers a and b. following that, mash screen compresses the set of ref- represents the fraction of k-mers in the sketch of a contained in b, and is referred to as the containment index. finally, the containment index is converted to a score that approximates sequence similarity. this final score is referred to as the mash containment score. the presence or absence of one or more reference sequences in a metagenomic sample is then determined by this mash containment score. an example is given, for instance, of searching for a set of reference viral sequences in hundreds of metagenomes by calculating the mash containment score between each reference and each metagenome. metagenomic sequence classification software typically uses reads to search against known genomes and perform lowest common ancestor based taxonomic classification. as the size of the reference databases (terabytes to petabytes) and the number of reads ( s of millions to billions) in metagenomic samples increase, it becomes computationally intractable to perform exhaustive comparison of all kmers in the reads against all k-mers within the reference databases, opening the door for efficient new tools. tools like kraken ( ) and diamond ( ) were two of the first ultra efficient tools for fast metagenomic classifications. we now review a few recently developed approaches for metagenomic sequence classification. krakenuniq is built based on kraken and its main goal is to decrease the false-positive read classification rate ( ) . compared to kraken, one of the additional features of krakenuniq is that the number of unique k-mers of each taxon is recorded while processing all reads of a metagenomic data set. krakenuniq uses hyperloglog to efficiently estimate these unique k-mer counts. by tracking the number of unique k-mers for a taxa alongside the coverage for that taxa across all the reads in a metagenome, krak-enuniq can identify likely false-positive read classifications caused by events such as sample contamination, lowcomplexity regions, and contaminated database sequences. kraken substantially reduces memory usage, while simultaneously gaining a significant boost in classification speed, when compared with kraken ( ) . this advancement in memory use and speed comes from using a compacted hash table that stores lca assignments for hashed minimizers of k-mers instead of a table storing lca assignments for all k-mers as in kraken . while this hash table saves significant memory, it comes at a small specificity and accuracy cost given that it only stores pairs of minimizers and lcas which are further subsampled through hashing. this hashing process includes adding spaced seed masking to the minimizer before hashing. the size of this new compact hash table can be specified by the user, with smaller sizes reducing the memory footprint and increasing speed but lowering classification accuracy. when compared with other state of the art tools, kraken ultimately provides similar or better classification accuracy alongside its memory and speed improvements. biobloom tools (bbt) ( ) is novel in that it applies a multi-index bloom filter (mibf) to the sequence classification problem. the mibf is a bloom filter-like data structure that consists of three arrays. the first array serves as a traditional bloom filter, recording the existence of hashed items in a set. the second array, named the rank array, tracks the number of non zero bits stored in the first bloom filter array at certain intervals (by default, the number of non zeros every bits in the bloom filter is stored). to reduce memory usage, the rank array is ultimately interleaved with the first bloom filter. the third array, also referred as the id array, saves the integer identifiers (ids) for reference sequences inserted into the mibf. these ids allow the mibf to additionally store associated taxonomic classification information for entries so as to be used as a classifier. for each reference sequence, bbt hashes spaced seeds into the mibf rather than contiguous k-mers. spaced seeds, unlike k-mers, allow mismatches between the references and the queries which can increase the sensitivity of approximate sequence search ( ) . to classify a given read, spaced seeds from the read are looked up in the bloom filter. the rank array is then used to help retrieve ids from the id array. ultimately, the retrieved ids lead to a final taxonomic classification. to reduce the false positive rate, bbt makes use of nearby spaced seeds within adjacent sliding windows, referred to as frames, when performing its classifications. bbt also intelligently populates the id array in multiple passes such that the effects of data loss from hash collisions is minimized. ganon ( ) focuses on quick database indexing in order to ensure usage of the most up to date sequence database data to accurately classify reads. many existing tools apply static, out-of-date versions of databases to assign reads. this approach can miss, for instance, classifications for species that have been newly sequenced and very recently added to existing databases. to overcome this problem, ganon employs interleaved bloom filters (ibf) ( ) to index up-to-date reference genomes efficiently. an ibf is an array of length b · n. it encompasses b bloom filters of length n. to index the references, ganon first groups the sequences into clusters. these clusters should roughly mirror different groups for a given taxonomic rank such as different species or strains. it then sketches each cluster into a single bloom filter. lastly, all the bloom filters are interleaved into one ibf. reads are classified that pass a minimum threshold for the number of matches found within the read and the references. if a given read can map to multiple references, an optional lowest common ancestor (lca) approach can be applied. metamaps was designed to perform classification on noisy long read data including making both classifications and abundance estimates down to the strain level ( ) . metamaps classifies long reads by mapping them to reference genomes. given that reads could map to many closely related references, metamaps simultaneously performs mapping as well as estimating the community composition of a metagenome sample. thus, when determin-ing the probability of mapping a read to a reference, the probability is a combination of both a probabilistic mapping quality to the reference as well as the estimated abundance of the reference's taxonomic unit in the sample. to quickly find mapping locations for reads across all reference genomes, an efficient probabilistic approach is used that generates initial candidate mappings using minimizers followed by a winnowed-minhash statistical modelling approach for further ani estimation ( ) . the read mappings and metagenome abundance estimates are then iteratively updated through an expectation-maximization (em) algorithm. metaothello ( ) is one of the latest efforts in improving the classification speed of metagenomic classification. similar to kraken , metaothello reports significant improvements in both memory use and speed when compared to, for instance, kraken . metaothello applies the recently developed l-othello data structure to speed up the process, which is a hashing based classifier. metaothello uses k-mers that act as signatures for taxa to make its classifications. a kmer is a signature for a taxon if it is only present in that taxon or that taxon's subtree, and nowhere else in the tree of life (it is taxon specific). metaothello indexes all reference sequences, finds all taxon signature k-mers and their taxonomic mappings, and populates an l-othello data structure that efficiently maps from signature k-mers to taxa. the l-othello, once built, maintains two arrays a and b populated with binary values. when looking up a k-mer's taxa mapping in the l-othello, the k-mer is hashed by two hash functions h a and h b that map to the matching positions in a and b. the final corresponding taxa value t for the k-mer is calculated through a bit-wise xor operation of the two values found in a and b. thus the classification step of metaothello operates similarly to other approaches. a query sequence is decomposed into its constituent k-mers and the corresponding taxa for each k-mer is looked up using the l-othello data structure. then, differing from other approaches, metaothello uses a windowed approach to make the final classification. for a given taxonomic rank, the classification takes into account the maximum number of contiguous taxa assignments that all occur consecutively within the query sequence. opal ( ) is an lsh-based metagenomic classifier that uses low density parity check (ldpc) codes. the rationale for using an ldpc lsh approach is to ensure even coverage for all of the positions in the k-mer while using as few hash functions as possible. the authors highlight that this is the first application of low-density lsh in bioinformatics. the rationale for using low-density lsh is that it will avoid coverage bias issues and offer increased accuracy when using long k-mers. in addition to newer more efficient methods for analyzing large metagenomic data sets, a parallel effort has been emerging that instead reduces the data set size first before running further downstream analyses. intelligently down sampling, for instance, a read data set can dramatically speed up any further computations performed, while ideally preserving the important characteristics of the metagenome. another alternative approach to analyze less data than a full metagenome would be to restrict sequencing to a small subset of regions in the metagenome such as the s rrna. this sequencing approach, referred to as metabarcoding ( ) or amplicon sequencing, can help to simplify other downstream tasks such as community profiling and taxonomic assignments of reads. here, however, we consider only the recent computational approaches that shrink large metagenomic datasets previously generated or in an online streaming fashion. diginorm ( ) is a cms-based method for downsampling shotgun sequencing data. diginorm is a streaming algorithm that can select a small set of reads from a large dataset using relatively few computational resources without substantial information loss. this improves the speed of downstream tasks. diginorm begins by finding the frequencies of all k-mers in a sequence using a cms. if the median frequency value is larger than a threshold, usually , the sequence is discarded. this process discards reads with k-mers that have already been observed in other reads. since rare reads have many rare k-mers, they will have a lower median count than common reads and will be kept. an easy-to-use python implementation is provided in the khmer package. bignorm ( ) is an extension of the ideas behind diginorm. bignorm obtains better downsampling performance by including additional information, such as quality scores and common error modalities, when determining whether to accept a read. while bignorm is still based on k-mer abundance counts and the cms, the decision threshold is based on a weighted summary of k-mer counts rather than simply the median. the decision process attempts to remove bias in diginorm that may incorrectly accept a read. for instance, bignorm attempts to differentiate between rare k-mers caused by single substitution errors and authentic uncommon reads. while diginorm and bignorm are both efficient streaming algorithms, bignorm is implemented in c++ and uses parallelism to achieve faster processing times. race ( ) is a recent downsampling method based on lsh and the cms. rather than consider explicit k-mer abundance statistics, race is based on jaccard similarity. diginorm and bignorm both discard reads which contain many k-mers that have already been observed. race discards reads that have a high jaccard similarity with many observed reads. while these decision criteria are similar, density estimation with jaccard similarity is incredibly efficient using the race algorithm. quikr/wgsquikr ( , ) are cs-based approaches that leverage differences in bacterial k-mer frequencies to recover the relative abundances of bacteria in complex samples. the setup of the cs problem is similar to our depiction in figure . in quikr, each column of the sensing matrix is populated with the -mer frequency profile of a bacterial species' s gene. sequence measurements across a whole sample are converted to raw -mer frequencies (y) from which the sparse combination of species can be recovered using cs with sparsity-based optimization. quikr was soon followed up with wgsquikr ( ) that leveraged the same core method except with -mer analysis of whole-genome shotgun sequencing data. at the time of publication, these techniques achieved competitive accuracy with orders of magnitude improvement in speed over state-ofthe-art read-by-read classifiers. however, they were limited to genus-level taxonomic depth and exhibited difficulty in recovering rare organisms. metapallette ( ) takes a cs-inspired approach similar to wgsquikr for metagenomic community reconstruction with a few subtle but significant differences. the authors define a matrix a created from k-mers of database reference genomes, known as the common k-mer training matrix. this matrix is analogous to the sensing matrix in cs, but a stores pairwise similarities of reference genomes based on shared k-mers. a is able to be efficiently constructed for long k-mers by using bloom count filters. ultimately, the relative taxa abundances x is recovered from the aggregate sample k-mer counts y by solving ax = y for a sparse x. while we only discuss a single a, x and y here, metapallete in fact creates multiple a and x for different values of k for k-mers ( and ) . the authors also augment a with artificial 'hypothetical organisms' of similar k-mer profiles. the use of long k-mers and the mathematical representation of unknown organisms enables metapallette to classify even novel organisms at the strain level. mission ( ) is a hybrid compressed sensing and hashing-based approach. specifically, mission uses a count-sketch data structure and will acquire the heavy hitters from the data and apply stochastic gradient descent to update the data structure. the sparsity of the features keeps the top heavy hitters while setting the rest to zero. this algorithm was used for metagenomic classification on the dataset from ( ) and showed how many features of the data would be adequate relative to performance. metagenomic sequencing has opened the gate for biologists to detect novel or rare organisms in different environments. however, detection with high sensitivity can demand extensive sequencing runtimes to capture novel fragments among the innumerable metagenomic background data ( ) . to circumvent these challenges, single stranded nucleic acid probes can enrich or sense dna fragments by binding to intended target strands. many software packages have been developed for designing probes for a specific target genome, but generating probes for metagenomic analysis is difficult because of the uneven and diverse sequences in metagenomic samples. capturing rare sequences while excluding highly similar sequences is challenging. therefore, metagenomics requires probe design techniques that scale well with the number of organisms found in samples. catch is a newly developed method to design optimal probes for targeted microbial enrichment to facilitate downstream detection in sequencing ( ) . this approach is particularly important for viral detection in samples with low titers; without probe-based enrichment, low abundance viruses may evade detection. moreover, catch pursues a set of probes that can scalably capture the full diversity of viruses. catch first yields a set of candidate probes from the input sequences and then collapses the probes with high similarity using lsh. specifically, it detects nearly-identical probes through either hamming distance or minhash, and then removes the similar candidate probes. to make sure that the final set of probes encapsulates the diversity of the input sequences, catch computes the smallest set of probes needed to cover the whole set of target sequences. catch treats this as a set cover problem and solves it using the canonical greedy solution ( ) . ultimately, thousands of probes are chosen to cover the targets based upon the optimization criteria. insense while catch focuses on probe design for enrichment of target sequences in a complex sample before metagenomic sequencing, applying cs permits another workflow with orders of magnitude fewer probes at the cost of some taxonomic depth. if a sample is known to be vsparse, i.e. contain a subset of v or fewer of the n possible microbes, cs can be applied with m = o(vlog(n/v)) mismatch-tolerant dna probes. the sensing matrix is populated by the expected number of binding events between each probe (in rows) and each target organism (in columns). these nonspecific probes can be thought of as directly measuring the abundance of soft-matching k-mers. proof-of-concept work was first explored in a cs microarray (csm) format ( ) . the same principle has also been demonstrated for sensing bacterial pathogen genomes at species resolution in bulk solution with less than a dozen fluorescent, random dna probes ( ) . although fewer probes can be resolved in bulk solution compared to a microarray (m is limited), such an approach may find applications in rapid infection diagnostics where the species library is constrained to pathogens (n is much smaller) and patient samples are very sparse with at most a few unique species ( ) . given a set of possible microbes (library), a set of probes, and the simulated hybridization behavior between them, a subset of probes can be selected with the insense algorithm ( ) . insense optimizes for the incoherence of , a common quality metric for cs sensing matrices, with a convex relaxation. this cs approach bypasses sequencing by capturing information directly from probe-target hybridization events, and it will be exciting to see how it performs in real patient and environmental samples. if can be accurately predicted from probe and target sequences, it is plausible that future applications can synergize with sequencing databases by automatically updating based on known trends in microbial evolution. however, nonspecific hybridization mandates a thorough understanding of the library of possible species and perhaps careful sample processing; out-oflibrary, unexpected nucleic acids that interact with nonspecific probes would corrupt the measurements and downstream sparse recovery. despite the nascent state of metagenomic sequencing and analysis, its accelerated adoption has led to both an explosion in available data as well as an ever increasing demand for new data analysis methodologies. in this survey, we have covered what we believe to be a core set of fundamental probabilistic data structures and algorithms that are uniquely positioned to tackle the burgeoning growth of metagenomic data, as well as the added nuances of anal-yses involving a diverse community contained inside of a metagenome. despite the relative youth of the field of metagenomics, many fast methods have already emerged that can be used or were designed for this area. for instance, as seen in table , methods like bindash and dashing are being developed in an effort to further accelerate sequence similarity estimations beyond the speed of the seminal mash tool. similarly, recent advances like bigsi, rambo, and ssbt are opening the door to petabyte-scale sequence searches among vast sequencing datasets. however, continued breakthroughs are still needed to better handle metagenomic-specific intricacies such as sequencing error, low abundance community members, and uneven coverage. in addition, probabilistic approaches as discussed in this paper generally come with an accompanying set of pros and cons. for instance, most bloom filter algorithms involve a fundamental trade-off between memory, query cost, and quality. standard bloom filters balance the size of the bit array with the possibility of false positives. the tradeoff is implicit for any algorithm using this data structure. the fpr can be reduced by choosing the right number of hash functions, which may increase query time, or by making assumptions about the input data, as with kmer bloom filters. cascading bloom filters provide an alternative way to trade query time and memory for fpr at the expense of a more complex hierarchical structure. additionally, cs approaches come with their own set of tradeoffs. while cs confers measurement efficiency for cost and time savings, it is inherently database-dependent. for instance, in some of the applications we discussed, the sensing matrix was precomputed by leveraging a sequence database (sequences at a specified position, k-mer frequencies, response to a set of probes etc.). similarly, the discovery of sparse representations requires a training set of signals. this requirement for a dataset becomes limiting in chaotic applications such as the identification of rapidly evolving organisms either through vertical or horizontal gene transfer. such novel differences that real-world samples may exhibit would likely be treated as noise in sparse recovery and ignored until the database is updated. cs is therefore likely limited to applications that exhibit an acceptable level of stability in the dataset. more generally, while the cs technique is provably robust to errors (noise) in the lowdimensional measurements y, any errors in the signal x are amplified by the factor n/m ( ) . in metagenomics, measurement noise may be attributed to whether an expected nucleic acid fragment in the sample generates a read during sequencing, and signal noise could be the result of unforeseen mutations or contamination. in applications featuring significant signal noise, the ratio n/m controls the tradeoff between the efficiency of the measurement process and signal-to-noise ratio degradation. in addition to all of the considerations directly involved in the inner workings of the discussed methods, there are many considerations surrounding these methods that can also greatly affect both their accuracy and scale. while we have discussed various tradeoffs involved in probabilistic approaches, many of these tradeoffs involve carefully selected hyper parameters. to a non expert user of the methods, it may not be obvious how to set the various parameters for each method, and even advanced users may struggle to find the truly optimal parameter settings derived from underlying theory. another consideration is in the modeling of processes such as natural genome evolution. many k-mer based approaches and hashing techniques are initially developed in a way that is blind to underlying biological processes such as evolutionary drift which gradually introduces point mutations, insertions, and deletions into closely related genomes that otherwise might be identical. conversely, phylogenetic methods which explicitly model events like drift and recombination have been slow to incorporate recent advances discussed in this survey. considerations can also be given to the actual data collection procedures, such as how the dna sequencing is performed. one new advance on the sequencing side of metagenomics is the concept of genome skimming ( ) , which is a technique to lightly sequence metagenomic samples. similarly, metabarcoding ( ) or amplicon sequencing can reduce metagenomic data by only sequencing a small set of amplified regions, potentially speeding up and simplifying downstream analyses. a final consideration surrounding newer methodologies is that of the sequence databases that nearly all metagenomics tools rely on for sequence classification. while recent advances in probabilistic data structures and algorithms may drastically shrink computational requirements, these speedups can be easily offset and even outpaced by exponential growth in sequence databases that these algorithms must interact with. new methods should also seek to overcome challenges such as database quality issues such as misassembled or mislabelled genomes or sets of reads. following methodologies such as simple uniform random downsampling and more intelligent downsampling like diginorm ( ) , recent advances like the race method ( ) attempt to address the need to shrink databases and remove contaminants and error, while preserving biologically important characteristics like diversity. probabilistic data structures for big data analytics: a comprehensive review computational biology in the st century: scaling with compressive algorithms sketching and sublinear data structures in genomics computational solutions for omics data when the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data on the resemblance and containment of documents approximate nearest neighbors: towards removing the curse of dimensionality an improved data stream summary: the count-min sketch and its applications hyperloglog: the analysis of a near-optimal cardinality estimation algorithm space/time trade-offs in hash coding with allowable errors fast and accurate short read alignment with burrows-wheeler transform opportunistic data structures with applications reducing storage requirements for biological sequence comparison compressive fluorescence microscopy for biological and hyperspectral imaging sparse mri: the application of compressed sensing for rapid mr imaging compressive sensing decoding by linear programming compressed sensing randomized algorithms the random projection method sampling techniques for kernel methods a random sampling based algorithm for learning the intersection of half-spaces adaptive sampling methods for scaling up knowledge discovery algorithms randnla: randomized numerical linear algebra finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions an algorithmic theory of learning: robust concepts and random projection dimensionality reduction by random projection and latent semantic indexing random projection trees and low dimensional manifolds experiments with random projection linear regression with random projections on the resemblance and containment of documents approximate nearest neighbors: towards removing the curse of dimensionality the space complexity of approximating the frequency moments data streams: models and algorithms mining data streams: a review signal recovery from random measurements via orthogonal matching pursuit iterative thresholding for sparse approximations cosamp: iterative signal recovery from incomplete and inaccurate samples from denoising to compressed sensing mash: fast genome and metagenome distance estimation using minhash viral coinfection analysis using a minhash toolkit large-scale sequence comparisons with sourmash optimal densification for fast and accurate minwise hashing densifying one permutation hashing via rotation for fast near neighbor search improved asymmetric locality sensitive hashing (alsh) for maximum inner product search (mips) simple and efficient weighted minwise hashing similarity estimation techniques from rounding algorithms in defense of minhash over simhash hashing algorithms for large-scale learning sectional minhash for near-duplicate detection nthash: recursive nucleotide hashing a resource-frugal probabilistic dictionary and applications in bioinformatics fast and scalable minimal perfect hashing for massive key sets hopscotch hashing robin hood hashing improving the performance of minimizers and winnowing schemes designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing hyperloglog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm sliding hyperloglog: estimating cardinality in a data stream over a sliding window using cascading bloom filters to improve the memory usage for de brujin graphs fast lossless compression via cascading bloom filters improving bloom filter performance on sequence data using k-mer bloom filters an improved construction for counting bloom filters spectral bloom filters diversified race sampling on data streams applied to metagenomic sequence analysis repeated and merged bloom filter for multiple set membership testing (msmt) in sub-linear time sub-linear sequence search via a repeated and merged bloom filter (rambo): indexing tb data in hours efficient generation of transcriptomic profiles by random composite measurements the restricted isometry property and its implications for compressed sensing a simple proof of the restricted isometry property for random matrices adaptive compressed sensing mri with unsupervised learning insense: incoherent sensor selection for sparse signals a data-driven and distributed approach to sparse signal representation and recovery the sparse recovery autoencoder learned d-amp: principled neural network based compressive image recovery deepcodec: adaptive sensing and recovery via deep convolutional neural networks nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection clinical metagenomics generating wgs trees with mashtree variant tolerant read mapping using min-hashing beware the jaccard: the choice of similarity measure is important and non-trivial in genomic colocalisation analysis bindash, software for fast genome distance estimation on a typical personal laptop dashing: fast and accurate genomic distances with hyperloglog finch: a tool adding dynamic abundance filtering to genomic minhashing streaming histogram sketching for rapid microbiome analytics histosketch: fast similarity-preserving sketching of streaming histograms with concept drift kwip: the k-mer weighted inner product, a de novo estimator of genetic similarity the khmer software package: enabling efficient nucleotide sequence analysis locality-sensitive hashing for the edit distance fast search of thousands of short-read sequencing experiments improved search of large transcriptomic sequencing databases using split sequence bloom trees ultrafast search of all deposited bacterial and viral genomic data mash screen: high-throughput sequence containment estimation for genome discovery kraken: ultrafast metagenomic sequence classification using exact alignments fast and sensitive protein alignment using diamond krakenuniq: confident and fast metagenomics classification using unique k-mer counts improved metagenomic analysis with kraken improving on hash-based probabilistic sequence classification using multiple spaced seeds and multi-index bloom filters efficient computation of spaced seeds ganon: precise metagenomics classification against large and up-to-date sets of reference sequences dream-yara: an exact read mapper for very large databases with short update time strain-level metagenomic assignment and compositional estimation for long reads with metamaps a fast approximate algorithm for mapping long reads to large reference databases a novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures metagenomic binning through low-density hashing the ecologist's field guide to sequence-based identification of biodiversity a reference-free algorithm for computational normalization of shotgun sequencing data an improved filtering algorithm for big read datasets and its application to single-cell assembly wgsquikr: fast whole-genome shotgun metagenomic classification quikr: a method for rapid reconstruction of bacterial communities via compressive sensing metapalette: a k-mer painting approach for metagenomic taxonomic profiling and quantification of novel strain variation mission: ultra large-scale feature selection using count-sketches large-scale machine learning for metagenomics sequence classification how much metagenomic sequencing is enough to achieve a given goal? capturing sequence diversity in metagenomes with comprehensive and scalable probe design a greedy heuristic for the set-covering problem compressive sensing dna microarrays universal microbial diagnostics using random dna probes polymicrobial interactions: impact on pathogenesis and human disease the pros and cons of compressive sensing for wideband signal acquisition: noise folding versus dynamic range genome skimming: a rapid approach to gaining diverse biological insights into multicellular pathogens tackling soil diversity with the assembly of large, complex metagenomes oceanic metagenomics: the sorcerer ii global ocean sampling expedition: northwest atlantic through eastern tropical pacific the ocean sampling day consortium. gigascience, a human gut microbial gene catalogue established by metagenomic sequencing ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses terragenome: a consortium for the sequencing of a soil metagenome img/m v. . : an integrated data management and comparative analysis system for microbial genomes and microbiomes the human microbiome project the european nucleotide archive in the sequence read archive comparative metagenomic and rrna microbial diversity characterization using archaeal and bacterial synthetic communities ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins the views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the odni, iarpa, aro or the us government. key: cord- - c elh authors: tükenmez, hasan; xu, hao; esberg, anders; byström, anders s. title: the role of wobble uridine modifications in + translational frameshifting in eukaryotes date: - - journal: nucleic acids res doi: . /nar/gkv sha: doc_id: cord_uid: c elh in saccharomyces cerevisiae, out of trna species contain -methoxycarbonylmethyl- -thiouridine (mcm( )s( )u), -methoxycarbonylmethyluridine (mcm( )u), -carbamoylmethyluridine (ncm( )u) or -carbamoylmethyl- ′-o-methyluridine (ncm( )um) nucleosides in the anticodon at the wobble position (u( )). earlier we showed that mutants unable to form the side chain at position (ncm( ) or mcm( )) or lacking sulphur at position (s( )) of u( ) result in pleiotropic phenotypes, which are all suppressed by overexpression of hypomodified trnas. this observation suggests that the observed phenotypes are due to inefficient reading of cognate codons or an increased frameshifting. the latter may be caused by a ternary complex (aminoacyl-trna*eef a*gtp) with a modification deficient trna inefficiently being accepted to the ribosomal a-site and thereby allowing an increased peptidyl-trna slippage and thus a frameshift error. in this study, we have investigated the role of wobble uridine modifications in reading frame maintenance, using either the renilla/firefly luciferase bicistronic reporter system or a modified ty frameshifting site in a his a::lacz reporter system. we here show that the presence of mcm( ) and s( ) side groups at wobble uridines are important for reading frame maintenance and thus the aforementioned mutant phenotypes might partly be due to frameshift errors. transfer of genetic information from mrna into proteins is the most energy consuming process in the cell and the translation machinery needs to decode mrnas with high efficiency and fidelity ( ) . even though the translational machinery transfers the information in mrna into protein with high fidelity, errors occur at a low frequency. missense errors are in most cases not harmful to the function of a protein, since such errors alter only one single amino acid, which will not interfere with the function or stability of the protein if they occur in non-critical positions. in contrast, processivity errors, like frameshift errors, are detrimental, since they completely change the amino acid sequence downstream of the frameshift site. moreover, following such an error, the ribosome frequently encounters a stop codon in the new reading frame resulting in premature termination of translation. accordingly, the frequency of frameshift errors is about -fold lower than the frequency of missense errors ( , ) . there are many examples where alterations in the trna structure, e.g. lack of a modified nucleoside, will affect the fidelity of reading frame maintenance ( , ) . in bacteria, modified nucleosides of different chemical structures, present in different positions, and in different species of the trna all prevent frameshifts errors ( , ) . in eukaryotes, both wyosin (yw) and queosine (q) in rabbit reticulocytes as well as other modified nucleosides present in the anticodon loop of eukaryotic trnas are important to maintain the reading frame ( , ) . synthesis of yw in yeast trna occurs in several steps and whereas fully modified yw has a low frequency of frameshifting, presence of any of the various intermediates in the synthesis of yw all increase frameshifting ( ) . also, lack of either cyclic n -threonylcarbamoyladenosine (ct a) at position or pseudouridine ( ) at position and in yeast trna increases + frameshifting ( ) ( ) ( ) ( ) . relevant for this study, the modified wobble nucleoside methylaminomethyl- -thiouridine (mnm s u ) present in bacterial trna specific for gln, lys and glu, is important for proper reading frame maintenance (the wobble nucleoside is in position of the trna and we denote such a nucleoside as n where n is any nucleoside.) ( , ( ) ( ) ( ) ( ) . apparently, modification status both in bacteria and in eukaryotes is important for a proper reading frame maintenance ( , ) . a peptidyl-trna slippage model of how trna modification deficiency may induce frameshifting errors is well established ( , , , ( ) ( ) ( ) ( ) ( ) ( ) ( ) . according to this model (figure ) modification deficient aminoacyl-trnas present dual-error frameshifting model. modification deficient trnas can induce frameshifting by either an a-or a p-site effect, or a combination thereof. (a) lack of wobble uridine modification reduces the efficiency of the ternary complex (aminoacyl-trna*eef a*gtp, here shorten as aminoacyl-trna) to be accepted to the a-site, allowing a near cognate aminoacyl-trna to be accepted in the a-site. after translocation to the p-site, the near cognate trna slips into an alternative reading frame, as it does not perfectly fit in the p-site. (b) lack of wobble uridine modification reduces the efficiency of the cognate aminoacyl-trna to be accepted to the a-site, which induces a pause that allows the trna in the p-site to frameshift. (c) the hypomodified aminoacyl-trna is able to enter the a-site and translocate to the p-site where it then slips into an alternative reading frame due to a reduced ribosomal grip. in a ternary complex, i.e. aminoacyl-trna*eef a*gtp (here shorten as aminoacyl-trna) induces frameshifts either by causing an a-or a p-site effect, or a combination thereof. lack of modification causes a defect in the cognate aminoacyl-trna selection step (we denote such an error as an a-site effect by modification deficiency), allowing a ternary complex with a near cognate wild type aminoacyl-trna instead of a cognate aminoacyl-trna to be accepted in the a-site. after translocation to the p-site, the fit of the near cognate peptidyl-trna is not optimal why it slips one nucleotide forward (+ frameshift) ( figure a ). alternatively, lack of a modified nucleoside reduces the efficiency by which a cognate aminoacyl-trna is accepted to the asite, which induces a ribosomal pause allowing the wild type peptidyl-trna to slip forward one nucleotide (denoted an a-site effect by modification deficiency, figure b ). when frameshifting is caused by a p-site effect, the hypomodified trna is efficiently accepted to the a-site, translocates to the p-site where its fit is not optimal why it slips into an alternative reading frame due to a reduced ribosomal grip (p-site effect by modification deficiency, figure c ) ( , , , , ). thus, in some cases, the modification deficiency reduces the rate of selection of the aminoacyl-trna (a-site effect) but also lack of the modification reduces the ribosomal grip in the p-site (p-site effect). note, in all cases explained above, the error in reading frame maintenance is due to a peptidyl-trna slippage. modifications of uridines in the wobble position of trnas are frequent in all three domains of life. in saccharomyces cerevisiae, there are trna species having four related modified uridine nucleotides at wobble position ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . these modified nucleosides are -carbamoylmethyluridine (ncm u ) present in five ( , , ) , -carbamoylmethyl- -omethyluridine (ncm u m) present in one ( ), methoxycarbonylmethyluridine (mcm u ) present in two ( , ) and -methoxycarbonylmethyl- -thiouridine (mcm s u ) present in three trna species (figures and ) ( , , ) . the first step in the synthesis of the mcm and ncm groups of the uridine modifications mentioned above requires the six-subunit elongator complex and its seven as- sociated proteins (reviewed in karlsborn et al. ( ) ). mutations in any of the corresponding genes result in deficiency of these xm -uridine modifications without affecting stability or aminoacylation of trna ( ) . these mutants also show strong pleiotropic phenotypes, such as defects in growth, transcription, chromatin remodelling, dna repair and secretion (reviewed in karlsborn et al. ( ) ). all these phenotypes, except lack of xm side chains, are suppressed by overexpression of hypomodified trnas specific for gln, lys and glu that in a wild type contains mcm s u ( , ) . it was concluded that lack of this wobble nucleoside reduces the efficiency to recognize the cognate codons for these tr-nas, which is compensated by an increased concentration of the modification deficient trna. thus, the many different phenotypes of elongator mutants are due to reduced efficiency in translating some key mrnas encoding proteins important for manifesting a correct phenotype. in bacteria, modified wobble uridines are important to prevent + frameshifting ( , ) . in eukaryotes, only a limited study has been done, which focused on the influence of the esterified methyl group of mcm u in reading frame maintenance ( ) . however, no specific conclusion was made where the frameshift errors occur, since the frameshift window used was very large. therefore, no extensive information of the role of modified wobble uridines in reading frame maintenance is available for eukaryotic trna. it was therefore important to investigate whether or not lack of the xm u or mcm s u modifications are crucial for reading frame maintenance. here, we show that presence of xm -(x, any substitution) or s side groups at wobble uridines in yeast is pivotal in maintaining the translational reading frame. the source and genotypes of yeast strains used in this study are listed in table . e. coli strain used was dh α (bethesda research laboratories). yeast transformation ( ) , media and genetic procedures have been described previously ( ) . plasmid pjd contains a renilla/firefly luciferase bicistronic reporter system ( ) . to introduce various frameshifting windows between the luciferase genes, a bamhi-xhoi fragment from plasmid pjd containing the firefly luciferase gene was cloned into corresponding sites of ycp , generating plasmid ycp -firefly. two complementary oligonucleotides carrying various frameshifting windows (see supplementary table s ) were annealed into the bamhi and saci sites of ycp -firefly. the newly constructed plasmids were digested with restriction enzymes (bamhi and xhoi) and fragments containing the frameshifting sites linked to the firefly luciferase gene were cloned back into the corresponding sites of pjd restoring the bicistronic reporter system with the frameshifting window. plasmids pmb - mer (ff and wt) contain a his a::lacz reporter cassette. in pmb - merff (inframe control construct), the lacz gene is in frame, while in pmb - merwt (test construct), the lacz gene is in + frame ( figure b ) ( ) . these plasmids were used as templates for pcr oligonucleotide directed mutagenesis to alter the ty sequence (ctt-agg-c) ( figure b and supplementary table s ) . for the overexpression of the lys-trna encoded by the tk(uuu)l gene, we first introduced sphi and nhei restriction sites to plasmids pmb - mer (ff and wt) carrying the 'cuu-aaa-c' sequence by pcr oligonucleotide directed mutagenesis. oligonucleotides used were -ggtgtcggggcgcatgcatgacccagtcac- and -agagtgcaccatatgcggtgtgagct agcgcacagatgcg- . the tk(uuu)l gene was amplified from strain umy by using oligonucleotides aaaagcatgccggtagagtctctt-cttggtc- and aaaagctagccggta-agagagaaacctcca- and cloned between sphi and nhei sites of these plasmids. three individual transformants of each dual luciferase assay construct (biological replicates) were grown at • c in synthetic complete (sc)-ura medium to an optical density at nm (od ) of . . for each transformant triplicate samples (technical replicates) of l cells were collected and kept at - • c. the luciferase assays were performed according to the instructions of dual-luciferase reporter assay system (promega, catalog no. e ). the luciferase activities were determined in a white -well plate (thermo scientific, # ) using a tecan infinite luminometer. the levels of + frameshifting (%) were determined by normalization of each biological test replicate with the average of the three biological replicates of the in-frame control. each value of the biological replicates was determined by taking the median of the three technical replicates. the significant differences between wild type and mutant were determined by two-tail t-test. three transformants of each ty assay construct (biological replicates) were grown in sc-ura to od ≈ . and od -units were collected and kept at - • c. for each transformant, ␤-galactosidase measurements were done three times (technical replicates). ␤-galactosidase activities were determined as described previously ( ) . values of the biological replicates were determined by taking the median of the technical replicates. the levels of + frameshifting (%) were determined by normalization of each biological test replicate with the average of the three biological replicates of the in-frame control. the significant differences between wild type and mutant were determined by two-tail t-test. to analyze the role of wobble uridine modifications ncm u, ncm um, mcm u or mcm s u in reading frame maintenance, we used defined yeast mutants unable to form the s figure . (a) schematic drawing of the dual-luciferase assay system. transcription of the genes encoding the renilla-and firefly-luciferase is under the adh promoter and terminated by cyc terminator. frameshift sites were cloned between the luciferase genes and expression of the firefly luciferase gene requires + frameshifting. the frameshifting site is as follows: xxx-slippery site, nnn-assay codon and uag-stop codon (all in-frame). an upstream stop codon (uag) was placed in the + frame to eliminate frameshifting events occurring before the assay site. the frame of the different luciferase genes is indicated. in the in-frame control construct, renilla-and firefly-luciferase genes are in-frame. (b) schematic drawing of the ty assay system. transcription from his promoter generates a transcript containing the first nucleotides of the his gene in the in-frame and the lacz gene of escherichia coli in the + frame. expression of the lacz gene is dependent on a + ribosomal frameshift event taking place within ty sequence. an upstream stop codon (uga) was placed in the + reading frame to eliminate frameshifting events occurring before the assay site. in the in-frame control construct, the first nucleotides of the his gene and lacz gene are in-frame. group (tuc Δ; also denoted as ncs Δ), the ncm or mcm groups (elp Δ) or the esterified methyl group (trm Δ) of the mcm side chain. the ribosomal + frameshifting assay system used contains a renilla luciferase (r-luc)/ firefly luciferase (f-luc) bicistronic reporter system ( figure a ) (see material and methods) ( ) . this bicistronic mrna synthesizes a two domain protein with the indicated enzymatic activities. to analyze a + frameshift event a sequence is introduced between these two cistrons in such a way that translation of rluc is in the frame and the f-luc is in the + frame ( figure a ). to obtain f-luc activity the ribosome must shift into the + frame before entering the f-luc gene. the inserted sequence between the r-luc and the f-luc reporter genes consists of a slippery codon (xxx) at which the peptidyl-trna will slip, the codon to be assayed for a-site selection (nnn), followed by a stop codon in zero frame (uag) ( figure a ). to terminate all ribosomes that have accidentally slipped into the + frame upstream the slippery codon, a stop codon was inserted in the + frame just a few nucleotides upstream the slippery codon (see figure ) . thus, to obtain f-luc activity a + frameshift must occur at the + frameshift sequence upstream of the stop codon in the zero frame. this construct results in a very short frameshifting window between the upstream stop codon in the + frame and the downstream in-frame stop codon. the slippery codon is determined individually for different assay sites in order to optimize the slippage of the peptidyl trna at the p-site. we chose uuu, ccc or ggg codons as the slippery codons (supplementary figure s and table and supplementary table s ). codon uuu is decoded by trna phe gaa , which has the wobble nucleoside gm ( ) , and its structure is not affected by the elp , tuc or trm mutations. codon ccc is read by the i (inosine) containing trna pro igg and the ncm u containing trna pro ncm ugg and the slippery codon ggg is read by the mcm u containing trna gly mcm ucc and the c containing trna gly ccc (figure ) ( , , ) . note that the structures of the ncm u containing trna pro ncm ugg reading the slippery codon ccc and the mcm u containing trna gly mcm ucc reading the slippery codon ggg are affected by the elp mutation and might therefore obscure the monitoring of an a-site effect at these test codons. these issues will be addressed below. as a control, we used a construct carrying the r-luc and f-luc genes in-frame. by dividing the ratio of f-luc/r-luc activities generated from the frameshifting construct with the ratio of activities from the f-luc/r-luc in-frame control, the level of frameshifting was revealed. using these reporter systems, the level of frameshifting for specific trna isoacceptors was investigated in the presence or absence of s , ncm , mcm groups or the esterified methyl group of the mcm side chain at u . in the bacterial system, modification deficiency of aminoacyl-trna in the ternary complex causes in most cases a slow entry of it to the a-site and thereby induces a peptidyl-trna slippage ( figure a and b) ( ) . therefore, we suspected that in the cases below where we observed an effect on the frequency of frameshifting in the modification deficient mutants, it would primarily be due to an a-site effect, i.e. slow entry of the ternary complex containing aminoacyl-trna cognate to the test codon allowing a peptidyl-trna interacting with the slippery codon xxx to slip ( figure a and b) . in the constructs used, all have a uag stop codon just after the test codon nnn (i.e. the sequence is in zero frame -xxx-nnn-uag). translational termination in yeast is controlled by two interacting protein chain release factors, erf and erf . whereas erf recognizes all three stop codons, binds to ribosomal asite, and promotes hydrolysis of the p-site located peptidyl-trna, erf stimulates the termination activity of erf (reviewed in kisselev and buckingham ( )). a poor erf binding to the uag in the a-site may induce slippage by the modification deficient peptidyl-trna from cognate nnn codon to nn-u codon. therefore, the + frameshifting observed using the luciferase assay may be caused by either an a-site or a p-site effect or both. note that, in all these cases the error occurs in the p-site (either by the trna reading the slippery codon or a trna cognate to the test codon). although the luciferase system used by us is unable to distinguish between an a-or a p-site effect caused by modification deficiency, it is still a valuable method to address whether or not modification is important for maintaining the reading frame. to address specifically if modification deficiency induces an a-or a p-site effect, we used the ty system, which is explained below. in yeast, there are trna species having mcm s u , mcm u , ncm u or ncm u m nucleosides at wobble position (figures and ) . the role of these modified uridines was analyzed for ribosomal + frameshifting using the renilla/firefly luciferase bicistronic reporter system described in the previous section. in an elp mutant these trna species are missing the ncm and mcm groups at wobble position (u ) ( , ) . the role of the ncm and mcm groups present in these trnas in reading frame maintenance was investigated in a wild type and in an elp mutant strain using cognate or near cognate codons as test codons. lack of the mcm side chain in trna arg mcm ucu , trna gly mcm ucc , trna lys mcm s uuu and trna glu mcm s uuc resulted in significantly higher levels of + frameshifting with either a-ending cognate or g-ending near cognate codons ( table and supplementary figure s ). however, absence of the mcm group in trna gln mcm s uug did not have any significant effect on reading frame maintenance for the gln codons caa or cag ( table ). lack of the ncm group of u in trna val ncm uac and trna ser ncm uga resulted in an increased level of + frameshifting with near cognate val codon gug or cognate ser codon uca. in contrast, absence of the ncm group of u in trna pro ncm ugg resulted in a decreased level of + frameshifting with near cognate pro codon ccg. lack of the ncm group of u of the remaining trnas did not cause a significant difference in levels of + frameshifting ( table and supplementary figure s ). we conclude that in the xm u and mcm s u trna isoacceptors, the mcm group plays a more vital role than the ncm group in reading frame maintenance. codon ccc that can be read by ncm u containing trna pro ncm ugg is used as a slippery codon upstream next to gln-, lys, arg-, gly-and thr-test codons (table and supplementary figure s ). thus, the ncm u present in the potential peptidyl-pro-trna might influence the ribosomal grip in the p-site and thus influence the slippage. to test directly the influence of the ncm group in pro-trna in peptidyl-pro-trna slippage, we used a construct -uuu-ccc-uag-in the luciferase system. the stop codon uag is in the zero frame just after the pro codon ccc. since eukaryotic release factor (erf ) acts in the a-site (reviewed in kisselev and buckingham ( )) a possible + frameshift by pro-trna lacking ncm u would occur in the p-site. however, no significant + frameshifting was observed when the ccc codon was just upstream the stop codon and thus in the p-site (table and supplementary figure s ). we conclude that the ncm group in pro-trna does not increase peptidyl-trna slippage at the slippery codon ccc. the slippery codon ggg is decoded by both c containing trna gly ccc and mcm u containing trna gly mcm ucc . the structure of the latter trna is affected by the elp mutation and this trna reads the ggg codon very inefficiently compared to the cognate trna gly ccc ( ) . therefore, in the elp mutant it is not likely that trna gly mcm ucc lack- table . influence of trna modifications mcm s u , mcm u , ncm u and ncm u m on reading frame maintenance based on data using the renilla/firefly luciferase bicistronic reporter system a bold indicate significant difference in frameshifting levels between indicated mutant and wild type as determined by two-tail t-test. (*) indicates p < . and (**) indicates p < . . b trna tyr g a has an unmodified g nucleoside at wobble position and its structure is not affected by the elp , trm or tuc mutations, thus it is used as a negative control ( ) . c in an elp mutant, levels of s group is reduced ( ) ( ) ( ) . d in a trm mutant, trna arg mcm ucu and trna glu mcm s uuc contain a mixture of ncm u/cm u and ncm s u/cm s u nucleosides, respectively ( ). n.a., not applicable. ing the mcm side chain at u will out-compete the efficiently decoding cognate trna gly ccc at ggg codons. consequently, the observed + frameshifting for trna glu mcm s uuc and trna val ncm uac (table ) is most likely not caused by slippage of an unmodified peptidyl-trna gly mcm ucc but rather a poor a-site entry by trna glu mcm s uuc and trna val ncm uac , respectively. the formation of the esterified methyl group of mcm u, which is the last step in the synthesis of the mcm side chain, is catalyzed by the dimeric trm /trm protein complex ( , ) . the influence of the esterified methyl group of the mcm u and mcm s u nucleoside in reading frame maintenance was investigated in a wild type and in a trm mutant strain using cognate or near cognate codons as test codons ( table ) . lack of the esterified methyl group of the mcm side chain of u in trna gln mcm s uug resulted in significantly decreased + frameshifting at the gln codon caa (table and supplementary figure s ). there were no significant differences in the levels of frameshifting between wild type and trm mutant in the remaining test constructs (table and supplementary figure s ). we conclude that presence or absence of the esterified methyl group in the mcm side chain only seem to play a minor role in + frameshifting. in a tuc mutant, the s group of mcm s u at the wobble position (u ) is absent in trna gln mcm s uug , trna lys mcm s uuu and trna glu mcm s uuc ( ) . the role of the s group present in these trnas in reading frame maintenance was investigated in a wild type and a tuc mutant strain using cognate or near cognate codons as test codons ( table ) . absence of the s group in trna lys mcm s uuu and trna glu mcm s uuc resulted in significantly higher levels of + frameshifting with either a-ending cognate or g-ending near cognate codons ( table and supplementary figure s ). however, lack of the s group in trna gln mcm s uug resulted in significantly decreased + frameshifting with the near cognate cag (table and supplementary figure s ). in the cases stated above, gln-and pro-trnas showed reduced levels of + frameshifting due to lack of esterified methyl, s or ncm groups (table ). this reduced level of frameshifting might be surprising but similar observations were noted earlier. in bacteria the gln-, lys-and glu-trna contain as wobble nucleoside the mnm s u, which is structurally related to the mcm s u present in the corresponding yeast trnas. lack of either the mnm side chain or the sulphur at position reduced frameshifting similarly as noted by us for the two aforementioned cases ( , ) . although these results seems counterintuitively strange, one has to remember that the structure of the different trna species is optimized and in fact has evolved to have similar decoding activity, which is obtained partly due to modification of it ( ) . therefore, a modification may improve the activity of one trna whereas it might reduce the activity of another trna species (see discussion of this issue in björk and hagervall ( )). from such considerations, one would expect that when measuring a specific activity of a trna, like influencing reading frame maintenance, a modification might improve or reduce the fidelity of it. a key feature of the peptidyl-trna slippage model is that the error in reading frame maintenance, induced either by an a-or a p-site effect due to modification deficient trna, occurs in the p-site by peptidyl-trna slippage. there are two ways to establish if the frameshift errors occur in the ribosomal a-or p-site. either one determines the amino acid sequence of the frameshift peptide covering the frameshift window or by overexpressing the trna cognate to the asite codon. in the latter case, if the frameshift error occurs due to an a-site effect, such overexpression would decrease the frameshift error, since it reduces the ribosomal pause and thereby reduces the ability of the peptidyl-trna to slip forward. we chose the latter method, since this approach is relevant for this study, as such a treatment also suppresses all the pleiotropic phenotypes induced by a mutation in, e.g. the elp gene. thus, the strong pleiotropic phenotypes ob-served in an elp mutant might be due, at least partly, to errors in reading frame maintenance of some key mrnas. as stated in the description of the assay system, the dual-luciferase assay system is not designed to clarify the difference between an a-or a p-site effect caused by modification deficiency, we decided to use ty assay system to address this question. the expression of the tyb gene of yeast ty retrotransposon requires a ribosomal + frameshift event caused by a peptidyl-trna slippage ( ) . only a seven nucleotide sequence cuu-agg-c is required for the + frameshift event to occur and thus only two trna species--trna leu uag and trna arg ccu --are participating in this event. in the yeast strain used, the availability of trna arg ccu is low resulting in a low rate of ribosomal a-site selection, which induces a slippage by trna leu uag at the cuu p-site codon into the + frame (uu-a) ( ) . therefore, we decided to use an altered version of the ty + frameshift system to study whether or not the + frameshifting caused by lack of the mcm side chain in the r-luc-f-luc system, is due to a peptidyl-trna slippage. we altered the 'cuu-agg-c' + frameshift site by changing the arg codon (agg) into either a lys codon aaa decoded by trna lys mcm s uuu or an arg codon aga decoded by trna arg mcm ucu to test whether or not lack of mcm side chain of these trnas induce + frameshifting ( table ). if the hypomodified trna is inefficiently accepted to the a-site in an elp mutant, the aaa (lys) and/or aga (arg) test codons will act similarly as codons decoded by the low available trna arg ccu resulting in a slow entry to the a-site by the ternary complex containing the unmodified trna. if so, the trna leu uag in the p-site will slip into the + frame (from cognate cuu to non-cognate uu-a). all alterations of the ty sequence were made in the his a::lacz frameshift reporter plasmid (see materials and methods) ( figure b ). the levels of frameshifting were calculated by dividing the ␤-galactosidase values generated from the test construct with the values from the in-frame control construct. table shows that for the 'cuu-aaa-c' lys codon test construct, lack of mcm side group in the mcm s u nucleoside of trna lys mcm s uuu resulted in -fold increased + frameshifting in the elp mutant compared to wild type. in contrast, for the 'cuu-aga-c' arg codon test construct, lack of mcm side group in trna arg mcm ucu did not increase frameshifting in the elp mutant compared to the wild type (table ) . thus, similar to the results obtained by the luciferase system lack of the mcm group of trna lys mcm s uuu induced increased + frameshifting. although we observed an increased frameshifting for mcm deficient trna arg mcm ucu in the luciferase assay system (table ), this was not the case using the ty assay system ( table ). to analyze whether lack of the mcm group of trna lys mcm s uuu could induce + frameshifting due to a psite effect by modification deficiency, we placed a lys codon aaa instead of the cuu codon in the ty assay system and varied the following codon. the concentration of a trna species is proportional to the number of the corre-sponding trna genes in the yeast genome ( ) . accordingly, by placing different codons in the a-site, the concentration of the corresponding trnas in the cell reading this codon is changed and thereby the efficiency of reading the a-site codon is altered. thus, to test for a possible p-site effect induced by a lack of the mcm group of mcm s u in lys-trna we placed an arg codon agg read by the rare cognate trna arg ccu ( genomic copy) and the near cognate trna arg mcm ucu ( genomic copies) after the lys codon aaa. in an elp mutant trna arg ccu is essential, demonstrating that mcm group of the near cognate arg-trna is required for efficient reading of the arg codon agg ( ) . therefore, in an elp mutant, a situation is generated where the agg codon is read slowly since it is read mainly by the rare cognate trna arg ccu and inefficiently by the more abundant modification deficient near cognate trna arg mcm ucu . such a condition would allow trna lys mcm s uuu at the psite to slip to the + translational frame. furthermore, we made test constructs to increase the rate of a-site selection by introducing either an ile codon auu decoded by cognate trna ile aau present in genomic copies or an arg codon cgu decoded by cognate trna arg acg present in genomic copies after lys codon aaa (table ) . by varying concentration of the potential a-site coding trnas from genomic copy to genomic copies, we did not observe any significant difference in the levels of + frameshifting between wild type and elp mutant (table ) . apparently, the possible peptidyl-trna lys mcm s uuu slippage is not sensitive to the rate of a-site selection suggesting that lack of mcm s u does not cause any p-site effect and thus an increased peptidyl-trna slippage. if the frameshifting event occurring at the modified ty site 'cuu-aaa-c' was caused by a slow entry of the ternary complex containing the hypomodified trna lys mcm s uuu causing a peptidyl-trna leu uag slippage to + translational frame, an elevated level of the hypomod-ified trna lys mcm s uuu should increase the rate of a-site selection and thereby reducing + frameshifting ( figure b) . we therefore cloned the tk(uuu)l gene, which encodes trna lys mcm s uuu into either plasmid pmb - merwt (test construct, containing the cuu-aaa-c frameshift site) and pmb - merff (corresponding in-frame control construct, figure and supplementary table s ). thus, the plasmids harbor both the trna gene and the ␤galactosidase gene with either a frameshift site or an inframe control. the plasmid encoded tk(uuu)l gene results in overexpression of trna lys mcm s uuu and concomitantly reduced the levels of + frameshifting in the elp mutant from -to -fold compared to wild type ( table ). this data strongly suggest that the + frameshifting event at 'cuu-aaa-c' lys codon test construct occurs by peptidyl-trna leu uag slippage due to an a-site effect caused by a slow entry of the hypomodified trna lys mcm s uuu . as was suggested earlier by us ( , ) and confirmed by rezgui et al. ( ) , the major function of the mcm s u nucleoside in lys-trna is to improve the reading of the cognate codon. thus, mcm s u deficiency results in slow decoding and reduced translation elongation rate but also, as shown here, induces + frameshifting by reducing the rate of a-site selection. among the trna isoacceptors having xm u or xm s u wobble uridine nucleosides, only lys-and gln-trnas has been investigated for + frameshifting in both bacteria and yeast. the modified wobble nucleoside -methoxycarbonylmethyl- -thiouridine (mcm s u ) present in yeast trnas specific for gln, lys and glu has a chemically related form, -methylaminomethyl- thiouridine (mnm s u ) present in the corresponding bacterial trnas. in bacteria, lack of the mnm group in gln-trna results in increased + frameshifting at both cognate (caa) and near cognate (cag) codons, whereas absence of the s group results in + frameshifting only at the cognate (caa) codon ( ) . in contrast, lack of mcm or s groups in yeast gln-trna does not result in increased + frameshifting at either caa or cag codons. instead, absence of the s group results in reduced + frameshifting at the cag codon. in bacteria, lack of mnm or s groups in lys-trna cause increased + frameshifting at both cognate (aaa) and near cognate (aag) codons by a-and p-site effects ( ) . in yeast, we also observed an increased + frameshifting due to lack of mcm or s groups of lys-trna at aaa and aag codons. however, we show that + frameshifting at the cognate (aaa) codon is induced by an a-site effect, not a p-site effect. it has been shown that presence of modified nucleosides in trnas are required for tuning the decoding activity in order to maintain uniformity in translation ( ) . an in vitro study in yeast showed that presence of the mcm and s groups of lys-trna are required for efficient a-site binding ( ) . consistent with these observations, our in vivo studies show that presence of the mcm group of lys-trna promotes its entry to ribosomal a-site and thereby avoids + frameshift errors. thus, wobble uridine modifications are required to optimize the function of trnas and thereby promote a proper reading frame maintenance. supplementary data are available at nar online. translational accuracy and the fitness of bacteria errors and alternatives in reading the universal genetic code a gripping tale of ribosomal frameshifting: extragenic suppressors of frameshift mutations spotlight p-site realignment. microbiol transfer rna modification: presence, synthesis, and function prevention of translational frameshifting by the modified nucleoside -methylguanosine improvement of reading frame maintenance is a common function for several trna modifications transfer rna modification status influences retroviral ribosomal frameshifting ) -methylguanosine in place of y base at position in phenylalanine trna is responsible for its shiftiness in retroviral ribosomal frameshifting role of a trna base modification and its precursors in frameshifting in eukaryotes lack of pseudouridine / in the anticodon arm of yeast cytoplasmic trna decreases in vivo recoding efficiency the sua protein is essential for normal translational regulation in yeast a role for the universal kae /qri /ygjd (cog ) family in trna modification a cyclic form of n -threonylcarbamoyladenosine as a widely distributed trna hypermodification expression of a coronavirus ribosomal frameshift signal in escherichia coli: influence of trna anticodon modification on frameshifting programmed translational - frameshifting on hexanucleotide motifs and the wobble properties of trnas transfer rna modifications that alter + frameshifting in general fail to affect - frameshifting competing pathways control host resistance to virus via trna modification and programmed ribosomal frameshifting programmed translational frameshifting programmed translational frameshifting transfer rna modification: influence on translational frameshifting and metabolism how translational accuracy influences reading frame maintenance the unbearable lightness of peptidyl-trna the ribosomal grip of the peptidyl-trna is critical for reading frame maintenance the phenotype of many independently isolated + frameshift suppressor mutants supports a pivotal role of the p-site in reading frame maintenance presence and coding properties of -o-methyl- -carbamoylmethyluridine (ncm um) in the wobble position of the anticodon of trna(leu) (u*aa) from brewer's yeast eukaryotic wobble uridine modifications promote a functionally redundant decoding system eukaryotic trnas(pro): primary structure of the anticodon loop; presence of -carbamoylmethyluridine or inosine as the first nucleoside of the anticodon the primary structure of yeast glutamic acid trna specific to the gaa codon presence of the methylester of -carboxymethyl uridine in the wobble position of the anticodon of trnaiii arg from brewer's yeast the kluyveromyces lactis ␥ -toxin targets trna anticodons the nucleotide sequences and coding properties of the major and minor lysine transfer ribonucleic acids from the haploid yeast saccharomyces cerevisiae s c modified nucleoside, -carbamoylmethyluridine, located in the first position of the anticodon of yeast valine trna elongator, a conserved complex required for wobble uridine modifications in eukaryotes elevated levels of two trna species bypass the requirement for elongator complex in transcription and exocytosis elongator complex influences telomeric gene silencing and dna damage response by its role in wobble uridine trna modification the modified wobble nucleoside uridine- -oxyacetic acid in trnapro(cmo ugg) promotes reading of all four proline codons in vivo translational infidelity-induced protein stress results from a deficiency in trm -catalyzed trna modifications improved method for high efficiency transformation of intact yeast cells methods in yeast genetics an in vivo dual-luciferase assay system for studying translational recoding in the yeast saccharomyces cerevisiae ribosomal frameshifting in the yeast retrotransposon ty: trnas induce slippage on a nucleotide minimal site studies on polynucleotides, lxviii the primary structure of yeast phenylalanine transfer rna : compilation of trna sequences and trna genes translational termination comes of age an early step in wobble uridine trna modification requires the elongator complex novel methyltransferase for modified uridine residues at the wobble position of trna is a -kda zinc finger protein essential for the activity of two trna and one protein methyltransferases in yeast a conserved modified wobble nucleoside (mcm s u) in lysyl-trna is required for viability in yeast uniform binding of aminoacylated transfer rnas to the ribosomal a and p sites transfer rna gene redundancy and translational selection in saccharomyces cerevisiae trna tkuuu, tquug, and teuuc wobble position modifications fine-tune protein translation by promoting ribosome a-site binding exonuclease i of saccharomyces cerevisiae functions in mitotic recombination in vivo and in vitro unexpected accumulation of ncm( )u and ncm( )s( ) (u) in a trm mutant suggests an additional step in the synthesis of mcm( )u and mcm( )s( )u large oligonucleotides isolated from yeast tyrosine transfer ribonucleic acid after partial digestion with ribonuclease t thio-modification of yeast cytosolic trna requires a ubiquitin-related system that resembles bacterial sulfur transfer systems ubiquitin-related modifier urm acts as a sulphur carrier in thiolation of eukaryotic transfer rna mechanistic characterization of the sulfur-relay system for eukaryotic -thiouridine biogenesis at trna wobble positions enhancer and silencerlike sites within the transcribed portion of a ty transposable element of saccharomyces cerevisiae we acknowledge prof. g. r. björk nucleic acids research, , vol. , no. ( , ) . c difference in frameshifting between elp mutant and wild type was significant as determined by two-tail t-test (p < . ). d difference in frameshifting between elp mutant and wild type was not significant as determined by two-tail t-test (p > . ). key: cord- -d sgnxc authors: tan, yong wah; hong, wanjin; liu, ding xiang title: binding of the ′-untranslated region of coronavirus rna to zinc finger cchc-type and rna-binding motif enhances viral replication and transcription date: - - journal: nucleic acids res doi: . /nar/gks sha: doc_id: cord_uid: d sgnxc coronaviruses rna synthesis occurs in the cytoplasm and is regulated by host cell proteins. in a screen based on a yeast three-hybrid system using the ′-untranslated region ( ′-utr) of sars coronavirus (sars-cov) rna as bait against a human cdna library derived from hela cells, we found a positive candidate cellular protein, zinc finger cchc-type and rna-binding motif (madp ), to be able to interact with this region of the sars-cov genome. this interaction was subsequently confirmed in coronavirus infectious bronchitis virus (ibv). the specificity of the interaction between madp and the ′-utr of ibv was investigated and confirmed by using an rna pull-down assay. the rna-binding domain was mapped to the n-terminal region of madp and the protein binding sequence to stem–loop i of ibv ′-utr. madp was found to be translocated to the cytoplasm and partially co-localized with the viral replicase/transcriptase complexes (rtcs) in ibv-infected cells, deviating from its usual nuclear localization in a normal cell using indirect immunofluorescence. using small interfering rna (sirna) against madp , defective viral rna synthesis was observed in the knockdown cells, therefore indicating the importance of the protein in coronaviral rna synthesis. during the replication of mammalian viruses, it is inevitable for host proteins to be involved in the viral life cycles. in fact, coronaviruses require host proteins to aid in the stages from virus entry to progeny release. entry of the virus particle into a host cell requires the recognition of specific cell surface proteins, which act as receptors for the virus spike (s) protein ( ) ( ) ( ) ( ) ( ) ( ) . upon entry into host cells, the ribonucleocapsid uncoats and releases the -capped viral genome, a single-stranded positive-sense rna. the genomic rna ranges from to kb in length, is the largest known of its kind and is structurally similar to host mrna ( ) . the replicase gene, which spans the two-thirds of the genome, is translated by host ribosomes into two large polyproteins, pp a and pp ab via a frameshift event ( ) ( ) ( ) . the polyproteins are autoproteolytically processed into a maximum of nonstructural proteins ( , ( ) ( ) ( ) ( ) ( ) ( ) , which are assembled into replicationtranscription complexes, including the main enzyme rna-dependent rna polymerase (nsp ) ( , ) . this complex is required for generating new full-length virus rna in replication as well as subgenome-length rnas to be used for translation of virus structural and accessory proteins. in addition to their role in rna synthesis, these nonstructural proteins may have multiple functions, such as the suppression of host mrna translation as well as mrna degradation by nsp of sars coronavirus (sars-cov; [ ] [ ] [ ] , which may play a role in the suppression of immune response mounted by the host upon infection. the replication-transcription complex (rtc), which is located on membrane bound vesicles in the cytoplasm ( ) , is required for genome replication through continuous transcription and subgenomic rna synthesis via discontinuous transcription ( , , ) . apart from the replicase gene products, a viral structural protein, the nucleocapsid (n), is also required for efficient viral rna synthesis ( , ) . the resulting genome-size transcripts are destined to be packaged into progeny virions while the subgenomic, positive-sense transcripts are being translated into four structural proteins, spike (s), nucleocapsid (n), membrane (m) and envelope (e) proteins, as well as other accessory proteins. in virus rna synthesis, the replicase complex is indispensable but not an exclusive participant. several host proteins have been identified to be able to interact with regulatory signals within the untranslated regions in the viral genome of betacoronavirus mhv. these include the polypyrimidine tract-binding protein (ptb) ( , ) with the leader sequence, hnrnp a ( , , ) and hnrnp q ( ) with the -utr. more recently, poly(a)-binding protein (pabp), hnrnp q and glutamyl-prolyl-trna synthetase (eprs) were found to play a role in coronavirus rna synthesis through their interaction with the -utr of alphacoronavirus tgev ( ) . in addition, interaction of viral proteins with host proteins, such as the recently identified interaction between coronavirus nsp and ddx ( ) , may also play important enhancement functions in coronavirus replication and infection cycles. in this study, we describe the interaction of a cellular protein, madp (zinc finger cchc-type and rna binding motif ) with the -utr of ibv and sars-cov, using yeast-based three hybrid screen ( ) and rna-binding assays. subsequently, the rna-binding domain of madp and the rna secondary structure responsible for the interaction were mapped and defined. using indirect immunofluorescence, we confirmed that madp , despite being reported as a nuclear protein ( ) , was detected in the cytoplasm of virus-infected cells and partially co-localized with the rtcs. upon silencing of madp using sirna, viral rna synthesis on general has been affected, resulting in a lower replication efficiency and infectivity. all wild-type and mutant madp expressing constructs were based on the vector pxj flag which contains both the cmv and t promoter and all expressed proteins were n-terminally tagged with the flag epitope. for the over-expression of the wild-type and mutant madp proteins, h cells grown to % confluency were infected with recombinant vaccinia-t virus for h (h), and the constructs were transfected into the infected cells using effectene transfection reagent (qiagen). cells were lysed with lysis buffer [ mm nacl, mm tris (ph . ), % np- ] h post-transfection. template dna was amplified from plasmid dna encoding the end of ibv genome with various sets of primers targeting different regions of the -utr (tables and ) , with the sense primers containing the t promoter sequence ( ) . biotinylated rnas were in vitro transcribed with t rna polymerase (roche applied science) in the presence of biotin rna labeling mix (roche applied science) at c for h. template dnas were removed by digestion with rnase-free dnase i (roche applied science) and the labeled rnas purified with ultrapure phenol:chloroform:isoamyl alcohol (invitrogen) then solubilized in nuclease-free water. biotinylated rna at . mm was incubated with cell lysates over-expressing egfp, flag-tagged madp or its mutant proteins, respectively, in the presence of mm dithiothreitol (dtt), mg/ml yeast trna (ambion) and u/ml protector rnase inhibitor (roche applied science) in a final volume of ml at room temperature for min. the mixtures were incubated with ml ( % slurry) of streptavidin agarose beads (sigma aldrich) at room temperature for min. the beads were collected by centrifugation and washed three times with rnase p (rp) buffer ( mm kcl, mm mgcl , mm hepes, ph . ), suspended in ml of sodium dodecyl sulfate (sds) sample buffer with mm dtt. bound proteins were resolved by sds-polyacrylamide gel electrophoresis (sds-page) and detected with appropriate antibodies. african green monkey kidney cells (vero) grown to % confluency in four-chamber glass slides were transfected to over-express flag-tagged madp or vector control using effectene for h. transfected cells were infected with wild-type ibv or mock-infected with vero cell lysate (vero cells with serum-free medium subjected to three freeze-thaw cycles at minus c and room temperature, respectively) for h. infection was allowed to progress for h after virus removal and the cells were treated with actinomycin d at mg/ml (sigma aldrich) for h; mm of brutp (sigma aldrich) was transfected into the cells with superfect (qiagen) for h. cells were fixed at h post-infection with % paraformaldehyde for min and permeabilized with . % triton-x for min. treated cells were blocked in % goat serum, stained with primary antibodies mouse anti-brdu and rabbit anti-flag (sigma aldrich) and subsequently probed with alexafluor anti-rabbit and anti-mouse (invitrogen) antibodies. images were captured with olympus fluoview upright confocal microscope using a sequential laser scanning protocol. h cells grown to % confluency were transfected with nm of either siegfp ( -gcaacgugaccc ugaaguucdtdt- ) or simadp ( -caaugacuu guaccggauadtdt- ) using dharmafect sirna transfection reagent (dharmacon) for h. cells were infected with recombinant ibv-luc at a multiplicity of infectivity of $ (moi & ) and incubated for h at c, % co . the virus-containing medium was replaced with fresh serum-free medium and the cells were either harvested immediately ( h) or continued to be incubated at c until specific time points post infection ( , , , , or h) . infected cells were subjected to lysis, either through three freeze-thaw cycles (at À c and room temperature, respectively) without removal of media, or using lysis buffer after removal of media. firefly luciferase activity which was used as an indication of viral activity for the recombinant virus was measured using luciferase assay system (promega) according to manufacturer's instructions using the cell lysates. an end-point dilution assay, the % tissue culture infectious dose (tcid ), the amount of virus that will produce pathological change in % of inoculated cell culture, of the infected cells was used as a measurement of virus titer. the tcid of the infected cells at each time point was determined by using the freeze-thawed infected cells. for each sample, a -fold serial dilution was performed and five wells of vero cells on -well plates were infected with each dilution. the numbers of infected wells were collated and tcid of each sample was calculated using the reed-muench method ( ) . reverse transcription-polymerase chain reaction determination of the replication and sub-genomic transcription efficiency of ibv total rnas were prepared from the infected cells at their specified time points using trizol reagent (invitrogen) reverse transcription (rt) was performed with expand reverse transcriptase (roche) according to the manufacturer's instructions using the sense primer ibv leader ( - ctattacactagccttgcgct - ) for the detection of negative-stranded subgenomic rna (sgrna) and the antisense primer ibv -r ( - ctctgg atccaataacctac - ) for the detection of positive-stranded sgrna. both primers were then used for pcr. if transcription of subgenomic mrnas did occur, a -bp pcr product corresponding to the -terminal region of subgenomic mrna and a -bp fragment corresponding to the -terminal region of subgenomic mrna would be expected. similarly, rt was carried out with the sense primer ibv -f ( - gcttatccactagtacatc - ) for the detection of negative-stranded genomic rna. sense primer ibv -f and the antisense primer ibv -r ( - cttctcgcacttctgcactagca - ) were used for pcr. if replication of viral rna occurred, a -bp pcr fragment would be expected. oligonucleotides were designed based on simadp sequence and cloned into psilencer . neo (ambion) according to manufacturer's instructions. negative control silencer construct was supplied with the cloning kit. constructs psilencer-nc (negative control) and psilencer-madp were transfected into h cells with effectene transfection reagent. transfected cells were selected with mg/ml g (sigma) and the selected clones were subjected to screening for madp knockdown efficiency. selected h -shnc and h -shmadp stable cell lines were maintained in media containing mg/ml of g . in order to find candidate host proteins that may be involved in the replication and transcription of coronavirus rna, a yeast-based three-hybrid ( ) screen against a human cdna library using the -utr of sars-cov rna as bait was performed. screens were also performed using the negative sense -utr and -utr as bait. each screen yielded about six to eight colonies which were sequenced and non-sense sequences of the candidates were eliminated. in total, the screen identified three candidates, madp , hax and ribosomal protein l a as binding partners to sars-cov positive sense -utr, negative sense -utr and negative sense -utr, respectively. although it was interesting to find ribosomal protein l a interacting with the table . nucleotide sequences of primers used to amplify dna templates for in vitro transcription primer name sequence t _egfp_ - r pt_egfp_f anti-sense -utr, which was not required for viral protein translation, subsequent functional studies of the protein would prove to be complicated as the virus itself relies heavily on the host ribosome to translate viral proteins, necessary for the infection to proceed. therefore, it was not chosen for further studies. hax was reported to function as an anti-apoptotic protein, which was not the focus of our screen and was therefore not chosen for further studies as well. madp was reported as a member of the alternative splicing pathway, which implied a possible role in facilitating distal rna sequences to be brought into close proximity, corresponded well with current evidence on the mechanism of discontinuous transcription. therefore, it was chosen as the sole target for this study. the -utr of coronavirus genomic rna interacts specifically with madp the interaction between madp and the coronavirus -utr was confirmed by using over-expressed flag-tagged madp in a biotin-rna pull-down assay. based on the efficiency of flag-tagged madp co-purification with the biotinylated rna, the full-length, mammalian-expressed madp was found to be able to interact with the -utr of ibv and sars-cov rna ( figure a ). over-expressed flag-tagged protein was used to facilitate detection, as there was no commercially available antibody to the protein at that time. it was noted that ibv -utr showed higher binding affinity to the flag-tagged madp than did sars-cov -utr ( figure a ). the specific interaction between ibv -utr and madp and its functional implication in coronavirus replication were therefore chosen for subsequent characterization. to check the specificity of the interaction, a competition assay based on the biotin-rna pull-down assay was performed. total cell lysates containing flag-tagged madp were incubated with . mm biotinylated ibv -utr in the presence of increasing concentrations of either unlabeled specific competitor rna probe (ibv -utr) or unlabeled non-specific probe (egfp rna) composed of nucleotides - of the egfp coding sequence, from to . mm. western blot analysis of the co-purified flag-tagged protein showed that increasing concentrations of unlabeled specific competitor rna led to the decreasing co-purification of madp with the biotinylated rna probe ( figure b) . however, increasing concentrations of unlabeled non-specific competitor rna did not result in detectable change to the efficiency of madp co-purification ( figure b) . simultaneously, a protein exhibiting a non-specific rna-binding activity, the flag-tagged ibv-n, was used as a control. total cell lysates containing the flag-tagged ibv n protein was incubated with . mm of the biotinylated ibv -utr, in the presence of increasing concentrations of either the unlabeled specific probe or an unlabeled non-specific probe, egfp rna, of an equal length. western blot detection of the co-purified flag-tagged n protein revealed that increasing concentrations of both unlabeled rna probes increasingly reduced the efficiency of n protein co-purification with the biotinylated rna probes ( figure b ). these results confirmed that madp could interact specifically with the -utr of ibv rna. over-expressed flag-tagged madp translocates from the nucleus to the cytoplasm madp was identified as a component of the s u / snrnp ( ) and its subcellular localization was determined to be in the nucleoplasm ( ) . ibv replication and transcription, on the other hand, take place in the cytoplasm of the infected cells. therefore, to validate the likelihood of madp interacting with the viral -utr, immunofluorescence was used to track the subcellular localization of both flag-tagged madp and de novo synthesized viral rna in both mock-infected and ibv-infected cells. flag-tagged madp was overexpressed in cultured vero cells, which were then infected with ibv and treated with actinomycin d to inhibit host transcription. the newly synthesized viral rna, a marker for the rtcs, was labeled with brutp. the cells were fixed at h post-infection to allow sufficient labeling of the newly synthesized viral rna and to minimize the formation of large syncytial cells. in uninfected cells, flag-tagged madp was localized in the nucleus exclusively ( figure ). upon infection by ibv, flag-tagged madp appeared to be present in the cytoplasm as well ( figure ) . interestingly, the cytoplasmic localization pattern of flag-tagged madp appears to be partially overlapped with that for the rtcs, although further studies would be required to ascertain if madp would be a part of the rtcs ( figure ). as a negative control for the over-expressed protein, vector transfected cells probed with flag antibody showed negative staining for the over-expressed protein ( figure ). similar colocalization patterns were also observed in ibv-infected h cells (figure ). to define the segment and structural elements of ibv -utr required for its interaction with madp , four truncated mutant rna fragments were synthesized, as shown in figure a , by in vitro transcription. -utrÁ contains stem-loops i-iv ( ), -utrÁ and -utrÁ spans stem-loops i to iii and ii to iv, respectively, whereas -utrÁ spans the rest of the nucleotides. the biotin-labeled rna transcripts were used in the biotin-rna pull-down assay ( figure b ) to check the efficiency of flag-tagged madp co-purification with rna. results showed that madp was co-purified only with transcripts which contain stem-loops i-iii of the -utr ( -utrÁ and -utrÁ ). in addition, stem-loop i appeared to be essential for interacting with madp as its absence in -utrÁ abolished the interaction with madp ( figure b ). the rest region of the -utr ( -utrÁ ) did not appear to interact with madp ( figure b ). to confirm further the role of stem-loop i in the interaction between madp and ibv -utr, two mutants were constructed, based on -utrÁ . -utrÁ m carried two-point mutations at nucleotide residues and from ga to cu, which would disrupt the structure of stem-loop i ( figure c ), and -utrÁ m carried additional mutations at residues and from uc to ag ( figure c ), which would restore the secondary structure of stem-loop i. the mutant rnas spanning stemloops i-iii were assessed for its ability to bind madp . the result indicated that the integrity of stem-loop i may be essential for the interaction between the -utr with madp ( figure d ), as the stem-loop disrupting mutation ( -utrÁ m ) failed to interact with madp . the stem-loop restoring mutation at nucleotide residues and from uc to ag was able to restore partially the interaction ( -utrÁ m ) ( figure d ). this result affirmed the conclusion that the secondary structure of stem-loop i of ibv -utr is indispensable for its interaction with madp . the rna recognition motif (rrm) of madp is responsible for its interaction with ibv -utr madp contains two nucleic acid binding domains, the rna recognition motif (rrm) in the n-terminal region and the universal minicircle sequence binding protein (umsbp) in the central region. in order to identify the domain involved in the interaction between madp and ibv -utr, a series of truncation mutants of the protein were created ( figure a ). the first three mutants, madp n which contains the rrm domain, madp m spans the zinc finger domain and madp c contains mostly phosphorylation sites, were assessed for their ability to interact with ibv -utr. only madp n retained a low level of the rna-binding activity ( figure b , n) and negligible activity was detected for the other two truncated proteins ( figure b, m, c) . as the rna-binding activity for madp n fragment was much lower compared to the full-length protein, three more mutants were created to extend the madp n fragment ( figure a ). an extension of or amino acid residues was made for mutants madp x and madp z, respectively. a truncation at the n-terminus by residues as well as an extension by amino acid residues was made for madp y. it was observed that both madp x and madp z bound to ibv -utr more strongly than did the full-length protein as well as madp n mutant protein ( figure b, x, z) . madp y, on the other hand, bound weakly to the rna fragment ( figure b, y) . hence, the amino acid extension beyond the rrm (madp x) may have been required to preserve the integrity of the protein structure and that the amino acid residues at the n-terminus of madp are required for efficient rna binding. as the rrm domain was determined to be responsible for the interaction, information available on this domain indicated three amino acids at its active site, which interact with nucleic acid residues via their aromatic and hydrophobic side chains. for madp , the identified active site was composed of phenylalanine and valine , respectively, while tyrosine may have acted as an anchor for the phosphate backbone via electrostatic interactions. hence, three mutants with either a single alanine substitution for tyrosine (y a), a double alanine substitution for valine and phenylalanine (v f a) or triple alanine substitutions for all three residues (yvf), were constructed ( figure a ). these three mutants were over-expressed in h cells as flag-tagged proteins, and the lysates were assessed for their respective rna-binding affinities for full-length ibv -utr ( figure c ). interaction of deletion mutants of madp with ibv -utr. cell lysates prepared from h cells over-expressing flag-tagged wild-type madp or its truncation mutants were used for biotin-rna pull-down assay using the full-length ibv -utr. both the crude lysates (labeled c) and protein bound on the streptavidin beads (labeled e) were resolved by sds-page and detected by western blot with anti-flag antibody. egfp over-expressed cell lysate was included as a negative control. (c) interaction of three madp mutant constructs, y a, v f a and yvf, with ibv -utr. the three full-length madp constructs with amino acid mutations at the predicted rna-binding sites were transfected into h cells and used in a biotin-rna pull-down assay with the full-length ibv -utr. all mutants resulted in a reduction in rna-binding affinity for the biotinylated rna molecule and the reduction was most dramatic for triple residue mutant yvf ( figure c ), implying cooperative binding demonstrated by the three residues. this finding confirms that the madp rrm is involved in the interaction with ibv -utr. to demonstrate the significance of the interaction between madp and ibv -utr, an sirna duplex designed to silence madp expression (simadp ) and a negative control sirna targeting egfp protein (siegfp) were figure a ). densitometric analyses identified a reduction between % and % of madp mrna was achieved by this sirna which resulted in a reduction between % and % of negative stranded genomic viral rna, - % of negative stranded subgenomic viral rna and - % of positive stranded subgenomic viral rna. western blot analysis also noted a reduction in the expression of viral structural genes, between % and % reduction for s and n proteins, with a reduction between % and % of madp protein ( figure b ). virus titers as represented by the tissue culture infectious dose (log tcid ) at each infection time point was reduced by a minimum of -fold and up to -fold compared to siegfp-transfected cells beyond h of infection ( figure c ). firefly luciferase activity of cell lysates harvested at different time points showed a minimum of % reduction upon the silencing of madp , which supports further the observation that the total viral protein production was much reduced ( figure d ). to eliminate the possibility that the phenotype observed in madp -silenced cells during ibv infection was due to an off-target effect of the sirna duplex used, four additional sirna duplexes targeting different regions of madp were used in various combinations with simadp ( figure ) to check their effect on ibv infection, as illustrated by the expression of the luciferase gene ( figure b ). all six combinations of five different sirna duplexes resulted in a reduction in the luciferase activity of the infected cells by either % (sicombi and ), without simadp or more than % (sicombi , , and ) with simadp , compared to negative control, siegfp-transfected cells ( figure b ). this implies that, in general, knocking down madp with any sirna results in a reduction of virus infection. a stable cell clone expressing short hairpin rna to madp (shmadp ) was selected from h cells and the madp mrna level was confirmed using northern blot ( figure a ). the expression of madp and the effect of madp -knockdown on ibv infection were tested by comparing with a g -selected cell line without expression of shmadp (non-targeting control, shnc). the results showed that, in general, silencing of madp with shrna reduced the amount of viral mrna production before h post-infection ( figure b ). the amount of virus mrna is higher in shmadp cells compared to shnc cells beyond h of infection as infection in shnc cells progressed much faster and most cells died and detached ( figure b ). the shmadp cell line was then transfected with constructs expressing flag-tagged wild type madp (fm), triple residue mutant (fm(yvf)), two mrna mutants resistant to silencing by simadp based on wild-type madp (fmmut) and the triple residue mutant (fmmut(yvf)), negative vector control (f) and egfp (e), respectively. the two sirna-resistant mutants were constructed by mutating the sirna-targeting sequence with degenerate codons, so that the protein sequence of madp was maintained. these transfected cells were subsequently infected with ibv-luc and harvested at h post-infection. western blotting results showed an obvious increase in the amount of ibv n expression in cells over-expressing silencing-resistant wild-type madp (fmmut) as well as a slight increase in cells over-expressing both normal triple residue mutant (fm(yvf)) and silencing-resistant triple residue mutant (fmmut(yvf)) ( figure c ). an assessment of the luciferase activity of total cell lysates showed that over-expression of triple residue mutants fm(yvf) and fmmut(yvf) resulted in a slight increase of the luciferase activity in shmadp cells, whereas over-expression of silencing-resistant wild-type madp (fmmut) resulted in a more drastic increase of the luciferase activity in shmadp cells ( figure d ). it was noted that although madp interacted with both sars-cov and ibv -utr, the interaction was rather weak for the former. a comparison of the predicted stemloop i structures from both coronaviruses indicated a marked difference in their primary sequence as well as the secondary structures. hence, a third coronavirus, hcov-oc , whose stem-loop i which deviated further from ibv than sars-cov, was assessed for its binding to madp ( figure a ). it was found that the binding of madp to the utr of hcov-oc was as weak, if not weaker than sars-cov. it was also noted that the predicted stem-loop i structure of hcov-oc contained a bulge which encompassed a larger area of the stem compared to sars-cov ( figure b ). bulges were conspicuously absent from the ibv stem-loop i ( figure b ). in addition to the differences in the secondary structures between the coronaviruses, there was a lack of sequence similarity as well ( figure b ). previous studies on the involvement of host proteins in viral rna synthesis have revealed a number of proteins which are able to interact with the utrs of viral genomes ( , , , ( ) ( ) ( ) ( ) . some of these proteins may also interact with other viral proteins as well ( , ) . our attempts to identify host proteins involved in this early process of the coronavirus life cycle yielded madp . this protein was shown to be localized to the nucleoplasm but excluded from the nucleolus, but its role in rna splicing remains to be determined ( ) . madp contains two conserved rna-binding domains, the rna recognition motif (rrm) and universal minicircle sequence binding protein (umsbp) domains (a zinc finger cchc-type) ( ) . the former was determined to be the domain responsible for the interaction between madp and ibv -utr. the madp rrm domain interacts with nucleic acid residues via aromatic and hydrophobic side chains at its active site, which in the case supplied by phenylalanine and valine , respectively. tyrosine may have acted as an anchor for the phosphate backbone via electrostatic interactions. in this study, interaction between madp and the sars-cov and ibv -utr was initially identified by a yeast-based three hybrid screen and subsequently confirmed using an in vitro rna pull-down assay with ibv -utr. a deeper look at the details of this interaction revealed that the rna recognition motif, but not the zinc finger motif, of madp , is responsible for the interaction. this interaction is also shown to be specific and stem-loop i of ibv -utr is essential for the interaction to occur. although madp was reported to be a nuclear protein ( ) , it could be detected in the cytoplasm of ibv-infected cells and partially overlaps with the de novo synthesized viral rna, which marks the location of the rtcs in infected cells in the presence of actinomycin d. silencing of madp resulted in a marked reduction in syncytium formation upon ibv infection. a closer examination revealed that the synthesis of both genome-(grna) and subgenome-length rnas (sgrna) was compromised, resulting in a drastic reduction of viral structural protein expression and release of viral progeny (titers), hence the overall reduction of viral infectivity in the cells. across different coronaviruses, the leader sequence situated in the extreme end of the genome, is composed of stem-loops i and ii. mutations introduced into either stem-loop i or ii resulted in non-viable viruses, impaired (sense and anti-sense) sgrna synthesis, but not the full-length grna synthesis ( , ) . it was, however, observed in this study that silencing of madp did render an impact on grna synthesis, although to a lesser extent compared to sgrna synthesis. this might have been due to a secondary effect of decreased sgrna synthesis, as proteins encoded by sgrnas may enhance viral rna synthesis ( ) . the predicted structure of stem-loop ii indicated a strong secondary interaction, which is highly conserved across different groups of coronaviruses. the predicted stem-loop i structure, on the other hand, appears to fold into a hairpin of low thermodynamic stability, shows a wider sequence variation and is characterized by the presence of bulges, non-canonical base pairing as well as a prevalence of a-u base pairing ( ) . it has been shown in mhv that the structural liability of stem-loop i is a critical driving force in the -and -utr interaction ( ) . comparing the predicted stemloop i structures of ibv to sars-cov and hcov-oc ( figure b) , it was noted that there exists a difference in the loop sequence. in addition, ibv stem-loop i has a shorter stem and the absence of bulges, although the structure may be as unstable thermodynamically as that of sars-cov and hcov-oc , due to the extremely high prevalence of weak base pairing between a and u as well as the presence of a non-canonical base pair at the base of the stem ( ) . hence, sequence and structural differences may be one of the possible explanations for the observation of a weaker binding between madp and sars-cov or hcov-oc -utr than with ibv -utr. in fact, the relatively weaker binding of madp to the stem-loop i restoring mutant ( -utrÁ m ) demonstrated in this study supports that primary sequences in the -utr may play a certain role in this interaction. most studies on host involvement in coronaviral rna synthesis were so far performed using mhv ( ) ( ) ( ) , ) . identification of the interaction between madp and -utr as well as its functional involvement in coronavirus replication, in this study, therefore may represent the first host protein identified to play a role in viral rna synthesis by interacting with the -utr of the viral rna in a gammacoronavirus. the functional implication of the interaction between madp and ibv -utr may be extended to the rest of the members of the coronavirus family. in the case of hnrnp a , it was initially reported to be functionally important for viral rna synthesis for group ii virus mhv ( , ) . subsequently, its involvement in viral rna synthesis was also confirmed in tgev, a group i coronavirus ( ) . in this study, we have shown that betacoronaviruses hcov-oc , sars-cov and gammacoronavirus ibv can bind to madp , albeit with different affinities. due to the lack of a high containment facility, the functional implication of the relatively weaker interaction between sars-cov -utr and madp was not further studied. it is, therefore, yet to be demonstrated if this weaker binding dictates less dependency on madp in sars-cov rna replication and infectivity. current evidence indicates that madp is compartmentalized in the nuclei of cultured cells ( ) , markedly differing from the cytoplasmic, perinuclear localization of the coronavirus rtcs ( ) ( ) ( ) . as there was no report on the possibility of madp shuttling between the nucleus and cytoplasm, our observation using indirect immunofluorescence that over-expressed madp upon ibv infection became partially localized in the cytoplasm may represent a first report that madp could be localized outside the nucleus. this could have been achieved with either an existing shuttling mechanism used by a nuclear protein or the assistance of viral factors. for example, ibv n protein is known to enter the nucleus while maintaining a predominantly cytoplasmic localization ( , ) . alternatively, binding of viral rna may partially retain the newly synthesized madp in the cytoplasm, as observed in this study. it was observed that over-expression of flag-tagged madp was unable to fully restore ibv infection in madp -knockdown cells, even though the expression level of the introduced madp construct far surpassed the endogenous level, as observed by western blot analysis. considering the fact that only % of cells were transfected and over-expressed madp protein despite the presence of a higher level of the protein in the transfected cells, it is understandable that the expression of viral proteins could not be restored after combining both transfected and untransfected cells. interestingly, over-expression of silencing-sensitive madp was unable to cause an increase in virus infection, comparing to that was observed for silencingresistant madp (fmmut) in shmadp cells, even though their expression levels were comparable. this lends further support to the conclusion that madp is actively involved in the replication and infectivity of ibv. although the functional studies involving ibv, a chicken coronavirus, and a human protein, madp , were conducted using human and african green monkey cells, which were non-native. it is noteworthy that madp (homologene ) is conserved in humans (homo sapiens), chimpanzees (pan troglodytes), wolves (canis lupus), cattle (bos taurus), mice (mus musculus), rats (rattus norvegicus) and chickens (gallus gallus). the african green monkey genome is not available at ncbi, but an alignment search using basic local alignment search tool (blast) of the madp amino acid sequence against the rhesus macaque (macaca mulatta) refseq protein library yields a % sequence similarity between the two species. the chicken homolog, on the other hand, bears % amino acid sequence similarity, but with an almost identical match in the n-terminal amino acids, to the human madp protein. as the predicted interaction domain lies in the n-terminus, it is highly likely that the homologs from other species could replace human madp in the interaction studies. in conclusion, the involvement of madp in coronavirus rna synthesis and its significance are demonstrated in this study in the tissue culture system. further studies with an madp knock-out animal system, which is currently not available, would be required to confirm further the involvement of madp in coronavirus rna synthesis. human coronavirus nl employs the severe acute respiratory syndrome coronavirus receptor for cellular entry angiotensin-converting enzyme is a functional receptor for the sars coronavirus a transmembrane serine protease is linked to the severe acute respiratory syndrome coronavirus receptor and activates virus entry crystal structure of nl respiratory coronavirus receptor-binding domain complexed with its human receptor proteolytic activation of the spike protein at a novel rrrr/s motif is implicated in furin-dependent entry, syncytium formation, and infectivity of coronavirus infectious bronchitis virus in cultured cells acquisition of cell-cell fusion activity by amino acid substitutions in spike protein determines the infectivity of a coronavirus in cultured cells coronavirus genome structure and replication the role of programmed- ribosomal frameshifting in coronavirus propagation identification of heptaand octo-uridine stretches as sole signals for programmed + and - ribosomal frameshifting during translation of sars-cov orf a variants a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting virus-encoded proteinases and proteolytic processing in the nidovirales an arginine-to-proline mutation in a domain with undefined functions within the helicase protein (nsp ) is lethal to the coronavirus infectious bronchitis virus in cultured cells proteolytic processing of polyproteins a and ab between non-structural proteins and / of coronavirus infectious bronchitis virus is dispensable for viral replication in cultured cells functional and genetic studies of the substrate specificity of coronavirus infectious bronchitis virus c-like proteinase functional screen reveals sars coronavirus nonstructural protein nsp as a novel cap n methyltransferase identification of a novel cleavage activity of the first papain-like proteinase domain encoded by open reading frame a of the coronavirus avian infectious bronchitis virus and characterization of the cleavage products characterization of viral proteins encoded by the sars-coronavirus genome the molecular biology of coronaviruses a two-pronged strategy to suppress host protein synthesis by sars coronavirus nsp protein severe acute respiratory syndrome coronavirus nsp suppresses host gene expression, including that of type i interferon, in infected cells severe acute respiratory syndrome coronavirus nsp protein suppresses host gene expression by promoting host mrna degradation characterization of the expression, intracellular localization, and replication complex association of the putative mouse hepatitis virus rna-dependent rna polymerase nidovirus transcription: how to make sense coronavirus transcription: a perspective the coronavirus nucleocapsid protein is dynamically associated with the replication-transcription complexes coronavirus nucleocapsid protein facilitates template switching and is required for efficient transcription viral and cellular proteins involved in coronavirus replication polypyrimidine-tractbinding protein affects transcription but not translation of mouse hepatitis virus rna heterogeneous nuclear ribonucleoprotein a binds to the -untranslated region and mediates potential '- -end cross talks of mouse hepatitis virus rna heterogeneous nuclear ribonucleoprotein a regulates rna synthesis of a cytoplasmic virus syncrip, a member of the heterogeneous nuclear ribonucleoprotein family, is involved in mouse hepatitis virus rna synthesis host cell proteins interacting with the end of tgev coronavirus genome influence virus replication the cellular rna helicase ddx interacts with coronavirus nonstructural protein and enhances viral replication a three-hybrid screen identifies mrnas controlled by a regulatory protein isolation, expression, and characterization of the human zcrb gene mapped to q amino acid residues critical for rna-binding in the n-terminal domain of the nucleocapsid protein are essential determinants for the infectivity of coronavirus in cultured cells the human s u /u snrnp contains a set of novel proteins not found in the u -dependent spliceosome a u-turn motif-containing stem-loop in the coronavirus ' untranslated region plays a functional role in replication towards construction of viral vectors based on avian coronavirus infectious bronchitis virus for gene delivery and vaccine development the nucleocapsid protein of sars coronavirus has a high binding affinity to the human cellular heterogeneous nuclear ribonucleoprotein a host protein interactions with the end of bovine coronavirus rna and the requirement of the poly(a) tail for coronavirus defective genome replication mitochondrial hsp , hsp , and hsp bind to the untranslated region of the murine hepatitis virus genome mitochondrial aconitase binds to the untranslated region of the mouse hepatitis virus genome structural lability in stem-loop drives a utr- ' utr interaction in coronavirus replication selective replication of coronavirus genomes that express nucleocapsid protein group-specific structural features of the -proximal sequences of coronavirus genomic rnas membrane association and dimerization of a cysteine-rich, -kilodalton polypeptide released from the c-terminal region of the coronavirus infectious bronchitis virus a polyprotein further identification and characterization of novel intermediate and mature cleavage products released from the orf b region of the avian coronavirus infectious bronchitis virus a/ b polyprotein further characterization of the coronavirus infectious bronchitis virus c-like proteinase and determination of a new cleavage site sumoylation of the nucleocapsid protein of severe acute respiratory syndrome coronavirus the coronavirus infectious bronchitis virus nucleoprotein localizes to the nucleolus conflict of interest statement. none declared. key: cord- -ftlb b authors: mroczek, seweryn; kufel, joanna title: apoptotic signals induce specific degradation of ribosomal rna in yeast date: - - journal: nucleic acids res doi: . /nar/gkm sha: doc_id: cord_uid: ftlb b organisms exposed to reactive oxygen species, generated endogenously during respiration or by environmental conditions, undergo oxidative stress. stress response can either repair the damage or activate one of the programmed cell death (pcd) mechanisms, for example apoptosis, and finally end in cell death. one striking characteristic, which accompanies apoptosis in both vertebrates and yeast, is a fragmentation of cellular dna and mammalian apoptosis is often associated with degradation of different rnas. we show that in yeast exposed to stimuli known to induce apoptosis, such as hydrogen peroxide, acetic acid, hyperosmotic stress and ageing, two large subunit ribosomal rnas, s and . s, became extensively degraded with accumulation of specific intermediates that differ slightly depending on cell death conditions. this process is most likely endonucleolytic, is correlated with stress response, and depends on the mitochondrial respiratory status: rrna is less susceptible to degradation in respiring cells with functional defence against oxidative stress. in addition, rna fragmentation is independent of two yeast apoptotic factors, metacaspase yca and apoptosis-inducing factor aif , but it relies on the apoptotic chromatin condensation induced by histone h b modifications. these data describe a novel phenotype for certain stress- and ageing-related pcd pathways in yeast. gene expression in all organisms is regulated at multiple levels, including transcription initiation, mrna stability and turnover, translation and protein degradation. not surprisingly, rapid changes in cell metabolism and most responses to environmental stimuli involve significant flux of many cellular rnas, of which mrna transcriptome profiles have been most extensively studied to date ( , ) . however, also stable rnas, such as ribosomal, transfer, nuclear and nucleolar rnas (rrnas, trnas, snrnas and snornas), are likely to undergo specific transformations in altered conditions. it has been demonstrated that certain pathways of cell death are accompanied by the destruction of nucleic acids. for example, in metazoans the programmed cell death (pcd) called apoptosis, in addition to irreversible dna damage, which is considered an apoptotic hallmark ( ) , also involves specific cleavage of several rna species, including s rrna, u snrna or ro rnp-associated y rnas ( ) . it was proposed that rrna degradation could contribute to cell autodestruction, whereas degradation of anti-apoptotic factors mrnas would accelerate apoptosis. in higher eukaryotes, rna cleavage is probably carried out by rnase l, a -oligoadenylate-dependent endoribonuclease, which functions in rna decay during the interferon-induced response to viral infection, and whose activation in animal cells causes apoptosis ( ) ( ) ( ) . however, rnase l-independent cleavage of s rrna in virus infected cells has also been reported ( ) . the occurrence of apoptosis was assumed to be limited to metazoans, where elimination of single cells does not kill the whole organism. nevertheless, recent studies revealed the existence of cell death pathway in yeast, saccharomyces cerevisiae, and other unicellular eukaryotes, with typical hallmarks of apoptosis: dna fragmentation, externalization of phosphatidyl serine and chromatin condensation. pcd in yeast is triggered by several different stimuli, including ageing, expression of mammalian pro-apoptotic proteins, exposure to low doses of h o , acetic acid, hyperosmotic stress and mating-type a-factor pheromone ( ) ( ) ( ) . unicellular organisms are believed to undergo pcd for a variety of reasons, including elimination of old, infected and damaged cells in growth-limiting conditions for better survival of the remaining population, and adaptation of the more fit subpopulation to the ever changing and challenging environment ( ) . orthologues of core regulators of mammalian apoptosis, such as the caspase-related protease yca , a homologue of mammalian pro-apoptotic mitochondrial serine protease htra (nma ), a yeast endog nuclease nuc , apoptosis inducing factor (aif ) involved in chromatin condensation, aif-homologous mitochondrionassociated inducer of death (amid) ndi and an inhibitor of apoptosis (iap) bir , are conserved in yeast ( ) ( ) ( ) ( ) ( ) ( ) . in addition, in both human and yeast cells, histone modifications (histone h b phosphorylation at serine in human and at serine in yeast and h b deacetylation at lysine in yeast) play an important role in apoptotic chromatin condensation and cell death ( ) ( ) ( ) . however, several apoptotic factors are missing in yeast, including the bcl- /bax family and the apoptosis protease activator factor apaf- . also, there is no good homologue of rnase l to execute possible rna degradation. nevertheless, yeast apoptosis has recently been shown to be activated in mrna decay mutants (dcp and lsm) ( , ) , supporting the notion that rna metabolism and apoptosis are linked. pcd occurs via a number of different mechanisms, e.g. caspase-dependent or independent, however, in all eukaryotes it is thought to be correlated with high levels of reactive oxygen species (ros). ros can either be generated exogenously through respiration or originate from exogenous sources such as exposure to hydrogen peroxide (h o ), superoxide anions or hydroxyl radicals. excessive ros results in damage of cellular components (dna, lipids and proteins), cell cycle arrest, ageing and finally cell death ( ) . cells have developed a complex network of defence mechanisms, both enzymatic and non-enzymatic, against adverse consequences of oxidative stress ( ) . non-enzymatic system comprises a set of small molecules acting as ros scavengers (e.g. glutathione, thioredoxin, glutaredoxin and ascorbic acid), whereas enzymatic system eliminates oxygen radicals by the action of specialized cytosolic or mitochondrial enzymes (e.g. catalases, superoxide dismutases, glutathione peroxidases and thioredoxin peroxidases) ( ) . most genes encoding components of these systems are induced in response to oxidative stress and are under transcriptional control of specific factors, for example yap , msn /msn and skn in budding yeast ( ) . however, it appears that there is no general oxidative stress response. in s. cerevisiae, different response pathways are triggered by specific oxidants and different genes are involved in maintaining efficient cellular resistance to various sources of ros ( , ) . interestingly, recent genomic approaches to identify these genes showed that strains lacking proteins which function in rna metabolism were oversensitive to oxidative stress ( , ). these included genes encoding rrna helicases (dbp , dbp ), rrna processing factors (nop , nsr ), mrna deadenylases (ccr , pop ) and several mitochondrial rna splicing components ( , ) . this indicates that rna processing and degradation may have a role in cellular response to ros. in this study, we examined the effects of elevated ros levels generated by oxidative stress, ageing and other apoptotic-inducing treatments on the status of ribosomal rna in yeast s. cerevisiae and we have shown that mature rrnas become specifically fragmented as a result of the cell response to these conditions. rna degradation coincides with fragmentation of chromosomal dna but occurs considerably earlier and most likely upstream of the activation of major apoptotic regulators, yca and aif . the existence of this mechanism underscores the role of gene expression, namely rrna turnover, in regulating certain pathways of cell death, in this case most likely through destruction of ribosomes and subsequent inhibition of translation in the early stages of apoptosis. yeast strains and plasmids used in this work are listed in supplementary table s . the transformation procedure was as described ( ) . strains were grown at c either in ypd, ypgal or ypgly medium ( % yeast extract, % bacto-peptone, % glucose or % galactose or % glycerol, respectively) or in synthetic complete medium (sc, . % yeast nitrogen base, % glucose or % galactose, supplemented with required amount of amino acids and nucleotide bases). strains w - a-bax, w - a-bcl-x l , w -rdna were grown in sc media without leucine, tryptophan or uracil, respectively. yeast cultures in early logarithmic phase (od $ . ) were stressed with h o ( . - mm), menadione ( . - . mm), cumene hydroperoxide (chp, . - . mm), tert-butyl hydroperoxide (t-bhp, - mm), paraquat ( . - mm), diamide ( . - mm) and linoleic acid hydroperoxide (loaooh, . - . mm) for min (all reagents from sigma). treatment with acetic acid stress was performed as described ( , ) . cells were grown in sc media to early exponential stage, shifted to sc media, ph = , and treated with mm acetic acid (sigma) for min. hyperosmotic shock was achieved by growth of exponential cells in sc complete media containing % (wt/wt) glucose (fluka) or % (wt/wt) sorbitol and % glucose (sigma) for - h ( ). chronological ageing was performed by constant growth of yeast cultures in sc complete medium for - days ( ) . for expression of murine bax or bcl-x l proteins, cells were grown to early exponential phase in sc-leu or sc-trp, respectively. expression of murine bax was induced by shifting the cells grown to early exponential phase in sc-leu medium containing glucose to sc-leu medium containing galactose for h. pre-treatment with ascorbic acid ( mm, fluka) and respiratory chain inhibitors oligomycin a ( . mg/ml, sigma) and sodium azide ( . mm, sigma) was performed for or min prior to treatment with h o . for inhibition of protein synthesis, cells were treated with mg/ml or mg/ml of cycloheximide (sigma) for min. cell fixation was achieved by addition of formaldehyde (final concentration %) or etoh (final concentration %) to exponentially growing yeasts and incubation in room temperature for min or min, respectively. formaldehyde was quenched by addition of glycine to the final concentration of . m for min. following the removal of fixation agents, cells were exposed to h o for min as described earlier. preparation of samples and analysis of chromosomal dna fragmentation by pulsed field gel electrophoresis (pfge) was performed exactly as described ( ) . pfge was conducted in a chef-driii chiller system (bio-rad). one percent agarose gels were run in . % tris borate-edta buffer at c with an angle of with a voltage of v/cm and switch times of - s for h. gels stained in ethidium bromide were analysed after destaining using syngene gene genius bioimaging system. rna extraction, northern hybridization and primer extension were essentially as described ( , ) . lowmolecular weight rnas were separated on % acrylamide gels containing m urea and transferred to a hybond n+ membrane by electrotransfer. high-molecular-weight rnas were analysed on . % agarose gels and transferred by capillary elution. oligonucletides used for rna hybridization and primer extension (w and w ) are listed in supplementary table s . quantification of northern blots was performed using a storm phosohorimager and imagequant software (molecular dynamics). dideoxy-dna sequencing was performed on pcr-templates prepared from genomic yeast dna using the same primers as for primer extension (w and w ) and a fmolseq kit (promega) according to manufacturer's instructions. the race assay was carried out on total rna ( mg) isolated from untreated cells and treated with mm h o . dna 'adaptor' oligonucleotide (w ) carrying aminolinker at the -end was ligated with the -end of total rna using u of t rna ligase (neb). ligation was performed in the presence of % peg (sigma) at c. rna was extracted with phenol: chloroform: isoamyl alcohol (v/v : : ), precipitated and used as a template for cdna synthesis using w primer complementary to the anchor sequence and the enhanced avian hs rt-pcr kit (sigma) according to manufacturer's instructions. cdna was amplified using primers w and w , the resulting pcr product was gel purified, cloned into pgem-t easy vector and sequenced using primer w . rnase h cleavage was performed essentially as described ( ) . samples of mg of total rna were annealed with ng of oligonucleotide complementary to the specific regions within rrna at c for min and digested with . u rnase h at c for h. for detection, samples were separated on polyacrylamide gels and analysed by northern hybridization using probes located upstream of rhase h cleavage. to examine the existence of the rna degradation pathway in yeast under oxidative stress, we have performed treatments with low doses of oxidative agents generating different ros. these included the inorganic h o (concentrations . - mm), superoxide-generating menadione (concentrations . - . mm), paraquat (concentrations . - mm), thiol oxidant diamide (concentrations . - mm), organic chp (concentrations . - . mm), t-bhp (concentrations - mm) and a loaooh (concentrations . - . mm). such concentrations of oxidants result in - % of cell death ( , , ) . total rna from wild-type w or by cells grown to early exponential phase (od = . ) in ypd media and treated with chemical compounds for min was separated on . % denaturing agarose/formaldehyde gels and analysed by northern hybridization using a probe complementary to the -end of mature s rrna (starting at position + ). this rna species was chosen in the first place, since effects on s rrna have been observed in apoptotic mammalian cells ( , , ) . on treatment with two oxidants, h o and menadione, extensive decay of the mature s and accumulation of specific degradation products was observed, whereas little or no degradation occurred for other chemicals tested ( figure a , data shown only for w strain and treatment with h o , menadione, chp and t-bhp). this indicates that rna cleavage accompanies some oxidative stress pathways, as it is known that different oxidants elicit specific cellular responses that, though partly overlapping, induce different groups of genes and require individual sets of specialized defence functions to maintain resistance ( , , ) . in addition to the s rrna, other rrna species were also probed for undergoing specific decay. after treatment with h o , rna damage with accumulation of characteristic breakdown products occurred for . s and, to a much lesser extent, for s, but not for s, (figure b and c; data not shown). in the case of s, hardly any degradation intermediates were detected, there was some decay of the mature rna, however, it was approximately . to -fold weaker than for the mature s. therefore, we conclude that mainly the two components of the large ribosomal subunit, s and . s, undergo specific apoptotic degradation. the oxidants utilized, except for h o , had not been tested for apoptotic effects in yeast. one of the most recognized apoptotic markers is fragmentation of chromosomal dna. internucleosomal dna laddering, typical for mammalian apoptosis, has not been detected during pcd in yeast; nevertheless, a higher order chromatin fragmentation to segments of several hundred kilobases also occurs in yeast ( , ) . this dna breakdown can be monitored either by the terminal deoxynucleotidyl transferase dutp nick-end labelling (tunel) assay or by using pfge of genomic dna. the latter approach was applied to verify which oxidative agents lead to apoptotic phenotypes. chromosomal dna from cells treated with h o ( mm), menadione ( . mm), chp ( . mm), paraquat ( mm), diamid ( mm) and t-bhp ( mm) for min was analysed using pfge ( figure d ). clear dna degradation was observed only for cells exposed to h o and menadione, other treatments did not result in a visible apoptotic fragmentation. this is in a striking agreement with rrna degradation that occurred only in h o -and menadione-treated cells. this strongly indicates that rrna decay phenotype can be related to apoptosis. to confirm this, we have examined other conditions known to provoke apoptosis in yeast, i.e. acetic acid, ageing and hyperosmotic shock ( figure e -h). for treatment with acetic acid, cells were grown in sc complete medium (ph ) to exponential phase and exposed to mm acetic acid for up to min ( , ). hybridizations with probes against s (probe , position + , lanes - ; probe w , position + , lanes - ; probe w , position + , lanes - ; probe w , position + , lanes - ; probe w , position + , lanes - and probe w , position + , lanes - ). asterisks above the arrows indicate the products that were further analysed. arrow marked with a hatch shows a band matching the potential product of the major cleavage, product is marked with one asterisk. (b-c) primer extension analysis for two main cleavage sites in the s rrna in w cells treated with mm h o (a) and in -day old chronologically aged rho w cells (b). primer extensions were performed using primers w for sites around positions + and + and w for sites around position + relative to the end of the mature s. dna sequencing on a pcr product encompassing the end of the s from + to + , using the same primers was run in parallel on % sequencing polyacrylamide gels (lanes - ). the sequences with primer extension stops are shown on the right. secondary structures of the regions in the vicinity of the cleavages, indicated by arrowheads and shown beside corresponding primer extension reactions, were adapted from the website http://rna.icmb.utexas.edu/. (d-e) ends of cleaved-off products for the major cleavage at positions + - were mapped by race. (d) pcr reactions on cdna prepared using total rna from untreated control (lane , c) and cells treated hyperosmotic shock was achieved by growth of exponential cells in sc complete media supplemented with % (wt/wt) glucose or % (wt/wt) sorbitol ( ) . and finally, chronological ageing was performed by constant growth of yeast cultures in sc complete medium for - days ( ) . this analysis revealed that all apoptotic stimuli tested resulted in the s and . rrna fragmentation with the degradation pattern specific for each condition (shown in figure e -g for s in all apoptotic conditions and in figure h for . s during ageing). the accumulating intermediates generated by some factors were comparable (see for example, cleavages mediated by % glucose and % sorbitol, acetic acid and h o , figure e and f); however, the general outcome of each treatment indicated differences in the course of events during each response. the occurrence of rna degradation triggered by h o was monitored during a time course between and min and over a broad range of concentrations ( . - mm for min) ( figure e , lanes - and figure a ). degradation was initiated relatively fast, since it was apparent at min for . mm h o ( figure e , lanes - ), - min for mm acetic acid ( figure e , lanes - ), h for % glucose and % sorbitol ( figure f ) and days for ageing ( figure g ) following the treatment. this onset of rrna degradation distinctly precedes the timing of dna damage characterized in apoptotic yeast exposed to the same stimuli ( ) , indicating that rna decay process is activated early during the response. also, in the case of h o , low doses of the oxidant, starting with . mm and optimal at . - mm, were sufficient to initiate rrna degradation with the appearance of specific bands. when high concentration of h o ( mm), believed to result in cell necrosis, was used, these specific degradation products were absent; however, the level of mature s and s rrnas was also significantly reduced ( figure a ; data not shown). these data show that different ros-generating treatments that lead to yeast apoptosis, namely h o and acetic acid, ageing and hyperosmotic shock, induce rna fragmentation that most likely precedes the dna damage and, as in higher eukaryotes, can be considered a hallmark of the induction of pcd in yeast. cleavages in the s rrna are endonucleolytic and require cellular machinery specific cleavages within the s rrna generated in the presence of hydrogen peroxide were monitored by northern hybridization with probes located along the molecule to narrow down the regions to be further analysed ( figure a ). this analysis showed the accumulation of diverse degradation products, some of which extended from the -end of the molecule (figure a, lanes - ) , whereas others were also truncated at their ends ( figure a, lanes - ) . the striking decrease in the level of the mature s rrna at higher doses of the oxidant ( - mm) indicates that following specific cleavages the majority of rrna becomes degraded, possibly by the exosome complex of ! exonucleases that participates in the decay of rrna precursors and excised transcribed spacers ( ) . the emergence of the characteristic cut-off in the signal at the fragment size corresponding to the position of the probe indicated that major cleavage sites are located around positions + , + and + with respect to the -end of the molecule. two of these cleavages, at positions + - and + - , were mapped for treatment with mm h o by primer extension using primers w and w situated downstream of the expected cleavage sites ( figure b ). similarly, major cleavage sites were analysed in -day old chronologically aged cells using the same primers ( figure c ) and mapped at positions + - and + - . according to the secondary structure of the s rrna taken from ( ), the regions where mapped cleavages occur (shown in figure b and c besides corresponding primer extension reactions) are located at unpaired nucleotides in loops or bulges. this points to the action of single-stranded rna nucleases. to establish the nature of the observed rna fragmentation, -ends of the products generated by the h o mediated cleavage in the s at positions + - and + - were determined by the race. to this end, dna 'anchor' oligonucleotide (w ) was ligated with t rna ligase to total rna from untreated and treated w cells to prepare cdna using a primer specific for the anchor (w ). this served as a template to amplify products containing required fragments using the same primer and a primer that covers the -end of s rna starting at position + (w ). the ensuing pcr fragments ( figure d ) were cloned into pgem-teasy and sequenced. the results of sequenced clones for the cleavage at + - and clones for the cleavage + - are shown in figure e . in the case of the major site (cuts at positions + - ), this analysis confirms that the and ends of this degradation product overlap ( figure e , lower panel), which is consistent with the endonucleolytic mechanism of the cleavage. mapping the and ends at site + - by primer extension and race produced a different pattern: these ends do not match ideally but the products with mm h o (lane ). to generate cdna, total rna that had been ligated to an 'anchor' oligonucleotide (w ) with t rna ligase, was reverse transcribed using a primer specific for the anchor (w ). this was followed by pcr reaction using the same primer and the primer starting at position + in the s rrna (w ). arrows indicate products corresponding to fragments cleaved at + - (lower) and + - (upper). pcr fragments were cloned into pgem-teasy and sequenced. (e) sequences obtained by the race analysis for fragments cleaved at site + - ( independent clones) and site + - ( independent clones). the corresponding regions of the s with cleavage sites mapped by primer extension and indicated with empty arrowheads are shown above in grey. figures in parentheses show the number of identical clones. (f) mapping ends of two major cleavages sites using rnase h cleavage on total rna extracted from wild-type, rrp - and ski d cells treated with mm h o (lanes, - ) and from wild-type untreated control (lane , c). rnase h treatment was performed on rna samples annealed to dna oligonucleotides w and w complementary to positions + and + , respectively. samples were separated on a % acrylamide gel and hybridized with probe w (f-i) and probe w (f-ii) to detect ends of fragments cleaved at + - (f-i) and at + - (f-ii), respectively. arrows show more defined ends of products cleaved at + - for all strains and at + - in the mutants; vertical bar in f-i indicates heterogenous ends of products cleaved at + - in wild-type cells. get progressively shorter pointing at the action of ! exonucleases ( figure e, upper panel) . the most likely candidate is the exosome, a large complex with a ! exonucleolytic activity involved in the processing and degradation of mrna, rrna and other rna substrates ( ) . mutants in the exosome core component rrp , the nuclear subunit rrp and the cytoplasmic cofactor ski , were used to assess the status of the product ends by a specific rnase h cleavage. this cleavage, directed by a dna-rna hybrid between oligonucleotide w and a complementary region in the s starting at residue + , allows higher resolution of analysed rnas. this analysis shows that products generated at site + - in the exosome mutants rrp - and ski d, but not in rrp d, are extended and less heterogenous than corresponding fragments in the wild-type strain (figure f-i; data not shown). in contrast, positions of cleavages at site + - are not affected by mutations in the exosome (fig. f-ii) . this suggests that the cytoplasmic exosome may contribute to rrna decay by digesting ends of at least some cleavage products. to ascertain that rna degradation process is enzymatic and not chemically induced by various reactive compounds, yeast cells were fixed with % formaldehyde for min or with % ethanol for min prior to exposure to increasing concentrations of hydrogen peroxide ( figure a and b) . both fixation procedures preserve cellular structures, however, it is known that most fixatives have harmful consequences, e.g. cause some loss of cellular components, including ribosomes. nevertheless, a similar approach had been used to demonstrate that dna damage in apoptotic yeast cells was an enzymatic process ( ) . also, in the case of rna, the appearance of specific h o -induced degradation products was prevented by fixation, although the overall level of rrna was reduced. some faster migrating rna species were detected in formaldehyde or ethanol fixed cells, however, these were generated also in the absence of the oxidant and did not intensify after treatment ( figure a and b, lane ) . finally, to check whether rrna destruction during apoptotic response is not due to cessation of translation in dying cells, cells were treated for min with the translation elongation inhibitor cycloheximide ( mg/ml) and this did not lead to an apparent rrna degradation (supplementary figure s a) . this is also supported by our earlier observations that exposure of yeast cells to many oxidative agents that cause cell death does not result in rrna decay ( figure a) . together, this strongly suggests that rrna degradation observed in apoptotic and oxidative stress conditions is not simply a result of cell death but is produced in the process that requires enzymatic activity and functional cellular machinery. rrna is most likely cleaved endonucleolytically and in some cases, dictated probably by the rna structure, this is followed by exonucleolytic digestion by the exosome. rrna degradation is strongly correlated with ros levels and is connected with oxidative stress response and apoptosis pathways treatment with oxidative agents and apoptotic stimuli generate elevated levels of ros in the cell. to test whether there is a direct link between rna fragmentation and ros, a potent ros scavenger, l-ascorbic acid (vitamin c), was used ( ) . the presence of mm ascorbic acid prior to treatment with standard doses of h o almost totally abrogated degradation of the s rrna ( figure a ). in addition, the ectopic expression of murine bcl-x l protein of the anti-apoptotic mammalian bcl- family, known to have a protective effect against ros in yeast ( , ) , also strongly safeguarded the s rrna from rapid degradation by h o ( figure b ). in contrast, expression of the mammalian pro-apoptotic bax protein that increases ros level ( , ) , additionally enhanced the degradation phenotype ( figure c ). this confirms the direct correlation between the production of ros and the fate of cellular nucleic acids, leading not only to dna but also rrna damage and destruction of ribosomes. from the data presented so far, it appears that the observed rrna fragmentation may possibly represent a part of the cellular oxidative stress and apoptotic responses. this was assessed by testing the extent of the s rrna degradation in different mutants defective in these pathways. in the first place, yca d strain, lacking the only identified apoptotic metacaspase yca in yeast, and aif d cells not expressing the yeast apoptosis inducing factor aif ( , ), were assayed for h o -induced rna decay, however, no significant differences were observed (supplementary figure s b and c) . similarly, addition of a broad-range caspase inhibitor z-vad-fmk ( mm) that prevents yca -dependent cell death in yeast ( , ) had no effect on rrna fragmentation (data not shown). this indicates that rrna degradation detected in all apoptotic conditions tested is independent of the two major apoptosis mediators, yca and aif , and of other potential yeast caspases. likewise, treatment with translation inhibitor cycloheximide, that has been shown to prevent apoptotic cell death induced by h o and acetic acid ( , ) , had little or no effect on h o -mediated rrna degradation (supplementary figure s a) . however, it appears that events in the course of apoptosis that require protein synthesis are rather late, for example dna fragmentation and chromatin condensation, whereas rrna decay is initiated relatively fast. in contrast, different outcome was observed for mutants inhibiting chromatin condensation during h o -induced apoptosis. phosphorylation of serine and deacetylation of lysine , both in histone h b, were reported to have an essential role for the progress of cell death in yeast ( , ) . in agreement, s a or k q mutations in histone h b that prevent these modifications and abrogate apoptosis resulted in the significant reduction of the s rrna degradation, both the decay of the mature rrna and the amount of degradation products ( figure a ). together, these data show that the destruction of ribosomal rna in cells treated with h o , and possibly with other apoptotic stimuli, is a part of a yeast cell death pathway that involves histone modification and not the caspase-dependent pathway. remarkably, nuc -mediated apoptosis resulting from over-expression of yeast endog homologue nuc , a major mitochondrial nuclease, was also reported to be yca -and aif -independent and related to histone modifications ( ) . alternatively, it can be envisaged that damage of ribosomes triggered by apoptotic stimuli is an early event during the response, does not require protein synthesis, precedes caspase activation and acts as an upstream signal in the apoptotic pathway. this scenario is consistent with most observations so far. as the oxidative stress in yeast proceeds through multiple pathways that involve different response mechanisms ( , ), we tested several known enzymes and factors that regulate these responses. these included two major transcription factors, yap and skn , that control expression of several genes induced by oxidative stress and in this way participate in ros sensing ( , , ( ) ( ) ( ) , as well as components of antioxidant pathways, e.g. superoxide dismutases sod - , glutathione peroxidases gpx - , glutaredoxins grx - , peroxiredoxins tsa - , prx , dot and ahp , thioredoxins trx - and thioredoxin reductases trr - ( , ). these enzymes are required for protection against ros either by catalysing the breakdown of oxidative compounds or by restoring natural intracellular redox equilibrium. in addition, as glutathion protects cells against ros, we also used a gsh d strain lacking a g-glutamylcysteine synthetase, which, when grown on glutathion-free synthetic medium, leads to glutathion depletion and cell death ( , ) . strains lacking these proteins are more sensitive to several oxidants than their isogenic wild-types ( ), and following treatment with h o they showed a marked increase in the s rrna degradation, however, to different degrees depending on the mutant. in figure b -d, yap d, skn d, gpx / / d, sod / d, grx / -d, prxd (tsa / d/prx d/ ahp d/dot d) and gsh d strains are shown, which gave the most evident effects in comparison with their respective isogenic wild-types, particularly when considering the decay rate of the mature s rrna. these data indicate that properly functioning oxidative stress response also protects cellular components such as nucleic acids from the attack by ros and that defects at any step of this defence result in a more severe and faster breakdown. it is noteworthy that multiple anti-oxidant mutants lacking all components of each enzymatic pathway exhibit a stronger effect on the s degradation than single mutants, pointing to the additive protection actions of these systems. striking effects on rrna stability in strains lacking stress response transcription factors yap and skn indicate that the synthesis of new anti-oxidant proteins that are induced by oxidative stress might be required for protection of ribosomes. consistently, blocking protein synthesis by pre-treatment with cycloheximide resulted in somehow stronger rrna degradation (supplementary figure s a, lanes - and - ). taken together, this indicates that targeting rrna degradation during oxidative stress may directly contribute to cell death. to test whether the level of rrna, which reflects the amount of cellular ribosomes, may be somehow linked with cell survival under oxidative stress, we attempted to create a situation where the steady-state level of mature rrnas will be increased or decreased. however, additional copies of rdna present on a multicopy pnoy plasmid under control of the inducible gal promoter ( ) did not affect the amount of any mature rrna species, possibly due to mechanisms that regulate ribosome abundance (data not shown). in contrast, the level of total genomic and plasmid-derived s and s rrnas was, to our surprise, reduced to % in a strain transformed with the multicopy pjv plasmid expressing a tagged rdna gene under the control of the constitutive pgk promoter ( ) , when compared to a strain transformed with vector alone ( figure a ). the basis of this effect is unclear, particularly that it was seen even though the tagged rrna versions were expressed as confirmed by northern blots using probes specific for plasmid borne s rrna ( figure a ). nevertheless, the strain carrying pjv showed a decreased viability already in the absence of oxidant ( . -fold) and even more strikingly reduced following treatment with different concentrations of h o ( . -fold for . mm and . for for mm, respectively) relative to the strain with vector alone ( figure b ). in another approach, we used the noy strain, which carries a temperature sensitive (ts) rna polymerase i ( ) . at c, this strain ceases to grow but it sustains slow growth at c due to reduced levels of mature rrna. expression of additional copies of rdna from pnoy or pjv plasmids improves growth at all temperatures and rescues the ts-lethal phenotype ( , ) . growth of noy expressing additional rdna under the control of gal (pnoy ) or pgk (pjv ) promoters resulted in total rrna levels lower by % in the latter case ( figure c ). this relatively modest difference in rrna abundance led to a % decrease in survival of cells exposed to oxidative stress ( figure d ). this indicates that there may exist a correlation between the quantity of ribosomal subunits and the capacity of the cell to elicit functional defence mechanisms and prevent cell death. it is possible that there is a feedback mechanism that controls this relationship: a healthy cell that contains an adequate number of ribosomes is able to respond more efficiently to stress stimuli to protect cell components from damage, including ribosomes themselves. therefore, provided that the level of ribosomal rna monitors cell fitness, its sudden reduction may act as one of the signals to initiate cell death mechanisms, including apoptosis. rrna degradation depends on the mitochondrial activity in the cell mitochondria are the major source of endogenous ros generated by oxidative phosphorylation. the extent to which mitochondria are involved in mammalian or yeast apoptosis is still questionable, although it appears that mitochondrial ros could be important in some signalling pathways ( , ) and have a central role in some apoptotic pathways and less crucial in others ( ) . for example, apoptotic cell death in yeast induced by acetic acid, pheromone and bax expression was shown to be mediated by mitochondria ( , ) . the correlation between mitochondria and rrna stability was assessed, in the first place, by checking rrna level in respiratory-deficient rho cells lacking mtdna in conditions inducing apoptosis, i.e. exposed to h o ( figure a ), mm acetic acid ( figure b ), hyperosmotic stress ( % glucose, figure c ) and during chronological ageing ( figure d ). all treatments resulted in a remarkably robust degradation of the s and . s rrnas in rho strains when compared to the parental w and by strains (figure , supplementary figure s e ; and data not shown). this points to the importance of the functioning mitochondria in the stressinduced rrna degradation. to test the contribution of the oxidative phosphorylation, two mutants in these pathways were used, op with a point mutation in a major adp/atp carrier aac (arg !his ), and a triple aac / / Á deletion mutant ( ) ( figure e ). both mutants behaved in a similar manner as rho cells and exhibited more pronounced rrna degradation in the presence of h o than the isogenic wild-type; however, the phenotype was stronger for op than upon deletion of the three carrier proteins, possibly due to a dominant negative effect in the point mutant. in addition, treatment with f -f atpase proton-pump inhibitors, oligomycin a ( . mg/ml) and sodium azide (nan , . mm) that block electron transfer and the synthesis of mitochondrial atp ( - ), resulted in a moderate increase in the rrna cleavage ( figure f and supplementary figure s d ). all these experiments indicate that the process of respiration, though generating the endogenous ros, is also vital for counteracting its adverse effects. this was further supported by the degree of rrna protection against oxidative damage caused by h o observed for yeast cells grown on different carbon sources, which are known to affect the level of respiration ( ) . the most extensive rna decay was observed in glucose, where a process called glucose repression discourages respiration. it was less pronounced in galactose and least of all in the nonfermentable source, glycerol, where mitochondrial respiration is forced ( figure g) . these experiments directly correlate functional mitochondria and the process of respiration with the defence against oxidative stress triggered by h o , acetic acid, hyperosmosis and ageing that, among others, prevents destruction of cellular components, including rrna. several cellular responses are regulated at the translational level, particularly by selective translation of specific mrnas or by inhibition of the ribosome and protein synthesis, as these processes consume a large amount of energy. such inhibition can follow various stimuli, including endoplasmic reticulum stress and unfolded protein response (upr), transition into quiescence and different stress-related and cell death-related signals ( ). the most straightforward and fastest way to achieve translation inhibition is to target ribosomal rna. interestingly, it has been proposed that repression of protein synthesis during upr in human cells is due to the cleavage of s rrna by hire b, a second homologue of ire ( ) . also, during apoptosis in some mammalian cells degradation of s rrna by rnasel, but also of other rnas such as y rna or some mrnas, has been suggested to block protein synthesis that contributes to, but could even initiate, cell death. furthermore, damage to the s rrna may act as a ribotoxic stress and induce an early death-committing signal through activation of sap and map kinases ( , ) . these possibilities were not examined in yeast pcd pathways. we have analysed the behaviour of ribosomal rnas during oxidative stress and in apoptotic pathways that are induced by different stimuli. we have observed that cells exposed to all apoptotic conditions tested, such as h o , acetic acid, hyperosmotic stress ( % glucose) and ageing, reveal a significant degradation of the s and some of . s rrnas, with a much lesser effect on the s rrna. the decay of mature rrnas was accompanied by the accumulation of treatment-specific, yet partly overlapping degradation intermediates. although there is no evidence so far that rrna damage during apoptotic conditions in yeast can directly initiate cell death by activating signalling pathways, it is tempting to speculate that this might be the case. such signalling could be conveyed either by a critical decrease of the mature s rrna or, alternatively, by the accumulation of degradation intermediates/products, which can function as signal molecules triggering a specific pcd pathway. in most cases, when ribosomal rnas are depleted (e.g. in pre-rrna processing mutants), rrna degradation is conducted rapidly with no or little rrna fragments detectable; however, lack of lsm proteins has been reported to result in degradation of ribosomal rnas with accumulation of unusual intermediates ( ) . it is noteworthy that one apoptotic pathway that is linked with rna metabolism is triggered by the defect in mrna turnover caused by mutations in enzymes involved in decapping, including components of dcp - and lsm - complexes ( ) . each of the applied apoptosis-inducing conditions resulted in a clear-cut pattern of the s degradation intermediates or products. this points to the endonucleolytic nature of the reactions, though exonucleolytic destruction, with certain rna fragments temporarily protected by compact structures or tight interactions with proteins, cannot be excluded. however, mapping and boundaries of the major h o -induced cleavage product by primer extension and race, respectively, confirmed that both ends strictly overlap and thus result from the endonucleolytic cut. it has been shown that apoptotic stimuli in yeast generate ros that are closely correlated with the onset or progression of pcd ( ) . we saw that, also the degree of rrna decay corresponded to the cellular level of ros, which was modified by using ros scavenger (ascorbic acid) or ectopic expression of pro-or anti-apoptotic proteins (bax and bcl-x l , respectively) known to affect ros generation. also, rrna degradation was more robust in oxidative stress defence mutants, both enzymatic and non-enzymatic, where intracellular ros is not properly neutralized. all these observations argue that there is a direct link between ros production and rrna fragmentation. it can be envisaged that various reactive species themselves are able to produce endonucleolytic nicks in rna molecules that will lead to breakdown, particularly as all mapped cleavages occur in singlestranded regions that constitute loops and bulges and are more accessible to chemical compounds in the solvent. however, this is not the case, given that specific rna fragmentation did not occur in cells fixed with formaldehyde or ethanol prior to treatment with h o . in addition, oxidative agents used in this work generate different forms of ros such as h o , hydroperoxide (loaooh, chp and t-bhp), superoxide anion (menadione and paraquat) and hydroxyl free radical (produced from h o or superoxide anion). although all were applied at toxic doses, only two of them, h o and menadione, mediated rrna degradation, supporting the notion that oxidative compounds as such do not provoke cuts in rna molecules within the cell. taken together, this strongly suggests that active cellular machinery, such as signalling factors and rna degrading enzyme(s), is required for this process. nevertheless, these enzymatic activities are still to be identified. the closest yeast homologue of mammalian rnase l, ire , a sensor of the unfolded protein response, functions in the unconventional splicing of hac pre-mrna and contains protein kinase and endoribonuclease domains similar to those in rnase l ( ) . although our unpublished data show that deletion of ire has no effect on the h o -induced rrna degradation (m.s. and j.k.), it does not exclude its participation in other apoptotic pathways, for example those related to endoplasmic reticulum stress and unfolded protein response. we are currently testing several known yeast endo-an exonucleases for participation in apoptosis-related rrna degradation. preliminary observations suggest that, contrary to expectations, this function may involve not one but a number of unspecific nucleases, including mitochondrial nuc p, that act in a redundant fashion (m.s. and j.k., unpublished data). the reason why treatment with some oxidative agents but not others produce fragmented rna is not entirely clear, however, it is known that different chemicals induce specific responses leading to expression of distinct sets of genes ( , ) , and possibly only a few activate pathways that involve rna destruction. our results suggest that rrna degradation phenotype most likely accompanies apoptosis, and only apoptosis-inducing oxidants result in rrna decay, whereas cell death caused by others probably occurs via a different mechanism. another question is why each apoptotic stimuli resulted in slightly different set of cleavage products. those which were mapped for treatment with h o and in aging cells illustrate that cleavages often occur in loops and bulges in closely located regions within the s rrna, or rather within the accessible rna elements in the compact rnp structures. the difference in cleavage patterns could be due to stress-induced subtle or severe alterations in ribosome particles that change the local accessibility of the rrna components presented for cleavage. alternatively, if rrna decay is indeed carried out by more than one nuclease, these differences may reflect varying enzyme specificities in each death-inducing condition. it is also possible that somehow different set of nucleases is activated or recruited to rrna substrates during apoptosis triggered by oxidative stress, acetic acid treatment and ageing. mitochondrial respiration protects rrna against deleterious consequences of ros certain aspects of apoptosis, such as the change in mitochondrial membrane potential, fragmentation of mitochondria and the requirement of cyt c and aif release to the cytoplasm, are strongly conserved among different organisms and point to the pivotal role of mitochondria. in yeast, pcd pathways triggered by acetic acid, bax expression and pheromone, are strictly correlated with these events and do not proceed in cells devoid of mtdna (rho ) ( , ) . during chronological ageing and hyperosmotic shock, rho strains were reported to have somehow higher survival, which can be attributed to their long doubling time ( . to -fold), but they still die apoptotically ( , ) . however, mitochondrial function is required for resistance to oxidative stress by way of detoxification or repair of the oxidative damage and, consequently, rho cells are more sensitive to several oxidants, have higher level of endogenous ros and undergo apoptosis caused by h o or amino-acid starvation ( , , ( ) ( ) ( ) ( ) . also, mammalian rho cell lines undergo apoptosis in response to some but not all cell death-activating stimuli. it has been postulated that these differences may arise from a distinct mechanism by which rho cells maintain membrane potential by way of atp consumption ( ) . our results show that, in all conditions tested, rrna degradation is tightly connected with mitochondrial function and the active process of respiration. rho cells that are less resistant to oxidative stress suffer severe rrna degradation in the presence of h o , acetic acid and % glucose and during chronological ageing. also, impediment of oxidative phosphorylation by protonpump inhibitors or by mutations in atp/adp carriers gives a similar outcome. in contrast, rrna is markedly more stable when mitochondrial respiration is enhanced (e.g. by growth on glycerol versus glucose). this is consistent with the protective role of active mitochondria against the damaging effects of ros on cellular components, rrna destruction included. to begin with, several anti-oxidant enzymes localize to mitochondria and are more abundant in respiring cells. in addition, it has been proposed that some anti-oxidant activities may require energy ( ) . as some apoptotic pathways depend on the release of mitochondrial factors to the cytoplasm and are not induced in rho cells, this poses an important question regarding the link between rrna degradation and apoptosis. rrna degradation-apoptotic or not? although several apoptotic mediators (yca , aif , nma , bir and ndi ) have been identified in yeast, their networking in regulation of apoptosis is not yet fully understood. they may interact and function in a similar fashion as their mammalian counterparts, however, in contrast to mitochondrial mammalian proteins, nma and bir are located in the nucleus, so in yeast there might be some deviations from the mammalian model ( , , ) . nevertheless, apoptotic cell-death pathways in yeast induced by h o , acetic acid, hyperosmotic shock, ageing and increased mrna stability were reported to require metacaspase yca , nominating them as caspasedependent pathways ( , , ) . still, deletion of yca in the mrna turnover mutant lsm d does not attenuate mrna decay, placing yca action downstream of the signal rising from mrna level ( ) . in contrast, histone phosphorylation at ser , which requires prior deacetylation at lys , has been shown to mediate h o -induced apoptosis independently of yca ( , ) . also, the activity of the major mitochondrial nuclease, nuc , in the celldeath pathway does not require either apoptotic mediators, yca or aif , but is affected by h b modifications ( ) . moreover, two pcd pathways, namely induced by defects in protein n-glycosylation and triggered by ammonia in multicellular yeast colonies, do not rely on yca but on as yet unknown caspase-like activity ( , ) . degradation of the s/ . s rrnas is observed in all conditions inducing apoptosis, however, this process is not dependent on yca and aif . on the other hand, mutations in histone h b that inhibit phosphorylation at ser and block the progress of h o -induced apoptosis also severely affect rrna degradation. more importantly, rrna degradation coincides with apoptotic dna fragmentation; from several compounds that lead to oxidative stress and cell death, only those that provoked dna destruction also triggered rrna decay. furthermore, at least some apoptotic pathways strictly require the involvement of mitochondria and do not occur in yeast lacking mtdna, whereas rrna degradation in all stress conditions tested is more powerful in rho cells. to sum up: rrna decay induced by apoptotic stimuli occurs during cell death pathway that involves ros generation, mitochondrial activity, fragmentation of chromosomes and histone modification but not apoptotic regulators, yca and aif . this could be due to the existence of numerous different but partly overlapping pcd mechanisms, whether caspase-and mitochondria-dependent or independent. ever increasing numbers of such pathways has been described in the literature in recent years. however, we favour a different model, where all these elements function in concert at different steps of the whole scenario. stress stimuli induce signals, possibly via ros, affecting different levels of gene expression, namely chromatin modifications, transcription and translation, which as a result activate defence response. at this level, mitochondrial respiration helps to protect cellular components via adaptive mechanisms with anti-oxidant functions. even so, when the attack is not successfully pacified, generated ros molecules initiate the destruction of cellular machineries, targeting in the first place crucial elements such as protein synthesis (i.e. ribosomes). now, the progression of the response comes to the crossroadsif the conditions are appropriate, the cascade of events leading to apoptosis is triggered (e.g. release of cytc and other mitochondrial factors to the cytoplasm) resulting in cell death with characteristic apoptotic markers. alternatively, when apoptotic prerequisites are not met, cells do not enter this pathway. the more fit cells escape death, whereas others die anyway, maybe less rapidly and through a different pathway. in this scenario, certain events, including generation of ros, histone modifications and possibly also rna fragmentation, occur upstream of subsequent steps, such as activation of caspases and other apoptotic regulators with resulting apoptotic phenotypes. genomic expression programs in the response of yeast cells to environmental changes cells have distinct mechanisms to maintain protection against different reactive oxygen species: oxidative-stress-response genes trashing the genome: the role of nucleases during apoptosis caspase-dependent cleavage of nucleic acids pppa p a p a: an inhibitor of protein synthesis synthesized with an enzyme fraction from interferon-treated cells activation of the ifn-inducible enzyme rnase l causes apoptosis of animal cells the role of - oligoadenylate-activated ribonuclease l in apoptosis rnase l-independent specific s rrna cleavage in murine coronavirus-infected cells programmed death in yeast as adaptation? apoptosis in yeast -mechanisms and benefits to a unicellular organism why yeast cells can undergo apoptosis: death in times of peace, love, and war a caspase-related protease regulates apoptosis in yeast the s. cerevisiae htra-like protein nma p is a nuclear serine protease that mediates yeast apoptosis an aif orthologue regulates apoptosis in yeast yeast amid homologue ndi p displays respiration-restricted apoptotic activity and is involved in chronological aging the inhibitor-of-apoptosis protein bir p protects against apoptosis in s. cerevisiae and is a substrate for the yeast homologue of omi/htra endonuclease g regulates budding yeast life and death apoptotic phosphorylation of histone h b is mediated by mammalian sterile twenty kinase sterile kinase phosphorylates histone h b at serine during hydrogen peroxide-induced apoptosis in histone h b deacetylation at lysine is required for yeast apoptosis induced by phosphorylation of h b at serine a truncated form of kllsm p and the absence of factors involved in mrna decapping trigger apoptosis in yeast yeast caspase links messenger rna stability to apoptosis in yeast complex cellular responses to reactive oxygen species oxidative stress responses of the yeast saccharomyces cerevisiae transcription factors regulating the response to oxidative stress in yeast phenotypic analysis of gene deletant strains for sensitivity to oxidative stress improved method for high efficient transformation of intact yeast cells saccharomyces cerevisiae commits to a programmed cell death process in response to acetic acid characterization of dna damage in yeast apoptosis induced by hydrogen peroxide, acetic acid, and hyperosmotic shock hyperosmotic stress induces metacaspase-and mitochondriadependent apoptosis in saccharomyces cerevisiae chronological aging leads to apoptosis in yeast fungal small nuclear ribonucleoproteins share properties with plant and vertebrate u-snrnps identification and functional analysis of two u binding sites on yeast pre-ribosomal rna nuclear pre-mrna decapping and degradation in yeast require the lsm - p complex oxygen stress: a regulator of apoptosis in yeast genetic analysis of glutathione peroxidase in oxidative stress response of saccharomyces cerevisiae fine mapping of s rrna sites specifically cleaved in cells undergoing apoptosis s ribosome degradation in lymphoid cell apoptosis: evidence for caspase and bcl- -dependent and -independent pathways a yeast mutant showing diagnostic markers of early and late apoptosis degradation of ribosomal rna precursors by the exosome the comparative rna web (crw) site: an online database of comparative sequence and structure information for ribosomal, intron, and other rnas rna-quality control by the exosome cellular functions of ascorbic acid release of cytochrome c and decrease of cytochrome c oxidase in bax-expressing yeast cells, and prevention of these effects by coexpression of bcl-xl modulation of cell death in yeast by the bcl- family of proteins mammalian bax triggers apoptotic changes in yeast yeast cell death during dna damage arrest is independent of caspase or reactive oxygen species yap and skn control two specialized oxidative stress response regulons in yeast discrimination between paralogs using microarray analysis: application to the yap p and yap p transcriptional networks yeast signaling pathways in the oxidative stress response oxidative activation of antioxidant defence low glutathione pools in the original pso mutant of saccharomyces cerevisiae are responsible for its pleiotropic sensitivity phenotype synthesis of large rrnas by rna polymerase ii in mutants defective in rna polymerase i development and application of an in vivo system to study yeast ribosomal rna biogenesis and function gene rrn in saccharomyces cerevisiae encodes the a . subunit of rna polymerase i and is essential only at high temperatures mitochondrial reactive oxygen species in cell death signaling mitochondria, oxidants, and aging. cell cell death: critical control points cytochrome c release and mitochondria involvement in programmed cell death induced by acetic acid in saccharomyces cerevisiae production of reactive oxygen species and loss of viability in yeast mitochondrial mutants: protective effect of bcl-xl application of inhibitors and uncouplers for a study of oxidative phosphorylation effects of the inhibitors azide, dicyclohexylcarbodiimide, and aurovertin on nucleotide binding to the three f -atpase catalytic sites measured using specific tryptophan probes the mitochondrial f f -atpase proton pump is required for function of the proapoptotic protein bax in yeast and mammalian cells translational control in stress and apoptosis translational control by the er transmembrane kinase/ribonuclease ire under er stress ribosome inactivating proteins and apoptosis rearrangement of nuclear ribonucleoprotein (rnp)-containing structures during apoptosis and transcriptional arrest lsm proteins are required for normal processing and stability of ribosomal rnas yeast programmed cell death: an intricate puzzle the transmembrane kinase ire p is a site-specific endonuclease that initiates mrna splicing in the unfolded protein response role of mitochondria in the pheromone-and amiodarone-induced programmed death of yeast saccharomyces cerevisiae has distinct adaptive responses to both hydrogen peroxide and menadione mitochondrial function is required for resistance to oxidative stress in the yeast saccharomyces cerevisiae the role of respiration, reactive oxygen species and oxidative stress in mother cell-specific ageing of yeast strains defective in the ras signalling pathway starvation for an essential amino acid induces apoptosis and oxidative stress in yeast cells depleted of mitochondrial dna (rho ) yield insight into physiological mechanisms physiological regulation of yeast cell death in multicellular colonies is triggered by ammonia defects in n-glycosylation induce apoptosis in yeast we thank m. nomura (university of california, irvine) for strain noy and plasmid pnoy ; h. raue´and j.c. vos (university of vrije, amsterdam) for plasmid pjv ; j. kolarov (comenius university, bratislava) for plasmids prs-bcl-x l and yep -bax and strains aac / / -d and op ; y. inoue (kyoto university) for strains yph , grx / -d and gpx / / -d; e.b. gralla (university of california, los angeles) for strains eg and sod / -d; c.d. allis (the rockefeller university, new york) for strains jhy , say and say ; r. wysocki (university of wroclaw) for plasmid yep -yap . this work was funded by the wellcome trust ( /z/ /z). funding to pay the open access publication charges for this article was provided by the wellcome trust. supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- -l r w authors: hou, linlin; klug, gabriele; evguenieva-hackenberg, elena title: archaeal dnag contains a conserved n-terminal rna-binding domain and enables tailing of rrna by the exosome date: - - journal: nucleic acids res doi: . /nar/gku sha: doc_id: cord_uid: l r w the archaeal exosome is a phosphorolytic ′– ′ exoribonuclease complex. in a reverse reaction it synthesizes a-rich rna tails. its rna-binding cap comprises the eukaryotic orthologs rrp and csl , and an archaea-specific subunit annotated as dnag. in sulfolobus solfataricus dnag and rrp but not csl show preference for poly(ra). archaeal dnag contains n- and c-terminal domains (ntd and ctd) of unknown function flanking a toprim domain. we found that the nt and toprim domains have comparable, high conservation in all archaea, while the ctd conservation correlates with the presence of exosome. we show that the ntd is a novel rna-binding domain with poly(ra)-preference cooperating with the toprim domain in binding of rna. consistently, a fusion protein containing full-length csl and ntd of dnag led to enhanced degradation of a-rich rna by the exosome. we also found that dnag strongly binds native and in vitro transcribed rrna and enables its polynucleotidylation by the exosome. furthermore, rrna-derived transcripts with heteropolymeric tails were degraded faster by the exosome than their non-tailed variants. based on our data, we propose that archaeal dnag is an rna-binding protein, which, in the context of the exosome, is involved in targeting of stable rna for degradation. the rna degrading exosome is a protein complex found in eukarya and archaea ( ) ( ) ( ) . it is composed of a structurally conserved nine-subunit core, which also shows similarities to bacterial polynucleotide phosphorylase (pnpase), and contains additional subunits ( ) ( ) ( ) ( ) ( ) ( ) . the nine-subunit core of the eukaryotic exosome is essential but catalytically inactive and additional eukarya-specific subunits are responsi-ble for the ribonucleolytic activity ( ) ( ) ( ) . in contrast, the archaeal nine-subunit exosome is a - -exoribonuclease like pnpase ( , ) and strongly interacts with a protein annotated as dnag ( , , , ) . the archaeal exosome and bacterial pnpase have not only structural but also functional similarities --they degrade rna phosphorolytically using inorganic phosphate and producing rndps, and in a reverse reaction they synthesize heteropolymeric rna tails ( , ( ) ( ) ( ) . it was suggested that the heteropolymeric rna tails found in prokaryotes destabilize rna enabling efficient binding of - exoribonucleases including pnpase or exosome ( , ) . such destabilization mechanism is known for short poly(a)-tails synthesized by poly(a)-polymerase in enterobacteria ( , ) and by non-canonical poly(a)polymerases in eukaryotes, where the polyadenylation of rrna precursors is a prerequisite for their degradation by the eukaryotic exosome ( , ) . while the structure and function of the archaeal ninesubunit exosome is well understood ( , , ( ) ( ) ( ) , little is known about the role of archaeal dnag in the context of the exosome. its annotation is based on its central topoisomerase/primase (toprim) domain ( , ) and nothing is known about the function of its n-terminal and c-terminal domains (ntd and ctd, respectively, figure a) . the archaeal nine-subunit exosome is formed by orthologs of the eukaryotic exosomal subunits rrp , rrp , rrp and csl . the rnase ph-domain containing subunits rrp and rrp are arranged in a catalytically active hexamer, on the top of which a trimeric cap composed of the rna-binding proteins rrp and csl is bound ( figure b ; , , [ ] [ ] [ ] . the rna-binding cap increases the efficiency of degradation of poly(a) and heteropolymeric rna by the recombinant archaeal exosome ( , ( ) ( ) ( ) ( ) . while in vivo the exosome contains both rrp and csl ( ) , in vitro complexes with homotrimeric, rrp or csl containing caps (rrp exosome or csl exosome) can be reconstituted ( figure b ; , ) . their comparative analysis revealed that rrp confers poly(a)-preference to the exosome of the hyperthermophilic and acidophilic ar- ( ) , nine-subunit exosomes with homotrimeric, rrp or csl containing caps ( , ) and biochemical data for dnag-containing exosomes ( ) .the csl -nt-exosome contains a homotrimeric cap build of the fusion protein csl -nt, which comprises full-length csl and the ntd of dnag. chaeon sulfolobus solfataricus ( ) , while csl is needed for the interaction of the complex with dnag ( ; figure b ). furthermore it was shown that dnag preferentially binds poly(a) rna in electrophoretic mobility shift assay (emsa) and increases the poly(a)-preference of the s. solfataricus exosome even in the presence of rrp ( ) . this suggested that dnag is a part of the rna-binding platform of the s. solfataricus exosome and modulates its substrate specificity ( ) . however, it remained unknown how dnag interacts with the exosome and with rna substrates. the tight interaction between archaeal exosome and dnag was documented for several archaeal species ( , , , ) , and fractionation of cell-free extracts followed by co-immunoprecipitation (coip) strongly suggested that in s. solfataricus dnag is an indispensable part of the exosome ( ) . on the other hand, dnag is ubiquitous in all genome-sequenced archaea, while the exosome is missing in methanococci, halobacteria and some methanomicrobia (figure and supplementary figure s ; ref. , , ) . the high conservation of dnag in archaea can be explained by the assumption that it plays an important role in rna metabolism even in the absence of exosome, and/or by its putative role as a primase, in accordance to its annotation and recent biochemical data ( , ) . the primase synthesizes de novo short rna primers during chromosome replication. ( ) . archaea possess a two-subunit primase pris/pril of eukaryotic type, which was characterized in vitro ( , ) . this primase shows strong interactions with components of the archaeal replication network in pulldown assays with thermococcus kodakarensis cell-free extracts, while the putative bacterial-type primase dnag interacts with the exosome instead ( ) . however, it was published that dnag of s. solfataricus exhibits primase activity in vitro, and this activity is decreased by mutations of conserved residues in the toprim domain. furthermore, an interaction was detected between s. solfataricus dnag and the archaeal minichromosome maintenance (mcm) helicase in yeast two-hybrid system and in vitro pull-down assays. based on this, it was suggested that archaeal dnag may have a dual function in the cell, as a part of the exosome and as a bacterial-type primase ( , ) . the bacterial primase dnag is composed of an ntd containing a zn-finger motif involved in dna binding, the central, catalytic toprim domain and a ctd neces-sary for the interaction with the replicative helicase dnab ( figure a , refs. , , , ) . assuming that a primase needs a dna-binding domain while a protein important for rna metabolism should possess an rna-binding domain, we decided to characterize the ntd and ctd of s. solfataricus dnag. we found that the ntd is a conserved archaeal rna-binding domain cooperating with the toprim domain in binding of rna substrates, while the ctd is important for the binding to the exosome. furthermore, we show that in vitro the exosome needs dnag for post-transcriptional tailing of native rrna, and that heteropolymeric tails enhance the degradation of rrna transcripts. our data strongly suggest that dnag is a conserved archaeal rna-binding protein, which participates in the degradation of stable rnas in s. solfataricus. sequences of dnag proteins were obtained from ncbi (http://www.ncbi.nlm.nih.gov/) and aligned using clustal x . (http://www.clustal.org/clustal /). the neighbor-joining phylogenetic tree of dnag proteins was generated by using mega . with bootstrap replicates (mega . http: //www.megasoftware.net/). the poisson correction method was used to compute the evolutionary distances which are in the units of the number of amino acid substitutions per site. recombinant hexahistidine-tagged dnag, rrp , csl , rrp and rrp , and streptavidin-tagged (strep-tagged) csl were expressed and purified as previously described ( ) . dnag-e q was kindly provided by dr. michael a. trakselis (pittsburgh, usa) and purified as previously described ( ) . primers used for the construction of mutant proteins are shown in supplementary table s . dnag-k ay a, dnag-k a and dnag-y a genes were generated by standard overlap extension polymerase chain reaction (pcr) ( ) and cloned into pet b vector using ncoi and ndei restriction sites. overlap extension pcr was also used for the fusion of dna encoding the n-terminal amino acid residues of dnag to the -end of the csl gene. the pcr product was cloned into pet b vector using ndei and bamhi restriction sites. both constructs were expressed in escherichia coli bl -goldenplus (de ). cells producing dnag-k ay a, dnag-k a or dnag-y a were sonicated in buffer containing mm hepes (ph . ), mm nacl and mm ␤-mercaptoethanol, % glycerol. the cell-free extract was heated at • c for min and the soluble protein was purified through hitrap hp q and hiload r / superdex r pg columns. dnag-nt was purified using the same lysis buffer, heat treatment and ni-nta resin. cells producing the fusion protein csl -dnag named csl -nt were sonicated in buffer containing mm hepes (ph . ) and m nacl. after incubation at • c for min, the soluble protein was purified using ni-nta resin. interactions between the csl exosome and his -dnag-ct or dnag-k ay a were analyzed by pull-down assays, in which reconstituted csl exosome and cell-free extracts of e. coli expressing one of the dnag variants were used ( ) . the csl exosome was reconstituted by mixing his -rrp and his -rrp with strep-tagged csl ( . mg of each protein) in a final volume of ml in buffer p ( mm tris (ph . ), mm mgcl , . mm ethylenediaminetetraacetic acid (edta), mm nacl, % glycerol, . % tween and . mm dithiotreitol (dtt)) and incubating at room temperature for h. after treatment at • c for min and centrifugation at g for min, the supernatant containing reconstituted csl exosome was collected. to prepare cell-free extracts of e. coli, l culture expressing his -dnag-ct or dnag-k ay a was harvested at od = . after h of induction with mm isopropyl-beta-d-thiogalactopyranoside (iptg). the cell pellet was resuspended in ml of buffer containing mm hepes (ph . ), mm nacl and mm ␤mercaptoethanol, % glycerol. after sonication and centrifugation at g for min, ml of the supernatant was mixed with the csl exosome. the mixture was incubated in buffer p for h at room temperature. then it was passed twice through a column with ml strep-tactin r sepharose r . the strep-tactin sepharose was washed with buffer p and eluted with l buffer containing mm tris (ph . ), mm nacl, mm edta and mm ddesthiobiotin. interaction between the csl exosome and his -dnag-nt was analyzed by coip assay using rrp -directed serum as previously described ( , , ) . all proteins in this assay were his-tagged. protein fractions were analyzed by sodium dodecylsulphate-polyacrylamide gel electrophoresis (sds-page) and silver-staining. for western blot analysis, protein samples were separated in % sdspolyacrylamide (paa) gel and then transferred to protran nitrocellulose membrane (whatman). western blot analysis was performed as described ( ) . circular dichroism (cd) spectra were recorded in a jasco j- circular dichroism spectrophotometer at ambient temperature. dnag ( . m) and the variant dnag-k ay a ( . m) were measured in a cell with . cm path in mm na hpo -nah po (ph . ), mm nacl . generation and purification of -labeled poly(ra) and the following internally labeled or unlabeled in vitro transcripts was previously described ( , ) : (i) native tail rna of nt (corresponding to an rna tail detected in s. solfataricus), (ii) mcs-rna of nt (corresponding to a part of a multiple cloning site of a plasmid) and (iii) -end s rrna transcript of nt. native s rrna was purified and labeled as follows. total rna was isolated using trizol, separated on % polyacrylamide-urea gel and stained with ethidium bromide. the gel slice containing s rrna was cut out, and rna was eluted overnight in buffer composed of mm naoac (ph . ), mm edta and . % phenol/chloroform. after phenol-chloroform extraction and ethanol precipitation, s rrna was labeled at the -end using [␣- p] atp. for the generation of internally labeled s rrna (sequence according to the comparative rna web site and project, http://www.rna.ccbb.utexas. edu, ref. ) , the s rrna gene was amplified with the primers indicated in supplementary table s and in vitro transcription in presence of [␣- p] ribonucleoside uridine triphosphate (rutp) was performed as described ( , ) . the sequence of the nt heteropolymeric tail added at the -end of the s rrna and s rrna transcripts is aaagggggauaaaauaaaga and corresponds to a tail previously detected in s. solfataricus ( ) . degradation and polyadenylation assays were carried out with . counts per minute (c.p.m.) of radioactively labeled substrate in a l reaction mixture containing mm hepes (ph . ), mm kcl, mm mgcl , . mm edta, mm dtt and mm k hpo (degradation assays) or mm ribonucleoside adenine diphosphate (radp) (polyadenylation assays). in each assay, . pmol/l of a reconstituted complex was used. the concentration of substrate in the assays is indicated in the figure legends. for the assays, csl exosome, dnag/csl exosome and csl -nt exosome were reconstituted using his -csl , dnag-his or his -csl -nt and equimolar amounts of thawed his -rrp /his -rrp hexamer. the hexamer was prepared in buffer containing mm tris-hcl, ph . and mm nacl, heat treated at • c for min, purified through gel filtration and stored at − • c in aliquots ( ) . repeated thawing was avoided. rrp /csl exosome and dnag/rrp /csl exosome were reconstituted using streptagged csl and were purified by tandem chromatography using strep-tactin and ni-nta-agarose as described ( ) . enzymatic reactions were carried out at • c for the indicated time (min). samples were analyzed in or % denaturing paa gels at v and visualized by phosphorimaging. signals were detected and quantified using a bio-rad molecular imager and quantity one (bio-rad). for graphical representation, the radioactivity per lane was set to % and % remaining substrate was calculated. binding assays were carried out at room temperature for min in a l reaction mixture containing mm hepes (ph . ), mm kcl, mm mgcl , % glycerine, mm dtt and . mm edta with the indicated amounts of proteins and rna substrates. the reaction samples were resolved in % native paa gels at v and • c, and were visualized by phosphorimaging using a bio-rad molecular imager and quantity one (bio-rad) ( , ) . genes encoding archaeal dnag proteins are found in all genome-sequenced archaea regardless of presence or absence of an exosome ( , ) . to learn more about the evolution of the archaeal dnag proteins, we created a phylogentic tree based on the sequences of dnag proteins from representative archaeal species ( figure ) and compared this tree to the s rrna-based phylogenetic tree of genome-sequenced archaea (supplementary figure s ). both trees are congruent in the delineation of the phyla euryarchaeota, crenarchaeota, nanoarchaota, korarchaeota and thaumarchaeota. interestingly, the absence of exosome leads to major differences in the dnag subtree of euryarchaeota, which comprise exosome-containing and exosome-less representatives, when compared to the s rrna tree. an informative example are methanomicrobia. in the dnag tree, exosome-less methanomicrobia form a well-delineated cluster together with the exosome-less halobacteria, while exosome-containing methanomicrobia cluster together with archaeoglobi and other exosomecontaining archaea ( figure ). this is in contrast to the s rrna tree, where methanomicrobia and halobacteria are in a cluster well separated from archaeoglobi and other euryarchaeota (supplementary figure s ) . methanococci, which accordingly to the s rrna tree are distantly related to methanomicrobia and halobacteria, also do not have an exosome. this may explain other differences between the dnag-and s rrna-based subtrees of euryarchaeota (compare figure to supplementary figure s ). to get insight into similarities and differences between the individual domains of dnag in different archaea, multiple alignment of eight dnag sequences from species with and without exosome was performed ( figure ). we found that the conservation of the ntd of dnag is very high and is comparable to that of the toprim domain. the ctd is less conserved and the conservation is even lower in exosome-less archaea. additional alignments were performed with dnag sequences from exosome-less archaea only (supplementary figure s ) and with dnag sequences from exosome-containing archaea only (supplementary figure s ). these alignments confirmed the highly conserved nature of the ntd and the toprim domain, and the lower conservation of the ctd, especially in exosome-less archaea. three invariant residues were found in the ctd of exosome-less archaea, but it should be taken into account that in this case only sequences from euryarchaeota were compared (supplementary figure s ) . these residues are also present in dnag from the exosomecontaining euryarchaeota and the crenarchaeon s. solfataricus shown in figure . in the last aa of the ctd of exosome-containing archaea belonging to all five archaeal phyla, an invariant aspartate residue (d in s. solfataricus) and a cluster of conserved residues (f to d in s. solfataricus) were detected. this cluster is present in all analyzed exosome-containing species but nanoarchaeum equitans (supplementary figure s ) . the data suggest that in exosome-containing archaea with exception of n. equitans, the ctd of dnag is involved in the interaction with the exosome. we also searched for similarities between the ntd and ctd of archaeal dnag and other proteins using phyre (http://www.sbg.bio.ic.ac.uk/phyre ). the analysis was performed with dnag from the exosome-containing s. solfataricus and the exosome-less methanocaldococcus jannaschii (supplementary table s ). this analysis revealed that the ntd of dnag from both archaeal species harbors a region with similarity to bacterial rna helicases (in agreement with ref. ) and another region with similarity to mammalian ribosomal protein l . the most conserved region of the ctd of both species shows similarity to the transcription elongation factor spt / interacting with rna polymerase (supplementary table s , figure , ref. ) . for essentially the same region of the ctd of s. solfataricus similarity to rossmann fold was found (supplementary table s ). to test experimentally which of the dnag domains is responsible for the binding to the exosome, dnag variants lacking either the ntd (his -dnag-nt) or the ctd (his -dnag-ct; see figure a ) were generated and used in protein-protein interaction assays with the exosome containing a homotrimeric csl cap (csl exosome). since both truncated dnag variants have the same length like rrp , it was necessary to discriminate them from his -rrp by western blot analysis with dnag-specific serum. figure b shows that both truncated his -tagged dnag variants but not his -rrp were detected using the dnag- specific serum. furthermore we noticed that the serum shows stronger signals for his -dnag-nt than for his -dnag-ct. we conclude that the specificity of the dnagdirected serum is sufficient for our analysis. interaction between his -dnag-nt and the csl exosome was analyzed by coip with rrp -specific antibodies coupled to protein a-sepharose beads. previously we have shown that binding of full-length dnag to the csl exosome is easily detectable with this assay ( ) . since all pro-teins used carry a his -tag and the polyclonal antibodies were raised against his -rrp , we performed a control immunoprecipitation experiment with his -dnag-nt only. figure c shows that his -dnag-nt did not interact with the antibodies. next his -dnag-nt and the csl exosome were mixed and coip was performed. sds-page and western blot with the anti-dnag serum revealed that his -dnag-nt was not present in the last washing fraction but was well detectable in the elution fraction ure d). we conclude that his -dnag-nt interacts with the exosome. since the purified his -dnag-ct protein shown in figure b was highly unstable, a cell-free extract of the e. coli strain, in which the protein was produced, was directly used for interaction tests. the extract was mixed with the csl exosome containing a strep-tagged variant of csl . all other recombinant proteins were his -tagged. exosomal complexes were purified with strep-tactin sepharose beads, and sds-page and western blot analysis with dnagspecific antibodies were performed. his -dnag-ct was well detectable in the input, flowthrough and the first washing fraction but was not detected in the elution fraction ( figure e ). since interaction between the strep-tagged csl exosome and full-length dnag in e. coli cell-free extract was easily detectable by pull-down assays with strep-tactin sepharose beads (for an example see figure a be-low), we conclude that the ctd of dnag is important for the binding to the archaeal exosome. the results of the phylogenetic analysis and the multiple alignments strongly suggest that the ntd of archaeal dnag has a highly conserved physiological role. since dnag from s. solfataricus binds poly(ra) ( ) , we assumed that the ntd may be involved in binding of rna. this assumption was strengthened by the similarities between the ntd of archaeal dnag and other proteins interacting with rna found by phyre (supplementary table s , figure ). there are several invariant amino acid residues in the ntd of archaeal dnag, among them are the lys(k) fore we decided to generate a k ay a mutant of dnag and to test its rna binding activity by emsa. the nontagged, mutated protein was purified and analyzed by circular dichroism spectroscopy in comparison to the recombinant, wild-type dnag, which carries a his -tag at the cterminus. no disorder of the secondary structure was detected ( figure a and b), allowing us to conclude that the mutated protein is suitable for our analyses. emsa assays were performed with the recombinant, wild-type dnag-his , the k ay a mutant and the previously published e q mutant of dnag, which is impaired in the primase activity. as an rna substrate, poly(ra) , which is easily shifted by dnag in emsa was used ( ) . for comparison, labeled poly(da) was used as a dna substrate. we found that under the applied conditions, poly(da) was not bound, while as expected, poly(ra) was strongly bound by wild-type dnag (compare lane to in figure c ). furthermore, the rna binding activity of the toprim domain mutant dnag-e q was weaker when compared to wild-type dnag and rna binding by the ntd mutant dnag-k ay a was completely abolished (lanes to in figure c ). single mutants dnag-k a and dnag-y a were also prepared. they showed very low rna binding activities (supplementary figure s a and b) . to test whether the e q mutant still retained the preference for poly(ra), which is characteristic for the wildtype dnag, competition assays were performed. wild-type dnag and the e q mutant were incubated with a mixture of low amount of labeled poly(ra) and excess of unlabeled poly(ra) or heteropolymeric mcs-rna of nt as competitors. both proteins shifted the labeled poly(ra) in presence of the mcs-rna competitor but not in the presence of the poly(ra) competitor, showing that dnag-e q has poly(ra) preference like wild-type dnag (figure d) . we also performed competition experiments with excess of poly(da) . figure e shows that pmol of unlabeled poly(da) pmol of unlabeled poly(ra) abolished the binding of the labeled rna. this shows that dnag is an rna-binding rather than dna-binding protein. we conclude that the ntd of s. solfataricus dnag is a novel, conserved archaeal rna binding domain and its k and y residues are important for binding of rna. furthermore both the ntd and the toprim domain of archaeal dnag are involved in rna binding. since wild-type dnag shows a poly(a) preference which is not affected by the e q exchange in the toprim domain, we assumed that the ntd is responsible for this preference. to verify this we generated a fusion protein composed of csl , which does not bind poly(ra) strongly and does not show poly(a) preference ( , ) , and the ntd of dnag. as the ntd of csl is the main anchor to the hexameric ring of the exosome ( ) and for degradation assays the fusion protein should be capable to interact with the ring, the ntd of dnag was fused to the c-terminus of csl . the fusion, his-tagged protein was named csl -nt. in order to analyze whether the ntd of dnag influences the rna binding capability of csl , emsa assays were performed with labeled poly(ra) . the substrate was not shifted by csl (lane in figure a ) but was successfully shifted by csl -nt and dnag (lanes and in figure a) . competition with unlabeled mcs-rna of nt and poly(ra) in concentrations -fold higher than the concentrations of the used proteins revealed that both csl -nt and dnag show poly(ra)-preference ( figure a ). dnag increases the efficiency of degradation of poly(ra) and a-rich rna by the csl exosome and by the exosome containing both csl and rrp in vitro ( ) . here we tested whether the fusion of the ntd of dnag to csl will have a similar effect. indeed, the csl -nt exosome degraded poly(ra) faster than the csl exosome and even faster than the csl exosome containing wild-type dnag ( figure b and c) . the faster rna degradation by the csl -nt exosome was not due to rnase contamination of - ) . as a negative control, the assay was performed with the cell-free extract only (lanes - ). m, marker; in, input, the mixture of proteins used; ft, flow-through; w , w , the first and the last washing fractions; e, the elution fraction. the protein fractions were analyzed by % sds-paa gel and silver stained. relevant proteins are marked on the right side of the panel. the size of marker proteins in kda is given on the left side. a protein copurifying with strep-csl is marked by an asterisk. (b) a phosphorimage of a denaturing % paa gel with degradation assays with pmol radioactively labeled poly(ra) the csl -nt protein fraction used for reconstitution of the complex, since incubation of poly(ra) with the csl -nt only did not result in degradation (supplementary figure s ) . actually, contamination of the degradation assays by spurious rnases originating from e. coli were excluded in our assays performed at • c ( ) . similar results were obtained from degradation assays with an a-rich transcript of nt, which corresponds to a native rna tail of s. solfataricius ( figure d ). the csl -nt containing exosome was the most efficient rnase complex, followed by the csl exosome with dnag and the csl exosome without dnag. in conclusion, the above results show that the ntd of dnag confers strong binding of poly(ra) and poly(a)specificity to the fusion csl -nt protein. the presence of dnag stimulates the degradation of arich rna by the csl exosome, most probably because dnag helps the csl exosome to recruit a-rich substrates ( ) . we decided to test this assumption experimentally using the dnag-k ay a mutant which cannot bind rna ( figure ). first it was necessary to verify that the dnag-k ay a mutant protein still interacts with the exosome. for this a cell-free lysate of the e. coli strain producing the dnag-k ay a protein was mixed with reconstituted strep-csl exosome and purification of strep-csl containing complexes was performed with strep-tactin sepharose beads. csl was detected in the elution fraction together with his -rrp , his -rrp and dnag-k ay a ( figure a , lane ). in the control experiment without addition of exosome dnag-k ay a was not present in the elution fraction ( figure a, lanes to ) . we conclude that the dnag-k ay a protein was specifically co-purified with the csl exosome. next, degradation assays were performed with labeled poly(ra) and csl exosome, dnag-containing csl exosome or dnag-k ay a-containing csl exosome. figure b and c show that poly(ra) is degraded faster in the presence of wild-type dnag in the protein complex, while there was no significant difference in the degradation of the substrate by the exosome containing dnag-k ay a and the exosome without dnag. we conclude that the rna binding capability of dnag is crucial for its positive influence on rna degradation by the exosome. ribosomal rna is one of the major substrates of the eukaryotic exosome and of bacterial pnpase ( , ) . thus we assumed that in exosome-containing archaea rrna is also a substrate of the exosome. this assumption is supported by the detection of heteropolymeric a-rich tails, which are most probably synthesized by the exosome, at the -end of s rrna and its fragments in s. solfataricus and methanopyrus kandleri ( , ) . however, in a previous study a transcript corresponding to the -end of s rrna ( s rrna) was not degraded nor polyadenylated in vitro by the hexameric rrp /rrp ring, the rrp exosome and csl exosome of s. solfataricus ( ) . to test whether dnag influences the interaction of the exosome with the s rrna transcript, we performed degradation and polyadenylation tests using the csl exosome with or without dnag. interestingly, dnag enabled polyadenylation of this substrate by the exosome. even after min of incubation in the presence of radp, the s rrna substrate was not polyadenylated by the csl exosome (lanes to in figure a ), while after min of incubation with the dnag-containing csl exosome, the majority of the substrate was prolonged (lanes to in figure a ). in contrast dnag did not enable degradation of the s rrna transcript by the exosome (supplementary figure s ) . to see whether the rna binding activity of dnag is important for the positive influence of dnag on the polyadenylation of the s rrna transcript by the exosome, dnag-e q and dnag-k ay a were used in the assays instead of wild-type dnag. less substrate was polyadenylated by the dnag-e q containing exosome (lanes to in figure a ) and the dnag-k ay a containing exosome did not polyadenylate at all (lanes to in figure a ). this suggests that binding of the s rrna transcript by dnag is necessary for its polyadenylation by the exosome. to test directly whether dnag binds this transcript, emsa analyses were performed ( figure b ). the transcript was completely shifted by the wild-type dnag and the exosome containing wild-type dnag, while no comparable shift was observed when the dnag-k ay a protein was used, alone or in the context of the exosome. when dnag-e q was used, the rna shift was weaker than with the wild-type dnag, resembling the results obtained with poly(ra) (compare figures c- b) . polyadenylation and emsa assays were also performed with native s rrna, which was isolated from total rna of s. solfataricus after separation in a % ureapolyacrylamide gel and labeled radioactively at the end. the results were very similar to those obtained with the s rrna-derived transcript: the csl exosome with dnag polyadenylated the native s rrna, while the exosome without dnag or with dnag-k ay a did not (figure c ). in accordance with this, the native s rrna was strongly shifted by dnag alone or in combination with the csl exosome in emsa assays, while no shift was observed when the mutant protein dnag-k ay a was used, and a very weak shift was observed with the csl exosome alone ( figure d ). furthermore, we verified that dnag is also needed for the polyadenylation of in vitro transcribed s rrna ( figure e ). we noticed that although similar amounts of substrate and enzyme were used in the assays shown in figure c and e, the in vitro transcript was polyadenylated with higher efficiency than the native s rrna. in contrast to the wild-type dnag, the double mutant dnag-k ay a and the single mutants did not enable polyadenylation of the s rrna transcript by the exosome (supplementary figure s c) . the above experiments revealed that dnag enables polyadenylation of rrna by the csl exosome in vitro. however, in vivo the exosome contains both rrp and csl ( ) , and thus we decided to test whether a recombinant exosome containing the two rna-binding proteins also needs dnag for polyadenylation of the s rrna transcript. figure f shows that indeed dnag was necessary for polyadenylation of in vitro transcribed s rrna by the exosome containing rrp and csl . previously we have shown that in contrast to the non-tailed s rrna transcript, a tailed variant containing adenine residues at the -end ( s rrna-a ) can be degraded by the rrp /rrp hexamer as well as by rrp exosome and csl exosome ( ) . here we tested whether the presence of dnag influences the degradation of the s rrna-a transcript by the exosome containing both rrp and csl . we found that dnag slightly increases the degradation of the tailed transcript by the exosome. furthermore, distinct intermediate degradation products were detected only when dnag was present in the exosome (supplementary figure s ). next we analyzed the influence of a heteropylmeric tail on the degradation of s rrna by the exosome containing rrp , csl and dnag. we compared the degradation of the non-tailed transcript to that of its tailed derivatives s rrna-a and s rrna-hetero containing a poly(a) tail or a heteroplymeric tail of nt, respectively. the sequence of the heteropolymeric tail corresponds to a tail sequence previously detected in s. solfataricus ( ) . we observed that the degradation of the tails restoring the nontailed transcript was faster than degradation of the body of the transcript (compare the two panels of different exposition in figure a ). furthermore, considering degradation products shorter than s rrna, we found that both tailed transcripts are degraded faster than the non-tailed one and that both tails equally enhance the degradation ( figure a and b) . we also tested whether the heteropolymeric tail leads to faster degradation of the s rrna transcript by the exosome. as expected, the tailed variant was degraded faster (figure c and d). our phylogenetic analysis suggests that archaeal dnag is an ancient protein predating the origin of the archaeal kingdom, since the five archaeal phyla euryarchaeota, crenarchaeota, nanoarchaota, korarchaeota and thaumarchaeota were delineated in a very similar way in the dnag and s rrna phylogenetic trees (figure and supplementary figure s ). however, our analysis also shows that the presence or absence of exosome had an influence on the evolution of dnag in archaea. previously archaeal dnag sequences were used for phylogenetic analysis of methanogenic consortia leading to very similar results when compared to s rrna-based analysis ( ) . probably this was due to the phylogenetic homogeneity of the studied archaeal group, in which no differences in respect of the exosome content are expected. we found substantial differences in the subtree of euryarchaeota comprising archaea with and without exosome. thus, despite its high conservation, archaeal dnag is not suitable as a phylogenetic marker. protein-protein interaction studies with truncated dnag proteins revealed that the ntd is not essential for the interaction with the exosome and that the ctd is important for this interaction (figure ). an involvement of the ctd in binding of dnag to the exosome is also supported by the sequence comparisons shown in figure , supplementary figures s and s , since higher conservation of the ctd was found in exosome-containing than in exosomeless archaea. not only dnag-ct but also dnag-nt is impaired in its interaction with the csl -exosome (compare figures d- a) . thus, the integrity of dnag is important for a strong binding to the exosome. most probably the conformation of the ctd is changed in the truncated dnag-nt protein preventing efficient binding to the protein complex. alternatively or in addition, each domain may contribute to the interaction with the exosome. it is known that the toprim domain of the recr protein from e. coli is responsible for the interaction with other protein partners and with dna ( ) . the overall spatial structure of the archaeal dnag-containing exosome is still not known. in exosome-less archaea the ctd of dnag may contribute to the integrity of the protein and/or to the interaction with other proteins. our data clearly show that the ntd is a novel, conserved archaeal rna-binding domain, which is essential for the interaction of s. solfataricus dnag with rna (figure ) . the experiments with the chimeric csl -nt protein revealed that the ntd of dnag is a separate rna bind-ing domain with poly(a)-preference, which can exert this function in the context of different proteins ( figure ). we also show that the ntd is needed for strong binding of s rrna and rrna-derived transcripts ( figure b and d) . thus, despite its poly(a)-preference, this protein domain is a general rna-binding domain necessary for interaction of archaeal dnag with heteropolymeric substrates. interestingly, we found that the toprim domain is also involved in the interaction of dnag with rna. notably, the conserved residue e in the toprim domain, which is crucial for the primase activity of the protein ( ) , is important for strong rna binding by dnag (figures and ) . these results strongly suggest that the ntd and toprim domains cooperate in binding of rna substrates. cooperation of multiple rna binding domains, each with a weak affinity for rna, is known to result in a strong rna binding by other proteins involved in rna metabolism like lin , a major regulator in mammalian cells, and the eukaryotic mrna export factor tip-associated protein; tip is a tyrosine kinase-interacting protein (tap) ( ) . the toprim domain is characteristic for bacterial type primases, topoisomerases, old family nucleases and recr proteins, altogether proteins involved in interactions with dna ( ) . however, archaeal dnag is not the only protein with a toprim domain which binds rna. a prominent exam-ple is ribonuclease (rnase) m from bacillus subtilis, in which a toprim domain contains the active site. both the toprim and the ctd of rnase m are important for binding of rna ( ) . binding of rna by dnag is important for the observed faster degradation of poly(ra) (figure ) and is a prerequisite for the polyadenylation of rrna and rrna-derived transcripts by the dnag-containing exosome in vitro (figure ) . therefore we propose that in archaea harboring exosome dnag not only participates in the efficient interaction of a-rich rna with the exosome ( ), but is also responsible for the polynucleotidylation of rrna. it is assumed that the heteropolymeric, a-rich rna tails in exosome-containing archaea have destabilizing function ( , ) like the short poly(a) tails in enterobacteria and eukarya ( , , , ) . our data, showing that a heteropolymeric tail leads to faster degradation of rrna transcripts by the exosome in vitro, are in agreement with this assumption. the destabilizing effect of the heteropolymeric tail was comparable to the effect of a poly(a) tail of the same length ( figure b) . similarly, both a heteropolymeric tail and a poly(a) tail equally enhanced the degradation of structured rna by the bacterial degradosome in vitro ( ) . analyses of the nucleotide composition of bacterial and archaeal heteropolymeric rna tails suggested that the tails do not have potential to form strong secondary structures ( , ) . together, these data support the view that prokaryotic, heteropolymeric tails function as single stranded regions enabling fast initial interaction of rna substrates with - exoribonuceases. although in vivo data demonstrating the destabilizing role of heteropolymeric tails in prokaryotes are still missing ( ), we suggest that dnag plays an important role in degradation of rrna in exosome-containing archaea. this suggestion is based on the data shown in figures and . degradation of rrna in the course of the quality control during ribosome biogenesis or as adaptation to changing environmental conditions is of pivotal importance for the cell ( , , ) . the in vitro polyadenylation of the native s rrna was less efficient than the polyadenylation of the in vitro transcribed s rrna, although comparable substrate amounts were used in the assays (compare lanes to in figure c and e). this can be explained by the failure of some in vitro transcripts to adopt the native rrna structure. additionally, missing rna modifications can lead to lower stability rna structures ( ) and this can lead to higher accessibility of the -end of the transcript for tailing by the exosome. furthermore t polymerase adds a non-templated nucleotide at the -end of in vitro transcripts ( ) , which may facilitate addition of poly(a) by the dnag containing exosome. we also observed that the s rrna transcript is polyadenylated much faster by the dnag/csl exosome than the dnag/csl /rrp exosome (compare figure e , lane , to figure f, lane ) . most probably this is due to the lower amount of dnag in the rrp containing exosome ( figure b) ). in vivo exosomal complexes with different stoichiometric amounts of rrp and dnag/csl are present ( ) . probably archaeal exosomes of different compositions exhibit different functions, and it is possible that the exosomal complexes with higher relative amounts of dnag are responsible for tailing of stable rna. our results characterizing s. solfataricus dnag as an rna binding subunit of the archaeal exosome do not necessarily exclude a function of dnag as a primase in the cell ( , ) . it is possible that archaeal dnag is a moonlighting protein like some other proteins with more than one function in prokaryotes and eukaryotes ( ) . however there are several reasons to believe that it is rather involved in rna metabolism than in replication ( , ) . the strong in vivo interaction with the exosome in several archaea was already mentioned in the introduction ( , , , ) . in vitro this interaction leads to a clear and strong effect of dnag on the polynucleotidylation of native rrna and rrnaderived transcripts by the exosome (figure ). in comparison, the documented interaction between s. solfataricus dnag and the mcm helicase is weak ( ) . importantly, this interaction does not influence the priming activity of dnag and specifically inhibits the helicase activity of mcm ( ) . this is in contrast to the enhanced helicase activity of dnab and the priming activity of dnag upon interaction between bacterial dnag and dnab ( ) ( ) ( ) . additionally, the phyre analysis of the ntd and ctd domains of dnag from the exosome-containing s. solfataricus and the exosome-less m. jannaschii revealed similarities between archaeal dnag and bacterial and eukaryotic proteins involved in rna metabolism (supplementary table s ), but no connection to the archaeal replication network was found. together with the high, exosome-independent conservation of the ntd in archaea, the strong affinity of this ntd for rna but not dna, and the involvement of the toprim domain in rna binding, this implicates that dnag functions as an rna binding protein even in archaea lacking an exosome. in exosome-less archaea dnag may play a role in the process of rna degradation together with archaeal homologs of the bacterial rnases r and j, or of the eukaryotic cleavage and polyadenylation specificity factor ( , ( ) ( ) ( ) ( ) . according to our data, dnag is most probably involved in tailing and degradation of stable rnas in exosome-containing archaea. supplementary data are available at nar online. we thank tom rische (institute for microbiology and molecular biology, university of giessen) for help and discussion, wolfgang wende (institute for biochemistry, university of giessen) for the opportunity to perform cd measurements and for helpful advices and stephanie glaeser (institute for applied microbiology, university of giessen) for creating the s rrna phylogenetic tree. we are grateful to christian lassek (institute for microbiology and molecular biology, university of giessen) for cloning of truncated dnag proteins and michael a. trakselis (university of pittsburgh) for sending us the plasmid for expression of recombinant dnage q. the exosome: a conserved eukaryotic rna processing complex containing multiple -> exoribonucleases prediction of the archaeal exosome and its connections with the proteasome and the translation and transcription machineries by a comparative-genomic approach an exosome-like complex in sulfolobus solfataricus the archaeal exosome core is a hexameric ring structure with three catalytic subunits structural framework for the mechanism of archaeal exosomes in rna processing reconstitution, activities, and structure of the eukaryotic rna exosome a duplicated fold is the structural basis for polynucleotide phosphorylase catalytic activity, processivity, and regulation characterization of native and reconstituted exosome complexes from the hyperthermophilic archaeon sulfolobus solfataricus crystal structure of an rna-bound -subunit eukaryotic exosome complex a single subunit, dis , is essentially responsible for yeast exosome core activity processing of -extended read-through transcripts by the exosome can generate functional mrnas protein complexes in the archaeon methanothermobacter thermautotrophicus analyzed by blue native/sds-page and mass spectrometry affinity purification of an archaeal dna replication protein network purification and characterization of polynucleotide phosphorylase from escherichia coli. probe for the analysis of sequences of rna rna polyadenylation in archaea: not observed in haloferax while the exosome polynucleotidylates rna in sulfolobus polynucleotyde phosphorylase functions as a - exonuclease and a poly(a) polymerase in escherichia coli polynucleotide phosphorylase and the archaeal exosome as poly(a)-polymerases rna quality control: degradation of defective transfer rna characterization of the role of ribonucleases in salmonella small rna decay rna degradation by the exosome is promoted by a nuclear polyadenylation complex a new yeast poly(a) polymerase complex involved in rna quality control rna channelling by the archaeal exosome insights into the mechanism of progressive rna degradation by the archaeal exosome the archaeal exosome the complete genome of the crenarchaeon sulfolobus solfataricus p toprim-a conserved catalytic domain in type ia and ii topoisomerases, dnag-type primases, old family nucleases and recr proteins quantitative analysis of processive rna degradation by the archaeal rna exosome the pyrococcus exosome complex: structural and functional characterization identification of archaeal proteins that affect the exosome function in vitro rrp and csl are needed for efficient degradation but not for polyadenylation of synthetic and natural rna by the archaeal exosome heterogeneous complexes of the rna exosome in sulfolobus solfataricus the evolutionarily conserved subunits rrp and csl confer different substrate specificities to the archaeal exosome the archaeal dnag protein needs csl for binding to the exosome and enhances its interaction with adenine-rich rnas rna polyadenylation and degradation in different archaea; roles of the exosome and rnase r characterization of a functional dnag-type primase in archaea: implications for a dual-primase system novel interaction of the bacterial-like dnag primase with the mcm helicase in archaea organization and evolution of bacterial and bacteriophage primase-helicase systems the dna primase of sulfolobus solfataricus is activated by substrates containing a thymine rich bubble and has a -terminal nucleotidyl-transferase activity the heterodimeric primase of the hyperthermophilic archaeon sulfolobus solfataricus possesses dna and rna primase, polymerase and -terminal nucleotidyl transferase activities structure of the rna polymerase domain of e. coli primase a toprim domain in the crystal structure of the catalytic core of escherichia coli primase confirms a structural link to dna topoisomerases a general method of in vitro preparation and specific mutagenesis of dna fragments: study of protein and dna interactions the comparative rna web (crw) site: an online database of comparative sequence and structure information for ribosomal, intron, and other rnas structure and function of the archaeal exosome hold on!: rna polymerase interactions with the nascent rna modulate transcription elongation and termination structure of the trp rna-binding attenuation protein, trap, bound to rna amino acid residues critical for rna-binding in the n-terminal domain of the nucleocapsid protein are essential determinants for the infectivity of coronavirus in cultured cells quality control of ribosomal rna mediated by polynucleotide phosphorylase and rnase r retrieval of first genome data for rice cluster i methanogens by a combination of cultivation and molecular techniques identification of the recr toprim domain as the binding site for both recf and reco. a role of recr in recfor assembly at double-stranded dna-single-stranded dna junctions dynamics in multi-domain protein recognition of rna the s rrna maturase, ribonuclease m , is a toprim domain family member differential sensitivities of portions of the mrna for ribosomal protein s to -exonucleases dependent on oligoadenylation and rna secondary structure polyadenylation promotes degradation of -structured rna by the escherichia coli mrna degradosome in vitro bacterial/archaeal/organellar polyadenylation degradation of ribosomal rna during starvation: comparison to quality control during steady-state growth and a role for rnase ph trna stabilization by modified nucleotides oligoribonucleotide synthesis using t rna polymerase and synthetic dna templates moonlighting proteins: an intriguing mode of multitasking thermococcus kodakarensis dna replication global phylogenomic analysis disentangles the complex evolutionary history of dna replication in archaea direct physical interaction between dnag primase and dnab helicase of escherichia coli is necessary for optimal synthesis of primer rna mapping protein-protein interactions within a stable complex of dna primase and dnab helicase from bacillus stearothermophilus dnab helicase stimulates primer synthesis activity on short oligonucleotide templates comparative genomics and evolution of proteins involved in rna metabolism identification of an rnase j ortholog in sulfolobus solfataricus: implications for -to- directional decay and -end protection of mrna in crenarchaeota distinct activities of several rnase j proteins in methanogenic archaea archaeal ␤-casp ribonucleases of the acpsf family are orthologs of the eukaryal cpsf- factor key: cord- - y ho x authors: bekaert, michaël; firth, andrew e.; zhang, yan; gladyshev, vadim n.; atkins, john f.; baranov, pavel v. title: recode- : new design, new search tools, and many more genes date: - - journal: nucleic acids res doi: . /nar/gkp sha: doc_id: cord_uid: y ho x ‘recoding’ is a term used to describe non-standard read-out of the genetic code, and encompasses such phenomena as programmed ribosomal frameshifting, stop codon readthrough, selenocysteine insertion and translational bypassing. although only a small proportion of genes utilize recoding in protein synthesis, accurate annotation of ‘recoded’ genes lags far behind annotation of ‘standard’ genes. in order to address this issue, provide a service to researchers in the field, and offer training data for developers of gene-annotation software, we have gathered together known cases of recoding within the recode database. recode- is an improved and updated version of the database. it provides access to detailed information on genes known to utilize translational recoding and allows complex search queries, browsing of recoding data and enhanced visualization of annotated sequence elements. at present, the recode- database stores information on approximately genes that are known to utilize recoding in their expression—a factor of approximately three increase over the previous version of the database. recode- is available at http://recode.ucc.ie the term 'translational recoding' describes the utilization of non-standard decoding during protein synthesis and encompasses such processes as ribosomal frameshifting, codon redefinition, translational bypassing and stopgo ( ) ( ) ( ) ( ) ( ) ( ) ( ) . what is often considered as a decoding error-e.g. a frameshifting error or mistranslation of a particular codon-may occasionally benefit the organism by increasing its fitness and survival. in such instances the propensity for the decoding 'error' may be selected for during evolution, leading to the formation of a particular sequence context that elevates the frequency of the 'error'. to discriminate such cases of programmed decoding 'misbehaviour' from promiscuous translational errors or translational noise, the term recoding is used. the position within an mrna where a recoding event takes place is termed the 'recoding site'. sequence elements responsible for increasing the efficiency of recoding events are termed 'recoding stimulatory signals', and a minimal sequence fragment that allows recoding to take place at the natural efficiency (i.e. relative to the level of standard decoding at the recoding site) is termed a 'recoding cassette'. recoding can benefit gene expression in a number of ways. it can regulate gene expression by being part of a sensor for particular cellular conditions. prominent examples include ribosomal frameshifting in bacterial release factor (rf ) and eukaryotic antizyme mrnas. in both instances, ribosomal frameshifting is required for the production of the corresponding active full-length protein products. in the rf mrna, the efficiency of frameshifting is negatively regulated by the cellular concentration of its product, rf , providing an autoregulatory circuit for its biosynthesis ( ) ( ) ( ) . in the antizyme mrna, the efficiency of frameshifting is modulated by cellular levels of polyamines, whose concentration in turn is controlled by antizyme ( , ) . thus, this mechanism ensures the maintenance of antizyme production at the levels required to support physiologically appropriate concentrations of polyamines. recoding can also be used for the diversification of protein products encoded by a single gene. an illustrative example is in bacterial dnax mrna, where frameshifting allows synthesis of two different protein subunits-sharing the same n-terminal part-from a single open reading frame (orf) in its mrna ( ) ( ) ( ) . a presumed constant ratio of frameshifting in dnax ensures a fixed stoichiometric balance between these two subunits ( ) . this balance, then, is independent of the absolute levels of dnax transcription and translational initiation on its mrna. similarly, in many viruses recoding is responsible for setting a ratio between protein products (such as those encoded by gag-pro-pol genes in retroviruses) produced from a single mrna ( ) . recoding also provides rna viruses with a mechanism for the translation of downstream orfs on polycistronic rnas [other mechanisms include leaky scanning, shunting, reinitiation, iress and the production of subgenomic rnas ( ) ] and may also be involved in global regulation mechanisms, such as mediating the switch between translation and replication on the same genomic rna ( ) . finally, recoding provides a way for the incorporation of non-standard amino acids-e.g. amino acids that share their codons with termination signals (the most prominent example of which is selenocysteine, encoded by uga) ( ) ( ) ( ) . for further information on the diverse variety of recoding functions, see recent reviews ( , , , , ) . recoding cassettes may be composed of a variety of diverse sequence elements. for example, primary nucleotide sequences may promote re-arrangements of trna molecules relative to their codons in mrna inside the ribosome or affect recognition of trnas or release factors in the ribosomal a-site. on the other hand, many recoding signals act in the form of rna secondary structures, such as simple stem-loops, or more complex pseudoknots, kissing stem-loops and other structures that involve interactions between considerably distant rna regions ( , ( ) ( ) ( ) ( ) . trans-acting rna signals affecting ribosomal decoding through complementary interactions with ribosomal rna ( - ), or through the nascent peptide acting within the ribosome exit tunnel ( , , ) , are also known. some recoding events-such as selenocysteine insertion-require the presence of additional specialized machinery such as selenocysteine trnas, selenocysteine-specific translation factors and several other components of the selenocysteine biosynthesis and insertion pathway ( , ( ) ( ) ( ) . recent reviews on stimulatory signals involved in the modulation of recoding events and molecular mechanisms of recoding provide further details ( , , , , ) . despite considerable progress in the development of computational tools for the prediction of protein coding genes in sequenced genomes, the identification and annotation of recoded genes lags far behind. the hurdle lies not so much in the fact that recoded genes do not obey standard rules of genetic readout but, rather, in the considerable diversity of recoded genes and sequence elements responsible for recoding. even among evolutionarily related genes, all utilizing recoding, the diversity of recoding signals can be considerable. an extreme example is when orthologous genes utilize recoding at different stages of gene expression to achieve the same goal. an example is in dnax, where ribosomal frameshifting is employed by enterobacteria, but transcriptional slippage is used in thermus thermophilus ( ) . a similar situation occurs in bacterial insertion sequence (is) elements, where a certain group of is elements utilizes transcriptional slippage to produce orfa-orfb fusions, while many other is elements utilize ribosomal frameshifting for the same purpose ( ) . the diversity of recoding functions, combined with the wide spectrum of unrelated sequence elements involved in recoding, makes the design of a uniform model of recoding intractable. nonetheless, in recent years, we have witnessed the development of specialized models and computational tools for the identification of particular subsets of recoding cassettes, or tools that are specific to recoding events in particular groups of homologous genes ( ) ( ) ( ) ( ) . these developments, at least partially, were facilitated by the availability of a compiled dataset of known recoded genes collected together in the recode database (http://recode.genetics.utah.edu), which was initially launched years ago ( , ) . to facilitate further development of computational tools for the prediction of recoded genes in the ever faster growing body of sequence data, as well as to provide bench researchers with upto-date information on recoding, an efficient means of recode database population and annotation are now required. in this article, we describe the incarnation of the database, recode- . the major advances of recode- (hosted in a new location http://recode.ucc.ie) over previous versions include a new web design allowing enhanced visualization of stimulatory signals, a uniform recodeml format for the annotation of recoded genes, and a significantly larger number of entriesincluding many recently identified cases-that altogether have more than doubled the size of the database since its last published update. the data are stored in a local postgresql database that is queried by php scripts embedded in the web interface. the schema of the postgresql database is shown in figure . the database stores information on individual genes that utilize recoding, the mechanisms and stimulatory signals involved, and references to the original literature sources that describe the recoding events. in order to facilitate the uniform annotation of recoding events, we have designed an xml-based format for the annotation of recoded genes, recodeml. the document type definition for recodeml is available at the recode- web site at http://recode.ucc.ie/dtd the extensibility of the recodeml format will allow incorporation of new annotation, if required, for newly discovered types of recoding, and the associated features, as they are being discovered. the database handles batch importation of properly designed recodeml entries into the postgresql database, thus facilitating rapid population of the database with new data. the data in the database may be explored in two ways. they may be browsed by one of the three categories: kingdom (archaea, bacteria, eukaryotes and viruses), organism and type of recoding. the data may also be searched directly by key words that can be inserted into the search field. searches that use regular expressions are allowed. the output of a database search is a list of recode- entries in a short format that includes organism name, kingdom, genus, type of recoding event, status of figure shows an example of sequence annotation for the human oaz gene, alongside a diagram of a stimulatory rna secondary structure, and the recode- logo. unlike recode- , where all data on recoding events were introduced manually, recode- also utilizes automated identification of recoding events by the recently developed computer programs arfa ( ) and oaf ( ) , that are able to identify and annotate + frameshifting events in mrnas of bacterial rf s and eukaryotic antizyme (oazs), respectively. however, a significant source of recoding events remains to be serendipitous discoveries by experimental studies that sometimes are complemented by more systematic studies of large groups of similar genes ( , ) . therefore, a large proportion of new data are still populated manually or semi-manually. to ease manual population of recoding events, a special form has been designed that is available in the database upon user registration. user registration needs to be approved by one of the database contributors. the novel data in the database include rf mrnas identified by arfa, events identified by oaf, new selenoprotein genes ( ) ( ) ( ) ( ) and new viral annotations ( ) including the newly discovered frameshift cassettes in potyviruses ( ), alphaviruses ( ) and the japanese encephalitis group of flaviviruses ( ) . the database will expand in accordance with the growth of available sequence information that will be scanned by one of the existing programs for recode annotation. we also plan to continue developing tools for the automatic identification of recoding events from nucleotide sequences. as the field grows and the number of recoded genes progressively increases, it becomes harder to extract data from the relevant literature and a number of novel recoded genes may escape the database. therefore, we encourage users and researchers in the field to submit their data directly to the recode- database. we are also willing to provide help with the analysis of potential new recoding events. reprogrammed genetic decoding in cellular gene expression programmed translational frameshifting recoding: translational bifurcations in gene expression programmed ribosomal frameshifting goes beyond viruses: organisms from all three kingdoms use frameshifting to regulate gene expression, perhaps signaling a paradigm shift a case for ''stopgo'': reprogramming translation to augment codon meaning of ggn by promoting unconventional termination (stop) after addition of glycine and then allowing continued translation (go) coupling of open reading frames by translational bypassing recoding: expansion of decoding rules enriches gene expression expression of peptide chain release factor requires high-efficiency frameshift the function, structure and regulation of e. coli peptide chain release factors release factor frameshifting sites in different bacteria ribosomal frameshifting in decoding antizyme mrnas from yeast and protists to humans: close to cases reveal remarkable diversity despite underlying conservation autoregulatory frameshifting in decoding mammalian ornithine decarboxylase antizyme the gamma subunit of dna polymerase iii holoenzyme of escherichia coli is produced by ribosomal frameshifting translational frameshifting generates the gamma subunit of dna polymerase iii holoenzyme programmed ribosomal frameshifting generates the escherichia coli dna polymerase iii gamma subunit from within the tau subunit reading frame structural probing and mutagenic analysis of the stem-loop required for escherichia coli dnax ribosomal frameshifting: programmed efficiency of % programmed ribosomal frameshifting in hiv- and the sars-cov alternative translation strategies in plant viruses long-distance rna-rna interactions in plant virus gene expression and replication eukaryotic selenoprotein synthesis: mechanistic insight incorporating new factors and new functions for old factors selenoprotein synthesis: uga does not end the story selenium: its molecular biology and role in human health recoding in bacteriophages and bacterial is elements the role of programmed- ribosomal frameshifting in coronavirus propagation frameshifting rna pseudoknots: structure and mechanism structure, stability and function of rna pseudoknots involved in stimulating ribosomal frameshifting rna pseudoknots and the regulation of protein synthesis a - ribosomal frameshift element that requires base pairing across four kilobases suggests a mechanism of regulating ribosome and replicase traffic on a viral rna slippery runs, shifty stops, backward steps, and forward hops: - , - , + , + , + , and + ribosomal frameshifting upstream stimulators for recoding overriding standard decoding: implications of recoding for ribosome function and enrichment of gene expression use of trna suppressors to probe regulation of escherichia coli release factor translational bypassing without peptidyl-trna anticodon scanning of coding gap mrna a nascent peptide is required for ribosomal bypass of the coding gap in bacteriophage t gene protein factors mediating selenoprotein synthesis solution structure of secis, the mrna element required for eukaryotic selenocysteine insertion-interaction studies with the secis-binding protein sbp selenocysteine inserting trnas: an overview p-site trna is a crucial initiator of ribosomal frameshifting a new kinetic model reveals the synergistic effect of e-, p-and a-sites on + ribosomal frameshifting nonlinearity in genetic decoding: homologous dna replicase genes use alternatives of transcriptional slippage or translational frameshifting transcriptional slippage in bacteria: distribution in sequenced genomes and utilization in is element gene expression knotinframe: prediction of - ribosomal frameshift events arfa: a program for annotating bacterial release factor genes, including prediction of programmed ribosomal frameshifting ornithine decarboxylase antizyme finder (oaf): fast and reliable detection of antizymes with frameshifts in mrnas predicting genes expressed via - and + frameshifts recode: a database of frameshifting, bypassing and codon redefinition utilized for gene expression database resources of the national center for biotechnology information pseudoviewer : generating planar drawings of large-scale rna structures with pseudoknots sequences that direct significant levels of frameshifting are frequent in coding regions of escherichia coli conserved translational frameshift in dsdna bacteriophage tail assembly genes comparative genomics of trace elements: emerging dynamic view of trace element utilization and function dynamic evolution of selenocysteine utilization in bacteria: a balance between selenoprotein loss and evolution of selenocysteine from redox active cysteine residues trends in selenium utilization in marine microbial world revealed through the analysis of the global ocean sampling (gos) project the selenoproteome of clostridium sp. ohilas: characterization of anaerobic bacterial selenoprotein methionine sulfoxide reductase a an extended signal involved in eukaryotic - frameshifting operates through modification of the e site trna an overlapping essential gene in the potyviridae discovery of frameshifting in alphavirus k resolves a -year enigma a conserved predicted pseudoknot in the ns a-encoding sequence of west nile and japanese encephalitis flaviviruses suggests ns ' may derive from ribosomal frameshifting we would like to express our appreciation to the colleagues who have contributed data for the previous versions of the database. conflict of interest statement. none declared. key: cord- - sg hv w authors: yeung, siu-wai; lee, thomas ming-hung; cai, hong; hsing, i-ming title: a dna biochip for on-the-spot multiplexed pathogen identification date: - - journal: nucleic acids res doi: . /nar/gkl sha: doc_id: cord_uid: sg hv w miniaturized integrated dna analysis systems have largely been based on a multi-chamber design with microfluidic control to process the sample sequentially from one module to another. this microchip design in connection with optics involved hinders the deployment of this technology for point-of-care applications. in this work, we demonstrate the implementation of sample preparation, dna amplification, and electrochemical detection in a single silicon and glass-based microchamber and its application for the multiplexed detection of escherichia coli and bacillus subtilis cells. the microdevice has a thin-film heater and temperature sensor patterned on the silicon substrate. an array of indium tin oxide (ito) electrodes was constructed within the microchamber as the transduction element. oligonucleotide probes specific to the target amplicons are individually positioned at each ito surface by electrochemical copolymerization of pyrrole and pyrrole−probe conjugate. these immobilized probes were stable to the thermal cycling process and were highly selective. the dna-based identification of the two model pathogens involved a number of steps including a thermal lysis step, magnetic particle-based isolation of the target genomes, asymmetric pcr, and electrochemical sequence-specific detection using silver-enhanced gold nanoparticles. the microchamber platform described here offers a cost-effective and sample-to-answer technology for on-site monitoring of multiple pathogens. decentralized medical testing plays a vital role in today's health care system. the blood glucose meter, which was the first commercial handheld device for medical diagnostics and developed three decades ago, is by far one of the most successful examples in point-of-care testing (poct). in the years to come, driven by the ever-growing threats from emerging infectious diseases (e.g. avian flu and severe acute respiratory syndrome), the development of small-size instruments for on-the-spot pathogen detection is expected to be an important segment of the poct market. this trend has already commenced with a few companies having launched self-test products for hepatitis and human immunodeficiency virus detection based on antibody-antigen interactions (e.g. orasure technologies and acon laboratories). these systems give a visual readout indicating the presence or absence of the target virus in $ min. one of the main shortcomings of these immunological techniques is their limited sensitivity. to address this issue, there have been significant efforts to develop nucleic acid (na)-based analyzers ( ) ( ) ( ) . the miniaturization of na analytical platforms has many advantages over the conventional bench-top counterparts. these include low sample/reagent consumption (volume of micro-down to picoliter) as well as short assay time (minutes rather than days). most importantly, they permit the integration of a number of functions including sample preparation, target amplification, and product detection, thus enabling a fully automated operation that can be used by untrained individuals. to date, several integrated na-based analytical systems have been commercialized [http://www.cepheid.com/sites/ cepheid/content.cfm?id¼ (genexpert from cepheid), http://www.gen-probe.com/prod_serv/inst_dts.asp (direct tube sampling systems from gen-probe), http://www. idahotech.com/razor/index.html (razor from idaho technology) and http://www.iquum.com/products/analyzer. shtml (liat analyzer from iquum)]. despite their wide use in clinical/central laboratories, their application for on-site pathogen monitoring on a routine basis is still limited due to the large footprint and high instrument cost (mainly the complex optics). a promising alternative would be the inherently simple and lost-cost electrochemical method. over the past decade, despite a great deal of work having been carried out on electrochemical sequence-specific na sensing ( ) ( ) ( ) , little work has been undertaken on the integration of these with upstream functionalities. in , liu et al. ( ) successfully demonstrated a fully integrated biochip for cell isolation and lysis, target amplification, as well as electrochemical amplicon detection. one relevant feature of their approach is the incorporation of on-chip mixers, valves, and pumps in a self-contained device. however, the design and fabrication of the chip involves many complicated steps, which limits its practical application. in a previous study, our group demonstrated a proof-ofconcept experiment that both dna amplification by the pcr and sequence-specific electrochemical amplicon detection could be done in a single microchamber ( ) , in contrast to the commonly used device which involved multi-chambers with complex microfluidic control elements ( ) . this microdevice had an ml reaction chamber etched in a silicon substrate with a thin-film heater and temperature sensor patterned on top for rapid thermal cycling. an oligonucleotide capture probe-modified detection electrode was placed on a glass substrate used to seal the microchamber. to develop this prototype device into practical use, sample preparation functionality as well as the ability to perform multiplexed analysis would need to be addressed. in this work, we present a complete dna-based assay in a single silicon-glass microchamber for multiple pathogen detection. a model system of e.coli and b.subtilis was used. the assay involves the following steps: (i) sample preparation using thermal cell lysis and magnetic particle-based target genome isolation; (ii) target dna amplification by the pcr; (iii) hybridization of the amplicons to their complementary oligonucleotide capture probes immobilized onto individual detection electrode surfaces and (iv) electrochemical transduction of the recognition event via gold nanoparticles with signal amplification using electrocatalytic silver deposition ( ) . an issue of importance much needed to be addressed was the compatibility of all the materials and processing steps. in particular, the chemistry used for the site-specific probe immobilization together with that of the magnetic particles used for genome isolation should be pcr-compatible. oligonucleotides and pcr reagents were obtained from invitrogen (carlsbad, ca, usa), unless otherwise stated. other chemicals were purchased from sigma-aldrich (st. louis, mo, usa). electrochemical measurements were performed using a vmp multichannel potentiostat (princeton applied research, oak ridge, tn, usa) controlled by ec-lab software (version . , bio-logic science instruments, claix, france). the thermal control system for the pcr consisted of a data acquisition card (pci-mio- e- , national instruments, austin, tx, usa) along with a signal conditioning board (sc- -rtd, national instruments) connected to the temperature sensors. a digital feedback proportional-integral-derivative (pid) control algorithm was implemented in labview software (national instruments) to control voltage supply to the heater from a power source (hp a, hewlett-packard, rockville, md, usa). the silicon chip (thickness of mm) had two fluid injection holes (top side, diameter of mm, depth of mm) and a chamber (bottom side, length and width of mm, depth of mm) etched by the inductively coupled plasma/deep reactive ion etching (icp/drie) process, see figure a (left panels). thin-film platinum ( nm) was patterned on top of the silicon substrate as heater and temperature sensors ( figure a , upper left). the glass chip had platinum pseudoreference and counter electrodes (thickness of nm) as well as four working electrodes made of indium tin oxide (ito) (thickness of nm), see right panel of figure a . ultra-violet curing optical cement (type uv- , summers optical, hatfield, pa, usa) was used to bond the silicon and glass chips, the curing procedure was in accordance with the manufacturer's instruction. pipet tips were glued to the fluid injection holes with epoxy ( figure b ). an oligonucleotide capture probe specific for e.coli amplicon [pyrrole- -acaacacgtttagcctgacc- (pyrrole-ec), apibio, france] was first electrochemically polymerized onto two of the four ito working electrodes. a mixture of mm pyrrole, mm pyrrole-ec, and . m liclo was introduced into the microchamber, followed by a cyclic voltammetric scan of the two electrodes between À . and + . v at a scan rate of mv/s for three times. the microchamber was then washed with deionized water and dried with nitrogen gas. the same procedure was repeated for the other two ito working electrodes, except an oligonucleotide capture probe specific for b.subtilis amplicon [pyrrole- -cctacgggaggcagcag- (pyrrole-bs), apibio, france] was used. ten microliters of avidin-coated magnetic particles ( . mm, vms- - , spherotech, libertyville, il, usa) were washed with an equal volume of saline/sodium citrate buffer (ssc, mm nacl/ mm sodium citrate, ph . ). after centrifugation and pipetting, the supernatant was removed and the magnetic particles were incubated with ml of nm biotinylated genome capture probe overnight at room temperature. oligonucleotide sequence of the genome capture probe for e.coli was -biotin-gacaagaaaatc-tccaacatcc- while that for b.subtilis was -biotin-ccagtttccaatgaccctcccc- . the capture probe functionalized magnetic particles were finally washed with ml of the ssc buffer, resuspended in ml of the ssc buffer, and stored at c. genome isolation. the sample containing e.coli or b.subtilis or both ( ml), which was cultured in luria-bertaini broth overnight at c, was mixed with ml each of the biotinylated genome capture probe ( nm) for e.coli and b.subtilis along with ml of the ssc buffer. the mixture was injected into the reaction chamber and sealed with bostik's blu-tack. then, the silicon-glass device was placed in a plexiglass holder with contact pins for electrical connections to the heater and temperature sensors ( figure c ). the chamber was maintained at c for min to lyse the cell and at the same time denature the genomic dnas. after that, the temperature was cooled to c and held for min to allow specific hybridization between the denatured genomic dnas and capture probes. subsequently, ml of the functionalized magnetic particles were added into the chamber and incubated for min to capture the specific genomes onto the magnetic particles. finally, ssc buffer was used to remove any unwanted materials with an external magnet to keep the particles within the microchamber. electrochemical amplicon detection. after the asymmetric pcr, the solution was allowed to stand at room temperature for h. unhybridized amplicons were washed away with the ssc buffer. gold nanoparticle label was bound to the hybridized amplicons by exposing the electrode to a stre-ptavidinÀgold nanoparticle ( nm) solution (the stock was diluted times with . m -( -hydroxyethyl)- piperazineethanesulfonic acid/ . m nacl) for min at room temperature. the unbound gold nanoparticles were removed by flushing the microchamber with phosphatebuffered nitrate solution ( . m nano / mm sodium phosphate, ph . ). electrocatalytic silver deposition onto the hybrid-bound gold nanoparticles was then achieved by applying a potential of À . v in a silver nitrate solution ( mm agno / m kno ) for s. finally, the amount of deposited silver was determined by measuring the oxidative silver dissolution response with an applied anodic current of ma in the same silver nitrate solution, and the time to reach a potential of + . v was taken as the signal. the single microchamber design poses particular challenges to the electrochemical platform used for the sequence-specific pcr amplicons detection. addressability and compatibility are two important considerations regarding immobilization chemistry for the oligonucleotide detection capture probes. for a multiplexed assay, it is necessary to individually modify the detection platform so that each individual electrode in an electrode array has a specific capture probe. when using either high temperature or ultra-violet glue to seal the microchamber, it is recommended that immobilization should be carried out after the siliconÀglass bonding process so as to prevent damage to the capture probes. in doing so, the more common chemical attachment (spotting) method cannot be used because all the active electrode surfaces are embedded within the same microchamber and they would receive identical modifications. one simple way to achieve site-specific probe immobilization onto individual electrode surfaces can figure . a schematic representation of the assay protocol in the siliconÀglass microchamber. the three main steps were (a) sample preparation: thermal cell lysis and magnetic particle-based isolation of specific genomic dnas; (b) target dna amplification: generation of single-stranded rich amplicons by asymmetric pcr; (c) product detection: gold nanoparticle labeling, electrocatalytic silver deposition, and electrochemical silver dissolution. be achieved by electrochemical copolymerization of pyrrole and pyrroleÀoligonucleotide ( ) . figure illustrates the strategy to immobilize different capture probes onto each individual electrode. a solution of pyrrole and oligonucleotide bearing a pyrrole group is introduced into the microchamber. when a cyclic voltammetric scan is applied to electrode , with other electrodes disconnected or grounded, oligonucleotide is selectively deposited on this particular electrode. then, the microchamber is washed with water to ensure there is no pyrroleÀoligonucleotide monomer is left. this procedure is repeated for the other electrodes with different pyrroleÀoligonucleotide polymerization solutions. in our model system with two target analytes and four working electrodes, the capture probes specific to e.coli and b.subtilis are immobilized in duplicate. before proceeding to the complete analytical protocol, the ability of these immobilized capture probes to recognize their complementary targets should be tested. figure shows the fluorescence images of the four functionalized electrodes (a and d: b.subtilis probe; b and c: e.coli probe) exposed to a sample containing a fluorescently-labeled sequence complementary to the e.coli probe. it is clear that electrodes b and c exhibit much higher fluorescence intensity than electrodes a and d, indicating the highly specific probe immobilization as well as hybridization offered by the electrochemical pyrrole-based attachment chemistry. another criterion for the selection of immobilization method is the compatibility with other processes, in particular the pcr. due to the fact that the detection electrodes are within the reaction chamber, the linkage between the immobilized capture probe and electrode surface must be strong enough to survive through the thermal cycling process (especially the high denaturation temperature). moreover, the detector surface should interact only with the specific amplicon but not with other components employed in the assay protocol. the assay procedure used in this work is schematically represented in figure . it involves three main steps: sample preparation, target dna amplification, and product detection, all performed within the same microchamber. intact cells are first broken down by applying a high temperature ( c, controlled by the on-chip heater and temperature sensor) to free the genomic dna. to remove all the interfering substances (e.g. cell debris and protein) that may affect the subsequent dna amplification process, magnetic particles are used to isolate the specific genomes. biotinylated genome capture probes for the two model species are mixed with the intact cells before injecting into the microchamber. when the temperature is lowered to c after the thermal lysis step, these probes hybridize to their complementary target genomes. these probeÀgenome hybrids are then isolated by the addition of the avidin-coated magnetic particles, followed by thorough washing. it is worth noting that the magnetic particles are pretreated with a small amount of the genome capture probes to minimize nonspecific adsorption of the interfering substances and other genomic dnas. subsequently, with the genomes captured on the magnetic particles serving as the template, asymmetric pcr is conducted to generate single-stranded rich target amplicons. after the amplification step, these amplicons hybridize to their corresponding detection electrodes. next, the hybridized amplicons are labeled with gold nanoparticles via biotinavidin interaction. finally, silver metal is electrocatalytically deposited onto the gold nanoparticles and the amount is determined by the electrochemical oxidative dissolution technique ( , ) . the detailed procedure for the sample preparation, target dna amplification, and product detection steps is given in the materials and methods section. the ability of this single microchamber to detect specific cell type using the above protocol is demonstrated by running a series of experiments with e.coli cells of different concentrations, taking the signal from the b.subtilis detection capture probe-modified electrode as the background. figure gives a semi-log plot of the sample to background ratio against the number of cells in the sample. a linear relationship is obtained in the concentration range investigated ( - cells/sample). this result confirms the successful isolation of the genome with the magnetic particle-based approach, compatibility of the magnetic particle with the pcr, thermal stability of the immobilized detection capture probe through the temperature cycling process, as well as negligible nonspecific adsorption on the electrode surface. another attractive feature of this microdevice is multiplexing. by constructing an electrode array, it is possible to identify several species in a single run. the results for the detection of the two model species are presented in figure . if the sample contains e.coli cells only, there is a significant increase in the analytical signal (silver metal stripping time) for the e.coli detection capture probefunctionalized electrode while that for the b.subtilis remains the same as the background signal. when the sample contains figure . a calibration plot of the sample-to-background ratio against the logarithmic number of e.coli cells in the sample. in our four working electrode design, the silver stripping time to reach a potential of + . v (versus pt pseudo-reference electrode) for the e.coli detection capture probemodified electrodes was taken as the sample signal whereas that for the b.subtilis detection capture probe-modified electrodes was taken as the background signal. note that the number of cells stated was the amount being introduced into the microchamber. b.subtilis cells only, opposite results from the two different electrodes are obtained. another case is the inclusion of both cell types; not surprisingly, both electrodes have much higher signals than the background. we have demonstrated the utilization of a siliconÀglassbased microchamber for dna-based detection of e.coli and b.subtilis cells. thermal cell lysis, magnetic particle-based target genome isolation, dna amplification, and electrochemical sequence-specific amplicons detection have been successfully implemented in this microdevice platform. the portable electrochemical instrumentation as well as a simple microchip design is conducive to the realization of on-site pathogen detection. the selective immobilization of capture probes using the pyrrole-based electropolymerization process provides good thermal stability and pcr process compatibility, which are crucial for the multiplexed analysis. future work will be directed towards interfacing the microchip with macro-world so as to achieve a fully automatic device that can be used by untrained individuals. miniaturised nucleic acid analysis microfabricated systems for nucleic acid analysis dna-based bioanalytical microsystems for handheld device applications electrochemical dna sensors electrochemical nucleic acid biosensors nanomaterial-based electrochemical biosensors self-contained, fully integrated biochip for sample preparation, polymerase chain reaction amplification, and dna microarray detection microfabricated pcr-electrochemical device for simultaneous dna amplification and detection an integrated nanoliter dna analysis device effects of gold nanoparticle and electrode surface properties on electrocatalytic silver deposition for electrochemical dna hybridization detection preparation of a dna matrix via an electrochemically directed copolymerization of pyrrole and oligonucleotides bearing a pyrrole group gold nanoparticle-catalyzed silver electrodeposition on an indium tin oxide electrode and its application in dna hybridization transduction results of the two different types of electrodes (functionalized with e.coli and b.subtilis detection capture probe) when subjected to different sample solutions. (a) e.coli cells only; (b) b.subtilis cells only the authors thank the funding support from the research grants council of the hong kong special administrative region government (rgc cerg project# and ). laboratory facilities provided by the nanoelectronics fabrication facility (nff) and bioengineering laboratory (belab) at hkust, for the chip fabrication and biomaterials processing, respectively, are also acknowledged. funding to pay the open access publication charges for this article was provided by rgc, hong kong sar government. conflict of interest statement. none declared. key: cord- -zt o bcz authors: rolando, justin c; jue, erik; barlow, jacob t; ismagilov, rustem f title: real-time kinetics and high-resolution melt curves in single-molecule digital lamp to differentiate and study specific and non-specific amplification date: - - journal: nucleic acids res doi: . /nar/gkaa sha: doc_id: cord_uid: zt o bcz isothermal amplification assays, such as loop-mediated isothermal amplification (lamp), show great utility for the development of rapid diagnostics for infectious diseases because they have high sensitivity, pathogen-specificity and potential for implementation at the point of care. however, elimination of non-specific amplification remains a key challenge for the optimization of lamp assays. here, using chlamydia dna as a clinically relevant target and high-throughput sequencing as an analytical tool, we investigate a potential mechanism of non-specific amplification. we then develop a real-time digital lamp (dlamp) with high-resolution melting temperature (hrm) analysis and use this single-molecule approach to analyze approximately . million amplification events. we show that single-molecule hrm provides insight into specific and non-specific amplification in lamp that are difficult to deduce from bulk measurements. we use real-time dlamp with hrm to evaluate differences between polymerase enzymes, the impact of assay parameters (e.g. time, rate or florescence intensity), and the effect background human dna. by differentiating true and false positives, hrm enables determination of the optimal assay and analysis parameters that leads to the lowest limit of detection (lod) in a digital isothermal amplification assay. isothermal methods, such as loop-mediated isothermal amplification (lamp), are attractive for nucleic acid amplification tests (naats) in point-of-care and limited-resource settings ( , ) . lamp in particular shows promise as a naat with fewer hardware requirements compared with polymerase chain reaction (pcr) ( ) . despite advancements, the ability to optimize lamp naats for a specific target sequence and primer set (specific to a target organism) remains constrained by a limited understanding of how amplification is affected by myriad factors, including polymerase choice, primer design, temperature, time and ion concentrations. in particular, addressing non-specific amplification remains a core problem as it constrains an assay's limit of detection (lod). in reactions containing template target molecules, both specific and non-specific amplification reactions may occur. unlike pcr, lamp lacks a temperature-gating mechanism, so non-specific reactions consume reagents and compete with specific amplification impacting its kinetics. the presence of non-specific amplicons therefore adversely impacts both the assay's analytical sensitivity (the fewest template molecules that can be detected) and its analytical specificity (ability to detect the target template in the presence of competing reactions). classifying reactions as either specific or non-specific amplification would therefore be invaluable both during assay optimization and assay deployment in clinical diagnostics. substantial research is focused on using isothermal amplification chemistries for diagnosis of infectious disease. for example, chlamydia (caused by the pathogen chlamydia trachomatis, ct) is the most common sexually transmitted infection worldwide, with more than million cases reported annually ( ) . diagnosis of ct infections is challenged by a lack of standard symptoms (many infections are asymptomatic) ( ) and the presence of mixed flora (particularly in the female reproductive tract) ( ) . thus, rapid naats with high sensitivity and specificity are critically needed, especially naats that can deal with the high levels of host or background dna likely to be present in clinical samples such as urine samples and swabs ( , ) . optimizing lamp for ct and other infectious pathogens requires addressing and reducing non-specific amplification or a method for separating non-specific reactions from specific amplification. reactions run in bulk (i.e. in a tube) in the absence of template can be informative to provide information on performance of non-specific amplification. another method to identify non-specific amplification includes mathematical modeling in conjunction with electrophoresis to distinguish between non-specific and specific banding patterns ( ) . however, in the presence of template, although specific and non-specific reactions occur simultaneously, they cannot be monitored simultaneously. thus, bulk reactions have three important limitations with regard to assay optimization: (i) differences in the kinetics of specific and non-specific reactions cannot be separated, (ii) rare but significant events, such as early but infrequent non-specific amplification, cannot be easily characterized; and (iii) testing the full design space requires many hundreds of replicates to obtain statistically significant data. to improve an assay's analytical specificity and sensitivity, one strategy is to eliminate the detection of non-specific amplification. in bulk lamp experiments, non-specific amplification can be excluded from detection by using probes, beacons, fret or reporter-quencher schemes that show only specific amplification of the target ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . although these methods improve the assay, they do not capture non-specific reactions and thus cannot give insights into the origin of non-specific amplification or the conditions that led to non-specific amplicons. moreover, probes and beacons do not eliminate non-specific amplification; non-specific amplification still competes for reagents and can limit the extent of the signal generated by specific amplification events ( ) . hence, it is highly desirable to distinguish specific from non-specific amplification. in this study, we combined sequencing and digital singlemolecule lamp (dlamp) with high-resolution melting temperature (hrm) to probe the fundamental mechanics of amplification reactions. we used dlamp to extract real-time kinetic information to identify the digital threshold data-processing parameters that minimize nonspecific amplification events and elucidate how an interfering molecule impacts amplification. digital single-molecule methods separating individual amplification events into discrete compartments, eliminating interference among individual amplification events ( , ) . furthermore, digital experiments consist of thousands of reactions that run in parallel and thus provide valuable statistical information ( ) ( ) ( ) . we used real-time imaging to monitor the kinetics of , dlamp reactions per experiment and observe ∼ . × reactions in total. we hypothesized that highresolution melting analysis (hrm) could be a tool for separating specific from non-specific amplification events and for identifying the optimal digital threshold data-processing parameters to distinguish specific and non-specific amplification events (even when an assay is deployed without hrm). to test this hypothesis, we used a dlamp assay with ct dna as the target (combined with sequencing to identify the products of bulk reactions) to analyze both specific and non-specific amplification under conditions that include clinically relevant concentrations of background human dna. isoamp i (#b s), isoamp ii (#b s), mgso (#b s), deoxynucleotide solution (#n s), bovine serum albumen (bsa, #b s ), bst . ( , u/ml, #m s) and bst . ( u/ml, #m s) were purchased from new england biolabs (ipswich, ma, usa). ambion rnase cocktail (#am ), ambion nucleasefree water (#am ), invitrogen syto (s ) and invitrogen rox reference dye (# ) were purchased from thermo fisher scientific (waltham, ma, usa). we found it important to use syto dilutions within one week of preparation. primers sequences were targeted against the chlamydia trachomatis s ribosomal gene using primer explorer v (eiken chemical, tokyo, japan) and checked in snap-gene (gsl biotech, chicago, il, usa) to ensure the sequences were in a mutation free region from the available genebank sequences of ct. primers were purchased from for both enzymes, after min of amplification, reactions were ramped to • c at maximum output and held for s to inactivate the enzymes. chips were cooling to • c and the melt performed at a ramp rate of • c per image from - • c, and a ramp rate of . • c per image from - • c. a frozen stock of live ct (d-uw , z , zeptometrix, buffalo, ny, usa) was re-suspended in pre-warmed ( • c) spg buffer ( mm sucrose, . mm kh po , . mm na hpo , and . mm l-glutamate) buffer to × ifu/ml. it was then diluted -fold into a freshly donated urine sample to × ifu/ml. urine from a healthy human donor (> years of age) was acquired and used in accordance with approved caltech institutional review board (irb) protocol - . written informed consent page of nucleic acids research, , vol. , no. e was obtained from all participants, donations were never tied to personal identifiers and all research was performed in accordance with relevant institutional biosafety regulations. a l aliquot from this ct-spiked urine sample was then extracted following the zr viral dna/rna kit protocol (#d , zymo research, irvine, ca, usa). briefly, l of ct-spiked urine was mixed with l dna/rna shield and l dna/rna viral buffer. a total of l ( l × ) was added to the column and centrifuged at × g for min. then, l viral wash buffer was added to the column and centrifuged at × g for min. then, l dnase/rnase-free water was added directly to the column and centrifuged at × g for s. the eluent was treated by adding . l ambion rnase cocktail (#am , thermofisher) to . l template. stocks were prepared in . × te buffer and dilutions quantified using the qx droplet digital pcr system (bio-rad laboratories, hercules, ca, usa), outer primers at nm each and × evagreen supermix (bio-rad). a thermoelectric module (vt- - . - . - ), thermister (mp- ), controller (tc- ) and v power supply (ps- - . ; te tech, traverse city, mi, usa) were wired according to the manufacturer's instructions. while the peltier can be used out of the box, we manufactured a heat plate and sink to improve the efficiency in the cooling mode. instructions for fabrication can be found in the supplementary materials and methods, 'fabrication of thermoelectric unit mount.' the ability of the embedded thermocouple to accurately assess temperature of the aluminum block was verified with an independent k-type minithermcouple read through a general irt k [ir] thermometer. human genomic dna from buffy coat leukocytes (roche (via sigma aldrich), # ) was fragmented using a covaris focused ultrasonicator m (woburn, ma, usa) equipped with l microtube afa fiber snap-cap at w peak power, % duty factor, cycles per burst, for s. fragment concentration was determined using a qbit fluorimiter (thermofisher, #q ) with dsdna hs assay kit (thermofisher, #q ) and mean fragment size determined as bp using an agilent tapestation (#g aa, agilent, santa clara, ca, usa) and high sensitivity d screentape (# - ) with ladder (# - ), and d screentape (# - ) with high sensitivity d reagents (# - ). dilutions were prepared with a final concentration of . × te buffer. microfluidic chips for dlamp (#a ; applied biosystems, foster city, ca, usa) were loaded as we have described previously ( ) at a concentration where ∼ % of partitions would fluoresce (corresponding to the poisson maximum single template per partition loading of cp/l). we estimated the volume of each partition to be pl. to achieve this concentration of template molecules, we diluted template stocks from storage in . × te to ∼ . × te for all experiments. genomic dna (gdna) stocks, also stored in . × te, were diluted to a final concentration of . ×. thus, the total final concentration of te for all experiments of was ∼ . × te buffer. data were collected in -s intervals using a dmi- b microscope (leica, buffalo grove, il, usa) equipped with a . × . na hcx pl fluotar objective and . × coupler (leica c-mount ). the response from syto was recorded using a . -s exposure through an l (gfp) nomarski prism, while the rox reference dye was collected using a -s exposure through a texas red prism. images were collected using a hamamatsu orca-er ccd camera (hamamatsu photonics k.k., hamamatsu city, japan) at gain. temperature was recoded using the built-in features of the tc- controller in ∼ s intervals and correlated to the images via image metadata. in these experiments, we chose to use a microscope, instead of the custom real-time amplification instrument we used previously ( , ) , because the microscope has superior optical properties (greater pixels per partition and lower exposure time requirements) to access higher temporal resolution and enhanced kinetic measurements. the matlab script processes a .txt file with temperaturetime data generated from the te tech controller and a tif stack containing -channel images of the lamp and melt curve from the leica microscope. partitions are identified using a custom iterative thresholding algorithm and labels are propagated throughout the tif stack using a custom labeling algorithm. average well intensity is tracked over time to generate lamp curves and plotted against temperature to generate the melt curves. complete details of the script are in the supplementary materials and methods, 'mat-lab script.' bulk lamp reactions were conducted in l volumes within a well plate on a cfx real-time thermocycler (bio-rad) at buffer conditions and temperatures matching the dlamp reactions. enzymatic digestions of bulk lamp products were conducted using cac i (#r s), hpy ii (#r s), acci (#r s), acii (#sr s), msei (#r s) and hpych iii (#r s) purchased from new england biolabs and were conducted in l reaction volumes containing l enzyme, g dna, in × cut smart buffer and incubated for h at • c. samples were inactivated for h at • c and diluted to ng/l (∼ : ) to run on an agilent tapestation using high sensitivity d screentape (# - ) with ladder (# - ), and d screen-tape (# - ) with high sensitivity d reagents (# - ). the - ng of amplified dna products were fragmented to the average size of bp with qsonica q r sonicator (power: %; pulse: s on/ s off; sonication time: min) and libraries were constructed using neb-next ultra™ ii dna library prep kit (neb, #e ) following manufacturer's instructions. briefly, fragmented dna was end-repaired, da tailed and ligated to nebnext hairpin adaptors (neb, #e ). after ligation, adapters were converted to the 'y' shape by treating with user enzyme and dna fragments were size selected using agencourt ampure xp beads (beckman coulter, #a ) to generate fragment sizes between and bp. adaptorligated dna was pcr amplified with five cycles followed by ampure xp bead clean up. libraries were quantified with qubit dsdna hs kit (thermofisher scientific, #q ) and the size distribution was confirmed with high sensitivity dna kit for bioanalyzer (agilent technologies, # ). libraries were sequenced on illumina hiseq in single-read mode with the read length of nt to the sequencing depth of million reads per sample, following manufacturer's instructions. base calls were performed with rta . . followed by conversion to fastq with bcl fastq . . . raw fastq files were first analyzed with fastqc v . . . overrepresented sequences were compared with input primer sequences to find reads consisting of potential products from the lamp reactions. to verify that all adjoining products were accounted for the fastq files were aligned to the predicted products using bowtie v . . . with global very-sensitive settings. unaligned reads were checked for any remaining possible amplification products. all regions consisting of sequences from multiple primers were tallied by counting the reads with a substring of n = from the end of each primer. one adjoining region between primers contained a random insertion of nucleotides and was analyzed by first extracting all reads containing the primer before and after the random nucleotides. the length and sequence distribution of random inserts was then analyzed from the extracted reads. we first wished to test whether melting temperature (t m ) could be used to separate specific and non-specific amplification in a lamp assay run in bulk. to start, we selected a concentration near the lod where we might observe both specific and non-specific amplification. we used extracted ct genomic dna in the presence of two commercially available polymerases, bst . and bst . , with ct s as the amplification target. at target molecule concentrations of copies per l (cp/l), amplification using bst . polymerase began between - min ( figure a ) and had uniform t m ( figure b ). amplification using bst . polymerase ( figure c ), also yielded amplification from - min; however, we also observed a non-specific amplification at min, defined as having a different t m than the specific amplification events ( figure d ). this indicated bst . could be a useful model for studying non-specific amplification. we observed that early amplifying products corresponded to specific amplification events, and the later products corresponded to non-specific amplification, supporting our prediction that we could use t m as a proxy for sequence identity, as is common with pcr and has been used previously in lamp ( ) ( ) ( ) ( ) ( ) . using bst . at low concentrations of target is a useful system to study non-specific amplification. to investigate the role of the concentration of the target on the incidence of non-specific amplification, we performed half-log dilutions of template from to . cp/l. at . cp/l ( figure e and f), only specific amplification occurred ( replicate wells/plate). however, once template concentrations reached cp/l ( figure g and h), non-specific amplification occurred with greater frequency than specific amplification ( of the replicates generated false positives). similarly, for . cp/l ( figure i and j) of the replicates generated false positives. we next ran the same assay in the absence of template (no-template control, ntc) ( figure k and l). even though we did not expect amplification, we observed all reactions amplified. a total of of replicates amplified at a t m of • c, consistent with the t m of non-specific amplification in the presence of template. although it is possible for a reaction to generate multiple different non-specific amplification products, even ones with t m matching to the specific products, the single amplicon observed at • c in the ntc was a contaminant that appeared to have the same sequence as the specific products ( figure a [well f ]). in general, when the specific target was present, it amplified sooner and outcompeted the nonspecific amplification, thereby reducing the number of observations of non-specific amplification. to determine if the non-specific amplification was inherent to the polymerase or a consequence of buffer selection, we conducted additional studies using both bst polymerases (supplementary figure s and table s ). to better understand non-specific amplification in lamp, we investigated the sequence identity of the nonspecific products with high t m using sequencing and gel analysis and compared them with the specific products. the t m of specific amplification differed between the two polymerases tested. specific amplification for bst . had a t m of . • c, whereas specific amplification using bst . had a t m of • c, and demonstrated non-specific amplification at t m of • c. the non-specific amplification had identical t m to amplification in absence of template ( figure k and l). despite the specific amplification products of bst . and bst . producing similar gel banding patterns (figure ) and the same sequencing results (see figure b ), they had different t m ( figure b and d, respectively). we determined the difference in t m was due to differences in buffer conditions (supplementary figure s and table s ). in all bulk reactions, we observed non-specific products with high tm. this was surprising because in pcr primer dimers have low tm; moreover, in previous demonstrations of lamp, t m was lower for non-specific compared with specific products ( ) . thus, we investigated the sequence identity of the non-specific product with high t m . we ran the lamp products on a gel and observed that the characteristic pattern of the specific amplification products differed substantially from the banding pattern seen in the high-t m non-specific products ( figure ) . interestingly, the high-t m non-specific product had a ladder pattern resembling that of specific lamp products. to determine the identity of the high-t m non-specific products, we performed next generation sequencing (ngs). we observed that the non-specific products lacked the corresponding target sequence and identified the product as a mixture of full-length fip, bip and their complements, as well as fragments of bip ( figure a) . to confirm the sequence identity of the amplicon, we targeted the fip and bip regions using several restriction endonucleases. digestion of the specific and non-specific products resulted in different banding patterns than the undigested samples, and was consistent with the presence of both fip and bip endonuclease recognition sites within the sequence (supplementary figure s ) . specific amplification products were % gc; non-specific amplification products were % gc. we hypothesize a mechanism for the formation of the nonspecific product with high t m originating as a consequence of interactions of the bst polymerase and lamp inner primers. other potential mechanisms include lima ( ) and uima ( ) , but are inconsistent with our sequencing results, which observe nearly equal reads of the forward and reverse strand as measured by counting the complementary sequences between each junction. our proposed mechanism requires properties that have been observed with bst enzymes: a strand-displacing polymerase lacking - exonuclease activity--common to polymerases from thermophilic bacteria ( , ) , template switching ability to allow synthesis across a discontinuous template ( ), terminal transferase activity, or the ability to perform nontemplated synthesis ( , ( ) ( ) . briefly, the non-specific product likely arises from extension of a low probability homo-dimerization of the backward inner primer (bip), followed by elongation across a discontinuous junction ('template switching') to form a double-stranded product incorporating forward inner primer (fip). through breathing of the molecule, the of one strand may form a second hairpin and amplify. some of these amplification events incorporate several random nucleotides via terminal nucleotidyl transferase activity resulting in a pool of hairpins with randomers. sequences with complementary randomers are selected in vitro to amplify. the doublestranded product of this amplification can, through intramolecular hydrogen bonding, form two dumbbell-like structures and amplify in a fashion similar to the standard lamp mechanism, but primed by bip. repetitive cycles of self-priming and hairpin priming by bip result in numerous sequences with complementarity and the possibility of multiple replication loci within a single molecule. this process can give rise to very long amplicons, and even a branched, mesh-like network from the multimeric sequences annealing to their neighbors or in a self-complementary fashion. a simplified version of this mechanism, annotated figure . quantification of junctions using next-generation sequencing of select chlamydia trachomatis amplification products from bulk reactions. nonspecific amplification from the no-template control using bst . (a), including amplification of a specific target contamination (well f ) corresponding to figure k and l. amplification in the presence of cp/l template (b), using bst . (wells a -a ) corresponding to figure a and b, and bst . (wells c -c ) corresponding to figure c and d. non-specific amplification in the presence of cp/l template and bst . (well c ) corresponding to figure c and d. for a complete list of abbreviations used in this figure, see supplementary table s . composite image of select chlamydia trachomatis amplification products from a bulk reaction. products were collected using d tape on an agilent tapestation. amplification in the presence of cp/l template using bst . (lanes a -a ) corresponding to figure a and b, and bst . (lanes c -c , c ) corresponding to figure c and d. non-specific amplification in the no-template control (ntc; lanes e -h ) correspond to figure k and l. contrast was determined using the automatic 'scale to sample' feature in the agilent tapestation analysis software. with sequencing data, can be found in supplementary figure s . in more detail, a potential mechanism of formation of non-specific products is as follows: initially, a doublestranded amplicon is generated by homo-dimerization of bip, and extension of the homodimer to produce a partial reverse complement of bip (prcbip) (figure - ). bst polymerase is highly prone to mismatched extension ( ) , and the two base pairs of cg provide a sufficient anchoring in the to start elongation. multiple primer analyzer (thermofisher) does not identify the bip homodimer, unless maximum sensitivity is used. alternatively, bip-prcbip product may arise from a single stranded bip-hairpin, as has been observed by others ( ) , although unafold (idt) does not predict the formation of the hairpin for this primer. these structures may not need to be abundant at equilibrium, but as long as they are extended by the polymerase, the product will be stabilized and will accumulate. upon accumulation of the bip-prcbip construct, the reverse complement of fip (rcfip) is incorporated by template switching (figure - ). the of fip is within spatial proximity of the homo-bip sequence due to microhomology of to end of the double-stranded sequence coupled with rapid breathing of two base pairs of ta. this allows temporary insertion and hybridization of fip with the double-stranded bip-prcbip sequence (figure - ) . when the polymerase is also in proximity of this reaction, fip slips out of the junction, and the polymerase elongates across the discontinuous junction ( , ) templated by fip (figure - ) . we confirmed the interaction of fip and bip produced the high-t m non-specific amplification, and that elimination of microhomology could significantly reduce high-t m non-specific amplification (supplementary figures s - and tables s - ). after elongation, the fip which has served as template, is poised to prime in the opposite direction (figure - ) . this either displaces the initial bip mispairing (bip*) or opens the hairpin, resulting in a double-stranded bip-prcbip-fip product (figure - ). this three part junction is observed as a complete product in ngs data. breathing of double-stranded bip-prcbip-fip is prone to formation of an intramolecular self-priming hairpin of rcbip-pbip ( figure . with each amplification, and re-prime by fip, a single product is generated. this process of hairpin accumulation would cause the linear 'rinsing' baseline observed by other researchers ( ) . within this pool of linear amplifying products, the bst enzyme will randomly incorporate additional nucleotides at the end of fip-pbip-rcbip-rcfip via terminal transferase activity ( figure - ). our sequencing methods are unable to observe a fip-randomer hairpin because adapter ligation requires double-stranded products. this pool of hairpins with random sequences will accumulate until lamp selects for sequences that amplify by sharing complementary 'toe holds' (figure - ) . much like in vitro evolution, those sequences with the highest probability of amplification are selected ( ) . the lack of a thermal gating mechanism in lamp and lack of - exonuclease activity makes the amplification reaction especially prone to in vitro evolution of self-amplifying products. when considered in this light, it is unsurprising that non-specific amplification could arise from mechanisms similar to the specific products. within a given bulk reaction, variation in randomer sequence length and identity was low. however, between different samples, randomer sequences of multiple lengths and identities were observed. these two results further suggest that in bulk reactions amplification occurs from one or a few sequences (supplementary table groups s - ) . elongation from the randomer overhang results in a double-stranded products, leading to dumbbell structures, and lamp-like amplification. first, elongation of hairpins with complementary randomer toe holds produces a dimer of fip-bip-prcbip-rcfip coupled through the randomer ( (figure - ) . a second priming of the hairpin by bip of the rcbip-pbip hairpin and subsequent elongation, creates a new double-stranded product and reveals a self-priming hairpin of the original strand (figure - ) . as previously, upon elongation, the sequence primed by bip is displaced (figure - ) . simultaneously, the self-priming event turns the fip-bip-prcbip trimer to a pentamer, which may continue to be amplified by bip. the released sequence (figure - ) is again self-priming, and whose product is equivalent to figure - to restart the cycle. further, amplified hairpins may, in addition to bip priming of the hairpin, duplicate through self-priming by breathing and formation of a rcbip-pbip hairpin (figure - ) . the products of these reactions are capable of forming a branched, mesh-like network resulting in the observed high temperature melting. products may experience random internal priming by through hairpin formation (e.g. . furthermore, in addition to intramolecular bonding, the highly repetitive nature of these products allows for melting of internal fragments, which reanneal to self in a different conformation, or a neighboring strand. though the initial steps of generating a double-stranded hairpin will be unique to our particular primer set, once a seed is generated, the processes of template switching and terminal transferase activity should be a general phenomenon associated with non-specific amplification of thermophilic polymerase resulting in exponential amplification. as evidence, when the mechanism of seed formation is disrupted through elimination of the microhomology, amplicons with high t m still occur, albeit with lower frequency and delayed occurrence (supplementary figures s - and tables s - ). template switching and non-template synthesis are × slower than template extension ( ). however, once the self-amplifying products are selected, the reaction follows standard exponential lamp enrichment. thus, accumulation of a sufficient pool of randomers may take time, but still result in a delayed bulk exponential amplification event. furthermore, should a hairpin with attached randomer form, it is possible that the rising baseline, attributed to hairpin formation ( ) , may also by in vitro selection of the products, lead to and result in spontaneous exponential amplification. to study specific and non-specific amplification events at the digital single-molecule level, we developed a new approach that enabled hrm analysis (obtaining 'melt curves') to be performed on each partition. we used a commercially available microfluidic chip with , partitions and a previously published open-source dlamp method accessible to most standard laboratories ( ) with the fol-lowing improvements: incorporation of an off-the-shelf thermoelectric unit to both heat and cool the chips, and an enhanced matlab script to allow for multicolor tracking. we used the temperature-independent fluorophore rox to track each partition's location and the dsdna intercalating fluorophore syto to follow amplification and hybridization status. this two-channel approach is required to follow a partition through both amplification and the entirety of the hrm when fluorescence from syto is lost. as an illustration of the capabilities of our approach, we first used real-time dlamp to study the kinetic parameters of individual reactions and we used t m to classify reaction outcome ( figure ). using real-time dlamp, we followed individual partitions as they amplified as a function of time ( figure a ) and then by temperature as they went through hrm ( figure b ). real-time imaging of individual partitions enables us to reconstruct the standard amplification curves of intensity for each partition as a function of time ( figure c ), and plotting the fluorescence intensity as a function of temperature yields an hrm trace ( figure d ); the negative derivative plot ( figure e ) of this melt trace is the standard melt curve. analogous to bulk measurements, the standard melt curve is used to classify reactions as specific or non-specific. we used these classifications to identify important patterns in the kinetics of each type of amplification ( figure f -h). we next used real-time dlamp with hrm to determine whether differences in time to positive (ttp) were due to a difference in amplification initiation or in rate. we expect this information would be valuable for elucidating whether the molecules that lead to bulk amplification are the ones that are first to initiate or the ones that initiate with the fastest rates. we found that ttp can be heterogeneous while t m is constant ( . ± . min with . ± . • c), indicating that the same product may initiate at different times ( figure f ). this is consistent with our knowledge of the stochastic initiation of lamp ( , ( ) ( ) . further, we observed some variability in the maximum rate despite similar t m ( . ± . rfu/ s, with . ± . • c t m ), which indicates the same product may amplify at different velocities ( figure g ). in general, we observed that maximum rate often corresponded to the point when the reaction first began to amplify. by plotting rate as a function of ttp ( figure h ) we observed little fluctuation in rate across a range of different ttps ( . ± . rfu/ s with . ± . min), indicating that the differences in ttp are mostly delays in the initiation of amplification rather than differences in the rate of amplification. the use of real-time data revealed heterogeneity in the timing of amplification initiation and the amplification rate, but homogeneity in t m , indicating stochasticity in initiation of amplification. in some cases, outlier data points for rate occurred. to determine whether removing these outliers impacted the distribution of enzymatic rates, we performed a non-parametric test (supplementary figure s ) and found no significant differences in enzymatic rates when these outliers were excluded. we next asked whether we could observe in dlamp the same pattern of high-t m non-specific amplification and low-t m specific amplification that we observed in bulk. we performed dlamp using three chips containing tem- plate, and three chips lacking template (ntc) and observed ∼ , partitions for each condition. although partitions are possible, not all partitions filled nor can all partitions be tracked for the full duration of an experiment. for the melt curve, fluorescence readings were taken at • c increments from - • c; and at . • c increments from - • c to give higher resolution. due to slight differences in the timing between the heating element and the image collection, some chips were observed at slightly different temperatures (< . • c). our approach enabled us to differentiate specific and non-specific amplification events using hrm. when using the polymerase bst . and template ( figure a , blue points), we observed a large band of amplification in the temperature range . - . • c, in agreement with the t m observed when performing the reaction in bulk (figure ). in contrast, the ntc ( figure a , red points) had very few amplification events in that temperature range ( out of partitions). hence, we defined events that occurred in the t m range . - . • c as true positives (specific amplification events) and we defined those that occurred outside this range (in both the ntc and in the presence of template) as false positives (non-specific amplification events). when using the polymerase bst . , we observed a large band of amplification from . to . • c in the presence of template ( figure b , blue points) that did not correspond with amplification in the ntc ( figure b , red points) so we defined these as specific amplification events. as with bulk measurements, we determined the difference in t m between specific amplification events between bst . and bst . was due to the difference in buffer composition (supplementary figure s and table s ). during these experiments, we observed two common patterns. first, the t m for specific amplification events was - • c lower in digital compared with bulk measurements. we attribute this difference to temperature calibration; the thermocycler is calibrated to the liquid temperature, whereas the thermoelectric element measures the temperature of the heating element. second, false positives in the ntc had predominantly high t m , which we attribute to the non-specific product we identified in the bulk reactions. we also observed differences in total amplification events between the two polymerases. assays with bst . resulted in substantially more non-specific amplification than those with bst . and confirmed this was not an issue with buffer selection (supplementary figure s and table s ). after min, next, we tested whether ttp is different for specific and non-specific amplification. because lamp follows a 'winner-takes-all' format, frequent and early non-specific amplification events may dominate bulk amplification. in general, for both bst . and bst . , specific amplification had earlier ttp than non-specific amplification, although there was some overlap, mostly > . • c ( figure a and b). we were able to distinguish the clustering of high-t m non-specific products separately from specific amplification using a threshold of . - . • c ( figure c and supplementary figure s a ). we illustrate each partition with only partial opacity so that when false positives in the ntc (red) overlap with false positives in the template-containing sample (blue), the overlap of multiple colors appears purple ( figure d ). color intensity indicates the abundance of paritions at a given ttp and temperature. to further illustrate how this approach can be used to differentiate specific and non-specific amplification, we next selected a re-gion where both specific and non-specific products were observed. for bst . , we were able to distinguish the clustering of high-t m non-specific products separately from specific amplification using the threshold of . . • c ( figure e ) and we observed better separation of specific and non-specific amplification than with bst . ( figure f and supplementary figure s b) . both enzymes had highly variable ttp, which we have observed previously ( ) and attribute to stochastic initiation of lamp. bst . had both earlier specific amplification and later non-specific amplification than bst . . bst . reactions containing template generally started at min, whereas non-specific amplification began at ∼ min. in contrast, bst . reactions containing template began at . min and non-specific amplification began at ∼ min. next we asked whether there is a difference between the maximum rates of specific and non-specific amplification. previously, we demonstrated that rate could be used to correct for some non-specific amplification using escherichia coli s primers ( ), so we wished to test whether we could use maximum rate as a way to differentiate specific and nonspecific amplification. generally, specific and non-specific amplification reactions did not have the same maximum rate. for bst . , non-specific amplification tended to have a slower max rate than specific amplification, although there was some overlap (figure g ). at high t m , the clustering of non-specific amplification in both the presence of template and in the ntc were observed at > . • c and below ∼ rfu/ s ( figure h ). for bst . , although there was substantial overlap, we again observed that nonspecific amplification tended to have slower maximum rate than specific amplification ( figure i ). examining the high-t m amplification events, non-specific amplification collects above . • c and has maximum rate extending out to rfu/ s ( figure j ). for both enzymes, overlap between specific and non-specific amplification was similar and specific amplification tended to be faster. however, the maximum rate of specific amplification between the two enzymes differed; bst . had a maxium rate of rfu/ s, whereas bst . did not exceed rfu/ s. bst . performing faster than bst . is consistent with our previous observations using an e. coli s primer set ( ) . additionally, the maximum rate of non-specific amplification in bst . tended to be lower than non-specific amplification in bst . ( and rfu/ s, respectively). consequently, the extent of overlap of specific and non-specific amplificaiton was greater for bst . than bst . . we observed an unexpected relationship between the final intensity of each partition and the maximum rate of that partition. after min of amplification, a partition should theoretically reach a fluorescence maximum whereby all reagents are consumed, amplification plateaus and thus the final intensity would be independent of the maximum rate of amplification. however, surprisingly, we observed a general scaling between the maximum rate and the final intensity of the partition. for bst . , all amplification in the ntc has final intensity < rfu and maximum rate < . rfu/ s. in the presence of template, . % of non-specific amplification and . % of specific amplification had final intensity and maximum rate less than these thresholds. for . % of specific amplification fell within these thresholds using bst . . thus, false positives were generally dimmer and had slower maximum rates than most true-positive events. when examining the brightest partitions, bst . ( figure k) and bst . (figure l ) exhibit a similar maximal final intensity near rfu. these maxima are also surprising, considering our -bit camera is capable of imaging up to rfu (the detector was not at saturation). we suspect that this maxima corresponds to consumption of one of the reagents; while scaling between maximum rate and final intensity occurs when stochastically initiated reactions have not completely amplified, resulting in partitions dimmer than the maxima and proportional to their rate of amplification. during these dlamp experiments, we also observed a relationship between maximum rate and ttp. in bulk reactions, the first and fastest amplification event determines the reaction outcome by consuming all of the reagents. thus, we hypothesized that reaction conditions that promote fast and early amplification in the ntc would lead to a high false-positive rate in bulk and thus misidentification of amplification. in both bst . ( figure m) and bst . ( figure n ) we observed a general trend of fast amplification events occurring earlier, and slow events occurring later. in bst . , we observed greater heterogeneity in ttp and rate than in bst . . furthermore, non-specific amplicons in the ntc tended to produce slower and later amplification events. occasional outliers occurred at both fast and early times. next, to explicitly test whether fast and early events correspond to specific amplification, we analyzed the relationship between a partition's ttp, its maximum rate, and t m . in the first min of amplification, we observed six nonspecific amplification events in bst . (four in the presence of template; two in the ntc; figure o ), and we observed non-specific events in bst . ( in the presence of template; three in the ntc; figure p ). for both polymerases, we were able to distinguish the rare, fast and early nonspecific amplicons from true positives. for bst . , these non-specific amplifications were slower than the fastest true positives, and occurred at similar times. in contrast, for bst . , the earliest amplification events were false positives and tended to have similar rates to the true positives. we hypothesize that in bulk reactions, the fast and early non-specific amplification events (as seen in bst . figure p ) lead to non-specific measurements, whereas non-specific amplification that coincides with specific amplification, but proceeds at a slower rate (as seen in bst . figure o ), would still produce specific amplification in bulk. this hypothesis is corroborated by sequencing of bulk lamp reactions (figure ) . though individual bulk reactions may be assigned a homogeneous label as 'true positive' or 'false positive' by t m , sequencing identifies multiple products within each reaction and the t m is determined by the dominant product. for example, we observed a 'false positive' by t m ( figure c and d) , despite the presence of template. the sequencing of this product, contained non-specific product sequences, similar to those observed in the ntc, at high prevalence, as well as the specific target sequences in low abundance (figure [well c ]) . similarly, though 'true positive' is as-signed to other bulk reactions in the presence of template, the non-specific products are still observed at low abundance (e.g. figure [well f ] ). further, a greater number of non-specific partitions in digital using bst . than bst . , is correlated with a greater number of non-specific reads despite the presence of template in the sequencing data (comparing figures a-b and b group a versus c) . we hypothesize that the combination of real-time parameters (such as rate and ttp), combined with the ability of digital assays to yield probabilities and to assign reaction identity through hrm, may ultimately help researchers optimize bulk reaction conditions. to better visualize how ttp, max rate, final intensity and t m data are interrelated, we next plotted these data in a four-dimensional ( d) space ( figure q -r, supplementary videos s and ). we observed that among all partitions, regarless of if the product was specific or non-specific amplificiation, fluorescence was brighter when amplification occurred earlier and faster. this was true for both polymerases. additionally, we observed two types of nonspecific amplification. the first type of non-specific was the traditional 'primer-dimer' cloud, which is characterized by a low t m , low final fluorescence intensity, a slow max rate and a generally late ttp. the second type of non-specific cloud matches only in its high t m , and spans a wide range of rates, ttp and final intensities. the high-t m non-specific amplification occurs with greater frequency than the low-t m non-specific amplification. the major differences between the polymerases can also be resolved with this visualization. the number of non-specific amplification events is much fewer for bst . than for bst . . further, these nonspecific events in bst . never achieve same fluorescene intensity or maximum rate as with bst . . we include the d representation as part of our matlab code, and as videos in the supplementary data. we next asked whether using a combination of digital realtime parameters, in conjunction with t m , could be used to improve the performance (lod) of a dlamp assay. for any given assay, there is a large combination of possible parameters (e.g. amplification rate, ttp, fluorescence intensity) that are used to determine when a digital partition is 'on' or 'off.' use of these parameters and selection of thresholds will influence assay performance (analytical specificity and sensitivity). assay performance is affected by amplification time and the combination of choices of parameters used to process the data impacting lod, the probability of detecting a molecule (efficiency), and the clinical sensitivity and specificity. having established that there is a direct relationship between t m , sequence identity and structure, we determined that t m allows us to explicitly differentiate specific and non-specific amplification in dlamp, and thus, differentiate true from false positives. nucleic acids research, , vol. , no. e we foresee two separate situations of dlamp analysis using hrm. first, where hrm is not incorporated in the final assay, but is used during assay development. second, the ideal situation for quantitative performance, where hrm is incorporated into the final lamp assay. we expect the first group of lamp assays to exist because collecting t m data adds additional time to an assay and requires more advanced hardware to run. this may be unideal in situations requiring more rapid diagnostics or limited-resource and field settings where the hardware may be impractical. nonetheless, running hrm is still useful during lamp assay development to select the optimal combination of parameters for end-point or real-time lamp without using t m . hence, t m allows one to identify the correct combination of assay parameters, and how to analyze the data for best lod. lod is a key parameter when optimizing clinical assays because pathogen load is low in many infections (e.g. in blood infections or asymptomatic sexually transmitted infections). we thus illustrated the optimization of parameters using improved lod as the selection criteria. the combination of real-time dlamp with hrm can uniquely define lod because of the combination of digital and t m . unlike bulk assays, which require a concentration titration curve (and are thus dependent on integrated signal intensity and enzymatic turnover), digital assays only require that an event (target molecule) is or is not observed and can be counted relative to the partition volume ( , ) . the minimum lod for any digital assay corresponds to one target or amplification event per partition volume. hence, we can define lod from a single concentration point by equation ( ): where c true is the concentration of target molecules loaded by ddpcr counts in copies per microliter, n true is the number of true positive (specific) amplification events observed on a chip, n false is the number of non-specific amplification events observed on a chip and n ci is the number of expected molecules for a given confidence interval. in this equation, the n true and n false are chip-specific, and take into account the total volume of the chip, the number of partitions and the volume of partitions. furthermore, in equation ( ), amplification efficiency is implicitly taken into account via the n true parameter (in other words, for a less efficient amplification process, a given c true on a given chip would lead to a lower value of n true ). for simplicity, equation ( ) makes the assumption that the measurements are performed at sufficiently low concentrations (as is typical for lod experiments) that only a very small fraction of occupied partitions contain more than one molecule and therefore there is a linear relationship between c true and n true . the concentration loaded, c true , generates n total counts of both true-and false-positive events. we can divide this concentration by the minimal number of counts needed to identify a specific amplification event and define this as the lod. the minimum number of counts needed to guarantee a specific amplification event is observed is determined by n true , n false and n ci . n true and n false are determined empirically, whereas n ci is calculated from the desired expected number of molecules that will yield at least one detection event for a given confidence interval (n ci ) from the poisson equation. if we require a % ci to observe a true positive across an entire chip, the minimum number of counted events is (i.e. % of the time, the poisson expected loading of three target molecules will still measure zero events.) for a % ci, n ci would be four counts. hence, all true-positive counts in excess of n ci are counts observed above the lod. uncertainty in the lod is given by supplementary equations s - . counting only true positives does not account for interference from false positives. in order to meet our minimum counts for detection, our equation must remove false counts (n false ). the generally accepted procedure for lod calculations with a . % ci is to assign n true only when the counts exceed the background plus three standard deviations of the background (n false + × √ n false ). we approximate the variance in the background using the counting error as three times the square root of the number of false-positive events counted and subtract those counts from the true-positive counts to yield the equation. using this calculation of lod to optimize an assay has three limitations. first, equation ( ) fails to produce a number with physical meaning when the number of true-positive events (n true ) is less than the number of false-positive events plus three times the standard deviation in false amplification (n false + × √ n false ). in this case, it is not possible to conclusively observe a true positive, and the lod becomes irrelevant. second, equation ( ) gives an absolute lod. the numerator (concentration of template molecules loaded on the chip, as determined by pcr) is corrected for the probability of observing a molecule amplify (efficiency) by the true-positive counts. n false accounts for the non-specific amplification, and n ci accounts for the poisson probability associated with loading a target molecule. third, this equation is specific to digital assays. we first sought to demonstrate the selection of optimal parameters for situations where hrm is not incorporated into the final assay. using this process, one can pick any threshold and use t m to determine the optimal trade-off between true and false positives. all initial experiments testing the utility of lod, juxtaposed against receiver operating characteristic (roc) curves, to identify optimal parameters were done using bst . . we began by determining the optimal thresholds for max rate, fluorescence intensity, and amplification time. we demonstrate optimization of all three parameters, using t m as the arbiter, to illustrate the utility of our method. we tested the use of roc curves (commonly used to indicate clinical sensitivity and specificity) to compare the performance in response to a given parameter. roc curves provide a visual representation of the ability to distinguish between a true-positive and false-positive event, as a function of a given threshold, but can be difficult to use for optimal selection of lod. roc curves show the fractions of true and false positives, where the true-positive fraction is the number of true positives at a given threshold out of the total number of true positives observed by t m ; and the falsepositive fraction is the number of false positives counted at the given threshold, divided by the total number of false positives observed by t m . a perfect classifying test will yield the largest true-positive fraction and smallest false-positive fraction. when plotting the roc curve for maximum rate (supplementary figure s a ), we observed that rate initially performs very well for eliminating false positives (the falsepositive fraction is very small for very high rates). however, as the digital threshold (analogous to roc 'cut-point') for rate decreases, a greater number of both false and truepositive values are counted. closer examination of this range of thresholds (supplementary figure s b) emphasizes the youden index at . true-positive fraction and . false-positive fraction as a possible choice for optimum threshold, although the assay performance in terms of lod is unclear. the choice for optimal final-intensity threshold is even less clear with the roc curve ( supplementary figure s c) , as the roc curves do not give clear indication of the optimal lod (the roc curve is a gentle concave slope). even relatively high fluorescence thresholds do not give indications of the optimal cut-point (supplementary figure s d) . filtering using lod revealed a clear optimum. we plot the total number of events for both true and false positives and lod as a function of maximum rate ( figure a ). the lod curve revealed a clear minima, corresponding to the optimal cut-point using rate. selecting the threshold of . rfu/ s generated an lod of . ± . cp/l. similarly, plotting lod against final intensity resulted in a clear minima, despite the histogram appearing as a continuum and the cut-point being thus ambiguous ( figure b ). using final intensity, an lod of . ± . cp/l can be achieved at rfu. the roc curve for ttp presented a narrow range of thresholds, with ∼ % true-positive fraction and % falsepositive fraction, although the precise optimal threshold was not obvious (supplementary figure s e) . to refine this threshold, we plotted the lod and the cumulative counts as a function of time in both linear ( figure c ) and logarithmic scales ( figure d ). assays employing hrm only during the development of the assay can improve the lod of the final assays by selecting (making an informed choice of the optimum threshold). the lod decreases (blue curve) as the true positives begin to amplify (blue dashed) and increases, as the false positives amplify (red dashed). the minima for this system occurs at min and . ± cp/l, striking a balance between allowing many true positives to amplify and only a small amount of false positives to occur ( . % true-positive fraction and . % false-positive fraction) and is clearly defined using the linear scale ( figure c ). plotting of lod on the logarithmic scale ( figure d ) emphasizes improperly selecting a threshold can result in several orders of magnitude loss in assay performance (for example, stopping the assay too early or allowing the assay to run for too long). although dlamp is robust to perturbations, selecting the appropriate duration for amplification is important. in contrast, assays using hrm as part of the final readout can distinguish false positives from the true positives and improve lod further by excluding non-specific amplification from the analysis. in some instances, an ntc may incorrectly identify partitions as true positives by t m (black dashed). we incorporate these events as non-specific amplification in the case hrm is used in the final readout. if nonspecific amplification is eliminated, the assay lod ( figure c and f, black solid) continues to improve with time, and is only dependent on the stochastic probability that a true positive will initiate and amplify. in this scenario, there is no penalty allowing the assay to amplify for extended periods of time. in this scenario the lod equation simplifies to additionally, there is no limitation on the number of parameters that can be used to identify the optimal lod. using multiple parameters to filter the data may be useful for individuals not employing hrm in the final assay or in assays only employing end-point measurement (e.g. an assay without real-time measurements will be unable to generate data on rate, but still benefit from selecting optimal assay time and fluorescence threshold). as a demonstration, we filtered first by optimal ttp, then for the optima of a second parameter. in this case, we selected the optimal ttp of min, and scanned for optimal fluorescence threshold. we plotted lod as a function of fluorescence threshold and determined that the optimal fluorescence threshold at min would be rfu and correspond to an lod of . ± . cp/l ( figure e ). do filter parameters exhibit the same lod minima when using bst . , as they did for bst . ? bst . had much lower non-specific background than bst . , and could behave similarly or may behave differently. first, does the roc curve for ttp display a clear optimum? similar to the ttp roc for bst . (supplementary figure s e ), the ttp roc for bst . has a concave slope making choice of the optimum a matter of computation (supplementary figure s f) . we can visually estimate the balance of true and false-positive fraction in the range of % true and % false. similar curves for max rate and final intensity could be generated but are not shown here. second, is there an advantage to using hrm in the final assay with bst . ? to answer this question, we plot lod and the cumulative counts of true and false positives as a function of time for bst . ( figure f ). similarly to bst . , we observe lod improve rapidly as true-positive events are counted. however, unlike bst . , the non-specific amplification events are few and their presence does not have an impact on lod. thus, when using bst . , the curves representing lod with or without hrm in the final assay overlay and indicate using hrm in the final assay has no additional benefit. furthermore, the continuously decreasing lod with time for either case indicates that use of roc curves to determine an optimum can be misleading. while the roc implies that an optimum exists, the false-positive incidence is rare enough that a ttp optimum selected by lod does not exist. hence, assay developers may select assay time based on requirements other than lod. we next assessed whether we could use hrm to compare the performance of the two polymerases, to see which one would give the best lod and which combination of hardware components would give the optimum assay performance. (figure g ) for both polymerases we observed a similar, rapid decrease in lod in the initial moments as true-positive events are detected. however, we also noticed several differences. bst . has a lower lod than bst . at any amplification time. we attribute this difference to the higher incidence of false positives when using bst . compared with bst . . an additional consequence of the low false-positive incidence using bst . , regardless of the use of hrm in the assay, is the lod continues to improve with time as additional true positives are counted. in contrast, bst . benefits greatly from use of hrm in the final assay. if hrm is not included in the assay ( figure g , red dashed), a clear optimum for lod occurs at min and . ± . cp/l. however, if hrm is employed in the assay, the lod more closely resembles the lod curve for bst . and improves with increased detection of true-positive events. we made several overarching conclusions regarding improving the lod of dlamp using a combination of digital real-time parameters and t m . first, filter parameters can be used singly or in combination to improve the performance (lod) of dlamp. in certain assays one parameter may perform better than another for this selection. for this primer set, lod for bst . was lower (better) when using ttp ( . ± . cp/l) than max rate ( . ± . cp/l) or final intensity ( . ± . cp/l). second, incorporation of hrm into the final assay readout will benefit some assays more than others. we observed incorporation of hrm as a part of the final assay improved the perofmance of bst . greater than the perofmance of bst . , and was vital for long assay times. assays with high clinical sensitivity and specificity are critically needed. clinical samples of ct, originating from urine and swabs, pose an intrinsic challenge because they contain variable levels of host dna, and dna from other flora. the analysis of these clinical samples, needs not only to be sensitive (good lod), but also able to function in the presence of non-specific, potentially amplifiable genomic secondary structures and other possible environmental contaminants, while remaining consistent between samples. we sought to investigate the impact of host human genomic dna (hgdna) on non-specific background amplification. we hypothesized that non-specific structures (like hairpins and regulatory elements), may amplify in the presence of lamp and contribute to non-specific background amplification. we titrated sheared buffy coat gdna (i.e. leukocytes) concentrations from zero to . × cells per l, a concentration . × greater than that expected to cause interference ( ) and observed the impact on specific and non-specific amplification of ct ( figure ). we measured the concentration of hgdna in human haploid genome equivalents (hhge) or half the total amount of hgdna in a diploid cell. for each concentration of host dna and enzyme, we ran at least three chips in the presence of ct template and three in the absence of template and across multiple days and sample lots. in total, we observed different reaction partitions. at the highest concentration of hgdna, there was times more hgdna than bacterial dna by mass. we first asked how background dna impacted ttp qualitatively. we observed for both bst . and bst . enzymes, specific and non-specific amplification were qualitative similar independent of background dna concentration below hhge per l. as with previous measurements, bst . rarely produced low-t m non-specific events; whereas bst . produced both high-and low-t m nonspecific events. further, there were more non-specific amplification events for bst . than bst . at both high and low t m . we next asked how background hgdna impacts specific and non-specific amplification quantitatively. we categorized amplification events as specific and non-specific based on t m as previously. first, we asked: is there a relationship between fraction of template molecules amplified in dlamp and amplification time? we then determined the total number of template copies loaded into a chip relative to the copies measured by ddpcr. if amplification initiation is stochastic, as observed in figures f and a -b, does longer assay time increase 'efficiency' and thereby improve lod when using t m (as seen in figure c and f)? we observe that for bst . a large number of partitions amplify at in the first . min, followed by a second phase after min where additional partitions amplify with lower frequency ( figure a ). the mode ttp for concentrations less than hhge per l was ∼ . ± . min (supplementary table s and figures s a- c) . after the mode ttp, the frequency of observing specific amplification in the absence of hhge decreases from a maximum frequency of . ± . % copies detected per s to a lower average fre- quency of . ± . % copies per s from to min ( figure a ). for bst . ( figure a ), we observe a similar trend temporally, though mode ttp was at least min slower and had greater variability than bst . (supplementary table s , figures s b- d) . further, bst . consistently amplified fewer target molecules than bst . at all time points. this highlights the stochastic nature of amplification using lamp, and importance in choice of enzyme on sensitivity. in theory, assays employing t m could be run until all partitions amplify as either a false or true positive. allow all partitions to amplify would give the highest possible number of target copies amplified and lowest possible lod when using t m in the final assay. second, we asked what is the impact of hgdna on efficiency as a function of time? for both bst . and . (figures a and a) , when comparing within a given enzyme, we observed that the fraction of copies detected and the moment the majority of reactions initiate, were indistinguishable for concentrations less than hhge per l. at hhge per l, a decrease in the fraction of copies detected and a delay in amplification initiation was observed (see also supplementary figure s c and d) . bst . had a mode ttp of delay of . min to . ± . min, whereas in bst . , the mode ttp was . ± . min at hhge per l (supplementary table s and figure s ). thus, high concentrations of hgdna may suppress specific amplification. third, we asked what is the impact of hgdna and time on non-specific amplification? for bst . , we observed consistent non-specific amplification products with high and low t m , regardless of concentration of hgdna. single digital partition counts were observed at low-t m non-specific amplification in both the presence of template and the ntc and independent of hgdna concentration ( figure b and c). the fraction of partitions generating a false-positive amplification at low t m was less than . × − through min (i.e. or fewer events in , partitions per chip). similarly, partition counts of high-t m non-specific amplification are < per chip until min. after min, high-t m non-specific amplification is more prevalent than low-t m non-specific amplification and the reactions finish with fewer than non-specific counts in , partitions corresponding to a false-positive fraction of . × − . one exception is the non-specific high-t m amplification in the absence template and hhge. this condition appears to have lower non-specific background than other conditions. we collected each replicate on separate days and are able to observe the experimental variability between the presence and absence fo template, which might be otherwise lost when examining the ntc alone. this experiment emphasizes the advantage of determining non-specific amplificaiton using t m from the same experiment as specific amplification is counted. at low background rates, such as when using bst . , inherant variability exists in the false-positive fraction and can impact lod. measuring non-specific amplificaiton from within an experimental eliminates the assumption that the false-positive rate remains identical to the ntc or between experimental runs. for bst . , non-specific amplification was variable, but tended to be fewer for higher concentrations of hgdna. at any given time, high-t m non-specific amplification was on average ∼ -fold more likely to occur than a low-t m nonspecific product. at min, low-t m non-specific amplification had false-positive fraction < . × − ( or fewer events per chip), amplification events with high t m had a false-positive fraction < . × − ( or fewer events per chip). at the completion of the experiment, high-t m nonspecific amplification events account for as much as % of the total partitions per chip; a value exceeding the total ob-served true-positive events. in these scenarios, utilization of t m to identify true and false amplification will be critical to successful quantification of target analytes. for this ct primer set, both bst . and bst . similarly demonstrate that the presence of high concentrations of hgdna may suppress the likelihood of non-specific amplification occurring. in general, for this primer set and target, we find that bst . performs significantly better than bst . as a consequence of having higher probability of detecting a target molecule and low likelihood of generating a non-specific amplification event. fourth, we asked is maximum rate impacted by the concentration of hgdna? we hypothesize that background hgdna may compete for the binding site of the polymerase with the target dna or generate competing amplification events and thus, decrease the maximum observed velocity in a given partition. this phenomena would be challenging to untangle in bulk. we find that maximum rates are similar for a given enzyme, until hhge per l for bst . (supplementary figure s a ) and above hhge per l for bst . (supplementary figure s b) . thus demonstrating that high concentrations of hhge may slow the rate of amplification. furthermore, in general, and echoing the conclusions of figure g and i, we observe that bst . has faster maximum rate than bst . , regardless of the hgdna concentration. fifth, we asked how is lod impacted by the concentration of hgdna? for bst . (supplementary figure s e) , the lod at a given time was similar for concentrations < hhge per l. while the lod in the presence of hhge per l was slightly worse from the detection of fewer target molecules (e.g. . versus . cp/l at min). as previously, incorporation of hrm into the final assay does not impact the lod when using bst . . when using bst . (supplementary figure s f) and hrm to remove non-specific amplification, lod tracks with the number of true-positive events. thus, lod becomes worse when efficiency is lower (i.e. at hhge per l). similarly, when hrm is not incorporated in the assay, higher concentrations of hhge tend to result in a worse lod. however, at long amplification times, high concentrations of hhge suppress non-specific amplification more than specific amplification, resulting in lod enhancement relative to low concentrations of hhge. cumulatively, these data show high background dna may reduce the probability of detecting a specific molecule (analytical sensitivity), suppress the false-positive fraction (analytical specificity), reduce the velocity of amplification, and delay the start of amplification at clinically relevant concentrations of hgdna. thus, we conclude background hgdna impacts dlamp for this primer set. generally, investigators should examine their own primer sets in the presence of high concentrations of hgdna and take caution when examining clinical samples with high leukocyte concentrations (as reported by urinanalysis). for example, ct infection is not inherently associated with high concentrations of leukocytes and many infections are asymptomatic. ultimately, these experiments underscore the value of quantifying non-specific amplificaiton variability, using hrm, from within the same experiment as a target is quantified. because non-specific amplificaiton is measured within a given sample, one no longer needs to assume it remains identical to the ntc or between experimental runs. we predict that the combination of hrm and real-time dlamp will be invaluable for answering many questions across a wide variety of applications, and thus our approach was designed to be accessible to most standard labs. we employed commercial chips for digitization, a commercial thermoelectric unit for heating and cooling, a commercial microscope for optical analyses and we made our dataprocessing script freely available. our intention was to design an accessible system with readily available components to enable others to access the advantages of digital microfluidics to study and optimize primer sets, enzymes, and reaction conditions of interest to them. we predict these capabilities will be particularly valuable for people working with variable sample matrixes, high background dna, poorly performing primer sets, or poorly performing enzymes. we derived four major lessons from this study. first, lamp can produce non-specific amplicons with high t m . the formation of these non-specific amplicons occurs from the interaction of multiple primers and the use of a polymerase with template switching ability, terminal transferase activity and lacking - exonuclease activity. interaction of primers may lead not only lead to rising background fluorescence ( ), but to spontaneous exponential amplification as well. primer design and enzyme selection therefore should be judicious to avoid formation of hairpins within primers, as well as microhomology at the with any other primer, in order to prevent non-specific amplification. second, hrm in lamp is a useful method for differentiating specific and non-specific amplification events. digital experiments measure the fate and rate of each template, in contrast, bulk experiments are biased toward early amplification events. the combination of dlamp and hrm allows observation of many amplification events and assignment of the nature of that amplification as true or false. further, dlamp with hrm quantifies non-specific amplification experimentally in the presence of specific amplification, eliminating the assumption that incidence of false positives in the presence of template remains identical to the ntc or between experimental runs. third, by differentiating specific and non-specific amplification, hrm is helpful in determining the combination processing and assay parameters that will lead to the best lod in a digital assay. when hrm is incorporated into a dlamp assay, true and false-positive amplification events can easily be separated. lod is improved by elimination of non-specific background and thus becomes dependent on the number of molecules that amplify (i.e. amplification efficiency or fraction of copies detected), without dependence on the incidence of false positives. in contrast, if hrm were employed in a bulk reaction, the lod would still be limited by the competition between specific and nonspecific amplification (which amplifies first) and would require a high number of trials to achieve sufficient statistical power. importantly, even when hrm will not be used in the final assay, it can still be incorporated during the assaydevelopment stage to improve the assay's lod by determin-ing the optimal choice of parameters based on rate, ttp, final intensity or any combination of these parameters. furthermore, our mathematical description of lod is generalizable to other amplification methods that are measured in digital and can separate specific and non-specific amplification. fourth, high levels of non-specific host gdna suppress analytical sensitivity and specificity, reduce amplification velocity, and delay the start of amplification. however, lowto-moderate levels of non-specific host gdna do not impact the analytical specificity or sensitivity of dlamp. we ran our assays through clinically relevant concentrations of background dna and did not observe interference until the upper range of concentrations expected to cause interference to demonstrate the clinical utility of real-time dlamp with hrm. real-time dlamp with hrm will enable the mechanistic optimization of primers and myriad assay conditions (such as buffer, mg + and reaction temperature). because real-time dlamp with hrm reveals the incidence of nonspecific amplification products with high and low t m as a function of time, dlamp with hrm can be used to investigate approaches that will eliminate different non-specific products. for example, fast or early non-specific events in digital may indicate primers or conditions that will be especially vulnerable to failure in a bulk reaction. thus, realtime dlamp with hrm could be used to design primers that will suppress non-specific amplification in bulk, by generating only non-specific amplicons that occur at slow rates and late ttp. future efforts should investigate the combination of realtime dlamp (and other digital isothermal amplification technologies) and hrm as a way to increase multiplexing of dlamp when using a single reporter. in pcr, hrm has been used to differentiate among multiple amplification products by measuring differences in t m ( ) ( ) ( ) ( ) ( ) , with applications that include among others multiplexed pathogen identification and antibiotic susceptibility testing. finally, studies with clinical samples should be performed using the dlamp with hrm method to understand the carryover effects from relevant matrices. the complete sequencing data generated during this study are available in the national center for biotechnology information sequence read archive repository with the bio-project id: prjna . the matlab script described here has been deposited in the open-access online repository github and may be accessed using the following direct link: https://github.com/ ismagilovlab/digital naat ch meltcurve analyzer. highly stable and sensitive nucleic acid amplification and cell-phone-based readout strand displacement probes combined with isothermal nucleic acid amplification for instrument-free detection from complex samples loop-mediated isothermal amplification of dna rapid and sensitive detection of chlamydia trachomatis sexually transmitted infections in resource-constrained settings in thailand at the point-of-care chlamydia trachomatis and genital mycoplasmas: pathogens with an impact on human reproductive health diagnosis of chlamydia infection in women supplemental tables for interference testing in clinical chemistry, st edition, ep interference testing in clinical chemistry mathematical model to reduce loop mediated isothermal amplification (lamp) false-positive diagnosis real-time target-specific detection of loop-mediated isothermal amplification for white spot syndrome virus using fluorescence energy transfer-based probes real-time detection and monitoring of loop mediated amplification (lamp) reaction using self-quenching and de-quenching fluorogenic probes quenching of unincorporated amplification signal reporters in reverse-transcription loop-mediated isothermal amplification enabling bright, single-step, closed-tube, and multiplexed detection of rna viruses simultaneous multiple target detection in real-time loop-mediated isothermal amplification homogenous, real-time duplex loop-mediated isothermal amplification using a single fluorophore-labeled primer and an intercalator dye: its application to the simultaneous detection of shiga toxin genes and in shiga toxigenic escherichia coli isolates fret-based assimilating probe for sequence-specific real-time monitoring of loop-mediated isothermal amplification (lamp) robust strand exchange reactions for the sequence-specific, real-time detection of nucleic acid amplicons establishment of an accurate and fast detection method using molecular beacons in loop-mediated isothermal amplification assay real-time sequence-validated loop-mediated isothermal amplification assays for detection of middle east respiratory syndrome coronavirus (mers-cov) adapting enzyme-free dna circuits to the detection of loop-mediated isothermal amplification reactions a digital microfluidic system for loop-mediated isothermal amplification and sequence specific pathogen detection increased robustness of single-molecule counting with microfluidics, digital isothermal amplification, and a mobile phone versus real-time kinetic measurements mechanistic evaluation of the pros and cons of digital rt-lamp for hiv- viral load quantification on a microfluidic device and improved efficiency via a two-step digital protocol real-time, digital lamp with commercial microfluidic chips reveals the interplay of efficiency, speed, and background amplification as a function of reaction temperature and time instrument for real-time digital nucleic acid amplification on custom microfluidic devices loop-mediated isothermal amplification (lamp) method for rapid detection of trypanosoma brucei rhodesiense enhancing melting curve analysis for the discrimination of loop-mediated isothermal amplification products from four pathogenic molds: use of inorganic pyrophosphatase and its effect in reducing the variance in melting temperature values development of a loop-mediated isothermal amplification method for diagnosing pneumocystis pneumonia development of a multiplex loop-mediated isothermal amplification method for the simultaneous detection of salmonella spp. and vibrio parahaemolyticus novel loop-mediated isothermal amplification (lamp) assay with a universal qprobe can detect snps determining races in plant pathogenic fungi isothermal amplification and multimerization of dna by bst dna polymerase unusual isothermal multimerization and amplification by the strand-displacing dna polymerases with reverse transcription activities n.bspd i dna nickase strongly stimulates template-independent synthesis of non-palindromic repetitive dna by bst dna polymerase analysis of non-template-directed nucleotide addition and template switching by dna polymerase de novo dna synthesis by human dna polymerase , dna polymerase and terminal deoxyribonucleotidyl transferase ab initio synthesis by dna polymerases interrogation of multimeric dna amplification products by competitive primer extension using bst dna polymerase (large fragment) impact of primer dimers and self-amplifying hairpins on reverse transcription loop-mediated isothermal amplification detection of viral rna lack of correlation between reaction speed and analytical sensitivity in isothermal amplification reveals the value of digital methods for optimization: validation using digital real-time rt-lamp loop-mediated isothermal amplification for detection of nucleic acids single-molecule enzyme-linked immunosorbent assay detects serum proteins at subfemtomolar concentrations digital concentration readout of single enzyme molecules using femtoliter arrays and poisson statistics integrated bacterial identification and antimicrobial susceptibility testing using pcr and high-resolution melt massively parallel digital high resolution melt for rapid and absolutely quantitative sequence profiling facile profiling of molecular heterogeneity by microfluidic digital melt nanoarray digital polymerase chain reaction with high-resolution melt for enabling broad bacteria identification and pheno-molecular antimicrobial susceptibility test highly efficient real-time droplet analysis platform for high-throughput interrogation of dna sequences by melt this project benefited from the use of instrumentation at the jim hall design and prototyping lab and the millard e nucleic acids research, , vol. , no. page of and muriel jacobs genetics and genomics laboratory. we thank daan witters and pedro ojeda for initial selection of ct primers, eric liaw for helpful discussions of error propagation and natasha shelby for help with writing and editing this manuscript. conflict of interest statement. the content of this manuscript is the subject of a patent application filed by caltech. supplementary data are available at nar online. key: cord- - qxf xo authors: chun, jong-yoon; kim, kyoung-joong; hwang, in-taek; kim, yun-jee; lee, dae-hoon; lee, in-kyoung; kim, jong-kee title: dual priming oligonucleotide system for the multiplex detection of respiratory viruses and snp genotyping of cyp c gene date: - - journal: nucleic acids res doi: . /nar/gkm sha: doc_id: cord_uid: qxf xo successful pcr starts with proper priming between an oligonucleotide primer and the template dna. however, the inevitable risk of mismatched priming cannot be avoided in the currently used primer system, even though considerable time and effort are devoted to primer design and optimization of reaction conditions. here, we report a novel dual priming oligonucleotide (dpo) which contains two separate priming regions joined by a polydeoxyinosine linker. the linker assumes a bubble-like structure which itself is not involved in priming, but rather delineates the boundary between the two parts of the primer. this structure results in two primer segments with distinct annealing properties: a longer ′-segment that initiates stable priming, and a short ′-segment that determines target-specific extension. this dpo-based system is a fundamental tool for blocking extension of non-specifically primed templates, and thereby generates consistently high pcr specificity even under less than optimal pcr conditions. the strength and utility of the dpo system are demonstrated here using multiplex pcr and snp genotyping pcr. since the development of the polymerase chain reaction (pcr), a variety of modifications in primer design and reaction conditions have been proposed to enhance and optimize specificity ( - ), but a fundamental solution for eliminating non-specific priming still remains a challenge and limits the versatility of pcr in nucleic-acid-based tests (nats). achieving a high level of specificity in priming currently requires very rigid primer search parameters and optimization of pcr conditions. even when these requirements are satisfied, a high risk of nonspecific priming is inevitable. in order to fundamentally block non-specific priming, we have developed a novel dual priming oligonucleotide (dpo) system that is structurally and functionally different from the primer system currently in widespread use. the conventional priming system is based on a single priming event between the primer and template, in which mismatches in priming can lead to extension of non-specific products. in contrast, the dpo system has two separate primer segments, one of which is longer than the other, joined by a polydeoxyinosine [poly(i)] linker ( figure a ). deoxyinosine (i) is known to have a relatively low melting temperature compared to the natural bases, due to weaker hydrogen bonding ( ) . thus, we hypothesized that a poly(i) linker inserted between two stretches of natural bases would form a bubble-like structure and separate a single primer into two functional regions at a certain annealing temperature: a -segment - nt in length and a -segment - nt in length. this unequal distribution of nucleotides leads to different annealing preferences for each segment. the longer -segment preferentially binds to the template dna and initiates stable annealing, whereas the short -segment selectively binds to its target site and blocks non-specific annealing. therefore, only target-specific extension will result from the successive priming of the -segment and the -segment of the dpo. in this article, we describe and demonstrate how effectively dpo eliminates extension of non-specifically primed templates and generates high pcr specificity under a range of sub-optimal or stringent reaction conditions. in particular, we show that dpo is successfully applied in one example of complex pcr manipulations such as multiplex pcr of the respiratory viruses and *to whom correspondence should be addressed. tel: ; fax: ; email: chun@seegene.com ß the author(s) this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/licenses/ by-nc/ . /uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. one example of single nucleotide polymorphism (snp) genotyping pcr in cyp c gene, which have not been perfected to date. the dpo is composed of three regions, a longer -segment a shorter -segment and a poly(i) linker that bridges these two segments. in designing the dpo ( figure a ), the position of the -segment was determined first, at a site where - bases had a - % gc content and the t m was not considered. five deoxyinosines were designated for the poly(i) linker since they had generated the best result when - deoxyinosines were tested to determine the optimum length of the linker. the -segment of the dpo was automatically determined by the sequence of the bases upstream of the -segment and extended - bases, until the t m was c. the secondary structure and dimers were not considered in the design of the dpo because the -segment alone, which is physically separated by the linker from the -segment, is too short to form such structures stably. viral genes are highly variable; therefore, in order to generate virus-specific dpos, the length of the -segment of the dpo was increased so that it had an even higher t m ( - c). the long conventional primers comprise sequences identical to the dpo primers except the poly(i) linker regions. the sequences of the primers, as well as their annealing temperatures, are given in table . for correct prediction of the t m values of primers which contains deoxyinosine, we used hyther tm (http://ozone .chem.wayne.edu/) which adopts the nearest neighbor model with optimized thermodynamic parameters for deoxyinosine pairs in dna duplex ( ) . total rna extracted from an embryonic day . (e . ) mouse embryo was reverse-transcribed using the primer dt -acp (seegene, korea) and m-mlv reverse transcriptase (promega), according to the manufacturer's instructions. -race fragments containing partial sequences from the ndufs gene ( bp) were amplified using  mastermix (solgent, korea) with an ndufs specific primer and dt table . conditions: denaturation at c for s, annealing at - c for s and extension at c for s. amplification was completed with a final extension step at c for min. total rna extracted from nasal aspirate samples from five patients was reverse-transcribed using random hexamer primers (fermentas) and m-mlv reverse transcriptase (promega). multiplex rt-pcr was performed on the five cdna samples to detect five different virus-specific genes using  mastermix (solgent). the following gene segments were amplified: the segment gene of the influenza a virus ( bp), the segment gene of the influenza b virus ( bp), the f gene of the respiratory syncytial virus b ( bp), the f gene of the respiratory syncytial virus a ( bp) and the m gene of coronavirus oc ( bp). as a negative control, sterile deionized water was used as the template instead of nucleic acid. as a positive control, plasmids containing amplicons of the same length were used. after preheating at c for min, amplification cycles were carried out in the thermal cycler (same as above) under the following conditions: denaturation at c for s, annealing at c for s and extension at c for s. amplification was completed with a final extension step at c for min. genomic dna was extracted from nine human blood samples, which consisted of three samples for each of three different genotypes at bp in exon of cyp c (allele /allele , allele /allele , allele /allele ). the allelespecific dpo primers were designed to have an snp in the middle of the -segment because such a position maximizes disturbance of the -segment annealing. multiplex pcr analysis of the genomic dna was performed to detect allele ( bp) and allele ( bp), together with a general primer to detect cyp c ( bp) using  mastermix (solgent). after a preheating step at c for min, amplification cycles were carried out in the thermal cycler (same as above) under the following conditions: denaturation at c for s, annealing at c for s and extension at c for s. amplification was completed with a final extension step at c for min. the difference in overall strategy of the conventional and the dpo-based pcr approaches is illustrated in figure a . conventional primers have a single priming region and extension may proceed even in the presence of mismatches between a primer and a template. in contrast, dpo has two priming regions and extension proceeds only when the two priming segments are perfectly matched with the sequence of a template ( figure a , perfect match). even though the -segment of the dpo binds to the template, extension will not proceed if there is any mismatch in the -segment of the dpo ( figure a , mismatch at -end). if the -segment does not bind to the template due to mismatches, the -segment alone, which is - nt in length, has a t m too low (below c) to bind its template at the generally used annealing temperatures of - c ( figure a , mismatch at -end). to demonstrate our hypothesis, we conducted -race of ndufs with dpo and long conventional primers carrying three mismatched bases in -or -ends ( figure b) . the conventional primers generated non-specific products under all conditions (lanes , and ), whereas the perfectly matched dpo primer generated only one target product (lane ), and did not amplify any pcr product when three mismatched bases were incorporated in either the -segment (lane ) or the -segment (lane ). we also hypothesized that the dpo primer will maintain a high level of specificity over a wide range of annealing temperatures due to its structural feature. the -race of ndufs was further conducted at a low ( c) and a high ( c) annealing temperature. as shown in figure c , the long conventional primers reduced the nonspecific products by increasing the annealing temperature to c (lanes and ) , while the dpo primer produced only one target product over a wide range of annealing temperatures ( and c) (lanes and ) . these results clearly support our hypothesis that the unique dualspecificity annealing properties of the dpo primer block non-specific priming events even under less than optimal pcr conditions. we evaluated the dpo-based system in a multiplex pcr application for the detection of five different human respiratory viruses. most respiratory tract infections are fairly mild, but they are highly infectious, and can sometimes result in severe symptoms requiring hospitalization, or even lead to death. traditional detection tools such as cell culture and antigenic detection are usually slow and may be inaccurate as well. various multiplex pcr-based detection tools have been developed and introduced to direct appropriate therapy and to avoid the use of unnecessary antibiotics ( , ) . however, current multiplex pcr-based assays require further validation, such as nested pcr or a probe hybridization assay, due to their high rate of false positives ( , ) . we developed a dpo-based multiplex pcr assay for the detection of five respiratory viruses: influenza a, influenza b, respiratory syncytial virus a, respiratory syncytial virus b and coronavirus oc ( figure ). as expected, long conventional primers generated many non-specific bands, most likely due to non-specific annealing or primer competition. in contrast, the dpo primer generated target-specific viral table . fragments, and no false positives. in addition, dpo primers detected the influenza a virus in patient (lane ), which was not detected using the conventional primer system (lane ). the presence of respiratory viruses in each patient was further validated by sequencing the fragment. these results indicate that a dpo-based multiplex pcr approach is a reliable tool for detecting multiple pathogens. we further evaluated the dpo-based multiplex pcr system for the detection of a single nucleotide polymorphism (snp) in cyp c . this polymorphism is one of the most thoroughly characterized snps, in which a single base pair substitution (g ! a) at position in exon of cyp c results in a non-functional protein and affects the metabolism of a number of commonly used drugs ( ) . snps are the most common type of dna sequence variation in the human genome, and are an important genetic factor in the origin and development of complex genetic traits in humans ( , ) . appropriate methods for snp genotyping are important in many fields of biological science, and several different methods for detecting snps in the human genome have been introduced ( ) . however, an ideal pcr-based snp detection method, one that is simple and provides an accurate genotype without additional verification steps such as sequencing, does not currently exist. we carried out multiplex pcr analysis of nine human genomic dna samples with known genotypes at the cyp c locus ( figure , lanes - and - , allele /allele ; lanes - and - , allele /allele ; lanes - and - , allele /allele ) using two different dpo primer sets to detect the three different allelic combinations (figure ). dpo-based multiplex pcr clearly distinguished between the different alleles of cyp c , while long conventional primer-based multiplex pcr did not. we have developed a new primer technology, the dpo system. the dpo system differs structurally and functionally from the conventional primer system by including a poly(i) linker, which is one of the most commonly used universal bases ( ) , between two segments of primer sequences ( figure a ). in general, primers bases are rarely used, since the t m s of mers or the longer primers can be over c, which is too high for proper annealing ( ) . this is a fundamental limitation in current conventional primer design. however, the long dpo primer ( - mer) is divided into two distinct target-specific priming segments by the presence of the poly(i) linker, and, thus, does not suffer from the limitations of a high t m . in addition, the two priming segments of differing lengths have distinct priming functions. for example, at the general annealing temperatures of - c, stable annealing is initiated only by the longer -segment since it has a high enough t m (over c) to bind to the template. target-specific extension is then determined by the shorter -segment, resulting in unparalleled high specificity. in order to demonstrate the usefulness of dpo, we compared the performance of the dpo system to that of the conventional primer system in multiplex pcr and snp genotyping pcr. overall our results demonstrate that the dpo system has dramatically improved performance. dpo primers are easier to design than conventional primers. in designing conventional primers, the sequence should be carefully checked for certain primer design features such as primer length, melting temperature, gc content and secondary structure ( ) . in contrast, we believe that a dpo can be designed based on almost any sequence of interest since the poly(i) linker prevents formation of secondary structure and effectively eliminates non-specific priming. this is very advantageous, particularly for detection of pathogens such as virus and bacteria since their sequences are highly variable and available primer sites are highly restricted. table . we have demonstrated the successful application of dpo primers in one multiplex reaction. multiplex pcr is a rapid and economical tool ( ), but when a large bank of genes is amplified with multi-primer sets, conventional primers often produce false positives due to primer competition, to primer dimers or to the different melting temperatures of the different primers. therefore, current multiplex pcr-based assays require further validation, such as nested pcr or a probe hybridization assay ( , ) . however, dpo allows specific detection of a large number of pathogens without any false result because the bubblelike structure of the poly(i) linker in dpo efficiently prevents primer-dimer and hairpin structure formation. the example presented here (figure ) demonstrates the successful use of the dpo-based multiplex pcr for simultaneous detection of multiple respiratory viruses with one pcr step. the high specificity without production of any non-specific bands or false-positive products clearly demonstrates the great potential of dpo-based multiplex pcr approaches to be a reliable, rapid, practical and cost-effective detection method. we have demonstrated the successful application of dpo primers for one snp genotyping pcr. in general, conventional primers cannot offer reliable results for snp genotyping and current snp genotyping methods require additional steps after amplification of an snp-containing region such as rflp ( ), sequencing ( ) or hybridization ( ) . these are neither rapid nor easy to manipulate, and they require large initial investments in equipment. in contrast, the example presented here (figure ) demonstrates that snp genotyping can be simply achieved in one pcr step by using the dpo system. finally, dpo can be used for any pcr approach which makes use of the ability of primers to specifically hybridize to complementary sequences except for the case of long and accurate pcr (la-pcr) which is achieved by dna polymerases having proofreading activity (e.g. pfu). it was found that such dna polymerases have a tendency to completely drop off the dna when they encounter inosine in a pcr primer ( ) . in this article, we demonstrated that the dpo system can be successfully used in multiplex pcr and snp genotyping pcr. we propose that combination of dpobased multiplex pcr with high-throughput analysis methods such as dna microarray amplification or quantitative analysis methods such as real-time pcr will further extend the potential of this approach. the use of a thermally activated dna polymerase pcr gives improved specificity, sensitivity and product yield without additives or extra process steps the elimination of primer-dimer accumulation in pcr controlled hot start and improved specificity in carrying out pcr utilizing touch-up and loop incorporated primers (tulips) comparison of the base pairing properties of a series of nitroazole nucleobase analogs in the oligodeoxyribonucleotide sequence -d(cgcxaattygcg)- nearest-neighbor thermodynamics of deoxyinosine pairs in dna duplexes multiplex pcr: optimization and application in diagnostic virology rapid identification of nine microorganisms causing acute respiratory tract infections by single-tube multiplex reverse transcription-pcr: feasibility study evaluation of a multiplex real-time reverse transcriptase pcr assay for detection and differentiation of influenza viruses a and b during the - influenza season in israel development of three multiplex rt-pcr assays for the detection of respiratory rna viruses the major genetic defect responsible for the polymorphism of s-mephenytoin metabolism in humans the birth and death of human single-nucleotide polymorphisms: new experimental evidence and implications for human history and medicine a map of human genome sequence variation containing . million single nucleotide polymorphisms methods for genotyping single nucleotide polymorphisms generic detection and differentiation of tobamoviruses by a spot nested rt-pcr-rflp using di-containing primers along with homologous dg-containing primers the effect of temperature and oligonucleotide primer length on the specificity and efficiency of amplification by the polymerase chain reaction a computer program for selection of oligonucleotide primers for polymerase chain reactions cyp c polymorphism and risk for essential tremor snapshot for pharmacogenetics by minisequencing detection of single nucleotide substitution by competitive allele-specific short oligonucleotide hybridization (cassoh) with immunochromatographic strip pcr with degenerate primers containing deoxyinosine fails with pfu dna polymerase we thank sun-hwa joung for critical reading of the manuscript. funding to pay the open access publication charge was provided by seegene. conflict of interest statement. none declared. key: cord- -ya siivm authors: liu, weichi; shi, xiaoling; gong, peng title: a unique intra-molecular fidelity-modulating mechanism identified in a viral rna-dependent rna polymerase date: - - journal: nucleic acids res doi: . /nar/gky sha: doc_id: cord_uid: ya siivm typically not assisted by proofreading, the rna-dependent rna polymerases (rdrps) encoded by the rna viruses may need to independently control its fidelity to fulfill virus viability and fitness. however, the precise mechanism by which the rdrp maintains its optimal fidelity level remains largely elusive. by solving . – . Å resolution crystal structures of the classical swine fever virus (csfv) ns b, an rdrp with a unique naturally fused n-terminal domain (ntd), we identified high-resolution intra-molecular interactions between the ntd and the rdrp palm domain. in order to dissect possible regulatory functions of ntd, we designed mutations at residues y and e to perturb key interactions at the ntd–rdrp interface. when crystallized, some of these ns b interface mutants maintained the interface, while the others adopted an ‘open’ conformation that no longer retained the intra-molecular interactions. data from multiple in vitro rdrp assays indicated that the perturbation of the ntd–rdrp interactions clearly reduced the fidelity level of the rna synthesis, while the processivity of the ns b elongation complex was not affected. collectively, our work demonstrates an explicit and unique mode of polymerase fidelity modulation and provides a vivid example of co-evolution in multi-domain enzymes. processive nucleic acid polymerases are essential for the preservation, passage, and evolution of the genetic information. optimal fidelity levels of processive polymerase synthesis, in some cases coupled to proofreading and/or repairing processes carried out by polymerase itself or other machineries ( ) ( ) ( ) ( ) , are critical for nearly all forms of life. the rna viruses are a large and unique group of species whose genetic information are solely carried in the form of rna, and the related genome replication process is dependent on the virally encoded rna-dependent rna polymerase (rdrp) and typically not assisted by proofreading mechanisms ( , ) . well known as quasi-species, the rna viruses undergo relatively rapid evolution and exist as populations bearing genome-wide distributed mutations ( ) ( ) ( ) . it was proposed that the rna viruses live with a narrow but optimal range of replication error frequency, as higher error rates may lead to distinction of the species and lower error rates may fail to overcome selection pressure ( ) . as primary machineries that contribute to the replication error of the rna viruses, viral rdrps are unique systems for understanding how optimal fidelity is achieved. viral rdrps all contain a catalytic core that is analogous to an encircled human right hand comprising palm, fingers and thumb domains ( ) ( ) ( ) . seven catalytic motifs a-g surround the active site with a-e in the most conserved palm and f/g in the fingers ( ) ( ) ( ) . the encirclement created by the interactions between the finger tips and thumb makes the distinction from other right-hand polymerases such as the klenow fragment of dna polymerase i, the bacteriophage t rna polymerase and the human immunodeficiency virus reverse transcriptase (hiv- rt) ( ) ( ) ( ) . accordingly, the fingers-thumb interactions in the rdrp may restrict large-scale fingers domain conformational changes typically seen in other polymerases ( ) ( ) ( ) . indeed, relative local rearrangement in the palm domain is responsible for the active site closure in viral rdrp nucleotide addition cycle ( ) . this rearrangement primarily involves coordinated backbone movement of motifs a and d and key side chain rotamer changes within motifs a, b and f ( ) ( ) ( ) ( ) . as the major determinants of polymerase fidelity, the ntp-binding induced pre-chemistry active site closure is the key process for fidelity modulation ( ) . to date, fidelity variants both in the levels of the rdrp and the full-length virus have been identified through approaches including rdrp structure-based rational mutation design and virus fidelity variant screening ( ) ( ) ( ) ( ) ( ) ( ) . somewhat unexpectedly, the variation/mutation sites have been found widely distributed in the rdrp core, not limited to the aforementioned catalytic motifs or key residues known to participate in active site closure. some of the mutations were believed to modulate fidelity through indirect interactions with or longrange transmission to the active site ( , ) . it was suggested that the mutations in the rdrp palm domain, where the majority of the conformational changes occur during the active site closure, had greater impact on fidelity than those in the fingers domain had ( ) . however, the precise mechanism by which each mutation/variation modulates fidelity and whether and how fidelity can be directionally controlled by engineering for purposes including attenuated vaccine development remain poorly understood ( , ) . the pestiviruses, including classical swine fever virus (csfv) and bovine viral diarrhea virus (bvdv), are a small group of livestock pathogens belonging to the pestivirus genus and flaviviridae family and their rdrps were given the name of ns b. compared to the rdrps of other flaviviridae representatives such as the ns of the japanese encephalitis virus (jev) and dengue virus (denv) in the flavivirus genus and the ns b of the hepatitis c virus (hcv) of the hepacivirus genus, the pestivirus ns b contains a unique ∼ -residue n-terminal domain (ntd) that does not have notable sequence homology to any other viral or host proteins. previous determined pestivirus ns b crystal structures (of bovine viral diarrhea virus, or bvdv) do not include the ntd ( , ) , and show a global architecture similar to the rdrp module of hcv ns b and flavivirus ns ( , , ) . a very recent work reported an ntdcontaining csfv (eystrup strain) ns b crystal structure of moderate resolution ( . Å), revealing the overall fold and the intra-molecular interactions between the ntd and the rdrp module ( ) . functional studies have implied that the ntd contributes to the polymerase activity ( ) ( ) ( ) , but the precise mechanism of how the ntd regulates polymerase catalysis remains elusive. in this work, by solving three high-resolution (up to . a) crystal structures of the ntd-containing csfv (shimen strain) ns b, we provide detailed structural information of the ntd and its intra-molecular interactions with the rdrp core. bearing a unique ␣/␤ fold, the ntd interacts with the rdrp palm domain in the vicinity of motifs a and d. crystallographic and enzymatic characterizations of csfv ns b and its ntd-rdrp interface mutants further demonstrated that ntd contributed to the optimal fidelity of the rdrp, defining a unique mechanism of fidelity modulation. likely a consequence of co-evolution of the ntd and the rdrp, the two parts of pestivirus ns b with distinct origins have worked coordinately and created a unique mode of intra-molecular fidelity modulation. the dna fragment corresponding to the ns b residues - were amplified from the csfv dna clone psm (shimen strain) and cloned into a pet b vector. the resulting plasmid pet b-csfv-ns b was used as the template for construction of all mutant plasmids. ns b point mutations were introduced by using the quickchange site-directed mutagenesis method ( ) . n-terminal and cterminal deletions were achieved through a site-directed ligase-independent mutagenesis (slim) method ( ) . all plasmids were transformed into escherichia coli strain bl -condonplus(de )-ril for overexpression. cells were grown overnight at • c in the nzcym medium with g/ml kanamycin (kan ) and g/ml chloramphenicol (chl ). the overnight culture was used to inoculate l of nzcym medium with kan and chl . the cells were grown at • c until the od reached . , and then were cooled to • c. isopropyl-␤-d-thiogalactopyranoside (iptg) was added at a final concentration of . mm, and the cells were grown for an additional h before harvesting. each ns b construct contains a c-terminal hexahistidine tag. cell lysis, protein purification and protein storage were performed as previously described for the jev ns study ( ) , except that tris (ph . ) was used as the buffering agent in the cation exchange chromatography and the final protein samples were stored in a buffer with higher concentration of nacl ( mm) and % (v/v) glycerol. the molar extinction coefficient for the ns b constructs were calculated based on protein sequence using the expasy protparam program (http://www.expasy.ch/tools/ protparam.html). the yield is typically about mg of pure protein per liter of bacterial culture. crystals of the wild-type (wt) csfv ns b or its variant were grown by sitting drop vapor diffusion at • c using and mg/ml protein. within weeks, quadrangular-shape crystals (form ) grew with a precipitant solution containing . m tris (ph . ) and % (v/v) poly(propylene glycol) for wt, c- mutant and c- mutant, while spindleshaped crystals (form ) grew with a precipitant solution containing . m lioac, . m bis-tris (ph . ) and % (w/v) soklan cp for c- aa mutant. the growth of form crystals were further optimized by supplementing the precipitant solution with - % volume of solution containing . m tris (ph . ), % (w/v) polyvinylpyrrolidone, and % (w/v) poly(ethylene glycol) methyl ether for wt, . m nacl, . m mes (ph . ), % (v/v) pentaerythritol propoxylate for c- mutant, or . m hepes (ph . ), % (v/v) jeffamine m- for c- mutant. crystals were flash cooled except that the c- aa crystals were transferred to a cryo-solution (precipitant solutions supplemented with % (v/v) glycerol) by incremental buffer exchange prior to flash cooling in liquid nitrogen. single crystal x-ray diffraction data were collected at the shanghai synchrotron radiation facility (ssrf) beamlines bl u (wavelength = . Å, temperature = k) and bl u (wavelength = . Å, temperature = k). at least - • of data were typically collected in . - . • oscillation steps. reflections were integrated, merged and scaled using hkl (table ) ( ). the initial structure solution was obtained using the molecular replacement program phaser ( ) using coordinates derived from bvdv ns b structures (pdb entries s f and cjq) as the search model ( , ) . manual model building and structure refinement were done using coot and phenix, respectively ( , ) . the , k composite simulated-annealing omit f o -f c electron density maps were generated using cns ( ) . unless otherwise indicated, all polymerase superimpositions were done using the maximum likelihood based structure superpositioning program theseus ( ) . the chemically synthesized -mer template strand (t , integrated dna technologies) was purified by % (w/v) polyacrylamide/ m urea gel electrophoresis, excised from the gels, and electro-eluted by an elu-trap device (ge healthcare). purified t was stored in an rna annealing buffer (rab: mm nacl, mm tris (ph . ), mm mgcl ) at − • c after a self annealing process (a -min incubation at • c followed by snap-cooling to minimize intermolecular annealing). for all the in vitro rdrp assays, t was annealed with a gg dinucleotide primer bearing a -phosphate (p , jena biosciences) at a : . molar ratio via a -min incubation at • c followed by slow-cooling to r.t. in the rab to yield the t /p construct. reaction quenching, sample processing, denaturing polyacrylamide gel electrophoresis (page), rna visualization by stains-all (sigma-aldrich) staining and quantification were as previously described in a jev rdrp study ( ) . all stains-all based gels were shown in greyscale-mode by converting from the original rgb-mode without any brightness/contrast adjustment. two types of misincorporation assays were carried out derived from the regular assays described above, corresponding to a guanosine-directed ump misincorporation (g:u mis ) at the th nucleotide of the product or a cytosine-directed ump misincorporation (c:u mis ) at the th nucleotide of the product. for radioactive labeling, [␣- p]atp (perkinelmer life sciences) was supplied with atp/utp ( m each) for the g:u mis assays and with atp/utp/ctp ( m each) for the c:u mis assays. for the g:u mis and c:u mis assays, experiments were either performed in a time course format (typically seven time points) for representative ns b constructs or with two representative time points for all constructs. the sng:u mis assays were performed in a two-step format. in the first step, the reaction proceeded for min in the presence of atp/utp ( m each), the mixture was then centrifuged at g and the supernatant was removed, and the precipitate containing the ns b elongation complex (ec) was washed twice with a reaction buffer ( mm nacl, mm tris (ph . ), mm mgcl , mm dtt). in the second step, the ump misincorporation reaction was conducted at • c at various utp concentrations in the range between and m. reaction was quenched at various time points. the radioactive rna products in the g:u mis and c:u mis assays were visualized by a cyclone plus storage phosphor system (perminelmer life sciences) and regular products in the sng:u mis assays were visualized by stains-all staining. band intensity quantification was performed using imagej (https://imagej.nih.gov/ij). to estimate the single-nucleotide misincorporation rate (r mis ) corresponding to the conversion from -mer to -mer in the sng:u mis assays, the values representing the fraction of -mer intensity (f) at all time points (t) was fitted to a single exponential rise equation: f = offset + amplitude [ -exp(-r mis × t)], where offset is related the portion of -mer contributed by minimum amount of g:u mis prior to the addition of utp and the amplitude is related to the possibility of a -mer that eventually failed to extend to an -mer. the sng:c assays assessing the p to p conversion rate and the stability assays were performed in a two-step format as in the sng:u assays and with the first step identical to the sng:u mis assays. in the second step of the sng:c assays, the precipitate was resuspended with the regular reaction buffer or reaction buffer with nacl concentration elevated to mm. the sng:c incorporation was quenched immediate ( min) following the manual mixing of ctp (for a final concentration of m) or after min. in the second step of the stability assays, the precipitate was resuspended with a high salt buffer ( / mm nacl, mm tris (ph . ), mm mgcl , mm dtt), and incubated at • c for to days. following the incubation, ctp was supplied at m final concentration and the reaction proceeded for min at • c. after subtracting the intensity of the p misincorporation product from the first step, the intensity fraction of p among the total amount of p and p ([p int -p m,int ]/[p int + p int -p m,int ]) was used to estimate the fraction of the ns b ec survived the incubation. crystal structures of the highly homologous bvdv ns b (sequence identity = %) and related flavivirus ns help designate residues - as the rdrp core of the residue csfv ns b ( figure a ) ( , , ) . for better description of the rdrp structure, we followed a nomenclature first used for describing the picornavirus rdrps by defining individual finger subdomains as index, middle, ring and pinky ( figure a ) ( ) . similar to hcv ns b but different from flavivirus ns , the pestivirus rdrp thumb domain contains a two-component priming element: an insertion (residues - ) between two thumb helices and a c-terminal tail (residues - ) ( figure a ). this priming element of the flaviviridae rdrp plays essential roles in de novo initiation through interactions with the template rna and the initiating ntps ( , , ). residues beyond table . x-ray diffraction data collection and structure refinement statistics data collection a space group p p p the rdrp core can be divided into three regions. residues - containing the ntd are not included in the bvdv ns b constructs used to determine the crystal structures ( , ) . residues - are structurally analogous to the 'n-terminal extension' (ne) of the flavivirus ns that contributes to the flavivirus rdrp activity ( , , ) . residues - at the c-terminus are highly hydrophobic and are likely analogous to the membrane anchoring helix of the hcv ns b ( , ) . in order to investigate the function of the ntd of csfv ns b, we first made an ns b construct comprising residues - with only the c-terminal hydrophobic region removed. for description purpose, we herein named this construct as the wild-type (wt). this construct was soluble and capable in gg dinucleotide driven rna synthesis using a template rna sequence derived from the rdrp assays established in the hcv ns b and the jev ns ( , ) ( figure b and c). this type of assays, although not identical to the de novo initiation assays, have been typically used to assess de novo mode rna synthesis by viral rdrps and are different from the assays using longer oligonucleotides as primers. using a -mer rna template (t ), we compared the rdrp activity both in the presence and absence of the dinucleotide primer (p ), and found that in the latter case the overall activity was low and sequences of the dominant products were not faithfully directed by the template sequence (supplementary figure s ). hence, we decided to use the p -based assays as primary approaches for the in vitro characterization of ns b. when t and p were used to generate the t /p rna substrate, a -mer product (p ) was expected when atp and utp were provided as the only ntp substrates ( figure b) . deletion of the ntd ( figure a , construct n- ) from the wt backbone did not apparently affect the p product level ( figure c , compare lanes and ). in contrast, deletion of residue - (figure a , construct c- ) in the c-terminal tail resulted in apparent reduction of product level ( figure c , compare lanes and ). in order to precisely determine which region within residues - is critical for de novo synthesis, incremental c-terminal truncations ( figure d , seven constructs from c- to c- ) were made on the wt backbone. the results indicated that residues - are not essential, as residue removal in this region did not apparently affect product level ( figure d , compare lanes - to lane ). further truncations beyond residue led to apparent reduction of product level ( figure d , compare lanes - to lane ), suggesting that the n-terminal half (residues - ) of the c-terminal tail is likely required for optimal de novo synthesis. very interestingly, the removal of ntd resulted in obvious higher level of a -mer misincorporation product (p m ) ( figure c , compare lanes and ), indicating that it likely plays important roles in controlling rdrp fidelity. in the following sections, crystallography and enzymology were utilized to dissect the structure and function of ntd. all ns b constructs used in following polymerase assays a schematic diagram of the de novo-mode rdrp assays. construct t /p was used as the rna substrate. when atp and utp were supplied as the only ntp substrates, the template-strand t directed a -nucleotide (grey) extension of the dinucleotide primer p (black) to produce a -mer product (p ). the -mer product was generated through a g:u mis event. (c and d) comparison of the de novo rdrp activity for the wt ns b and its n-/c-terminal truncated forms. the oval-shaped band below the -nt marker (m) is the bromphenol blue mixed with the marker sample. note that the -nt marker was chemically synthesized and bearing hydroxyl groups at the -end, and therefore migrated slower than the -nt product bearing a -phosphate. (e) global views of csfv ns b crystal structure. structure of ns b c- construct shown in orientations viewing into the front channel (left) and ntp entry channel (right). the coloring scheme is consistent with that in panel-a. (f) a structural comparison of the flaviviridae rdrp ring finger (motif f). the ring finger is shown as noodles with the ␣-carbon atoms of two highly conserved motif f residues in spheres. top row: csfv ns b constructs; bottom row: representative flaviviridae ns b constructs. pdb entries: yf (wt); yf (c- ); yf (c- ); y r (csfv) ( ); s f (bvdv) ( ); nb (hcv) ( ); k m (jev) ( ) . have intact c-terminal tail (i.e. ending at residue ) to preserve the capability of de novo-mode synthesis. with an aim to study the structure-function relationship of the csfv ns b including the ntd, we screened crystallization conditions of three constructs (wt, c- , c- ) and obtained single crystals for all constructs after multiple rounds of optimization from a single initial crystallization condition. the structure of the wt was solved at . Å resolution in space group p by molecular replacement using a bvdv ns b structure comprising residues - as the search model ( ) ( table ). the structures of the c- and c- were solved at . and . Å resolution, respectively, in the same crystal form by molecular replacement using the structure of the wt as the search model (table ) . these three structures, each containing one ns b molecule in the crystallographic asymmetric unit, are highly similar with root mean square deviation (rmsd) values of . - . a between the wt and two c-terminal truncation mutants for all superimposable ␣-carbon atoms with % coverage of the resolved residues in the structure of the wt. we hereinafter choose the highest-resolution c- structure as the primary structure for illustration with difference between structures discussed where necessary ( figure e and f). in nucleic acids research, , vol. , no. the structures of the wt and c- , residues beyond are disordered. therefore, all three structures are not sufficient to provide a structural basis for why residues up to residue are essential for the de novo-mode synthesis, but are valid for assessing the ntd-rdrp interactions described as follows. the csfv ns b structures are relatively complete, with more than residues resolved for all - ns b residues ( figure e ). the disordered regions mainly include residues - at the n-terminus, residues - in the index finger, residues - in the thumb and c-terminal residues beyond position . for some of the constructs, the tips of the ring finger (residues - ) and the looplike priming element insertion (residues - ) are disordered. the rdrp core of csfv ns b is structurally consistent with the bvdv structures, with an rmsd value of . Å ( % coverage) between the bvdv ns b n duplication mutant structure and the csfv c- structure ( ) . aside from global structural difference brought by small-scale rigid body movement between the rdrp domains, the most notable difference is the conformation of the ring finger. different from the observations in the wt bvdv ( ) , full-length jev ( ), hcv ( ) and the recently reported csfv rdrp structures ( ) that have a canonical fold optimal for ntp entry and binding, the csfv ns b ring finger bent toward the pinky finger, partially occupying the template rna binding channel ( figure f ). although this conformation is likely not compatible with de novo initiation, normal activity of the wt and c- observed in the polymerase assays suggests that the canonical conformation also exist in solution and may be in equilibrium with the observed bent conformation observed in the crystal structures ( figure c and d) . the global conformation of our csfv ns b structures is consistent with the recently reported csfv structure with an rmsd value of . Å ( % coverage) between the reported structure and our representative c- structure ( ) . these structures together reveal that the ntd adopts a globular ␣/␤ fold with an ␣-␤-␣-␤-␤-␣ pattern (figure ) . we were not able to identify any known structural domain highly homologous to ntd using the dali server ( ) . taking the consideration that sequence homology of ntd also has not been reported beyond pestiviruses, ntd therefore represents a highly unique viral rdrp fusion partner that may play important regulatory roles to the rdrp function. the ntd interacts with the palm of the rdrp core intra-molecularly with a mixture of hydrophobic and hydrogen bonding interactions ( figure a and b) . these interactions mainly involve ntd residues - and - and the rdrp core residues - and - in the vicinity of motifs a and d, occluding ∼ Å of solvent accessible surface area ( figure b ). lying in the heart of this ntd-rdrp interface, are two adjacent rdrp residues y and e . the aromatic ring of y side chain is wrapped around by ntd resides - through hydrophobic interactions and its phenyl hydroxyl group forms a hydrogen bond with the carbonyl oxygen of c backbone, while the carboxyl group of the e side chain forms two hydrogen bonds with backbone amide nitrogen atoms of residues m and g and its two-carbon aliphatic side chain region interacts with p and v through hydrophobic interactions ( figure b ). the intra-molecular ntd interactions with the rdrp palm that controls the active site closure, together with the observation of the ntd deletion mutant n- exhibited higher level of misincorporation in the de novo rna synthesis, suggesting a unique mode of fidelity modulation. the ntd-rdrp interface is different from the two types of the methyltransferase (mtase)-rdrp interface first identified in the ns of jev and denv serotype (denv ) that occlude relatively large surface areas ( - Å ) ( , ) . firstly, the flavirus mtase interacts with the rdrp fingers domain, while the pestivirus ntd interacts with the rdrp palm. secondly, the nature of the interface interactions is different, with the jev-interface featuring a conserved hydrophobic core, the denv -interface being primarily polar, and the csfv-interface having a mixture of interactions as mentioned above. to better understand the nature of the ntd-rdrp intramolecular interface interactions and whether they regulate the rdrp fidelity, we designed point mutations at the y and e sites in the context of the wt ns b and the nand c-terminal deletion mutants ( figure c ). crystallization screenings were performed for all the ntd containing mutant constructs and four of them were successfully crystallized with one construct crystallized under two different conditions (table and supplementary table s ). among these five structures, three of them (y a and two forms of c- y a) maintain the wt conformation with the ntd-rdrp intra-molecular interface maintained. the other two structures, obtained using the c- construct bearing the y a-e a double mutation (c- aa) and the c- construct with the e a single mutation (c- e a), were solved in a space group different from those of the wt, c- and c- structures (table and supplementary table s ) ( figure d ). very interestingly, the c- aa and c- e a adopt a drastically different global conformation with the ntd-rdrp interface no longer maintained (c- aa structure shown in figure d ). the ntd-rdrp intra-molecular interactions are fully disrupted, and the ntd is rather involved in a non-intensive three-way interaction involving two symmetry-related neighboring ns b molecules in the crystal lattice. these structural data together suggest that the selection of y -e mutation sites is valid in perturbing the ntd-rdrp interface, but the interface may not necessarily be fully disrupted by some of the mutations. in order to quantitatively assess whether the fidelity levels are modulated by the intra-molecular ntd-rdrp interface interactions, we established the p-radioactivity based ntp misincorporation assays using the t /p construct utilized in the aforementioned de novo-mode synthesis assessment ( figure b and c) . when atp and utp were provided as the only ntp substrates, the -mer product (p ) was expected based on correct nmp incorporation. the slow accumulation of the -mer product (p m ) was derived from a g:u mis event (see material and methods), since providing utp but not atp as the only substrate to the p -containing complex led to slow accumulation of mer (supplementary figure s ) . we therefore used molar fraction of g:u mis -derived p m among the total amount of p and p m (mismatch fraction) to assess the fidelity of ns b ( figure a ). in a time course reaction, wt ns b exhibited low level of misincorporation with the mismatch fraction gradually increasing over time and reached about . at the -min time point. when the ntd was absent (constructs n- and n- aa) or alanine mutations were simultaneously introduced at residues and (construct aa), ns b consistently exhibited higher level of misincorporation than the wt did, with mismatch fractions around . at the -min time point. when an alanine mutation was introduced at residue e that is also on the rdrp surface but does not participate in the ntd-rdrp intra-molecular interactions ( figure a and c), the misincorporation level is comparable to that of the wt (figure a ). -min and -min were chosen as representative time points for a comparison also including single-point alanine mutants (y a, e a, and the n- form mutants) ( figure b and supplementary figure s a ). consistent with the observation in the time course experiments, ev-ery mutant either lacking the ntd or bearing mutation(s) at residues y and e showed much higher level of misincorporation than the wt did. the effect of e a mutation was highly consistent with aa mutation ( figure b and supplementary figure s a since polymerase misincorporation level can be affected by the type of misincorporation and the sequence context of the misincorporation site, we established the second type of regular ntp misincorporation assays using the same t /p construct for a more adequate assessment of the rdrp fidelity modulation brought by the ntd. when atp, utp and ctp were provided as the only ntp substrates, ns b was expected to synthesize a -mer product (p ) through correct nmp incorporation. a -mer was also observed primarily derived from a c:u mis (see material and methods) followed by two correct ump incorporation events, while only a minority of the -mer may arise from a cytosine-directed amp misincorporation (c:a mis ) or cytosine-directed cmp misincorporation (c:c mis ) (figure c and supplementary figure s c and d) . for simplicity, here we use c:u mis , the major misincorporation event, to describe these misincorporation assays. similar to the observation in the g:u mis assays, the aa mutant exhibited much higher misincorporation level than the wt did in the time course experiments (figure c ). at the -min time point, the aa mutant had a mismatch fraction ∼ . (de- figure a . fined by the molar fraction of - -mer misincorporation products among the - -mer products) while the wt had a value about . ( figure c and supplementary figure s b, compare lanes and ) . for assessment of all mutants, the -min and -min time points were chosen. the effect brought by ntd removal or point mutation(s) were consistent with that observed in the g:u mis assays ( figure d and supplementary figure s b) . overall, the mismatch fractions of c:u mis reactions were higher than those of the g:u mis reactions, likely reflecting the differences in misincorporation types (g:u versus c:u) and sequence contexts in these assays. collectively, these biochemical data indicated that the rna synthesis fidelity of csfv ns b is finetuned by its ntd through the intra-molecular interactions with the rdrp palm. the misincorporation events in both the g:u mis and c:u mis assays were coupled to the slow accumulation of the correct product (p and p in g:u mis and c:u mis assays, respectively) through p -driven initiation. in order to assess the misincorporation solely occurred in the elongation phase and to find out whether processivity, another key polymerase property, was affected by the ntd-rdrp interface mutations, we first needed to explicitly assess whether the p -containg complex has entered the elongation phase. as fast catalytic rate and high stability are the two hallmarks of a polymerase elongation complex (ec), we tested p to p single nucleotide extension (corresponding to the guanosine-directed cmp incorporation, or g:c) rate and the p -containing complex stability of the wt ns b (figure ) . the single-nucleotide g:c (sng:c) assays was designed in a two-step format. in the first step atp and utp were supplied to generate the p containing complex, and cmp incorporation was initiated after the removal of the originally supplied atp/utp ( figure a ). immediately after the manual addition of ctp (corresponding to ' min'), p to p conversion was complete, under two different nacl concentrations tested ( figure b, lanes and ) , suggesting that the catalytic rate of this single nucleotide addition is much faster than the p and p accumulation ob-served in the g:u mis and c:u mis assays. to test the stability of the p -containing complex, we used nacl and/or heparin as the challenging agent in the stability assays. when nacl concentration was at least mm, the p -driven p formation with atp/utp was not detected (supplementary figure s a and b). in contrast, the majority of the p -containing complex survived long-time nacl challenge at mm or mm concentration in the stability assays ( figure c-e) . after a -day nacl challenge, ∼ % of the p can be rapidly converted to p when ctp was supplied, while after a -day challenge still ∼ % of the p can be converted. when g/ml heparin or g/ml combined with mm nacl was supplied for the challenge, similar results were obtained (supplementary figure s c -e). these data together suggest that the p -containing complex is highly stable and can rapidly elongate and therefore can be considered as an ec. in order to test whether perturbing the ntd-rdrp interface interactions affects the processivity of the ec, we carried out the sng:c assays and the stability assays for the double mutant aa and the ntd-truncated construct n- . the results showed that both ns b variants behaved similarly to wt, with very fast conversion of p to p (figure b ) and comparable stability upon challenge of nacl and/or heparin ( figure d and e; supplementary figure s d and e). these data indicate that ec processivity is likely not modulated by the ntd-rdrp intra-molecular interactions. to further dissect the mechanism of fidelity modulation by the ntd, we determined the misincorporation rate constants (k mis ) and the apparent michaelis constants (k m app ) for the wt and aa mutant using a single-nucleotide g:u mis (sng:u mis ) assays. similar to the p to p conversion assays, the sng:u mis assays was designed in a twostep format ( figure a ). the ump misincorporation reactions converting p to p m were performed at • c under a series of utp concentrations after the removal of the originally supplied atp/utp ( figure a and b, step ). when utp was supplied at very high concentrations (e.g. and m), an inhibitory effect was observed for both ns b constructs ( figure c ). therefore the misincorporation rates (rate mis ) measured under these concentrations were not used in the michaelis-menten curve fitting routines. the k mis value of the aa mutant is about . fold of that of the wt ( . h − versus . h − ), while the k m app values for wt and aa mutant are very much consistent ( m versus m) ( figure c ). these data together suggest that the fidelity modulation by ntd is likely not through initial ntp binding but related to sub-sequent events leading to active site closure and the phosphoryl transfer reaction. this is consistent with the structural observation that ntd interacts with the rdrp palm in the vicinity of the active site closure-modulating motifs a and d but not the motif f containing ring finger involved in ntp binding. the current work unravels the functional relationship between the pestivirus ns b ntd and its natural fusion partner rdrp. as a small and unique globular domain, the ntd is connected to the rdrp through a flexible fiveresidue linker that could allow global ns b conformational switching leading to disengagement of ntd and rdrp, as suggested by the two crystallographic conformational states identified in this study. very interestingly, the intramolecular interactions between the ntd and rdrp appear to have important roles in maintaining the rdrp fidelity at relatively high level. structurally, the fidelity modulation by ntd likely achieved through its close proximity, in the closed conformation, to motifs a and d that are the only regions undergoing backbone movement during the rdrp active site closure ( figure a and b) ( ) . when the ntd-rdrp intra-molecular interactions were perturbed or absent, the rdrp fidelity was apparently impaired as suggested by data from all tested types of in vitro misincorporation assays. the pestivirus ns b therefore represents a unique rdrp that may modulate its fidelity level through the interaction from a naturally fused domain. although ntd-rdrp disengagement, a situation mimicked by the crystallographic open conformation state in a crystal lattice or the n- construct in solution, results in fidelity reduction, engaged ntd-rdrp with perturbation from point mutations can achieve similar level of fidelity reduction. as suggested by the two open conformation crystal structures (c- aa and c- e a), the ntd-rdrp disengagement might occur in solution. in order to find out which conformational state is dominant in solution, we performed gel filtration chromatography and trypsin proteolysis analyses for the wt and representative mutants. in the chromatography analysis, the crystallographic open conformation c- aa and the c- , or the aa mutant and the wt, had consistent retention volumes (supplementary figure s a-c) . in the trypsin proteolysis analysis, the wt and aa mutant had largely consistent proteolytic profiles (supplementary figure s d and e) . if compared to wt, the n- mutant obviously had characteristic proteolytic products, presumably related to the exposed rdrp palm surface due to the absence of the ntd (supplementary figure s f ). by carefully comparing the proteolytic profiles at representative time points for all three constructs, the aa mutant also produced a few proteolytic products that either had obviously different amount from the wt or were con-sistent with the characteristic products of the n- . these observations suggest that though the closed conformation is dominant for the aa mutant in solution, the open conformation aa mutant likely exists in solution as a minor fraction. we propose that the pestivirus ns b achieve its optimal fidelity level through the maintenance of native ntd-rdrp intra-molecular interactions. alteration of these interactions, either by naturally occurring or engineered mutations or through ntd interactions with viral and host factors, can result in a change in fidelity level. fidelity adjustment mutagenesis based live attenuated vaccine design has been a valid rational approach to develop vaccines for rna viruses ( , , ) . however, the majority of these attempts utilized fidelity modulation sites within the rdrp core and identifying optimal mutation sites has not been straightforward. due to the unique fidelity regulation mechanism, the pestivirus ns b may serve as an ideal system to testify and achieve fidelity alteration through the intra-molecular ntd-rdrp interface based mutagenesis design, and sites on both sides of the interface can be utilized for mutations. functional characterization of the ntd have been reported in both the bvdv and csfv systems, largely by comparing the in vitro rdrp activities of the wt ns b and their n-terminal truncated mutants ( , , ) . however, these studies focused on the overall de novo synthesis activities and the conclusions drawn by these studies have not been figure . mn + and the starting template sequence affected the overall synthesis activities of ns b and the effect brought by ntd deletion. (a) reaction flow-charts and product designations of the p -based assays and the de novo (p -free) assays both using t as the template. (b) assessment of the impact on the ns b synthesis by mn + and the ntd deletion. in the p -based assays the primary product was a -mer mismatch product (p m ) instead of the -mer correct product (p ) when atp and utp were supplied and mm mn + was supplemented. longer products (solid triangle in the top left gel) likely derived from mn + -induced misincorporation were also evident. the wt and n- read through the template t (indicated by band intensity enhancement at the migration position of the template and the solid triangle in the top right gel) in the de novo assays when atp, utp and gtp were supplied and mm mn + was supplemented. (c) the reaction flow-chart and product designations of the de novo assays using a -nt template (t ) with a different starting template sequence from the t . (d) assessment of the wt and n- synthesis using the t -based assays. the wt and n- had very low activities under mn + -free condition if compared to data of the t -based assays. when mm mn + was supplemented, both the wt and n- read through (indicated by the solid triangles) the template t with atp, utp and gtp supplied. the reactions with atp, utp, gtp and -deoxy-ctp ( dctp) supplied were designed to largely inhibit the read-through activity and to allow a comparison of the synthesis levels of the wt and n- in the presence of mn + . in panels a and c, the starting trinucleotides of the t and t rna are underlined for comparison. consistent. in a bvdv study using a -nt self-priming prohibited template derived from the -terminal sequence of the viral minus strand rna (starting template sequence: cau), ns b constructs with the n-terminal and residues deleted had % and % of de novo synthesis activities of the wt level with mm mn + present in the assay buffer ( ) . in a csfv study using a long template derived from viral sequences (length and sequence sense not specified), ns b constructs with the n-terminal and residues deleted had about % and % of the de novo synthesis activities of the wt level ( ) . in the recent csfv study reporting the ntd-containing crystal structure, a nt template very similar to the bvdv study (self-priming prohibited; starting template sequence: cau) was used and an ns b construct with the n-terminal residues deleted produced % wt-level -mer products with mm mn + present in the assay buffer ( ) . however, a -mer likely arose from mn + -induced activities was the dominant product in this study. since mn + is known to facilitate the viral rdrp initiation and misincorporation ( ) and the start-ing template sequence may also affect the initiation level, we performed comparative analysis of our t /p -based assays, the de novo (p -free) t assays, and a type of de novo assays using a -nt template (t ) with a different starting sequence (cuu versus ccu in t ), using either our mn + free assay condition or with mm mn + supplemented (figure ) . in every assay type tested, the presence of mn + resulted in obviously enhanced overall activities and pronounced misincorporation activities (figure , compare lanes - , - , - , - with lanes - , - , - , - ) . for the two p -free assays with different starting template sequence under mn + -free conditions, the overall synthesis activities were very different. while the t -based assays had moderate amount of products (figure b, lanes and ) , the product level of the t -based assays was very low and barely detectable even under mm initiating gtp concentration (figure , lanes and ) . we compared the activities of the wt and the n- and found that the n- had a - % of the wt-level activities, no matter which assay types were tested or whether the mn + was used. collectively, these data confirmed that the mn + and the starting template sequence can affect both the overall activity level and the effect brought by the ntd deletion. when mn + was present, we did not observe obvious evidence of fidelity difference between the wt and the n- . it is probable that in these assays and in the mn +based assays in the literature, the fidelity effect brought by the ntd deletion identified in our p -based mn + -free assays can be masked by the presence of mn + . our data, in particular the data of the sng:u assays that specifically showed the fidelity modulation by the ntd-rdrp interface perturbation in the elongation phase ( figure ), provide unambiguous evidence for the linkage between the ntd and the rdrp fidelity. the viral rdrps have versatile global architecture beyond the conserved catalytic core comprising the palm, fingers and thumb. the picornaviridae rdrp (e.g. poliovirus d pol ) and the hcv ns b represent the least complicated rdrp without fused domains beyond the rdrp core ( , ) . the pestivirus ns b represents rdrps that have a small-size (∼ residues) fused domain, while the flavivirus rdrp (e.g. jev ns ) and the coronaviridae rdrp (namely nsp ) represent rdrps that have a medium-size (∼ - residues) fused domain/region ( , ) . the bunyavirilaes rdrp (namely l protein) represents rdrps that have several fused domains/functional regions ( ) , while the orthomyxoviridae rdrp complex (pa-pb -pb ) represents rdrps that function with intensive structural folding with other proteins ( ) . the diversity in rdrp global organization and its functional coupling to its fusion/folding partner(s) likely reflect the diversity in evolutionary origin of the viruses and virus-host co-evolution. based on the best of our knowledge, the pestivirus ntd neither has high sequence homology nor high structural homology with proteins in any other systems. however, two structural implications likely support the evolutionary relationship between the pestivirus ns b and the flavivirus ns . as the first implication, the structures of the rdrp ne of pestiviruses and flaviviruses are highly analogous, although the ne sequences of the two virus genera are not obviously related ( ) . to date, no ne-like structures have been identified beyond these two viral genera. in amino acid sequence, pestivirus and flavivirus nes are also similarly connecting the rdrp core and an n-terminal region: the ntd in pestivirus or the mtase in flavivirus. although about three times the size of the pestivirus ntd, the mtase is also a single-domain module and also adopts an ␣/␤ fold with a seven-strand ␤-sheet flanked by several ␣-helices ( ). although quite speculative, the pestiviurs ntd and the flavivirus mtase might come from the same origin and have achieved its current function through divergent evolution. the atomic coordinates and structure factors for the reported crystal structures of the wt csfv ns b and its variants c- , c- , c- aa, y a, c- y a (form ), c- y a (form ) and c- e a have been deposited in the protein data bank under accession numbers yf , yf , yf , yf , ae , ae , ae and ae respectively. eukaryotic mismatch repair in relation to dna replication dna polymerase proofreading: multiple roles maintain genome stability rna polymerase ii transcription: structure and mechanism transcription fidelity and its roles in the cell insights into rna synthesis, capping, and proofreading mechanisms of sars-coronavirus lack of evidence for proofreading mechanisms associated with an rna virus polymerase quasispecies diversity determines pathogenesis through cooperative interactions in a viral population extreme heterogeneity in populations of vesicular stomatitis virus genetic variation and quasi-species rna virus error catastrophe: direct molecular test by using ribavirin crystal structure of the rna-dependent rna polymerase from hepatitis c virus reveals a fully encircled active site structural basis for proteolysis-dependent activation of the poliovirus rna-dependent rna polymerase structure of foot-and-mouth disease virus rna-dependent rna polymerase and its complex with a template-primer rna a structural and primary sequence comparison of the viral rna-dependent rna polymerases the palm subdomain-based active site is internally permuted in viral rna-dependent rna polymerases of an ancient lineage a structural overview of rna-dependent rna polymerases from the flaviviridae family crystal structure at . a resolution of hiv- reverse transcriptase complexed with an inhibitor structure of large fragment of escherichia coli dna polymerase i complexed with dtmp crystal structure of bacteriophage t rna polymerase at . a resolution structure of a covalently trapped catalytic complex of hiv- reverse transcriptase: implications for drug resistance crystal structures of open and closed forms of binary and ternary complexes of the large fragment of thermus aquaticus dna polymerase i: structural basis for nucleotide incorporation the structural mechanism of translocation and helicase activity in t rna polymerase structural basis for active site closure by the poliovirus rna-dependent rna polymerase structural basis of viral rna-dependent rna polymerase catalysis and translocation structural insights into mechanisms of catalysis and inhibition in norwalk virus polymerase viral replication. structural basis for rna replication by the hepatitis c virus polymerase incorporation fidelity of the viral rna-dependent rna polymerase: a kinetic, thermodynamic and structural perspective coxsackievirus b mutator strains are attenuated in vivo design of a genetically stable high fidelity coxsackievirus b polymerase that attenuates virus growth in vivo attenuation of foot-and-mouth disease virus by engineered viral polymerase fidelity structure-function relationships underlying the replication fidelity of viral rna-dependent rna polymerases residues arg , arg , and ile in the nucleotide binding pocket of bovine viral diarrhea virus ns b rna polymerase affect catalysis and fidelity a single mutation in poliovirus rna-dependent rna polymerase confers resistance to mutagenic nucleotide analogs via increased fidelity remote site control of an active site fidelity checkpoint in a viral rna-dependent rna polymerase picornaviral polymerase structure, function, and fidelity modulation rationalizing the development of live attenuated virus vaccines engineering attenuated virus vaccines by controlling replication fidelity the structure of the rna-dependent rna polymerase from bovine viral diarrhea virus establishes the role of gtp in de novo initiation the structure of bovine viral diarrhea virus rna-dependent rna polymerase and its amino-terminal domain crystal structure of the rna polymerase domain of the west nile virus non-structural protein crystal structure of the full-length japanese encephalitis virus ns reveals a conserved methyltransferase-polymerase interface crystal structure of classical swine fever virus ns b reveals a novel n-terminal domain characterization of the n-terminal domain of classical swine fever virus rna-dependent rna polymerase mutational analysis of bovine viral diarrhea virus rna-dependent rna polymerase site-directed mutagenesis in one day with > % efficiency site-directed, ligase-independent mutagenesis (slim): a single-tube methodology approaching % efficiency in h processing of x-ray diffraction data collected in oscillation mode phaser crystallographic software coot: model-building tools for molecular graphics phenix: a comprehensive python-based system for macromolecular structure solution crystallography & nmr system: a new software suite for macromolecular structure determination theseus: maximum likelihood superpositioning and analysis of macromolecular structures perturbation in the conserved methyltransferase-polymerase interface of flavivirus ns differentially affects polymerase initiation and elongation structural and functional analysis of methylation and -rna sequence requirements of short capped rnas by the methyltransferase domain of dengue virus ns a novel mechanism to ensure terminal initiation by hepatitis c virus ns b polymerase hydrophobic and charged residues in the c-terminal arm of hepatitis c virus rna-dependent rna polymerase regulate initiation and elongation characterization of soluble hepatitis c virus rna-dependent rna polymerase expressed in escherichia coli assembly, purification, and pre-steady-state kinetic analysis of active rna-dependent rna polymerase elongation complex substrate complexes of hepatitis c virus rna polymerase (hc-j ): structural evidence for nucleotide import and de-novo initiation dali server: conservation mapping in d a crystal structure of the dengue virus ns protein reveals a novel inter-domain interface essential for protein flexibility and virus replication characterisation of interaction between ns and ns b protein of classical swine fever virus by deletion of terminal sequences of ns b poliovirus rna-dependent rna polymerase ( dpol): pre-steady-state kinetic analysis of ribonucleotide incorporation in the presence of mn + the rna polymerase activity of sars-coronavirus nsp is primer dependent structural insights into bunyavirus replication and its regulation by the vrna promoter structure of influenza a polymerase bound to the viral rna promoter an rna cap (nucleoside- -o-)-methyltransferase in the flavivirus rna polymerase ns : crystal structure and functional characterization dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features we thank dr zishu pan for providing the cloning material for the csfv ns b gene, dr zhongzhou chen and dr xiulian sun for helpful discussions, dr bo shu and dr guoliang lu for synchrotron data collection and helpful discussions, and liu deng for laboratory assistance. we thank synchrotron ssrf (beamlines bl u and bl u , shanghai, china) for access to beamlines, and the core facility and technical support of the wuhan institute of virology for access to instruments. supplementary data are available at nar online. key: cord- -noscodew authors: wu, rebecca p.; youngblood, derek s.; hassinger, jed n.; lovejoy, candace e.; nelson, michelle h.; iversen, patrick l.; moulton, hong m. title: cell-penetrating peptides as transporters for morpholino oligomers: effects of amino acid composition on intracellular delivery and cytotoxicity date: - - journal: nucleic acids res doi: . /nar/gkm sha: doc_id: cord_uid: noscodew arginine-rich cell-penetrating peptides (cpps) are promising transporters for intracellular delivery of antisense morpholino oligomers (pmo). here, we determined the effect of l-arginine, d-arginine and non-α amino acids on cellular uptake, splice-correction activity, cellular toxicity and serum binding for cpp−pmos. insertion of -aminohexanoic acid (x) or β-alanine (b) residues into oligoarginine r( ) decreased the cellular uptake but increased the splice-correction activity of the resulting compound, with a greater increase for the sequences containing more x residues. cellular toxicity was not observed for any of the conjugates up to μm. up to μm, only the conjugates with ⩾ xs exhibited time- and concentration-dependent toxicity. substitution of l-arginine with d-arginine did not increase uptake or splice-correction activity. high concentration of serum significantly decreased the uptake and splice-correction activity of oligoarginine conjugates, but had much less effect on the conjugates containing x or b. in summary, incorporation of x/b into oligoarginine enhanced the antisense activity and serum-binding profile of cpp−pmo. toxicity of x/b-containing conjugates was affected by the number of xs, treatment time and concentration. more active, stable and less toxic cpps can be designed by optimizing the position and number of r, d-r, x and b residues. steric-blocking antisense oligonucleotides (aos) are considered potential therapeutics for genetic diseases such as duchenne muscular dystrophy (dmd) and b-thalassemia. for their potential to be realized, however, the aos must be effectively delivered to cell nuclei. cationic lipoplex-or pei-based transfection methods used to deliver charged aos are not suitable for the delivery of uncharged aos such as phosphorodiamidate morpholino oligomers (pmo, figure ) ( ) and peptide nucleic acids (pnas) ( ) . conjugation of pmo to short cpps is a good method to enhance the cytoplasmic and nuclear delivery of pmo because the conjugates are simple to use and because the short peptides and their ao conjugates can be easily manufactured and characterized in a qualitycontrolled manner. examples of well-studied cppÀpmo conjugates include those with tat and oligoarginine peptides ( , ) important considerations in the design of effective cpps include the ability to deliver ao efficiently, stability in living systems and toxicity. we have reported that tat and oligoarginine peptides are not stable in human serum ( ) , and are therefore ill-suited for in vivo applications. oligoarginine peptides incorporating non-a amino acids have been proven superior to oligoarginine alone. cpps containing -aminohexanoic acid (x) and b-alanine (b) were more stable in human serum than tat or oligoarginine peptides ( ) . a cppÀpmo conjugate, (rxr) Àpmo, has been shown to be more efficient in the correction of pre-mrna mis-splicing ( ) and in inhibition of the replication of mouse hepatitis virus in vivo ( ) than an oligoarginine peptide. in addition, (rxr) Àpmo conjugates have been shown to cause effective exon skipping in muscle cells from dmd dogs ( ) , in human muscle explants ( ) and in mdx mice ( ) , as well as inhibiting the replication of various viruses in cell cultures ( , ( ) ( ) ( ) and in mice ( , ) . the above studies have helped make it clear that unnatural amino acids can confer enhanced stability and activity, and therefore improve the potential of cpps to deliver therapeutic pmo. in pursuit of cpps with improved characteristics, we have carried out a structure-activity relationship study to investigate the effects of unnatural amino acid insertions in oligoarginine peptides on cellular delivery, nuclear antisense activity, toxicity and serum-binding characteristics of the resulting cppÀpmo conjugates. the unnatural amino acids studied here are x, b and d-arginine (r). we chose to study the x amino acid based on the successes of the (rxr) cpp in several studies as shown in the previous paragraph. b and r amino acids were chosen because they have good enzymatic stability ( ) . the cpps are (i) the oligoarginine sequences, r and r , (ii) sequences with rxr, rx and rb repeats, as well as various combinations thereof, and (iii) sequences containing d-arginine, r , (rx) (rxr) , (rxr) and (rb) . the cppÀpmo conjugates were evaluated for their relative (a) cellular uptake, as determined by flow cytometry, (b) antisense activity, as determined by a splice correction assay ( ) and (c) cellular toxicity, as determined by mtt cell viability, propidium iodide membrane integrity and hemolysis assays, as well as by microscopic imaging. cpp nomenclature and sequences are listed in table . chemical structures of pmo and (rx) Àpmo are shown in figure . the antisense pmo (cct ctt acc tca gtt aca) is designed to target a b-thalassemic mutant splice site present in the human b-globin intron of a positive-readout antisense activity assay system ( ) as described in the results section. synthesis of pmo, described previously ( , ) , and the cpps, using standard fmoc chemistry ( ) , were performed at avi biopharma, achieving purities of % as determined by hplc and mass spectrometry analysis. conjugation of a cpp to a pmo through an amide linker, described previously ( ) , was followed with an additional purification step to remove nonconjugated peptide. samples were loaded on source s resin (amersham biosciences, pittsburgh, pa) in a ml biorad (hercules, ca) mt column at ml/min with running buffer a ( mm na hpo , % acetonitrile, ph . ) and purified into -s fractions with - % buffer gradient (buffer b: . m nacl, mm na hpo , % acetonitrile, ph . ) over min, using a biorad biologic low pressure chromatography system. the desired faction was desalted by a method described previously ( ) . hplc and ms analyses revealed that the final product contained % cpp conjugated to fulllength pmo, with the balance composed of cpp conjugated to incomplete pmo sequence, nonconjugated full-length or incomplete pmo. the hela pluc (pluc ) ( ) cell line was obtained from gene tools, llc (philomath, or). human liver cell line hepg was from american type culture collection (atcc, manassas, va). cells were cultured in rpmi medium supplemented with mm l-glutamine, u/ml penicillin and % fetal bovine serum (fbs) (hyclone, ogden, ut) at c in a humidified atmosphere containing % co . all treatments were carried out in optimem medium (gibco, inc., carlsbad, ca.) with or without fbs. cell uptake assay pluc cells were seeded h prior to treatment in -well plates at cells/well. cells were treated with mm fluorescein-tagged cppÀpmo conjugates for h. after treatment, cells were washed with ml of rxrxrxrxrxrxrxb (rx) rxrxrxrxrxb (rx) rxrxrxb (rxr) rxrrxrrxrrxrxb (rxr) rxrrxrrxrrxrxb (rxr) rxrrxrrxrrxrxb (rb) rbrbrbrbrbrbrbrbb (rb) rbrbrbrbrbrbrbrbb (rb) rbrbrbrbrbrbrbb (rb) rbrbrbrbrbb (rb) rbrbrbb à the sequences of peptides are written from n to c terminus. r = arginine, r = d-arginine, x = -aminohexanoic acid, b = b-alanine. each peptide had an acetyl group at the n-terminus and a carboxyl group at the c-terminus. cell viability assay and microscopy pluc cells were seeded h before the treatment in -well plates at cell/well and then treated with the conjugates. the microscopic phase images of treated cells were visualized by a nikon diaphot inverted microscope (melville, ny), captured by an olympus digital camera and processed by the magnafire software (optronics, goleta, ca). after imaging the cells, the cell viability was determined by the methylthiazoletetrazolium assay (mtt, sigma, st. louis, mo) assay. mtt solution ( mg/ml) was added to the treatment medium to a final concentration of . mg/ml and incubated for h at c. % of the media of each well was then replaced with dmso containing . m hcl and further incubated for min at c and the absorbance measured at nm. percent cell viability was determined by normalizing the absorbance of each treated sample to the mean of untreated samples. propidium iodide membrane integrity assay pluc cells were seeded h before treatment in -well plates at cells/well. cells were treated by removing the medium, washing with ml pbs and incubating with medium containing cppÀpmo conjugates. treatment medium was collected in tubes and cells were washed with pbs once, and then treated with ml of % trypsin for min at c. trypsin was neutralized with ml of the serum containing medium. cells were transferred to the tubes containing previously collected treatment medium, pelleted by centrifugation at g for min, washed with pbs once, and re-suspended in ml of . mg/ml propidium iodide (pi) in pbs. cells were further incubated at c for min and analyzed by the beckman coulter cytometer ( events/sample collected). the hemolytic activities of the conjugates were determined in fresh rat blood according to a method described elsewhere ( ) . cellular uptake of cppÀpmo conjugates was investigated using the -carboxyfluorescein-tagged pmo (pmof) and flow cytometry. we chose mm as the treatment concentration because none of the conjugates caused any detectable cytotoxicity at this concentration, as demonstrated by the mtt and pi uptake assays. after treating with the conjugates, cells were treated with trypsin ( ) to remove membrane-bound conjugates. we found that a heparin sulfate washing step prior to trypsin treatment did not remove additional membrane-bound conjugates but caused some cellular toxicity (data not shown); therefore, only the trypsin treatment step was used in this study. to determine the effect of serum on cellular uptake of the various conjugates, uptake evaluation assays were carried out in the medium containing various concentrations of, or in the absence of, serum. cellular uptake of cppÀpmof conjugates increased with the number of arginines and decreased with the x and/or b residue insertion (figure a and b). the oligoarginine r Àpmof had a mean fluorescence (mf) of , nearly -fold higher than the produced by r Àpmof, indicating that a difference of a single arginine can make a substantial difference in the biological properties of a cpp. insertion of an x or b residue in the r sequence reduced the mf from of r Àpmo to , and of (rx) À, (rxr) À and (rb) Àpmof, respectively ( figure a ). the number of rx or rb repeats affected cellular uptake, with conjugates having fewer rx or rb repeats generating lower mf ( figure b) . while the addition of % serum to the medium caused a decrease in the uptake of the r À or r Àpmof conjugates, it increased the uptake of conjugates containing rx, rb or rxr motifs (figure a and c) . serum reduced the mf of r À and r Àpmof from and to and , respectively, and increased the mf of (rx) À, (rxr) À and (rb) Àpmof from , and to , and , respectively. these differences were statistically significant (figure a ). however, higher serum concentrations ( and %) decreased the uptake of (rxr) Àpmof and oligoarginineÀpmof ( figure c ). arginine stereochemistry (d versus l) had little effect on the uptake of cppÀpmof conjugates. we compared the mf of r À, (rb) À and (rx) Àpmof with their respective d-isomer conjugates, r À, (rb) À and (rx) Àpmof and found that there was no significant difference between each pair, as shown in figure d for the (rx) À and (rx) Àpmof pair. nuclear antisense activity. the effectiveness of each cppÀpmo conjugate was determined in a previously described splicing correction assay ( ) , considered a reliable method to assess nuclear antisense activity of a steric-blocking ao. this assay utilizes the ability of steric-blocking aos to block a splice site created by a mutation in order to restore normal splicing. the luciferase coding sequence was interrupted by the human b-globin thalassemic intron which carried a mutated splice site at nucleotide . hela cells were stably transfected with the plasmid therefore named as pluc cell. in the pluc system, steric-blocking aos must be present in the cell nucleus for splicing correction to occur. advantages of this system include the positive readout and high signal-to-noise ratio. with this system the relative efficiencies of various cpps to deliver an ao with sequence appropriate for splice-correction to cell nuclei can be easily compared. oligoarginine, rx, rxr and rb panels. the cpp conjugates with the highest nuclear antisense activities were (rxr) À and (rx) Àpmo. figure a and b show luciferase activity normalized to protein of cells treated with various conjugates at and mm for h. at both concentrations, (rx) À and (rxr) Àpmo were more effective than the other conjugates tested, with the difference more prominent in serum-containing medium at mm than at mm. cells treated with mm of either conjugate had luciferase activity - -fold over the background while the remaining conjugates yielded about a - -fold over the background (figure a ). at mm, all conjugates generated higher luciferase activity figure . cellular uptake of the cppÀpmof conjugates. pluc cells were treated with carboxyfluorescein-tagged pmos conjugated to cpps in optimem with or without % serum for h, followed by flow cytometry analysis. data are presented as mean fluorescence (mf) ae sd of six data points from two independent experiments. (a) cells treated with mm of r , r , (rxr) , (rx) or (rb) pmof conjugates. (b) cells treated with mm of (rx) , (rx) , (rx) , (rb) , (rb) or (rb) pmof conjugates in the absence of serum. (c) cells treated with mm of (rxr) or r pmof in media containing , , or % serum. (d) cells treated with (rx) or (rx) pmof in media containing % serum. than at mm, with (rx) Àpmo and (rxr) Àpmo again the most effective, followed by (rb) Àpmo ( figure b ). figure c shows that at mm, the activity of rx or rb conjugates decreased as the number of rx or rb repeats in the cpp decreased. the peptides with or rx or rb repeats, (rx) , (rx) (rb) or (rb) , generated much lower luciferase activity than those with and repeats. number and position of x residues. having shown that (rb) Àpmo had less activity than (rxr) Àpmo or (rx) Àpmo, we further investigated the effect of the number and position of x residues on the activity of conjugates. eleven cppÀpmo conjugates containing , , , , or xs were compared (figure ) . generally, cpps containing a higher number of xs had higher activities. at mm, (rx) Àpmo ( x residues) had the highest activity followed by (rxr) Àpmo ( x residues) and the conjugates with fewer xs had lower activities. at mm, three conjugates containing ( d), ( c) and ((rx) ) x residues had the highest activities, suggesting that the position of x residues affects activity. l-arginine versus d-arginine. arginine stereochemistry had little effect on the nuclear activity of the r À and (rb) Àpmo conjugates but affected the (rx) Àpmo ( figure ). replacement of the eight l-arginine residues in r À or (rb) Àpmo with d-arginine residues did not change the luciferase activity generated over the - mm ( figure a and b) . however, the replacement did cause a small but statistically significant decrease in the activity for (rx) Àpmo at mm (p = . ) and mm (p = . ) ( figure c ). serum effect on activity. the effect of serum on the antisense activity of the conjugates depended on the cpp sequences, as shown in figure a -d. addition of % serum to the medium decreased the activity of oligoarginineÀpmo conjugates (r Àpmo and r Àpmo) but increased activity of conjugates containing rxr, rx and rb repeats. the addition of % serum nearly doubled the luciferase activity of (rxr) À, (rx) À and (rb) Àpmo at mm ( figure b ). we further studied this effect for (rxr) Àpmo up to % serum. while the activity almost doubled as the serum concentration increased from to %, it gradually decreased as the serum concentration increased to %, at which activity was similar to that in % serum which was still significantly above the background. this 'up and down' profile was also observed with the mm (rxr) Àpmo treatment. unlike (rxr) Àpmo, the luciferase activity of r Àpmo or r Àpmo (data not shown) only decreased as the serum concentration increased, with an approximately % reduction in % serum and no activity in % serum. r Àpmo or r Àpmo did not display any detectable activity at mm, regardless of the serum concentration (data not shown). the cellular toxicity of the various cppÀpmo conjugates was determined by mtt-survival, propidium iodine (pi) exclusion and hemolysis assays and microscopic imaging. the mtt and pi exclusion assays measure metabolic activity and membrane integrity of cells, respectively. the hemolysis assay determines compatibility with blood. microscopic images were used to verify the mtt results and observe the general health of the cells. mtt assay. pluc cells were treated at concentrations ranging from to mm for h. as shown in figure , all conjugates, except (rx) and (rxr) , had no toxicity at up to mm. up to mm, (rx) and (rxr) conjugates exhibited no toxicity, at higher concentrations they reduced cell viability in a concentration-dependent manner, with (rx) being more toxic than (rxr) ( figure c and d) . replacement of l-arginine with d-arginine in r À, (rb) À and (rxr) Àpmo did not change the viability profiles of these conjugates ( figure a-c) . surprisingly, the l!d replacement in (rx) Àpmo decreased the toxicity. cell viability with mm treatment was % for (rx) Àpmo, but % for (rx) Àpmo ( figure d ). the eight conjugates containing fewer than x residues did not inhibit cell proliferation up to mm ( figure e ). monomers of arginine or x, individually or in combination, at mm each, produced no inhibition of cell proliferation ( figure f ). the toxicities of the cppÀpmo conjugates, (rxr) Àpmo, dÀpmo and cÀpmo, were also evaluated in human liver hepg cells. we found that only (rxr) Àpmo caused dose-dependent inhibition of cell proliferation while other two conjugates had no toxicity up to mm, the highest concentration tested in this study (data not shown). microscopic images. we sought to verify the mtt results by collecting microscopic images of cells treated with mm of the conjugates. the images correlated well with the cell viability data. images of (rx) À, (rx) À, (rxr) rbrÀ ( c), (rxr) Àpmo and vehicle-treated cells are shown in figure . cells treated with (rx) Àpmo and (rxr) Àpmo appeared rounded and detached from the culture well, and appeared to have fewer live cells. interestingly, cells treated with (rx) Àpmo appeared to have normal morphology and cell density. the replacement of one x of (rxr) Àpmo . nuclear activity of cpp-pmo conjugates: number and position of x residues. pluc cells were treated with the conjugates having , , ( a, b, c and d), ( a, b and c), and x residues (see sequences in table ) in optimem medium with % serum for h. nuclear activity of a conjugate is indicated by relative luciferase activity (rlu) per microgram of protein. data represent a mean ae sd of - data points from four independent experiments. propidium iodine exclusion assay. the effect of the conjugates on integrity of cell membranes was investigated by a propidium iodine (pi) exclusion assay. pi can only permeate unhealthy/damaged membranes, so positive pi fluorescence indicates compromised cell membranes. only (rxr) À and (rx) Àpmo conjugates were found to significantly affect membrane integrity at higher concentrations (up to mm tested). figure a shows the histograms of pluc cells treated with (rxr) Àpmo at mm for . , and h. the pi positive (pi+) region was defined by the cells permeabilized with ethanol (positive control) as indicated by the gate in the histogram. the pi histogram shifts from the pi-negative region to pi-positive region in the longer incubations, indicating the conjugate caused membrane leakage in a time-dependent manner. the . -and -h-treatments caused a slight shift towards the pi+ region, while the -htreatment produced a distinct peak which corresponded to % of cells that were in the pi+ region. figure b shows the histograms of cells treated with (rxr) Àpmo at concentrations of , , , and mm for h. there was no significant pi uptake at concentrations up to mm. at higher concentrations, the pi+ population appeared and the percentage of pi+ cells increased as the treatment concentration increased, indicating that there were more leaking cells at the higher treatment concentration. similar concentrationand time-dependent pi uptake profiles were observed for (rx) Àpmo but not for (rb) Àpmo and the remaining conjugates (data not shown). addition of % serum to the treatment medium significantly reduced membrane toxicity for the (rxr) À ( figure c ) and (rx) Àpmo conjugates (data not shown). hemolysis assay. the (rxr) À and (rx) Àpmo conjugates were tested in a hemolysis assay and found to be compatible with red blood cells. fresh rat red blood cells were treated with the conjugates at mm, pbs (background) or . % tx- (positive control). the supernatants of conjugate-and pbs-treated samples had small and similar amounts of free hemoglobin released, far lower than that of the tx- -treated samples ( figure d ). the naturally occurring cpps such as tat peptide are not stable in blood and neither are oligolysine/oligoarginine ( ) , rendering these cpps unfavorable as transporters for therapeutic aos. we reasoned that one approach to improve stability would be to use non-a amino acids or d-amino acids. in this study, we investigated whether incorporation of -aminohexanoic acid (x), b-alanine (b) and d-arginine (r) amino acids into the cpp would affect cellular delivery, antisense activity, toxicity and serum binding of the resulting cppÀpmo conjugates. we found that cppÀpmof conjugates containing x/b residues did not enter cells as efficiently as r À and r Àpmo conjugates. this is consistent with our previous finding for the (rxr) conjugate ( ). we have found that cell surface proteoglycans were involved with binding of the tatÀ, r f À and (rxr) Àpmo conjugates with the (rxr) conjugate having the lowest binding affinity. insertion of x into an oligoarginine cpp reduces the charge density and may lead to decreased binding affinity for proteoglycans. despite the lower cellular uptake of x/b-containing cpp-pmo, they generated higher antisense activities in the cell nucleus than oligoarginineÀpmo. we have found that endocytosis was the internalization mechanism (at least primarily) for oligoarginine-and (rxr) Àpmo conjugates. indication of different uptake mechanisms was not found among these conjugates ( ) . therefore we hypothesize that x/b-containing conjugates have a greater ability to escape from endosomes/lysosomes than oligoarginine conjugates by a mechanism as yet to be studied. the number of x residues affects both the nuclear antisense activity and the toxicity of conjugates. the cppÀpmo conjugate with x residues [(rx) Àpmo] had the highest activity followed by one with xs [(rxr) Àpmo] (figures and ) . however, these conjugates were toxic to cells at higher concentrations, which may be a concern when considering potential applications for in vivo delivery of pmo. replacement of all xs with bs decreased both toxicity and antisense activity. the combination of - xs with several b residues yielded cpps with no detectable toxicity, and at some concentrations several of them had similar antisense activity as (rx) Àpmo. we think this type of cpp, having bs and fewer than xs, will offer balanced activity and low toxicity as well as the stability, and have considerable potential for delivery of therapeutic aos. further investigation into the toxicity and activity versus dosing levels of these cpps in vivo is warranted. surprisingly, the replacement of l-arginine with d-arginine enhanced neither uptake nor antisense activity for oligoarginine, or x-and b-containing conjugates. in the case of (rx) Àpmo, the replacement actually caused a small but statistically significant decrease in activity. our observation is different from the results reported by others ( , ) who found that d-cpps had higher cellular uptake than l-cpps, although no biological functional cargo was used in their study. the difference between results may be due to the type and size of cargos and the cell lines used for the assays. whether the use of d-arginine-containing peptides results in superior cppÀpmo functional activity in vivo remains to be tested. we attempted to understand the nature of (rx) À and (rxr) Àpmo toxicity. it is apparent that these two conjugates caused little immediate membrane damage with . or h treatment at concentrations as high as mm (figure ). however, these two conjugates had dose-dependent toxicity with hr treatment as shown by the leaky cell membranes and fewer cells compared to controls ( figure c&d, figure ). interestingly, the replacement of xs with bs in (rx) Àpmo abolished the toxicity, and the replacement of l-arginine with d-arginine reduced the toxicity of (rx) Àpmo ( figure ). we have found that (rx) Àpmo was completely stable and the peptide portion of (rb) Àpmo was only partially degraded, whereas the peptide portion of (rx) Àpmo was completely degraded in cells ( ) . we wondered whether the difference in toxicity among (rx) , (rb) and (rx) Àpmo conjugates was caused by differences in intracellular stability, resulting in the metabolized products of (rx) Àpmo producing toxicity. the identifiable metabolized products of (rx) Àpmo were xrxbÀpmo and xbÀpmo ( ) but neither product had any detectable toxicity as measured by mtt assay (data not shown). it is possible that the cpp portion was degraded into free amino acids and/or smaller peptide fragments which were toxic. however, our investigation revealed that neither free r nor x, alone or in combination, caused cellular toxicity. another possibility is that because of the high hydrophobicity of x compared to b, x in combination with positively charged arginine residues leads to toxicity not generated by b residue combinations. however, this explanation does not account for the difference in toxicity observed between (rx) Àpmo and (rx) Àpmo, which have the same hydrophobicity. perhaps the toxicity of (rxr) Àpmo and (rx) Àpmo was caused by the peptide fragments that we could not identify by mass spectrometry. unlike the toxicity difference between (rx) À and (rx) Àpmo, the l!d replacement did not change the toxicity of (rxr) Àpmo ( figure ). substitution of either one r (rxr) or two r (rxr) from the rxr repeat neither reduced nor increased the toxicity profile of (rxr) Àpmo. at this point, we do not fully understand the mechanisms of (rxr) À and (rx) Àpmo conjugate toxicity, but look forward to studying this topic further. serum effect on the activity of a cppÀao conjugates is an important issue when considering potential in vivo applications. x/b-containing conjugates were still active in % serum while oligoarginine conjugates were not. the greater stability of the x/b-containing conjugates to serum enzymes is likely a factor contributing to their high activity. the loss of activity in high serum concentrations makes oligoarginine cpps undesirable as potential therapeutic ao carriers. in summary, we have found that the x/b-containing cppÀpmo conjugates are superior to oligo-arginineÀpmo conjugates for the following reasons: they display higher activity in cell nuclei, are less affected by serum and are more stable in blood ( ) . the toxicity of the x/b-containing cpps can be reduced by keeping the number of x residues below while still maintaining a reasonable delivery efficacy and stability. this study provides a basis for further optimization of cpp sequence using r, r, x and b residues in the interest of further reducing toxicity and increasing antisense activity, which will likely lead to more effective ao transporters for potential therapeutic applications. morpholino antisense oligomers: the case for an rnase h-independent structural type sequence-selective recognition of dna by strand displacement with a thymine-substituted polyamide hiv tat peptide enhances cellular delivery of antisense morpholino oligomers cellular uptake of antisense morpholino oligomers conjugated to arginine-rich peptides stability of cell-penetrating peptidemorpholino oligomer conjugates in human serum and in cells vectorization of morpholino oligomers by the (r-ahx-r)( ) peptide allows efficient splicing correction in the absence of endosomolytic agents antiviral effects of antisense morpholino oligomers in murine coronavirus infection models antisense oligonucleotide-induced exon skipping restores dystrophin expression in vitro in a canine model of dmd induced dystrophin exon skipping in human muscle explants morpholino oligomer mediated exon skipping averts the onset of dystrophic pathology in the mdx mouse inhibition of flavivirus infections by antisense oligomers specifically suppressing viral translation and rna replication inhibition of multiple subtypes of influenza a virus in cell cultures with morpholino oligomers inhibition of coxsackievirus b in cell cultures and in mice by peptide-conjugated morpholino oligomers targeting the internal ribosome entry site up-regulation of luciferase gene expression with antisense oligonucleotides: implications and applications in functional assay development morpholino antisense oligomers: design, preparation, and properties uncharged morpholino-based polymers having phosphorus containing chiral intersubunit linkage fmoc solid phase peptide synthesis vitro cytotoxicity testing of polycations: influence of polymer structure on cell viability and hemolysis cellpenetrating peptides. a reevaluation of the mechanism of cellular uptake arginine-rich molecular transporters for drug delivery: role of backbone spacing in cellular uptake the design, synthesis, and evaluation of molecules that enable or enhance cellular uptake: peptoid molecular transporters we would like to thank dr jon d. moulton and david a. stein for their critical reading of the manuscript. we are grateful to dr adams amantana for providing us with fresh rat blood and the chemistry team at avi biopharma for the synthesis, purification and analysis of peptides and pmo. funding to pay the open access publication charges for this article was provided by avi biopharma, inc.conflict of interest statement. none declared. key: cord- -klsl nzn authors: zhang, dapeng; iyer, lakshminarayan m.; aravind, l. title: a novel immunity system for bacterial nucleic acid degrading toxins and its recruitment in various eukaryotic and dna viral systems date: - - journal: nucleic acids res doi: . /nar/gkr sha: doc_id: cord_uid: klsl nzn the use of nucleases as toxins for defense, offense or addiction of selfish elements is widely encountered across all life forms. using sensitive sequence profile analysis methods, we characterize a novel superfamily (the sukh superfamily) that unites a diverse group of proteins including smi /knr , pgs , fbxo , skip , syd, herpesviral us , irs and trs , and their bacterial homologs. using contextual analysis we present evidence that the bacterial members of this superfamily are potential immunity proteins for a variety of toxin systems that also include the recently characterized contact-dependent inhibition (cdi) systems of proteobacteria. by analyzing the toxin proteins encoded in the neighborhood of the sukh superfamily we predict that they possess domains belonging to diverse nuclease and nucleic acid deaminase families. these include at least eight distinct types of dnases belonging to hnh/endovii- and restriction endonuclease-fold, and rnases of the endou-like and colicin e -like cytotoxic rnases-folds. the n-terminal domains of these toxins indicate that they are extruded by several distinct secretory mechanisms such as the two-partner system (shared with the cdi systems) in proteobacteria, esat- /wxg-like atp-dependent secretory systems in gram-positive bacteria and the conventional sec-dependent system in several bacterial lineages. the hedgehog-intein domain might also release a subset of toxic nuclease domains through auto-proteolytic action. unlike classical colicin-like nuclease toxins, the overwhelming majority of toxin systems with the sukh superfamily is chromosomally encoded and appears to have diversified through a recombination process combining different c-terminal nuclease domains to n-terminal secretion-related domains. across the bacterial superkingdom these systems might participate in discriminating `self’ or kin from `non-self’ or non-kin strains. using structural analysis we demonstrate that the sukh domain possesses a versatile scaffold that can be used to bind a wide range of protein partners. in eukaryotes it appears to have been recruited as an adaptor to regulate modification of proteins by ubiquitination or polyglutamylation. similarly, another widespread immunity protein from these toxin systems, namely the suppressor of fused (sufu) superfamily has been recruited for comparable roles in eukaryotes. in animal dna viruses, such as herpesviruses, poxviruses, iridoviruses and adenoviruses, the ability of the sukh domain to bind diverse targets has been deployed to counter diverse anti-viral responses by interacting with specific host proteins. the use of toxins as a defensive, offensive or selfish addictive strategy is observed across the tree of life. interestingly, a diverse set of protein toxins from distantly related organisms have a propensity to catalyze nucleic acid modifying or cleaving reactions in their target cells. well-known examples are currently known from across the phylogenetic spectrum: plants deploy toxins such as ricin, abrin and modeccin to protect their seeds, which are rna n-glycosidases that remove a specific purine *to whom correspondence should be addressed. tel: + ; fax: + ; email: aravind@ncbi.nlm.nih.gov base from eukaryotic s rrna to render it non-functional ( , ) . in a similar vein, the fungal toxin a-sarcin, produced by fungi such as aspergillus giganteus, acts as a specific endonuclease that cleaves the s rrna at a position close to the site of action of the above plant toxins ( ) . among animals the use of nucleic acid-targeting enzymes is observed in the venoms of snakes ( ) . several animals, including vertebrates, are known to deploy cytotoxic rnases, such as rnase a, which potentially target rna from bacteria and viruses ( ) . bacteria are a particularly rich source of nucleic acid-targeting toxins, which are deployed in various contexts. pathogenic bacteria secrete rna n-glycosidases that target the s rrna of eukaryotic hosts similar to the ricin-like plant toxins ( ) . bacteria are also known to deploy rnase and dnase bacteriocins in intra-and possibly inter-specific competition that target molecules such as trna and genomic dna ( ) . the best known are the plasmid-borne toxins of the model bacterium escherichia coli, which kill closely related competing strains. of these colicins e , e and e cleave rrna, colicins e and d cleave trna and colicins e , e , e and e cleave dna ( ) . additionally, bacterial genomes are also colonized by systems such as the toxin-antitoxin systems and restriction-modification systems which produce enzymes that function as nucleic acid-targeting toxins ( ) ( ) ( ) ( ) . in these systems the primary function of the toxin is to kill the host bacterial cell if the toxin encoding system is genetically disrupted in some way ( , ) . thus, they act as selfish elements that forcibly 'addict' the host to maintain them in genomes or plasmids. in many of these cases, organisms or genetic elements that produce the toxin also produce an antitoxin or immunity protein that renders the 'self' resistant to the action of the toxin. the study of these toxins and antitoxins or immunity proteins has not only expanded our understanding of the evolution of inter-species competition but also thrown considerable light on the biochemistry of nucleic acids and other molecules that interact with them ( ) ( ) ( ) ( ) . in practical terms these nucleic acid-targeting toxins and antitoxins/immunity proteins are potential reagents that could be utilized in numerous biotechnological contexts ranging from chemical analysis of nucleic acids to bio-defense. availability of an enormous wealth of genomic sequence data provides opportunities to identify novel versions of such toxins and associated immunity and delivery systems through computational analysis, thereby opening the door for new investigations on nucleic acid-modifying enzymes. the first step in this process requires detailed case-by-case analysis of protein sequences and structures using the best available methods for detecting sequence and structure similarity. results from such an analysis of protein structures needs to be further combined with in-depth analysis of genomic contexts and domain architectures to glean novel functional associations. finally, these results need to be placed in the context of phyletic patterns of the occurrence of various components of the system in order to reconstruct a total picture of their natural history and predict aspects of their biochemistry and biological functions. indeed, such a strategy has allowed the prediction of novel biochemical activities and has laid the foundations for further systematic investigations of the toxin-antitoxin and peptide-modification systems of prokaryotes ( , ( ) ( ) ( ) . in this article we present the results of such a strategy that helped us uncover and characterize a remarkable, diverse class of nuclease toxins, whose immunity appears to depend primarily on a protein superfamily prototyped by the saccharomyces cerevisiae protein smi /knr . the smi /knr protein was first recovered in a screen for s. cerevisiae mutants that confer resistance to the killer toxin produced by the competing yeast species hansenula mrakii ( , ) . smi /knr was shown to physically interact with the tyrosyl trna synthetase and it appears to functionally interact with the non-ribosomal peptide ligase dit , with a trna-synthetase-like catalytic domain, in the efficient synthesis of dityrosine a peptide metabolite that is typical of fungal spore-walls ( ) . interestingly, it also shows synthetic lethal and physical interactions with a great number of proteins ( ) . nevertheless, its exact significance and biochemical action has remained poorly understood ( ) . parallel studies recovered other smi /knr eukaryotic homologs namely fbxo , a subunit of a scf-type e ubiquitin ligase in vertebrates ( ) , and pgs , a subunit of the tubulin polyglutamylase, which is a non-ribosomal peptide-ligase that links multiple glutamates to the g-carboxyl group of target proteins ( , ) . exploratory sequence surveys suggested that smi /knr homologs are also abundantly represented in bacteria (smi /knr domain, pfam: pf ). furthermore, our preliminary contextual analysis of conserved gene neighborhoods of these representatives suggested that they might be functionally linked to potential nucleases. very recently, a novel contact-dependent inhibitory (cdi) toxin system has been reported in proteobacteria that delivers multiple nuclease toxins into target cells ( , ) . our observations indicated that smi /knr homologs are potential immunity proteins in a subset of these cdi systems. together, these observations prompted us to systematically investigate both the bacterial and eukaryotic smi /knr homologs and explore their potential connection to nuclease toxins, their delivery and immunity against them. as a result we were able to identify a diverse group of previously unknown nuclease toxins and immunity proteins that are present across all the major bacterial lineages with considerable significance for intra-specific and host interactions. this investigation also allowed us to uncover diverse, previously unknown nuclease and deaminase domains in bacterial toxins and predict their folds and biochemical mechanisms. we also show that the smi /knr homologs, which were ultimately derived from bacterial toxin-immunity systems, have been recruited by eukaryotic double-stranded dna viruses to perform multiple roles in intracellular survival and morphogenesis of these viruses. finally, we present evidence that the ability of the conserved domain in the smi /knr superfamily of proteins to bind structurally diverse protein partners has been re-used in eukaryotes as a means to recruit targets to peptide-modifying systems such as the ubiquitin and the polyglutamylase systems. iterative sequence profile searches were run using the psi-blast program ( ) against the non-redundant (nr) protein database of national center for biotechnology information (ncbi). similarity based clustering for both classification and culling of nearly identical sequences was performed using the blastclust program (ftp://ftp.ncbi.nih.gov/blast/documents/blast clust.html). the hhpred program was used for profileprofile comparisons ( ) . structure similarity searches were performed using the dalilite program ( ) . multiple sequence alignments were built by muscle ( ) , promals ( ) , kalign ( ) and pcma ( ) programs, followed by manual adjustments on the basis of profile-profile and structural alignments. the consensus for alignments were calculated and colored by the chroma program ( ) . secondary structures were predicted using the jpred and psipred programs ( , ) . for earlier known domains the pfam database ( ) was used as a guide, though the profiles were often augmented by addition of newly detected divergent members that were not detected by the original pfam models. clustering with blastclust followed by multiple sequence alignment and further sequence profile searches were used to identify other domains that were not present in the pfam database. signal peptides and transmembrane segments were detected using the tmhmm and phobius programs ( , ) . contextual information from prokaryotic gene neighborhoods was retrieved by a perl custom script that extracts the upstream and downstream genes of the query gene and uses blastclust to cluster the proteins to identify conserved gene-neighborhoods. phylogenetic analysis was conducted using an approximately-maximum-likelihood method implemented in the fasttree . program under default parameters ( ) . the modeller v program ( ) was utilized for homology modeling of the structure of the irs n-terminal domain. structural visualization and manipulations were performed using vmd ( ) and pymol (http://www.pymol.org) programs. the in-house tass package, a collection of perl scripts, was used to automate aspects of large-scale analysis of sequences, structures and genome context (anantharaman, v., balaji, s., and aravind, l., unpublished data sequence profile searches and structural comparisons reveal a vast superfamily of smi -related proteins as a first step to computationally characterize the smi /knr protein, we analyzed it using the seg program to identify potential globular regions in it ( ) . this indicated the presence of a single globular domain that was then used as a seed in iterative sequence profile searches of the nr database with psi-blast and jackhmmer from the hmmer package. in addition to recovering other eukaryotic proteins with a homologous region, such as fbxo from animals, skip from plants and pgs , a subunit of tubulin polyglutamylase complex, the search also recovered a large number of bacterial proteins such as the bacillus subtilis yobk. given the great diversity of sequences recovered prior to convergence from bacteria, we initiated transitive sequence profile searches with several distinct bacterial starting points to achieve maximal coverage in terms of detection. we also noted that a crystal structure for yobk has been solved by the joint structural genomics initiative (pdb: prv). we used this structure as a query for structure similarity searches using the dalilite program and recovered hits to four other homologous structures ( ffv, pag, icg, d p; z > . ). of these, ffv was the structure of the earlier characterized protein syd from e. coli which interacts with secy, a key component of the sec-dependent protein secretion system that traffics proteins across the bacterial inner membrane ( ) ( ) ( ) . consistent with this, we also found that syd homologs were recovered with borderline e-values (e $ . - . ) in the above jackhmmer and psi-blast searches. hence we included the syd homologs in the profiles to further expand the relationships of the group of proteins homologous to smi /knr . at convergence, some of these searches also recovered with borderline e-values proteins (e $ . ) from certain dna viruses such as fpv (gi: ) from the fowl poxvirus, and the us family of proteins (e.g. us , ul , irs and trs ) from herpesviruses. to confirm the relationship of these proteins to smi we used them in a profile-profile comparison search with the hhpred program against a library of hmms created using the sequence of polypeptides in the pdb database as a query. these searches recovered the structures prv, ffv and icg as the best hits with significant p-values (p = À to À ). furthermore, examination of the hits produced by the viral proteins in profile-profile comparisons showed that most of the versions from herpesviruses possessed two tandem repeats of the domain homologous to smi . additional transitive searches with these viral proteins revealed that homologous proteins are present in a number of distantly related or unrelated dna viruses. finally, the above searches also recovered hits to two distinct groups of proteins each with over representatives in the nr database, predominantly from bacteria, typified respectively by ca_c (gi: ) from c. acetobutylicum and sgr_ (gi: ) from s. griseus. profileprofile comparisons with the hhpred program using alignments of each of these groups of proteins also confirmed their relationship to the smi -like proteins via recovery of significant hits (e = À to À ) to hmms generated using the sequences of prv and ffv as best hits. thus, it became clear that smi /knr defines a large superfamily of conserved domains that is widespread in bacteria, eukaryotes and various dna viruses but practically absent in currently sequenced archaeal genomes. we accordingly named it the sukh (for syd, us , knr homology) domain superfamily. despite the low average pairwise sequence similarity across this superfamily, all representatives are known or predicted to possess a similar core fold comprising of four conserved helices and six strands ( figure , supplementary data). strands and form a b-hairpin and the strands - form a -stranded b-meander; however, the b-hairpin and the b-meander show only limited or no hydrogen-bonding along their length, despite being spatially beside each other. thus, the structural core of the sukh domain can be described as a split b-sheet with only weak interaction between its two parts. this structural peculiarity could potentially be critical for the functional interactions of the domain (see below). based on sequence-similarity-based clustering and phylogenetic analysis five major groups can be recognized within the sukh domain superfamily (figure , supplementary data). the first of these, and the most widespread, is the one typified by smi /knr , fbxo , skip , pgs and yobk (that entirely includes the pfam model pf , 'smi /knr family', and additional proteins not detected by that model within it) and is seen in both bacteria and eukaryotes. this ensemble, which we term smi -like or sukh- group includes the majority of the sukh domains. we term the second group, prototyped by syd, the syd-like or sukh- group. this group is largely restricted to the gammaproteobacteria and firmicutes. the sukh- group prototyped by ca_c (gi: ) is widely distributed across most bacterial lineages. the group prototyped by sgr_ (gi: ), the sukh- group, is again seen in several bacteria and sporadically in fungi. the sukh- or us -like group is present in fowl adenoviruses, various vertebrate iridoviruses, archosaur poxviruses (crocodilepox virus and fowlpox virus), and in multiple copies in several herpesviruses (representatives of the alphaherpesvirus, betaherpesvirus and alloherpesvirus clades). members of this group are also encoded by genomes of the early-branching chordate branchiostoma, the salmon, the frog rana catesbeiana and the duckbilled platypus, where they appear to have been acquired from the genomes of integrated herpesviruses ( ) . phylogenetic analysis of each group, along with the phyletic patterns, strongly suggests that sukh domain proteins have been widely disseminated both within and across the superkingdoms via extensive lateral transfer (supplementary data). in light of this pattern, the near complete absence of this superfamily in archaea suggests that there could be certain specific functional barriers that prevent acquisition of the sukh domain by that superkingdom. phylogenetic analysis strongly suggests that the groups sukh- - are monophyletic clades. the largest group, sukh- is likely to represent the ancestral group from within which the above clades have diversified through rapid sequence divergence. contextual information gleaned from gene neighborhoods in prokaryotes and domain architectures of proteins, when combined with sequence analysis, can be a powerful means of discerning protein function ( ) . indeed, this method has proven particularly effective in both function prediction and identification of new analogous systems, using the organizational syntax of tightly linked genes, in case of toxin-antitoxin and restrictionmodification systems ( , , , , ) . to better understand the role of the sukh domain we performed a detailed analysis of the gene-neighborhoods of all bacterial genes encoding a protein with this domain ( figure ). consequently, we were able to identify at least three striking themes among the gene-neighborhoods of this superfamily. firstly, across the bacterial phylogenetic tree we found numerous genomic neighborhoods that linked two or more adjacent genes encoding sukh domain proteins. in certain cases, e.g. b. grahamii (gi: ), we found tandem arrays with up to six paralogous sukh superfamily genes ( figure ). we found that in several instances these paralogous versions are not closely related and in certain cases adjacent paralogs might belong to completely different sukh groups. for example, we found combinations of genes encoding proteins belonging to the smi -like (sukh- ), syd-like (sukh- ), sukh- and sukh- groups in the same neighborhood in several bacteria such as b. cereus mm and various streptomyces species (figure ). this observation suggested that there appears to be selective pressure for the diversification of the linked sukh domain proteins encoded in a gene neighborhood either via sequence divergence, or independent assembly of neighborhoods from distantly related paralogs of different groups. this situation, wherein multiple paralogous genes are linked together as tandem arrays in a neighborhood, is relatively rare in bacteria ( ) . given that products of genes linked in conserved gene-neighborhoods physically interact, it is possible that these paralogs interact to form a single complex ( ) . on the other hand, the multiple paralogs could also represent different alternative versions of the same component of a system which is under selection to display diversity. given the great variability in the numbers and types of paralogous versions of the sukh superfamily encoded by these neighborhoods, we favor the later explanation in this case (details see below). the second major feature that emerged from the analysis of gene neighborhoods was the linkage of genes encoding diverse sukh superfamily members to genes encoding different types of nucleases ( figure ). among these, we observed multiple linkages in distantly related bacteria, such as b. thuringiensis and m. marina and s. griseoflavus, to genes for nucleases of the metal-dependent nuca family, which includes the well-studied s. marcescens secreted endonuclease ( ) and the anabaena non-specific endonuclease nuca, which degrades both rna and dna ( ) . another prominent linkage observed in several bacteria, such as m.infernorum, various bacillus species and n. mucosa, was to genes encoding proteins with a hnh superfamily nuclease domain ( figure ). sequence analysis showed that several of the hnh domains were related to similar nuclease domains found in previously studied bacteriocins such as pyocin ap of p. aeruginosa, klebsiella klebicin b and colicin e of e. coli ( ) . these linkages involved members of both the smi -like and syd-like groups; thus, despite their diversity, potential functional interactions with different types of nuclease domains are a common feature of the bacterial representatives of the sukh superfamily. the third major linkage we observed was between sukh superfamily genes and those encoding gigantic bacterial surface proteins with repetitive motifs such as the hemagglutinin-repeats, rhs repeats (yd) and another previously uncharacterized a-helical repeat motif. all these proteins showed a characteristic feature of possessing a highly variable but globular domain at the extreme c-terminus of the protein, downstream of the repetitive region. these proteins also usually contain 'h' in red, a-helix). the numbers in bracket are indicative of the excluded residues from sequences. 'hash' indicates the residues involved in metal ion-binding, 'percent' symbol indicates the conserved histidine which is required for activation of the water molecule for hydrolysis and 'asterisk' indicates the conserved asparagines. on the right, structures of hnh and endog families are shown as cartoon representations with the central structural core colored by structural element type (a-helices in purple, b-sheets in yellow), and key catalytic residues highlighted. for those newly identified families, inferred topology diagrams of their core nuclease domains are shown with conserved catalytic residues. certain domains related to adhesion and the two-partner secretory (tps) system n-terminal to the repetitive region, such as paar (pfam: pf ) and the tpsa-secretion domain (tpsa-sd, also known as the filamentous hemagglutinin fhab secretory domain; pfam: pf ) with a pectate lyase-like fold ( ) ( ) ( ) . some of these proteins with repetitive domains, which were recovered in our analysis of sukh superfamily neighborhoods, are representatives of toxins of the cdi systems ( figure ) that were reported even as this study was being prepared for submission ( , ) . like the above proteins, the cdi toxins are characterized by multiple n-terminal tpsa-sd domains and hemagglutinin-repeats combined with polymorphic c-terminal domains that vary greatly between different cdi toxins. in all these cdi proteins the polymorphic c-terminal domain is separated from the repetitive region by either or both of two small a-helical domains annotated as domains of unknown function in the pfam database (duf or duf ). furthermore, it was shown that the protein encoded by the gene following the cdi toxin was an immunity gene, whose product provided resistance against the toxin to the cell that was producing it ( ) . by this criterion it became clear that the sukh superfamily genes in the cdi operons were actually immunity proteins for the toxins encoded by the upstream genes. however, in contrast to the pan-bacterial distribution of the sukh superfamily, the cdi operons were only observed in proteobacteria ( ) . furthermore, we observed that polymorphic c-terminal domains of the cdi toxins, which are found linked to the sukh superfamily immunity proteins in cdi systems, are also seen in bacterial lineages outside of proteobacteria, where too they are linked to sukh superfamily genes. in these cases they are linked to other n-terminal domains that are distinct from the tpsa-sd and hemagglutinin repeat domains. studies on cdi systems indicated that the toxin function resides in the polymorphic c-terminal domains and at least two of these domains are nuclease toxins that cleave both trnas and dna ( ) . our above observations indicate that outside of cdi systems, the sukh superfamily genes are linked to genes encoding the hnh and nuca nucleases; hence, it is likely that even these nucleases function as distinct but analogous toxins that cleave nucleic acids in target cells. together, the above observations raised the possibility that the sukh superfamily protein might serve as immunity proteins, not just in certain proteobacterial cdi systems, but also more generally function, across all major bacterial lineages, to protect against linked genes, which are predicted to act as toxins. interestingly, in addition to gene-neighborhoods with multiple tandem divergent sukh superfamily genes, in several bacteria, we also observed notable lineage-specific expansions of sukh domain proteins (e.g. paralogs in gemmata obscuriglobus, paralogs in c. gingivalis and in s. albus). these observations also make sense in light of the above toxin-immunity protein hypothesis: while the sukh superfamily gene adjacent to a nuclease or cdi toxin gene is likely to provide immunity to the 'self' toxin, the supernumerary sukh superfamily genes, which occur as tandem arrays or as isolated versions, might provide immunity against other 'non-self' toxins delivered by competing bacteria in the environment. such associations of multiple distinct immunity genes have also been observed in the case of plasmid-borne colicin gene operons ( ) . other features of the genomics of the sukh superfamily also support this proposal. gene neighborhoods encoding sukh proteins and linked nucleases or cdi toxin are highly variable in terms of being present or absent between different strains of the same species or between different closely related species which share an otherwise similar genomic organization. secondly, there appear to have been recent duplications of entire loci encompassing these gene-neighborhoods within the same genome in several bacteria (supplementary data). this kind of phyletic and genomic polymorphism is also typical of loci involved in inter-and intra-genomic competition such as toxin-antitoxin, restriction-modification and virulence toxin systems ( , , , ) , suggesting that even systems with sukh superfamily proteins might have comparable roles. to test this proposal further, as the first line of investigation, we aimed at exploring further the link between nucleases and the sukh domain proteins. while the polymorphic c-terminal domains of two cdi toxins have been characterized as nucleases, the c-terminal domains of those cdi toxins which are found linked to the sukh superfamily immunity proteins have not be characterized. we speculated that these domains, along with some of the other uncharacterized domains in proteins encoded by conserved gene-neighborhoods containing a sukh superfamily gene, might be as yet uncharacterized nuclease domains. as a second line of investigation we sought to uncover those among the associated uncharacterized domains, which might have a role in distinct toxintrafficking mechanisms, comparable to the two-partner system used by the proteobacterial cdis. therefore, to accomplish these two objectives and identify other components of these systems we resorted to systematic sequence analysis of the uncharacterized proteins recovered in the above gene-neighborhood analysis. sequence analysis reveals the presence of distinct families of nuclease toxins encoded by genes adjacent to those of the sukh superfamily sequence analysis indicated that at least distinct families of domains recovered in our searches in proteins encoded by genes adjacent to one encoding a sukh domain protein are potential nucleases. while some of these, as noted above, belong to the earlier characterized families, several of those identified here belong to entirely new families or are highly distinctive previously unrecognized versions of previously known families (figures - and supplementary data). identification of this diverse panoply of nuclease domains as being functionally linked to the sukh domain lends critical support to the proposal that this domain functions primarily as an immunity protein against nucleic acid-targeting toxins in bacteria. we briefly describe below these newly identified nuclease domains. nuclease toxins of the hnh/endovii fold. the hnh or the endovii fold is a version of the treble-clef fold. the treble-clef fold is one of the most prevalent zn-binding motifs across the three superkingdoms of life ( ) . classical hnh nucleases, like the restriction endonuclease (rease) mcra and the t endonuclease vii, contain the four conserved, zn-chelating cysteines of the treble-clef fold ( ) . however, these cysteines are lost in several forms, such as the rease mboii, colicin e and the nuca family, but these domains still retain the characteristic structural geometry of the treble-clef ( , ) . the active site of these enzymes is formed at the interface of the characteristic helix and b-hairpin and contains a divalent cation, which is chelated by three polar residues usually from the first strand of the b-hairpin and the c-terminal helix of the treble-clef fold. the residues chelating the metal are typically histidine, aspartate and asparagine but their exact configuration can greatly vary between different members of this fold making them difficult targets for identification through sequence analysis ( ) . among the nucleases of this fold occurring in the neighborhood of the sukh superfamily we observed eight distinct families spanning the entire gamut ranging from conventional hnh nucleases to certain highly derived forms that have not be identified before. the conventional hnh versions (e.g.am _ , gi: from the cyanobacterium a. marina) retain all the four cysteines of the treble-clef fold and a typical arrangement of residues chelating the catalytic metal. others, like the nuclease domains of the pspto_ protein from p. syringae (gi: ) and some cdi proteins, belong to the colicin e /e /e family (figure ) . a highly derived version is represented by the nuca family ( ) , where structural analysis reveals that a treble-clef domain which has lost the characteristic cysteines is inserted between two copies of a three-stranded domain with distinct loop-like c-terminal extensions (figure ). we uncovered several divergent, earlier unrecognized nuca family nuclease domains in both the sukh superfamily neighborhoods and cdi systems, such as those typified by the b. subtilis protein yeef (gi: ). the structural organization of the nuca domain suggests that it arose from an ancestral hnh/endovii domain, which 'carried' these duplicated three-stranded units along with it to form a more complex domain. consistent with this proposal, we discovered a family of novel hnh fold nucleases in our gene-neighborhoods, which contain an active site similar to the nuca nucleases, but are standalone versions without the two flanking three-stranded units. we called this family gh-e after the three conserved residues associated with the active site typical of these domains. interestingly, a subset of the gh-e family preserves the conserved cysteines of the treble-clef suggesting that they indeed represent the potential evolutionary intermediate from a classical hnh domain to the derived nuca-like forms (figure ) . we also recovered three other novel families of domains, which are respectively typified by nearly absolutely conserved tripeptide sequence motifs lhh, whh and ahh (figure ). most cdi operons, which encode a sukh domain immunity protein, have proximal toxin genes with a lhh domain as the polymorphic c-terminal unit of their products ( figure ) . additionally, the lhh domain is found in products of genes adjacent to the sukh superfamily gene outside of proteobacteria in several other bacterial lineages such as firmicutes, actinobacteria, bacteroidetes and planctomycetes ( figure ). although we also found the whh domain as the polymorphic toxin unit of a subset of proteobacterial cdi systems, none of these have a sukh superfamily immunity protein. however, we found several non-cdi gene neighborhoods, which are likely to define distinct but analogous toxin systems, in proteobacteria, firmicutes, actinobacteria, synergistetes and bacteroidetes that combine genes for whh and sukh domain proteins (figure ). the ahh domain is also found in similarly organized gene-neighborhoods from the same bacterial lineages as those in which the whh and lhh domains are found. profile-profile comparisons with multiple alignments of all these three novel domains indicated that the best matches are families of the hnh fold. indeed, a visual examination of the conservation patterns of these three domains showed that the hh dyad shared by them corresponds to the hh or dh dipeptide found in the first strand of treble-clef fold of the classical hnh domains ( ). the first h forms one of the catalytic metal-chelating ligands and the second h contributes to the active site that directs the water for phosphoester hydrolysis ( ) . further, the sequence alignments of the lhh, whh and ahh motifs revealed two further conserved histidines, which were associated with the helix of the treble-clef fold and aligned with the two c-terminal metal-chelating residues in the profile of the classical hnh domains ( , ) . these observations indicated that the lhh, whh and ahh domains are highly derived versions of the hnh fold. the eighth family of hnh fold enzymes emerging from this analysis comprises of proteins typified by the protein dd _ (gi: ) from d. dadanti found in predicted toxins in sukh neighborhoods and also in cdi operon products which do not contain a sukh-type immunity protein (figure ) . a subset of these domains constitutes the pfam model for a 'domain of unknown function', duf , that does not define the boundaries of this domain precisely. we were able to define the proper boundaries of this domain by using the diversity of distinct architectural contexts in which we detected it and used the refined alignment for profile-profile comparisons. this comparison revealed the representatives of hnh domains as the best hits and indicated a perfect match between the polar residues conserved in this domain and catalytic and active-site metal chelating residues of the classical hnh domains. we named this family of hnh domains as dh-nnk after the conserved dh dyad in the strand- and the two asparagines and lysine which are conserved in the helix of the core treble-clef fold (figure ). while all these above versions have lost the cysteines of the ancestral treble-clef, they nevertheless, retain the catalytic configuration typical of those nucleases. hence, we predict that these domains are likely to be nucleases with a similar catalytic mechanism. practically all characterized hnh fold nucleases, barring those of the nuca family, which show a distinct active metal chelating site ( ), have a preference for dna substrates. hence, it is likely that most of these domains are the active components of toxins that hydrolyze dna in the target cells. nuclease toxins of the endou fold. the endou nuclease domain is typified by the nuclease domain previously identified in the u-specific, metal-dependent endonuclease, which in eukaryotes processes intron-encoded u and u snornas and generates products with - cyclic phosphate and -oh termini ( ) . a related endonuclease was identified in nidoviruses, such as the severe acute respiratory syndrome coronavirus where it appears to process rnas as a part of the replication complex ( , ) . our structural analysis revealed that the catalytic domain of these enzymes contains two elements each comprised of a single helix followed by a three-stranded unit. this suggests that it is likely to have emerged through duplication of the simple helix-three-strand structural element, followed by flipping of the sheet in one of the units ( figure a ). the catalytic residues, i.e. two histidines, appear to have emerged asymmetrically in a peculiar hairpin insertion within the helix of the first repeat. this hairpin insertion appears to be mobile and adopts different conformations in structures ( , )-this mobility might have a role in accommodating the substrate between the helix and the sheets formed by the three-stranded units of the repeats ( figure a ). we found that the bacterial members of the endou family are linked to genes of the sukh superfamily mainly in firmicutes and proteobacteria ( figure ). other than sukh superfamily gene-neighborhoods, related versions also comprise the polymorphic c-terminal domain of the cdi toxins from moraxella and mannheimia that, however, lack a sukh superfamily immunity gene. a further set of bacterial nucleases of this family are predicted secreted versions encoded by intracellular symbiotic and pathogenic bacteria, such as wolbachia (gi: ) and ehrlichia (gi: ). most bacterial versions that we identified are extremely divergent relative to the eukaryotic and viral forms and are not recognized by the previously available hmm models for this nuclease (pf ). hence, the identification of these relationships represents a significant extension of this superfamily ( figure a , supplementary data). versions within these gene-neighborhoods show considerable variability including loss of strands from the first unit. this variability suggests that the endou fold is rather flexible to accommodating drastic modification, which in turn might help it recognize a diverse spectrum of substrates. on the precedence of the eukaryotic endou and the nidoviral nuclease and their genomic organization we suggest that the majority of the bacterial endou homologs are nuclease toxins that cleave rnas in the competitor cells. those secreted by intracellular bacteria could be deployed as toxins or regulators to manipulate host physiology by cleaving specific transcripts. with the identification of these new endou homologs it becomes clear that the bacteria contain the greatest diversity of this superfamily, with certain versions closer to the eukaryotic and nidoviral versions and others that are more divergent (supplementary data). this suggests that the original radiation of this superfamily probably happened within the bacterial toxin systems and were subsequently acquired, perhaps from intracellular symbiotic bacteria, by eukaryotes and viruses. in the latter they appear to have been recruited as rna processing enzymes. nuclease toxins of the rease fold. the rease fold is a highly versatile fold that accommodates considerable structural diversity and has, not surprisingly, been used as the primary fold from which reases of restriction-modification systems are derived ( , ) . we also found several proteins with this fold to be encoded by genes that are neighbors of sukh superfamily genes ( figure ). these versions were originally identified as a distinct conserved domain of unclear affinities-both psi-blast and hmmer searches failed to identify any relationships with previously known domain. however, we observed that the multiple sequence alignment of this domain showed a characteristic signature of conserved residues of the form ge-d-exk-q ( figure b ) that matched the pattern of similar conserved residues in the lambda exonuclease and the recb family of the rease fold ( , ) . the predicted secondary structure pattern of these domains also closely matched the rease fold with conserved d and exk motif falling on a b-hairpin as is typical of the rease fold ( figure b ). these observations induced us to use the alignment of this domain in a profile-profile comparison with the hhpred program, and we recovered a composite profile made of diverse rease fold superfamilies such as the vrr-nuc, lambda exonuclease, the archaeal holliday junction resolvase and recb as the best hits (p = À ). this suggested that this family defines a novel group of rease-fold nucleases. given that the majority of the rease-fold enzymes are dnases, we predict that these toxins are likely to cleave the dna of the target cells. nuclease toxins of the cytotoxic rnase fold. the last family of nucleases that we found encoded by genes linked to the sukh superfamily genes was the cytotoxic rnase family ( ) . this nuclease domain was first characterized as the toxin domain of the colicins e and e and is typified by a conserved active site configuration with an aspartate followed by a glutamate sandwiched between two histidines (supplementary data). the version of this domain in colicin e has been demonstrated to function as an endornase that specifically cleaves the phosphoester bond between bases and of s ribosomal rna ( ) . given that versions detected in systems characterized in our current study are closely related to the version found in colicin e and e , we posit that these nuclease domains act as rnases that similarly cleave rna in the target cells. other domains with a possible role in nucleic acid modifications. we found three other families of domains in proteins that were encoded by genes which occupied positions adjacent to sukh superfamily genes in certain predicted operons, equivalent to positions of the genes encoding the above nucleases. additionally, these families of domains are also found as representatives of the polymorphic c-terminal module of the proteobacterial cdis. together these observations hinted that they are potentially uncharacterized enzymatic domains operating on nucleic acids. psi-blast and jackhmmer searches showed that the first of these families belonged to the nucleotide deaminase superfamily that includes rna-editing enzymes, such as the apobecs and dna-modifying enzyme aid of vertebrates. hence, like the nucleases, these enzymes are likely to function as toxins that mutate nucleic acids in the target cells. we discuss the natural history of these enzymes in a separate article (iyer lm, zhang d, aravind l, manuscript in preparation). the second of these families prototyped by the b. cereus protein bcere _ (gi: ) is characterized by a conserved signature [ns]hh followed by another conserved histidine (supplementary data). although we were unable to unify this family with any of the other nuclease folds, the presence of the hh motif typical of many of the above families of hnh/endovii fold nucleases might point to a divergent relationship with those proteins. the third of these families, typified by the cdi system from p. luminiscens (gi: ) includes a globular domain of - amino acids that might define yet another uncharacterized nucleic acid-modifying domain (cdiac in figure , supplementary data). earlier characterized toxin systems such as the classical plasmid-encoded bacteriocins and the recently characterized cdi systems use thematically comparable, albeit biochemically distinct mechanisms for trafficking of nuclease toxins. while these systems have been used as models to understand bacterial protein trafficking, the complete set of events starting from the extrusion of the 'pro-toxin' by the producing cell to its recognition at the target cell surface and delivery into the target cell are only partially understood ( ) . classical plasmid-borne colicins and cognate bacteriocins from other bacteria do not have secretory mechanisms and their release appears to occur primarily through cell-lysis mediated by the colicin-release proteins ( ) . colicin-like bacteriocins are multidomain proteins with an extreme c-terminal toxin module, which is either a nuclease or a membraneperforating domain (e.g. colicin e and a) ( ) . they typically possess two additional n-terminal modules, of which the first facilitates translocation across the target cell membrane and the second (i.e. the central module) facilitates binding to a membrane receptor on the target cell. these colicins hijack either the tol or the tondependent molecular import systems to enter the target cells ( ) . the chromosomally encoded proteobacterial cdi system toxins do not require lysis; instead they are trafficked out of the cell which produces them via the twopartner-system that depends on the cdib proteins belonging to the tpsb class of outer-membrane trafficking proteins ( ) . these latter proteins contain n-terminal periplasmic polypeptide-transport-associated (potra) domains linked to a c-terminal b-barrel transmembrane domain. they recognize the secretory domains such as the tpsa-sd in the extreme n-terminal region of the cdi 'pro-toxins' to deliver them across the outer membrane of proteobacteria ( ) . this n-terminal region is separated from the c-terminal regions by repetitive regions with rhs-or filamentous hemagglutinin-type repeats. their uptake by the target cell is less-clearly understood. in the well-studied examples, the first step of this process appears to depend on the outer membrane-biogenesis protein bama recognizing a conserved a-helical domain immediately n-terminal to the toxin module, with a venn signature that overlaps with the pfam model termed 'duf . subsequently the inner-membrane protein acrb, a transporter, appears to be necessary for uptake into the target cell cytoplasm ( ) . additionally, it is posited that a proteolysis step at the cell surface releases just the c-terminal nuclease module for uptake by the target cell ( ) . thus, despite the differences between the cdi and classical colicin-like systems they share a common feature of the toxin activity being borne by the extreme c-terminal domain in a multidomain polypeptide. further, the modules located immediately-n-terminal to the nuclease domain (e.g. the a-helical domain with the venn motif $pfam duf ) are involved in association with receptors on the target cell. hence, we term these domains collectively the pre-toxin (pt) domains. the extreme n-terminal domains appear to play a critical role in export from the host cell in the cases where lysis is not involved, i.e. typically chromosomally borne versions. these observations accordingly presented the organizational logic for these systems, wherein there are usually three functionally distinct sets of modules in the pro-toxin going from the n-to the c-terminus of the protein. analysis of the domain architectures of the nuclease domain-containing proteins encoded in the sukhsuperfamily neighborhoods revealed that the majority of the proteins followed an architectural logic which was consistent with the above-described organization of these earlier studied toxin systems ( figure ) . however, only a relatively small subset of the sukh domain-associated systems overlaps with the cdi systems. further the sukh superfamily proteins and functionally linked toxins are also found outside of proteobacteria, in lineages lacking outer membranes and cdib-like delivery systems. we reasoned that analysis of these distinct pre-nuclease and extreme n-terminal domains might reveal features pertaining to the trafficking of toxins in non-cdi systems and point to alternative delivery mechanisms. identification of multiple distinct trafficking systems for toxins encoded in sukh superfamily neighborhoods. we observed that in gram-positive bacteria, proteins with the c-terminal nuclease typically possessed one of a set of several distinct domains at the extreme n-terminus of the protein ( figure ) . a significant subset of these could be unified using sequence profile searches with the psi-blast and jackhmmer programs to the wxg/esat superfamily of a-helical domains ( ) . these domains are a specific signal recognized by the yuea-like atpases of the hera-ftsk superfamily that secrete them in an atp-dependent manner ( , ) . this indicated that the wxg/esat- domain-containing toxins in gram-positive bacteria are extruded by yuea-like pumps using an atp-dependent mechanism. a significant subset of toxin proteins from firmicutes possessed a distinctive n-terminal domain that could not be unified with any earlier known domain (a subset of these have been included in the erroneously annotated model transposase_ of pfam; pf ). sequence searches showed that this domain possessed a conserved [lf]xg sequence motif and it was predicted to assume an a-helical bundle fold based on the multiple sequence alignment (supplementary data). we accordingly termed it the lxg domain ( figure ) and were able to unify it with the wxg domain by means of profile-profile comparisons with the hhpred program (p = À ). contextual analysis indicated that this domain is encoded by certain conserved gene-neighborhood across firmicutes, where it is associated with genes coding for a yuea-like hera-ftsk superfamily protein pump and a small protein related to the s. aureus esac protein (gi: , supplementary data). through profile-profile comparisons we showed that the esac-like superfamily is a bacterial version of the eukaryotic evh peptide-binding domains with the ph-like fold (hhpred p-value: À ) ( ). these observations suggest that the lxg domain is comparable to the wxg/esat- domain, and is likely to utilize the atp-dependent yuea pumps and the potential peptide-binding esac domain as partners for extrusion from the producing cell. the protein srot_ (gi: ) from the actinobacterium s. rotundus contains two copies of a distinct domain n-terminal to the gh-e nuclease domain ( figure ). this domain is also widely found in several actinobacteria at the n-termini of putative cell-surface proteins. profileprofile comparisons suggested a possible relationship between these n-terminal domains and the wxg domain suggesting that it might be yet another representative of the wxg-like superfamily (p = À ) and might utilize a similar atp-dependent mechanism for its extrusion. a fourth group of proteins, restricted to certain firmicutes (e.g. s. aureus sacol protein; gi: ), is typified by yet another n-terminal a-helical domain (ldxd in figure ) that is also found in domain architectural contexts very similar to the wxg and lxg domains. it is conceivable that this domain is comparable to them and functions similarly as a mediator of export via the hera-ftsk superfamily pumps. thus, a notable mode of export of nuclease toxins in gram-positive bacteria appears to be via the atp-dependent extrusion system, which while biochemically distinct from the tps of the proteobacteria, is thematically comparable. in actinobacteria, but not firmicutes, we observed several large proteins with architectures similar to the cdis of the proteobacteria. these typically contain rhs repeats; however, their extreme n-terminal domains did not bear any close relationship to the proteobacterial tpsa-sd. instead they were found to contain an n-terminal signal peptide and some of these proteins (e.g. gi: , a protein from s. griseus) contain multiple lamining domains embedded within repetitive regions. the protein dip (gi: ) from c. diphtheria shows another distinct low complexity repeat n-terminal to the nuclease domain ( figure ) and like in the above case it also possesses a conventional signal peptide. likewise, a distinctive signal peptide, which is highly conserved in multiple proteins only within the genus planctomyces, is seen in predicted nuclease toxins from this organism (e.g. gi: ). another group of large toxin proteins with rhs repeats, which predominantly occur in proteobacteria, are defined by the presence of repeats of the paar domain (pfam: pf ) n-terminal to the rhs repeats. all these proteins are typified by the presence of a conserved transmembrane domain with two tm segments ( figure and supplementary data) just n-terminal to the paar domains. we propose that these tm segments are required for their trafficking to the cell membrane, following which they might be processed in the periplasm for release via the outer membrane in a process that might depend on the paar domains. we also noticed a comparable domain with two tm segments in few firmicutes (e.g. gi: from c. thermocellum) and in chlamydiae (e.g. from m. infernorum, which is a rare case of the nuclease domain occurring n-terminal to the two tm domain; figure ). these proteins lack paar domains but the firmicute versions have additional hedgehog-intein (hint) peptidase domains (see below) that could aid in their release on the cell-surface ( figure ). these observations suggest that at least some nuclease toxins in bacterial lineages such as actinobacteria, bacteroidetes and planctomycetes with conventional signal peptides, and those in proteobacteria, chlamydiae and firmicutes with two-tm domains are probably delivered to the cell using the conventional sec-dependent system ( ) . in the context of the above cases, it is of interest to note that e. coli syd, an archetypal member of the sukh superfamily, was first identified as a possible proof-reading component of the sec-dependent export system ( ) ( ) ( ) . in this context it is possible that the binding of certain members of the sukh superfamily (at least the syd-like group) in the producing cell might not only help in conferring immunity to 'self' but also in guiding the 'pro-toxin' to the sec-dependent export machinery. both actinomycetes and firmicutes do not display proteins with a pt domain with the venn motif (pt-venn). however, we observed that in both these lineages there was a conserved a-helical domain that frequently occurred just to the n-terminus of several distinct nuclease modules in different predicted toxins. this domain had a conserved tg motif and we accordingly named it the pt-tg domain ( figure another domain, which we found frequently associated with several unrelated or distantly related nuclease domains from gram-positive bacteria, was the nuclease_n domain ( figure , supplementary data) . it is predicted to be an a-helical domain and might also play a role in the delivery of the toxin module into the host cells. toxins in the sukh superfamily neighborhoods, irrespective of the type of the nuclease domain, can also be distinguished into two major architectural groups: one comprised of relatively small proteins with no notable stretches of repetitive sequence separating the n-from the c-terminal regions, and the second in which such repetitive sequences, such as the rhs and the filamentous hemagglutinin are present ( figure ). this might reflect a mechanistic difference in their mode of action: the smaller proteins could be soluble toxins that diffuse away from the cell producing it. in contrast, the large proteins with repetitive elements might form filamentous appendages that stick out from the cell-surface and depend primarily on contact with target cells for delivery [hence, the latter group includes the recently characterized cdis ( ) ]. alternatively, this difference might reflect the differences in the cell-wall structures of the bacterial lineages, with the smaller toxin proteins being more prevalent in the firmicutes. a subset of the smaller proteins with nuclease domains lack noticeable trafficking-related (n-terminal) domains. the corresponding genes could represent cassettes for alternative toxin modules that are linked by recombination to the larger full-length genes ( figure , see below). other auxiliary domains which might play a role in resistance, trafficking or processing of toxins. several other domain families were found to be encoded by genes having persistent association with the sukh superfamily neighborhoods across distantly related bacterial species. one of these is the sufu superfamily ( figure and supplementary data) prototyped by the suppressor of fused protein from drosophila ( ). in addition, we also detected members of this superfamily to be encoded by cdi-like operons, such as the one from n. gonorrhoeae that encodes a toxin with a distinct version of the hnh fold nuclease domain (toxin ngo , gi: ; supplementary data). in these cases the sufu superfamily gene occupies a position equivalent to that of the sukh superfamily gene, suggesting that they might be functionally comparable. we also found several examples wherein the sufu and sukh domains are combined in the same polypeptide (figures and ) . based on these associations we propose that the sufu domain represents a second widely conserved domain that function as an immunity protein for diverse nuclease toxins. two other conserved protein families are encoded in the toxin neighborhoods (sukh-neighborhood conserved family and ; sncf and sncf , supplementary data) that occupy positions similar to the sukh and sufu superfamily genes ( figure ). they were not found in multi-domain architectures typical of the nuclease toxins and always occurred as proteins with standalone domains. this suggested that they were unlikely to be novel toxins but act as alternative immunity proteins just like the sufu and sukh superfamily proteins. the hint domain, prototyped by the peptidase domains of the animal hedgehog proteins and protein-splicing inteins, is also frequently associated with sukh superfamily neighborhoods ( ) ( ) ( ) . these versions of the hint domain are closer to those found in several bacterial surface proteins and the secreted animal proteins such as hedgehog and the c. elegans hog proteins ( ) . when present in a multidomain 'pro-toxin' protein, the hint domain always occurs sandwiched between the pt domains such as pt-venn and pt-tg and the nuclease toxin domain. this location of the hint domain suggests that it is likely to serve as a peptidase that undergoes autoproteolytic cleavage, similar to what is observed in hedgehog and the inteins ( ) , to release the c-terminal nuclease domain for uptake by the target cell. it is conceivable that this cleavage step is regulated by the interaction of the pt domains with the surface receptor on the target cell. eukaryotic/dna viral members and structure-function analysis of the sukh superfamily while sukh superfamily neighborhoods are very widespread in bacteria, they are largely absent in archaea. although we uncovered potential extruded nuclease toxins in certain halophilic archaea such as h. borinquense (gi: , with a gh-e nuclease domain), which are delivered by means of a distinctive n-terminal metallopeptidase domain, we did not find any immunity proteins of the sukh or sufu superfamilies. although the exact reason for this exclusion is unclear, it is conceivable that these immunity proteins are ineffective in the context of the distinct archaeal secretory systems. however, several eukaryotes possess one or more sukh superfamily members. phylogenetic analysis and phyletic patterns suggest that there are two major eukaryotic lineages of the sukh superfamily that are nested within the radiation of the bacterial versions (supplementary data). they are respectively prototyped by the polyglutamylase subunit pgs ( ) , and the vertebrate scf ubiquitin e ligase subunit fbxo with yeast smi /knr ( , ) . the pgs version is found in basal eukaryotes such as giardia and spironucleus, animals and chlorophyte algae suggesting that it was likely to have been acquired prior to the last eukaryotic common ancestor (leca) and subsequently lost in several lineages. the fbxo lineage is present in animals, fungi, plants, stramenopiles and ciliates. however, it does not group with the pgs lineage, instead grouping with other bacterial forms. hence, it was probably acquired relatively early in eukaryotic evolution via an independent transfer from bacteria. in both plants and animals the fbxo version is fused to an n-terminal f-box domain and a distinctive c-terminal immunoglobulin superfamily domain (overlaps with the pfam model duf ), suggesting that it was recruited as an e subunit prior to the radiation of these eukaryotic groups. in addition to these versions, there appear to have been other sporadic transfers of sukh superfamily members to eukaryotes. for example, land plants contain a version typified by the arabidopsis protein at g (gi: ) which seems to have been independently acquired by them from a bacterial source. another sporadic transfer is seen in certain filamentous fungi, which acquired a version of the sukh- group that has been independently fused to an n-terminal f-box domain (e.g. a. oryzae gi: ). dna viral versions show no specific relationship with eukaryotic forms; instead, they share specific sequence motifs with the sukh- group, recover them as best hits in profile-profile comparisons, and group with them in the phylogenetic tree (supplementary data). within viruses they are most widespread and abundant in herpesviruses, with the versions from adenoviruses, poxviruses and iridoviruses being nested within the herpesviral radiation of the family (supplementary data). thus, they appear to have been acquired first by an ancestral herpesvirus, similar to that inserted in the amphioxus genome ( ) , from a bacterial source and subsequently disseminated across diverse dna viruses. although there has been gene loss in several eukaryotic lineages, at least the two ancient versions, namely pgs and fbxo appear to have been largely vertically inherited and show no lineage-specific expansions within eukaryotes. this is in sharp contrast to the high propensity for lateral transfer and for lineage-specific expansions of the sukh superfamily that is observed in bacteria. this feature, together with the available functional evidence suggests that these conserved eukaryotic versions have acquired a biological role distinct from that in the toxin-immunity systems of bacteria. nevertheless, there were several features that suggested to us that biochemically the eukaryotic versions might be exploiting an ancient functional template provided by the sukh domains in bacterial nuclease toxin systems. firstly, the studies on yeast smi /knr have shown that it interacts with a large number of structurally and functionally distinct proteins ( ) . in fbxo , and independently in the above-mentioned fungal proteins, it appears in a domain architectural context corresponding to the part of the e f-box subunit that recognizes the substrate for ubiquitination ( ) . this suggests that it might be deployed as a recognition domain to recruit particular substrates for ubiquitination. in bacteria the sukh superfamily domains are one of the most widespread immunity proteins that appear to function in conjunction with a repertoire of nuclease toxins that are extremely diverse in sequence and structure (figures and ) . taken as a whole, these observations indicate that the sukh domain contains a scaffold that has been adapted to recognize a diverse set of protein partners. a possible clue for the structural basis of this capability is offered by studies on the e. coli syd protein: it has been shown to contain a prominent negatively charged cleft with which it could interact with partner proteins ( ) . examination of the structure of this protein indicates that this cleft is formed by the space between the conserved helix h and the fissure in sheet between the two-stranded n-terminal unit and the c-terminal -stranded meander ( figure ). given that this unusual feature is seen across the fold, we examined the surface renderings of different sukh superfamily members and a corresponding cleft is observed in most of them ( figure ). although this cleft is not necessarily negatively charged as in syd, and might vary in depth and shape, its widespread presence suggests that it might be the means by which the sukh superfamily is able to accommodate different protein partners. in support of this hypothesis we observed that in the case of two distantly related members of the sukh superfamily, namely syd (pdb: ffv) and yobk (pdb: prv), this cleft is used in protein-protein interactions. in both these crystal structures one of the monomers is bound in the cleft of the other monomer resulting in an asymmetric dimer (supplementary data). these dimers are unlikely to represent biologically native dimeric states, but in any case illustrate the ability of the conserved cleft of the sukh fold to accommodate other proteins. interestingly, the sufu superfamily also shows a comparable kind of sheet with a fissure between two sets of strands ( ) . experimental studies on the drosophila sufu shows that it also functions as protein tether which holds the zn-finger transcription factor gli in the cytoplasm in the absence of the hedgehog signal ( ) . in vertebrates the sufu ortholog has been shown to bind gli and gli and prevent their degradation due to ubiquitination by f-box e ligases ( ) . thus, the presence of comparable binding interfaces that have the flexibility to recognize a wide range of protein ligands might be a common feature shared by both the sukh and the sufu superfamilies of immunity proteins. it is this feature that appears to have resulted in them being utilized as adaptors for recruiting other proteins in eukaryotic regulatory systems. the extensive spread of the us group of the sukh superfamily across unrelated or distantly related dna viruses of animals suggests that it confers an important advantage to these viruses. this is also supported by the lineage-specific expansion in betaherpesviruses of the sukh superfamily in the form of multigene arrays similar to what is seen in bacteria ( figure ). indeed multiple studies suggest that distinct copies of the proteins in herpesviruses are required for effective survival and replication of the virus in their hosts. for instance, mutagenesis of two sukh superfamily paralogs m and m in the murine cytomegalovirus was shown to be essential for survival of the virus itself, whereas mutagenesis of other paralogs m , m and m specifically prevents its replication in macrophages ( ) . other studies indicated that m and m form a heterotetrameric complex which counters the action of the host protein kinase r (pkr) in shutting down viral protein synthesis ( ) ( ) ( ) ( ) . the human cytomegalovirus sukh superfamily proteins trs and irs have been shown to similarly counter the pkr and the dsrna dependent arm of the anti-viral response ( , ( ) ( ) ( ) ( ) ( ) . another paralog ul inhibits the host cell stress responses by antagonizing the tuberous sclerosis protein complex in the endoplasmic reticulum ( , ) and counters apoptosis in conjunction with yet another paralog ul ( , ) . in light of these observations it appears that the viral versions of the sukh superfamily are deployed to counter different facets of the host anti-viral and stress response. by analogy to the bacterial versions, which function as immunity proteins, we propose that the viral sukh domain proteins in general bind diverse host proteins that are used against the virus. here again the special ability of the sukh scaffold to bind diverse proteins appears to have been exploited by the virus as a flexible binding interface to neutralize a diverse group of host anti-viral defenses. identification of the sukh superfamily and associated nucleic acid modifying toxin systems has considerable implications for understanding bacterial genetic conflicts, evolutionary forces acting on strongly linked multi-gene loci, and potential biotechnological applications. we briefly discuss some of these implications that emerge directly from our observations. relationship of toxin systems to genetic conflicts in the bacterial world. classical colicins and earlier characterized cdis act primarily on related bacterial strains of the same 'species'. although the systems identified in our studies are abundantly represented in extracellular pathogenic bacteria, they are rare in intracellular symbionts or pathogens. this might be because intracellular bacteria are much less likely to encounter a heavy load of competing cells in the same niche. the bacterial toxin systems which we uncovered in this study and the related cdis are also different in certain features from the classical colicin-like systems. classical colicins are in large part encoded on plasmids, which might be either single copy, medium-sized conjugative plasmids or small multi-copy small plasmids that depend on the conjugative plasmids for their transmission ( ) . such bacteriocins are relatively rare on chromosomes. in contrast, . % of the systems recovered in our study are chromosomally encoded. majority of the plasmid-encoded classical colicin-like toxins are accompanied by a gene encoding a lysis protein and their release is concomitant with the lysis of the host cell. however, none of the systems identified in this study or the cdis have lysis genes in their neighborhoods ( ) . this difference suggests that, while both the plasmid-borne bacteriocins and these systems might be directed at close relatives, they appear to be geared toward distinct genetic conflicts. the lysis of the cell nullifies the fitness of the chromosome; hence, it would be largely deleterious for the chromosome to encode systems that require lysis. the plasmid being a selfish element is not completely affected by loss of fitness of the host as long as it can offset it by holding on to, or spreading in the host population (i.e. the plasmid's own fitness is enhanced or maintained). cells of the host type without the bacteriocinogenic plasmid are competitors that affect the plasmid fitness, especially under stationary phase or starvation conditions. hence, the plasmid-borne colicin would be primarily selected to act against host cells that have lost the plasmid or lack it by default under these stress conditions. further, the plasmid toxins are unlikely to have ready access to trafficking by the host because, given the large amounts in which the colicins are produced ( ) , their export is likely to impair host fitness. further, it has been shown that under starvation only $ % of the cells produce colicin ( ) . although the loss of the cells producing the colicin would endanger the resident plasmid, a relatively small fraction of the host population is affected. by the principle of inclusive fitness of kin ( ), the plasmid could still have an enhancement of fitness from the copies in the surviving cell along with the elimination of competitors by the released toxin. on the other hand, the toxin domains of many of the chromosomal versions like the cdis and those identified in this study appear to be borne on filamentous structures that are primarily geared toward to elimination of competitors that come in physical contact with the cell-surface ( , ) . therefore, these systems are likely to be critical in the context of the formation and organization of biofilms and solid substrate colonies. when bacterial cells are aggregating in the above contexts it would benefit to eliminate resource sharing with non-kin competitors. hence, presence of a chromosomally encoded toxin that acts at a short range is likely to be selected, resulting in the proliferation of systems such as those described here. nevertheless, it would also benefit 'cheater cells' to evade such defensive mechanisms. hence, they would be selected to maintain a wide diversity of immunity proteins to counter different non-self toxins, which might explain the arrays of diverse sukh genes in several bacterial genomes. potential evolutionary processes in diversification of toxins and immunity proteins. imprints of the evolutionary arms race arising from the above processes are readily observed in our systems. the toxin proteins appear to show a rather peculiar pattern of diversification. the n-termini, which are typically associated with trafficking, tend to be relatively conserved while c-terminal nuclease domains show major diversity ( figure ). this is consistent with a recent study on the diversification of rhs proteins in enterobacteria which showed that the rhs proteins undergo c-terminal polymorphism due to rampant recombination with invading cassettes that encode alternative c-terminal modules ( ) . this type of recombination or gene-conversion with polymorphic c-terminal cassettes might explain the presence of smaller loci found in the gene-neighborhoods characterized here that encode just a nuclease domain by itself or with an additional small n-terminal extension (figures and ) . hence, we extend the original proposal for rhs diversification to suggest that, more generally, recombination with cassettes with distinct c-terminal modules is the primary proximal mechanism for diversification of the toxin proteins across all bacterial lineages ( figure ) . furthermore, the presence of nuclease and nucleic acid deaminase domains as the primary toxin modules of these systems raises the possibility that their nucleic acid cleaving or mutating activity is involved in triggering recombination events. this appears plausible given the observations that most of these nucleases are likely to be endonucleases, which like their counterparts in the restriction-modification systems could cleave at specific sequences. similarly, deaminase-induced mutations have been implicated in the triggering of class-switching recombination events in vertebrates ( ) . more generally, this ties in with earlier studies which have demonstrated the role for both recombination and positive selection in the evolution of plasmid-borne bacteriocins ( ) . it has been proposed that pore-forming versions have predominantly utilized recombination for diversification whereas nucleases have mainly evolved through positive selection. in our systems, the evidence points to both these forces being active at different levels in the evolution of the toxin proteins ( ) . while the basic architectures evolve through recombination generating c-terminal polymorphism, the c-terminal nucleases themselves show evidence for considerable sequence diversification within each family. indeed, much of the diversification of the hnh/endovii fold appears to have happened within the context of these systems, with several structurally distinct forms evolving amidst the nuclease toxins ( figure ). phyletic and phylogenetic analysis of the sukh superfamily indicates three salient features, namely rampant lateral transfer between different branches of the bacterial tree, gene loss and lineage-specific expansion followed by divergence of the lineage-specific paralogs (supplementary data). this suggests that there is a notable trend for maintaining diversity within the sukh superfamily that probably arises from selection for recognition of a diverse range of nucleic acid-modifying toxins. although there are multiple distinct types of immunity proteins known from plasmid-borne bacteriocins and cdi systems, most show very limited phyletic patterns. for example the cdii toxin seen in several cdi systems is entirely limited to proteobacteria ( ) . we observed that it is a protein with two tm segments that is likely to form a membrane channel (supplementary data) and have a mode of action very distinct from the sukh superfamily. as only the sukh superfamily and, to certain extent, the sufu superfamily show a pattern of wide dissemination across bacteria it is likely that only these scaffolds can support sufficient diversification that goes hand in hand with the polymorphism of the toxin domains. implications for eukaryotic and viral functions. our observations also suggest that the biochemical diversity generated within these bacterial toxin systems has been taken up and utilized for very different functions by eukaryotes and their viruses. both the sukh and the sufu superfamily domains have been utilized as adaptors that regulate recognition of different substrates by protein modification systems such as ubiquitination and polyglutamylation. in a completely different context, the hint domains derived from such bacterial toxin systems appear to have been used to release peptide messengers in animal signaling pathways, like the hedgehog pathway ( ) . the nuclease domains ultimately derived from various toxins also appear to have been used for different functions by eukaryotes and their viruses. the endou nuclease domain, which ultimately emerged from these toxin systems, has been recruited by the nidoviruses for the replication of their negative-strand rna genome, whereas a related domain was recruited by eukaryotes for processing of certain snrnas. we also observed that a hnh/endovii fold nuclease found in the bacterial toxin typified by the n. gonorrhoea protein ngo is found in several eukaryotic lineages such as animals, plants, stramenopiles and apicomplexans (supplementary data). given its conservation and relatively lower divergence, it is unlikely that the nuclease functions as a toxin in eukaryotes. however, it is possible that it has been recruited as a dna-repair enzyme, as has been previously observed in the case of certain nucleases of bacterial restriction-modification and phage replication systems ( ) . in general terms, these observations suggest that the origin of key systems in eukaryotes, including those related to the emergence of certain lineages, such as animals (i.e. the hedgehog pathway), appear to have extensively benefited from the availability of 'pre-adaptations' in the form of components whose ultimate origins lay in these toxin systems. the current study points to the remarkable flexibility of sukh domains in mediating different protein-protein interactions. in a sense, this situation resembles what has earlier been observed with certain scaffolds like the immunoglobulin domain and the leucine-rich repeats of various immunity-related proteins of eukaryotes ( , ) . the ability of the sukh scaffold to accommodate diverse binding partners makes it a potential candidate as a template for protein engineering to generate novel binding capabilities. likewise, the c-terminal diversification of the toxin domain could also have biotechnological utility as a model for generating secreted proteins that differ extensively in a given module but retain a constant n-terminal part. we hope that this characterization of the sukh superfamily and identification of the associated nuclease toxin families provides new leads for the future exploration of the manifold implications of the systems discussed here. ribosome-inactivating proteins from plants: present status and future prospects mechanism of action of ricin and related toxic lectins on eukaryotic ribosomes the ribonuclease activity of the cytotoxin alpha-sarcin. the characteristics of the enzymatic activity of alpha-sarcin with ribosomes and ribonucleic acids as substrates an overview on nucleases (dnase, rnase, and phosphodiesterase) in snake venoms rnase a ribonucleases and host defense: an evolving story the comprehensive sourcebook of bacterial protein toxins molecular mechanisms of bacteriocin evolution new connections in the prokaryotic toxin-antitoxin network: relationship with the eukaryotic nonsense-mediated rna decay system behavior of restriction-modification systems as selfish mobile elements and their impact on genome evolution addiction modules and programmed cell death and antideath in bacterial cultures programmed cell death in bacteria: proteic plasmid stabilization systems natural history of the e -like superfamily: implication for adenylation, sulfur transfer, and ubiquitin conjugation the prokaryotic antecedents of the ubiquitin-signaling system and the early evolution of ubiquitin-like beta-grasp domains an antisense rna controls synthesis of an sos-induced toxin evolved from an antitoxin killer toxin from hansenula mrakii selectively inhibits cell wall synthesis in a sensitive yeast cloning and characterization of knr , a yeast gene involved in ( , )-beta-glucan synthesis interaction of knr protein, a protein involved in cell wall synthesis, with tyrosine trna synthetase encoded by tys in saccharomyces cerevisiae the 'interactome' of the knr /smi , a protein implicated in coordinating cell wall synthesis with bud emergence in saccharomyces cerevisiae functional dissection of an intrinsically disordered protein: understanding the roles of different domains of knr protein in protein-protein interactions pml activates transcription by protecting hipk and p from scffbx -mediated degradation tubulin polyglutamylase enzymes are members of the ttl domain protein family amidoligases with atp-grasp, glutamine synthetase-like and acetyltransferase-like domains: synthesis of novel metabolites and peptide modifications of proteins contact-dependent inhibition of growth in escherichia coli a widespread family of polymorphic contact-dependent toxin delivery systems in bacteria gapped blast and psi-blast: a new generation of protein database search programs the hhpred interactive server for protein homology detection and structure prediction searching protein structure databases with dalilite v. muscle: multiple sequence alignment with high accuracy and high throughput promals: towards accurate multiple sequence alignments of distantly related proteins kalign-an accurate and fast multiple sequence alignment algorithm pcma: fast and accurate multiple sequence alignment based on profile consistency chroma: consensus-based colouring of multiple alignments for publication jpred: a consensus secondary structure prediction server protein secondary structure prediction based on position-specific scoring matrices the pfam protein families database predicting transmembrane protein topology with a hidden markov model: application to complete genomes an hmm posterior decoder for sequence feature prediction that includes homology information fasttree: computing large minimum evolution trees with profiles instead of a distance matrix comparative protein modelling by satisfaction of spatial restraints vmd: visual molecular dynamics analysis of compositionally biased regions in sequence databases product of a new gene, syd, functionally interacts with secy when overproduced in escherichia coli syd, a secy-interacting protein, excludes seca from the secye complex with an altered secy subunit diversity and evolution of chromatin proteins encoded by dna viruses automatic detection of subsystem/pathway variants in genome analysis rebase--a database for dna restriction and modification: enzymes, genes and genomes atomic structure of the serratia marcescens endonuclease at . a resolution and the enzyme reaction mechanism structural insights into the mechanism of nuclease a, a betabeta alpha metal nuclease from anabaena holliday junction resolvases and related nucleases: identification of new families, phyletic distribution and evolutionary trajectories filamentous hemagglutinin of bordetella pertussis. a bacterial adhesin formed as a -nm monomeric rigid rod based on a -residue repeat motif rich in beta strands and turns merging extracellular domains: fold prediction for laminin g-like and amino-terminal thrombospondin-like modules based on homology to pentraxins two-partner secretion in gram-negative bacteria: a thrifty, specific pathway for large virulence proteins structural classification of zinc fingers: survey and summary the scop database survey and summary: holliday junction resolvases and related nucleases: identification of new families, phyletic distribution and evolutionary trajectories crystal structure of the beta beta alpha-me type ii restriction endonuclease hpy i with target dna the structure of the endoribonuclease xendou: from small nucleolar rna processing to severe acute respiratory syndrome coronavirus replication crystal structure and mechanistic determinants of sars coronavirus nonstructural protein define an endoribonuclease family a single nuclease active site of the escherichia coli recbcd enzyme catalyzes single-stranded dna degradation in both directions inhibition of a ribosome-inactivating ribonuclease: the crystal structure of the cytotoxic domain of colicin e in complex with its immunity protein functional importance of a conserved sequence motif in fhac, a prototypic member of the tpsb/ omp superfamily the esat- /wxg superfamily-and a new gram-positive secretion system? comparative genomics of the ftsk-hera superfamily of pumping atpases: implications for the origins of chromosome segregation, cell division and viral capsid packaging diversity of polyproline recognition by evh domains genomic analysis of secretion systems the crystal structure of a bacterial sufu-like protein defines a novel group of bacterial proteins that are similar to the n-terminal domain of human sufu the hedgehog protein family protein splicing of inteins and hedgehog autoproteolysis: structure, function, and evolution crystal structure of a hedgehog autoprocessing domain: homology between hedgehog and self-splicing proteins identification of a family of human f-box proteins skp connects cell cycle regulators to the ubiquitin proteolysis machinery through a novel motif, the f-box a mechanism for vertebrate hedgehog signaling: recruitment to cilia and dissociation of sufu-gli protein complexes suppressor of fused and spop regulate the stability, processing and function of gli and gli full-length activators but not their repressors role of murine cytomegalovirus us gene family members in replication in macrophages murine cytomegalovirus m and m are both required to block protein kinase r-mediated shutdown of protein synthesis specific inhibition of the pkr-mediated antiviral response by the murine cytomegalovirus proteins m and m binding and relocalization of protein kinase r by murine cytomegalovirus double-stranded rna binding by a heterodimeric complex of murine cytomegalovirus m and m proteins binding and nuclear relocalization of protein kinase r by human cytomegalovirus trs evasion of cellular antiviral responses by human cytomegalovirus trs and irs double-stranded rna binding by human cytomegalovirus ptrs essential role for either trs or irs in human cytomegalovirus replication human cytomegalovirus trs and irs gene products block the double-stranded-rna-activated host protein shutoff response induced by herpes simplex virus type infection human cytomegalovirus protein pul induces atf expression, inhibits persistent jnk phosphorylation, and suppresses endoplasmic reticulum stress-induced cell death human cytomegalovirus protein ul inhibits host cell stress responses by antagonizing the tuberous sclerosis protein complex human cytomegalovirus ul protein blocks apoptosis the human cytomegalovirus ul gene controls caspase-dependent and -independent cell death programs activated by infection of monocytes differentiating to macrophages a cka-gfp transcriptional fusion reveals that the colicin k activity gene is induced in only percent of the population inclusive fitness theory from darwin to hamilton bacterial contact-dependent delivery systems evolutionary diversification of an ancient gene family (rhs) through c-terminal displacement the aid/apobec family of nucleic acid mutators nucleotide polymorphism in colicin e gene clusters: evidence for nonneutral evolution the hiran domain and recruitment of chromatin remodeling and repair activities to damaged dna conservation of folding and stability within a protein family: the tyrosine corner as an evolutionary cul-de-sac structure of a lamprey variable lymphocyte receptor in complex with a protein antigen supplementary data are available at nar online. conflict of interest statement. none declared. key: cord- -l brhmzq authors: munnur, deeksha; bartlett, edward; mikolčević, petra; kirby, ilsa t; matthias rack, johannes gregor; mikoč, andreja; cohen, michael s; ahel, ivan title: reversible adp-ribosylation of rna date: - - journal: nucleic acids res doi: . /nar/gkz sha: doc_id: cord_uid: l brhmzq adp-ribosylation is a reversible chemical modification catalysed by adp-ribosyltransferases such as parps that utilize nicotinamide adenine dinucleotide (nad(+)) as a cofactor to transfer monomer or polymers of adp-ribose nucleotide onto macromolecular targets such as proteins and dna. adp-ribosylation plays an important role in several biological processes such as dna repair, transcription, chromatin remodelling, host-virus interactions, cellular stress response and many more. using biochemical methods we identify rna as a novel target of reversible mono-adp-ribosylation. we demonstrate that the human parps - parp , parp and parp as well as a highly diverged parp homologue trpt , adp-ribosylate phosphorylated ends of rna. we further reveal that adp-ribosylation of rna mediated by parp and trpt can be efficiently reversed by several cellular adp-ribosylhydrolases (parg, targ , macrod , macrod and arh ), as well as by macrod-like hydrolases from veev and sars viruses. finally, we show that trpt and macrod homologues in bacteria possess activities equivalent to the human proteins. our data suggest that rna adp-ribosylation may represent a widespread and physiologically relevant form of reversible adp-ribosylation signalling. adenosine diphosphate (adp)-ribosylation is a covalent modification in which the adp-ribose (adpr) group from nicotinamide adenine dinucleotide (nad + ) is transferred to diverse target molecules: proteins, nucleic acids and small molecules such as phosphate or acetate ( ) . this modification changes physical and chemical properties or localization of target molecules and regulates many important cellular processes in both prokaryotes and eukaryotes ( ) . adp-ribosylation was first described as a mechanism of pathogenicity used by pathogenic bacterial exotoxins that irreversibly modify crucial host cell proteins ( ) . two divergent bacterial toxins, diphtheria and cholera toxin are founders of two major adp-ribosyl transferase (art) groups ( , ) . poly(adp-ribose) polymerases (parps), the best studied and largest art subgroup, belong to diphtheria toxin-like adp-ribosyl transferases. parps are present in all eukaryotes (except yeast) and sporadically in bacteria; they regulate important cellular processes such as dna damage repair, transcription, protein degradation, cell-cycle progression, host-virus interaction, cell division, ageing, cell death and bacterial metabolism ( ) ( ) ( ) ( ) . the human genome encodes for seventeen parps with different domain architecture and functions ( ) . trna phosphotransferase (trpt /tpt /kpta) is sometimes referred to as the eighteenth parp family member ( ) . several parp family members (parp , parp and tankyrases) synthesize long chains of poly-adpr, while the other parp family members transfer a single adpr group on targets (such as parp and parp ) ( ) . adp-ribosylation is a dynamic chemical modification that is regulated both at the level of addition and the removal of adpr groups. parps have been shown to target mostly glu/asp or ser residues ( ) ( ) ( ) ( ) ( ) . poly-adp-ribosylation can be removed by the action of two divergent enzymes, poly(adp-ribose) glycohydrolase (parg) and adp-ribosylhydrolase (arh ) ( , ) . parg is unable to remove the last adpr group attached to target proteins ( ) while arh is the only hydrolase that can completely remove both poly-and mono-adpr signal from serine residue ( ) , a modification catalysed by parp /hpf and parp /hpf complexes ( , ) . terminal adpr linked to glu/asp is removed by macrodomain containing enzymes namely terminal adpr glycohydrolase (targ /oard ), macrod and macrod ( ) ( ) ( ) ( ) . enzymes with phosphodiesterase activity, nudt (nucleoside diphosphate-linked moiety x-type motif ) and enpp (ectonucleotide pyrophosphatase/phosphodiesterase ), can cleave py-rophosphate from both poly-and mono-adpr modified targets leaving phosphoribose tags on the proteins ( , ) . although adp-ribosylation has historically been considered to mostly target proteins, there has been increasing evidence that dna can also be a target for adp-ribosylation. the first enzymes reported to adp-ribosylate dna were pierisins, toxins expressed by cabbage butterfly and related species. piersins irreversibly adp-ribosylate dna on guanines ( , ) . more recently, it was discovered that the bacterial toxin-antitoxin system dart-darg mediates reversible dna adp-ribosylation on thymidine residues in single-stranded dna (in a sequence specific manner) ( ) . furthermore, it was shown that dna repair parps (parp , parp and parp ) can modify dna on phosphates at dna breaks ( ) ( ) ( ) ( ) . the adpr groups on phosphates at dna breaks are efficiently removed by several cellular hydrolases, most notably by parg, macrod / , targ and arh ( , , ) . the pool of cellular substrates for adp-ribosylation has continued to expand and it was recently shown that a parp-like proteins trpt /tpt /kpta from bacteria and fungi can adp-ribosylate rna and dna ends ( ) . in this paper, we reveal that adp-ribosylation of rna at the terminal phosphate is more widespread than initially thought. we demonstrate that homologues of trpt in higher organisms as well as human parp , parp and parp can adp-ribosylate phosphorylated ends of rna. we also show that rna adp-ribosylation is a reversible process that can be accomplished by several human hydrolases as well as by some viral and bacterial macrodomains. thus, this study provides the first evidence of reversible adp-ribosylation of rna. plasmids expressing full length (fl) parp was cloned into pdest vector with his tag. cdna encoding the human parp catalytic and brct domains ( - aa) were obtained using gblock gene fragments and cloned into a pet-his-sumo-tev using ligation independent cloning. parp bcat (tankyrase catalytic domain) was cloned into pet-his vector. parp catalytic domain ( - aa) wt and g w mutant genes were cloned into pgex- t vector with gst tag. parp fl and parp fl genes were cloned into pet-his -sumo-tev vector. parp catalytic domain ( - aa) was cloned in pet a vector. the catalytic domain of human parp ( - aa) was pcr-amplified from the cdna library using primers with non-complementary restriction enzyme sites located at the (ecori) and (xhoi) ends. the amplified product was cloned into pet- b+ (novagen). parp catalytic domain ( - aa) was pcr amplified and cloned into pet a vector using bamhi and xhoi cloning sites. parp wwe and catalytic domain ( - aa) and parp catalytic domain ( - aa) were cloned into pnic-bsa - xhis vector. trpt gene (uniprot q tn ) was codon optimized and synthesized with his tag from invitrogen geneart gene synthesis and then further cloned into pet a vector using ncoi and xhoi restriction enzyme cloning. point mutants of trpt were prepared by using agilent quick change lighting site directed mutagenesis kit. streptomyces coelicolor kpta homologue (sco ) was cloned from s. coelicolor genomic dna into pet b. parp fl was purified as mentioned earlier ( ) . parp , tankyrase cat ( ), parp cat and parp fl plasmids were transformed into escherichia coli bl (de ) competent cells (millipore) and grown on lb agar plates with kanamycin ( mg/ml) and chloramphenicol ( mg/ml) overnight at • c. a swath of cells were inoculated into a ml starter culture of lb media with kanamycin and chloramphenicol at rpm, • c overnight. for each protein of interest l of terrific broth (tb) media ( g bacto tryptone, g yeast extract, . % glycerol, mm kh po , mm k hpo , % glucose, g/ml kanamycin, g/ml chloramphenicol) was inoculated with the starter culture and grown to . - . od at • c, rpm. iptg (sigma-aldrich) was added to . mm to induce protein expression for - h at • c, rpm. cells were harvested by centrifugation, resuspended in lysis buffer ( mm hepes, ph . , mm ␤-mercaptoethanol, mm benzamidine, . % np- , . % tween- , mm nacl, mm phenylmethylsulfonyl fluoride (pmsf), . mg/l dnase i (roche)) and lysed by sonication at • c (branson sonifier ). lysates were incubated with pre-washed ni-nta agarose resin ( % slurry, qiagen) with end-over-end rotation at • c for h. following extensive washing with buffer b + ( mm hepes, ph . , mm ␤-me, mm pmsf, mm benzamidine, mm nacl, mm imidazole) protein was eluted in four fractions of b containing - mm imidazole. fractions containing protein were collected and dialysed against mm tris-hcl, ph . , . mm edta, mm ␤-me, . m nacl at • c. parp catalytic domain (wt and g w mutant) were purified as mentioned earlier ( ) . in short, gsttagged parp was transformed into rosetta de competent cells and grown in lb media supplemented with ampicillin and chloramphenicol. cultures were induced with . mm iptg at . - . od and grown overnight at • c. following centrifugation, parp cat bacterial cell pellet was resuspended in pbs buffer supplemented with bugbuster protein extraction reagent, benzonase, % glycerol, mm dtt and complete protease inhibitor cocktail and allowed to lyse by incubation at • c for h. lysate was further centrifuged and cleared lysate was applied to glutathione sepharose beads for h at • c. gst-tagged parp was eluted using lysis buffer supplemented with mm reduced glutathione. eluted protein was further dialysed against mm tris-hcl ph . , mm nacl, % glycerol and mm dtt. parp cat wt and g w mutant were further purified on superdex column. parp fl was transformed into rosetta de competent cells and grown in l × yt media supplemented with kanamycin. cultures were induced with . mm iptg at . - . od and grown overnight at • c. bacterial pellet was resuspended in lysis buffer ( mm hepes ph , mm nacl, mm imidazole and . mm tcep). parp fl was purified via three-step purification process involving nickel column purification, heparin column and gel filtration column. cells were lysed by addition of bugbuster, protease inhibitor cocktail, benzonase and lysozyme and allowed to lyse for h at • c. cleared lysate was incubated with pre-washed ni-nta agarose resin ( % slurry, qiagen) with end-over-end rotation at • c for h. beads were then further washed with high salt buffer ( mm hepes ph , m nacl, mm imidazole and . mm tcep) followed by gradient elution over mm- m imidazole. eluted protein was assessed by sds-page gel and further dialysed against mm tris ph . , mm nacl, mm edta and . mm tcep. dialysed protein sample was further diluted using no salt buffer ( mm tris ph . , mm edta and . mm tcep) to get the salt concentration to mm nacl and was applied onto heparin column to remove any nucleic acid contamination. small fraction of protein bound onto heparin column while most ran out as flow through in the condition tested. the heparin column bound protein was eluted with gradient of mm- m nacl concentration. at this stage, the purity of eluted protein was tested by sds-page gel. protein fractions were concentrated and further subjected to size exclusion chromatography using superdex column. parp cat ( - aa) and parp wwe and cat ( - aa) plasmids were transformed in rosetta de competent cells and grown in × yt media supplemented with kanamycin. induction was carried out at . - . od using . mm iptg and cells were allowed to grow overnight at • c. bacterial pellet was lysed in lysis buffer ( mm hepes ph , mm nacl, mm imidazole and . mm tcep) supplemented with bugbuster, protease inhibitor cocktail, benzonase and lysozyme. cleared lysate was then bound to pre-washed ni-nta agarose resin followed by washes with lysis buffer. proteins were eluted using elution buffer ( mm hepes ph , mm nacl and . mm tcep) with an incremental gradient of - mm imidazole. proteins purity was assessed by sds-page gel. parp protein was dialysed overnight against mm tris ph . , mm nacl, mm edta and . mm tcep buffer. parp protein was dialysed overnight against mm hepes ph . , mm nacl and . mm tcep and further subjected to superdex column for size exclusion chromatography. the catalytic domain of parp ( - aa) was purified as mentioned earlier ( ) . trpt was purified as described earlier ( ) , in short trpt plasmid was transformed into rosetta de competent cells and grown in lb media supplemented with kanamycin. cultures were induced with . - . mm iptg at . od and grown overnight at • c. following centrifugation, trpt bacterial pellet (wt or mutants) was resuspended in lysis buffer ( mm tris-hcl ph . , mm nacl, % glycerol and mm imidazole) supplemented with bugbuster, benzonase, . mm tcep and complete protease inhibitor cocktail and lysed by mixing for h at • c. lysate was centrifuged at rpm for min and the cleared lysate was incubated with prewashed nickel nta agarose beads for h at • c. his-tagged trpt protein was eluted using elution buffer ( mm tris-hcl ph , . m nacl, % glycerol) with an incremental gradient of - mm imidazole. eluted trpt protein was dialysed overnight against column buffer ( mm tris-hcl ph , mm nacl, mm dtt and % glycerol). trpt wt protein was further purified by size exclusion chromatography using superdex column. streptomyces coelicolor kpta homologue (sco ) gene was expressed in escherichia coli bl (de ) cells and were grown for h at • c with . mm iptg added at . od . recombinant protein was purified using talon affinity resin according to standard procedure. streptomyces coelicolor macrod homologue (sco ) was obtained as described earlier ( ) . proteins listed below were gifts from other members of the lab. mycobacterium tuberculosis (mtb) darg-macro was cloned with n-terminal amino acids as described earlier ( ) . catalytic domains of parp ( - aa) ( ), macrod ( , ) , macrod ( ) , targ ( ), parg ( ), nudt ( ) and arh - ( ) were purified as described earlier. viral macrodomain-containing hydrolases from veev ( ) and sars coronavirus ( ) were prepared as described earlier. single stranded (ss) rna and dna oligos used in this study were commercially ordered from sigma-aldrich and invitrogen, respectively, and are listed in table . oligonucleotides were diluted to m stock solution in mm hepes-koh (ph . ) and mm kcl buffer. double stranded (ds) dna was prepared by annealing complementary strands of dna (ssdna oligo with rext) at • c for min and then allowed to gradually cool down to room temperature. pmol nop ssrna (with or without cyanine tag at end) was radioactively labelled at end using t polynucleotide kinase phosphatase minus (neb) in presence of ␥ p atp (perkin elmer) and heated at • c for min followed by heat inactivation at • c for min. radiolabelled oligo was further desalted on g column to remove any unincorporated atp. this radiolabelled oligo was used as size marker as indicated in figures. adp-ribosylation assays with rna were performed as described previously for dna adp-ribosylation ( ) . all buffers were made in dnase/rnase free water and filter sterilized prior to use. in short, l reaction mix was prepared in buffer containing mm hepes-koh (ph . ), mm kcl, mm mgcl and mm dtt. rna substrate ( m) was added along with m protein, m nad + (trevigen) and kbq p labelled nad + (perkinelmer) per reaction. protein and nad + concentrations were used as mentioned above unless stated otherwise. reactions were incubated at • c for min and stopped by addition of ng/l proteinase k and . % sds and heating the reaction at • c for min, unless stated otherwise. reactions that were treated with benzonase or calf intestinal phosphatase (cip) were heated at • c for min. samples were further heated at • c for min with × tbe urea sample buffer ( m urea, m edta ph . , m tris ph . and bromophenol blue). the samples were loaded on a prerun denaturing urea page gel made of % (w/v) polyacrylamide, m urea and × tbe. the gel was run at w/gel in . × tbe buffer. the gel was dried under vacuum and visualized by autoradiography. non-radioactive rna adp-ribosylation assay was performed using cyanine labelled rna essentially similar to radioactive assay with exception of using m labelled rna oligo and m nad + . the gel was visualized using molecular imager pharosfx systems using laser excitation for cyanine fluorophore at nm wavelength. all adp-ribosylation assays were individually repeated three times. to study the effect of time kinetics on parp cat and trpt mediated rna adp-ribosylation, reaction samples were prepared as mentioned earlier. aliquots were taken out at different time points ( , and min). the min time was done by placing the reaction on ice. reactions were stopped by addition of × tbe urea sample buffer. to study the effect of nad + concentration dependence of parp cat and trpt the assay was performed as described earlier with different concentrations ( - m) of nad + in reaction. nad + concentration dependence study was performed in non-radioactive setup using cya-nine labelled rna. rna adp-ribosylation reaction studying the effect of adenosine mono-phosphate (amp) or phosphoadenosine phosphate (pap) on parp mediated rna modification was performed by supplementing the reaction with , or m concentration of amp or pap. parp catalysed rna adp-ribosylation reactions were stopped by addition of parp inhibitor, -aminobenzamine ( aba) before treating with hydrolases. m hydrolase enzymes were added per reaction and heated at • c for min. reactions containing nudt were supplemented with mm mgcl ( ) . in recent years there has been increasing evidence of dna as a new target for reversible adp-ribosylation. we wanted to investigate whether rna could also be similarly adpribosylated by any known arts. we decided to initially test parp as it was recently demonstrated that this protein has robust art activity on dna ( ) ( ) ( ) . in addition, we focused on another member of the parp family, parp , which contains a rna-recognition motif (rrm domain) ( ) . purified parp and parp (catalytic domain) were first tested with a nucleotide single-stranded rna (ss-rna) oligo with or without a phosphate group at the end in the presence of p labelled nad + as an adpr donor. double-stranded dna was used as a positive control (figure a , lane ) for adp-ribosylation on dna by parp ( ) . strikingly, parp substantially modified the phosphorylated ssrna oligo ( figure a, lane ) reducing its mobility compared to the phosphorylated oligo labelled on its phosphate using p labelled gamma-atp ( figure a , lane ). parp was not able to adp-ribosylate ss-rna without a phosphate group ( figure a , lane compared to lane ), suggesting that rna adp-ribosylation by parp occurs on the phosphate. importantly, parp could only adp-ribosylate dna ends and did not have any activity on rna oligos, while parp specifically modified phosphorylated ssrna oligo in the conditions tested ( figure a and b). we further demonstrate rna adpribosylation activity of parp is time and nad + concentration dependent (supplementary figure s a and b) but independent of mgcl presence ( supplementary figure s c, lanes and ) . to further ascertain the specificity of adp-ribosylation by parp on dna and/or rna ends we tested both single-stranded oligos with or without a phosphorylated moiety at either or end. parp modified ssrna oligos phosphorylated at either end but did not show activity on ssdna oligo irrespective of the terminal phosphorylation state ( figure c, lanes and ) . the reaction product catalysed by parp in presence of phosphorylated ssrna was stable against proteinase k treatment ( figure d, lanes and ) but not against treatment with benzonase ( figure d, lanes and ) confirming that the modification was on nucleic acid and not on protein. next we wanted to assess the specificity of parp for modification of the phosphate groups on rna. for this, we analysed rna adp-ribosylation catalysed by parp catalytic domain in the excess presence of adenosine mono-phosphate (amp) or -phosphoadenosine phosphate (pap) as potential competitors and we observed no significant change in rna modification in presence of nucleotide analogue in excess (supplementary figure s d) . to further demonstrate that adp-ribosylation of rna by parp occurs on terminal phosphates we designed and phosphorylated ssrnas with the cyanine (cy ) label at the opposite end to the phosphorylation for more sensitive detection. we also produced non-phosphorylated versions of these oligos as additional controls. the phosphorylated oligos p ssrna cy and p ssrna cy treated with parp produced a slower migrating adpribosylated product ( figure e and f, lane ) . when the reaction was further treated with calf intestinal phosphatase (cip, phosphatase), the slower migrating adp-ribosylated band remains intact while the lower unmodified band shifts upwards due to the removal of a charged phosphate group by cip phosphatase treatment ( figure e and f, lane ) and now migrates the same as non-phosphorylated oligo control ( figure e and f, lane ) . when the phosphorylated oligos p ssrna cy and p ssrna cy (figure e and f, lane ) are treated directly with cip in absence of parp , dephosphorylation of oligo is observed which is confirmed by similar migrating pattern as the nonphosphorylated oligo ( figure e and f, lanes and ). parp modified phosphorylated end more efficiently than phosphorylated rna end. these results show that adp-ribosylation of rna by parp occurs on the phosphate group thereby protecting the phosphate group from the dephosphorylation activity of cip. since the catalytic domain of parp lacks two-thirds of the protein including the rrm we wanted to assess the activity of the fl parp on rna substrate. we expressed and purified full length parp and tested it for rna modification activity. we observed parp full length can also adp-ribosylate phosphorylated rna ends ( figure g, lane ) , however, in comparison to wt parp catalytic domain the activity was much weaker ( figure g, lane ) . a possible explanation for this finding may be that the isolated fl parp exists in an autoinhibited state, due to the inhibitory function of another domain within the protein. such autoinhibitory property has been already well characterized for parp ( ) . we observed no rna modification by previously characterized catalytic mutant of parp g w ( figure g, lane ) . adp-ribosylation of proteins and dna is reversible. we wanted to investigate whether rna adp-ribosylation by parp is also a reversible process. since parp can adp-ribosylate both and phosphorylated ends of rna, we tested both of these modified oligos as substrates for well characterized human adp-ribosylhydrolases: parg, targ , macrod , macrod and arh - . we also tested human nudt which is known to cleave the pyrophosphate bond in adp-ribosylated proteins to generate phospho-ribose modified proteins ( ). all above mentioned hydrolases, except for arh and arh , were able to remove adp-ribosylation from either or phosphates at the end of rna substrates (figure a and b) . importantly, catalytically inactive mutants of macrod (g e) and arh (d a) ( , ) did not remove adpr from phosphorylated-ssrna (fig-ure c) , demonstrating that the enzymatic activity of these adp-ribosylhydrolases is required for efficient removal of adp-ribose from rna phosphorylated ends. together, these results demonstrate that adp-ribosylation of phosphorylated-rna oligos catalysed by parp is a reversible process. previous studies have shown parp to be an interferon induced gene that inhibits replication of venezuelan equine encephalitis virus (veev) and other alphaviruses ( ) , yet the physiological substrates for parp antiviral activity remain unknown. thus, we tested viral macrodomain-containing hydrolases from veev and severe acute respiratory syndrome coronavirus (sars cov) for their adp-ribosylhydrolase activity on rna substrates. these viral hydrolases are known to support the ability of viruses to replicate in host cells ( , , ( ) ( ) ( ) , but their physiological substrates have yet to be identified. strikingly, and phosphorylated ssrna adp-ribosylated by parp could be efficiently reversed by the addition of viral macrodomain proteins ( figure d ). the ability of viral macrodomains to remove adpr from parp -modified phosphorylated-ssrna could indicate a potential biological role for parp in antiviral response acting on viral rnas and the role of viral macrodomains in suppressing this function. viral macrodomains could also reverse adpribosylation of both double stranded and single stranded dna similar to single stranded rna modification ( figure e ). next we decided to check several other human parps for their ability to modify rna. we tested full length parp and parp ( ) ; catalytic domains of parp , tankyrase ( ), parp , parp , parp , parp , parp and parp and a highly diverged parp-like protein sometimes annotated as th human parp-trpt ( , ) for rna adp-ribosylation activity. we observed that, in addition to parp , parp , parp and human trpt were also able to adpribosylate phosphorylated ssrna ( figure a , lanes , , and ). however, in the conditions tested, the other parps were unable to adp-ribosylate phosphorylated ssrna ( figure a ). we focused on the adp-ribosylation activity of trpt . we wanted to test whether the adpribosylation by trpt was phosphate dependent and had specificity towards dna and/or rna. for this, we tested ssdna and ssrna oligo with a phosphate group at either or end or without phosphate group (nop). we observed trpt can adp-ribosylate both dna and rna but only in the presence of phosphorylated end ( figure b ). we observe trpt based rna modification is time and nad + concentration dependent ( supplementary figure s a and b) however independent of mgcl (supplementary figure s c, lanes and ) . conserved amino acid residues arg-his-arg-arg are essential for phosphotransferase function of yeast tpt ( , ) . based on sequence alignment we mutated the corresponding residues in human trpt into single alanine based point mutations (r a, h a, r a and r a). these point mutants were un- able to adp-ribosylate phosphorylated end of rna ( figure c ). this suggests adp-ribosylation at end of rna is also mediated via the same active site as originally studied for nad-dependent rna phosphotransferase activity of yeast tpt . to further establish that the adpribosylation signal observed in the presence of trpt was an rna dependent modification, we further treated the reaction with proteinase k or benzonase. the band observed in presence of trpt and p ssrna was resistant towards proteinase k treatment but not benzonase, which validates the band to be nucleic acid related adp-ribosylation (figure d) . we treated p ssrna cyanine tagged oligo ( p ssrna cy ) with trpt , which generated % of the adp-ribosylated product of slower mobility in our conditions ( figure e, lanes and ) . treatment of unmodified p oligo with cip led to a band that migrated slower, simi-lar to the non-phosphorylated oligo ( figure e , lane versus lanes and ). cip treatment of trpt catalysed rna substrate leaves the upward shifted adp-ribosylated rna oligo intact but the lower unmodified rna oligo migrates similar to non-phosphorylated oligo ( figure e, lanes and ) . these results confirm adp-ribosylation mediated by trpt is on the rna oligo and the modification occurs on the phosphate group at the end which protects rna against further dephosphorylation by cip. we also tested smaller mer and mer rna oligos and established that trpt activity is not affected by the length of rna oligo ( figure f ). similar to parp mediated rna adp-ribosylation, we wanted to test if the trpt catalysed rna modification could be reversed by known human adpribosylhydrolases. we observe the removal of adpr signal by parg, targ , macrod , macrod , arh and nudt ( figure g ). arh and arh were unable to reverse the rna modification. catalytically inactive mutants of macrod (g e) and arh (d a) were inactive compared against the wild-type hydrolase as seen earlier for parp ( figure h ). trpt mediated dna modification can also be reversed by the above tested human hydrolases (supplementary figure s c) . we also tested hy-drolase function on the adp-ribosylated p ssrna cy oligo. the phosphorylated oligo when treated with trpt produces a slow migrating band and a faster migrating unmodified rna band ( figure i , lane ). further treatment with hydrolases: parg, macrod and macrod reverses the modification observed by the loss of the slow migrating adp-ribosylated band ( figure i , lane [ ] [ ] [ ] . however, the reversal of modification by the hydrolases does not change the migration pattern of the lower unmodified band ( figure i , lane [ ] [ ] [ ] , that still matches the migration pattern of phosphorylated ssrna (as seen in figure i , lane versus [ ] [ ] [ ] . this confirms the hydrolase mediated hydrolytic cleavage of adpr group does not affect the phosphate group on which the modification is covalently attached. trpt homologues are distributed across eukaryal, archaeal and bacterial domains of life. in e. coli bacteria, these proteins are usually referred to as kpta. we wanted to test if rna adp-ribosylation by kpta homologs is conserved across different species. in addition to trpt , we also tested kpta homolog from s. coelicolor (sco kpta/ sco ) with different rna substrates. as with earlier for trpt experiment, we observe that sco kpta could also exclusively adp-ribosylate rna at phosphorylated end ( figure a , lanes and ). since s. coelicolor also possesses a macrod-like protein sco similar to human macrod and macrod we were interested to investigate if this macrodomain could function as potential hydrolase to remove adp-ribosylation mediated by sco kpta. using macrod as a positive control for removal of adp-ribosylation mediated by both sco kpta and trpt we tested sco and another known bacterial macrodomain fold containing hydrolase darg ( ) . we observed that sco was proficient at reversing rna adp-ribosylation mediated by both sco kpta and trpt , however, darg was inactive against kpta mediated modification ( figure b and c). adp-ribosylation is an important chemical modification which helps cells to adapt and survive while maintaining their genomic integrity when faced with challenging environmental conditions. classic macromolecular targets of adp-ribosylation have been proteins, however, there have been several studies in the past few years that have demonstrated dna as an important target for adp-ribosylation ( ) ( ) ( ) ( ) ( ) . here, we set out to uncover if any member of the adp-ribosyltransferase family could also adp-ribosylate rna. we demonstrate, for the first time, that adpribosylation of rna can be catalysed by a few members of parp family--parp , parp , parp and a parplike protein -trpt previously characterized as an nad + dependent phosphotransferase ( , ) . parp was one of the first intracellular mono(adpribosyl)ating arts identified ( ) . in addition to the catalytic domain, parp also contains a rna recognition motif (rrm), two functional ubiquitin interaction motifs (uim), a sequence that promotes nuclear targeting as well as nuclear export and a motif that mediates interaction with pcna (pip) ( , ( ) ( ) ( ) ( ) . while several protein targets of parp have been suggested ( ) , the physiological role of parp is unclear. our study sheds further light on the potential biological function of parp through modification of rna. our results show parp can adp-ribosylate phosphorylated rna ends with a modest preference for over ends. parp mediated rna adp-ribosylation is resistant to phosphatase treatment which would indicate a novel rna capping mechanism possibly protecting the rna against the nuclease attack. while the biological relevance for this rna based modification is currently unknown we postulate a potential role in the innate immune response. parp has previously been shown to be induced by interferon and can inhibit viral replication ( , , ) . parp can also inhibit the activation of nf-b which is activated during infection ( ) . adp-ribosylation of rna by parp could act as a signal/marker to initiate an appropriate immune response. parp has an inhibitory effect on alphavirus replication and on protein biosynthesis ( , ) . these inhibitory effects could potentially be mediated via rna adp-ribosylation, where the adpr moiety acts as a rna cap thereby preventing rna translation or triggering signal transduction. the presence of rrm domain in parp could have a role in differentiating foreign rna of invading pathogens from host rna to work in tandem with the catalytic domain to adp-ribosylate rna and to further initiate the immune response. similar function has been observed for the apoptotic role of parp whereby the rrm domain contributed to pro-apoptotic activity together with the catalytic domain ( ) . while rna adp-ribosylation could provide an interesting link towards explaining the anti-viral role of parp , equally this activity could function in initiating or inhibiting translation thus effecting a cascade of signal transduction. several human adp-ribosyl hydrolases parg, targ , macrod / and arh can reverse rna adpribosylation mediated by parp . localization of these hydrolases to nucleus, cytoplasm and mitochondria ( , , , ) suggests that rna adp-ribosylation is utilized in different cellular compartments. interestingly, in addition to human hydrolases we also observe that veev and sars viral macrodomain-containing hydrolases can remove rna adp-ribosylation mediated by parp . this ability of viral macrodomains could indicate a mechanism of pathogenesis by counteracting antiviral activity of parps. this could make viral macrodomains good candidates as a potential drug target to combat pathogenesis. in addition to parp , parp and parp we also show adp-ribosylation of rna catalysed by trpt , an ancestral relative of parp superfamily which is sometimes referred to as the th member of parp family. this gene is highly conserved in eukaryotic, archaeal and bacterial domains of life. while the human gene is known as trpt , the yeast and bacterial version of trpt are referred to as tpt and kpta, respectively. the yeast homologue has been characterized for its role in trna splicing acting as a nad + dependent -phosphotransferase ( , ) . the enzymatic role of -phosphate removal by tpt occurs in a two chemical steps process -first, the -phosphate reacts with nad + to release nicotinamide and form -phospho-adp-ribose rna intermediate and second step involves generation of adp-ribose - cyclic phosphate and leav- ing behind the rna with hydroxyl group at end. while -phosphotransferase activity is conserved across all diverse homologues of trpt ( ) there is no evidence of intron containing trna (that would need trpt activity for splicing) and/or pathway which would generate rna with -phosphate in most of the organisms except in plants and fungal species ( , ) . furthermore, trpt knockout cells from mouse exhibit levels of trna splicing comparable to the wild type cells ( ) . although some bacteria possess introns in their trnas, they are self-splicing introns with very limited distribution to several representatives of proteobacteria and cyanobacteria ( ) . therefore, the functional role of these widely conserved trpt genes in other species remains elusive. a recent study has demonstrated that several archaeal species such as aeropyrum pernix, pyrococcus horikoshii and archaeoglobus fulgidus and bacterial clostridium thermocellum possess tpt protein that can adp-ribosylate rna at -phosphorylated ends ( ) . in our study, we show that trpt from a higher eukaryote (human) and from a bacterium (streptomyces species) can also adp-ribosylate -phosphorylated rna--revealing that rna adp-ribosylation activity is widespread among trpt proteins. to summarize, our results identify rna as a novel target of reversible adp-ribosylation that can be catalysed by both parp and trpt classes of arts in vitro. this modification of rna occurs on phosphorylated terminal ends of rna; it can be made by parp and trpt arts and reversed by several known adp-ribosylhydrolases. efficient in vitro activities on rna substrates by these enzymes suggest that rna adp-ribosylation reactions could be relevant in vivo. we hypothesize that trpt /parp could potentially mediate adp-ribosylation signalling on rna substrates as an on/off switch thereby controlling the functional state of rna, protecting rna ends or act as a platform for recruiting other proteins. in addition we also demonstrate other parps--parp and parp to adp-ribosylate phosphorylated rna ends, however further characterization is required to reveal the functional role of these proteins. adp-ribosylation: new facets of an ancient modification distribution of protein poly(adp-ribosyl)ation systems across all domains of life a family of killer toxins toward a unified nomenclature for mammalian adp-ribosyltransferases new insights into the molecular and cellular functions of poly(adp-ribose) and parps family-wide analysis of poly(adp-ribose) polymerase activity specificity of reversible adp-ribosylation and regulation of cellular processes insights into the biogenesis, function, and regulation of adp-ribosylation is a new target residue for endogenous adp-ribosylation on histones serine adp-ribosylation depends on hpf serine is the major residue for adp-ribosylation upon dna damage the structure and catalytic mechanism of a poly(adp-ribose) glycohydrolase identification and characterization of a mammalian -kda poly serine adp-ribosylation reversal by the hydrolase arh . elife hpf /c orf is a parp- -interacting protein that regulates parp- adp-ribosylation activity deficiency of terminal adp-ribose protein glycohydrolase targ /c orf in neurodegenerative disease macrodomain-containing proteins are new mono-adp-ribosylhydrolases a family of macrodomain proteins reverses cellular mono-adp-ribosylation macrodomains: structure, function, evolution, and catalytic activities processing of protein adp-ribosylation by nudix hydrolases enpp processes protein adp-ribosylation in vitro mono(adp-ribosyl)ation of -deoxyguanosine residue in dna by an apoptosis-inducing protein, pierisin- , from cabbage butterfly the toxin-antitoxin system dartg catalyzes reversible adp-ribosylation of dna poly(adp-ribose) polymerases covalently modify strand break termini in dna fragments in vitro reversible mono-adp-ribosylation of dna breaks characterization of dna adp-ribosyltransferase activities of parp and parp : new insights into dna adp-ribosylation dna is a new target of parp . scientific rep macrod is a promiscuous adp-ribosyl hydrolase localized to mitochondria nad+-dependent synthesis of a -phospho-adp-ribosylated rna/dna cap by rna -phosphotransferase tpt engineering the substrate specificity of adp-ribosyltransferases for identifying direct pprotein targets towards small molecule inhibitors of mono-adp-ribosyltransferases structure-function analysis of the yeast nad+-dependent trna -phosphotransferase tpt identifying family-member-specific targets of mono-artds by using a chemical genetics approach identification of macrodomain proteins as novel o-acetyl-adp-ribose deacetylases synthesis of dimeric adp-ribose and its structure with human poly(adp-ribose) glycohydrolase viral macro domains reverse protein adp-ribosylation the conserved coronavirus macrodomain promotes virulence and suppresses the innate immune response during severe acute respiratory syndrome coronavirus infection parp- , a novel myc-interacting protein with poly(adp-ribose) polymerase activity, inhibits transformation parp- activation requires local unfolding of an autoinhibitory domain interferon-stimulated poly(adp-ribose) polymerases are potent inhibitors of cellular translation and virus replication the conserved macrodomains of the non-structural proteins of chikungunya virus and other pathogenic positive strand rna viruses function as mono-adp-ribosylhydrolases viral macrodomains: uuique mediators of viral replication and pathogenesis adp-ribosylhydrolase activity of chikungunya virus macrodomain is critical for virus replication and virulence parp is a tail-anchored endoplasmic reticulum protein required for the perk-and ire ␣-mediated unfolded protein response tank , a new trf -associated poly(adp-ribose) polymerase, causes rapid induction of cell death upon overexpression identification of novel components of nad-utilizing metabolic pathways and prediction of their biochemical functions two-step mechanism and step-arrest mutants of runella slithyformis nad+-dependent trna -phosphotransferase tpt a -phosphotransferase implicated in trna splicing is essential in saccharomyces cerevisiae transient adp-ribosylation of a -phosphate implicated in its removal from ligated trna during splicing in yeast substrate-assisted catalysis by parp limits its activity to mono-adp-ribosylation dynamic subcellular localization of the mono-adp-ribosyltransferase artd and interaction with the ubiquitin receptor p the adp-ribosyltransferase parp /artd interacts with proliferating cell nuclear antigen (pcna) and is required for dna damage tolerance regulation of nf-b signalling by the mono-adp-ribosyltransferase artd new parp gene with an anti-alphavirus function caspase-dependent cleavage of the mono-adp-ribosyltransferase artd interferes with its pro-apoptotic function human poly(adp-ribose) glycohydrolase is expressed in alternative splice variants yielding isoforms that localize to different cell compartments differential activities of cellular and viral macro domain proteins in binding of adp-ribose metabolites an enzyme from saccharomyces cerevisiae uses nad+ to transfer the splice junction -phosphate from ligated trna to an acceptor molecule an intact unfolded protein response in trpt knockout mice reveals phylogenic divergence in pathways for rna ligation barriers to intron promiscuity in bacteria key: cord- - ge kmr authors: routh, andrew; johnson, john e. title: discovery of functional genomic motifs in viruses with virema–a virus recombination mapper–for analysis of next-generation sequencing data date: - - journal: nucleic acids res doi: . /nar/gkt sha: doc_id: cord_uid: ge kmr we developed an algorithm named virema (viral-recombination-mapper) to provide a versatile platform for rapid, sensitive and nucleotide-resolution detection of recombination junctions in viral genomes using next-generation sequencing data. rather than mapping read segments of pre-defined lengths and positions, virema dynamically generates moving read segments. virema initially attempts to align the ′ end of a read to the reference genome(s) with the bowtie seed-based alignment. a new read segment is then made by either extracting any unaligned nucleotides at the ′ end of the read or by trimming the first nucleotide from the read. this continues iteratively until all portions of the read are either mapped or trimmed. with multiple reference genomes, it is possible to detect virus-to-host or inter-virus recombination. virema is also capable of detecting insertion and substitution events and multiple recombination junctions within a single read. by mapping the distribution of recombination events in the genome of flock house virus, we demonstrate that this information can be used to discover de novo functional motifs located in conserved regions of the viral genome. viruses are renowned for their ability to mutate and rapidly adapt to new environments. recently, the use of next-generation sequencing (ngs) has risen dramatically in virus discovery and the identification of emerging pathogens ( ) ( ) ( ) , the characterization of the human virome ( ), the analysis of established infectious agents ( , ) , the quality control of live attenuated viruses ( , ) and to understand the mutant spectra of viruses. ngs can be used to discover and characterize both homologous ( ) and non-homologous recombination ( ) . viral recombination generates considerable genetic diversity and plays a central role in the evolution and emergence of new viruses ( ) . recombination can reshuffle single mutations that originally occurred on different, but homologous, viral genomes, resulting in the accumulation of advantageous mutations or the removal of deleterious ones. homologous recombination between co-infecting viruses may also result in the evolution of new virus strains, as was observed among picornaviruses including human rhinoviruses ( ) and between vaccine-derived polioviruses and circulating enteroviruses ( ) . non-homologous recombination has the potential to mutate large swathes of the viral genome by deleting large portions to form 'defective genomes' or by inserting foreign genetic material from other viruses or from the host. defective genomes evolve during persistent and acute infections in cell culture ( ) as well as during wild infections ( , ) . the evolution of defective genomes was proposed to be critical in the transition of acute to chronic viral infections ( ) and was found in patients persistently infected with measles virus ( ) , dengue virus ( ) and hepatitis c virus ( ) . virus-to-host recombination events are relatively rare as compared with intra-viral genome recombinations; however, such events may have important biological consequences such as the selective advantage conferred to hepatitis e virus on insertion of a -nt fragment from human s ribosomal protein mrna ( ) . we have developed a new algorithm called virema (viral-recombination-mapper) to provide a versatile platform for the discovery of recombination events in deep sequencing datasets. virema is compatible across a number of sequencing platforms that produce either long (e.g. technologies) or short (e.g. illumina) reads and can work with a variety of viral genomes (dna or rna, single-or double-stranded, multi-or single partite, short or long, etc). virema does not require any pre-treatment of the original dataset beyond standard quality-filtering and does not require any special reference library generation. in addition to single recombination events, multiple recombination events within a single read are detected, as are virus-to-host recombination and insertion and substitution events. using virema, we demonstrate that by mapping the distribution and frequency of recombination events in the genome of flock house virus (fhv), we can discover de novo functional genomic motifs required for viral replication and encapsidation. source code and updates for virema can be found at sourceforge.net/projects/virema the dataset analyzed in this study was generated as part of a previous analysis in our laboratory ( ) and is publicly available at the ncbi small read archive with the accession number srp . these reads are -nt single reads generated on an illumina hiseq using standard cdna library generation protocols and directional rnaseq. briefly, authentic fhv particles were amplified in drosophila s cells grown in suspension, harvested days after infection and then purified over a series of centrifugation steps consisting of one % sucrose cushion and two - % sucrose gradients in the presence of mm hepes, ph . . an additional -h nuclease digestion was performed with u dnase i and . ug rnase a in between the two sucrose gradient spins to remove any contaminating non-encapsidated rna or dna. after the final sucrose gradient, rna was extracted using standard phenol/chloroform extraction, ethanol precipitated and re-suspended in pure water. in all, ng rna was used for cdna library generation using standard truseq adaptors and barcodes, polymerase chain reaction amplified for cycles and then purified by agarose gel electrophoresis to yield inserts of nt. the cdna library was loaded onto a hiseq v single read flowcell and sequenced for nt of the insert and nt of the index sequence on an illumina hiseq . reads were processed using casava . . . before analysis, the raw reads were processed by removing the adaptors (sequence = tggaattctcg ggtgccaagg) using cutadapt ( ) with default parameters and then removing any reads containing any nucleotide with a phred score < using the fastx toolkit (http://hannonlab.cshl.edu/fastx_toolkit/). any read < nt was discarded, leaving reads ranging from to nt in length. for the analysis using pseudo-reference libraries to detect rna recombination, the last five nucleotides of each read were trimmed away and then any read containing < nt was discarded. this yielded a dataset containing reads, all exactly nt in length. these datasets were then aligned end-to-end to the fhv genome (nc_ and nc_ ) using bowtie (version . . ) with parameters -v --best, and then to the drosophila melanogaster genome (fb _ ) using parameters -v --best. any unaligned reads were analyzed for evidence of recombination as described in the main text. virema is a python script that iteratively calls the small read alignment program, bowtie ( ) , to try and map all portions of a candidate read. the process used is similar to other recombination or fusion detection algorithms such as tophat and tophat-fusion ( , ) , fusionmap ( ) or mapsplice ( ) that split up a read into segments of a specific length, which are then mapped independently. however, virema initially attempts to align the end of a read to the reference genome(s) and then dynamically generates a new read segment by either extracting nucleotides at the end of the read that fail to align or by trimming the first nucleotide from the read. this continues iteratively until all portions of the read are either mapped or trimmed or a combination of both as summarized in the flow diagram in figure . this process is illustrated step by step using an example read in figure . virema is split into two phases. the first phase searches and reports recombinations found for each read of a given dataset. the second phase compiles these results into several output files based on the recombination events that they describe. virema uses bowtie's seed-based alignment: '-n' mode. a seed, typically comprising - nt, is extracted from the beginning of a sequence read and aligned to a reference genome using a burrows-wheeler indexing technique ( ) . if a valid mapping location is found for the seed, the remaining nucleotides are also aligned. bowtie takes into account the quality scores for each nucleotide in the read and reports a successful mapping provided that the sum of the quality scores for each discovered mismatch does not exceed a user-defined value: the '-e' value. if multiple mapping locations are found in the reference genome, bowtie (in 'best' mode) reports the mapping that raises the fewest mismatches in and after the seed. virema exploits this method of mapping by purposely specifying a very high '-e' value. consequently, bowtie will report the locations of a successfully mapping seed regardless of the number of mismatched nucleotides that follow it. these mismatches are reported in the th field of the standard sequence alignment map (sam) output ( ) . for example, a -nt read that successfully aligns to a reference sequence but with only one mismatch might report 'md:z: c ' in this field. however, a -nt read that maps over a recombination junction might read: 'md:z: c g g c c g g a g g a t c c a -a t t t a g t c '. from this, we can determine that the first nucleotides from the read have been successfully aligned, but the remaining nucleotides are predominantly mismatched to the reference sequence. (note that the cigar string will still read ' m'.) the first aligned nucleotides thus correspond to the region of a putative recombination junction up until the recombination breakpoint. the remaining nucleotides in the read after a detected breakpoint can then be extracted from the sequence alignment map file and used to generate a new read segment for use in another bowtie alignment. a new segment will be generated from the remaining nucleotides beginning with the first disqualifying mismatch. virema can tolerate up to two mismatches ('--n' parameter in command line) in the seed during the bowtie mapping and in the remaining aligned nucleotides. however, these mismatches cannot occur in the nucleotides immediately preceding or following a putative recombination event ('--x' parameter in command line). this ensures confidence in the discovery of a recombination junction while allowing maximum sensitivity of the segment mapping. this also prevents a segment from mapping beyond the true recombination junction by claiming mismatched nucleotides at this location. not only would this introduce noise, this would reduce the number of nucleotides used in subsequent mapping iterations that can be mapped to the side of the recombination junction, which would therefore reduce overall sensitivity. virema also has the capacity to search a host genome for putative recombination events. virema will first attempt to align the seed to the virus genome. if a mapping cannot be found, virema will then attempt to map the read to the host genome. this will give preference to a mapping found in the virus genome, even if a better match may have been found in the host genome. additionally, it is possible to specify different seed lengths when mapping to either the host or viral genome. if a successful alignment is not found during the mapping of a segment, then the first nucleotide is trimmed and a new segment is formed from the remaining nucleotides. this trimming process is crucial to remove short pads or adaptor sequences from the beginning of a sequence read that might prevent a seed from being mapped. trimming may also occur in the middle of a read after one or more segments have already been all nts after the seed are aligned until a mismatch is found ccaa remaining nts are too small to make a new seed ccaa ref: all nts after the seed are aligned until a disqualifiying mismatch is found bowtie call # : no mapping found for new seed, first nt is trimmed. ref: ...acctgacttcaacaccgaccccggtaagggaatacctgatagatttgaaggcaaagtggt.... (c) ten further nucleotides are found after the seed that map at this location in the reference genome until a mismatch is found. the remaining nucleotides thus form a new segment. (d) a new seed is extracted from the new segment, but a successful mapping cannot be found. consequently, the first nucleotide is trimmed, and again a new segment is generated. (e) for a second time, a mapping cannot be found for the seed and the first nucleotide is trimmed to form a third segment. (f) finally, a mapping is found for the third segment at nt - on fhv rna that contains one mismatch at nt . (g) nine further nucleotides are found after the seed that map at this location in the reference genome until a mismatch is found. (h) the remaining nucleotides form a new segment, but it is shorter than the required seed length and so is not aligned to the reference genome. (i) finally, the results of these mapping events are recorded and appended with a summary code. mapped. this enables virema to read through insertion events or regions of extensive mismatching. the nature of these trimmed nucleotides is scrutinized in the second phase of the program and is reported according to whether they constitute insertion or substitution events. this process will also find regions where multiple recombinations have occurred, as discussed later, in compound_handling. in a worst-case scenario, virema will be unable to map any portion of a candidate read and so would trim off each nucleotide from the beginning of the read in multiple iterations until there are too few nucleotides remaining to form a segment long enough to be mapped. consequently, a poor-quality dataset or an incomplete or inaccurate reference genome will result in a large number of bowtie calls and a corresponding increase in run time. once an entire read has been mapped, or there are fewer nucleotides left in a segment than required by the seed length, the mapping results are written to an output file. each entry is appended by a short code to summarize the results of the mapping: 'm' corresponds to mapped nucleotides; 'x' denotes mismatches within mapped segments; and 'u' denotes unmapped nucleotides that were either trimmed or could not form a new segment as required by the seed length. this code aids the second phase of the algorithm that analyzes the mapping results for each read and identifies what types of events have occurred. examples of the types of output generated by the first phase of virema are given in table . in the simplest case, a recombination event is found when a read is mapped to two separate locations on the same gene and there are no other unmapped nucleotides present. such an example is given in table (a). here, a read that is nt in length has been broken into two segments, each of which has been perfectly matched to the virus genome as denoted by its code ' m m'. the first segment of nt has been mapped to fhv rna , nt - , and all the remaining nt have mapped to fhv rna , nt - . therefore, this read maps across a single recombination event between nt and . in a similar manner, multiple or inter-species recombination events can also be detected [as in e.g. table (b or c) ]. virema is also sensitive to complex recombination events, insertions, substitutions and can also map reads that contain short pads at either or both of the and end of the read. examples of these are given in table . the exact site of recombination is sometimes ambiguous due to the inherent 'fuzziness' of recombination junctions. this occurs when the nucleotides immediately upstream of the acceptor site are identical to the nucleotides immediately upstream of the donor site. this leaves a number of possible sites where the original recombination event may have occurred but would still produce an identical resulting sequence. in virema, such 'fuzzy' junctions can be reported either at the end, the middle or at the end of the fuzzy nucleotides present in the reference sequence, as chosen by the user using the --defuzz parameter. once each read has been scrutinized, each type of recombination event is tallied and these results are written to specific output files used for downstream analyses: insertions, microinsertions, microdeletions, recombination_results, single_alignments, substitutions, unmappedreads and unknown_ recombinations. if both host and virus reference genomes are used, there will be one of each file for both reference genomes, plus an extra file for virus-to-host recombinations. the accumulation of multiple recombinations within single template can result in a highly fragmented and a read with a single mapping, but with pads at either end g) @read:name fhvrna - acgccgaactacgacgactattcgatc m u unknown or ambiguous recombination event: initially mapped, but then pad is longer than seed (e.g. nt) and would be long enough to blast h) @read:name fhvrna - acgccgggcag fhvrna - m u m unknown recombination: recombination has taken place, but unidentified nucleotides are present that are smaller than the chosen seed (e.g. nt). this may be the result of two recombination events having occurred within proximity, which can be tested using the optional command: '--compound_handling' i) @read:name atagcatgcagcgttatttagcacgacagaatcatcgactagctacgat u an unmapped read the output from the mapping phase of virema gives the name of the read, followed by the details of each successfully mapped segment or the nucleotides that were trimmed, and are appended with a code to describe these mappings. complex recombinant genome. this raises problems when trying to identify the original individual recombination events that took place. the output of such a scenario may look like the example in table (h). here, segments at the and the end of a complex recombination event have been mapped to nt - and nt - of fhv rna , but there remain a small number of trimmed nucleotides in the middle. during the second phase of virema, the '--compound_handling' option can be used to attempt to align these short fragments back to the virus genome. the two flanking aligned segments provide a small window in which to search for a new alignment. as this window would be considerably smaller than the entire viral genome, a reliable mapping for this fragment may therefore be found even if the fragment is smaller than the seed length. if a single perfect match is found, then virema will report two separate recombination events as opposed to one 'unknown' recombination. for example, in the case of the example in table (h), an unknown sequence is flanked by two segments that map to nt - and nt - of fhv rna . using '--compound_handling', this sequence is found to correspond to fhv rna , nt - and so virema will report this as two recombination events: one from fhv rna , nt to ; and the other from fhv rna , nt to . to assess the sensitivity and error rate of virema, we generated a simulated dataset containing randomized recombination events in fhv rna . to generate a simulated read, a -nt fragment was randomly selected from fhv rna and then appended to another randomly selected -nt fragment. from this -nt fragment, a read of nt was extracted starting from a randomly chosen nucleotide. in this manner, a recombination event could occur at any position along the synthetic read. for each -nt fragment, this was repeated a random number of up to times. there are possible 'cutting' sites in a -nt read at which a recombination may occur. with a search seed of nt, recombination events occurring in the first or last cutting sites of the reads will not be detected leaving possible sites. therefore, a theoretical maximum efficiency of recombination detection would be / = . %. we generated synthetic reads containing unique recombination events and aligned these reads to the fhv genome with virema using a seed length of nt. we detected recombination events, which correspond to a detection sensitivity of . %-just below the theoretical maximum. as we know the nature of the simulated recombination events, we can determine that no incorrect events were reported. to further test the sensitivity and error rate of virema, we generated another simulated dataset containing reads, but including a random mismatch rate of . nt per wild-type nucleotides. this approximates the error frequency previously reported for rnaseq datasets of rna viruses including fhv ( , ). we aligned these reads to the fhv genome allowing either one or two mismatches per read segment ('--n' parameter = or ). in addition, we varied the number of nucleotides at both the beginning and end of each read segment in which these mismatches were disallowed (the '--x' parameter). as can be seen in figure , the sensitivity of recombination detection is close to the theoretical maximum when mismatches are allowed to occur anywhere in the read segment. however, the error rate of recombination junction detection is poor due to the first one or two nucleotides upstream of a recombination junction being counted as mismatches but in the downstream segment. by increasing the stringency of the recombination mapping with the '--x' parameter, the error rate improves dramatically, but with a small penalty in sensitivity. the improvement in error rate plateaus with an '--x' value of nt for --n = , and an '--x' value of nt for --n = ( figure ) . although this analysis does not account for the many other sources of potential error in deep-sequencing datasets, it can act as a guide to optimize search parameters in subsequent analyses. for example, from simulated reads, we detected recombination events using a seed length of nt, --n = and --x = . this results in a sensitivity of . %. from these reported events, were inaccurate, thus giving an error rate of . %. these errors are due to mismatches overlapping the 'fuzzy' region of a recombination junction (as described earlier) and so are inaccurate by only a small number of nucleotides as limited by the size of the 'fuzzy' region. we recently reported that by deep-sequencing the rna encapsidated by fhv, it is possible to detect a plethora of recombination events within the viral genome with a ---- figure . the sensitivity and error rate of virema determined using a simulated dataset. a simulated dataset containing reads was analyzed by virema using either an allowed mismatch rate of n = (circles) or n = (diamonds) and an '--x' value of between and . the sensitivity (left axis, dotted line) decreases linearly with increasing mapping stringency as imposed by the '--x' value. the error rate (right axis, dashed lines) decreases dramatically with increased mapping stringency. frequency approaching that of mismatch mutation ( ) . in that study, we collected single reads that were quality filtered and all trimmed to exactly nt. using an end-to-end alignment, we removed all reads that mapped to either the fhv or d. melanogaster genomes (table ) , leaving a dataset containing reads. next, we generated a pseudo-library containing million short reference sequences corresponding to all the possible recombination events that might occur within the fhv genome. by aligning the unmapped reads to this pseudolibrary (allowing two mismatches per read), we identified recombination events (excluding insertions and deletions smaller than nt, 'microindels', that were found to be artifacts). using these results, we generated a second pseudo-library corresponding to recombination junctions that occurred within nt of one another, enabling us to identify an additional recombination events. in total, we detected recombination events using the pseudo-libraries with end-to-end mapping ( table ) . to compare the sensitivity of virema with this approach, we analyzed the same dataset ( reads) using similar parameters (seed length of nt for virus alignment and nt for host alignment, mismatch allowed per segment and an '--x' value of ). using a standard apple intel workstation with gb ram and physical core processors, this analysis was performed in < min. this yielded recombination events (table ) , an improvement in sensitivity of %. this small improvement is despite the fact that virema uses a stricter mapping procedure. when determining which reads had been successfully mapped with the virema algorithm, but not by the pseudo-library alignment, we found that all of these reads contained multiple events within the read. this is why the majority of the extra events were detected in fhv rna , which harbored the largest number of recombination events among the fhv genome and many of which occurred in proximity to one another. furthermore, in addition to the improved sensitivity, we were also able to detect several other events, including intra-host and host-to-virus recombination events, insertion and substitution events, and some 'unknown' events that contain fragments of either viral or host rna as well as a large number of trimmed nucleotides (table ) . virema allows the analysis of datasets containing reads of variable lengths. in our previous analysis, we had trimmed all of our reads to a uniform length as is required when using the pseudo-library approach. however, a large improvement in sensitivity was achieved by analyzing the raw dataset that contained . million reads ranging from to nt in length. with this, we found recombination events (table ), a dramatic improvement over initial analysis. after this, < reads remained that were completely unmapped. de novo discovery of functional motifs in the genome of fhv fhv readily produces defective genomes, even during limited passaging in cell culture. defective genomes have lost their ability to independently encode functional viral proteins and are dependent on the wild-type 'helper' virus for replication and propagation. the study of defective genomes has been highly important in establishing the regions of viral genomes required for replication and encapsidation ( ) . defective genomes that maintain the sequence information required for replication or encapsidation gain a strong selective advantage over those that cannot and so will be successfully propagated and thus highly represented in our dataset ( ) . conversely, recombination events that remove functional motifs are negatively selected and so these events are seldom observed. some of the recombination events detected in our analysis correspond to those previously observed to be present in fhv-defective genomes ( ) . however, deepsequencing reveals a far richer distribution or quasispecies of defective genomes than has previously been the results for the pseudo-library-based mapping were performed in routh et al. ( ) and are compared here against the number of recombination events found by virema when using the same initial dataset containing only -nt long reads. as virema does not require uniform read lengths, the raw dataset containing million reads was also mapped with virema. virema can also detect virus to host recombination events as well as recombination events in the host genome by including a second reference genome during the mapping phase. observed using standard molecular cloning and polymerase chain reaction techniques. to see which regions of the fhv genome were retained during passaging, we plotted the frequency with which every nucleotide was excised among all of the detected recombinant rnas. this reveals conserved regions of the viral genome. as can be seen in figure , the two genomic rnas show conservation of the and -utrs and some short internal motifs. there are three conserved regions in fhv rna : nt - , nt - and nt - ; and three conserved regions in fhv rna : nt - , nt - and nt - . these observations correlate well with previous studies that have demonstrated the necessity for the and -utrs for efficient rna replication ( ) ( ) ( ) ( ) ( ) . similarly, a short motif is present in rna at nt - that forms a bulged stem loop and acts as a signal for rna packaging into virions ( ) . we would observe conservation of this packaging motif, as we only sequenced encapsidated rnas. the internal regions at nt - in rna and at nt - , nt - and nt - in rna have also previously been demonstrated to contain sequence motifs that are required for replication ( , , ) . the fact that we find all of these previously described motifs to be so well conserved in our analysis confirms their necessity for replication and packaging. this simple analysis demonstrates how regions of functional importance can be discovered through the use of ngs data without any prior knowledge of the lifecycle or genomic structure of a virus. there are number of highly cited software packages already available that address the complex issue of detecting recombination junctions. in contrast to these packages that extract individual segments of a pre-determined length and position from a read, virema provides a unique approach by dynamically generating moving segments for alignment. after an initial seed-based alignment, a new read segment is obtained from any unaligned nucleotides at the end of the read. alternatively, if a mapping cannot be found, the seed position is adjusted by trimming a single nucleotide from the beginning of the current segment. consequently, virema provides a highly sensitive and versatile platform for recombination discovery and it is not limited to specific reads lengths and it can handle reference genomes of any size. virema is also capable of detecting multiple recombination events within a read, insertions and substitutions as well as more complex recombination events where they may be a small number of inserted nucleotides between recombination junctions. our algorithm is aimed at detecting the maximum number of recombination events and is almost exhaustive in its search for recombination events. in our analysis of over million reads, > . million recombination events were detected, recombination events could not be unambiguously identified and < reads could not be mapped. owing to artifactual recombination events that inevitably creep into cdna-sequencing libraries ( ) , such an exhaustive search prompts consideration of the handling of false-positive recombination events. many recombination detection algorithms are focused on the detection of splice-junctions in eukaryotic mrna or in the detection of chromosomal rearrangements or fusions in the dna of tumorigenic cells. these programs implement strict noise-reducing filters to remove any potential false-positive hits and any putative junction must be confirmed by the presence of multiple aligning reads or by paired reads that span a recombination junction. this is suitable in cases such as eukaryotic rna splicing, as there should be only one or a limited number of biologically correct fusion junctions and as these junctions are likely to contain canonical splicing site consensus sequences. however, this is not a good assumption in the case of rna or dna recombination in viral genomes. as many thousands of genome copies may be generated in a single replication cycle and as a deeply-sequenced viral genome will contain reads derived from a large pool of viral quasispecies, we should expect to find a wide range of possible recombination sites. moreover, the mechanisms of dna or rna recombination may vary between species. for example, rna template switching has been shown to occur most frequently in au-rich regions of the brome mosaic virus rna genome ( ) , whereas poliovirus has been demonstrated to favor gc-rich tracts ( ) . consequently, we cannot exclude events based on the sequence information alone. therefore, suitable controls must be used to obtain a base-line rate for artifactual recombination during the generation of the cdnasequencing libraries. this can be achieved by comparing the recombination in the viral genome with non-viral templates such as in vitro transcribed rna. similarly, artifactual recombination can be directly detected by mixing separate samples before cdna library generation ( ) . our analysis of fhv demonstrates that by isolating a small number of virus particles, deep sequencing the encapsidated rna and mapping the positions of recombination events, functional rna motifs can be discovered. in principle, the approach laid out here would be possible even without knowledge of the genome sequence of the virus, as this can be assembled de novo from the sequence dataset ( ) . also, the sample would not need to be purified as with sufficient sequencing read depth, the non-viral sequence reads can be removed computationally to reveal just the relevant virus data ( ) . we are constantly facing the threat of emerging pathogens, as exemplified by recent coronavirus and influenza virus outbreaks. it is therefore critical to develop methodology that allows researchers to quickly identify and characterize important features of a viral genome while only having available limited knowledge or means to study an outbreak virus. virus identification in unknown tropical febrile illness cases using deep sequencing metagenomics for the discovery of novel human viruses virus discovery by deep sequencing and assembly of virus-derived small silencing rnas viruses in the faecal microbiota of monozygotic twins and their mothers beyond the consensus: dissecting within-host viral population diversity of foot-and-mouth disease virus by using next-generation genome sequencing host rnas, including transposons, are encapsidated by a eukaryotic singlestranded rna virus ensuring the safety of vaccine cell substrates by massively parallel sequencing of the transcriptome viral nucleic acids in live-attenuated vaccines: detection of minority variants and an adventitious virus identification and manipulation of the molecular determinants influencing poliovirus recombination nucleotide-resolution profiling of rna recombination in the encapsidated genome of a eukaryotic rna virus by next-generation sequencing why do rna viruses recombine? sequencing and analyses of all known human rhinovirus genomes reveal structure and evolution natural genetic exchanges between vaccine and wild poliovirus strains in humans molecular characterization of drosophila cells persistently infected with flock house virus defective interfering viral particles in acute dengue infections internally deleted wnv genomes isolated from exotic birds in new mexico: function in cells, mosquitoes, and mice defective viral particles and viral disease processes biased hypermutation and other genetic changes in defective measles viruses in human brain infections long-term transmission of defective rna viruses in humans and aedes mosquitoes characterization of hepatitis c virus deletion mutants circulating in chronically infected patients adaptation of a genotype hepatitis e virus to efficient growth in cell culture depends on an inserted human gene segment acquired by recombination cutadapt removes adapter sequences from high-throughput sequencing reads ultrafast and memory-efficient alignment of short dna sequences to the human genome tophat: discovering splice junctions with rna-seq tophat-fusion: an algorithm for discovery of novel fusion transcripts fusionmap: detecting fusion genes from next-generation sequencing data at base-pair resolution mapsplice: accurate mapping of rna-seq reads for splice junction discovery the sequence alignment/map format and samtools defective interfering rnas: foes of viruses and friends of virologists flock house virus replicates and expresses green fluorescent protein in mosquitoes long-distance base pairing in flock house virus rna regulates subgenomic rna synthesis and rna replication requirements for the self-directed replication of flock house virus rna cis-acting requirements for the replication of flock house virus rna replication of the genomic rna of a positive-strand rna animal virus from negative-sense transcripts a terminal stem-loop structure in nodamura virus rna forms an essential cis-acting signal for rna replication evidence that the packaging signal for nodaviral rna is a bulged stem-loop cis elements direct nodavirus rna recruitment to mitochondrial sites of replication complex formation the impact of pcr-generated recombination on diversity estimation of mixed viral populations by deep sequencing engineering of homologous recombination hotspots with au-rich sequences in brome mosaic virus clonal integration of a polyomavirus in human merkel cell carcinoma the authors thank brad kearney and tatiana domitrovic for advice and discussions. conflict of interest statement. none declared. key: cord- -tx lqgff authors: te velthuis, aartjan j.w.; van den worm, sjoerd h. e.; snijder, eric j. title: the sars-coronavirus nsp +nsp complex is a unique multimeric rna polymerase capable of both de novo initiation and primer extension date: - - journal: nucleic acids res doi: . /nar/gkr sha: doc_id: cord_uid: tx lqgff uniquely among rna viruses, replication of the ∼ -kb sars-coronavirus genome is believed to involve two rna-dependent rna polymerase (rdrp) activities. the first is primer-dependent and associated with the -kda non-structural protein (nsp ), whereas the second is catalysed by the -kda nsp . this latter enzyme is capable of de novo initiation and has been proposed to operate as a primase. interestingly, this protein has only been crystallized together with the -kda nsp , forming a hexadecameric, dsrna-encircling ring structure [i.e. nsp( + ), consisting of copies of both nsps]. to better understand the implications of these structural characteristics for nsp -driven rna synthesis, we studied the prerequisites for the formation of the nsp( + ) complex and its polymerase activity. we found that in particular the exposure of nsp 's natural n-terminal residue was paramount for both the protein's ability to associate with nsp and for boosting its rdrp activity. moreover, this ‘improved’ recombinant nsp was capable of extending primed rna templates, a property that had gone unnoticed thus far. the latter activity is, however, ∼ -fold weaker than that of the primer-dependent nsp -rdrp at equal monomer concentrations. finally, site-directed mutagenesis of conserved d/exd/e motifs was employed to identify residues crucial for nsp( + ) rdrp activity. in the replicative cycle of rna viruses, the crucially important process of rna-templated rna synthesis is generally performed by an rna-synthesizing complex of viral enzymes ( , ) . commonly, its core subunit is a single rna-dependent rna polymerase (rdrp) that drives the production of template strands for replication, new genome molecules, and-in many rna virus groupsalso subgenomic (sg) mrnas. this canonical rdrp is structurally conserved among rna viruses and widely accepted to drive catalysis of phosphodiester bond formation via a well-established reaction mechanism involving two metal ions that are coordinated by aspartate residues in its motifs a and c ( ) ( ) ( ) . uniquely among rna viruses, however, current evidence suggests that at least two rdrp activities are encoded by the genomes of members of the coronavirus (cov) family, the +rna virus group that infects a wide range of vertebrates and is renowned for its exceptionally large polycistronic genome of $ kilobases ( ) . both cov rdrps belong to the set of non-structural proteins (nsps) that are produced through proteolytic processing of the pp a and pp ab replicase precursor polyproteins, which both derive from translation of the genomic rna ( , ) . for the severe acute respiratory syndrome-associated coronavirus (sars-cov), which emerged in and caused worldwide concern due to the $ % mortality rate associated with infection of humans ( , ) , the two replicase subunits with rdrp activity have been studied in some detail. the first is the -kda nsp , which contains the canonical viral rdrp motifs in its c-terminal part and employs a primerdependent initiation mechanism ( , ) . the second polymerase, the -kda nsp , is unique for covs and was reported to be only capable of de novo rna synthesis with a low fidelity on ssrna templates ( ) . together, these observations inspired a hypothesis in which nsp would serve as an rna primase, i.e. would synthesise short oligonucleotide primers for subsequent extension by the nsp 'main rdrp' ( ) . in spite of this attractive model, however, many questions regarding cov rna synthesis remain unanswered thus far. for instance, it is unclear whether the homomeric form of nsp , for which in vitro rdrp activity was previously documented ( ) , actually occurs in vivo, as nsp was also shown to co-crystallize and form a unique hexadecameric ring-structure with the -kda nsp subunit that resides immediately upstream in the replicase polyprotein precursors (figure ) ( ) . in a similar fashion, it is presently unknown whether the postulated double-stranded rna (dsrna) binding channel of this complex plays a role in the rdrp activity of nsp and whether this activity is influenced by nsp , particularly given the observed low fidelity and low processivity of nsp ( ) . to investigate the properties of the nsp +nsp [nsp( + )] hexadecamer in more detail, and seek answers to the above questions, we here generated and purified recombinant forms of sars-cov nsp and nsp( + ) that have natural n-terminal residues. this technical refinement was found to greatly improve nsp 's ability to associate with nsp . moreover, and in contrast to previous observations ( ) , exposure of the natural n-terminus proved crucial for the enzymatic activity of the complex on partially double-stranded rna templates, demonstrating that nsp( + ) is capable of primerdependent rdrp activity as well. site-directed mutagenesis of nsp in the context of the nsp( + ) complex identified a conserved d/exd/e motif that is important for catalysis in vitro, possibly providing a first indication of the location of the presently unknown nsp active site. overall, these results define the sars-cov nsp( + ) complex as an intriguing multimeric rna polymerase that is capable of primer extension. for sars-cov nsp -nsp expression, the sequence encoding amino acids - of the sars-cov replicase pp a was amplified by reverse transcriptionpolymerase chain reaction (rt-pcr) from the genome of sars-cov isolate frankfurt- (genbank accession number ay ). the primers used were sav and sav (supplementary table s ). for nsp expression, the sequence encoding pp a residues to was amplified by rt-pcr using sav and sav as primers (supplementary table s ). both pcr products were digested with sacii and bamhi, and ligated into expression vector pask -ub-chis ( ). this vector was originally derived from the pet -ub-chis vector ( ) , but drives expression of n-terminally ubiquitintagged and c-terminally his -tagged fusion proteins via a tetracyclin-inducible promoter, to rule out the potential t polymerase contaminations that are known to cause false positive results when using t promoter-driven systems for recombinant rdrp expression. all described nsp mutants were engineered via site-directed mutagenesis according to the quikchange protocol (stratagene) using the primers listed in supplementary table s . for nsp - or nsp expression, escherichia coli c cells (new england biolabs) were transformed with the plasmids pask -ub-nsp - -chis or pask -ub-nsp -chis together with the ubp protease expression plasmid pcg ( ) . routinely, ml of luria broth, containing ampicillin ( mg/ml) and chloramphenicol ( mg/ml), was inoculated : with o/n precultures, and cells were grown to od > . at c. subsequently, the cells were slowly cooled to c, followed by induction with anhydrotetracycline (fluka) at a final concentration of ng/ml for h. expression at c was, however, only crucial for the preparation of certain nsp mutants and similar yields of active wild-type protein could be obtained by expression at c for - h. cells were harvested by centrifugation and stored at c until protein purification was started. the expression of sars-cov nsp with a c-terminal his -tag (nsp -his) was achieved from plasmid pdest -nsp -his according to the protocol previously described for eav nsp ( ) . sars-cov nsp -his (nsp -his) was expressed as a self-cleaving maltose binding protein (mbp)-fusion protein and was purified via its c-terminal his -tag ( ) . the pask -his-nsp plasmid for expression of the n-terminally his -tagged the coronavirus genome contains two large -proximal orfs (orf a and b) that encode the two replicase polyproteins, whose mature products assemble into the viral replication and transcription complex. both polyproteins are cleaved (cleavage sites indicated with arrow heads) by the proteinase activities of nsp (black lines) and nsp (red lines), which releases the mature nsps. also indicated are the cap structure and the polya tail (a n ). (b) the sars-cov nsp crystal structure (pdb ahm) resembles a 'golf club-like' shape, as presented by the yellow ribbon structure. this nsp conformation connects to a much larger, hexadecameric structure that is composed of seven additional nsp subunits (grey) and eight nsp subunits (green). the hollow hexadecameric ring structure has a positively charged channel (blue background shading) that was proposed to mediate rna binding. the outside of the structure is predominantly negatively charged (red background shading). nsp was kindly provided by dr imbert and dr canard (university of marseille, france). purification of sars-cov nsp , nsp - and nsp bacterial pellets were thawed on ice, resuspended in buffer a [ mm hepes ph . , mm imidazole, . % tween- , mm b-mercaptoethanol and edta-free protease inhibitor cocktail (roche)] containing mm nacl, and lysed by sonication. the supernatant was cleared by ultracentrifugation at g for min and subsequently incubated with talon beads (clontech) for h at c. the beads were washed four times min with volumes of binding buffer. ultimately, the c-terminally his -tagged proteins were eluted with mm imidazole in buffer a containing mm nacl, or cleaved off of the column during a -h digestion with sars-cov nsp in the presence of mm mgcl . the eluates were analysed by sodium dodecyl sulfate polyacrylamide gel electrophoresis (sds-page) and typically found to be > % pure. elution fractions containing nsp -, nsp - or nsp were subsequently pooled, dialysed, stored and analysed as described previously for sars-cov nsp ( ) . to study sars-cov nsp( + ) complex formation, different nsp :nsp ratios were mixed in binding buffer ( mm hepes ph . , mm nacl, % glycerol, . % triton x- and mm dtt) to give a final reaction volume of ml. the proteins were pre-incubated for min at c, after which cross-linking was initiated through the addition of . ml of a freshly prepared . % glutaraldehyde solution. the reactions were incubated for a further min at c and then terminated with ml m tris ph . . analysis of complex formation was performed on sds-page gels, which were stained with coomassie g- dye. a dilution series of - mm sars-cov nsp in storage buffer was incubated for min at c with . nm of p-labelled duplex rna. subsequently, samples were directly loaded onto % polyacrylamide gels containing % glycerol and . x tge ( mm tris, mm glycine and mm edta) buffer and run at v for h at c. gels were dried on whatman filter paper and bands were quantified by phosphorimaging using a typhoon variable mode scanner (ge healthcare) and imagequant tl . software (ge healthcare) as described elsewhere ( ) . using the matlab a curve fitting toolbox, the percentage of bound rna was fit to the hill equation, which is defined as: rna bound ¼ b Á ½nsp n =ðk n d +½nsp n Þ. here b is the upper binding limit, ½nsp the nsp concentration, n the hill coefficient and k d the dissociation constant. the oligoribonucleotide substrates used for polymerase assays are listed in table and were prepared as described previously ( ) . primer-extension assays for nsp , the nsp - polyprotein, and the nsp( + ) complex were essentially performed as described previously for sars-cov nsp ( , ) . in each primer-extension reaction, typically mm wild-type or mutant nsp was incubated with mm mgcl , mm gtp, mm atp, . mm [a- p]atp, mm dtt, . % triton x- , mm kcl and mm tris (ph . ). at most, mm nacl and % glycerol were introduced with the nsp storage buffer. gels were run and analysed as described previously ( ) . to convert the phosphorimager signal into the amount of [a- p]amp incorporated, a À to À dilution series of the [a- p]atp stock was spotted in triplicate on whatman filter paper and exposed alongside the page gel. the amount of incorporated label was ultimately corrected for the concentration of competing, unlabelled nucleotides present in the reaction mixture. de novo initiation assays were essentially performed as described by imbert et al. ( ) , with small modifications for optimisation. briefly, mm wild-type or mutant nsp was incubated with mm mgcl , mm mncl , mm gtp, mm atp, . mm [a- p]atp and mm of oligo afmb . alignments of nsp sequences were made using muscle ( ) . sequences used included the alphacoronaviruses human cov e (nc_ ), human cov nl (nc_ ), and bat cov hku (nc_ ); the betacoronaviruses sars-cov frankfurt- (ay ), mouse hepatitis virus a (mhv, nc_ ) and human cov oc (nc_ ); and the gammacoronaviruses beluga whale cov sw (nc ), turkey cov (nc_ ) and avian infectious bronchitis virus (ibv, aj ). n-terminal processing defines nsp multimerization and nsp( + ) complex formation sars-cov nsp and nsp were previously reported to interact and form a hollow ring structure that is composed of an intricate nsp octamer supported by eight copies of nsp ( , ) ( figure b ). based on the large diameter, positive charge of the hexadecamer's channel and in silico docking, it was proposed to be able to encircle dsrna ( figure b) . however, the functional significance of the compound interactions between nsp and nsp is poorly understood, as are the polymerase activities associated with monomeric nsp or nsp -containing multimers. so far, strategies for the purification of recombinant nsp have involved the use of affinity tags [e.g. his or glutathione-s-transferase (gst) ( , ) ] that were fused to one terminus to facilitate protein recovery. inadvertently though, such tags or other exogenous sequences may significantly impede the correct folding of enzymes and thus alter their stability or activity, as exemplified by studies of the poliovirus ( d pol ) and sars-cov (nsp ) rdrp subunits ( , , ) . to circumvent this issue, we developed a protocol in which sars-cov nsp was expressed as a ubiquitin (ub) fusion protein carrying a c-terminal his -tag (ub-nsp -his), which was subsequently processed at both termini in two steps. the first step was co-translational and involved the release of the n-terminal ub fusion partner by the co-expressed ubiquitin carboxyl-terminal hydrolase (upb , figure a ) ( , ) . the second proteolytic step, catalysed by a recombinant form of the sars-cov nsp main protease ( ) , removed the c-terminal his -tag and was performed either in solution (figure a and b) or when nsp -his was immobilised to talon beads. this procedure yielded sars-cov nsp with its exact natural nand c-terminus (replicase residues ala- and gln- , respectively; figure a ), the product that is normally liberated by the nsp -driven autoprocessing of the sars-cov replicase polyproteins ( ) . in accordance with the octameric state observed in cross-linking experiments using glutaraldehyde (supplementary figure s ) or ethylene glycolbis ( ) , the hydrodynamic profile of the untagged nsp corresponded to a mass of $ kda ( figure d ). to identify and explain differences with previously published observations, we also produced and characterised n-and c-terminally tagged forms of nsp ( figure c ). importantly, under the same assay conditions, the n-terminally his -tagged nsp (his-nsp ) that was used in the original nsp rdrp activity study ( ) showed a marked difference in multimerization behaviour ( figure d and supplementary figure s ). on the other hand, little difference was observed between untagged nsp and a c-terminally his -tagged version of the protein (nsp -his; figure e ). to investigate whether nsp could influence the change in multimerisation behaviour, we next added separately purified and c-terminally processed nsp to the different nsp preparations. interestingly, we found that nsp and nsp -his could both associate with this protein, in accordance with published data ( ) , but that his-nsp was unable to do so within the frame of our experimental conditions ( figure f ). consequently, although various lines of evidence support the observation that nsp and nsp can form a hexadecamer, it now appears that the correct n-terminal processing of nsp is a significant factor in determining the final oligomeric state of the protein. a unique feature of the hexadecameric sars-cov nsp( + ) structure is the fact that it does not derive from stacking of its protein subunits, but rather from stable inter-connections of the 'golf club-like' nsp molecules ( figure b) ( ) . the structural support of the nsp , followed by purification and cleavage by recombinant sars-cov nsp main protease to remove the c-terminal his -tag and its upstream gssg linker. (b) eighteen percent sds-page analysis of nsp -treated, purified nsp -his demonstrates near-complete release of the c-terminal his -tag within min. the maltose binding protein (mbp) was added to the reaction to serve as an independent loading control. asterisks indicate non-specific bands. (c) in addition to the tag-less nps and nsp -his, we also produced the n-terminally his -tagged nsp (his-nsp ) used by imbert et al. ( ) . (d) comparative gel filtration analysis of nsp ( kda as a monomer) versus his-nsp and (e) nsp versus nsp -his. in all three cases, nsp formed multimers in solution, but the apparent molecular mass of complexes formed by both nsp and nsp -his was $ -fold higher than for complexes formed by his-nsp . (f) comparative analysis of nsp , nsp( + ), his-nsp and nsp +nsp -his. only nsp( + ) showed a molecular weight shift to the $ -kda size range with a standard deviation of -kda (n = ). this size is indicative of hexadecamer formation, whereas the analysis of nsp +nsp -his showed dominant peaks of nsp -his and nsp (which is $ kda as a monomer). octamer by eight copies of nsp thus appears to be redundant, in line with the critical role for the nsp n-terminal domain described above. we surmised therefore that the additional complexity must have evolved to improve nsp 's function and set out to compare the rna binding capabilities of the purified nsp octamer and nsp( + ) hexadecamer. by analysing the steady-state ribonucleotide-protein (rnp) complexes formed through binding of nsp to p-labelled dsrna ( figure a) , we estimated the nsp dissociation constant (k d ) for dsrna to be $ . mm ( figure f) , which is about $ -fold higher than the apparent k d of nsp under comparable conditions ( ) . a comprehensive analysis of the influence of nsp on nsp -dependent rna binding required an nsp mutant that was incapable of rna binding. to this end, we engineered an alanine substitution of the conserved residue k , which resides in nsp 's proposed dsrna-binding channel [residues - ( )]. as is evident from the electromobility shift assay in figure b , this mutation was sufficient to significantly disrupt rna binding. as a control, we also performed an aspartate-to-alanine substitution at position , which is partially conserved, yet not expected to participate in rna backbone binding due to its negative charge and position just outside the proposed rna binding channel. indeed, the d a mutation only induced a migratory shift of the dominant rnp signal towards the anode, likely as a result of the lost negative charge ( figure c ). with the results obtained with these control proteins in mind, we next explored the contribution of nsp to rna binding by the nsp( + ) complex. we used a fixed concentration of nsp and added either wild-type or mutant nsp up to the point where the nsp :nsp ratio reached equimolarity. no rna binding was observed in the absence of nsp , but upon nsp( + ) complex formation the amount of bound dsrna rapidly increased ( figure d ). indicative of successful complex formation, we also observed a shift in the molecular weight of the major rnp complex formed ( figure d ). western blot analysis confirmed that both nsp and nsp were present at this position in the gel (not shown), but due to the generally unpredictable migration behaviour of proteins and rnps in native page, it was not possible to assess whether this band indeed corresponded to the nsp( + ) hexadecamer. the k d of the nsp( + ) complex was estimated at $ . mm, about -fold lower than that of nsp alone ( figure f ). when we next added an equimolar amount of nsp to the nsp rna-binding mutant k a, we observed a minor increase in the binding affinity for rna (compare figure b with e) . mutant d a, on the other hand, behaved similar to the wild-type protein ( figure e ). together, these results complement the observation that various positively charged nsp residues line the inside of the nsp -scaffolded rna binding channel ( ) , and they provide the first direct evidence for a functional role of nsp in the sars-cov nsp( + ) structure. given nsp( + )'s ability to bind dsrna, we wondered whether this protein complex would also be catalytically active on this type of template and able to incorporate nucleoside monophosphates (nmps) into partially double-stranded rna molecules, i.e. primed templates. we therefore examined the ability of nsp to extend a -nt primer that was pre-annealed to a heteromeric template with relatively low secondary structure, to rule out potential adverse effects of hairpins ( figure a ). interestingly and in contrast to previous observations ( ) , the nsp( + ) complex readily extended the primer up to template length, resulting in the formation of a -base pair rna duplex ( figure b ). the negatively charged and helical polymer heparin is able to occupy the binding sites of rna and dna polymerases, and can thus directly compete with rna and dna templates. to verify that the full-length and longer rna products were derived from single nsp( + ) complexes bound to the template (i.e. from a processive activity), and not from multiple binding and extension events (i.e. a distributive activity), we performed the primer extension reaction in the presence of heparin to trap any unbound nsp( + ). we first tested the concentration required to saturate all nsp( + ) complexes in the reaction by titrating - mm into the reaction (supplementary figure s a) and observed that the incorporation levels were stable above mm (supplementary figure s b) , suggesting that these reactions represent single initiation-extension events. we next assessed whether the activity of nsp or nsp( + ) was distributive or processive by quantifying the incorporated signal in full-length or longer products in the presence of mm heparin ( figure c ). as shown in figure d , ± % (mean ± standard deviation) of the nsp products were full length compared to ± % of the nsp( + ) products, suggesting that both the enzymes complexes are mostly processive and that nsp does not confer additional processivity to nsp . interestingly, both nsp and nsp( + ) are able to extend the rna primers beyond template length in the presence of heparin ( figure d and supplementary figure s b ), suggesting that these extensions result from terminal transferase activity and not from template switching, as was previously observed for poliovirus d pol ( ) . intrigued by the primer extension activity of the sars-cov nsp( + ) complex described above, we next designed a set of mutations to verify that the activity indeed was nsp( + ) derived and to identify the most critical residues for activity in the complex. we first tested rna-binding mutant k a (figure ) at varying concentrations and observed a $ % loss of nucleotide incorporation activity compared to the wild-type protein ( figure ). other likely candidates for a direct role in rdrp catalysis generally are mg + -coordinating aspartate residues and lysine or histidine residues that can function as general acid ( ). in canonical rna polymerases, the aspartates commonly reside in motifs a and c ( , ) , while in dna-dependent rna primases they are usually found in a central d/exd/e motif ( ) . given the absence of classical rdrp a and c motifs in the nsp sequence ( ), we screened an alignment of cov nsp sequences for conserved d/exd/e motifs. interestingly, we found such a motif in both the n-terminal and the c-terminal domain ( figure a ). subsequent alanine substitution of the n-terminal d/ exd/e motif, composed of d and d in sars-cov, greatly affected primer extension activity on the cu template as shown in figure c . mutation of the downstream domain (residues d and d in sars-cov), however, had a much smaller effect on polymerase activity, suggesting that this c-terminal d/exd/e motif is not critical for catalysis. controls included mutant k a and a mutant carrying a lysine-to-alanine substitution of the non-conserved residue . in line with the observation of the u template and its conservation in covs, the loss of a lysine at position resulted in a near complete loss of rdrp activity, whereas mutation of k positively influenced rna synthesis ( figure ) . as outlined above, magnesium ions are well-known cofactors of nucleic acid polymerases and assist in the coordination and activation of incoming nucleoside triphosphates. also the activity of sars-cov nsp( + ) was found to be positively correlated with the mg + concentration, albeit with a broad optimum running from - mm ( figure a ). at this optimum, nsp( + ) incorporates $ mm nmp into the primed template per mm of monomeric nsp and nsp present in the reaction. similar to the presence of divalent cations, the ph greatly affects the activity of rdrps and has been shown to play a role in both catalysis and fidelity ( , ) . to investigate the influence of the ph on nsp( + ), we tested the activity of the complex in a ph range of - . as shown in figure c , we observed a sharp optimum at ph . , which is considerably higher than the optimum that was previously observed for the sars-cov nsp -rdrp and the his-nsp homomer (ph optimum . and . , respectively) ( , ) . interestingly, the primer extension activity of nsp( + ) did not require manganese ions as was previously reported for the his-nsp homomer ( ) . in fact, similar to the sars-cov nsp -rdrp ( ), the addition of mn + was found to reduce the fidelity of nsp( + ) and induce both transversional and transitional misincorporations in a pulse-chase experiment ( figure e and f) . interestingly, the assay also revealed a discrimination against the widely figure . mutagenesis of sars-cov nsp . (a) alignment of nsp sequences from representative alpha-, beta-and gammacoronaviruses. fully conserved residues are shaded red, while partially conserved residues are boxed. the residues targeted by mutagenesis are indicated with asterisks. please see 'material and methods' section for the genbank accession numbers associated with the presented sequences. (b) to verify that the observed extension activity was nsp -dependent, we tested the incorporation of amp into the primed u template by , or mm of wild-type nsp or template-binding mutant k a. mutation of k resulted in a $ % reduction of amp incorporation. (c) to assess the importance of the two d/exd/e motifs in nsp , we engineered alanine substitution mutants of these residues and tested their primer extension activity on the primed uc template (see figure ). reactions were stopped after min and compared to the activity of the wild-type nsp( + ) complex on a % page/ used atp and gtp analogue ribavirin triphosphate (rtp) ( , ) . whether this may offer an explanation for sars-cov's relative resistance to this antiviral drug ( , ) remains an open question for future research. the primer-extension and terminal transferase activity documented in figure for the complex containing the untagged nsp was not observed by imbert et al. ( ) when they first purified and analysed his-nsp . to investigate whether this difference could be attributed to complex formation with nsp or the removal of the affinity tag, we performed the primer extension assay with three different recombinant nsp versions of which the gel filtration analysis is documented in figure . interestingly, for all three variants primer-extension activity was observed ( figure a ), but the activity was most pronounced for nsp -his and the untagged nsp ( figure a ). to estimate the effect of nsp on the nsp -driven primer extension activity, we performed a direct comparison of the two enzyme complexes and found that the activity of nsp alone was > -fold lower than when nsp and nsp were present at equal molarity in the reaction (figures d and b) . a similar comparison was performed for the de novo activity of nsp , using the assay published by imbert et al. ( ) and taking the first dinucleotide (pppgpa) product as readout. interestingly, both nsp and nsp( + ) synthesized equal amounts of the pppgpa dinucleotide ( figure c) , suggesting that the effect of nsp is limited to the primer-extension activity of nsp . in addition, we observed that the de novo initiation activity of nsp was $ -fold higher than that of his-nsp ( figure d ). our comparative study revealed that the n-terminal his -tag of his-nsp greatly influences the primerextension activity of nsp ( figure a ), its multimerization profile and its association with nsp ( figure ). to test if this inhibitory effect was his -tag specific, we assessed the activity of a ub-nsp -his fusion protein. at the same time, control reactions were performed in which we (i) followed the activity of this protein as it was being processed by a recombinant form of the ubiquitin-cleaving nsp protease of equine arteritis virus ( ) or (ii) monitored the activity of nsp -his. as shown in supplementary figure s , the presence of the ub-tag decreased nsp activity to a level that was comparable to that of n-terminally his -tagged nsp . upon cleavage by eav nsp , however, a partial recovery of the primer extension activity was observed (supplementary figure s ) . unfortunately, we were not able to perform the same experiment with purified ub-nsp , since our recombinant nsp removed the n-terminal ub-tag with similar efficiency as the c-terminal his -tag (supplementary figure s ) . extrapolating to the situation in the viral pp a and pp ab precursor polyproteins, in which the nsp n-terminus is initially fused to nsp ( figure a) , our observations suggested that nsp may thus be inactive in the polyprotein context. this would constitute a form of regulation of viral enzyme activity that is not without figure a , presented as the amount of ntp incorporated per mm nsp monomer. error bars represent standard deviations (n = ). (c) the influence of the ph on nsp( + ) activity was tested for a ph range of - . a clear optimum was observed around . . (d) quantification of the results in figure c , presented as the amount of ntp incorporated per mm nsp monomer. error bars indicate standard deviations (n = ). (e) schematic presentation of the pulse-chase experiment that was used to test the nsp( + ) nucleotide incorporation specificity on a primed poly(u) template (see table ). the reactions were initiated with a limiting concentration of [a- p]atp to allow the formation of a stable polymerase-template complex. unlabelled nucleotides were used at a final concentration of mm. (f) sars-cov nsp( + ) allowed only limited transversional and transitional mutations. use of manganese ions as cofactor for polymerase activity resulted in a minor, though noticeable loss of fidelity. lane represents the input signal to which no unlabelled nucleotides were added. nucleoside triphosphates are abbreviated to single letters (i.e. a for atp, g for gtp, u for utp, c for ctp and r for rtp). precedent, since also the poliovirus dpol is inactive as long as it is fused to the c protease in the cd precursor ( ) . to verify this hypothesis, we expressed nsp - -his and tested this protein for rdrp activity. interestingly, this fusion protein, a potential intermediate of cov replicase polyprotein processing and a multimer in solution ( figure e ), showed primer extension activities that were comparable to or higher than the activity of nsp( + -his) ( figure f ). the de novo initiation activity of nsp - -his was, however, $ -fold lower than the activity of nsp and nsp( + ) ( figure d ). in conclusion, this result clearly underlines that the two n-terminal fusion partners other than nsp are specifically detrimental to sars-cov nsp primer-dependent rdrp activity in vitro. it also demonstrates that nsp alone may be sufficient to act as a primase. the complex replication and transcription process that coronaviruses initiate upon infection involves up to viral nsps and at least one host factor ( ) ( ) ( ) . both individually as well as in complex with each other, these subunits engage in numerous protein-protein interactions ( , ) and embody various enzymatic activities, including proteolytic ( , ) , atpase ( ) , and cap modifying reactions ( ) . remarkably though, the mechanism and enzymes required to catalyse rna synthesis in the cov rtc remain very poorly understood. moreover, uniquely among rna viruses which generally employ a single rna polymerase to drive their rna synthesis ( , ) , the polymerase activity assays and nsp mutagenesis documented in this and other studies suggest that, in addition to the presumed nsp 'main rdrp', other polymerase activities could play a critical role in the synthesis of sars-cov rnas ( , , , ) . following up on the description of an nsp -and nsp -containing hexadecameric ring structure ( ) and the nsp -associated polymerase activity ( ), we here demonstrate that the nsp( + ) hexadecamer is the most probable conformation of the second sars-cov polymerase, given the near-complete association of nsp and nsp when mixed : in solution ( figure f ). significant for our understanding of cov rna synthesis, we find that this complex is capable of binding dsrna molecules and extending partially double-stranded rna templates. this activity is therefore essentially comparable to the activity reported for the nsp -rdrp ( ) . a direct comparison with the nsp activity is difficult, however. in the course of a one-hour reaction, . mm monomeric nsp -rdrp incorporates $ mm nmp into a primed (cu) template ( ). the nsp( + ) complex, at a mm concentration of nsp and nsp monomers, incorporates $ mm nmp. per monomer, the activity difference is therefore -fold, but if we assume that most nsp and nsp monomers assemble into hexadecamers and that each hexadecamer contributes only one functional active site per incorporation event, the difference would be much smaller and only $ . -fold. presently, however, we do not yet have an estimate for the efficiency and stability of the nsp( + ) complex, nor do we know the number of active sites in the complex that determine its overall activity. mutagenesis of nsp was performed to identify residues that may contribute to the catalytic centre of the nsp( + ) polymerase, while differently tagged nsp recombinant proteins were constructed to explain some striking differences with previous observations. these efforts resulted in two intriguing observations. firstly, mutation of the conserved n-terminal d/exd/e motif, comprising d table ), using the synthesis of the first dinucleotide pppgpa, as previously described by imbert et al. ( ) , as readout. nsp template binding mutant k a was used as negative control. the amp contaminant present in the used [a- p]atp label is marked as loading control and size reference. (d) side-by-side comparison of the de novo initiation activities of nps , his-nsp and nsp - -his. (e) elution profile of the nsp - -his fusion protein relative to nsp -his. (f) primer-extension activities of putative cleavage intermediate nsp - on the u template (see figure and table ). and d in sars-cov, abolished rdrp activity, whereas mutation of the c-terminal motif, including sars-cov residues d and d , did not affect polymerase activity ( figure ). given the general importance of acidic residues for metal-ion coordination in polymerase active sites ( ) ( ) ( ) ) and the d/exd/e consensus sequence in coronaviruses at positions - , we now postulate that these residues are part of the mg + -binding active site in spite of the more conserved nature of d and d ( figure ) , and their position in the nsp( + ) structure (see below for further discussion). secondly, the presence of n-terminal extensions other than nsp , such as ubiquitin and his , severely affected the primer extension activity of nsp (figure ) , potentially by changing its oligomeric state ( figure ) . however, the relatively strong activity of nsp - (figure ) , a potential naturally occurring replicase processing intermediate, implies that nsp 's activity is unlikely to be directly controlled by an n-terminal cleavage event, as was observed for, e.g. the poliovirus polymerase ( ) . in addition, these observations suggest that a more diverse array of nsp -containing rdrps may be involved in cov replication and transcription. comparing our data against the background of the previously published nsp( + ) structure ( ), we made four main observations. firstly, we note that in the published nsp( + ) crystal structure four of the eight n-terminal d/exd/e motifs in the complex reside at the border of partially unresolved n-terminal nsp domains, where the coordinates of up to nsp residues and exogenous amino acids derived from the removed gst fusion partner were not determined. in light of our own finding that unnatural n-terminal extensions severely impair nsp 's rdrp activity ( figure ), this suggests that the published crystal structure may not represent an active conformation of the nsp( + ) polymerase. secondly, we observe that residues d and d , which are both crucial for nsp( + ) activity, are residing in an a-helix in the nsp( + ) structure (supplementary figure s ) , whereas in canonical primases and polymerases, the catalytic centre is preferentially located on b-strands or turns ( , ) . thirdly, we note that mg + was lacking from the published nsp( + ) crystal structure ( ) , even though it is required for nsp( + ) activity. fourthly and last, we observe that a : ratio of nsp :nsp is sufficient to capture all nsp in a higher molecular weight complex ( figure f ) whereas previously a : ratio was required ( ) , potentially due to the additional n-terminal residues that altered the dynamics of complex formation. the (functional) implications of these observations are not clear at present, but additional structural studies will likely be required to address these issues in detail, and gain insights that may aid in explaining the in vitro results presented here. likely, such experiments will also offer further information regarding the residues that are involved in nucleotide positioning, mg + coordination and rdrp chemistry. in summary, our results provide important novel insights into the functionality of the sars-cov hexadecameric nsp( + ) complex and demonstrate its activity as an rna polymerase. in addition, our experiments and controls revealed and address a number of disparities between previous claims and hypotheses ( ) , and our own observations. the 'primase hypothesis' previously formulated by imbert and co-workers ( ) remains an intriguing model to explain the initiation of sars-cov rna synthesis and is a topic that will be addressed in detail elsewhere. nevertheless, based on the primer extension activity of nsp( + ) on non-structured rna templates, we can no longer exclude the possibility that nsp( + ) may synthesise substantially longer products than mere oligonucleotide primers in vivo, possibly stimulated by the presence of additional viral protein factors that could e.g. provide rna-unwinding activity. consequently, it is now a distinct possibility that cov rna synthesis involves structurally different and functionally separable rna synthesising complexes [e.g. containing nsp or nsp( + )], each possessing their own dedicated rdrp characteristics and function in viral plus or minus strand rna synthesis. it will therefore be crucial to study whether these different polymerase activities are part of the same enzyme complex and, if so, whether they can influence each other's activity or are subject to additional control mechanisms. host factors in positive-strand rna virus genome replication synthesis of subgenomic rnas by positive-strand rna viruses two proton transfers in the transition state for nucleotidyl transfer catalyzed by rna-and dna-dependent rna and dna polymerases mutation of the aspartic acid residues of the gdd sequence motif of poliovirus rna-dependent rna polymerase results in enzymes with altered metal ion requirements for activity molecular model of sars coronavirus polymerase: implications for biochemical functions and drug design coronaviruses post-sars: update on replication and pathogenesis coronaviruses: molecular biology and diseases unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group lineage the rna polymerase activity of sars-coronavirus nsp is primer dependent zn + inhibits coronavirus and arterivirus rna polymerase activity in vitro and zinc ionophores block the replication of these viruses in cell culture a second, non-canonical rna-dependent rna polymerase in sars coronavirus insights into sars-cov transcription and replication from the structure of the nsp -nsp hexadecamer production of ''authentic'' poliovirus rna-dependent rna polymerase ( d(pol)) by ubiquitin-protease-mediated cleavage in escherichia coli a new lead for nonpeptidic active-site-directed inhibitors of the severe acute respiratory syndrome coronavirus main protease discovered by a combination of screening and docking methods muscle: multiple sequence alignment with high accuracy and high throughput genome-wide analysis of protein-protein interactions and involvement of viral proteins in sars-cov replication structural basis for proteolysis-dependent activation of the poliovirus rna-dependent rna polymerase virus-encoded proteinases and proteolytic processing in the nidovirales poliovirus rna-dependent rna polymerase ( dpol) is sufficient for template switching in vitro coronavirus genome: prediction of putative functional domain in the non-structural polyprotein by comparitive amino acid sequence analysis origin and evolution of the archeo-eukaryotic primase superfamily and related palm-domain proteins: structural insights and new members fidelity of dna synthesis catalyzed by human dna polymerase alpha and hiv- reverse transcriptase: effect of reaction ph the broad-spectrum antiviral ribonucleoside ribavirin is an rna virus mutagen mechanisms of action of ribavirin against distinct viruses inhibitory effect of mizoribine and ribavirin on the replication of severe acute respiratory syndrome (sars)-associated coronavirus ribavirin and interferon-beta synergistically inhibit sars-associated coronavirus replication in animal and human cell lines ovarian tumor domain-containing viral proteases evade ubiquitin-and isg -dependent innate immune responses crystal structure of poliovirus cd protein: virally encoded protease and precursor to the rna-dependent rna polymerase sars-coronavirus replication/transcription complexes are membrane-protected and need a host factor for activity in vitro the coronavirus replicase: insights into a sophisticated enzyme machinery nidovirus transcription: how to make sense the sars-coronavirus plnc domain of nsp as a replication/transcription scaffolding protein a noncovalent class of papain-like protease/ deubiquitinase inhibitors blocks sars virus replication multiple enzymatic activities associated with severe acute respiratory syndrome coronavirus helicase in vitro reconstitution of sars-coronavirus mrna cap methylation processing of open reading frame a replicase proteins nsp to nsp in murine hepatitis virus strain a replication genetic interactions between an essential cis-acting rna pseudoknot, replicase gene products, and the extreme end of the mouse coronavirus genome the authors thank dr danny nedialkova, lorenzo subissi, dr isabelle imbert, dr bruno canard and dr alexander gorbalenya for stimulating discussions; linda boomaars-van der zanden and dr clara posthuma for assistance with nsp purification; puck van kasteren and marjolein kikkert for providing the eav nsp protease; and jos van vugt for his initial work on nsp in our lab. supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- -xc osdx authors: qureshi, abid; thakur, nishant; tandon, himani; kumar, manoj title: avpdb: a database of experimentally validated antiviral peptides targeting medically important viruses date: - - journal: nucleic acids res doi: . /nar/gkt sha: doc_id: cord_uid: xc osdx antiviral peptides (avps) have exhibited huge potential in inhibiting viruses by targeting various stages of their life cycle. therefore, we have developed avpdb, available online at http://crdd.osdd.net/servers/avpdb, to provide a dedicated resource of experimentally verified avps targeting over medically important viruses including influenza, hcv, hsv, rsv, hbv, denv, sars, etc. however, we have separately provided hiv inhibiting peptides in ‘hipdb’. avpdb contains detailed information of peptides, including modified peptides experimentally tested for antiviral activity. in modified peptides a chemical moiety is attached for increasing their efficacy and stability. detailed information include: peptide sequence, length, source, virus targeted, virus family, cell line used, efficacy (qualitative/quantitative), target step/protein, assay used in determining the efficacy and pubmed reference. the database also furnishes physicochemical properties and predicted structure for each peptide. we have provided user-friendly browsing and search facility along with other analysis tools to help the users. entering of many synthetic peptide-based drugs in various stages of clinical trials reiterate the importance for the avp resources. avpdb is anticipated to cater to the needs of scientific community working for the development of antiviral therapeutics. viruses are the causative agents of various dreadful diseases in humans and animals ( , ) . for majority of viruses like hepatitis c virus (hcv), influenza, dengue virus (denv), severe acute respiratory syndrome (sars), herpes simplex virus (hsv), etc., antiviral therapeutics are limited or lacking ( ) . moreover, owing to increasing drug resistance, conventional antiviral therapy is continuously challenged these days ( , ) . therefore, scientific efforts are underway to search for novel antivirals ( , ) . antiviral peptides (avps) are being regarded as such new promising entities to combat the viral infections. avps are a subset of antimicrobial peptides (amps) which act as the first line of defence in many organisms as innate immune response and are the hosts' defence peptides generated in response to pathogenic disease condition ( ) ( ) ( ) . avps are known to act either directly or by eliciting immune response ( ) . they usually inhibit directly one or more stages in the life cycle of a virus, viz., entry, attachment, replication, transcription, translation, maturation, release, etc.; thereby exhibiting the antiviral effects ( , ) . one of the earliest reports stating the direct involvement of peptides in inhibiting herpes simplex virus (hsv) multiplication dates back to ( ) . since then researchers have been extensively working on peptidebased antiviral development. bultmann et al. ( ) used fgf signal peptide derivatives to inhibit hsv- entry and the best performing avp had a half maximal inhibitory concentration (ic ) of . mm. budge and graham ( ) used r-a derived peptides to inhibit respiratory syncytial virus (rsv) replication and achieved a maximum ic of . mm. also, a peptide derived from spike (s) protein of sars-cov has been proved to be effective against sars virus entry with an efficacy of mm ( ). the peptide 'flupep' inhibits influenza virus attachment to the cells with an ic of . mm ( ) . similarly, xu et al. ( ) were able to inhibit denv protease using avps with a minimum ic of . mm. an avp named 'ctry ' has been synthesized, which possesses anti-hcv activity with an ec of . mg/ml ( ) . therapeutic potential, mode of action and importance of avps has been further reviewed ( , , ) . peptide-based drugs are advantageous over conventional drugs in having lesser molecular weight, higher efficiency, lower toxicity and minor side effects ( ) . avps are usually derived from natural sources but they can be readily modified by adding chemical groups or non-natural amino acids to further enhance their activity and stability ( ) . due to high potential, an estimated peptidebased therapeutics as antimicrobial/immunomodulatory are under clinical trials ( ) . the first avp to pass the clinical trials was 'enfuvirtide' (t ), an hiv fusion inhibitor that is being sold under the name of 'fuzeon' ( , ) . bioinformatics resources are required to accommodate and analyse the enormous data being generated on avps. although a number of resources exist for general antimicrobial peptides like apd ( ) , camp ( ) , dampd ( ) , yadamp ( ) , lamp ( ), etc., yet, specific resources on avps are lacking. therefore, to fill this void we have recently developed avppred ( ) and hipdb ( ) . avppred is the first avp prediction algorithm developed using support vector machine (svm). whereas hipdb is a specific database of experimentally validated hiv inhibiting peptides, which is freely available at http://crdd.osdd.net/servers/hipdb. hipdb harbours information of peptides and modified peptides experimentally tested for hiv inhibiting activity. besides above, no other resource is available for avps. hence, we developed avpdb-a comprehensive resource of peptides experimentally validated for their antiviral activities. relevant data were retrieved from the pubmed database, a free repository of abstracts and references on biomedical and life sciences. exhaustive literature search was accomplished by building search queries having combination of many keywords including virus, viral, peptide, inhibit, block, etc. a typical text mining query is given below: full text search returned articles as on july . in the initial screening, we found that majority of the articles were not furnishing the desired data. this could be due to the fact that the above keywords are quite frequent in the literature. therefore, we limited our query to the title/abstract fields and retrieved articles using the advanced search option of pubmed. these articles were manually examined in detail based on their abstracts/full paper to fish out the desired data. besides, we have also searched these keywords in the patentlens database and included data from eight relevant patents. reviews, general methodological and non-english articles were not considered. besides these, there were number of articles in which information on only predicted peptides or peptide structures or analogues was given were excluded. also, dendrimeric peptides, complex peptide conjugates and peptide/drug combinations were removed. similarly, articles that were lacking peptide sequence or experimental efficacy were also not considered. in addition, peptides targeting hiv were also left out of the database, as these data were already published in our recent database, hipdb ( ) . papers that were limited in giving information only on predicted peptides or design, peptide structural studies, peptide analogues, dendrimeric peptides, complex peptide conjugates, peptides used in combination with drugs, emphasis was laid on to articles having experimentally validated peptides and covering all or most of the avpdb fields. after filtering out the above articles, remaining research articles were finally used to collect peptides experimentally tested for virus inhibiting activity. further modified peptides were also extracted and have been provided separately in avpdb. in our database, complete avp data of almost all human viruses reported in the literature have been included. avpdb is a manually curated, open source database of avps targeted against diverse viruses of therapeutic importance. the database comes with easy-to-operate browsing as well as searching with sorting and filtering functionalities. avpdb also provides physicochemical properties and predicted structure of avps along with more informative tools for data analysis such as blast and map as well as links to major peptide resources. physicochemical properties displayed are charge, polarity, composition, hydrophobicity and secondary structure preference. the values used for calculating these properties were retrieved from the aaindex database. structures of the peptides were predicted using the pepstr algorithm ( ) and pep-fold ( ) server. structures are displayed in jmol applet. to view the structures, java plugin should be installed in the browser and javascript to be enabled. blast and map tools help in finding the similar peptides reported in the database. overall database architecture is shown in figure . avpdb currently archives the following fields extracted from the literature: sequence: all peptide sequences are formatted in standard one letter amino acid notation along with their respective string length. figure . these facts were also separately calculated for modified peptides as shown in figure . further analysing the overall amino acid composition of the database, it was noticed that some amino acids like leu, lys, ala, arg and val were found to be more abundant while some amino acids like his, met, trp and tyr were present less frequently. these results are shown in supplementary figure s . peptide efficacy statistics for natural as well as modified peptides are presented in table . also, the top sources of the natural and modified peptides are given in table . the 'avpdb map' is a user friendly tool to fetch the perfectly matching peptide available in our database. so, it helps the user to find how many peptides against the user-provided protein sequence are available in our database. the output of this tool displays the avpid, its source, sequence and its target. also mentioned is the start position where the match is found in the user-provided sequence. (ii) avpdb blast additionally, the blast allows alignment of a userprovided peptide sequence against all the peptide sequences available in our database. this helps the user to confirm whether a given peptide sequence or similar one has already been reported or not. the output is given in the standard format with the blast score and e-value. the alignment is shown for the peptides found to be identical or similar in the database. the output can be formatted based on the options provided by user. various important physicochemical properties such as amino acid composition, hydrophobicity, preference for b-sheets, frequency of a-helix, amino acid charge and polarity can been calculated using aaindex ( ) . these properties can be calculated for any user-provided peptide sequence by submitting it on the analysis page available under tools column. a user-friendly 'browse by' option allows to explore the data for normal peptides by any of the fields categorized in the database, viz., virus, family, peptide source, cell line, target and assay. for modified avps also, a separate browse option is provided where the data can be sought by virus, modification, peptide source, cell line, target and assay. to specifically retrieve hiv inhibiting peptides, extensive links of hipdb are provided from avpdb pages. avpdb has been incorporated with four different searches: (i) field search: here the user can enter the query in the box and can specify any of the fields against which one wishes to search or else keep the default 'all' option which will search against all the fields in the database. besides the option to choose the fields, search type allows to retrieve either an exact match or the match containing the query. the results obtained from this search display fields where first nine contain the experimental data and the last one, 'analysis', has links to blast results, physicochemical properties and predicted peptide structure. as more and more avps are being published, the interested workers may submit the desired data into avpdb via the online submission form provided in the database. once the information is cross-verified by our team, it will be included in the updates of database. avpdb database is implemented using the open source lamp solution stack on red hat enterprise linux (ibm sas  machine) with mysql ( . . b) and apache ( . . ) in back-end and front-end of web interface is implemented with php ( . . ). the database is freely available at http://crdd.osdd.net/servers/avpdb. a vast amount of data regarding avps both natural as well as modified is reported every year. to cope with these valuable data, we would like to include more viruses or newly discovered unique peptides to our database as appropriate information becomes available in the scientific literature. also, a tool to predict the ic value of virus inhibitory peptides shall be plugged in the database in near future. supplementary data are available at nar online. mechanisms of viral emergence emerging viral diseases rates of evolutionary change in viruses: patterns and determinants herpes simplex virus resistance to acyclovir and penciclovir after two decades of antiviral therapy antiviral resistance and the future landscape of hepatitis c virus infection therapy hivsirdb: a database of hiv inhibiting sirnas virsirnadb: a curated database of experimentally validated viral sirna/shrna anti herpes simplex virus activity of lactoferrin/ lactoferricin -an example of antiviral activity of antimicrobial protein/peptide inhibition of respiratory syncytial virus by rhoa-derived peptides: implications for the development of improved antiviral agents targeting heparinbinding viruses antimicrobial and hostdefense peptides as new anti-infective therapeutic strategies the gamma interferon (ifn-gamma) mimetic peptide ifn-gamma ( - ) prevents encephalomyocarditis virus infection both in tissue culture and in mice hipdb: a database of experimentally validated hiv inhibiting peptides avppred: collection and prediction of highly effective antiviral peptides inhibition of virus multiplication by immunoactive peptides modified fgf signal peptide inhibits entry of herpes simplex virus type identification of a new region of sars-cov s protein critical for viral entry a novel family of peptides with potent activity against influenza a viruses critical effect of peptide cyclization on the potency of peptide inhibitors against dengue virus ns b-ns protease design of histidine-rich peptides with enhanced bioavailability and inhibitory activity against hepatitis c virus antimicrobial peptides of multicellular organisms phage display of combinatorial peptide libraries: application to antiviral research chemical modifications designed to improve peptide stability: incorporation of non-natural amino acids, pseudo-peptide bonds, and cyclization designing antimicrobial peptides: form follows function a phase ii clinical study of the long-term safety and antiviral activity of enfuvirtide-based antiretroviral therapy enfuvirtide (fuzeon): the first fusion inhibitor apd : the updated antimicrobial peptide database and its application in peptide design camp: a useful resource for research on antimicrobial peptides dampd: a manually curated antimicrobial peptide database yadamp: yet another database of antimicrobial peptides lamp: a database linking antimicrobial peptides pepstr: a de novo method for tertiary structure prediction of small bioactive peptides pep-fold: an updated de novo structure prediction server for both linear and disulfide bonded cyclic peptides aaindex: amino acid index database, progress report conflict of interest statement. none declared. these peptides are comprised of non-natural or chemically modified amino acids. key: cord- -lz rh authors: li, jin; berbeco, ross; distel, robert j.; jänne, pasi a.; wang, lilin; makrigiorgos, g. mike title: s-rt-melt for rapid mutation scanning using enzymatic selection and real time dna-melting: new potential for multiplex genetic analysis date: - - journal: nucleic acids res doi: . /nar/gkm sha: doc_id: cord_uid: lz rh the rapidly growing understanding of human genetic pathways, including those that mediate cancer biology and drug response, leads to an increasing need for extensive and reliable mutation screening on a population or on a single patient basis. here we describe s-rt-melt, a novel technology that enables highly expanded enzymatic mutation scanning in human samples for germline or low-level somatic mutations, or for snp discovery. gc-clamp-containing pcr products from interrogated and wild-type samples are hybridized to generate mismatches at the positions of mutations over one or multiple sequences in-parallel. mismatches are converted to double-strand breaks using a dna endonuclease (surveyor™) and oligonucleotide tails are enzymatically attached at the position of mutations. a novel application of pcr enables selective amplification of mutation-containing dna fragments. subsequently, melting curve analysis, on conventional or nano-technology real-time pcr platforms, detects the samples that contain mutations in a high-throughput and closed-tube manner. we apply s-rt-melt in the screening of p and egfr mutations in cell lines and clinical samples and demonstrate its advantages for rapid, multiplexed mutation scanning in cancer and for genetic variation screening in biology and medicine. screening for genetic changes to unveil molecular attributes of human specimens is important for a variety of medical applications, including genotyping for inherited disorders, prediction of the pathologic behavior of malignancies, identification of cancer biomarkers and can affect treatment decisions for individual patients ( ) ( ) ( ) . for example, mutations in genes like egfr can profoundly influence chemotherapeutic response in lung cancer ( ) ( ) ( ) ( ) and the response is modulated by mutations in other genes of the same signaling pathway [e.g. k-ras, her , erbb- ( , ) ]. therefore there is a need for efficient and high-throughput mutation screening of multiple genes along identified signal transduction pathways in tumor samples. because a large portion of cancer-causing genetic changes remains unknown and can occur in numerous positions along tumor suppressor genes (e.g. p , atm, pten) mutation scanning rather than detection of specific mutations is frequently required for molecular cancer profiling. sequencing is often considered the gold standard for comprehensive mutation analysis. multi-capillary electrophoresis, re-sequencing arrays or pyrosequencing provide platforms for highly parallel genetic analysis ( ) ( ) ( ) ( ) ( ) ( ) ( ) . however, the expense associated with these techniques is currently high both for instrumentation and for runningcosts. since somatic mutations for most genes are relatively rare events it can be inefficient to scan for mutations using expensive approaches that in several cases provide unnecessary data ( , ) . another issue with direct sequencing or re-sequencing arrays is the difficulty *to whom correspondence should be addressed. tel: + - - - ; fax: + - - - ; email: mmakrigiorgos@partners.org ß the author(s) this is an open access article distributed under the terms of the creative commons attribution non-commercial license (http://creativecommons.org/licenses/ by-nc/ . /uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. in detecting a small fraction of mutated alleles in the presence of a high excess normal alleles, which is frequently the case with clinical cancer samples ( ) . as a less expensive alternative, rapid pre-screening methods such as sscp, dgge, dhplc, ccm, cdce or hr-melting are widely utilized to identify dna fragments that contain mutations prior to performing full sequencing ( , ( ) ( ) ( ) ( ) ( ) . enzymatic mutation detection based on mismatch scanning enzymes like muty, tdg or t endonuclease vii for mutation pre-screening has also been employed ( ) ( ) ( ) ( ) ( ) , albeit with modest success since these enzymes cannot detect all possible mutations and deletions ( ) and some of them have substantial activity on homoduplex dna ( ) . recently an enzymatic mutation scanning method based on the surveyor tm (celi/ii) nuclease ( , ) combined with dhplc or gel electrophoresis detection was introduced that shows satisfactory selectivity and reliability ( % mutant to wild-type alleles is detectable) while it also identifies all base substitutions and small deletions that are important to cancer ( , ) or to biotechnology and plant genetic applications [tilling method ( ) ( ) ( ) ( ) ( ) ( ) ]. while reliable, the use of dhplc for examining surveyor tm -generated dna fragments is a slow endpoint detection method restricted to examining a single dna fragment at a time and the resulting dna fragments cannot be sequenced. this limits analysis of cancer specimens when numerous samples or genetic regions need to be screened. we introduce a new approach that enables surveyor tm to scan for mutations over one or several pcr products simultaneously and selectively amplify and isolate the mutation-containing dna fragment(s) via linkermediated pcr. by selectively amplifying mutationcontaining dna from wild-type fragments, the present approach de-couples enzymatic mutation scanning from the endpoint detection step. as a result, following enzymatic action on mismatches any chosen dna detection method (real-time pcr, gel/capillary electrophoresis, microarray-based detection) can potentially be used to identify the mutated dna fragments in a simplex or multiplex fashion. here we utilize real-time pcr coupled with melting curve analysis (surveyor tmmediated real time melting, s-rt-melt) to validate the new technology. we demonstrate that this approach increases the mutation scanning throughput by - orders of magnitude when several ( ) samples are to be pre-scanned for mutations, enables mutation scanning over several pcr fragments simultaneously and mutationpositive samples can be directly sequenced when somatic mutations are at a low-level ($ - % mutantto-wild-type ratio) in surgical cancer specimens. genomic dna from cell lines with defined mutations in p exons, du (exon ), sw (exons and ), dld (exon ) and bt (exon ) was extracted from cell lines purchased from the american type culture collection (atcc), or purchased as purified dna when available. surgical colon and lung cancer tumor samples were obtained from the massachusetts general hospital tumor bank following internal review board approval. dna from the egfr mutation-positive cell lines a , hcc , h and lu and from formalin-fixedparaffin-embedded lung cancer samples were obtained from the lowe center for thoracic oncology, dana farber cancer institute following internal review board approval. we isolated genomic dna using dneasy tm tissue kit (qiagen). pcr with primers containing -gc-clamp and -m sequences for the m and gc-clamp portion of the primers, as well as the gene-specific portion of the primers used in this investigation are listed in supplementary table . the m f and gc-clamp sequence was added to the end of forward and reverse gene-specific primers respectively, or vice versa. twenty microliter pcr reactions were performed from genomic dna with final concentrations of reagents as follows: x jumpstart tm buffer (sigma), . mm each dntp, . mm forward and reverse primer, x jumpstart tm taq polymerase (sigma). pcr cycling was done on a perkin elmer pcr machine. the cycling conditions were: c, s; ( c, s/ c, s/ c, min)  cycles, with annealing temperature decreasing c/cycle, touch-down pcr; ( c, s/ c, s/ c, min)  cycles; c, min. this pcr program was linked to a program for denaturation and re-annealing of the pcr product over min. five-microliter pcr product ( - ng) was mixed with . ml enhancer tm and . ml surveyor tm (transgenomic) and incubated at c for min followed by adding . ml stop-solution, as per manufacturer's protocol. the inactivated surveyor tm -digested product was purified with pcr qiaquick tm purification kit (qiagen) and eluted in ml water. in some experiments, the pcr product was mixed with an approximately equal amount of pcr product from wild-type dna prior to forming cross-hybridized sequences, to facilitate detection of homozygous mutations. addition of polya-tail on the -end following purification of the surveyor tm -treated sample, poly-adenine 'tail' was added to the -ends of dna fragments. for each reaction, we added ml purified surveyor-digested pcr product to a final volume of ml with final concentration of x reaction buffer- , x cocl , . mm datp, u terminal transferase (new england biolabs). the reaction was incubated at c for min and inactivated by heating at c for min. the real-time pcr amplification was performed using titanium-taq tm polymerase (bd-biosciences -clontech) in a smart cycler (cepheid) real-time pcr machine. for each reaction, we added . ml polya-tailed dna to a final volume of ml with final concentration of x titanium buffer, . mm each dntp, .  lcgreen (idaho technologies), . mm m f primer, . mm oligodt-anchor mix gaccacgcgtatcgatgtcg acttttttttttttttttv [v represents a, c and g each oligodt-anchor concentration is . mm, as per race protocol ( )],  titanium tm polymerase (clontech-bd biosciences). the thermocycling program was as following: cycle of c for min, cycles of c for s, c for s and c for s for reading fluorescence. temperature titration was performed using different denaturation temperatures, - c to experimentally determine conditions that selectively enable mutation-containing fragments to amplify. the real-time pcr step was immediately followed by real-time differential melting curve analysis using the smartcycler tm machine. dna melting was performed immediately following pcr on the smart cycler i machine. samples were heated from c to c at . c/s. differential fluorescent intensity curves (Àdf/dt) were smoothed and used for identification of melting peak (s). altenatively, real-time pcr products were examined via dhplc chromatography on a wave tm system (transgenomic). mutation-positive pcr products were purified via pcr purification kit (qiagen) and sequenced using the m f primer. all experiments were repeated at least three times in independent runs from genomic dna. the openarray tm high-throughput, massively parallel real-time pcr platform ( ) (biotrove) was tested for compatibility with s-rt-melt. forty-eight samples of p exon pcr products were generated from different lung adenocarcinoma samples and mutation-containing cell lines and processed via the hybridization and enzymatic steps of s-rt-melt. real-time pcr in the openarray tm platform was performed with the lightcycler faststart tm dna master sybr green tm i (roche) using . mm m f and . mm oligodt-anchor-mix as primers pre-positioned on the array through-holes ( ) and polya-tailed dna as template. the cycling conditions were as follows: cycle at c for min, cycles of c for s, c for s and c for s for reading fluorescence using a high sensitivity imaging camera ( ) . the real-time pcr step was immediately followed by real-time differential melting curve analysis. raw data were exported in excel software for further analysis. the openarray tm experiment was repeated twice at the company's headquarters. to estimate t m,min, the pcr denaturation temperature below which pcr is not efficient it was assumed, as an initial approximation, that % hypochromicity must be present for pcr to work (i.e. any given sequence must be completely denatured, otherwise it re-forms immediately when temperature is lowered in the reaction and inhibits primer binding). the percent melting (hypochromicity)versus-temperature relations for gc-clamp-containing pcr products and surveyor tm activity-generated products were estimated using the poland algorithm ( ) , and the thermodynamic parameters determined by blake and delcourt for mm nacl in the solution ( ) were used. in order to force agreement at a single point, predicted and observed values for a p exon sequence containing a short gc-clamp were normalized at c. this shift accounts for the influence on t m,min of nacl and mg++ content in the reaction, the presence of the sybr-green/lc-green dyes and the proprietary composition of pcr buffers. the t m,min of all other pcr products was then estimated using these semi-empirically determined parameters. the 'enriched pcr' method by behn et al. ( ) was used to sequence codon mutation of p exon from sample ct and wild-type samples. in addition, a second method [amplification via primer-ligation at the mutation ( , ) ] was used to distinguish mutant and wild-type samples by virtue of the de novo nla-iii site generated in the mutant sample by the p codon g a mutation. the s-rt-melt assay converts pcr fragments generated at positions of mutations by the surveyor tm enzyme to fully amplifiable sequences that enable selective pcr amplification in a subsequent quantitative pcr detection method. following denaturation and re-annealing of pcr products that leads to formation of cross-hybridized sequences at the positions of mutations ( figure a ) the sample is exposed to surveyor tm endonuclease that recognizes base pair mismatches or small loops with high specificity ( ) and generates a break on both dna strands to the mismatch. the resulting dna fragments participate in a terminal transferase (tdt) reaction that leads to polynucleotide 'tailing' (sequential addition of adenine, poly-a-tail) at the -ends. a real-time pcr reaction is subsequently performed using adjusted conditions that enable selective amplification of the mutantonly fragments, followed by real-time melting curve analysis for identification of mutations in the presence of sybr-green tm or lc-green tm dna dye. to enable selective amplification of the mutationcontaining fragments in the real-time pcr step, modified primers are employed for the original amplification from genomic dna ( figure b ). the forward primer contains a region specific to the target gene and a high melting domain (gc-clamp), while the reverse primer contains a region specific to the target gene and an m tail (or vice versa). following the tdt tailing reaction, the m primer is used for real-time pcr in conjunction with a primer that binds to the poly-a tail. the denaturation temperature of the real-time pcr reaction is lowered to enable pcr amplification only for fragments that do not contain gc-clamps. because the pcr products that escape digestion by surveyor tm contain gc-clamps ( figure b ), these fragments do not amplify efficiently during pcr, thereby enabling selective amplification of surveyor tm -selected fragments, i.e. an effective 'purification' of mutation-containing fragments. the subsequent closed-tube melting curve analysis enables clear separation of true mutant sequences from pcr dimers or other artifacts. because s-rt-melt does not require size-separation for identification of enzymatically generated fragments, more than one sequence can be scanned in parallel for unknown mutations in a single-tube reaction of surveyor tm . this simple procedure enables the specificity of the surveyor tm enzyme to be combined with the throughput and convenience of real-time pcr for rapid mutation scanning. finally, because the amplified mutated sequences contain defined primers at their ends, direct sequencing of enzymatically selected pcr products is readily possible following the real-time melting step, enabling sequencing of low-level mutations identified by surveyor tm . to provide initial proof of principle for unknown mutation scanning using s-rt-melt we utilized cell lines and tumor samples containing sequencing-identified mutations at several positions of p exon . figure a depicts dhplc chromatograms of the products obtained using a sample containing a p exon g a mutation or a wild-type sample. the standard surveyor tm -dhplc approach ( ) was first employed to identify the mutation following pcr amplification of exon from genomic dna. the resulting dhplc traces contain a single product for the wild-type and two products for the mutation-containing sequences (figure a , curves and , respectively). next, s-rt-melt was used to screen the same p exon sequence. following pcr amplification simplex or multiplex pcr amplification of one or more exons pcr product(s) g c -c l a m p m self-hybridize or cross-hybridize with wild type dna: generate mismatches at positions of mutations in one or more pcr fragments scan for mismatches all fragments simultaneously using cel i /surveyor tm enzyme. cel i /surveyor tm enzyme. use tdt enzyme to add oligonucleotide tail (e.g. oligo-da) to ′oh ends, to serve as primer anchor un-digested fragments digested fragments tdt tailing of ′ dna ends amplify only mutated fragment(s) coupled w. real time melting analysis (see b) detect mutations via closed tube, high-throughput melting curve analysis. if positive, sequence the amplified mutated dna fragment with gc/m -modified primers we cross-hybridized pcr products and exposed them to surveyor tm and tdt tailing. the subsequent real-time pcr was run at different denaturation temperatures and the products were examined either via dhplc or via real-time melting-curve analysis. at the standard denaturation temperature of c the mutation-containing sample contains two peaks, corresponding to the anticipated amplification of both surveyor tm -digested and un-digested fragments (figure a, curve ) . however, when the pcr denaturation temperature is lowered (e.g. - c) a single pcr product is generated for the mutant sample, while the wild-type sample demonstrates no product (figure a , curves - ). in figure b , real-time differential melting curves for the pcr reaction run at c are depicted. a peak corresponding to the pcr product from the mutant sample is again clearly evident, which is absent in the wildtype sample. finally, figure c depicts sequencing of the s-rt-melt-generated pcr fragment, as well as the direct sequencing from genomic dna. the g a mutation is evident in both samples. in the s-rt-melt product the anticipated addition of the poly-a tail at the -position next to the mutation is illustrated. to examine the selectivity of s-rt-melt, dilutions of mutant to wild-type dna were performed using dna from sw- cells that harbor a p exon g a homozygous mutation. the real-time pcr reaction was again performed at c and mutant-to wild-type ratios of $ - % were distinguished from the wild-type using either dhplc ( figure d ) or melting curve analysis ( figure e ). in these samples, direct di-deoxy-sequencing could not identify a mutation if the ratio of mutant-towild-type was $ - % (data not shown). on the other hand, sequencing of s-rt-melt products was possible including the lower dilutions ( figure f ). srt-melt sequencing generated traces with poly-a tails depicting the presence and the position of the mutation, although the exact nucleotide change was less clear than the one in exon (i.e. the position ae base from the mutation might also be confused to be a mutation). the reason for this ae base ambiguity of the exact position of the mutation can be probably understood. the pcr performed following poly-a tail addition contains an equimolar mixture of three reverse primers ( ending in v = g, a or c). depending on the exact nucleotide at the mutation, the correct primer should in theory be preferred, while the other two primers should not allow efficient polymerase extension due to the mismatched -end. however, in practice this 'allele-specific pcr' step occasionally allows -mismatched primer extension, enabling more than one version of the primer to amplify over the position of the mutation, or alternatively the incorporation of the poly-a tail may occur ae base from the exact position of the mutation. we conclude that in certain cases srt-melt indicates the position of the mutation to within base, while in others (e.g. p exon ) it indicates the position 'and' the actual nucleotide change. next, p exon was amplified using dna from a group of surgical lung adenocarcinoma samples and s-rt-melt was used for the screening of unknown mutations via melting curve analysis. mutations at different positions along exon were present in several of these clinical samples, as indicated by the shift in melting profiles obtained ( figure h ) and subsequently verified via sequencing. in this set of samples, srt-melt-sequencing detected a low-level mutation on a colon cancer specimen (ct ) that direct sequencing failed to identify ( figure i ). as with figure f , sequencing of sample ct indicated the position of poly-a tail addition to within one base, but the actual nucleotide change was difficult to identify. to exclude the possibility for a false positive, two independent rflp-based methods were used to verify the presence of the mutation. thus, since the position of poly-a tail addition was known ( figure i , codon of p exon ) the mismatched primer approach by behn et al. ( ) was used to introduce an mlui restriction site for the wild-type p sample but not for the codon mutants. subsequent restriction with mlui enzyme followed by pcr generated a product with a g a mutation for the ct sample but not for the wild-type sample (supplementary figure , frame a). as an additional verification for the low-level ct mutation, we observed that g a mutation generates a de-novo nla-iii site at the position of the mutation. accordingly, we applied 'amplification via primer-ligation at the mutation', a method that we described previously ( , ) to ligate a primer at the nla-iii-digested site, and preferentially amplified the mutant fragment in a second pcr. the sequenced pcr product identified again the g a mutation (supplementary figure , frame b) . in conclusion, srt-melt identified correctly a p codon low-level mutation on ct that was missed by regular sequencing. this is very significant as p exon mutations at codon have been associated with bad prognosis in cancer ( , ) . table of supplementary data depicts a good agreement between standard surveyor screening, s-rt-melt screening and di-deoxy-sequencing, except for the low-level mutation discovered on sample ct via s-rt-melt. s-rt-melt-sequencing traces for two samples with p exons and mutations are also depicted. the data in figures a-d and h indicate a lack of substantial pcr amplification at denaturation temperatures c for fragments containing the gc-clamp and a selective amplification of the mutation-containing fragments for several different mutation positions on p exon . to estimate the influence of the gc-clamp length on pcr efficiency versus temperature and the pcr amplification of fragments generated for mutations lying at different positions along the sequence, a calculation based on the poland algorithm ( ) was performed. the predicted minimum temperatures for substantial pcr amplification were then plotted versus the experimentally observed values. three possibilities were simulated, no gc-clamp, nucleotides (nt) gc-clamp and -nt gc-clamp. dna fragments corresponding to mutations at several positions along exon were also simulated and compared to the experimentally observed minimum temperatures for generating a pcr product for three samples that contained mutations at different positions along p exon (sw , ct and tl ). the results ( figure g ) indicate agreement to within $ . c between theoretical prediction and experimental observation. for denaturation temperatures in the region - c in combination with a -nt gc-clamp all the available mutations on p exon are predicted to result in selective amplification of the mutation-containing fragment and inhibition of the gc-clamp-containing fragment. this prediction is consistent with the experimental results obtained from pcr temperature-titration experiments ( figure g ). the developed calculation algorithm can thus be used to predict the appropriate pcr denaturation temperature for additional pcr fragment/gc-clamp combinations. as a further validation for s-rt-melt, we utilized the method to identify mutations in additional p exons. figure a depicts the chromatographs obtained when a : mixture of dna from sw- cells (homozygous mutation at p c t exon ) and from wild-type cells was screened. the real-time pcr reaction was performed at different denaturation temperatures and the products were examined both via dhplc and via melting curve analysis for comparison. as also observed for p exon , at c denaturation temperature both the surveyor tm -digested and the undigested pcr products are amplified during real-time pcr ( figure a , curves and , mutant and wild-type, respectively). by lowering denaturation temperature to c or c, a single pcr product is obtained from the mutant while no product, other than primer dimer, is obtained by the wild-type sample ( figure a , curves - ). figure b depicts the melting curves obtained following real-time pcr at c denaturation temperature for the mutant and wild-type samples. s-rt-melt was subsequently applied in the same manner to screen for p mutations in exons - from cell lines and surgical colon samples harboring sequencing-identified mutations including a single-base frameshift mutation in exon (listed in supplementary table ). the melting curves from mutant and wild-type samples in p exons - are depicted in figure c -e. the data indicate that results similar to those obtained for p exon are also obtained for p exons , , and . detection of mutations in egfr exons - is of particular clinical interest as these alterations can modulate response to egfr inhibitors in lung adenocarcinoma patients ( , ) . figures f, g and h depict the application of s-rt-melt for screening dna from lung cancer cell lines that harbor dhplc-identified alterations in egfr exons - , including a two-codon deletion (del l -e , exon ). the ability of s-rt-melt for detecting low-level egfr mutations was evaluated by performing dna dilutions of a heterozygous egfr exon into a homozygous sample. a - % mutant-to-wildtype ratio was detectable in this dilution experiment ( figure f) . finally, the application of s-rt-melt in detecting mutations in dna from formalin-fixed paraffinembedded (ffpe) samples was examined by screening four clinical ffpe lung adenocarcinoma specimens. two of these samples were known to harbor egfr exon mutations (l r), while the other two were negative for mutations when independently evaluated via dhplc ( ) . figure i demonstrates the identification of the mutational status of these samples via s-pcr-melt. multiplex s-rt-melt or openarray tm -based s-rt-melt increases the throughput of mutation scanning a significant potential advantage of enzymatic mutation scanning is the ability to screen several sequences simultaneously for mutations. to demonstrate that s-rt-melt can be used for parallel scanning of mutations in several pcr products, we mixed equimolar amounts of pcr products from p exons - containing mutations either in exon or in exon . we then formed 'cross-hybridized sequences' and screened the mixture for mutations in p exons - in a single tube using s-rt-melt, as depicted in figure a . following real-time pcr and melting curve analysis, the exon or exon mutants were clearly distinguished from the wild-type sample ( figure a, curves - ) . next, the mutant exon dna sample was first diluted -fold into wild-type exon and the equimolar mixture of p exons - was prepared and screened again in a single tube via s-rt-melt. the exon mutation was again distinguished from the wild-type mixture of exons ( figure b , curves - ). since % of p mutations in human tumors are encountered in exons - ( ), the multiplex single-tube s-rt-melt reaction could be used to identify most p mutations encountered in clinical tumor samples. combined with multiplex pcr directly from genomic dna, this approach could result to a convenient, high-throughput method for mutation scanning. by adopting a real-time pcr platform as endpoint detection for s-rt-melt, the throughput for mutation scanning increases drastically over other mutation prescreening approaches that utilize dhplc, or capillary and gel electrophoresis. to demonstrate better this point, a highly parallel nano-technology platform was adopted for the real-time pcr step of s-rt-melt that enables an array of nl volume real-time pcr reactions (openarray tm system) to be carried-out simultaneously followed by differential melting curve analysis ( ) . as a proof of principle of the compatibility of s-rt-melt with openarray tm , p exon pcr products were generated from different lung adenocarcinoma samples and mutation-containing cell lines and processed via the hybridization and enzymatic steps of s-rt-melt. the samples were each dispensed in replicate nano-liter volume reactions on openarray tm plates pre-fabricated to contain the appropriate primers and amplified in real-time pcr reactions using a denaturation temperature of c in the presence of sybr-green i dye. melting curves were subsequently obtained using the openarray tm melting curve analysis mode. the pcr growth curves and smoothed differential melting curves obtained distinguish clearly the mutation-containing samples from wild-type samples ( figure c and d, representative results from reactions). furthermore, identification of mutation-containing samples is in good agreement between the conventional and the nano-technology platforms ( figure d versus figure h ). these data indicate that s-rt-melt is compatible with high-throughput nano-technology detection formats and reiterates the advantage of de-coupling enzymatic selection from the detection step. comparison of the throughput using conventional pre-screening method (dhplc or dhplc/surveyor tm ) to s-rt-melt (table ) indicates that s-rt-melt is - orders of magnitude faster when a large number of samples ( ) are screened for mutations. if the multiplex s-rt-melt format is adopted, the throughput can increase further. the intrinsic potential of enzymatic mutation scanning for parallel identification of mutations can, in principle, be very high since the enzyme operates on numerous distinct mismatch-containing sequences on a molecule-tomolecule basis thus providing highly parallel mutation scanning. however, in the past the selectivity of the enzymes used and the endpoint detection method has limited the realization of this potential. here we enabled surveyor tm , an endonuclease that recognizes selectively mismatches formed by mutations and small deletions following 'cross-hybridized sequence' formation, to generate mutation-specific dna fragments that are amplified and screened via differential melting curve analysis. the replacement of size-separation methods (capillary/gel electrophoresis, dhplc) by real-time pcr technology as the endpoint detection platforms and the ability to scan more than one sequences in parallel result in a highly increased throughput for s-rt-melt while retaining the ability to detect diverse mutations at low-levels. cel i/ii endonucleases have also been known to have exonuclease activity on dna-ends ( , ) . for this reason, s-rt-melt was designed to attach an oligonucleotide linker to the -dna ends via terminal transferase (tdt) instead of using the -dna ends. the exonuclease activity also tends to degrade the attached -gc-clamps in s-rt-melt, thereby eliminating their influence in reducing amplification of un-digested fragments. we found that if exposure of dna 'cross-hybridized sequences' to surveyor tm is limited to - min, the substantial degradation of -gc-clamps is avoided. for multiplexing mutation detection using several pcr products simultaneously, the size of the gc-clamp on each pcr amplicon may need to be individually adjusted to ensure that mutations along all sequence positions of the pcr products included in the mixture can be screened at a single real-time pcr temperature and that undigested fragments do not amplify. the calculational tools developed in this work can be used to guide the individual design of gc-clamps. s-rt-melt detects heterozygous snps as well as mutations. as with other mutation prescreening techniques, the presence of a snp concurrently with a mutation might be difficult to identify without performing sequencing. because snps occur at fixed positions, melting peaks originating from snps have a reproducible pattern and melting temperatures ( , ) thus in many cases they should be distinguishable from mutations. finally, it is noteworthy that s-rt-melt is a general methodology that may also be applied to isolate mutations using mismatch-cutting enzymes other than surveyor tm when enzymes with satisfactory properties for mutation detection become available. detection platforms other than real-time pcr/melting (e.g. dna microarraybased) may also be envisioned following enzymatic mutation selection. in summary, we developed a new method for rapid mutation scanning, s-rt-melt that utilizes the cel i/ii (surveyor tm ) and terminal deoxy-nucleotide transferase (tdt) enzymes to isolate and amplify mutation-containing dna fragments without the requirement of dna sizedependent techniques. besides enabling highly increased throughput, multiplexed mutation screening and direct sequencing of the identified mutant dna fragments, s-rt-melt also retains the advantages of the surveyor endonuclease over alternative pre-screening methods, such as reliability and identification of genetic alterations present at low ( - %) fractions in the sample. s-rt-melt provides a significant advancement in unknown mutation scanning in cancer research and diagnostics as well as for general medical, biological and biotechnology applications. targeting tyrosine kinases in cancer: the second wave egfr mutations in lung cancer: correlation with clinical response to gefitinib therapy egfr mutation and resistance of non-small-cell lung cancer to gefitinib activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib egf receptor gene mutations are common in lung cancers from ''never smokers'' and are associated with sensitivity of tumors to gefitinib and erlotinib somatic mutations of epidermal growth factor receptor signaling pathway in lung cancers capillary array electrophoresis dna sequencing the application of capillary electrophoresis for dna polymorphism analysis resequencing and mutational analysis using oligonucleotide microarrays tracking the evolution of the sars coronavirus using highthroughput, high-density resequencing arrays pyrosequencing: history, biochemistry and future sensitive mutation detection in heterogeneous cancer specimens by massively parallel picoliter reactor sequencing genome sequencing in microfabricated high-density picolitre reactors recent developments in high-throughput mutation screening somatic mutations of the protein kinase gene family in human lung cancer methods for detection of point mutations: performance and quality assessment. ifcc scientific division, committee on molecular biology techniques enzymatic mutation detection technologies reactivity of cytosine and thymine in single base pair mismatches with hydroxylamine and osmium tetroxide and its application to the study of mutations high resolution analysis of point mutations by constant denaturant capillary electrophoresis (cdce) a comparison of high-resolution melting analysis with denaturing high-performance liquid chromatography for mutation scanning: cystic fibrosis transmembrane conductance regulator gene as a model novel non-isotopic detection of muty enzyme-recognized mismatches in dna via ultrasensitive detection of aldehydes an amplification and ligation-based method to scan for unknown mutations in dna pcr-based detection of minority point mutations screening for mutations by enzyme mismatch cleavage with t endonuclease vii genetic mapping of thymine dna glycosylase (tdg) gene and of one pseudogene in the mouse purification, cloning, and characterization of the cel i nuclease mutation detection using a novel plant endonuclease a rapid and sensitive enzymatic method for epidermal growth factor receptor mutation screening large-scale discovery of induced point mutations with highthroughput tilling mismatch cleavage by single-strand specific nucleases a tilling reverse genetics tool and a web-accessible collection of mutants of the legume lotus japonicus the wnt/beta-catenin pathway regulates cardiac valve formation high-throughput discovery of rare human nucleotide polymorphisms by ecotilling tilling is an effective reverse genetics technique for caenorhabditis elegans nanoliter high throughput quantitative pcr thermal denaturation of double-stranded nucleic acids: prediction of temperatures critical for gradient gel electrophoresis and polymerase chain reaction thermal stability of dna frequent detection of ras and p mutations in brush cytology samples from lung cancer patients by a restriction fragment length polymorphism-based ''enriched pcr'' technique ligation of a primer at a mutation: a method to detect low level mutations in dna detection of hotspot mutations and polymorphisms using an enhanced pcr-rflp approach frequent detection of ras and p mutations in brush cytology samples from lung cancer patients by a restriction fragment length polymorphism-based ''enriched pcr'' technique mutations in exon and of p as poor prognostic factors in patients with non-small cell lung cancer mutations of p and k-ras genes as prognostic factors for non-small cell lung cancer thermal denaturation of double-stranded nucleic acids: prediction of temperatures critical for gradient gel electrophoresis and polymerase chain reaction iarc p mutation database: a relational database to compile and analyze p mutations in human tumors and cell lines. international agency for research on cancer dna melting analysis for detection of single nucleotide polymorphisms genotyping of single-nucleotide polymorphisms by high-resolution melting of small amplicons the assistance of mohamet miri and frank haluska, md in obtaining tissue specimens from the massachusetts general hospital tumor bank is gratefully acknowledged. this work was supported by nci grants r ca - and ca - , by training grant t ca (jl) and by the joint center for radiation therapy foundation. supplementary data are available at nar online.conflict of interest statement. none declared. key: cord- -z kx h authors: métifiot, mathieu; amrane, samir; litvak, simon; andreola, marie-line title: g-quadruplexes in viruses: function and potential therapeutic applications date: - - journal: nucleic acids res doi: . /nar/gku sha: doc_id: cord_uid: z kx h g-rich nucleic acids can form non-canonical g-quadruplex structures (g s) in which four guanines fold in a planar arrangement through hoogsteen hydrogen bonds. although many biochemical and structural studies have focused on dna sequences containing successive, adjacent guanines that spontaneously fold into g s, evidence for their in vivo relevance has recently begun to accumulate. complete sequencing of the human genome highlighted the presence of ∼ sequences that can potentially form g s. likewise, the presence of putative g -sequences has been reported in various viruses genomes [e.g., human immunodeficiency virus (hiv- ), epstein–barr virus (ebv), papillomavirus (hpv)]. many studies have focused on telomeric g s and how their dynamics are regulated to enable telomere synthesis. moreover, a role for g s has been proposed in cellular and viral replication, recombination and gene expression control. in parallel, dna aptamers that form g s have been described as inhibitors and diagnostic tools to detect viruses [e.g., hepatitis a virus (hav), ebv, cauliflower mosaic virus (camv), severe acute respiratory syndrome virus (sars), simian virus (sv )]. here, special emphasis will be given to the possible role of these structures in a virus life cycle as well as the use of g -forming oligonucleotides as potential antiviral agents and innovative tools. almost a century ago, the ability of guanosine, but not guanine, to form viscous gels was described ( ) . fifty years later, x-ray diffraction data clearly showed that the guanosine moieties in these gels were arranged in a tetrameric organization linked by eight hoogsteen hydrogen bonds ( figure ) ( , ) . these hydrogen bonds differ from the bonds observed in canonical watson-crick pairing and involve the interaction of the n group from one guanine with the exocyclic amino group from a neighboring base (figure a ). therefore, a g-tetrad or a g-quartet results from planar association between four guanines that are held together by eight hydrogen bonds and coordinated with a central na+ or k+ cation ( ) ( ) ( ) ( ) ( ) . in addition, nucleoside derivatives were also used to confirm the structural properties of g-quartets ( ) ( ) ( ) ( ) ( ) ( ) . conversely, a g-quadruplex or g is formed by nucleic acid sequences (dna or rna) containing g-tracts or gblocks (adjacent runs of guanines) and composed of various numbers of guanines. depending on the nucleotide sequence, the way g s can be formed presents a high degree of diversity. the core of a g is based on stacking between two or more g-tetrads, wherein the guanines can adopt either a syn or an anti glycosidic bond angle conformation. consequently, each of the four g-tracts that form the core of the structure can run in the same or opposite direction with respect to its two neighbors, forming parallel, antiparallel or hybrid core conformations. depending on these orientations, the g-blocks delimit four negatively charged grooves of different sizes: narrow, medium or wide (figure b-e) . for intra-molecular structures (figure b and c) , the four g-tracts belong to the same oligonucleotide and are attached by linkers with variable nucleotide sequences and lengths. these loops can adopt three different conformations: lateral, diagonal or propeller (figure b-d) . the bi-or tetra-molecular g structures (figure d -e) are assembled from g-tracts belonging to two or four different strands. the g-blocks can also be interrupted by one to seven non-g nucleotides, which result in bulges that protrude from the g core ( figure e ). in contrast to the almost mono-morphic canonical duplex, these variable structural parameters are directly related to the nucleotide primary sequence. this unique family of globular-shaped nucleic acid structures (figure f -j) presents a high level of plasticity that enables various applications (see 'g s as antiviral agents' section). however, a potential physiological role for g s has remained controversial for many years. nonetheless, interest in g-quaduplexes has increased over time, with thousands of reports and reviews published on several aspects of g s, including the biophysical and chemical characteristics as well as biological function in prokaryotes and eukaryotes ( , ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . with an increasing volume of sequencing data, databases and algorithms have also been developed to enable search and mapping of g in a few mammalians as well as hundreds of bacterial species ( ) ( ) ( ) ( ) . additionally, g s present an intrinsic resistance to 'regular' nucleases. however, g specific nucleases have been isolated, first in yeast with kem /sep ( ) and later in humans with gqn ( ) . later, a dna binding protein (g r /rhau) that can unfold rna/dna quadruplexes was isolated in human cells ( , ) , which further highlights the biological relevance of g s. finally, development of g specific probes such as monoclonal antibodies ( ) ( ) ( ) and small chemical ligands ( ) ( ) ( ) ( ) ( ) ( ) facilitated in vivo studies that support quadruplex formation in cells. taken together, g s are likely present in several significant genomic regions and may be a key component in important cellular processes ( , ) . the sequencing of the human genome highlighted the presence of many sequences enriched in guanines that can potentially form quadruplexes [for a review, see ( ) ]. in theory, more than putative g-tetrads may be formed with loops of - nucleotides, and over , with loops of up to nucleotides. however, the genomic dna is dynamic, and quadruplexes, duplexes or other structural forms at those sites will be influenced by chromatin and other dna-binding proteins. thus, g formation depends on the cell type and cell cycle, ultimately impacted by environmental conditions and stresses. without clear signaling and tight regulation, extremities present in linear chromosomes could be recognized as damage dna and it would be deleterious for the cell if processed as such by repair mechanisms ( ) . telomere length has also been linked to the lifespan of organisms. telomeres are nucleoprotein structures found at the end of chromosomes protecting the genome from instability. this terminal region of chromosomes contains long tracts (several kilobases) of double stranded ttaggg:ccctaa repeats ending with a protrusion of single stranded ttaggg repeats ( - repeats) . telomerase is a ribonucleoprotein enzyme that adds ttaggg repeats to the end of dna strands in the telomere regions. human telomerase is a heterodimeric complex composed of (i) telomerase reverse transcriptase (tert), a dna-polymerase rna-dependent, (ii) a singlestranded rna template referred to as telomerase rna component (terc) and (iii) dyskerin [dyskeratosis congenita (dkc )], a pseudouridine synthase binding terc through the h/aca motif and stabilizing the telomerase complex ( , ) . accumulated experimental data indicate the presence of g s in telomeric dna ( , ( ) ( ) ( ) ( ) ( ) . for example, dna , a helicase/nuclease that cleaves g s, is involved in maintaining telomere integrity ( , ) . altogether, these observations clearly establish that telomeric g s are crucial structures for regulating telomere maintenance, thus providing a mechanism for controlling cell proliferation ( , ) . at the ends of telomeres, if the g-rich overhang is longer than four ttaggg repeats (> nucleotides), it can fold over itself and form secondary structures, including g s (figure a) . those structures prevent telomere elongation by the telomerase complex ( ) . accordingly, small g -ligands (e.g. telomestatin) and g -binding proteins (e.g. trf ) show anti-proliferative and potential anti-tumor activities through telomere interference ( , ( ) ( ) ( ) ( ) . however, the exact mechanism is more complicated as telomestatin derivatives present telomerase independent activity, most likely through targeting g s involved in tumor growth elsewhere in the genome ( ) ( ) ( ) . dna replication is highly regulated, which ensures faithful reproduction of genetic information through each cell cycle. in eukaryotes, this process is initiated at thousands of dna replication origins (oris) distributed along each chromosome ( ) . initially, a pre-replicative complex (pre-rc) is assembled during late mitosis and the early g phase; it evolves during the s phase to a pre-initiation complex through initiator protein activity (e.g., cyclin-dependent kinases, cdc , cdc and minichromosome maintenance complex (mcm) proteins). this step involves unwinding the origin and dna polymerase recruitment. a tissuespecific temporal system exists, and only a few origins undergo 'firing' (replication fork elongation). the remaining origins are dormant backups used under stress (e.g., dna damage and collisions with transcription machinery). studies on organisms ranging from drosophila to humans have highlighted that repeated elements are enriched with g residues ( ) . in vitro, origin g-rich repeated element (ogre) sequences form g s, and point mutations that affect the stability of these g s also impair origin function. even if a -bp cis-regulatory element is necessary for efficient initiation, g structure formation at oris might be the key to selecting the firing origins ( , ) . additionally, g s formed in the lagging strand can lead to the stalling of the replication fork ( figure d ). on the other side, formation of g s in the leading strand can displaced it from the template and causes polymerase slippage (figure d ). such events could participate to the unusual expansions of g forming mini-satellite sequences ( ) . altered gene expression levels are implicated in many human diseases. g s have been found in the promoter region of many genes and implicated in transcription and protein translation regulation [for a review, see ( ) ]. particularly interesting, the p region, which is upstream of the c-myc promoter, is highly sensitive to dnase i and s nucleases (nuclease-hypersensitive element or nhe). this purine-rich sequence can form an intramolecular g ( ) ( ) ( ) ( ) , which structure has been determined by nuclear magnetic resonance (nmr) ( ) . this nheiii region controls up to % of gene transcription activation, and the g acts as a transcriptional repressor element ( , ) . because interactions between dna-binding proteins and local dna supercoiling can impact the equilibrium between duplex/singlestranded dna (active transcription) and quadruplex dna (silent), c-myc expression and, ultimately, cell proliferation may be modulated. nucleolin (a nucleolar phosphoprotein) and adar (a z-dna binding/rna editing protein) both bind the c-myc promoter in vivo and stabilize the g structure of the promoter ( , ) . due to the high therapeutic potential of genomic g s, studies have also been performed on promoters of other genes implicated in cancer biology. for example, kras is one of the most frequently mutated genes in human cancer. its promoter contains a g-rich nuclease hypersensitive element that is critical for transcription ( ) . the polypurine strand forms g structures that bind crucial dna repair proteins (e.g., parp- , ku ) ( ) . direct evidence for g formation was also reported for the proximal promoter region of the ret proto-oncogene, and targeting this region with a small molecule represses ret proto-oncogene transcription ( ) . although many g-rich sequences have been identified in the genome, they enable generation of corresponding grich rnas when the sequences are located in transcription units. ultimately, these rnas can fold into quadruplex structures and impact gene expression ( ) . in untranslated regions ( -utrs), g s can act as translation repressors ( ) ( ) ( ) . g structures have also been proposed to regulate translation driven by ires domains in the absence of an mrna cap [for a recent review, see ( ) and the references therein]. finally, it has recently been shown that g rnas in coding regions can stimulate ribosomal frameshifts in vitro and in cultured cells providing a role for g in translational regulation ( , ) . a major challenge for future studies is to better understand genome instability, cancer and severe neurological/neuromuscular defects origins. repeated sequences, such as polynucleotide repeats (three and above), mini-and mega-satellites, are source of genome instability through either expansion or contraction ( ) . however, dna secondary structures associated with those tandem repeats may be crucial elements in the expansion phenomenon. for progressive myoclonus epilepsy type- (epm ), expansion of the dodecamer sequence (cgcg cg ) within the cystatin b promoter is thought to depend on g formation ( ) . in addition to this potential deleterious consequence, g s have also been implicated in development of specific immune responses. in antigen-activated b cells, class switch recombination promotes deletion of a dna fragment (several kilobases) that joins a new constant region to the variable immunoglobulin chain ( ) . these switch regions (s regions) are intronic, g-rich and repetitive sequences that form g-loops upon transcription (figure e ). in vivo, class switch recombination depends on the activity of aid (a cytidine deaminase) and muts␣ (msh /msh , and con-served dna repair factors involved in mismatch repair). aid promotes creation of u:g mismatches, and muts␣ binds with high affinity the g dna formed upon transcription of the s regions. in parallel, pathogens have also developed countermeasures to evade the immune system. the role of gquartets in this evasion is related to antigenic variation. for plasmodium falciparum, g-rich sequences have been found in the upstream region of group b var genes that can form in vitro stable g s ( ) . similarly, a g -forming sequence ( -g tg ttg tg ) is located upstream of the pile expression locus in neisseria gonorrhoeae ( ) . formation of g s is necessary for initiating this pilin antigenic variation through recombination, and interactions with reca could facilitate this specialized reaction. overall, immune evasion and immunoglobulin gene class switch recombination in vertebrates are clearly analogous mechanistically ( ) . searches for g s and elucidation of their function in the viruses' genome have mainly focused on the human immunodeficiency virus (hiv), which causes acquired immunodeficiency syndrome (aids). currently, ∼ million people are infected with hiv worldwide. with over million new infections and ∼ . million deaths from aids per year, the pandemic continues to spread. even without a vaccine, development of highly active anti-retroviral therapies have allowed people to live with hiv as a chronic disease ( ) . however, viruses within t cells remain fully capable of replicating and infecting other cells if the drug pressure is removed or when resistance emerges. thus, new drugs must be developed to overcome the treatment's genetic barrier. hiv- is an rna virus in the lentivirus genus and is part of the retroviridae family. lentiviruses are singlestranded, positive-sense, enveloped rna viruses. hiv- particles contain two molecules of genomic rna that are converted into double-stranded dna by the viral reverse transcriptase (rt). the resulting viral dna is then imported into the nucleus and insertion into the cellular dna is catalyzed by the virally encoded integrase (in). once integrated, transcription from the viral promoter at the long terminal repeat generates mrnas that code various viral proteins and genomic rna. alternatively, the provirus may become latent, which allows the virus and its host cell to avoid detection by the immune system. the presence of g structures has been highlighted at both rna and dna levels with implications throughout the viral life cycle. retroviral rnas dimerize in the cytoplasm of an infected cell allowing two copies of the genome to be encapsidated in the newly produced virion ( ) . while a single copy of the genome is sufficient for viral replication, the second copy is also used during reverse transcription, and the viral rt switches multiple times between the two rna molecules ( , ) . the strand transfers are partially responsible for the viral variability through production of recombinant molecules. therefore, understanding the mechanisms that drive dimerization and recombination is essential. dimerization is a two-step process that involves sequences upstream of the splice donor site ( , ) . the sequences involved in initial dimerization and encapsidation partially overlap at the end of the viral genome. one of the sequences is a highly conserved dimer initiation site (dis) that forms a stem loop. a concentration-dependent kissing-loop interaction is initiated from contacts between consecutive guanines ( ); the interaction then spreads to the stems. however, this interaction does not seem sufficiently strong to keep the two copies together during reverse transcription. several studies have identified a g-rich sequence that form bi-molecular g structures in the gag region of the hiv- genome, near the dis ( - ). localization of these g -forming sequences correlates with recombination hot spots and exhibits an increased rate of template switching that highlights a potential role for these structures ( ) . supporting this hypothesis, recombination in the u domain is cation-dependent and is lower in the presence of li + , which is a metal ion that fails to stabilize g s ( ). short rna templates from the central region of the hiv- genome contain g-rich sequences near the central polypurine tract (cppt) at the end of the pol gene (in coding sequence); this is a region where one of the two primers used for synthesizing the (−) strand dna is produced during reverse transcription. these sequences can form both intramolecular and dimeric g structures. moreover, reconstituted systems have confirmed that g structures near the cppt facilitate strand transfer and promote template switching by the rt ( ) . interestingly, certain elements from the cppt region are involved in forming the cppt flap, which is a region that plays an important role in nuclear entry of the double-stranded dna ( , ) . taken together, the g-rich regions located at the end of the genome and in the central region are likely maintained in proximity through inter-rna g formation with a crucial role in hiv- replication. retroviral nucleocapsid proteins (ncp) are multifunctional elements encoded in the gag gene. notably, ncp participate in many retroviral cycle steps by remodeling nucleic acid structures to favor thermodynamically stable conformations. they are referred to as nucleic acid chaperones and interact with nucleic acid phosphodiester backbones through electrostatic interactions thanks to basic residues (especially residues in the n-terminus) [for reviews, see ( , ) ]. moreover, hiv- ncp (ncp ) exhibits sequence-specific binding to runs of gs, ugs or tgs through interactions involving its two zinc fingers (cchc motifs separated by a proline-rich linker). although there is no doubt that ncp tightly binds g sequences, data in the literature shows that a hydrophobic interaction engaged by the c-terminal zinc finger of the protein may lead to g stabilization ( ) , while high concentration of ncp promotes g unfolding ( ) . a recent biophysical study using high-speed atomic force microscopy (hs-afm) addressed direct and real-time investigations on the molecular chaperone activity at the single-molecule level ( ) . ncp can efficiently promote bimolecular g formation and is able to anneal the g structures. the g structure is induced by both unprocessed ncp and mature ncp , which indicates that both proteins may participate in genome recognition, recombination, dimerization and packaging. ncp could act through a potential mechanism that involves synaptic g intermediates, as illustrated in figure ( ) . transcription of the viral dna is performed by the cellular rna polymerase ii from the viral promoter located in the -long terminal repeat (ltr) of the proviral genome. the u region (figure a) (tss) and close to the tata box (figure b ). this sequence overlaps with the so-called minimum promoter, which is composed of three sp as well as two nf-kb binding sites and is crucial for transcription initiation. the presence of eight blocks of guanines suggests that this region is a good candidate for g formation. two independent biophysical studies based on dmsmediated foot-printing assays or nmr recently evaluated the ability of this -nt g-rich sequence to form g structures ( , ) . dna fragments spanning this region (corresponding to two or three sp or nf-kb binding sites with oligonucleotides ranging from to bases) were all able to form stable parallel and anti-parallel g topologies with melting temperatures ranging from • c to • c in -mm kcl solutions. the three models are formed with rather long sequences (up to nucleotides), allowing the formation of long loops of to nucleotides (figure d, e and f). interestingly, a g structure was also observed in the nmr study with an anti-parallel g core composed of only two tetrads and an additional watson-crick base pair that stacks on top of the upper tetrad (figure g) . notably, these four topologies are mutually exclusive, and forming one of these g s in the promoter will prevent formation of the three alternative conformations. thus, the equilibrium between these forms may play a role in regulating promoter activity. recently, the interaction between the sp protein and a fragment of the hiv- promoter sequence folded into a g was studied ( ) . piekna-przybylska et al. used an affinitybased selection approach using biotinylated g s immobilized on streptavidin-coated magnetic beads. the pull down experiments followed by western blotting revealed that the sp protein can bind the hiv- promoter sequence when it adopts a g conformation. perrone et al. analyzed the effect of point mutations that disrupt the g structures formed in the promoter ( ) . the wild-type (wt) and mutated ltr promoters were cloned upstream of a firefly gene in a promoter-free plasmid. in hek t cells, the promoter activity of the mutated sequences (unable to form g ) was twice as high as the wt ltr. these data suggest that g s act as repressor elements in the transcriptional activation of hiv- . therefore, g structures might be critical for hiv- fitness and represent novel targets for antiviral drug development (see below). a better understanding of the role for g s in regulating the hiv- promoter activity would also shed light on hiv- latency and reactivation mechanisms, from which a new landscape may emerge in clinical research to eradicate hiv from reservoirs ( ) . a g-rich sequence composed of three conserved clusters has been identified in the reading frame of the negative regulatory factor (nef) ( ) . this coding sequence is located at the -end of the genome and overlaps with the -ltr (figure a ). isolated nef g-stretches can form g s in vitro. moreover, their stabilization represses nef expression and decreases viral replication. nef has multiple roles during hiv infection and has been implicated in immune evasion by promoting cd and major histocompatibility complex (mhc) molecule down-regulation. importantly, the absence of nef seems to correlate with low viral load and inhibition of disease progression. thus, targeting the g s located in the nef coding sequence is a new attractive therapeutic opportunity. sv is a polyomavirus with a -kb, closed circular and double-stranded dna genome that codes for six proteins and includes a non-coding regulatory region (ncrr). this latter regulatory region contains the ori and the encapsidation sequence (ses) but also controls the transcription direction (early versus late transcription). notably, six gc boxes (gggcgg) are present in this region, which can form an unusual quadruplex structure containing a ctetrad stacked between two g-tetrads as determined by nmr ( ) . these repeated sequences are binding motifs for sp and, therefore, play an important role in early transcription. on the other hand, sv genome replication requires the large t-antigen (tag), which is a multifunctional protein that binds the ori, has adenosine triphosphatedependent helicase activity and interacts with cellular proteins such as p and rb ( ) . interestingly, tag can unwind g dna structures ( , ) ; thus, it might play a crucial role in regulating replication as well as early and late transcription. perylene di-imide derivatives (pdi) stabilize g structures and inhibit both the g and tag duplex dna helicase activities. hence, pdi provide tools for probing the role of the g helicase activity in sv replication ( , ) and introduces new insights into the link between helicases and tumorigenesis ( ) or other human genetic diseases ( ) . the human papillomaviruses (hpv) family consists of more than viruses; approximately half are sexually transmissible and are considered high risk due to their carcinogenic properties. these viruses are one of the most common sexually transmitted infections and induce cervical cancer. notably, hpv and hpv contribute to % of cervical cancer induced by hpv infections. the hpv genome consists of kb, circular, double-stranded dna and integration into the infected cell induces dramatic genome instability ( ) . the open reading frames encode six 'early' proteins (e , e , e , e , e and e ) and two 'late' structural proteins (l and l ). the late proteins are major and minor capsids, respectively, and association with l /l heterodimers forms star-shaped capsomeres. however, l can self-assemble into - nanometers, virus-like particles, which can be used in prophylactic hpv vaccines to protect against an initial hpv infection. g-rich regions have been found in hpv genomes, and their potential to fold into a g structure has been described ( ) . g-rich loci that fulfill the criteria for g formation have only been found in eight types of hpv. however, a strong argument for the relevance of g s in hpv biology is that the viral protein e is a helicase that resembles sv tag ( ) . consequently, e may also present a g unwinding activity. for hpv and hpv , potential g -forming sequences are located in the long control region (lcr), which is a regulatory sequence composed of nearly kb, suggesting a potential role in transcription and replication. the presence of g s in the sequence coding for the l protein (hpv ), e (hpv , hpv ) and e (hpv , hpv , hpv ) suggests that g formation may also alter alternative splicing necessary for producing viral proteins from the overlapping open reading frame (orfs). targeting these g s could potentially serve as a basis for novel antiviral therapies. hodgkin's lymphoma, burkitt's lymphoma and nasopharyngeal carcinoma ( ) . in the case of hiv- co-infection, ebv is also associated with hairy leukoplakia and central nervous system lymphoma. the virus is ∼ to nm in diameter and is composed of a kb dna double helix that circularizes upon entry into the nucleus and becomes a viral episome. human herpes viruses are mostly asymptomatic due to latency. the epstein-barr virus encodes a genome maintenance protein (nuclear antigen , ebna ), expressed in all ebvassociated malignancies. ebna binds g-rich sequences at the viral replication origin, recruits the replication complex and is involved in metaphase chromosome attachment, which insures maintenance throughout mitosis ( ) . thus, formation of secondary structures, such as g s, may play a role in regulating ebv replication. the level of ebna synthesized is tightly controlled; it is sufficiently high to maintain viral infection but sufficiently low to avoid immune recognition by the host's virus-specific t cells. regulation occurs during translation due to secondary structures present in the mrna ( ) . destabilization of the g -forming sequence through antisense oligonucleotide annealing increases the translation rate and, consequently, promotes antigen presentation. on the other hand, stabilization of the g with a g ligand (pyridostatin) decreases ebna synthesis and allows immune evasion. interestingly, g s are similarly observed in mrna for other maintenance genes in the gammaherpesvirus family. these findings suggest that this mode of translational regulation may be more general among proteins that self-regulate synthesis. in addition, one could imagine alternative therapeutic strategies focused on targeting rna structures within viral orfs to interfere with the virus cycle as well as to promote antigen presentation and to stimulate the host immune response. recently, a g ligand called quarfloxin (cx- ) entered phase ii clinical trials. it is capable to suppress myc transcription by inhibiting the interaction between nucleolin and the c-myc g motifs ( ) . even if cx- failed trials because of bioavailability issues, original anti-cancer applications based on such ligands seem underway ( ) . taken together, all these studies suggest that targeting these g s could potentially serve as a basis for novel antiviral therapies. hence, as described for cellular targets ( ) , small ligands that can stabilize the g structure may compose a potential new class of therapeutic agents to fight viral infections ( , ) [for reviews, see ( , ) ]. recent studies showed that g ligands, such as braco- and tmpyp , inhibited hiv- replication ( , ) . exhaustive viral assays demonstrated that braco- acts at both reversetranscription and post-integration steps by targeting of the viral promoter ( ) . several laboratories are already working to identify g ligands that preferentially interact with the pathogen's g over cellular g s. this new antiviral strategy presents many advantages: (i) these compounds target viral dnas and rnas, which are the source of the disease; (ii) high conservation of these targets across subspecies suggests that they are important for viruses and (iii) mutations will likely impact viral fitness, limiting the emergence of resistant strains. nucleic acids have already been validated as therapeutics ( ) . the first example is formivirsen (vitravene r ), a phosphorothioate anti-sense oligonucleotide approved from to to treat retinitis caused by cytomegalovirus (cmv) ( ) . the second is pegaptanib (macugen r ), used in the clinic from to . this -fluoropyrimidine rna-based aptamer targets vegf to treat neovascular age-related macular degeneration ( ) . however, those two drugs have been withdrawn due to the insufficient benefice associated with their use. regarding g s, certain g oligonucleotides present therapeutic potential, including the thrombin-binding aptamer (tba, g t g tgtg t g ) ( , ) . the most promising g to date is as (agro ), a bases oligonucleotide targeting nucleolin with anti-proliferative properties ( ) . as recently completed phase ii clinical trials as anticancer drug (nct ) with low toxicity, highlighting the high therapeutic potential of g ( ) . for most applications, g -forming molecules have been selected through combinatorial methods (e.g. pegaptanib, anti-tba). two key protocols have been developed. first, polynucleotide arrays facilitated rapid detection of molecules with an affinity to the target. then, the development of selex (systematic evolution of ligands by exponential enrichment) facilitates isolation of oligonucleotide sequences with the capacity to recognize virtually any class of target molecule with high affinity and specificity ( ) . these oligonucleotides, 'aptamers', are emerging as a class of molecules that rival antibodies in both therapeutic and diagnostic applications ( , ) . in this section, we describe a few examples of aptamers specifically selected to interact with viral components and interfere with viral replication. hepatitis a virus (hav) belongs to the picornaviridae family. even without a specific treatment for hav infection, a vaccine has been produced to protect against initial infection. however, only a limited number of countries recommend vaccination because hav is rarely lethal. the hav genome comprises a . -kb single-stranded (+) rna with a -utr and poly(a) tail. it encodes for a single polyprotein, which includes the c protease. this enzyme is crucial for the virus because it cleaves the polyprotein into several capsid proteins and non-structural proteins. additionally, the c protease binds regulatory structural elements at the -utr, which control viral genome replication. thus, the c protease is an attractive target for develop a specific antiviral treatment. recently, hexadeoxyribonucleotides that specifically bind the hav c protease were identified through a hexanucleotide array ( ) . in vitro experiments showed that the hexanucleotide gggggt (g t) forms a tetramolecular g , binds the c-terminal domain of hav protease and is a potent protease inhibitor ( ) . every year in winter, influenza viruses cause seasonal respiratory disease epidemics ( ) . three subtypes have been defined (a, b and c), which depend on the surface proteins (hemagglutinin and neuraminidase). sixteen hemagglutinin (h) and nine neuraminidase (n) variants have been discovered, but only h , and as well as n and are commonly found in humans. these viruses belong to the orthomyxoviridae family, which composes enveloped viruses with a single-stranded negative-sense rna. the genome is to kb but is segmented into multiple ( or ) molecules. each rna is . to . kb and encodes for or proteins. the non-structural protein (ns ) is a kda multifunctional protein and participates in protein-protein and protein-rna interactions ( ) . during infection, ns interferes with cellular mrna biology (splicing, maturation and translation). as a result, ns prevents interferon (ifn) production, which leads to inhibition of the host's innate immunity ( ) . after cycles of selex, which employs a -base variable sequence, a g -forming aptamer was selected that presents high affinity for the ns rna binding domain (kd ≈ nm) ( ) . in a cellular context, this g can block ns and restore inf production, which results in antiviral activity without cellular toxicity. severe acute respiratory syndrome coronavirus (sars-cov) is an enveloped virus with a single-stranded rna genome ∼ kb long. it encodes poly-cistronic orfs, which code for non-structural proteins (nsp). since the outbreak, the three-dimensional structures of several key proteins have been determined (e.g. rdrp, c-like protease). the non-structural protein (nsp ) is one component of the viral replicase complex and contains a domain referred to as sars unique domain (sud), which interacts with g s. thus, g s are also relevant for the coronavirus and might be involved either in viral replication or host immune evasion ( ) . on the other hand, using a se-lex approach with oligonucleotides that harbor a nucleotide variable sequence and following rounds of selection, dna aptamers against the sars-cov helicase were isolated ( ) . these aptamers compose two distinct classes, g and non-g forming sequences, which were determined through circular dichroism and gel electrophoresis. their inhibitory effect on viral replication is being studied with encouraging results. combinatorial approaches have been used to identify several aptamers that target hiv- . the first rna aptamer isolated using the selex approach was an rna pseudoknot inhibitor of hiv- rt ( ) . using another procedure, the surf (synthetic unrandomization of randomized fragments) ( ) , isis was selected and exhibited sub-micromolar inhibition of hiv- . isis is a phosphorothioate containing dna octamer (t g t ) that forms g with anti-hiv properties. more specifically, it inhibited cell-to-cell and virus-to-cell spread of the hiv- by interacting with the v loop located on the viral glycoprotein gp ( ) ( ) ( ) . its g structure and phosphorothioate backbone were reported as essential to this inhibition. later studies showed that phosphodiester oligonucleotides containing only g and t inhibit hiv- replication and the most potent molecule, gtg tg tg tg t (t , ec at ∼ nm), formed a g in vitro ( ) ( ) ( ) . this oligonucleotide and related molecules, such as t (g tg tg tg t, figure j ), are potent hiv- in inhibitors in vitro. t was the first in inhibitor tested in clinical trials (zintevir tm developed by aronex pharmaceuticals in ) ( ) . however, the action mechanism in cells is more complex because it also targets viral entry ( ) . in attempts to obtain natural-type oligonucleotide, hotoda et al. identified a hexamer (tg ag) also targeting hiv- entry through gp binding ( ) . this 'hotoda's sequence' adopts a tetramolecular g structure and submicromolar hiv- inhibition was described for derivatives with -end substitutions. once more, the antiviral activity of the molecule was directly linked to its capability to form g s. later, selex approaches were developed to isolate dna aptamers with high affinity for the rnase h domain of hiv- rt. thus, the target protein used was either the isolated rnase h domain (p ) or the functional heterodimer p /p with a counter-selection that used the p /p form (truncated for the rnase h domain) ( , ) . interestingly, these selections mainly led to aptamers with g-rich sequences capable of forming g s. some, but not all, of these g inhibited the rnase h activity of hiv- rt in vitro with an ic in the nm range ( ) . surprisingly, these g aptamers ( del and del) were also potent in inhibitors in vitro with ic values in the range - nm ( ) . this dual inhibition can be explained by the structural similarities between the in active site and rt rnase h domain ( ) . the aptamer del can form an original dimeric interlocked g (figure i) , which is stable even at temperatures over • c ( ) . through comparing the structures of the g -forming aptamers that inhibit in, in inhibition likely requires a stack of gtetrads ( ) . similar to t , del is a potent antiviral agent with multimodal inhibition and, in cells, targets the viral entry step, reverse transcription and integration ( ) . additional in vitro studies indicated that free del and t can enter human cells, including epithelial (hela), hepatic (huh ) and lymphocytes (h ) cells ( ) . however, striking differences were observed in the presence of viral particles; hiv- strongly stimulates cellular uptake of aptamer ( ) . this latter observation opens an opportunity for specific drug delivery to cells that are infected, which may prevent intracellular side effects from g off-targeting. nucleic acids research, , vol. , no. given the remarkable physical properties of g s, scientists have engineered innovative tools using g folding to control ribozyme activity and constructed probes for state-ofthe-art imagery as well as quantification techniques, among other approaches. as an example, the aforementioned g del was engineered into aptamer beacons to visualize endogenous protein hiv- reverse transcriptase in living cells ( ) . the following section describes two such innovations related to viruses. hdv is a small, enveloped virus that depends on hepatitis b virus (hbv) for propagation ( ) . its genome is a circular, single-stranded (−) rna molecule of nucleotides. however, due to self-complementarity over ∼ % of its length, it folds into a partially double-stranded molecule (rod-like rna structure). because hdv does not encode for a polymerase, replication of its genome relies on cellular enzymes. upon polymerization following the 'double rolling circle' mechanism, genomic and anti-genomic circular rna is produced with linear polyadenylated mrna that only codes for the hdv protein, delta antigen (hdag). however, hdv also encompasses an nucleotide ribozyme with self-cleavage activity. molecular engineering led to development of a 'g-quartzyme', which is a ribozyme controlled by a g structure ( , ) . stabilization of the g at the end by monovalent ions (k + ) activates the ribozyme. cauliflower mosaic virus (camv) is a pararetrovirus that infects plants ( ) . its genome is an -kb, circular, doublestranded dna molecule produced through reverse transcription of a pre-genomic mrna. therefore, transcription of the genome is a crucial step for two reasons: (i) it is necessary to code for several viral proteins and (ii) it generates s rna that is slightly longer than the genome (terminally redundant), which serves as a matrix to replicate the dna genome. because the unique camv promoter, the s promoter, is a strong and constitutive promoter, it has been widely used to develop genetically modified organisms (gmos). consequently, there is a growing need to detect gmos in the food industry, and many methods have been developed based on elisa, polymerase chain reaction and other techniques. notably, a fluorescent assay has been engineered based on g formation. it involves two specific primers that recognize the s promoter. each primer is linked to a double repeat of the human telomeric sequence ttaggg. as a consequence, binding the two primers facilitates inter-strand quadruplex formation. upon adding berberine (a selective g -binder), a strong fluorescent signal is produced ( ) , which facilitates promoter detection. the x-ray structure of g-tetrads was determined in the early s; in the s, g s remained an intriguing deviation from the watson-crick canonical structure. however, a remarkable quantity of information has been obtained over the past two decades, which has allowed this field to grow from basic science to clinical application ( , ) . thus, the presence of g s in telomeres and oncogenic promoters has opened broad opportunities for understanding and generating new treatments against cancer. the viral world is extensive and includes viruses with replication schemes based on rna (e.g., flaviviruses), dna (e.g., papillomavirus) or intermediates thereof (e.g., retroviruses) and g s are part of their life cycle. first, they must address the presence of g s in their hosts, and second, they must also contain g -forming sequences that regulate important replication steps. several challenges remain to better define the features that provide specific g motifs with the ability to function as structural elements. nevertheless, g sequences and g -binders have been identified in some of the most pathogenic viruses; thus, they are attractive targets for controlling viral infections. pdb ids: kf , lpw, m p, y d and le . untersuchungenüber die guanylsäure helix formation by guanylic acid the structure of helical -guanosine monophosphate poly(inosinic acid) helices: essential chelation of alkali metal ions in the axial channel guanine quartets telomeric dna dimerizes by formation of guanine tetrads between hairpin loops a phase diagram for sodium and potassium ion control of polymorphism in telomeric dna a k cation-induced conformational switch within a loop spanning segment of a dna quadruplex containing g-g-g-c repeats physicochemical properties of nucleosides . gel formation by -bromoguanosine physico-chemical properties of nucleosides. . gel formation by quanosine and its analogues nucleoside conformations. x. an x-ray fiber diffraction study of the gels of guanine nucleosides nucleoside conformations. xi. solvent effects on optical properties of guanosine and its derivatives in dilute solutions nucleoside conformations. . an infrared study of the polymorphism of guanine nucleosides in the solid state nucleoside conformations. xiii. circular dichroism of guanosine gels and the conformation of gpg and poly (g) four-stranded nucleic acid structures years later: from guanosine gels to telomer dna g-quartet structures in telomeric dna comprehensive supramolecular chemistry multistranded dna structures quadruplex structures in nucleic acids thermodynamic and kinetic characterization of the dissociation and assembly of quadruplex nucleic acids biological aspects of dna/rna quadruplexes beyond nucleic acid base pairs: from triads to heptads g-quadruplex dna structures-variations on a theme genome-wide prediction of g dna as regulatory motifs: role in escherichia coli global regulation qgrs mapper: a web-based server for predicting g-quadruplexes in nucleotide sequences grsdb: a database of quadruplex forming g-rich sequences in alternatively processed mammalian pre-mrna sequences quadbase: genome-wide database of g dna-occurrence and conservation in human, chimpanzee, mouse and rat promoters and microbes extensive selection for the enrichment of g dna motifs in transcriptional regulatory regions of warm blooded animals the yeast kem gene encodes a nuclease specific for g tetraplex dna: implication of in vivo functions for this novel dna structure a human nuclease specific for g dna g resolvase binds both dna and rna tetramolecular quadruplex with high affinity and is the major source of tetramolecular quadruplex g -dna and g -rna resolving activity in hela cell lysates the dexh protein product of the dhx gene is the major source of tetramolecular quadruplex g -dna resolving activity in hela cell lysates selective recognition of a dna g-quadruplex by an engineered antibody detection of g-quadruplex dna in mammalian cells in vitro generated antibodies specific for telomeric guanine-quadruplex dna react with stylonychia lemnae macronuclei visualization and selective chemical targeting of rna g-quadruplex structures in the cytoplasm of human cells quantitative visualization of dna g-quadruplex structures in human cells stabilization of quadruplex dna perturbs telomere replication leading to the activation of an atr-dependent atm signaling pathway small-molecule-induced dna damage identifies alternative dna structures in human genes telomestatin, a novel telomerase inhibitor from streptomyces anulatus telomerase inhibition with a novel g-quadruplex-interactive agent, telomestatin: in vitro and in vivo studies in acute leukemia prevalence of quadruplexes in the human genome structural insights into g-quadruplexes: towards new anticancer drugs existence and consequences of g-quadruplex structures in dna telomeres: protecting chromosomes against genome instability a telomerase component is defective in the human disease dyskeratosis congenita human dyskerin: beyond telomeres all gene-sized dna molecules in four species of hypotrichs have the same terminal sequence and an unusual terminus genomic reorganization in ciliated protozoans telomeric dna oligonucleotides form novel intramolecular structures containing guanine-guanine base pairs an overhanging terminus is a conserved feature of telomeres monovalent cation-induced structure of telomeric dna: the g-quartet model processing of g dna by dna helicase/nuclease and replication protein a (rpa) provides insights into the mechanism of dna /rpa substrate recognition mammalian dna helicase/nuclease cleaves g-quadruplex dna and is required for telomere integrity telomerase: dr jekyll or mr hyde? telomerase extends the lifespan of virus-transformed human cells without net telomere lengthening g-quadruplex formation at the end of telomere dna inhibits its extension by telomerase, polymerase and unwinding by helicase discovery of g-quadruplex stabilizing ligands through direct elisa of a one-bead-one-compound library recent developments in the chemistry and biology of g-quadruplexes with reference to the dna groove binders an intramolecular g-quadruplex structure is required for binding of telomeric repeat-containing rna to the telomeric protein trf novel g-quadruplex stabilizing agents: in-silico approach and dynamics g-quadruplex stabilizer induces m-phase cell cycle arrest reevaluation of telomerase inhibition by quadruplex ligands and their mechanisms of action telomestatin impairs glioma stem cell survival and growth through the disruption of telomeric g-quadruplex and inhibition of the proto-oncogene, c-myb replication in context: dynamic regulation of dna replication patterns in metazoans new insights into replication origin characteristics in metazoans unraveling cell type-specific and reprogrammable human replication origin signatures associated with g-quadruplex consensus motifs g motifs affect origin positioning and efficiency in two vertebrate replicators stimulation of gross chromosomal rearrangements by the human ceb and ceb minisatellites in saccharomyces cerevisiae depends on g-quadruplexes or cdc making sense of g-quadruplex and i-motif functions in oncogene promoters dna tetraplex formation in the control region of c-myc the cationic porphyrin tmpyp down-regulates c-myc and human telomerase reverse transcriptase expression and inhibits tumor growth in vivo direct evidence for a g-quadruplex in a promoter region and its targeting with a small molecule to repress c-myc transcription the dynamic character of the g-quadruplex element in the c-myc promoter and modification by tmpyp propeller-type parallel-stranded g-quadruplexes in the human c-myc promoter ribonucleoprotein and protein factors bind to an h-dna-forming c-myc dna element: possible regulators of the c-myc gene puf/nm -h /ndpk-b transactivates a human c-myc promoter-cat gene via a functional nuclease hypersensitive element identification and characterization of nucleolin as a c-myc g-quadruplex-binding protein novel interaction of the z-dna binding domain of human adar with the oncogenic c-myc promoter g-quadruplex chromatin structure of the promoter region of the human c-k-ras gene g-rich oligonucleotide inhibits the binding of a nuclear protein to the ki-ras promoter and strongly reduces cell growth in human carcinoma pancreatic cells formation of pseudosymmetrical g-quadruplex and i-motif structures in the proximal promoter region of the ret oncogene an rna g-quadruplex in the utr of the nras proto-oncogene modulates translation an unusually stable g-quadruplex within the -utr of the mt matrix metalloproteinase mrna represses translation in eukaryotic cells a g-quadruplex structure within the -utr of trf mrna represses translation in human cells -utr g-quadruplex structures acting as translational repressors -utr rna g-quadruplexes: translation regulation and targeting stimulation of ribosomal frameshifting by rna g-quadruplex structures translational recoding induced by g-rich mrna sequences that form unusual structures repeat instability as the basis for human diseases and as a potential target for therapy tetraplex formation by the progressive myoclonus epilepsy type- repeat: implications for instability in the repeat expansion diseases mutsalpha binds to and promotes synapsis of transcriptionally activated immunoglobulin switch regions putative dna g-quadruplex formation within the promoters of plasmodium falciparum var genes reca-binding pile g sequence essential for pilin antigenic variation forms monomeric and end-stacked dimeric parallel g-quadruplexes the g genome hiv integrase inhibitors: -year landmark and challenges cell biology of retroviral rna packaging retroviral recombination and reverse transcription extensive recombination among human immunodeficiency virus type quasispecies makes an important contribution to viral diversity in individual patients moloney murine sarcoma virus genomic rnas dimerize via a two-step process: a concentration-dependent kissing-loop interaction is driven by initial contact between consecutive guanines functional characterization of the dimer linkage structure rna of moloney murine sarcoma virus a loop-loop 'kissing' complex is the essential part of the dimer linkage of genomic hiv- rna evidence for interstrand quadruplex formation in the dimerization of human immunodeficiency virus genomic rna mode of dimerization of hiv- genomic rna dimerization of human immunodeficiency virus type rna involves sequences located upstream of the splice donor site hiv- nucleocapsid protein increases strand transfer recombination by promoting dimeric g-quartet formation mechanism of hiv- rna dimerization in the central region of the genome and significance for viral evolution hiv- genome nuclear import is mediated by a central dna flap analysis of the viral elements required in the nuclear import of hiv- dna first glimpses at structure-function relationships of the nucleocapsid protein of retroviruses comparative nucleic acid chaperone properties of the nucleocapsid protein ncp and tat protein of hiv- g-quartets direct assembly of hiv- nucleocapsid protein along single-stranded dna unfolding of dna quadruplexes induced by hiv- nucleocapsid protein hiv- nucleocapsid proteins as molecular chaperones for tetramolecular antiparallel g-quadruplex formation solution structures of all parallel-stranded monomeric and dimeric g-quadruplex scaffolds of the human c-kit promoter a dynamic g-quadruplex region regulates the hiv- long terminal repeat promoter topology of a dna g-quadruplex structure formed in the hiv- promoter: a potential target for anti-hiv drug development u region in the hiv- genome adopts a g-quadruplex structure in its rna and dna sequence formation of a unique cluster of g-quadruplex structures in the hiv- nef coding region: implications for antiviral activity nmr observation of a novel c-tetrad in the structure of the sv repeat sequence gggcgg the large tumor antigen: a 'swiss army knife' protein possessing the functions required for the polyomavirus life cycle the sv large t-antigen helicase can unwind four stranded dna structures linked by g-quartets real-time investigation of sv large t-antigen helicase activity using surface plasmon resonance simian virus large t-antigen g-quadruplex dna helicase inhibition by g-quadruplex dna-interactive agents dna helicases as targets for anti-cancer drugs the bloom's syndrome helicase unwinds g dna genome-wide analysis of hpv integration in human cancers reveals recurrent, focal genomic instability human papillomavirus g-quadruplexes spectrum of epstein-barr virus-associated diseases role for g-quadruplex rna binding by epstein-barr virus nuclear antigen in dna replication and metaphase chromosome attachment g-quadruplexes regulate epstein-barr virus-encoded nuclear antigen mrna translation targeting g-quadruplexes in gene promoters: a novel anticancer strategy? g-quadruplexes as potential therapeutic targets for embryonal tumors nucleic acids targeted to drugs: selex against a quadruplex ligand small-molecule interaction with a five-guanine-tract g-quadruplex structure from the human myc promoter g-quadruplexes as targets for drug design g-quadruplexes: targets in anticancer drug design anti-hiv- activity of the g-quadruplex ligand braco- nucleic acid aptamers: research tools in disease diagnostics and therapeutics pegaptanib, a targeted anti-vegf aptamer for ocular vascular disease selection of single-stranded dna molecules that bind and inhibit human thrombin thrombin binding aptamer, more than a simple aptamer: chemically modified derivatives and biomedical applications agro inhibits activation of nuclear factor-kappab (nf-kappab) by forming a complex with nf-kappab essential modulator (nemo) and nucleolin a phase ii trial of as (a novel nucleolin-targeted dna aptamer) in metastatic renal cell carcinoma systematic evolution of ligands by exponential enrichment: rna ligands to bacteriophage t dna polymerase development of aptamer therapeutics aptamers as tools in molecular biology and immunology specific recognition of proteins by array-bound hexanucleotides functional binding of hexanucleotides to c protease of hepatitis a virus influenza: evolving strategies in treatment and prevention multiple alignment comparison of the non-structural genes of influenza a viruses a recombinant influenza a virus expressing an rna-binding-defective ns protein induces high levels of beta interferon and is attenuated in mice single-stranded dna aptamer that specifically binds to the influenza virus ns protein suppresses interferon antagonism the sars-unique domain (sud) of sars coronavirus contains two macrodomains that bind g-quadruplexes differential inhibitory activities and stabilisation of dna aptamers against the sars coronavirus helicase rna pseudoknots that inhibit human immunodeficiency virus type reverse transcriptase rational screening of oligonucleotide combinatorial libraries for drug discovery combinatorially selected guanosine-quartet structure is a potent inhibitor of human immunodeficiency virus envelope-mediated cell fusion potent and specific inhibition of hiv envelope-mediated cell fusion and virus binding by g quartet-forming oligonucleotide (isis ) inhibition of human immunodeficiency virus type infection in scid-hu thy/liv mice by the g-quartet-forming oligonucleotide suppression of human immunodeficiency virus type activity in vitro by oligonucleotides which form intramolecular tetrads inhibition of the human immunodeficiency virus type integrase by guanosine quartet structures hiv- integrase inhibitor t forms a stacked dimeric g-quadruplex structure containing bulges hiv- in inhibitors: update and perspectives human immunodeficiency virus glycoprotein gp as the primary target for the antiviral action of ar (zintevir) biologically active oligodeoxyribonucleotides. . -end-substituted d(tgggag) possesses anti-human immunodeficiency virus type activity by forming a g-quadruplex structure dna aptamers selected against the hiv- rnase h display in vitro antiviral activity targeting hiv- integrase with aptamers selected against the purified rnase h domain of hiv- rt dna aptamers derived from hiv- rnase h inhibitors are strong anti-integrase agents closely related antiretroviral agents as inhibitors of two hiv- enzymes, ribonuclease h and integrase: 'killing two birds with one stone an interlocked dimeric parallel-stranded dna quadruplex: a potent inhibitor of hiv- integrase the guanine-quadruplex aptamer del inhibits hiv- replication ex vivo by interfering with viral entry, reverse transcription and integration cellular uptake of odns in hiv- human-infected cells: a role for viral particles in dna delivery? aptamer beacons for visualization of endogenous protein hiv- reverse transcriptase in living cells an update on hdv: virology, pathogenesis and treatment potassium ions modulate a g-quadruplex-ribozyme's activity modulating rna structure and catalysis: lessons from small cleaving ribozymes structure and replication of caulimovirus genomes a novel fluorescent biosensor for detection of target dna fragment from the transgene cauliflower mosaic virus s promoter quadruplex nucleic acids therapeutic applications of quadruplex nucleic acids weblogo: a sequence logo generator key: cord- - ts zpdb authors: ruggiero, emanuela; richter, sara n title: g-quadruplexes and g-quadruplex ligands: targets and tools in antiviral therapy date: - - journal: nucleic acids res doi: . /nar/gky sha: doc_id: cord_uid: ts zpdb g-quadruplexes (g s) are non-canonical nucleic acids secondary structures that form within guanine-rich strands of regulatory genomic regions. g s have been extensively described in the human genome, especially in telomeres and oncogene promoters; in recent years the presence of g s in viruses has attracted increasing interest. indeed, g s have been reported in several viruses, including those involved in recent epidemics, such as the zika and ebola viruses. viral g s are usually located in regulatory regions of the genome and implicated in the control of key viral processes; in some cases, they have been involved also in viral latency. in this context, g ligands have been developed and tested both as tools to study the complexity of g -mediated mechanisms in the viral life cycle, and as therapeutic agents. in general, g ligands showed promising antiviral activity, with g -mediated mechanisms of action both at the genome and transcript level. this review aims to provide an updated close-up of the literature on g s in viruses. the current state of the art of g ligands in antiviral research is also reported, with particular focus on the structural and physicochemical requirements for optimal biological activity. the achievements and the to-dos in the field are discussed. g-quadruplexes (g s) are nucleic acids secondary structures that can form within dna ( ) or rna ( ) guanine (g)-rich strands, when two or more g-tetrads stack on top of each other and coordinate monovalent cations, such as k + and na + . each tetrad is composed of four g residues that are linked by the sugar-phosphate backbone and connected through hoogsteen-type hydrogen bonds. g s are highly polymorphic structures whose topology can be influenced by variations in strand stoichiometry and polarity, as well as by the nature and length of loops and their location in the sequence. g s can fold intramolecularly from a single g-rich strand, or intermolecularly through dimerization or tetramerization of separate filaments: research of biologically relevant g s has mainly focused on monomolecular g s ( , ) ; however, intermolecular g s are gaining increasing attention ( ) ( ) ( ) . strands orientation defines the parallel, antiparallel or mixed topology of g s, which is directly correlated to the conformational state, anti or syn, of the glycosidic bond between the g base and the sugar ( ). the anti conformation characterizes a parallel folding, while antiparallel g s are found to adopt both syn and anti orientations ( ) (figure ). while rna g s are mostly locked in a parallel conformation due to the -hydroxyl group in the sugar which exclusively allows the anti orientation ( ), dna g s are in principle characterized by higher topological diversity, even though the majority of dna g s examined so far adopt the parallel topology. computational analysis using different algorithms ( , ) indicated that and up to around potential g -forming sequences may form in the human genome, correlated with specific gene functions ( ) . these data have been corroborated by 'g -seq' high-throughput sequencing method, which identified about g s ( ) . however, mapping of g s in chromatin by g chip-sequencing with an anti-g antibody ( ) or footprinting ( ) retrieved only about g s in highly transcribed regulatory nucleosome-depleted chromatin regions. these data indicate that g s are mostly suppressed in chromatin and that, in turn, they may influence the occupancy and positioning of nucleosomes. in general, g sequences are non-randomly distributed but mainly clustered in pivotal genomic regions, namely telomeres, gene promoters and dna replication origins ( ) . moreover, putative g -forming sequences have been found in coding and non-coding regions of the human transcriptome, i.e. open reading frames and untranslated re- gions (utrs), and in the telomeric repeat-containing rna ( ) .this evidence suggests that g s are likely involved in the regulation of different biological pathways such as replication, transcription, translation and genome instability. in the past years, the resolution of g structures ( ) ( ) ( ) and the employment of novel visualization approaches ( ) ( ) ( ) helped researchers to validate the previous computational predictions, disclosing new aspects of the multifaceted g s world, e.g. the effective occurrence of g s within patient-derived cancer tissues ( ) or the key role in the pathogenesis of two incurable neurodegenerative diseases, amyotrophic lateral sclerosis and frontotemporal dementia ( ) . indeed, the presence of g s in the human genome and their potential in diseases modulation have been extensively investigated, resulting in many good and exhaustive reviews focused on g structures ( , , , ) and their biological role, particularly in telomeres ( ) ( ) ( ) ( ) and oncogene promoters ( ) ( ) ( ) ( ) ( ) ( ) . besides humans, putative g -forming sequences have been found in other mammalian genomes ( ) , yeasts ( ), protozoa ( ) , bacteria ( , ) and viruses, therefore implicating g s in many human infectious diseases. one review has been published in on the possible role of g s in the antigenic variation systems of bacteria and protozoa and silencing of two viruses ( ) . the possible role of g s in viruses and the use of g -forming oligonucleotides as antiviral agents have been discussed in ( ) . the virus recognizes and binds the host cell surface receptors (step ) to enter the cell (step ). after penetration, the viral genome is uncoated (step ) and its dna or rna nature determines where and how the genome is replicated (step ): most dna viruses replicate in the cell nucleus, while the majority of rna viruses replicate in the cytoplasm of infected cells. after viral mrna production, viral proteins are expressed in the cytoplasm (steps - ). the newly synthesized viral genomes and proteins are then assembled into new virions (step ), which are released outside the cell (step ). since the number of reports describing the presence of g s in virus genomes has boomed in the past years and treatment with several g ligands has shown potentially interesting therapeutic activity, we here aim at presenting, organizing and discussing an up-to-date close-up of the literature on g s in viruses and the classes of molecules that have shown antiviral activity by viral g targeting. in particular, we first focus on the presence and proposed function of g s in virus genomes. next, we present the classes of g ligands that have reported successful antiviral activity, with special emphasis on the structural and physicochemical properties that characterize the viral g /g ligand interaction. a general simplified virus life cycle is schematically depicted in figure ; a summary of the viruses in which g s have been reported and of the corresponding g s is shown in figure . since the use of g -forming oligonucleotides as antiviral agents has been more recently addressed by musumeci et al. ( , ) , this topic has not been considered in the present review. the human immunodeficiency virus (hiv) is the etiological agent of the acquired immune deficiency syndrome (aids), which to date affects more than million people worldwide. albeit the current anti-retroviral therapy keeps the disease progression under control, people still die from hiv-related causes; thereby it is necessary to find alternative and effective antiviral targets. the hiv belongs to the retroviridae family; the single-stranded rna genome is processed by the viral retrotranscriptase and the newly formed double-stranded dna is integrated into the host cell chromosomes to form the proviral genome, from which viral mrnas and new genomes are transcribed. the research of g s in the hiv- genome has been quite productive, concerning not only the two rna viral genome copies, but also the integrated proviral genome, specifically for each virus the following information is shown: virion structure and dimension, genome size and organization; schematic representation of the g (red dots) location in the viral genomes or in the mrna and g binding proteins; number of g s assessed through bioinformatics analysis, according to the corresponding references; g ligands reported to date to display antiviral effect and corresponding references. in the long terminal promoter (ltr) region ( ) ( ) ( ) and in the nef coding region ( ) , as properly reviewed by metifiot et al. ( ) . briefly, the ltr promoter is characterized by a highly conserved g-rich sequence in the u region, corresponding to sp and nf-b binding sites, where three mutually exclusive g structures can form, i.e. ltr-ii, ltr-iii and ltr-iv ( ) . ltr-iv is a parallel g with a bulge at its end, as ascertained by nuclear magnetic resonance (nmr) characterization ( ) . ltr-iii and ltr-iv exert opposite effects on ltr promoter activity, which is silenced when ltr-iii is folded and enhanced by ltr-iv stabilization ( ) . in addition, the ltr g region is under the control of two nuclear proteins: nucleolin, which upon binding increases ltr g stability and thus silences transcription ( ) and the human ribonucleoprotein (hnrnp) a /b , which unwinds the ltr region, decreasing its promoter activity ( ) : these data suggest that the balance between g s acts as a regulatory mechanism in hiv- promoter activity. interestingly, g -forming sequences are present in the ltr promoter of all primate lentiviruses and display binding sites for transcription factors that are related to g regulation ( , ) , supporting a role for g s as crucial control elements for viral transcription, conserved throughout evolution ( ) . g s were also evidenced in the u region of the hiv- rna genome, where multiple highly stable parallel g s can form ( ) . rna sequences can dimerize through an intermolecular g interaction ( ) , suggesting that the u region could represent an additional point of contact between the two viral genome copies. additionally, such rna g s likely contribute to the observed increased genetic recombination rate in the u ( ) . nef, a viral accessory protein, is an essential factor in proviral dna synthesis ( ) and in the establishment of a persistent state of infection ( ) . its coding region is located at the -end of the viral genome and partially overlaps with the -ltr. three g sequences have been identified in the most conserved region of the gene ( ). g-rich sequences able to form g s were reported in the hiv central dna flap overlapping positive-strand and were found to protect the pre-integrated genome from nuclease degradation ( ) . stabilization of hiv g s by small molecules showed antiviral effects at different levels: g ligand binding to dna ltr g s decreased viral transcription, while binding to rna ltr g s inhibited the reverse transcription process, leading in both cases to strong antiviral effects ( , , ) . g ligand-mediated stabilization of the nef g s induced nef-dependent antiviral activity ( ) . very recently, g stabilizing agents were also employed in cells infected with latent hiv- , where their activity resulted in a strong antiviral effect, especially in combination with a dna repair inhibitor, revealing new aspects of hiv- latent infection ( ) . the specific molecules that were used as anti-hiv- agents are discussed in the 'antiviral g ligands' section of this review. herpesviridae is a large family of viruses with long linear double-stranded dna genomes. among the nine herpesvirus species that can infect humans, at least five are extremely widespread, i.e. herpes simplex virus and (hsv- and hsv- ), varicella zoster virus, epstein-barr virus (ebv) and cytomegalovirus, which cause orolabial and genital herpes ( ) , chickenpox and shingles ( ), mononucleosis ( ) and some cancers ( ) . more than % of adults have been infected with at least one of these ( ) . herpesviruses also tend to display latent, recurring infections, with the virus remaining in some part of the infected organism and typically maintaining its genome as extrachromosomal nuclear episome ( ) . recent genome-wide bioinformatics analysis revealed an impressively high density of putative g -forming sequences in all herpesvirus species ( ) . indeed, the presence of g s has been experimentally reported for hsv- , ebv, kaposi's sarcoma associated herpesvirus (kshv) and human herpesvirus (hhv- ). hsv- establishes life-long persistent infections with a viral lifecycle that involves latency and reactivation/lytic replication. more than half of the world population suffers from hsv infections, the outcome of which may become severe in immunocompromised patients. anti-hsv- therapy can be very effective; however, the emergence of drugresistant viral strains urges the discovery of anti-herpetic drugs with innovative mechanisms of action. the hsv- genome, characterized by % gc-content, was found to contain numerous and highly stable g -forming sequences that are mainly located in the repeated regions ( ) . these hsv- g s, visualized through a g -specific antibody in infected cells at different time points post-infections, were shown to form in a virus cycle-dependent fashion: viral g s form massively in the cell nucleus during viral replication, and localize in different cell compartments according to the viral genome movements ( ) . ebv is associated not only with the well-known infectious mononucleosis, but also with a wider spectrum of illnesses, including several lymphoid malignancies. studies on the presence and role of g s in ebv proved that the genome maintenance protein ebv-encoded nuclear antigen- (ebna ) stimulates viral dna replication by recruiting the cellular origin replication complex through an interaction with rna g s ( ) . the ebna mrna itself is rich in g clusters able to fold into parallel g s, which behave as cis-acting regulators of viral mrna translation, producing ribosome dissociation. g s in ebna mrna have been shown to modulate the endogenous presentation of ebna -specific cd + t-cell epitopes, which are involved in persistent infections ( ) . the cellular protein nucleolin counteracts this mechanism by interacting with ebna mrna g s and thus downregulating ebna protein expression and antigen presentation ( , ) . g s can also be observed in the mrnas of other genome maintenance proteins that are known to regulate their selfsynthesis, suggesting that g s are exploited as structural regulatory elements by the virus ( ) . kshv is the etiological agent of all forms of kaposi's sarcoma and other numerous lymphoproliferative disorders, which mostly concern aids patients, and at the moment, no treatments for the lytic or latent infections are available ( ) . the kshv genome is organized in a kb long unique region, flanked by the terminal repeats, which are rich in g residues and able to form stable g s, both in the forward and reverse strands ( ) . hhv- is a ubiquitous virus that infects almost % of the human population. the diseases associated with hhv- include the febrile illness roseola infantum, also known as the sixth childhood eruptive disease ( ) . reactivation of hhv- in immunosuppressed individuals is associated with adverse clinical outcomes, comprising life-threatening encephalitis or graft rejection in transplant patients ( ) . the hhv- genome presents telomeric regions at its termini, which can integrate into the telomeres of human chromosomes: integration is considered one possible mode of latency ( ) . since telomeres can fold into g s, these structures may be involved in the mechanism of hhv- integration. indeed, stabilization of telomeric g s by a g ligand inhibited hhv- chromosomal integration ( ) . stabilization of herpesvirus g s by g ligands led to antiviral activity. in hsv- , inhibition of dna replication and reduction of late viral transcripts were observed ( , ) . in ebv, a g ligand inhibited ebna -dependent stimulation of viral dna replication ( ) and ebna synthesis ( ) . in contrast, another g ligand reduced nucleolin binding to ebna mrna ( ) , which in turn resulted in enhanced ebna synthesis and antigen presentation ( , ) . treatment of latently infected cells with g stabilizing compounds proved to negatively regulate viral replication, leading to a reduction in the kshv genome copies ( ) . g ligands used against herpesviruses are discussed in the 'antiviral g ligands' section of this review. dna viruses. the human papillomavirus (hpv) is a double-stranded dna virus that can cause skin and genital warts and some types of cancer. its genome displays several g-rich sequences: stable g s form in only eight out of identified hpv types; however, the g -forming hpvs include some of the most high risk hpv types, responsible for the majority of cases of cervical cancer ( , ) . the hepatitis b virus (hbv) is a partially doublestranded dna virus, the best known member of the hepadnaviridae family. it causes the hepatitis b disease, which may lead to cirrhosis and hepatocellular carcinoma. a single putative g -forming sequence was discovered in the promoter region of the pres /s gene in hbv genotype b and was found to fold into an intramolecular hybrid g structure. surprisingly, the g acted as a positive regulator of hbv transcription, as revealed by luciferase reporter assays ( ) . adeno-associated viruses (aav) are single-stranded dna viruses of the parvoviridae family. aav are not currently linked to human diseases and have been used as delivery vectors for gene therapy. a recent study reported the presence of g s in the aav genome. the dna binding protein nucleophosmin (npm ), which is known to enhance aav infectivity, directly interacts with g s: putative g s were identified, located within the inverted terminal repeat region ( ) . amongst rna viruses, g putative sequences have been identified in three positive and singlestranded ones, namely the severe acute respiratory syndrome coronavirus (sars-cov), the hepatitis c virus (hcv) and the zika virus (zikv). the sars-cov belongs to the family of coronaviridae; its genome is about . kb, which is one of the largest among rna viruses. it has been identified after a massive outbreak in and is considered one of the most pathogenic coronaviruses in humans. within the nonstructural protein , the so-called sars unique domain (sud), which plays an essential role in viral replication and transcription, was found to preferentially bind g -forming oligonucleotides ( , ) . these may be found in the -nontranslated regions of mrnas coding for host-cell proteins involved in apoptosis or signal transduction; therefore, it has been proposed that sud/g interaction may be involved in controlling the host cell's response to the viral infection. the hcv belongs to the flaviviridae family; it can cause both acute and chronic hepatitis, possibly leading to cirrhosis and liver cancer. bioinformatics and biophysical analysis demonstrated the existence of two highly conserved g sequences in the c gene of hcv ( ) . the zikv is also included in the family of flaviviridae. it is transmitted to humans by mosquito bites; while in an adult it may cause mild symptoms or even be symptomless, it may be devastating in a pregnant woman as it causes microcephaly in the unborn child. several g sequences were discovered in the positive strand of the zikv genome: of these are conserved within more than flavivirus genomes, suggesting an important role in the life cycle of these viruses. furthermore, zikv presents an additional g in the unique -utr region, crucial for initial viral replication of the negative-sense strand ( ) . finally, g s have been investigated in the ebola virus (ebov) and marburg virus (marv), two negative and single-stranded rna viruses belonging to the filoviridae family. these are deadly pathogens that cause haemorrhagic fever in humans and primates ( ) . the presence of g sequences in the negative strand of ebov and marv was assessed by a fluorescent probe ( ) . both zikv and ebolv went through massive outbreaks in the past three years, which makes them two of the most dangerous agents of viral epidemics of the current decade. in figure , all the viruses in which g s have been investigated are displayed. the stabilizing g ligands tested in some of these viruses are thoroughly described in the section below. in the past few years much effort has been directed toward the design of small molecules able to target g s, leading to very promising potential therapeutics, especially against cancer. several updated reviews describe the use of g ligands that target telomeres and oncogenes to treat cancer ( , , ( ) ( ) ( ) . despite the considerable achievements in antiviral research, viral infections still represent a major global threat nucleic acids research, , vol. , no. for human health, causing significant morbidity and mortality. the recurrent onset of drug-resistant pathogens, combined with the fact that the majority of viruses still lack a specific vaccine, urges the development of novel therapeutic approaches for the management of viral diseases. to this end, g ligands provide both compounds with an innovative mechanism of action in antiviral treatment and valuable tools to better understand virus mechanisms. in the section below g ligands reported to exert antiviral activity have been grouped based on the chemical nature of their core. a description of their discovery, general g binding activity and biological effects in cells is initially provided. antiviral properties, activity and selectivity are then discussed. the n,n'-( -(( -(dimethylamino)phenyl)amino)acridine- , -diyl)bis( -(pyrrolidin- -yl)propan-amide), labeled braco- (b ) ( , figure ), is to date one of the most studied g ligands. it is the outcome of a complex and thorough medicinal chemistry investigation that started with the introduction of an acridine moiety as a new chromophore in the research of g binders. read and colleagues demonstrated that the acridine core was more active than the previously developed anthraquinone core ( , ) , because of the presence of a nitrogen atom in the heterocyclic scaffold that could be protonated at physiological conditions. as a result, the electron deficiency in the chromophore was increased, with consequent enhancement of the g interaction ( ) . in-depth structure-activity relationship (sar) analysis supported by molecular modeling techniques next led to the development of bi-and tri-substituted derivatives ( , ) . these classes of compounds are characterized by a central planar pharmacophore that binds g-tetrads throughinteractions ( figure a ). additionally, two side chains functionalized with a tertiary amine moiety are needed to interact with the grooves: the amine group is crucial for activity since it is protonated at physiological ph, while it disrupts the g when substituted with bulky residues ( ). the , , -trisubstituted acridines emerged as the most potent compounds among all the possible regioisomeric series that have been evaluated: they proved to act as g mediated telomerase inhibitors. b showed telomerase inhibition at nanomolar concentration, with higher affinity for g with respect to duplex dna, and lower cytotoxicity when compared to first generation acridines. it induced long-term growth arrest and replicative senescence in the nt breast carcinoma cell line and was the first g ligand to prove anticancer activity in vivo, against human tumor xenograft models ( , ) . the use of b in a viral environment was first analyzed in ebv, to investigate the functional and biochemical characteristics of ebna . results showed that b stabilized the viral rna g and, during infection, was able to reduce ebv genome copy numbers in raji cells. it was also found to induce modest reduction of transcription levels of ebna and ebna a and inhibition of ebna dependent dna replication. these data indicate that g -interacting molecules can block functions of ebna that are critical for viral dna replication ( ) . in the ltr promoter region of the hiv- proviral genome, b was able to significantly stabilize the naturally occurring g s, ltr-ii and ltr-iii, and to induce an additional g , ltr-iv. in the presence of increasing concentration of b , ltr promoter activity was decreased of almost % with respect to the untreated control, while no activity was detected in a mutated sequence unable to fold into g s ( ) . these results confirmed a g -mediated mechanism of action. the anti-hiv- activity of b (ic < . m) was tested in various cell lines, against different viral strains and was demonstrated to be g mediated. since g structures also formed in the pre-integration viral rna ( ), a dual mode of action both at the pre-and post-integration level was proposed ( figure ) . b antiviral activity was tested and confirmed in latent hiv- infected cells, where the acridine was able to reduce the viral titer to undetectable level, also in long-term treatment ( ) . b exerted its g stabilizing activity also in the hsv- genome, where multiple g s can form. treatment with b led to a significant antiviral effect (ic = m), with reduction in viral dna synthesis and late proteins production ( ) . moreover, b was used in hhv- a infected cells to evaluate the ability of g ligands to impair viral integration in the telomeric region, through stabilization of telomeric g s. interestingly, in telomerase expressing cell lines, the frequency of chromosomal integration was reduced up to % upon treatment. however, effects of g ligands on hhv- replication and gene expression are yet to be discovered ( ) . recently, b was employed in a luciferase reporter assay to analyze the role of g s in hbv, where it enhanced promoter activity, suggesting a positive regulatory role of g s in hbv transcription ( ) . despite its good solubility in aqueous solutions and strong g binding, poor permeability across biological barriers, which characterizes most g ligands, restrains b pharmacological application ( ) . nonetheless, b is still considered a reference compound in g research. the cationic porphyrin compound , , , -tetrakis-(nmethyl- -pyridyl)porphine (tmpyp , , figure ) was proposed as g binder because of its suitable physical properties, such as molecular size, planar core, positive charges and hydrophobicity, favorable for stacking with the g tetrads ( ) ( figure b ). biophysical analysis demonstrated that tmpyp was actually able to stack and stabilize both parallel and antiparallel g s, with mild selectivity for quadruplex over duplex dna ( ) . since then, it has been widely employed as a tool to study g s, especially because of the availability of a negative control compound, tmpyp ( , figure ) , which is a structural isomer with n-methyl- -pyridyl residues on the porphine core. intriguingly, tmpyp is sterically hindered from external stacking on the g with respect to tmpyp , producing no biological effects ( , ) . in biological assays, tmpyp was shown to inhibit human telomerase (ic = . ± . m) ( ) and downreg- ulate the proto-oncogene c-myc expression as well as several c-myc-regulated genes containing g -forming sequences. such modulation resulted in in vivo antitumor activity in different models where the porphyrin was able to decrease tumor growth and prolong survival ( ) . in viruses, tmpyp was shown to stabilize g s in the hiv- nef coding region and to induce their formation within the double-helix conformation. interestingly, in the tzm-bl reporter cell line, which supports nef-dependent hiv- replication, the porphyrin inhibited viral infectivity in a dose-dependent manner ( ) . in addition, tmpyp administration was able to block viral replication in two different jurkat-derived t-cell lines with established hiv- latency. bambara's research group demonstrated that the antiviral activity was coupled with an increased rate of apoptosis/death when compared to untreated cells, and that this effect was enhanced by association with dna damage repair inhibitors ( ) . in hcv, tmpyp was found to stabilize rna g s and inhibit hcv c gene expression through a g -mediated mechanism of action confirmed by an enhanced green fluorescent protein reporter gene system. in addition, in an infectious hcv culture system, administration of the porphyrin led to a dose-dependent decrease of viral rna levels ( ) . tmpyp was also employed to investigate the role of g s in ebov l gene. it exerted high stabilization of the target g rna in circular dichroism and rna stop assays. more importantly, after treatment with increasing concentrations of the compound, transcription of the l gene was gradually reduced. to confirm target selectivity, a mutant non-g -forming sequence was used as a negative control, where tmpyp did not produce significant inhibition of transcription. in addition, the porphyrin was found to in-hibit replication of ebov mini-genome, a cell-based approach that uses firefly luciferase as reporter protein and thus can be used as an efficient antiviral screening system ( ) . it is worth noting that the low selectivity of tmpyp towards g structures versus duplex dna ( ) may suggest the antiviral activity to be ascribed to multiple mechanisms of action, limiting its biological and clinical application. perylenes represent a well-known family of g ligands, containing a differently substituted, large fused aromatic ring system: they are characterized by a hydrophobic heptacyclic central core, which is responsible for the binding to g quartets throughinteractions, and by up to four protonated side chains. accurate sar studies on this scaffold pointed out two crucial features for g binding: the basicity of the system, which prevents the compound from self-aggregation, and the distance between the aromatic central core and the quaternarized nitrogen residue in the side chain, which modulates ligand solubility and affects g recognition. the cationic amino moieties in the lateral substituents are thought to regulate specificity for g versus duplex dna ( ) . piper, n,n'-bis[ -( -piperidino) ethyl]- , , , -perylenetetracarboxylic diimide ( , figure ) is the lead compound of this class; it was shown to induce and stabilize g structures in telomeres ( ) , leading to telomere shortening, reduction of cell proliferation and tumorigenicity, and senescence ( ) . in the effort to improve the physicochemical properties of the perylene scaffold, progressive surface reduction led to the more promising class of naphthalene diimide (ndi) derivatives. indeed, it was demonstrated that the dimensions of the planar core modulate the ability of this class of compounds to recognize different dna conformations. in particular, in the cyclic condensed system at least four rings are required to efficiently target g s ( ) . in addition, the ndi planar core can accommodate up to four side chains to enhance g affinity. these compounds were found to inhibit telomerase activity in the low micromolar range and to produce short-term cell growth inhibition against mcf- and a cancer cell lines ( ) . to improve dna g alkylating properties, further modifications were introduced on the ndi scaffold, which include quinone methides precursors ( , ) . these ligands revealed both reversible and irreversible binding properties toward telomeric dna, with promising duplex versus quadruplex selectivity ( ) , and were found to impair the growth of different telomerase-positive cancer cell lines following telomerase activity inhibition ( ) ( ) ( ) . crystallographic analyses of various ndi-telomere complexes provided a turning point for rational optimization of this class of compounds ( , ) . neidle et al. reported that the tested ligands promoted a parallel g topology, forming a : complex with the oligonucleotide. this stoichiometry resulted from the combination of binding site affinity and direct groove interactions that are highly influenced by the protonated moiety in the side chains, which interacts with dna phosphates in the grooves. despite their high molecular weight, ndis are highly versatile structures, suitable for further medicinal chemistry modifications to improve their pharmacological profile ( , ) . in the antiviral field, piper induced and stabilized g structures in the nef coding region of the hiv- genome ( ) . however, the best results were obtained with coreextended ndi derivatives (c-exndis, , figure ). this series of compounds, endowed with exceptional solubility properties, has been obtained by fusing the ndi core with a , -dihydroquinoxaline heterocycle. interestingly, the newly developed ligands displayed greater in vitro binding and stabilization activity on viral hiv- ltr g s than the human telomeric sequence, used as a cellular reference g . most importantly, the c-exndis exhibited very promising antiviral activity in the low nanomolar range (ic < nm) against different strains of hiv- , with very low cytotoxicity, yielding a wide and encouraging therapeutic window. the g -related mechanism of action was proved combining time-related antiviral and reporter assays, using a non-g -forming ltr-mutant sequence as control. it is reasonable that the higher antiviral activity depends on the selectivity toward the viral g s, as, during the infection, ltr and telomeric g s are likely the most abundant species in the cell ( ) . the most active c-exndi was also analyzed in hsv- infection. in vitro cd and taq-polymerase stop assays indicated that the compound was able to bind and stabilize various g -forming sequences of the hsv- genome. mass spectrometry competition analysis revealed a stronger preference for hiv- g s over hsv- , but generally, viral g s were preferentially bound, when compared to the telomeric g . indeed, c-exndi showed remarkable antiviral activity (ic = . ± . nm). the anti-herpetic effect was ascribed to inhibition of viral dna replication, as gathered by time-of-addition assay and flow cytometry analysis using acyclovir as reference compound ( ) . since c-exndi selectivity towards hsv- g s in vitro resulted to be good but not outstanding, the marked anti-hsv- activity was likely due also to the massive presence of viral g s in the cell nucleus, which was demonstrated to occur during hsv- replication ( ) . pyridostatin (pds, , figure ) has been rationally designed on the structural features shared by known g -binding ligands, as it comprises a potentially planar electron-rich aromatic surface and the ability to participate in hydrogen bonding. moreover, the rotatable bonds provide a flexibility degree, which makes pds capable to adapt to the dynamism of g s. pds strongly stabilized telomeric g with no effect on double-stranded dna: as a result, the shelterin complex integrity was altered, triggering a dna-damage response at telomeres ( ) . numerous modifications have been introduced in the pds scaffold to further explore the role of this class in anticancer therapy. indeed, the obtained analogues showed remarkable growth-inhibitory effects in cancer cell lines and a complete arrest after long-term exposure to the drug. these results emphasize the high potential of these compounds to fine-tune their biological activity ( , ) . in antiviral research, pds has been used to study the role of g s in ebv ebna mrna, where it enhanced the stability of the g -forming sequence, decreasing ebna synthesis level in a concentration-dependent fashion, both in vitro and in vivo. as a consequence, ebv-infected cells resulted less efficiently recognized by virus-specific t cells, albeit the mechanism of action still needs to be clarified ( ) . in hbv, pds was used to unravel the positive regulatory role of g s within the pres /s gene promoter ( ) . a pds analogue, namely pdp ( , figure ) , was employed in hcv g research, along with tmpyp . the pdp-induced stabilization of g structures located in the hcv rna downregulated c gene expression. in vivo, pdp inhibited intracellular replication of different hcv genotypes through a confirmed g -related mechanism of action, resulting in antiviral activity in the low micromolar range ( ) . bisquinolinium compounds are characterized by an aromatic nucleus substituted with two protonated quinoline moieties. the first reported compounds present a dicarboxamide-pyridine or -triazine ring as central core: the most promising of these ligands have shown to increase g stability in telomeres, with great selectivity over duplex dna ( ) . these compounds are able to adopt an intramolecular syn-syn h-bond, which was proposed to be critical for g recognition, likely because the consequent rigidity of the compound promotes g-quartet overlap. on these bases, the central core was expanded without disrupting the h-bonds, leading to a new disubstituted- , phenanthroline series that displays exceptional selectivity for g s ( ) , due to the crescent-like shape which prevents such compounds to intercalate with duplex dna ( figure c ) ( , ) . phendc ( , figure ), the best representative of this class, is a potent telomeric g ligand able to reduce telomerase processivity ( ) . phendc was used in kshv to evaluate its potential role in inhibiting latent viral replication. the ligand was found to elicit a stress response in infected bcbl- cells and to stall the replication machinery both in the leading and lagging strands of the kshv genome. furthermore, treatment with phendc resulted in the dramatic reduction ( %) of episome copy number, with no effect on cell growth and proliferation. these data represent the first use of g ligands in targeting latent viral infections ( ) . phendc was also used in ebv, where it prevented binding of nucleolin to ebna mrna g and increased the endogenous ebna levels in ebv-infected b cells and in cells derived from a nasopharyngeal carcinoma. these results indicate that the nucleolin-ebna mrna interaction can also be targeted by antiviral g -ligands ( ) . a summary of g ligands and the viruses against which they have been tested is reported in figure . in the last decades, research on the role of g s in the human genome has been quite challenging and promising, leading to the awareness that these high-order structures nucleic acids research, , vol. , no. play key regulatory roles in biological pathways such as transcription, replication, translation and telomere maintenance. the development of g binders with encouraging anti-cancer activity has prompted researchers to identify new ways to exploit g structures in human diseases, e.g. viral infections. because g s are present both in cell and virus genomes, the challenge in developing antiviral g ligands reasonably consists in overcoming selectivity toward viral versus cellular g s. a major limitation of the so far described g ligands is their large flat aromatic core that stacks on the g tetrad, which reduces the chances to discriminate among different g s. moreover, they are generally characterized by high-molecular weights and protonated side chains, which are necessary for loops and grooves interaction, but, on the other hand, may affect cellular uptake. indeed, because of the low selectivity profile and poor drug-like properties, no g ligand has advanced beyond phase ii in the drug discovery pathway. quarfloxin, a fluoroquinolone derivative compound developed by hurley's research group ( ) , is to date the only g ligand that has reached phase ii clinical trials but was withdrawn due to bioavailability related problems ( ) . however, several data presented in the literature indicate that, in general, a certain degree of selectivity is achievable towards the viral g of interest in comparison to the telomeric g , i.e. the most abundant cellular g ( ) . in the case of hiv- g s and c-exndi compounds, the higher affinity towards the viral structure is likely caused by the extension of the ndi core and thus by the interaction with the viral g loop region, which is unique for this g ( ) . in general, loop and groove regions characterize each g and thus are amenable for selective recognition. structural studies on cellular g /g ligand complexes indicated that most g -binding molecules interact with g s through quasi-external stacking, in which the heteroaromatic chromophore of the small molecules isstacked onto the face of an external g-quartet ( ) ( figure ) and onto the side chains positioned in the g grooves ( ) . it is therefore conceivable that the reported antiviral activity of g ligands is mediated by an increased interaction, hence affinity, with the groove/loop moiety of the viral g s. to date, only one viral g structure has been resolved through nmr spectroscopy ( ), therefore future nmr and crystallographic resolutions of viral g s and g /g ligand complexes are necessary to define the viral g s architecture. this could help researchers identifying possible unique g structures which could lead to the design and development of selective molecules. in other cases, g ligands did not show significant selectivity for the viral versus telomeric g s, and the g s present in oncogene promoters were usually strongly bound by the tested compounds ( ) . nonetheless, the data so far presented on the antiviral use of g ligands have shown in general very promising activity against a wide range of virus species. one possible explanation is that the amount of the viral g s in the infected cells largely surpasses that of the cellular g s ( ) . indeed, usually cells are exploited to function as factories in the production of new viral genomes that are eventually assembled into new mature virus particles (see figure for the viral infection cycle). it is thus conceivable that the viral g s become largely more abun-dant than the cellular g s during virus replication. at least in one case this eventuality has been demonstrated: in hsv- there is a sharp increase in the number of viral g s during viral dna replication ( ) . combining the abundance of g s per genome and the number of new genomes, the amount of viral g s could outstand that of cellular g s by several logs per cell. in addition, the so far identified viral g s are usually key regulatory elements of the virus life cycle and their stabilization/unfolding by g ligands can likely explain the resulting massive virus inhibition. if this behaviour is demonstrated also in other viruses, it would be possible to exploit g ligands that are not strictly selective for the viral g s. this scenario would highly and rapidly expand the research and pharmacological application of g ligands as antiviral agents. a further point to be addressed is the necessity to standardize methods to study the antiviral activity of the g ligands. one starting point should be the detection of the inhibitory activity of the ligand on the virus life cycle. if an effect is obtained, further investigation on the mechanism of action has to be performed. in this regard, the time of addition method ( ) can be of assistance as it indicates the last viral step at which the compound is active and it thus narrows the possible molecular targets. however, because of the complexity and uniqueness of each virus, the investigation of the target and mechanism of action at the molecular level may not be straightforward. for example, pds inhibited ebna synthesis in vitro but not in cells, while phendc in cells led to the exact opposite effect, i.e. enhanced ebna synthesis ( , ) . it is likely that multiple g -mediated mechanisms are involved in the observed outcomes. finally, targeting g s in the viral genomes leads to the exciting possibility of affecting viruses that undergo latency. these viruses, such as hiv, the herpes and papilloma virus families, comprise an initial acute infection and a subsequent latent infection. the latter is characterized by the maintenance of the virus genome in the human host for the entire life of the host. the latent virus may reactivate from time to time to produce new mature virus. current therapies that normally target viral proteins fail to remove the latent virus, i.e. the virus genome, from its host. selectively targeting the viral genome in a g -mediated approach would allow removing not only the replicating virus but also the latent one, therefore eradicating so far incurable infective agents. in this picture it is worth considering the virus-induced manipulation of host chromatin. in recent years, studies about the role of chromatin in viral infections showed dynamic virus-host chromatin interactions and chromatin machinery modulation by virus encoded proteins ( ) . for example, the hsv- epigenetic regulation of viral chromatin by viral gene products plays a key role in determining whether the virus develops a lytic or latent infection ( ) . considering the recent evidences reported by hänsel-hertsch et al. that g formation reflects the suppressive role of heterochromatin and that it occurs only in highly transcribed regulatory nucleosome-depleted chromatin regions ( ) , it would be interesting to understand how the virus and its g s affect and could be affected by such a complex mechanism. to conclude, all the data reported in this review indicate that: i) g structures are crucial elements in the regulation of viruses' life cycle, both in lytic and latent states; ii) g ligands efficiently act as antiviral agents. this should encourage researchers to continue investigating on g -binding small molecules: as a matter of fact, albeit quarfloxin clinical evaluation did not progress, its success in phase i clinical trial, i.e. optimal toxicity profile ( ) , suggests that improvements of g ligand pharmacological profiles will very likely lead to concrete clinical applications of these compounds. therefore, research in the next future will need to improve i) the understanding of g activity and regulation at the viral level, ii) the selectivity of g ligands toward the viral versus cellular g s, iii) the drug-like properties of the antiviral g ligands to be employed in in vivo studies. g -mediated antiviral drugs may represent a significant turning point in the management of viral infections, especially for people who cannot access immunization, like immunocompromised patients or elderly people. in addition, the g -mediated antiviral effects reported in latent infections ( ) may pave the way for cutting-edge therapeutic approaches in the treatment of human fatal malignancies related to latent viruses, such as aids, herpes-and hpvrelated cancer. quadruplex dna: sequence, topology and structure rna g-quadruplexes in biology: principles and molecular mechanisms g-quadruplexes: prediction, characterization, and biological application targeting unimolecular g-quadruplex nucleic acids: a new paradigm for the drug discovery? g-quadruplexes involving both strands of genomic dna are highly abundant and colocalize with functional sites in the human genome formation of dna:rna hybrid g-quadruplex in bacterial cells and its dominance over the intramolecular dna g-quadruplex in mediating transcription termination co-transcriptional formation of dna:rna hybrid g-quadruplex and potential function as constitutional cis element for transcription control four-stranded nucleic acids: structure, function and targeting of g-quadruplexes re-evaluation of g-quadruplex propensity with g hunter prevalence of quadruplexes in the human genome gene function correlates with potential for g dna formation in the human genome high-throughput sequencing of dna g-quadruplex structures in the human genome g-quadruplex structures mark human regulatory chromatin permanganate/s nuclease footprinting reveals non-b dna structures with regulatory potential across a mammalian genome g-quadruplexes and their regulatory roles in biology high-resolution three-dimensional nmr structure of the kras proto-oncogene promoter reveals key features of a g-quadruplex involved in transcriptional regulation structure of two intramolecular g-quadruplexes formed by natural human telomere sequences in k+ solution crystal structure of parallel quadruplexes from human telomeric dna quantitative visualization of dna g-quadruplex structures in human cells detection of g-quadruplex dna in mammalian cells visualization of rna-quadruplexes in live cells elevated levels of g-quadruplex formation in human stomach and liver cancer tissues g-quadruplex-binding small molecules ameliorate c orf ftd/als pathology in vitro and in vivo g-quadruplex structures and their interaction diversity with ligands structure, location and interactions of g-quadruplexes telomere g-quadruplex as a potential target to accelerate telomere shortening by expanding the incomplete end-replication of telomere dna telomeric g-quadruplex architecture and interactions with potential drugs human telomeric g-quadruplex: the current status of telomeric g-quadruplexes as therapeutic targets in human cancer cell cycle regulation of g-quadruplex dna structures at telomeres g-quadruplexes in human promoters: a challenge for therapeutic applications g dna in ras genes and its potential in cancer therapy small molecules targeting c-myc oncogene: promising anti-cancer therapeutics making sense of g-quadruplex and i-motif functions in oncogene promoters g-quadruplex structures in the human genome as novel therapeutic targets targeting g-quadruplexes in gene promoters: a novel anticancer strategy? genome-wide computational and expression analyses reveal g-quadruplex dna motifs as conserved cis-regulatory elements in human and related species genomic distribution and functional analyses of potential g-quadruplex-forming sequences in saccharomyces cerevisiae putative dna g-quadruplex formation within the promoters of plasmodium falciparum var genes genome-wide study predicts promoter-g dna motifs regulate selective functions in bacteria: radioresistance of d. radiodurans involves g dna-mediated regulation mapping and characterization of g-quadruplexes in mycobacterium tuberculosis gene promoter regions g-quadruplexes in pathogens: a common route to virulence control? g-quadruplexes in viruses: function and potential therapeutic applications g-quadruplex forming oligonucleotides as anti-hiv agents g-quadruplex-based aptamers against protein targets in therapy and diagnostics topology of a dna g-quadruplex structure formed in the hiv- promoter: a potential target for anti-hiv drug development a dynamic g-quadruplex region regulates the hiv- long terminal repeat promoter hiv- nucleocapsid protein increases strand transfer recombination by promoting dimeric g-quartet formation formation of a unique cluster of g-quadruplex structures in the hiv- nef coding region: implications for antiviral activity structure and possible function of a g-quadruplex in the long terminal repeat of the proviral hiv- genome nucleolin stabilizes g-quadruplex structures folded by the ltr promoter and silences hiv- viral transcription the cellular protein hnrnp a /b enhances hiv- transcription by unfolding ltr promoter g-quadruplexes a non-canonical dna structure is a binding motif for the transcription factor sp in vitro the relationship of potential g-quadruplex sequences in cis-upstream regions of the human genome to sp -binding elements conserved presence of g-quadruplex forming sequences in the long terminal repeat promoter of lentiviruses anti-hiv- activity of the g-quadruplex ligand braco- evidence for interstrand quadruplex formation in the dimerization of human immunodeficiency virus genomic rna mechanism of hiv- rna dimerization in the central region of the genome and significance for viral evolution nef stimulates human immunodeficiency virus type proviral dna synthesis the human immunodeficiency virus- nef gene product: a positive factor for viral infection and replication in primary lymphocytes and macrophages g-quartets assembly within a g-rich dna flap. a possible event at the center of the hiv- genome synthesis, binding and antiviral properties of potent core-extended naphthalene diimides targeting the hiv- long terminal repeat promoter g-quadruplexes deficiency in dna damage response, a new characteristic of cells infected with latent hiv- herpes simplex virus establishment, maintenance, and reactivation: in vitro modeling of latency diagnosis, antiviral therapy, and prophylaxis of varicella-zoster virus infections infectious mononucleosis herpesviruses and cancer genital herpes genome-wide analysis of g-quadruplexes in herpesvirus genomes the herpes simplex virus- genome contains multiple clusters of repeated g-quadruplex: implications for the antiviral activity of a g-quadruplex ligand visualization of dna g-quadruplexes in herpes simplex virus -infected cells role for g-quadruplex rna binding by epstein-barr virus nuclear antigen in dna replication and metaphase chromosome attachment mrna structural constraints on ebna synthesis impact on in vivo antigen presentation and early priming of cd + t cells a yeast model for the mechanism of the epstein-barr virus immune evasion identifies a new therapeutic target to interfere with the virus stealthiness nucleolin directly mediates epstein-barr virus immune evasion through binding to g-quadruplexes of ebna mrna g-quadruplexes regulate epstein-barr virus-encoded nuclear antigen mrna translation kaposi sarcoma herpesvirus-associated cancers and related diseases g-quadruplex-interacting compounds alter latent dna replication and episomal persistence of kshv clinical impact of primary infection with roseoloviruses roseoloviruses in transplant recipients: clinical consequences and prospects for treatment and prevention trials the latent human herpesvirus- a genome specifically integrates in telomeres of human chromosomes in vivo and in vitro stabilization of telomere g-quadruplexes interferes with human herpesvirus a chromosomal integration a core extended naphtalene diimide g-quadruplex ligand potently inhibits herpes simplex virus replication the effect of single nucleotide polymorphisms in g-rich regions of high-risk human papillomaviruses on structural diversity of dna human papillomavirus g-quadruplexes a g-quadruplex motif in an envelope gene promoter regulates transcription and virion secretion in hbv genotype b the function of dna binding protein nucleophosmin in aav replication a g-quadruplex-binding macrodomain within the "sars-unique domain" is essential for the activity of the sars-coronavirus replication-transcription complex the sars-unique domain (sud) of sars coronavirus contains two macrodomains that bind g-quadruplexes a highly conserved g-rich consensus sequence in hepatitis c virus core gene represents a new anti-hepatitis c target zika virus genomic rna possesses conserved g-quadruplexes characteristic of the flaviviridae family chemical targeting of a g-quadruplex rna in the ebola virus l gene ebola virus derived g-quadruplexes: thiazole orange interaction recent advances in targeting the telomeric g-quadruplex dna sequence with small molecules as a strategy for anticancer therapies quadruplex nucleic acids as targets for anticancer therapeutics g-quadruplexes: targets in anticancer drug design human telomerase inhibition by regioisomeric disubstituted amidoanthracene- , -diones inhibition of human telomerase by a g-quadruplex-interactive compound molecular modeling studies on g-quadruplex complexes of telomerase inhibitors: structure-activity relationships human telomerase inhibition by substituted acridine derivatives structure-based design of selective and potent g quadruplex-mediated telomerase inhibitors trisubstituted acridine derivatives as potent and selective telomerase inhibitors the g-quadruplex-interactive molecule braco- inhibits tumor growth, consistent with telomere targeting and interference with telomerase function a g-quadruplex-interactive potent small-molecule inhibitor of telomerase exhibiting in vitro and in vivo antitumor activity biopharmaceutical characterization of the telomerase inhibitor braco structural basis for binding of porphyrin to human telomeres interactions of tmpyp and tmpyp with quadruplex dna. structural basis for the differential effects on telomerase inhibition cationic porphyrins as telomerase inhibitors: the interaction of tetra-(n-methyl- -pyridyl)porphine with quadruplex dna effects of cationic porphyrins as g-quadruplex interactive agents in human tumor cells the cationic porphyrin tmpyp down-regulates c-myc and human telomerase reverse transcriptase expression and inhibits tumor growth in vivo shedding light on the interaction between tmpyp and human telomeric quadruplexes synthesis of a non-cationic, water-soluble perylenetetracarboxylic diimide and its interactions with g-quadruplex-forming dna nmr-based model of a telomerase-inhibiting compound bound to g-quadruplex dna telomere shortening and cell senescence induced by perylene derivatives in a human lung cancer cells tri-, tetra-and heptacyclic perylene analogues as new potential antineoplastic agents based on dna telomerase inhibition tri-and tetra-substituted naphthalene diimides as potent g-quadruplex ligands binol-amino acid conjugates as triggerable carriers of dna-targeted potent photocytotoxic agents binol quinone methides as bisalkylating and dna cross-linking agents quinone methides tethered to naphthalene diimides as selective g-quadruplex alkylating agents hybrid ligand-alkylating agents targeting telomeric g-quadruplex structures targeting loop adenines in g-quadruplex by a selective oxirane naphthalene diimide scaffolds with dual reversible and covalent interaction properties towards g-quadruplex structural basis for telomeric g-quadruplex targeting by naphthalene diimide ligands structure-based design and evaluation of naphthalene diimide g-quadruplex ligands as telomere targeting agents in pancreatic cancer cells targeting multiple effector pathways in pancreatic ductal adenocarcinoma with a g-quadruplex-binding small molecule a g-quadruplex-binding compound showing anti-tumour activity in an in vivo model for pancreatic cancer a novel small molecule that alters shelterin integrity and triggers a dna-damage response at telomeres small-molecule-mediated g-quadruplex isolation from human cells pyridostatin analogues promote telomere dysfunction and long-term growth inhibition in human cancer cells cell senescence and telomere shortening induced by a new series of specific g-quadruplex dna ligands highly efficient g-quadruplex recognition by bisquinolinium compounds quadruplex nucleic acids as novel therapeutic targets solution structure of a g-quadruplex bound to the bisquinolinium compound phen-dc( ) reevaluation of telomerase inhibition by quadruplex ligands and their mechanisms of action design and synthesis of fluoroquinophenoxazines that interact with human telomeric g-quadruplexes and their biological effects how shelterin protects mammalian telomeres structural basis of dna quadruplex recognition by an acridine drug a time-of-drug addition approach to target identification of antiviral compounds snapshots: chromatin control of viral infection chromatin control of herpes simplex virus lytic and latent infection key: cord- -sjab zsk authors: mendez, aaron s; vogt, carolin; bohne, jens; glaunsinger, britt a title: site specific target binding controls rna cleavage efficiency by the kaposi's sarcoma-associated herpesvirus endonuclease sox date: - - journal: nucleic acids res doi: . /nar/gky sha: doc_id: cord_uid: sjab zsk a number of viruses remodel the cellular gene expression landscape by globally accelerating messenger rna (mrna) degradation. unlike the mammalian basal mrna decay enzymes, which largely target mrna from the ′ and ′ end, viruses instead use endonucleases that cleave their targets internally. this is hypothesized to more rapidly inactivate mrna while maintaining selective power, potentially though the use of a targeting motif(s). yet, how mrna endonuclease specificity is achieved in mammalian cells remains largely unresolved. here, we reveal key features underlying the biochemical mechanism of target recognition and cleavage by the sox endonuclease encoded by kaposi's sarcoma-associated herpesvirus (kshv). using purified kshv sox protein, we reconstituted the cleavage reaction in vitro and reveal that sox displays robust, sequence-specific rna binding to residues proximal to the cleavage site, which must be presented in a particular structural context. the strength of sox binding dictates cleavage efficiency, providing an explanation for the breadth of mrna susceptibility observed in cells. importantly, we establish that cleavage site specificity does not require additional cellular cofactors, as had been previously proposed. thus, viral endonucleases may use a combination of rna sequence and structure to capture a broad set of mrna targets while still preserving selectivity. viral infection dramatically reshapes the gene expression landscape of the host cell. by changing overall messenger rna (mrna) abundance or translation, viruses can redirect host machinery towards viral gene expression while si-multaneously dampening immune stimulatory signals ( ) ( ) ( ) . suppression of host gene expression, termed host shutoff, can occur via a variety of mechanisms, but one common strategy is to accelerate degradation of mrna ( ) ( ) ( ) . this occurs during infection with dna viruses such as alphaherpesviruses, gammaherpesvirues, and vaccinia virus, as well as with rna viruses such as influenza a virus and sars and mers coronaviruses ( , , ) . in the majority of these cases, a viral factor promotes endonucleolytic cleavage of target mrnas. this strategy bypasses the normally rate limiting steps of deadenylation and decapping to effect rapid mrna degradation by host exonucleases ( ) . virally encoded host shutoff endonucleases are usually specific for mrna, yet broad-acting in that they target the majority of the mrna population. this is exemplified by herpesviral nucleases, including the sox endonuclease encoded by kaposi's sarcoma-associated herpesvirus (kshv), an oncogenic human gammaherpesvirus that causes kaposi's sarcoma and b cell lymphoproliferative diseases ( , ) . kshv sox is a member of the pd-(d/e)xk type ii restriction endonuclease superfamily that possesses mechanistically distinct dnase and rnase activities ( ) ( ) ( ) . the rnase activity of the gammaherpesvirus sox protein has been shown to play key roles in various aspects of the viral lifecycle, including immune evasion, cell type specific replication, and controlling the gene expression landscape of infected cells ( ) ( ) ( ) ( ) . however, the mechanism by which sox targets mrnas remains largely unknown. sequencing data indicate that within the mrna pool there appears to be a range of sox targeting efficiencies; some transcripts are efficiently cleaved in cells, while others are partially or fully refractory to cleavage ( ) ( ) ( ) ( ) ( ) . additionally, sox has been shown to cut within specific locations of mrnas in cells, further emphasizing that there must be transcript features that confer selectivity ( , ) . indeed, a transcriptome-wide cleavage analysis indicated that sox targeting is directed by a relatively degenerate motif, often containing an unpaired polyadenosine stretch shortly upstream of the cleavage site, which is located in a loop structure ( ) . cleavage within an unpaired loop was confirmed in a recent crystal structure of sox with rna, although additional contacts that could confer sequence specificity were not observed ( ) . thus, a major outstanding question is how rna sequence and/or structure contribute to sox target recognition. in this context, it is unclear how sequence features surrounding the rna cleavage site might impact sox targeting, for example by changing its affinity for a given rna or the efficiency with which cleavage occurs. to address these questions, we sought to reconstitute the sox cleavage reaction in vitro using purified components. using an rna substrate that is efficiently cleaved by sox in cells, we revealed that specific rna sequences within and outside of the cleavage site significantly contribute to sox binding efficiency and target processing. in particular, we found that the polyadenosine stretch adjacent to the cleavage site is critical for sox binding, and we experimentally verified the importance of an open loop structure surrounding the cleavage site. finally, we demonstrated that this in vitro system faithfully recapitulates the initial endonucleolytic cleavage event that is an essential component of mrna target specificity in vivo. collectively, our data reveal that specific sequence features potently impact sox binding, and thus provide key insight into the breadth of sox targeting efficiency observed across the transcriptome. more broadly, this information provides a framework for better understanding the target specificity of endonucleases, which play central roles in mammalian quality control processes and viral infection outcomes. kshv sox was codon optimized for sf expression and synthesized from genewiz. sox was then subcloned using restriction sites bamhi and sali (new england bio-labs) into pfastbac htd. this vector was modified to carry a gst affinity tag and prescission protease cut site as described ( ) . all sox mutants were generated using single primer site-directed mutagenesis ( ) . sequences were validated using standard pgex forward and reverse primers. generation of viral bacmids and transfections were prepared as described in the bac-to-bac ® baculovirus expression system (thermo fisher scientific) manual. after transfection, sf cells (thermo fisher scientific) were grown for h at • c using sf- smf media (gibco) substituted with % fetal bovine serum (fbs) and % antibiotic antimycotic (aa). supernatant was transferred to a six-well tissue culture plate containing ml of × ∧ cells/well. cells were incubated for hr to generate passage (p ). the p supernatant was transferred to a flask containing ml of × cells/ml and incubated for h, a time point sufficient to yield mg of sox per ml of cells. protein expression was confirmed by western blot with an anti-gst antibody (ge health care life sciences). sf cell pellets were suspended in lysis buffer containing mm nacl, % glycerol, . % triton x- , mm dtt, mm hepes ph . with a complete, edta-free protease inhibitor cocktail tablet (roche). cells were sonicated on ice using a macro trip for s bursts with min rests for min at a. cell lysate was cleared using a pre-chilled ( • c) sorvall lynx superspeed centrifuge spun at rpm for min. the cleared lysate was incubated for h at • c with rotation with ml of a gst bead slurry (ge healthcare life sciences) that had been pre-washed × with wash buffer (wb) containing mm nacl, % glycerol, mm dtt, mm hepes ph . . the bead-protein mixture was washed × with ml of wb, then transferred to a ml disposable column (qiagen) and washed with an additional ml of wb followed by ml of low salt buffer (lsb) containing mm nacl, % glycerol, mm dtt, mm hepes ph . with periodic resuspension to prevent compaction. sox was then cleaved on column with prescission protease (ge healthcare life sciences) overnight at • c, and protein eluate was collected for a final volume of ml in lsb. cleaved protein was concentrated to ∼ ml using amicon filter concentrator membrane cut off kda (emd millipore), then loaded onto a hiload superdex s pg gel filtration column (ge healthcare life sciences). protein elutions were concentrated using an amicon concentrator described above to mg/ml and l aliquots were snap frozen in liquid n using nuclease-free . ml microfuge tubes (ambion life technologies) and stored at - • c. all rna substrates (sequences in supplementary table s ) unless stated otherwise were synthesized by dharmacon (ge healthcare) with hplc and page purification. rnas were end labeled with ␥ -[ p]-atp- ci/mmol mci/ml (perkin elmer) using t pnk (new england bi-olabs). rnas were end labeled with -[ p]-pcp ci/mmol mci/ml using t rna ligase (new england biolabs). labeled rna substrates were purified using % urea-page and were isolated from gel slices by incubating overnight at • c in a buffer containing mm tris-hcl, mm edta ph . . eluted rnas were ethanol precipitated and resuspended in rnase-free ddh o. k obs and hill coefficients of sox were determined from the cleavage kinetics of [ p]-labeled rna substrates as previously described ( ) . briefly, l (≤ pm) of [ p]-labeled rna was added to l of premixture containing mm hepes ph . , mm nacl, mm mgcl , mm tcep, % glycerol, and increasing concentrations of purified sox. reactions were performed at room temperature under single turnover conditions, and quenched at the indicated time intervals with l stop solution ( m urea, . % sds, . mm edta, . % xylene cyanol, . % bromophenol blue). samples were resolved by % urea-page, imaged using a typhoon variable mode imager (ge healthcare), and quantified using imagequant and gelquant software packages (molecular devices). the data were plotted and fit to exponential curves using prism software package (graphpad) to determine observed rate constants. a fret probe with excitation at nm and emission at nm (limd flo) was purchased from dharmacon (supplementary table s ). the rna fret probe was added at a final concentration of nm to l of premixture containing mm hepes ph . , mm nacl, mm mgcl , mm tcep, % glycerol with m of sox ( ) . terminator exonuclease (lucigen) was added to reactions using a : dilution of the enzyme. reactions were quenched at indicated time intervals with equal volumes of stop solution containing % formamide and mm edta, then resolved using urea-page and visualized using a typhoon variable mode imager (ge halthcare). the data were plotted using prism software package (graphpad). all experiments were repeated > times and mean values were computed. for assays designed to detect endonucleolytic cleavage intermediates, l of labeled rna substrate was combined with l of reaction solution ( mm hepes ph . , mm nacl, . mm cacl, . mm mgcl , % glycerol, . mm tcep) in the presence or absence of m sox for min at room temperature. rna was then ethanol precipitated, resuspended in % formamide solution containing mm edta, and resolved on a % urea-page analytical grade sequencing gel together with a ss-rna decade ladder (ambion life technologies) for . h at w before imaging as described above. the sequence surrounding the cut site in limd was inserted into a pbssk (-) backbone using the bamhi and xbai restriction sites. mutations were introduced by the quickchange site directed mutagenesis protocol (agilent). the nt sequence surrounding the gfp cut site was inserted using the bamhi and xhoi restriction site. in-line probing was performed as described previously ( ) . briefly, pbssk(-) plasmids containing the indicated sequences (see supplementary table s ) were linearized by digestion with xhoi and scai for gfp or blpi and saci (neb) for limd , gel purified, phenol/chloroform extracted, and ethanol precipitated. the fragments were then used as templates for in vitro transcription with the hiscribe t high yield rna synthesis kit (neb) and afterwards subjected to turbo dnase (ambion by life technologies) treatment. rna was resolved by % urea page, and full length transcripts were excised from the sybr gold stained gel (thermo fisher scientific), eluted overnight in g buffer ( mm tris hcl ph . , mm naoac, mm edta, . % sds), phenol/chloroform extracted, and ethanol precipitated. the rna (∼ pmol) was dephosphorylated using shrimp alkaline phosphatase (rsap, neb), labeled with l [␥ p] atp ( mci/ml) using usb optikinase (affymetrix), then gel purified as described above and dissolved in l of nuclease free water. for the in-line probing reaction, l rna (≥ cpm) was incubated in × reaction buffer ( mm tris-hcl ph . , mm mgcl , mm kcl) at room temperature for or h. the reaction was quenched with × loading buffer ( m urea, . mm edta ph . ). to generate ladders, l of the purified rna was separately subjected to hydrolysis using the next magnesium rna fragmentation module (-oh) or rnase t digestion (t ) (neb). reactions were resolved by % urea-page, exposed on a phoshorimager screen, and scanned using the storm imaging system (ge healthcare). deduced rna structures were drawn using the rna secondary structure visualization tool forna (vienna rna web services). rna probes used in emsa experiments were radiolabeled using the protocol described for ribonuclease activity assays. reactions were incubated at rt for min in buffer containing mm hepes ph . , mm kcl, mm cacl , . % tween- , . tcep, . mg/ml bsa (sigma-aldrich), g/ml of yeast trna (ambion thermo fisher), and the indicated amount of purified sox protein. calcium chloride was used in these binding assays to prevent substrate processing and stabilize rna-protein interactions. reactions volumes were kept at l and stopped with l × emsa loading dye ( mm hepes ph . , mm kcl, % glycerol). reactions were resolved by % native page, and gels were imaged on a typhoon multivariable imager (ge healthcare) and quantified using gelquant software package (molecular dynamics). limd - rna was end labeled with ␥ -[ p]-atp- ci/mmol mci/ml (perkinelmer) using t pnk (new england biolabs). rna was then gel purified as stated previously. emsa gel shifts were first used to determine optimal binding conditions (> % binding, homogeneous complexes of rna-protein). binding buffer contained . % tween (sigma-aldrich), mm cacl , mm kcl, mm nacl, . mm tcep, mm hepes ph . , . mg/ml yeast trna (ambion), . mg/ml nuclease free bovine serum albumin (bsa) (ambion). a dilution series of sox ( - . m) was incubated with l of radiolabeled limd - in the presence of . unit of rnase t (epicentre illumina). reactions were incubated at rt for a total of min before being ethanol precipitated. rna pellets were then resuspended in l of % formamide solution containing mm edta and boiled for min. samples were then loaded onto a % analytical grade urea-page gel and run at w for . h. gels were imaged and analyzed as stated above. in order to produce an rnase t ladder, l of limd - was incubated with . units of rnase t . reactions were incubated at rt for min before being quenched and prepared as stated previously. the limd - hydrolysis ladder was generated as stated in the in-line probing methods. rna probes ( end labeled with biotin) were synthesized from dharmacon (ge healthcare) and hplc and page purified (see supplementary table s ). the octet red e bio-layer interferometry instrument and streptavidin (sa) biosensors were available from fortebio (menlo park, ca, usa). all steps were performed in reaction buffer similar to emsa binding conditions. biosensors were incubated with nm of the biotinylated rna substrate for containing no rna. sox protein was incubated with the rna conjugated biosensors for - s in order to reach saturation. indicated protein concentrations for each bio sensor are located on corresponding binding curves. complexes were dissociated for minimum of min. response curves for each biosensor were normalized against biosensors conjugated to rna in the absence of sox (buffer only control). normalized response curves were processed using octet software version by fitting the group of selected bio sensors to a nonlinear regression model ( ) . dissociation constants (k d ) were determined from k on and k dis values derived from the fitted curves. a complete table of all values is provided in supplementary table s . in cells, the mrna fragments resulting from the primary sox endonucleolytic cleavage are predominantly cleared by the host - exonuclease xrn , while in vitro, rna fragments are rapidly degraded by - exonucleolytic activity intrinsic to purified sox ( ) . thus, it has been challenging to analyze the initial endonucleolytic cleavage event that is an essential component of mrna target specificity in vivo. here, we sought to develop a biochemical system to address these questions. our prior analysis of sox targets in cells identified the human limd mrna, which codes for a protein essential for p body formation and integrity, as being highly susceptible to cleavage by sox ( ) . the minimum sequence required to directly cut the putative cleavage site in limd in cells was mapped to a -nucleotide segment (limd - ), and we therefore chose this as our model substrate to study sox targeting in vitro ( ) . we first expressed and purified kshv sox to greater then % purity from sf insect cells (supplementary figure s a ). using the limd - substrate, we plotted the observed rate constant (k obs ) as a function of sox concentration, yielding a hill coefficient of n = . ( figure a ). thus, in agreement with previous observations ( , ), sox appears to function predominantly as a monomer. under conditions of half maximal activity ( m; figure a ), sox displayed a strong preference for the 'hard' divalent metal mg + and a weaker preference for the 'softer' and larger metals mn + , co + and zn + ( figure b ). this is again consistent with other characterized members of the p/dexk family of enzymes ( , ) . notably, sox activity in the presence of mg + was inhibited in a dose-dependent manner upon competitive addition of ca + ( figure c and supplementary figure s d ). this is likely the result of increased coordination partners engaged by ca + , which decreases the ability of catalytic residues to promote proper base hydrolysis ( ) ( ) ( ) . finally, increasing the nacl concentration above mm led to substantially decreased sox activity ( figure d ), in accordance with the observation that high salt concentrations frequently inhibit nuclease activity by disrupting protein-protein or proteinsubstrate interactions ( ) . given that recombinant sox displays robust - exonuclease activity ( , ), we sought to confirm that limd - was subject to endonucleolytic sox cleavage, as this is the predominant event that directs mrna turnover in sox expressing cells ( , ) . both the and ends limd - were blocked by capping the end with a cy fluorophore and the end with an iowa black quencher (limd - flo). we confirmed this rna was resistant to degradation by the -phosphate dependent exonuclease terminator ( figure e, lane ) . however, in the presence of sox, a cleavage product was observed that correlated with an endonucleolytic cut ( figure e, lane ) . to confirm this processing event was not a result of contamination, we purified a sox mutant containing mutations within two key residues of the sox active site (d n/e q). incubation of this mutant with limd - over the course of . h yielded no rna cleavage (supplementary figure s e ). thus, recombinant sox appears to target limd - for endonucleolytic cleavage in vitro, as has been observed for this substrate in cells. to analyze rna substrate selectivity using our in vitro assay, we first compared sox degradation of limd - to a -nucleotide sequence of the mrna encoding gfp (gfp- ). we have previously shown that gfp mrna is cleaved by sox in cells, and that gfp- is the minimal sequence required to elicit cleavage ( , ) . the cleavage sites for limd - and gfp- are predicted to occur in an open loop region (figure a, red arrow) . upon direct comparison of these two rnas, we observed a ∼ -fold increase in the catalytic efficiency of sox for the limd - substrate compared to gfp- ( figure b ). this difference was not exclusively due to the fact that the gfp substrate was slightly shorter than limd - , as sox also displayed a fold reduction of catalytic efficiency on a longer, nt gfp substrate (gfp- ; figure b ). electrophoretic mobility shift assays (emsa) further revealed a -fold increase in sox binding to limd - compared to gfp- ( figure c ). given that both substrates contain the requisite unpaired bulge at the predicted cleavage site (see figure a and supplementary figure s ), these observations suggest that additional sequence or structural features impact sox targeting efficiency on individual rnas. two sox point mutants, p s and f a, located in an unstructured region of the protein that bridges domains i and ii have been shown to be selectively required for its endonucleolytic processing of rna substrates (supplementary figure s a and s b) ( , ) . structural data indicate that residue f forms a stacking interaction with an adenine base in the rna, likely stabilizing the protein-rna interaction, while p is hypothesized to contribute to structural rearrangements required for f engagement ( ) . we purified both mutants to evaluate their relative rna processing and rna binding activity against the optimal limd - substrate. both mutants displayed purity and elution profiles similar to wild type (wt) sox (see supplementary figure s a-c) . however, the catalytic efficiency of each mutant was > -fold less than wt sox ( figure d ). furthermore, rna binding was severely perturbed; the binding kinetics of wt sox for limd - are in the single digit nanomolar range (k d = nm), while p s and f a display > log defects (k d = nm and nm, respectively) ( figure e and supplementary figure s a c). thus, the large defect in rna binding likely explains the decreased efficiency of rna processing. notably, while there was a dramatic decrease in the relative affinities of the two mutants for limd - , there was not a complete loss of binding or rna processing. this could be a result of secondary nonspecific interactions and/or nonspecific exonucleolytic degradation by sox from the monophosphorylated end of the probe. in silico rna folding predictions of sox targeting motifs, coupled with rna mutagenesis experiments, have indicated that an rna stem loop structure is an important determinant in sox targeting both in vitro and in vivo ( , ) . given the importance of this predicted motif, and in partic- predicted rna fold ular the proposed requirement for unpaired sequence at the cut site, we sought to experimentally determine the structure of limd - using chemical based in-line probing ( figure a ). this showed that the limd - structure contains a largely base paired stem region, followed by a loop at positions - that encompasses the predicted sox cleavage site between nt and , and a short hairpin struc-ture at positions - ( figure b) . notably, some differences exist between the predicted and observed structures of limd - , including a larger loop region and the subsequent short stem-loop (compare figure b to figure a ) however, in both cases the predicted cleavage site of sox resides in a loop region. recently, a high-resolution crystal structure was solved of sox bound to a nt fragment of the kshv pre-microrna k - (k - ). in this structure, the only observed contacts between sox and k - occurred between the four active site residues of sox (y , r , c , f ) and the ugaag motif surrounding the cleavage site of the rna ( ) . it was therefore hypothesized that no other residues beyond this unpaired ugaag motif were involved in transcript recognition ( ) . however, the binding affinity we observed for limd - was -fold stronger that what was previously reported for k - ( ) , suggesting that a more extended interaction surface might distinguish optimal from sub-optimal rna substrates. we therefore used rna footprinting to map the sox binding sites on limd - . indeed, sox protected a region of limd - that included the three adenosine stretch (positions - ) from rnase t digestion in a dose dependent manner ( figure ) . notably, this mapped binding region is the same region predicted from in vivo pare-seq data to be important for sox targeting, although the reason for its importance remained unknown ( ) . we also observed a modest protection of base (g) located directly adjacent to the predicted cleavage site of sox, which represents the region detected in the crystal structure of k - bound to sox. collectively, these findings suggest that while sox may interact with residues directly adjacent to the cut site, a more extensive interaction interface exists for its preferred in vivo targets. to explore the importance of the residues involved in sox binding and cleavage, we engineered mutants of the limd - substrate ( figure a ). first, we preserved the loop structure but replaced the three adenosines bound by sox (residues - ) with guanosines (limd - xa-g). second, we largely abolished the loop structure by providing complementary base pairing (limd - zipper). third, we mutated the residue located at the predicted sox cut site that was also protected in the footprinting assay (limd - a-g). this mutant has been previously identified to block sox cleavage in vivo ( ) . tary figure s ). real-time binding kinetics for sox with wt limd - and each of the three mutant substrates were then measured using bio-layer interferometry (bli). all rna probes were biotinylated and immobilized to a streptavidin-coated bli probe, whereupon the binding and dissociation of sox was measured. to prevent degradation of the probe, excess calcium ion was used in place of magnesium (supplementary figure s e) . sox retained similar binding affinity to the cut site mutant table s ). to rule out the possibility that the effect on binding affinity to the limd - zipper mutant was a result of altered residues within the binding site, we also engineered an additional zipper mutant (limd - zipper ) that did not disrupt the polyadenosine sequence. in agreement with the loop structure playing a critical role in target recognition, this limd - zipper mutant also displayed a substantial defect in binding (k d = . m; supplementary figure s a , b, supplementary table s ). finally, we measured sox binding to the kshv pre-mirna sequence used to obtain the sox-rna cocrystal structure (k - ) ( ) . notably, the affinity of sox for k - was within the range of the limd - structural mutants (k d = . m), suggesting that despite having an ugaag motif upstream of a predicted bulge, this is unlikely to be a sox target ( figure b , supplementary figure s e and supplementary table s ) . we next quantitatively measured the catalytic efficiency of sox towards each of the above rna substrates. despite sox having wt binding affinity for the predicted cleavage site mutant limd - a-g, there was a -fold defect in ' its ability to degrade this mutant ( figure c ). even more marked defects in sox catalytic efficiency were observed for the binding site mutant limd - xa-g, the loop mutants limd - zipper and limd - zipper , and the pre-mirna k - ( figure c and supplementary figure s ). collectively, these data indicate that efficient rna cleavage requires both an appropriate sox binding site and a suitable cut site. in cells, sox cleaves its mrna substrates site-specifically. mutagenesis of residues in mapped cleavage sites generally abolishes sox cleavage at that location ( ) . to determine if our in vitro assay faithfully recapitulated the site specificity of sox endonucleolytic targeting observed in cells, we established reaction conditions that enabled trapping of nucleic acids research, , vol. , no. the early cleavage events. by combining ca + and mg + in our reaction buffer, we were able to sufficiently slow sox processing to visualize cleavage products derived from p labeled substrates. indeed, we observed a predominant nt band, which is the size of the product released upon limd - cleavage at the predicted cut site ( figure a , lane ). additional bands also appeared, likely representing subsequent processing events. importantly, when we incubate sox with the cut site mutant limd - a-g, there is a complete loss of this nt product, as well as the additionally processed intermediates ( figure a, lane ) . production of these cleavage intermediates required sox, as no decay was observed in the rna-only controls ( figure a , lanes - ). finally, we sought to verify that the predominant nt cleavage product we observed was a result of an endonucleolytic cleavage and not end processing. to this end, we generated a limd - substrate containing a p pcp label and a free oh to block end processing. again, in the presence of sox, wt limd - but not the a-g mutant produced a cleavage product whose size corresponded to cleavage at the predicted site ( figure b ). taken together, these data confirm that our in vitro assay faithfully recapitulates sox cleavage site specificity on a true substrate. endonuclease-directed mrna degradation plays key roles in the lifecycle of gammaherpesviruses, yet the fundamental principles governing target specificity by sox and other viral endonucleases are not well understood. here, through the development of the first biochemical system to faithfully recapitulate the internal cleavage specificity observed for sox in cells, we revealed how both rna sequence and structure contribute to targeting. these findings resolve a central feature of the current model of sox activity ( figure ). previous observations established that sequences flanking the cut site were required to direct cleavage by sox ( , ) . however, it was unresolved whether they played a strictly structural role in presenting an exposed loop for cleavage, served as a platform for sox binding, or created a binding site for one or more cellular factors that then indirectly recruited sox to its targets. through a combination of mutational analyses, rna structure probing, and rna footprinting assays, we showed that efficient sox targeting requires both an exposed loop structure and upstream sequences that serve as a sox binding platform. this combination of sequence and structural features within the targeting motif helps explain why some mrnas are efficiently cleaved by sox, whereas others are weaker substrates. a key open question related to sox function is how it can target the majority of mrnas in cells, yet with significant site specificity. our observations suggest that there must be specific mrna features that influence targeting. indeed, pare-seq analyses of cleavage intermediates in sox expressing cells revealed that cleavage sites were associated with a degenerate sequence motif ( ) . sequences proximal to the cleavage site were predicted to be un-base paired and frequently contained a polyadenosine stretch followed by a purine ( ) . the requirement for these sequence features for sox targeting was validated for the limd transcript in cells ( ) . because limd has been established as a particularly robust sox target in cells ( ) , we reasoned that it must contain features optimal for sox processing and therefore would be an ideal substrate to dissect biochemically why these features are important. indeed, sox binding to limd - was -fold better than to the commonly used reporter substrate gfp, and ∼ -fold better than to the k - pre-mirna, which has not been demonstrated to be processed by sox in cells. importantly, these binding differences correlated with the efficiency of sox cleavage in vitro, arguing that the ability to bind the targeting motif is a key step in target recognition. through rna footprinting assays, we were able to show that sox binds to a bulge structure proximal to the cleavage site containing the polyadenosine stretch previously predicted to be important for mrna cleavage by sox in cells ( ) . mutating either just the bulge structure (limd - zipper ) or maintaining the bulge but mutating the polyadenosine stretch (limd - xa-g) resulted in a ∼ -fold reduction in binding affinity, correlating with a dramatic decrease in cleavage efficiency. collectively, these data demonstrate that variability in the efficiency of sox targeting observed in cells is likely due to differences in rna sequences that mediate sox binding. a recent crystal structure of sox bound to the k - pre-mirna captured the importance of the exposed loop region for sox cleavage ( ) . however, the structure did not reveal additional interactions between sox and the rna beyond the three residues surrounding the cut site. our data suggest that this is likely because the k - rna lacks the additional residues necessary for sox binding site found in both limd and gfp. while the k - rna does contain adenosines upstream of the cleavage site, structural predictions indicate these residues are within a stem region ( ) , rather than in an exposed loop as is the case for limd and gfp. together, these observations indicate that while upstream adenosines are important for binding, they must be present in an unpaired state to promote sox binding. it is notable that prior studies reported much weaker interactions between sox and rna (k d = m) compared to its dna substrates (k d = m) ( , , ) . however, in these cases binding assays were conducted with scrambled rna sequences. we found that sox binding affinities to rna substrates vary over several orders of magnitude, in a manner that correlates with cleavage efficiency. interestingly, the crystal structure of sox bound to dna showed more dynamic interactions along the length of the protein (∼ Å interaction surface), when compared to the k - rna bound structure (∼ Å interaction surface). it is therefore possible that more interaction along the length of sox protein might occur with optimal substrates such as limd that are more tightly bound. the fact that purified sox endonucleolytically cleaved limd - at the precise site observed in sox-expressing cells demonstrates that cleavage site selection on an mrna is not mediated by a cellular cofactor. instead, targeting at particular rna motifs is strongly influenced by the strength of sox binding. our observation that the p s and f a sox mutants display significant rna binding defects indicates that their failure to cleave mrnas in cells is due to an inability to efficiently bind the targeting motif. target identification exonucleolytic degradation by xrn / dis l figure . model of mrna targeting by sox. sox is able to distinguish mrna from other types of rna in cells by an as yet unknown mechanism. subsequently, it endonucleolytically cleaves its targets at specific sites, whereupon the fragments are degraded by host exonucleases such as xrn and dis l . here, we revealed that in addition to the requirement for an unpaired loop at the cleavage site, additional upstream rna sequences increase the affinity of sox for individual targets, thereby controlling cleavage efficiency. the mechanism by which sox initially distinguishes rna polymerase ii transcribed mrnas from other types of rna in cells remains an important open question, as this feature of sox selectivity is not preserved in vitro. we hypothesize that cellular co-factors, perhaps though interactions with sox, enable this distinction. more broadly, endonucleases are instrumental in rna processing and degradation. nuclease processing defects lead to several human pathologies ranging from cancer to neurodegeneration ( ) ( ) ( ) ( ) ( ) , and our study provides a framework for better understanding the mechanistic features governing endonuclease targeting. emerging roles for rna degradation in viral replication and antiviral defense modulation of the translational landscape during herpesvirus infection a common strategy for host rna degradation by divergent viruses a two-pronged strategy to suppress host protein synthesis by sars coronavirus nsp protein influenza a virus protein pa-x contributes to viral growth and suppression of the host antiviral and immune responses increasing incidence of cancers associated with the human immunodeficiency virus epidemic human herpesvirus- : kaposi sarcoma, multicentric castleman disease, and primary effusion lymphoma the exonuclease and host shutoff functions of the sox protein of kaposi's sarcoma-associated herpesvirus are genetically separable crystal structure of the shutoff and exonuclease protein from the oncogenic kaposi's sarcoma-associated herpesvirus crystal structure of a kshv-sox-dna complex: insights into the molecular mechanisms underlying dnase activity and host shutoff global mrna degradation during lytic gammaherpesvirus infection contributes to establishment of viral latency gammaherpesviral gene expression and virion composition are broadly controlled by accelerated mrna degradation host shutoff during productive epstein-barr virus infection is mediated by bglf and may contribute to immune evasion aberrant herpesvirus-induced polyadenylation correlates with cellular messenger rna destruction lytic kshv infection inhibits host gene expression by accelerating global mrna turnover coordinated destruction of cellular messages in translation complexes by the gammaherpesvirus host shutoff factor and the mammalian exonuclease xrn deep sequencing reveals direct targets of gammaherpesvirus-induced mrna decay and suggests that multiple mechanisms govern cellular transcript escape an rna element in human interleukin confers escape from degradation by the gammaherpesvirus sox protein nuclease escape elements protect messenger rna against cleavage by multiple viral endonucleases transcriptome-wide cleavage site mapping on cellular mrnas reveals features underlying sequence-specific cleavage by the viral ribonuclease sox kshv sox mediated host shutoff: the molecular mechanism underlying mrna transcript processing t cell costimulatory receptor cd is a primary target for pd- -mediated inhibition the unfolded protein response signals through high-order assembly of ire in-line probing analysis of riboswitches structure of the atp synthase catalytic complex (f( )) from escherichia coli in an autoinhibited conformation identification of new homologs of pd-(d/e)xk nucleases by support vector machines trained on data derived from profile-profile alignments crystal structures of lambda exonuclease in complex with dna suggest an electrostatic ratchet mechanism for processivity why do divalent metal ions either promote or inhibit enzymatic reactions? the case of bamhi restriction endonuclease from combined quantum-classical simulations cofactor-mediated conformational control in the bifunctional kinase/rnase ire the rna exosome and rna exosome-linked disease mutations of exosc /rrp p associated with neurological diseases impact ribosomal rna processing functions of the exosome in s. cerevisiae the rnase ii/rnb family of exoribonucleases: putting the 'dis' in disease nonsense-mediated mrna decay and cancer applying nonsense-mediated mrna decay research to the clinic: progress and challenges mfold web server for nucleic acid folding and hybridization prediction we thank members of the glaunsinger lab for their suggestions and critical reading of the manuscript. we would like to thank the university of california, berkeley tissue culture facility for sf cell maintenance and the university of california san francisco quantitative biosciences institute, antibiome center for use of their octet red e for binding kinetics measurements. nucleic acids research, , vol. , no. key: cord- - vi skvh authors: horejsh, douglas; martini, federico; poccia, fabrizio; ippolito, giuseppe; di caro, antonino; capobianchi, maria r. title: a molecular beacon, bead-based assay for the detection of nucleic acids by flow cytometry date: - - journal: nucleic acids res doi: . /nar/gni sha: doc_id: cord_uid: vi skvh molecular beacons are dual-labelled probes that are typically used in real-time pcr assays, but have also been conjugated with solid matrices for use in microarrays or biosensors. we have developed a fluid array system using microsphere-conjugated molecular beacons and the flow cytometer for the specific, multiplexed detection of unlabelled nucleic acids in solution. for this array system, molecular beacons were conjugated with microspheres using a biotin-streptavidin linkage. a bridged conjugation method using streptavidin increased the signal-to-noise ratio, allowing for further discrimination of target quantitation. using beads of different sizes and molecular beacons in two fluorophore colours, synthetic nucleic acid control sequences were specifically detected for three respiratory pathogens, including the sars coronavirus in proof-of-concept experiments. considering that routine flow cytometers are able to detect up to four fluorescent channels, this novel assay may allow for the specific multiplex detection of a nucleic acid panel in a single tube. with the continual emergence of new pathogens, the differential diagnosis or identification of etiological agents is the important first step to control the spread of infection. the sars coronavirus (sars-hcov) tested the ability of the scientific community to develop methods to isolate, identify and characterize an emerging virus ( , ) . the most powerful etiological diagnostic was arguably the use of a microarray 'viro-chip', which was able to quickly reveal that this pathogen was a coronavirus, even as it also found that this particular coronavirus had never been described previously ( ) . it appears that this type of solid-phase array technology will become more routine as the costs decrease, the procedures become streamlined for practical use and the technology becomes better disseminated. current tests for nucleic acid detection are based upon real-time pcr assays ( ) . in these assays, non-specific (sybr) or probe-specific fluorescence is measured throughout the pcr reaction [reviewed in ( ) ]. tyagi and kramer ( ) published work describing single-stranded 'loop-and-stem' molecules carrying both a fluorochrome and a quencher in close proximity. in this configuration, energy emitted by the excited fluorochrome is absorbed by the quencher and dissipated as heat in a process called fluorescence resonance energy transfer (fret). when the loop binds to a complementary nucleic acid strand, the molecule changes its conformation to distance the fluorochrome from the quencher, allowing unquenched fluorescence. these molecules were called 'molecular beacons' because they emit a fluorescent signal only when the probes are hybridized to their targets ( ) . it was later shown that it was possible to build multiplexed assays with this method; differently labelled molecular beacons could recognize different targets in the same reaction tube ( ) . it was also shown that the specificity of the assay was very high, so that probes differing in only nt could be resolved ( ) . various assays were then published using the molecular beacons technology, ranging from mrna in situ visualization ( , ) to nucleic acid sequence-based amplification detection ( ) , and multiplex detection of four pathogenic retroviruses ( ) . in other applications, molecular beacon probes were designed for use as dna biosensors by binding molecular beacons to glass beads or cover slips ( ) , ultra small optical fibre probes ( ) and gold surfaces ( ) , allowing the specific detection of complementary sequences. flow cytometers are common diagnostic tools used in the immunophenotyping of immune system cells or in the characterization of blood malignancies. with a routine configuration, they allow for the contemporary measurement of four different fluorescent wavelengths, in addition to two signals related to cell size and internal complexity. as such, their application is possible only when the fluorescent signal is associated with discrete particles, such as cells in suspension or microparticles. the flow cytometer has recently been used to detect nucleic acids in a multiplexed format using secondary reagents for signal detection ( ) . in this report, we describe the construction of molecular beacon-conjugated beads that we have called 'beadcons', whose specific hybridization with complementary target sequences can be resolved by flow cytometry (see figure ). assay sensitivity is achieved through the concentration of fluorescence signal on discrete particles. we first obtained evidence that the method could work in principle, with the ability to detect single, synthetic target sequences. then, we set up the system in a multiplex format (array), and applied the method to nucleic acid oligonucleotides mimicking respiratory diagnostic sequences. the results indicated that this method could allow the detection of the corresponding oligonucleotide, even when diluted in a complex mixture of nucleic acids. in fact, the versatility of flow cytometers allows the resolution of very complex analytical mixtures, in which the hybridized beads of a specific size and/or colour can be readily distinguished from the others that are unbound. the short assay time and ease-ofuse makes this method a good candidate for a further development of its diagnostic capabilities and use in the routine laboratory. figure . schematic representation of the interaction of the bead-bound molecular beacon with a complementary nucleic acid. the molecular beacon contains a probing loop sequence embedded within complementary arm sequences. these arms form a hairpin stem that keeps a terminal fluorophore and a terminal quencher molecule in close proximity in the absence of nucleic acid that is complementary to the loop. this state allows for fret, or a transfer of energy from the fluorophore to the quencher rather than excitation upon absorbance. after target addition, the complementary target forms a duplex with the loop portion of the beacon that pulls the fluorophore and quencher from close proximity, allowing fluorescence. the molecular beacon sequences are linked to the microspheres using a biotinylated thymidine in the stem sequence proximal to the quencher. previously published molecular beacon and ssdna complementary control sequences were synthesized for hepatitis c virus (hcv) ( ) . molecular beacons and ssdna complementary controls were designed for parainfluenza virus type (piv- ), respiratory syncytial virus (rsv), sars-hcov-m and sars-hcov-n using beacon designer . software (premier biosoft international, palo alto, ca). all beacons were designed with a thymidine on the stem proximal to the quencher molecule, allowing for the addition of biotinylated thymidine in the synthesis process (see table for beacon sequences). the molecular beacons were ordered with -fam as the reporter dye, bhq- as the dark quencher and the biotinylated thymidine on the stem (biolegio, the netherlands). the molecular beacons for the sars-hcov were also ordered with the cy /bhq- fluorescence/quencher pair. all molecular beacon and positive control sequences are listed in table . aliquots containing . mm streptavidin-coated microspheres (g. kisker gbr, steinfurt, germany) were used in a direct conjugation with the biotinylated molecular beacons. aliquots containing . and . mm biotinylated microspheres (g. kisker gbr) were used in a streptavidin-bridged bead design. streptavidin ( ml of mg/ml; roche) was diluted to a volume of ml in facsflow solution ( mm phosphate-buffered saline, mm nacl, ph . ; becton-dickinson), the biotinylated beads ( ml of . % w/v) were added to the buffer and this mixture was incubated for min at room temperature. the beads were then washed twice with facsflow to eliminate unbound streptavidin. the biotinylated molecular beacon ( ml of mm) was added to the streptavidin-bridged or streptavidin-coated beads ( ml of . % w/v), diluted to a total of ml in facsflow and incubated for min at room temperature. these beadcons were washed twice in facs-flow and stored at room temperature until use. control samples were analyzed using oligonucleotides complementary to the hybridizing loop sequence ( table ) . the beads were washed with facsflow and aliquoted to · beads/test in ml. the complementary nucleic acid ( ml of a mm stock) was then added to the test beads. in the multiplex detection experiment, the test sample contained . ml of the positive oligo dna ( mm stock) diluted in . ml of a complex mixture of oligonucleotides (equimolar levels of mm each, equalling a mm total concentration; sequences listed in supplementary table ). for the mismatch analysis, ml of a nm stock for each oligonucleotide was used (sequences listed in supplementary table ). the samples were hybridized with gentle agitation for min at room temperature before reading. an aliquot of ml of facs-flow was added to each tube to run the sample on the flow cytometer. a minimum of · events were collected for each sample (flow cytometry conditions for the facscalibur machine are listed in supplemental table ). the data were analyzed using cellquest . (bd biosciences, san jose, ca). a positive threshold was set for each beadcon, based upon the highest fluorescent point seen in the negative control sequence (for all beadcons, an hiv complementary sequence was used; the sequence is listed in supplementary table ). mean fluorescence intensities (mfis) were recorded for the fl ( -fam) and fl (cy ) channels on gated beadcons populations. a marked shift in fluorescence intensity is seen using flow cytometry after the addition of loop-specific sequence beadcons were prepared as described in materials and methods to test whether the system is functional. a molar excess of biotinylated molecular beacon specific for hcv was added to . mm streptavidin-bound microspheres. it is important to note that the biotinylated thymidine in the molecular beacon is on the stem proximal to the quencher molecule as described previously ( ) . target dna was then added to the washed beads to determine whether there was a shift in the mean fluorescence of the beads. an hcv-specific positive control dna yielded an mfi of . - . compared with an mfi of . - . for an hiv target sequence (see figure ). therefore, the signal-to-noise ratio (stnr) for the molecular beacons with a direct conjugation to the beads was . , higher than the - stnr of other molecular beacon systems that use a solid or immobilized phase ( , ) , but lower than the standard stnr that is typically seen in solution ( , , ). the commercial availability of biotinylated beads led us to test the possibility of using streptavidin to bridge the beacon and bead. such an approach has previously been used to increase the signal of molecular beacons as functional biosensors ( ) . streptavidin was added to . mm biotinylated beads in excess to limit the possibility of bead cross-linking, with a high enough level of streptavidin so that no two beads would share a single streptavidin molecule. initial experiments with lower streptavidin levels revealed the production of a doublet population, formed as a streptavidin molecule-bridged two biotinylated beads. when prepared and thoroughly washed beads were hybridized with the hcv-specific positive control dna, an mfi of . - . was seen compared with an mfi of . - . for the unrelated hiv control. therefore, an stnr of . was achieved through the use of a streptavidin bridge, comparable with the typical, optimized results in solution for this fluorophore/quencher set. an asymmetrical pcr product derived from an hcv-infected serum sample yielded an mfi of . - . , for an stnr of . (supplementary figure ) . the reduction in stnr is probably due to self-annealing of the amplified asymmetric pcr strand; however, qualitative gating showed that the binding of the single-stranded product was much more effective than that of the symmetric pcr product (negative using the current hybridization conditions). all the test beadcons were assayed for their dynamic detection range. after an initial -fold dilution of a mm stock, -fold dilutions of positive oligonucleotide controls were made in facsflow buffer and ml were added to the beadcons that were built using the streptavidin bridging. it must be noted that the pmt voltage setting was slightly increased for the fl channel (from to ) to increase the mfi and, thus, differences between the beads with little or no positivity. a representative detection curve is shown in figure . for this example, the limit of detection ( sds above the background mfi of . ) for this beadcon specific for sars-n was found to be fmol at an mfi of . . the dynamic detection of this assay extended to pmol, revealing a working range of nearly four orders of magnitude. the other beadcons tested using this method (including those for hcv, piv- , rsv and sars-m) showed an average limit of detection at fmol (range: - fmol). previous work with molecular beacons demonstrates that single nucleotide polymorphisms can be detected in pcr assays ( ) . using a previously published hcv sequence as the positive control, primers with one, three and five mismatches were studied for their binding characteristics (sequences listed in supplementary table ; for results, see figure ). using this assay under initial conditions, a single nucleotide difference could not be statistically distinguished from the positive control sequence. the same experiment was repeated using a min incubation at c for hybridization, with or without the addition of mm mgcl . these hybridization conditions allowed a better resolution of nucleotide mismatches, although there was a small loss in the overall stnr in the assay (see figure a ). finally, the analysis was repeated using the sars-n beadcon. using the new hybridization conditions ( c with mm mgcl ), the single nucleotide mismatch for sars-n generated a fluorescence that was % lower than the complementary positive control. a nt mismatch showed a % loss of fluorescence compared with the positive control, at a level very similar to that of the complete mismatch oligo from hiv (see figure b) . to test whether both multiple bead sizes and fluorochromes could be used simultaneously, molecular beacons that incorporated the cy /bhq- fluorochrome/quencher set were synthesized for the sars-hcov-m and -n open reading frames. the beadcons for these targets were constructed using the same conditions as for the other probes, with conjugation to streptavidin-bridged . and . mm beads. beadcons against rsv ( . mm) and piv- ( . mm) were also constructed using the -fam/bhq- fluorochrome/ quencher set. an aliquot of pmol of each pathogen oligo sequence was diluted in pmol of a mixture of oligonucleotides with no specificity for the beacon sequence (oligonucleotide sequences are listed in supplementary table ). when these dnas were added to the beadcons mixture of four targets, the corresponding beadcon showed a fluorescent shift above background, confirming presence of that target in the tube (see figure ). it should be noted that the 'sars' tube contained pmol of both sars-m and sars-n oligonucleotides, and both beadcons showed a positive shift in fluorescence. without specific target addition, the beadcons failed to fluoresce above background. the mfi, stnr and corresponding sds are listed for three experimental replicates of the shown, representative dot plots. all fluorescence shifts were significant to the levels shown in the figure by utilizing the unpaired t-test using welch's correction. the development of a highly accessible and easily adaptable multiplex system for the detection of pathogens remains the ultimate goal for the molecular diagnostic laboratory. realtime pcr has been a very useful tool in the research field as it allows for the rapid, simultaneous detection of pathogens in multiplex. unfortunately, the multiplex pcr system is complex for assay development, as primer/probe sets must be matched for primer melting temperature (t m ), probe t m , amplicon length and amplicon t m . obviously, each additional target also adds another level of complexity to the assay when it comes to primer dimerization and mispriming. microarrays will probably allow for the rapid screening of thousands of possible pathogens, but the cost, equipment and expertise make current routine use impossible, and this type of solid-phase array technology will not be practical as a clinical diagnostic for some years. there remains a need for the continual development and refinement of assays that can be used to detect nucleic acids in a multiplexed format. we have developed molecular beacon-labelled microspheres with a read-out on the flow cytometer for the multiplexed detection of nucleic acids in solution. a bridging approach allowed the specific binding of the beadcon by the proper complementary dna sequence. the bridging of the beadcon using a streptavidin molecule increased the stnr -fold to levels that are similar to those seen for molecular beacons in solution. the commercial source of the beads that were used in these studies confirmed that the streptavidin beads carried three times as many binding sites for biotin as the biotinylated beads did for streptavidin. therefore, if each streptavidin of the bridged molecule allows complete accessibility of all three available sites for biotin, the stnr should be roughly about the same. as the stnr was twice as high for the bridged molecule, it is likely that the bridging helps to overcome the surface effects that are often seen using solid state nucleic acid detection systems ( ) . as the molecular beacons are more accessible for the target, there is a more effective loss of fret and, thus, an increase in maximal fluorescence. the individual specificity and sensitivity of each molecular beacon must be developed specifically for this assay. it appears that currently designed methods for use in pcr must be modified if increased specificity is required. for the hcv beadcon used in our studies at room temperature, a nt mismatch was readily detectable at a level that was virtually indistinguishable from the matched positive control. as this reaction was hybridized and measured at room temperature, mismatched hybridization is expected as it is difficult to distinguish mismatches at temperatures below the 'window of discrimination' for each beacon ( , , ) . in this temperature range, it is possible to discriminate a single nucleotide polymorphism; however, as the temperature decreases to c, the difference in fluorescence between mismatched targets is negligible. the currently designed beacons are used to measure fluorescence at an elevated annealing temperature, which fails to allow mismatched target duplexes and, thus, fluorescence ( ) . an assay that allows mismatching can be useful in detecting nucleic acids that have point mutations from their homologous sequence. alternatively, the dynamics of the binding reaction could be altered by changing the gibb's free energy assignments of the stem and loop regions of the molecular beacon. the sars-n molecular beacon that we designed showed a better discrimination of mismatched targets, although it appears that the spacing of the mismatches may play a role. the binding properties can be modified by altering the backbone of the beacon (such as for a peptide nucleic acid), substituting high affinity nucleotides for normal ones (such as in a locked nucleic acid), or by changing the figure . qualitative detection of three respiratory virus sequences in multiplex. a beadcons array was generated using two bead sizes and two fluorescence markers as described in the text. the addition of piv- , rsv or sars-h-cov (genes m and n) dna control sequences that were diluted in a complex dna mixture allowed the specific identification of the target compared with negative controls. in effect, each panel shows an internal negative control, as each test was carried out using a complex dna mixture. the black arrows note the positive shift in fluorescence. lengths or recognition portions of these sequences ( ) . these types of molecular beacons will probably be used more in the future, as these complex synthesis technologies are improved and better diffused. in assay development for each group of targets, it will be necessary to consider whether single nucleotide polymorphism detection will be developed through these means, or if it is better for the system to be tolerant for small nucleotide changes that would allow detection even as the target sequence contains mutations. unique sequences can be selected for each target, allowing specific detection. in addition, redundancy can be included within the assay by conjugating molecular beacons for multiple sequences of the same pathogen or gene onto a single type of bead. the further development of this assay to detect a single target from a complex mixture may be important in several fields of use. we hope to facilitate differential diagnosis, genetic testing, genotyping and gene expression studies through the use of this technology. identification of a novel coronavirus in patients with severe acute respiratory syndrome a novel coronavirus associated with severe acute respiratory syndrome viral discovery and sequence recovery using dna microarrays kinetic pcr analysis: real-time monitoring of dna amplification reactions quantification using real-time pcr technology: applications and limitations molecular beacons: probes that fluoresce upon hybridization multiplex detection of single-nucleotide variations using molecular beacons multicolor molecular beacons for allele discrimination in situ visualization of messenger rna for basic fibroblast growth factor in living cells real time detection of dna.rna hybridization in living cells molecular beacon probes combined with amplification by nasba enable homogeneous, real-time detection of rna multiplex detection of four pathogenic retroviruses using molecular beacons ultrasensitive optical dna biosensor based on surface immobilization of molecular beacon by a bridge structure molecular beacons for dna biosensors with micrometer to submicrometer dimensions hybridization-based unquenching of dna hairpins on au surfaces: prototypical ''molecular beacon'' biosensors multiplexed, particle-based detection of dna using flow cytometry with dna dendrimers for signal amplification real-time rt-pcr for quantitation of hepatitis c virus rna a fiber-optic evanescent wave dna biosensor based on novel molecular beacons molecular-beacon-based array for sensitive dna analysis thermodynamic basis of the enhanced specificity of structured dna probes molecular beacons: novel fluorescent probes using molecular beacons to detect single-nucleotide polymorphisms with real-time pcr target discrimination by surface-immobilized molecular beacons designed to detect francisella tularensis structure-function relationships of shared-stem and conventional molecular beacons this work was supported by funding from the 'ministero della salute' of the italian government, ricerca corrente e finalizzata. funding to pay the open access publication charges for this article was provided by 'ministero della salute' of the italian government, ricerca finalizzata. supplementary material is available at nar online. key: cord- -d dqr e authors: yang, jie; cheng, zhenyun; zhang, songliu; xiong, wei; xia, hongjie; qiu, yang; wang, zhaowei; wu, feige; qin, cheng-feng; yin, lei; hu, yuanyang; zhou, xi title: a cypovirus vp displays the rna chaperone-like activity that destabilizes rna helices and accelerates strand annealing date: - - journal: nucleic acids res doi: . /nar/gkt sha: doc_id: cord_uid: d dqr e for double-stranded rna (dsrna) viruses in the family reoviridae, their inner capsids function as the machinery for viral rna (vrna) replication. unlike other multishelled reoviruses, cypovirus has a single-layered capsid, thereby representing a simplified model for studying vrna replication of reoviruses. vp is one of the three major cypovirus capsid proteins and functions as a clamp protein to stabilize cypovirus capsid. here, we expressed vp from type helicoverpa armigera cypovirus (hacpv- ) in a eukaryotic system and determined that this vp possesses rna chaperone-like activity, which destabilizes rna helices and accelerates strand annealing independent of atp. our further characterization of vp revealed that its helix-destabilizing activity is rna specific, lacks directionality and could be inhibited by divalent ions, such as mg( +), mn( +), ca( +) or zn( +), to varying degrees. furthermore, we found that hacpv- vp facilitates the replication initiation of an alternative polymerase (i.e. reverse transcriptase) through a panhandle-structured rna template, which mimics the ′- ′ cyclization of cypoviral positive-stranded rna. given that the replication of negative-stranded vrna on the positive-stranded vrna template necessitates the dissociation of the ′- ′ panhandle, the rna chaperone activity of vp may play a direct role in the initiation of reoviral dsrna synthesis. reoviruses are a family of viruses (reoviridae) that contain segmented double-stranded rna (dsrna) genome packaged by a single-or double-layered inner capsid and include several important pathogens that are responsible for diseases in humans, livestock animals, insects and plants ( ) . a common characteristic shared by reoviruses as well as many other dsrna viruses is that their inner capsid contains several copies of rna-dependent rna polymerases (rdrps) and mrna-capping enzymes, and the process of reoviral positive-strand (+) rna (mrna) transcription from minus-strand (À) rna, followed by mrna capping, takes place within the reoviral inner capsid ( , ) . the nascently synthesized and capped (+)rna is then released into the cytoplasm of infected cells for protein translation. after that, reoviral (+)rnas are encapsidated into newly assembled inner capsids, and are used by rdrps as templates to synthesize genomic dsrna segments within reoviral inner capsids ( , , ) . thus, unlike most single-stranded rna (ssrna) viruses, reoviruses use their inner capsids as the machinery for viral rna (vrna) synthesis and the shield for evading host antiviral defenses ( , , ) . cypovirus (cytoplasmic polyhedrosis virus, cpv), which contains genomic dsrna segments, is one of the genera within the family reoviridae and is further classified into types (cypovirus - ) based on the electrophoretic migration profiles of their genome segments ( ) (http://www.ictvonline.org/virustaxonomy.asp?version= ). and the cpv particles are embedded in polyhedra for surviving unfavorable external environment ( ) . *to whom correspondence should be addressed. tel: + ; fax: + ; email: zhouxi@whu.edu.cn unlike other multishelled reoviruses, cypovirus is the simplest genus of the family reoviridae because it has only a single-layered capsid ( ) , and this characteristic makes it an ideal and simplified model for studying the mrna/dsrna synthesis and mrna-capping mechanisms of reoviruses and probably other dsrna viruses ( ) . for this reason, extensive molecular and structural studies have focused on cpvs in recent years ( ) ( ) ( ) ( ) ( ) . however, our knowledge about the molecular mechanisms that orchestrate the segmented dsrna genome replication of cpvs and other reoviruses is still limited. the cpv capsid comprises three major capsid proteins: vp , vp and vp ( ) . previous studies have shown that each cpv capsid contains copies of vp , copies of vp and copies of vp . vp is the shell protein that forms the cpv capsid shell; vp is the mrna-capping enzyme; and vp serves as the clamp protein to enhance the stability of the cpv capsid shell ( , ) . in addition, one cypoviral rdrp, vp , is located at the inner surface of the cpv capsid shell at each -fold axis ( , ) . moreover, cheng and colleagues recently found that during cypoviral mrna transcription, both vp and vp undergo conformational changes, resulting in an enlarged capsid chamber and a wider channel in the capsid for mrna transport and capping ( ) . for vp , no structural change has been detected ( ) , and it was not known whether vp contains any activity other than stabilizing capsid shell. for rna viruses including reoviruses, vrna molecules require proper secondary and tertiary structures to form diverse cis-acting elements within their untranslated region, -untranslated region or protein coding region, which are important for vrna functions such as translation, replication and encapsidation ( ) ( ) ( ) . moreover, the interactions (also termed 'cyclization') between the -and -termini of vrnas have been recognized as an important prelude for efficient vrna replication of many (+)rna viruses like flaviviruses, (À)rna viruses like hantaviruses and dsrna viruses like reoviruses ( ) ( ) ( ) . for reoviruses, the - cyclization of reoviral (+)rna forms a panhandle-like structure ( , ) . this - panhandle is required for efficient dsrna replication, probably by allowing reoviral rdrp to recognize the -terminal of the (+)rna template to initiate (À)rna synthesis ( ) . on the other hand, like other - cyclized vrnas ( ) , the panhandle should be disrupted when rna replication proceeds on the cyclized template ( ) . the formation and dissociation of rna tertiary structures should be highly regulated for rnas to function properly in various processes. however, correct folding or unfolding of rnas is always challenging, since it is believed that rnas would be easily trapped in local intermediate structures that are thermodynamically stable ( , ) . in response, cells or viruses encode various rna remodeling proteins, generally including adenosine triphosphate (atp)-dependent rna helicases and atpindependent rna chaperones, which are proposed to help overcome the thermodynamic barriers of kinetically trapped rna molecules. rna helicases are thought to be involved in most atp-dependent structural rearrangements of rnas ( ) . on the other hand, rna chaperones are a heterogeneous group of proteins that are able to destabilize or unwind rna helices and accelerate the formation of correctly folded rna structures by helping misfolded rnas escape kinetic traps. for the viruses having vrna cyclization, both hantavirus nucleocapsid (n) protein and flavivirus core (capsid) protein are rna chaperones, and hantavirus n has also been reported to unwind the - panhandle structure in vitro ( ) . however, no capsid protein of cpvs or other reoviruses has been found to be an rna chaperone. the type helicoverpa armigera cypovirus (hacpv- ) was initially isolated by our laboratory in from a mixture of hacpvs and can quickly cause lethal disease in h. armigera, which is one of the most serious agricultural pests in china ( ) . thus far, the rna genome of hacpv- has been completely sequenced ( , ) . hacpv- rna segment contains a single open reading frame that encodes a -kda protein, and further sequence and structural analysis using bioinformatic tools has revealed that this -kda protein is the vp capsid protein ( , , ) . in this study, we expressed hacpv- vp in a eukaryotic expression system and determined that this cpv vp possesses an rna chaperone-like activity to atp-independently destabilize rna helices and accelerate strand annealing. our further characterization of vp revealed that its helix-destabilizing activity is rna specific, lacks directionality and could be inhibited by divalent metallic ions, such as mg + , mn + , ca + or zn + , to various degrees. moreover, we found that hacpv- vp could facilitate the transcription initiation of an alternative polymerase (i.e. reverse transcriptase) through a cpv panhandle-structured rna template, thereby strongly suggesting a direct role of the rna chaperone activity of vp in the initiation of cypoviral dsrna replication. standard procedures were used for extraction of viral genome rna and reverse transcription (rt)-polymerase chain reaction ( ) . a cdna fragment of hacpv- rna segment open reading frame (vp ) (accession no. dq ) was inserted into the vector pfastbac tm htb-mbp that was originated from pfastbac tm htb (invitrogen, carlsbad, ca), in which the maltose binding protein (mbp) was n-terminally fused by us as previously described ( , ) . point mutations were introduced via polymerase chain reaction-mediated mutagenesis as described previously ( ) ( ) ( ) . the primers used in this study are shown in supplementary table s . the constructed plasmids were subjected to the bac-to-bac system (invitrogen) to express wild type and mutant mbp fusion vp (mbp-vp ) proteins. the expression and purification of recombinant mbp-vp and its derivatives as well as the negative-control protein mbp were carried out as previously described ( , ) . briefly, sf cells were infected with the recombinant baculoviruses and harvested at h postinfection. cell pellets were resuspended, lysed by sonication and subject to centrifugation for min at g to remove debris. the protein in the supernatant was purified using amylose affinity chromatography (new england biolabs, ipswich, ma), according to the manufacturer's protocol. fractions containing the recombinant protein were combined, followed by concentration using an amicon ultra- membrane column (millipore, schwalbach, germany). after that, the buffer was exchanged to mm -[ -( -hydroxyethyl)- -piperazinyl] ethanesulfonic acid (hepes)-koh (ph . ), mm nacl, and the protein was stored at À c in aliquots. all proteins were quantified by the bradford method. sodium dodecyl sulphate-polyacrylamide gel electrophoresis and western blot analysis sodium dodecyl sulphate-polyacrylamide gel electrophoresis (sds-page) and western blot assays were performed as described previously ( ) . the anti-mbp polyclonal antibody was purchased from new england biolabs, and used at the dilutions of : . the d structure of hacpv- vp was modeled by submitting its amino acid sequence to the hmmstr/ rosetta server [from robetta, university of washington (http://robetta.bakerlab.org/)] ( ) . five models were obtained, and the best one was chosen as a template based on its score, assessed by submitting it to the swiss-model server [from swiss institute of bioinformatics and the biozentrum, university of basel, switzerland (http://swissmodel.expasy.org)]. the figure of the modeled hacpv- vp d structure was drawn by pymol program . (delano scientific llc, south san francisco, ca) from coordinate file. the surface representation of modeled vp on cpv particle are drawn by pymol . based on the atomic cryoem structure of bmcpv capsid (pdb number izx) by replacing bmcpv vp with hacpv- vp . preparation of oligonucleotide helix substrate rna or dna helix substrates were prepared by annealing two complementary nucleic acid strands. one strand was labeled at -end with hexachloro fluorescein (hex) (takara, dalian), and the other strand was unlabeled. the two strands were mixed in a proper ratio, and annealed through heating and gradually cooling. all unlabeled dna strands were synthesized by invitrogen, and hexlabeled dna and rna strands were purchased from takara. unlabeled rna strands were synthesized by us from the in vitro transcription using t rna polymerase (promega, madison, wi). the transcribed rna strands were purified by poly-gel rna extraction kit (omega bio-tek, guangzhou, china) according to the manufacturer's instruction. standard rna helix substrate was annealed with rna and rna , d*/r substrate was annealed with dna and rna , r*/d substrate was annealed with rna and dna , d*/d substrate was annealed with dna and dna , -tailed substrate was annealed with rna and rna , -tailed substrate was annealed with rna and rna , and blunt-ended substrate was annealed with rna and rna . the -tailed rna helix substrates with different lengths of -tails were prepared by annealing rna with and rna , rna , rna , rna and rna . the sequences of all dna and rna strands are listed in supplementary table s . gel mobility shift was performed in mm hepes-koh (ph . ), mm nacl, in a volume of ml with indicated amount of protein and . pmol of ssrna (rna ) or blunt-ended dsrna (rna /rna ). rna is hex-labeled. reactions were incubated for min at room temperature. for the competition experiments, unlabeled ssrna or dsrna competitor was added together with the hex-labeled rna probe (rna ) to the binding reaction. the reactions were terminated by the addition of . ml  sample buffer [ mm tris-hcl (ph . ), % glycerol and . % bromophenol blue]. the nucleic acidprotein complexes were separated by electrophoresis on % agarose gels, and gels were scanned by a typhoon imager (ge healthcare, piscataway, nj). the values of hill coefficient, an indicator of the cooperativity of rna binding, were calculated by applying hill transformation to the rna binding data obtained in repeated experiments. chemical cross-linking assays were performed as previously described ( ) . mbp-tagged proteins were chemically cross-linked in the cross-linking buffer ( mm hepes [ph . ], mm nacl and final . % [vol/vol] glutaraldehyde) for min. the complexes were then analyzed via % sds-page and western blots. to detect which form of mbp-vp binds rna, mbp-vp were first cross-linked with the hex-labeled rna probe using a -nm ultraviolet (uv) light for min, and the light source was cm away from the samples. after that, the samples were subjected to chemical cross-linking as described above. the complexes were then analyzed via % sds-page and scanned by a typhoon imager (ge healthcare). the standard helix destabilizing assay was performed as previously described ( , ) with minor modifications. in brief, pmol of protein and . pmol of helix substrate were added to a mixture containing a final concentration of mm hepes-koh (ph . ), . % bovine serum albumin (bsa), mm nacl, mm mgcl , mm dithiothreitol (dtt) and u rnasin (promega) and incubated at c for min unless otherwise indicated. the reactions were terminated by adding u proteinase k and . ml  loading buffer [ mm tris-hcl (ph . ), % glycerol and bromophenol blue]. mixtures were electrophoresed on % native-page gels. gels were scanned with a typhoon imager (ge healthcare). the ratio of released single strands versus the total substrates was quantified with imagequant software. the standard rna strand hybridization assay was performed as previously described ( ) . in brief, indicated amount of mbp-vp was incubated with hex-labeled and unlabeled rna strands ( . pmol each strand) in c in a buffer containing mm hepes-koh (ph . ), . mm mgcl , mm dtt, . % bsa and u rnasin. the reaction was terminated and analyzed as described above. for the hybridization assay of the stem-loop structured rna strands, the sequences of the two stem-loop rna strands were indicated in figure a , and the stem-loop structures were predicted by mfold (http://mfold.rna.albany.edu/?q=mfold). for the hybridization of the hex-labeled -nt and unlabeled -nt rna strands, their sequences are listed in supplementary table s . then mixtures were also resolved on % native-page gels ( figure ) or % native-page gels ( figure ), following by scanning with a typhoon imager (ge healthcare). the rt assay was previously described by mir et al. ( ) with some modifications. briefly, the structured rna mimicking cypovirus panhandle was generated using nt from the -end and nt from the -end of hacpv- rna segment . the sequences of the panhandle and dna primer were indicated in figure a . the mixture of . pmol panhandle, pmol dna primer and . mm deoxynucleoside triphosohates (dntps) with digoxigenin (dig)-dutp was incubated with indicated amount of mbp-vp or mbp alone without heating and cooling, while for the positive control, the mixture was heated at c for min and placed on ice to allow the annealing of the primer to the panhandle. after incubation, ml of the rt buffer, ml of . m dtt, u rnasin and . ml m-mlv reverse transcriptase (promega) were supplemented and reacted at c for min. reactions were terminated by heating at c for min. the samples were analyzed on % urea page and northern blot. the northern blot was performed as previously described ( ) . among hacpv- rna segments, segment encodes a -kda protein that is homologous with vp proteins of other cpvs based on amino acid sequences ( figure a ) but shares no sequence homology with any other non-vp cpv proteins (data not shown). to further compare this protein with other cpv vp proteins, we modeled its d structure by submitting its amino acid sequence to the hmmstr/rosetta server ( ) and found that its modeled structure is highly similar to that of bmcpv- vp ( figure d ), which has been solved and extensively studied ( ) . altogether, these results show that this protein is the vp capsid protein of hacpv- . to determine the potential function of hacpv- vp , we expressed this protein as an n-terminal maltose binding fusion protein (mbp-vp ) in a eukaryotic (baculovirus) expression system and then purified the protein ( figure b and c). because many capsid proteins of rna viruses have rna binding activity ( , , ) , we sought to determine whether vp could also bind to rna. to this end, a gel mobility shift assay with a hex-labeled -nt ssrna (rna ) was conducted. mbp-vp did bind to rna (figure a, lane ) , showing that vp has ssrna binding activity. moreover, to determine whether vp also has dsrna binding activity, we constructed the rna helix by annealing hex-labeled rna with a nonlabeled rna (rna ) that is complementary to rna and found that vp efficiently shifted the rna helix ( figure a , lane ). after determining that mbp-vp can bind to both ssrna and dsrna, it is intriguing for us to examine whether this protein has a higher affinity for ssrna or for dsrna. to this end, rna probe competition assays were performed. briefly, . pmol hex-labeled ssrna probe and pmol mbp-vp were incubated in the presence of increasing amounts of unlabeled ssrna (rna ) or dsrna (rna /rna ). as shown in figure b , only the ssrna competitor efficiently competed with hexlabeled rna , while dsrna had a minimal competing effect, indicating that mbp-vp has a higher affinity for ssrna. moreover, we incubated . mm of hex-labeled rna with increasing concentrations of mbp-vp ( . - . mm), and our result showed that mbp-vp binds to ssrna in a dose-response manner ( figure c ). the rna binding data were then quantified, and a hill transformation was applied to determine whether the binding of mbp-vp to ssrna is cooperative [a > hill coefficient indicates positive cooperativity ( , ) ]. as shown in figure d , mbp-vp at low concentrations ( . - . mm) had a hill coefficient value of $ . , whereas high protein concentrations ( . - . mm) had a higher hill coefficient value of $ . . these results indicate that vp binds to ssrna in a cooperative manner. taken together, these results show that hacpv- vp has both ssrna and dsrna binding capacity, has higher affinity for ssrna and binds ssrna in a cooperative manner. besides, we also examined if mbp-vp can bind to dna, and our results showed that this protein also has ssdna and dsdna binding activities (supplementary figure s ) . after determining the rna binding activity of vp , we ought to examine the form of mbp-vp in solution. to this end, the in vitro chemical cross-linking assay was performed, followed by sds-page and western blot analysis. our result showed that mbp-vp has both monomeric and dimeric forms in solution ( figure a) . moreover, this result was further confirmed by the gel filtration assay (supplementary figure s ) . then, it is intriguing to determine which form of mbp-vp binds to rna. to this end, mbp-vp was incubated with hex-labeled rna probe and then subjected to uv the d structure of hacpv- vp was modeled by the hmmstr/ rosetta server, and drawn by pymol as described in 'materials and methods' section. the three mutation sites as well as n-and c-termini were labeled as indicated. (e and f) the surface representations of modeled vp , which indicate the orientation of vp on the cpv particle (e), or in the asymmetric unit formed by vp , vp and vp (f). the t on the vp is shown in blue. color-coded by protein subunits, vp is in yellow, vp has two conformers in cyan and magenta, while vp in yellow and salmon. the maps of cpv particle and the asymmetric unit are drawn by pymol based on the atomic cryoem structure of bmcpv capsid (pdb number izx) by replacing bmcpv vp with hacpv- vp . cross-linking. the purpose of this step is to label the vp proteins that are binding to rna. after that, the samples were subjected to chemical cross-linking with . % [vol/ vol] glutaraldehyde, and analyzed via sds-page and scanned by a typhoon to visualize the presence of hex-labeled rna. our results showed that both monomeric and dimeric forms of mbp-vp can bind to ssrna in vitro ( figure b ), indicating that the rna binding activity of vp is independent of its monomeric or dimeric form. the finding that hacpv- vp contains both ssrna and dsrna binding activities led us to question whether this protein also has nucleic acid helix-destabilizing activity like flavivirus core (capsid) and hantavirus n do ( , ) . for this purpose, the hex-labeled rna and a long nonlabeled -nt rna (rna ) were annealed to generate a standard rna helix substrate with both ( bases) and ( bases) single-stranded tails ( figure a ). the helix-destabilizing assay was performed by incubating the standard rna helix substrate with purified mbp-vp in the standard destabilizing reaction mixture and then evaluating the substrate via gel electrophoresis. the hex-labeled rna strand was released from the rna helix substrate in the presence of mbp-vp ( figure a, lane ) , whereas the rna helix was stable when mbp alone was added to the reaction mixture (lane ). the boiled helix substrates were used as the positive controls for helix destabilization in this (lane ) and the subsequent assays. these results indicate that vp has rna helix-destabilizing activity. because some viral proteins with rna helixdestabilizing activities can also unwind dna helices or rna-dna hybrids, we sought to determine whether vp can destabilize nucleic acid helices containing dna. to this end, we constructed three different helix figure c ) by annealing hexlabeled dna and the long nonlabeled rna ; and r*/ d ( figure d ) by annealing hex-labeled rna and nonlabeled dna . each substrate was incubated with mbp-vp under the conditions described in figure a . mbp-vp was unable to destabilize any of these dnacontaining helices ( figure b-d) , showing that the helixdestabilizing activity of vp is rna specific. interestingly, we previously found that vp can bind both rna and dna (figure a and supplementary figure s ), suggesting that the protein binding to nucleic acids is the prerequisite but not the only determinant for helix destabilization. to further confirm the helix-destabilizing activity of vp , we generated point mutations (t a, f a and q a) at the highly conserved residues within the consensus regions of cpv vp ( figure a ). these mutants were expressed in a eukaryotic expression system as mbp-fusion proteins, purified ( figure b ) and then assessed for their helixdestabilizing activities. our results showed that the f a or q a mutation dramatically reduced the destabilization of the standard rna helix substrate ( figure c, lanes and ) , whereas the t a mutation almost abolished the unwinding activity of vp ( figure c, lane ) . moreover, we examined the rna binding activities of these mutants, and found that these mutations resulted in the loss or dramatic decrease of rna binding ( figure d ). based on the modeled surface representation of these mutant residues on vp ( figure d -f), t is located on the surface of vp , while f and q residues are located within the protein, suggesting that the strategies of these mutations to inhibit rna binding and helix destabilizing activities of vp are different. to characterize the helix-destabilizing activity of vp , we designed three different rna helix substrates: one containing a single-stranded tail ( bases), one containing single-stranded tail ( bases) and one with blunt ends ( figure a -c, left panels). each substrate was incubated with mbp-vp in the standard destabilizing reaction mixture. our results showed that both the -tailed and -tailed rna helices could be efficiently unwound by mbp-vp ( figure a and b) , whereas the blunt-ended rna helix could not be unwound ( figure c ). these experiments were independently repeated several times by us. overall, these results show that hacpv- vp destabilizes rna helices in both -to- and -to- directions. next, we sought to determine the impact of the length of the single-stranded tail on its helix-destabilizing activity. to this end, a series of rna helix substrates with different lengths ( - nt) of single-stranded tails were generated ( figure a ), and unwinding assays were performed by incubating pmol mbp-vp with . pmol indicated helix substrates for min ( figure b ) or for different time points ( , , , and min) ( figure c ). our results showed that all these single-stranded -tails could similarly support the helix destabilization by vp and the length of -tail had no much effect on the helix destabilization. this phenomenon is consistent with our previous observation of ectropis obliqua picorna-like virus (eov) nonstructural protein c, which also displays rna chaperone activity ( ) . previously, we established the rna helix-destabilizing activity of hacpv- vp in the absence of atp. thus, we sought to determine whether the presence of atp has any effect on the unwinding activity of vp . for this purpose, the standard rna helix substrate with both and tails ( figure a ) was incubated with mbp-vp in the presence of increasing concentrations ( . - mm) of atp for min ( figure b ) or min ( figure c ). in either condition, the presence of atp had no positive effect on the unwinding activity of vp ( figure b and c); in contrast, higher atp concentrations (> . mm) actually inhibited rna helix destabilization in a dose-response manner ( figure b , lanes - ; c, lanes - ). these experiments were independently repeated several times, and the gradual inhibitory effect of increasing concentrations of atp was plotted ( figure b and c, right panels). after determining that atp has an inhibitory effect on helix destabilization by vp , we sought to determine whether other nucleoside triphosphates (ntps) or dntps have similar or different effects. to this end, the destabilizing activity of vp was assessed in the presence of atp, gtp, ctp, utp, datp, dgtp, dctp or dttp at a final concentration of . mm. as shown in figure d , atp, datp, gtp or dgtp had an obvious inhibitory effect on the destabilizing activity of vp , but other ntps or dntps had negligible inhibitory effect. altogether, these data further confirm the atp independence of vp helixdestabilizing activity, thereby excluding the possibility that vp may contain atp-dependent rna helicase activity. to characterize the activity of vp in destabilizing rna secondary structures, we adapted a canonical assay, which was initially developed by destefano and colleagues for measuring the helix-destabilizing and annealing acceleration activities of vrna chaperones ( ) ( ) ( ) , for this cypoviral vp . for this purpose, two -nt complementary rna strands that form defined stem-loop structures were used ( figure a ), and one strand was hex labeled ( figure a, right) . hex-labeled ( . pmol) and nonlabeled ( . pmol) strands were mixed and incubated with mbp alone or mbp-vp in the standard destabilizing mixture, and then the hybridization of the two complementary strands was measured via gel electrophoresis ( figure b-d) . in the presence of mbp alone, little spontaneous hybridization was detected according to increased reaction time ( figure b, lanes - ) , whereas the presence of pmol vp promoted the hybridization of the two rna strands ( figure c, lanes - ) . moreover, when we increased the amounts of mbp-vp to pmol, a more dramatic stimulation of the rna strand annealing was observed ( figure d, lanes - ) . these data show that vp contains the helixdestabilizing activity that can unwind rna secondary structures and stimulate the formation of more stable rna hybrids. to further determine the strand-annealing stimulation activity of vp , we generated a shorter hex-labeled rna strand ( nt in length) and a longer nonlabeled complementary rna strand ( nt in length; figure a ). equal amounts ( . pmol) of these two strands were incubated for min in the presence of mbp alone or mbp-vp . gel electrophoresis was conducted to detect the annealing of the two complementary rna strands. the annealing of the rna strands in the presence of mbp alone was almost undetectable ( figure b, lane ) , whereas the presence of pmol mbp-vp dramatically stimulated strand annealing ( figure b, lane ) . moreover, nucleic acid chaperone normally functions well when its amount is in large excess over that of its substrate ( ) . to determine whether this notion also applies to vp , we incubated increasing amounts ( - pmol) of mbp-vp with the two rna strands ( . pmol each). as shown in figure b (lanes - ) , increasing the amount of mbp-vp led to stronger stimulating effects on strand annealing in a dose-response manner. taken together, our data show that hacpv- vp contains rna chaperone-like activity that can destabilize rna helices and stimulate rna strand annealing. concentrations. our results showed that the destabilizing activity was optimal at mm for mg + or mn + ( figure b and c) , and the increase of mg + or mn + concentrations from to mm led to a gradual decrease in the helix-destabilizing activity of mbp-vp . interestingly, this observation also excludes the possibility that the inhibitory effect of higher atp concentrations (> . mm) on vp -mediated helix destabilization (figure ) was due to the chelation of mg + , which could be caused by adding large quantities of atp or other ntps ( , ) , since lower mg + concentrations actually enhance helix destabilization. for ca + , the helix-destabilizing activity of mbp-vp was optimal at . mm, and the increase of ca + concentrations from . to mm also led to a gradual decrease in the rna helix destabilization ( figure d ). our results showed that the inhibitory effect of zn + on helix destabilization was more dramatic than that of the other three divalent ions, and the helix-destabilizing activity of vp was almost abolished when the concentration of zn + reached mm ( figure e) . moreover, all these divalent metallic ions conferred inhibitory effects in a doseresponse manner. furthermore, we sought to determine the optimal ph value for the helix-destabilizing activity of vp . vp prefers a neutral or mildly basic ph, as the reaction conditions were optimal at ph . - . ( figure f) . last, we assessed the effects of the molar ratio of vp versus rna duplex substrate as well as the incubation time on the destabilizing activity of vp . at a molar ratio of : , the helix-unwinding efficiency of vp reached $ %, and vp completely unwound the rna duplex substrates at a ratio of : (for min incubation; figure g) . a subsequent experiment showed that the helix-destabilizing activity of vp (at the molar ratio of : ) was dependent on the incubation time ( figure h ). after determining the rna chaperone activity of vp , we sought to determine its potential role in cpv rna replication. since the destabilization of the - panhandle structure of cpv (+)rna should occur before or when the (À)rna synthesis proceeds on the cyclized (+)rna template ( ) , it is plausible that vp functions to unwind the reoviral - panhandle to allow replication initiation at the accessible -end by an rdrp. to assess this possibility, we adapted a canonical assay, which was developed by panganiban and colleagues for assessing the ability of hantavirus n protein to facilitate rna replication initiation through the hantaviral panhandle structure ( ), for vp . we used this assay to determine whether vp can enhance in vitro primer-dependent replication initiation by an alternative polymerase (i.e. reverse transcriptase) through the cypoviral panhandle structure. to this end, a structured rna mimicking the cypoviral panhandle structure was constructed using nt from the -end and nt from the -end of hacpv- rna segment , and then used as the template for rt ( figure a , upper panel). moreover, because rt reactions normally involve thermal cycles at c to denature and anneal the rna template with a primer, the rt reactions were carried out at c in the presence or absence of mbp-vp to assess the potential role of the vp 's rna chaperone activity to unwind panhandle structure of the rna template. here, dig-labeled dutp was supplemented into the reaction mixture for visualizing the rt products. as shown in figure b , the presence of vp dramatically stimulated the synthesis of rt products (lane ), as did the thermal annealing treatment at c (lane ); on the other hand, the reaction product was barely detectable in the absence of vp ( figure b, lanes and ) . to further confirm the capacity of vp to facilitate rt through the panhandle structure, we conducted the reactions at varying molar ratios of vp to the panhandlestructured template. our data showed that the stimulating effect of vp on the rts was optimal at molar ratios of : - : ( figure c, lanes and ) . taken together, these results strongly suggest that vp has a direct role in cpv dsrna synthesis by destabilizing the - panhandle structure to promote the accessibility of the -end of the (+)rna template for replication initiation. as one of the three major cpv capsid proteins, vp is also named as clamp protein or large protrusion protein. previous studies have demonstrated that copies of vp are located on the surface of the cpv capsid, exist in two conformations and function as molecular clamps to interact with and tie together neighboring vp shell proteins, thus enhancing the stability of the cpv capsid ( , ) . moreover, the structural homologs of cpv vp are commonly present in the inner capsids of other turreted reoviruses, including orthoreovirus s and aquareovirus vp , which also function as molecular clamps to stabilize inner capsids ( , , ) . however, whether vp , a major component of cpv dsrna replication machinery, and other reoviral clamp proteins contain activities other than structural stabilization was not known. in this report, we show that the vp protein from hacpv- is a novel rna chaperone that destabilizes rna helices and stimulates strand annealing in an atp-independent manner, implying a direct role of vp in the replication of cpv dsrna. it is believed that the vrna molecules of many rna viruses could be kinetically trapped in incorrect/inactive intermediate structures, thereby probably requiring rna chaperones to facilitate proper rna folding and efficient vrna replication/translation ( , , ) . so far, hiv- nucleocapsid (nc), vif and tat, flavivirus core, hantavirus n, poliovirus ab, coronavirus nucleocapsid (n), tombusvirus p , hepatitis d virus small delta antigen and eov c have been determined to contain rna chaperone activities ( , ) . the list of virus-encoded rna chaperones is still growing, and our current study adds hacpv- vp as a new member of it. in the family reoviridae, cpv vp may not be the only rna chaperone, as rotavirus nonstructural protein (nsp ), which is a multifunctional enzyme involved in rotaviral dsrna replication, was previously shown to contain atp-independent nucleic acid helix-destabilizing activity ( ) . as observed with cpv vp , the helixdestabilizing activity of rotavirus nsp has no nucleotide sequence specificity and lacks unwinding directionality. it is generally believed that the helix-destabilizing activity of rna chaperones normally requires a large excess of protein over rna helix substrates ( , ) , and this notion also applies to both cpv vp and rotavirus nsp . the optimal unwinding activity of vp was observed at the protein-to-nucleic acid molar ratio of $ : - : , whereas the optimal activity of rotavirus nsp was detected at much higher molar ratios of $ : - : ( ) . on the other hand, some features of rotavirus nsp and cpv vp are different. for instance, vp binds to both rna and dna and exhibits rna-specific destabilizing activity (figure ) , whereas nsp is an ssrna binding protein and can unwind both rna-rna and dna-rna helix substrates ( , ) . the difference between vp and nsp implies that these two proteins may not use the same mechanism to destabilize rna helices. moreover, because no previous study has determined whether rotavirus nsp possesses strand annealing acceleration activity, an important characteristic of rna chaperones, further characterization of nsp is needed to determine whether this protein can be recognized as an rna or nucleic acid chaperone. both cpv vp and rotavirus nsp are components of reoviral rna replication machinery: the former is a major cpv capsid protein, and the latter is associated with the rotaviral rdrp ( , ) . there is no homology between rotavirus nsp and any cpv structural or nonstructural proteins; on the other hand, cpv vp has no sequence similarity with any rotavirus proteins. moreover, as a 'nonturreted' reovirus, rotavirus does not encode clamp proteins ( ) . these observations led us to question whether cpv or rotavirus contains extra rna chaperone or helix-destabilizing protein, except vp or nsp , respectively. however, since it is not uncommon for a virus (e.g. hiv- ) to encode multiple rna chaperones, the possibility that a reovirus contains more than one rna chaperones cannot be ruled out, and it would be interesting for us or others to investigate this issue in the future. during the rna replication of reoviruses, (+)-vrnas (mrnas) are encapsidated into the nascently assembled inner capsids and are then used by reoviral rdrps as replication templates to synthesize genomic dsrna segments within the inner capsids ( , , ) . as seen with (À)-vrna of hantavirus, reoviral (+)-vrna can be cyclized, and its -and -ends undergo base pairing to form a panhandle-like structure ( ) . previous studies revealed that the panhandle structure of cyclized reoviral (+)-vrna is required for efficient dsrna synthesis probably by recruiting rdrps to the -end of the (+)-vrna template ( ) . moreover, like other - cyclized vrnas, before or when rna replication initiates, the vrna panhandle structure should be dissociated to make the -end of the template accessible by rdrps ( ) . considering that hantavirus n, a nucleocapsid protein and rna chaperone, can efficiently unwind the - panhandle of cyclized hantaviral (À)-vrna and subsequently promote transcription/replication initiation ( , ) , we propose that the rna chaperone activity of cpv vp is directly involved in the initiation of cpv dsrna replication by destabilizing cypoviral panhandle structure in a similar manner. in accordance with this speculation, we found that when an rna structure mimicking the cypoviral panhandle was used as a template, hacpv- vp effectively promoted primerdependent transcription initiation by an alternative polymerase (i.e. reverse transcriptase) ( figure ) . as a heterogeneous group of proteins that share no consensus sequences or motifs, rna or nucleic acid chaperones are poorly understood in regard to the mechanism(s) governing their atp-independent helix-destabilizing and annealing stimulation activities. to explain the rna chaperone activities, an 'entropy transfer' model has been proposed. according to this model, rna chaperones contain intrinsically disordered (unstructured), highly flexible regions that can transfer their disorder or entropy to misfolded rna molecules on binding to rna. such an entropy transfer can destabilize kinetically trapped misfolded rna molecules, leading to the rearrangement of rna folding in an atp-independent manner ( , ) . so far, many virus-encoded rna chaperones, including hiv- nc, vif and tat, hantavirus n, flavivirus core, coronavirus n and tombusvirus p , have been predicted to contain intrinsically disordered regions ( , ) . however, since the regions responsible for chaperoning activities have not been accurately mapped in many rna chaperones, the relationship between intrinsic disorder and rna chaperone activities is difficult to formally determine ( , ) . furthermore, alternative models, such as transient ionic or electrostatic interactions, have also been proposed to explain the rna chaperone activity ( ) . our disorder prediction of cpv vp using pondr vl-xt ( , , ) indicates that vp proteins from three different types of cypoviruses-cpv- , cpv- and cpv- -contain similar distribution patterns of potentially disordered regions (data not shown), thereby suggesting that the 'entropy transfer' applies to hacpv- vp , and vp proteins of other cpvs may also contain rna chaperone activity. as a single-shelled member of the family reoviridae, cpv is well recognized as an ideal and simplified model for studying the mechanisms of vrna replication and mrna capping of the reoviridae and probably other dsrna viridae. in this study, we found that a cpv capsid clamp protein vp possesses a novel rna chaperone activity, which is thought to be directly involved in the initiation of cpv dsrna replication. these findings show that cpv capsid proteins not only include rna polymerase and mrna capping enzyme, but also rna chaperone that is important for vrna replication and/or translation. this study should extend our understanding of rna replication of cpv, reoviridae and dsrna viridae, and also enrich our knowledge about virus-encoded rna chaperones. furthermore, since the (+)-vrna cyclization and panhandle structures commonly exist in the family reoviridae, encoding an rna chaperone or other rna remodeling protein, like helicase, might be a common strategy for all reoviruses. future studies by our group and others should reveal whether rotavirus nsp , the clamp proteins of other turreted reoviruses, as well as other reoviral structural or nonstructural proteins can also function in - panhandle unwinding and rna replication initiation. the dsrna viruses the dsrna viridae and their catalytic capsids structural comparisons of empty and full cytoplasmic polyhedrosis virus. protein-rna interactions and implications for endogenous rna transcription mechanism viral molecular machines virus taxonomy: ninth report of the international committee on taxonomy of viruses the molecular organization of cypovirus polyhedra visualization of protein-rna interactions in cytoplasmic polyhedrosis virus cryo-em structure of a transcribing cypovirus the structure of a cypovirus and the functional organization of dsrna viruses cytoplasmic polyhedrosis virus structure at Å by electron cryomicroscopy: structural basis of capsid stability and mrna processing regulation ) . Å structure of cytoplasmic polyhedrosis virus by cryo-electron microscopy atomic model of a cypovirus built from cryo-em structure provides insight into the mechanism of mrna capping rna chaperones, rna annealers and rna helicases role of rna chaperones in virus replication rna remodeling by chaperones and helicases conserved elements in the ' untranslated region of flavivirus rnas and potential cyclization sequences the ends of la crosse virus genome and antigenome rnas within nucleocapsids are base paired a basespecific recognition signal in the ' consensus sequence of rotavirus plus-strand rnas promotes replication of the doublestranded rna genome segments rotavirus rna replication requires a single-stranded ' end for efficient minus-strand synthesis rna structure and the replication of the rotavirus segmented double-stranded rna genome characterization of the rna chaperone activity of hantavirus nucleocapsid protein taming free energy landscapes with rna chaperones structure and mechanism of helicases and nucleic acid translocases identification and genome characterization of heliothis armigera cypovirus types and and heliothis assulta cypovirus type the complete nucleotide sequence of the type helicoverpa armigera cytoplasmic polyhedrosis virus genome fully automated ab initio protein structure prediction using i-sites, hmmstr and rosetta genetic modification of baculovirus expression vectors identification and characterization of rna duplex unwinding and atpase activities of an alphatetravirus superfamily helicase rna binding by a novel helical fold of b protein from wuhan nodavirus mediates the suppression of rna interference and promotes b dimerization targeting of dicer- and rna by a viral rna silencing suppressor in drosophila cells membrane association of wuhan nodavirus protein a is required for its ability to accumulate genomic rna template the nonstructural protein c of a picorna-like virus displays nucleic acid helix destabilizing activity that can be functionally separated from its atpase activity atomic model of cpv reveals the mechanism used by this single-shelled virus to economically carry out functions conserved in multishelled reoviruses hiv- nucleocapsid protein as a nucleic acid chaperone: spectroscopic study of its helix-destabilizing properties, structural binding specificity, and annealing activity analysis of the rna chaperoning activity of the hepatitis c virus core protein on the conserved 'x region of the viral genome in vivo detection, rna-binding properties and characterization of the rna-binding domain of the p putative movement protein from carnation mottle carmovirus (carmv) characterization of the rna-binding domains in the replicase proteins of tomato bushy stunt virus differing roles of the n-and c-terminal zinc fingers in human immunodeficiency virus nucleocapsid protein-enhanced nucleic acid annealing poliovirus protein ab displays nucleic acid chaperone and helix-destabilizing activities the twenty-nine amino acid c-terminal cytoplasmic domain of poliovirus ab is critical for nucleic acid chaperone activity chelation of divalent cations by atp, studied by titration calorimetry subnanometer-resolution structures of the grass carp reovirus core and virion rna chaperoning and intrinsic disorder in the core proteins of flaviviridae predicting intrinsic disorder from amino acid sequence identification and characterization of the helix-destabilizing activity of rotavirus nonstructural protein nsp multimers formed by the rotavirus nonstructural protein nsp bind to rna and have nucleoside triphosphatase activity the rotavirus rna-binding protein ns (nsp ) forms s multimers and interacts with the viral rna polymerase structural evolution of reoviridae revealed by oryzavirus in acquiring the second capsid shell genome replication and packaging of segmented double-stranded rna viruses the bunyavirus nucleocapsid protein is an rna chaperone: possible roles in viral rna panhandle formation and genome replication identification of a region of hantavirus nucleocapsid protein required for rna chaperone activity the role of structural disorder in the function of rna and protein chaperones rna chaperone activity of the tombusviral p replication protein facilitates initiation of rna synthesis by the viral rdrp in vitro combining prediction, computation and experiment for the characterization of protein disorder we thank dr xiangdong fu (san diego, ca, usa) and dr qijia wu (wuhan, china) for technical assistance, and ms markeda wade (houston, tx, usa) for professionally editing the manuscript. supplementary data are available at nar online. key: cord- - q pjw authors: lew, qiao jing; chu, kai ling; lee, jialing; koh, poh ling; rajasegaran, vikneswari; teo, jin yuan; chao, sheng-hao title: pcaf interacts with xbp- s and mediates xbp- s-dependent transcription date: - - journal: nucleic acids res doi: . /nar/gkq sha: doc_id: cord_uid: q pjw x-box binding protein (xbp- ) is a key regulator required for cellular unfolded protein response (upr) and plasma cell differentiation. in addition, involvement of xbp- in host cell–virus interaction and transcriptional regulation of viruses, such as human t-lymphotropic virus type (htlv- ), has been revealed recently. two xbp- isoforms, xbp- u and xbp- s, which share an identical n-terminal domain, are present in cells. xbp- s is a transcription activator while xbp- u is the inactive isoform. although the transactivation domain of xbp- s has been identified within the xbp- s-specific c-terminus, molecular mechanism of the transcriptional activation by xbp- s still remains unknown. here we report the interaction between p /cbp-associated factor (pcaf) and xbp- s through the c-terminal domain of xbp- s. no binding between xbp- u and pcaf is detected. in a cell-based reporter assay, overexpression of pcaf further stimulates the xbp- s-mediated cellular and htlv- transcription while knockdown of pcaf exhibits the opposite effect. expression of endogenous xbp- s cellular target genes, such as bip and chop, is significantly inhibited when pcaf is knocked down. furthermore, pcaf is recruited to the promoters of xbp- s target genes in vivo, in a xbp- s-dependent manner. collectively, our results demonstrate that pcaf mediates the xbp- s-dependent transcription through the interaction with xbp- s. introduction x-box binding protein (xbp- ) belongs to the cyclic amp response element binding protein/activating transcription factor (creb/atf) family of transcription factors. xbp- plays a major role in regulating unfolded protein response (upr), which is triggered when endoplasmic reticulum (er) is under stress ( ) . xbp- has two protein isoforms, xbp- u and xbp- s. both isoforms share a common n-terminus containing a basic-region leucine zipper (bzip) domain which is required for dna binding. xbp- u is the dominant isoform under non-stress conditions. activation of upr induces the endoribonuclease activity of inositol requiring enzyme , an er transmembrane protein, which removes nts from the open-reading frame of xbp- mrna ( ) . this unconventional splicing occurs in cytoplasm and causes a frame shift at amino acid of xbp- , leading to the generation of xbp- s by replacing the c-terminus of xbp- u with a strong transactivation domain ( , ) . xbp- s is a transcription activator that up-regulates the expression of er chaperones and other genes involved in membrane synthesis and the pathway of protein secretion ( , ) . overexpression of xbp- s increases the secretory capacity of the cell and improves recombinant protein productivity in secretion-limited mammalian cells by expanding the surface area and volume of er ( , ) . it has been shown recently that high-level expression of recombinant secreted proteins in cells and environmental stresses during culture also induce the generation of xbp- s ( ) . xbp- s is also found to be essential in the terminal differentiation of the antibody producing plasma cells by enhancing the secretory machinery of the cell ( , ) . the xbp- -knockout b cells display impaired immunoglobulin secretion, which can be restored by ectopic expression of xbp- s ( ) . furthermore, the involvement of xbp- in tumorigenesis has been reported recently ( ) ( ) ( ) . recent studies show that cellular upr can be induced by infection of various viruses, including kaposi's sarcoma-associated herpesvirus ( ) , west nile virus ( ) , japanese encephalitis virus (jev) ( ) , hepatitis c virus ( , ) , human cytomegalovirus (hcmv) ( , ) , dengue virus serotype (den- ) ( ) , severe acute respiratory syndrome coronavirus ( ) , coronavirus ( ) , epstein-barr virus ( ) and semliki forest virus ( ) . some viruses, such as jev and den- , use the er of host cells as the primary site of glycoprotein synthesis, genomic rna replication and virus particle maturation, and thus trigger er stress as well as upr ( , ) . in the other case, some viral proteins, such as hcmv us , traffic to the er of host cells and induce upr ( ) . the transactivator of human t-lymphotropic virus type (htlv- ), tax, has been shown to be localized in the organelles associated with protein secretion including er and golgi complex ( ) , raising the possibility that htlv- may affect cellular upr as well. we previously discovered that xbp- s stimulates basal and tax-activated transcription of htlv- . infection of htlv- was found to induce upr and up-regulation the expression of several upr genes, including xbp- . furthermore, xbp- was identified as one of the tax target genes in cells ( ) . our results not only revealed a positive feedback loop between htlv- and the host cells, but also suggested an important role for xbp- in transcriptional regulation of htlv- . the localization of a transactivation domain within the c-terminus of xbp- s helps to explain the transactivating ability of xbp- s. however, the molecular mechanism of xbp- s transactivation still remains to be determined. one possibility is that the c-terminus of xbp- s may interact with a specific cellular co-activator, which is responsible for the up-regulation of xbp- s target genes. here, we identify a histone acetyltransferase (hat), p /cbp-associated factor (pcaf), as a xbp- sspecific binding protein and demonstrate the functional significance of the pcaf-xbp- s interaction in the xbp- s-mediated transcription. cells, short interfering rnas, short hairpin rnas and plasmids hek , t and mcf cells were obtained from american type culture collection. the short interfering rnas (sirnas) targeting pcaf (sipcaf- : -cggag tgtactccgcctgcaa- and sipcaf- : -cagca aataattgtcagtcta- ) and p (sip - sirna: -ttggactaccctatcaagtaa- and sip - : -cccggtgaactctcctataat- ) were purchased from qiagen, and the short hairpin rnas (shrna) against pcaf (shpcaf: -tagatgaggt gctttgagcagttctgaaa- ) was obtained from origene. human xbp- s and xbp- u expression plasmids were previously described ( ) . the plasmids for expression of human pcaf and p were obtained from open biosystems. the plasmids containing a series of hemagglutinin (ha) tagged xbp- deletions were generous gifts from dr hiderou yoshida ( ) . the firefly luciferase reporter plasmids, htlv-luc and bip-luc [including wild-type and er stress response (erse) mutant bip-luc plasmids], were kindly provided by dr arnold rabson and dr kazutoshi mori, respectively ( , ) . transient transfections of dna plasmids into hek , t and mcf cells were performed using fugene (roche) according to the manufacturers' instructions. to perform the cell-based overexpression assays, cells were grown to - % confluence in -well plates and co-transfected with a luciferase reporter and an expression plasmid. lipofectamine reagent (invitrogen) was utilized to co-transfect cells with dna plasmids and sirnas for the cell-based knockdown experiments. firefly luciferase activities were measured h post-transfection using the bright-glo assay system (promega) and the activities were determined using an infinite multiplate reader (tecan). hek cells were used in the cell-based luciferase assays. t cells were transiently co-transfected with indicated expression plasmids and the cell lysates were prepared days post-transfection for co-immunoprecipitation (co-ip). to get the high levels of ectopic expression, t, a highly transfectable derivative of hek , was chosen for the co-ip study. the ip kit was purchased from roche and co-ip was performed according to the manufacturers' instructions. the immunoprecipitated complexes were analyzed by western blotting. western blotting was carried out according to the standard protocols. all the antibodies used in our study were obtained from santa cruz biotechnology, except the anti-ha antibody (sigma). the upr inducing compounds, tunicamycin (tm) (assay designs) and thapsigargin (tg) (sigma), were dissolved in dimethyl sulfoxide (dmso) to mg/ml and mm, respectively. all three cell lines, hek , t and mcf , exhibited upr after treating with tm or tg. induction of the upr genes in the treated cells were confirmed by quantitative reverse transcriptase-polymerase chain reaction (qrt-pcr) (data not shown). among the cell lines used in this study, the endogenous xbp- s target genes in mcf cells showed the highest sensitivity to the ectopic expression of xbp- s (data not shown). therefore, mcf cells were selected for the xbp- s overexpression experiments followed by the examination of the transcriptional regulation of xbp- s-dependent genes in vivo. total rnas of the transfected mcf cells or the tm ( mg/ml)/ tg ( nm) treated hek cells were isolated using rneasy mini kit (qiagen). one microgram of the total rnas was converted into complementary dna (cdna) using improm tm -ii reverse transcription system (promega). specific cdnas were amplified using sybr green pcr master mix (applied biosystems). the primer pairs used in this study include: bip ( -ggtgaaagacccctgacaaa- and -gtcagg cgattctggtcatt- ), chop ( -cttctctgg cttggctgact- and -cccttggtcttcctcct ctt- ), edem ( -aggtgctgataggagatg tgg- and -ggattcttggttgcctggta- ) and glyceraldehyde -phosphate dehydrogenase (gapdh) ( -aacagcctcaagatcatcagc- and -ggat gatgttctggaggacc- ). gapdh was used as a control to normalize the cdna inputs. amplification and detection of the cdnas were performed using abi prism thermal-cycler (applied biosystems). chromatin immunoprecipitation (chip) assays were carried out using ez chip kit (millipore) according to the manufacturer's protocol with some modifications. hek cells were treated with tm ( mg/ml) or tg ( nm) for h prior to cross-linking. dna fragments at around - bp were achieved by sonication with microson ultrasonic cell disruptor (misonix). for the ip, the indicated antibodies (i.e. anti-xbp- or anti-pcaf antibodies) were added to the sheared chromatin individually and incubated at c overnight. the dna/protein/antibody complex was then pulled down by protein g agarose and the dna in the complex was purified using qiaquick pcr purification kit (qiagen). quantitative-pcr was performed to determine the relative amount of dna that was immunoprecipitated by anti-xbp- or anti-pcaf antibodies in the presence of tm or tg. the primer pairs used to amplify the promoter regions of bip and chop genes include: bip ( -gatggggcggatgttatcta- and -ctct cacactcgcgaaacac- ) and chop ( -gaca ctacgtcgacccccta- and -ggttccagctc tgattttgg- ). cells treated with dmso were served as a negative control. for the overexpression study, mcf cells were co-transfected with a pcaf expression vector and one of the xbp- plasmids (xbp- s or xbp- u plasmids) days prior to cross-linking. cells co-transfected with a pcaf plasmid and an empty vector served as a negative control. the data shown (including luciferase assays, qrt-pcr and quantitative chip) were analyzed using student's t-test at % significance level (p < . ). we previously demonstrated that xbp- s, a member of creb/atf family proteins, stimulates basal and tax-activated htlv- transcription ( ) . it has been reported that two histone acetyltransferases (hats), pcaf and p , are required to activate htlv- transcription through three -bp repeats known as taxresponsive element (tre) located with the htlv- promoter ( ) . each tre contains a binding site for creb/atf proteins, suggesting a potential functional connection between hats and xbp- s. we first investigated the interaction between pcaf and two xbp- isoforms. cells were transfected with an xbp- s or xbp- u expression plasmid followed by ip analyses figure . pcaf associates with xbp- s. (a) t cells were transfected with an expression plasmid to ectopically express xbp- s, xbp- u and creb , respectively. ip was performed using the cell lysates prepared from the transfected cells and the indicated antibodies. normal igg (igg) was used as a negative control. the immunoprecipitated complexes and the protein inputs were analyzed by western blotting. (b) the cell lysates of xbp- s expressing cells were used for ip with an anti-pcaf antibody. the presence of xbp- s in the immunoprecipitates was determined by western blotting. (c) cells were co-transfected with a p expression vector and an indicated plasmid (i.e. xbp- s, xbp- u, and creb plasmids, respectively). ip was carried out using an anti-p antibody followed by western blotting. ( figure a ). the anti-xbp- antibody used in the assays can recognize both xbp- isoforms. the association between pcaf and another member of creb/atf protein family, creb , was also examined ( figure a ). pcaf was found in the immunoprecipitated complexes of xbp- s expressing cells, but not in xbp- u or creb expressing cells ( figure a ). reciprocal ip was carried out using an anti-pcaf antibody and xbp- s was detected in the immunoprecipitated complexes, confirming the interaction between pcaf and xbp- s ( figure b) . interaction between xbp- s and another hat, p , was examined next. however, no association between p , xbp- s, xbp- u and creb was detected ( figure c) , indicating a specific binding between pcaf and xbp- s. domain study of xbp- was carried out using a series of ha-tagged xbp- truncations (figure a ). cells were transfected with an individual xbp- truncation plasmid followed by ip using anti-pcaf antibodies. as shown in figure b , only the xbp- s-specific c-terminal region, which contained the transcriptional activation domain of xbp- s, was found to associate with pcaf, but not the xbp- u-specific c-terminus or any other regions of xbp- . the heavy chains of anti-pcaf antibodies were also recognized by the secondary antibody used for the immunoblotting. since the molecular weights of heavy chains and ha-tagged xbp- s were similar ( kda), the blot could not reveal the presence of ha-xbp- s in the immunoprecipitates. we did another western blot using an anti-xbp- s antibody recognizing the common domain of xbp- s and xbp- u and confirmed the interaction between pcaf and ha-xbp- s ( figure b , the anti-xbp- blot). it was noted that the interaction between pcaf and endogenous xbp- s proteins were also detected in the ha- s( - )-and ha-xbp- u-transfected cells ( figure b) . collectively, the results demonstrate that pcaf binds to xbp- s through the transcriptional activation domain of xbp- s located in its c-terminal region. pcaf is required for xbp- s-mediated activation of htlv- and bip transcription functional significance of the pcaf-xbp- s interaction was assessed in the xbp- s-dependent transcription assays. xbp- s is known to regulate the transcription of htlv- and cellular gene bip ( , ) . the luciferase reporters, in which the expression of luciferase was driven by htlv- and bip promoters (i.e. htlv-luc and bip-luc, respectively), were utilized in the study. in the xbp- s co-transfected cells, more than -fold increases in luciferase expression were observed in htlv- and bip promoters ( figure a and b). further induction (more than -fold) of the xbp- s-mediated activation of htlv- and bip promoters was detected in the pcaf-expressing cells ( figure a and b) . however, overexpression of p had no significant effects on xbp- s-dependent transcription ( figure a and b) . the impact of pcaf knockdown on the activation of htlv- and bip transcription by xbp- s was studied next. the knockdown experiments were carried out using the sirnas specifically targeting pcaf. the effectiveness of two pcaf sirnas, pcaf- and pcaf- , was confirmed by western blotting ( figure a ). two p sirnas (i.e. p - and p - ) were utilized as controls since no association between p and xbp- s was observed ( figure c ). however, the protein levels of endogenous p in the cells were not high enough to be clearly revealed by western analyses. qrt-pcr was then used to confirm the actions of two p sirnas. forty to fifty percent decrease in p mrna levels were detected in the cells transfected with the p sirnas (data not shown). cells were co-transfected with the luciferase reporter (i.e. htlv-luc or bip-luc), a xbp- s plasmid, and an indicated sirna ( figure b and c) . compared to the transfection excluding the xbp- s expression vector, and -fold enhancement in the activation of htlv and bip promoters was observed ( figure b and c, the first two transfections). the gl sirna, which specifically targeted the gl luciferase used in the htlv-luc and bip-luc reporters, was used as a positive control and caused % decreases in luciferase expression under the control of htlv and bip promoters ( figure b and c, the second and third transfections). the two pcaf sirnas inhibited % luciferase expression driven by htlv promoter, while no significant effects were caused by either p sirna ( figure b ). similar observations were found in the bip-luc reporter assays ( figure c ). results obtained from the pcaf overexpression and knockdown reporter assays (figures and ) demonstrate the functional involvement of pcaf in the genes regulated by xbp- s. xbp- s regulates the transcription of bip by binding to the erse element located within the bip promoter ( , ) . we next wished to determine if the transcriptional activation of the bip promoter by pcaf was mediated through erse as well. the wild-type and erse-mutant bip-luc reporter plasmids were utilized in the experiments. extopic expression of pcaf significantly activated the luc expression driven by the wild-type bip promoter, while little or no effects were detected on the transcription driven by the erse-mutant bip promoter ( figure a) . since the protein level of endogenous xbp- s was low in the er stress-free cells, only up to a % increase in bip transcription was observed ( figure a ). in the xbp- soverexpressing cells, pcaf exhibited stronger activation on the expression of luciferase driven by the wild-type bip promoter ( figure b, up to -fold) . however, no activating effects on the erse mutant bip promoter were detected when both pcaf and xbp- s were overexpressed ( figure b ). collectively, these results suggest that pcaf interacts with xbp- s and mediates bip transcription in an erse-dependent manner. requirement for pcaf in the mediation of endogenous xbp- s target genes, including bip, chop and edem ( ), was investigated. we performed qrt-pcr assays to determine the impact of pcaf on the activation of xbp- s target genes by knocking down the expression of pcaf. compared to the dna transfection, co-transfection of dna plasmids and sirnas was much more cytotoxic (data not shown). therefore, a shrna plasmid against pcaf was used to co-transfect cells along with an xbp- expression vector. effectiveness of the pcaf shrna was confirmed by western blotting ( figure a ). overexpression of xbp- s resulted in -to -fold increases in the mrna levels of bip, chop, and edem ( figure b ). co-transfection of the pcaf shrna in the xbp- s-expressing cells led to , and % inhibition of bip, chop, and edem transcription, respectively ( figure b ), demonstrating the involvement of pcaf in the xbp- s-dependent transcription. the in vivo recruitment of pcaf to the xbp- s endogenous target genes was examined next. cells were transfected with a pcaf expression plasmid and an indicated vector (i.e. empty, xbp- u, and xbp- s plasmids). distribution of pcaf and xbp- on the promoters of bip and chop was analyzed by quantitative chip. fewer xbp- and pcaf proteins were located on bip and chop genes when xbp- u was overexpressed ( figure a and b) . in the xbp- s/pcaf co-transfected cells, more xbp- s proteins were found to bind to the promoter region of bip and chop genes ( figure a and b). it was expected since overexpression of xbp- s activated the transcription of bip and chop ( figure b ). in addition, a -fold increase in pcaf binding to bip and chop genes was detected in the xbp- s/pcaf co-expressing cells ( figure a and b) , providing the evidence that pcaf was recruited to bip and chop promoters through the interaction with xbp- s. upr induces the generation of xbp- s which up-regulates its target genes required for secretory pathway, membrane synthesis, protein folding and er-associated degradation ( ). the involvement of pcaf for xbp- s activation during upr was studied by examining the expression bip and chop genes. cells were transfected with a control or pcaf shrna followed by the treatment of tm to induce upr. the mrnas isolated from the cells were analyzed by qrt-pcr. the mrna levels of bip and chop increased -and -fold, respectively, after tm incubation ( figure a ). knockdown of pcaf only led to minor inhibition on the transcription of the two genes ( figure a ). an identical set of assays was performed using tg as the upr inducing reagent. little or no significant effects on bip and chop mrnas were detected in the pcaf shrna-transfected cells ( figure b) . we further carried out quantitative chip to examine the distribution of xbp- s and pcaf on bip and chop genes during upr. incubation of tm resulted in -and -fold increases in xbp- s binding to bip and chop promoters, respectively, while only -and . -fold increases in pcaf associating with the two genes ( figure c ). in another set of experiments with tg treatment, < -fold increases in pcaf binding to endogenous bip and chop genes were detected, while more than -(bip) and -fold (chop) enhancement in xbp- s binding ( figure d) . taken together the qrt-pcr and quantitative chip analyses suggest the limited involvement of pcaf in the mediation of xbp- s target genes during upr. a recent study demonstrated that the association between xbp- s and its binding protein could be upr-dependent and such protein-protein interaction was disrupted after treating cells with upr-inducing compound, tm ( ) . we examined the influence of upr on the pcaf-xbp- s interaction by treating cells with tm followed by ip analyses. no changes in the binding of pcaf to xbp- s were detected under the treatment of tm (figure ), suggesting the existence of the pcaf-xbp- s protein complexes during upr. in this study, we investigate the molecular mechanism to elucidate the distinct functions between the inactive xbp- u and active xbp- s. both isoforms have an identical n-terminus and an isoform-specific c-terminal region (figure a ). we identify pcaf as a novel xbp- s binding protein and demonstrate the biological importance of pcaf in regulating the xbp- s-mediated cellular and viral transcription. pcaf binds to xbp- s through the interaction with the xbp- s-specific c-terminal domain but fails to associate with the full-length xbp- u or the xbp- u-specific c-terminus (figures and ) , providing an explanation for the transactivating ability of xbp- s on gene expression. basal transcription of htlv- occurs after proviral integration into the host cell genome and induces the initial expression of htlv- proteins, including the transactivator, tax, followed by tax transactivation to boost the synthesis of viral transcripts. two hats, pcaf and p , have been shown to interact with tax and play a role in tax-activated viral transcription ( , ( ) ( ) ( ) . tax, which does not bind to dna by itself, activates htlv- transcription through three -bp repeats known as tre, located within the promoter of htlv- . each -bp tre repeat contains a creb/ atf binding site and is known to associate with creb/ atf family proteins ( , ) . tax binds to tres through the interaction with creb/atf family proteins (including xbp- s, creb and creb ) and recruits pcaf/p to htlv- promoter, resulting in tax transactivation ( , ( ) ( ) ( ) . we previously found that xbp- s bound to tax and induced stronger tax transactivation than other creb/atf family proteins ( ) . interestingly, xbp- s also stimulated basal transcription of htlv- , while creb and creb did not show any activating effects, suggesting a crucial role for xbp- s during the early phase of viral transcription as well ( ) . no interaction between pcaf and creb was detected in co-ip analyses ( figure a ). this observation could explain why creb and other creb/atf family proteins fail to up-regulate htlv- transcription in the absence of tax. in contrast, the requirement for pcaf in the xbp- s-dependent htlv- basal transcription was clearly demonstrated in the cell-based reporter assays ( figures a and a) . functional significance of pcaf-xbp- s interaction on the cellular target genes of xbp- s, including bip, chop and edem, was demonstrated in the pcaf overexpression and knockdown experiments (figures - ). in addition, quantitative chip assays showed that xbp- s recruited pcaf to the promoters of endogenous xbp- s target genes in vivo, establishing direct functional connection between pcaf and xbp- s ( figure ) . however, knockdown of pcaf by sirna or shrna did not completely inhibit the elevated transcription caused by xbp- s (figures and ). there are two possible explanations for these observations. first, both sirna and shrna against pcaf did not completely block the protein synthesis of pcaf ( figures a and a ). therefore, it is possible that the sirna-and shrna-transfected cells still have sufficient pcaf left to participate in the gene activation by xbp- s. secondly, pcaf may be one of the cellular co-factors responsible for xbp- s-dependent transcription. therefore, elimination of pcaf by rna interference could only partially inhibit the transactivation of xbp- s. discovery of the involvement of pcaf in the transcriptional regulation of bip and edem genes is novel. pcaf has been identified as a co-factor of atf (or creb ) for the expression of chop ( ) . in response to amino acid starvation, atf binds to the amino acid response element located in the chop promoter and recruits pcaf to the promoter, leading to the activation of chop transcription ( ) . besides pcaf, atf also interacts with other hats, including p and cbp, through its n-terminal transactivation domain ( , ) . as shown in figure , xbp- s shows more stringent protein binding than atf and fails to associate with p . future study is required to investigate the interaction between xbp- s and other hats to further determine the binding specificity of the xbp- s transactivation domain. collectively, figure . requirement of pcaf for the mediation of xbp- s target genes under upr. mcf cells were co-transfected with a non-specific (i.e. control) or pcaf shrna, and incubated with mg/ml tm (a) or nm tg (b). both tm and tg were dissolved in dmso and the final concentration of dmso in the culture was kept at . %. expression of endogenous bip and chop genes was determined by qrt-pcr. cells transfected with a control shrna with . % dmso were served as a negative control. for quantitative chip assays, hek cells were treated with mg/ml tm (c) or nm tg (d) and the bindings of xbp- s and pcaf to the endogenous bip and chop genes were analyzed by quantitative pcr. cells incubated with . % dmso were used as a negative control. fold changes were determined by comparing to the negative controls. *p < . versus negative controls. figure . interaction between xbp- s and pcaf under upr. t cells were transiently transfected with a xbp- s expression vector and incubated with mg/ml tm or . % dmso (i.e. the negative control) for h. ip was performed using the cell lysates prepared from the transfected cells and the antibody against xbp- . normal igg (igg) was used as a negative control. the immunoprecipitated complexes and the protein inputs were analyzed by western blotting. the findings by our and other groups point out that pcaf may play an important role in transcriptional activation of chop through the xbp- s-as well as atf -dependent pathways. it has been reported that p is recruited to the endogenous bip promoter in the tg-treated cells by chip assays ( ) . co-overexpression of p , yy and atf showed synergistic activation of luciferase expression driven by the bip promoter, suggesting that p might be required for yy -/atf -mediated activation of bip ( ) . similar cell-based reporter assays (i.e. using the bip-luc reporter plasmid) were performed to determine the requirement of p for xbp- s-mediated transactivation. neither overexpression nor knockdown of p showed any significant effects on xbp- s-dependent luciferase expression (figures and ) , suggesting that p might function in a xbp- s-independent manner. these results were further supported by co-ip data, in which no interaction between p and xbp- s was detected ( figure c ). furthermore, we assessed the requirement of p for transcriptional activation of bip and chop genes under upr. in contrast to the report by baumeister et al. ( ) , our results obtained from the quantitative chip analyses did not show any increased p binding to either bip or chop promoters in the tm-or tg-stressed cells (data not shown), raising the questions regarding to the involvement of p in the regulation of xbp- s target genes. under the er stress-free condition, our data clearly indicated that pcaf was required for xbp- s-mediated transcriptional regulation (figures - ) . however, results from qrt-pcr and quantitative chip showed that pcaf only exhibited limited involvement in the expression of xbp- s target genes when upr was induced ( figure ) . a recent study identified the regulatory subunit of phosphoinositide -kinase (pi k) as a novel xbp- s binding protein and demonstrated that the association between xbp- s and the pi k subunit could be upr-dependent ( ) . we examined the xbp- s-pcaf protein-protein interaction under the normal or tm-stressed conditions. as shown in figure , the interaction between xbp- s and pcaf was not disrupted during upr. on-going research focuses on the identification of the xbp- s co-factor(s) required for the transactivation caused by xbp- s once upr is induced. gcn , a hat which shares % identity in amino acid sequence with pcaf ( ), is a possible candidate for xbp- s binding partner. the involvement of gcn in xbp- s-dependent transcription and during upr is currently under investigation. the tumor microenvironment is hypoglycemic and hypoxic, resulting in induction of upr and overexpression of xbp- . recent studies show the involvement of xbp- in tumorigenesis of various cancers and suggest xbp- as a potential target for anti-cancer therapeutics ( ) ( ) ( ) ) . fujimoto et al. investigated the expression of xbp- in primary breast cancers and five breast cancer cell lines, including mcf . the increased expression of xbp- was detected in all breast cancers and cell lines examined, but not in the non-cancerous breast issue ( ) . in addition, clinical results showed that high levels of xbp- s increased the survival of breast cancers ( ) . our data presented here demonstrate the functional importance of pcaf in mediating the expression of xbp- s target genes in mcf cells (figures and ) , suggesting a potential role of pcaf in xbp- s-mediated tumerigenesis of breast cancers. furthermore, pcaf may be an essential factor for other xbp- s-mediated signaling pathways. for example, xbp- s is one of the key components in the transcriptional program controlling plasma cell differentiation ( ) . it would be worthwhile to examine the importance of pcaf-xbp- s during the development of plasma cells. building an antibody factory: a job for the unfolded protein response xbp mrna is induced by atf and spliced by ire in response to er stress to produce a highly active transcription factor unconventional splicing of xbp mrna occurs in the cytoplasm during the mammalian unfolded protein response xbp- regulates a subset of endoplasmic reticulum resident chaperone genes in the unfolded protein response xbp : a link between the unfolded protein response, lipid biosynthesis, and biogenesis of the endoplasmic reticulum effects of overexpression of x-box binding protein on recombinant protein production in chinese hamster ovary and ns myeloma cells regulation of xbp- signaling during transient and stable recombinant protein production in cho cells plasma cell differentiation and the unfolded protein response intersect at the transcription factor xbp- plasma cell differentiation requires the transcription factor xbp- the role of x-box binding protein- in tumorigenicity targeting xbp- as a novel anti-cancer strategy x box-binding protein regulates angiogenesis in human pancreatic adenocarcinomas kaposi's sarcoma-associated herpesvirus-infected primary effusion lymphoma has a plasma cell gene expression profile west nile virus infection activates the unfolded protein response leading to chop induction and apoptosis japanese encephalitis virus infection initiates endoplasmic reticulum stress and an unfolded protein response hepatitis c virus subgenomic replicons induce endoplasmic reticulum stress activating an intracellular signaling pathway hepatitis c virus suppresses the ire -xbp pathway of the unfolded protein response human cytomegalovirus protein us provokes an unfolded protein response that may facilitate the degradation of class i major histocompatibility complex products human cytomegalovirus infection activates and regulates the unfolded protein response flavivirus infection activates the xbp pathway of the unfolded protein response to cope with endoplasmic reticulum stress modulation of the unfolded protein response by the severe acute respiratory syndrome coronavirus spike protein coronavirus infection modulates the unfolded protein response and mediates sustained translational repression endoplasmic reticulum stress triggers xbp- -mediated up-regulation of an ebv oncoprotein in nasopharyngeal carcinoma semliki forest virus induced endoplasmic reticulum stress accelerates apoptotic death of mammalian cells secretion of the human t cell leukemia virus type i transactivator protein tax xbp- , a novel human t-lymphotropic virus type (htlv- ) tax binding protein, activates htlv- basal and tax-activated transcription (u) encoded in xbp pre-mrna negatively regulates unfolded protein response activator pxbp (s) in mammalian er stress response activation of human t cell leukemia virus type ltr promoter and cellular promoter elements by t cell receptor signaling and htlv- tax expression atf activated by proteolysis binds in the presence of nf-y (cbf) directly to the cis-acting element responsible for the mammalian unfolded protein response transcriptional and post-transcriptional gene regulation of htlv- identification of the cis-acting endoplasmic reticulum stress response element responsible for transcriptional induction of mammalian glucose-regulated proteins. involvement of basic leucine zipper transcription factors a regulatory subunit of phosphoinositide -kinase increases the nuclear accumulation of x-box-binding protein- to modulate the unfolded protein response and p /camp-responsive element-binding protein associated factor interact with human t-cell lymphotropic virus type- tax in a multi-histone acetyltransferase/activator-enhancer complex pcaf interacts with tax and stimulates tax transactivation in a histone acetyltransferase-independent manner control of camp-regulated enhancers by the viral transactivator tax through creb and the co-activator cbp identification of p x-responsive regulatory sequences within the human t-cell leukemia virus type i long terminal repeat characterization of cellular factors that interact with the human t-cell leukemia virus type i p x-responsive -base-pair sequence the p /cbp-associated factor (pcaf) is a cofactor of atf for amino acid-regulated transcription of chop characterization of human activating transcription factor , a transcriptional activator that interacts with multiple domains of camp-responsive element-binding protein (creb)-binding protein modulates atf stability and transcriptional activity independently of its acetyltransferase domain endoplasmic reticulum stress induction of the grp /bip promoter: activating mechanisms mediated by yy and its interactive chromatin modifiers distinct gcn /pcaf-containing complexes function as co-activators and are involved in transcription factor and global histone acetylation the differentiation and stress response factor xbp- drives multiple myeloma pathogenesis upregulation and overexpression of human x-box binding protein (hxbp- ) gene in primary breast cancers expression and splicing of the unfolded protein response gene xbp- are significantly associated with clinical outcome of endocrine-treated breast cancer we would like to thank dr kazutoshi mori, dr hiderou yoshida and dr arnold rabson for providing the expression and reporter plasmids, dr niki wong for critical review of the manuscript, and ms yi ling chia for expert technical assistance. conflict of interest statement. none declared. key: cord- -xfzhn n authors: jabado, omar j.; liu, yang; conlan, sean; quan, p. lan; hegyi, hédi; lussier, yves; briese, thomas; palacios, gustavo; lipkin, w. i. title: comprehensive viral oligonucleotide probe design using conserved protein regions date: - - journal: nucleic acids res doi: . /nar/gkm sha: doc_id: cord_uid: xfzhn n oligonucleotide microarrays have been applied to microbial surveillance and discovery where highly multiplexed assays are required to address a wide range of genetic targets. although printing density continues to increase, the design of comprehensive microbial probe sets remains a daunting challenge, particularly in virology where rapid sequence evolution and database expansion confound static solutions. here, we present a strategy for probe design based on protein sequences that is responsive to the unique problems posed in virus detection and discovery. the method uses the protein families database (pfam) and motif finding algorithms to identify oligonucleotide probes in conserved amino acid regions and untranslated sequences. in silico testing using an experimentally derived thermodynamic model indicated near complete coverage of the viral sequence database. the capacity of dna microarrays to simultaneously screen for hundreds of viral agents makes them an attractive supplement to traditional methods in microbiology. their utility has been demonstrated through detection of papilloma virus in cervical lesions ( ) , sars coronavirus in tissue culture ( ) , parainfluenza virus in nasopharyngeal aspirates ( ) , influenza from nasal wash and throat swabs ( , ) , gammaretrovirus in prostate tumors ( ) , coronaviruses and rhinoviruses from nasal lavage ( ) , metapneumovirus from bronchoalveolar lavage ( ) , filoviruses and malarial parasites in blood in hemorrhagic fever ( ) , and a wide variety of respiratory pathogens in nasal swabs and lung tissue ( ) . viral microarrays have increased in density and strain coverage as fabrication technologies have improved. cdna pathogen arrays derived from reference strain nucleic acids ( , ) have been replaced by oligonucleotide arrays due to their increased flexibility. oligonucleotide design strategies have focused on pairwise sequence comparisons to identify conserved regions within a variety of viral pathogens ( ) ( ) ( ) . multiple alignments have been used to design probes for clinically important virus genera, e.g. rotaviruses ( ) , orthopoxviruses ( ) or influenzaviruses ( ) . viral resequencing arrays have recently been introduced that allow single nucleotide resolution ( , ( ) ( ) ( ) . although such tiling arrays enable accurate typing, the number of probes required to build a resequencing array for all viral sequences exceeds current art. a comprehensive viral microarray should address the entire viral sequence database. pairwise nucleic acid comparisons, while rapid, do not scale well with sequence number and ignore valuable coding information. nonoverlapping segments, heterogeneous sizes and the large number of sequences preclude automated multiple alignments of nucleic acids for probe design. protein-protein comparisons are more sensitive for detecting conserved regions due to the power of substitution matrices ( ) ; however, at the time of writing, no reported oligonucleotide design algorithm leverages this information. the protein families database (pfam) ( ) is a repository of hand curated protein multiple alignments and hidden markov models (hmms) across all phylogenetic kingdoms. hmms are probabilistic representations of protein alignments that are well suited to identifying homologies ( , ) . beginning with the pfam database as a foundation, we established a tiered method for creating viral probes that uses all sequence information available for viruses. our method for probe design employs protein alignment information, discovered protein motifs, nucleic acid motifs and finally, sliding windows to ensure near complete coverage of the database. we pursued experiments to determine the effects of probetarget mismatch and background nucleic acid concentration on array sensitivity and specificity; results were used to derive parameters for probe design. west nile virus rna (wnv, strain new york , af ) was used as template in hybridization experiments on an agilent oligonucleotide array with complementary probes of length nucleotides (nt). approximately one third of the probes had between and randomly introduced mismatches. the plus and minus (reverse complement) strands of each sequence were deposited, in duplicate. in addition to the flaviviral specific probes, the array contained nearly probes for other viral families, negative and positive controls. a volume containing copies of wnv and ng of background nucleic acid (human lung tissue rna) was amplified using random primers and hybridized in four replicate experiments as previously described ( ) . hybridizing a wnv isolate of known sequence allowed prediction of probe-viral hybrid strength and correlation to fluorescence data. to predict hybrids with high accuracy, smith-waterman alignments of the virus sequence against microarray probes were generated using the emboss bioinformatics suite ( ) . the number of mismatches was calculated for each expected probe-target pair. the change in gibbs free energy at c (hybridization temperature) was calculated using pairfold version . ( ) as a separate measure of probe-template binding strength. pairfold employs a dynamic programming algorithm to compute the minimum free energy structure (excluding pseudo-knots); the standard free energy model is used ( ) with empirical nearest neighbor energies ( ) . the arrays were visualized with an agilent slide scanner, then processed with the quantile normalization technique ( ) . spss version was used for statistics and data plots (http://www.spss.com/), fluorescence data is available as supplementary material. the embl nucleotide sequence database [july , release ; , nucleic acid sequences ( ) ] was chosen as the reference for this study because it is tightly integrated with the pfam protein family database ( , taxon growth was estimated using a standard least squares method, with the spss statistical package. a non-redundant database comprising sequences was generated with cd-hit ( ), using a similarity cutoff of % to define sequences as identical. bacteriophages were not included in the analysis; however, data were retained to allow probe design using the embl phage database. the pfam database is comprised of hand curated seed protein alignments that are converted to a probabilistic representation using hmms. these hmms are used to search the protein database for homologues that can be added to the seed to create a comprehensive alignment ( , ) . pfam domains were analyzed to identify short, conserved protein regions and corresponding nucleic acid sequences. in the first step, the log-odds score for each position of the hmm built from the seed alignment was summed; lower scores were considered to indicate conservancy. the most conserved, non-overlapping amino acid (aa) regions were identified. in the second step, protein alignments of all pfam-a families were extracted and mapped to their underlying nucleotide sequences by cross reference to the embl records. hmm parsing modules from the bioperl package were used. in the third step, the underlying nucleotide sequences were extracted and stored. in cases where the region contained gaps, flanking nucleotides were brought together to yield sequences of length . these sequences formed the basis for downstream probe design. domain alignments in the pfam-b were not used in probe design because they are of lower quality; also, as domain quality improves these alignments will be integrated into pfam-a ( ). all coding nucleic acid sequences that were not part of a pfam-a alignment were extracted. in this step, the most conserved regions within homologous genes were identified for probe design. sequences were clustered at the protein level with cd-hit, using a similarity threshold of %. all sequence clusters were subjected to a meme motif search ( ) using the following parameters: motif width of , zero or one motif allowed per sequence, a minimum of two sequences per motif. three motifs were selected for each sequence cluster. the underlying nucleic acid sequence extracted for each protein motif was used for probe design. a sliding window approach was used for highly divergent sequences that did not share any motifs. using the pam matrix ( ) a summed log odds scores for every aa subsequence in the protein was calculated; the three least likely to vary (lowest log odds score) were selected as regions for probe design. viruses often have highly conserved non-coding regions at the termini of their genomes or genome segments that serve critical roles in replication, transcription, and packaging. we reasoned that probes based in these regions may be useful in microarray design. we identified conserved probes across homologous regions in sequences annotated as utr, utr, ltr, and those without annotation. sequences were first clustered at the % threshold. clustered sequences were then subjected to a motif search using the same parameters employed for proteins, except that a length of nt per motif was specified. we addressed sequences that did not contain a shared motif separately; three non-overlapping nt subsequences were chosen as probes. probe selection and minimization with set cover algorithm an algorithm was designed to automate identification of the minimum set of probes required to address a repertoire of potential viral targets ( ) . in the first step of analysis, the number of mismatches between a probe and its viral target was computed; the algorithm considered a probe to be 'covering' if it had mismatches to the template. coverage data were converted to a matrix of binary values. a greedy algorithm was implemented to choose a probe combination from the matrix, minimizing the number required probes. candidate probes were further screened to ensure a t m > c, no repeats exceeding a length of nt, no hairpins with stem lengths exceeding nt, and < % overall sequence identity to non-viral genomes. because it is not feasible to test all probes with all known viruses, we tested probe validity using a gibbs free energy model of hybridization. all probe sequences were compared to the non-redundant set of viral sequences by blastn ( ) . probe-target pairs were aligned by smith-waterman to ensure accuracy; mismatches and change in gibbs free energy at c (hybridization temperature) were then calculated. to gauge the performance of our probe selection algorithm, another comprehensive method was devised that used only nucleic acid sequence. sequences in the reference sequence viral genomes project ( ) are evenly distributed among viral families; therefore, we reasoned that probes derived from these sequences would provide broad coverage. to contrast with our method, we selected nt oligonucleotides end-to-end along all viral genomic refseq sequences ( viruses). this resulted in a tiling probe-set where the length of a sequence was proportional to the number of interrogating probes. the viral sequence database is dominated by gene fragments we queried the embl viral database to determine the frequencies of coding sequences and full genomic sequences. the majority of viral sequences were < kb a commonly used method to reduce sequence complexity is generation of a non-redundant sequence set by clustering ( ) . we grouped sequences at the % identity level and selected the longest sequences as unique representatives of each group. this method was used to assess the growth of sequence diversity between january and the current release of july . the database grew % in the -year period; doubling every three years. unique sequences decreased as a proportion of the database, from % to %; overall growth of unique sequences was % (figure b) . the current database comprised unique sequence representatives at the % similarity level. thus, the growth in the number of sequences in the viral database has been rapid, while growth in diversity has been more modest. one hypothesis to explain this slower growth of sequence diversity is that many of the existing viruses infecting humans have already been discovered and new isolates deposited are variants of well studied viruses. we charted the growth of viral taxonomic groups as a function of time to visualize trends in viral discovery (figure c ). the number of families and genera has remained stable since ; however, the number of sequences that have been classified as a new species has steadily risen. a least squares fit of this growth indicates that the steady increase in new species characterization is likely to continue, while the discovery of new viral families will be less common. a tiered, protein-motif-based approach to probe design addresses all viral sequences nucleotide sequences were divided into four subtypes: (i) coding sequences that corresponded to pfam-a alignments (cpf), (ii) coding sequences not in the pfam-a (cnpf), (iii) sequences that were annotated as untranslated regions (utr) or long terminal repeats (ltr) and (iv) sequences that were unannotated (ua). we sought to match the quality of pfam-a alignments in the non-pfam coding sequences by clustering them into groups of related sequences, approximating homologous genes. these were then subjected to a protein motif finding program to identify the conserved regions within each cluster. the untranslated and unannotated sequences were subjected to a similar clustering analysis, but at the nucleotide level. all four subtypes were subjected to the same three step design method: identification of conserved regions, extraction of nucleotide probe sequences, and minimization of covering probes. by allowing a limited number of mismatches to cognate templates, the number of probes required can be reduced. the mismatch threshold was determined based on experiments with west nile virus (strain new york , af ) that indicated high, homogenous fluorescence signal was observed if probes had five or fewer mismatches to the viral template ( figure ). the probe minimization technique serves to lower microarray printing costs and simplify analysis while maintaining sequence coverage. a flowchart of the design method is depicted in figure . the most recent pfam-a release (version ) comprised families, of which had viral members. of annotated protein sequences with length > aa, ( . %) belonged to a pfam-a family, while ( . %) did not. three probes were chosen for each gene, yielding a total of cpf and cnpf probes. of sequences not contained in pfam-a, only . % ( ) were found in pfam-b alignments. thus, due to the lower quality of alignments ( ) and poor viral representation, the pfam-b was not used for probe design. the untranslated regions processed yielded probes. for the unannotated sequences processed, probes were designed. sequences that were not covered due to high/low gc%, low complexity, repetitive sequence or a preponderance of ambiguous nucleotides ( ) were processed with a sliding window strategy; probes were designed. overall, the number of probes required to address all viral sequences was . sequence counts and probe counts for the most recent embl/pfam release are detailed in figure . an example of typical probe distribution is shown with respect to the dengue virus genome (nc_ ; figure ). probe sequence composition is a major determinant of hybridization signal and is responsible for much of the variance between probes that target the same nucleic acid strand. probe-target thermodynamics have been successfully modeled to predict fluorescence ( , ) , control for variance ( ) and even estimate concentrations of target detected in samples ( ) . observing that some probes with more than five mismatches to their targets showed strong fluorescent signal, we concluded that sequence composition is a major factor in our array platform. we sought to validate the probe design method by generating a simple thermodynamic model to predict hybridization signal based on sequence composition. we computed the change in gibbs free energy (Ág) for all expected probe-viral nucleic acid pairs in the west nile virus hybridization experiments described above. the calculation method employed finds the most thermodynamically stable structure (minimum free energy) ( ) based on empirically established nearest neighbor energies ( ) . strong signal was observed from probe-virus hybrids with Ág of À . kj or less. thus, this value was chosen as the threshold to classify a probe as likely to generate high signal when the cognate viral target is present (figure ). probes will be designed in the area of short motifs of aa or nt figure . comprehensive motif-based probe design. the embl viral database is clustered with a threshold of % nucleotide identity to create a non-redundant sequence database. coding sequences are subjected to an amino acid motif search, and then probes are made from the underlying nucleic acid sequences. similarly, nucleic acid motifs are found in non-coding sequences and used to make probes. database coverage is checked; supplementary probes for highly divergent sequences are designed as necessary. acronyms: pfam-protein families database, meme-multiple expectation maximization for motif elicitation, utr-untranslated region, ltr-long terminal repeat. motif-based probe design provides higher coverage than virus genome tiling use of motif finding and set cover minimization markedly increases the computational resources needed to generate probe sets. to determine whether increased complexity results in a more comprehensive probe set, we compared our method to a genome tiling strategy. probes of nt were designed end-to-end along the entire genome for all reference sequence viral strains available as of may . the tiling probe set served as a contrast to our design method since it was based on nucleic acid sequence, had more probes per gene, required less computation, and included viruses from all genera. in comparison of the methods, the following rules were used to compute database coverage: sequences > nt in length were considered covered if six or more probes met hybridization criteria; sequences < nt in length were considered covered if two probes met hybridization criteria; sequences < nt in length were considered covered with a single probe meeting hybridization criteria. coverage of the entire database was gauged by computing probe-template Ág for all unique sequence representatives. database coverage using the tiling method was . % and required probes; coverage using the motif-based method was . % and required probes (table ) . whereas probe design in motif-based arrays can exploit partial genome sequences, probes in tiling arrays are based on full length genome sequences. complete reference sequence genomes represent % of embl sequence entries. although at least one full length genome sequence is described for all viral genera, only % ( of ) of viral species have a fully sequenced representative genome. the impact of differences in the motif and tilingbased strategies for probe design is reflected in differences in coverage. for the tiling-based probe-set, of families with < % sequence coverage included species lacking representative genomes. coverage with motifbased probe-sets for these same species was ! %. there is an increasing appreciation for the power of microarray technology in clinical microbiology, public health and environmental surveillance. viral microarray probe design poses unique challenges due to the rapid increase in sequence data and the high propensity for sequence divergence within viral taxa. to ensure coverage of the newest isolates it is essential to consider partial as well as complete genomic sequences in probe design. probe design based on multiple alignments or pairwise comparisons of nucleic acids for all known sequences is computationally intensive and scales poorly with database size. protein sequence comparisons are rapid and incorporate rich evolutionary models, but require a cumbersome mapping step to extract underlying nucleic acid sequence. we have described a method that capitalizes on the pfam protein alignment database and a motif finding algorithm to automate the extraction of nucleic acid sequence for probes from conserved protein regions. the protein motif-centric method has several advantages: (i) the majority of viral nucleic acid sequences encode proteins; thus, using this information leverages knowledge about function; (ii) protein sequence comparison and the resulting probesets are independent of viral taxonomy; this may enable incorporation of misclassified sequences; (iii) the pfam is a well established and highly annotated database that will provide a basis for future design efforts; and (iv) probes designed in conserved regions may be able to detect novel viruses. a second application of this design method is viral expression profiling. insights into the replication cycle, host evasion and virulence factors may be obtained by monitoring viral transcript levels during infection. to this end, arrays could be synthesized that combine probes for a single viral family and all host genes. because the viral probe sets generated by our method account for known variants across all genes, a variety of strains could be profiled with a single array. this would provide a unique experimental platform for investigating virus biology, while minimizing fabrication cost and simplifying analysis. the thresholds used to design and validate probes were experimentally determined for the agilent technologies array platform and the types of clinical samples our figure . gibbs free energy model of hybridization signal. the change in gibbs free energy of probe-west nile virus hybrids was computed. aliquots of west nile virus (new york strain rna) at copies were spiked into ng of human lung (background) rna. the fluorescent signal values of replicate arrays were log transformed, normalized, and converted to z-scores. % confidence intervals of the mean for fluorescence versus gibbs energy is plotted. probe-virus hybrids with free energy - . kj had high fluorescence; this value was chosen as the threshold for considering a probe likely to generate a strong signal when the target virus is present (dotted line). laboratory encounters. probe length can be selected to emphasize efficient coverage of higher order taxa or speciation. the goal of this project is to cover all known viral sequences and optimize potential for detecting related viral sequences. thus, we designed nt probes because they can better tolerate mismatched templates than nt oligonucleotide probes ( ) . using an empirical approach, appropriate thresholds can be determined for other array platforms, hybridization conditions, and probe lengths. the method of probe design and setcover minimization is flexible and agnostic of platform; application to bead, solution, or surface-based hybridization technology should be straightforward. although the growth of the public sequence databases has been rapid, sequence diversity has not grown as quickly. if this trend continues, we anticipate that only incremental updates to a core set of probes will be needed to maintain array integrity. an update strategy would require periodic testing of probe sets against newly deposited sequences and fresh design only in the cases of high sequence divergence. supplementary data are available at nar online. correlation of cervical carcinoma and precancerous lesions with human papillomavirus (hpv) genotypes detected with the hpv dna chip microarray method viral discovery and sequence recovery using dna microarrays microarray detection of human parainfluenzavirus infection associated with respiratory failure in an immunocompetent adult broad-spectrum respiratory tract pathogen identification using resequencing dna microarrays experimental evaluation of the fluchip diagnostic microarray for influenza virus surveillance identification of a novel gammaretrovirus in prostate tumors of patients homozygous for r q rnasel variant pan-viral screening of respiratory tract infections in adults with and without asthma reveals unexpected human coronavirus and human rhinovirus diversity diagnosis of a critical respiratory illness caused by human metapneumovirus by use of a pan-virus microarray panmicrobial oligonucleotide array for diagnosis of infectious diseases detection of respiratory viruses and subtype identification of influenza a viruses by greenechipresp oligonucleotide microarray dna microarrays for virus detection in cases of central nervous system infection detection of potato viruses using microarray technology: towards a generic method for plant viral disease diagnosis microarray-based detection and genotyping of viral pathogens database to dynamically aid probe design for virus identification design of microarray probes for virus identification and detection of emerging viruses at the genus level detection and genotyping of human group a rotaviruses by oligonucleotide microarray hybridization detection and discrimination of orthopoxviruses using microarrays of immobilized oligonucleotides robust sequence selection method used to develop the fluchip diagnostic microarray for influenza virus sequence-specific identification of pathogenic microorganisms using microarray technology tracking the evolution of the sars coronavirus using highthroughput, high-density resequencing arrays genechip resequencing of the smallpox virus genome can identify novel strains: a biodefense application amino acid substitution matrices from an information theoretic perspective pfam: a comprehensive database of protein domain families based on seed alignments profile hidden markov models sequence comparison and protein structure prediction emboss: the european molecular biology open software suite rnasoft: a suite of rna secondary structure prediction and design software tools calculating nucleic acid secondary structure a unified view of polymer, dumbbell, and oligonucleotide dna nearest-neighbor thermodynamics a comparison of normalization methods for high density oligonucleotide array data based on variance and bias embl nucleotide sequence database: developments in pfam: clans, web tools and services cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences an artificial intelligence approach to motif discovery in protein sequences: application to steriod dehydrogenases atlas of protein sequence and structure greene scprimer: a rapid comprehensive tool for designing degenerate primers from multiple sequence alignments gapped blast and psi-blast: a new generation of protein database search programs national center for biotechnology information viral genomes project rationale and uses of a public hiv drugresistance database global epidemiology of hiv modeling of dna microarray data by using physical properties of hybridization thermodynamic calculations and statistical correlations for oligo-probes design improving comparability between microarray probe signals by thermodynamic intensity correction absolute mrna concentrations from sequence-specific calibration of oligonucleotide arrays expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer the work presented here was supported by national institutes of health awards (ai , northeast biodefense center u -ai -lipkin, ai , hl ey and t gm ). we thank carolyn morrison for excellent technical assistance. funding to pay the open access publication charges for this article was provided by nih u -ai -lipkin.conflict of interest statement. none declared. key: cord- -il mz na authors: rodnina, marina v; korniy, natalia; klimova, mariia; karki, prajwal; peng, bee-zen; senyushkina, tamara; belardinelli, riccardo; maracci, cristina; wohlgemuth, ingo; samatova, ekaterina; peske, frank title: translational recoding: canonical translation mechanisms reinterpreted date: - - journal: nucleic acids res doi: . /nar/gkz sha: doc_id: cord_uid: il mz na during canonical translation, the ribosome moves along an mrna from the start to the stop codon in exact steps of one codon at a time. the collinearity of the mrna and the protein sequence is essential for the quality of the cellular proteome. spontaneous errors in decoding or translocation are rare and result in a deficient protein. however, dedicated recoding signals in the mrna can reprogram the ribosome to read the message in alternative ways. this review summarizes the recent advances in understanding the mechanisms of three types of recoding events: stop-codon readthrough, – ribosome frameshifting and translational bypassing. recoding events provide insights into alternative modes of ribosome dynamics that are potentially applicable to other non-canonical modes of prokaryotic and eukaryotic translation. ribosomes produce proteins by translating the sequence of an mrna into the amino acid sequence of a protein. to make a protein that is encoded by a given open reading frame (orf) of an mrna, the ribosome has to select the correct aug codon to start translation, ensure the collinearity of the mrna and the protein sequences during translation elongation, and terminate translation at a stop codon marking the end of the orf. cells have evolved sophisticated control mechanisms that ensure fidelity of each translation phase. however, in special cases, signals encoded in an mrna reprogram the ribosome to read the message in an alternative way, a phenomenon called translational recoding. in this review, we will focus on three types of recod-ing: (i) stop-codon readthrough; (ii) ribosome frameshifting and (iii) translational bypassing ( figure ). during translation elongation, the mrna is decoded with the help of aminoacyl-trnas (aa-trna) that are delivered to the ribosome in complex with an elongation factor (ef-tu in bacteria or eef in eukaryotes) and gtp. the ribosome selects the trnas according to the match between the mrna codon and the trna anticodon. failing to discriminate against incorrect aa-trna results in missense errors of translation. generally, the fidelity of decoding is very high, with a frequency of missense errors in the range from < − to − per codon depending on the type of mismatch and the position of the amino acid in the protein ( ) ( ) ( ) ( ) . at the end of the open reading frame, stop codons (uaa, uag and uga) are recognized by termination (release) factors (rf and rf in bacteria or erf in eukaryotes). the frequency of occasional readthrough is low, < − per stop codon ( , ) , which can increase dramatically, up to . - . , when induced by sequence and structural elements in the mrna and by trans factors ( ) ( ) ( ) . missense and nonsense errors are mistakes of decoding. after peptide bond formation, the ribosome moves along the mrna to read the next codon in a tightly orchestrated process of translocation. in order to produce a correct protein, the ribosome must be translocated by exactly one codon at a time. failing to maintain the correct reading frame results in ribosome frameshifting in -or + direction. depending on the conditions, frameshifting errors can occur during decoding or translocation. the frequency of spontaneous frameshifting is rather low, i.e. < − ( ) ( ) ( ) . signals in the mrna provide a context in which frameshifting is greatly enhanced, which is referred to as programmed ribosome frameshifting (prf). the efficiency of prf can vary in a wide range between . % and %, depending on the organism and the frameshifting sequence (for reviews, see ( ) ( ) ( ) ( ) ). finally, translational bypassing is a recoding figure . three types of recoding events. translational readthrough extends the polypeptide c-terminally allowing the production of two protein isoforms from the same transcript. frameshifting produces typically two functional polypeptides from different reading frames of the same mrna. bypassing is a recoding event that synthesizes one protein from two open discontinuous reading frames. phenomenon that produces a single protein from a discontinuous reading frame. bypassing is a post-decoding event that requires multiple signals in the mrna. in the following, we will discuss the mechanisms of each of these recoding events in light of the recent progress of biochemical, kinetic and structural studies. stop codon readthrough can result from decoding of a stop codon as a sense codon by a near-cognate trna. natural trnas that are prone to readthrough usually have an anticodon that has a single mismatch upon pairing to a stop codon, such as trna gln , trna tyr , trna cys or trna trp ( ) . translational readthrough is widely employed by viruses to expand the coding potential of their limited genome ( , ( ) ( ) ( ) . readthrough does not alter the translational reading frame, but rather extends the polypeptide c-terminally allowing the production of two protein isoforms from the same transcript. the cterminal extension can carry cellular localization signals or homo/heterodimerization domains or alter the function of the protein such as its ligand-binding properties ( table ) . the minimal mrna sequence motif that modulates readthrough is comprised of the stop codon (nt + , + , + ) and its context from nt - to + ( figure ). the propensity for readthrough is lowest on the uaa and highest on the uga codon ( ) ( ) ( ) ( ) . the context of stop codons shows a non-random distribution of nucleotides in escherichia coli and in humans ( ) . the presence of two adenines at positions - and - favors readthrough ( ) . the presence of a cytidine at position + (c+ ) is associated with leaky termination in various organisms, in particular on uga codons ( , ( ) ( ) ( ) ( ) ( ) ; notably, uga c or uag c are rare in mammals ( ) . the effect of bases other than c+ varies between the three stop codons ( , , ) . the nucleotides + to + in the context of uga-cua or uga-cgg induce readthrough in a number of viral and eukaryotic genes ( , , , ) . in several cases, the mrna context up to nt + can modulate readthrough ( , ) . for example, in the tobacco mosaic virus (tmv) replicase gene, the consensus sequence caryya (r, purines; y, pyrimidines) triggers readthrough at all stop codons ( ) . the structural basis for sequence effects in readthrough is unclear. recognition of stop codons by rfs is achieved by sequence-and shape-specific recognition of the three nucleotides of the stop codons (nt + to + ) and, in eukaryotes, of the adjacent nucleotide + ( , ) . nucleotides + and + are involved in stacking interactions with rrna bases around the decoding center, which are more stable with purines than pyrimidines ( , ) . this might suggest that c or u at positions + and/or + decrease the stability of the decoding complex and interfere with the compaction of the mrna in the a site, which is a hallmark of stopcodon recognition by erf in eukaryotes ( ) . although the details of stop codon recognition differ between bacteria and eukaryotes, there are indications that adenosines in positions + and + interact with the s rrna, which might account for the reported context bias in prokaryotes ( ) . in addition to the immediate context, more distal stimulatory cis elements involving mrna structures regulate readthrough in several viral and eukaryotic mrnas ( , ( ) ( ) ( ) ( ) . for example, an -nucleotide sequence downstream of the stop codon in the drosophila hdc gene forms a stemloop (sl) structure that stimulates readthrough ( ) . cisacting rna structures can modulate readthrough by (i) interfering with release factor recruitment to the ribosome; (ii) modulating ribosome function by interacting with ribosomal proteins or rrnas; (iii) inducing ribosome stalling or (iv) recruiting trans factors ( , , ) . we note that the sequences downstream of stop codons evolved to limit the negative consequences of leaky termination, as in-frame stop codons are significantly over-represented immediately downstream of the primary stop signal, which ensures termination in close proximity of the correct end of the orf ( ) . in addition to elements in the mrna, several trans factors may influence the efficiency of termination by various mechanisms. for example, readthrough of the mammalian vascular endothelial growth factor a (vegfa) mrna is facilitated by the heterogeneous nuclear ribonucleoprotein (hnrnp) a /b that binds the hnrnp a /b recognition element (a re) in the termination region ( ) (figure ). recently, eif was proposed to promote readthrough at all three stop codons in leaky context by preventing erf from recognizing the third position of the stop codon ( ) . depletion of termination factors erf and/or erf results in increased levels of readthrough in humans independent of the codon context ( , ) . the [psi + ] strain of saccharomyces cerevisiae exhibits the epigenetically inherited prion state of termination factor erf where translation termination is compromised. in these strains, erf forms amyloid fibrils that sequester a part of the release factor pool ( ) ( ) ( ) . the abundance and properties of trnas also influence readthrough efficiency ( , , ) . for example, the relative abundance of the major trna gln isoacceptor with the -uug- anticodon compared to the minor trna gln with figure . factors affecting translational readthrough in eukaryotes. cis factors that affect readthrough include sequences upstream of the stop codon (light gray), the identity of the stop codon (red-orange), the + nucleotide (blue) and the downstream sequences that occupy the mrna channel (green). distal cis element includes downstream mrna secondary structure. among several trans factors that affect readthrough, the specific case of hnrnp a /b is depicted. hnrnp a /b promotes readthrough by binding to a cis element in the utr of mammalian gene vegfa. a, p and e depict the three stable trna-binding sites. ssu, small ribosomal subunit; lsu, large ribosomal subunit. -cug- in s. cerevisiae explains why glutamine is preferentially incorporated at uaa compared to uag, despite the same non-conventional g-u base pairing that forms upon decoding. the modification of the trna bases within the anticodon or in its vicinity affects its ability to read stop codons ( , ) . the prevalence of readthrough varies between organisms. analysis of the stop codon contexts of drosophila species and ribosome profiling studies suggested potential readthrough in several hundred drosophila genes ( , ) . however, similar genomic analyses and profiling studies of human genes have so far found only a few candidate genes ( , , ) . computational analysis of readthrough protein isoforms suggests that these are mostly long, modular proteins with intrinsically disordered c-termini of low sequence complexity ( , ) . the lack of a structurally ordered c-terminus might provide conformational flexibility that allows the readthrough extensions to perform functions without distorting the native protein. the majority of readthrough genes identified in d. melanogaster have regulatory roles, and appending a functional c-terminal extension may confer conditional advantage to protein function. in addition to readthrough by near-cognate aa-trnas, stop codons can be recoded by the specialized cognate tr-nas with an anticodon that is complementary to the stop codon, such as trna pyl or trna sec ( , ) . pyrrolysin and selenocysteine are natural proteinogenic non-canonical amino acids that are not encoded by a sense codon. pylspecific trna pyl reads the uag stop codon, whereas secspecific trna sec reads the uga codon ( ) ( ) ( ) ( ) . the pyl trait is restricted to several microbes, mostly methanogenic archaea, which encode a trna pyr (pylt) and the dedicated aa-trna synthetase (pyls). pyl-trna pyl is recognized by ef-tu. genome analysis of pyl-containing organisms suggested that uag is not a typical stop signal in pyl-utilizing archaea and that pyl insertion can effectively compete with translation termination for uag codons obviating the need for specific mrna structures that recruit trna pyl to a specific stop codon ( ) . in contrast to trna pyl , trna sec is found in bacteria, archaea and eukaryotes. sec is required for synthesis of a specialized group of proteins, selenoproteins. sec-trna sec is delivered to the ribosome by the specialized elongation factor selb (efsec in eukaryotes), a gtp-binding protein that belongs to the family of translational gtpases ( , ) . the key element for recruitment of the selb-gtp-sec-trna sec to the stop codon on bacterial ribosomes is a selenocysteine insertion sequence (secis) in the mrna, a sl structure located immediately downstream of the in-frame uga codon at which sec is incorporated ( ) . recent cryo-em structures revealed how sec-trna sec -selb-gtp is recognized by the ribosome ( ) ( figure a ). because trna sec is cognate for the uga codon, the codon-anticodon recognition initiates the same ribosome rearrangements as the canonical aa-trna-ef-tu-gtp complex ( ) ( ) ( ) . this includes the domain movements of the ssu, gtpase activation of the factor by the interaction with the sarcin-ricin loop on the lsu and the accommodation of aa-trna on the lsu upon dissociation of selb-gdp ( ) . however, some details of the interaction are secspecific. secis recruits selb domain . the specific recognition of sec-trna sec by selb is achieved by interactions between unique regions in selb with the extra-long variable arm of trna sec and the acceptor-and t-stems of trna sec ( ) . these elements distinguish trna sec from canonical trnas ( ) . finally, the amino acid-binding pocket of selb is lined with positively charged residues, allowing selb to specifically recognize the negatively charged selenol group and to discriminate against ser-trna sec ( , ) . the affinity of selb-gtp for sec-trna sec is very high, with a k d in the picomolar range ( ) . also, selb binding to the secis is in the nanomolar range and is rapid (k on = m − s − ) ( ). this implies that in the cell the sec-trna sec -selb-gtp complex can bind to the secis before it enters the ribosome, thereby facilitating the recruitment of sec-trna sec to the uga codon preceding the secis. although trna sec is recognized by the ribosome as a cognate aa-trna ( , ) , the efficiency of sec incorporation is only about %, whereas % of the ribosomes terminate translation with the help of rf ( ) . why some translating ribosomes incorporate sec and others do not, remains unclear. surprisingly, rf does not act as a direct competitor of sec, but rather terminates translation on the ribosomes that failed to incorporate sec. it is possible that when the ribosome arrives at the uga, the secis-bound sec-trna sec -selb-gtp blocks the entrance of rf to the a site ( figure b ). however, if the attempt to deliver sec is unsuccessful, the interaction of selb with the secis will be lost eventually, thereby freeing the access for rf to the stop codon. alternatively, conformational heterogeneity of translating ribosomes and the folding-unfolding dynamics of the secis may define the preference for sec binding on one fraction of ribosome complexes, whereas the other fraction favours rf ( ) . the propensity of the ribosome for spontaneous frameshifting depends on the stability of the codon-anticodon complexes. early studies suggested that in solution even fully matched codon-anticodon complexes dissociate very rapidly, at - s − ( ) . in the a site of the ribosome, the dissociation is much slower, about . s − ( ) . however, when these stabilizing ribosome interactions are released during translocation, the trna may unpair from the mrna within the time of translocation and thus the inherent stability of the codon-anticodon complex may be insufficient to hold the trna in frame. at mrna sequences where trna pairing with its -frame codon is favored over - or + alternative frames, transient loss of base pairing may be unimportant, because even if the anticodon dissociates from the anticodon, the -frame codon is the most likely target for it to rebind. however, when the mrna sequence is 'slippery', i.e., allows trna base pairing with the codon in the - -or + -frame, the loss of interactions with the codon, together with the movements of the elements of the ssu that occur during translocation, may result in frameshifting. a recent crystal structure of a translocation intermediate formed in the absence of ef-g indeed shows that the interactions of the ribosome with the codon-anticodon complex are disrupted and the a-site trna in the complex is shifted by one nucleotide toward the - -frame of the mrna ( ) (figure ) . in comparison, in crystal structures obtained in the presence of ef-g, residues at the tip of domain of ef-g interact with the a-site trna and prevent it from shifting ( ) . the interacting residues at the tip of ef-g domain , h and q (e. coli numbering), are known to play a key role in translocation ( ) . in contrast to spontaneous frameshifting, which produces non-functional polypeptides, prf typically leads to the synthesis of a functional polypeptide from an altered frame. prf was initially identified in viral genomes, where it plays an important role in viral propagation by modulating synthesis of viral proteins in specific stoichiometric ratios ( , ) . examples of - prf were found in all three domains of life ( ) ( ) ( ) ( ) ( ) ( ) . in eukaryotes, frameshifting can regulate the stability of an mrna. after a frameshifting event, the translating ribosome soon encounters an out-of-frame stop codon, causing premature termination of translation and thereby recruiting the machinery of the nonsense-mediated decay pathway ( ) . in most cases, - prf is facilitated by two regulatory elements in the mrna sequence, a slippery site and a secondary structure element (a pseudoknot, a sl or a kissing loop) at a precisely defined distance of to nt from the slippery site ( , ( ) ( ) ( ) . the mrna structure element stalls the ribosome, which facilitates slippage ( , ). - prf can ( )). the gtpase of selb is activated by the sarcin-ricin loop (srl) of s rrna. (b) secis-mediated sec insertion versus rf -dependent termination at uga. the sec-trna sec -selb-gtp complex is rapidly recruited to the secis while still distant from the ribosome. step : while the ribosome moves along the mrna toward the uga codon, the lower part of the secis becomes unwound and the sec-trna sec -selb-gtp complex occupies the entry to the a site, thereby hindering the recruitment of rf to the stop codon. step : after delivery of sec-trna sec to the a site and sec insertion into the growing peptide, the ribosome can recruit the next ef-tu-gtp-aa-trna complex (gray) and continue translation. alternatively (step ), if sec incorporation fails, the a site becomes accessible for rf , which promotes termination and peptide release. be also facilitated by binding of mirnas ( ) or proteins ( ) ( ) ( ) ( ) to the sequence following the slippery site. recent mechanistic studies suggested that despite the great variety of the frameshifting sequences, - frameshifting follows one of two main pathways ( - ) ( figure ). one route is predominant under translation conditions where the tr-nas that read the slippery sequence codons are abundant. in this case, frameshifting occurs at the late stage of translocation, with two trnas moving through the ribosome, and requires the presence of the stimulatory element within the mrna sequence. the other route is favored at conditions of aa-trna limitation and occurs via one-trna slippage of the p-site trna when the a site is vacant; its efficiency is independent of the downstream mrna stimulators. the latter mechanism is often called 'hungry' frameshifting, because it can be triggered by aa-trna limitation due to starvation ( ) ( ) ( ) . the detailed insights into the kinetic mechanism of translocation-dependent - prf came from ensemble and single molecule kinetic studies on a/ b mrna of the avian infectious bronchitis virus (ibv) and dnax mrna from e. coli ( , , , ) . despite differences in sequence and structure in those mrnas, frameshifting proceeds by a very similar mechanism. the frameshifting motif of the a/ b mrna consists of a slippery site u uua aag encoding leu (uua) and lys (aag) in -frame followed by a pseudoknot ( ) . the dnax frameshifting motif has the slippery site a aaa aag encoding two lys (aaa and aag) in -frame preceded by a shine-dalgarno-like sequence and followed by a sl ( ) . in both cases, the role of the downstream secondary structure element is to slow down the late stages of translocation ( , , , ) . at this point the ribosome is stalled in a rotated or even hyper-rotated state in which the stabilizing contacts between the ribosome and the codon-anticodon complexes are dis- rupted, which allows the trna to sample alternative reading frames ( , , ) . both the dissociation of the e-site trna and the backward rotation of the ribosomal subunits are slow, but the e-site trna is released before the ribosome rotates backwards ( ) . ef-g, which usually restricts the a-site trna in the -frame position ( ) , can also dissociate prior to the completion of translocation ( ) . when both ef-g and the deacylated trna have been released, a single trna in transit from the a to the p site may be particularly prone to frameshifting ( ) . there are two ways to resolve the metastable stalled state, either by spontaneous unwinding of the mrna secondary structure element that hinders the progression of the ribosome, which would allow the ribosome to resume its progression in the -frame, or by slippage in the - direction ( ) . the latter scenario may be kinetically advantageous because this would move the base of the pseudoknot to the entrance of the mrna tunnel where the helicase center of the ribosome can actively unwind the mrna secondary structure ( , ) . the choice of frameshifting pathway on the dnax mrna is dictated by environmental conditions, i.e. the availability of nutrients ( figure ) . however, there are cases where both pathways are constitutive. one prominent example is the gag-pol mrna of human immunodeficiency virus type (hiv- ). here, the function of - prf is to produce viral structural proteins (gag, -frame) and enzymes (gag-pol, - -frame) at a defined ratio ( ) . the gag-pol mrna contains the slippery sequence u uuu uua encoding phe (uuu) and leu (uua) in -frame followed by a sl ( ) . the - frameshifting efficiency in hiv- is modulated by the availability of the leu-trna leu(uaa) isoacceptor that is rare in cd + t-lymphocytes--cells infected by the virus in the human host ( ) . when trna leu is abundant, it is rapidly accommodated at its cognate codon uua, and - prf takes place during the late stage of translocation by two-trna slippage of trna phe and trna leu . the frameshifting scenario changes markedly when leu-trna leu(uaa) is limiting. during the translation pausing due to the 'hungry' uua codon in the a site, the psite trna phe can slip into the - -frame, which exposes a uuu phe codon in the a site and bypasses the limitation for leu-trna. taking into account the low level of leu-trna leu(uaa) in hiv- target cells and potential changes in trna profiles upon viral infection and interferon signaling activation ( ) ( ) ( ) , the alternative mechanism could act as a rescue pathway to allow for frameshifting under the limitation of the key trna ( ) . most likely, hiv- has evolved to use both mechanisms to maintain the efficiency of - prf at the constant value, which is critical for viral replication and infectivity ( , ) . rescue pathways regulated by trna availability may be operational in other viruses, as recent studies of - prf on the k mrna of the alphavirus semliki forest virus (sfv) identified a very similar switch between frameshifting pathways, also operated by trna leu(uaa) ( ) . manipulation of the frameshifting efficiency opens new perspectives in developing antiviral therapies and controlling gene expression of cellular mrnas ( ) . an intriguing example is the interferon-stimulated cellular protein shiftless. - prf in retroviruses (hiv) and alphaviruses (sfv) seems to be suppressed by this protein, which is thought to bind to both the translating ribosome and the frameshifting mrna motif by a mechanism that is not fully understood ( ) . multiple attempts have been made to design synthetic drugs targeting the frameshifting motif of hiv- ( ) ( ) ( ) ( ) ( ) and sars coronavirus ( ) . recently, matsumoto et al. have developed a small-molecule tool that can induce pseudoknot formation and activate - prf both in vitro and in vivo in human cells ( ) . such inducible - prf was previously reported for hiv- using prf stimulation by antisense nucleotides ( ) and can serve to control viral propagation and gene expression using small synthetic molecules. another remarkable example of recoding is translational bypassing, which involves skipping of a portion of the mrna by the translating ribosome, leading to the production of one polypeptide from a discontinuous frame. translational bypassing was first identified in gene of bacteriophage t ( ) , which remains the best-studied example of bypassing, and was later found in the mitochondrial genome of the yeast magnusiomycetes ( ) . the mrna of gene contains two open reading frames (orf and orf ) separated by a non-coding gap ( figure a ). chemical and enzymatic probing of the mrna structure suggested that mrna of both orfs are highly structured, whereas the gap is largely unfolded and forms a module that is structurally independent of the two orfs ( ) . the gap appears to represent a mobile genetic element inserted into the gene mrna to inhibit cleavage by homing endonuclease moba ( ) . the ribosome translates the first mrna codons of orf up to a gga triplet coding for amino acid glycine. the subsequent codon is a stop codon uag, but instead of terminating protein synthesis, the ribosome slides over a nt-long non-coding gap, lands at a distal gga codon and resumes translation to the end of orf ( ) . gene mrna elements that stimulate bypassing are located of the take-off site, in the take-off sl and of the landing site ( ) ( ) ( ) ( ) ( ) . remarkably, the key bypassing signals, such as the take-off sl element and the matching take-off and landing codons, are present also in yeast mitochondrial bypassing mrnas ( ) , suggesting a similar mechanism of bypassing to that in bacteriophage t . recent biochemical, single molecule and structural work suggests how translational bypassing works. translation of orf is a non-uniform process: at the beginning, translation of orf is rapid but then gradually slows down ( ) , probably because the ribosome has to unwind the secondary structure elements on its way along the mrna. the ribosome pauses at the take-off gga codon ( ) . to start bypassing, the ribosome requires the action of ef-g accompanied with gtp hydrolysis and a rotation of the ribosomal subunits relative to each other into an unusual hyperrotated conformation ( ) . the cryo-em structure of the take-off complex reveals that the nascent peptide, which is known to be a key determinant for bypassing ( ) , forms numerous interactions with the polypeptide exit tunnel of the ribosome ( ) (figure b ). these contacts help to hold the peptidyl-trna on the ribosome during sliding and likely contribute to the slow down at the take-off codon. in addition, the interactions of the nascent peptide residues with the ribosome lock an inactive conformation of the peptidyl transferase center, thus preventing the premature termination and readthrough at the take-off site. another remarkable feature of the take-off complex is a short dynamic sl formed by the mrna in the decoding site of the ssu ( ) ( figure b ). the short sl hinders access of the translation termination factor or near-cognate aa-trnas into the a site ( ) . in addition, the sl serves as a mimic of an a-site trna to help ef-g to promote a pseudo-translocation event ( figure c ). this displaces the p-site peptidyl-trna from its codon and starts ribosome sliding. as the ribosome moves forward, the mrna upstream of the take-off site starts to emerge from the ribosome and can re-fold, thereby preventing backward sliding of the ribosome ( ) . the directionality of the ribosome movement may be also facilitated by cycles of ef-g binding and gtp hydrolysis ( ) . in fact, the kinetics of gtp hydrolysis by ef-g and bypassing are identical. ef-g appears to hydrolyze, on average, about molecules of gtp for each ribosome that completes bypassing. considering the length of the non-coding gap ( nt), ef-g hydrolyzes on average . molecules of gtp per nucleotide of the sliding sequence. this gtp expenditure may be required to maintain the ribosome conformation that is prone to sliding or to facilitate the forward direction of sliding, similarly to the power-stroke action of ef-g in translocation ( ) . although all ribosomes disengage from the take-off gga codon and start sliding, only - % of them synthesize the full-length protein, while the remaining ribosomes stop translation due to termination or spontaneous dropoff of the peptidyl-trna gly ( , , ) . at the end of the non-coding mrna gap, the ribosome lands at the gga codon guided by the sl in the mrna downstream of the landing codon ( ) . the ribosome adopts a rotated conformation into which the next aa-trna accommodates ( , ) . after peptide bond formation and subsequent translocation, the ribosome returns into a canonical nonrotated state and resumes translation of orf . although at the first glance recoding events seem to be a heterogeneous group of different phenomena facilitated by specific regulatory elements, collectively they provide in-nucleic acids research, , vol. sights into the dynamic modes of translation. comparison of aa-trna recognition during canonical decoding and uga recoding by sec-trna sec shows that the major key steps on the ribosome are identical ( ) . specific recognition of sec-trna sec and the discrimination against all other similar aa-trnas occur at the preceding, pre-ribosomal steps of sec-trna sec recruitment to selb and secis. this probably reflects the evolution of the ribosome as a universal decoder for all different trnas and mrna codons, which relies on the geometry of the codon-anticodon complex, rather than on the structural specifics of each trnacodon pair. the mechanism of programmed readthrough is remarkably unclear, except for the fact that the near-cognate aa-trna and the rf must compete with each other. in contrast, comparison between canonical translocation, spontaneous and prf and translational bypassing show common mechanisms underlying these processes. one general theme is the importance of ribosome dynamics. for example, the hyper-rotated state is found not only in ribosomes starting bypassing, but also during frameshifting ( ) or in complexes stalled by the secm peptide ( ) , suggesting that a hyper-rotated state may be a hallmark for stalled ribosomes resuming translation. ribosome stalling is another important factor that defines the outcome of translation, as it regulates the efficiency of spontaneous and prf, as well as bypassing. in these three cases, ef-g has a key role by either holding and escorting the trna or facilitating a pseudo-translocation of a trna-like a-site sl. formation of the short dynamic sl in the a site may regulate ribosome pausing. in contrast to normal translation where ribosomes move by one codon at a time, during bypassing the ribosome slides over the mrna. similarly, ribosomes can move along the untranslated regions of eukaryotic mrnas ( , ) . ribosome sliding exploits conserved elements of the translational machinery, such as the decoding center of the ribosome and ef-g. thus, bypassing may explain how the ribosome changes from canonical decoding to unconventional ef-g-promoted movement through noncoding regions on the mrna and suggests several new modes of ribosome dynamics that are potentially applicable in prokaryotic and eukaryotic translation. broad range of missense error frequencies in cellular proteins studies of translational misreading in vivo show that the ribosome very efficiently discriminates against most potential errors a comprehensive analysis of translational missense errors in the yeast saccharomyces cerevisiae the problem of genetic code misreading during protein synthesis functional translational readthrough: a systems biology perspective translational readthrough potential of natural termination codons in eucaryotes-the impact of rna sequence stimulation of stop codon readthrough: frequent presence of an extended rna structural element programmed translational readthrough generates antiangiogenic vegf-ax impact of the six nucleotides downstream of the stop codon on translation termination translational accuracy and the fitness of bacteria the evolutionary consequences of erroneous protein synthesis a molecular characterization of spontaneous frameshift mutagenesis within the trpa gene of escherichia coli ribosomal frameshifting and transcriptional slippage: from genetic steganography and cryptography to adventitious use changed in translation: mrna recoding by - programmed ribosomal frameshifting recoding: expansion of decoding rules enrichesgene expression reprogramming the genetic code: the emerging role of ribosomal frameshifting in regulating cellular gene expression new insights into the incorporation of natural suppressor trnas at stop codons in saccharomyces cerevisiae leaky uag termination codon in tobacco mosaic virus rna expression of the gag-pol fusion protein of moloney murine leukemia virus without gag protein does not induce virion formation or proteolytic processing the readthrough protein a is essential for the formation of viable q beta particles evidence of efficient stop codon readthrough in four mammalian genes eukaryotic translational termination efficiency is influenced by the nucleotides within the ribosomal mrna channel aminoglycoside antibiotics mediate context-dependent suppression of termination codons in a mammalian translation system sequence specificity of aminoglycoside-induced stop condon readthrough: potential implications for treatment of duchenne muscular dystrophy contexts of escherichia coli and human termination codons are similar the major determinant in stop codon read-through involves two adjacent adenines a single uga codon functions as a natural termination signal in the coliphage q␤ coat protein cistron the signal for translational readthrough of a uga codon in sindbis virus rna involves a single cytidine residue immediately downstream of the termination codon sequence analysis suggests that tetra-nucleotides signal the termination of protein synthesis in eukaryotes effects of the nucleotide to an amber codon on ribosomal selection rates of suppressor trna and release factor- translational termination efficiency in both bacteria and mammals is regulated by the base following the stop codon translational termination efficiency in mammals is influenced by the base following the stop codon the efficiency of translation termination is determined by a synergistic interplay between upstream and downstream sequences in saccharomyces cerevisiae misreading of termination codons in eukaryotes by natural nonsense suppressor trnas uga suppression by trnacmcatrp occurs in diverse virus rnas due to a limited influence of the codon context genome-wide prediction of stop codon readthrough during translation in the yeast saccharomyces cerevisiae the signal for a leaky uag stop codon in several plant viruses includes the two downstream codons decoding mammalian ribosome-mrna states by translational gtpase complexes structural basis for stop codon recognition in eukaryotes structural basis for translation termination on the s ribosome bipartite signal for read-through suppression in murine leukemia virus mrna: an eight-nucleotide purine-rich sequence immediately downstream of the gag termination codon followed by an rna pseudoknot pseudoknot-dependent read-through of retroviral gag termination codons: importance of sequences in the spacer and loop evidence that a downstream pseudoknot is required for translational read-through of the moloney murine leukemia virus gag stop codon evidence of abundant stop codon readthrough in drosophila and other metazoa a novel stop codon readthrough mechanism produces functional headcase protein in drosophila trachea characterization of the stop codon readthrough signal of colorado tick fever virus segment rna translation initiation factor eif promotes programmed stop codon readthrough involvement of human release factors erf a and erf b in translation termination and regulation of the termination complex formation stop codon suppression via inhibition of erf expression propagation of the yeast prion-like [psi+] determinant is mediated by oligomerization of the sup -encoded polypeptide chain release factor extrachromosomal psi+ determinant suppresses nonsense mutations in yeast rules of uga-n decoding by near-cognate trnas and analysis of readthrough on short uorfs in yeast nonsense suppression by near-cognate trnas employs alternative base pairing at codon positions and deciphering the reading of the genetic code by near-cognate trna ribosome profiling reveals pervasive and regulated stop codon readthrough in drosophila melanogaster peroxisomal lactate dehydrogenase is generated by translational readthrough in mammals computational analysis of translational readthrough proteins in drosophila and yeast reveals parallels to alternative splicing robustness by intrinsically disordered c-termini and translational readthrough versatility of synthetic trnas in genetic code expansion pyrrolysine and selenocysteine use dissimilar decoding strategies direct charging of trna(cua) with pyrrolysine in vitro and in vivo selenocysteine: the st amino acid the function of selenocysteine synthase and selb in the synthesis and incorporation of selenocysteine identification of a novel translation factor necessary for the incorporation of selenocysteine into protein characterization of mselb, a novel mammalian elongation factor for selenoprotein translation features of the formate dehydrogenase mrna necessary for decoding of the uga codon as selenocysteine the pathway to gtpase activation of elongation factor selb on the ribosome ensemble cryo-em elucidates the mechanism of translation fidelity selection of trna by the ribosome requires a transition from an open to a closed form recognition of cognate transfer rna by the s ribosomal subunit antideterminants present in minihelix(sec) hinder its recognition by prokaryotic elongation factor tu thermodynamic and kinetic framework of selenocysteyl-trnasec recognition by elongation factor selb kinetics of the interaction of translation factor selb from escherichia coli with guanosine nucleotides and selenocysteine insertion sequence rna partitioning between recoding and termination at a stop codon-selenocysteine insertion sequence fidelity of aminoacyl-trna selection on the ribosome: kinetic and structural mechanisms kinetic determinants of high-fidelity trna discrimination on the ribosome spontaneous ribosomal translocation of mrna and trnas into a chimeric hybrid state how the ribosome hands the a-site trna to the p site during ef-g-catalyzed translocation role of domains and in elongation factor g functions on the ribosome expression of the rous sarcoma virus pol gene by ribosomal frameshifting achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins translational frameshifting generates the gamma subunit of dna polymerase iii holoenzyme the gene of an archaeal alpha-l-fucosidase is expressed by translational frameshifting evidence for a role of translational frameshifting in the expression of transposition activity of the bacterial insertion element is ribosomal frameshifting in the ccr mrna is regulated by mirnas and the nmd pathway characterization of the frameshift signal of edr, a mammalian example of programmed - ribosomal frameshifting a functional - ribosomal frameshift signal in the human paraneoplastic ma gene secondary structures and starvation-induced frameshifting mutational analysis of the "slippery-sequence" component of a coronavirus ribosomal frameshifting signal spacer-length dependence of programmed - or - ribosomal frameshifting on a u a heptamer supports a role for messenger rna (mrna) tension in frameshifting comparative study of the effects of heptameric slippery site composition on - frameshifting among different eukaryotic systems regulation of hiv- gag-pol expression by shiftless, an inhibitor of programmed - ribosomal frameshifting protein-directed ribosomal frameshifting temporally regulates gene expression transactivation of programmed ribosomal frameshifting by a viral protein identification of a cellular factor that modulates hiv- programmed ribosomal frameshifting programmed - frameshifting by kinetic partitioning during impeded translocation conditional switch between frameshifting regimes upon translation of dnax mrna gag/gag-pol frameshifting by trna abundance mechanisms and biomedical implications of - programmed ribosome frameshifting on viral and bacterial mrnas dynamics of translation by single ribosomes through mrna secondary structures dynamic pathways of - translational frameshifting a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting a frameshifting stimulatory stem loop destabilizes the hybrid state and impedes ribosomal translocation ribosome excursions during mrna translocation mediate broad branching of frameshift pathways leftward ribosome frameshifting at a hungry codon a - frameshift in the hiv- env gene is enhanced by arginine deficiency via a hungry codon mechanism hungry codons promote frameshifting in human mitochondrial ribosomes characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot sequence requirements for efficient translational frameshifting in the escherichia coli dnax gene and the role of an unstable interaction between trna(lys) and an aag lysine codon structured mrna induces the ribosome into a hyper-rotated state the ribosome uses two active mechanisms to unwind messenger rna during translation mrna helicase activity of the ribosome characterization of ribosomal frameshifting in hiv- gag-pol expression hiv- modulates the trna pool to improve translation efficiency avian influenza virus pb gene in h n viruses evolved in humans to reduce interferon inhibition by skewing codon usage toward interferon-altered trna pools vaccinia and influenza a viruses select rather than adjust trnas to optimize translation the human immunodeficiency virus type ribosomal frameshifting site is an invariant sequence determinant and an important target for antiviral therapy overexpression of the hiv- gag-pol polyprotein results in intracellular activation of hiv- protease and inhibition of assembly and budding of virus-like particles translational recoding signals: expanding the synthetic biology toolbox high-affinity recognition of hiv- frameshift-stimulating rna alters frameshifting in vitro and interferes with hiv- infectivity targeting frameshifting in the human immunodeficiency virus importance of ribosomal frameshifting for human immunodeficiency virus type particle assembly and replication drugs targeting the− ribosomal frameshifting that generates the enzymes of the human immunodeficiency virus enhancing the ligand efficiency of anti-hiv compounds targeting frameshift-stimulating rna interference of ribosomal frameshifting by antisense peptide nucleic acids suppresses sars coronavirus replication small synthetic molecule-stabilized rna pseudoknot as an activator for - ribosomal frameshifting a persistent untranslated sequence within bacteriophage t dna topoisomerase gene massive programmed translational jumping in mitochondria secondary structure of bacteriophage t gene mrna: implications for translational bypassing a homing endonuclease and the -nt ribosomal bypass sequence of phage t constitute a mobile dna cassette a nascent peptide is required for ribosomal bypass of the coding gap in bacteriophage t gene ribosome rearrangements at the onset of translational bypassing coupling of open reading frames by translational bypassing high-efficiency translational bypassing of non-coding nucleotides specified by mrna structure and nascent peptide coupling of mrna structure rearrangement to ribosome movement during bypassing of non-coding regions ef-g-induced ribosome sliding along the noncoding mrna elongation factor g initiates translocation through a power stroke drop-off during ribosome hopping modified ribosome profiling reveals high abundance of ribosome protected mrna fragments derived from untranslated regions dom rescues ribosomes in untranslated regions uag readthrough during tmv rna translation: isolation and sequence of two trnas with suppressor activity from tobacco plants aphid transmission of beet western yellows luteovirus requires the minor capsid read-through protein p local and distant sequences are required for efficient readthrough of the barley yellow dwarf virus pav coat protein gene stop codon in vivo expression and mutational analysis of the barley yellow dwarf virus readthrough gene murine leukemia virus protease is encoded by the gag-pol gene and is synthesized through suppression of an amber termination codon modulation of stop codon read-through efficiency and its effect on the replication of murine leukemia virus readthrough of the bacillus subtilis stop codon produces an extended enzyme displaying a higher polymerase activity translational readthrough of the pde stop codon modulates camp levels in saccharomyces cerevisiae cryptic peroxisomal targeting via alternative splicing and stop codon read-through in fungi rabbit beta-globin is extended beyond its uga stop codon by multiple suppressions and translational reading gaps a uga termination suppression trnatrp active in rabbit reticulocytes immunopurification of the suppressor trna dependent rabbit beta-globin readthrough protein l-mpz, a novel isoform of myelin p , is produced by stop codon readthrough phylogenetically conserved sequences around myelin p stop codon are essential for translational readthrough to produce l-mpz ribosomal readthrough at a short uga stop codon context triggers dual localization of metabolic enzymes in fungi and animals stop codon readthrough generates a c-terminally extended variant of the human vitamin d receptor with reduced calcitriol response key: cord- -ujhgb b authors: huang, yi; lau, susanna k. p.; woo, patrick c. y.; yuen, kwok-yung title: covdb: a comprehensive database for comparative analysis of coronavirus genes and genomes date: - - journal: nucleic acids res doi: . /nar/gkm sha: doc_id: cord_uid: ujhgb b the recent sars epidemic has boosted interest in the discovery of novel human and animal coronaviruses. by july , more than coronavirus sequence records, including complete genomes, are available in genbank. the number of coronavirus species with complete genomes available has increased from in to in , of which six, including coronavirus hku , bat sars coronavirus, group bat coronavirus hku , groups c and d coronaviruses, were sequenced by our laboratory. to overcome the problems we encountered in the existing databases during comparative sequence analysis, we built a comprehensive database, covdb (http://covdb.microbiology.hku.hk), of annotated coronavirus genes and genomes. covdb provides a convenient platform for rapid and accurate batch sequence retrieval, the cornerstone and bottleneck for comparative gene or genome analysis. sequences can be directly downloaded from the website in fasta format. covdb also provides detailed annotation of all coronavirus sequences using a standardized nomenclature system, and overcomes the problems of duplicated and identical sequences in other databases. for complete genomes, a single representative sequence for each species is available for comparative analysis such as phylogenetic studies. with the annotated sequences in covdb, more specific blast search results can be generated for efficient downstream analysis. coronaviruses are found in a wide variety of animals and are associated with respiratory, enteric, hepatic and neurological diseases of varying severity. based on genotypic and serological characterization, coronaviruses were divided into three distinct groups ( ) ( ) ( ) . as a result of the unique mechanism of viral replication, coronaviruses have a high frequency of recombination ( , ) . the recent severe acute respiratory syndrome (sars) epidemic, the discovery of sars coronavirus (sars-cov) and identification of sars-cov-like viruses from himalayan palm civets and a raccoon dog from wild live markets in china have led to a boost in interest on discovery of novel coronaviruses in both humans and animals ( - ) ( figure ). for human coronaviruses, a novel group human coronavirus, human coronavirus nl (hcov-nl ) was reported in ( , ) , while we described the discovery, complete genome sequence and genetic diversity of a novel group human coronavirus, coronavirus hku (cov-hku ) in ( , ( ) ( ) ( ) . as for animal coronaviruses, six group ( ) ( ) ( ) , four group , including bat sars-cov and two new subgroups of group coronaviruses ( , , , ) , and group ( - ) coronaviruses have recently been described. by july , more than coronavirus sequence records, including a total of complete genomes, are available in genbank ( ) . among the coronavirus species with complete genome sequence available, six were sequenced by our group, including cov-hku and bat sars-cov ( , , , ) . furthermore, we defined two novel subgroups of group coronavirus ( ) . during the process of batch sequence retrieval for comparative genome analysis of the coronavirus genomes that we sequenced, we encountered several major problems about the coronavirus sequences in genbank as well as other coronavirus databases (coronaviridae bioinformatics resource, http://athena.bioc.uvic.ca/database.php?db= coronaviridae; patric http://patric.vbi.vt.edu) ( ) . first, in genbank, the non-structural proteins in the polyprotein encoded by orf ab were not annotated. second, in all databases, for the non-structural proteins encoded by orfs downstream to orf ab, the annotations are often confusing because they are not annotated using a standardized system. third, multiple accession numbers are often present for reference sequences ( ) . these problems often lead to confusion when sequence retrieval is performed. fourth, coronaviruses, especially sars-cov, amplified from different specimens may contain the same genome or gene sequences. these sequences usually lead to redundant work when they are analyzed. in view of these problems, we started to develop our own database for coronavirus gene and genome sequences in . in this database, covdb, we sought to create a user-friendly platform for efficient batch sequence retrieval, which is crucial for comparative genome analysis. in this article, we describe this comprehensive database of annotated coronavirus genes and genomes, which provides a central source of information about coronaviruses. to further increase the usefulness of covdb, commonly used bioinformatics tools were also included for analysis of the sequence data. sequence data. covdb is a web-based coronavirus database. data of covdb is stored and managed by mysql database management system. by july , covdb contains coronavirus sequences and one torovirus genome sequence. two hundred and sixty-four of them are complete genomes and the rest are partial genomes or genes. all data were retrieved from genbank using modules of bioperl. we annotated sequences without gene information or non-structural protein boundary and labeled the and untranslated regions (utrs) of the genomes. by july , covdb contains genes and utrs. information on coronavirus genome characteristics. in addition to the two sequence retrieval pages, covdb collects information on coronavirus sequence characteristics, including genome organization, a brief description on each complete coronavirus genome, gc content, polyprotein cleavage sites, transcription regulatory sequences, acidic tandem repeat sequences and known rna structures. these pieces of information can be accessed by clicking 'genome' in the top menu bar of covdb. in the 'tools' page, blast similarity search ( ) against annotated coronavirus sequences in covdb can be performed and other commonly used tools are also provided. batch sequence retrieval. the main goal for setting up covdb is to provide a convenient and efficient platform for retrieving batches of coronavirus gene sequences. the interfaces of the database are simple and user friendly. all genes and genomes contain links to genbank and/or pubmed. covdb contains two main pages for sequence retrieval. from the homepage, one can enter the first main page for retrieval of complete genomes and their genes by clicking 'covdb' (figure a) . from this page, users can obtain genes from specific coronavirus species by selecting the corresponding check boxes. we defined one representative genome from each species as the 'type strain'. most of the time, this 'type strain' is the one assigned as the reference sequence in genbank. by choosing the 'type strain only' option, users can obtain one gene sequence per species and construct phylogenetic tree or perform other comparisons. an example of retrieving complete genome or a specific gene of complete genome of selected species is shown in figure b and c. from the page for retrieval of complete genomes and their genes, one can enter the second main page for retrieval of all complete and/or incomplete genes of a coronavirus ( figure a ) by clicking 'from all groups of genes'. in this page, all the gene sequences are grouped vertically according to which coronavirus group and subgroup they belong to, and horizontally by the names of the genes. the option 'exclude partial cds' can be used if only complete genes are required. an example of retrieving all the sequence of a particular gene for a group of coronavirus is shown in figure b . if the translated sequence of a selected gene has more than one stop codon which is probably due to sequencing error, the number in the 'length' column of this gene will be marked in red. polyprotein annotation. in all coronavirus genomes, orf ab occupies two-thirds of the genome and it is translated as a polyprotein. this polyprotein is posttranslationally cleaved by c-like protease ( cl pro ) and papain-like protease (pl pro ) into - non-structural proteins. some of the non-structural proteins, such as rna-dependent rna polymerase, helicase, cl pro and pl pro are essential for replication or virulence of the coronavirus, although the functions of others are still unclear. due to the essentiality of the non-structural proteins, these sequences are often used for evolutionary analysis, primer design, etc. however, except for the reference sequences, detailed cleavage site information is not provided for the non-structural proteins in other sequences in genbank. since it has been shown that cl pro and pl pro of coronavirus cleave at conserved specific amino acids, the putative cleavage sites of the - non-structural proteins can be predicted by multiple sequence alignment. using these pieces of information, we have annotated these non-structural proteins in all the coronavirus sequences for easy retrieval in covdb. protein/gene name unification. by convention, all nonstructural proteins in the polyprotein encoded by orf ab are named as 'nsp', with each protein numbered consecutively starting from the end (nsp -nsp ). the structural proteins after the polyprotein are hemagglutinin esterase (he, in group a coronaviruses), spike glycoprotein (s), envelope protein (e), membrane protein (m) and nucleocapsid protein (n). however, there is no unified naming system for the non-structural proteins encoded by orfs downstream to orf ab. this lack of a unified system greatly reduces the stability and accuracy of ortholog retrieval. in covdb, with the aim of facilitating gene retrieval, we tried to unify the naming of these non-structural proteins from different groups of coronaviruses. on the other hand, we have also tried to avoid radical changes in the names that may lead to confusion. in covdb, these non-structural proteins are named as ns a, ns x, ns x, ns x and ns x (x = a, b, c,. . .). ns a denotes the orf between orf ab and he of group a coronaviruses. ns x denotes the orfs between s and e of groups , c, d and coronaviruses. in most of these coronaviruses, there are two ns x, named ns a and ns b. however, in group coronaviruses, the genomes of some members (e.g. hcov-nl , pedv) contain only one orf between s and e. when we compared their putative amino acid sequences to the corresponding ones in other group coronavirus genomes using blast, as well as searching for conserved domains using motifscan, results showed that the putative proteins encoded by these orfs belonged to a protein family in pfam originally assigned as 'corona_ns b' (accession number pf ). therefore, we named these orfs as ns b. ns x denotes the orfs between s and e of group a coronaviruses. ns x denotes the orfs between m and n of group coronaviruses. one exception is ns a of group a coronaviruses. traditionally, this name denotes an orf upstream of e in group a coronaviruses. therefore, we have kept this name for that orf in covdb. ns x denotes the orfs downstream of n gene. it is important to note that due to variations in genome organizations among different groups of coronaviruses (table ) , ns genes with the same name in different coronavirus groups may not be orthologs of each other. the complete genome gene search page of covdb contains a link to a gene synonyms page, which includes a list of synonymous names of the various genes in the coronavirus genomes. identical sequence labeling. sequence redundancy is another problem of coronavirus sequences in public nucleotide databases. different strains of the same species from samples collected in different locations or at different times may possess completely or partially identical sequences. these sequences, though containing important epidemiological information, increase the workload during sequence analysis. in covdb, we compared all nucleotide sequences and labeled the identical ones to mitigate this problem. users can choose to show or not to show strains with identical sequences by clicking on the check boxes to the left of the page (figure b ). blast similarity search. during the process of coronavirus gene sequences analysis, we encountered a major problem when coronavirus gene sequences, especially those of orf ab, were used for blast search against genbank or any other coronavirus databases. when part of the orf ab gene (e.g. nsp ) is used as the query sequence, instead of getting the gene for the specific non-structural protein that the query sequence is homologous to, the results will only show that the hits are within orf ab, or in some cases, shown to be within the entire coronavirus genome. much time will be needed for further analyzing the results manually in order to locate the positions of the cleavage sites of the corresponding genes for the nonstructural proteins, making it very inefficient for further downstream work. this problem has been overcome by the annotated sequences in covdb. the blast search page of covdb is an interface for facilitating coronavirus similarity search. the background support program, blastall, is from the ncbi blast package. the blast search page can be entered by clicking 'tools' in the top menu bar in any page of covdb. since all sequences in covdb are annotated, they can be grouped into different datasets for blast search. users can choose one of the three nucleotide and two protein sequence datasets as the database for comparison (figure ) . the three nucleotide sequence datasets are: cov genes (nsp + genes after ab), cov genes ( ab + genes after ab) and cov genbank strains, which are the original sequences retrieved from genbank. the two protein sequence datasets are the translated sequences of the first two nucleotide datasets: cov proteins (nsp + aa after ab) and cov proteins ( ab + aa after ab). myblast. 'myblast' employs the same blast program as the blast page mentioned above. however, instead of selecting a predefined nucleotide or amino acid sequence database, multiple sequences can be pasted into the second sequence input box to generate a temporary sequence database. one or more query sequences can be pasted into the first sequence input box for blastn or blastp search against the temporary sequence database. orf finder for coronavirus. this orf finder is specifically designed for coronavirus genome analysis. the result page shows the positions and lengths of each putative orf and the position of the putative ribosomal frameshift site for translation of orf ab. the nucleotide or amino acid sequences of the orfs can be shown by selecting the corresponding check boxes. to facilitate genome comparison and annotation, the most closely related coronavirus, which had been annotated in covdb, can be chosen from a pull-down list for comparison using blast search. this function is particularly useful for determining the range of nsp in orf ab. rapid and accurate batch sequence retrieval is both the cornerstone and bottleneck for comparative gene or genome analysis. during the process of complete genome sequencing and comparative analysis of the various novel human and animal coronavirus genomes in the past years, we have developed a comprehensive the first column is covdb gene id. in the uniq column, 'uniq' will be shown if there is no other identical sequence in covdb. otherwise, gene id of the sequences identical to it will be shown. database, covdb, of annotated coronavirus genes and genomes, which offers efficient batch sequence retrieval and analysis. as shown by our experience in using covdb for comparative genome analysis of novel coronaviruses we have discovered ( , , , , ) , we find that covdb is more rapid and efficient than other existing coronavirus databases for batch sequence retrieval for the following reasons. first, we have performed annotation on all non-structural proteins in the polyprotein encoded by orf ab of every single sequence. second, annotation was performed for the non-structural proteins encoded by orfs downstream to orf ab using a standardized system, with some exceptions given to some names that have been used for a long time so as to minimize confusion. third, all sequences with identical nucleotide sequences were labeled where one can choose to show or not to show strains with identical sequences. fourth, covdb contains not only complete coronavirus genome sequences, but also incomplete genomes and their genes. some genes of coronaviruses, such as pol, spike and nucleocapsid are sequenced much more frequently than others because they are either most conserved or least conserved. these gene sequences are particularly important for evolutionary analysis, single nucleotide polymorphism studies and design of primers for rt-pcr or quantitative rt-pcr amplification. covdb is constructed by the department of microbiology, the university of hong kong. it is available at no charge at http://covdb.microbiology.hku.hk. coronavirus genome structure and replication the molecular biology of coronaviruses molecular biology of severe acute respiratory syndrome coronavirus comparative analysis of coronavirus hku genomes reveals a novel genotype and evidence of natural recombination in coronavirus hku isolation and characterization of viruses related to the sars coronavirus from animals in southern china the genome sequence of the sars-associated coronavirus coronavirus as a possible cause of severe acute respiratory syndrome characterization of a novel coronavirus associated with severe acute respiratory syndrome relative rates of non-pneumonic sars coronavirus infection and sars coronavirus pneumonia a previously undescribed coronavirus associated with respiratory disease in humans identification of a new human coronavirus in silico analysis of orf ab in coronavirus hku genome reveals a unique putative cleavage site of coronavirus hku c-like protease characterization and complete genome sequence of a novel coronavirus, coronavirus hku , from patients with pneumonia clinical and molecular epidemiological features of coronavirus hku -associated community-acquired pneumonia molecular diversity of coronaviruses in bats complete genome sequence of bat coronavirus hku from chinese horseshoe bats revealed a much smaller spike gene with a different evolutionary lineage from the rest of the genome prevalence and genetic screenshot of blast similarity search page. five datasets can be chosen as the database for comparison. diversity of coronaviruses in bats from china comparative analysis of twelve genomes of three novel group c and group d coronaviruses reveals unique group and subgroup features severe acute respiratory syndrome coronavirus-like virus in chinese horseshoe bats coronaviruses from pheasants (phasianus colchicus) are genetically closely related to coronaviruses of domestic fowl (infectious bronchitis virus) and turkeys coronavirus infection of spotted hyenas in the serengeti ecosystem molecular identification and characterization of novel coronaviruses infecting graylag geese (anser anser), feral pigeons (columbia livia) and mallards (anas platyrhynchos) isolation of avian infectious bronchitis coronavirus from domestic peafowl patric: the vbi pathosystems resource integration center ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins basic local alignment search tool conflict of interest statement. none declared. key: cord- -wjf t vp authors: brister, j. rodney; ako-adjei, danso; bao, yiming; blinkova, olga title: ncbi viral genomes resource date: - - journal: nucleic acids res doi: . /nar/gku sha: doc_id: cord_uid: wjf t vp recent technological innovations have ignited an explosion in virus genome sequencing that promises to fundamentally alter our understanding of viral biology and profoundly impact public health policy. yet, any potential benefits from the billowing cloud of next generation sequence data hinge upon well implemented reference resources that facilitate the identification of sequences, aid in the assembly of sequence reads and provide reference annotation sources. the ncbi viral genomes resource is a reference resource designed to bring order to this sequence shockwave and improve usability of viral sequence data. the resource can be accessed at http://www.ncbi.nlm.nih.gov/genome/viruses/ and catalogs all publicly available virus genome sequences and curates reference genome sequences. as the number of genome sequences has grown, so too have the difficulties in annotating and maintaining reference sequences. the rapid expansion of the viral sequence universe has forced a recalibration of the data model to better provide extant sequence representation and enhanced reference sequence products to serve the needs of the various viral communities. this, in turn, has placed increased emphasis on leveraging the knowledge of individual scientific communities to identify important viral sequences and develop well annotated reference virus genome sets. recent outbreaks of ebolavirus ( , ) and middle east respiratory syndrome coronavirus (mers-cov) ( , ) clearly demonstrate the power of sequence analysis in viral surveillance, host reservoir identification and public health policy debate. as these viruses have filled media headlines, their genome sequences have spilled into international public databases. such real time analysis promises to fundamentally alter our understanding of viral biology and significantly impact public health responses to viral dis-ease, but it also places renewed emphasis on public research infrastructure that is necessary to support the storage and analysis of sequence data. this infrastructure includes primary databases that together comprise the international nucleotide sequence database collaboration (insdc) ( ) , genbank ( ) , european molecular biology laboratory's european bioinformatics institute (embl-ebi) ( ) , and dna database of japan (ddbj) ( ) , and reference databases like the viralzone resource at the swiss institute of bioinformatics (http://viralzone.expasy. org) ( ) and the viral genome resource at national center for biotechnology information (ncbi) (http://www. ncbi.nlm.nih.gov/genome/viruses/) ( ) . whereas primary databases are archival repositories of sequence data, reference databases provide curated datasets that enable a number of activities, among them are transfer annotation to related genomes ( ) ( ) ( ) , sequence assembly and virus discovery ( ) ( ) ( ) ( ) , viral dynamics and evolution ( ) ( ) ( ) and pathogen detection ( , ( ) ( ) ( ) . the ncbi viral genomes project was established in response to the growing need for a public, virus-specific, reference sequence resource ( ) . the project catalogs all complete viral genomes deposited in insdc databases and creates so-called refseq records for each viral species. each refseq is derived from an insdc sequence record, but may include additional annotation and/or other information. accessions for refseq genome records include the prefix 'nc ', allowing them to be easily differentiated from insdc records. for example, the refseq genome record for enterobacteria phage t has the accession nc but was derived from the insdc record af . typically, the first genome submitted for a particular species is selected as a refseq, and once a refseq is created, other validated genomes for that species are indexed as 'genome neighbors'. as such, the viral refseq data model is taxonomy centric, or more specifically, species centric, and all refseq records and genome neighbors are indexed at the species level. this model requires both the demarcation of individual viral species and the grouping of genome sequences into defined species. virus genome type refseq genome segments total genome segments total insdc sequences dsdna viruses, no rna stage dsrna viruses ssdna viruses ssrna negative-strand viruses ssrna positive-strand viruses, no dna stage retro-transcribing viruses a the table does not include influenza virus sequences. these sequences are stored in a specialized database ( , ) . there are now validated viral and viroid genome segments deposited within insdc databases, not including influenza sequences, which are stored in a specialized database ( , ) . this figure represents a nearly fold increase since (figure ), and this rise reflects both steady increases in the number of novel viruses sequenced--as measured by the number of refseq genome segments--and a large increase in the number of genome neighbors, i.e. genome sequences belonging to viral species already represented by a refseq (figure ). as shown in table , refseq genome segments are distributed among all viruses, but genome neighbor segments are concentrated among smaller, ssdna, rna, and retro-transcribing viruses. although many of these neighbor genomes are concentrated among human pathogens, there are also several viruses of agricultural importance with high numbers of sequenced genomes ( table ). while most of the viruses in table are well studied in the laboratory, many other sequenced viruses are not. the refseq data model for most organisms underscores the importance of very well annotated reference sequence records ( ) . unfortunately, a minority of viral systems are experimentally well defined, so there is often little primary data on which to base genome annotations. in some cases, sequence homologies allow the transfer of annotation from experimentally defined to poorly characterized genomes ( ) ( ) ( ) . yet, often genomes are annotated by purely ab initio processes ( ) ( ) ( ) . given the difficulty of implementing a purely well annotated representation of viral genome sequences, the viral refseq model has evolved into a more flexible approach that includes both reference and representative sequences. reference refseq records provide sources of well annotated sequence features, whereas representative records provide coverage of extant sequence variation. the comment 'reviewed refseq' is added to refseq records to highlight those that include additional annotation, and as of this writing, there are reviewed viral refseq records, including references for several human pathogens, such as human immunodeficiency virus (nc ), measles virus (nc ) and poliovirus (nc ) and several other important viral systems such as enterobacteria t (nc ), enterobacteria t (nc ) and tobacco mosaic virus ( ) ( ) . moreover, some viral communities are developing well defined subspecies classification such as the genotyping schemes for hepatitis b virus and hepatitis c virus ( ) ( ) ( ) . these genotyping schemes can provide an important framework for the interpretation of genome sequence data ( ) , and more communities are expected to develop genotyping schemes in the coming years. finally, there are cases when the best characterized viral isolate is a laboratory variant, and it may be important to create multiple refseq records in order to provide both experimentally annotated references and sufficient sequence representation of circulating isolates. together these cases highlight the need for both reference genome sequences that capture the best possible annotation and representative genome sequences that capture important intraspecies variation or define subspecies categories. therefore the viral refseq model has expanded to include both reference and representative genome sequences to better serve community needs. the rising pace of viral discovery has a number of implications for data processing by the viral genomes group. viral taxonomy within the ncbi taxonomy database is based on the list of valid species names and classifications provided by the international committee for the taxonomy of viruses (ictv) ( , ) . when the viral genomes project was initiated, there were many more viral species recognized by the ictv than viral refseq genome sequence records ( figure ). however, as the rate of viral genome sequencing has increased over the past decade, so too has the pace of viral discovery. as a result many refseqs are made from viruses clearly distinct from existing ones but without of- ficial taxonomy designation. taxonomy also affects the interpretation of genome sequence data, and technical difficulties encountered when sequencing the termini of some ssrna and ssdna viruses often lead to differing community standards for 'complete genomes' ( ) . this means that some difficult to sequence genomes are considered complete if they include the entire coding region but are missing some terminal sequence. improved methods may eventually resolve this issue ( ) , but in the meantime it would be useful for communities to define completeness standards with regard to current technology. in addition to manual selection based on genome length, the taxonomy of both refseq genome records and insdc genome neighbor records are validated. indeed, given that many novel virus genome sequences are submitted before analysis by the ictv (see figure ), validation of taxonomy assignment is a major facet of curation. taxonomy is important to the overall usability of ncbi viral genome resources, and when properly implemented, creates a framework for groups of related sequences. using standards established by individual ictv study sections ( ) and published reports, the taxonomy of each viral genome is validated and updated as necessary. newly submitted viral genomes without official ictv assignment are placed with 'uncharacterized' taxonomy bins that are easily distinguished from those recognized by the ictv. often little information is included in the insdc sequence record and a growing number of sequences do not include any linked publications. using sequence analysis and comparative genomics, every attempt is made to place new genomes into a family (i.e. the 'uncharacterized' bin associated with a specific family) or lower order classification bin. however, some genomes are very distinct from previously characterized ones and only higher order classification is possible. reference viral refseq records are generally curated by biologists using in-house annotation tools and the scientific literature as guides. a panel of viral genome advisors from outside ncbi bolsters curation efforts by offering expert guidance or taking responsibility for specific refseq records themselves. this approach is used for the maintenance of adenovirus and herpesvirus refseq records ( ) and could be extended to other virus genomes ( ) . these efforts considered, the growing number of viral genomes submitted to insdc databases and the rapid pace of scientific discovery make maintenance of up-to-date references difficult. therefore collaboration with scientific communities is critical to providing accurate annotation. sometimes these collaborative efforts are directed at curating a single refseq record, and all of the reviewed refseq records mentioned in the previous section were curated in collaboration with experts from individual viral communities. other times these collaborations are more extensive and touch many sequence records. for example, overlapping gene annotations were corrected on refseq records from virus families (arteriviridae, arteriviridae, bunyaviridae, caliciviridae, circoviridae, disistroviridae, flavoviridae, luteoviridae, paramixovridae, parvoviridae, picornaviridae, potyviridae, reoviridae, togaviridae) as directed by experimental or predictive analysis ( , ) . a new emphasis has been placed on initiating annotation collaborations at the beginning of a large genome sequencing program so that reference annotations, isolate naming schemes and other standards can be established prior to sequence submission ( ) ( ) ( ) . these collaborations often include members of the uniprot viral protein annotation program ( ) ( ), and/or curators from sequencing centers and other databases ( ) in addition to members of the relevant viral communities and effectively ensure both well annotated references and consistently annotated insdc sequence records. such arrangements underscore the extensive impact of viral genome annotation issues--from public databases to sequencing centers to individual researcher communities--and were formalized within the viral genome annotation working group, which brings together stakeholders and provides a forum for the discussion of annotation issues ( , ) . in addition to protein annotation and isolate naming issues, this group is working to define standards for viral genome sequence data. as the number of viral sequences has risen, so has the demand for curated metadata describing sequences. the viral genomes group has implemented two models designed to capture and standardize metadata. in the first model exemplified by the virus variation resource, host, isolation country and other important metadata are parsed from individual sequence records, mapped against vocabulary lists and standardized ( , ) . sequences can then be searched using these standardized metadata terms. currently, only a small subset of viral sequences are included in the virus variation resource, including those for influenza, dengue and west nile viruses, but the ultimate goal is to expand this semi-automated model to include more viruses. the second model captures and standardizes host information for all viruses, and whenever a new refseq record is created, a manually curated 'viral host' property is assigned to the relevant species within the ncbi taxonomy database. the property defines higher order, biologically relevant taxonomic host groups--algae, archaea, bacteria, diatom, environment, fungi, human, invertebrates, plants, protozoa and vertebrates--and enable sorting and selection of sequences within the ncbi taxonomy (http://www. ncbi.nlm.nih.gov/taxonomy) and viral genomes resource. for example searching the ncbi taxonomy database with the term 'vhost fungi'[properties] (quotes included) will return a list of taxonomy groups comprised of viruses that infect fungi. users can then select the 'genome' database from 'find related data' link on the taxonomy search page to view all viral genomes associated with viruses retrieved from the search. in cases where a virus infects multiple types of organisms, multiple terms are assigned, for example 'invertebrates, plants'. to search ncbi taxonomy for viruses that infect multiple hosts simply include 'and' between search terms, for example 'vhost invertebrates' [properties] and 'vhost plants' [properties] (quotes included). the current distribution of assigned viral host terms is shown in figure . the ncbi viral genome resource can be accessed at www.ncbi.nlm.nih.gov/genome/viruses/. on this home page, users will find ftp links where users can download accession list of all viral and viroid genomes (refseq and genome neighbors) and the complete viral and viroid ref-seq dataset. perhaps the central features of the resource are the viral and viroid genome browsers. these tables list all viral and viroid species represented by a reference sequence and include links to genome neighbor sequences. users can navigate to specific taxonomic groups and sort the table by viral host type. once a dataset has been defined by taxonomy and host types, users can download the resultant table, the list of refseq accessions in the table, or a list that includes refseq and genome neighbor accessions as well as taxonomy and viral host information. several specialized viral resources and tools are also linked through the viral genomes resource home page. these include specialized resources for influenza, dengue and west nile and other viruses that are part of the virus variation resource (http://www.ncbi.nlm.nih.gov/genome/viruses/variation/) ( , , ) . the link to the retrovirus resource (http://www.ncbi.nlm.nih.gov/genome/viruses/retroviruses) provides access to the retrovirus genotyping tool and hiv- , human interaction database ( , ) . these tools are designed to assist retroviral researchers in the identification and classification of sequences and to document hiv- and human protein and replication interactions through a searchable interface. finally, there is a link to the pairwise sequence comparison tool (pasc) (http://www.ncbi.nlm.nih.gov/sutils/pasc), a blast-based tool with graphical output that can be used to establish taxonomic classification criteria of some viruses and classify viruses with newly sequenced genomes ( , ) . both refseq records and other genomes for species are linked throughout ncbi resources and can be used in a variety of operations. among these, the refseq dataset can be used to reduce the redundancy of blast searches (http://blast.ncbi.nlm.nih.gov/blast.cgi) ( ), providing fewer, higher quality sequences within search results. to restrict nucleotide blast searches to include only viral refseq genomes, employ the 'choose search set' options in the blast search interface ( ): select 'reference genomic sequences (refseq genomic)' in the database field and enter 'viruses' in the 'organism' field text box. for protein blast searches, the viral refseq protein set can be used by selecting 'reference proteins' (refseq proteins) in the database field and entering 'viruses' in the 'organism' field text box. data derived from viral refseqs are also used to support a number of other databases including gene ( ) and protein clusters ( ) . each species that includes a refseq can be found in the genome database (http://www.ncbi.nlm.nih.gov/genome) ( ) . this resource can be searched by taxonomy names, and retrieved genome records include links to all refseqs for that species. each individual genome record also includes links to neighbor sequences for that species under 'related information', and these can be viewed by selecting the 'other genomes for species' option. these links display all genome neighbor records in the nucleotide database where they can be viewed and/or downloaded. genome neighbor records can also be retrieved from multiple genome records using the 'find related data' options, allowing the user to search for an entire viral family or similar and then retrieve all genome neighbor records defined by the original search criteria. simply select 'nucleotide' in 'database' pull down menu and 'other genomes for species' from the 'option' pull down menu to return all genome neighbors for all the species listed in the search results. as the sequencing revolution continues to gather steam, and the rate of viral genome sequencing increases, reference databases will be pressed to serve growing community needs. meeting these will require further collaboration with individual viral communities and across public databases. data models will also need to shift to better represent the extant sequence universe and provide better standardized sequence annotation. once annotated, large-scale genome sequence data will need to be presented in ways that facilitate human data sorting and discovery operations. this will require semiautomated metadata capture and standardization, as well as innovative interfaces and tools that leverage metadata in discovery operations. many of these approaches and processes are currently being tested within the ncbi virus variation resource ( ) where users can readily find sequences based on specific, standardized sequence descriptors, greatly improving the accessibility and utility of viral sequence data. while currently limited to a handful of human pathogens, our intent is to expand the virus variation data model to include more viruses from more viral communities. this should open up a number of possibilities and will support the aggregation and retrieval of sequences based on community-defined criteria like genotypes or complete genome sets as is currently possible for influenza virus sequences ( , ) . the growing cloud of viral genome sequences also poses significant barriers to the maintenance of reference genome records. the pace of experimental discovery and the number and breadth of viral genomes make it increasingly difficult to provide well annotated, up-to-date reference sequences. to counter, we must leverage community knowledge and activities against the goal of better refseq viral resources and must collaborate with viral communities to maintain well annotated reference sequences, develop community-accepted gene and protein naming standards and define community-established subspecies classification schemes. though collaborations have been initiated within d nucleic acids research, , vol. , database issue some communities ( , ( ) ( ) ( ) ) , they need to be scaled to include more groups. as a public resource, we serve a range of communities--from the public health to the basic research--and rely on them to both better inform our mission and help support it. only by engaging our stakeholders and working together on shared goals can we provide the rigorous resources necessary to support viral sequence data activities. emergence of zaire ebola virus disease in guinea--preliminary report genomic surveillance elucidates ebola virus origin and transmission during the outbreak middle east respiratory syndrome coronavirus in dromedary camels: an outbreak investigation transmission and evolution of the middle east respiratory syndrome coronavirus in saudi arabia: a descriptive genomic study the international nucleotide sequence database collaboration the european bioinformatics institute's data resources ddbj progress report: a new submission system for leading to a correct annotation viralzone: recent updates to the virus knowledge resource ncbi reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins flan: a web server for influenza virus genome annotation vigor extended to annotate genomes for additional different viruses vigor, an annotation program for small viral genomes evaluation of alignment algorithms for discovery and identification of pathogens using rna-seq identification of a novel polyomavirus from patients with acute respiratory tract infections klassevirus , a previously undescribed member of the family picornaviridae, is globally widespread a highly abundant bacteriophage discovered in the unknown sequences of human faecal metagenomes deep sequencing of norovirus genomes defines evolutionary patterns in an urban tropical setting molecular epidemiology of contemporary g p[ ] human rotaviruses cocirculating in a single u.s. community: footprints of a globally transitioning genotype going viral: next-generation sequencing applied to phage populations in the human gut pathseq: software to identify or discover microbes by deep sequencing of human tissue virusfinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data a cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples national center for biotechnology information viral genomes project virus variation resource--recent updates and future directions ncbi reference sequences (refseq): current status, new features and genome annotation policy improving gene annotation of complete viral genomes identification of proteins associated with murine cytomegalovirus virions microbial virus genome annotation-mustering the troops to fight the sequence onslaught imbroglios of viral taxonomy: genetic exchange and failings of phenetic approaches molecular identification of hepatitis b virus genotypes/subgenotypes: revised classification hurdles and updated resolutions consensus proposals for a unified system of nomenclature of hepatitis c virus genotypes expanded classification of hepatitis c virus into genotypes and subtypes: updated criteria and genotype assignment web resource is there any value to hepatitis b virus genotype analysis? the ncbi taxonomy database virus taxonomy: classification and nomenclature of viruses: ninth report of the international committee on taxonomy of viruses rapid cdna synthesis and sequencing techniques for the genetic study of bluetongue and other dsrna viruses a new approach to determining whole viral genomic sequences including termini using a single deep sequencing run herpesvirus systematics evolution of viral proteins originated de novo by overprinting overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation uniformity of rotavirus strain nomenclature proposed by the rotavirus classification working group (rcwg) virus nomenclature below the species level: a standardized nomenclature for natural variants of viruses assigned to the family filoviridae virus nomenclature below the species level: a standardized nomenclature for laboratory animal-adapted strains and variants of viruses assigned to the family filoviridae the universal protein resource (uniprot) in vipr: an open bioinformatics database and analysis resource for virology research towards viral genome annotation standards virus variation resources at the national center for biotechnology information: dengue virus the influenza virus resource at the national center for biotechnology information a web-based genotyping resource for viral sequences human immunodeficiency virus type , human protein interaction database at ncbi pairwise sequence comparison (pasc) and its application in the classification of filoviruses improvements to pairwise sequence comparison (pasc): a genome-based web tool for virus classification blast: a more efficient report with usability improvements ncbi blast: a better web interface database resources of the national center for biotechnology information the national center for biotechnology information's protein clusters database we would like to thank vyacheslav chetvernin, boris fedorov, sergey resenchuck, igor tolstoy, tatiana tatusova and jim ostell for their development and support. key: cord- -zrbn z authors: ishimaru, daniella; plant, ewan p.; sims, amy c.; yount, boyd l.; roth, braden m.; eldho, nadukkudy v.; pérez-alvarado, gabriela c.; armbruster, david w.; baric, ralph s.; dinman, jonathan d.; taylor, deborah r.; hennig, mirko title: rna dimerization plays a role in ribosomal frameshifting of the sars coronavirus date: - - journal: nucleic acids res doi: . /nar/gks sha: doc_id: cord_uid: zrbn z messenger rna encoded signals that are involved in programmed - ribosomal frameshifting (- prf) are typically two-stemmed hairpin (h)-type pseudoknots (pks). we previously described an unusual three-stemmed pseudoknot from the severe acute respiratory syndrome (sars) coronavirus (cov) that stimulated - prf. the conserved existence of a third stem–loop suggested an important hitherto unknown function. here we present new information describing structure and function of the third stem of the sars pseudoknot. we uncovered rna dimerization through a palindromic sequence embedded in the sars-cov stem . further in vitro analysis revealed that sars-cov rna dimers assemble through ‘kissing’ loop–loop interactions. we also show that loop–loop kissing complex formation becomes more efficient at physiological temperature and in the presence of magnesium. when the palindromic sequence was mutated, in vitro rna dimerization was abolished, and frameshifting was reduced from to . %. furthermore, the inability to dimerize caused by the silent codon change in stem of sars-cov changed the viral growth kinetics and affected the levels of genomic and subgenomic rna in infected cells. these results suggest that the homodimeric rna complex formed by the sars pseudoknot occurs in the cellular environment and that loop–loop kissing interactions involving stem modulate - prf and play a role in subgenomic and full-length rna synthesis. a novel coronavirus was responsible for the sudden epidemic, severe acute respiratory syndrome (sars) outbreak, in . coronaviruses are positive-strand rna viruses with large genomes [$ nucleotides (nt)] that serve as templates for translation of viral proteins and for replication. the production of proteins from these viral rnas does not follow the usual rules governing translation. the first polyprotein encoded by open reading frame (orf) a, which encodes non-structural proteins, is defined by initiation and termination codons and is translated normally. signals embedded within the rna just before the termination codon of orf a redirect a fraction of translating ribosomes to bypass the stop codon and continue translation in the - reading frame, thus creating the larger orf ab polyprotein ( - ). these programmed - ribosomal frameshift (- prf) stimulating signals are typically composed of a heptameric slippery site, on which the ribosome can change register by nt in the direction, followed by a pseudoknot. slippery site sequence requirements have been characterized for several cell types ( ) but the range and diversity of frameshift-stimulating pseudoknots continues to grow ( ) . most frameshift-stimulating pseudoknots are twostemmed h-type structures. however, we and others have shown that the sars coronavirus (sars-cov) - prf signal is composed of three stems [ figure a ; ( , , ) ]. secondary structure predictions indicate that the potential to form the third stem is conserved among group ii coronavirus even though the rna sequences themselves are not well conserved ( ) . interestingly, removal of the third stem from the coronavirus frameshift signal still allows for frameshifting ( , , , ) . thus, it is not clear what the molecular role of the additional stem-loop (stem -loop [s l ]) is, and this requires further study. here we scrutinize features of the third stem of the sars-cov frameshift-stimulating pseudoknot that are important for rna structure and frameshifting efficiency. we demonstrate the importance of the capping loop sequence in promoting stem stability and maintaining near-wild-type levels of frameshifting. specifically, a hexanucleotide, self-complementary sequence in the loop capping stem raises the possibility that dimerization of the pseudoknot may play a role in viral lifecycle. while the palindromic sequence embedded in the sars-cov stem is not strictly conserved among severe acute respiratory syndrome-related coronaviruses (sarsr-cov), genomic sequences from rhinolophus chinese horseshoe bats (sarsr-rh-batcov) ( ) that were previously identified as a natural reservoir of sars-cov-related cov ( ) also encode the hexanucleotide palindrome. 'loop-loop kissing' interactions involving watson-crick base pairing between complementary rna loops are common rna-tertiary structural motifs and are used e.g. in retroviral replication ( ) ( ) ( ) ( ) ( ) . here, the retroviral nucleocapsid proteins can chaperone conversion of the non-covalently linked kissing dimer to a more thermodynamically stable extended duplex-mediated dimer linkage in a structural rearrangement suggested to be associated with viral particle maturation ( ) . pioneering work on infectious bronchitis virus suggested that the genomic rna (grna) of covs is not packaged as a dimer ( ) . however, cov replication is mediated by the synthesis of a negative-strand rna and also includes a discontinuous step involving synthesis of five to eight nested subgenomic rna (sgrna) intermediates ( ) . the three-stemmed wild-type sars pseudoknot. stems are labeled s , s and s in the order that they occur to along the rna. accordingly, loops are labeled l , l and l . note that l and l join adjacent stems, while l closes s (highlighted using gray box). only the last two digits of the wild-type sequence numbering are used for clarity. the palindromic sequence -acuagu- embedded into l is indicated using white circles. dashes represent watson-crick and the dot gu wobble base-pairing as confirmed by nmr spectroscopy. (b) stem deletion mutant Ás pk. (c) s l hairpin construct s l spanning nucleotides g to c . (d) s l hairpin constructs s l -acuucc and s l -acuagc with l mutations that render the palindromic sequence asymmetrical (highlighted using gray circles) while conserving a serine codon. (e) s l hairpin constructs s -cuug and s -gaaa where the nt l is replaced with the smaller tetraloops -cuug- and -gaaa- , respectively. (f) sars pseudoknot variants s - bp-cuug pk with a shortened stem is capped with a -cuug- tetraloop. s -cuug and s l -acuucc variant constructs highlighted with an asterisk (*) were also generated in the context of full-length pk. in the current work, we demonstrate that a previously overlooked loop-loop kissing interaction involving the conserved stem-loop embedded within the sars-cov pseudoknot occurs under physiological conditions in vitro. we further show that kissing dimer formation plays a role in frameshift-stimulation and modulates the relative abundance of full-length and subgenomic viral rnas. plasmids containing wild-type pseudoknot as well as the Ás pk mutant were described in plant et al ( ) . the wild-type pk plasmid was used as a template for the generation of s l mutants. s l -only transcripts incorporated a cis-acting, -hammerhead ribozyme. this plasmid was the template for site-directed mutagenesis replacing loop -gcactagta- with gcactag ca, gcacttcca, cttg or gaaa. in vitro transcriptions were optimized and performed as described ( , , ) using unlabeled nucleotide triphosphates (mp biomedicals). rna transcripts were purified by fast-performance liquid chromatography with hitrap q column (ge healthcare) ( , ) . s l -only transcripts and variants were purified with a hitrap q (ge healthcare), followed by a dnapac pa columns (dionex) ( ) . purified rna was equilibrated with nuclear magnetic resonance (nmr) buffer [ mm kcl unless otherwise noted, mm sodium phosphate (ph . ), mm edta (ethylenediaminetetraacetic acid), mm sodium azide, : h o:d o]. all nmr spectra were recorded at k, k or k on bruker avance , , or mhz spectrometer equipped with either a triple resonance inverse detection cryoprobe ( and ) or standard triple resonance inverse detection-probeheads. nmr experiments were performed on samples of ml volume containing . - . mm sars-cov pk and stem-loop s l variant rna. data were processed using nmrpipe ( ) and analysed using sparky ( ) . one-dimensional imino proton spectra were acquired using a jump-return echo sequence ( ) . imino resonances were assigned sequence specifically from water flip-back, watergate d nuclear overhauser effect spectroscopy (noesy) spectra (t mix = - ms) ( ) . typically, for the h, h-noesy spectra, complex points were recorded with an acquisition time of ms for h (o ), and complex points with an acquisition time of ms for h (o ). repetition delays ranging from . (pk variants) to . s (s l variants) were used between transients, with scans per increment (total measuring times - h, respectively). unlabeled ( mm) and p-labeled ( pm) sars-cov pk and s l rna transcripts were annealed in nmr buffer unless stated otherwise. when mm mgcl was added, samples were incubated for h at c. temperature, time of incubation and rna concentration varied, as specified in the text. rna samples were separated on % native polyacrylamide gel in tris borate buffer ph . at c when mgcl was added to the dimerization reaction. otherwise, tris borate edta ph . was used and the gel analysis performed at c or c, as specified in the text. gels were dried and analysed by phosphorimaging. in figures a and b , gels were analysed by ethidium bromide staining. rna samples were incubated overnight at c to promote dimerization. samples were crosslinked in a ultraviolet (uv) crosslinker (spectroline) for min, nm, . mw/cm on ice ( ) . crosslinked rna was separated on denaturing page (polyacrylamide gel electrophoresis) and eluted with . m sodium acetate, % sodium dodecyl sulfate (sds), ph . at c for h. precipitated rna was subjected to partial alkaline hydrolysis. control, non-crosslinked, p-labeled rna was subjected to rnase t digestion or partial alkaline hydrolysis (applied biosystems). all samples were separated on denaturing % page. gels were dried and densitometry of bands was determined using imagequant tl (ge healthcare) ( ) . transfected veroe cells were grown overnight in dulbecco's modified eagle medium supplemented with % fetal bovine serum at c. cells were disrupted using the passive lysis buffer (promega). luminescence reactions were initiated by addition of - ml of cell lysates to ml of the promega lar ii buffer and completed by addition of ml stop-n-glo reagent. luminescence was measured using a turner design td / . at least three replicates were performed within each assay, and all assays were repeated at least three times until the data were normally distributed ( ) . the frequency of frameshifting is expressed as a ratio of firefly to renilla luciferase from a test plasmid divided by the analogous ratio from the read-through control plasmid multiplied by %. fold change, standard error and estimates of the p-values for ratiometric analyses were performed as previously described ( ) . veroe cells inoculated with sars-cov or s l -acu ucc pk (multiplicity of infection of ) or were mockinfected and incubated at c. media were harvested at , , or h post-infection (pi) and titers assessed by plaque assay as previously published ( ) . viral detection limit was pfu/ml. error bars are the standard deviation of three measurements. total rna from sars-cov and s l -acuucc pk was isolated from infected cell monolayers (trizol invitrogen) and purified using oligotex mrna spin column reagents (qiagen). rna was separated on an agarose gel using northern-max-gly (ambion), transfered to a brightstar-plus membrane (ambion) and cross-linked to the membrane with uv light. the blot was pre-hybridized and probed with a sars-cov nucleocapsid-specific oligodeoxynucleotide probe ( -cttgactgccgcctc , where biotinylated nucleotides are designated with a superscript b. blots were hybridized overnight and washed with low-and highstringency buffers. filters were incubated with streptavidin-ap, washed and then developed using the chemiluminescent substrate cdp-star (new england biolabs). at , and h pi, sars-cov and s l -acuucc pk infected or mock-infected cells were washed and lysed in buffer containing mm tris-hcl (ph . ), mm nacl, . % deoxycholine, % nonidet p- , . % sds and post-nuclear supernatants added to an equal volume of mm edta and . % sds, resulting in a final sds concentration of . %. samples were heat inactivated twice before usage. on - % criterion gradient gels (bio-rad), mg of protein was loaded and transferred to a polyvinylidene difluoride membrane. blots were probed with polyclonal rabbit antisera directed against nsp (diluted : ) or nsp (diluted : ) ( ) and developed using enhanced chemiluminescence reagents (amersham biosciences). all cov genome sequences were obtained from gen bank (http://www.ncbi.nlm.nih.gov/genbank/). accession numbers for sequences discussed are summarized in supplementary table s . multiple sequence alignments were performed using clustalw (version . . ) (http:// www.ebi.ac.uk/tools/msa/clustalw /) ( ) . free energies of the proposed s l structures at c in m nacl were calculated with mfold ( ) . our previous nmr analysis of exchangeable imino protons of the sars-cov pseudoknot ( figure a , wild-type pk) provided unequivocal evidence for the existence of stem ( ). in the present study, secondary structure analysis by nmr provided further insight into the complex global architecture of the wild-type pk. to establish experimentally whether stem interacts with the two-stemmed h-type structure, we prepared a transcript lacking the base-paired region of stem while retaining the hexanucleotide -acu agu- palindromic sequence ( figure b , Ás pk) and compared this construct with wild-type pk. the imino noesy spectrum of the Ás pk transcript is virtually identical to the wild-type pk native construct with a few marked exceptions. the missing sequential imino assignment path for the mutant Ás pk is indicated by dashed lines in the superposition of wild-type pk and Ás pk noesy spectra shown in figure a . for clarity, these nuclear overhauser effect (noe) connectivities including the characteristic g -u wobble pair are highlighted in the schematic secondary structure of the three-stemmed pseudoknot structure (dashed box) and correspond to stem . no significant chemical shift or linewidth changes can be observed for imino protons located outside of stem . the only notable exception is the severely broadened crosspeak for the g -c base-pair in the Ás pk construct located in the vicinity of the stem -stem junction (figure a , red box). a possible explanation for this observation is that a longer unpaired loop consisting of (Ás pk) rather than nt (wild-type pk) may affect the degree of overrotation at the s -s junction. addressing this question in more detail would require a complete structure determination. at this stage, the large intrinsic linewidth of a -nt wild-type pk rna in combination with severe line-broadening observed for nmr samples concentrated to > mm made this procedure unrealistic (data not shown). however, the overall comparison of exchangeable imino proton spectra of the wild-type pk with Ás pk suggests that stem does not noticeably engage in stable tertiary interactions involving the two-stemmed h-type structure and likely constitutes an autonomous substructure within the frameshift signal. the nmr line-broadening observed motivated a closer examination of the sars pseudoknot and revealed a palindromic sequence in the loop capping stem , designated loop (l ) ( figure a ). to determine whether this palindromic sequence could mediate dimerization, stem transcripts ( figure c , s l ) were incubated at c for min in the presence of kcl and subjected to native page. lane of figure a shows two bands and demonstrates that s l transcripts form homodimers when analysed at c. to evaluate the role of the palindromic sequence for dimer formation, a series of mutations were generated ( figure d -f) with the intention of altering the palindromic sequence ( figure d ), replacing the entire loop ( figure e ) or significantly reducing the size of both stem and loop ( figure f ). when the palindromic sequence was mutated from -a cuagu- to acuagc or acuucc, the resulting s l -a cuagc and -acuucc transcripts migrated as single species, indicating that dimer formation was abolished ( figure a , lanes and ). unexpectedly, replacement of the entire loop ( -gcacuagua- ) with the stable tetraloop -cuug- generated a transcript migrating as a dimeric species ( figure a , lane ). this could be explained by the formation of an extended duplex featuring two central u-u wobble mismatches and was further investigated using nmr methods. in contrast, a gaaa- tetraloop containing transcript, s -gaaa, efficiently prevented dimer formation in vitro ( figure a , lane ). to verify whether dimers could be observed in the context of the full-length pseudoknot, sars wild-type pk transcripts were incubated in the same conditions described above. full-length pk homodimers were observed ( figure c , lane ) as demonstrated for s l constructs ( figure c , lane ). to confirm that pk dimerization occurs via loop , full-length pk, Ás pk and s - bp-cuug pk were incubated with s l transcripts ( figure c ). detection of heterodimers was performed by incubation of unlabeled pk with p-labeled s l (s l *) transcripts ( figure c , lane ), where the radioactive-labeled transcripts showed mobility shifts compatible with both homo-and heterodimers. similarly, Ás pk, a construct that lacks stem but retains loop , was found to self-associate ( figure c , lane ) and to form heterodimers when incubated with s l ( figure c , lane ), albeit weakly. s l was unable to form heterodimers with s - bp-cuug pk ( figure c, lane ) , a variant lacking sequence complementarity in loop . these results collectively indicate that the sars pseudoknot can homodimerize in vitro via the palindromic sequence located in loop . because Ás pk is able to self-associate weakly, we conclude that stable s formation facilitates dimerization but is not a requirement. the stability of loop-loop kissing interactions in mitochondrial transfer rna ( ) and viral rna ( , ) are highly dependent on cation concentration. to test whether sars s l dimer formation is favored in the presence of mg + -ions ( ), [ p] -end-labeled transcripts were incubated in the absence or presence of mgcl at c and separated by native page. comparison of figures a and b shows that s l self-association is a mg + -dependent event. in the absence of mg + , no appreciable change in dimer population was observed between and h at c ( figure a ). next, we investigated s l self-association as a function of increasing monovalent potassium cation concentration ( figure e ). we found that s l self-association was predominantly driven by the presence of mg + because significant but slow dimerization resulted from the addition of - mm kcl in the absence of mg + ( figure e , open circles), while responses to varying kcl concentrations were negligible in the presence of mm mg + ( figure e , closed circles). to examine the influence of temperature on dimer formation, s l rnas were incubated in the presence of mgcl at c and c (figures c and d ). aliquots at various time-points were collected and stored at À c until all samples were collectively analysed by native page. as shown in figure d , s l dimers readily formed at physiological temperatures (t ½ = ± min), while dimer formation was considerably slower at room temperature (t ½ = . ± . h; figure c ). we also quantified the concentration dependence of s l dimerization, in the presence of mgcl , and determined the dissociation constant for the dimer to be . ± . mm at c ( figure f ). taken together these results suggest that stem /loop readily dimerizes under physiological conditions, i.e. at c in the presence of figure . nmr secondary structure comparison of wild-type sars-cov pk, Ás pk, s and s l -acuucc mutants. (a) imino regions of d h, h-noesy experiments collected on wild-type sars-cov pk (black contours) and Ás pk mutant (red contours), respectively. dashed black lines show the imino proton walk in the s stem. the base-paired region of s is deleted in the Ás pk mutant; however, l is left intact. solid red lines show the sequential noe correlations involving the imino proton u located in s , and the red box highlights the cross peak connecting imino protons u and g adjacent to the s -s junction, which is absent in the Ás pk mutant. only the last two digits of the wild-type sequence numbering are used for clarity. the schematic sars-cov pk inset highlights the corresponding s stem (dashed box) as well as the g -c basepair location in s (solid red box). (b) imino regions of d h, h-noesy experiments collected on wild-type sars s (black contours) and the s l -acuucc mutant (red contours), respectively. dashed black lines show the imino proton walk in the lower portion of the s stem. solid red lines highlight the sequential cross peaks in the upper portion of s correlating imino protons g , u and g adjacent to l , which are broadened beyond detection in the s l -acuucc mutant. the schematic sars s l inset highlights the corresponding lower s stem (dashed box) as well as the base-paired region in the upper s stem (solid red box). mm mgcl , and that homodimers tolerate a broad range of ionic strengths. intermolecular loop-loop kissing of retroviral grna are initially metastable and subsequently converted to more stable mature duplexes catalysed by nucleocapsid proteins. such processes involving palindromic -nt sequences have been extensively studied and described for moloney murine leukemia virus ( ), hepatitis c virus ( ) and human immunodeficiency virus (hiv) dimers ( ) . we thus asked the question whether s l loop-loop kissing complexes ( figure b ) can potentially form extended duplexes ( figure c ). seminal work by laughrea and jette´have established that loose (loop-loop kissing) dimers are unstable when subjected to native electrophoresis at c, while tight (extended) duplexes resist these conditions ( , ) . as shown in lane of figures a and b , s l dimers, while detectable at c, are not favored at c, suggesting the formation of loop-loop kissing complexes. to further demonstrate that s l dimerizes via l -mediated loop-loop kissing interactions, uv-crosslinked dimers were subjected to partial hydrolysis. normalized densitometry of bands in each lane revealed a significant reduction in p-signal beginning at nucleotide ( figure b ), indicating that this nucleotide is in close proximity to the dimerization interface. the nucleotide at position represents the adenosine -neighbor of the -nt palindrome, consistent with sars s l loop-loop kissing formation ( figure b ) and cannot be explained on the basis of extended duplex formation ( figure c ). detection of loop-loop kissing dimers for s l transcripts prompted us to investigate whether pk, Ás pk and s - bp-cuug pk homodimers formed loose or tight complexes. as shown in figure d , all three homodimers remained stable when subjected to native gel at room temperature, an indication of tight dimer formation ( , ) . however, as observed previously, only faint bands were detected for Ás pk homodimers ( figures c and d, lane ) . altogether, these results suggest that s l initially forms loop-loop kissing dimers and, when embedded into larger sars pk constructs, can mature to form tight dimers. extended duplex formation can also be observed in case of constructs capped by stable -cuug- tetraloops. to obtain direct structural information about the observed homodimers, a series of rna transcripts were further investigated by nmr. watson-crick base pairing for the isolated stem rna was confirmed by twodimensional nuclear noesy spectra and a sequential imino walk is indicated in figure b . as compared with the native s l construct, a number of exchangeable imino resonances disappeared at k in the noesy spectrum of the asymmetric s l -acuucc loop mutant. the missing sequential imino assignment path for the mutant s l -acuucc corresponds to the upper portion of stem and is indicated by solid red lines in the superposition of s l and s l -acuucc imino noesy investigation, we attempted to monitor the conversion by nmr with nucleotide-specific resolution in the presence of only potassium chloride. therefore, we recorded a series of consecutive one-dimensional jump-return echo nmr spectra (figure ) to monitor s l imino proton resonances in real-time. the overlapping g and u as well as g and g imino proton resonance assignments were confirmed using d heteronuclear h, n-hmqc correlations (supplementary figure s ). dimer formation at c was induced by addition of mm potassium chloride to an nmr sample containing mm s l rna and (interrupted) d spectra recorded over a period of > days. figure shows that half of the kissing complex is formed after $ h and conversion progresses for > days before reaching a plateau characterized by a s l :(s l ) ratio of : . no evidence for further progression to a potential extended duplex structure could be detected. curiously, the replacement of the -nt loop l with the stable -cuug- tetraloop (s -cuug, figure e ) designed to stabilize hairpin formation generated a transcript that efficiently dimerized ( figure a, lane ) . the imino assignments through the stem could be easily followed based on the resonance assignments and noe patterns observed in s l (supplementary figure s ) . close inspection of the sequential assignments and noe patterns revealed the existence of tandem u-u wobble pairs involving the two uridine nucleotides of the tetraloop. in addition, substantial line broadening was observed consistent with extended duplex formation of s -cuug transcripts (supplementary figure s ). subsequently, we examined the noesy spectra of the larger, three-stemmed -cuug- -tetraloop containing sars mutants s - bp-cuug ( figure f ) and s -cuug pk that were subjected to functional frameshifting analysis. we confirmed the formation of the tandem u-u wobble pairs by d noesy. the noe patterns for the u-u wobbles are an almost perfect subspectrum of the same region of the noesy for the isolated s -cuug transcripts (supplementary figure s ) . other exchangeable protons in the stem and stem regions comprising the h-type pseudoknot overlay with the corresponding imino protons observed in the context of wild-type pk. the excellent agreement in the overlay of the imino connectivities traced by the noesy experiments indicates that the stable -cuug- tetraloop capping an otherwise wild-type stem facilitates the formation of extended duplex structures in the context of larger s - bp-cuug and s -cuug pk mutants and locally retains the same structure as s -cuug in isolation. in contrast to the sars s l variants capped with -cuug- tetraloops, s -gaaa ( figure e ) nmr samples showed pronounced differences and narrower imino proton linewidths. based on noesy spectra and the observation of an additional upfield-shifted guanosine resonance located in the sheared g-a loop basepair, it was concluded that a construct containing a -gaaa- tetraloop predominantly adopts a hairpin structure (supplementary figure s ) , which is consistent with the native gel mobility, indicating a monomeric species (lane , figure a ). to determine the importance of l -mediated selfassociation for sars-cov - prf, several pseudoknot variants were subjected to a dual luciferase reporter-based frameshifting analysis as previously described ( figure a ) ( , ) . a loop acuucc silent (serine codon) mutation was made to investigate the role of the palindrome (s l -acuucc; figure d ) in promoting - prf. phylogenetic analysis indicated that the sequence ucc is present at a ( ). dual luciferase measurements of - prf from the s l -acuucc pk construct showed a % reduction in frameshift stimulation compared with wild-type pk ( . % versus . %, figure a ). next, we tested two s mutants for frameshift stimulation in the luciferase assay. s -cuug ( figure e ), which we showed to readily self-associate ( figure a, lane ) , demonstrated a nominal increase in frameshift frequency ( . %), while a s truncation of the same construct (s - bp-cuug, figure f ) exhibited near wild-type efficiency ( . %). it is important to note that the previously characterized s -deletion Ás pk construct maintained dimerization capability albeit with reduced efficiency (figures b and c, lane ) . our results are consistent with previous analyses by three different groups that have shown mutations to the third stem of a three-stemmed pseudoknot to have less impact on frameshifting efficiency than mutations to stem ( , , , ). however, our data indicate that silent codon changes to loop that disrupt the palindrome play a role in regulating the frequency of frameshifting. having shown that the sars-cov pseudoknot dimerizes in vitro and alters frameshifting in vero cells, we asked whether the ability to self-associate is important for viral propagation. site-directed mutagenesis was used to introduce the loop acuucc silent mutation into sars subclone d ( ) , and the recombinant virus mutants were recovered from full-length virus transcripts. the observed effects should therefore reflect changes related to rna structure within the frameshifting signal, rather than modifying the encoded proteins. to determine the effects on viral growth kinetics, stocks of rescued wild-type sars-cov and s l -acuucc pk (multiplicity of infection (moi) of ) were used to infect veroe cells, media was harvested at , and h pi and tested by plaque assay. viral titers for sars-cov increased by wild-type pk Δs pk s l -acuucc pk s -cuug pk s - bp-cuug pk . e+ . logs at h pi and by . logs at h pi, while the growth kinetics for s l -acuucc, a mutant that does not form dimers, lagged behind at $ log increase at h pi and $ . logs by h pi ( figure b ). our analysis demonstrates that disruption of the palindrome via silent codon changes does impede growth kinetics for s l -acuucc pk at early times pi; however, by h pi, similar titers were detected (data not shown). to determine if the differences in viral growth kinetics were due to reduced levels of rna replication and/or synthesis, we harvested total rna at , , and h pi and performed northern blot analysis. total rna was probed using a nucleocapsid specific biotinylated probe ( figure a ) to examine levels of grna. s l -acuucc grna levels were reduced $ -fold when compared with sars-cov at and h pi, suggesting that rna replication contributed to the reduction in viral titers ( figure b ). to determine the quantities of sgrna species, the total rna was enriched for poly a containing mrna species ( figure c ) before separation on agarose gel. all sgrna species were readily detectible in both sars-cov and s l -acuucc samples; however, $ . -fold reduction in the quantity of sgrna species was detected at and h pi ( figure d ), suggesting that reduced levels of rna transcription also contributed to the titer loss. finally we compared the levels of orf a and orf b replicase proteins to determine if these protein levels were altered as well. equivalent protein concentrations were probed with polyclonal rabbit antisera directed against nsp or nsp as indicated ( figure e and f). reduced orf a and orf b replicase protein levels were seen at and h pi but similar levels were detected by h pi (supplementary figure s ). together these data indicate that dimerization is not required for viral viability but it is important for maintaining efficient rates of virus growth and genomic and subgenomic mrna synthesis. while stem is conserved in group coronaviruses, the palindrome in loop is not the common ancestor of the civet and human strains seems to have been a bat virus ( , ) . therefore, we attempted to obtain information about the combined conservation of sequential and structural features of stem from a compilation of sequences from related human [genbank accession codes ay ( ) and nc_ ( , ) ], masked palm civet [genbank accession code ay ( ) ] and bat coronaviruses. for this purpose, mfold ( ) was used to predict lowest energy structures of aligned cov s sequences. while out of nt of the s l region are absolutely invariant between the viral species (supplementary figure s ) , the variation that does occur most frequently disrupts the hexanucleotide palindromic sequence at its -end exhibiting a u c substitution (supplementary figure s ) . the u c substitution stabilizes the hairpin and lowers the calculated Ág from À . kcal/mol to À . kcal/mol (supplementary table s ). we previously showed that this 'ancestor' s l -aguagc variant was dimerization incompetent (lanes , figure a, d) . on the basis of phylogeny clustering, covs are classified into three groups (group - ) . the sars-cov lineage has been proposed to cluster with group ( , ) . to further analyse conservation patterns within the s element in a broader context, we aligned complete genomic cov sequences (summarized in supplementary table s ). as previously shown, pseudoknots containing a structured stem can be predicted for all group coronaviruses ( ). the calculated average free energy for predicted stem s of group cov is Ág = À . ± . kcal/mol. except for feline cov (Ág = À . kcal/ mol), all group sequences are not predicted to form stable stem substructures and are characterized by average free energies around zero (Ág = À . ± . kcal/mol). similarly, in group coronaviruses the s l region is most likely a single-stranded loop as indicated by a predicted Ág of À . kcal/mol ( ). remarkably, among the predicted s l structures of all group coronaviruses investigated, sars-cov codes for the second most labile stem with a Ág of À . kcal/ mol; only the human cov hku stem is predicted to be less stable as indicated by its Ág of À . kcal/mol (supplementary figure s ) . thus, sars-cov may form homodimers using its unique hexanucleotide palindromic sequence to compensate for a lack of thermodynamic stability. although a coronavirus frameshift signal was among the first pseudoknots to be identified and has been extensively studied using mutagenesis ( ) , the size of the cov pseudoknots have limited structural analyses at the atomic level. nmr and x-ray crystallography have been used to describe several smaller pseudoknots but the differences between the pseudoknots makes it impossible to extrapolate features from the smaller pseudoknots to the larger ones ( ) . from studies of hiv and other retroviruses, it is clear that modulation of frameshift efficiency can have a dramatic effect on virus viability, and the same is true for positive-strand rna viruses ( , ) . recently, it was demonstrated that the sars-cov pseudoknot could potentially be a target for antiviral agents making it imperative to understand its structure and function in greater detail ( ) . our studies of the third stem of the sars coronavirus pseudoknot illuminate features of stem that affect rna structure, frameshifting frequency and viral replication. typically, g-c-rich watson-crick complementarity is required for stable loop-loop base pairing interactions between two rna hairpins. among all natural hiv isolates, two palindromic hexanucleotide sequences are most commonly found: gcgcgc and gugcac ( , ) . however, mutagenesis work investigating hiv rna dimerization also demonstrated that the guuaac palindrome found in siv mnd (simian immunodeficiency virus) could yield up to % dimers in the presence of mm mgcl ( ) . in good agreement with those studies, we confirm that weaker palindromic sequences such as ac uagu can readily facilitate intermolecular loop-loop kissing rna-rna interactions. rna pseudoknots implicated in - prf stimulation are not static and are in fact dismantled during translation ( , ) . if a frameshift event does not occur during the first round of translation then the pseudoknot will have to refold for frameshifting to occur in subsequent rounds. long unstructured loops between the portion of stem and portion of stem would be expected to reduce the chances of stem forming and reduce frameshifting efficiency. if this sequence can form a stable restricting s l substructure, as is the case for the sars-cov pseudoknot steadied by an intermolecular kissing complex, then the rapid re-formation of stem , which is required for efficient frameshifting, will be more probable. in previous work, the functional importance of the nt comprising stem and loop in the sars-cov pseudoknot (c through a ) were investigated and no obvious determinants for efficient frameshift stimulation for this specific element could be identified. deletion analysis characterizing variants with ( ) and nt ( ) crossing the minor groove side of stem yielded and % of wild-type frameshifting levels. however, because of the common lack of precise structural information for pseudoknot varaints, determining the exact contribution of specific nucleotides of the pseudoknot to frameshifting is often difficult. there is a potential pitfall in the mutational methods because longer-range distortions that are difficult to predict using e.g. mfold especially in complicated three-stemmed architectures of group ii cov could lead to false hypotheses regarding the physical mechanism of frameshifting. secondary structures and basepairing patterns for all sars-cov variants investigated in this study were characterized using nmr methods. because the -acu agu- palindromic sequence was left intact, even the s -deletion Ás pk mutant could dimerize and thus efficiently stimulated - prf. on the other hand, the s l -acuucc did not engage in loop-loop kissing interactions, was characterized by an open loop structure, which destabilized stem , and its ability to promote - prf was reduced almost -fold. when we capped stem with a stable cuug tetraloop that normally facilitates formation of stable hairpin structures ( ) and is highly conserved ( ) , the frequency of frameshifting almost returned to wild-type levels ( figure ) . surprisingly, in the context of the sars-cov stem sequence, -cuug- tetraloop-capped mutants readily formed extended duplex structures as revealed by native gel and nmr analysis. thus, the promotion of wild-type levels of frameshifting exhibited by s -cuug and s - bp-cuug sars pseudoknot variants is consistent with a mechanism involving dimer formation. the metastable sars-cov stem sequence, which features a g · u wobble pair and a bulged a , apparently evolved to facilitate a certain degree of dimerization if seeded through loop-loop kissing interactions that can even tolerate tandem u-u mismatches. the s l -acuucc pk mutant, in which frameshifting was reduced nearly -fold, was viable. however, while the mutant replicated to levels similar to the wild-type virus at later times pi, the inability to dimerize in infected vero cells significantly changed the viral growth kinetics, rna species and replicase protein levels. the fact that a silent codon change in the loop of sars-cov reproducibly affected the levels of grna and sgrna strongly suggests that dimer formation occurs in the cellular environment and that loop-loop kissing interactions involving stem of the pseudoknot are important for accumulation of sgrna. the in vitro findings of reduced total rna levels and orf a/b protein following infection leading to a lag in replication are consistent with reduced levels of orf b translation products as a result of the -fold reduction in - prf. the downregulation of the orf a gene product nsp is likely a result of the decrease in the amount of grna in the infected cells. several studies have demonstrated that facile formation of the h-type skeleton involving stem and of a frameshift-stimulating pseudoknot structure is critical for efficient frameshifting ( , ) . intriguingly, several long-distance intramolecular loop-loop kissing has also been implicated in - prf stimulation. specifically, the group human cov e has been shown to use a kissing stem loop to promote frameshifting rather than a compact h-type pseudoknot ( ) . based on sequence alignments, such 'elaborated' pseudoknots could be used by human cov nl , porcine epidemic diarrhea virus and transmissible gastroenteritis virus ( ) . similarly, the p -p pseudoknot from the luteovirus barley yellow dwarf virus can be considered a variation on the 'elaborated' cov pseudoknot structure. it contains a loop l of nearly kb, with s formed by long-range kissing interactions with nucleotides near the end of the genome ( ) . such a long-distance rna-rna kissing interaction is also implicated in - prf of red clover necrotic mosaic virus ( ) . thus, an intermolecular dimerization in sars-cov may simply represent a variation of the 'elaborated' pseudoknot. in conclusion, we demonstrate that the structured third stem of sars-cov represents an integral part of the - prf stimulating pseudoknot topology. while the exact functional interplay in the viral lifecycle and timing of stem folding-unfolding interconversions remain to be determined, our results provide important first insight into a palindromic sequence element embedded into stem that permit loop-loop kissing formation for efficient - prf stimulation, efficient translation of orf a/b encoded proteins and relative amounts of sgrna to remain close to wild-type levels. a three-stemmed mrna pseudoknot in the sars coronavirus frameshift signal an atypical rna pseudoknot stimulator and an upstream attenuation signal for - ribosomal frameshifting of sars coronavirus mechanisms and enzymes involved in sars coronavirus genome expression comparative study of the effects of heptameric slippery site composition on - frameshifting among different eukaryotic systems frameshifting rna pseudoknots: structure and mechanism programmed ribosomal frameshifting in decoding the sars-cov genome mutational analysis of the rna pseudoknot component of a coronavirus ribosomal frameshifting signal ecoepidemiology and complete genome comparison of different strains of severe acute respiratory syndrome-related rhinolophus bat coronavirus in china reveal bats as a reservoir for acute, self-limiting infection that allows recombination events a review of studies on animal reservoirs of the sars coronavirus requirements for kissing-loop-mediated dimerization of human immunodeficiency virus rna evidence that a kissing loop structure facilitates genomic rna dimerisation in hiv- a -nucleotide sequence upstream of the major splice donor is part of the dimerization domain of human immunodeficiency virus genomic rna identification of the primary site of the human immunodeficiency virus type rna dimerization in vitro impact of human immunodeficiency virus type rna dimerization on viral infectivity and of stem-loop b on rna dimerization and reverse transcription and dissociation of dimerization from packaging dimerization of retroviral rna genomes: an inseparable pair genome of infectious bronchitis virus biochemical aspects of coronavirus replication and virus-host interaction rna structure determination by nmr nmrpipe: a multidimensional spectral processing system based on unix pipes university of california absorption mode two-dimensional noe spectroscopy of exchangeable protons in oligonucleotides use of a water flip-back pulse in the homonuclear noesy experiment pattern of -thiouridine-induced cross-linking in s ribosomal rna in the escherichia coli s subunit systematic analysis of bicistronic reporter assay data severe acute respiratory syndrome coronavirus infection of human ciliated airway epithelia: role of ciliated cells in viral spread in the conducting airways of the lungs identification and characterization of severe acute respiratory syndrome coronavirus replicase proteins clustal w and clustal x version . mfold web server for nucleic acid folding and hybridization prediction dimerization of a pathogenic human mitochondrial trna dimerization of hiv- genomic rna of subtypes a and b: rna loop structure and magnesium binding molecular dynamics simulations of rna kissing-loop motifs reveal structural dynamics and formation of cation-binding pockets rna lego: magnesium-dependent formation of specific rna assemblies through kissing interactions bipartite signal for genomic rna dimerization in moloney murine leukemia virus hepatitis c virus genomic rna dimerization is mediated via a kissing complex intermediate kissing-loop model of hiv- genome dimerization: hiv- rnas can assume alternative dimeric forms, and all sequences upstream or downstream of hairpin - are dispensable for dimer formation hiv- genome dimerization: kissing-loop hairpin dictates whether nucleotides downstream of the splice junction contribute to loose and tight dimerization of human immunodeficiency virus rna achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins reverse genetics with a full-length infectious cdna of severe acute respiratory syndrome coronavirus recombination, reservoirs, and the modular spike: mechanisms of coronavirus cross-species transmission isolation and characterization of viruses related to the sars coronavirus from animals in southern china the genome sequence of the sars-associated coronavirus unique and conserved features of genome and proteome of sars-coronavirus, an early split-off from the coronavirus group lineage maintenance of the gag/gag-pol ratio is important for human immunodeficiency virus type rna dimerization and viral infectivity analysis of natural variants of the human immunodeficiency virus type gag-pol frameshift stem-loop structure interference of ribosomal frameshifting by antisense peptide nucleic acids suppresses sars coronavirus replication structure and function of the human immunodeficiency virus leader rna dimerization of retroviral genomic rnas: structural and functional implications variant effects of non-native kissing-loop hairpin palindromes on hiv replication and hiv rna dimerization: role of stem-loop b in hiv replication and hiv rna dimerization correlation between mechanical strength of messenger rna pseudoknots and ribosomal frameshifting triplex structures in an rna pseudoknot enhance mechanical stability and increase efficiency of - ribosomal frameshifting solution structure of the cuug hairpin loop: a novel rna tetraloop motif architecture of ribosomal rna: constraints on the sequence of ''tetra-loops programmed translational frameshifting structure, stability and function of rna pseudoknots involved in stimulating ribosomal frameshifting an 'elaborated' pseudoknot is required for high frequency frameshifting during translation of hcv e polymerase mrna a sequence required for - ribosomal frameshifting located four kilobases downstream of the frameshift site a long-distance rna-rna interaction plays an important role in programmed - ribosomal frameshifting in the translation of p replicase protein of red clover necrotic mosaic virus we acknowledge the support of the hollings marine laboratory nmr facility for this work. conflict of interest statement. none declared. supplementary data are available at nar online: supplementary table and supplementary figures - . key: cord- -wwq sd r authors: liao, pei-yu; choi, yong seok; dinman, jonathan d.; lee, kelvin h. title: the many paths to frameshifting: kinetic modelling and analysis of the effects of different elongation steps on programmed – ribosomal frameshifting date: - - journal: nucleic acids res doi: . /nar/gkq sha: doc_id: cord_uid: wwq sd r several important viruses including the human immunodeficiency virus type (hiv- ) and the sars-associated coronavirus (sars-cov) employ programmed − ribosomal frameshifting (prf) for their protein expression. here, a kinetic framework is developed to describe − prf. the model reveals three kinetic pathways to − prf that yield two possible frameshift products: those incorporating zero frame encoded a-site trnas in the recoding site, and products incorporating − frame encoded a-site trnas. using known kinetic rate constants, the individual contributions of different steps of the translation elongation cycle to − prf and the ratio between two types of frameshift products were evaluated. a dual fluorescence reporter was employed in escherichia coli to empirically test the model. additionally, the study applied a novel mass spectrometry approach to quantify the ratios of the two frameshift products. a more detailed understanding of the mechanisms underlying − prf may provide insight into developing antiviral therapeutics. programmed ribosomal frameshifting (prf) is a process where specific signals in the mrna direct the ribosome to switch reading frame at a certain efficiency. in À prf, the ribosome slips nt towards the -end of the mrna during translation. several viruses, including human immunodeficiency virus type (hiv- ) and the coronavirus responsible for severe acute respiratory syndrome (sars-cov), employ À prf to synthesize precursors of enzymes for their replication ( , ) , and the ratio of the zero frame to À frame encoded products is important to the vitality of viruses ( ) ( ) ( ) . as such, altering À prf efficiency may damage viral replication [reviewed in ( ) ]. this suggests À prf as a target for the development of antiviral therapeutics. programmed À ribosomal frameshifting signals usually contain three essential mrna elements: (i) a 'slippery' heptanucleotide sequence x xxy yyz (x can be any three identical nucleotides, y is a or u and z is not g in eukaryotes; spaces denote the initial reading frame), where the ribosome changes the reading frame ( , ) ; (ii) a downstream stimulatory mrna secondary structure, typically a pseudoknot ( ) ( ) ( ) ; and (iii) a spacer between the slippery sequence and the stimulatory signal. it has been suggested that the stimulatory structural element promotes À prf by positioning the ribosome to pause over the slippery sequence ( ) ( ) ( ) . the length of the spacer has also been shown to affect frameshift efficiency ( , , ) . as prf occurs during translation elongation, models of À prf should be described within this context. the elongation cycle can be divided into four stages. first, the ribosome selects the cognate aminoacyl-trna (aa-trna) according to the codon at the decoding center (decoding, dc in figure ). second, the aa-trna moves from a/t entry state into the a/a state to be accommodated into the ribosome (aa-trna accommodation, aa in figure ). third, the ribosome catalyses peptidyltransfer, resulting in a peptidyl trna in the a-site and a deacylated trna in the p site (peptidyltransfer, pt in figure ). fourth, the peptidyl-trna moves from the a-site to the p-site, carrying the mrna along, and the deacylated trna moves out of the p-site into the e-site from where it dissociates (translocation, tl in figure ). translocation opens up the ribosomal a-site and the ribosome moves on to another round of aa-trna selection. *to whom correspondence should be addressed. tel: + ; fax: + ; email: khl@udel.edu three major models have been proposed for the mechanism of À prf ( figure ). one hypothesis proposes that À prf takes place during accommodation of the aa-trna ( , , ) . we have denoted this pathway ii. the simultaneous-slippage model ( ) originally suggested that peptidyl-and aa-trnas simultaneously slip by one base in the -direction to base pair with the À frame codons in the slippery site. in a refinement of this model ( ) , À prf was posited to occur when aa-trna and peptidyl-trna are located in the a/t entry and p/p site. the -Å model of À prf ( ) built upon both this and newly available structural data to propose that the $ -Å movement of the anticodon loop in the -direction during aa-trna accommodation is constrained by the presence of the downstream stimulatory rna structural element. this creates tension on the mrna between the decoding center and the stimulatory element that can be relieved by decoupling of the a-and p-site trnas from the mrna followed by subsequent slippage of the mrna by one base in the -direction relative to the trnas, resulting in a net slip reading frame by À base. consistent with this model, mutations altering aa-trna accommodation were found to affect À prf ( ) ( ) ( ) . however, the simultaneous slippage-based models do not explain the role of sequences upstream of the slippery site, which have also been shown to affect the À prf efficiency ( , ) . a second general hypothesis proposes that À prf occurs during translocation. this can be modeled through two discrete kinetic pathways. the first suggested that after peptidyltransfer, the two trnas move to p/e and a/p states, followed by an incomplete, two-base translocation event promoted by the downstream mrna stimulatory structure ( ) . during this incomplete translocation event, the trnas dissociate from the mrna and re-pair with the À frame codons in the slippery site. we call this pathway iii. in support of this model, cryoelectron microscopy imaging revealed that a À prf stimulating pseudoknot can interact with the ribosome to block the mrna entrance channel, compromising the translocation process during À prf ( ) . the second co-translocational model proposed that incomplete translocation occurs one elongation cycle prior to the model by weiss et al. ( ) , and that trnas in the ribosomal e-, p-and a-sites are all involved in the process ( ) . this model suggests that incomplete translocation promotes formation of a transition intermediate, and that entry of the new aa-trna into the ribosome and the tendency of trnas to revert to stable states drives the shift in reading frame. this is pathway i. this model is supported by the demonstration that mutations altering e-site trna binding affect À prf ( ) . however, figure . a mechanistic model of - programmed ribosomal frameshifting. two translation elongation cycles are depicted at the top: the ribosome undergoes decoding (dc), aa-trna accommodation (aa), peptidyltransfer (pt) and translocation (tl) twice to add two amino acids into the polypeptide sequences. a shift in reading frame may occur at the first tl step and the ribosome decodes a À frame a-site codon at the recoding site. additionally, À prf may occur during the second aa step, in which the ribosome has decoded the zero frame a-site codon. incorporation of the À reading frame aa-trna starts at the following cycle. moreover, the shift in reading frame may occur at the second tl step and incorporation of the À reading frame aa-trna starts at the following cycle. neither of the co-translocation models explain the presence of two species of frameshift proteins produced by hiv- frameshifting (see next paragraph). protein sequencing was originally employed to generate the simultaneous slippage model, and to confirm that the À prf site for hiv- is u uuu uua located within the gag/ pol overlap (where the p-site of the ribosome during frameshifting is underlined) ( ). interestingly, $ % of the frameshift products contained phe-leu (derived from decoding the -frame uuu uua sequence), while $ % of the products contained phe-phe (derived from decoding the À frame uuu uuu sequence) at the frameshift site ( , ) . previous studies suggested that the product with phe-phe at the frameshift site could result from slippage of the p-site trna alone ( , , ) , i.e. the product predicted by pathway i, and that the À frame aa-trna is subsequently recruited to the ribosome. however, the precise mechanism driving this process remained unclear, and no model has been proposed to date explaining the simultaneous formation of different frameshift proteins. here, we have developed a kinetic model of À prf to explain all of the experimental observations. this model reveals the major steps in the translation elongation cycle that affect À prf, and reconciles all three models of À prf. in addition, À prf efficiency was monitored in vivo using a dual fluorescence reporter ( ) and the compositions of different frameshift proteins were analysed by mass spectrometry. the experimental approach was also applied to study human t-cell leukemia virus type (htlv) pro-pol frameshift sequence. this is the first study to demonstrate and quantify the ratio of frameshift products incorporating À frame a-site trna at this À prf sequence. in agreement with the model predictions, experimental perturbation of different translation steps resulted in different levels of À prf efficiency as well as in the relative ratios of two types of frameshift proteins. our findings demonstrate that all three kinetic pathways are operative during À prf. in our earlier study, a kinetic model successfully described the effects of ribosome e-, p-and a-site interactions on + prf ( ) . a similar approach can be applied to understand the mechanism of À prf. the mechanistic model in the present study proposes that À prf can occur during translocation and/or aa-trna accommodation. figure describes the overall framework using abc dex xxy yyz fgh sequence as an example, where spaces separate zero frame codons and the slippery sequence is underlined. when À prf occurs during translocation, the presence of the downstream stimulatory structure forces the ribosome to translocate by two, rather than three, bases toward the -end of the mrna, thus shifting the reading frame. if this 'incomplete' translocation occurs to the pre-translocational ribosome aligning with dex xxy, translation of the À frame begins at yyy. alternatively, if incomplete translocation occurs to the pre-translocational ribosome aligning with xxy yyz, translation of the À frame starts at zfg. when À prf occurs during aa-trna accommodation, the two trnas interacting with xxy yyz slip to base pair with xxx yyy. consequently, translation of À frame starts at zfg (figure ). an elegant series of biochemical analyses have established detailed kinetic models of translocation ( ) and aa-trna selection ( ) . translocation involves ef-g binding to the pre-translocational ribosome, gtp hydrolysis, unlocking conformation change, pi release, trna movement, relocking conformation change and dissociation of ef-g from the post-translocational ribosome. this concept is illustrated along the top of figure from component pa (pre-translocational ribosome) to e p (post-translocational ribosome). detailed descriptions for each rate constant are shown in supplementary table s . selection and accommodation of aa-trna involves initial binding of the ternary complex ef-tu:aa-trna:gtp, codon recognition, ef-tu gtpase activation, gtp hydrolysis, dissociation of ef-tu from the ribosome and accommodation of the acceptor end of the aa-trna into the a-site or the rejection of the aa-trna by proofreading. detailed descriptions for each rate constant are shown in supplementary table s . in the absence of frameshifting, progression through these steps of the elongation cycle results in synthesis of the non-frameshift protein, called nfs ( figure ). our kinetic model suggests three possible reaction pathways that could generate À frameshift proteins ( figure ). in pathway i, blockage of the mrna channel entrance by the downstream stimulatory structure induces incomplete translocation with the pre-translocational ribosome positioned at dex xxy. specifically, the reading frame shift occurs between the trna movement and pi release (rate constant r tl ), and the relocking step (rate constant r ). weiss et al. ( ) suggested that when the two trnas move from p/e and a/p to the e/e and p/p states, they can un-pair from the mrna and re-pair with the À reading frame. in our model, r t represents the rate constant for a ribosome:ef-g:gdp complex with two trnas in the e-and p-sites (e p efggdp ) to re-pair with the À reading frame (e p efggdp ). this motion is reversible, as denoted by the r -t rate constant. this step is followed by a relocking conformational change and ef-g release from the ribosome complex. the resulting e p or e p (a-site unocupied) then moves on to the aa-trna selection step. here, e p is the posttranslocational ribosome aligning with dex xxy (zero frame) and e p is the post-translocational ribosome aligning with cde xxx (- frame), where subscript means a zero frame trna pairing with the zero frame; subscript means a zero frame trna pairing with the À frame. e p may generate non-frameshift product nfs, or enter pathway ii or iii described below. e p can generate frameshift product fs m , which incorporates the À frame aa-trna in the recoding site (yyy). in addition, it is also possible for e p to recruit a zero frame aa-trna for yyz (a ) and accommodate this aa-trna into the À frame. in this case, frameshift product fs z , which incorporates the zero frame aa-trna in the recoding site (yyz), is produced (pathway ia). in the second pathway, the downstream stimulatory structure induces ribosome pausing and promotes À prf during aa-trna accommodation. pathway ii suggests that simultaneous slippage of p-and a-site trnas occurs during accommodation and/or before peptidyltransfer. in figure , the process from p a to p a with the rate constant k pas describes the slippage in pathway ii. p a then proceeds through peptidyltransfer to generate fs z . in pathway iii, incomplete translocation occurs when the pre-translocational ribosome is positioned at xxy yyz. consequently, translation of the À frame is one codon downstream of yyz and the ribosome produces fs z . it is important to note that while both pathway i and iii involve incomplete translocation, the pathway i slip occurs one elongation cycle before the pathway iii slippage event. all pathways were mathematically described as systems of ordinary differential equations (supplementary data). assuming steady state, the expressions of intermediate figure . the kinetic framework for programmed - ribosomal frameshifting. top: the procedure from pa to e p represents translocation, which involves r , r - (reversible ef-g binding), r tl (gtp hydrolysis, unlocking conformation change, trna movement and pi release), and r (re-locking conformation change and ef-g dissociation). the e p complex then undergoes aa-trna selection: from e p to p a . the selection of aa-trna involves: k , k - (reversible ef-tu binding), k , k - (reversible codon recognition), k (gtpase activation, gtp hydrolysis, ef-tu conformation change and dissociation), and k (aa-trna rejection by proofreading), or k (aa-trna accommodation). the elongation cycle without a - prf event results in synthesis of non-frameshift proteins (nfs). pathway i in green suggests that - prf occurs during the relocking step in the first translocation, leading to the formation of fs m . pathway ia indicates that the e p complex may interact with a zero frame aa-trna and eventually produce fs z . pathways ii suggests that - prf occurs during aa-trna selection and accommodation, resulting in fs z . pathway iii suggests that - prf occurs during the second translocation step, resulting in fs z production. concentrations in terms of initial reactant (pa) were solved by matlab v.r a (mathworks inc., natick, ma, usa). by applying the empirically-determined rate constants and assumed ranges of rate constants of incomplete translocation, p-and a-site trna slippage (supplementary tables s -s ) , the amount of non-frameshift proteins nfs (p a pt in the kinetic model) and two types of frameshift proteins, fs m (p a pt in the kinetic model) and fs z (p a pt and p a pt in the kinetic model), were identified. the frameshift efficiency (fs%) in the model is defined as the amount of frameshift proteins divided by the amount of total proteins and multiplied by % [equation ( )]. the fraction of fs m is calculated as the amount of fs m divided by the amount of total frameshift proteins and multiplied by % [equation ( )]. a program was developed in matlab v.r a to perform an n-way analysis of variance (anova). each parameter in the model was varied over five levels: a base line value, ± % of the base line, a ± % of the base line. randomly selected parameter sets were used to calculate fs%. a higher f statistic indicates a larger impact of the parameter on fs%. escherichia coli xl blue mrf (stratagene, la jolla, ca, usa) was used in all experimental studies. all constructs were verified by dna sequencing at the cornell bioresource center. construction of the dual fluorescence reporter was described earlier ( , ) , except that different linker sequences were incorporated into the reporter plasmid ( cells with the appropriate plasmids were cultured in ml luria-bertani (lb) medium containing mg/ml ampicillin with or without . mg/ml chloramphenicol in -well plates for h at rpm and c. fluorescence was measured using a plate reader (spectramax m , molecular devices, sunnyvale, ca, usa). fluorescence measurements were performed as described earlier ( ) . experimental frameshift efficiency (fs% exp ) was obtained as the ratio of green fluorescence to red fluorescence for the test strains and normalized against the fluorescence ratio of the control strain. statistical analyses were applied to all data sets as described earlier ( ) . a total of - replicates for test strains and control strains were performed to satisfy the minimum sample requirement for statistical significance. test strains were grown in ml lb medium containing mg/ml ampicillin in ml flasks at rpm and c. after h, od units of cells were collected by centrifugation at g and c for min. cells were lysed and purified by ni-nta under native conditions according to the manufacturer's protocol (qiagen). purified protein samples were resolved by sds-page ( % w/v polyacrylamide). gel band excision and in-gel trypsin digestion were performed using a previously described standard method ( ) . a representative flow chart of the mass spectrometry analysis is shown in supplementary data (supplementary figure s ). trypsin-digested frameshift protein samples resulted in target peptides spanning the recoding sites with a single amino acid difference between fs z and fs m . these peptides were analysed by nano-flow liquid chromatography tandem mass spectrometry using multiple reaction monitoring (nlc-mrm/ms). the digested sample was vacuum dried, reconstituted with ml of . % formic acid (fa), and a portion of each reconstituted sample was injected into dionex nlc system (sunnyvale, ca, usa). first, the sample was loaded onto an acclaim pepmap c trap column ( mm  mm, mm) and on-line desalting was carried out with water ( . % fa) at a flow rate of ml/min for min. then, peptides trapped in the trap column were gct aat ttt tta ggg aag atc tgg cct tcc tac aag gga agg cca ggg aat ttt ctt gga taa ag mb gcu cct ttt tta ggg aag atc tgg cct tcc tac aag gga agg cca ggg aat ttt ctt gga taa ag mb ucc gcc cct ttt tta ggg aag atc tgg cct tcc tac aag gga agg cca ggg aat ttt ctt gga taa ag mb ccc ttc cct tta aac cag aac gcc tcc agg cct tgc aac act tgg tcc gga agg ccc tgg agg cag gcc taa ( )]. samples were analysed in triplicates (except duplicates of ccc and tlv). fraction of fs m ð%Þ observed by ms where a fsm is the sum of peak areas at different charge states for an fs m target peptide and a fsz is the same for an fs z target peptide in ms. the kinetic model allows for the evaluation of the effects of different translation elongation cycle steps on fs% and the fraction of fs m . sensitivity analysis revealed several parameters that have a greater influence on fs% ( figure ) . therefore, the model results will focus on these higher impact parameters in different pathways. in pathway i, À prf occurs during translocation while the pre-translocational complex is aligned with dex xxy. two parameters play important roles in pathway i in the kinetic model. here, r t represents the rate constant for incomplete translocation. an increase in r t while other parameters in the model remain constant leads to an increase in fs% (blue line in figure a ). both the levels of fs m and fs z increase when r t increases (green and red lines in figure a ). because the rise in the fs m level is larger, increasing r t results in a larger fs m fraction (figure b ). it is also interesting to note that the majority of fs z comes from pathway iii when r t is < s À , while the majority of fs z is from pathway ii when r t is > s À (figure b) . the rate constant r accounts for the relocking step during translocation. a decrease in r while other parameters in the model remain constant results in an increase in fs% (blue line in figure a ). both the levels of fs m and fs z increase when r decreases (green and red lines in figure a ). however, in this case the increase in the fs m level is larger, leading to a larger fs m fraction with a decrease in r (figure b) . here, the majority of fs z is from pathway iii when r is < s À , but the majority of fs z results from pathway ii when r becomes > s À ( figure b) . these results suggest that translocation perturbations by either a downstream mrna secondary structure, by mutations, or by chemical inhibitors may result in production of a higher fs%, primarily due to production of a larger amount of fs m . notably, manipulating r t values causes larger changes in fs% and in the fs m fraction compared to the effect of r , suggesting a dominant role by r t on À prf in pathway i. consistent with our model, experimental studies demonstrated that mutating the e-site codon in the recoding site, or the use of a translocation inhibitor altered fs% ( ) . in pathway ii, À prf occurs during aa-trna accommodation and the slippage occurs before peptidyltransfer. figure a shows that a higher k pas results in a higher fs%. interestingly, the larger fs% results from an increase in fs z while the level of fs m remains at a similar level (figure a) . therefore, the fraction of fs m is predicted to decrease as k pas increases ( figure b) . here, the majority of fs z is generated from pathway iii when k pas is < s À , while the majority of fs z is produced from pathway ii when k pas is > s À (figure b) . in pathway iii, À prf occurs during translocation while the pre-translocational complex is aligned with xxy yyz. the rate constant for the incomplete translocation step is demonstrated by r t . figure a shows that a higher r t promotes increased fs%. interestingly, the larger fs% results from an increase in fs z while the level of fs m remains relatively constant (figure a) . therefore, the fraction of fs m is predicted to decrease as r t increases (figure b ). in this case, the majority of fs z is generated from pathway ii when r t is < s À , but the majority of fs z comes from pathway iii when r t is > s À (figure b ). in the model, k pt represents the rate constant for peptidyltransfer, the last step in all three pathways. the model predicts that a decrease in k pt would result in a higher fs% due to increased production of fs z , while fs m synthesis remains relatively constant (figure a ). consequently, a smaller fraction of fs m is observed as k pt decreases (figure b ). in this scheme, the majority of fs z is synthesized from pathway ii when k pt is < s À , while the majority of fs z comes from pathway iii when k pt is > s À (figure b) . the model results are consistent with previous experimental observations that peptidyltransferase inhibitors affect fs% ( ) . to examine the model predictions, À prf efficiency was monitored in vivo using a dual fluorescence reporter system. in addition, compositions of the frameshift protein products were analysed by mass spectrometry. analysis of the frameshift products revealed that the ratio of fs z to fs m was $ : in mb cells (figure ), thus indicating that the vast majority of À prf events naturally occur through pathways ii and/or iii. the model predicts that a smaller k pt should cause higher fs% and a lower fraction of fs m (figure ) . a prior study using yeast demonstrated that inhibition of peptidyltransfer promoted increased rates of À prf, but did not differentiate between fs m and fs z products ( ) . the model predicts that addition of chloramphenicol, a potent peptidyltransferase inhibitor in bacteria ( ), should promote increased fs%. consistent with the model, a . -fold increase in fs% exp was observed in the e. coli culture containing . mg/ml chloramphenicol compared to the culture without the drug. the fractions of fs m for the culture with and without chloramphenicol were . and . %, respectively (figure a) . although a slight decrease in the fraction of fs m was observed in the presence of the drug, the difference was not statistically significant (p > . ). the frameshift sequence for hiv- is u aau uuu uua, where a space separates each zero frame codon and the p-site of the recoding site is underlined. the e-site trna guu asn may form one canonical base pairing with the À frame uaa. in the mb ucc strain, the sequence was mutated to u ccu uuu uua (mutations shown in bold) where the e-site trna ggg pro can potentially form one g:u and two c:g interactions. in the mb ccc strain, the sequence was mutated to c ccu uuu uua (mutations shown in bold) where the e-site trna ggg pro can form three canonical base pairings with the À frame ccc. because pathway i requires that the e-and p-site trnas interact with the À frame, ucc and ccc as the À frame e-site codons may enhance this reaction, i.e. these codons would promote an increase in r t . the model predicts that a larger r t should result in a higher fs% due to increased production of fs m . consistently, . -and . -fold increase in fs% exp are observed for the mb ucc and mb ccc strains compared to the mb strain, respectively (figure b ). in the mb ucc strain, . % of the frameshift products were fs m , and in the mb ccc strain, . % of the frameshift products were fs m . these results suggest that changing the sequence to favor incomplete translocation, i.e. to favor pathway i, can dramatically alter the composition of the frameshift product. to further our understanding of two types of frameshift proteins, the htlv- pro-pol frameshift sequence was cloned into the reporter system. the extended frameshift sequence for htlv- pro-pol is c ccu uua aac (where spaces separate zero frame codons and the slippery sequence is underlined). similar to mb ccc, the e-site trna ggg pro can potentially form three canonical base pairings with the À frame ccc, which may create a favorable condition for pathway i. consequently, a significant amount of fs m among total frameshift proteins can be produced. consistent with the model, the frameshift efficiency for htlv- was . % and the fraction of fs m was . %. in this study, a mathematical framework was developed for À prf. to our knowledge, this is the first kinetic model to explain the production of two types of À frameshift proteins through three distinct kinetic pathways. using dex xxy yyz fgh as an example, pathway i predicts that a pre-translocational ribosome aligning with dex xxy may shift reading frame during incomplete translocation, producing the frameshift product (fs m ) incorporating a À frame aa-trna in the frameshift site (codon yyy). additionally, pathway ii predicts that a ribosome can change reading frame due to simultaneous slippage of p-and a-site trnas for a ribosome aligning with xxy yyz, generating a frameshift product (fs z ) incorporating the zero frame aa-trna in the frameshift site (codon yyz). lastly, in pathway iii, a pretranslocational ribosome aligning with xxy yyz may also undergo incomplete translocation to generate fs z . the kinetic model suggests that incomplete translocation of the pre-translocational ribosome aligning with dex xxy produces fs m . previous studies suggested that fs m may result from slippage of a single p-site trna ( , , ) . however, it is not clear when and/or how this could occur. in addition, the single slippage model does not explain experimental evidence regarding the influence of translocation on À prf ( , ) . our model suggests that in both mechanisms, incomplete translocation and slippage of p-and a-site trnas participate in synthesizing frameshift proteins to varying extents for different À prf signals. frameshifting at the hiv- sequence was reported to generate $ % fs z and % fs m ( , ) , indicating that pathways ii and/or iii exert stronger influence on fs% than pathway i. notably, our protein analysis showed $ % fs z and % fs m for the frameshifting signal in hiv- group m subtype b. this small discrepancy may due to the use of different reporter systems, or differences in the quantitative methods employed for the assay. for the htlv- pro-pol frameshift sequence, this study observed . % fs m in total frameshift protein. this is the first study to demonstrate another frameshift sequence that generates a significant amount of fs m in addition to hiv- . interestingly, a small but observable lysine peak appeared at the corresponding position (- frame a-site codon in the recoding site) when htlv- pro-pol frameshift proteins were sequenced in the previous study ( ) , supporting the production of fs m . the stimulatory rna of htlv- pro-pol has been suggested to be a pseudoknot ( ) , although no direct evidence for the rna structure was shown in the same study. interestingly, the mouse edr frameshift sequence, which shares high similarity with the htlv- pro-pol sequence, was suggested to involve a pseudoknot ( ) . because the length of htlv- pro-pol stimulatory signals is not well-defined, the frameshift sequence incorporated in the tlv strain may not represent the whole stimulatory signal. however, the possible absence of a portion of the stimulatory signal did not prevent us from observing a significant amount of frameshift efficiency and the ratio of fs m to the total frameshift protein. frameshift products from other À prf signals were analysed previously ( , ) . for sars-cov frameshifting, fs m was not found ( ) . however, for the alphavirus coding sequence k, both fs z and fs m were identified in the frameshift products, although the exact ratio was not determined ( ) . the current study shows that all three kinetic pathways are operative during À prf. for hiv- and htlv- , experimental results indicate that . and . %, respectively, of frameshift proteins are fs m , i.e. the contribution of pathway i. although the experimental approach cannot differentiate between the relative contributions of pathways ii and iii to fs z production, the kinetic model can be used to discriminate between the effects of these two pathways. in the kinetic model, r t , k pas and r t represent non-regular events, which are likely to be ratelimiting steps. these rate constants are thus expected to be small. the values for r and k pt are and s À , respectively (supplementary tables s and s ). as shown in figures - , r t , r , k pas , and r t are all in the small value ranges, and k pt at s À , suggest that pathway iii contributes approximately equal or more to the fs z production than does pathway ii. interestingly, inhibition of peptidyltransfer (decreasing k pt ) can switch to a condition in which pathway ii contributes more to fs z production than pathway iii (figure b ). these findings are consistent with the hypothesis that the À prf signals of different viruses have evolved within these kinetic parameters so as to produce the optimal ratios of shifted to unshifted products according to their specific biological requirements. the effect of incomplete translocation can be understood in two ways. incomplete translocation at the pre-translocational ribosome aligning with dex xxy produced more fs m (pathway i), consistent with the model by leger et al. ( ) . incomplete translocation at the pre-translocational ribosome aligning with xxy yyz resulted in more fs z (pathway iii), consistent with the models proposed by weiss et al. ( ) and namy et al. ( ) . enhancing incomplete translocation in pathway i promoted an increase in the fraction of fs m among total frameshift proteins. on the other hand, enhancing incomplete translocation in pathway iii decrease the fraction of fs m . while these two pathways were supported by previous studies, the direct observation of fs m for hiv- and htlv- pro-pol frameshift sequences provided proof for the validity of pathway i. interestingly, altering the extended frameshift sequence resulted in a significant change in frameshift protein compositions, supporting the role of the sequence upstream of the traditional slippery site xxy yyz. in the presence of chloramphenicol, a . -fold increase in fs% exp was observed while the fraction of fs m was not significantly different compared to the culture condition without the chemical (figure a) . the model predicts that k pt has a relatively smaller effect on fs% and the fraction of fs m than r t and k pas (figure ) . a dual fluorescence reporter can sensitively detect small changes in fs% in e. coli and mammalian cells ( , , ) . on the other hand, analysing the composition of the frameshift products relies on multiple manipulations including protein purification, gel electrophoresis, in-gel digestion, table ). liquid chromatography and mass spectrometry. the multistage preparation may thus affect sample yields, complicating the detection of small changes in protein composition. a significant increase in the fraction of fs m was observed in both mb ucc and mb ccc strains (figure b) . mutation of the À frame e-site sequence to ucc and ccc in the hiv- frameshift site may enhance incomplete translocation in pathway i by allowing more interactions between e-site trna and the À frame. this result is consistent with the model that different mechanisms exist and participate in making frameshift proteins to different extents. the creation of a favorable condition for one pathway can affect the composition of frameshift proteins significantly. to date, no mutations affecting the composition of frameshift proteins have been reported in the literature. notably, our experimental results show that in one case, fs% increases significantly without a change in the composition of frameshift products (figure a) , while in another condition fs% increases a smaller amount but the composition of frameshift products change dramatically (figure b) . several parameters in the kinetic model, such as r t , r -t , k pas , k -pas , r t and r -t , have not been measured experimentally in the literature. the test ranges for these parameters are based on our current kinetic understanding of translation elongation. the rate constants for translocation and aa-trna selection range from . to s À (supplementary tables s -s ). the average protein synthesis rate is - s À in prokaryotic cells but some codons are translated at a rate < s À ( ). therefore, a broad range ( - s À ), within the scope of the known elongation rate constants, was tested to understand the impact each of these unknown parameters on À frameshifting (figures , and ) . notably, these figures do not show the impact of these parameters > s À because the curves level off for the parameters at this range. although the investigation of whether the slippage of the two trnas is a simultaneous or a sequential process is beyond the scope of this study, a similar kinetic approach figure . the effect of the rate constants related to sequential trna movement (a) r t , (b) r -t , (c) r t , and (d) r -t on the rate constants related to single step sequential trna movement (r t and r -t ). the base point is assumed as r t = s À , r -t = s À , r t = s À , and r -t = s À . the inset in (d) shows a zoom in of the plot. can be used to understand the effect of sequential trna movement on the overall process. using the movement of e-and p-site trnas as an example, the overall movement and the sequential movement can be described as the following: one step trna movement (rx. assuming steady state, r t and r -t can be represented by r t , r -t , r t and r -t (supplementary data). figure shows how a change in the rate constants in rx. can affect the overall rate constants in rx. . the result suggests that repositioning of the e-site trna to the À frame (represented by r t ) may have a larger impact than repositioning of the p-site trna (represented by r t ) on the slippage toward the À frame (represented by r t ). on the other hand, repositioning the p-site trna back to zero frame (represented by r -t ) may have larger impact than repositioning the e-site trna (represented by r -t ) on the slippage toward the zero frame (represented by r -t ). similarly, the same observation also applies to the movement for p-and a-site trnas (k pas , and k -pas ) in the model. a mathematical framework developed upon the translation elongation cycle revealed three distinct kinetic pathways for À prf. the model describes how alterations of these kinetic parameters can affect not only changes in frameshift efficiency, but also changes in the composition of frameshift products under different conditions. in addition, the model identifies the dominant parameters, representing steps in the translation elongation cycle, on À prf. experimentally targeting these steps resulted in different levels of frameshifting efficiency, consistent with model predictions. a mutation in the À frame e-site sequence was shown to dramatically change the composition of frameshift products, suggesting an important role for the sequence upstream of the slippery site. our results suggest that not only the frameshift efficiency, but also the compositions of the frameshift products, are worth investigating to advance our knowledge of À prf. characterization of ribosomal frameshifting in hiv- gag-pol expression mechanisms and enzymes involved in sars coronavirus genome expression the human immunodeficiency virus type ribosomal frameshifting site is an invariant sequence determinant and an important target for antiviral therapy the role of programmed- ribosomal frameshifting in coronavirus propagation achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins translating old drugs into new treatments: ribosomal frameshifting as a target for antiviral agents mutational analysis of the ''slippery-sequence'' component of a coronavirus ribosomal frameshifting signal signals for ribosomal frameshifting in the rous sarcoma virus gag-pol region characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot rna pseudoknots: translational frameshifting and readthrough on viral rnas ribosomal movement impeded at a pseudoknot required for frameshifting ribosomal pausing during translation of an rna pseudoknot kinetics of ribosomal pausing during programmed - translational frameshifting the sequences of and distance between two cis-acting signals determine the efficiency of ribosomal frameshifting in human immunodeficiency virus type and human t-cell leukemia virus type ii in vivo programmed alternative reading of the genetic code the -a solution: how mrna pseudoknots promote efficient programmed - ribosomal frameshifting translocation of trna during protein synthesis translational misreading: mutations in translation elongation factor alpha differentially affect programmed ribosomal frameshifting and drug sensitivity an ''integrated model'' of programmed ribosomal frameshifting a reassessment of the response of the bacterial ribosome to the frameshift stimulatory signal of the human immunodeficiency virus type comparative mutational analysis of cis-acting rna signals for translational frameshifting in hiv- and htlv- the three transfer rnas occupying the a, p and e sites on the ribosome are involved in viral programmed - ribosomal frameshift e. coli ribosomes re-phase on retroviral frameshift signals at rates ranging from to percent a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting the function of a ribosomal frameshifting signal from human immunodeficiency virus- in escherichia coli p-site trna is a crucial initiator of ribosomal frameshifting a new kinetic model reveals the synergistic effect of e-, p-and a-sites on + ribosomal frameshifting an elongation factor g-induced ribosome rearrangement precedes trna-mrna translocation recognition and selection of trna in translation fsscan: a mechanism-based program to identify + ribosomal frameshift hotspots characterization of ribosomal frameshifting for expression of pol gene products of human t-cell leukemain virus type systematic analysis of bicistronic reporter assay data comparison of automated in-gel digest methods for femtomole level samples probability-based protein identification by searching sequence databases using mass spectrometry data peptidyl-transferase inhibitors have antiviral properties by altering programmed - ribosomal frameshifting efficiencies: development of model systems decreased peptidyltransferase activity correlates with increased programmed - ribosomal frameshifting and viral maintenance defects in the yeast saccharomyces cerevisiae structural basis for the interaction of antibiotics with the peptidyltransferase centre in eubacteria characterization of the frameshift signal of edr, a mammalian example of programmed - ribosomal frameshifting programmed ribosomal frameshifting in decoding the sars-cov genome discovery of frameshifting in alphavirus k resolves a -year enigma a homogeneous cell-based bicistronic fluorescence assay for high-throughput identification of drugs that perturb viral gene recoding and read-through of nonsense stop codons absolute in vivo translation rates of individual codons in escherichia coli. the two glutamic acid codons gaa and gag are translated with a threefold difference in rate the authors acknowledge abhinav rabindra jain for assistance in the laboratory. conflict of interest statement. none declared. key: cord- - ulketgy authors: snyder, e. e.; kampanya, n.; lu, j.; nordberg, e. k.; karur, h. r.; shukla, m.; soneja, j.; tian, y.; xue, t.; yoo, h.; zhang, f.; dharmanolla, c.; dongre, n. v.; gillespie, j. j.; hamelius, j.; hance, m.; huntington, k. i.; jukneliene, d.; koziski, j.; mackasmiel, l.; mane, s. p.; nguyen, v.; purkayastha, a.; shallom, j.; yu, g.; guo, y.; gabbard, j.; hix, d.; azad, a. f.; baker, s. c.; boyle, s. m.; khudyakov, y.; meng, x. j.; rupprecht, c.; vinje, j.; crasta, o. r.; czar, m. j.; dickerman, a.; eckart, j. d.; kenyon, r.; will, r.; setubal, j. c.; sobral, b. w. s. title: patric: the vbi pathosystems resource integration center date: - - journal: nucleic acids res doi: . /nar/gkl sha: doc_id: cord_uid: ulketgy the pathosystems resource integration center (patric) is one of eight bioinformatics resource centers (brcs) funded by the national institute of allergy and infection diseases (niaid) to create a data and analysis resource for selected niaid priority pathogens, specifically proteobacteria of the genera brucella, rickettsia and coxiella, and corona-, calici- and lyssaviruses and viruses associated with hepatitis a and e. the goal of the project is to provide a comprehensive bioinformatics resource for these pathogens, including consistently annotated genome, proteome and metabolic pathway data to facilitate research into counter-measures, including drugs, vaccines and diagnostics. the project's curation strategy has three prongs: ‘breadth first’ beginning with whole-genome and proteome curation using standardized protocols, a ‘targeted’ approach addressing the specific needs of researchers and an integrative strategy to leverage high-throughput experimental data (e.g. microarrays, proteomics) and literature. the patric infrastructure consists of a relational database, analytical pipelines and a website which supports browsing, querying, data visualization and the ability to download raw and curated data in standard formats. at present, the site warehouses complete sequences for bacterial and viral genomes. the patric website () will continually grow with the addition of data, analysis and functionality over the course of the project. bioterrorism became an important national security issue ( ) following the deliberate release of anthrax spores into the us postal system in october ( ) . meanwhile, emerging and reemerging infectious diseases ( ) have had profound effects on public health in many parts of the world. recognizing the pathogens responsible for these diseases as threats to homeland security, the national institute of allergy and infectious diseases (niaid) of the us national institutes of health has embarked upon a series of initiatives aimed at developing a comprehensive understanding of the organisms identified as niaid category a, b and c priority pathogens (for a complete list, see http://www .niaid.nih.gov/biodefense/bandc_priority. htm). the virginia bioinformatics institute's pathosystems resource integration center (patric) is one of eight bioinformatics resource centers (brcs) established to study the niaid priority pathogens and develop these information resources for the research community. while database resources for bacterial (( ) and those cited in ( ) ) and viral ( , ) genomics have been available for number of years, this project seeks to integrate genomics with comparative genomics and pathway analysis and ultimately proteomics, transcriptomics, immune epitope mapping, hostresponse and other downstream technologies. the goal is to help researchers and clinicians better detect and respond to biothreat agents (and infectious diseases in general) by facilitating the development of diagnostics, vaccines and therapeutics. this requires access to comprehensive information on the molecular biology, physiology and pathogenicity of these organisms. patric is responsible for the eight organism categories listed in table . the three genera of proteobacteria are all intracellular pathogens that are known or potential biowarfare agents. in the s, brucella suis was the first infectious agent developed for use as a biowarfare agent by the united states. brucellosis, caused by brucella sp., is an important agricultural disease infecting cattle, sheep, goats and swine as well as humans. it is highly contagious and readily dispersed as an aerosol ( ) . coxiella burnetii, the causative agent of q fever, is a highly infectious agent of relatively low lethality. its interest as a biowarfare agent stems from its high infectivity, stability to heat and desiccation and potential for aerosol dispersal. the genus rickettsia contains the organisms responsible for numerous types of typhus and arthropod-borne spotted fevers ( , ) . rickettsia prowazekii was developed as a bioweapon by the ussr in the s and was used by the japanese in manchuria during world war ii ( ) . the five categories of viruses studied by patric are all positive-strand ssrna viruses, with the exception of lyssaviruses, which have negative-strand ssrna genomes. while there are no reports of any of these viruses being weaponized, they represent the causative agents for a number of emerging and reemerging diseases including severe acute respiratory syndrome (sars), rabies and transmissible gastroenteritis. recombinant vaccines for these viruses are either still in development or unavailable in areas where these infections are endemic or epidemic, compounding the public health risk. the pace of research on these organisms has increased significantly since the turn of the millennium, with outbreaks, such as that of sars in ( , ) , spawning a flurry of scientific activity. the widespread use of automated dna sequencing, microarray gene expression analysis and other high-throughput laboratory technologies has increased the volume of data produced, but not necessarily its accessibility. currently, significant genomics and bioinformatics expertise is required to extract, process and interpret this wealth of data. to address these problems, patric has created an interdisciplinary team of bioinformaticians, software engineers, computational biologists and organism experts to build a publicly accessible resource aimed at providing high quality, analyzed and curated data to the infectious disease community working on these pathogens. to date, we have achieved the following objectives: (i) collection and organization of existing genomic data for the eight pathosystems under a single, unified framework (ii) genome annotation and curation following standardized procedures (iii) visualization of raw data from analytical programs, as well as curated data (iv) creation of orthologous gene groups within each organism category allowing comparative analysis of gene content (v) prediction and visualization of bacterial metabolic pathways to complement functional analysis of proteins (vi) integration of online literature reviews from pathinfo ( ) for selected organisms. longer-term goals include integration of data from gene expression and proteomics experiments (including hostresponse), predicted protein and rna secondary and tertiary structures, and well-cataloged literature compilations. ultimately, we hope our website will become an essential tool for researchers working on these pathogens and provide networking opportunities within the pathogen research communities. patric is implemented on oracle i rdbms using the genomics unified schema (gus) version . , developed at the computational biology and informatics laboratory at the university of pennsylvania (see http://www.gusdb.org). gus is used to store all sequence data and associated annotation with the exception of metabolic pathway data, which is the database is populated with all known full-length or nearly full-length genomic sequences for the eight organism categories listed in table . automated scripts query gen-bank ( ) daily to identify new or updated records. the corresponding sequences, annotation and associated literature are retrieved from ncbi and loaded following curatorial review to remove redundancies and assign unique names to each genome. refseq ( ) records are used when available to take advantage of their more thorough and consistent annotation. draft genome sequences from joint genome institute (jgi)/los alamos national labs (lanl) and the niaidfunded microbial sequencing centers will also be part of the patric dataset. in addition to genome sequences and primary annotation from the original genbank or refseq entry, the database stores the results of all automated and manual analyses described in the following section. our motivation to invest resources in sequence-level annotation is to maintain a high standard of quality over time. even when good reference annotation is available, there are many reasons to re-annotate microbial genomes ( ) . genbank data are of variable quality and there is a trend towards depositing draft genome sequences with no annotation at all. in-house annotation also allows us to present supporting evidence and keep the annotation up to date. this is of particular importance for alignment-based annotation since databases such as genbank ( ) and uniprot ( ) continue to grow at a prodigious rate. due to the large number of closely related genomes in each organism category, we have adopted an annotation strategy in which automated methods are applied to all genomes while detailed manual curation is applied to a limited number of reference genomes. the species b.suis , c.burnetii rsa and r.prowazekii str. madrid e were chosen as reference genomes for their respective categories. each viral category has (or will have) multiple reference genomes, representing phylogenetically diverse strains. automated nucleic acid and protein sequence annotation is accomplished using a java-based genome annotation pipeline (unpublished), which reads an xml script containing the names and parameters of the analytical applications. the bacterial pipeline executes the gene prediction programs glimmer ( ) and genemark ( , ) followed by start site correction programs rbsfinder ( ) and tico ( ) . blastx ( ) searches the non-redundant protein database, complementing the ab initio gene prediction methods. rna genes are identified by trnascan-se ( ) and blastn searching against a ribosomal rna database ( , ) . the annotation protocol containing the full list of applications and parameters is available online at https://patric.vbi.vt. edu/documents/ under 'standard operating procedures'. results of the genome analysis pipeline are merged with original genbank or refseq features for automated interpretation. a decision tree is used to classify genes into categories based on the level of agreement between the various prediction methods. genes that are unambiguously predicted by multiple methods are automatically 'finalized', creating new 'gene', 'cds' and/or '[t/r]rna' features. the remaining genes are marked for manual curation. for viral genomes, an abbreviated pipeline is executed that emphasizes sequence alignment for gene identification and employs genemarkhmm optimized for mammalian (host) genomes. after curatorial review, finalized protein-coding (cds) features are translated and subjected to another pipeline executing interproscan and structure prediction methods such as memsat ( ) . currently, each protein is associated with go terms ( ), tigrroles, enzyme commission numbers ( ) based on pfam ( ) and tigrfam alignments (for a description of tigrfam and tigrroles, see: http://www. tigr.org/tigrfams/). the protocol for automated proteome annotation is also available online. manually curated protein sequences will be available in early . once protein sequences are inferred from each genome in an organism category, putative ortholog groups are generated using blastp for all pairwise genome combinations and applying the conventional bidirectional-best-hit (bbh) criterion ( ) . while putative ortholog groups within the bacterial categories are generally well defined, many viral proteins cannot be readily clustered using the stringent bbh criterion. this is an active area of curation. using the ortholog groups as a starting point, a reference protein list is created for each bacterial category consisting of the proteins of the reference genome (each representing one ortholog group) plus a representative protein from each ortholog group identified in the associated genomes. a gene occurring in only a single genome constitutes a 'group' of one and would be included in the reference list. the reference protein lists will be manually curated and include, whenever possible, detailed functional descriptions, gene symbols, go terms and ec numbers. thus, every protein in the database will either be manually curated or be linked to an ortholog group member that has been manually curated. the ortholog groups are further processed to create multiple sequence alignments (msas) using muscle ( ) with default parameters. phylogenetic estimations using the neighbor-joining method ( ) were created based on trimmed alignments using phylip ( ) . trees were validated by bootstrapping ( ) using a minimum of replicates. to help users understand the function of the bacterial proteins in context, we have adopted the pathway tools system ( ) to derive pathways from genome annotation and to fill potential gaps in annotation known as pathway holes. the system takes a list of protein names, descriptions and ec numbers as input. proteins with ec numbers can be assigned roles directly; the roles of other proteins are suggested by lexicographic analysis of descriptive information and/or analysis of gene order from homologous regions of related genomes and confirmed or rejected by the curation staff. the output is a database with integrated web server that allows users to browse and query the organism's metabolic pathways. this system has been integrated with the patric web site, allowing users to access pathway information for all bacterial reference genomes. the current analysis was based on preexisting refseq or genbank annotations; later releases will incorporate data curated in house, unifying the genomic and pathway versions of the data. the analysis of pathways can facilitate the identification of metabolic choke points, critical enzymes that could be targeted by drugs that may have valuable antimicrobial properties. pathway analysis can also yield clues to pathogenesis by comparing virulent and avirulent strains and examining the roles of genes not present in both strains. the patric website is hosted on a sun microsystems v z server running suse linux using the apache web server. applications are written in php and perl, accessing data from an oracle i server hosted on a sun microsystems e running sun os. the conceptual organization of the website is described in figure . the website's home page contains news, figure . conceptual map of patric website. arrows show the relationship between the principal datatype on a page and related data on neighboring pages. solid arrows represent 'drilling down' to more specific information (e.g. from genome to gene). dashed arrows represent links between different views of conceptually similar data (e.g. between ortholog group and phylogenetic tree). this figure represents only a subset of the pages and links on the actual website. a navigation bar and the list of patric organisms. users can select their organism of interest from the list to access the corresponding organism category page. this page contains a table of genomes currently in our database with links to the three principal representations of individual genomes: the genome summary, genome browser and gene table. these pages allow users to view a summary of genome sequencing information and to identify specific genes and link to their corresponding gene, protein and pathway information pages. the gene information page displays the output of sequence analysis software run by the annotation pipeline, as well as curated data. similarly, the protein information page displays interproscan and tigrfam alignments and associated information such as go terms and ec numbers. for bacterial genomes, the pathway information page illustrates the protein's position in the organism's metabolic network and links to a wealth of information provided by pathwaytools. the organism category page also contains links to a pathogen summary, ortholog group table and a phylogenetic tree based on s rrnas for bacteria or a selected protein family for viruses. for bacterial genomes, detailed pathosystem information is available, provided by the vbi pathinfo documents ( ) . the ortholog group table shows the presence or absence of reference gene list proteins for each organism in the organism category and provides links to an msa and tree viewer and the base-by-base msa editor ( ) for every ortholog group. base-by-base allows users to add sequences to the msa, recalculate it using clustal ( ) , t-coffee ( ) or muscle and generate the corresponding tree using neighbor-joining or a number of clustering algorithms. the patric website also supports analytical and query tools. a database search page allows user-supplied sequences to be blasted against reference and curated sequences from patric organisms. the page also supports mummer ( ) comparisons between genomes in the database or with a usersupplied sequence. a query tool is available throughout the site by which users can retrieve genes by name, id, description, as well as go and ec identifiers and descriptions. questions, comments and suggestions concerning the website and its contents may be submitted via the 'feedback' page, accessible from the menu bar. the patric database is hosted at the virginia bioinformatics institute at virginia tech and can be accessed via web browser at https://patric.vbi.vt.edu. sequences and annotation in gff format (see http://song.sourceforge.net/gff .shtml) can be downloaded by following the 'downloads' link on the main menu bar. gff files are also available through brc-central at: http://brc-central.org. this paper presents the first detailed description of the patric website. future development will advance on several fronts. genome and proteome curation will continue, complemented by improved tools for query, analysis and visualization. for viruses, we will transition to the more widely accepted ictv taxonomy ( ) . the website's user interface is being enhanced to integrate organism-, tool/ task-and data-centric approaches to data access, allowing users more efficient and effective access to patric resources. this will be followed up by prioritized curation targeted at potential drug and vaccine targets, virulence factors and genes with differential representation or polymorphisms associated with clinically significant phenotypes. leveraging another niaid-funded vbi project, the administrative resource for biodefense proteomics research (http:// www.proteomicsresource.org/), we plan to integrate expression profiling and proteomics data from pathogen and host to better understand the pathosystem's biology and help the community identify targets for counter-measures. the integration of these disparate data types into a single, easy-touse system is a goal that we anticipate will enable pathogen researchers to make full use of available data to develop diagnostics, vaccines and therapeutics. biodefence on the research agenda investigation of bioterrorism-related anthrax the challenge of emerging and re-emerging infectious diseases the comprehensive microbial resource ) xbase, a collection of online databases for bacterial comparative genomics new bioinformatics tools for viral genome analyses at viral bioinformatics-canada virgen: a comprehensive viral genome resource bichat guidelines for the clinical management of brucellosis and bioterrorism-related brucellosis the past and present threat of rickettsial diseases to military medicine and international public health rickettsial pathogens and their arthropod vectors principles of the malicious use of infectious agents to create terror: reasons for concern for organisms of the genus rickettsia outbreak of severe acute respiratory syndrome-worldwide aetiology: koch's postulates fulfilled for sars virus piml: the pathogen information markup language the pathway tools software ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins the past, present and future of genome-wide re-annotation the universal protein resource (uniprot): an expanding universe of protein information improved microbial gene identification with glimmer genemark.hmm: new solutions for gene finding genmark: parallel gene recognition for both dna strands a probabilistic method for identifying start codons in bacterial genomes tico: a tool for improving predictions of prokaryotic translation initiation sites gapped blast and psi-blast: a new generation of protein database search programs trnascan-se: a program for improved detection of transfer rna genes in genomic sequence the european ribosomal rna database the comparative rna web (crw) site: an online database of comparative sequence and structure information for ribosomal, intron, and other rnas a model recognition approach to the prediction of all-helical membrane protein structure and topology the gene ontology (go) database and informatics resource the enzyme database in the pfam protein families database a genomic perspective on protein families muscle: a multiple sequence alignment method with reduced time and space complexity the neighbor-joining method: a new method for reconstructing phylogenetic trees phylogenetic analysis using phylip confidence limits on phylogenies: an approach using the bootstrap base-by-base: single nucleotide-level analysis of whole viral genome alignments multiple sequence alignment with the clustal series of programs t-coffee: a novel method for fast and accurate multiple sequence alignment versatile and open software for comparing large genomes international committee on taxonomy of viruses and the , unassigned species we would like to thank chris upton for making the application base-by-base ( ) available for incorporation into our website and to peter karp for making a similar contribution with his pathway tools software ( ) . this work is funded through niaid contract hhsn c to bruno sobral. funding to pay the open access publication charges for this article was provided by niaid contract hhsn c to bruno sobral. conflict of interest statement. none declared. key: cord- -db tze j authors: chkuaseli, tamari; white, k andrew title: activation of viral transcription by stepwise largescale folding of an rna virus genome date: - - journal: nucleic acids res doi: . /nar/gkaa sha: doc_id: cord_uid: db tze j the genomes of rna viruses contain regulatory elements of varying complexity. many plus-strand rna viruses employ largescale intra-genomic rna-rna interactions as a means to control viral processes. here, we describe an elaborate rna structure formed by multiple distant regions in a tombusvirus genome that activates transcription of a viral subgenomic mrna. the initial step in assembly of this intramolecular rna complex involves the folding of a large viral rna domain, which generates a discontinuous binding pocket. next, a distally-located protracted stem-loop rna structure docks, via base-pairing, into the binding site and acts as a linchpin that stabilizes the rna complex and activates transcription. a multi-step rna folding pathway is proposed in which rate-limiting steps contribute to a delay in transcription of the capsid protein-encoding viral subgenomic mrna. this study provides an exceptional example of the complexity of genome-scale viral regulation and offers new insights into the assembly schemes utilized by large intra-genomic rna structures. positive-strand rna viruses comprise a large group of agriculturally and medically important pathogens that infect a wide range of hosts. the successful takeover of their hosts requires multiple steps that involves precise regulation and careful coordination. a critical component of this control is the modulation of different viral processes by rna sequences and structures located within viral genomes ( ) ( ) ( ) ( ) . in some cases, the rna-based regulation is mediated by large functional rna folds, some of which span the entire length of a viral genome ( ) . accordingly, overall rna genome architecture and dynamics can contribute significantly to the orchestration of different phases that occur during viral infections ( , ) . notably, this large-scale form of riboregulation is employed by many significant plant and animal messenger-sensed rna viruses, including luteoviruses ( , ) , carmoviruses ( , ) , umbraviruses ( , ) , flaviviruses ( ) ( ) ( ) ( ) , hepacivirus ( ) ( ) ( ) ( ) ( ) ( ) ( ) and coronaviruses ( ) ( ) ( ) . tombusviruses (family tombusviridae) are important model plus-strand rna viruses ( ) . studies performed on members of this genus have resulted in pioneering discoveries ( ) ( ) ( ) and led to significant progress in the identification of pro-and antiviral host factors ( ) ( ) ( ) . tombusviruses have also been invaluable for investigating how global viral rna genome structure actively controls essential viral processes ( , ) . their . kb-long codingsensed ssrna genomes contain a vast network of intragenomic, base pair-mediated, long-distance rna-rna interactions (ldris) that play different critical roles during the viral reproductive cycle ( ) . in particular, two tombusviruses, the prototype of the genus, tomato bushy stunt virus (tbsv), and the closely-related carnation italian ringspot virus (cirv) ( figure a) , have been instrumental in deducing the structure and function of this complex ldri network ( ) . tombusvirus rna genomes are not -capped or polyadenylated, thus they rely on an unconventional mode of translation, which has been studied extensively in cirv ( ) ( ) ( ) ( ) . an rna structure in cirv s -untranslated region ( utr), termed the -cap independent translation enhancer ( cite) , binds to eukaryotic translation initiation factor f (eif f). the eif f-bound cite then simultaneously base-pairs with the utr via an ldri, which positions eif f near the -end of the genome, where it mediates ribosome recruitment ( ) (figure a ). this results in translation of the auxiliary rna replication protein, p . production of the p rnadependent rna polymerase (rdrp) requires translational readthrough of the p stop codon. this recoding event involves an extended rna stem-loop (sl) structure, termed the readthrough sl (rtsl), located immediately to the p termination codon, uag ( figure a , green asterisk). the rtsl is not able to direct readthrough on its own and, to function, requires the formation of an ldri between a bulged sequence in rtsl (the proximal readthrough element, prte) and a complementary sequence (the distal readthrough element, drte) in the utr of the genome ( ) ( figure a ). this ldri not only promotes readthrough, it also concomitantly inhibits genomic minus- strand rna synthesis, which would interfere with translation. thus, the rtsl, via an ldri, functions as a dual regulator that coordinates translational recoding and genome replication. ldris are also involved in controlling the production of tombusvirus subgenomic (sg) mrnas, which are small virus genome-derived mrnas that are transcribed by the viral rdrp during infections ( , ) . structurally, sg mr-nas are -coterminal with the viral genome, while their -ends map to internal regions. consequently, they encode -proximal orfs that are translationally silent within the context of the full-length genome. by modulating sg mrna transcription, the virus is able to control the amount and timing of viral protein production during infections. tombusviruses transcribe two sg mrnas ( figure a ). the smaller sg mrna is transcribed earlier during infections, and mediates translation of both the p suppressor of gene silencing and the p cell-to-cell movement protein. the larger sg mrna is transcribed later in infections, and encodes the capsid protein (cp) ( , ) . tombusviruses ( ) , nodaviruses (family nodaviridae) ( ) and toroviruses (family tobaniviridae) ( ) transcribe their sg mrnas using a premature termination mechanism ( ) . in this process, the viral rdrp terminates transcription prematurely while synthesizing a minus-strand from a full-length plus-strand viral rna genome. the stalling of the rdrp occurs when it encounters an rna element within the genome called an attenuation structure. this termination event leads to the production of a -truncated minus-sense rna species that possess a promoter sequence at its -end ( figure a , pr). the promoter is then recognized by the viral rdrp, which transcribes the coding-sense sg mrnas from the truncated intermediate. the attenuation structures that block the progression of rdrps are helical rna structures that are located ∼ - nt upstream from where the copying rdrp stalls. in some viruses, the inhibitory stem is formed by ldris ( , ) . production of tombusvirus sg mrna involves two sets of ldris. one occurs between activator sequence (as ) and receptor sequence (rs ), spanning ∼ nucleotides ( ) , and the other involves distal element (de) and core element (ce), traversing ∼ nucleotides ( ) ( figure a , turquoise and brown). when viewed in the context of the rna secondary structure model for the tbsv genome ( ) , the de/ce interaction corresponds to the closing stem of a sizable rna domain, termed large domain (ld ), which, along with formation of the adjacent ld , acts to unite the as and rs sequences ( figure b) . efficient sg mrna transcription requires an ldri between as and rs , which spans ∼ nucleotides (figure a , red) and forms a helix just three nucleotides upstream of the sg mrna initiation site (supplementary figure s ) ( ) . the nt long as sequence is the terminal loop of an rna hairpin, designated as -sl, that facilitates its accessibility (supplementary figure s ) ( ) . the as /rs interaction has been verified experimentally to (i) pair and operate in the plus-strand of the genome, (ii) occur in cis and (iii) promote production of sg mrna minusstrand intermediates ( ) . the as /rs interaction is also predicted to form the closing helix of ld ( figure b) , thus accurate folding of ld is proposed to be important for formation of the as /rs ldri ( , ) . in this study, we show that the attenuation structure for sg mrna is far more complex than previously appreciated, with the as /rs interaction being a component of a group of critical ldris. unexpectedly, the active rna structure includes the recoding rna element, rtsl, as well as specific subsections of ld . formation of a functional attenuation structure requires multiple ldris within ld that generate a discontinuous binding site for rtsl. the docking of rtsl into this binding pocket acts as a linchpin that stabilizes an active conformation of the rna complex. functional and structural aspects of these ldris are discussed and a likely path for assembly of this intragenomic rna attenuation structure is presented. nucleotide substitutions were introduced into a cloned cdna copy of the full-length wt cirv genome ( ) through standard pcr-based site-directed mutagenesis. each of the mutated cirv clones was sequenced over the entire inserted pcr fragment containing the modification to confirm that only the desired change was present. smai-linearized wild type (wt) and mutant full-length cirv genome cdnas were used as templates for in vitro transcription reactions using a t flashscribe transcription kit (cellscript) to synthesize uncapped genomic cirv rnas, as described previously ( ) . in vitro-generated viral genomic rnas ( . pmol) were assessed for translation and readthrough using a wheat germ extract (wge) in vitro translation system (promega) and proteins were monitored by incorporation of [ s]-methionine, as described previously ( , ) . translation products were separated by % sodium dodecyl sulfate polyacrylamide gel electrophoresis, detected using a typhoon fla variable mode imager (ge healthcare), and quantified using quantityone software (bio-rad). each wge translation experiment was performed three times independently and averages and standard errors of the mean (sem) were calculated. readthrough levels were calculated as a ratio of the amount of p readthrough product relative to that of its corresponding p pre-readthrough product, with the ratio for the wt genome set as % ( , ) . production of genomic and subgenomic cirv rnas were assessed after protoplast infections, as described previously ( ) . protoplasts were prepared from the cotyledons of day old cucumber plants. for each viral rna genome tested, ∼ protoplasts were transfected with g of cirv transcript using polyethylene glycol and cacl ( ) . transfected protoplasts were incubated under constant fluorescent light at • c for h. total nucleic acids were extracted and separated by agarose gel electrophoresis and plant s ribosomal rna bands were monitored as controls to ensure even loading. total nucleic acids were then transferred to a nylon membrane and plus-strand viral rna accumulation levels were assessed using a [␥ - p]labeled oligonucleotide probe complementary to the -end of the cirv genome and subgenomic mrnas (coordinates - ). northern blots were imaged using a typhoon fla and rna bands were quantified using the quan-tityone software. relative sg mrna levels were calculated as the ratio of sg mrna levels to their cognate genome levels, with the wt ratio set to %. each set of protoplast transfections was carried out three times independently and averages and sem values were calculated. minus-strand viral rna accumulation was analyzed as described previously ( ) . briefly, total nucleic acids isolated from protoplast infections were denatured with dimethyl sulfoxide and glyoxal and separated by agarose gel electrophoresis in mm sodium phosphate buffer (ph . ). northern blotting, imaging, and data analysis was performed as described earlier, except that [␣- p]-utp-labeled riboprobe, corresponding to the -end of cirv cdna (coordinates - ), was used for detection. dna fragments of rtsl and ld and their derivatives were generated by standard pcr that incorporated a t promoter upstream of the -ends of the rna-encoding region. individual, or mixtures of, in vitro-transcribed rna fragments ( pmol each in . l of water) were heated at • c for min, then combined with . l rna binding buffer (final concentration: mm hepes ph . , mm mgcl , mm kcl. . % glycerol) ( , , ) . the tubes were placed at • c for min and snap-cooled on ice for min. an equal volume of sterile % glycerol was added to each sample and the entire contents were separated by nondenaturing % (or %) polyacrylamide gel electrophoresis in a running buffer containing mm tris, ph . , mm boric acid and mm mgcl ( ) . gels were then stained with mg/ml ethidium bromide ( ) , imaged using typhoon fla scanner, and rna bands were quantified using the quantityone software. relative binding efficiencies were determined by quantifying the amount of shifted ld or ld -core by comparing their levels in ld -only or ld -core-only lanes with their corresponding unbound levels in mixtures with rtsl. thus, relative binding efficiency is presented as a percentage of shifted ld or ld core. each emsa experiment was conducted three times independently, with averages and sems provided. in vitro-generated rna transcripts of wt ld -core and wt rtsl were purified using two cycles of the crush-soak rna purification method ( ) . purified transcripts were then dephosphorylated using calf-intestinal phosphatase (neb) and -end labeled using [␥ - p]-atp and t polynucleotide kinase (neb). end-labeled transcripts were recovered by g- column chromatography and used for in-line reactions that were carried out at • c for hours in x in-line reaction buffer ( mm tris-hcl ph . , mm kcl, mm mgcl ) ( , ) . reactions contained labeled fragments individually (∼ pmol) or as a mixture with their unlabeled partner fragment ( pmol) ( , ) . labeled fragments were also used to generate untreated controls, as well as size ladders generated by alkaline hydrolysis or rnase t digestion. all samples were separated in % denaturing polyacrylamide gels ( ) and imaged and quantified as described in the previous sections. in-line probing was performed twice, with consistent results. reactivities were used to generate an in-line-guided secondary structure model for ld -core, rtsl or a complex of both as described in supplementary figures s -s . rna secondary structures presented were generated using rna drawer software ( ). shape analysis of the ld region of the cirv genome was performed using -methyl- -nitroisatoic anhydride ( m ), as described previously ( ) . four primers were used to map the secondary structure of the ld (primer coordinates in the cirv genome: - , - , - and - ). following fluorescent capillary sequencing, the raw data was analyzed using the shapefinder software ( ) to generate relative reactivities at single nucleotide resolution. shape reactions were performed twice for each of the four primers and average reactivities were used. the reactivity data was normalized against the average of the ten highest reactivity values, as described previously ( ) . the rnastructure web server was used to combine shape reactivity data with thermodynamic prediction to generate a secondary structure model of ld as described in supplementary figure s ( ) . rna secondary structures presented were generated using rna drawer software ( ) . translational readthrough for the cirv genome requires a long-distance rna-rna interaction (ldri) between rtsl and the utr, involving the prte and drte partner sequences, respectively ( figure a , b) ( ) . to investigate the possible involvement of other regions of the rtsl in the readthrough process, silent nucleotide substitutions were introduced into its terminal loop (mutants tc- and tc- ) and closing base pair (tc- ) (figure a ). in vitro translation of cirv genomes containing these modifications showed that readthrough production of p was similar to wt, or moderately affected (∼ % to ∼ %) ( figure b ). however, northern blot analysis of protoplasts transfected with the same mutant viral genomes revealed an unanticipated role for the terminal loop of rtsl (rtsl-tl) in facilitating the accumulation of sg mrna . in these infections, sg mrna levels were quantified relative to the corresponding levels of their cognate genomes, with that for wt set at %. both terminal loop substitutions resulted in a ∼fivefold decrease in relative sg mrna accumulation, whereas alteration in the loop's closing base pair yielded wt levels ( figure c) . notably, the negative effects of the rtsl-tl mutants were specific for sg mrna , as typical levels of sg mrna were maintained. also, because the modifications introduced were not present in sg mrna , altered rna stability was ruled out as a cause for the observed decreases. instead, the results indicated a role for rtsl-tl in regulating the transcriptional efficiency of sg mrna . modulation of sg mrna transcription by rtsl-tl could occur by it interacting with a protein factor or a complementary rna sequence in the cirv genome. as tombusviruses are known for controlling important viral processes via intra-genomic rna-rna interactions, the latter possibility was deemed more probable ( , , , , , ) . to regulate sg mrna transcription, rtsl-tl would likely have to interact with a sequence located near the initiation site for sg mrna transcription. in tombusviruses, this initiation site is positioned just downstream from the transcription-promoting as /rs interaction (supplementary figure s ), which forms the closing stem of the rna domain ld , as shown for tbsv ( figure b ) ( , ) . corresponding structure probing analysis ( ) of the cirv genome predicted a comparable as /rs -containing ld ( figure d and supplementary figure s a ) that structurally mimicked that in tbsv (supplementary figure s b ). also, the cirv as /rs ldri was shown, as demonstrated previously for tbsv ( ) , to be necessary for sg mrna transcription (supplementary figure s ). cirv's ld was examined for a potential base-pairing partner for rtsl-tl, and a candidate nt long segment was identified nt upstream from the transcription initiation site for sg mrna . this sequence was present within a predicted rna hairpin structure, sl , located within the -end of the p orf, some ∼ nt away from rtsl-tl ( figure d , green). its partner sequence, rtsl-tl, was present in the terminal loop of rtsl and extended into the adjoining -stem region (figure a , green); thus, for the rtsl/sl interaction to occur, the helical region of rtsl-tl would need to unpair. similarly, to associate with rtsl-tl, the complementary partner sequence in sl , comprising the -half of this hairpin (herein termed sl - , green), would have to unpair from its -half (sl - , pink) ( figure e) . a potential base pairing partner for the displaced sl - was also identified that mapped to the half of as -sl, termed as -sl (pink) ( figure f ). consequently, the binding of rtsl-tl to sl - (green interaction) could be accompanied by an intra-ld interaction (pink) (supplementary figure s a) , both of which (in addition to the as -rs interaction) were supported by comparative sequence analysis showing maintenance of the base pairing, despite sequence variations (supplementary figure s b) . the binding of rtsl-tl to sl - was investigated functionally by introducing compensatory nucleotide substitutions into the candidate partner sequences and assessing the effects on sg mrna accumulation following transfection of mutant viral rna genomes into protoplasts pairing potential in mutants tc- and tc- diminished sg mrna plus-and minus-strand levels below ∼ % of wt, while regenerating pairing capacity with alternate nucleotides in mutant tc- restored levels up to ∼ - % of wt ( figure b, c) . this correlation between base pairing stability and sg mrna accumulation is consistent with a role for the interaction in mediating transcription of sg mrna . notably, the low levels of accumulation of the intermediate minus-strand sg mrna templates in tc- and tc- indicated that disrupting the rtsl-tl/sl - interaction hindered proper formation of the rdrp attenuation structure for sg mrna ( figure c ). similar results were observed when comparable mutational analysis was performed to assess the proposed as -sl /sl - inter-nucleic acids research, , vol. , no. action ( figure d-f) . thus, in addition to as /rs , two other ldris, rtsl-tl/sl - and as -sl /sl - , are critical for generating an effective rdrp attenuation structure. since formation of the intra-ld interaction (pink) would free sl - for binding to rtsl-tl (green) (supplementary figure s a) , the probable order of these interactions would be the former followed by the latter. the organisation of ld includes two subdomains, ld -sub and ld -sub , which have closing stems (s , orange, and s , blue, respectively) that are proximal to the sequences involved in the rtsl-tl/sl - and as -sl /sl - interactions ( figure g ). these closing stems, which are maintained in the genus (supplementary figure s ) , could therefore influence formation of the lattermentioned interactions. to address this question, compensatory mutational analysis was performed on s and s , which yielded results supporting the importance of their helical stability (supplementary figure s ) . accordingly, the closing stems of both ld subdomains also contribute to the assembly of an effective rdrp attenuation structure. this allowed for approximate delineation of a core region of functional importance at the base of ld ( figure g , black dashed line). having obtained in vivo genetic evidence for the rtsl-tl/sl - interaction, we next sought physical support for this pairing event. to achieve this, fragments of rtsl ( nt) and ld ( nt) ( figure a ) that contained the same compensatory mutations in rtsl-tl and sl - that were tested earlier ( figure a ) were used in rna-rna electrophoretic mobility shift assays (emsas) ( figure b ). incubation of wt fragments of rtsl and ld led to the formation of an rna-rna complex, observed as an upward shift of the ld fragment ( figure b , compare lane with ). combinations of fragments in which the rtsl-tl/sl - interaction was destabilized diminished shifting, while restoration of pairing regenerated the shift (figure b, lanes & and lane , respectively) . thus, formation of an rtsl/ld complex is dependent on the rtsl-tl/sl - (green) interaction. the demonstration of an in trans interaction in vitro raised the possibility that the same could be true during viral infections. to address this prospect, virus genome mutants tc- and tc- ( figure a) , each of which was unable to form the rtsl-tl/sl - interaction in cis, but could potentially form it between each other in trans, were co-transfected into protoplasts. levels of sg mrna in the co-transfection (tc- +tc- ) were similar to those for the individual transfections, i.e. ∼ %, and well below the ∼ % observed for the compensatory mutant tc- ( figure c, lanes - ) . as a further test, a small noncoding cirv genome-derived rna replicon, di , containing a wt rtsl was utilized ( figure d, lower section) . di replicates only in the presence of the cirv genome, which provides rdrp for its reproduction ( ) . thus, di amplification is limited to cells occupied by both the replicon and the cirv genome. co-transfection of di and tc- (containing a mutated rtsl-tl and a wt sl - compatible with the wt rtsl in di ) resulted in high levels of accumulation of both viral rnas ( figure c, lane ) . however, despite robust co-accumulation, no increase in sg mrna levels was observed ( figure c , compare lane with lane ). collectively, these results indicate that during cirv infections, the rtsl-tl/sl - interaction occurs as an intra-genomic event. three key interactions are involved in efficient formation of the rdrp attenuation structure, rtsl-tl/sl - (green), as -sl /sl - (pink) and as /rs (red). to investigate the order in which these binding events occur, additional rna-rna emsas were performed (figure ). when the as /rs interaction corresponding to the closing stem of ld ( figure a , red) was assessed via compensatory mutations ( figure b ), the results indicated its requirement for rtsl-tl binding to sl - ( figure c, lanes to ) . interestingly, disruption of as /rs in the ld fragment led to a slight decrease in its mobility, suggesting a more open conformation, consistent with as /rs s role in stabilizing the basal region of this large rna domain ( figure c , compare lanes and with lanes and ). a dependence on the as -sl /sl - (pink) interaction was also observed, as rtsl/ld complex formation was inhibited by the cu mismatch in mutant f ( figure d , e lane ). in contrast, the ag mismatch in f allowed for complex formation ( figure d , e lane ). in the cirv genome, this modification led to strong inhibition of sg mrna levels in protoplast transfections ( figure d -f). the differing results observed for the ag mismatch in the emsa is likely the consequence of this common noncanonical base pair being less destabilizing under the higher salt conditions of the assay. notwithstanding, inhibition of complex formation with the cu mismatch and its recovery with the au pair indicates that the as -sl /sl - (pink) interaction is indeed required for rtsl-tl binding to sl - ( figure d , e). these results, when considered along with the requisite for partner sequence accessibility and proximity, support the following sequential order for the formation for the three critical interactions. the as /rs (red) interaction would occur first and position as -sl proximal to sl - . next, formation of the as -sl /sl - (pink) interaction would concurrently liberate sl - (green). lastly, pairing of sl - with rtsl-tl (green) would complete assembly of the attenuation structure. infections with cirv genome mutants revealed that efficient activation of sg mrna transcription required the basal region of ld , bounded by as /rs , s and s ( figure g, black dotted line) . to determine if more distal figure a . positions of positive-sense genome, sg mrnas, and di are indicated on the left and right of the blot. average sg mrna accumulation levels relative to that of the wt are provided below the blot with standard errors obtained from three independent experiments. (d) schematic depiction of the co-transfection involving mutant tc- cirv genome (top) and wt di (bottom, horizontal black bars correspond to regions of the viral genome present in di ). tc- modifications (two burgundy asterisks) in rtsl prevents formation of an intragenomic rtsl-tl/sl - ldri (green double-headed arrow with a black x). the di rna contains wt rtsl in its sequence that, when co-inoculated with tc- , can potentially base pair in trans with the wt sl - sequence in tc- (the curved green arrow connects di 's wt rtsl and tc- 's wt sl ). sequences or structures in ld -sub or ld -sub were required for rtsl-tl binding, a nt-long rna fragment containing only the core region of ld was constructed ( figure a , ld -core). in ld -core, the subdomain sequences beyond s and s were replaced with ultra-stable uncg-type tetraloops. when ld -core and rtsl fragments containing compensatory mutations in the rtsl-tl/sl - interaction ( figure a) were tested by emsa, the results were equivalent to those observed with the complete ld fragment (compare figure b , lanes - , with figure b , lanes - ). ld -core thus accurately recapitulated the binding activity of the full-length ld , implying that all determinants for efficient rtsl binding are present in this smaller fragment. rtsl binding to ld -core also exhibited equivalent binding activities compared to fulllength ld in terms of dependence on the as /rs (supplementary figure s b , lanes - , with figure c , lanes - ) and as -sl /sl - interactions (supplementary figure s c , lanes - , with figure e, lanes - ) . thus, ld -core behaves comparably to full length ld . additionally, the potential involvement of sl in rtsl binding was assessed by deleting it from ld -core and the results indicated no role for this substructure in complex formation (supplementary figure s ) . the portion of rtsl required for binding to ld core was also sought by generating fragments with increasingly larger truncations of its lower region ( figure c ) the emsa results revealed that the bottom half of rtsl, including the prte, was dispensable for binding ( figure d) . therefore, the portion of rtsl essential for translational readthrough (i.e. the prte) is not required for formation of the rtsl-tl/sl - interaction. with both genetic and physical evidence supporting the formation and function of the rtsl/ld interaction, we next sought to gain additional insights into the nature of this rna complex through solution structure probing analysis. to this end, in-line probing was used to assess the rna structure of rtsl and ld -core, both individually and in complex. under the assay conditions, residues that are flexible, and thus likely single-stranded, undergo spontaneous hydrolysis ( ) . information gained from the analysis is then used to build structural models consistent with the chemical reactivity data. ld -core was assessed first ( figure a ). in its free state, the structural status of sl - (green), its adjacent partner sequence sl - (pink), and the alternate partner of the latter, as -sl (pink), were of particular interest. the reactivity data ( figure a, lane ) suggested that unbound ld -core likely exists as a conformational mixture that includes sl (figure bi ) and the mutually-exclusive as -sl /sl - (pink) interaction (figure bii) . probing results with free ld -core that were consistent with the formation of sl included (i) high reactivity in the -portion of as -sl (pink, coordinates to ), indicating that a proportion of this sequence does not pair with sl - ( figure a , lower black bar and bi, brown-shaded triangles) and (ii) high reactivity in the loop residues in sl , which would be reactive in the context of sl ( figure a , upper black bar and bi, brown-shaded triangles). further evidence for sl s functional relevance and structural existence was provided, respectively, by comparative sequence analysis supporting its conservation (supplementary figure s ) and rna structure modelling, guided by the in-line reactivity data, that predicted its presence in the optimally-folded ld -core (supplementary figure s a) . conversely, the moderate reactivity of residues in sl - (green, - ), which indicated an unpaired state in a proportion of the structural population, was consistent with an alternative non-sl -containing structure ( figure a , white bar and bii, brown-shaded triangles); a concept bolstered by the prerequisite for the sl - -freeing as -sl /sl - (pink) interaction for complex formation ( figure e and supplementary figure s c ). collectively, these data suggest that, when unbound, the core re- figure . structural requirements for rtsl/ld complex formation. (a, c) secondary structures of cirv rtsl ( nt) and ld -core ( nt) rna fragments tested in rna-rna emsas. the red nucleotides in the ld -core secondary structure represent the added uncg-type tetraloops that replaced ld sub and sub beyond s and s , respectively. the g-c pair shown in red in rtsl mutant tc- was added to allow for its transcription from pcr templates using t rna polymerase. (b, d) ethidium bromide-stained % native polyacrylamide gels of emsas testing the rna fragments containing modifications shown in figure a and panel c, respectively. the contents of each lane are indicated above the gels, with the fragment type shown to the far left. lane represents a mock lane containing only rna binding buffer and glycerol. the black arrows on the right side of the gels point to the positions where the rtsl/ld complexes migrate. the percentages with standard errors of shifted ld -core rnas are provided below the gels and were obtained from three independent emsa experiments. gion of ld is comprised of a mixture that includes the two structural conformations presented ( figure b ), however other configurations are also plausible (supplementary figure s b) . probing results for ld -core when in complex with rtsl revealed a notable reduction in reactivity of sl - (green), consistent with it base-pairing with rtsl-tl ( figure a, compare lanes and ) . correlative results were observed when rtsl was probed individually ( figure c , lane , d, and supplementary figure s ) or in complex, the latter of which showed a corresponding reduction in reactivity of rtsl-tl (green) in the bound state ( figure c , compare lanes and ). the probing results also revealed a potential second inter-fragment interaction involving two nt-long complementary sequences (i.e. corresponding re-duced reactivities in the bound states) ( figure a , c, purple) located between s and s in ld -core and in a bulged region of rtsl ( figure b and d, purple, respectively). thus, in addition to the rtsl-tl/sl - (green) interaction, a second interaction between rtsl and ld (purple) could also be functionally relevant, as structurally modeled ( figure e and supplementary figure s ) . the potential second interaction was initially assessed in protoplast infections with cirv genomes containing compensatory mutations in the partner sequences. the results indicated that base pairing of these sequences was required for both sg mrna plus-and minus-strand synthesis (supplementary figure s a-c) . emsa analysis of the same mutations in the context of the ld -core and rtsl fragments indicated that complex formation was depen- figure . in-line structural probing analysis of cirv ld -core and rtsl rnas. (a, c) sequencing gel analysis following in-line probing of radiolabeled ld -core (ld -core*) and rtsl (rtsl*) rna fragments, respectively. lane contains untreated ld -core* or rtsl* rna samples (nr, no reaction). lane contains the rnase t -digested ld -core* or rtsl* rna samples to generate g ladders. lane contains ld -core* or rtsl* rna samples that were subjected to alkaline hydrolysis reaction (-oh) to generate cleavages at every nucleotide position. lane contains in-line reactions from free ld -core* or free rtsl* rna fragments (free). lane shows in-line reactions when ld -core* or rtsl* was incubated with unlabeled rtsl and ld -core, respectively, to generate a complex. nucleotide positions of selected g residues are indicated on the left. different regulatory sequences are color coded and labeled on the right side of the gels. black bars on the left of lane in panel (a) indicate sl and as -sl sequences that show high cleavage levels in free ld -core. the white bar on the left of lane shows moderate cleavage levels for sl - in free ld -core. (b) two alternative rna secondary structure conformations for free ld -core. the structure on the left (i) was deduced as the optimal structure by in-line-guided folding of ld -core, as described in supplementary figure s a . areas of notable reactivity are indicated by brown arrowheads (which correspond to vertical black bars in panel a). the structure on the right (ii) was generated with folding constraints that maintained nucleotides - as unpaired (brown arrowheads, which correspond to the vertical white bar in panel a). (d) rna secondary structure of free rtsl was deduced as the optimal structure by in-line-guided folding of rtsl, as described in supplementary figure s a . rtsl-tl and rtsl-seq are shown in green and purple, respectively. (e) rna secondary structure of the rtsl/ld -core complex, deduced by in-line probing results from analysis of the rna complex (supplementary figure s ) . dent upon complementarity of the sequences in rtsl and ld , termed rtsl-seq and ld -seq , respectively (supplementary figure s d ). additionally, rtsl-seq /ld -seq pairing was found to be well conserved among the members of the genus tombusvirus (supplementary figure s e) . these findings support a critical role for the rtsl-seq /ld -seq (purple) interaction in mediating formation of an effective attenuation structure for sg mrna transcription ( figure e ). global architecture of viral rna genomes can contribute significantly to the regulation of critical viral functions. accordingly, there is considerable interest in understanding how these genome-level rna structures assemble and function. our investigation of a tombusvirus led to the discovery of a novel intra-genomic rna complex that activates sg mrna transcription. notably, this rna-based attenuation structure is comparatively complex and provides new perspectives into this higher-order level of viral riboregulation. the initial goal of this study was to investigate the possible role of the apical region of rtsl in the readthrough process, thus its observed involvement in sg mrna transcription was unexpected. this additional function hinted at possible regulatory cross-talk between readthrough and transcription activities. indeed, the presence of transcriptional regulatory sequences in the rdrp coding region of the viral genome would require the suppression of readthrough to allow for unimpeded transcription. though an appealing possibility, in vitro translation analysis showed either no effect or minor decreases in readthough when the rtsl-tl/sl - interaction was disrupted ( figure b ), whereas a notable increase (i.e. derepression) would have been expected if it was involved in coordinating the two processes. nonetheless, the new transcriptional function uncovered adds to its previously known roles in promoting readthrough and inhibiting minus-strand rna synthesis and classifies rtsl as a unique multifunctional rna element controlling three distinct viral processes. the complexity of the rdrp attenuation signal formed between rtsl and ld provided a unique opportunity to explore the assembly of this functional rna complex. the comparatively smaller and localized components involved in the interaction, rtsl, as -sl and sl , are anticipated to fold independently and relatively rapidly after their emergence during progeny viral rna genome synthesis (figure a ). in contrast, formation of larger and more complex structures, such as subdomain- and - of ld , would likely require additional time ( figure b) . a role for these subdomains in assembly of the functional complex is supported by the observed importance of their closing stems for mediating efficient transcription (supplementary figure s ) . notably, the establishment of these subdomains unites as and rs (red) to within ∼ nt, thereby markedly reducing their ∼ nt distance of separation in the linear genome (compare figure a with b) . this colocalization would in turn facilitate base pairing of as and rs ( figure b ) and complete formation of ld ( figure c ). the as /rs (red) interaction also mediates formation of the core region of ld that ultimately forms the binding pocket for rtsl. probing data suggests that this core region likely exists as a conformational ensemble that includes incompatible and compatible forms, with respect to rtsl binding ( figure d and e, respectively) . the presence of sl precludes formation of the essential as -sl /sl - (pink) interaction ( figure d ), while formation of the latter is needed to free sl - (green) and ld -seq (purple) to allow their binding with partner sequences in rtsl ( figure e ). sl thus represents an integral but transient component in the folding process. functionally, the formation of sl could prevent its critical halves from interacting with non-cognate complementary sequences that would interfere with correct folding of the binding pocket. in this capacity, sl would provide a safe, temporary, storage form for its component sequences until their requirement for binding pocket formation, initiated by the as /rs interaction. formation of the rtsl binding pocket requires both global folding of the large rna domain ld and detailed conformational arrangements within its basal core region. key features of the resulting docking site includes two discontinuous sequences (ld -seq , purple and sl - , green) that map to either side of s , the closing stem of subdomain- ( figure f ). the docking of rtsl, via bipartite binding of rtsl-seq and rtsl-tl with these sites, acts as a linchpin in the formation and stabilization of the higher-order rna complex capable of blocking progression of the viral rdrp ( figure g ). this final docking step could confer its effect by bolstering the as /rs (red) interaction by either direct or allosteric means. in the latter case, rtsl pairings could stabilize the adjacent as -sl /sl - (pink) interaction, which, in turn, could structurally support the juxtaposed as /rs (red) helix ( figure g ). alternatively, direct, presumably noncanonical, interactions between rtsl and as /rs could function to stabilize the latter. a third possibility is that an additional part(s) of the rna complex, in addition to the as /rs helix, contacts the rdrp and contributes to the stalling activity. future, higher-resolution structural analysis will be required to investigate further the precise mode of rna-based inhibition of the rdrp. the formation and stability of the rna attenuation structure is highly cooperative, as verified by the strong inhibitory effects of nucleotide mismatches in any of its component interactions (figure and supplementary figures s , s , and s ). although our analyses indicate that these interactions can occur spontaneously in vitro (figure and figure ), viral or host proteins could also assist in folding of the attenuation rna complex during infections (e.g. rna chaperones ( ) ). assembly of the rna complex follows a multistep folding pathway involving the spatial unification of numerous distant regions of the genome ( figure ). in this folding scheme, two steps in particular are likely to be rate limiting, and thus determinants of the timing of active complex formation leading to sg mrna transcription. the first is the generation of ld , including formation of the binding pocket, which would depend on overall domain folding and subsequent refinement of the docking site. a second restrictive step would be the docking event, nucleic acids research, , vol. , no. figure . proposed rna genome folding pathway leading to activation of sg mrna transcription in cirv. note, this is a highly simplified folding pathway based on conjectured temporally-distinct transitions dictated by differences in stability, complexity and the relative spatial positions of the subcomponents. the relative timescales for transitions are not known and the structures shown represent approximations of probable intermediates within ensemble populations. (a) schematic depiction of partially folded sections of the cirv genome including rtsl, as -sl and sl . as and rs are separated by ∼ nucleotides in the linear sequence. orange and blue double-headed arrows point to the complementary sequences involved in formation of s of sub and s of sub , respectively. small secondary structures within sub and sub are not shown, but are anticipated to form on a similar timescale as rtsl, as and sl . (b) rna secondary structure after folding of ld sub (orange) and sub (blue), which are closed by stems s (orange) and s (blue), respectively. in this conformation as and rs are brought within ∼ nt from one another, which facilitates their base pairing (red double headed arrow). (c) formation of the as /rs interaction completes folding of ld and its basal core region. (d) sl , when present in the basal core region, prevents the docking of rtsl into the binding pocket by sequestering its partner sequences (refer also to figure b the rate of which would be determined by the facility with which rtsl stochastically and productively encounters the binding pocket. together, these steps could function to delay the production of cp from sg mrna to later in the infection, when packaging is required. indeed, time course experiments of viral rna accumulation with tombusviruses show that, compared with that for sg mrna , sg mrna transcription is delayed ( , ) . intra-genomic ldris are also used to control sg mrna transcription in other genera of the family tombusviridae, including aureusviruses ( ) and pelarspoviruses ( ) , while dianthoviruses utilize an inter-genomic interaction ( ) . however, in these cases the attenuation rna structures are less complex than that described here; though, based on our unexpected results, further analyses may be warranted. this mode of sg mrna regulation via ldris also extends beyond plant viruses to include plus-strand rna viruses that infect insects and mammals. the sole sg mrna produced by the insect-infecting flock house virus (family nodaviridae) is produced via a premature termination mechanism that utilizes an rna-based attenuation structure composed of a three-helix junction formed by distant sequences ( ) . in contrast, coronaviruses use an alternate discontinuous transcription mechanism for sg mrna production, where and then segments of the viral genome are copied discontinuously during minus-strand synthesis ( ) . in transmissible gastroenteritis coronavirus, the discontinuous step for the production of the mrna encoding the nucleocapsid protein is facilitated by an ldri in the genome that unites the regions where the viral rdrp dissociates and reinitiates ( ) . other plus-strand rna viruses that infect humans and animals also depend on ldris for regulating viral processes, most notably flaviviruses (e.g. dengue virus ( , , ) and zika virus ( )), hepaciviruses (e.g. hepatitis c virus ( , ) ) and aphthoviruses (e.g. foot-and-mouth disease virus ( , ) ). moreover, other categories of rna virus such are retroviruses (e.g. hiv ( ) ) and negativestrand rna viruses (e.g. influenza virus ( )) also rely on ldris. understanding the molecular mechanisms of large-scale rna circuits and their structural and functional integration is key to determining how rna viruses regulate their infectious cycles. deciphering such ldri networks, however, has remained a challenge because many reside in coding regions and have multiple functions, as illustrated herein. in this study, we uncovered a new function for the folding of a large viral rna domain in creating a distinctive binding pocket, and showed that subsequent docking of a distal rna structure into this binding site acts as a linchpin that stabilizes an rna complex required for viral transcription. we also proposed a plausible multistep pathway for the formation of the active intra-genomic rna complex, an area of ldri research that remains largely unexplored. these novel findings reinforce the importance and often overlooked underlying role of global rna structure in viruses. indeed, in many instances viral rna genomes should be viewed as large complex rna switches, and tombusviruses, with no fewer than eight functional ldris, serve as valuable prototypes for understanding this intriguing category of rna-mediated regulation. structure and function of cis-acting rna elements of flavivirus coronavirus cis-acting rna elements cis-acting rna elements in human and animal plus-strand rna viruses cis-acting rna elements in positive-strand rna plant virus genomes long-distance rna-rna interactions in plant virus gene expression and replication functional long-range rna-rna interactions in positive-strand rna viruses intragenomic long-distance rna-rna interactions in plus-strand rna plant viruses base-pairing between untranslated regions facilitates translation of uncapped, nonpolyadenylated viral rna a - ribosomal frameshift element that requires base pairing across four kilobases suggests a mechanism of regulating ribosome and replicase traffic on a viral rna an rna element that facilitates programmed ribosomal readthrough in turnip crinkle virus adopts multiple conformations long-distance kissing loop interactions between a proximal y-shaped structure and apical loops of hairpins enhance translation of saguaro cactus virus multiple cis-acting elements modulate programmed - ribosomal frameshifting in pea enation mosaic virus concerted action of two cap-independent translation enhancers increases the competitive strength of translated viral genomes circularization of flavivirus genomic rna inhibits de novo translation initiation integrative analysis of zika virus genome rna structure reveals critical determinants of viral infectivity overlapping local and long-range rna-rna interactions modulate dengue virus genome cyclization and replication a rna element promotes dengue virus rna synthesis on a circular genome the role of the rna-rna interactome in the hepatitis c virus life cycle genomic-scale interaction involving complementary sequences in the hepatitis c virus utr domain iia and the rna-dependent rna polymerase coding region promotes efficient virus replication end-to-end crosstalk within the hepatitis c virus genome mediates the conformational switch of the x-tail region the functional rna domain bsl . within the ns b coding sequence influences hepatitis c virus ires-mediated translation a long-range rna-rna interaction between the and ends of the hcv genome a twist in the tail: shape mapping of long-range interactions and structural rearrangements of rna elements involved in hcv replication hepatitis c virus rna: molecular switches mediated by long-range rna-rna interactions? identification of a coronavirus transcription enhancer gene n proximal and distal rna motifs regulate coronavirus nucleocapsid mrna transcription long distance rna-rna interactions in the coronavirus genome form high-order structures promoting discontinuous rna synthesis during transcription advances in the molecular biology of tombusviruses: gene expression, genome replication and recombination tomato bushy stunt virus at . a resolution a defective interfering rna that contains a mosaic of a plant virus genome recognition of small interfering rna by a viral suppressor of rna silencing a host ca + /mn + ion pump is a factor in the emergence of viral rna recombinants the glycolytic pyruvate kinase is recruited directly into the viral replicase complex to generate atp for rna synthesis tombusvirus-host interactions: co-opted evolutionarily conserved host factors take center court exploring the architecture of viral rna genomes context-influenced cap-independent translation of tombusvirus mrnas in vitro tombusvirus y-shaped translational enhancer forms a complex with eif f and can be functionally replaced by heterologous translational enhancers tombusvirus recruitment of translational machinery via the utr multifaceted regulation of translational readthrough by rna replication elements in a tombusvirus subgenomic mrna transcription in tombusviridae uncoupling rna virus replication from transcription via the polymerase: functional and evolutionary insights subgenomic mrna regulation by a distal rna element in a (+)-strand rna virus effects of inactivation of the coat protein and movement genes of tomato bushy stunt virus on early accumulation of genomic and subgenomic rnas long-distance base pairing in flock house virus rna regulates subgenomic rna synthesis and rna replication discontinuous and non-discontinuous subgenomic rna transcription in a nidovirus the premature termination model: a possible third mechanism for subgenomic mrna transcription in (+)-strand rna viruses a complex network of rna-rna interactions controls subgenomic mrna transcription in a tombusvirus global organization of a positive strand rna virus genome an rna activator of subgenomic mrna transcription in tomato bushy stunt virus molecular cloning and complete nucleotide sequence of carnation italian ringspot tombusvirus genomic and defective interfering rnas nonhomologous rna recombination in tombusviruses: generation and evolution of defective interfering rnas by stepwise deletions regulatory activity of distal and core rna elements in tombusvirus subgenomic mrna transcription analysis of a -translation enhancer in a tombusvirus: a dynamic model for rna-rna interactions of mrna termini long-range rna-rna interactions between distal regions of the hepatitis c virus internal ribosome entry site element rna purification by preparative polyacrylamide gel electrophoresis in laboratory methods in enzymology structural analysis of rna backbone using in-line probing in laboratory methods in enzymology in-line probing analysis of riboswitches in post transcriptional gene regulation denaturing gel electrophoresis for sequencing rna drawer: geometrically strict drawing of nucleic acid structures with graphical structure editing and highlighting of complementary subsequences shapefinder: a software system for high-throughput quantitative analysis of nucleic acid reactivity information resolved by capillary electrophoresis rnastructure: web servers for rna secondary structure prediction and analysis a discontinuous rna platform mediates rna virus replication: building an integrated model for rna-based regulation of viral processes physical and functional analysis of viral rna genomes by shape rna chaperone activity of the tombusviral p replication protein facilitates initiation of rna synthesis by the viral rdrp in vitro subgenomic mrna transcription in an aureusvirus: down-regulation of transcription and evolution of regulatory rna elements evidence supporting a premature termination mechanism for subgenomic rna transcription in pelargonium line pattern virus: identification of a critical long-range rna-rna interaction and functional variants through mutagenesis rna-mediated trans-activation of transcription from a viral rna continuous and discontinuous rna synthesis in coronaviruses long-distance rna-rna interactions in the coronavirus genome form high-order structures promoting discontinuous rna synthesis during transcription functionally conserved architecture of hepatitis c virus rna genomes the end of the foot-and-mouth disease virus genome establishes two distinct long-range rna-rna interactions with the end region in-cell shape uncovers dynamic interactions between the untranslated regions of the foot-and-mouth disease virus rna circularization of the hiv- rna genome single-molecule fret reveals a corkscrew rna structure for the polymerase-bound influenza virus promoter we thank members of our laboratory for reviewing the manuscript and baodong wu for assistance during the early stages of this work. supplementary data are available at nar online. key: cord- - pou r authors: lin, ya-hui; chang, kung-yao title: rational design of a synthetic mammalian riboswitch as a ligand-responsive - ribosomal frame-shifting stimulator date: - - journal: nucleic acids res doi: . /nar/gkw sha: doc_id: cord_uid: pou r metabolite-responsive rna pseudoknots derived from prokaryotic riboswitches have been shown to stimulate − programmed ribosomal frameshifting (prf), suggesting − prf as a promising gene expression platform to extend riboswitch applications in higher eukaryotes. however, its general application has been hampered by difficulty in identifying a specific ligand-responsive pseudoknot that also functions as a ligand-dependent - prf stimulator. we addressed this problem by using the − prf stimulation pseudoknot of sars-cov (sars-pk) to build a ligand-dependent − prf stimulator. in particular, the extra stem of sars-pk was replaced by an rna aptamer of theophylline and designed to couple theophylline binding with the stimulation of − prf. conformational and functional analyses indicate that the engineered theophylline-responsive rna functions as a mammalian riboswitch with robust theophylline-dependent − prf stimulation activity in a stable human t cell-line. thus, rna–ligand interaction repertoire provided by in vitro selection becomes accessible to ligand-specific − prf stimulator engineering using sars-pk as the scaffold for synthetic biology application. rna modules capable of recognizing specific metabolites to regulate gene expression have been identified in the utr of a variety of prokaryotic genes ( ) ( ) ( ) . such riboswitches can control accessibility of shine-dalgarno (sd) sequences and intrinsic transcriptional termination hairpins to tune translation initiation and transcription termination efficiencies, respectively ( ) ( ) ( ) . the ability to control rna conformations by metabolites or artificial organic molecules to regulate specific gene expressions in higher eukaryotic systems could provide new opportunities in biomedical and synthetic biological applications ( , ) . however, the fact that eukaryotes have different translation initiation and transcription termination mechanisms from those of prokaryotes has thus far hampered attempts to extend riboswitch applications into eukaryotic systems. there are only a few examples of successfully engineered mammalian riboswitches, and all are involved in the regulation of other rna-mediated processes, such as the control of mirna biogenesis and ribozyme activity ( , ) . recently, metabolite-binding units of some prokaryotic riboswitches grafted into an open reading frame (orf) have been shown to stimulate − programmed ribosomal frameshifting (prf) in response to specific metabolites, suggesting that − prf holds promise as an expression platform for the implementation of an engineered mammalian riboswitch ( , ) . the − prf involves the backward movement of an elongating ribosome by one nucleotide relative to the decoding reading-frame. it leads to a switch of the decoding process into a − reading-frame to generate a protein with its c-terminal domain composition being determined by the new reading-frame. it has been adopted in a variety of viruses to control the ratio between viral proteins crucial for optimal propagation via instrumental frameshifting efficiency ( , ) . − prf occurs on a shifty sequence with a low basal efficiency and can be further enhanced by an rna structure optimally positioned downstream of the shifty sequence ( ) . the downstream rna structure is usually an h-type pseudoknot ( ) composed of an rna hairpin with its loop sequences pairing with complementary sequences downstream of the hairpin stem (stem ) to form a second duplex (stem ). given the critical role of a downstream stimulator in the efficiency of eukaryotic − prf, the ability to modulate stimulator conformation formation by a ligand-binding rna aptamer could result in a ligand-responsive − prf stimulator. however, only a subset of the h-type pseudoknot can stimulate − prf efficiently. the two riboswitch-derived − prf stimulators both possess ligand-induced base-triple interaction networks that surround the helical junctions of pseudoknot folds ( ) ( ) ( ) ( ) . however, it is challenging to design a specific liganddependent base-triple network within an rna pseudoknot as well as to convert the ligand-responsive pseudoknot into a ligand-dependent − prf stimulator. by contrast, the − prf stimulators of coronaviruses belong to a family of wellcharacterized h-type pseudoknots (ibv-type pseudoknot) with a long stem of at least base pairs essential for stimulating − prf efficiently ( ) . furthermore, in vitro selection methods capable of identifying rna receptors for specific ligands of interest, and rna aptamers for a variety of ligands are available ( ) . thus, the combination of a well-characterized − prf stimulator and an aptamer of a specific ligand could provide a straightforward solution for rational design of a ligand-responsive − prf stimulator. in this study, we take advantage of an extra stem-loop of sars-cov − prf stimulation pseudoknot to show that this stem-loop can be replaced by an rna aptamer to design a ligand-responsive − prf stimulator with activity that rivals those of viral and metabolite-responsive stimulators. we further demonstrate the in vitro improvement of ligand responsiveness and function of the engineered riboswitch as a ligand-responsive − prf stimulator in a stable human cell line. thus, this scaffold should make a repertoire of rna aptamers available for artificial riboswitch construction in ligand-dependent − prf regulation of higher eukaryotes. the genes of designed pseudoknot constructs, with their corresponding slippery sequences and bridging spacer sequences, were generated by a fragment overlapping extension polymerase chain reaction (pcr) ( , ) . nucleotide sequences corresponding to sars-cov pseudoknot or core missing -half sequences were pcr-amplified by designed primers using previous sars-cov plasmids ( ) as the templates. different fragments were assembled by pcr via overlapping sequences located in the -and ends of each fragment. a theophylline-responsive element combining theooff with switch- was generated using the same strategy. the final assembled fragment flanked by sali and bamhi restriction sites was restriction-enzymes digested, purified and cloned into compatible sites of puc , p luc dual luciferase reporter ( ) or pninsertc-venus fluorescence reporter ( ) (for pntheooff -switch c-venus). a theooff -switch containing venus was further amplified from pntheooff -switch c-venus and constructed downstream of the tetracycline responsive element (tre) promoter of pb-t-paf vector ( ) to form plasmid pbtpaf-theooff -switch . the pb-rn plasmid that carries reverse tetracycline transactivator gene (rtta), the helper plasmid pbcy that expresses pb transposase and pb-t-paf were gifts from prof. j. m. rini at the university of toronto, canada ( ) . mutants with theophylline binding pocket disruption or read-through control used for calibrating frameshifting efficiency were generated using the quick-change mutagenesis kit from stratagene according to manufacturer's instructions. all the primers were chemically synthesized and purchased from genomics biosci & tech, taiwan. identities of all cloned and mutated genes were confirmed by dna sequencing. rnas were synthesized by in vitro transcription from appropriate dna templates cloned into puc- using t rna polymerase. the purified rnas were dephosphorylated by calf intestine alkaline phosphatase (roche) and p -labeled at the -end using t polynucleotide kinase (neb) in the presence of [␥ - p] atp. in-line probing assays were performed following published protocols ( , ) . briefly, approximately cpm per reaction of p-labeled rnas were incubated with varied amounts of theophylline ( - mm for final concentration) in in-line probing reaction buffer ( mm tris-hcl, ph . , mm mgcl , mm kcl) at room temperature for h. partial alkaline digested rna ladders were prepared by incubating labeled rnas in alkaline buffer ( mm na co , mm edta, ph ) at • c for min, while guanine-specific sequencing ladders were obtained by procedures described in next section. all reactions were terminated by adding gel loading buffer ( % formamide, mm edta, . % sds, . % xylene cyanol, . % bromophenol blue). spontaneous rna cleavage products from in-line probing assays and related markers were separated by % denaturing polyacrylamide gel electrophoresis and exposed to a phosphorimager screen after drying of the gels. the phosphorimager screen was scanned by typhoon fla phosphorimager (ge) and the radioactivity of spontaneous rna-cleavage products was analyzed and quantified by imagequanttl software. for loading difference calibration, the quantified intensity values of the cleavage-bands of interest were normalized against value of a band corresponding to residue g of the same lane. the fraction of rna cleaved in each band under a specific theophylline concentration was calculated from the difference between sample intensity and minimum intensity divided by the difference between maximum intensity and minimum intensity. the maximum and minimum intensities are the highest and lowest values measured for each nucleotide position over a range of theophylline concentrations. the value of the fraction of rna cleaved and logarithm of theophylline concentrations were plotted and fitted to a logistic dose-response model according to the following equation ( , ) . the results were then plotted using sigmaplot . (systat software, inc), where a and a correspond the highest and lowest limits reached by the plotted curve, respectively. the y axis in the plot represents the normalized value of the fraction of rna cleaved and x is the logarithm of the concentration of theophylline. the concentration of theophylline needed to induce half-maxima in cleaved value provided an approximation of apparent k d for theophyllinebinding of the analyzed rna. nucleic acids research, , vol. , no. approximately cpm per reaction of p-labeled switch- or switch- rnas were denatured at • c for min under different conditions (with or without ligands). the denatured rnas were refolded on ice for min and then digested with rnase t ( . u), v ( . u) or t ( . u) in structure mapping buffer ( mm tris ph , . m kcl, mm mgcl ) at room temperature for min, and stopped by adding gel loading buffer. the rna alkaline hydrolysis marker was obtained as described above. the guanine-specific and cytosine/uracil sequencing ladders were obtained by denaturing labeled rnas in rna sequencing buffer ( mm sodium citrate ph , mm edta, m urea) at • c for min followed by rnase t ( . u) and rnase a ( − ng/l) digestion, respectively. rnase t digestion was carried out at room temperature for min or min and rnase a digestion was carried out at room temperature for min. a l aliquot from each reaction was loaded for % denaturing polyacrylamide gel electrophoresis and quantified using a similar method to that described for in-line probing assays. a rabbit reticulocyte lysate system (ambion) was used to generate shifted and non-shifted protein products. capped reporter mrnas were in vitro transcribed by t rna polymerase supplemented with a methylated cap analogue (epicentre) in the reaction. the purified capped reporter mrna ( ng) was used in a l in vitro translation reaction containing . l rabbit reticulocyte lysate, . l of translation buffer, . l of rnase inhibitor ( u/l), . l of ci/ l [ s]-labeled methionine (nen) and l of theophylline of varied concentrations. the reaction mixtures were incubated at • c for . h, and then loaded into % sodium dodecylsulphate-polyacrylamide (sds-page) gels for electrophoresis analysis. the gels were exposed to a phosphorimager screen after drying and the radioactivity of translated products analyzed. the radioactivity of protein products was calibrated with the methionine content of each protein product. estimated frameshifting efficiency was calculated by dividing calibrated radioactive intensity of full-length shifted protein products by the sum of calibrated radioactive intensity of full-length shifted and non-shifted protein products. because translation products due to ribosome drop-off in the − frame (radioactivity detectable or non-detectable) were difficult for accurate measurement as well as methionine calibration, they were not included in the calculation and would lead to underestimation of frameshifting efficiency. by assuming similar extent of drop-off tendency, we present the effect of theophylline on radioactivity-based − prf activity in terms of relative − prf so that the ribosome drop-off effect can be filtered out ( ) . dual luciferase activity was measured from in vitro translation reactions (without addition of labeled methionine) or lysates of reporter transfected cells for frameshifting efficiency calculations by dual luciferase tm reporter assay (promega) following manufacturer's instructions on a chameleon tm multi-label plate reader (hidex). dual-luciferase based frameshifting efficiency of a specific construct was calculated according to previously described procedures ( ) by comparison with a corresponding read-through control assuming that similar extent of ribosome drop-off occured during translation. each read-through control has the tttaaac slippery sequence replaced by a cttaagaa sequence that disrupts the slippery site and shifts the reading frame to − frame by one extra nucleotide insertion. unless specified, these readthrough controls (listed in supplementary table s ) were used for calibration in frameshifting activity calculation. hek t cells were plated in dulbecco's modified eagle medium (dmem, gibco) supplemented with % fetal bovine serum (fbs, corning) in a -well plate one day before transfection. one hour before transfection, the medium was changed to minimum essential medium ␣-medium (␣-mem, gibco) containing % fbs. jetprime tm transfection reagent (polyplus) was used to transfect the reporter plasmids into t cells according to manufacturer's instruction. the medium was changed to fresh % fbs ␣-mem containing final concentrations of . , . or mm of theophylline h after transfection and transfected cells were cultured for another h. a stable cell line harboring theophylline-responsive − prf element embedded fluorescence reporter was established using a piggybac transposon system ( ) . pbtpaf-theooff -switch was cotransfected into t cells with pb-rn and a helper plasmid pbcy ( ) . cells inserted with fluorescent reporter and rtta were selected using a culture medium containing g ( g/ml) and puromycin ( g/ml). after weeks of selection, surviving cells were treated with g/ml tetracycline and reporter fluorescence was detected after h of tetracycline treatment to identify a positive cell colony. this colony was named t-theo . transmitted light images were used for monitoring cell morphologies, and fluorescent images were obtained by epifluorescence microscope (olympus bx ) with an olympus dp camera system. the fluorescence-filter set (olympus) used for venus fluorescence detection was u-myfphq/ nm. cells were lysed in lysis buffer ( mm hepes-ph . , mm nacl, mm ethylenediaminetetraacetic acid, . % triton x- , % glycerol) on ice for min. clear cell lysates were collected after centrifugation and protein concentration was determined by bradford assay (biorad). fifteen micrograms of total protein from each treatment was loaded and separated by % sds-page electrophoresis. the separated proteins were transferred to a polyvinylidene difluoride membrane (pvdf; perkinelmer) by a trans-blot semidry blotting system (biorad). the membrane was incubated with primary rabbit anti-gfp polyclonal antibody ( : dilution; biovision) or with primary mouse anti-␤-actin monoclonal antibody ( : dilution; abcam) at room temperature for h after % skim milk blocking. it was then reacted with horseradish peroxidase-conjugated secondary antibody (goat anti-rabbit immunoglobulin g (igg) or goat anti-mouse igg, : dilution; jackson). the blotting signals were visualized by western lighting plus ecl (perkinelmer) and detected by an imagequant tm las- mini luminescent image analyzer (ge). experiments were performed in triplicate (at least) and frameshifting activities were reported as one standard deviation from the mean. the variances in each set of data (without or with different dosages of ligands) were analyzed by analysis of variance (anova). when data sets presented with an f-value bigger than critical values from a lookup table for ␣ = . and p-value smaller than . , significance was further determined by pairwise comparisons to compute the smallest significant difference (lsd) using a t-test. in order to engineer a ligand-responsive stimulator with efficiency to rival that of a viral − prf stimulator in mammalian cells, we looked for a potent − prf stimulator to be our designing template since the integration of an rna aptamer could compromise stimulation activity. previously, a three-stem pseudoknot, sars-pk was characterized as the − prf stimulator of sars coronavirus ( , , ) ( figure a and c). mutagenesis analysis of stems indicated that stem of sars-pk could tolerate modification without severe reduction in − prf stimulation activity ( ) , while an intermolecular kissing-loop interaction involving the loop of stem was shown to affect frameshifting activity ( ) . given that solution nmr and limited nuclease digestion analyses have supported three-stem formation in sars-pk ( , ( ) ( ) ( ) , using it as a scaffold could also provide advantages in detection of ligand-dependent conformational switch during the designing process. in a first step to constructing a ligand-responsive − prf stimulator, we designed switch- rna with a theophylline aptamer replacing the stem of sars-pk ( figure a and c). a theophylline aptamer was used due to theophylline's cell permeability ( ) and the well-characterized structural features of the aptamer ( ) . the ligand-binding pocket of theophylline aptamer is composed of an internal-loop and an adjacent bulge with conserved key theophyllinecontact sequences distributed within the two motifs. each motif is connected to duplex regions that serve as the carrier of the binding-pocket ( figure a and c). importantly, the conservation of primary sequences in the terminal duplex (the 'lower stem' in figure a ) that closes the internal-loop is not absolutely required as long as basepairing complementarity of the duplex is maintained ( ) . this feature thus provides flexibility in designing a liganddependent conformational switch. the − prf activity of switch- placed downstream of a slippery sequence is one third that of sars-pk based on in vitro frameshifting assays performed in reticulocyte lysate. furthermore, the − prf efficiencies of both sars-pk and switch- remained virtually unchanged with or without mm theophylline treatment (supplementary figure s a-d) . however, results from in-line probing analysis of switch- rna (figure a) indicated that the embedded theophylline aptamer remained theophylline binding competent (supplementary figure s e) . therefore, switch- represents an ideal starting framework to build a theophylline-dependent − prf stimulator. recent simulation studies have indicated that the stabilities of constituent secondary structures determined the folding of rna pseudoknots ( ) . this means, interference of the folding of stem or stem of a pseudoknot to affect pseudoknot formation could be controlled by a designed secondary structural element within the pseudoknot. as the -side of pseudoknot stem as well as that of the embedded theophylline aptamer in switch- is bridged by ucu tri-nucleotides, we reasoned that a theophylline-responsive − prf stimulator (switch- ) could be constructed by coupling stem formation with theophylline-binding pocket formation ( figure b and d) . this was achieved by designing sequences flanking ucu to form a stable hairpin, while maintaining base pairing of the lower stem in the theophylline-bound aptamer ( figure d ). we rationalized that such an engineered switch hairpin of reasonable stability (predicted free energy of − . kcal/mole ( )) would be the dominant conformation that could interfere with the formation of pseudoknot stem in the absence of theophylline (supplementary figure s a) . as it is difficult to measure the free energy contribution of stem formation, we mimicked it by a hairpin of ucu loop closed by the stem ( ) with a predicted free energy of − . kcal/mole (supplementary figure s b) . by contrast, the addition of theophylline could interfere with switch hairpin formation via theophylline aptamer stabilization and help release the trapped -side of stem to facilitate stem pairing for generation of a pseudoknot. in the design of switch- , only the eight nucleotides constituting its lower aptamer stem are different from those of switch- ( figure c and d) . given the structural information available for sars-pk and theophylline aptamer, in-line probing was used to evaluate theophylline-binding activity as well as monitor the extent of ligand-dependent spontaneous rna cleavage of switch- rna. the results were then compared to those of switch- rna (figure and supplementary figure s ). by tracking hydrolyzed rna patterns with increased amounts of theophylline, dramatic changes in cleavage patterns were observed in regions corresponding to theophylline-binding pockets in both rnas. this result was consistent with ligand-binding mediated conformational change or protection of cleavage with an apparent kd value of . m for switch- rna (supplementary figure s c) . extra prominent rna hydrolysis signals were observed in sequences involved in aptamer lower stem formation (corresponding to s - to s - in figure d ) as well as in sequences cor- figure . sars-pk as a scaffold for engineering a theophylline-dependent − prf stimulator. (a) a schematic drawing shows the replacement of stem (s ) of sars-pk with a theophylline aptamer (boxed in blue) to form switch- . the drawing is based on characterized secondary structures of sars-pk and ligand-bound theophylline aptamer. the secondary structures are designated by 's' for a stem and 'l' for a loop with given numbers corresponding to appearance order from the -end. (b) a scheme shows coupling of pseudoknot stem formation with theophylline binding in switch- by designing a switch hairpin. the -and -complementary sequences of the hairpin stem (in the free form) are designed to participate in the formations of ligandbinding pocket (colored in green) and stem (colored in blue) upon the binding of theophylline, respectively. (c) sequences and secondary structural models of sars-pk and switch- rna (in theophylline-bound form). the numbering of sequence in sars-pk follows the one described previously ( ) . numbering system in the sars-pk part of switch- follows that of sars-pk while numbering in the aptamer domain starts at s - and ends at s - with s standing for stem . characterized secondary structures of sars-pk and theophylline aptamer are used as templates to build the models. (d) sequences and secondary structural models of free and ligand-bound forms of switch- . characterized secondary structures of stem / of sars-pk and theophylline aptamer are used as templates to build the models. numbering logic is the same as that of switch- . the eight nucleotides different from switch- are typed in lower case in both forms. responding to the -side of pseudoknot stem in switch- without theophylline treatment ( figure ) . furthermore, the intensities of these unique hydrolysis bands in switch- were reduced upon theophylline addition (supplementary figure s ). however, the reduction of rna hydrolysis signals was neither observed in the presence of caffeine nor in theophylline-treated switch- m rna having the theophylline-binding pocket being disrupted (supplementary figure s ) ( ) . by contrast, similar or identical sequences in switch- were much more resistant to hydrolysis ( figure and supplementary figure s ) . as a duplex conformation is more resistant to in-line attack than a single-stranded conformation ( ) , these observations im-plicate a theophylline-dependent dynamic property as well as a theophylline-induced formation of stem in switch- rna. to clarify the existences of theophylline-induced conformational switch and stem formation, we tracked the distribution of single-stranded and duplex regions in free and theophylline-bound switch- or switch- rna by limited ribonuclease t and v digestions, respectively (figure ). ribonucleasev cleavages corresponding to the -side of stem for both free and theophylline-bound rnas were in agreement with formation of stem under both conditions for switch- and switch- . furthermore, v cleavage signals also occurred in the -side of the upper stems of the aptamer domains but were reduced in the presence of theophylline in both switch- and switch- , suggesting ligand-dependent rearrangement in the regions proximal to the binding-pockets. consistent with these, reduced ribonu-cleaset cleavage signals were also observed in bindingrelated nucleotides downstream of the upper stem upon theophylline treatment for both rnas. however, major differences in t cleavage patterns between switch- and switch- appeared in the sequences covering -sides of both binding pocket and lower stem of the aptamer domains, although the sequences forming binding-pockets are identical between the two rnas. furthermore, prominent t cleavages in these sequences observed for switch- were reduced upon theophylline treatment and are consistent with theophylline-induced aptamer stabilization for switch- . importantly, t cleavages also appeared in the stemloop junction of stem in switch- in the absence of theophylline, and were greatly reduced upon theophylline treatment. no similar t cleavage occurred in the corresponding left oh − , a and t lanes represent denatured condition sequence markers corresponding to alkaline hydrolysis ladder, c/u residues-specific cut by rnase a and g residues-specific cut by rnase t , respectively. right t , v and t lanes represent limited ribonuclease digestion by corresponding ribonuclease alone (−) or in the presence of either mm theophylline (theo) or mm caffeine (caf). sequences corresponding to the -side of stem , -sides of lower aptamer stem/binding pocket and -sides of binding pocket/lower aptamer stem are typed in the left from down to top (from to direction). sequences identical between switch- and switch- are typed in capital, while the eight nucleotides different between them are typed in lower case. the predicted secondary structural elements along primary sequences are labeled in the right. signals with reduced t and v cleavages in the presence of mm theophylline are annotated by open and filled circle, respectively. region of switch- . because these sequences constitute the -side of stem and are identical in both rnas, these results are consistent with theophylline-induced formation of stem in switch- . together, these probing data are consistent with the existence of a theophylline-triggered conformation switch that leads to the formation of a pseudoknot. next, we measured the in vitro − prf activity of switch- ( figure a) in the presence of different amounts of theophylline using reticulocyte lysate. the − prf activity of switch- responded to theophylline treatment in a dosagedependent manner and was virtually non-responsive to mm caffeine. the mutant construct with theophyllinebinding pocket disrupted (switch- m ) possessed minor ( ) in t cell with or without mm theophylline. − prf activity was calculated by calibrating with the dual-luciferase activity of p luci as a read-through control ( ) . value for each bar is the mean of three independent experiments with standard error of the mean. (f) comparison of relative − prf activity of switch- and sah-pk ( , ) in t cell toward cognate ligand variation. a total of m of adox, an sah hydrolyase inhibitor was used to increase concentration of sah in t cells ( ) . relative − prf activity was calculated by calibrating with the dual-luciferase activity of p luci as a read-through control while the drug-free activity was treated as (in gray). value for each bar is the mean of three independent experiments with standard error of the mean. for all panels, p-values were determined by a student's t-test with p-value < . designated by an '*'. increment of frameshifting activity in mm theophylline ( figure b and c) . in addition, dual-luciferase based − prf activity obtained from t cells transfected by switch- containing reporter possessed a similar dosagedependent trend toward theophylline as that of the in vitro analysis, whereas cells transfected by switch- m reporter lost theophylline-dependency for − prf activity ( figure d ). finally, side-by-side comparison indicates that the − prf efficiency of switch- in mm theophylline rivals those of mmtv and srv − prf stimulators ( ) (figure e) , while the dynamic range of theophylline-dependent frameshifting stimulation is close to the level of sah-pk toward sah variation ( , ) ( figure f) . collectively, the results of probing and functional assays demonstrate that switch- is a bona-fide mammalian riboswitch using − prf as the expression platform. we then explored if the stability of a switch hairpin can be used as a guide to improving the design. we first stabilized the switch hairpin of switch- by engineering two extra gc base pairs in the terminal end of the switch stem to form a switch-lock construct (supplementary figure s a) and found that it possessed low − prf activity in theophylline (supplementary figure s b-d) . this is consistent with the stabilized switch hairpin (predicted free-energy of − . kcal/mole) being locked even in the presence of theophylline. switch- was then designed by removing base pairs from lower aptamer stem of the ligand-bound switch- . it resulted in three base pairs disruption in the upper stem of switch hairpin in the free form of switch- (compare figure d with supplementary figure s a) . this was done with the assumption that the destabilized switch hairpin (predicted free-energy of − . kcal/mole) could still compete with the formation of a pseudoknot stem without theophylline, while binding of theophylline would facilitate stem formation. switch- possessed -fold increase in − prf stimulation in response to mm theophylline (supplementary figure s b-d) . furthermore, the dynamic range was increased to -fold in the presence of mm theophylline, whereas the dynamic range of switch- did not increase further (supplementary figure s c) . in-line probing analysis of switch- rna indicated similar theophyllinedependent rna hydrolysis patterns as those of switch- rna in ligand-binding-pockets and -side sequences of stem (supplement figure s a and b) . a kd value of folds higher than that of switch- rna (supplement figure s c) helps explain higher theophylline concentration requirement to activate − prf stimulation efficiency of switch- . however, experiments showed that switch- behaves similarly to switch- in t cells (data not shown), indicating a missing link between in vitro and cellular experiments. comparison of in vitro activities of switch- and switch- indicated that residual frameshifting activity in the absence of theophylline is the main cause of reduction in the dynamic range of ligand-responsiveness of designed stimulators (supplementary figure s ) . recently, we have identified rna hairpins upstream of a frameshifting site as a negative regulator of − prf ( ) and demonstrated that − prf activity can be regulated by ligand-induced conformational rearrangements of this upstream attenuator ( ) . to improve the dynamic range of ligand response and to see if theophylline aptamers can be functional while existing in both positive and negative regulators of − prf, we fused previously designed theophylline-dependent upstream attenuator, theooff ( ) with switch- ( figure a ) and examined theophylline-dependent − prf activity in vitro. for comparison, a construct with an upstream theooff and a downstream sars-pk was also generated. we found that the upstream theooff regulated − prf stimulated by downstream sars-pk in a theophylline-dependent way with a dynamic range better than that of switch- (figure b and c) . furthermore, a -to -folds increase in in vitro − prf stimulation was observed in the theooff -switch construct when theophylline was increased to mm ( figure c ), suggesting the existence of a synergetic ef-fect for theophylline-dependent − prf stimulation. however, this dynamic range was reduced to -fold in t cells transfected with the theooff -switch construct ( figure d ). further analysis suggests that this was due to the reduced dynamic range of theooff in t cells because the dynamic range of switch- remained virtually the same under both conditions ( figure c and d) . thus, the use of the same ligand-binding aptamer in both upstream attenuator and downstream stimulator result in further enhancement of ligand-responsiveness for − prf activity regulation. we also used a split fluorescent reporter with the coding region of its c terminal domain shifted to the − frame ( ) to monitor theophylline-dependent − prf activity in t cells by using theooff -switch to link the split n and c domains of fluorescent protein. consistently, elevated fulllength fluorescent protein expression was induced by theophylline treatment, whereas the construct of read-through control (theooff -rfc ) expressed constitutively ( figure e and f). importantly, these transiently expressed results indicate that a combination of theooff and switch- provides tighter theophylline-dependent regulation of − prf compared with theooff alone (figure e and f) . finally, a stable cell-line ( t-theo ) harboring a split fluorescent reporter gene embedded with theooff -switch was established via a piggbac-based approach ( ) with the transcription of reporter mrna controlled by tetracycline. we found that prominent venus activity could be observed in the presence of both theophylline and tetracycline ( figure a and b), whereas low venus activity existed in the absence of theophylline. together, these results clearly demonstrate that the engineered theophylline-responsive − prf stimulator is robust and compatible with existing tools to build a regulatory circuit in the t human cell-line. the construction of a ligand-responsive pseudoknot does not necessarily lead to ligand-responsive − prf stimulation activity. here, we present concepts and designs in using sars-cov − prf pseudoknot stimulator as the framework to rationally building a ligand-dependent stimulator for mammalian application. notably, we demonstrated that the extra stem of this ibv-type pseudoknot variant can be replaced by an rna aptamer to provide a general approach for building a ligand-responsive pseudoknot and a liganddependent − prf stimulator simultaneously. this engineered mammalian riboswitch possesses activity that rivals metabolite-responsive and viral − prf stimulators, with potential for further improvement by adding intermolecular kissing interaction to the terminal loop of aptamer. finally, ligand-dependency of the mammalian riboswitch engineered in this study could be swapped to other ligands by starting from replacing the extra stem of sars-pk with other aptamer domains. in the left side of arrow, the attenuator hairpin (in red) of theooff is on to attenuate − prf without theophylline. however, the attenuator hairpin is switched off upon theophylline treatment in the right. as the on and off switches of switch- respond in an opposite way to theophylline treatment, the on switch- and off theooff result in synergistic up-regulation of − prf activity in the presence of theophylline. (b) % sds-page analysis of radioactivity-based − prf activity of an upstream attenuator module (theooff ) in combination with different downstream stimulators in reticulocyte lysate. − prf activities of p luc reporters containing theooff -sarspk, switch- and theooff -switch under different ligand conditions are shown with and − frame products annotated. theooff -rfc and theooff -zfc represent read-through and -frame product controls of theooff -switch , respectively. (c) relative − prf activity of reporter constructs in (b) in reticulocyte lysate with the ligand-free activity being treated as (in gray). − prf efficiency was calculated from dual-luciferase activity calibrated by using theooff -rfc and theooff -rfc as the read-through controls of theooff -sars and theooff -switch , respectively. value for each bar is the mean of seven independent experiments with standard error of the mean. p-values were determined by a student's t-test with p-value < . designated by an '*'. (d) relative − prf activity of t cells transfected with reporter constructs in (b) with the ligand-free activity being treated as (in gray). − pr efficiency was calculated as those in (b). value for each bar is the mean of five independent experiments with standard error of the mean. p-values were determined by a student's t-test with p-value < . designated by an '*'. (e) fluorescence microscopy images of t cells, transfected with a pninsertc-venus − prf reporter harboring theooff -sarspk, theooff -switch or theooff -rfc , with or without mm theophylline (scale bar, m). (f) western blot results of t cell lysates from cells transfected with the − prf vector in (e). n-venus (corresponding to frame product) and fused venus containing full-length product (corresponding to − frame product) were detected by a polyclonal anti-gfp antibody. cellular ␤-actin was treated as the internal loading control. in regards to the theophylline aptamer used in this study, the successes in the designs of switch- and switch-lock suggest that the stabilities of the switch hairpins affect the regulatory dynamic ranges of designed variants in vitro. indeed, further destabilizing the switch hairpin in switch- led to improvement in ligand responsiveness of switch- in vitro. however, this improvement might be due to reasons other than the original design. in particular, design in switch- could disrupt potential stacking between stem and the lower aptamer stem in the absence of theophylline (see figure d for a predicted secondary structure model), leading to reduced basal frameshifting activity and improved dynamic range. additionally, this design could also reduce the base-pairing number of the lower aptamer stem in the presence of theophylline thereby affecting theophylline-binding affinity ( ) , and raising the concentration of theophylline required to fully stabilize aptamer conformation. consistently, increased spontaneous rna hydrolysis bands appeared in the lower aptamer stem of switch- rna in higher theophylline concentration (supplement figure s a and b), whereas no theophyllinedependent in-line cleavage in corresponding regions of switch- rna (figure and supplement figure s ). finally, we observed a gap between in vitro and cellular results for switch- . as the theophylline-dependent activation ranges for switch- were well-conserved between in vitro translation and t cells, this made it possible to predict that the discrepancy observed in switch- could be caused by the six nucleotides removed from switch- . the predicted free energy for the switch hairpin of switch- (− . kcal/mole) suggests it could be unstable under cell culturing conditions ( • c). by contrast, the switch hairpin (predicted free energy of − . kcal/mole) in switch- should populate significantly at both • c and • c. a more comprehensive analysis in energy contribution from the formation of stem as well as from theophylline binding would also be required to address this problem. it will be interesting to see if switch- retains its dynamic range in other eukaryotic systems requiring habitation temperatures lower than • c. a comparison of theophylline-dependent − prf activity between constructs using only one ligand-dependent regulator (theooff -sarspk versus switch- ) indicated that the use of an upstream regulator provided better dynamic range for theophylline-dependent regulation than that of switch- in vitro. however, the upstream attenuator seems to have reduced regulatory activity in cells and may be related to its different attenuation activities toward distinct downstream stimulators as observed previously ( ) . nevertheless, combining these two opposite regulators of − prf helps further enhance the dynamic range of theophyllineresponsiveness both in vitro and in cells. importantly, given that the only difference between the two aptamers used in theooff and switch- is the base-pairing composition of their lower stems, this suggests that the same ligand can synergistically control a set of negative and positive − prf regulators harboring homologous aptamers. finally, the successful usage of polypeptides encoded by these regula-tory − prf modules to create fused functional venus proteins suggests that the modules can be inserted as a linker to bridge the coding sequences of two independent domains of a protein (such as the substrate-binding and catalytic domains of a specific enzyme) for regulation application. with a kd value of . m for interaction between theophylline and switch- rna, it took at least m of theophylline to start observing − prf stimulation in vitro ( figure b -d) and mm of theophylline to reach − prf efficiency rivals that of viral stimulator in cellular assay ( figure e ). this suggests that the artificial mammalian riboswitch needs to be fully bound by the ligand to effectively stimulate − prf and is consistent with the theophylline concentration required to fully saturate switch- rna in affinity measurement plot (supplement figure s c ). by contrast, the higher concentration required in cellular condition could be due to the cellular uptake efficiency of theophylline and reduced theophylline-binding affinity of switch- at • c. indeed, studies in bacterial riboswitch regulation of transcription termination also indicated that much higher metabolite concentration is required to observe effective transcription termination in vitro ( ) because a riboswitch is composed of an aptamer and an rnabased gene expression platform. thus, effective ligand concentration required for gene expression regulation in both bacterial and mammalian riboswitches is much higher than the kd value that saturates % of its receptor, and could be varied to different extents due to the involvement of different expression platforms. gene regulation by riboswitches riboswitch rnas: using rna to sense cellular metabolism a decade of riboswitches riboswitches: discovery of drugs that target bacterial gene-regulatory rnas engineering biological systems with synthetic rna molecules design of small molecule-responsive micrornas based on structural requirements for drosha processing a general design strategy for protein-responsive riboswitches in mammalian cells stimulation of - programmed ribosomal frameshifting by a metabolite-responsive rna pseudoknot exploiting preq riboswitches to regulate ribosomal frameshifting ribosomal frameshifting on viral rnas achieving a golden mean: mechanisms by which coronaviruses ensure synthesis of the correct stoichiometric ratios of viral proteins programmed translational frameshifting viral rna pseudoknots: versatile motifs in gene expression and replication structural basis for recognition of s-adenosylhomocysteine by riboswitches cocrystal structure of a class i preq riboswitch reveals a pseudoknot recognizing an essential hypermodified nucleobase the structural basis for recognition of the preq metabolite by an unusually small riboswitch aptamer domain structural insights into riboswitch control of the biosynthesis of queuosine, a modified nucleotide found in the anticodon of trna the role of rna pseudoknot stem length in the promotion of efficient − ribosomal frameshifting selex-a (r)evolutionary method to generate high-affinity nucleic acid ligands precise gene fusion by pcr construction of long dna molecules using long pcr-based fusion of several fragments simultaneously an atypical rna pseudoknot stimulator and an upstream attenuation signal for - ribosomal frameshifting of sars coronavirus a dual-luciferase reporter system for studying recoding signals synergetic regulation of translational reading-frame switch by ligand-responsive rnas in mammalian cells simple piggybac transposon-based mammalian cell expression system for inducible protein production relationship between internucleotide linkage geometry and the stability of rna in-line probing analysis of riboswitches biochemical and thermodynamic characterization of compounds that bind to rna hairpin loops: toward an understanding of selectivity direct structural analysis of modified rna by fluorescent in-line probing a three-stemmed mrna pseudoknot in the sars coronavirus frameshift signal programmed ribosomal frameshifting in decoding the sars-cov genome rna dimerization plays a role in ribosomal frameshifting of the sars coronavirus theophylline restores histone deacetylase activity and steroid responses in copd macrophages interlocking structural motifs mediate molecular discrimination by a theophylline-binding rna molecular interactions and metal binding in the theophylline-binding core of an rna aptamer assembly mechanisms of rna pseudoknots are determined by the stabilities of constituent secondary structures oligoanalyzer . integrated dna technology altering molecular recognition of rna aptamers by allosteric selection a general strategy to inhibiting viral - frameshifting based on upstream attenuation duplex formation regulation of programmed ribosomal frameshifting by co-translational refolding rna hairpis an mrna structure that controls gene expression by binding s-adenosylmethionine the authors thank daniel flynn for reading the manuscript and comments. supplementary data are available at nar online. key: cord- -worgd xu authors: hatcher, eneida l.; zhdanov, sergey a.; bao, yiming; blinkova, olga; nawrocki, eric p.; ostapchuck, yuri; schäffer, alejandro a.; brister, j. rodney title: virus variation resource – improved response to emergent viral outbreaks date: - - journal: nucleic acids res doi: . /nar/gkw sha: doc_id: cord_uid: worgd xu the virus variation resource is a value-added viral sequence data resource hosted by the national center for biotechnology information. the resource is located at http://www.ncbi.nlm.nih.gov/genome/viruses/variation/ and includes modules for seven viral groups: influenza virus, dengue virus, west nile virus, ebolavirus, mers coronavirus, rotavirus a and zika virus. each module is supported by pipelines that scan newly released genbank records, annotate genes and proteins and parse sample descriptors and then map them to controlled vocabulary. these processes in turn support a purpose-built search interface where users can select sequences based on standardized gene, protein and metadata terms. once sequences are selected, a suite of tools for downloading data, multi-sequence alignment and tree building supports a variety of user directed activities. this manuscript describes a series of features and functionalities recently added to the virus variation resource. genome sequences have the potential to define evolutionary relationships, elucidate disease determinants and inform public health policy decisions. the public databases that comprise the international nucleotide sequence database consortium (insdc) are an invaluable resource to a variety of genome-related sequence analysis projects ( ) . this collaboration between the national center for biotechnology information (ncbi), the european bioinformatics institute and the dna databank of japan supports free and unrestricted access to stored sequence data that are maintained as part of the scientific record. as nucleotide sequencing efforts extend into the future, the archival insdc databases will support comparisons between samples collected over generations and provide infrastructure to study the evolution and impact of viruses in real time. despite this potential, there are fundamental issues with archival databases that can only be resolved through resources that provide enhanced data such as the ncbi virus variation resource (http://www.ncbi.nlm.nih.gov/genome/ viruses/variation/), which is described in this manuscript. genbank records ( ) and other insdc sequence records are archival by design, and changes to them can be made only by one of the original submitters. hence, it is likely that the gene and protein annotations and information about the source of the sequence will remain unchanged after a sequence is deposited in an insdc database. this is problematic because even if communities develop sequence annotation standards, the pace of biochemical and genetic research effectively guarantees that annotations become outdated as new genetic features are characterized and naming conventions change. for example, while it has been known for some time that flavivirus genomes encode a polyprotein that is cleaved into mature peptides, sometimes with two rounds of cleavage ( ) ( ) ( ) ( ) , recently, several flavivirus proteins have been identified that are translated (at least partially) from alternative reading frames ( ) . these alternative reading frame proteins and mature peptides, especially the products of the second round of cleavage, are not annotated in the vast majority of current genbank records for flavivirus genomes. the limitations of an archival database can be illustrated by considering a common way in which it might be used -to obtain all of the nucleotide sequences that encode a particular gene of interest. take, for example, the rnadependent rna polymerase (rdrp) of the ebolavirus. one would need to know that this gene is also sometimes called l-protein or l-polymerase and search the database with all three names to find all relevant protein sequences. in addition, not all genes or proteins are annotated in all database entries, so one would still likely miss some potential sequences. alternatively, a nucleotide blast search could be performed using the rdrp coding region from the zaire ebolavirus reference sequence (refseq accession number nc . ). however, when matching sequences are obtained, there would still be no indication of potential prob-nucleic acids research, , vol. , database issue d lems with the sequences, such as frameshifts, which may affect the biological function of the resulting protein. even when an annotation pipeline is available to validate retrieved sequences, several additional steps would be needed to associate metadata, such as country of isolation or host, to the sequences. issues regarding the long term usability of sequence data were addressed in the ncbi influenza virus resource ( ) . this resource leveraged machine processing of gen-bank records, human curation and a unique search and retrieval interface to build a value-added user experience where researchers could search for sequences using defined, standardized terms (table ). an annotation pipeline was added later to standardize gene and protein annotation and nomenclature across all sequences. this feature supports not only standardized annotation of sequences when submitted, but also provides a mechanism to update previously submitted sequences as new genes and proteins are described. in many ways, the ncbi influenza virus resource paved a path for a variety of other resources that share the common goal of making viral sequence data more accessible ( ) ( ) ( ) ( ) . these include the ncbi virus variation resource where the influenza virus resource data model was extended to include dengue and west nile viruses ( , ) . while the initial release of this resource provided a range of functionalities, the necessity of in-house annotation pipelines and internally developed tools imposed long development cycles making it difficult to quickly provide new modules in response to emerging outbreaks and associated nucleotide sequencing efforts. here, we document a series of updates and improvements designed to make viral sequences more easily accessible and usable through the virus variation resource, a value-added database, as well as tools that make it simple to analyze genomic relationships. the resource now includes expanded data processing pipelines and analysis tools, and supports selection and retrieval of nucleotide and protein sequences from four new viral groups: ebolaviruses, mers coronavirus, rotavirus, and zika virus ( table ). the latest package of updates includes a variety of features designed to improve data usability and ease data retrieval. new processes have been added to parse source descriptor terms from gen-bank records and map these to controlled vocabulary, and the resource now supports retrieval of sequences based on standardized isolation source and host terms in addition to standardized gene and protein names. a new set of filters has also been developed to identify laboratory isolates, vaccine strains or environmental samples so that they can be included or excluded from searches. a variety of updates have been made to the search interface and results table to better leverage these features, and a new set of multi-sequence alignment and tree building tools has been implemented to allow robust analysis of retrieved sequences. the ncbi virus variation resource provides users with a convenient way in which to search, download, and analyze viral nucleotide and protein sequences. the resource includes data processing pipelines that retrieve sequences from genbank, provide standardized gene and protein an-notation, and map sequence source descriptors (i.e. metadata) to uniform vocabularies. this data processing enables users to select sequences based on standardized gene, protein and metadata terms using a purposely-designed interface. once selected, sequences can then be downloaded with the standardized metadata in a variety of formats or analyzed using web-based alignment and tree building tools. there are currently seven discrete virus variation modules--dengue virus, ebolavirus, influenza virus, mers coronavirus, rotavirus a, west nile virus, and zika virus--and these include a total of nearly nucleotide sequences (see table ). example usages of the resources for dengue virus, ebolavirus, and rotavirus are klema et al. current development efforts have focused on expanding the virus variation model to include more viruses, enhancing the functionality of the resource and providing rapid support to emergent sequencing efforts. this last point has been particularly relevant over the past several years as emerging viral outbreaks of ebola and zika viruses and others have quickly led to large sequencing efforts. there was a clear need to support these sequencing efforts with bioinformatics resources, but timelines prevented traditional development paths where new virus modules and features were added over the course of months. the first rapid deployment of a virus variation module was during the western african ebola virus outbreak that began in december of . the outbreak was declared a public health emergency of international concern by the world health organization on august , (http://www.who.int/mediacentre/ news/statements/ /ebola- /en/). by september, a virus variation resource specific to ebolaviruses was available to help access the sequences that had begun to pour into the insdc databases. similarly, a virus variation resource module was developed in september in response to the outbreak of middle east respiratory syndrome-related coronavirus (mers-cov). most recently, this rapid response model was repeated for the zika virus module, which was put in place in march . this need-based deployment strategy is likely a model for future efforts, and much of our current development is geared toward harmonizing processes and interfaces among individual data and software modules so as to provide more support for more virus species within the resource and to respond more efficiently to emergent large-scale sequencing efforts. accurate gene and protein annotation is necessary both to identify sequences of interest and to analyze them. the virus variation resource employs annotation pipelines that support consistent gene and protein naming. initial processing for each annotation pipeline is the same: newly released genbank records are retrieved hourly based on their listed taxonomy. retrieved sequences are compared to nucleotide references for that virus group using blastn, and the best match is determined ( , , ) . this step confirms species taxonomy, identifies segment assignment if applicable and provides information about the lineage, genotype, type or subtype. the references used are listed in table , and sequences that fail to match a reference within established metrics are pushed to a curation interface where they can be reviewed manually. once a sequence has been matched to a reference, one of three pipelines is employed to determine the span of gene and protein features and to assign standardized names to these features. the first pipeline uses a reference protein guided approach based on the prosplign tool as described previously ( , , ) . here, protein reference sequences are aligned with potential translations of the query sequence. the highest scoring translation alignment to any protein reference is then chosen and parsed to determine that it meets specific criteria -the presence of a start codon, exact matches to mature peptide cleavage sites or premature stop sites. post transcriptional and translational exceptions can be accounted for by this tool by adjusting parameters and allowing multiple transitions from different open reading frames to be assembled into a single alignment. one advantage of this approach is that new viruses can be incorporated by adding new reference protein sequences and adjusting the criteria used for validating a particular translation. such was the case for zika virus annotation where the existing dengue virus pipeline was updated with new zika virus reference sequences (see table ). a second approach to gene and protein annotation was implemented in the ebola virus and mers coronavirus rapid deployment modules. here, there was a need to quickly develop a pipeline that could validate the annotation on genbank records and assign consistent gene and protein names so that these could be accurately used as search criteria. to accomplish this, a blast-based pipeline was developed that compares genes and proteins as annotated on genbank records to reference proteins derived from the best reference nucleotide match. if a protein matches the reference sequence with > % identity as measured by blastp then the presence of this protein is stored. genes are validated in the same manner using blastn and reference nucleotide sequences. sequences with genes and proteins that cannot be validated are pushed to the curation interface where they can be manually examined. ultimately these approaches support both search and analysis functionality but are not capable of generating standardized annotation across all sequences belonging to a particular virus. our experience has emphasized the importance of accurate annotation pipelines that can be applied to new viruses rapidly in response to emergent needs. though our current pipelines are effective, they are also very specific to particular viruses and application to new viruses requires much work developing reference sequences, defining processing parameters and manually reviewing annotation results. with that in mind we are now implementing a new, third approach to annotation that can be adapted rapidly when needed and is scalable to multiple virus groups. this new approach is built around two important considerations. first, it uses annotations contained within the so-call reference sequence records ( ) that are created by our group to represent important taxonomic and sequence space groups. the nucleotide and protein sequences within these records can be invaluable for the unambiguous assignment of sequences to defined groups and can also serve as repositories of reference sequence feature annotation maintained by in-house curation efforts often in collaboration with other scientists ( ) ( ) ( ) ( ) ( ) . second, this approach includes a comprehensive list of error flags that provide extensive information about sequences and can provide warnings about potential problems. this error coding not only allows staff to quickly sort through thousands of annotations during the development of new pipelines, but also provides potential criteria for the selection or filtering of sequences to resource users. this new approach was used to annotate polyprotein and mature peptide genomic intervals in west nile virus (wnv), and this annotation will be available soon through the virus variation resource. these annotations were calculated as follows: first, genbank west nile sequences were classified as one of the two common lineages of wnv (lineage or lineage ) using a combination of blastn ( ) against the two refseq sequences and expert knowledge. the principal characteristic that distinguishes lineage from lineage is that the additional protein warf occurs only in lineage wnv genomes and is believed to occur in most of them ( ) . there is some evidence that a small proportion of wnv genomes do not fit neatly into lineage or lineage ( ), but these were classified as lineage in our annotations. second, the annotation pipeline built a covariance model (cm) for each of mature peptides present in the nc refseq annotation and for the mature peptides in the nc refseq. the cms are built using the cmbuild program of the infernal homology search software package ( ) . infernal is typically used for modeling the sequence and secondary structure of rnas, and because the sequences we are modeling lack structure (i.e. basepairs between positions), the cms we created are effectively identical to sequence-only profile hidden markov models. in the current version of our pipeline, each model was derived from the single refseq nucleotide sequence encoding each mature peptide. third, the cms built from the refseq to which that genome was assigned were used to predict each mature peptide coding sequence using infernal's cmscan program. the annotation software then runs a variety of validation checks and produces error codes that assist in curation of sequences. for example, the pipeline checks for the exis-tence of any in-frame stop codons within the predicted regions. if one or more is found, the prediction boundaries are modified to terminate at the -most stop found. coding sequence (cds) coordinates are determined implicitly based on the predicted mature peptide coordinates. lineage (nc ) has three cdss and lineage (nc ) has two cdss. for each cds, the predictions for the corresponding mature peptides that make up each cds are tested for consistency by ensuring that mature peptide coding sequences that are adjacent (separated by nucleotides) in the refseq are also adjacent in the predictions. the start position of the first mature peptide and end position of the final mature peptide that comprise each cds are then used as the start and stop position for that cds. cds annotations are not made if the mature peptide consistency check fails. in addition to checking for early stop codons and the adjacency of mature peptide coding sequences, the annotation pipeline identifies other unusual or unexpected features in each sequence and reports those as 'error codes'. there are possible error codes, which provide an easy way for users to gauge the quality of each sequence and its annotations, and should facilitate the selection of subsets of the sequence data that meet specific user-defined quality standards. a more detailed description of the new annotation pipeline and error flags will be included in full detail eventually in a separate manuscript, as well as in the help documents available at the virus variation resource. another important aspect of sequence analysis is to place a given sequence within biological, temporal and geospatial contexts. such associations can provide profound health policy and scientific insights, but unfortunately, descriptors that provide information about the source of nucleotide sequences are notoriously inconsistent. to resolve this issue, the virus variation database loading pipeline parses gen-bank records, identifies important metadata terms, such as sample isolation host, date, country and source, and maps these to a standardized vocabulary using a hierarchical approach. for example, isolation host terms are first identified in the host field and failing that, then isolate or strain fields, then isolation source, note and finally organism name. this vocabulary mapping strategy follows the insdc practice of separating isolation host from source. in this convention host refers to an organism--and hence has an organism's name that can be mapped to the ncbi taxonomy tree--and isolation source refers to a physical, en-vironmental or local geographic location ( ) . for human pathogens isolation source often refers to a host tissue or bodily fluid, and the virus variation vocabulary mapping strategy attempts to combine similar clinical terms into biologically relevant groups. for example, the parsed terms 'serum,' 'plasma' and 'lymphocytes' are all mapped to the standardized vocabulary term 'blood'. to support more efficient data retrieval, host terms are mapped in a hierarchy, and once a species term such as 'accipiter cooperii' is identified, it is mapped to both the group name 'bird' and the common name 'accipiter. ' other metadata terms such as those for disease associations and clinical/laboratory manipulations are more difficult to parse. to this end, laboratory isolates, vaccine strains and environmental samples are identified by searching for key terms, such as 'tissue culture' or 'sewage,' from all fields. disease terms for dengue virus are also found using a similar strategy. in all cases these strategies require extensive examination of sequence records and documentation of specific terms that can be accurately mapped to controlled vocabulary gleaned from established ontologies such as the environmental ontology (https:// bioportal.bioontology.org/ontologies/envo) and the infectious disease ontology (https://bioportal.bioontology. org/ontologies/ido). this process is supported by a curation interface that lists records where parsing fails to identify expected terms, leading to good old-fashioned manual curation and the identification of new terms, common misspellings, regional spelling differences and the manual incorporation of metadata from relevant literature into the virus variation database. in total, these vocabulary remapping strategies can have a profound impact on data usability as large numbers of parsed terms can be mapped to controlled vocabularies (table ). the virus variation annotation and metadata mapping pipelines create standardized terms that can then be leveraged by the resource search interface. a link to this interface can be found on the home page of each virus module, which also includes links to help documents, other ncbi resources, and relevant external resources (for an example, please see http://www.ncbi.nlm.nih.gov/genome/ viruses/variation/dengue/). to access the search interface from the module home page, select the link to 'search nucleotide and protein sequences.' here, users can select between protein and nucleotide searches (see figure ). when searching protein sequences, selecting 'full-length sequences only' filter, limits retrieved sequences to those with a complete coding region as determined to the relevant reference. the same filter limits nucleotide searches to full-length genomes, where the completeness of a given genome is operationally determined by comparing the genes/proteins present on a given sequence to those on the relevant, full-length reference genome. currently, noncoding, terminal regions are not included in this determination. during both protein and nucleotide searches, users can define explicitly the genomic regions present on retrieved sequences using drop-down menus that support multiple se-lections. additionally, sequences can be filtered using standardized source metadata terms for host, region/country and isolation source using similar pull down menus. the host and country menus are arranged so that aggregate terms are listed in the top portion of the menu and more discrete terms below. in addition to these common filters, there are module-specific filters for species, types, and disease for ebolaviruses and dengue virus respectively. the influenza virus module also provides some module-specific search options. for example, a user can select 'full length only' to include sequences with complete coding regions or 'full length plus' to include sequences with complete coding regions, but no start and/or stop codon. several other specific filters are also available on the influenza module search interface, such as h and n subtypes, minimum or maximum sequence lengths, and inclusion or exclusion of pandemic h n viruses. a second set of functions and filters is included within the 'additional filters' menu. here users can search for keywords in the genbank record deflines or strings within sequences. there are also filters to include or exclude laboratory isolates, vaccine strains, and environmental isolates. one can also select specific rotavirus segment types based on assignment by the rotavirus classification working group ( , ) , or by selecting specific sequences by genbank accession. once the parameters for a specific search are selected, a user can choose to add the query to the query builder and define another search, or they can go directly to the results. several searches can be run and added to the query builder where the combination of filters and number of retrieved sequences is displayed for each search. the number of unique sequences can be displayed using the 'collapse identical sequences' checkbox. individual searches can then be selected and/or combined and sent to the results page for further refinement and analysis. the results page supports selection of sequences from the search set for analysis or download. search parameters are displayed at the top of the results page, and a table displays retrieved sequences and associated metadata. the individual columns within the table can be selected to display specific sets of metadata and hyperlinked genbank and biosample accessions ( ) . biosample records store an extended set of sample descriptors and are linked to sequence read archive (sra) ( ) records, allowing users to easily find sequence read data associated with retrieved genbank sequences when available. one new feature is the ability to collapse identical retrieved sequences for all viruses as described in the preceding section. when identical sequences are collapsed on the query page, they will be represented by a single sequence on the results page with the number of collapsed sequences shown in the 'identical sequences' column (see figure ). clicking the arrows in the 'identical sequences' column displays the individual sequences and makes them selectable. users can now customize sequence titles including the fasta defline of downloaded sequences and tree labels using the 'customize label' tool. the defline can be modified to include various types of data such as the sequence accession number, calculated genomic the ebolavirus module search interface with all elements opened and several example searches displayed in the query builder. the search page is divided into three elements. the first element supports selection of protein or nucleotide sequences based on standardized metadata terms generated by processing pipelines described in the text. menus support filtering of sequences based on gene or protein names, host, isolation country and isolation source, and collection and release dates ranges can be set with text boxes. additional filters are accessible with a drop-down arrow revealing options for environmental or laboratory isolates, vaccine strains, keyword or sequence string searches, and optional menus tailored to specific viruses. the second element supports searches based on genbank accessions -either using the text box or by uploading a text file of accessions. the third element includes the query builder where the number of sequences retrieved from individual searches can be viewed by clicking one of the 'add query' buttons. when multiple searches are added to the query builder, the total number of unique sequence records is also summed. a checkbox is provided that allows identical sequences to be collapsed and represented by the oldest sequence on the results table. clicking the 'show results' button opens a separate browser tab and displays all of the sequences meeting the criteria in each of the checked queries in the results interface. region, host, isolation source, collection date or country, as well as field-separators such as pipes or slashes. userselected titles will also be displayed in multi-sequence alignments and trees as described in the following section. users can build multiple sequence alignments or trees from selected sequences, and these in turn can be downloaded in various formats. the influenza module uses previously described tools for these functions ( , , ) , but a new set of tools has been developed for other viruses. multi-ple sequence alignments are constructed using an optimized version of muscle, and rooted trees are generated using the unweighted pair group method with six base nucleotide or amino acid k-mers ( ) (see figure ). the multiple sequence alignment display includes a navigation map above the alignment, a variation histogram and a consensus sequence. characters are colored to indicate variable positions. the alignment can be downloaded in fasta, clustal, phylip, nexus, or asn. formats. the tree display supports a variety of layouts including rectangular and slanted cladograms, radial trees and circular trees, the image can be downloaded as a pdf, and the tree file can be down- loaded in asn text or binary, newick, or nexus formats. these options are accessible through the 'tools' menu in the viewer. the data labels on multi-sequence alignments and trees can be customized from the results table before the tree is calculated using the 'customize label' options, making it easier to identify the distribution of sample/sequence characteristics. when certain download formats are selected, customized labels will be included in the downloaded files (fasta and asn. for the multiple alignments, and all files for the trees). a url is also provided to make sharing a tree easy. the virus variation resource described here provides a number of features that improve the usability of archival sequence data. the resource now includes more than % of the genbank sequences that are assigned viral taxonomy. further improvement will be dependent on which viruses are added in the future and on updates to the various pipelines, interfaces and tools so that they can further support user needs. our plan is to increase the pace at which new virus species are added to the virus variation resource, and we are currently developing layers of data processingthe least transformative of which could be applied across all viral sequences but still provide basic information about a sequence. the search interface and data displays will be revised so that they better support user-required comparative genomic functions across a much larger number of viral species from the same query page. we also intend to support searches based on author names and more detailed sample information, such as clinical symptoms or laboratory handling. though we will begin parsing the potentially rich metadata data sets from biosample records, the success of this effort will ultimately rest on improved community awareness and more consistent submission of metadata to public databases. given the unbridled growth and clear potential of nucleotide sequencing efforts, one must assume the current virus variation resource is just scratching the surface of future bioinformatic needs. the current resource model is suited to viruses that have experimentally validated annotation, and similar modules are in development for additional viral species. however, the vast majority of viruses do not have strong experimental evidence for protein coding regions, making it difficult to build a virus variation module including an annotation pipeline. in these cases annotation will need to be inferred from related, experimentally studied viruses, requiring new approaches and better ways of standardizing gene and protein information across multiple groups of viruses. our current annotation pipeline development is directed toward these goals, and we intend to extend public access to these pipelines beyond our current influenza virus module. we also intend to reveal resourcederived annotation as tracks on multiple sequence alignments, making annotated sequences available for download and improving access to our data sets. this will also enable users to limit downloads and multiple sequence alignments to selected mature peptides for polyprotein sequences, and trees to be built from selected genomic regions. finally, there are a variety of enhancements to our tools under development. we are developing improved tree visualizations that support better search and markup functions, similar to those currently used in the influenza virus module. some limitations of the tree function will be addressed at a later time by giving the user the option of viewing the quick tree which is currently offered, or a more the international nucleotide sequence database collaboration flaviviridae replication organelles: oh, what a tangled web we weave natural history of hepatitis c west nile virus a glance at subgenomic flavivirus rnas and micrornas in flavivirus infections west nile alternative open reading frame (n-ns b/warf ) is produced in infected west nile virus (wnv) cells and induces humoral response in wnv infected individuals the influenza virus resource at the national center for biotechnology information hiv sequence compendium database issue institute of allergy and infectious diseases bioinformatics resource centers: new assets for pathogen informatics vipr: an open bioinformatics database and analysis resource for virology research the papillomavirus episteme: a central resource for papillomavirus sequence data and analysis virus variation resources at the national center for biotechnology information: dengue virus virus variation resource-recent updates and future directions dengue virus nonstructural protein (ns ) assembles into a dimer with a unique methyltransferase and polymerase interface genome sequence analysis of ebola virus in clinical samples from three british healthcare workers genomic constellation and evolution of ghanaian g p[ ] rotavirus strains from a global perspective flan: a web server for influenza virus genome annotation reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation uniformity of rotavirus strain nomenclature proposed by the rotavirus classification working group (rcwg) towards viral genome annotation standards, report from the ncbi annotation workshop. viruses microbial virus genome annotation-mustering the troops to fight the sequence onslaught filovirus refseq entries: evaluation and selection of filovirus type variants, type sequences, and names ncbi viral genomes resource gapped blast and psi-blast: a new generation of protein database search programs infernal . : -fold faster rna homology searches recommendations for the classification of group a rotaviruses using all genomic rna segments rotac: a web-based tool for the complete genome classification of group a rotaviruses bioproject and biosample databases at ncbi: facilitating capture and organization of metadata the sequence read archive: explosive growth of sequencing data visualization of large influenza virus sequence datasets using adaptively aggregated trees with sampling-based subscale representation muscle: multiple sequence alignment with high accuracy and high throughput nucleic acids research, , vol. , database issue d figure . virus variation resource tree and multi-sequence alignment displays. (a) a sample tree is shown depicting the use of standardized metadata terms as sequence labels. the tree was built from west nile virus complete polyprotein sequences collected since . sequence labels are based on genbank accessions, host, country of isolation and isolation date. left clicking a node highlights the lineage, and hovering over a node with the cursor displays a menu that includes descriptors for that particular sample, including genbank accession and available standardized metadata terms for host, country, isolation source, etc. the menu also includes a function to reroot the tree around that sequence. (b) a multi-sequence alignment is shown for the same west nile polyprotein sequences. individual genbank accessions are listed to the left next to sequences. left clicking the accession displays a menu that includes the standardized metadata label chosen in the results interface, a link to the sequence in genbank, a function to use that sequence as an anchor for the alignment. differences between residues in a given sequence and the consensus are highlighted in red. a histogram above the alignment shows coverage in blue and the frequency of changes in red. sophisticated combination of muscle-multiple sequence alignment and phylogenetic tree. we are also interested in supporting blast-based searches within our data sets to support more precise sequence associations. ultimately, the presumed very large sequencing datasets of the future will ultimately require better ways to evaluate data retrieved from searches which, in turn, will require better integration of search functions with data visualizations such as trees.members of the scientific community are encouraged to contact the ncbi help desk (ncbi-help@ncbi.nlm.nih.gov) to make suggestions to improve the virus variation resource, or to assist with establishing annotation or metadata standards. key: cord- -kjet e authors: lin, zhaoru; gilbert, robert j. c.; brierley, ian title: spacer-length dependence of programmed − or − ribosomal frameshifting on a u( )a heptamer supports a role for messenger rna (mrna) tension in frameshifting date: - - journal: nucleic acids res doi: . /nar/gks sha: doc_id: cord_uid: kjet e programmed − ribosomal frameshifting is employed in the expression of a number of viral and cellular genes. in this process, the ribosome slips backwards by a single nucleotide and continues translation of an overlapping reading frame, generating a fusion protein. frameshifting signals comprise a heptanucleotide slippery sequence, where the ribosome changes frame, and a stimulatory rna structure, a stem–loop or rna pseudoknot. antisense oligonucleotides annealed appropriately ′ of a slippery sequence have also shown activity in frameshifting, at least in vitro. here we examined frameshifting at the u( )a slippery sequence of the hiv gag/pol signal and found high levels of both − and − frameshifting with stem–loop, pseudoknot or antisense oligonucleotide stimulators. by examining − and − frameshifting outcomes on mrnas with varying slippery sequence-stimulatory rna spacing distances, we found that − frameshifting was optimal at a spacer length – nucleotides shorter than that optimal for − frameshifting with all stimulatory rnas tested. we propose that the shorter spacer increases the tension on the mrna such that when the trna detaches, it more readily enters the − frame on the u( )a heptamer. we propose that mrna tension is central to frameshifting, whether promoted by stem–loop, pseudoknot or antisense oligonucleotide stimulator. accurate maintenance of the translational reading frame is essential in the production of functional proteins and spontaneous frameshifting occurs rarely, with an estimated frequency (in escherichia coli) of  À -  À per codon ( ) . in some genes, however, mrna elements are present that induce the ribosome to change reading frame at very high frequencies (reviewed in [ ] [ ] [ ] . these sites of programmed ribosomal frameshifting direct ribosomes into an overlapping open reading frame (orf), generating a fusion protein containing the products of both upstream and downstream orfs. most widespread are sites of programmed À ribosomal frameshifting (- fs) where the ribosome slips back one nucleotide (nt) in the -direction on the mrna. frameshifting in eukaryotes was first described as the mechanism by which the gag-pol polyprotein of the retrovirus rous sarcoma virus (rsv) is expressed from overlapping gag and pol orfs ( , ) and related signals have since been documented in many other viruses, including the clinically important human immunodeficiency virus types and ( ) (hiv- , hiv- ), human t-cell lymphotrophic virus types and ( , ) and the coronavirus responsible for severe acute respiratory syndrome ( ) . frameshifting has also been increasingly recognized in conventional cellular genes of both prokaryotes and eukaryotes as well as in other replicating elements, such as insertion sequences and transposons. the mrna signal for À fs is composed of two elements, a slippery sequence with consensus x_xxy_ yyz (underlines denote zero frame; x can be any base, y is a or u, z is not g in eukaryotic systems) where the ribosome changes frame, and a downstream stimulatory rna structure, a stem-loop or pseudoknot (reviewed in , ) . appropriate spacing (typically - nt) between slippery sequence and stimulatory rna is also required for optimal À fs efficiency ( ) ( ) ( ) . there is considerable experimental support for the idea that 'tandem-slippage' of ribosome-bound peptidyl-and aminoacyl-trnas on the slippery sequence occurs upon encounter of the stimulatory rna, with the trnas detaching from the zero frame *to whom correspondence should be addressed. tel: + ; fax: + ; email: ib @mole.bio.cam.ac.uk codons (xxy_yyz) and re-pairing in the À frame (xxx_yyy) ( , ) . what actually drives trna movement in frameshifting is uncertain. there is accumulating evidence to suggest involvement of an intrinsic unwinding activity of the ribosome ( ) , with the stimulatory rna exhibiting resistance to unwinding, perhaps by presenting an unusual topology. failure to unwind the stimulatory rna appropriately has been proposed to induce tension in the mrna leading to uncoupling of the codon:anticodon complexes and realignment of the trnas in the À frame ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) . in recent years it has been discovered that efficient À fs can also be stimulated in some circumstances simply by annealing an rna oligonucleotide downstream of a slippery sequence, at least in vitro ( ) ( ) ( ) . this was unexpected as mrna-antisense oligonucleotide (aon) complexes appear to lack the structural features of naturally occurring stimulatory rnas, such as stem-stem junctions, base triplexes or kinks, that have been associated with models implicating resistance to unwinding (reviewed in , ) . in an attempt to gain insight into the mechanism of aon-induced À fs, we initiated a study to examine the effect on À fs of modulating the spacing distance between slippery sequence and annealed aon. during the initial in vitro translations carried out to validate the system, we were intrigued to observe 'two' frameshift products in the aon-stimulated frameshift assays. in this article, we describe our examination of the origin of these products. we show that in the experimental system employed, based on that developed by howard and colleagues ( ) , both À 'and' À fs can occur efficiently on the hiv- slippery sequence (u a) in response to bound aons. importantly, this effect is also seen when the aon stimulator is replaced by a stem-loop or pseudoknot stimulator. by examining À and À fs outcomes on mrnas with varying slippery sequence-stimulatory rna spacing distances, we found that the spacer-length optimum for maximal frameshifting is different depending upon the kind of stimulatory rna employed, and that À fs is optimal at a spacer length - nt shorter than that optimal for À fs. we propose that the shorter spacer increases the tension on the mrna such that when the trna detaches, it more readily enters the À frame on the u a heptamer. these experiments provide the first observation of À fs on a eukaryotic viral heptameric slippery sequence and support the view that mrna tension is central to the mechanism of frameshifting, not only with 'traditional' stemloop or pseudoknot rnas, but also with aon stimulators. site-directed mutagenesis was performed using the quikchange ii site-directed mutagenesis kit (stratagene) according to the manufacturer's instructions. assessment of in vitro frameshift efficiencies employed plasmids derived from pfscass ( ) . this vector contains the bacteriophage sp polymerase promoter driving expression of the influenza a/pr / pb gene, with the minimal infectious bronchitis coronavirus (ibv) frameshifting signal ( ) inserted at the bgl ii site at position of the pb gene. we modified pfscass by introducing a unique mlu i site downstream of the inserted pseudoknot (creating plasmid pfscass ), removed the entire ibv frameshifting signal using bgl ii and mlu i and replaced it with a pair of complementary oligonucleotides encoding the aon-driven frameshifting signal utilized by howard and colleagues ( ) , giving plasmid pfshiv-aon ( figure ). the frameshift signal in this plasmid comprises the hiv- slippery sequence u a positioned nt upstream of the binding site for a complementary aon. frameshift assays in tissue culture cells employed derivatives of the dual luciferase reporter plasmid p luc ( ) . the aon frameshifting signal present in pfshiv-aon was cloned as a pair of complementary oligonucleotides into bam hi and sal i-cut p luc such that expression of the downstream firefly luciferase gene required either a À (p lucaon À ) or a À (p lucaon À ) frameshift at the end of the upstream renilla luciferase gene. to allow calculation of frameshifting efficiencies, two control plasmids (p lucoinc and p lucop inc) were prepared in which the two luciferases were aligned in-frame in order to obtain readings for normalization of data. expression of the u a slippery sequence-derived À fs product in transfected cells employed plasmid pfsegfp-n . this plasmid was prepared by subcloning a polymerase chain reaction (pcr)-generated dna fragment encoding the u a slippery sequence, a nt spacer and a stimulatory stemloop structure (see section 'results') into xhoi/ bamhi-cut pegfp-n (clontech, genbank accession number u ). in this plasmid, the egfp tag is expressed only in the À frame (following a frameshift event) and the natural start codon of egfp was changed (to tcg) by site-directed mutagenesis to minimize leaky expression of egfp. all plasmid sequences were confirmed by dideoxy sequencing. the morpholino oligonucleotide mo ( -agctcagg gaagttgaaggatccca- ) was purchased from gene tools (oregon, usa). the equivalent -o-me oligonucleotide ome ( -agcucagggaaguugaagg auccca- ), a truncated version ( ome; -aguuga aggauccca- ) and an equivalent mer composed of rna bases ( rna) were from thermo scientific (colorado, usa). the primary sequence of mo is identical to that of moab, described by howard et al. ( ) . frameshift reporter plasmids were linearized with nco i or bam h and capped run-off transcripts generated using sp rna polymerase as described ( ) . messenger rnas were recovered by a single extraction with phenol/ chloroform ( : vol/vol) followed by ethanol precipitation. remaining unincorporated nucleotides were removed by gel filtration through a nucaway spin column (ambion). the eluate was concentrated by ethanol precipitation, the mrna resuspended in water, checked for integrity by agarose gel electrophoresis and quantified by spectrophotometry. messenger rnas were translated in nuclease-treated rabbit reticulocyte lysate (rrl; promega) programmed with $ mg/ml template mrna. typical reactions were of ml volume and composed of % (vol/vol) rrl, mm amino acids (lacking methionine) and . mbq [ s]-methionine. reactions were incubated for h at c and stopped by the addition of an equal volume of mm ethylenediaminetetraacetic acid (edta), mg/ml rnase a followed by incubation at room temperature for min. samples were prepared for sodium dodecyl sulphate-polyacrylamide gel electrophoresis (sds-page) by the addition of vol of  laemmli's sample buffer ( ), boiled for min and resolved by sds-page. dried gels were exposed to a cyclone plus storage phosphor screen (perkinelmer), the screen scanned using a typhoon trio variable mode imager (ge healthcare) in storage phosphor autoradiography mode, and bands were quantified using imagequant tm tl software (ge healthcare). where used, aons were added to translation reactions on ice at the same time as the mrna, but were not pre-annealed to the mrna. the calculations of frameshifting efficiency (%fs) took into account the differential methionine content of the various products and utilized two formulas. rnas derived from nco i-cut pfshiv-aon were translated in rrl (at $ mg/ml) in the presence of increasing quantities of mo or ome. the products were resolved by % sds-page and visualized by autoradiography. in most lanes, an additional product is evident, thought to be derived from ribosomes that enter the + /À frame (+ /À ) or fall off the template (or are permanently stalled) in the vicinity of the annealed aon (drop-off, d.o.). control translations were also carried out using an mrna derived from pfscass (ref. ; pfscass /ibv pk) which contains the minimal ibv frameshift site. the frameshifting efficiency measured for each signal (to the nearest integer) is indicated below the relevant lanes (%fs; see section 'materials and methods') and takes into account the number of methionines present in each product (stop, ; À fs, ; + /À /d.o., ). due to the close migration of the + /À /d.o. product and the stop product in this experiment, the values represent our best efforts for quantification of each class of event. m; molecular size markers. and frameshift products are denoted by met s , met fs and met fs , respectively; the densitometer values for the same products by i s , i fs and i fs . frameshifting assays in tissue culture cos cells were maintained in dulbecco's modification of eagle's medium supplemented with % (vol/vol) fetal calf serum, % (vol/vol) penicillin/streptomycin and % (wt/vol) mm glutamine. plasmids were transfected using a commercial liposome method (lipofectamine , invitrogen).  cells were seeded per well in -well plates and grown for - h until % confluency was reached. transfection mixtures (containing plasmid dna, serum-free medium [optimem; gibco-brl] and lipofectamine ) were set up as recommended by the manufacturer and added directly (dropwise) to the tissue culture cell growth medium. the cells were harvested h post-transfection and reporter gene expression determined using a dual luciferase assay system kit (promega). each data point represents the mean value (± sem) from six separate transfections. plasmid pfsegfp-n expresses a fusion protein comprising n-terminal u a-derived À fs peptide and a c-terminal egfp tag. t cells (  per dish) were seeded onto  cm dishes and transfected with pfsegfp-n ( mg per dish) while in suspension using transit (mirus bio llc). after h, the proteasome inhibitor mg was added to the growth medium (to mm), the cells were harvested h later and lysed in ml lysis buffer containing mm tris-hcl, ph , % glycerol, . % igepal ca- , mm nacl, mm dtt, mm pmsf and  complete protease inhibitor tablet (roche). after clarification, the supernatant was mixed with ml glutathione sepharose beads and incubated for min at c with rotation to pre-clear the lysate. subsequently, the supernatant was incubated with ml gfp-trap_a (chromotek) for h at c with rotation to bind gfp-tagged proteins. the beads were washed three times with wash buffer ( mm tris-hcl ph , . % igepal ca- , mm nacl, . mm edta, mm pmsf and  complete protease inhibitor tablet), transferred to an ultrafree-mc spin column ( . mm; millipore) and any remaining wash buffer removed by centrifugation. the beads were resuspended in ml of  laemmli's buffer ( mm tris ph nucleic acid facility (pnac). the excised band was subjected to reduction with mm tris( -carboxyethyl)phosphine and alkylation by addition of iodoacetamide (to mm), followed by liquid removal and washes with ml mm ammonium bicarbonate with % acetonitrile. the gel pieces were dried in vacuo for min and ml of mm ammonium bicarbonate containing mg/ml modified trypsin (promega) was added for trypsin digestion for h at c. peptides were recovered and desalted using mc ziptip (millipore) and eluted to a maldi target plate using ml alpha-cyano- -hydroxycinnamic acid matrix (sigma) in % acetonitrile, . % trifluoroacetic acid. peptide mass was determined using a maldi micro mx ms (waters) in reflectron mode and analysed with masslynx software. for tandem ms/ms analysis (esims/ ms), desalted peptides in % methanol, . % formic acid were delivered to a thermofinnigan lcq classic ion-trap ms using a static nanospray needle (thermo proxeon). peptide masses of interest were manually selected for fragmentation using manufacturer-recommended settings. fragment ions were matched to possible sequence interpretations using ms-product (http://prospector.ucsf.edu/). our experimental system for studying aon-mediated frameshifting in vitro is based on that of howard et al. ( ) and is shown in figure . we began by confirming that a nt long morpholino ( mo) or -o-me ( ome) aon could stimulate- fs at the slippery sequence of hiv- (u a) when bound nt downstream on the mrna. the frameshift reporter mrna was transcribed from nco i-cut pfshiv-aon, a derivative of pfscass ( ; see section 'materials and methods'), and translated in rabbit rrl in the presence of increasing concentrations of aon. as can be seen in figure , both non-frameshifted (stop, kda) and À fs product ( kda) were seen, with the À fs efficiency peaking at $ %, a level similar to that measured for a control, pseudoknot-dependent frameshift signal (pfscass ibv pk; % in this experiment). in the absence of added aon, the baseline À fs efficiency was - %, indicating that the u a heptamer is inherently slippery in rrl, as observed previously ( , ) . in control translations with a non-targeting aon, or of mrnas with mismatches in the aon binding site, very little frameshifting was seen, confirming the specificity of aon-mediated frameshifting (data not shown). also evident in the ome titration of figure was an unexpected product migrating just below the stop product. based on the nucleotide sequence of the frameshift region and the position of termination codons in the different reading frames, this protein may correspond to ribosomes that have undergone a + (or À ) frameshift on the u a heptamer. alternatively, it could represent a peptide derived from ribosomes that had irreversibly stalled at the annealed aon, and subsequently dropped off the template (drop-off or d.o.). the unexpected product was also present, albeit to a lesser extent, in the mo titration. given the possibility that it may represent an alternative frameshift product, we examined whether its generation was linked to the homopolymeric nature of the u a slippery sequence by changing the slippery sequence of pfshiv-aon to that present at the ibv (uuuaaac) or simian retrovirus (srv- ) gag/pro (gggaaac) frameshift sites ( figure ). with these mrnas, little or no + /À /d.o. product was evident, indicating that its appearance is most likely linked to the u a slippery sequence rather than to a compromise of elongation arising from the presence of a stably bound aon. examining the aon titrations further (figures and ) , it can be seen that the plateau of À fs stimulation with mo was reached slightly earlier (at about mm) than for ome ( - mm). however, ome was the more effective stimulator, promoting % À fs on uuuaaac (c.f. % with mo), % on gggaaac ( % with mo) and, on the assumption that the novel product seen with u a is an alternative frameshift product, a total of % frameshifting on u a ( % with mo). the À fs efficiency engendered by ome on the ibv and srv- gag/pro slippery sequences ( % and %, respectively) was greater than that seen with that of hiv- ( %). however, it is likely that À fs on u a would be greater if a proportion of ribosomes were not entering another frame. in an attempt to disentagle possible origins of the + / À /d.o. product, a variant of pfshiv-aon was prepared (pfshiv-aon stopall) in which three local stop codons in the and + reading frames were changed to sense codons such that the length of the orfs downstream of the u a stretch (in each of the three frames) would allow differentiation of a product of + /À fs from that of a d.o. event based on size ( figure ). this experiment revealed that the majority of the + /À /d.o. product appears to be derived from a + /À fs, with a small proportion derived from a product whose size is consistent with prolonged (or permanent) stalling at the u a heptamer (presumably) while the ribosome attempts to unwind the annealed oligonucleotide. consistent with earlier work, with mo there was less + /À fs product and little or no d.o. product, suggesting that morpholino oligonucleotides present less of a barrier to ribosomal elongation than their -o-me counterparts. within the context of pfshiv-aon stopall, three slippery sequence variants were prepared to address the question of whether ome-stimulated + /À fs was strictly dependent on a u-rich slippery sequence. as shown in figure , with slippery sequence u ucu, possessing an a-site codon sub-optimal for frameshifting, both À fs and + /À fs were diminished ( -to -fold) suggesting a requirement for optimal a-site trna re-pairing in each case (see below). with slippery sequence a c, efficient À fs was observed ( %) but the quantity of + /À fs product was reduced (albeit detectable at $ %). we also tested a c and were surprised to see relatively high expression of both À and + /À fs products in the absence of added aon. the synthesis of these products could potentially be accounted for by transcriptional slippage on a c by sp rna polymerase during synthesis of the reporter rnas ( ) . alternatively, on this long homopolymeric a c stretch, the ribosome can frequently lose frame even in the absence of a stimulatory element. a similar observation has been made for a u stretch ( ) . significantly, in the presence of ome, expression of the À fs product from the a c mrna rose to a level similar to that seen with the u a mrna and the + /À fs product was also enhanced (to about half the value seen with u a). thus translational frameshifting is certainly taking place on this a-rich stretch and both À fs and+ /À fs products are synthesized. stimulation of efficient À and À fs on the u a heptamer in vitro and in transfected cells the reduction in + /À fs frequency seen with the u ucu heptamer ( figure ) was thought-provoking in that it raised doubts as to whether a 'traditional' + fs event was occurring. in this mrna, the a-site base changes (uua to ucu) would not preclude forward movement of the p-site trna decoding the zero-frame uuu codon onto the overlapping + frame uuu codon, yet production of the + /À product was diminished. to rule out that this was an effect of the identity of the a-site trna (trna leu versus trna ser ) on+ /À fs, we revisited the experiment, but changing the first base of the u a heptamer (to a, c or u) in the context of pfshiv-aon stopall to probe p-site trna re-pairing in the À frame. from previous studies ( , ) , we expected that tandem À slippage of p-and a-site trnas would be compromised to a greater or lesser extent, since the p-site codon would be sub-optimal for repairing in most cases, whereas a + movement of the p-site trna, as outlined above, would be unaffected. however, as shown in figure , changing the first base to a, c or g effectively abolished the + /À product. the effect on À fs was consistent with earlier work, with a reduction in all cases, especially with c at the first position. these data suggested strongly that the + / À fs product results from À fs and this was confirmed by ms (for convenience, these data are presented at the end of the 'results' section, since acquisition of sufficient trans-frame protein for ms analysis required additional knowledge obtained from experiments outlined in the following sections). the stimulation of both À and À fs on the u a heptamer could also be engendered by an rna-only oligonucleotide ( rna) and a shorter -o-me oligonucleotide ( ome) (supplementary figure s ). thus the capacity to stimulate both frameshift events is not specific to ome. to confirm that aon-mediated À and À fs could also take place in a cellular context, the key elements of pfshiv-aon were cloned into the dual luciferase frameshift reporter plasmid p luc ( ) such that the downstream firefly luciferase gene reported either À (p luchiv-aon À fs) or À (p luchiv-aon À fs) frameshifting, with the spacer length optimized in each case (À fs, nt; À fs, nt; see below). the relevant reporter plasmid together with increasing concentrations of ome were transfected into cos cells and luciferase activities measured h later. both À and À fs were detectable, with peak values of . % (À fs) and . % (À fs) and saturation at around $ nm aon ( figure ) . a version of the À fs reporter plasmid with the ibv slippery sequence (p lucibv-aon) was also spacer-length dependence of programmed À or À ribosomal frameshifting on a u a heptamer supports a role for mrna tension in frameshifting based on the published literature, including our own studies of s ribosomes stalled at the ibv frameshift-stimulatory pseudoknot ( , ) , we proposed a mechanical model of frameshifting in which a failure of intrinsic ribosomal helicase activity ( , ) to unwind efficiently the stimulatory rna during the translocation step leads to the build up of tension in the mrna and subsequently, breakage of codon:anticodon contacts and realignment of the trnas in the À reading frame. the validity of this model remains to be determined, but one prediction of it is that the magnitude of frameshifting should be influenced by relatively subtle changes to the length of the spacer separating the slippery sequence and stimulatory element. for the ibv frameshift signal, this is known to be the case; altering the natural spacing distance ( nt) between pseudoknot and slippery sequence (uuua aac) by a single nucleotide either way has a -fold inhibitory effect on frameshifting, and a -fold reduction is seen when nt are added to or removed from the spacer ( , ) . regarding aon-stimulated frameshifting, current evidence also supports a requirement for appropriate spacing. in the studies of howard and colleagues ( ), efficient frameshifting was seen with spacer lengths from À to nt, with an optimum at nt (as used in this study). given the observation in the present work of aon-stimulated À and À fs, it was important to ascertain the optimal spacing distance for the two events. to do this, we modified pfshiv-aon stopall to generate plasmids (pfshiv-aon spacer series) with spacers varying in length from to nt (figure panel a). in addition, for ease of comparison, a single base was added or removed from the mrna downstream of the aon-binding region such that the size of the various translation products (stop, À fs, À fs) was maintained between constructs ( figure panel b) . in translations of the pfshiv-aon spacer series, with added mo or ome (at mm in this experiment), we observed discrete peaks of frameshifting, with the overall pattern essentially the same, except that frameshifting was more efficient with ome. the À fs product was evident across a broad range, spanning spacer lengths - nt, but with two optima, at nt and at nt, (particular noticeable in the mo titration). the À fs product had an optimal spacing of nt and was evident over the range - nt with ome, but more discrete with mo ( - nt). in the absence of added aon, both frameshift products were detectable although at low levels, and unsurprisingly, there were no spacer effects. we went on to examine whether the À fs product could be engendered by cis-acting stimulatory rnas and the spacing optima for such events. in these experiments, the aon-binding site was replaced by a pseudoknot (the minimal ibv pseudoknot; a functional version of the wild-type ibv pseudoknot with a shorter loop [ ] ), generating plasmid pfshiv-pk, or a stable stem-loop structure whose base composition was the same as the two stems of the minimal ibv pseudoknot, generating pfshiv-sl (figure ) . a variant of this plasmid with the ibv slippery sequence was also prepared (pfsibv-sl). subsequently, spacing variants of these plasmids were constructed (spacers of - nt) and the mrnas translated in rrl. with pfsibv-sl, only the À fs product was evident (consistent with the uuuaaac slippery sequence being incompatible with À fs), with efficient À fs promoted over a narrow window of spacer length ( - nt), peaking at nt. with pfshiv-sl, both classes of frameshift product were seen. as with aon-mediated frameshifting, À fs on the u a slippery sequence was observed at most spacer distances, peaking at - nt. the À fs product was more discrete, spanning spacers of - nt with a peak at - nt. also shown in figure (panel d) are the translations of pfshiv-sl-derived mrna with spacers of or nt in which the first base of the slippery sequence was changed to g, a or c. similar to the experiment of figure , we found that disruption of this base inhibited both À and À fs. with the nt spacer, only a trace of À fs is seen, as expected for this spacer length (c.f. figure panel c), and the inhibition of À fs was, if anything, more pronounced than with the nt spacer. to account for the latter, we speculate that with the nt spacer, a greater number of ribosomes have the capacity to frameshift (see figure panel c) and that partitioning occurs, with ribosomes able to enter either the À or À frame. with a nt spacer, the block to À fs brought about by the slippery sequence changes may well direct 'frameshiftcompetent' ribosomes to partition more into the À frame. however, with an nt spacer that does not promote À fs (see figure panel c), there are no additional ribosomes to partition, and the overall level of À fs is generally lower. the pattern of frameshifting observed with the stemÀloop stimulator was also seen with the ibv pseudoknot (pfshiv-pk). again, À fs occurred across a broad range of spacer lengths, peaking at - nt, with the À fs product spanning a more discrete spacer range of - nt with a peak at nt. pseudoknot-induced À and À fs on u a was also examined in constructs containing the srv- gag/ pro pseudoknot, one feature of which is a smaller stem than the ibv pseudoknot ( bp c.f. bp). as shown in supplementary figure s , a similar pattern of À and À fs was observed, with optimal spacing distances of - nt (À fs) and - nt (À fs). an interesting aspect of u a-mediated frameshifting was the occurrence of À fs at shorter spacing distances (e.g. - nt in pfshiv-sl), albeit at a lower efficiency (e.g. figure panel c). this could be a result of frameshifting one codon earlier on the mrna, where the stimulatory rna effect would be most apparent. at this position in pfshiv-sl, the sequence g aau uuu is present, which by chance is compatible (in principle) with tandem À fs. however, this sequence is also present in pfsibv-sl, where À fs at spacers of - nt was less evident (figure panel c) . an alternative explanation is that a single p-site trna slip is occuring on u a (and to a lesser extent on uuuaaac) at the shorter spacing distances, which is more favoured with p-site trna phe (u uuu uua) than trna leu (u uua aac). an unexpected observation was the appearance of a third 'recoding' product in translations of pfshiv-pk with a nt spacer. this product (asterisked in figure panel e) corresponds to readthrough of the uga codon immediately downstream of the slippery sequence. thus at an appropriate spacing distance, a frameshift-promoting pseudoknot can induce stop-codon readthrough. that this was observed at a spacing of nt is consistent with the longer spacing requirements generally observed for those naturally occurring readthrough signals that have a stimulatory rna component ( ) . . the effect of slippery sequence-stem-loop (sl) and slippery sequence-pseudoknot (pk) spacing on À and À fs. (a) the spacer of pfshiv-sl and pfsibv-sl was changed from three to nine nucleotides as indicated. in mrnas derived from these plasmids, frameshifting is stimulated by a sl structure (whose length and base-pair composition is identical to the stacked stems of the minimal ibv pk; see text). also shown is a diagrammatic representation of potential translation products of pfshiv-sl mrnas and predicted molecular masses. note that zero frame ribosomes terminate at the stop-codon in the spacer in all cases. as previously, the sizes of the encoded frameshift products were normalized by appropriate deletion of bases downstream of the sl. slippery sequence variants of pfshiv-sl were also prepared in which the u a sequence was changed to those indicated. (b) the spacer of pfshiv-pk was changed from three to nine nucleotides as indicated. in these plasmids, frameshifting is stimulated by the minimal ibv pk ( ) . as with pfshiv-sl and derivatives, zero frame ribosomes terminate at the stop-codon in the spacer in all cases, but note that the reading frames of frameshifted ribosomes differ. (c) messenger rnas derived from nco i-cut pfsibv-sl and pfshiv-sl spacer variants were translated in rrl and products analysed and quantified as in the legend to figure . the numbers above each gel represent the spacer length. the frameshifting efficiency measured for each signal (to the nearest integer) is indicated below the relevant lanes (À % fs; À % fs) and takes into account the number of methionines present in each product (nfs, ; À fs, ; À fs, ). (d) messenger rnas derived from nco i-cut pfshiv-sl variants with nt or nt spacers were translated in rrl and products analysed and quantified as in panel c. also tested were in an attempt to determine the amino-acid sequence of the presumed À fs product, we carried out a large-scale in vitro translation of a tagged version of pfshiv-aon with optimal ( nt) spacer, but probably due to the low productivity of rrl, it proved impossible to isolate material of sufficient purity and yield for unambiguous n-terminal sequence determination by edman degredation. a similar problem was encountered with a tissue culture expression system in which frameshifting was dependent upon co-transfection of ome. however, it proved possible to purify sufficient + /À product for ms analysis when a frameshift cassette of u a, nt spacer and stem-loop stimulator was expressed as an n-terminal fusion with eukaryotic green fluorescent protein (egfp). this plasmid (pfsegfp-n , see section 'materials and methods') was transfected into t cells and the À fs product purified by affinity chromatography (utilizing a gfp binding matrix), gel electrophoresis and band excision. following digestion with trypsin, resultant peptides were analysed by maldi mass fingerprinting and subsequent tandem mass spectrometry (esims-ms). peptides corresponding to % of the predicted fusion protein were identified and the sequence spanning the frameshift region was determined as lnflye, indicating À fs (figure ; raw data in supplementary figure s ). the peptide fingerprint data were scanned for other possible events, including the aforementioned p-site+ fs (generating lnfye), or sequential À ribosomal frameshifts in consecutive elongation cycles (with the first on g aau uuu and the second on u uuu uua; generating lnffye) but no matches were present. from the determined amino acid sequence, we predict that the À fs product is generated by tandem slippage of p-and a-site trnas. as the slip is À , the p-site trna would re-pair on a codon that includes the base of the u a heptamer (an a in these mrnas) and it should be noted that the post-slippage contacts following this rearrangement are suboptimal, with the codon:anticodon complex ( -auu- / -gmaa- ) mismatched at the first position. altering the base of the slippery sequence to gu a also did not inhibit À fs (supplementary figure s ). however, in À fs, it is known that there is tolerance for mismatches at the first position of the post-slip p-site complex ( ; see also figure ) and this appears to hold true for À fs. studies of programmed À ribosomal frameshifting have focused predominantly on stem-loop and pseudoknotdependent signals (reviewed in , ) . from these investigations, a plausible model of frameshifting has emerged which posits a critical contribution of the stimulatory rna in compromising the activity of the proposed intrinsic helicase activity of the s ribosome, located at the mrna entry channel ( , ) . a failure to unwind the stimulatory rna appropriately during the elongation cycle would potentially compromise frame maintenance through the generation of tension in the mrna, effectively pulling the mrna in a -direction while promoting breakage of the trna anticodon:codon interaction and realignment of the trna in the -(À ) direction. in support of this model, ribosomes are known to pause at frameshift-stimulatory figure . continued variants of the two plasmids in which the first position of the slippery sequence was changed to g, a or c. (e) messenger rnas derived from bam hi-cut pfshiv-pk spacer variants were translated in rrl and products analysed and quantified as in panel c. the numbers above each gel represent the spacer length. the frameshifting efficiency measured for each signal (to the nearest integer) is indicated below the relevant lanes (À % fs; À % fs) and takes into account the number of methionines present in each product (nfs, ; À fs, ; À fs, ) . the asterisk marks the position of the À fs product of pfs cass control mrna (ibv pk) and also an additional product seen in the translation of the pfshiv-pk nt spacer mrna. the size of the latter product is consistent with one that would be synthesised following readthrough of the uga codon the terminates the non-frameshifted product (present in the spacer; see text). stem-loops and pseudoknots, indicating that such structures can act as barriers to elongation ( , , , ( ) ( ) ( ) ( ) ( ) . in addition, cryo-em reconstructions of s rabbit ribosomes stalled at the ibv frameshift-promoting pseudoknot have revealed a distorted trna in a hybrid a/p-like state, with the anticodon arm bent markedly towards the a-site of the ribosome ( , , ) . in these reconstructions, density that likely corresponds to the pseudoknot is observed at the mrna entry channel close to the putative s helicase. these features are consistent with the model, with the pseudoknot resisting unwinding during eef -mediated translocation such that tension builds up in the mrna, subsequently placing strain on the trna and resulting in the adoption of a bent conformation ( ) . in mechanical unwinding studies, a functional ibv-based pseudoknot has been shown to be a 'brittle' structure, with a shallow dependence of the unfolding rate on applied force and a slower unfolding rate than component hairpin structures (green et al., ) . this greater mechanical stability and kinetic insensitivity to force is consistent with a role in resistance to unwinding ( ) ( ) ( ) ) . indeed, a number of pseudoknot features have been identified that could act in such resistance, such as the unusual topology of stems and loops, the geometry of the junction of the two stems and in some pseudoknots, base triplexes between loop and stem (reviewed in , ) . while some or all of these features are readily identifiable in pseudoknot-stimulatory rnas, the situation is not so clear-cut for stem-loop stimulatory rnas. nuclear magnetic resonance (nmr) analysis of the hiv- ( - ) and simian immunodeficiency virus (siv) ( ) structures has revealed a few features (inter-stem kink in hiv- ; stable loop in siv- ) that may be relevant, but some other viral stem-loop stimulatory rnas appear likely to have only regular a-form geometry. the situation is even more convoluted when one considers aon-mediated frameshifting ( ) ( ) ( ) ) . the mrna-oligonucleotide complexes would appear to lack unusual features like stem-stem junctions, triplexes or kinks and thus the mechanism by which they induce frameshifting is uncertain. it is known that the length of the aon can affect the efficiency of the process ( ), thus the stability of the mrna:aon duplex plays a role. however, while the specific chemical modifications present in -o-me, mo, phosphorothioate ( ) and locked nucleic acid ( ) aons can affect binding stability and target specificity, they are not fundamental for activity in frameshifting, as unmodified rna oligonucleotides can also stimulate À fs, at least in vitro ( , ) and also À fs (supplementary figure s ) . recently, it has been shown that stem-loop structures can effectively substitute for rna pseudoknots in some circumstances, with frameshift-stimulatory activity driven largely by the thermodynamic stability of the stem, but also influenced by loop size, composition and stem irregularities ( ) . the stem-loop employed here is also clearly capable of inducing frameshifting at certain spacer distances (figure ) . with this in mind, it is likely the aonmediated frameshifting is largely determined by stability and duplex length ( ) . nevertheless, the encounter between ribosome and aon is clearly different from that of a cis-acting secondary structure. firstly, the s helicase will encounter the -hydroxyl of the annealed aon rather than a constrained duplex (or triplex in some pseudoknots). secondly, the optimal spacing for frameshifting is quite different, being only nt for aon-mediated À fs, but respectively, some nt and nt for optimal stem-loop or pseudoknot-mediated À fs. the simplest interpretation of these observations is that the putative s helicase unwinds several base-pairs of the annealed aon before its activity is compromised. further work will be required to confirm a role for the helicase in frameshifting and to understand how it is compromised by what appears to be a regular mrna:aon duplex. it may be that the presence of the free -end of the aon is critical in this regard. it is clear that ome has a significant effect on ribosome progression, as evidenced by the appearance of a polypeptide corresponding to ribosomes stalled at the bound aon. this d.o. product accounted for up to % of the overall synthesis at high ome or ome concentration ( mm). it has been speculated recently that ribosomal frameshifting frequencies have been generally underestimated as a failure to take into account ribosomes that have frameshifted yet failed to progress on the mrna, which are often scored as non-frameshifted products ( ) . while the observation here of what appears to be aon-mediated drop-off of ribosomes lends support to the idea that mrna structures can act as roadblocks to the elongating ribosome, such drop-off is far less apparent when ribosomes are challenged with natural frameshift-stimulatory rnas ( ) . indeed, in this study, the abundance of the d.o. product was greatly reduced when a mer unmodified rna aon replaced ome (supplementary figure s ) . the relatively high proportion of ribosomes that appear to be stalled for an extended period probably reflects the very stable association of the ome aons with the mrna template. one of the unexpected outcomes of this work was the discovery of À fs on the u a heptamer (and potentially-albeit to a lesser extent-on the a c and a c heptamers). initially, we imagined that the protein in question originated by + fs, with the p-site trna phe decoding uuu in the zero frame slipping forwards onto the overlapping+ frame codon (also uuu) in a proportion of ribosomes stalled at the aon, stem-loop or pseudoknot, but this was ruled out experimentally. a À fs is consistent with the idea that mrna tension promotes -movement of the trnas in this context. indeed, the spacing analysis of figures and provides further support for this viewpoint. irrespective of the nature of the stimulatory rna, the spacing distance facilitating maximum À fs was consistently $ - nt less than that promoting À fs. viewed simplistically, with the shorter spacer, the tension on the mrna would be greater, increasing the likelihood of a À shift. while this hypothesis requires further substantiation, there is a precedent for the importance of mrna tension, from studies of the + programmed frameshifting signal in the e. coli prfb gene encoding release factor . in this system, the interaction of a shine-dalgarno (sd)-like element in the mrna with the anti-sd at the -end of s rrna is important in promoting efficient + fs at a -recoding site ( ) . the effect of varying the spacing between sd sequence and p-site codon in the prf b system has been analysed by toeprinting ( ) . at a spacer length of nt, noticeably shorter than that found naturally between sd and initiator aug ( nt; ), s pre-translocation complexes could not be formed and instead, the trna added subsequently to fill the a-site moved spontaneously into the p-site, restoring the spacer to the natural length ( nt). these data support a model in which formation of the sd-anti-sd helix in ribosomes stalled at the in-frame uga codon of prfb generates tension on the mrna that destabilizes codon:anticodon pairing in the p site and promotes slippage of the mrna in the -direction. it is plausible that the À and À fs we observe here originate in a similar manner, except that here, the tension pulls the mrna in a -direction, leading to À and À fs. an interesting question is whether À fs is more widely exploited in virus or cellular gene expression. only one example of a À fs signal has been documented to date, involved in the expression of tail assembly chaperone genes in bacteriophage mu ( , ) . here, tandem À slippage occurs on a gg ggg cga with the anticodon of the a-site trna arg ( gci , where i in the wobble position is inosine) forming a more stable post-slippage contact with the mrna in the À frame rather than the À frame. another potential À fs signal in trichomonas vaginalis virus has been suggested ( ) . in this virus, frameshifting most likely occurs on a conserved cc cuu uuu sequence, compatible with tandem À shifting. examination of known viral À fs sites possessing a u a or a c slippery sequence, however, reveals that in most cases the spacing distances seem inappropriately long for efficient À fs. in hiv- , where the stimulatory rna forms immediately of a u a slippery sequence ( ), a stop codon in the À reading frame is present directly downstream of the u a heptamer and appears to be present in all isolates of hiv- ( , ) . any ribosomes entering the À reading frame would terminate immediately, generating a truncated gag polyprotein lacking viral proteins p and p . as yet, there is no evidence to suggest that such a species is expressed in hiv- infected cells. frameshifting in the expression of mammalian ornithine decarboxylase antizyme has remarkably been shown to be + in mammals and fission yeast, yet À in budding yeast ( , ) . precise details of the mechanism of the À fs remain to be elucidated, but as observed by matsufuji and colleagues, lengthening the spacing between the antizyme shift site and its pseudoknot by three bases increased + fs in yeast at the expense of À fs, supporting a link between spacer length and frameshift direction ( ) . intriguingly, replacing the -stimulatory element that forms a component of the antizyme frameshifting signal with an annealed aon has been shown also to stimulate À fs when placed with zero spacing ( ) . however, this À product is not seen with the natural antizyme frameshift signal. conclusive evidence of a functional À fs signal in virus or cellular genes is therefore still awaited. in the past few years, aons have been increasingly exploited as a tool to examine aspects of ribosomal frameshifting. the observation here of spacer-length-dependent À and À fs events supports the view that mrna tension plays an important role in frameshifting and it will be interesting to see whether the structure of ribosomal complexes stalled at an aon-mrna complex resemble those of ribosomes stalled at other frameshiftstimulatory elements ( , , ) . errors and alternatives in reading the universal genetic code a gripping tale of ribosomal frameshifting: extragenic suppressors of frameshift mutations spotlight p-site realignment frameshifting rna pseudoknots: structure and mechanism pseudoknot-dependent - ribosomal frameshifting: structures, mechanisms and models expression of the rous sarcoma virus pol gene by ribosomal frameshifting signals for ribosomal frameshifting in the rous sarcoma virus gag-pol region characterization of ribosomal frameshifting in hiv- gag-pol expression characterization of ribosomal frameshifting for expression of pol gene products of human t-cell leukemia virus type i translation of gag, pro, and pol gene products of human t-cell leukemia virus type mechanisms and enzymes involved in sars coronavirus genome expression characterisation of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot mutational analysis of the 'slippery-sequence' component of a coronavirus ribosomal frameshifting signal the sequences of and distance between two cis-acting signals determine the efficiency of ribosomal frameshifting in human immunodeficiency virus type and human t-cell leukemia virus type ii in vivo p-site trna is a crucial initiator of ribosomal frameshifting mrna helicase activity of the ribosome mutational analysis of the rna pseudoknot component of a coronavirus ribosomal frameshifting signal evidence that a downstream pseudoknot is required for translational read-through of the moloney murine leukemia virus gag stop codon crystal structure of the ribosome at . a resolution the path of messenger rna through the ribosome the Å solution: how mrna pseudoknots promote efficient programmed - ribosomal frameshifting torsional restraint: a new twist on frameshifting pseudoknots a mechanical explanation of rna pseudoknot function in programmed ribosomal frameshifting correlation between mechanical strength of messenger rna pseudoknots and ribosomal frameshifting characterization of the mechanical unfolding of rna pseudoknots triplex structures in an rna pseudoknot enhance mechanical stability and increase efficiency of - ribosomal frameshifting interaction of the hiv- frameshift signal with the ribosome footprinting analysis of bwyv pseudoknot-ribosome complexes the ribosome uses two active mechanisms to unwind messenger rna during translation mechanical unfolding of the beet western yellow virus - frameshift signal efficient stimulation of site-specific ribosome frameshifting by antisense oligonucleotides novel application of srna: stimulation of ribosomal frameshifting stimulation of ribosomal frameshifting by antisense lna a dual-luciferase reporter system for studying recoding signals cleavage of structural proteins during the assembly of the head of bacteriophage t transcript slippage and recoding structure-function analysis of the ribosomal frameshifting signal of two human immunodeficiency virus type isolates with increased resistance to viral protease inhibitors ribosomal pausing at a frameshifter rna pseudoknot is sensitive to reading phase but shows little correlation with frameshift efficiency stimulation of stop codon readthrough: frequent presence of an extended rna structural element ribosomal movement impeded at a pseudoknot required for frameshifting ribosomal pausing during translation of an rna pseudoknot kinetics of ribosomal pausing during programmed - translational frameshifting halting a cellular production line: responses to ribosomal pausing during translation ) mrna pseudoknot structures can act as ribosomal roadblocks the mechanics of translocation: a molecular ''spring-and-ratchet'' system direct observation of distinct a/p hybrid-state trnas in translocating ribosomes solution structure of the hiv- frameshift inducing stem-loop rna structure of the rna signal essential for translational frameshifting in hiv- solution structure and thermodynamic investigation of the hiv- frameshift inducing element programmed ribosomal frameshifting in siv is induced by a highly structured rna stem-loop antisense-induced ribosomal frameshifting stem-loop structures can effectively substitute for an rna pseudoknot in - ribosomal frameshifting reading frame switch caused by base-pair formation between the end of s rrna and the mrna during elongation of protein synthesis in escherichia coli short spacing between the shine-dalgarno sequence and p codon destabilizes codon-anticodon pairing in the p site to promote + programmed frameshifting determination of the optimal aligned spacing between the shine-dalgarno sequence and the translation initiation codon of escherichia coli mrnas conserved translational frameshift in dsdna bacteriophage tail assembly genes recoding in bacteriophages and bacterial is elements non-canonical translation in rna viruses selection of peptides interfering with a ribosomal frameshift in the human immunodeficiency virus type the human immunodeficiency virus type ribosomal frameshifting site is an invariant sequence determinant and an important target for antiviral therapy mutational patterns in the frameshift-regulating site of hiv- selected by protease inhibitors reading two bases twice: mammalian antizyme frameshifting in yeast programmed frameshifting in the synthesis of mammalian antizyme is + in mammals, predominantly + in fission yeast, but À in budding yeast we thank dr len packman (department of biochemistry, university of cambridge) for his assistance with the mass spectroscopy and associated methodology text. supplementary data are available at nar online: supplementary figures - .conflict of interest statement. none declared. key: cord- - tknscm authors: sztuba-solinska, joanna; diaz, larissa; kumar, mia r.; kolb, gaëlle; wiley, michael r.; jozwick, lucas; kuhn, jens h.; palacios, gustavo; radoshitzky, sheli r.; j. le grice, stuart f.; johnson, reed f. title: a small stem-loop structure of the ebola virus trailer is essential for replication and interacts with heat-shock protein a date: - - journal: nucleic acids res doi: . /nar/gkw sha: doc_id: cord_uid: tknscm ebola virus (ebov) is a single-stranded negative-sense rna virus belonging to the filoviridae family. the leader and trailer non-coding regions of the ebov genome likely regulate its transcription, replication, and progeny genome packaging. we investigated the cis-acting rna signals involved in rna–rna and rna–protein interactions that regulate replication of egfp-encoding ebov minigenomic rna and identified heat shock cognate protein family a (hsc ) member (hspa ) as an ebov trailer-interacting host protein. mutational analysis of the trailer hspa binding motif revealed that this interaction is essential for ebov minigenome replication. selective ′-hydroxyl acylation analyzed by primer extension analysis of the secondary structure of the ebov minigenomic rna indicates formation of a small stem-loop composed of the hspa motif, a ′ stem-loop (nucleotides – ) that is similar to a previously identified structure in the replicative intermediate (ri) rna and a panhandle domain involving a trailer-to-leader interaction. results of minigenome assays and an ebov reverse genetic system rescue support a role for both the panhandle domain and hspa motif in virus replication. ebola virus (ebov) can cause large and highly lethal human disease outbreaks. because of the scarcity of effective treatments against ebov, there is an urgent need to identify novel viral inhibitors that target specific viral processes. as of yet, relatively little is known about the molecular biology of ebov replication, and, in particular, the interplay between host factors and the viral genome. the ebov genome is a negative-sense, single-stranded rna organized into a leader non-coding region (ncr), followed by seven discrete transcriptional units encoding np (nucleocapsid protein), vp (polymerase cofactor and interferonresponse modulator), vp (matrix protein), gp , (glycoprotein), vp (transcriptional enhancer), vp (secondary matrix protein, ion channel and interferon-response modulator), l (rna-dependent rna polymerase) and a trailer ncr ( , ) . each gene is separated by intergenic regions of varying lengths that modulate transcript levels ( ) . the ncrs of rna virus genomes have highly conserved primary and secondary structures ( ) ( ) ( ) ( ) . however, the functions of ebov ncrs are not well characterized. replicon systems consisting of the complete trailer and leader sequences and short segments of the l and np protein genes that flank a reporter gene such as chloramphenicol acetyl transferase, green fluorescent protein, or luciferase have been developed ( ) ( ) ( ) ( ) . in these systems, transcription and replication from a replicon genome is supported by the ebov proteins np, l, vp , and vp , which are driven from co-transfected expression plasmids. such systems allow examination of virus specific processes involving the rna and these proteins. addition of vp and gp expres-sion plasmids can drive production of infectious vlps ( ) . using the minigenome replicon systems and/or computerassisted secondary structure predictions, some functions have been attributed to the ncrs. computer modeling of the ebov trailer and leader indicate that a panhandle structure may form between the ncrs starting with base paring of nucleotide (nt) and . almost identical hairpin structures for the leader of the ebov genome and ri rna are possible ( ) . computer modeling of ebov mini-ri rna and mutational analysis suggested that the terminal nts of the ebov leader are not essential for infection ( ) . chemical probing of the ebov ri rna revealed a stem-loop within the end that is involved in regulating transcription ( ) . additional probing analyses of mini-ri rna suggest that the trailer and leader do not interact, but that the terminal nts of the leader are important for replication ( ) . recently, a direct interaction between vp and the leader has been described, suggesting that vp clamps the rna template to prevent the polymerase complex (vp /l) from dissociation and allows productive transcription initiation in the presence of secondary structures in the template ( , ) . the optimal rna substrate for vp binding is a single-stranded rna that is linked to a stem-loop, as found in the region of the replication promoter element of the ebov genomic leader ( ) . other studies further emphasize the importance of non-terminal ebov rna secondary structures in transcription and replication. for instance, ebov gp mrna editing is dependent on such secondary structures ( ) ( ) ( ) . interrupting secondary structure formation inhibits synthesis of the mrna encoding gp , , which mediates virion entry into host cells. host proteins bind to rna secondary structures to modulate lifecycle processes for many positive-sense rna viruses, such as hepatitis c virus (hcv) ( ) , hepatitis a virus ( ) , poliovirus type ( ) , dengue viruses ( ) , bovine coronavirus ( ) , and murine hepatitis virus ( ) ( ) ( ) . however, few host protein-viral rna interactions have been characterized for negative-sense, single-stranded rna viruses. previously, the la autoantigen (sjögren syndrome antigen b) was shown to interact with the leader of rabies virus ( ) , vesicular stomatitis new jersey virus ( , ) , and rinderpest virus ( ) . la autoantigen is an rna polymerase iii transcription factor that shuttles between the nucleus and cytosol and may play a role in mrna stability for translation. interactions between the viral leader and la autoantigen are thought to play a role in replication. replication is increased when the concentration of la autoantigen is increased ( ) indicating a necessary functional role for host proteins to directly interact with the negative-sense viral genome. however, no specific viral rna secondary structures that interact with la or other host proteins have been identified. host dna topoisomerase (top ) facilitates ebov genome transcription and replication ( ) . in the case of retroviruses, such as human immunodeficiency virus- and rous sarcoma virus, top interacts with their viral genome through a genomic rna stem-loop, suggesting that the top -viral genome interaction directly regulates transcription and replication ( ) ( ) ( ) . whether a similar interaction occurs with ebov remains unclear, although, ebov np and l genes both contain a potential top target sequence (tcctt) ( , ) . considering the length ( nt) and structural complexity of ebov trailer, host and viral factors likely interact with trailer rna motifs modulating the virus lifecycle. the goal of this study was to determine the secondary structure of the ebov e- e-enhanced green fluorescent protein ( e- e-gfp) minigenome rna, identify host proteins that interact with the ebov trailer, define their rna binding motifs, and establish a functional role for the protein-rna interactions. e- e-gfp minigenome rna secondary structure and host protein interactions were examined using selective -hydroxyl acylation analyzed by primer extension (shape) ( , ) , antisense-interfered shape (aishape) ( ) , electrophoretic mobility shift assays (emsa), sirna, and mutational analysis, using both the e- e-gfp minigenome system and ebov reverse genetics. the human kidney embryonic cell line, t (atcc, manassas, va, crl ) used in e- e-gfp minigenome assays was maintained in dulbecco's modified eagle's medium (lonza) supplemented with % calf serum (sigma-aldrich, st. louis, mo, usa) with % v/v penicillin/streptomycin (lifetechnologies, grand island, ny, usa) at • c in a % co atmosphere. hela cells (atcc ccl- ) were similarly maintained and used for ebov infection assays as described below. grivet (chlorocebus aethiops) vero e cells (atcc #crl- ) were maintained similarly and used for cell lysate preparation for emsa as previously described ( ) . lysates were collected by scraping t- flasks of confluent cells into - ml of phosphate-buffered saline (pbs) followed by centrifugation at × g for min. pellets were resuspended in . ml of hypotonic buffer consisting of mm hepes (sigma-aldrich) . mm mgcl (sigma-aldrich), mm kcl (sigma-aldrich) and . mm pmsf (sigma-aldrich) and incubated on ice for min. non-ionic detergent (igepal , sigma-aldrich) was added to a final concentration of . % (v/v), and the suspension was vortexed twice for s followed by incubation on ice for min. the suspension was centrifuged at × g for min to obtain a clarified lysate. supernatant was removed and glycerol was added to a final concentration of % (v/v). supernatants were aliquoted, frozen immediately on dry ice and stored at − • c. a portion of the lysate was quantified using the pierce bca protein assay kit (thermo fisher scientific, waltham, ma, usa). primers listed in table were used for emsa template preparation, protein pull-down assays, and site-directed mutagenesis for minigenome assays as indicated. transcription templates for emsas were generated by polymerase chain reaction (pcr) amplification using the ebov e- e-gfp plasmid as a template. pcr products were resolved by nucleic acids research, , vol. , no. - . % low melting temperature agarose gel electrophoresis (seakem gtg, sigma-aldrich) and purified using the qi-aquick gel extraction kit (qiagen, valencia, ca, usa). products were dialyzed to remove excess salt and used as templates for transcription with the megascript t kit (thermo fisher scientific, waltham, ma, usa) following the manufacturer's protocol. radiolabeled probes were prepared by transcription in the presence of utp-␣ p. depending upon the length of the probe, pcr products were purified by either phenol:chloroform extraction and ethanol precipitation or megaclear transcription cleanup kit (thermo fisher scientific, waltham, ma, usa) and subsequently quantified. five micrograms of protein lysate was incubated with . pmol of radiolabeled probe in a final volume of l. reactions were incubated at • c for min in × reaction buffer consisting of mm kcl, mm tris ph . , mm mgcl , mm dtt, ng poly (i)-(c) (ther-mofisher waltham, ma, usa). reactions with unlabeled specific wild-type (wt) competitor and non-specific trna competitor at , , and × molar excess were carried out in parallel. the l reaction was loaded onto a × tris borate edta buffer, % non-denaturing polyacrylamide gel and electrophoresed for . h. gels were transferred to whatman chr paper (whatman, maidstone uk), dried at • c under vacuum for h, and exposed to a phosphorimager screen (ge healthcare life sciences, little chalfont, uk). at least three independent assays were performed per probe. typhoon software was used to analyze the gels and determine relative signal intensities of the shifted bands. background subtracted signal intensity for each sample was determined as a ratio to wt in the absence of competitor. relative average intensities were calculated and compared by paired t-tests using graphpad prism. the - probe ( pmol) was biotinylated using the pierce rna end biotinylation kit (thermo fisher scientific, waltham, ma, usa) according to the manufacturer's protocol. this probe was purified by size-exclusion chromatography using probequant g- microcolumns (ge healthcare life sciences) and incubated with streptavidin magnetic dynabeads (thermo fisher scientific, waltham, ma, usa) in binding buffer for h at • c with constant rocking. lysate was prepared as described above and incubated with the prepared dynabeads. beads were harvested with a magnetic separator and washed in binding buffer three times. protein was eluted by boiling in × lithium dodecyl sulfate buffer for min. samples were loaded on a - % -(n-morpholino)propanesulfonic acid (mops) polyacrylamide gel and electrophoresed for h at v. the gel was silver-stained followed by mass-spectrometry of the whole lane for control and the - probe samples. peptide fragments were generated, and protein identities were determined by amino acid sequence homology. immunoprecipitation-reverse transcriptase-pcr (ip-rt-pcr) was performed as described previously ( ) . briefly, vero e lysate was incubated with - probe followed by immunoprecipitation with . g of anti-hspa (hsc ) antibody (santacruz biotechnology, dallas, tx, usa) and protein g dynabeads (thermo fisher scientific, waltham, ma, usa) and three rounds of washing with binding buffer. bound material was eluted from the beads in l of elution buffer ( mm tris-hcl, ph . , mm edta, % sodium dodecyl sulphate (sds)) at • c for min. the eluate ( l) was loaded on sds-mops - % gels and probed for hspa . the remainder was digested by g of proteinase k for min at • c, extracted by phenol:chloroform, and precipitated with ethanol. the pellet was resuspended in l of deionized rnase-dnase free water, and pmol of the resuspension were reverse transcribed using superscript iii (thermo fisher scientific, waltham, ma, usa) followed by pcr with hifi taq (thermo fisher scientific, waltham, ma, usa) and the - primer set (table ). negative control samples were processed in parallel by excluding the - rna. the ebov e- e-gfp minigenome system was kindly provided by dr. kawaoka, university of wisconsin-madison ( , ) . the nt ebov egfp minigenome rna comprises trailer (nt - ), a partial np gene (nt - ), egfp (nt - ), a partial l gene (nt - ) and leader (nt - ) sequences. hek t cells were seeded at . × cells and incubated overnight at • c and % co . cells were transfected using the calcium phosphate method by combining an equal volume of a solution containing . m cacl and minigenome plasmids to × hbss ( . m nacl, . m hepes, . m na hpo . h o) dropwise. the plasmids were in the following ratios per well: p e- e-gfp ( g), pcaggsnp ( ng), pcaggsvp ( ng), pcaggsvp ( ng), pcaggsl ( g) and pt ( g). after min incubation, the mixture was added to cells. at h posttransfection, cells were harvested, washed, and resuspended in % pluronic acid and % paraformaldehyde fixative for analysis by flow cytometry (becton dickenson, ca, usa). measurements were gated relative to vp -negative control samples. statistical analysis was performed by one-way anova using graphpad prism . (graphpad software). total rna from minigenome assay cells was extracted with trizol (thermo fisher scientific). five microgram of total rna was digested with rq dnase (promega) for h at • c. the rna samples were electrophoresed for h at v in a × mops and % formaldehyde agarose gel and transferred to hybond-n+ membrane (ge healthcare life sciences). probes specific for genomic and ri rnas (table ) were labeled with p atp using t polynucleotide kinase (promega). membranes were prehybridized for h in perfecthyb plus buffer (sigma-aldrich), and probe was added and hybridized overnight at • c for genomic and • c for ri, respectively. membranes were exposed to phosphor screen overnight, imaged on ge typhoon fla variable mode imager and quantified with imagequant software (ge healthcare). signal was normalized to vp (-) sample for the genomic probe and normalized to wt sample for the ri probe. various sirnas targeting hspa (table ) were used to transiently reverse-transfect hela cells ( cells per well, -well format) in triplicate at a nm final concentration, using hiperfect reagent (qiagen) as previously described ( ) . cells were washed the following day. twentyfour hours later, cells were infected with ebov-zsgreen at multiplicities of infection of , or for h. cells were fixed with % formalin (val tech diagnostics), and stained for high-content quantitative image-based analysis. the assay was repeated twice. in three wells on each plate, cells were transfected with a negative control sirna (nt, sicontrol non-targeting sirna # , dharmacon d- - ). cdna clones encoding ebov/h.sapiens-tc/cod/ /y ambuku-mayinga (ebov; genbank #af ) wt and variants thereof (a u, a u/a u or the terminal nt deletion variant) were constructed by standard cloning techniques. for recovering recombinant viruses, hek t cells in -well plates were transfected in duplicate with g of full-length ebov cdna clone-encoding plasmid and support plasmids ( g of pcaggs-np, . g of pcaggs-vp , . g of pcaggs-vp , g of pcaggs-l and g of pcaggs-t ) using lipofectamine (invitrogen carlsbad, ca, usa) according to the manufacturer's instructions. as a negative control, pcaggs-l was omitted from one of the samples. at day post-transfection, supernatants were collected, cell debris was removed by centrifugation, and an aliquot of the viruscontaining media (termed passage ) was used to infect a fresh monolayer of vero e cells. one week later, when cytopathic effects were observed in the wt-ebov samples, supernatants (termed passage ) were harvested, cleared by centrifugation, and stored at − • c. vero e cells that did not exhibit cytopathic effects (those transfected with the ebov mutants) were replenished with fresh growth media and incubated for an additional days. supernatants were harvested, cleared by centrifugation and stored at − • c. all ebov rescue experiments were conducted under biosafety laboratory (bsl- ) conditions. virus-containing supernatants (wt-ebov and mutants thereof) were used to infect vero e or hela cells in well plates ( cells/well). cells were inoculated with virus for h, washed with pbs, and replenished with fresh growth media. cells were fixed h later, blocked with % bovine serum albumin in pbs, and stained with murine monoclonal antibodies against ebov gp , ( d , : dilution) and with alexa fluor -conjugated antibodies ( : dilution, life technologies). infected cells were also stained with hoechst and hcs cellmask deepred (life technologies) for nuclei and cytoplasm detection, respectively. infection rates were determined by high-content quantitative image-based analysis on an opera quadruple excitation high sensitivity confocal reader (model and ; perkin-elmer, waltham, ma, usa) as described ( ) . all infections were conducted under bsl- conditions. rna was extracted using the zymo directzol kit (zymo research) following the manufacturer's instructions, including the optional on-column dnase-treatment. rna was prepared for sequencing and enriched for ebov-specific reads using the illumina truseq rna access kit with modifications to the manufacturer's recommended procedures as described previously ( ) . libraries were sequenced on an illumina miseq desktop sequencer using a version , cycle kit ( × ) and analyzed using in-house scripts. dna templates for in vitro transcription were generated by pcr amplification of plasmids encoding the ebov e- e-gfp minigenome and the corresponding replicon mutants, using primers listed in table . all pcr experiments were performed using platinum ® taq dna polymerase high fidelity (thermo fisher scientific, waltham, ma, usa). transcripts were synthesized with the t -megascript system (thermo fisher scientific, waltham, ma, usa) following the manufacturer's protocol. rnas were purified by denaturing m urea/ % polyacrylamide gel electrophoresis, followed by elution and ethanol precipitation. purified rnas were dissolved in sterile water and stored at − • c. six pmol of rna were heated at ºc for min and slowly cooled to • c. the volume was adjusted to l in a final buffer of mm tris-hcl (ph . ), mm nacl, mm mgcl . samples were incubated at • c for min. folded rna was divided into two equal portions ( l each) treated with l of mm -methyl- -nitroisatoic anhydride ( m ) ( ) ( ) . electropherograms were processed using the open-source shapefinder program ver. . following the software developer's protocol, including the required precalibration for matrixing and mobility shift for each set of primers as previously described ( ) . briefly, the area under each negative peak was subtracted from that of the corresponding positive peak. the resulting peak area difference at each nt position was divided by the average of the highest % of peak area differences, calculated after discounting any results greater than the third quartile plus . × the interquartile range. normalized intensities were introduced into open-source rnastructure version . ( ) . locked nucleic acid (lna)/dna chimeras were purchased from exiqon, woburn, ma, usa, and the sequences of which are provided in table . chimeras were added at × molar excess after folding the rna. samples were subsequently incubated at • c for min prior to m treatment (see above). to quantify alterations induced by antisense oligonucleotides, raw data were processed as described above. the secondary structure of the ebov e- e-gfp minigenome rna predicted by rnastructure software version . ( ) and chemical probing data from shape were used to generate three-dimensional ( d) models for the trailer-to-leader panhandle interaction in the wt-ebov genome and variants, a u and a u/a u, using open-source rnacomposer, version . (http://rnacomposer.cs.put.poznan.pl/). the quality of predicted models was evaluated using open-source molprobity and king tools ( , ) . identification of regions of the ebov trailer that interact with host proteins was performed by emsa ( figure a ) using the primers listed in table . initially, probes truncated at the end of the trailer were evaluated to identify host protein binding regions (supplementary table) . because each of the larger truncated probes was positive, further emsas were performed to define minimal rna regions of the trailer. overlapping probes were evaluated in triplicate (supplementary table) . background-corrected signal intensity was compared to wt probe and averaged for the replicates. ribonucleoprotien (rnp) complex formation of probe - was reduced in the presence of × molar excess of trna with a % reduction in binding (p = . ) indicating that the complex can be competed, but only at high concentrations. no reduction in rnp complex formation was observed when competed with wt, unlabeled probe - ( figure b) . the - probe was significantly competed with unlabeled wt probe at ×, ×, × and ×, and × trna ( figure c ). the - probe followed a similar pattern as the - probe and could be significantly competed with wt unlabeled probe and × and × trna ( figure d ). the data suggests that the rnp complex formation is specific because low to moderate concentrations of wt competitor and the highest concentrations of trna were required for significant competition. the - probe was chosen for use as bait in pull-down assays. host proteins were eluted from the beads, resolved and analyzed by mass-spectrometry ( figure a ). rattus norvegicus (peptide score . ), bos taurus (peptide score . ) and mus musculus (peptide score . ) hspa were identified as specifically binding to the - probe by peptide score and selectivity when compared to the control lane ( figure b ). other host proteins binding specifically to the - probe included atp a, mitofilin, aldehyde dehydrogenase and cops , but not as consistently or with lower peptide scores (data not shown). ip-rt-pcr confirmed the hspa : - interaction ( figure c and d). specific - probe pcr products were detected only when cell lysates and - probe were immunoprecipitated with anti-hspa antibody. - probe could not be detected by pcr in control samples excluding antibody, thus supporting a specific interaction between hspa and the - probe. following identification of hspa , a literature search was performed indicating that hspa interacts with a pentanucleotide motif, auuua , ( ) ( ) ( ) ( ) . closer examination of the ebov trailer sequence identified three of these motifs at nt positions - , - and - . here, these are referred to as hspa -binding motifs , and , respectively. based on the emsa data, hspa motif was chosen for further functional analysis. variants of motif were generated containing either a single point mutation or clustered multiple point mutations. the effect of these variants on ebov transcription/replication was examined in the context of the e- e-gfp minigenome assay using t cells. the e- e-gfp minigenome assay is a transfectionbased replicon system. expression plasmids encoding the ebov vp , vp , l and np proteins are co-transfected with a plasmid encoding the e- e-gfp minigenome that is driven by a t promoter and an expression plasmid encoding the t polymerase (t pol). the minigenome is described in detail in watanabe et al. ( ) . briefly the and ends of the ebov genome flank a egfp open reading frame (orf) as a reporter gene and is replication and transcription competent. single-nucleotide changes in motif of the minigenome indicated that an a u ( auuuu ) mutation resulted in a statistically significant (p < . ) decrease in both the total number of gfp-positive cells ( % decrease, figure a ) and in the mean fluorescence intensity of the gfp signal ( % decrease) ( figure b ). the a u/a u double mutant ( uuuuu ) motif resulted in a significant decrease in gfp-positive cells ( %, p < . , figure a and b). northern blots specifically targeting the genome and ri rnas were performed on the wt and mutant minigenome samples to determine which step of the virus lifecycle was impacted by mutagenesis of motif . based on previous identification of vp as essential for replication and transcription in an ebov minigenome assay, ( ), a control omitting expression of vp (vp (−)) was included to determine the level of t pol produced minigenome rna. this control is necessary to determine the level of minigenome produced by the viral proteins. if the mutant does not impact minigenome synthesis, then the northern blot signal intensity will be increased when compared to vp (−) and equal to the wt sample. if the mutant impacts minigenome rna synthesis, then the northern blot signal intensity will be lower than wild type or equal to the vp (−) signal or greater than the wt signal. the data indicate that four of five single point mutations, a u, u a, u a and a u, in motif reduced minigenome rna synthesis. the a u mutant demonstrated the greatest impact and was below the vp (−) signal ( figure c and supplementary figure s a ), however the u a mutant did not impact minigenome rna synthesis. the double mutants a u/u a, a u/u a, and a u/a u also reduced minigenome rna synthesis when compared to wt. these data indicate that motif is involved in minigenome rna synthesis. to assess the impact of motif in ri rna synthesis, a northern blot specifically targeting the ri rna was developed ( figure d and supplementary figure s b ). ri rna is not produced by t pol and will only be produced when the appropriate complement of viral proteins and a suitable rna template is present, therefore the northern blot signal intensity for wt ri rna was used as a basis of comparison. mutants u a, a u and a u/a u decreased ri rna synthesis when compared to wt sequence. mutants a u, u a, u a, a u/u a, a u/u a and a u/u a demonstrated variable but near wt levels of ri rna synthesis. these data indicate that motif also plays a role in ri rna synthesis. interestingly, the vp (−) control indicates that vp is also required for ri rna synthesis. transfection-based viral replicons reflect basic viral process, but do not reflect all aspects of the viral lifecycle. therefore, evaluation of the mutations with the greatest effect in the e- e-egfp minigenome assay, a u, a u/a u and an additional - mutant, was carried out using a full-length infectious clone ( table ). the a u/a u and - mutants could not be recovered in four of four replicates. however, the a u ebov variant was rescued in / replicates, although with slower kinetics. sequence analysis of one of three replicates indicated an 'a' insertion at position in % of viral rnas, which is in the gp orf. as expected, wt-ebov was recovered in all replicates. these data support the importance of motif in the virus lifecycle. based on e- e minigenome assay data, the a u mutant was evaluated for rnp formation by emsa, using the - probe. as shown in figure e and f significant competition was observed with cold wt competitor similar to the initial experiments described above. competition with cold trna did not reach significance at or × molar concentration, but was observed at and × molar concentration suggesting specificity of the interaction between hspa and the trailer in this probe. emsa of the - probe containing the a u point mutation indicated a % decrease in rnp formation (p < . , one-way anova std dev . ) ( figure e and f), compared to wt probe. rnp complex formation was competed with both unlabeled wt probe and trna supporting our hypothesis that a of motif is necessary for rnp complex formation (p < . one-way anova, graphpad prism). chemical modulation and sirna screening was performed to verify the role of hspa in the virus lifecycle. cells were pre-treated with oxymatrine, which is used as an hcv inhibitor that modulates hspa mrna stability ( ) , but failed clinical trial evaluation ( ) . oxymatrinetreated cells were infected at an moi of with ebov. oxymatrine treatment of cells minimally reduced viral titer, and semi-quantitative western blot did not support reduction of hspa expression (data not shown). thus, a previously established sirna screening assay ( ) was used to evaluate four commercially available sirnas against hspa . on-target sirna reduced relative ebov infection when compared to no sirna and negative control sirna at an moi , and ( figure g ). shape interrogates rna secondary structure by examining backbone flexibility (directly related to base pairing) at each nucleotide position via reactivity with a specific electrophilic reagent ( ) . we applied this technique to the ebov e- e-gfp minigenome rna, which contains all essential cis-acting elements for efficient gfp translation and ebov minigenome replication in vitro ( , , ) . reactivity to the shape reagent m for ebov minigenome rna is shown in figure a (supplementary figure s ) . the most reactive residues, and thus, the least structurally constrained residues have a reactivity > . . nucleotide positions with reactivities < . are indicative of fully base-paired residues. minimum free-energy modeling using shape data as pseudo free energy constraints indicated formation of a panhandle duplex structure between trailer and leader (figure a ). this long-range rna-rna interaction spans the first and last nt of trailer and leader, respectively. the trailer-to-leader panhandle is interrupted by an internal bulge on the leader side (nt - ) and a three-way - and - ) . the three-way junction embeds a short stem-loop (nt - ) containing hspa motif . the region upstream of the leader forms a stem-loop structure (nt - ) with an au-rich apical loop, similar to the ri hairpin previously identified as a putative vp binding site ( ) . residues a -u of this hairpin reacted more strongly with m than residues u -u . thus, the possibility exists that this hairpin (nt - ) forms an h-type pseudoknot structure with the upstream complementary region (nt - ). pseudoknots play a critical role in many biological activities, from regulation of viral gene expression to catalysis of mrna splicing and repeat-addition processivity of human telomerase ( , ) . subsequent experiments involving lna-directed displacement of the putative pseudoknot interaction (aishape) ( ) did not change reactivity of apical loop residues or flanking regions (data not shown). conceivably, weaker reactivities of the apical loop of this hairpin can be attributed to intraloop base-stacking interaction between a:u residues. hspa motifs (nt - ) and (nt - ) are located in an unstructured single-stranded region forming the internal loop (nt - , nt - ) of a hairpin preceding the gfp sequence ( figure a ). importantly, the gfp sequence base pairs independently (supplementary figure s ) . to validate the trailer-to-leader interaction, we applied aishape ( ) . experimentally, one strand of an rna duplex is displaced by hybridizing an antisense dna/lna oligonucleotide, and disrupted rna-rna interactions are characterized by enhanced m reactivity of the displaced nucleotides. two chimeric lna/dna oligonucleotides, -lna and -lna (table ) , were hybridized to the leader (nt - and nt - ) to disrupt base pairing interaction with the trailer (wt + lnas). in the control sample, these oligonucleotides were omitted (wt). aishape indicated changes in chemical reactivity within trailer residues ( figure b ). in particular, nts c -a and u -a of the experimental sample (wt + lnas) were more sensitive to m modification (median reactivity . ) than the corresponding residues in the control sample (wt, median reactivity of . ). moreover, nts a -u of wt + lnas were less reactive (median value . ) than their wt counterparts (median value . ). the rnastructure algorithm, when provided with pseudo energy constraints retrieved from aishape experiments, predicted that in the absence of its interacting partner, the trailer forms an independent stem-loop structure involving nts c -g ( figure b ). the stem of this hairpin is mainly composed of a:u base pairs interrupted by only two g:c pairs and one g:u wobble. the specificity of the aishape strategy was verified by formation of an extensive barrier to reverse transcription at the sites of lna/dna hybridization, as revealed during capillary electrophoresis separation of the reverse transcription products (data not shown). the m reactivity profile in the presence of both lnas indicated minor off-site changes, which could reflect perturbation of tertiary contacts within ebov minigenome rna ( supplementary figure s ). we performed structural analysis of ebov minigenome rna mutants - , a u and a u/a u to address how changes introduced within the trailer affect the conformation of the trailer-to-leader panhandle. although the lack of trailer sequence-induced structural rearrangements in deletion mutant - ( figure d ; supplementary figures s and ) , certain structural domains specific for wt ebov minigenome rna remained unchanged. these include a bifurcated stem-loop at the terminus (nts - ), a stem-loop occluding hspa motifs and (nts - ) and the hairpin structure previously proposed as vp binding site (nts - ) ( figure d ). in contrast, the a u mutation within motif affected only the conformation of the trailer-to-leader panhandle ( figure e and supplementary figure s ). this single nucleotide substitution eliminated formation of the nt - hairpin (containing hspa motif in wt rna), and introduced an a:c mispair (nt a , c ), and an asymmetric internal loop (nt - and ) ( figure e) . similarly, the a u/a u point mutations caused structural rearrangements of the panhandle duplex, again eliminating the nt - stem-loop and introducing a mismatch at position and an additional internal loop (g -u and c -u ) ( figure f and supplementary figure s ). a d structural model of the ebov trailer and leader was generated using rnacomposer ( ) (figure ). since the server does not accept sequences > nts, a -nt derivative sequence was created by deleting the sequences downstream of nt and upstream of nt and closing the remaining short helical region (g -a and u -c ) with a -g-a-g-a tetraloop. a dot-bracket notation generated by rnastructure software was manually adjusted to account for this deletion, and subsequently provided to rnacomposer. ten d rna models were generated and analyzed, taking into account their secondary figure . three-dimensional projection models of the trailer-to-leader interaction in ebov minigenome wt, a u and a ua u rnas. the nt of internally deleted ebov wt and mutant minigenome rna are depicted. specific cis-acting motifs and domains are color-coded as shown in the key. the models indicate that the previously defined vp -binding site stem-loop is near to hspa motif . a u and a ua u mutations affect the spatial arrangement between these stem-loop structures. structure topology, sequence homology, structure resolution, and free energy. in addition, the quality of predicted models was evaluated using molprobity ( . ) and king tools ( , ) . the models with the best topological score are presented in figure , indicating the position of trailer, leader, hspa motif and the nt - hairpin. the wt model reveals close spatial arrangement of hspa motif (green) with the hairpin (blue). the presence of trailer a u and a u/a u mutations changed the distance between these rna motifs. the model presented here provides useful insight into how rna substructures may interact during the course of ebov replication. accumulation of viral proteins during infections often leads to cellular stress and upregulation of heat shock protein expression ( ) . the role of hsps in the viral lifecycle is only just being unveiled ( ) ( ) ( ) . numerous studies also indicate that specific interactions of host proteins with viral secondary and tertiary rna motifs modulate the lifecycle ( , ) . in this manuscript, we provide novel insight into the structural conformation of ebov ncrs, identify a specific rna motif within the trailer that interacts with a host cell chaperone, hspa , and define the role of this protein/rna interaction in viral replication. emsa (figure ) , followed by protein pull-downs, mass spectrometry, and ip-rt-pcr ( figure ) allowed us to identify and confirm that hspa interacts with the first nts of the ebov trailer region. hspa , a member of the hsc family, is a host chaperone that assists mis-folded polypeptide chains to (re)fold into functional proteins and is crucial for cell survival during stress ( ) . hsc also interacts with hcv particles; hsc downregulation significantly reduced virus production either via modulation of viral assembly or release ( ) . in addition, hsc was shown to be part of a protein complex that includes hcv ns a and host proteins hsp and hsp ( ) and was demonstrated to assist the ns a/hsp complex essential for hcv ires-mediated translation. hsc is also recruited to reovirus viral factories ( ) and is present in influenza a virus ( ) , and vesicular stomatitis indiana virus viral particles ( ) . the pentanucleotide motif auuua is an hspa interacting motif ( ) ( ) ( ) ( ) . sequence analysis of the e- e-gfp minigenomic rna suggested three putative hspa motifs in the trailer. however, the exact structural conformation of these regions was undefined. thus, using chemical acylation techniques and site-directed mutagenesis, we determined the secondary structure of e- e-gfp ebov minigenome rna and demonstrated that its and ncrs form complex long-range rna interactions including a trailer-to-leader panhandle ( figure a and supplementary figure s ). closer analysis indicates that motif forms part of a small stem-loop (nts - ) that is in near proximity to the vp binding stem-loop (nts - ). to further investigate the role of the terminal nt in the - interaction and the importance of hspa motif , a deletion mutant was evaluated by chemical probing and in the context of an ebov infectious clone. shape data indicate that deleting trailer sequences caused structural changes within the e- e-gfp minigenome rna, mainly due to the release of the leader sequences for base pairing with complementary regions ( figure d ; supplementary figures s and ). these structural rearrangements did not affect the topology of other rna domains. experiments with the infectious clone system indicated that ebov mutant - was not viable (table ). since hspa motif is located within the terminal nts of the trailer, it is difficult to unambiguously determine if the panhandle structure or the hspa motif is necessary for virus growth, but such data do reinforce the importance of hspa motif . probing analysis of the a u and a u/a u minigenome rnas showed that hspa motif point mutations affect the panhandle conformation, eliminating the nt - stem-loop containing hspa motif ( figure e and f). no other rna domains outside of the - interaction were changed (supplementary figures s and ) , suggesting that the trailer-to-leader interaction forms an independent structural/regulatory element essential for efficient virus replication. hybridization of lna/dna chimeras to the leader did not induce extensive structural changes within upstream domains. these data suggest that the trailer-to-leader panhandle might indeed function as an autonomous element (supplementary figure s ) . the infectious clone system experiments also indicated that ebov mutant a u/a u was not viable. on the other hand, the a u mutant was rescued but with slower kinetics. sequence analysis indicated an a insertion into the the gp open reading frame in % of the viral rnas at position . the effect of this insertion, if any, on viral replication remains to be determined (table ) . automated d structure modeling of the trailer-to-leader interaction in the wt e- e-gfp minigenome rna implied a close association of hspa motif with the - hairpin, recently shown to bind vp ( ) (figure ). the first nts of the interacting - regions form an extended arm that is positioned orthogonally to the interacting hspa / - motifs. the spatial proximity of these elements suggests a potential molecular bridging between hspa /vp and their specific rna-binding motifs. a u and a u/a u mutations changed the spatial arrangement between these stem-loop structures possibly disrupting these complex long-range interactions (figure ). mutational analysis of hspa motif further confirmed its importance in the ebov lifecycle ( figure a and b). in particular, single-and double-point mutations indicated that residue a plays an essential role in viral replication and transcription. furthermore, northern blot data indicated a reduction in both minigenome and ri rna production of the a u mutant ( figure d and supplementary figure s ). the a u/a u double mutation also resulted in reduced ri rna and minigenome synthesis. since the ri rna serves a template for synthesis of the minigenome rna, reduction in minigenome rna most likely directly affects production of ri rna. in addition, emsa analysis of the a u mutant demonstrated reduced host protein/viral rna/rnp complex formation ( figure e ). it is interesting that sirna-directed ( figure g ) or chemical inhibition (data not shown) of hspa moderately reduced infectivity and viral titer, respectively, whereas ebov genome mutagenesis that targets the hspa binding motif led to nonviable virus. taken together, our structural probing, mutagenesis and reverse genetics data indicate that the conformation of the ebov trailer and its interactions with host cell hspa are essential regulators of the ebov lifecycle. our studies indicate that hspa plays a critical role in production of viral genomic and ri rnas. during transcription and replication, the viral genome must become a template for synthesis of progeny rna. this process involves uncoating or at least relaxation of the viral rnp. host factors likely interact with the viral rnp to maintain proximity of necessary factors and aid complex formation to initiate and complete these processes ( ) ( ) ( ) . it is likely that many of these interactions are weak and transient, acting as scaffolds for virus-driven processes. these transient interactions may assist proper rna folding for ri rna synthesis, transcription complex formation, or proper confirmation of packaging signals to drive discrimination by viral components to ensure effective packaging of full-length genomes and reduction of defective particles. ebola protein analyses for the determination of genetic organization sequence analysis of the ebola virus genome: organization, genetic elements, and comparison with the genome of marburg virus analysis of the highly diverse gene borders in ebola virus reveals a distinct mechanism of transcriptional regulation conserved rna secondary structures and long-range interactions in hepatitis c viruses the genetic code as expressed through relationships between mrna structure and protein function conserved rna secondary structures in flaviviridae genomes the structure and functions of coronavirus genomic and ends comparison of the transcription and replication strategies of marburg virus and ebola virus by using artificial replication systems termini of all mrna species of marburg virus: sequence and secondary structure production of novel ebola virus-like particles from cdnas: an alternative to ebola virus generation by reverse genetics high-throughput, luciferase-based reverse genetics systems for identifying inhibitors of marburg and ebola viruses characterization of the l gene and trailer region of ebola virus analysis of the role of predicted rna secondary structures in ebola virus replication ebola virus vp -mediated transcription is regulated by rna secondary structure formation the ebola virus genomic replication promoter is bipartite and follows the rule of six dynamic phosphorylation of vp is essential for ebola virus life cycle rna binding of ebola virus vp is essential for activating viral transcription rna binding specificity of ebola virus transcription factor vp ebola virus rna editing depends on the primary editing site sequence and an upstream secondary structure the virion glycoproteins of ebola viruses are encoded in two reading frames and are expressed through transcriptional editing gp mrna of ebola virus is edited by the ebola virus polymerase and by t and vaccinia virus polymerases hur displaces polypyrimidine tract binding protein to facilitate la binding to the untranslated region and enhances hepatitis c virus replication hepatitis a virus (hav) proteinase c inhibits hav ires-dependent translation and cleaves the polypyrimidine tract-binding protein differential utilization of poly(rc) binding protein in translation directed by picornavirus ires elements identification of proteins bound to dengue viral rna in vivo reveals new host proteins important for virus replication host protein interactions with the end of bovine coronavirus rna and the requirement of the poly(a) tail for coronavirus defective genome replication evaluation of the role of heterogeneous nuclear ribonucleoprotein a as a host factor in murine coronavirus discontinuous transcription and genome replication effect of mutations in the mouse hepatitis virus (+) protein binding element on rna replication mitochondrial aconitase binds to the untranslated region of the mouse hepatitis virus genome nucleotide sequence and host la protein interactions of rabies virus leader rna a host protein (la) binds to a unique species of minus-sense leader rna during replication of vesicular stomatitis virus leader rna of rinderpest virus binds specifically with cellular la protein: a possible role in virus replication dna topoisomerase facilitates the transcription and replication of the ebola virus genome cellular topoisomerase i activity associated with hiv- the role of topoisomerase i in hiv- replication topoisomerase i and atp activate cdna synthesis of human immunodeficiency virus type dna topoisomerases: structure, function, and mechanism rna structure analysis at single nucleotide resolution by selective -hydroxyl acylation and primer extension (shape) selective -hydroxyl acylation analyzed by primer extension (shape): quantitative rna structure analysis at single nucleotide resolution the rna transport element of the murine musd retrotransposon requires long-range intramolecular interactions for function ebola virus vp -vp interaction is sufficient for packaging e- e minigenome rna into virus-like particles shapefinder: a software system for high-throughput quantitative analysis of nucleic acid reactivity information resolved by capillary electrophoresis ) sirna screen identifies trafficking host factors that modulate alphavirus infection infectious lassa virus, but not filoviruses, is restricted by bst- /tetherin molecular evidence of sexual transmission of ebola virus selective -hydroxyl acylation analyzed by protection from exoribonuclease (rnase-detected shape) for direct analysis of covalent adducts and of nucleotide flexibility in rna high-throughput single-nucleotide structural mapping by capillary automated footprinting analysis rnastructure: software for rna secondary structure prediction and analysis molprobity: all-atom structure validation for macromolecular crystallography modeling conserved structure patterns for functional noncoding rna mammalian hsp and hsp proteins bind to rna motifs involved in mrna stability cytokines direct the regulation of bim mrna stability by heat-shock cognate protein thermodynamics and kinetics of hsp association with a + u-rich mrna-destabilizing sequences analysis of sequence-specific binding of rna to hsp and its various homologs indicates the involvement of n-and c-terminal interactions three of the four nucleocapsid proteins of marburg virus, np, vp , and l, are sufficient to mediate replication and transcription of marburg virus-specific monocistronic minigenomes heat stress cognate host protein as a potential drug target against drug resistance in hepatitis b virus medicinal herbs for hepatitis c virus infection: a cochrane hepatobiliary systematic review of randomized trials pseudoknots: rna structures with diverse functions inhibition of hepatitis c virus in mice by a small interfering rna targeting a highly conserved sequence in viral ires pseudoknot automated d structure composition for large rnas identification of cellular proteome modifications in response to west nile virus infection broad action of hsp as a host chaperone required for viral replication virus-heat shock protein interaction and a novel axis for innate antiviral immunity heat shock protein modulates influenza a virus polymerase activity identification of proteins bound to dengue viral rna in vivo reveals new host proteins important for virus replication virus-host protein interactions in rna viruses structure and function of rna replication the cis-acting replication element of the hepatitis c virus genome recruits host factors that influence viral replication and translation rna virus replication complexes we are grateful to peter b. jahrling, jennifer sword, cindy allan, krisztina b. janosko, stacy l. agar, richard bennett, michael r. holbrook and the entire evps and irf-frederick team for their support and fruitful discussions about these experiments. we thank dr john l. casey and brittany l. griffin for their discussions about shape. we thank dr y. kawaoka, university of wisconsin for the ebov e- e egfp minigenome system. we thank laura bollinger and jiro wada for editing the manuscript and for preparing figures, respectively. the content of this publication does not necessarily reflect the views or policies of the us department of the army, the us department of defense, us department of health and human services (dhhs) or of the institutions and companies affiliated with the authors funding supplementary data are available at nar online.nucleic acids research, , vol. , no. key: cord- - su oqbz authors: elmén, joacim; thonberg, håkan; ljungberg, karl; frieden, miriam; westergaard, majken; xu, yunhe; wahren, britta; liang, zicai; Ørum, henrik; koch, troels; wahlestedt, claes title: locked nucleic acid (lna) mediated improvements in sirna stability and functionality date: - - journal: nucleic acids res doi: . /nar/gki sha: doc_id: cord_uid: su oqbz therapeutic application of the recently discovered small interfering rna (sirna) gene silencing phenomenon will be dependent on improvements in molecule bio-stability, specificity and delivery. to address these issues, we have systematically modified sirna with the synthetic rna-like high affinity nucleotide analogue, locked nucleic acid (lna). here, we show that incorporation of lna substantially enhances serum half-life of sirna's, which is a key requirement for therapeutic use. moreover, we provide evidence that lna is compatible with the intracellular sirna machinery and can be used to reduce undesired, sequence-related off-target effects. lna-modified sirnas targeting the emerging disease sars, show improved efficiency over unmodified sirna on certain rna motifs. the results from this study emphasize lna's promise in converting sirna from a functional genomics technology to a therapeutic platform. double-stranded small interfering rna (sirna) molecules have drawn much attention since it was unambiguously shown that they mediate potent gene knock-down in a variety of mammalian cells ( ) . this work followed the discovery of the phenomenon of rna interference (rnai) in caenorhabditis elegans ( ) and the demonstration of sirnas as possible mediators of gene regulation in other eukaryotes ( ) ( ) ( ) . sirna works through watson-crick base-pairing of an rna guide sequence to the target rna followed by specific degradation or translational block of the target [reviewed in ( , ) ]. as such, sirna technology offers the means to rationally design gene-specific inhibitors and in recent years such molecules have found widespread use as tools in functional genomic studies in mammalian cells in vitro. however, application of sirnas in vivo and their possible use as therapeutics still face several critical hurdles that have not yet been comprehensively addressed. for instance, sirna delivery, bio-stability, pharmacokinetics and specificity, including off-target effects, will be major topics of further investigation. many of these issues are not new to oligonucleotide-based technologies being developed as drug platforms, such as antisense, aptamers and ribozymes. here, critical advances have come from the development of nucleotide analogues with improved properties over natural nucleotides and recently several of these such as phosphorothioates ( , ) , -o-me ( , ), -o-allyl ( ) and -deoxy-fluorouridine ( , ) have been examined as a means to improve the prospect for sirna therapy. briefly, these studies have demonstrated that sirnas can accommodate quite a number of modifications at both base-paired and non-base-paired positions without significant loss of activity. moreover, some of the modified sirnas were found to exhibit enhanced serum stability ( ) and longer duration of action ( ) . modification of the end of the antisense strand with -o-allyl ( ) or chemical blocking of the -hydroxyl group ( ) resulted in a dramatic loss in activity consistent with the proposed in vivo requirement for end phosphorylation. also, more substantial modifications, such as total modification by -o-me ( ) or ps modifications of every second or all internucleoside linkages ( , ) increased cytotoxic effects and resulted in a significant decrease or complete loss of activity. locked nucleic acid (lna) is a family of conformationally locked nucleotide analogues which, amongst other benefits, imposes truly unprecedented affinity and very high nuclease resistance to dna and rna oligonucleotides *to whom correspondence should be addressed. tel: + ; fax: + ; email: joacim.elmen@cgb.ki.se the online version of this article has been published under an open access model. users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the journal and oxford university press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. for commercial re-use permissions, please contact journals.permissions@oupjournals.org. ( ) ( ) ( ) ( ) ( ) . when used in antisense constructs, lna has been reported to combine substantially increased potency in vitro and in vivo with minimal toxicity ( ) ( ) ( ) ( ) ( ) ( ) . also, the commonly used lna contains a methylene bridge connecting the -oxygen with the -carbon of the ribose ring. this bridge locks the ribose ring in the -endo conformation characteristic of rna ( ) ( ) ( ) ( ) . as such, lna is a prime candidate for introducing critical new features into sirnas without perturbing the overall a-form helical structure they require for activity ( ) . recently, braasch et al. ( ) provided the first evidence that lna can be used to increase the thermal stability of sirna molecules without affecting their function. in this report, we expand on these early findings by systematically pursuing the construction and biological testing of lna-modified sirna molecules, hereafter termed silna, aiming at in vivo applications. we show that lna is substantially compatible with the sirna machinery, and that silnas exhibit greatly improved bio-stability and shows enhanced inhibition at certain rna targets. we further show that lna can be used to reduce sequence-related off-target effects by either lowering incorporation of the sirna sense-strand and/or by reducing the ability of inappropriately loaded sense-strands to cleave the target rna. all sirna and silna oligonucleotides used in this study are listed in table . lna containing oligonucleotides were synthesized by santaris a/s (hørsholm, denmark), sirna was ordered from medprobe (lund, sweden) and dna oligonucleotides from invitrogen (paisely, uk). target sequences have been described elsewhere [firefly luciferase ( ), renilla ( ), npy ( ), sars - ( ) ]. the different sirna sequences were used as unrelated controls in non-overlapping systems. the plasmids used were pgl -control coding for firefly luciferase and prl-tk coding for renilla luciferase (promega, madison wi, usa). ps xs and ps xas with sars target in the sense or antisense direction, respectively, were constructed by ligation of a double-stranded dna oligonucleotide corresponding to the sars target site with xba i overhangs into the xba i site in the utr of the firefly luciferase in the pgl -plasmid. the sense or antisense direction of the insert was confirmed after ligation by pcr and sequencing. melting curves were recorded with a perkin elmer uv/vis spectrophotometer lambda attached to a ptp- peltier system. the sirna/silna were dissolved in an rnase-free buffer ( mm phosphate buffer, mm nacl, . mm edta ph ) to a final concentration of . mm and measured in cm path-length cells. samples were denatured at c for min and slowly cooled to c prior to measurements. melting curves were recorded at nm using a heating rate of c/ min, a slit of nm and a response of . s. t m values were obtained from the maxima of the first derivatives of the melting curves. cell lines used were human hek , rat pc and monkey vero. hek cells were maintained in dmem supplemented with % foetal bovine serum, penicillin, streptomycin and glutamine. pc were maintained in dmem supplemented with % horse serum, % foetal bovine serum, penicillin, streptomycin and glutamine. vero cells were maintained in eagle's mem supplemented with % foetal bovine serum, penicillin, streptomycin and glutamine (invitrogen). sars-cov, frankfurt isolate (genbank ay ) was amplified as previously described ( ) . inhibition of firefly luciferase was performed in hek cells by co-transfection of the target plasmid. hek cells were seeded in ml antibiotic-free medium in -well plates the day before transfection to allow adherence and reach confluence of - % at the time of transfection. the standard co-transfection mix was prepared for triplicate samples by adding ng pgl -control, ng prl-tk and ng sirna to ml opti-mem i (invitrogen) and ml lipofec-tamine (invitrogen) to another ml opti-mem i. the two solutions were mixed and incubated at room temperature for - min before ml of the mix was added to each of three wells. the final volume of medium plus transfection mix was ml and the final sirna concentration was nm. the cells were incubated with the transfection mix for h and the medium was then replaced with new fully supplemented culturing medium. the cells were harvested h later and luciferase activity was measured. the opposite amounts of plasmids were used in the renilla luciferase assays. the dose-response studies were performed analogously using a final sirna concentration of nm. the effective firefly luciferase sirna was serially diluted with the unrelated sirna targeting neuropeptide y (npy), reducing the effective amount of sirna while keeping the total sirna concentration constant. the plasmids ps xs and ps xas were used instead of pgl -control when assaying for the effects of the sense and antisense strand of sars sirna and silna. sirna and silna inhibition experiments with the endogenous target npy were performed in pc cells as described above but without adding a target plasmid. the final sirna concentration was nm. mrna was extracted h posttransfection. inhibition of sars-cov-induced cytotoxicity followed a similar procedure. sars-cov infections and sirna transfections of vero cells were performed as described previously ( ) . vero cells were seeded in -well plates, transfected with nm sirna using lipofectamine and thereafter infected with tcid of sars-cov. the cytotoxicity was quantified with an ldh cytotoxicity detection kit (roche, penzberg, germany) h later. luciferase activity was assessed according to the dual-luciferase reporter assay protocol (promega) using a novo-star -well format luminometer with substrate dispenser (bmg labtechnologies, offenburg, germany). a ml sample were placed in each well of a -well plate subsequent to which ml luciferace assay reagent ii (substrate for firefly luciferase) was added to each well by the luminometer and the firefly activity was measured. then ml stop and glow (stop solution for firefly luciferase and substrate for renilla luciferase) were added and renilla luciferase activity measured. the mean of the luciferase activities measured for s ( readings) each were used to calculate ratios between firefly and renilla luciferase. total rna was isolated with rneasy mini and treated with rnase-free dnase according to the manufacturer's protocol (qiagen, hilden, germany). an amount of - ng of dnase-treated total rna was used as template for first strand dna synthesis according to the manufacturer's protocol (applied biosystems, stockholm, sweden). an aliquot (one-twentieth) of the cdna reaction were analyzed by quantitative real-time pcr on an abi prism (applied biosystems). gene-specific primers and probes for the target genes npy and cyclophilin a ( , ) were mixed separately with taqman universal mastermix (applied biosystems) and added to the cdna to be analyzed. samples were run in triplicate and the data obtained were analyzed with abi prism sds software (applied biosystems). duplexes of sirna and silna ( mm) were incubated at c in % fetal bovine serum (invitrogen) diluted in phosphatebuffered saline, % human or % mouse serum. aliquots of ml were withdrawn at different time points and immediately frozen in ml . · tbe-loading buffer. samples were table . sequences of sirna and silna used in the study top strand depicts the sense strand in the - direction (same as the target sequence). bottom strand depicts the antisense strand in the - direction (complementary to the target). lna, uppercase; rna, lower case; dna, italic lowercase. all lna-c monomers were methyl cytosines. subjected to electrophoresis in % polyacrylamide-tbe under non-denaturing conditions and visualized by staining with sybr gold and quantified by typhoon hardware and imagequant software (amersham biosciences, uppsala, sweden). lna confers both unprecedented affinity as well as very high nuclease resistance on oligonucleotides ( ) ( ) ( ) ( ) ) . a priori, this suggests that lna may be used to increase the functional half-life of sirna in vivo by two different mechanisms, e.g. by enhancing the resistance of the constituent rna strands against degradation by single-stranded rnases and by stabilizing the sirna duplex structure that is critical for activity. to investigate how these mechanisms influence overall biostability, we assessed the integrity in fetal bovine serum, as well as human and mouse serum, of either unmodified sirna or silnas that were modified to enhance exonuclease resistance (silna ) or further modified to also increase duplex stability (silna ). briefly, silna has lna modifications at the ends and exhibits a duplex stability similar to the unmodified sirna (t m = . c for silna and . c for sirna ) whereas silna has six additional duplex stabilizing sense strand modifications at base-paired positions which increases its t m to > c. as shown in figure a and b, unmodified sirna (sirna ) was markedly degraded after h during which it produces a smear of faster migrating species. a similar diffuse band was not observed with the end protected silna , which in contrast to the unmodified sirna showed only weak signs of degradation after h. the more modified silna had a striking stability and showed no signs of degradation even at h. when incubated in either undiluted human or mouse serum (figure c and d) , the unmodified sirna were fully degraded within h. an increased degradation rate was also observed with the silna where little fulllength product remained at h. in contrast, silna remained intact for the full h of the assay in both human and mouse serum. to verify experimentally the compatibility of lna with the sirna machinery, a range of different lna-modified sirnas ( table ) were analysed for their ability to selectively inhibit firefly luciferase in cultured cells expressing both firefly and renilla luciferase. as shown in figure a , the firefly sirna (sirna ) effectively and selectively reduced firefly luciferase activity whereas an unrelated sirna control was essentially without effect. introduction of lna modifications in the overhangs in either or both strands of the firefly sirna revealed no loss of inhibitory effect (silna - ). one lna in the end of the sense strand was fully compatible with activity (silna and ), while an lna at the end of the antisense strand (silna - ) dramatically impaired the inhibitory effect. to exclude the possibility that this impairment was due to a lack of a phosphate, which has been shown to be crucial for sirna function ( ) , the end of the antisense strand in silna and was phosphorylated in vitro. however, this procedure did not recover any of the lost effect (data not shown). silnas wherein the sense strand was modified in the overhangs and at as many as seven basepaired positions (silna and ) retained significant inhibitory activity whereas silnas comprising various combinations of either fully modified sense or antisense strands did not have any inhibitory effect (data not shown). two of the firefly silnas, the lightly modified silna and the medium modified silna , were further compared to the unmodified sirna in a dose-response experiment (figure b ). silna and sirna had the same efficacy but a slight difference in potency (estimated ec silna . nm, sirna . nm) whereas silna had a somewhat lower efficacy at the highest dose tested but similar potency (estimated ec . nm). our finding that an lna at the antisense position substantially impairs the function of the sirna contrasts with the findings of braasch et al. who recently published on the effect of several chemical modifications, including some lna, on sirna against human caveolin ( ) . to determine whether this discrepancy was due to the different choice of targets, we repeated the analysis, this time targeting renilla luciferase. as shown in figure c , we observe the same tendency as with the firefly target. silna and , carrying modifications in the overhangs were as functional as unmodified renilla sirna (sirna ). again, the end of the sense strand could be modified without loss of activity (silna ), while an lna in the antisense end significantly reduced the effect (silna ). interestingly, in this case part of the activity that was lost as a result of the antisense lna (silna ) could be recovered by simultaneously modifying the sense end (silna ) and even more so if modifications also included the sense end (silna ). to substantiate the generality of our finding, we finally appraised silnas targeting an endogenous gene, neuropeptide y (npy) in pc cells. as shown in figure d , the unmodified npy sirna (sirna ) reduced the mrna levels considerably whereas an unrelated control, sirna against dopamine d receptor, did not. sense strand lna modifications were, as before, well tolerated with both lightly modified (silna ) and medium modified (silna ) displaying a similar inhibitory effect as unmodified sirna. again an lna antisense modification substantially impaired activity (silna ), which, however, could be mostly recovered by simultaneous modification of the and sense end (silna ). next, we examined the effect of making single rna to lna exchanges at base-paired positions in the antisense strand of the firefly luciferase silna . as shown in figure , such exchanges were tolerated in most of the tested positions. apart from the antisense end (silna ), the notable exceptions are positions (silna ), (silna ) and (silna ), where introduction of lna leads to a clear decrease of inhibitory activity. although we cannot exclude that these modifications somehow prevent loading of the antisense strand into risc, we believe this to be unlikely given the functionality of many significantly more modified silnas. rather, as these positions are all close to the site where rna target cleavage occurs [between pos. and of the sirna strand counting from the end ( )], we suspect that the lna modifications may exert a direct conformational or functional effect on the catalytic site. the lna substitutions at position and exchanged an rna-u for an lna-t and an rna-c for an lna-m c both of which lead to the introduction of an additional methyl-group on the nucleobase. in the a-form helix formed between the sirna/rna-target, these methyl groups will protrude into the major groove with potential effects on helical geometry or accessibility important for catalysis. to investigate this in more detail, we repeated the experiment with lna-u in place of lna-t at position . this new compound displayed activity similar to the unmodified sirna (sirna ), thus lending support to the importance of having native nucleobases close to the cleavage site (data not shown). the replacement at position (sirna ) substituted an rna-a for an lna-a indicating that factors other than helical structure or accessibility are also important for proper catalytic activity. given the increased affinity imposed by lna, it seems likely that one such factor may be a changed thermodynamic fingerprint of the sirna in the vicinity of the cleavage site, the importance of which has been indicated by recent reports ( , ) . much experimental data supports the notion that cells can incorporate both strands of an sirna into the risc complex but that preference is given to one of the two strands ( , ) . ( nm) and luciferase activity was assessed h later. the firefly luciferase activity was normalized to the renilla luciferase activity and the uninhibited activity (plasmids alone) was set to %. (b) dose-response curves on selected silna targeting firefly luciferase. the total silna/sirna concentration was kept constant at nm; the ratio of effective and irrelevant silna/ sirna was varied. the graph shows the log concentration of the effective silna from one of two representative experiments, with mean and sd derived from duplicate samples. (c) renilla luciferase activity was assessed as for firefly luciferase. the renilla activity was normalized to the firefly luciferase activity. (d) npy mrna levels. rat pc cells, endogenously expressing npy, were transfected with silna ( nm). npy mrna was measured h later by quantitative pcr. the npy mrna levels were normalized to cyclophilin a. the uninhibited normalized npy mrna level was set to %. the mean and sd values in the case of luciferase are from two independent experiments performed in triplicate, and from two independent experiments performed in duplicate in the case of npy. an explanation for this strand-bias was provided by schwarz et al. ( ) and khvorova et al. ( ) , who proposed that the strand that displays the weakest binding energy at its closing base-pair is incorporated preferentially. as a functional genomic tool and as a prospective therapeutic, the incorporation of the unwanted, non-target complementary, sense strand is a concern as it is a likely cause of 'off-target' effects ( ) and may lower the potency of the sirna by limiting incorporation of the intended antisense strand. if relative binding energies at the ends of the sirna duplex determine strand bias, it ought to be possible to favour incorporation of the antisense strand by selectively enhancing the affinity of the sense end with lna. this intriguing possibility was already hinted at by the previous observations that activity loss due to lna incorporation at the antisense end could be largely rescued by compensatory modifications in the end of the sense strand (which a priori would serve to restore the relative binding energies of the two ends of the silna). to examine in more detail the ability of lna to direct strand loading, we constructed a plasmid system that made it possible to monitor the activity of both the sense and antisense strand. briefly, a target region derived from the sars virus to which a medium effective sirna (sars -sirna) had previously been identified ( ) was cloned into the utr of the firefly luciferase gene in both the sense (ps xs) and antisense (ps xas) orientation (figure a ). the sars sirna (table ) has identical closing base-pairs at both ends (a:u) making it likely that enough of both the antisense and sense strand would be incorporated into risc to observe activity on the respective targets. as shown in figure b , both sars -sirna and sars -silna (modified at the overhangs and at the sense end) inhibited the sense target (ps xs) and to the same extent indicating that both sirna and silna are effective in loading the antisense strand into risc. however, when tested for sense strand activity, the outcome was different. here, the sirna showed clear downregulation of the target (ps xas), albeit the effect was less than that observed with the antisense strand. in contrast, no activity was observed with the silna sense strand strongly supporting the conclusion that the sense lna modification has altered strand-bias in favour of incorporation of the antisense strand. both sirna and silna were also tested against the control plasmid pgl and found to have no effect on luciferase expression. improving on low efficacy sirnas as described above, incorporation of sense strand may decrease the potency of sirnas by simply lowering the number of risc complexes loaded with the antisense strand. having established that sense lna appeared able to redirect strand loading, we next examined whether these modifications would also be able to improve the potency of two inefficient sirnas targeting renilla luciferase, sirna and ( ) . both of these sirnas have a strong g:c base-pair at their antisense end and a weak a:u base-pair at their sense end (table ) making it likely that at least part of the poor activity could be due to strand incorporation of the sense strand. as shown in figure a , the activity of both sirnas were improved by the sense lna modification with sirna reducing residual luciferase activity from to % (silna ) and sirna from to % (silna ). to investigate if an lna sense modification could rescue mediocre sirnas against a therapeutically important target, we compared the ability of three sirnas and their corresponding silnas to protect vero cells from death induced by severe acute respiratory syndrome-associated coronavirus (sars-cov). of the three sirnas, the reasonably effective sars- sirna and the ineffective sars- sirna both have a strong g:c base-pairs at the sense end and a weak a:u base-pair at the antisense ( table ) . as such, further stabilization by lna of their sense end is not expected to lead to improved activity and consistent with this notion none was observed (figure b ). in contrast, the modestly effective sars- sirna, which has a:u base-pairs at both ends showed the expected improvement when modified with lna. decreasing or increasing the viral titres did not change the relative behaviour of the sirnas and silnas although a generally greater reduction or increase in cytotoxicity was noted at the lower and higher titers, respectively (data not shown). unrelated sirna and silna controls (firefly luciferase and npy) showed no inhibition of virus-induced cytotoxicity. we have shown that the nucleotide analogue lna is substantially compatible with the sirna intracellular machinery, preserving molecule integrity whilst offering several improvements that are relevant to the development of sirna technology for therapeutic use. notably, lna offers the means to improve dramatically the half-life of sirnas through a combination of enhanced nuclease stability and stabilization of the duplex structure. as this property can be obtained with a modest number of lna modifications that do not affect the ability of the sirna to mediate target knockdown, we expect that silnas may exhibit significantly enhanced efficacy when administered in vivo compared to their unmodified counterparts. off-target effects brought about by inappropriate loading of sirna sense strands constitute a major concern for the use of sirnas as genomic tools and prospective drugs. our data provide evidence that lna can be used to minimize such effects, acting through two different mechanisms. first, lna substitution can alter strand-bias through selectively increasing the affinity of the closing base-pair at the end of the sirna sense strand. we note that our analyses so far have been confined to the ultimate sense position and we cannot exclude the possibility that even greater strand-bias may be imposed by additional modifications to neighbouring positions. second, lna may be incorporated into positions in the sense strand that, once loaded into the risc complex, impair its ability to participate in target cleavage. we have identified these activity-impairing positions (pos. , and ) by systematically analysing a whole set of single lna insertions in the antisense strand and found them to be in the vicinity of the target cleavage site. the evidence that similar substitutions applied to the sense strand will have a similar effect is indirect. nevertheless, we see no reason to believe this will not be the case. if so, the combination of two or more activity-impairing lna modifications may facilitate complete loss of activity of risc complexes inappropriately loaded with sense strands. the ability to influence strand loading by lna modification at the sense end also provides an opportunity to improve the potency of ineffective sirnas by further enhancing antisense strand incorporation into risc. although our data demonstrate that such enhanced loading is not a general phenomenon, it is an option to use lna in this way where the choice of target sequence is restrained. consistent with the findings of braasch et al. ( ) , the present data confirm that excessive lna modifications, each of which are permissive when introduced as separate modifications, can reduce the knockdown efficacy of the sirna. based on the present data, we can only speculate as to the underlying causes which may include changes in sirna structure that affect risc loading, problems with unwinding the duplex due to excessive thermo-stability, changes in release kinetics after substrate cleavage, etc. whatever the cause, as we have shown, the potential key therapeutic benefits of introducing lna into sirnas can all be achieved with relative few modifications that do not compromise sirna activity. other modifications than lna has been shown to provide benefits to sirna and could be conceivable when successfully combined with lna. in conclusion, the rna-like character of lna combined with its enhanced biophysical characteristics, e.g. increased nuclease resistance and affinity, enabled us to construct hybrid rna-lna molecules with new and favourable properties over unmodified sirna. we anticipate that these new molecules, which we have termed silna, will impact positively on the use of rnai technology in functional genomics and the broader perspective of translating the technology into a drug platform. figure . lna improvement of medium-efficient sirnas. (a) renilla luciferase activity. effect of sirna and silna depending on the target sequence (sirna and have different target sequences, accordingly also silna and ). (b) effect of sirna and silna on sars-induced cytotoxicity depending on the target sequence (sars - ). vero cells were transfected with nm sirna or silna and then infected with tcid sars-cov. sarsinduced cytotoxicity was assessed h later. the untransfected but infected sample was set to % cytotoxicity. mean and sd values in the renilla case are from two experiments performed in triplicate, and in the sars case from three experiments performed in quadruplicate. duplexes of -nucleotide rnas mediate rna interference in cultured mammalian cells potent and specific genetic interference by doublestranded rna in caenorhabditis elegans rna interference is mediated by -and -nucleotide rnas a species of small antisense rna in posttranscriptional gene silencing in plants rnai: double-stranded rna directs the atp-dependent cleavage of mrna at to nucleotide intervals sirnas: applications in functional genomics and potential as therapeutics killing the messenger: short rnas that silence gene expression rna interference in mammalian cells by chemicallymodified rna sequence, chemical, and structural variation of small interfering rnas and short hairpin rnas and the effect on mammalian gene silencing tolerance for mutations and chemical modifications in a sirna structural variations and stabilising modifications of synthetic sirnas in mammalian cells potent and nontoxic antisense oligonucleotides containing locked nucleic acids locked nucleic acid (lna): finetuning the recognition of dna and rna design of antisense oligonucleotides stabilized by locked nucleic acids design and characterization of decoy oligonucleotides containing locked nucleic acids structural studies of lna:rna duplexes by nmr: conformations and implications for rnase h activity antisense inhibition of gene expression in cells by oligonucleotides incorporating locked nucleic acids: effect of mrna target sequence and chimera design stability and structural features of the duplexes containing nucleoside analogues with a fixed n-type conformation, -o, -cmethyleneribonucleosides lna (locked nucleic acids): synthesis of the adenine, cytosine, guanine, -methylcytosine, thymine and uracil bicyclonucleoside monomers, oligomerisation, and unprecedented nucleic acid recognition lna (locked nucleic acids): synthesis and high-affinity recognition lna (locked nucleic acid): an rna mimic forming exceedingly stable lna:lna duplexes rnai in human cells: basic structural and functional features of small interfering rna effective small interfering rnas and phosphorothioate antisense dnas have different preferences for target sites in the luciferase mrnas characterization of rna interference in rat pc cells: requirement of gerp sars virus inhibited by sirna. preclinica the use of taqman rt-pcr assays for semiquantitative analysis of gene expression in cns tissues and disease models atp requirements and small interfering rna structure in the rna interference pathway functional anatomy of sirnas for mediating efficient rnai in drosophila melanogaster embryo lysate functional sirnas and mirnas exhibit strand bias rational sirna design for rna interference asymmetry in the assembly of the rnai enzyme complex expression profiling reveals off-target gene regulation by rnai this study was supported by the phd program in biotechnology with an industrial focus and the foundation for knowledge and competence development, which also provided the funding to pay the open access publication charges for this article.