key: cord-0001001-iuglkdcp
authors: Sperschneider, Jana; Datta, Amitava; Wise, Michael J.
title: Predicting pseudoknotted structures across two RNA sequences
date: 2012-12-01
journal: Bioinformatics
DOI: 10.1093/bioinformatics/bts575
sha: 5a054a570f409d5c2006369f19addc98d024b6c1
doc_id: 1001
cord_uid: iuglkdcp

Motivation: Laboratory RNA structure determination is demanding and costly and thus, computational structure prediction is an important task. Single sequence methods for RNA secondary structure prediction are limited by the accuracy of the underlying folding model, if a structure is supported by a family of evolutionarily related sequences, one can be more confident that the prediction is accurate. RNA pseudoknots are functional elements, which have highly conserved structures. However, few comparative structure prediction methods can handle pseudoknots due to the computational complexity. Results: A comparative pseudoknot prediction method called DotKnot-PW is introduced based on structural comparison of secondary structure elements and H-type pseudoknot candidates. DotKnot-PW outperforms other methods from the literature on a hand-curated test set of RNA structures with experimental support. Availability: DotKnot-PW and the RNA structure test set are available at the web site http://dotknot.csse.uwa.edu.au/pw. Contact: janaspe@csse.uwa.edu.au Supplementary information: Supplementary data are available at Bioinformatics online.

Macromolecules such as DNA, RNA and proteins have the ability to form diverse tertiary structures, which enable functionality and thus, life. For many decades, proteins were deemed the global players in the cell until RNA entered the spotlight. For example, RNA structures have been found to be catalytically active, which was assumed to be the privilege of proteins. Furthermore, small RNAs are known to regulate gene expression and RNA viruses employ a plethora of structure elements to invade the host cell.

To gain insight into macromolecule function, one must investigate the structure. The first step in RNA folding is stable base pairing that leads to a secondary structure. As RNA structure formation is of hierarchical nature, secondary structure is the basis for the tertiary fold that produces the functional structure. Especially for RNAs, structure determination by experimental means is an intricate and expensive task. Computational RNA structure prediction is therefore an invaluable tool for biologists. Comparative structure prediction is considered the most reliable approach for computational RNA structure prediction. Single sequence structure prediction is always limited by the accuracy of the underlying folding model. Three main streams have been identified for comparative RNA secondary structure prediction: (i) predict a structure from a pre-computed sequence alignment; (ii) simultaneously compute an alignment and a structure and (iii) alignment-free methods (Gardner and Giegerich, 2004) .

Tools for multiple sequence alignments such as ClustalW (Thompson et al., 1994) are readily available and thus, structure prediction from an alignment is a tempting approach [e.g. RNAalifold (Hofacker et al., 2002) ]. Such methods heavily depend on the sequence conservation and quality of the underlying alignment. However, ncRNAs are conserved rather on the structure level than on the sequence level. The gold standard of RNA comparative structure prediction is the Sankoff approach as it does not rely on a high-quality sequence alignment and captures the structural conservation of ncRNAs. Sankoff (1985) introduced a theoretical dynamic programming algorithm for simultaneous folding and aligning for a set of N sequences that takes O(n 3N ) time and O(n 2N ) space. Practical variants have been derived which more or less retain the Sankoff principle by sacrificing optimality. Alignment-free methods aim to avoid the pragmatic restrictions made in a practical Sankoff approach as well as the reliance on a high-quality alignment [e.g. CARNAC (Perriquet et al., 2003) ]. Note that all of these comparative structure prediction methods exclude the prediction of RNA pseudoknots.

RNA pseudoknots are crossing structure elements with diverse functions. The principle of pseudoknot formation is that bases within a loop region pair with complementary unpaired bases outside the loop. From an algorithmic point of view, even the simplest type of pseudoknot adds considerable computational demands due to crossing base pairs. In fact, the majority of comparative RNA structure prediction methods exclude pseudoknots. Biologists have delivered a wealth of studies, which show that pseudoknots have an astonishing number of diverse functions and occur in most classes of RNA (Staple and Butcher, 2005) . RNA viruses use pseudoknots for hijacking the replication apparatus of the host (Brierley et al., 2007) .

A limited number of RNA comparative structure prediction methods can handle pseudoknots due to the computational complexity. Several of these methods take a sequence alignment as an input. ILM is an algorithm that takes as an input either individual sequences or a sequence alignment (Ruan et al., 2004) . A base pair score matrix is prepared initially and helices are added to the structure in an iterative fashion. In the approach hxmatch, a maximum weighted matching algorithm with combined thermodynamic and covariance scores is used (Witwer et al., 2004) . This program gives the option to be combined with RNAalifold.

KNetFold is a machine learning method, which takes a sequence alignment as an input and outputs a consensus structure allowing pseudoknots (Bindewald and Shapiro, 2006) . Simulfold takes an alignment as an input and simultaneously calculates a structure including pseudoknots, a multiple-sequence alignment and an evolutionary tree by sampling from the joint posterior distributions (Meyer and Miklos, 2007) . Tfold combines stem stability, covariation and conservation to search for compatible stems and subsequently for pseudoknots for a set of aligned homologous sequences (Engelen and Tahi, 2010) . Several comparative structure prediction methods including pseudoknots do not rely on an initial sequence alignment. The graph-theoretical approach comRNA computes stem similarity scores and uses a maximum clique finding algorithm to find pseudoknotted structures (Ji et al., 2004) . SCARNA performs pairwise structural alignment of stem fragments with fixed lengths derived from the probability dot plot (Tabei et al., 2008) .

In the following, a novel comparative approach for predicting structures including H-type pseudoknots called DotKnot-PW will be introduced. The input consists of two unaligned, evolutionarily related RNA sequences. Similarity scores between structure elements will be calculated. Statistically significant pairs will be used to find the set of conserved structure elements common to two sequences, which maximize a combined thermodynamic and similarity score. Using a hand-curated test set of pseudoknotted structures with experimental support, the prediction accuracy of DotKnot-PW will be compared with methods from the literature.

Pseudoknots are functional elements in RNA structures and therefore, the most promising approach for comparative prediction is a structure comparison with less focus on exact sequence matching. In fact, perfect conservation on the sequence level can be more of a curse than a blessing. Especially ncRNAs are known to evolve quickly and so-called consistent and compensatory base pairs in both sequences will give much more confidence for structure conservation than a sequence alignment. One strong point of the DotKnot method for single sequence pseudoknot prediction (Sperschneider and Datta, 2010; Sperschneider et al., 2011) is that the set of possible H-type pseudoknot candidates (and secondary structure elements) is explicitly computed and thus readily available for further investigation. The main steps in the pairwise pseudoknot prediction approach DotKnot-PW are as follows ( Fig. 1) :

(1) Run DotKnot for two unaligned sequences Seq x and Seq y . This returns secondary structure element and Htype pseudoknot candidate dictionaries.

(2) Calculate pairwise base pair similarity scores for secondary structure elements and H-type pseudoknot candidates. Keep significant pairs that have a low estimated P-value.

(3) Use significant pairs to calculate the set of conserved structure elements and pseudoknots for the two sequences that maximizes a combined free energy and similarity score.

The key point of the DotKnot-PW approach is how to score the similarity of stems, secondary structure elements and H-type pseudoknot candidates derived from sequences Seq x and Seq y . Related work has been done for stem finding in unaligned sequences, where stem candidates are assigned a matching score across unaligned sequences, e.g. in SCARNA. Another point is how to assess the significance of a similarity score using Pvalues. These points will be explained in detail in the following section.

For two unaligned RNA sequences Seq x and Seq y , the single sequence prediction method DotKnot (Sperschneider and Datta, 2010; Sperschneider et al., 2011) returns two stem dictionaries D s (x) and D s (y) derived from the probability dot plot. It also returns secondary structure element dictionaries D L s ðxÞ, D L s ðyÞ and D M s ðxÞ, D M s ðyÞ and H-type pseudoknot candidate dictionaries D p (x) and D p (y) (Fig. 1) . To detect conserved structure elements for the two sequences, a pairwise structural comparison is performed. Instead of a full structure-tostructure alignment, which takes O(n 4 ) time and O(n 3 ) space, pairwise base pair similarity scores are calculated using the RIBOSUM85-60 matrix for base pair substitutions (Klein and Eddy, 2003) .

For two given stems s i (x) and s j (y) with fixed lengths in sequences Seq x and Seq y , respectively, the base pair similarity score sim[s i (x), s j (y)] is calculated using an ungapped local structure alignment of the base pairs with the RIBOSUM85-60 matrix. As an example, consider the following optimal ungapped local structure alignment of the two stems with base pair similarity score of sim[s 1 (x), s 2 (y)] ¼ 22.04 using the RIBOSUM85-60 matrix.

)))))--To evaluate the significance of base pair similarity scores instead of the raw score, one has to find out what the underlying probability distribution is. Similar to the case of ungapped local sequence alignments (Karlin and Altschul, 1990) , it is assumed here that the base pair similarity scores follow an extreme value distribution. However, the main difference is that a comparison between fixed-length stem fragments is made. It is important to remember that parameters and K describe the extreme value distribution of optimal local alignment scores in the asymptotic limit of long sequences (Altschul et al., 2001) . Here, the parameters for the generalized extreme value distribution are pre-calculated using maximum Fig. 1 . For two unaligned RNA sequences Seq x and Seq y , DotKnot-PW produces structure element dictionaries derived from the probability dot plot. Similarity scores and P-values are computed to detect conserved elements likelihood fitting of a distribution to the histogram of a large sample of random base pair similarity scores. The maximum likelihood fitting was performed using the ismev package of the R statistical language for a range of stem lengths (see Supplementary Material). The P-value is defined as the probability to obtain a score greater than or equal to the observed score strictly by chance. A stem s i (x) in sequence Seq x and a stem s j (x) in sequence Seq y are a significant pair if the score sim[s i (x), s j (y)] has an estimated P-value less than . Stem pairs with a P-value larger than are not considered in the following.

For two interrupted stems, the base pair similarity score is calculated by deleting bulges and internal loops and scoring stems as consecutive base pairs. Base pair similarity scores for regular and interrupted stems are also calculated if the difference in number of base pairs is less than 5. For example, a stem with one bulge might be a conserved match with a regular stem. A stem s i (x) in sequence Seq x and a stem s j (y) in sequence Seq y are a significant pair if the score sim[s i (x), s j (y)] has an estimated P-value less than .

Calculating the base pair similarity score for two multiloop structures is complex due to the variety of inner loop elements, which may be regular or interrupted stems. A multiloop s M i ðxÞ can be decomposed into an outer stem s o i ðxÞ and a set of inner structure elements

The base pair similarity score sim½s o i ðxÞ, s o j ðyÞ for the outer stems of two multi-loops can be easily obtained from the previously calculated base pair similarity scores. If the outer stem is a conserved match, a local alignment on the set of inner structure elements is used to find the base pair similarity score. Here, gaps are allowed in the local alignment of inner structure elements; however, no gap penalty is used. Let two sets of inner structure elements S i (x) ¼ [s 1 (x),. . ., s n (x)] and S j (y) ¼ [s 1 (y),. . ., s m (y)] be given. Let H(i, j) be the maximum similarity score between a suffix of S i (x) and a suffix of S j (y). The optimal local alignment is calculated as follows:

Hði, 0Þ ¼ 0, 0 i n Hð0, jÞ ¼ 0, 0 j m Hði, jÞ ¼ max 0 Hði À 1, jÞ Hði À 1, j À 1Þ þ sim½s i ðxÞ, s j ðyÞ Hði, j À 1Þ

A multiloop s M i ðxÞ in sequence Seq x and a multiloop s M j ðyÞ in sequence Seq y are a significant pair if the similarity score sim½s M i ðxÞ, s M j ðyÞ has an estimated P-value less than .

A H-type pseudoknot has two pseudoknot stems S 1 and S 2 . The prerequisite for a conserved pseudoknot pair is that both core H-type pseudoknot stem pairs [S 1 (x), S 1 (y)] and [S 2 (x), S 2 (y)] are significant. The base pair similarity score for two H-type pseudoknots p i (x) and p j (y) in sequences Seq x and Seq y , respectively, is the sum of base pair similarity scores for the core pseudoknot stems as well as the base pair similarity score from a gapped local alignment of the recursive secondary structure elements in the loops (as described for multiloops). A pseudoknot p i (x) in sequence Seq x and a pseudoknot p j (y) in sequence Seq y are a significant pseudoknot pair if the similarity score sim[p i (x), p j (y)] has an estimated P-value less than .

The base pair similarity score calculated in the previous sections might not be powerful enough to distinguish true positive conserved structure element pairs from false-positive structure element pairs due to the finite lengths of stems and exclusion of loop sequences in the alignment. Therefore, a dissimilarity score is also used to confirm whether a pair is significant. The dissimilarity for two given structure elements s i (x) and s j (y) in sequences Seq x and Seq y is defined as:

where dissim 1 is the difference in the stem lengths and dissim 2 is the difference in the number of loop lengths. As an example, consider the pseudoknot pair p 1 (x) and p 1 (y) in sequences Seq x and Seq y , respectively, with stems S 1 , S 2 and loops L 1 , L 2 , L3. The pseudoknot pair has dissimilarity of 6. [[[[[[.) )))))) [.[[[[[[[.) ))))). .

A weight is assigned to a significant pair, which is a combination of the free energy, covariation and dissimilarity. The overall weight s of a significant structure element pair [s i (x), s j (y)] in sequences Seq x and Seq y is a combination of the free energy weights w[s i (x)] and w[s j (y)], base pair similarity score sim[s i (x), s j (y)] and dissimilarity dissim[s i (x), s j (y)]: s½s i ðxÞ, s j ðyÞ ¼ Â sim½s i ðxÞ, s j ðyÞ À Â fw½s i ðxÞ þ w½s j ðyÞg À Â dissim½s i ðxÞ, s j ðyÞ Only structure element pairs with positive score s are allowed in the following dynamic programming algorithm. Here, and are set to 0.5 and is set to 1.

Let p x 1 , . . . , p x n be the number of structure elements in the first sequence Seq x and p y 1 , . . . , p y m be the number of structure elements in the second sequence Seq y . Each structure element has a left and right endpoint in the sequence and is a stem, interrupted stem, multiloop or H-type pseudoknot. Structure elements can also be represented as nodes in a graph. In each sequence, the structure elements are ordered by their right endpoints. An edge is drawn between two structure elements in the first and the second sequence if their base pair similarity score has a P-value less than . Given the set of edges between nodes p x 1 , . . . , p x n and p y 1 , . . . , p y m , the goal is to find the set of edges with maximum weight that are non-crossing. This relates to finding the set of non-overlapping structure elements in the two sequences that maximize the score under the requirement that the interval ordering is preserved. A set of structure elements in the first and second sequence, which preserves the interval ordering is called a feasible structure element alignment and must satisfy the following two requirements. Each structure element can be aligned with at most one other structure element in the other sequence. The order of structure elements must be preserved with respect to the alignment. That is, if structure elements p x i and p x j in the first sequence are aligned with p y a and p y b in the second sequence, respectively, the pairs may never overlap: p x i 5p x j^p x a 5p x b (Fig. 2) . Given nodes p x 1 . . . , p x n in the first sequence Seq x and p y 1 , . . . , p y m in the second sequence Seq y , let f(i, a) be the maximum sum of edge weights for nodes between 1 and i in the first sequence and 1 and a in the second sequence such that the edges are non-crossing (i n and a m). The nodes that maximize the sum of edge weights are called an optimal structure element alignment for the two sequences. The optimal structure element alignment is calculated using dynamic programming. For a given structure element p i with start point a i and end point b i , let pre(i) be the non-overlapping predecessor. For each structure element, its predecessor is pre-computed using the sorted list of structure elements. The recursion for calculating the optimal structure element alignment is as follows:

Furthermore, nested structures are taken into account for significant outer stem pairs, which have estimated P-value less than . For each significant outer hairpin loop pair, the optimal structure element alignment of inner elements is computed.

For two unaligned RNA sequences Seq x and Seq y , the single sequence prediction method DotKnot returns structure element dictionaries derived from the probability dot plot. Let n and m be the number of structure elements in sequences Seq x and Seq y , respectively. Calculating the similarity scores and the optimal structure element alignment takes O(nm) time. Furthermore, nested structures are taken into account for significant outer stem pairs, which have estimated P-value less than . Let a be the number of significant stem pairs, where both stems are hairpin loops. For each significant outer hairpin loop pair, the optimal structure alignment of inner elements is computed. In the worst case, this increases time requirements to O(a Â nm). The number of structure elements depends on the base composition of the sequence. Empirically, n and m can be observed to grow linearly with the length of the sequence for uniform base distribution (see Supplementary Material). In practice, DotKnot-PW can be expected to run in the order of minutes for sequences shorter than 500 nt.

Many pseudoknot prediction programs have been evaluated using all the entries in the PseudoBase database (van Batenburg et al., 2000) . There are several caveats in this approach. First, the sequences given in PseudoBase are those which exactly harbor the pseudoknot. However, in practice structure prediction algorithms will be applied to longer sequences without prior knowledge of the pseudoknot location. Second, long-range pseudoknot entries appear in a truncated version in the database. Third, some classes of pseudoknots have a large number of entries (such as short H-type pseudoknots in the 3 0 -untranslated regions of plant viruses), whereas more complex types of pseudoknots only have one representative (such as long-range rRNA pseudoknots). Therefore, a hand-curated dataset of pseudoknot structures will be used here.

When it comes to pseudoknots, many structures have been published based on a secondary structure predicted by free energy minimization. These predicted secondary structures are used as a working model and refined using experimental techniques such as chemical and enzymatic probing. However, the native structure remains unsolved unless tertiary structure determination methods such as X-ray crystallography are used. Testing structures that are based on computer predictions with no experimental support creates a bias in the benchmark and will be avoided in this evaluation.

A total of 16 pseudoknotted reference structures from different RNA types were collected, which have strong experimental support. For each reference structure, a supporting set of 10 evolutionarily related sequences was obtained from the RFAM database (Gardner et al., 2010) . Note that for the vast majority of supporting sequences, no experimentally determined structures are available. The average pairwise sequence identities vary from 55% to 99%. Given a reference structure, the performance of prediction algorithms is evaluated in terms of sensitivity (S), i.e. the percentage of base pairs in the reference structure, which are predicted correctly, as well as positive predictive value (PPV), i.e. the percentage of predicted pairs, which are in the reference structure. The Matthews correlation coefficient (MCC) is also reported and is in the range from À1 to 1, where 1 corresponds to a perfect prediction and À1 to a prediction that is in total disagreement with the reference structure. The performance of each method for predicting the reference structure was evaluated as described in Gardner and Giegerich (2004) .

DotKnot-PW was compared with methods that are freely available and use standard input and output formats. The comparative methods are CARNAC, Tfold and hxmatch (with the -A option using RNAalifold). All of these methods return structure predictions for only the reference structure with regards to the support set of evolutionarily related sequences. Tfold and hxmatch take a sequence alignment as the input. ClustalW with the default parameters was used to produce the initial sequence alignment. DotKnot-PW and CARNAC take a set of unaligned sequences as the input. Furthermore, prediction results for the reference sequence (not the supporting sequences) were obtained from the single sequence methods DotKnot (Sperschneider and Datta, 2010; Sperschneider et al., 2011), ProbKnot (Bellaousov and Mathews, 2010) , IPknot (Sato et al., 2011) and RNAfold (Hofacker et al., 1994) . Note that all methods except CARNAC and RNAfold allow pseudoknot prediction.

The results are shown in Table 1 . DotKnot-PW has the highest average MCC of 0.75 for the test sequences. For each reference structure with the 10 support sequences from the corresponding RFAM family, 10 predictions are returned ordered by the combined free energy and similarity score. If only the Fig. 2 . A set of edges with positive scores is given between nodes p 1 ,. . .,p 7 in the first sequence and p 1 ,. . .,p 6 in the second sequence. The goal is to find the best set of non-overlapping structure elements in the two sequences such that the interval ordering is preserved. The optimal structure element alignment, which preserves the interval ordering includes structure elements p 1 , p 4 , p 7 in the first sequence and p 1 , p 4 , p 6 in the second sequence pairwise prediction with highest combined free energy and similarity score is taken, DotKnot-PW has an improved average MCC of 0.81. Tfold and hxmatch have average MCC of 0.6 and 0.59, respectively. CARNAC has average MCC of 0.45 with much higher average specificity than sensitivity. The prediction results for single sequence structure prediction for each of the reference sequences with experimentally determined structures are also shown in Table 1 . Note that this does not include the prediction for the support sequences from RFAM, as no experimentally determined structures are available. All single sequence pseudoknot prediction methods show improved results over using RNAfold. DotKnot has the highest average MCC of 0.76, followed by IPKnot and ProbKnot.

As an example, consider the S15 mRNA pseudoknot that binds to specific proteins in the autoregulation mechanism of ribosomal protein S15 synthesis (Philippe et al., 1995) . For the reference sequence S15 and 10 support sequences from the corresponding RFAM family, DotKnot-PW returns pairwise predictions ordered by the combined free energy and similarity score. The top two pairwise predictions with the highest scores are shown in Figure 3 .

We presented DotKnot-PW for prediction of structures common to two RNA sequences, including H-type pseudoknots. Both DotKnot and DotKnot-PW have been designed as dedicated pseudoknot prediction tools. In the following, important aspects of pseudoknot prediction will be discussed.

Single sequence prediction methods are always limited by the underlying RNA folding model. This may be the set of free energy parameters used by free energy minimization methods or the underlying methodological framework such as maximum expected accuracy methods. DotKnot-PW shows excellent results on H-type pseudoknots with short interhelix loops. For this type of pseudoknots, DotKnot-PW uses free energy pseudoknot parameters by Chen (2006, 2009 ) based on polymer statistical mechanics. Improvements of the accuracy of free energy parameters, both for secondary structures and pseudoknots, will lead to more accurate prediction methods. However, one has to keep in mind that the algorithms themselves must be designed in such a fashion that novel parameters can be efficiently incorporated. The heuristic framework of DotKnot-PW has been designed such that it can incorporate sophisticated free energy parameters for pseudoknots, secondary structures and coaxial stacking. In the future, DotKnot-PW could also use contributions from basic tertiary structure elements such as base triples around the pseudoknot junction or stem-loop interactions.

Pseudoknot prediction algorithms come in two flavors: either they can predict a certain, restricted class of pseudoknots or they do not have a restriction on the type of pseudoknot that can be predicted. For methods using free energy parameters, the inclusion of general types of pseudoknots might be more of a 

Each reference structure is given by its ID (see Supplementary Material for dot-bracket notation). The following column gives the method of experimental support (NMR, NMR spectroscopy; X-ray, X-ray crystallography; SC, sequence comparison; MG, mutagenesis; SP, structure probing), length of the sequence and number of pseudoknots. For each reference structure, the corresponding RFAM family ID, average sequence length and average pairwise sequence identity is shown. The * symbol means that the method failed to run. The 'first' prediction for DotKnot-PW is the pairwise prediction with highest combined free energy and similarity score.

curse than a blessing, as no reliable free energy parameters for complex pseudoknots are available. DotKnot-PW has restrictions on the type of pseudoknot that can be predicted. However, this does not always lead to poor prediction results in practice. For example, DotKnot-PW shows the best result for the HDV ribozyme, which is a complex double nested pseudoknot.

The results from the benchmark for structure prediction in Table 1 must be interpreted with care. First, the tested methods can be run with different parameters, possibly producing better results. However, as a typical user has no prior knowledge about the structure, the default parameters for each method are used. Of course, a comprehensive benchmark should include a larger number of structures to obtain a more reliable evaluation. However, in this study, the focus has been on a test set where the structures are supported experimentally. Many structures have been published, which were determined using computational tools and this will inevitably create a bias in a benchmark, and thus they were excluded here.

Here, an extension of the single sequence prediction method DotKnot was presented based on the pairwise comparison of structure elements. This approach called DotKnot-PW is designed as an algorithm for finding the structure including H-type pseudoknots common to two sequences. As shown in Table 1 , DotKnot-PW can greatly improve structure predictions for RNA families when compared with the single sequence prediction using DotKnot. In some cases, a comparative approach might have lower sensitivity than a single sequence prediction; however, this should not generally be judged as 'inferior'. For example, ncRNAs might preserve some integral base pairs throughout evolution and only these will be detected by a comparative approach, which returns the set of base pairs common to a set of evolutionarily related sequences. DotKnot-PW uses a set of unaligned sequences as the input; therefore, no expert user intervention is required. In the future, DotKnot-PW will be extended to include intramolecular kissing hairpins. Furthermore, constrained folding will be implemented to predict a structure subject to constraints, e.g. enforce certain base pairs or regions, which must remain unpaired.

DotKnot-PW has been designed as a dedicated pseudoknot prediction tool and should be applied to RNA sequences where pseudoknotted interactions are suspected in the structures. Prediction accuracy will inevitably decrease for sequences, which are longer than say 400 nt for any single sequence structure prediction method (Reeder et al., 2006) . To achieve reliable results, short sequences should be folded using DotKnot and predictions should be compared with results from other methods from the literature. To gain confidence in predictions, subsequent comparative prediction using DotKnot-PW and other comparative methods is highly recommended. Ideally, experimental verification of computationally predicted pseudoknots should be sought. Fig. 3 . Pairwise prediction results for the S15 mRNA pseudoknot (RFAM family RF00114) with the top two combined free energy and similarity scores. The reference structure is shown at the top and folds into two conformations in dynamic equilibrium: a H-type pseudoknot or a series of hairpins.

For the pairwise prediction with highest score, the pseudoknot structure is returned. For the second-best pairwise prediction, the alternative hairpin loop structure is returned

The estimation of statistical parameters for local alignment score distributions

An RNA pseudoknot as the molecular switch for translation of the repZ gene encoding the replication initiator of IncI alpha plasmid ColIb-P9

ProbKnot: fast prediction of RNA secondary structure including pseudoknots

RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classifiers

Viral RNA pseudoknots: versatile motifs in gene expression and replication

Predicting RNA pseudoknot folding thermodynamics

Predicting structures and stabilities for H-type pseudoknots with interhelix loops

Tfold: efficient in silico prediction of non-coding RNA secondary structures

Crystal structure of a hepatitis delta virus ribozyme

A comprehensive comparison of comparative RNA structure prediction approaches

Rfam: Wikipedia, clans and the ''decimal'' release

New insight into RNase P RNA structure from comparative analysis of the archaeal RNA

Fast folding and comparison of RNA secondary structures

Secondary structure prediction for aligned RNA sequences

A graph theoretical approach for predicting common RNA secondary structure motifs including pseudoknots in unaligned sequences

Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes

Structural basis of glmS ribozyme activation by glucosamine-6-phosphate

RSEARCH: finding homologs of single structured RNA sequences

Ribosomal binding to the internal ribosomal entry site of classical swine fever virus

Chloroplast phosphoglycerate kinase, a gluconeogenetic enzyme, is required for efficient accumulation of Bamboo mosaic virus

SimulFold: simultaneously inferring RNA structures including pseudoknots, alignments, and trees using a Bayesian MCMC framework

Structure of the S-adenosylmethionine riboswitch regulatory mRNA element

An NMR and mutational analysis of an RNA pseudoknot of Escherichia coli tmRNA involved in trans-translation

Terminal RNA replication elements in human parechovirus 1

The stimulatory RNA of the Visna-Maedi retrovirus ribosomal frameshifting signal is an unusual pseudoknot with an interstem element

Finding the common structure shared by two homologous RNAs

Structural basis for ribosome recruitment and manipulation by a viral IRES RNA

Molecular dissection of the pseudoknot governing the translational regulation of Escherichia coli ribosomal-protein S15

A three-stemmed mRNA pseudoknot in the SARS coronavirus frameshift signal

Beyond Mfold: recent advances in RNA bioinformatics

Folding of a transcriptionally acting preQ1 riboswitch

An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots

Simultaneous solution of the RNA folding, alignment and protosequence problems

IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming

DotKnot: pseudoknot prediction using the probability dot plot under a refined energy model

Heuristic RNA pseudoknot prediction including intramolecular kissing hairpins

Pseudoknots: RNA structures with diverse functions

A fast structural multiple alignment method for long RNA sequences

Structure and function of telomerase RNA

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight matrix choice

PseudoBase: a database with RNA pseudoknots

A phylogenetically conserved hairpin-type 3' untranslated region pseudoknot functions in coronavirus RNA replication

Prediction of consensus RNA secondary structures including pseudoknots

We thank the reviewers for their time and thoughtful comments that helped to considerably improve the quality of the manuscript.Funding: International Postgraduate Research Scholarship (IPRS) from the Australian Government (to J.S.).Conflict of Interest: none declared.