key: cord-0035602-p8puu1jk authors: Li, Menglu; Cheng, Micheal; Ye, Yongtao; Hon, Wk; Ting, Hf; Lam, Tw; Tang, Cy; Wong, Thomas; Yiu, Sm title: Predicting RNA Secondary Structures: One-grammar-fits-all Solution date: 2015 journal: Bioinformatics Research and Applications DOI: 10.1007/978-3-319-19048-8_18 sha: e353d4dfa5156858a16a1d7501c41248056af931 doc_id: 35602 cord_uid: p8puu1jk RNA secondary structures are known to be important in many biological processes. Many available programs have been developed for RNA secondary structure prediction. Based on our knowledge, however, there still exist secondary structures of known RNA sequences which cannot be covered by these algorithms. In this paper, we provide an efficient algorithm that can handle all RNA secondary structures found in Rfam database. We designed a new stochastic context-free grammar named Rectangle Tree Grammar (RTG) which significantly expands the classes of structures that can be modelled. Our algorithm runs in O(n (6)) time and the accuracy is reasonably high, with average PPV and sensitivity over 75%. In addition, the structures that RTG predicts are very similar to the real ones. Secondary structures of RNA molecules play important roles in their functionalities [1, 2] . Many methods have been proposed to predict RNA secondary structures. Although the majority of RNAs have simple secondary structures, pseudoknots (base pairs crossing each other) are found in almost all classes of RNAs. Pseudoknots are known to be involved in biological functions such as stimulating ribosomal frameshifting [3, 4] . The existence of pseudoknots make the secondary structure prediction an NP-hard problem, in general [5, 6] . Existing algorithms attempt to solve the problem by considering a restricted set of pseudoknots [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] . Not all existing pseudoknots can be modelled. In terms of prediction accuracy, CentroidAlifold [9] generalized a centroid estimator that maximizes the expected accuracy of structure prediction. Tabei, Yasuo and Kiryu [16] proposed a fast multiple sequence alignment method named MXScarna in which the optimal structure that maximized a heuristic scoring function was found during the group alignments of stem component sequences. RNAaliFold [17] precomputed alignments using a combination of free-energy and a covariation measures, whilst TurboFold [18] utilized an iterative probabilistic method to predict secondary structures for multiple RNA sequences. Despite of so many algorithms to predict RNA secondary structures, there exist secondary structures of known RNAs in Rfam [19] that cannot be covered by existing efficient algorithms 1 . Figure 1 shows such an example. In this paper, we proposed a grammar-based machine learning method to predict secondary structures for all RNA sequences in Rfam. Enlightened by [20] , we designed a new stochastic context-free grammar called Rectangular Tree Grammar (RTG), which can model all possible secondary structures of known RNA sequences in the Rfam database. Each structure can be generated by a unique operation path, that is, the only sequence of operations that yields this sequence. A set of paths is obtained using some real RNA sequences with known structures. Rule transition probabilities and base emission probabilities are calculated based on this set. In order to determine the unknown secondary structure of a RNA sequence, dynamic programming is adopted to generate the most probable structure. This procedure takes O(n 6 ) time, where n is the length of input RNA sequence. The proposed approach was evaluated using several sets of sequences with one containing pseudoknot-free structures and the others with different types of pseudoknots. We compared the performance of RTG with popular prediction algorithms including gfold [7] , CentroidAlifold [9] , pknotsRG [21] , NUPACK [22] , MXScarna [16] , RNAaliFold [17] and TurboFold [18] . The experimental results have shown that our approach outperforms others substantially with high PPV and sensitivity, especially on highly-pseudoknotted sequences. Let S = s 1 s 2 . . . s n be an RNA sequence of length n. M x,y is the set of base pairs in the range [x, y], M x,y = {(i, j)|x ≤ i < j ≤ y, (s i , s j ) is a base pair}. The secondary structure of s x . . . s y is a banding if it satisfies the following conditions: Regular Structure: A structure is a regular structure if no base pair crossing exists, that is, the secondary structure of s x . . . s y is a regular structure if ∃i, j, k, l ∈ [x, y] such that (i, j) ∈ M x,y , (k, l) ∈ M x,y , and i < k < j < l. Standard Pseudoknot of Degree k: A structure is a standard pseudoknot of degree k (k ≥ 3) if it is either a simple standard pseudoknot of degree k or a gapped standard pseudoknot of degree k. For The secondary structure of s x . . . s y is a simple standard pseudoknot of degree k (k ≥ 3) if there exists a set of x 1 , x 2 , . . . , x k−1 that satisfies the following conditions ( Figure 2 ): Based on our analysis to Rfam database, we focus on all standard pseudoknots of degree k (k ≥ 3) in this paper. Three Banding Structure: The secondary structure of s x . . . s y is a three banding structure if we can find x 1 , x 2 , x 3 such that all the following conditions are satisfied. [x,y] , it must belong to one of the sets L 12 , L 23 , L 34 , L 14 as defined below. (iii) for any two pairs (i, j) ∈ L ab and (k, l) ∈ L ab , then i < k < l < j or k < i < j < l. where L ab = L 12 , L 23 , L 34 or L 14 . Let (a) A three banding structure. (Figure 4 (a)) and how it is twisted so that all its base pairs are grouped into four parallel sets ( Figure 4 (b)), i.e., L 12 (blue and cyan pairs), L 23 (red pairs), L 34 (green and lime pairs) and L 14 (magenta pairs). k-Crossing Structure: s x s x+1 . . . s y is a k-crossing structure (k ≥ 3) if it is either a simple k-crossing structure or a gapped k-crossing structure. Intuitively, in a k-crossing structure, there exist k gapped bandings where any two of them crosses each other. For The secondary structure of s x . . . s y is a simple k-crossing structure (k ≥ 3) if there exist x 0 , x 1 , . . . , x 2k that satisfy the following conditions: [1, k] , CH w is a regular structure, a standard pseudoknot or a three banding structure. The secondary structure of s x . . . s y is a gapped k-crossing structure if and only if there exists a, b such that s a+1 . . . s b−1 is a defined structure and s x . . . s a ∪ s b . . . s y satisfies the following conditions: (i) By cutting out the gap s a+1 . . . s b−1 , the secondary structure over this new Figure 5 depicts a 3-crossing structure. Each color denotes a crossing set, i.e., CH 1 (cyan), CH 2 (red) and CH 3 (green). Recall the example in Figure 1 . The difficulty of this structure lies on two mixed substructures called 3-crossing and standard pseudoknots for which none of the existing algorithms can model (the green basepairs form a standard pseudoknot; the green, blue, and the red basepairs form a 3-crossing structure). As a matter of fact, among all classes of Rfam structures we defined below, only gfold [7] can generate some extremely simple 3-crossing structures with CH w = H w . None of the aforementioned algorithms can generate k-crossing structures (k ≥ 4). We have observed that the classic grammar-based algorithm, Simple Linear Tree Adjoining Grammar [23] , is incapable of predicting some highly-pseudoknotted structures (e.g k-crossing structures). To predict these structures, we introduce a new grammar called Rectangle Tree Grammar (RTG). Let V be a finite set of alphabets and Σ be a set of terminal alphabets where Σ ⊂ V . Let γ be a tree over V such that (i) each internal node must be labeled with a nonterminal symbol. (iii) there is only one red edge that defines the insertion point of this tree which is along the backbone. (iv) considering the red edge N 2 − N 3 , N 3 is the only child of N 2 . (v) the path from root to N 2 is the longest path in upper tree, the path from N 3 to N 4 is the longest path in bottom tree. According to the definition above, a rectangle tree can be divided into two parts by splitting through red edge, the yield of upper tree is γ U , the yield of bottom tree rooted at N 3 is γ B . Y (γ) = γ U γ B , the position between γ U and γ B is called an insertion point (where other structures can be inserted in). Figure 6 is an example of a rectangle tree. A tree is complete if it satisfies all the following conditions: (i) only one leaf (labeled as N 4 ) is labeled with nonterminal symbol. (ii) there is no red edge, i.e., no more base pairs will be added. (iii) the path from root to N 4 is the longest path in the tree. By labeling the red edge in Figure 6 , the rectangle tree becomes a complete tree. The yield is f abegcd. To predict the secondary structure of an RNA sequence, we compute the most probable rectangle tree whose yield is exactly the given sequence. A rectangle tree or a complete tree has a unique state. As shown in Table 1 , a state corresponds to the secondary structure represented by this tree. The first seven states are for rectangle trees. The remaining three are for complete trees. There are multiple ways to add bases into a tree. A tree operation defines how a single base, a base pair or another tree are allowed to be added. In this section, banding or gapped banding B3 the structure is three banding, insertion point is the insertion point of the second banding. BL the structure is standard pseudoknot of degree k(k ≥ 3), insertion point is the insert point of the rightmost banding. BR the structure is standard pseudoknot of degree k(k ≥ 3), insertion point is the insert point of the leftmost banding. BLR the structure is standard pseudoknot of degree k(k ≥ 4), insertion point can be insertion point of any banding except the leftmost and rightmost one. G 2-crossing structure, after another Cr operation, it will transit to state H. H k-crossing structure(k ≥ 3). CP P both si and sj are paired bases. CP S si is a paired base, sj is a single base. CSP si is a single base, sj is a paired base. CSS both si and sj are single bases. we introduce tree operations from state to state so that it is clear why each operation is needed. For simplicity, we use S 1 to denote γ U (the yield of upper tree) and S 2 to denote γ B (the yield of bottom tree). A gapped banding is divided by insertion point into two parts: S 1 and S 2 . Basically, base pairs and single bases of a banding is allowed to be added from outmost inwards. For rectangle trees, single bases can only be added at the end of S 1 (at N 2 ) or at the beginning of S 2 (at N 3 ). To obtain a gapped banding, the following tree operations are designed: -L23: add a base pair X into the tree, where the head and tail of X are added to the end of S 1 and the beginning of S 2 , respectively. -Ls2: add a single base to the end of S 1 . -Ls3: add a single base to the beginning of S 2 . Three Banding (State B3). A basic idea to generate a three banding structure is to add gapped bandings in the twisted structure in a top-down manner. Besides L23, Ls2 and Ls3, there are three more legal operations to add a gapped banding (or a base pair) X: -L12: add the head of X to the beginning of S 1 ; tail of X to the end of S 1 . -L34: add the head of X to the beginning of S 2 ; tail of X to the end of S 2 . -L14: add the head of X to the beginning of S 1 ; tail of X to the end of S 2 . To generate standard pseudoknot of degree k(k ≥ 3), we designed operation LR (Figure 7) . Operation LR inserts the upper tree of α above N 1 of β and its bottom tree above N 2 . At the same time, the insertion point is updated to insertion point of β. As a result, base pairs across upper tree and bottom tree in α and β would cross. Moreover, the update of insertion point prevents base pairs in α from crossing base pairs in subsequent trees adjoined with LR later. After (k − 2) LR operations, standard pseudoknot of degree k is generated. k-Crossing (State G and H). In k-crossing structures, without considering embedded substructures, all base pairs can be grouped into different crossing sets CH 1 . . . CH k , where CH w (∀w ∈ [1, k] ) is a regular structure (state B), a standard pseudoknot (state BL, BR or BLR) or a three banding structure (state B3). Standard pseudoknots of state BL and BR have their insertion point within the leftmost and rightmost banding, respectively. Otherwise, if the insertion point comes from neither the leftmost nor the rightmost banding, this standard pseudoknot is in state BLR. Operation LL is designed for state BR and BLR. An LL operation on rectangle tree with ζ inserts the upper tree of ζ above N 3 of and its bottom tree under N 4 . Then by operating LR on α 1 LR . . . LRα i with 1 LL . . . LL j , standard pseudoknot with insertion point in its (i + 1)th banding is generated. After generation of all the crossing sets, we designed the operation Cr to link them up. As is shown in Figure 8 , operation Cr on rectangle tree α with β inserts upper tree of β under N 2 of α and bottom tree under N 4 . So base pairs between upper tree and bottom tree in α and β would cross. After (k − 1) Cr operations, k-crossing can be generated. Embedding and Concatenation (State CP P ). By applying operation Lm to label the red edge of a rectangle tree to black, a complete tree is generated. Embedding operations Lplus2 and Lplus3 insert a complete tree to N 2 and N 3 of a complete tree, respectively. The concatenation operation Cp can concatenate two complete trees. The above three operations are also explained in Figure 9 . Note that if a rectangle tree α embeds (using Lplus2 or Lplus3) some complete trees, it becomes a new rectangle tree α . And the state of α remains the same as that of α. Single Bases at Both Ends (State CP S, CSP and CSS). As required by RTG, gapped bandings are always generated at first. Afterwards, applying proper tree operations as defined above, these gapped bandings compose a more complicated structure. When no base pairs are to be inserted, Lm alters this rectangle tree (representing this complicated structure) to a complete tree of state CP P . Note that the first base s i and the last base s j must have been boundary bases of gapped bandings. When there are single bases at either end of an RNA sequence, operation Ls1 and Ls4 are used to add single bases to the beginning of S 1 and the end of S 2 , respectively. A RTG grammar rule clarifies whether a specific operation is applicable to rectangle trees (in CPP, CPS, CSP or CSS state) or complete trees (in any other state). All the rules are tabulated in Table 2 . In the table, α is a single base. (α,β) is a base pair. (b1, b2) and (b3, b4) are rectangle trees, where comma denotes their insertion points. (c), (c1), and (c2) represent complete trees. After applying RTG grammar rules, the state of the predicted structure transits into another. All valid transitions defined by the grammar rules will be given in the full paper. For the dynamic programming algorithm for structure prediction and the parameter training, we follow the standard techniques (details will be given in the full paper). A total of 564 RNA sequences from 44 families were extracted from Rfam database for our experiments. All these families were classified into three sets D1, D2 and D3. D1 consists of regular structures (15 families). D2 contains standard pseudoknots of degree ≥ 3 (27 families). D3 2 comprises a set of 3-crossing structures (2 families). We carried out a 10-fold cross-validation on D1, D2 and D3 datasets separately. More specifically, take D1 as an example. In each round of validation, a total of 334 sequences in D1 were randomly partitioned into ten equal-size subsets. Out of these ten subsets, one subset was retained to test the model, while the other nine subsets were used to train this model. To eliminate variability, 10 rounds were performed using different partitions. The performance evaluated below is based on the average among 10 rounds. We compared the performance of our RTG method with seven popular softwares. For softwares that take multiple sequences as inputs, like TurboFold, CentroidAlifold and RNAalifold, we provided them with each family of sequences as an input. The performance was measured using positive predictive value (PPV) and sensitivity defined below. PPV = α γ and sensitivity = α β , where α is the number of correctly reported base pairs, β is the total number of reported base pairs, and γ is the total number of base pairs in the Rfam. Table 3 summarized the comparison of secondary structure prediction for RTG and seven other state-of-the-art programs. Our RTG program often outperforms other programs in terms of PPV and sensitivity. The experiment has revealed that 3-crossing dataset is hard to predict for other programs, which is consistent with our analysis of previous algorithms. However, the prediction of RTG program is accurate to a certain extent. Apart from RTG, NUPACK [22] and RNAalifold [17] performed best in estimating the secondary structure for 3-crossing dataset. The performance regarding this dataset is further illustrated in Figure 10 , which presents the predicted structure of NUPACK, RNAalifold and RTG over AE005174-2 as well as the trusted annotation in Rfam. The underlined parentheses(< − >, A − a and B − b) denotes the correctly predicted base pairs. Evidently, RTG behaved the best with PPV = 87.5% and sensitivity = 70.0%. Structural and functional aspects of rna pseudoknots In vivo determination of rna structure-function relationships: analysis of the 790 loop in ribosomal rna Characterization of an efficient coronavirus ribosomal frameshifting signal: requirement for an rna pseudoknot Structure, stability and function of rna pseudoknots involved in stimulating ribosomal frameshifting RNA pseudoknot prediction in energy-based models Complexity of pseudoknot prediction in simple models prediction of rna pseudoknots Hotknots: Heuristic prediction of rna secondary structures including pseudoknots Predictions of RNA secondary structure using generalized centroid estimators Rich parameterization improves rna structure prediction Cylofold: secondary structure prediction including pseudoknots. Nucleic Acids Research suppl Dynamic programming algorithms for rna secondary structure prediction with pseudoknots An o(n(5)) algorithm for mfe prediction of kissing hairpins and 4-chains in nucleic acids A partition function algorithm for nucleic acid secondary structure including pseudoknots A dynamic programming algorithm for rna structure prediction including pseudoknots A fast structural multiple alignment method for long rna sequences Rnaalifold: improved consensus structure prediction for rna alignments Turbofold: iterative probabilistic estimation of secondary structures for multiple rna sequences Rfam: annotating non-coding rnas in complete genomes Stochastic modeling of rna pseudoknotted structures: a grammatical approach pknotsrg: Rna pseudoknot folding including near-optimal structures and sliding windows Nupack: Analysis and design of nucleic acid systems Tree adjoining grammars for rna structure prediction The PPV and sensitivity of RNAalifold were 56.2% and 45.0%, respectively. NU-PACK reached even lower PPV and sensitivity. In addition to its high accuracy evaluated using PPV and sensitivity, RTG predicted a structure much more similar to the ground truth. RTG thought the secondary structure of AE005174-2 is a 3-crossing. Furthermore, it almost pointed out all the bandings correctly. Even for pairs denoted by B − b, the pairing position was very close. However, NUPACK and RNAalifold predicted it as regular structures, which was way far from its real structure.