key: cord-0025989-uuve1e2x authors: Sato, Kengo; Kato, Yuki title: Prediction of RNA secondary structure including pseudoknots for long sequences date: 2021-10-02 journal: Brief Bioinform DOI: 10.1093/bib/bbab395 sha: 26e722ff1d37b910f634206d61f1ce174196ce5d doc_id: 25989 cord_uid: uuve1e2x RNA structural elements called pseudoknots are involved in various biological phenomena including ribosomal frameshifts. Because it is infeasible to construct an efficiently computable secondary structure model including pseudoknots, secondary structure prediction methods considering pseudoknots are not yet widely available. We developed IPknot, which uses heuristics to speed up computations, but it has remained difficult to apply it to long sequences, such as messenger RNA and viral RNA, because it requires cubic computational time with respect to sequence length and has threshold parameters that need to be manually adjusted. Here, we propose an improvement of IPknot that enables calculation in linear time by employing the LinearPartition model and automatically selects the optimal threshold parameters based on the pseudo-expected accuracy. In addition, IPknot showed favorable prediction accuracy across a wide range of conditions in our exhaustive benchmarking, not only for single sequences but also for multiple alignments. Genetic information recorded in DNA is transcribed into RNA, which is then translated into protein to fulfill its function. In other words, RNA is merely an intermediate product for the transmission of genetic information. This type of RNA is called messenger RNA (mRNA). However, many RNAs that do not fit into this framework have been discovered more recently. For example, transfer RNA and ribosomal RNA, which play central roles in the translation mechanism, nucleolar small RNA, which guides the modification sites of other RNAs, and microRNA, which regulates gene expression, have been discovered. Thus, it has become clear that RNAs other than mRNAs are involved in various biological phenomena. Because these RNAs do not encode proteins, they are called non-coding RNAs. In contrast to DNA, which forms a double-stranded structure in vivo, RNA is often single-stranded and is thus unstable when intact. In the case of mRNA, the cap structure at the 5 end and the poly-A strand at the 3 end protect it from degradation. On the other hand, for other RNAs that do not have such structures, singlestranded RNA molecules bind to themselves to form threedimensional structures and ensure their stability. Also, as in the case of proteins, RNAs with similar functions have similar threedimensional structures, and it is known that there is a strong association between function and structure. The determination of RNA three-dimensional (3D) structure can be performed by Xray crystallography, nuclear magnetic resonance, cryo-electron microscopy, and other techniques. However, it is difficult to apply these methods on a large scale owing to difficulties associated with sequence lengths, resolution and cost. Therefore, RNA secondary structure, which is easier to model, is often computationally predicted instead. RNA secondary structure refers to the set of base pairs consisting of Watson-Crick base pairs (A-U, G-C) and wobble base pairs (G-U) that form the backbone of the 3D structure. RNA secondary structure prediction is conventionally based on thermodynamic models, which predict the secondary structure with the minimum free energy (MFE) among all possible secondary structures. Popular methods based on thermodynamic models include mfold [1] , RNAfold [2] , and RNAstructure [3] . Recently, RNA secondary structure prediction methods based on machine learning have also been developed. These methods train alternative parameters to the thermodynamic parameters by taking a large number of pairs of RNA sequences and their reference secondary structures as training data. The following methods fall under the category of methods that use machine learning: CONTRAfold [4] , ContextFold [5] , SPOT-RNA [6] and MXfold2 [7] . However, from the viewpoint of computational complexity, most approaches do not support the prediction of secondary structures that include pseudoknot substructures. Pseudoknots are one of the key topologies occurring in RNA secondary structures. The pseudoknot structure is a structure in which some bases inside of a loop structure form base pairs with bases outside of the loop (e.g. Figure 1A ). In other words, it is said to have a pseudoknot structure if there exist base pairs that are crossing each other by connecting bases of base pairs with arcs, as shown in Figure 1B . The pseudoknot structure is known to be involved in the regulation of translation and splicing, and ribosomal frameshifts [8] [9] [10] . The results of sequence analysis suggest that the hairpin loops, which are essential building blocks of the pseudoknots, first appeared in the evolutionary timescale [11] , and then the pseudoknots were configured, resulting in gaining those functions. We therefore conclude that pseudoknots should not be excluded from the modeling of RNA secondary structures. The computational complexity required for MFE predictions of an arbitrary pseudoknot structure has been proven to be NP-hard [12, 13] . To address this, dynamic programming-based methods that require polynomial time (O(n 4 )-O(n 6 ) for sequence length n) to exactly compute the restricted complexity of pseudoknot structures [12] [13] [14] [15] [16] and heuristics-based fast computation methods [17] [18] [19] [20] have been developed. We previously developed IPknot [21] , a fast heuristic-based method for predicting RNA secondary structures including pseudoknots. IPknot decomposes a secondary structure with pseudoknots into several pseudoknot-free substructures and predicts the optimal secondary structure using integer programming (IP) based on maximization of expected accuracy (MEA) under the constraints that each substructure must satisfy. The threshold cut technique, which is naturally derived from MEA, enables IPknot to perform much faster calculations with nearly comparable prediction accuracy relative to other methods. However, because the MEA-based score uses base pairing probability without considering pseudoknots, which requires a calculation time that increases cubically with sequence length, it is difficult to use for secondary structure prediction of sequences that exceed 1000 bases, even when applying a threshold cut technique. Furthermore, as the prediction accuracy can drastically change depending on the thresholds determined in advance for each pseudoknot-free substructure, thresholds must be carefully determined. To address the limitations of IPknot, we implemented the following two improvements to the method. The first is the use of LinearPartition [22] to calculate base pairing probabilities. LinearPartition can calculate the base pairing probability, with linear computational complexity with respect to sequence length, using the beam search technique. By employing the LinearPartition model, IPknot is able to predict secondary structures while considering pseudoknots for long sequences, including mRNA, lncRNA and viral RNA. The other improvement is the selection of thresholds based on pseudo-expected accuracy, which was originally developed by Hamada et al. [23] . We show that the pseudo-expected accuracy is correlated with the 'true' accuracy, and by choosing thresholds for each sequence based on the pseudo-expected accuracy, we can select a nearly optimal secondary structure prediction. Given an RNA sequence x = x 1 · · · x n (x i ∈ {A, C, G, U}), its secondary structure is represented by a binary matrix y = (y ij ), where y ij = 1 if x i and x j form a base pair and otherwise y ij = 0. Let Y(x) be a set of all possible secondary structures of x including pseudoknots. We assume that y ∈ Y(x) can be decomposed into a set of pseudoknot-free substructures y (1) , y (2) , . . . , y (m) , such that y = m p=1 y (p) . In order to guarantee the uniqueness of the decomposition, the following conditions should be satisfied: (i) y ∈ Y(x) should be decomposed into mutually exclusive sets; that is, for all 1 ≤ i < j ≤ |x|, m p=1 y (p) ij ≤ 1; (ii) every base pair in y (p) should be pseudoknotted with at least one base pair in y (q) for ∀q < p. One of the most promising techniques for predicting RNA secondary structures is the MEA-based approach [4, 24] . First, we define a gain function of predictionŷ ∈ Y(x) with regard to the correct secondary structure y ∈ Y(x) as where TP(y,ŷ) = ii p ij ≤ 1 for 1 ≤ ∀i ≤ |x|. Constraint (9) means that each base x i is paired with at most one base. Constraint (10) disallows pseudoknots within the same level p. Constraint (11) ensures that each base pair at level p is pseudoknotted with at least one base pair at every lower level q < p to guarantee the uniqueness of the decomposition y = m p=1 y (p) . To solve the IP problem (5)-(11), we are required to choose the set of thresholds for each level τ (1) , . . . , τ (m) , each of which is a balancing parameter between true positives and true negatives. However, it is not easy to obtain the best set of τ values for any sequence beforehand. Therefore, we employ an approach originally proposed by Hamada et al. [23] , which chooses a parameter set for each sequence among several parameter sets that predicts the best secondary structure in terms of an approximation of the expected accuracy (called pseudo-expected accuracy) and makes the prediction by the best parameter set the final prediction. The accuracy of a predicted RNA secondary structureŷ against a reference structure y is evaluated using the following measures: Here, TP(y,ŷ) = i 500 nt) sequences. These results suggest that prediction of crossing base pairs is improved by selecting the predicted secondary structure while considering both the pseudo-expected accuracy of the entire secondary structure and the pseudo-expected accuracy of the crossing base pairs. Using our dataset, we compared our algorithm with several previous methods that can predict pseudoknots, including Thresh-Knot utilizing LinearPartition (committed on 17 March 2021) [22] , Knotty (committed on Mar 28, 2018) [22] and SPOT-RNA (committed on 1 April 2021) [6] , and those that can predict only pseudoknot-free structures, including CONTRAfold (version 2.02) [4] and RNAfold in the ViennaRNA package (version 2.4.17) [22] . IPknot has several options for the calculation model for base pairing probabilities, namely the LinearPartition model with CONTRAfold parameters (LinearPartition-C), the LinearPartition model with ViennaRNA parameters (LinearPartition-V), the CONTRAfold model and the ViennaRNA model. In addition, ThreshKnot has two possible LinearPartition models for calculating base pairing probabilities. The other existing methods were tested using the default settings. We evaluated the prediction accuracy according to the Fvalue as defined by Equation (14) for pseudoknot-free sequences (PKF in Table 2 ), pseudoknotted sequences (PK in Table 2 ) and only crossing base pairs (CB in Table 2 ) by stratifying sequences by length: short (12-150 nt), medium (151-500 nt) and long (500-4381 nt). For short sequences, SPOT-RNA archived high accuracy, especially for pseudoknotted sequences. However, a large difference in accuracy between the bpRNA-1m-derived and Rfam 14.5derived sequences can be observed for SPOT-RNA compared with the other methods (See Tables S4-S9 in Supplementary Information) . Notably, bpRNA-1m contains many sequences in the same family as the SPOT-RNA training data, and although we performed filtering based on sequence identity, there is still a concern of overfitting. Knotty can predict structures including pseudoknots with an accuracy comparable to that of SPOT-RNA, but as shown in Figure 3 , it can perform secondary structure prediction for only short sequences, owing to its huge computational complexity. Comparing IPknot using the LinearPartition-C and -V models with its counterparts, the original CONTRAfold model and ViennaRNA model achieved comparable accuracy. However, because the computational complexity of the original models is cubic with respect to sequence length, the computational time of the original models increases rapidly as the sequence length exceeds 1500 bases. On the other hand, the computational complexity of the LinearPartition models is linear with respect to sequence length, so the base pairing probabilities can be quickly calculated even when the sequence length exceeds 4000 bases. In addition to calculating the base pairing probabilities, IP calculations are required, but because the number of variables and constraints to be considered can be greatly reduced using the threshold cut technique, the overall execution time is not significantly affected if the sequence length is several thousand bases. Because ThreshKnot, like IPknot, uses the LinearPartition model, it is able to perform fast secondary structure prediction even for long sequences. However, for the prediction accuracy of crossing base pairs, ThreshKnot is even less accurate. Pseudoknots are found not only in cellular RNAs but also in viral RNAs, performing a variety of functions [8] . Tables S10-S11 in Supplementary Information show the results of the secondary structure prediction by separating the datasets into cellular RNAs and viral RNAs, indicating that there is no significant difference in the prediction accuracy between cellular RNAs and viral RNAs. Few methods exist that can perform prediction of common secondary structures including pseudoknots for sequence alignments longer than 1000 bases. Table 3 and Tables S12-S20 in Supplementary Information compare the accuracy of IPknot that employs the LinearPartition model, and RNAalifold in the Vien-naRNA package. We performed common secondary structure prediction for the Rfam reference alignment and the alignment calculated by MAFFT, as well as secondary structure prediction of single sequences only for the seed sequence included in the alignment, and evaluated the prediction accuracy for the seed sequence. In most cases, the prediction accuracy improved as the quality of the alignment increased (Single < MAFFT < Reference). IPknot predicts crossing base pairs based on pseudo-expected accuracy, whereas RNAalifold is unable to predict pseudoknots. Both IPknot and ThreshKnot use the LinearPartition model to calculate base pairing probabilities, and then perform secondary structure prediction using different strategies. ThreshKnot predicts the base pairs x i and x j that are higher than a predetermined threshold θ and have the largest p ij in terms of both i and j. IPknot predicts the pseudoknot structure with multiple thresholds τ (1) , . . . , τ (m) in a hierarchical manner based on IP (5)- (11) , and then carefully selects from among these thresholds based on pseudo-expected accuracy. Because both the pseudoexpected accuracy of the entire secondary structure as well as the pseudo-expected accuracy of the crossing base pairs are taken into account, the prediction accuracy of the pseudoknot structure is inferred to be enhanced in IPknot. Because the LinearPartition model uses the same parameters as the CONTRAfold and ViennaRNA packages, there is no significant difference in accuracy between using LinearPartition-C and -V and their counterparts, the CONTRAfold and ViennaRNA models. It has been shown that LinearPartition has no significant effect on accuracy even though it ignores structures whose probability is extremely low owing to its use of beam search, which makes the calculation linear with respect to the sequence length [22] . The LinearPartition model enables IPknot to perform secondary structure prediction including pseudoknots of very long sequences, such as mRNA, lncRNA, and viral RNA. SPOT-RNA [6] , which uses deep learning, showed notable prediction accuracy in our experiments, especially in short sequences containing pseudoknots, with F-value of 0.621, which is superior to other methods. However, SPOT-RNA requires considerable computing resources such as GPGPU and long computational time. Furthermore, SPOT-RNA showed a large difference in prediction accuracy between sequences that are close to the training data and those that are not compared with the other methods. Therefore, the situations in which SPOT-RNA can be used are considered to be limited. In contrast, IPknot uses CONTRAfold parameters, which is also based on machine learning, but we did not observe as much overfitting with IPknot as with SPOT-RNA. Approaches that provide an exact solution for limitedcomplexity pseudoknot structures, such as PKNOTS [14] , pknot-sRG [15] , and Knotty [16] , can predict pseudoknot structures with high accuracy but demand a huge amount of computation O(n 4 )-O(n 6 ) for sequence length n, limiting secondary structure prediction to sequences only up to about 150 bases. On the other hand, IPknot predicts the pseudoknot structure using a fast computational heuristic-based method with the linear time computation, which does not allow us to find an exact solution. Instead, IPknot improves the prediction accuracy of the pseudoknot structure by choosing the best solution from among several solutions based on the pseudo-expected accuracy. IPknot uses pseudoknot-free algorithms, such as CON-TRAfold and ViennaRNA, to calculate base pairing probabilities, and its prediction accuracy of the resulting secondary structure strongly depends on the algorithm used to calculate base pairing probabilities. Therefore, we can expect to improve the prediction accuracy of IPknot by calculating the base pairing probabilities based on state-of-the-art pseudoknot-free secondary structure prediction methods such as MXfold2 [7] . It is well known that common secondary structure prediction from sequence alignments improves the accuracy of secondary structure prediction. However, among the algorithms for predicting common secondary structure including pseudoknots, only IPknot can deal with sequence alignments longer than several thousand bases. In the RNA virus SARS-CoV-2, programmed -1 ribosomal frameshift (-1 PRF), in which a pseudoknot structure plays an important role, has been identified and is attracting attention as a drug target [10] . Because many closely related strains of SARS-CoV-2 have been sequenced, it is expected that structural motifs including pseudoknots, such as -1 PRF, can be found by predicting the common secondary structure from the alignment. We have developed an improvement to IPknot that enables calculation in linear time by employing the LinearPartition model and automatically selects the optimal threshold parameters based on the pseudo-expected accuracy. LinearPartition can calculate the base pairing probability with linear computational complexity with respect to the sequence length. By employing LinearPartition, IPknot is able to predict the secondary structure considering pseudoknots for long sequences such as mRNA, lncRNA, and viral RNA. By choosing the thresholds for each sequence based on the pseudo-expected accuracy, we can select a nearly optimal secondary structure prediction. The LinearPartition model realized the predictiction of secondary structures considering pseudoknots for long sequences. However, the prediction accuracy is still not sufficiently high, especially for crossing base pairs. We expect that by learning parameters from long sequences [36] , we can achieve high accuracy even for long sequences. • We reduced the computational time required by IPknot from cubic to linear with respect to the sequence length by employing the LinearPartition model and enabled the secondary structure prediction including pseudoknots for long RNA sequences such as mRNA, lncRNA, and viral RNA. • We improved the accuracy of secondary structure prediction including pseudoknots by introducing pseudoexpected accuracy not only for the entire base pairs but also for crossing base pairs. • To the best of our knowledge, IPknot is the only method that can perform RNA secondary structure prediction including pseudoknot not only for very long single sequence, but also for very long sequence alignments. Supplementary data are available online at Briefings in Bioinformatics. The IPknot source code is freely available at https://githu b.com/satoken/ipknot. IPknot is also available for use from a web server at http://rtips.dna.bio.keio.ac.jp/ipknot++/. The datasets used in our experiments are available at https:// doi.org/10.5281/zenodo.4923158. Mfold web server for nucleic acid folding and hybridization prediction Vien-naRNA package 2.0 RNAstructure: software for RNA secondary structure prediction and analysis RNA secondary structure prediction without physics-based models Rich parameterization improves RNA structure prediction RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning RNA secondary structure prediction using deep learning with thermodynamic integration Viral RNA pseudoknots: versatile motifs in gene expression and replication Pseudoknots: RNA structures with diverse functions Structural and functional conservation of the programmed -1 ribosomal frameshift signal of SARS coronavirus 2 (SARS-CoV-2) Primordia vita. deconvolution from modern sequences Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots RNA pseudoknot prediction in energy-based models A dynamic programming algorithm for RNA structure prediction including pseudoknots Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics Knotty: efficient and accurate prediction of complex RNA pseudoknot structures An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots HotKnots: heuristic prediction of RNA secondary structures including pseudoknots FlexStem: improving predictions of RNA secondary structures with pseudoknots by reducing the search space ProbKnot: fast prediction of RNA secondary structure including pseudoknots IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities Prediction of RNA secondary structure by maximizing pseudo-expected accuracy Prediction of RNA secondary structure using generalized centroid estimators The equilibrium partition function and base pair binding probabilities for RNA secondary structure Robust prediction of consensus secondary structures using averaged base pairing probability matrices Improving the accuracy of predicting secondary structure for aligned RNA sequences bpRNA: large-scale automated annotation and analysis of RNA secondary structure Rfam 12.0: updates to the RNA families database The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs Rfam 14: expanded coverage of metagenomic, viral and microRNA families RNA STRAND: the RNA secondary structure and statistical analysis database CD-HIT: accelerated for clustering the next-generation sequencing data MAFFT multiple sequence alignment software version 7: improvements in performance and usability Thresholded ProbKnot for improved RNA secondary structure prediction Learning to fold RNAs in linear time. bioRxiv K.S. conceived the study, implemented the algorithm, collected the datasets, conducted experiments, and drafted the manuscript. K.S. and Y.K. discussed the algorithm and designed the experiments. All authors read, contributed to the discussion of and approved the final manuscript. This work was partially supported by a Grant-in-Aid for Scientific Research (B) (No. 19H04210) and Challenging Exploratory Research (No. 19K22897) from the Japan Society for the Promotion of Science (JSPS) to K.S. and a Grant-in-Aid for Scientific Research (C) (Nos. 18K11526 and 21K12109) from JSPS to Y.K. The supercomputer system used for this research was made available by the National Institute of Genetics, Research Organization of Information and Systems.