key: cord-0025989-uuve1e2x
authors: Sato, Kengo; Kato, Yuki
title: Prediction of RNA secondary structure including pseudoknots for long sequences
date: 2021-10-02
journal: Brief Bioinform
DOI: 10.1093/bib/bbab395
sha: 26e722ff1d37b910f634206d61f1ce174196ce5d
doc_id: 25989
cord_uid: uuve1e2x

RNA structural elements called pseudoknots are involved in various biological phenomena including ribosomal frameshifts. Because it is infeasible to construct an efficiently computable secondary structure model including pseudoknots, secondary structure prediction methods considering pseudoknots are not yet widely available. We developed IPknot, which uses heuristics to speed up computations, but it has remained difficult to apply it to long sequences, such as messenger RNA and viral RNA, because it requires cubic computational time with respect to sequence length and has threshold parameters that need to be manually adjusted. Here, we propose an improvement of IPknot that enables calculation in linear time by employing the LinearPartition model and automatically selects the optimal threshold parameters based on the pseudo-expected accuracy. In addition, IPknot showed favorable prediction accuracy across a wide range of conditions in our exhaustive benchmarking, not only for single sequences but also for multiple alignments.

Genetic information recorded in DNA is transcribed into RNA, which is then translated into protein to fulfill its function. In other words, RNA is merely an intermediate product for the transmission of genetic information. This type of RNA is called messenger RNA (mRNA). However, many RNAs that do not fit into this framework have been discovered more recently. For example, transfer RNA and ribosomal RNA, which play central roles in the translation mechanism, nucleolar small RNA, which guides the modification sites of other RNAs, and microRNA, which regulates gene expression, have been discovered. Thus, it has become clear that RNAs other than mRNAs are involved in various biological phenomena. Because these RNAs do not encode proteins, they are called non-coding RNAs. In contrast to DNA, which forms a double-stranded structure in vivo, RNA is often single-stranded and is thus unstable when intact. In the case of mRNA, the cap structure at the 5 end and the poly-A strand at the 3 end protect it from degradation. On the other hand, for other RNAs that do not have such structures, singlestranded RNA molecules bind to themselves to form threedimensional structures and ensure their stability. Also, as in the case of proteins, RNAs with similar functions have similar threedimensional structures, and it is known that there is a strong association between function and structure. The determination of RNA three-dimensional (3D) structure can be performed by Xray crystallography, nuclear magnetic resonance, cryo-electron microscopy, and other techniques. However, it is difficult to apply these methods on a large scale owing to difficulties associated with sequence lengths, resolution and cost. Therefore, RNA secondary structure, which is easier to model, is often computationally predicted instead. RNA secondary structure refers to the set of base pairs consisting of Watson-Crick base pairs (A-U, G-C) and wobble base pairs (G-U) that form the backbone of the 3D structure.

RNA secondary structure prediction is conventionally based on thermodynamic models, which predict the secondary structure with the minimum free energy (MFE) among all possible secondary structures. Popular methods based on thermodynamic models include mfold [1] , RNAfold [2] , and RNAstructure [3] . Recently, RNA secondary structure prediction methods based on machine learning have also been developed. These methods train alternative parameters to the thermodynamic parameters by taking a large number of pairs of RNA sequences and their reference secondary structures as training data. The following methods fall under the category of methods that use machine learning: CONTRAfold [4] , ContextFold [5] , SPOT-RNA [6] and MXfold2 [7] . However, from the viewpoint of computational complexity, most approaches do not support the prediction of secondary structures that include pseudoknot substructures.

Pseudoknots are one of the key topologies occurring in RNA secondary structures. The pseudoknot structure is a structure in which some bases inside of a loop structure form base pairs with bases outside of the loop (e.g. Figure 1A ). In other words, it is said to have a pseudoknot structure if there exist base pairs that are crossing each other by connecting bases of base pairs with arcs, as shown in Figure 1B . The pseudoknot structure is known to be involved in the regulation of translation and splicing, and ribosomal frameshifts [8] [9] [10] . The results of sequence analysis suggest that the hairpin loops, which are essential building blocks of the pseudoknots, first appeared in the evolutionary timescale [11] , and then the pseudoknots were configured, resulting in gaining those functions. We therefore conclude that pseudoknots should not be excluded from the modeling of RNA secondary structures.

The computational complexity required for MFE predictions of an arbitrary pseudoknot structure has been proven to be NP-hard [12, 13] . To address this, dynamic programming-based methods that require polynomial time (O(n 4 )-O(n 6 ) for sequence length n) to exactly compute the restricted complexity of pseudoknot structures [12] [13] [14] [15] [16] and heuristics-based fast computation methods [17] [18] [19] [20] have been developed.

We previously developed IPknot [21] , a fast heuristic-based method for predicting RNA secondary structures including pseudoknots. IPknot decomposes a secondary structure with pseudoknots into several pseudoknot-free substructures and predicts the optimal secondary structure using integer programming (IP) based on maximization of expected accuracy (MEA) under the constraints that each substructure must satisfy. The threshold cut technique, which is naturally derived from MEA, enables IPknot to perform much faster calculations with nearly comparable prediction accuracy relative to other methods. However, because the MEA-based score uses base pairing probability without considering pseudoknots, which requires a calculation time that increases cubically with sequence length, it is difficult to use for secondary structure prediction of sequences that exceed 1000 bases, even when applying a threshold cut technique. Furthermore, as the prediction accuracy can drastically change depending on the thresholds determined in advance for each pseudoknot-free substructure, thresholds must be carefully determined.

To address the limitations of IPknot, we implemented the following two improvements to the method. The first is the use of LinearPartition [22] to calculate base pairing probabilities. LinearPartition can calculate the base pairing probability, with linear computational complexity with respect to sequence length, using the beam search technique. By employing the LinearPartition model, IPknot is able to predict secondary structures while considering pseudoknots for long sequences, including mRNA, lncRNA and viral RNA. The other improvement is the selection of thresholds based on pseudo-expected accuracy, which was originally developed by Hamada et al. [23] . We show that the pseudo-expected accuracy is correlated with the 'true' accuracy, and by choosing thresholds for each sequence based on the pseudo-expected accuracy, we can select a nearly optimal secondary structure prediction.

Given an RNA sequence x = x 1 · · · x n (x i ∈ {A, C, G, U}), its secondary structure is represented by a binary matrix y = (y ij ), where y ij = 1 if x i and x j form a base pair and otherwise y ij = 0. Let Y(x) be a set of all possible secondary structures of x including pseudoknots. We assume that y ∈ Y(x) can be decomposed into a set of pseudoknot-free substructures y (1) , y (2) , . . . , y (m) , such that y = m p=1 y (p) . In order to guarantee the uniqueness of the decomposition, the following conditions should be satisfied: (i) y ∈ Y(x) should be decomposed into mutually exclusive sets; that is, for all 1 ≤ i < j ≤ |x|, m p=1 y (p) ij ≤ 1; (ii) every base pair in y (p) should be pseudoknotted with at least one base pair in y (q) for ∀q < p.

One of the most promising techniques for predicting RNA secondary structures is the MEA-based approach [4, 24] . First, we define a gain function of predictionŷ ∈ Y(x) with regard to the correct secondary structure y ∈ Y(x) as

where TP(y,ŷ) = i<j I(y ij = 1)I(ŷ ij = 1) is the number of true positive base pairs, TN(y,ŷ) = i<j I(y ij = 0)I(ŷ ij = 0) is the number of true negative base pairs, and τ ∈ [0, 1] is a balancing parameter between true positives and true negatives. Here, I(condition) is the indicator function that takes a value of 1 or 0 depending on whether the condition is true or false.

Our objective is to find a secondary structure that maximizes the expectation of the gain function (1) under a given probability distribution over the space Y(x) of pseudoknotted secondary structures, as follows:

Here, P(y | x) is a probability distribution of RNA secondary structures including pseudoknots. Because the calculation of the expected gain function (2) is intractable for arbitrary pseudoknots, we approximate Eq.

(2) by the sum of the expected gain function for decomposed pseudoknot-free substructuresŷ (1) , . . . ,ŷ (m) forŷ ∈ Y(x) such that y = m p=1ŷ (p) , and thus, we find a pseudoknotted structureŷ and its decompositionŷ (1) 

where τ (p) ∈ [0, 1] is a balancing parameter between true positives and true negatives for a level p, and C is a constant independent ofŷ. The base pairing probability p ij is the probability that the bases x i and x j form a base pair, which is defined as

See Section S1 in Supplementary Information for the derivation.

Notably, it is no longer necessary to consider the base pairs whose probabilities are at most the threshold τ (p) , which we refer to as the threshold cut. We can choose P (y | x), a probability distribution over a set Y (x) of secondary structures without pseudoknots, from among several options. Instead of using a probability distribution with pseudoknots, we can employ a probability distribution without pseudoknots, such as the McCaskill model [25] and the CON-TRAfold model [4] , whose computational complexity is O(|x| 3 ) for time and O(|x| 2 ) for space. Alternatively, the LinearPartition model [22] , which is O(|x|) in both time and space, enables us to predict the secondary structure of sequences much longer than 1000 bases.

We can formulate our problem described in the previous section as the following IP problem:

Because Equation (5) is an instantiation of the approximate estimator (3) and the threshold cut technique is applicable to Eq. (3), the base pairs y

ij whose base pairing probabilities p ij are larger than τ (p) need to be considered. The number of variables y (p) ij that should be considered is at most |x|/τ (p) because j<i p ji + j>i p ij ≤ 1 for 1 ≤ ∀i ≤ |x|. Constraint (9) means that each base x i is paired with at most one base. Constraint (10) disallows pseudoknots within the same level p. Constraint (11) ensures that each base pair at level p is pseudoknotted with at least one base pair at every lower level q < p to guarantee the uniqueness of the decomposition y = m p=1 y (p) .

To solve the IP problem (5)-(11), we are required to choose the set of thresholds for each level τ (1) , . . . , τ (m) , each of which is a balancing parameter between true positives and true negatives. However, it is not easy to obtain the best set of τ values for any sequence beforehand. Therefore, we employ an approach originally proposed by Hamada et al. [23] , which chooses a parameter set for each sequence among several parameter sets that predicts the best secondary structure in terms of an approximation of the expected accuracy (called pseudo-expected accuracy) and makes the prediction by the best parameter set the final prediction.

The accuracy of a predicted RNA secondary structureŷ against a reference structure y is evaluated using the following measures:

Here, TP(y,ŷ) = i<j I(y ij = 1)I(ŷ ij = 1), FP(y,ŷ) = i<j I(y ij = 0)I(ŷ ij = 1) and FN(y,ŷ) = i<j I(y ij = 1)I(ŷ ij = 0). To estimate the accuracy of the predicted secondary structureŷ without knowing the true secondary structure y, we take an expectation of F(y,ŷ) over the distribution of y:

However, this calculation is intractable because the number of y ∈ Y(x) increases exponentially with the length of sequence x. Alternatively, we first calculate expected TP, FP and FN as follows:

Then, we approximate F by calculating Equation (14) using TP, FP, and FN instead of TP, FP and FN, respectively. In addition to the original pseudo-expected accuracy described above, we introduce the pseudo-expected accuracy for crossing base pairs to predict pseudoknotted structures. Prediction of secondary structures including pseudoknots depends on both the conventional prediction accuracy of base pairs described above and the accuracy of crossing base pairs. A crossing base pair is a base pair x i and x j such that there exists another base pair x k and x l that is crossing the base pair x i and

We define the expectations of true positives, false positives and false negatives for crossing base pairs as follows:

Here, cb(y) is an n×n binary matrix, whose (i, j)-element is y ij itself if there exists k < i < l < j or i < k < j < l such that y kl = 1, and 0 otherwise. Then, we calculate the pseudo-expected F-value for crossing base pairs F cb using Equation (14) 

Then, we calculate their pseudo-expected accuracy F(ŷ t ) + F cb (ŷ t ) and choose the secondary structureŷ t that maximizes the pseudo-expected accuracy as the final prediction.

The average of the base pairing probability matrices for each sequence in an alignment has been used to predict the common secondary structure for the alignment [26, 27] . Let A be an alignment of RNA sequences that contains k sequences and |A| denote the number of columns in A. We calculate the base pairing probabilities of an individual sequence x ∈ A as

The averaged base pairing probability matrix is defined as

The common secondary structure of the alignment A can be calculated in the same way by replacing p ij in Equations (5) with p

ij . While the common secondary structure prediction based on the average base pairing probability matrix has been implemented in the previous version of IPknot [21] , the present version employs the LinearPartition model, which enables the calculation linearly with respect to the alignment length.

Our method has been implemented as the newest version of a program called IPknot. In addition to the McCaskil model [25] and CONTRAfold model [4] , which were already integrated into the previous version of IPknot, the LinearPartition model [22] is also supported as a probability distribution for secondary structures.

To solve IP problems, the GNU Linear Programming Kit (GLPK; http://www.gnu.org/software/glpk/), Gurobi Optimizer (http://gu robi.com/) or IBM CPLEX Optimizer (https://www.ibm.com/ana lytics/cplex-optimizer) can be employed.

To evaluate our algorithm, we performed computational experiments on several datasets. We employed RNA sequences extracted from the bpRNA-1m dataset [28] , which is based on Rfam 12.2 [29] , and the comparative RNA web dataset [30] with 2588 families. In addition, we built a dataset that includes families from the most recent Rfam database, Rfam 14.5 [31] . Since the release of Rfam 12.2, the Rfam project has actively collected about 1400 RNA families, including families detected by newly developed techniques. We extracted these newly discovered families. To limit bias in the training data, sequences with higher than 80% sequence identity with the sequence subsets S-Processed-TRA from RNA STRAND [32] and TR0 from bpRNA-1m [28] , which are the training datasets for CONTRAfold and SPOT-RNA, respectively, were removed using CD-HIT-EST-2D [33] . We then removed redundant sequences using CD-HIT-EST [33] , with a cutoff threshold of 80% sequence identity.

For the prediction of common secondary structures, the sequence selected by the above method was used as a seed, and 1-9 sequences of the same Rfam family and with high sequence identity (≥ 80%) with the seed sequence were randomly selected to create an alignment. Common secondary structure prediction was performed on the reference alignments from Rfam and the alignments calculated by MAFFT [34] . Because there are sequences from bpRNA-1m that do not have Rfam reference alignments, only sequences from Rfam 14.5 were tested for common secondary structure prediction. To capture the accuracy of the common secondary structure prediction, the accuracy for the seed sequence is shown.

A summary of the dataset created and utilized is shown in Table 1 .

First, to show the effectiveness of the automatic selection from among thresholds τ (1) , . . . , τ (m) based on the pseudo-expected accuracy, Figure 2 and Table S1 in Supplementary Information show the prediction accuracy on the dataset of short sequences (≤ 150 nt) using automatic selection and manual selection of the threshold τ values. For IPknot, we fixed the number of decomposed sets of secondary substructures m = 2, and varied threshold parameters τ values for base pairing probability in such a way that {(τ (1) ,

In IPknot with pseudo-expected accuracy, the best secondary structure in the sense of pseudo-expected F is selected from the same range of (τ (1) , τ (2) ) for each sequence. For these variants of IPknot, the LinearPartition model with CONTRAfold parameters (LinearPartition-C) was used to calculate base pairing probabilities. In addition, we compared the prediction accuracy of IPknot with that of ThreshKnot [35] , which also calculates base pairing probabilities using LinearPartition-C. We used {2 −t | t = 1, 2, 3, 4}∪ {0.3} as the threshold parameter θ for ThreshKnot because the default threshold parameter of ThreshKnot is θ = 0.3. IPknot with threshold parameters of τ (1) = 0.125 and τ (2) = 0.125 had the highest prediction accuracy of F = 0.659. IPknot with pseudoexpected accuracy has a prediction accuracy of F = 0.658, which is comparable to the highest accuracy obtained. ThreshKnot with a threshold of 0.25 has an accuracy of F = 0.656, which is also comparable to the best accuracy obtained. The pseudo-expected F-value and "true" F-value are relatively highly correlated (Spearman correlation coefficient ρ = 0.639), indicating that the selection of predicted secondary structure using pseudo-expected accuracy works well.

While the accuracy of the prediction of the entire secondary structure has already been considered, as shown in Figure 2 , for the prediction of secondary structures with pseudoknots, it is necessary to evaluate the prediction accuracy focused on the crossing base pairs. In terms of prediction accuracy limited to only crossing base pairs, IPknot with pseudo-expected accuracy yielded F cb = 0.258, while the highest accuracy achieved by IPknot with the threshold parameters and ThreshKnot was considerably lower at F cb = 0.161 and 0.057, respectively (See Table S1 in Supplementary Information). We can observe the similar tendency to the above in Figures S1 and S2 , and Tables S2 and S3 in Supplementary Information for medium (151-500 nt) and long (> 500 nt) sequences. These results suggest that prediction of crossing base pairs is improved by selecting the predicted secondary structure while considering both the pseudo-expected accuracy of the entire secondary structure and the pseudo-expected accuracy of the crossing base pairs.

Using our dataset, we compared our algorithm with several previous methods that can predict pseudoknots, including Thresh-Knot utilizing LinearPartition (committed on 17 March 2021) [22] , Knotty (committed on Mar 28, 2018) [22] and SPOT-RNA (committed on 1 April 2021) [6] , and those that can predict only pseudoknot-free structures, including CONTRAfold (version 2.02) [4] and RNAfold in the ViennaRNA package (version 2.4.17) [22] . IPknot has several options for the calculation model for base pairing probabilities, namely the LinearPartition model with CONTRAfold parameters (LinearPartition-C), the LinearPartition model with ViennaRNA parameters (LinearPartition-V), the CONTRAfold model and the ViennaRNA model. In addition, ThreshKnot has two possible LinearPartition models for calculating base pairing probabilities. The other existing methods were tested using the default settings.

We evaluated the prediction accuracy according to the Fvalue as defined by Equation (14) for pseudoknot-free sequences (PKF in Table 2 ), pseudoknotted sequences (PK in Table 2 ) and only crossing base pairs (CB in Table 2 ) by stratifying sequences by length: short (12-150 nt), medium (151-500 nt) and long (500-4381 nt).

For short sequences, SPOT-RNA archived high accuracy, especially for pseudoknotted sequences. However, a large difference in accuracy between the bpRNA-1m-derived and Rfam 14.5derived sequences can be observed for SPOT-RNA compared with the other methods (See Tables S4-S9 in Supplementary Information) . Notably, bpRNA-1m contains many sequences in the same family as the SPOT-RNA training data, and although we performed filtering based on sequence identity, there is still a concern of overfitting. Knotty can predict structures including pseudoknots with an accuracy comparable to that of SPOT-RNA, but as shown in Figure 3 , it can perform secondary structure prediction for only short sequences, owing to its huge computational complexity. Comparing IPknot using the LinearPartition-C and -V models with its counterparts, the original CONTRAfold model and ViennaRNA model achieved comparable accuracy. However, because the computational complexity of the original models is cubic with respect to sequence length, the computational time of the original models increases rapidly as the sequence length exceeds 1500 bases. On the other hand, the computational complexity of the LinearPartition models is linear with respect to sequence length, so the base pairing probabilities can be quickly calculated even when the sequence length exceeds 4000 bases. In addition to calculating the base pairing probabilities, IP calculations are required, but because the number of variables and constraints to be considered can be greatly reduced using the threshold cut technique, the overall execution time is not significantly affected if the sequence length is several thousand bases. Because ThreshKnot, like IPknot, uses the LinearPartition model, it is able to perform fast secondary structure prediction even for long sequences. However, for the prediction accuracy of crossing base pairs, ThreshKnot is even less accurate.

Pseudoknots are found not only in cellular RNAs but also in viral RNAs, performing a variety of functions [8] . Tables S10-S11 in Supplementary Information show the results of the secondary structure prediction by separating the datasets into cellular RNAs and viral RNAs, indicating that there is no significant difference in the prediction accuracy between cellular RNAs and viral RNAs.

Few methods exist that can perform prediction of common secondary structures including pseudoknots for sequence alignments longer than 1000 bases. Table 3 and Tables S12-S20 in Supplementary Information compare the accuracy of IPknot that employs the LinearPartition model, and RNAalifold in the Vien-naRNA package. We performed common secondary structure prediction for the Rfam reference alignment and the alignment calculated by MAFFT, as well as secondary structure prediction of single sequences only for the seed sequence included in the alignment, and evaluated the prediction accuracy for the seed sequence. In most cases, the prediction accuracy improved as the quality of the alignment increased (Single < MAFFT < Reference). IPknot predicts crossing base pairs based on pseudo-expected accuracy, whereas RNAalifold is unable to predict pseudoknots.

Both IPknot and ThreshKnot use the LinearPartition model to calculate base pairing probabilities, and then perform secondary structure prediction using different strategies. ThreshKnot predicts the base pairs x i and x j that are higher than a predetermined threshold θ and have the largest p ij in terms of both i and j. IPknot predicts the pseudoknot structure with multiple thresholds τ (1) , . . . , τ (m) in a hierarchical manner based on IP (5)- (11) , and then carefully selects from among these thresholds based on pseudo-expected accuracy. Because both the pseudoexpected accuracy of the entire secondary structure as well as the pseudo-expected accuracy of the crossing base pairs are taken into account, the prediction accuracy of the pseudoknot structure is inferred to be enhanced in IPknot. Because the LinearPartition model uses the same parameters as the CONTRAfold and ViennaRNA packages, there is no significant difference in accuracy between using LinearPartition-C and -V and their counterparts, the CONTRAfold and ViennaRNA models. It has been shown that LinearPartition has no significant effect on accuracy even though it ignores structures whose probability is extremely low owing to its use of beam search, which makes the calculation linear with respect to the sequence length [22] . The LinearPartition model enables IPknot to perform secondary structure prediction including pseudoknots of very long sequences, such as mRNA, lncRNA, and viral RNA.

SPOT-RNA [6] , which uses deep learning, showed notable prediction accuracy in our experiments, especially in short sequences containing pseudoknots, with F-value of 0.621, which is superior to other methods. However, SPOT-RNA requires considerable computing resources such as GPGPU and long computational time. Furthermore, SPOT-RNA showed a large difference in prediction accuracy between sequences that are close to the training data and those that are not compared with the other methods. Therefore, the situations in which SPOT-RNA can be used are considered to be limited. In contrast, IPknot uses CONTRAfold parameters, which is also based on machine learning, but we did not observe as much overfitting with IPknot as with SPOT-RNA.

Approaches that provide an exact solution for limitedcomplexity pseudoknot structures, such as PKNOTS [14] , pknot-sRG [15] , and Knotty [16] , can predict pseudoknot structures with high accuracy but demand a huge amount of computation O(n 4 )-O(n 6 ) for sequence length n, limiting secondary structure prediction to sequences only up to about 150 bases. On the other hand, IPknot predicts the pseudoknot structure using a fast computational heuristic-based method with the linear time computation, which does not allow us to find an exact solution. Instead, IPknot improves the prediction accuracy of the pseudoknot structure by choosing the best solution from among several solutions based on the pseudo-expected accuracy.

IPknot uses pseudoknot-free algorithms, such as CON-TRAfold and ViennaRNA, to calculate base pairing probabilities, and its prediction accuracy of the resulting secondary structure strongly depends on the algorithm used to calculate base pairing probabilities. Therefore, we can expect to improve the prediction accuracy of IPknot by calculating the base pairing probabilities based on state-of-the-art pseudoknot-free secondary structure prediction methods such as MXfold2 [7] .

It is well known that common secondary structure prediction from sequence alignments improves the accuracy of secondary structure prediction. However, among the algorithms for predicting common secondary structure including pseudoknots, only IPknot can deal with sequence alignments longer than several thousand bases. In the RNA virus SARS-CoV-2, programmed -1 ribosomal frameshift (-1 PRF), in which a pseudoknot structure plays an important role, has been identified and is attracting attention as a drug target [10] . Because many closely related strains of SARS-CoV-2 have been sequenced, it is expected that structural motifs including pseudoknots, such as -1 PRF, can be found by predicting the common secondary structure from the alignment.

We have developed an improvement to IPknot that enables calculation in linear time by employing the LinearPartition model and automatically selects the optimal threshold parameters based on the pseudo-expected accuracy. LinearPartition can calculate the base pairing probability with linear computational complexity with respect to the sequence length. By employing LinearPartition, IPknot is able to predict the secondary structure considering pseudoknots for long sequences such as mRNA, lncRNA, and viral RNA. By choosing the thresholds for each sequence based on the pseudo-expected accuracy, we can select a nearly optimal secondary structure prediction.

The LinearPartition model realized the predictiction of secondary structures considering pseudoknots for long sequences. However, the prediction accuracy is still not sufficiently high, especially for crossing base pairs. We expect that by learning parameters from long sequences [36] , we can achieve high accuracy even for long sequences.

• We reduced the computational time required by IPknot from cubic to linear with respect to the sequence length by employing the LinearPartition model and enabled the secondary structure prediction including pseudoknots for long RNA sequences such as mRNA, lncRNA, and viral RNA.

• We improved the accuracy of secondary structure prediction including pseudoknots by introducing pseudoexpected accuracy not only for the entire base pairs but also for crossing base pairs. • To the best of our knowledge, IPknot is the only method that can perform RNA secondary structure prediction including pseudoknot not only for very long single sequence, but also for very long sequence alignments.

Supplementary data are available online at Briefings in Bioinformatics.

The IPknot source code is freely available at https://githu b.com/satoken/ipknot. IPknot is also available for use from a web server at http://rtips.dna.bio.keio.ac.jp/ipknot++/. The datasets used in our experiments are available at https:// doi.org/10.5281/zenodo.4923158.

Mfold web server for nucleic acid folding and hybridization prediction

Vien-naRNA package 2.0

RNAstructure: software for RNA secondary structure prediction and analysis

RNA secondary structure prediction without physics-based models

Rich parameterization improves RNA structure prediction

RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning

RNA secondary structure prediction using deep learning with thermodynamic integration

Viral RNA pseudoknots: versatile motifs in gene expression and replication

Pseudoknots: RNA structures with diverse functions

Structural and functional conservation of the programmed -1 ribosomal frameshift signal of SARS coronavirus 2 (SARS-CoV-2)

Primordia vita. deconvolution from modern sequences

Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots

RNA pseudoknot prediction in energy-based models

A dynamic programming algorithm for RNA structure prediction including pseudoknots

Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics

Knotty: efficient and accurate prediction of complex RNA pseudoknot structures

An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots

HotKnots: heuristic prediction of RNA secondary structures including pseudoknots

FlexStem: improving predictions of RNA secondary structures with pseudoknots by reducing the search space

ProbKnot: fast prediction of RNA secondary structure including pseudoknots

IPknot: fast and accurate prediction of RNA secondary structures with pseudoknots using integer programming

LinearPartition: linear-time approximation of RNA folding partition function and base-pairing probabilities

Prediction of RNA secondary structure by maximizing pseudo-expected accuracy

Prediction of RNA secondary structure using generalized centroid estimators

The equilibrium partition function and base pair binding probabilities for RNA secondary structure

Robust prediction of consensus secondary structures using averaged base pairing probability matrices

Improving the accuracy of predicting secondary structure for aligned RNA sequences

bpRNA: large-scale automated annotation and analysis of RNA secondary structure

Rfam 12.0: updates to the RNA families database

The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs

Rfam 14: expanded coverage of metagenomic, viral and microRNA families

RNA STRAND: the RNA secondary structure and statistical analysis database

CD-HIT: accelerated for clustering the next-generation sequencing data

MAFFT multiple sequence alignment software version 7: improvements in performance and usability

Thresholded ProbKnot for improved RNA secondary structure prediction

Learning to fold RNAs in linear time. bioRxiv

K.S. conceived the study, implemented the algorithm, collected the datasets, conducted experiments, and drafted the manuscript. K.S. and Y.K. discussed the algorithm and designed the experiments. All authors read, contributed to the discussion of and approved the final manuscript.

This work was partially supported by a Grant-in-Aid for Scientific Research (B) (No. 19H04210) and Challenging Exploratory Research (No. 19K22897) from the Japan Society for the Promotion of Science (JSPS) to K.S. and a Grant-in-Aid for Scientific Research (C) (Nos. 18K11526 and 21K12109) from JSPS to Y.K.

The supercomputer system used for this research was made available by the National Institute of Genetics, Research Organization of Information and Systems.