key: cord-0211103-kto778l5 authors: Gao, James Z. M.; Li, Linda Y. M.; Reidys, Christian M. title: Inverse folding of RNA pseudoknot structures date: 2009-05-06 journal: nan DOI: nan sha: 081fce9252d6f843d2329db65ae188d41bf4c30f doc_id: 211103 cord_uid: kto778l5 Background: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and pairGU-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searching for new ribozymes and designing artificial RNA, it is of interest to find RNA sequences folding into a specific structure and to analyze their induced neutral networks. Since the established inverse folding algorithms, {tt RNAinverse}, {tt RNA-SSD} as well as {tt INFO-RNA} are limited to RNA secondary structures, we present in this paper the inverse folding algorithm {tt Inv} which can deal with 3-noncrossing, canonical pseudoknot structures. Results: In this paper we present the inverse folding algorithm {tt Inv}. We give a detailed analysis of {tt Inv}, including pseudocodes. We show that {tt Inv} allows to design in particular 3-noncrossing nonplanar RNA pseudoknot 3-noncrossing RNA structures-a class which is difficult to construct via dynamic programming routines. {tt Inv} is freely available at url{}. Conclusions: The algorithm {tt Inv} extends inverse folding capabilities to RNA pseudoknot structures. In comparison with {tt RNAinverse} it uses new ideas, for instance by considering sets of competing structures. As a result, {tt Inv} is not only able to find novel sequences even for RNA secondary structures, it does so in the context of competing structures that potentially exhibit cross-serial interactions. Pseudoknots are structural elements of central importance in RNA structures [1] , see Figure 1 . They represent cross-serial base pairing interactions between RNA nucleotides that are functionally important in tRNAs, RNaseP [2] , telomerase RNA [3] , and ribosomal RNAs [4] . Pseudoknot structures are being observed in the mimicry of tRNA structures in plant virus RNAs as well as the binding to the HIV-1 reverse transcriptase in in vitro selection experiments [5] . Furthermore basic mechanisms, like ribosomal frame shifting, involve pseudoknots [6] . Despite them playing a key role in a variety of contexts, pseudoknots are excluded from large-scale computational studies. Although the problem has attracted considerable attention in the last decade, pseudoknots are considered a somewhat "exotic" structural concept. For all we know [8] , the ab initio prediction of general RNA pseudoknot structures is NP-complete and algorithmic difficulties of pseu-doknot folding are confounded by the fact that the thermodynamics of pseudoknots is far from being well understood. As for the folding of RNA secondary structures, Waterman et al [9, 10] , Zuker et al [11] and Nussinov [12] established the dynamic programming (DP) folding routines. The first mfe-folding algorithm for RNA secondary structures, however, dates back to the 60's [13] [14] [15] . For restricted classes of pseudoknots, several algorithms have been designed: Rivas and Eddy [16] , Dirks and Pierce [17] , Reeder and Giegerich [18] and Ren et al [19] . Recently, a novel ab initio folding algorithm Cross has been introduced [20] . Cross generates minimum free energy (mfe), 3-noncrossing, 3-canonical RNA structures, i.e. structures that do not contain three or more mutually crossing arcs and in which each stack, i.e. sequence of parallel arcs, see eq. (1), has size greater or equal than three. In particular, in a 3-canonical structure there are no isolated arcs, see Figure 2 . S t ac k_ 1 S t ac k_ 2 S ta c k_ 3 Figure 2 : σ-canonical RNA structures: each stack of "parallel" arcs has to have minimum size σ. Here we display a 3-canonical structure. The notion of mfe-structure is based on a specific concept of pseudoknot loops and respective loop-based energy parameters. This thermodynamic model was conceived by Tinoco and refined by Freier, Turner, Ninio, and others [14, [21] [22] [23] [24] [25] . 1.1 k-noncrossing, σ-canonical RNA pseudoknot structures Let us turn back the clock: three decades ago Waterman et al. [26] , Nussinov et al. [12] and Kleitman et al. in [27] analyzed RNA secondary structures. Secondary structures are coarse grained RNA contact structures, see Figure 3 . In a diagram, two arcs (i 1 , j 1 ) and (i 2 , j 2 ) are called crossing if i 1 < i 2 < j 1 < j 2 holds. Accordingly, a k-crossing is a sequence of arcs (i 1 , j 1 ), . . . , (i k , j k ) such that i 1 < i 2 < · · · < i k < j 1 < j 2 < · · · < j k , see Figure 5 . (1, 7) , (4, 9) , (5, 11) (drawn in red). We call diagrams containing at most (k − 1)crossings, k-noncrossing diagrams. RNA secondary structures have no crossings in their diagram representation, see Figure 3 and Figure 4 , and are therefore 2-noncrossing diagrams. A structure in which any stack has at least size σ is called σ-canonical, where a stack of size σ is a sequence of "parallel" arcs of the form As a natural generalization of RNA secondary structures k-noncrossing RNA structures [28] [29] [30] were introduced. A k-noncrossing RNA structure is k-noncrossing diagram without arcs of the form (i, i+1). In the following we assume k = 3, i.e. in the diagram representation there are at most two mutually crossing arcs, a minimum arc-length of four and a minimum stack-size of three base pairs. The notion k-noncrossing stipulates that the complexity of a pseudoknot is related to the maximal number of mutually crossing bonds. Indeed, most natural RNA pseudoknots are 3-noncrossing [31] . Before considering an inverse folding algorithm into specific RNA structures one has to have at least some rationale as to why there exists one sequence realizing a given target as mfe-configuration. In fact this is, on the level of entire folding maps, guaranteed by the combinatorics of the target structures alone. It has been shown in [32] , that the numbers of 3noncrossing RNA pseudoknot structures, satisfying the biophysical constraints grows asymptotically as c 3 n −5 2.03 n , where c 3 > 0 is some explicitly known constant. In view of the central limit theorems of [33] , this fact implies the existence of extended (exponentially large) sets of sequences that all fold into one 3-noncrossing RNA pseudoknot structure, S. In other words, the combinatorics of 3-noncrossing RNA structures alone implies that there are many sequences mapping (folding) into a single structure. The set of all such sequences is called the neutral network 1 of the structure S [34, 35] , see Figure 6 . 1 the term "neutral network" as opposed to "neutral set" stems from giant component results of random induced sub- Structure space Figure 6 : Neutral networks in sequence space: we display sequence space (left) and structure space (right) as grids. We depict a set of sequences that all fold into a particular structure. Any two of these sequences are connected by a red edge. The neutral network of this fixed structure consists of all sequences folding into it and is typically a connected subgraph of sequence space. . Indeed, we may graphs of n-cubes. That is, neutral networks are typically connected in sequence space 2 note: we do not consider insertions or deletions. where the u h denotes the unpaired nucleotides and the p h = (s i , s j ) denotes base pairs, respectively, see Analogously a p-neighbor differs by a compensatory base pair-mutation, see Figure 8 . Note, however, that a p-neighbor has either Hamming distance one (G-C → G-U) or Hamming distance two (G-C → C-G). We call a u-or a p- work between s and s ′ , respectively. Note that since each neutral path is in particular a compatible path, the compatible distance is always smaller or equal than the neutral distance. In this paper we study the inverse folding problem for RNA pseudoknot structures: for a given 3-noncrossing target structure S, we search for sequences from C[S], that have S as mfe configuration. For RNA secondary structures, there are three different strategies for inverse folding, RNAinverse, They all generate via a local search routine iteratively sequences, whose structures have smaller and smaller distances to a given target. Here the distance between two structures is obtained by aligning them as diagrams and counting "0", if a given position is either unpaired or incident to an arc contained in both structures and "1", otherwise, see Figure 9 . One common assumption in these inverse fold- ing algorithms is, that the energies of specific substructures contribute additively to the energy of the entire structure. Let us proceed by analyzing the algorithms. RNAinverse is the first inverse-folding algorithm that derives sequences that realize given RNA secondary structures as mfe-configuration. In its initialization step, a random compatible sequence s for the target T is generated. Then RNAinverse proceeds by updating the sequence s to s ′ , s ′′ . . . step by step, minimizing the structure distance between the mfe structure of s ′ and the target structure T . Based on the observation, that the energy of a substructure contributes additively to the mfe of the molecule, RNAinverse optimizes "small" substructures first, eventually extending these to the entire structure. Cross is an ab initio folding algorithm that maps RNA sequences into 3-noncrossing RNA structures. It is guaranteed to search all 3-noncrossing, σcanonical structures and derives some (not necessarily unique), loop-based mfe-configuration. In the following we always assume σ ≥ 3. The input of Cross is an arbitrary RNA sequence s and an integer N . Its output is a list of N 3-noncrossing, σ-canonical structures, the first of which being the mfe-structure for s. This list of N structures (C 0 , C 1 , . . . , C N −1 ) is ordered by the free energy and the first list-element, the mfe-structure, is denoted by Cross(s). If no N is specified, Cross assumes N = 1 as default. Cross generates a mfe-structure based on specific loop-types of 3-noncrossing RNA structures. For a given structure S, let α be an arc contained in S (S-arc) and denote the set of S-arcs that cross α by For two arcs α = (i, j) and α ′ = (i ′ , j ′ ), we next specify the partial order "≺" over the set of arcs: All notions of minimal or maximal elements are understood to be with respect to ≺. An arc α ∈ A S (β) is called a minimal, β-crossing if there exists no can be minimal β-crossing, while β is not minimal α-crossing. 3-noncrossing diagrams exhibit the fol- lowing four basic loop-types: where (i, j) is an arc and [i, j] is an interval, i.e. a sequence of consecutive vertices (i, i+1, . . . , j −1, j). (2) An interior-loop, is a sequence where (i 2 , j 2 ) is nested in (i 1 , j 1 ). That is we have (3) A multi-loop, see Figure 10 [20], is a sequence where S τ h ω h denotes a pseudoknot structure over [ω h , τ h ] (i.e. nested in (i 1 , j 1 )) and subject to the substructures are just arcs, for all h, then we have A pseudoknot, see Figure 11 [20], consists of the following data: (P1) A set of arcs interior-or multi-loops. Having discussed the basic loop-types, we are now in position to state Theorem 1 Any 3-noncrossing RNA pseudoknot structure has a unique loop-decomposition [20] . A motif in Cross is a 3-noncrossing structure, having only ≺-maximal stacks of size exactly σ, see If C 0 0 (λ i−1 ) = T , then Inv returns λ i−1 . Else, in case of d = (Cross(C 0 0 (λ i−1 )), T ) < d min , we set Otherwise we do not update seq min and go directly to Step II. Step II. The competitors. We introduce a specific procedure that "perturbs" arcs of a given RNA pseudo- Clearly, there are nine perturbations of any given arc a (including a itself), see Figure 16 . We proceed by keeping a, replacing the arc a by a nontrivial perturbation or remove a, arriving at a set of ten structures ν(S, a). Now we use this method in order to generate the set C 1 (λ i−1 ) by perturbing each arc of each struc- Step III. Mutation Here we adjust λ i−1 with respect to T as well as the set of competitors, C(λ i−1 ) derived in the previous step. Suppose λ i−1 = Figure 19 ) we modify λ i−1 as follows: 1. unpaired position: If p(T, w) = 0, we up- See position 6 in Figure 19 . Figure 19 : (5, 9) . The pair G-C retains the compatibility to (5, 9) , but is incompatible to (5, 10) ). By Figure 20 we show feasibility of this step. 3. end-point: If 0 < p(T, w) < w, then by construction the nucleotide has already been considered in the previous step. Therefore, updating all the nucleotides of λ i−1 , we arrive at the new sequence λ i = s i 1 s i 2 . . . s i n . Note that the above mutation steps heuristically decrease the structure distance. However, the resulting sequence is not necessarily incompatible to all competitors. For instance, consider a competitor C h whose arcs are all contained T . Since λ i is compatible with T , λ i is compatible with C h . Since competitors are obtained from suboptimal folds such a scenario may arise. In practice, this situation represents not a problem, since these type of competitors are likely to be ruled out by virtue of the fact that they have a mfe larger than that of the target structure. Accordingly we have the following situation, competitors are eliminated due to two, equally important criteria: incompatibility as well as minimum free energy considerations. If the distance of Cross(λ i ) to T is less than or equal to d min + 5, we return to Step I (with λ i ). Otherwise, we repeat Step III (for at most 5 times) thereby generating λ i 1 , . . . , λ i 5 and set λ i = λ i w where d(Cross(λ i w ), T ) is minimal. The procedure Adjust-Seq employs the negative paradigm [17] in order to exclude energetically close conformations. It returns the sequence seq middle which is tailored to realize the target structure as mfe-fold. Input: the original start sequence start Input: the target structure T Output: a initialized sequence seq middle 1: n ← length of T 2: dmin ← +∞, seqmin ← start 3: for i = 1 to 1 2 √ n do 4: £ Step I: generate the set C 0 (λ i−1 ) via Cross 5: if d = 0 then £ Step II: generate the competitor set C(λ i−1 ) 14: Step III: mutation 24: seq ← λ i−1 for w = 1 to n do 26: if ∃C h (λ i−1 ) ∈ C(λ i−1 ) s.t. p(C h , w) = p(T, w) then 27: seq[w] ← random nucleotide or pair, s.t. In this section we introduce two the routines, Decompose and Local-Search. The routine Decompose partitions T into linearly ordered energy independent components, see Figure 12 and Section 2.1. Local-Search constructs iteratively an optimal sequence for T via local solutions, that are optimal to certain substructures of T . Decompose: Suppose T is decomposed as follows, where the T w are the loops together with all arcs in the associated stems of the target. We define a linear order over B as follows: T w < T h if either 1. T w is nested in T h , or 2. the start-point of T w precedes that of T h . In Figure 21 we display the linear order of the loops of the structure shown in Figure 12 . Next we define the interval projecting the loop T w onto the interval [l(T w ), r(T w )] and b w = [l ′ , r ′ ] ⊃ a w , being the maximal interval consisting of a w and its adjacent unpaired consecutive nucleotides, see Figure 12 . Given two consecutive loops T w < T w+1 , we have two scenarios: • or b w ⊆ b w+1 , see b 1 and b 2 in Figure 21 . Let c w = ∪ w h=1 b h , then we have the sequence of intervals a 1 , b 1 , c 1 , . . . , a m ′ , b m ′ , c m ′ . If there are no unpaired nucleotides adjacent to a w , then a w = b w and we simply delete all such b w . Thereby we derive the sequence of intervals I 1 , I 2 , . . . , I m . In Figure 22 we illustrate how to obtain this interval sequence: here the target decomposes into the loops T 1 , T 2 and we have I 1 = [3, 5] , I 2 = [3, 6] , I 3 = [2, 9] , and Local-Search: Given the sequence of intervals I 1 , I 2 , · · · , I m . We proceed by performing a local stochastic search on the subsequences seq| I1 , seq| I2 , . . . , seq| Im (initialized via seq = seq middle and where s| [x,y] = s x s x+1 . . . s y ). When we perform the local search on seq| Iw , only positions that contribute to the distance to the target, see Figure 9 , or positions adjacent to the latter, will be altered. We use the arrays U 1 , U 2 to store the unpaired and paired positions of T . In this process, we allow for mutations that increase the structure distance by five with probability 0.1. The latter parameter is heuristically determined. We iterate this routine until the distance is either zero or some halting criterion is met. The main result of this paper is the presentation of the algorithm Inv, freely available at Its input is a 3-noncrossing RNA structure T , given in terms of its base pairs (i 1 , i 2 ) (where i 1 < i 2 ). As discussed in the introduction it has to be given an argument as to why the inverse folding of pseudoknot RNA structures works. While folding maps into RNA secondary structures are well understood, the generalization to 3-noncrossing RNA structures is nontrivial. However the combinatorics of RNA pseudoknot structures [28, 29, 40] implies the existence of large neutral networks, i.e. networks composed by sequences that all fold into a specific pseudoknot structure. Therefore, the fact that it is indeed possible to generate via Inv sequences contained in the neutral networks of targets against competing pseudoknot configurations, see Figure 23 and Figure 24 confirms the predictions of [32] . An interesting class are the 3-noncrossing nonplanar pseudoknot structures. A nonplanar pseudoknot structure is a 3-noncrossing structure which is not a bi-secondary structure in the sense of Stadler [31] . That is, it cannot be represented by non-AUACGACAUCGUAACUUCCUACUCGUUGUGGAACUGGCCGGGAGC CGGUCUCAGGAGCGAAUGGGUUAGGGGGCUCACGCGCUGUCAUUG GUUGGUCCUAUCGACAGCCUGAGAGGUCAGAAAGAGAGCGGUUGC Figure 24 : The Pseudoknot PKI of the internal ribosomal entry site (IRES) region [41] : its diagram representation and three sequences of its neutral network as constructed by Inv. crossing arcs using the upper and lower half planes. Since DP-folding paradigms of pseudoknots folding are based on gap-matrices [16] , the minimal class of "missed" structures 5 are exactly these, nonplanar, 3-noncrossing structures. In Figure 25 we showcase a nonplanar RNA pseudoknot structure and 3 sequences of its neutral network, generated by Inv. networks. In addition we present in Table 1 Domain structure of the ribozyme from eubacterial ribonuclease P. All authors contributed equally to this paper. We are grateful to Fenix W.D. Huang for discussions. Special thanks belongs to the two anonymous referee's whose thoughtful comments have greatly