key: cord-0699256-1lojd0xa authors: Gao, James ZM; Li, Linda YM; Reidys, Christian M title: Inverse folding of RNA pseudoknot structures date: 2010-06-23 journal: Algorithms Mol Biol DOI: 10.1186/1748-7188-5-27 sha: a3778f543ef74ff3a0dd90250221b77ee38d7533 doc_id: 699256 cord_uid: 1lojd0xa BACKGROUND: RNA exhibits a variety of structural configurations. Here we consider a structure to be tantamount to the noncrossing Watson-Crick and G-U-base pairings (secondary structure) and additional cross-serial base pairs. These interactions are called pseudoknots and are observed across the whole spectrum of RNA functionalities. In the context of studying natural RNA structures, searching for new ribozymes and designing artificial RNA, it is of interest to find RNA sequences folding into a specific structure and to analyze their induced neutral networks. Since the established inverse folding algorithms, RNAinverse, RNA-SSD as well as INFO-RNA are limited to RNA secondary structures, we present in this paper the inverse folding algorithm Inv which can deal with 3-noncrossing, canonical pseudoknot structures. RESULTS: In this paper we present the inverse folding algorithm Inv. We give a detailed analysis of Inv, including pseudocodes. We show that Inv allows to design in particular 3-noncrossing nonplanar RNA pseudoknot 3-noncrossing RNA structures-a class which is difficult to construct via dynamic programming routines. Inv is freely available at http://www.combinatorics.cn/cbpc/inv.html. CONCLUSIONS: The algorithm Inv extends inverse folding capabilities to RNA pseudoknot structures. In comparison with RNAinverse it uses new ideas, for instance by considering sets of competing structures. As a result, Inv is not only able to find novel sequences even for RNA secondary structures, it does so in the context of competing structures that potentially exhibit cross-serial interactions. Pseudoknots are structural elements of central importance in RNA structures [1] , see Figure 1 . They represent cross-serial base pairing interactions between RNA nucleotides that are functionally important in tRNAs, RNaseP [2] , telomerase RNA [3] , and ribosomal RNAs [4] . Pseudoknot structures are being observed in the mimicry of tRNA structures in plant virus RNAs as well as the binding to the HIV-1 reverse transcriptase in in vitro selection experiments [5] . Furthermore basic mechanisms, like ribosomal frame shifting, involve pseudoknots [6] . Despite them playing a key role in a variety of contexts, pseudoknots are excluded from large-scale computational studies. Although the problem has attracted considerable attention in the last decade, pseudoknots are considered a somewhat "exotic" structural concept. For all we know [7] , the ab initio prediction of general RNA pseudoknot structures is NP-complete and algorithmic difficulties of pseudoknot folding are confounded by the fact that the thermodynamics of pseudoknots is far from being well understood. As for the folding of RNA secondary structures, Waterman et al [8, 9] , Zuker et al [10] and Nussinov [11] established the dynamic programming (DP) folding routines. The first mfe-folding algorithm for RNA secondary structures, however, dates back to the 60's [12] [13] [14] . For restricted classes of pseudoknots, several algorithms have been designed: Rivas and Eddy [15] , Dirks and Pierce [16] , Reeder and Giegerich [17] and Ren et al [18] . Recently, a novel ab initio folding algorithm Cross has been introduced [19] . Cross generates minimum free energy (mfe), 3-noncrossing, 3-canonical RNA structures, i.e. structures that do not contain three or more mutually crossing arcs and in which each stack, i.e. sequence of parallel arcs, see eq. (1), has size greater or equal than three. In particular, in a 3-canonical structure there are no isolated arcs, see Figure 2 . The notion of mfe-structure is based on a specific concept of pseudoknot loops and respective loop-based energy parameters. This thermodynamic model was conceived by Tinoco and refined by Freier, Turner, Ninio, and others [13, [20] [21] [22] [23] [24] . Let us turn back the clock: three decades ago Waterman et al. [25] , Nussinov et al. [11] and Kleitman et al. in [26] analyzed RNA secondary structures. Secondary structures are coarse grained RNA contact structures, see Figure 3 . RNA secondary structures as well as RNA pseudoknot structures can be represented as diagrams, i.e. labeled graphs over the vertex set [n] = {1, ..., n} with vertex degrees ≤ 1, represented by drawing its vertices on a horizontal line and its arcs (i, j) (i < j), in the upper half-plane, see Figure 4 and Figure 1 . Given an arc (i, j) we refer to (j -i) as its arc-length. Here, vertices and arcs correspond to the nucleotides A, G, U, C and Watson-Crick (A-U, G-C) and (U-G) base pairs, respectively. In a diagram, two arcs (i 1 , j 1 ) and (i 2 , j 2 ) are called crossing if i 1 0 is some explicitly known constant. In view of the central limit theorems of [32] , this fact implies the existence of extended (exponentially large) sets of sequences that all fold into one 3-noncrossing RNA pseudoknot structure, S. In other words, the combinatorics of 3-noncrossing RNA structures alone implies that there are many sequences mapping (folding) into a single structure. The set of all such sequences is called the neutral network of the structure S [33, 34] , see Figure 7 . The term "neutral network" as opposed to "neutral set" stems from giant component results of random induced subgraphs of ncubes. That is, neutral networks are typically connected in sequence space. By construction, all the sequences contained in such a neutral network are all compatible with S. That is, at any two positions paired in S, we find two bases capable of forming a bond (A-U, U-A, G-C, C-G, G-U and U-G), see Figure 8 . Let s' be a sequence derived via a point-mutation of s. If s′ is again compatible with S, we call this mutation "compatible". Let where the u h denotes the unpaired nucleotides and the p h = (s i , s j ) denotes base pairs, respectively, see Accordingly, there are two types of compatible neighbors in the sequence space u-and p-neighbors: a uneighbor has Hamming distance one and differs exactly by a point mutation at an unpaired position. Analogously a p-neighbor differs by a compensatory base pair-mutation, see Figure 9 . We display a 4-noncrossing diagram containing the three mutually crossing arcs (1, 7) , (4, 9) , (5, 11) (drawn in red). Note, however, that a p-neighbor has either Hamming distance one (G-C ↦ G-U) or Hamming distance two (G-C ↦ C-G). We call a u-or a p-neighbor, y, a compatible neighbor. In light of the adjacency notion for the set of compatible sequences we call the set of all sequences folding into S the neutral network of S. By construction, the neutral network of S is contained in C [S]. If y is contained in the neutral network we refer to y as a neutral neighbor. This gives rise to consider the compatible and neutral distance of the two sequences, denoted by C(s, s′) and N(s, s′). These are the minimum length of a C[S]-path and path in the neutral network between s and s′, respectively. Note that since each neutral path is in particular a compatible path, the compatible distance is always smaller or equal than the neutral distance. In this paper we study the inverse folding problem for RNA pseudoknot structures: for a given 3-noncrossing target structure S, we search for sequences from C[S], that have S as mfe configuration. For RNA secondary structures, there are three different strategies for inverse folding, RNAinverse, RNA-SSD and INFO-RNA [35] [36] [37] . They all generate via a local search routine iteratively sequences, whose structures have smaller and smaller distances to a given target. Here the distance between Figure 6 Stems. A stem composed by a sequence of three nested stacks. Note that respective stacks only have to be separated by isolated nucleotides on either the left hand side or the right hand side but not necessarily both. Neutral network in sequence space. We display sequence space (left) and structure space (right) as grids. We depict a set of sequences that all fold into a particular structure. Any two of these sequences are connected by a red path. The neutral network of this fixed structure consists of all sequences folding into it and is typically a connected subgraph of sequence space. two structures is obtained by aligning them as diagrams and counting "0", if a given position is either unpaired or incident to an arc contained in both structures and "1", otherwise, see Figure 10 . One common assumption in these inverse folding algorithms is, that the energies of specific substructures contribute additively to the energy of the entire structure. Let us proceed by analyzing the algorithms. RNAinverse is the first inverse-folding algorithm that derives sequences that realize given RNA secondary structures as mfe-configuration. In its initialization step, a random compatible sequence s for the target T is generated. Then RNAinverse proceeds by updating the sequence s to s′, s′′ ... step by step, minimizing the structure distance between the mfe structure of s′ and the target structure T. Based on the observation, that the energy of a substructure contributes additively to the mfe of the molecule, RNAinverse optimizes "small" substructures first, eventually extending these to the entire structure. While optimizing substructures, RNAinverse does an adaptive walk in order to decrease the structure distance. In fact, this walk is based entirely on random compatible mutations. RNA-SSD inverse folds RNA secondary structures by initializing sequences using three specific subroutines. In the first a particular compatible sequence is generated, where non-complementary nucleotides to bases adjacent to helical regions are assigned. In the second nucleotides located in unpaired positions as well as helical regions are assigned at random, using specific (non-uniform) probabilities. The third routine constitutes a mechanism for minimizing the occurrence of undesired but favourable interactions between specific sequence segments. Following these subroutines, RNA-SSD derives a hierarchical decomposition of the target structure. It recursively splits the structure and thereby derives a binary decomposition tree rooted in T and whose leaves correspond to T-substructures. Each non-leaf node of this tree represents a substructure obtained by merging the two substructures of its respective children. Given this tree, RNA-SSD performs a stochastic local search, starting at the leaves, subsequently working its way up to the root. INFO-RNA constructs sequences folding into a given secondary structure by employing a dynamic programming method for finding a well suited initial sequence. This sequence has a lowest energy with respect to the T. Since the latter does not necessarily fold into T, (due to potentially existing competing configurations) INFO-RNA then utilizes an improved (relative to the local search routine used in RNAinverse) stochastic local search in order to find a sequence in the neutral network of T. In contrast to RNAinverse, INFO-RNA allows for increasing the distance to the target structure. At the same time, only positions that do not pair correctly and positions adjacent to these are examined. Cross is an ab initio folding algorithm that maps RNA sequences into 3-noncrossing RNA structures. It is guaranteed to search all 3-noncrossing, s-canonical structures and derives some (not necessarily unique), loop-based mfe-configuration. In the following we always assume s ≥ 3. The input of Cross is an arbitrary RNA sequence s and an integer N. Its output is a list of N 3-noncrossing, s-canonical structures, the first of which being the mfe-structure for s. This list of N structures (C 0 , C 1 , ..., C N-1 ) is ordered by the free energy and the first list-element, the mfe-structure, is denoted by Cross(s). If no N is specified, Cross assumes N = 1 as default. Cross generates a mfe-structure based on specific loop-types of 3-noncrossing RNA structures. For a given structure S, let a be an arc contained in S (S-arc) and denote the set of S-arcs that cross a by A S ( )  . For two arcs a = (i, j) and a' = (i', j'), we next specify the partial order "≺" over the set of arcs: All notions of minimal or maximal elements are understood to be with respect to ≺. An arc a 3-noncrossing diagrams exhibit the following four basic loop-types: where (i, j) is an arc and [i, j] is an interval, i.e. a sequence of consecutive, isolated vertices (i, i + 1, ..., j -1, j). (2) An interior-loop, is a sequence (3) A multi-loop, see Figure 11 [19] , is the closed structure formed by denotes the substructure over the interval [ω h , τ h ], subject to the condition that if all these substructures are simply stems, then there are at least two of them, see Figure 6 . A pseudoknot, see Figure 12 [19], consists of the following data: (P1) A set of arcs where i 1 = min{i h } and j t = max{j h }, such that (i) the diagram induced by the arc-set P is irreducible, i.e. the dependency-graph of P (i.e. the graph having P as vertex set and in which a and a′ are adjacent if and only if they cross) is connected and (ii) for each (i h , j h ) P there exists some arc b (not necessarily contained in P) such that (i h , j h ) is minimal b-crossing. (P2) Any i 1