key: cord-0106840-4zp1q2os authors: Backofen, Rolf; Fricke, Markus; Marz, Manja; Qin, Jing; Stadler, Peter F. title: Distribution of graph-distances in Boltzmann ensembles of RNA secondary structures date: 2013-07-30 journal: nan DOI: nan sha: c4cc952118977cd6b27cc08b298173190cd88164 doc_id: 106840 cord_uid: 4zp1q2os Large RNA molecules often carry multiple functional domains whose spatial arrangement is an important determinant of their function. Pre-mRNA splicing, furthermore, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium therefore provides useful information on the overall shape of the molecule can provide insights into the interplay of its functional domains. Spatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between arbitrary nucleotides can be computed in polynomial time by means of dynamic programming. A naive implementation would yield recursions with a very high time complexity of O(n^11). Although we were able to reduce this to O(n^6) for many practical applications a further reduction seems difficult. We conclude, therefore, that sampling approaches, which are much easier to implement, are also theoretically favorable for most real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules. Abstract. Large RNA molecules often carry multiple functional domains whose spatial arrangement is an important determinant of their function. Pre-mRNA splicing, furthermore, relies on the spatial proximity of the splice junctions that can be separated by very long introns. Similar effects appear in the processing of RNA virus genomes. Albeit a crude measure, the distribution of spatial distances in thermodynamic equilibrium therefore provides useful information on the overall shape of the molecule can provide insights into the interplay of its functional domains. Spatial distance can be approximated by the graph-distance in RNA secondary structure. We show here that the equilibrium distribution of graph-distances between arbitrary nucleotides can be computed in polynomial time by means of dynamic programming. A naive implementation would yield recursions with a very high time complexity of O(n 11 ). Although we were able to reduce this to O(n 6 ) for many practical applications a further reduction seems difficult. We conclude, therefore, that sampling approaches, which are much easier to implement, are also theoretically favorable for most real-life applications, in particular since these primarily concern long-range interactions in very large RNA molecules. The distances distribution within an RNA molecule is of interest in various contexts. Most directly, the question arises whether panhandle-like structures (in which 3' and 5' ends of long RNA molecules are placed in close proximity) are the rule or an exception. Panhandles have been reported in particular for many RNA virus genomes. Several studies [28, 8, 2, 13] agree based on different models that the two ends of single-stranded RNA molecules are typically not far apart. On a more technical level, the problem to compute the partition function over RNA secondary structures with given end-to-end distance d, usually measured as the number of external bases (plus possibly the number of structural domains) arises for instance when predicting nucleic acid secondary structure in the presence of single-stranded binding proteins [9] or in models of RNA subjected to pulling forces (e.g. in atom force microscopy or export through a small pore) [10, 23, 11] . It also plays a role for the effect of loop energy parameters [7] . In contrast to the end-to-end distance, the graph-distance between two arbitrarily prescribed nucleotides in a larger RNA structure does not seem to have been studied in any detail. However, this is of particular interest in the analysis of single-molecule fluorescence resonance energy transfer (smFRET) experiments [25] . This technique allows to monitor the distance between two dye-labeled nucleotides and can reveal details of the kinetics of RNA folding in real time. It measures the non-radiative energy transfer between the dye-labeled donor and acceptor positions. The efficiency of this energy transfer, E f ret , strongly depends on the spatial distance R according to E f ret = (1 + (R/R 0 ) 6 ) −1 . The Förster radius R 0 sets the length scale, e.g. R 0 ≈ 54Å for the Cy3-Cy5 dye pair. A major obstacle is that, at present, there is no general and efficient way to link smFRET measurements to interpretations in terms of explicit molecular structures. To solve this problem, a natural first step to compute the distribution of spatial distances for an equilibrium ensemble of 3D structures. Since this is not feasible in practice despite major progress in the field of RNA 3D structure prediction [4] , we can only resort to considering the graph-distances on the ensemble of RNA secondary structures instead. Although a crude approximation of reality, our initial results indicate that the graph distance can be related to the smFRET data such as those reported by [14] . From a computer science point of view, furthermore, we show here that the distance distribution can be computed exactly using a dynamic programming approach. An RNA secondary structure is a vertex labeled outerplanar graph G(V, ξ, E), where V = {1, 2, . . . , n} is a finite ordered set (of nucleotide positions) and ξ : {1, 2, . . . , n} → {A, U, G, C}, i → ξ i assigns to each vertex at position i (along the RNA sequence from 5' to 3') the corresponding nucleotide ξ i . We write ξ = ξ 1 . . . ξ n for the sequence underlying secondary structure and use ξ[i . . . j] = ξ i . . . ξ j to denote the subsequence from i to j. The edge set E is subdivided into backbone edges of the form {i, i + 1} for 1 ≤ i < n and a set B of base pairs satisfying the following conditions: The first condition allows base pairs only for Watson-Crick and GU base pairs. The second condition implements the minimal steric requirement for an RNA to bend back on itself. The third condition enforces that B forms a matching in the secondary structure. The last condition (nesting condition) forbids crossing base pairs, i.e. pseudoknots. The nesting condition results in a natural partial order in the set of base pairs B defined as {i, j} ≺ {k, l} if k < i < j < l. In particular, given an arbitrary vertex k, the set B k = {{i, j} ∈ B|i ≤ k ≤ j} of base pairs enclosing k is totally ordered. Note that k is explicitly allowed to be incident to its enclosing base pairs. Consider a fixed secondary structure G, for a given base pair {i, j} ∈ B, we say a vertex k is accessible from {i, j} if i < k < j and there is no other pair {i , j } ∈ B such that i < i < k < j < j. The unique subgraph L i,j induced by i, j, and all the vertices accessible from {i, j} is known as the loop of {i, j}. The type of a loop L i,j is unique determined depending on whether {i, j} is external or not, and the numbers of unpaired vertices and base pairs. For details, see [26] . Each secondary structure G has a unique set of loops {L i,j |{i, j} ∈ B}, which is called the loop decomposition of G. The free energy f (G) of a given secondary structure, according to the standard energy model [20] , is defined as the sum of the energies of all loops in its unique loop decomposition. The relative location of two vertices v and w in G is determined by the base pairs B v and B w that enclose them. If B v ∩ B w = ∅, there is a unique ≺-minimal base pair {i v,w , j v,w } that encloses both vertices and thus a uniquely defined loop L {iv,w,jv,w} in the loop associated with v and w. then v or w is unpaired and part of L {iv,w,jv,w} . Otherwise, i.e. B v ∩B w = ∅, there are uniquely defined ≺-maximal base pairs {k v , l v } ∈ B v \ B w and {k w , l w } ∈ B w \ B v that enclose v and w, respectively. This simple partition holds the key to computing distance distinguished partition functions below. It will be convenient in the following to introduce edge weights ω i,j = a if j = i + 1, i.e., for backbone edges, and ω i,j = b for {i, j} ∈ B. Given a path p, we define the weight of the path d(p) as the sum of the weights of edges in the path. The (weighted) graph-distance d G v,w in G is defined as the weight of the path p connecting v and w with d(p) being minimal. For the weights, we require the following condition: (W) If i and j are connected by an edge, then {i, j} ∈ E is the unique shortest path between i and j. This condition ensures that single edges cannot be replaced by detours of shorter weight. Condition (W) and property (ii) of the secondary structure graphs implies b < 3a because the closing base pair must be shorter than a hairpin loop. Furthermore, considering a stacked pair we need b < b + 2a, i.e. a > 0. We allow the degenerate case b = 0 that neglects the traversals of base pairs. For a fixed structure G, d G v,w is easy to compute. Here, we are interested in the distribution P r[d G v,w |ξ] and its expected value As first shown in [21] , Q and related quantities can be computed in cubic time. A crucial quantity for our task is the restricted partition function for a given pair v, w of positions in a given RNA sequence ξ. A simple but tedious . . , n. In sections 2.3-2.5 we show that this can be achieved by a variant of McCaskill's approach [21] . For the ease of presentation we describe in the following only the recursion for the simplified energy model for the "circular maximum matching" matching, in which energy contributions are associated with individual base pairs rather than loops. Our approach easily extends to the full model by using separating the partition functions into distinct cases for the loop types. We use the letter Z to denote partition functions with distance constraints, while Q is used for quantities that appear in McCaskill's algorithm and are considered as pre-computed here. Before we continue with the calculation of the partition function, let's first look into problem formulation in more detail. For the FRET application, it is well-known that the rate which with FRET occurs is correlated with the distance. Therefore, only a limited range of distance changes (e.g. 20Å − 100Å for Cy3-Cy5) can be reported by the FRET experiments. Thus the more useful formulation of our problem is not to use the full expected quantity for all positions. Instead, we are interested in the average for all distances within some threshold θ d . As the space and time complexity will depend on the number of distances we consider, we will parametrise our complexity by the number of nucleotides n and the number of overall distances considered D = θ d + 1, as well. An important special case assumes that both v and w are external. This is case e.g. when v and w are bound by proteins. In particular, the problem of computing end-to-end distances, i.e., v = 1 and w = n, is of this type. Assuming (W), the shortest path between two external vertices v, w consists of the external vertices and their backbone connections together with the external base pairs. We call this path the inside path of i, j since it does not involve any vertices "outside" the subsequence ξ[i..j]. For efficiently calculating the internal distance between any two vertices v, w, we denote by Z I i,j [d] the partition function over all secondary structures on ξ[i..j] with end-to-end distance exactly d. Furthermore, let Q B i,j denote the partition function over all secondary structures on ξ[i..j] that are enclosed by the base pair {i, j}. We will later also need the partition function Q i,j over the sub-sequence ξ[i..j], regardless of whether {i, j} is paired or not. Now note that any structure on ξ[i..j] starts either with an unpaired base or with a base pair connecting i to some position k satisfying i < k ≤ j. In the first case, we have d can be split as follows, This gives the recursion [10] , the investigate of loop entropy dependence [7] , the analysis of FRET signals in the presence of single-stranded binding proteins [9] , as well as in mathematical studies of RNA panhandle-like structures [2, 13] . In the following it will be convenient to define also a special terms for the empty structure. Setting a allows us to formally write an individual backbone edge as two edges flanking the empty structure and hence to avoid the explicit treatment of special cases. This definition of Z I also includes the case that i and j are base paired in the recursion (1). This is covered by the case k = j, where we evaluate is the only admissible value here, this refers to Z I j+1,j [−a], which has the correct value of 1 due to our definition. Later on, we will also need Z I under the additional condition that the path starts and end with a backbone edge. We therefore introduce Z I defined as Fig. 1 . Inside and outside paths. The shortest path (violet arrows) from v (green) to w (blue) is not an inside path: inside emphasizes that, in contrast to the shortest path (cyan arrows) between the red region and w, it is not contained in the interval determined by its end points. By our initialization of Z I , we can simply define Z I by This recursion requires O(Dn 3 ) time and space. It is possible to reduce the complexity in this special case by a linear factor. The trick is to use conditional probabilities for arcs starting at i or the conditional probability for i to be singlestranded, which can be determined from the partition function for RNA folding [2] , see Appendix B. The minimal distance between two positions that are covered by an arc can be realized by inside paths and outside paths. This complicates the algorithmic approach, since both types of paths must be controlled simultaneously. Consider Fig. 1 . The shortest path between the green and blue regions includes some vertices outside the interval between these two regions. The basic idea is to generalize Equation (1) to computing the partition function Z v,w [d]. The main question now becomes how to recurse over decompositions of both the inside and the outside paths. Fig. 1 shows that the outside paths are important for the green region, i.e., the region that is covered by an arc. Hence, we have to consider the different cases that the two positions v and w are covered by arcs. The set Ω of all secondary structures on ξ can be divided into two disjoint subclasses that have to be treated differently: Note that this bipartition explicitly depends on v and w. In the following, we will first introduce the recursions that are required in Ω 0 structures to compute Contribution of Ω 0 structures to Z v,w [d] One example of this case is given in Fig. 1 with the red and blue region, where v (vertex in green region) is covered by an arc, and w (vertex in blue region) is external. Denote the ≺-maximal base pair enclosing v by {i, j}. Since at most one of v and w is covered by an arc, we know that j < w. Hence, every path p from v to w, and hence also the shortest paths (not necessarily unique) must run through the right end j of the arc {i, j}. More precisely, there must sub-paths p 1 and In case c.) we have to split the distance d into four contributions and we require two splitting positions for the sequence for all combinations of i, j, v, w. This would result in an O(n 6 D 5 ) algorithm. A careful inspection shows, however, that the split of the distances for the arcs into d and d r is unnecessary. Since we want to know only distance to the left/right end overall, we can simply introduce two matrices Z B,v, i,j Analogously, we compute Z B,v,r i,j [d]. Overall, the contribution to Z v,w [d] for structures in Ω 0 is given by Note that for splitting the distance, we reuse the same indices (e.g., the j in where as for the remaining partition function, we use successive indices (e.g.,the i in . This difference comes from the fact that splitting a sequence into subsequences is done naturally between two successive indices, whereas splitting a distance is naturally done by splitting at an individual position. We have only to guarantee that the substructures which participate in the split do agree on the structural context of the split position. This is guaranteed by requiring that Z I starts and ends with a backbone edge. We note that the incorporation of the full dangling end parameters makes is more tedious to handle the splitting positions. This results in a complexity of O(n 6 D 3 ) time and O(n 3 D) space. However, we do not need to split in i, j, k, j simultaneously. Instead, we could split case (c) at position j and introduce for all v ≤ j and k ≤ w the auxiliary variables Finally, we can replace recursion (3) by We thus arrive at O(n 4 D 2 ) time and O(n 3 D) space complexity for the contribution of Ω 0 structures to Z v,w [d], excluding the complexity of computing Contribution of Ω 1 structures to Z v,w [d] Ω 1 contains all cases where v and w are covered by a base pair. In the following, let {p, q} be the ≺-minimal base pair covering v and w. In principle, this case looks similar to the overall case for Ω 0 . However, we have now to deal not only with an inside distance, but also with an outside distance over the base pair {p, q}. Thus, we need to store the partition function for all inside and outside for each ≺-minimal arc {p, q} that covers v and w, which we will call Y B,v,w In principle, a similar recursion as defined for Z 0 in equation (3) can be derived, with the additional complication since we have to take care of the additional outside distance due to the arc (p, q). Thus, we obtain the following splitting: Again we can avoid the complexity of simultaneously splitting at {i, j} and {k, l} by doing a major split after j. Thus, we get the equivalent recursions as in eqns. (5) (6) (7) : Overall, we get the following recursion: Overall, we can now define Z v,w [d] by This part has now a complexity of O(n 4 D 2 ) space and O(n 5 D 4 ) time. For practical applications, however, we do not need to consider all possible {p, q}. Instead, there are only few base pairs that are likely to form and that cover v, w, especially for v, w where the internal distance of v, w is large enough such that an outside path has to be considered at all. If we assume a constant number of such long-range base-pairs, then the complexity is reduced by an n 2 -factor. For the complexity in terms of distance, recall that D is typically small. So far, we have used Z B,v i,j [d , d r ] as a black box. In order to compute these terms, we distinguish the limiting cases a.) v = i, b.) v = j, c.) is external from the generic case d.): Starting from the limiting cases, we initialize Z B,v v,j [0, d r ] as follows: Finally, we have the following recursion for i = v = j, d > 0 and d r > 0: where Q b i,j is the external partition function over all structures on the union of the intervals ξ[1. .i] ∪ ξ[j..n] so that {i, j} is a base pair. This is equivalent to Q b i,j = P r({i, j}) × Q/Q b i,j . The base pair probability P r({i, j}), and the partition functions Q and Q b i,j are computed by means of McCaskill's algorithm. Recursion (9) apparently has complexity O(n 5 D 4 ) in time and O(n 3 D 2 ) in space. This can be reduced due to the strong dependency between d and d r , however. By construction we have |d − d r | ≤ b since we can always use the bond {i, j} to traverse from one end to the other. Furthermore, assuming integer values for a and b, we can have only In which, the minimum free energy structure is showed in (a), (c) is the real secondary structure which is ranked as the 3rd best sub-optimal structure with RNAsubopt -e. he graphic representations of these structures are produced with VARNA [3] . (B) The corresponding smFRET efficiency (E f ret ) histograms are reported in [14] . From these data, three separate states of the DAse ribozyme can be distinguished, the unfolded (U), intermediate (I) and folded (F) states. (C) The graph distance distribution in the ensemble which is approximated with RNAsubopt -p at temperature 50 • C. The theoretical analysis of the distance distribution problem shows that, while polynomial-time algorithms exist, they probably cannot the improved to space and time complexities that make them widely applicable to large RNA molecules. Due to the unfavorable time complexity of the current algorithm and the associated exact implementation in C, a rather simple and efficient sampling algorithm has been implemented. We resort to sampling Boltzmann-weighted secondary structures with RNAsubopt -p [17] , which uses the same stochastic backtracing approach as sfold [5] . As the graph-distance for a pair of nucleotides in a given secondary structure can be computed in O(n log n) time, even large samples can be evaluated efficiently 10 . As we pointed out in the introduction, the graph distance measure introduced in this paper can serve as a first step towards a structural interpretation of smFRET data. As an example, we consider the graph distance distribution of a Diels-Alderase (DAse) ribozyme (Fig. 2 (A) ). Histograms of smFRET efficiency (E f ret ) for this 49 nt long catalytic RNA are reported in [14] for a large number of surface-immobilized ribozyme molecules as a function of the Mg 2+ concentration in the buffer solution. A sketch of their histograms is displayed in Fig. 2 (B) . The dyes are attached to sequence positions 6 (Cy3) and 42 (Cy5) and hence do not simply reflect the end-to-end distance, Fig. 2 (A)(c) . In this example, we observe the the expected correspondence small graph distances with a strong smFRET signal. This is a particular interesting example, since the minimal free energy (mfe) structure (Fig. 2 (A)(a) ) predicted with RNAfold is not identified with the real secondary structure (Fig. 2 (A)(c) ). In fact, the ground state secondary structure is ranked as the 3rd best sub-optimal structure derived via RNAsubopt -e. The free energy difference between these two structures is only 0.1kcal/mol. However, their graph distances show a relatively larger difference. The 2nd best sub-optimal structure (Fig. 2 (A)(b) ) looks rather similar with the 3rd structure, in particular, they share the same graph distance value. The smFRET data of [14] indicate the presence of three sub-populations, corresponding to three different structural states: folded molecules (state F), intermediate conformation (state I) and unfolded molecules (state U). In the absence of Mg 2+ , the I state dominates, and only small fractions are found in states U and F. Unfortunately, the salt dependence of RNA folding is complex [15, 19] and currently is not properly modeled in the available folding programs. We can, however, make use of the qualitative correspondence of low salt concentrations with high temperature. In Fig. 2 (C) we therefore re-compute the graph distance distribution in the ensemble at an elevated temperature of 50 • C. Here, the real structure becomes the second best structure with free energy −10.82kcal/mol and we observe a much larger fraction of (nearly) unfolded structures with longer distances between the two beacon positions. Qualitatively, this matches the sm-FRET data showed in Fig. 2 (B) . Long-range interactions play an important role in pre-mRNA splicing and in the regulation of alternative splicing [1, 22] , bringing splice donor, acceptor, branching site into close spatial proximity. Fig. 3(A) shows for D. melanogaster pre-mRNAs that the distribution of graph-distances between donor and acceptor sites shifted towards smaller values compared to randomly selected pairs of positions with the same distance. 11 Although the effect is small, it shows a clear difference between the real RNA sequences and artificial sequences that were randomized by di-nucleotide shuffling. The spatial organization of the genomic and sub-genomic RNAs is important for the processing and functioning of many RNA viruses. This goes far beyond the well-known panhandle structures. In Coronavirus the interactions of the 5' TRS-L cis-acting element with body TRS elements has been proposed as an important determinant for the correct assembly of the Coronavirus genes in the host [6] . The matrix of expected graph-distances in Fig. 3(B) shows that TRS-L and TRS-B are indeed placed near each other. More detailed information is provided in Appendix (D). Our first results show that the systematic analysis of the graph-distance distribution both for individual RNAs and their aggregation over ensembles of 11 Due to the insufficiency of the spacial-distance information of structural elements in the secondary structures, we artificially choose a = b = 1 in our experiments. structures can provide useful insights into structural influences on RNA function. These may not be obvious directly from the structures due to the inherent difficulties of predicting long-range base pairs with sufficient accuracy and the many issues inherent in comparing RNA structures of very disparate lengths. Due the complexity of algorithm we have refrained from attempting a direct implementation in an imperative programming language. Instead, we are aiming at an implementation in Haskell that allows us to make use of the framework of algebraic dynamic programming [12] . The graph distance measure and the associated algorithm can be extended in principle to of RNA secondary structures with additional tertiary structural elements such as pseudoknots [24] and G-quadruples [18] . RNA-RNA interaction structures [16] also form a promising area for future extensions. We note finally, that the Fourier transition method introduced in [27] could be employed to achieve a further speedup. A stem structure in fibroblast growth factor receptor 2 transcripts mediates cell-type-specific splicing by approximating intronic control elements Expected distance between terminal nucleotides of RNA secondary structures VARNA: Interactive drawing and editing of the RNA secondary structure Automated de novo prediction of native-like RNA tertiary structures A statistical sampling algorithm for RNA secondary structure prediction Structure and functional relevance of a transcription-regulating sequence involved in coronavirus discontinuous RNA synthesis Impact of loop statistics on the thermodynamics of RNA folding The end-to-end distance of RNA as a randomly self-paired polymer Modeling the interplay of single-stranded binding proteins and nucleic acid secondary structure Force-induced denaturation of RNA Translocation of structured polynucleotides through nanopores Algebraic dynamic programming The 5'-3' distance of RNA secondary structures Mg2+-dependent folding of a Diels-Alderase ribozyme probed by single-molecule FRET analysis Ion-RNA interactions thermodynamic analysis of the effects of mono-and divalent ions on RNA conformational equilibria RNA-RNA interaction prediction based on multiple sequence alignments ViennaRNA Package 2.0. Alg 2d meets 4g: G-quadruplexes in rna secondary structure prediction Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure The equilibrium partition function and base pair binding probabilities for RNA secondary structure RNA structure and the mechanisms of alternative splicing The secondary structure of RNA under tension Topology and prediction of RNA pseudoknots A practical guide to single-molecule FRET From sequences to shapes and back: a case study in RNA secondary structures Using the Fast Fourier Transform to Accelerate the Computational Search for RNA Conformational Switches The ends of a large RNA molecule are necessarily close Acknowledgements. This work was supported in part by the Deutsche Forschungsgemeinschaft proj. nos. BA 2168/2-2, STA 850/10-2, SPP 1596 and MA5082/1-1.