key: cord-016261-jms7hrmp authors: Liu, Chunmei; Song, Yinglei; Malmberg, Russell L.; Cai, Liming title: Profiling and Searching for RNA Pseudoknot Structures in Genomes date: 2005 journal: Transactions on Computational Systems Biology II DOI: 10.1007/11567752_2 sha: doc_id: 16261 cord_uid: jms7hrmp We developed a new method that can profile and efficiently search for pseudoknot structures in noncoding RNA genes. It profiles interleaving stems in pseudoknot structures with independent Covariance Model (CM) components. The statistical alignment score for searching is obtained by combining the alignment scores from all CM components. Our experiments show that the model can achieve excellent accuracy on both random and biological data. The efficiency achieved by the method makes it possible to search for structures that contain pseudoknot in genomes of a variety of organisms. Searching genomes with computational models has become an effective approach for the identification of genes. During recent years, extensive research has been focused on developing computationally efficient and accurate models that can find novel noncoding RNAs and reveal their associated biological functions. Unlike the messenger RNAs that encode the amino acid residues of protein molecules, noncoding RNA molecules play direct roles in a variety of biological processes including gene regulation, RNA processing, and modification. For example, the human 7SK RNA binds and inhibits the transcription elongation factor P-TEFb [17] [25] and the RNase P RNA processes the 5' end of precursor tRNAs and some rRNAs [7] . Noncoding RNAs include more than 100 different families [23] . Genome annotation based on models constructed from homologous sequence families could be a reliable and effective approach to enlarging the known families of noncoding RNAs. The functions of noncoding RNAs are, to a large extent, determined by the secondary structures they fold into. Secondary structures are formed by bonded base pairs between nucleotides and may remain unchanged while the nucleotide sequence may have been significantly modified through mutations over the course of evolution. Profiling models based solely on sequence content such as Hidden Markov Model (HMM) [12] may miss structural homologies when directly used to search genomes for noncoding RNAs containing complex secondary structures. Models that can profile noncoding RNAs must include both the content and the structural information from the homologous sequences. The Covariance Model (CM) developed by Eddy and Durbin [6] extends the profiling HMM by allowing the coemission of paired nucleotides on certain states to model base pairs, and introduces bifurcation states to emit parallel stems. The CM is capable of modeling secondary structures comprised of nested and parallel stems. However, pseudoknot structures, where at least two structurally interleaving stems are involved, cannot be directly modeled with the CM and have remained computationally intractable for searching [1] [13] [14] [18] [19] [20] [21] [24] . So far, only a few systems have been developed for profiling and searching for RNA pseudoknots. One example is ERPIN developed by Gautheret and Lambert [8] [15] . ERPIN searches genomes by sequentially looking for single stem loop motifs contained in the noncoding RNA gene, and reports a hit when significant alignment scores are observed for all the motifs at their corresponding locations. Since ERPIN does not allow the presence of gaps when it performs alignments, it is computationally very efficient. However, alignments with no gaps may miss distant homologies and thus result in a lower sensitivity. Brown and Wilson [2] proposed a more realistic model comprised of a number of Stochastic Context Free Grammar (SCFG) [3] [22] components to profile pseudoknot structures. In their model, the interleaving stems in a pseudoknot structure are derived from different components; the pseudoknot structure is modeled as the intersection of components. The optimal alignment score of a sequence segment is computed by aligning it to all the components iteratively. The model can be used to search sequences for simple pseudoknot structures efficiently. However, a generic framework for modeling interleaving stems and carrying out the search was not proposed in their work. For pseudoknots with more complex structure, more than two SCFG components may be needed and the extension of the iterative alignment algorithm to k components may require k! different alignments in total since all components are treated equally in their model. In this paper, we propose a new method to search for RNA pseudoknot structures using a model of multiple CMs. Unlike the model of Brown and Wilson, we use independent CM components to profile the interleaving stems in a pseudoknot. Based on the model, we have developed a generic framework for modeling interleaving stems of pseudoknot structures; we propose an algorithm that can efficiently assign stems to components such that interleaving stems are profiled in different components. The components with more stems are associated with higher weights in determining the overall conformation of a sequence segment. In order to efficiently perform alignments of the sequence segment to the model, instead of iteratively aligning the sequence segment to the CM components, our searching algorithm aligns it to each component independently following the descending order of component weights. The statistical log-odds scores are computed based on the structural alignment scores of each CM component. Stem contention may occur such that two or more base pairs obtained from different components require the participation of the same nucleotide. Due to the conformational constraints inherently imposed by the CM components, stem contentions occur infrequently (less than 30%) and can be effectively resolved based on the conformational constraints from the alignment results on components with higher weight values. The algorithm is able to accomplish the search with a worst case time complexity of O((k − 1)W 3 L) and a space complexity of O(kW 2 ), where k is the number of CM components in the model, W and L are the size of the searching window and the length of the genome respectively. We used the model to search for a variety of RNA pseudoknots inserted in randomly generated sequences. Experiments show that the model can achieve excellent sensitivity (SE) and specificity (SP) on almost all of them, while using only slightly more computation time than searching for pseudoknot-free RNA structures. We then applied the model and the searching algorithm to identify the pseudoknots on the 3' untranslated region in several RNA genomes from the corona virus family. An exact match between the locations found by our program and the real locations is observed. Finally, in order to test the ability of our program to cope with noncoding RNA genes with complex pseudoknot structures, we carried out an experiment where the complete DNA genomes of two bacteria were searched to find the locations of the tmRNA genes. The results show that our program identified the location with a reasonable amount of error (with a right shift of around 20 nucleotide bases) for one bacterial genome and for the other bacteria search was perfect. To the best of our knowledge, this is the first experiment where a whole genome of more than a million nucleotides is searched for a complex structure that contains pseudoknots. To test the performance of the model, we developed a search program in C language and carried out searching experiments on a Sun/Solaris workstation. The workstation has 8 dual processors and 32GB main memory. We evaluated the accuracy of the program on both real genomes and randomly generated sequences with a number of RNA pseudoknot structures inserted. The RNAs we choose to test the model are shown in Table 1 . Model training and testing are based on the multiple alignments downloaded from the Rfam database [10] . For each RNA pseudoknot, we divided the available data into a training set and a testing set, and the parameters used to model it are estimated based on multiple structural alignments among 5 − 90 homologous training sequences with a pairwise identity less than 80%. The emission probabilities of all nucleotides for a given state in a CM component are estimated by computing their frequencies to appear in the corresponding column in the multiple alignment of training sequences; transition probabilities are computed similarly by considering the rel- Table 2 . The performance of the model on different RNA pseudoknots inserted into a background (of 10 5 nucleotides) randomly generated with different C+G concentrations. TN is the total number of pseudoknotted sequence segments inserted; CI is the number of sequence segments correctly identified by the program (with a positional error less than ±3 bases); NH is the number of sequence segments returned by the program; SE and SP are sensitivity and specificity respectively. The thresholds of log-odds score are predetermined using the Z-score value of 4.0. ative frequencies for different types of transitions that occur between the corresponding consecutive columns in the alignment. Pseudocounts, dependent on the number of training sequences, are included to prevent overfitting of the model to the training data. To measure the sensitivity and specificity of the searching program within a reasonable amount of time, for each selected pseudoknot structure, we selected 10 − 40 sequence segments from the set of testing data and inserted them into each of the randomly generated sequences of 10 5 nucleotides. In order to test whether the model is sensitive to the base composition of the background sequence, we varied the C+G concentration in the random background. The program computes the log-odds, the logarithmic ratio of the probability of generating sequence segment s by the null (random) model R to that by our model M . It reports a hit when the Z-score of s is greater than 4.0. The computation of Z-scores requires knowing the mean and standard deviation for the distribution of log-odd scores of random sequence segments; both of them can be determined with methods similar to the ones introduced by Klein and Eddy [11] before the search starts. As can be seen in Table 2 , the program correctly identifies more than 80% of inserted sequence segments with excellent specificity in most of the experiments. The only exception is the srpRNA, where the program misses more than 50% inserted sequence segments in one of the experiments. The relatively lower sensitivity in that particular experiment can be partly ascribed to the fact that the pseudoknot structure of srpRNA contains fewer nucleotides; thus its structural and sequence patterns have a larger probability to occur randomly. The running time for srpRNA, however, is also significantly shorter than that needed by most of other RNA pseudoknots due to the smaller size of the model. Additionally, while the alpha−RBS pseudoknot has a more complex structure and three CM components are needed to model it, our searching algorithm efficiently identifies more than 95% of the inserted pseudoknots with high specificities. A higher C+G concentration in the background does not adversely affect the specificity of the model; it is evident from Table 2 that the program achieves better overall performance in both sensitivity and specificity in a background of higher C+G concentrations. We therefore conjecture that the specificity of the model is partly determined by the base composition of the genome and is improved if the base composition of the target gene is considerably different from its background. To test the accuracy of the program on real genomes, we performed experiments to search for particular pseudoknot structures in the genomes for a variety of organisms. Table 3 shows the genomes on which we have searched with our program and the locations annotated for the corresponding pseudoknot structures. The program successfully identified the exact locations of known 3'UTR pseudoknot in four genomes from the family of corona virus. This pseudoknot was recently shown to be essential for the replication of the viruses in the family [9] . In addition, the genomes of the bacteria, Haemophilus influenzae and Neisseria meningitidis MC58, were searched for their tmRNA genes. The Haemophilus influenzae DNA genome contains about 1.8 × 10 6 nucleotides and Neisseria meningitidis MC58 DNA genome contains about 2.2 × 10 6 nucleotides. The tmRNA functions in the transtranslation process to add a C-terminal peptide tag to the incomplete protein product of Table 3 . The results obtained with our searching program on the genomes of a variety of organisms. GA is the accession number of the genome; RL specifies the real location of the pseudoknot structure in the genome; SL is the one returned by the program; RT is the running time needed to perform the searching in hours; GL is the length of the genome in its number of bases. a defective mRNA [16] . The central part of the secondary structure of tmRNA molecule consists of four pseudoknot structures. Figure 1 shows the pseudoknot structures on the tmRNA molecule. In order to search the bacterial DNA genomes efficiently, the combined pseudoknots 1 and 2 were used to search the genome first; the program searches for the whole tmRNA gene only in the region around the locations where a hit for Pk1 and Pk2 is detected. We cut the genome into segments with shorter lengths (around 10 5 nucleotide bases for each), and ran the program in parallel on ten of them in two rounds. The result for Neisseria meningitidis MC58 shows that we successfully identified the exact locations of tmRNA. However, the locations of tmRNA obtained for Haemophilus influenzae have a shift of around 20 nucleotides with respect to its real location (7% of the length of the tmRNA). This slight error can probably be ascribed to our "hit-andextend" searching strategy to resolve the difficulty arising from the complex structure and the relatively much larger size of tmRNA genes; positional errors may occur during different searching stages and accumulate to a significant value. Our experiment on the DNA genomes also demonstrates that, for each genome, it is very likely there is only one tmRNA gene in it, since our program found only one significant hit. To our knowledge, this is the first computational experiment where a whole genome of more than a million nucleotides was successfully searched for a complex structure that contains pseudoknot structures. The Covariance Model (CM) proposed by Eddy and Durbin [6] [5] can effectively model the base pairs formed between nucleotides in an RNA molecule. Similarly to the emission probabilities in HMMs, the emission probabilities in the CM for both unpaired nucleotides and base pairs are positional dependent. The profiling of a stem hence consists of a chain of consecutive emissions of base pairs. Parallel stems on the RNA sequence are modeled with bifurcation transitions where a bifurcation state is split into two states. The parallel stems are then generated from the transitions starting with the two states that result respectively. The genome is scanned by a window with an appropriate length. Each location of the window is scored by aligning all subsequence segments contained in the window to the model with the CYK algorithm. The maximum log-odds score of them is determined as the log-odds score associated with the location. A hit is reported for a location if the computed log-odds score is higher than a predetermined threshold value. Pseudoknot structures are beyond the profiling capability of a single CM due to the inherent context sensitivity of pseudoknots. Models for pseudoknot structures require a mechanism for the description of their interleaving stems. Previous work by Brown and Wilson [2] and Cai et al. [4] has modeled the pseudoknot structures with grammar components that intersect or cooperatively communicate. A similar idea is adopted in this work; a number of independent CM components are combined to resolve the difficulty in profiling that arises from the interleaving stems. Interleaving stems are profiled in different CM components and the alignment score of a sequence segment is determined based on a combination of the alignment scores on all components. However, the optimal conformations from the alignments on different components may violate some of the conformational constraints that a single RNA sequence must follow. For example, a nucleotide rarely forms two different base pairs simultaneously with other nucleotides in an RNA molecule. This type of restriction is not considered by the independent alignments carried out in our model and thus may lead to erroneous searching results if not treated properly. In our model, stem contention may occur. We break the contention by introducing different priorities to components; base pairs determined from components with the highest priority win the contention. We hypothesize that, biochemically, components profiling more stems are likely to play more dominant roles in the formation of the conformation and are hence assigned higher priority weights. In order to profile the interleaving stems in a pseudoknot structure with independent CM components, we need an algorithm that can partition the set of stems on the RNA sequence into a number of sets comprised of stems that mutually do not interleave. Based on the consensus structure of the RNA sequence, an undirected graph G = (V, E) can be constructed where V , the set of vertices in G, consists of all stems on the sequence. Two vertices are connected with an edge in G if the corresponding stems are in parallel or nested. The set of vertices V needs to be partitioned into subsets such that the subgraph induced by each subset forms a clique. We use a greedy algorithm to perform the partition. Starting with a vertex set S initialized to contain a arbitrarily selected vertex, the algorithm iteratively searches the neighbors of the vertices in S and computes the set of vertices that are connected to all vertices in S. It then randomly selects one vertex v that is not in S from the set and modifies S by assigning v to S. The algorithm outputs S as one of the subsets in the partition when S can not be enlarged and randomly selects an unassigned vertex and repeats the same procedure. It stops when every vertex in G has been included in a subset. Although the algorithm does not minimize the number of subsets in the partition, our experiments show that it can efficiently provide optimal partitions of the stems on pseudoknot structures of moderate structural complexity. The CM components in the profiling model are generated and trained based on the partition of the stems. The stems in the same subset are profiled in the same CM component. For each component, the parameters are estimated by considering the consensus structure formed by the stems in the subset only. The optimal alignments of a sequence segment to the CM components are computed with the dynamic programming based CYK algorithm. As we have mentioned before, higher priority weights are assigned to components with more stems profiled. The component with the maximum number of stems thus has the maximum weight and is the dominant component in the model. The algorithm performs alignments in the descending order of component weights. It selects the sequence segment that maximizes the log-odds score from the dominant component. The alignment scores and optimal conformations of this segment on other components are then computed and combined to obtain the overall log-odds score for the segment's position on the genome. More specifically, we assume that the model contains k CM components M 0 , M 1 , ..., M k−1 in descending order of component weights. The algorithm considers all possible sequence segments s d that are enclosed in the window and uses Equation (1) to determine the sequence segment s to be the candidate for further consideration, where W is the length of the window used in searching, and Equation (2) to compute the overall log-odds score for s. We use sm i to denote the parts of s that are aligned to the stems profiled in CM component M i . Basically, Log odds(sm i |M i ) accounts for the contributions from the alignment of sm i to M i . The log-odds score of sm i is counted in both M 0 and M i and must be subtracted from the sum. Log odds(s|M ) = Log odds(s|M 0 ) The conformations corresponding to the optimal alignments of a sequence segment to all CM components are obtained by tracing back the dynamic programming matrices and checking to ensure that no stem contention occurs. Since each nucleotide in the sequence is represented with a state in a CM component, the CM inherently imposes constraints on the optimal conformations of sequence segments aligned to it. We hence expect that stem contention occurs with a low frequency. In order to verify this intuition, we tested the model on sequences randomly generated with different base compositions and evaluated the frequencies of stem contentions for pseudoknot structures on which we have performed an accuracy test; the results are shown in Figure 2 . The presence of stem contention increases the running time of the algorithm, because the alignment of one of the involved components must be recomputed to resolve the contention. Based on the assumption that components with more stems contribute more to the stability of the optimal conformation, we resolve the contention in favor of such components. We perform recomputation on the component with a lower number of stems by incorporating conformational constraints inherited from components with more stems into the alignment algorithm, preventing them from forming the contentious stems. Specifically, we assume that stem S j ∈ M i and stem contention occurs between S j and other stems profiled in M i−1 ; the conformational constraints from the component M i−1 are in the format of (l 1 , l 2 ) and (r 1 , r 2 ). In other words, to avoid the stem con- tmRNA-pk34 telomerase-vert tombus-3-IV alpha-RBS srpRNA Fig. 2 . 4000 random sequences were generated at each given base composition and aligned to the corresponding profiling model. The sequences are of about the same length as the length of the pseudoknot structure. The stem contention rates for each pseudoknot structure were measured and plotted. They were the ratio of the number of random sequences in which stem contentions occurred to the number of total random sequences. Left: plots of profiling models observed to have a stem contention rate lower than 20%, right: plots of these with slightly higher stem contention frequencies. The experimental results demonstrate that, in all pseudoknots where we have performed accuracy tests, stem contention occurs with a rate lower than 30% and is insensitive to the base composition of sequences. tention, the left and right parts of the stem must be the subsequences of indices (l 1 , l 2 ) and (r 1 , r 2 ) respectively. The dynamic programming matrices for S j are limited to the rectangular region that satisfies l 1 ≤ s ≤ l 2 and r 1 ≤ t ≤ r 2 . The stem contention frequency depends on the conformational flexibilities of the components in the covariance model. More flexibilities in conformation may improve the sensitivity of the model but cause higher contention frequency and thus increase the running time for the algorithm. In the worst case, recomputation is needed for all nondominant components in the model and the time complexity of the algorithm becomes O((k − 1)W 3 L), where k is the number of components in the model, W and L are the window length and the genome length respectively. In this paper, we have introduced a new model that serves as the basis for a generic framework that can efficiently search genomes for the noncoding RNAs with pseudoknot structures. Within the framework, interleaving stems in pseudoknot structures are modeled with independent CM components and alignment is performed by aligning sequence segments to all components following the descending order of their weight values. Stem contention occurs with a low frequency and can be resolved with a dynamic programming based recomputation. The statistical log-odds scores are computed based on the alignment results from all components. Our experiments on both random and biological data demonstrate that the searching framework achieves excellent performance in both accuracy and efficiency and can be used to annotate genomes for noncoding RNA genes with complex secondary structures in practice. We were able to search a bacterial genome for a complete structure with a pseudoknot in about one week on our Sun workstation. It would be desirable to improve our algorithm so that we could search larger genomes and databases. The running time, however, could be significantly shortened if a filter can be designed to preprocess DNA genomes and only the parts that pass the filtering process are aligned to the model. Alternatively, it may be possible to devise alternative profiling methods to the covariance model that would allow faster searches. Dynamic programming algorithms for RNA secondary structure prediction with pseudoknots RNA Pseudoknot Modeling Using Intersections of Stochastic Context Free Grammars with Applications to Database Search Small subunit ribosomal RNA modeling using stochastic context-free grammars Stochastic Modeling of Pseudoknot Structures: A Grammatical Approach Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids RNA sequence analysis using covariance models Ribonuclease P: unity and diversity in a tRNA processing ribozyme Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles Characterization of the RNA components of a Putative Molecular Switch in the 3' Untranslated Region of the Murine Coronavirus Genome Rfam: an RNA family database RSEARCH: Finding Homologs of Single Structured RNA Sequences Hidden Markov models in computational biology. Applications to protein modeling Prediction of RNA Pseudoknots-Comparative Study of Genetic Algorithms RNA pseudoknot prediction in energy based models RNAMotif, an RNA secondary structure definition and search algorithm Functional and structural analysis of a pseudoknot upstream of the tag-encoded sequence in E. coli tmRNA 7SK small nuclear RNA binds to and inhibits the activity of CDK9/cyclin T complexes Design, implementation and evaluation of a practical pseudoknot folding algorithm based on thermodynamics The language of RNA: a formal grammar that includes pseudoknots A Dynamic Programming Algorithm for RNA Structure Prediction Including Pseudoknots An iterated loop matching approach to the prediction of RNA secondary structures with pseudoknots Stochastic Context-Free Grammars for tRNA Modeling An expanding universes of noncoding RNAs Tree adjoining grammars for RNA structure prediction The 7SK small nuclear RNA inhibits the Cdk9/cyclin T1 kinase to control transcription