key: cord-0977129-kkfvwy2t
authors: Li, Pan; Zhou, Xiaolin; Xu, Kui; Zhang, Qiangfeng Cliff
title: RASP: an atlas of transcriptome-wide RNA secondary structure probing data
date: 2020-10-17
journal: Nucleic Acids Res
DOI: 10.1093/nar/gkaa880
sha: f83c0f041b87396c318eba7d66afea4556424c4b
doc_id: 977129
cord_uid: kkfvwy2t

RNA molecules fold into complex structures that are important across many biological processes. Recent technological developments have enabled transcriptome-wide probing of RNA secondary structure using nucleases and chemical modifiers. These approaches have been widely applied to capture RNA secondary structure in many studies, but gathering and presenting such data from very different technologies in a comprehensive and accessible way has been challenging. Existing RNA structure probing databases usually focus on low-throughput or very specific datasets. Here, we present a comprehensive RNA structure probing database called RASP (RNA Atlas of Structure Probing) by collecting 161 deduplicated transcriptome-wide RNA secondary structure probing datasets from 38 papers. RASP covers 18 species across animals, plants, bacteria, fungi, and also viruses, and categorizes 18 experimental methods including DMS-seq, SHAPE-Seq, SHAPE-MaP, and icSHAPE, etc. Specially, RASP curates the up-to-date datasets of several RNA secondary structure probing studies for the RNA genome of SARS-CoV-2, the RNA virus that caused the on-going COVID-19 pandemic. RASP also provides a user-friendly interface to query, browse, and visualize RNA structure profiles, offering a shortcut to accessing RNA secondary structures grounded in experimental data. The database is freely available at http://rasp.zhanglab.net.

RNA is critical across biological processes and a range of cellular mechanisms act upon it to carefully regulate and refine gene expression (1) . The specific secondary structures formed by non-coding RNAs (ncRNAs) are cen-tral to their regulation and functions (2) (3) (4) . Recent studies have also found that mRNA secondary structures influence gene transcription, translation and decay (5) . The secondary structure of many RNA viruses also have important functions. For example, the 3 UTRs of Flaviviruses produce highly structured noncoding RNAs that are resistant to host nucleases (6) . As more and more functions for RNA secondary structure are discovered, deciphering the structures themselves has become a priority.

During the past few decades, many computational methods predicting RNA secondary structure have been developed (7) (8) (9) . These methods only work well on shorter RNA sequences, and has been a major source of information for RNA structure studies (10) . However, computational prediction and modeling usually cannot take into consideration the complex cellular environments and thus lack of the resolution for in vivo studies. Small molecule approaches have been long developed to quantitatively measure RNA conformation (11) . In the last few years, thanks to development of high-throughput sequencing technology, RNA structure probing has entered the omics era, allowing simultaneous, transcriptome-wide measurement of RNA (leads to the results of the so-called 'structuromes'), both in test tubes and in cellular conditions (12) (13) (14) .

The principles underlying RNA probing largely fall into two categories: nuclease cleavage and small moleculebased probing ( Figure 1 ). RNase P1, S1 and RNase V1 nucleases cut the single and double-stranded RNA respectively (15, 16) . During reverse transcription, cDNA synthesis stops at the cleavage site, revealing information about single-stranded and double-stranded nucleotides upon high-throughput sequencing. Small chemicals such as 1M7, DMS, N 3 -kethoxal and NAI-N 3 can be used to specifically probe single-stranded RNA bases (17) (18) (19) (20) (21) (22) . Upon reverse transcription, the RT enzymes terminate at the modified site (17, 18, 20, 21) or mis-incorporate nucleotides resulting from the chemical modification (19, 22) . By normalizing RT stop values or mutation rates, a structural score can be assigned to each base, measuring the likelihood of that base being single-stranded or double-stranded. RNA secondary structure databases such as RMDB (23) and RSVdb (24) have been developed, but they normally focus on specific datasets of a very limited coverage. For example, RMDB contains diverse RNA structural mapping experiments, but focuses on low-throughput experiments (23) . RSVdb collects RNA structure data, is limited to DMS reagent-based datasets (24) . Given the increasing volume of experimental RNA structure data (especially those using high-throughput technologies), and their broad relevance to biological processes, a comprehensive database is highly desired.

Here we describe a database, RASP, that collects 161 datasets from 38 papers (Table 1) . RASP spans, categorizes and organizes 18 species across animals, plants, bacteria, fungi and viruses and 18 different experimental methods. RASP contains almost all currently published transcriptome-scale data, including the most recent studies that probed the RNA secondary structures of the genome of the SARS-CoV-2 RNA virus, with technologies varying from PARS (16, 25) , DMS-seq (17), Structure-seq (18) , to SHAPE (26), icSHAPE (21) and SHAPE-MaP (27) etc. RASP provides a user-friendly interface to query, browse and download data. In addition, RASP implements analytical functions such as multiple sequence alignment, along with RNA secondary structure prediction and visualization. This dataset will greatly expand accessibility and crosscomparison of RNA secondary structures, empowering relevant researches across fields.

We retrieved published papers containing high-throughput RNA structure probing data from PubMed. For experiments containing multiple conditions, we collected datasets for each condition separately. We classified datasets according by species and experimental technology used (Table 1) . Detailed information including species, cell line, reagents used in the experiment, and publication information was also collected, and used for classification during construction of the RASP database (Supplementary Table S1 ).

In principle, all RNA secondary structure probing methods generate a structure score (called with different names) to provide a measure of the pairing probability of each nucleotide (14) . We directly downloaded the structural scores from publications and integrated these data into RASP if the processed structural score is provided (Figure 2 , right). If not, we downloaded raw data and used the same pipeline as the original paper to calculate the scores (Figure 2 , left). This data processing information, along with the transcript numbers with structural scores is listed in Supplementary  Table S1 .

RASP provides a rich number of approaches to interact with the data stored in the database, including search, browse, download and functions including structure prediction, multiple sequences alignment and cross-dataset comparison ( Figure 2 , bottom).

RASP provides two kinds of inquiry modes: 'search gene' and 'search sequence' ( Figure 3A ). In the gene-based inquiry mode, a user first selects one or more species, then inputs a gene symbol (such as GAPDH) or ensembl gene ID or transcript ID of interest. Clicking 'Search Gene' gives the user a match list ( Figure 3B ). Each match item includes the organism name, the genome location, the match string (gene or transcript), the gene symbol, a transcript list and a genome browser link. The user has two options here: click any specific transcript to visualize the transcript sequence and structure score (see 'Visualize sequence and structure data for transcript' section below) or click 'Go' to visualize in genome browser (see 'Genome browser' section). In the sequence-based inquiry mode, a user can select a species and input a DNA or RNA sequence, and click the 'Search Sequence', to search for the inquiry sequence in the genome by using blastn (28) . The action will return the user a hit list ( Figure 3C ). Each hit item includes the organism name, the genome location, the E-value, the match between query sequence and target sequence, the gene symbol, a transcript list and a genome browser link. As described above, the user can visualize data in the genome or transcript by clicking the 'Go' button or any transcript.

RASP integrates Jbrowse (29) which allows users to visualize and compare structure scores in the genome ( Figure  4A ). On the 'Browse' page, users select a species and input the gene symbol or ensembl gene ID of interest. Clicking 'Go' refreshes the browser to display the gene region. There are two options for users to load structure data: (i) a 'Click here to select datasets' button to expand the selection panel;

(ii) a 'Select tracks' button on the top left of the browser region to shift the browser selection panel. Users can filter the data based on all kinds of criteria including the name of the technology, the reagents used in the experiment, the journal where the paper was published, specific experimental conditions, the cell line, the strand, the experimental principle, etc. Users can filter out those tracks without structural data coverage in the genomic region by clicking the 'Only show tracks with structural data coverage' checkbox.

Jbrowse also allows users to download a small amount of data by highlighting a region of interest, click 'save track data' in the track menu to save the structure score as a bedGraph file. Users can also rearrange tracks by dragging the track labels, and visualize their custom tracks by uploading local files or providing hyperlinks. Jbrowse also provides convenient ways for structure score comparison. Figure 4B shows two examples of the structure scores of the human GAPDH and yeast RPL32 transcripts obtained by different technologies. Users can also eas- ily compare the structural differences of the same RNA under different conditions, which may help to visually identify the influence of different conditions on RNA structure.

By clicking the transcript on the search results page (see 'Search interface for retrieving gene and sequence' section) ( Figure 3B , C), users can skip to a new page to visualize the sequence and structure score ( Figure 4C ). There are five panels on this page. 'Summary of transcript' panel displays basic transcript information including genome location, the transcript biotype, etc. The 'Selection' panel contains buttons that allow selection of sequence regions and performs various operations on this area including copy, download, alignment and structure prediction. The 'Sequence' panel displays the full sequence of transcript, with the UTR highlighted in orange. Users can click any two bases to select a region and copy or download this region. To search a subsequence, users input the subsequence to the text area and click the 'Search' button. If successful, the subsequence will be selected. 'Load probing data' panel provides the function of loading probing data. The structure score is displayed with base-specific colors in the 'Sequence' panel. High structure scores are indicated in red, and low scores is in blue. The 'Compare probing data' panel draws scatter plots that compare structure scores between any two selected datasets. The Pearson correlation efficient is shown. If the user selects a region, only the data in the region will be displayed. The 'statistics' panel stores statistics and draw a distribution plot for the structure score of the full transcript or selected region.

If a user is interested in the structure of homologous sequences, they can compare the structure score of homologous sequences through the 'Alignment' page. Multiple sequences and corresponding structure scores can be input ( Figure 5A) , and upon clicking 'Submit', the RASP server uses muscle (30) to align the sequences and return a new page including the aligned sequences and the aligned structure scores ( Figure 5B ). The user can then click the 'Download' button to download the aligned sequences and structure scores. Users can also select a region of a transcript and click 'Add' button in the transcript summary page ( Figure  4C ) to save the data, and then directly load the data in the 'Alignment' page.

The 'Predict' page allows users to fill in the sequence and structure scores in text boxes, and provide the intercept and slope parameters ( Figure 5C ). The structure score can be converted into pseudo free energy through the following formula (31):

Deigan et al. used grid search to fit slope and intercept based on the secondary structure and SHAPE scores of E. coli rRNA (31) . we took the values they obtained (slope = 1.8, intercept = −0.6) as default parameters, but users can adjust these parameters to change the weight of the structure score. Users can also select a region of a transcript and click the 'Predict' button in the transcript summary page ( Figure 4C ) to directly jump to the 'Predict' page for structural prediction. Clicking the 'Submit' button returns a page with the predicted structures ( Figure 5D ). 'Summary of query information' panel contains input sequences, constraints and parameters. 'Prediction results' panel present a list of predicted structures ranked by free energy. Users can visualize structure with forna (32) by clicking the 'Go' button, or copying a java command to visualize the structure with VARNA (33).

The Download page provides genome reference sequence files, annotation files and structure score files. Processed structure data are saved in the bigWig format and bed format. The Bigwig format is a binary file format with genomic coordinates. bigWig files can be converted to text-format bedGraph files using the bigWigToBedGraph program from UCSC tools package (34) . Users can refer to the 'Help' page for detailed operation steps. Users can also download the bed text-format files.

The recent outbreak COVID-19 has rapidly spread to the whole world and caused tremendous damage to our society and economy. COVID-19 is caused by a single-stranded, highly infectious RNA virus SARS-CoV-2. As previous studies on the other RNA viruses like HCV, HIV, Dengue virus, Zika virus have demonstrated that the secondary structures of their RNA genome are important for the life cycle of these viruses (26, (35) (36) (37) , much research efforts have been dedicated to determining the secondary structure of the SARS-CoV-2 RNA genome, by using RNA structure probing methods including SHAPE-MaP, DMS-MaPseq and icSHAPE (38) (39) (40) (41) (42) (43) .

Called on by the urgent need fighting against the ongoing pandemic, we made a special effort for the collection of existing RNA secondary structure of the SARS-CoV-2 genome ( Figure 6 ). We explored datasets associated with six manuscripts deposited on bioRxiv, including one study from our own laboratory (38) and found that two of the six studies disclosed their processed structural data (39, 40) . We have thus collected these data and integrated them into RASP. Users can easily visualize, compare and download the data through our database server. We have been actively monitoring the progress in the field and will continue to collect and update with the new results in the future.

High-throughput RNA secondary structure probing has raised a great amount of research interest (44) , and large-scale RNA secondary structure datasets are accumulating rapidly. These datasets are helpful for modeling RNA secondary structure, and analyzing correlations between structural and cellular activities including transcription rate, translation efficiency, etc. Based on the current need to comprehensively collect and process RNA probing data, we developed RASP that cover 161 manually curated RNA probing datasets from 18 species with 18 different experimental technologies. In contrast to existing databases, RASP provides a comprehensive data and analysis platform.

The current version of RASP mainly includes datasets of large-scale RNA secondary structure studies, integrated with enabling RNA structurome analysis tools. However, previous low-throughput experiments have also generated a lot of RNA structural data. For comprehensiveness, in the future RASP will include these low-throughput data, as well as results of studies focused on certain RNA targets such as the lncRNA, Hotair (45) . We also aim to integrate more analysis tools, such as RNA covariation analysis, conserved RNA structure elements discovery, and multiple methods for RNA secondary structure prediction and modeling. We expect that RASP should greatly aid researchers, and accel- erate RNA structure research, as well as allow follow up on more data from current large datasets.

Supplementary Data are available at NAR Online.

Non-coding RNA genes and the modern RNA world

Secondary structure of the large subunit ribosomal RNA from Escherichia coli, Zea mays chloroplast, and human and mouse mitochondrial ribosomes

Structure of a ribonucleic acid

The family of box ACA small nucleolar RNAs is defined by an evolutionarily conserved secondary structure and ubiquitous sequence elements essential for RNA accumulation

RNA structure maps across mammalian cellular compartments

A highly structured, nuclease-resistant, noncoding RNA produced by flaviviruses is required for pathogenicity

Fast algorithm for predicting the secondary structure of single-stranded RNA

Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information

The four ingredients of single-sequence RNA secondary structure prediction. A unifying perspective

Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction

Progress and challenges for chemical probing of RNA structure inside living cells

High-throughput determination of RNA structures

Understanding the transcriptome through RNA structure

RNA regulations and functions decoded by transcriptome-wide RNA structure probing

FragSeq: transcriptome-wide RNA structure probing using high-throughput sequencing

Genome-wide measurement of RNA secondary structure in yeast

Genome-wide probing of RNA structure reveals active unfolding of mRNA structures in vivo

In vivo genome-wide profiling of RNA secondary structure reveals novel regulatory features

Pervasive regulatory functions of mRNA structure revealed by high-resolution SHAPE probing

Keth-seq for transcriptome-wide RNA structure mapping

Structural imprints in vivo decode RNA regulatory mechanisms

DMS-MaPseq for genome-wide or targeted RNA structure probing in vivo

An RNA mapping dataBase for curating RNA structure mapping experiments

RSVdb: a comprehensive database of transcriptome RNA structure

Landscape and variation of RNA secondary structure across the human transcriptome

The coding region of the HCV genome contains a network of regulatory RNA structures

Functionally conserved architecture of hepatitis C virus RNA genomes

Database indexing for production MegaBLAST searches

JBrowse: a dynamic web platform for genome visualization and analysis

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Accurate SHAPE-directed RNA structure determination

Forna (force-directed RNA): simple and effective online RNA secondary structure diagrams

VARNA: interactive drawing and editing of the RNA secondary structure

BigWig and BigBed: enabling browsing of large distributed datasets

Architecture and secondary structure of an entire HIV-1 RNA genome

Structure mapping of dengue and Zika viruses reveals functional long-range interactions

Integrative analysis of Zika virus genome RNA structure reveals critical determinants of viral infectivity

In vivo structural characterization of the whole SARS-CoV-2 RNA genome identifies host cell target proteins vulnerable to re-purposed drugs

Genome-wide mapping of therapeutically-relevant SARS-CoV-2 RNA structures

Comprehensive in-vivo secondary structure of the SARS-CoV-2 genome reveals novel regulatory motifs and mechanisms

Comparative analysis of coronavirus genomic RNA structure reveals conservation in SARS-like coronaviruses

Structure of the full SARS-CoV-2 RNA genome in infected cells

Specific viral RNA drives the SARS CoV-2 nucleocapsid to phase separate

Genome-wide analysis of RNA secondary structure

HOTAIR forms an intricate and modular secondary structure