SALTS – SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite SALTS – SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite Mohan V Kasukurthi1,§, Dominika Houserova2,§, Yulong Huang2, Addison A. Barchie3, Justin T. Roberts4, Dongqi Li1, Bin Wu5,*, Jingshan Huang1,2,6,*, and Glen M Borchert2,3,* 1 School of Computing, University of South Alabama, Mobile, AL, 36688, USA 2 Department of Pharmacology, University of South Alabama, Mobile, AL, 36688, USA 3 Department of Biology, University of South Alabama, Mobile, AL, 36688, USA 4 Department of Biochemistry and Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, 80045, USA 5 First Affiliated Hospital, Kunming Medical University, Kunming, Yunnan, China 6 Qilu University of Technology (Shandong Academy of Science), Jinan, Shandong, China § The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. * The authors wish it to be known that, in their opinion, the last three authors should be regarded as joint Corresponding Authors. To whom correspondence should be addressed: Tel: +1 251 461 1367; Email: borchert@southalabama.edu, Tel: +1 251 460 7612; Email: huang@southalabama.edu, Tel: +86 871 65334106; Email: wu.bin.kmu@qq.com .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint mailto:borchert@southalabama.edu mailto:huang@southalabama.edu mailto:wu.bin.kmu@qq.com https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ ABSTRACT The widespread utilization of high-throughput sequencing technologies has unequivocally demonstrated that eukaryotic transcriptomes consist primarily (>98%) of non-coding RNA (ncRNA) transcripts significantly more diverse than their protein-coding counterparts. ncRNAs are typically divided into two categories based on their length. (1) ncRNAs less than 200 nucleotides (nt) long are referred as small non-coding RNAs (sncRNAs) and include microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), transfer ribonucleic RNAs (tRNAs), etc., and the majority of these are thought to function primarily in controlling gene expression. That said, the full repertoire of sncRNAs remains fairly poorly defined as evidenced by two entirely new classes of sncRNAs only recently being reported, i.e., snoRNA-derived RNAs (sdRNAs) and tRNA-derived fragments (tRFs). (2) ncRNAs longer than 200 nt long are known as long ncRNAs (lncRNAs). lncRNAs represent the 2nd largest transcriptional output of the cell (behind only ribosomal RNAs), and although functional roles for several lncRNAs have been reported, most lncRNAs remain largely uncharacterized due to a lack of predictive tools aimed at guiding functional characterizations. Importantly, whereas the cost of high-throughput transcriptome sequencing is now feasible for most active research programs, tools necessary for the interpretation of these sequencings typically require significant computational expertise and resources markedly hindering widespread utilization of these datasets. In light of this, we have developed a powerful new ncRNA transcriptomics suite, SALTS, which is highly accurate, markedly efficient, and extremely user-friendly. SALTS stands for SURFR (sncRNA) And LAGOOn (lncRNA) Transcriptomics Suite and offers platforms for comprehensive sncRNA and lncRNA profiling and discovery, ncRNA functional prediction, and the identification of significant differential expressions among datasets. Notably, SALTS is accessed through an intuitive Web-based interface, can be used to analyze either user- generated, standard next-generation sequencing (NGS) output file uploads (e.g., FASTQ) or existing NCBI Sequence Read Archive (SRA) data, and requires absolutely no dataset pre-processing or knowledge of library adapters/oligonucleotides. SALTS constitutes the first publically available, Web-based, comprehensive ncRNA transcriptomic NGS analysis platform designed specifically for users with no computational background, providing a much needed, powerful new resource capable of enabling more widespread ncRNA transcriptomic analyses. The SALTS WebServer is freely available online at http://salts.soc.southalabama.edu. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint http://salts.soc.southalabama.edu/ https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ GENERAL INTRODUCTION Cellular metabolism and survival are greatly dependent on how quickly and efficiently the cell can respond to internal and external stimuli. This process often requires tightly orchestrated genome-wide changes in gene expression. With rapid technological advancements in both genomics and transcriptomics, particularly the development of robust deep sequencing, it is ever more apparent that many regulatory non-coding RNAs (ncRNAs) that help coordinate gene expression changes remain elusive and the networks created thereof are far more complex than previously thought(1). As many of these are dynamic and their presence or absence is highly conditional (i.e., environmental stress, disease, tissue type, etc.), their identification poses a challenge and many remain undescribed(2). As such, we have developed a set of guidelines and parameters to help confidently identify and characterize these molecules. Importantly, by implementing alternative strategies for next-generation sequencing (NGS) analysis based on examining conditional changes in expression and/or fragmentation patterns from individual genomic loci rather than depending on pre-existing annotations, we find previously elusive ncRNAs can now be readily identified via our platform. In addition to this, we have also developed an array of downstream analyses to more fully characterize identified ncRNAs and predict their functional roles (e.g., molecular targets). To date several platforms aimed at either small non-coding RNA (sncRNA) or long non-coding RNA (lncRNA) characterization have been developed(3). Although each of these existing platforms possess some unique advantages, each also carry their own critical limitations (detailed herein). That said, to our knowledge, SALTS is the first-ever resource designed to determine ncRNA expressions in both short ncRNA-Seq and standard RNA- Seq datasets and to provide functional predictions for ncRNAs identified in either. Perhaps most importantly, however, in addition to being highly accurate and efficient, SALTS has been developed to require absolutely no computational background in order to enable widespread ncRNA transcriptomic analysis by a much broader community of researchers. Of note, a clear, step-by-step user manual for the SALTS platform is provided in Supplemental Information File 1. SECTION 1. SALTS Tool for Small non-coding RNA Analysis: SURFR ncRNAs less than 200 nucleotides (nt) in length are referred to as small non-coding RNAs (sncRNAs) and include microRNAs (miRNAs), piwi-interacting RNAs (piRNAs), small nucleolar RNAs (snoRNAs), transfer ribonucleic RNAs (tRNAs), etc.(4). One striking example of the regulatory capabilities of sncRNAs comes from a group of small yet potent RNAs called microRNAs (miRNAs). MiRNAs are ~22 nt RNAs excised from longer pre-miRNA hairpins that function through associating with the RNA-induced silencing complex (RISC) in order to bind to the 3’ UTRs of their target mRNAs and repress their translational activities(5). In just the past two decades, thousands of miRNAs have been identified and implicated in regulating cell growth, differentiation, and apoptosis(6), as well as contributing to tumorigenesis(7) and chemoresistance(8). As this group has been thoroughly examined due to its relevance to various types of cancer(9), it is now widely accepted that a single miRNA is capable of altering the expression of whole cohorts of protein coding genes(4). Importantly, studies aimed at evaluating the transcriptomic changes of miRNAs have revealed the existence of miRNA-like fragments derived from other ncRNA biotypes and suggest similar regulatory capacities may be associated with these novel sncRNAs(10–13). As such, we suggest that the SURFR resource described herein represents an intuitive, high throughput platform capable of revisiting old NGS datasets and identifying novel, relevant miRNA-like fragments derived from other types of ncRNAs that were previously overlooked. Comparably sized, miRNA-like fragments excised from many other types of ncRNAs have now been reported and many of these shown to similarly regulate gene expressions and/or chromatin compaction (e.g., piRNAs, rasiRNAs, rRNAs, scRNAs, snoRNAs, snRNAs, RNase P, tRNAs, Y RNAs, and Vault RNAs)(10–13). That said, the expressions and functions of the vast majority of specific sncRNA fragments excised from anything other .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ than annotated miRNAs remain largely undefined, although fragments from snoRNAs (sdRNAs) and tRNAs (tRFs) have recently begun to receive considerably more attention(12, 13). In 2008, Ender et al. were the first to report a small RNA fragment originating from a snoRNA, ACA45(14). Despite the principle snoRNA function being long characterized as guiding rRNA modifications, they showed that this snoRNA-derived RNA (sdRNA) was not only processed by Dicer-like regular miRNAs but also capable of silencing CDC2L6 gene in miRNA- like manner. Since then various other studies have described similar fragments arising from other snoRNAs (reviewed in (15)) as well as from other types of ncRNAs. Notably, tRNA-derived fragments (tRFs) have recently gained attention due to their differential abundance under highly specific conditions, such as developmental stage(16), stress(17), or viral infection(18). Moreover, regulatory capacity of some tRFs has been observed; Zhou et al., for example, showed that a fragment excised from 5’ end of tRNA-Glu regulates BCAR3 expression in ovarian cancer(19). It is now clear that ncRNA-derived miRNA-like fragments are precisely processed out of various types of ncRNA transcripts, and that this processing is evolutionarily conserved across species(10–13). While an increasing body of evidence suggests specifically excised sncRNA fragments from an array of ncRNAs exist and are functionally relevant, there are currently no Web-based, user-friendly resources that offer comprehensive sncRNA fragment profiling and discovery, functional prediction, and the identification of significant differential expressions of fragments among datasets. To address this gap we present SURFR. SURFR refers to our Short Uncharacterized RNA Fragment Recognition tool that identifies all miRNA, snoRNA, and tRNA fragments (as well as fragments from all other ncRNAs annotated in Ensembl) specifically excised in a given transcriptome provided as either a raw user-generated RNA-Seq dataset or NCBI SRR file identifier. In addition, SURFR can also compare individual fragment expressions among as many as 30 distinct datasets (as well as compare the expressions of full length (non-fragmented) sncRNAs). SURFR Features  Identifies fragments specifically excised from all miRNAs, tRNAs, rRNAs, scaRNAs, scRNAs, snoRNAs, sRNAs, vault RNAs, and any other ncRNAs annotated in the current Ensembl assembly(20) in individual small RNA-Seq datasets.  Ten files can be processed at once then up to 30 individual files compared after processing for ncRNA fragment differential expression analysis.  SURFR can also determine and compare the expressions of all full length (non-fragmented) sncRNAs in a given transcriptome.  SURFR results are stored on the server indefinitely, protected by powerful state-of-the-art cryptographic algorithms, and can be instantly recalled by the user via entering their session key in the “Get Results” tab on the SURFR home page.  OmniSearch-based miRNA analysis of annotated miRNAs(21).  Direct, intuitive ncRNA visualization of individual ncRNA fragmentation.  Easily downloadable Excel files of results from a single RNA-Seq file and/or comparisons among files. These files can be filtered (if desired) and list clearly defined, readily understandable, pertinent data (e.g., fragment expression, host gene links, and the exact fragment sequence excised).  Contains prepopulated ncRNA databases allowing the identification of ncRNA fragments and/or ncRNA expressions in 440 unique animal, plant, fungal, protist, and bacterial species. In addition, SURFR RNA fragment calls require considerably less processing time than previous ncRNA fragment identification pipelines for two principle reasons. We have: (1) developed a novel alignment strategy significantly faster than traditional methods (e.g., BLAST(22)) and (2) designed a novel method to locate the start and end positions of an ncRNA fragment using wavelets. Full details of these novel computational methodologies are described in length in Supplemental Information File 2. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ SURFR Workflow Figure 1. SURFR workflow. Sequence Input (left). The user provides up to ten unmodified small RNA-Seq datasets as input. These datasets can all be uploaded directly by the user or downloaded from the NCBI SRA database by entering SRA IDs. sncRNA Fragment Analysis (middle). SURFR identifies all ncRNA fragments (both annotated and novel) and their expressions in up to ten datasets per session. sncRNA Fragment Visualization (top right). Graphics of individual host ncRNAs and the fragments excised (along with the expressions at each nt position) are provided. In addition, tables comparing the expressions of all fragments within individual datasets and comparing fragment expressions across all datasets are generated. SURFR Cross Section Comparison (bottom right). The user can comprehensively compare all fragment expressions identified in up to 30 individual datasets by entering multiple SURFR session IDs from separate analyses. SURFR Input Under “Use SURFR”, the user first selects the organism corresponding to the sequences. SURFR small RNA databases have been prepopulated for 440 species including 286 metazoans, 62 plants, and 92 other fungi, protists, and bacteria. As indicated in Figure 1, the user then provides one to ten small RNA sequencing datasets as input. These datasets can be all uploaded directly by the user, or all downloaded from the NCBI SRA database(23) by entering SRA IDs (e.g., SRR6495855, SRR4217122), or any combination thereof (for example, three datasets uploaded by the user along with seven datasets downloaded from the NCBI SRA database). Importantly, a major strength of SURFR is that users can upload most raw small RNA-Seq files directly as original, unmodified, compressed FASTQ files (as provided by commercial sequencers) with absolutely no preprocessing and with no specifics about library generation, linkers, or oligonucleotides required. Allowable formats for uploading are uncompressed, standard FASTA or FASTQ files or any major compression of either. SURFR Output After the user uploads/specifies the small RNA-Seq datasets and clicks the “Let’s SURF” button, the browser is automatically redirected to a report page, progress indicators for each uploaded dataset are provided under the “Click Here To Choose Your File” drop down menu at the top of the page (Figure 2A) with individual datasets having completed analysis indicated by a checkmark. Following completion of analysis, results for the individual file selected are then displayed on the report page and organized into several sections (Figure 2). .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ Figure 2. SURFR report page. SURFR report example. (A) The “Click Here To Choose Your File” drop-down menu for selecting individual RNA-Seq files. (B) A summary of the overall composition of the selected small RNA-Seq dataset. (C) The “Create ncR Profile” button automatically populates the derived RNA Profile section at the bottom of the page. (D) The “Derived RNA Fragments” window detailing each fragment identified in the individual, selected small RNA-Seq dataset. (E) The user can download an Excel file detailing the full set of information presented in the “Derived RNA Fragments” window by pressing the “Download Results” button. (F) The “Differential Expression Vector (DEV)” window illustrates each nucleotide within a host gene and indicates the fragment called with a blue rectangle. The x-axis represents the position in the ncRNA selected (e.g., miR-29a), and the y-axis depicts the expression levels of the ncRNA at each position. (G) The “Selected ncRNA & Called RNA Fragment Sequences” window illustrates the full length host ncRNA (miR-29a) highlighting the SURFR-called fragment in yellow. (H) The “Derived RNA Profile” window details each fragment identified in any of the analyzed small RNA-Seq datasets and compares fragment expressions across samples. (I) The “OmniSearch for miRNAs” window lists the top 50 OmniSearch entries (reported targets and PubMed publications) for an individual miRNA selected in the “Derived RNA Profile” window. (J) The “Full Length ncRNA Expression Analyses” button in the upper center of the results page redirects the user to a SURFR window detailing the expressions of all full length sncRNAs in the provided datasets. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ A summary of the overall composition of the selected small RNA-Seq dataset, including the file size, total number of reads, number of mapped reads, and time taken for analysis is included just below the file selection window at the top of the page (Figure 2B). The user can compare fragment expressions across all datasets by pressing the “Create ncR Profile” button that automatically populates the derived RNA Profile section at the bottom of the page (Figure 2C). The “Derived RNA Fragments” window (Figure 2D) details the Ensembl Gene ID, Ensembl Transcript ID, gene annotation (name), the type of gene a fragment was excised from, the start and end positions of a fragment within its host gene, the expression of a fragment in reads per million (RPM), and the nucleotide sequence for each fragment identified in the individual, selected small RNA-Seq dataset. The “Derived RNA Fragments” window is an interactive table that allows users to view, sort, and filter small RNA fragments based on any column value. Users can also view host gene information available at the RNAcentral browser by selecting a fragment in the table and then clicking the “RNAcentral” button on the toolbar(24). The user can download an Excel file detailing the full set of information presented in the “Derived RNA Fragments” window (Figure 2D) for each fragment identified in the individual, selected small RNA-Seq dataset by pressing the “Download Results” button (Figure 2E). An Excel file containing the derived RNA fragment information in its entirety will be automatically downloaded to the user’s computer (Figure 3). Figure 3. Derived RNA Fragments “Download Results” File. The first few rows of an example “Download Results” Excel file detailing the full set of information presented in the “Derived RNA Fragments” window: Ensembl “Gene ID”, Ensembl “Transcript ID”, gene “Annotation” (name), the “Type” of gene a fragment was excised from, the start and end positions of a fragment within its host gene, the expression of a fragment in reads per million (RPM), and the nucleotide “Sequence” for each fragment identified in the selected small RNA-Seq dataset. The “Differential Expression Vector (DEV)” window (Figure 2F) details the expressions of each nucleotide within a host gene and indicates the fragment called with a blue rectangle. The x-axis in the graph shown in Figure 2F represents the position in the ncRNA selected (miR-29a), and the y-axis represents the expression levels of the ncRNA at each position. The user can also interactively view the expression at each individual nucleotide by panning over the image, zoom in or out using the buttons on the top right, and/or download DEV image files and an Excel file detailing expression at each nucleotide by selecting the menu button on the top right of the window. The "Selected ncRNA & Called RNA Fragment Sequences" window (Figure 2G) illustrates the full length host ncRNA highlighting the SURFR-called fragment in yellow just as depicted in the preceding DEV window (Figure 2F). The “Derived RNA Profile” window (Figure 2H) details the Ensembl Gene ID, Ensembl Transcript ID, gene annotation (name), the type of gene a fragment was excised from, the average start and end positions of a fragment within its host gene (To be considered the same fragment start and stop positions had to agree within 5 nts.) with corresponding nucleotide sequence for each “average” fragment listed, the start and end positions of a fragment within its host gene along with the fragment’s expression (RPM) in each individual small RNA-Seq dataset, and finally, the % standard deviation of the expression of individual fragments(20). Importantly, the full list of all fragments identified in any of the datasets is presented. The “Derived RNA Profile” window is an interactive Gene ID Transcript ID Annotation Type Fragment(start-end) Expression(RPM) Sequence ENSG00000199135.1 ENST00000362265.1 MIR101-1 miRNA 46 - 66 68602 TACAGTACTGTGATAACTGA ENSG00000284032.1 ENST00000362111.4 MIR29A miRNA 41 - 62 64394 TAGCACCATCTGAAATCGGTT ENSG00000207752.1 ENST00000385019.1 MIR199A1 miRNA 46 - 67 34071 ACAGTAGTCTGCACATTGGTT ENSG00000207638.1 ENST00000384906.1 MIR99A miRNA 12 - 33 33760 AACCCGTAGATCCGATCTTGT ENSG00000288462.1 ENST00000673161.1 MIR23A miRNA 44 - 63 13936 ATCACATTGCCAGGGATTT ENSG00000198973.4 ENST00000362103.4 MIR375 miRNA 39 - 60 6214 TTTGTTCGTTCGGCTCGCGTG ENSG00000199085.3 ENST00000362215.3 MIR148A miRNA 43 - 64 3774 TCAGTGCACTACAGAACTTTG ENSG00000199047.3 ENST00000362177.3 MIR378A miRNA 42 - 63 2166 ACTGGACTTGGAGTCAGAAGG ENSG00000207713.3 ENST00000384980.3 MIR200C miRNA 43 - 65 1713 TAATACTGCCGGGTAATGATGG ENSG00000277864.1 ENST00000516881.1 SCARNA15 scaRNA 65 - 86 1360 AGGTAGATAGAACAGGTCTTG ENSG00000277947.1 ENST00000619178.1 SNORD3D snoRNA 194 - 217 1304 GGAGAGAACGCGGTCTGAGTGGT .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ table that allows users to view, sort, and filter small RNA fragments based on any column value. Users can also view host gene information available at the RNAcentral browser(24) by selecting a fragment in the table and then clicking the “Search ncRNA In RNAcentral” button on the toolbar. The user can also download an Excel file detailing the full set of information presented in the “Derived RNA Profile” window by pressing the “Generate Report” button at the top right of the window. An Excel file containing the derived RNA profile information in its entirety will be automatically downloaded to the user’s computer (Figure 4). In addition, Excel file reports can be downloaded following the application of specific filters in the “Derived RNA Profile” window (e.g., only snoRNA fragments can be included or excluded). Figure 4. Derived RNA Profile “Generate Report” File. The first few rows of an example “Generate Report” Excel file detailing the full set of information presented in the “Derived RNA Profile” window. The “OmniSearch for miRNAs” window (Figure 2I) returns the top 50 OmniSearch entries(21) (reported targets and PubMed entries) for an individual miRNA selected in the preceding “Derived RNA Profile” window. And finally, when desired, the “Full Length ncRNA Expression Analyses” button (Figure 2J) redirects the user to a SURFR window detailing the expressions of all full length sncRNAs in the provided datasets regardless of fragmentation. Importantly, all pertinent features (e.g. expression table downloads) described above are similarly available for full length sncRNA analyses via this resource. SURFR Example Use/Case Study SURFR allows users to profile and compare the expressions of sncRNA fragments (both annotated and novel) across multiple small RNA-Seq experiments in order to identify the top sncRNA fragments significantly differentially expressed in a particular disease, tissue, developmental stage, etc.. Our group’s interest in fragments excised from ncRNAs other than miRNAs initially arose from an attempt to identify novel miRNA contributors to breast cancer(12). For this work, we performed small RNA sequencing on several breast cancer cells lines, and while we failed to identify any (traditional) miRNAs of interest, we did identify a snoRNA fragment (we deemed sdRNA-93) that was specifically and significantly overexpressed in MDA-MB-231 cells - a widely studied model of a highly invasive and metastatic human cancer. Next, as we found sdRNA-93 to be significantly overexpressed in these cells (≥75x compared to controls), we decided to determine if sdRNA-93 functionally contributed to the malignant phenotype. Stringently testing sdRNA-93 inhibitors and mimics in MDA-MB-231 cells across multiple time points revealed that sdRNA-93 gain- and loss- of-function showed profound effects on invasion within standard matrix-based (matrigel) chemoattractant assays. Remarkably, sdRNA-93 loss-of-function reduced cell invasion by >90% at 48 hours compared to control cells, whereas sdRNA-93 gain-of-function enhanced cell invasion by >100%. Thus, we showed a single sdRNA (sdRNA-93) strongly selectively regulates invasion of MDA-MB-231s. These findings link a specific sdRNA (sdRNA-93) to an aggressive malignant phenotype (invasion) within an established cancer cell model that is .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ widely used to study invasive behavior. We next employed a BLAST-based methodology to determine sdRNA- 93 expressions across small RNA-Seq datasets corresponding to 115 unique breast cancer patients and detected strong overexpression of sdRNA-93 in 92.8% of tumors classified as Luminal B Her2+, compared to normal tissue controls (extremely low expression) and other breast cancer subtypes (modest expression levels of 30- 40%). Thus, this work represented the first evidence demonstrating that sdRNAs that regulate specific malignant properties are differentially expressed within divergent molecular subtypes of human breast cancer(12). Importantly, our initial BLAST-based identification of sdRNA-93 as being significantly overexpressed in MDA- MB-231 cells was highly labor intensive taking days to complete. In contrast, when we uploaded our original unmodified FASTQ sequencing files to SURFR, sdRNA93 was readily identified as the most highly differentially expressed snoRNA fragment between our two cancer cell lines taking just 7.9 minutes (Figure 5). Figure 5. SURFR identification of sdRNA-93. (A) “Derived RNA Fragments” window showing SNORD93 derived sdRNA-93 was identified as the second most highly expressed sdRNA in the highly invasive breast cancer cell line MDA-MB-231. (B) Alignment among the human genome (GRCh38 Ch7:22856601:22856699:1) (top), snoRNA-93 (ENSG00000221740) (middle), and next generation small RNA sequence read (bottom) obtained by Illumina sequencing of MDA-MB-231 RNA as originally described in(12). All sequences are in the 5′ to 3′ direction. An asterisk indicates base identity between the snoRNA and genome. Vertical lines indicate identity across all three sequences. (C) “Derived RNA Profile” window comparing small RNA-Seq results for MCF-7 and MDA-MB- .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 231 cells. Note SNORD93 derived sdRNA-93 was identified as the most significantly differentially expressed sdRNA between the weakly and highly invasive breast cancer cell lines. SURFR Comparison to other Existing Tools Numerous characterizations of significant regulatory roles for sncRNA fragments excised from various types of ncRNAs other than miRNAs have now been reported(10–13). As new high-throughput small RNA sequencing strategies(25) continue to make small RNA-Seq faster and less expensive, there is a clear need for tools capable of digesting large amounts of small RNA-Seq data in order to detect and characterize all small RNA genes including specifically-excised small RNA fragments. Most existing tools (e.g., miRDeep(26), miRSpring(27), miRanalyzer(28), etc.) focus almost exclusively on miRNAs and/or only evaluate existing sncRNA annotations and are not capable of fully defining small RNA-Seq ncRNA fragment profiles and differences among these datasets (sRNAnalyzer(29), Oasis2.0(30), SPAR(31), etc.). That said, most existing tools capable of characterizing novel ncRNA fragments and their expressions, such as FlaiMapper(32), SPORTS(33), and DEUS(34), require fairly extensive computational expertise for utilization, support only pre-aligned file inputs (BAM), and/or require standalone installation (Table 1). As such, we have designed SURFR to address the need for a user-friendly, Web-based, comprehensive small RNA fragment tool requiring no computational expertise to utilize. In stark contrast to most existing platforms, SURFR identifies fragments excised from all types of ncRNAs annotated in Ensembl(20) in a given transcriptome provided as either a raw user-generated RNA-Seq dataset or NCBI SRA file. In addition, SURFR can compare individual fragment expressions among as many as 30 distinct datasets, and we have included ncRNA databases for 440 unique animal, plant, fungal, protist, and bacterial species. Importantly, there are currently no Web-based, user-friendly resources that offer comprehensive sncRNA fragment profiling and discovery, functional prediction, and the identification of significant differential expressions among datasets comparable to SURFR. Although two platforms, sRNA toolbox(35) and sRNAtools(36), do offer many of SURFR’s features, SURFR distinguishes itself by providing significantly more intuitive, versatile, and user friendly results generated in less than 10% of the time required for data upload and processing by these tools. That said, because SURFR was developed specifically for ncRNA fragment identification, it does not provide expression analysis for full length ncRNAs. Table 1. SncRNA analysis platform feature comparison. Various features offered by SURFR were compared to other existing tools including sRNA toolbox(35), Oasis2.0(30), sRNAtools(36), CPSS2.0(37), SPAR(31), sRNAnalyzer, SPORTS1.0(33), DEUS(34), FlaiMapper(32), and featureCounts(38). Features examined were: “Online,” if tool is available online; “Input,” form of input RNA-Seq dataset - either raw (direct NGS output) or pre-processed (e.g., requires BAM file); “Clear, User-friendly Results/Output,” if interactive and user-friendly results are generated directly; ”Library Oligo Sequences Req,” if user knowledge of NGS oligo sequences is required; “TCGA, SRA, GEO, or Encode Input,” if publically available RNA-Seq datasets can be specified for examination based on identifier alone; “Known Full Length sncRNA Expressions,” detection and quantification of known sncRNAs; “Novel Full Length sncRNA Expressions,” detection and quantification of novel sncRNAs; “Novel sncRNA Fragment Discovery,” detection and quantification of novel ncRNA fragments; “Differential Expression,” ability of the tool to integrate expression data from multiple files (“sRNAde” denotes that expression analyses can be performed in parallel); and “Species,” number of species available for analysis. “user” denotes .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ that the tool has the capacity to perform given task however requires additional user input or user-directed change to program’s code and/or advanced settings. Notably, as a verification of SURFR’s accuracy, we recreated an analysis of ten prostate cancer small RNA-Seq files previously performed using FlaiMapper(39). Importantly, FlaiMapper-based ncRNA fragment discovery of these ten files originally identified 147 snoRNA-derived fragments that were 18 to 35 nt in length and expressed at > 10 RPM. Similarly, SURFR analysis of the same files identified 110 snoRNA-derived fragments expressed at > 10 RPM, and strikingly, 104 of these fragments were nearly identically identified (+/- 2 nts) by both methods. Notably, we find the majority of the FlaiMapper-identified sdRNA fragments not present in the SURFR calls were excluded based on SURFR’s 100% sequence identity requirement (in contrast to FlaiMapper’s 2 nt mismatch allowance). SECTION 2. SALTS Tool for Long non-coding RNA Analysis: LAGOOn ncRNAs longer than 200 nt in length are known as long ncRNAs (lncRNAs). This distinction, while somewhat arbitrary and based on technical aspects of RNA isolation methods, serves to distinguish lncRNAs from miRNAs and other sncRNAs. lncRNA loci are present in large numbers in eukaryotic genomes typically comparable to or exceeding that of protein coding genes. Many lncRNAs possess features reminiscent of protein-coding genes, such as having a 5′ cap and undergoing alternative splicing(40). In fact, many lncRNA genes have two or more exons(40), and about 60% of lncRNAs have polyA+ tails. In addition, although numerous long intergenic RNAs (lincRNAs)(41) including eRNAs from gene-distal enhancers have recently been reported(42), the majority of lncRNA genes identified to date are located within 10 kb of protein-coding genes and typically found to be antisense to coding genes or intronic(43). That said, many lncRNAs are expressed at relatively low levels in highly specific cell types(40) both explaining why the majority of lncRNAs were thought to be “transcriptional noise” until quite recently and also representing perhaps the single largest challenge in terms of lncRNA discovery and characterization. NGS has now identified tens of thousands of lncRNA loci in humans alone with the number of lncRNAs linked to human diseases quickly increasing. That said, lncRNA functionality is highly contentious, and the number of experimentally characterized and / or disease-associated lncRNAs remain in the low hundreds, or ≤1% of identified loci(44). This has led to a burgeoning focus on elucidating the molecular mechanisms that underlie lncRNA functions(45). Although only a minority of identified lncRNAs have been functionally characterized, several distinct modes of action for lncRNAs have now been described, including functioning as signals, decoys, scaffolds, guides, enhancer RNAs, and short peptide messages(46)(47). Importantly, however, there are currently no Web-based, user-friendly resources that offer comprehensive lncRNA profiling, functional prediction, and the identification of significant differential expressions among datasets. To address this gap we present LAGOOn. LAGOOn refers to our Long-noncoding and Antisense Gene Occurrence and Ontology tool that identifies all lncRNAs expressed in a given human transcriptome from either a user-provided RNA-Seq dataset or publically available SRA file(23). In addition, LAGOOn can also compare lncRNA expressions among datasets and predict likely functional roles for individual lncRNAs. LAGOOn Features  Direct, intuitive visualization of significant lncRNA expressions. Determines the expressions of all lncRNAs annotated in the current Ensembl assembly(20) in individual human RNA-Seq datasets.  Identifies differentially utilized lncRNA exons.  Up to three files can be processed at once then up to 15 individual files compared after processing for lncRNA differential expression analysis. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/  LAGOOn results are stored on the server indefinitely, protected by powerful state-of-the-art cryptographic algorithms, and can be instantly recalled by entering a previous session key in “Access Your Results” on the LAGOOn home page.  Easily downloadable Excel files of results profiling a single RNA-Seq file and/or comparisons among various files. These files can be filtered (if desired) and list clearly defined, readily understandable, and pertinent data (e.g., expression, lncRNA Ensembl ID, etc.).  Detailed, comprehensive lncRNA functional prediction detailing: o If a lncRNA serves as a host for a sncRNA(45). o Significant potentials for a lncRNA to serve as a specific miRNA sponge(48). o All overlaps between a given lncRNA and annotated enhancers(49). o Significant potentials for lncRNAs to serve as naturally occurring antisense silencers for genes located on the strand opposite to themselves(50). o Associations between individual lncRNAs and ribosomes suggesting microprotein production(51). Importantly, LAGOOn is the first Web-based, user-friendly resource that offers real-time lncRNA profiling, the identification of significant differential expressions among datasets, and an array of functional prediction assessments beyond standard mRNA interaction characterizations. Full details of these novel computational methodologies are described in length in Supplemental Information File 3. LAGOOn Workflow Figure 6. LAGOOn workflow. Sequence Input (left). The user provides up to two unmodified RNA-Seq files and one Ribo-Seq dataset (optional) as input. These datasets can all be uploaded directly by the user or downloaded from the NCBI SRA database by entering SRA IDs. lncRNA Exon Analysis (middle). LAGOOn enumerates all annotated lncRNA expressions in up to three datasets per session. lncRNA Expression and Functional Prediction Visualization (top right). An interactive table is generated comparing the expressions of all exons within individual datasets and comparing exon expressions across all datasets. Tables indicating putative lncRNA functions are also depicted. LAGOOn Cross Section Comparison (bottom right). The user can comprehensively compare all exon expressions identified in up to 15 individual datasets by entering multiple LAGOOn session IDs from separate analyses. LAGOOn Input As summarized in Figure 6, after selecting “Start New Analysis” on the LAGOOn homepage, the browser is redirected to the “Data Transfer Options” page where the user provides one or two RNA sequencing datasets as input and is given the chance to provide an optional, additional input, i.e., a Ribo-Seq dataset for determining microprotein coding potentials. These datasets can all be uploaded directly by the user, or all downloaded from the NCBI SRA database(23) by entering SRA IDs (e.g., SRR9729388, SRR6290085), or any combination thereof. Importantly, a major strength of LAGOOn is that users can upload most raw RNA-Seq files directly as original, unmodified, compressed FASTQ files (as provided by commercial sequencers) with absolutely no preprocessing and with no specifics about library generation, linkers, or oligonucleotides required. There is no limit on the size .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ of SRA files whereas individual user uploaded files are limited to 18 GB regardless of format meaning extremely large sequencing files exceeding even this size can be converted to FASTA format then compressed prior to being uploaded if necessary. Allowable uploaded formats are uncompressed, standard FASTA or FASTQ files or any major compression of either. In addition to this, the LAGOOn homepage provides links to: (1) “Access Your Results” where users can retrieve results from previous sessions via providing a session key and then compare results from up to five separate sessions. (2) “LAGOOn Search” where users can obtain detailed, comprehensive functional predictions for individual lncRNAs. And, (3) “Download Our Databases” where users can download databases containing all the lncRNAs and/or lncRNA exons employed by LAGOOn. LAGOOn Output After the user uploads/specifies the RNA-Seq datasets, the browser is automatically redirected to the LAGOOn report page (Figure 7). Initially, a summary of the size and composition of individual RNA-Seq datasets, the number of lncRNAs expressed in a dataset, and the top ten most highly expressed lncRNAs in the specified dataset are shown. Following selection of either one or all of the RNA-Seq files and the Ribo-Seq file (if included) analyzed from the file selection toolbar (Figure 7A), results for the file(s) selected are then displayed on the report page under the “Results” tab (Figure 7B), and organized into several distinct sections. Figure 7. LAGOOn report page. LAGOOn report example. (A) The file selection toolbar contains drop-down menus for selecting individual RNA-Seq and Ribo-Seq files. (B) The toolbar allowing selection of either the “Summary” or “Results” tab. (C) The lncRNA expression window displays a filterable table of all lncRNA exons expressed in any of the user-provided files. Full length lncRNA sequence, individual exon sequence, or Ensembl lncRNA gene information is obtained by selecting an exon in the table and then clicking the “lncRNA Sequence,” “Exon Sequence,” or “Search lncRNA in Ensembl” button on the toolbar. (D) The “Generate Report” button creates and automatically downloads an Excel file detailing the full set of information presented in the expression table window. (E) The “Exon Sponge to (miRNA)” window lists all miRNA complementarities of ten base pairs or greater occurring within the selected lncRNA exon (F) The “lncRNA host to” window lists all full length ncRNAs contained in any of the selected lncRNA’s exons. (G) The “Enhancer” window lists all overlaps between a selected lncRNA and GeneHancer annotated enhancer (as well as genes with expression linked to individual enhancers). (H) The “lncRNA Overlapping Genes” window lists all genes even partially overlapping a lncRNA locus on either strand. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ The table presented in Figure 7C details the Ensembl Gene ID, Ensembl Exon ID along with gene annotation (name), and expressions (RPM) of all lncRNA exons in each individual RNA-Seq dataset, and finally, the % standard deviation of the expression of individual exons(20). Importantly, the full list of all exons found to be expressed in any of the datasets is presented. In addition, the expression table is interactive and allows user to view, sort, and filter based on any column value by clicking the “Filter Table” button on the toolbar. Users can also obtain a full length lncRNA sequence, a specific exon sequence, or view the lncRNA gene information available at Ensembl by selecting an exon in the table and then clicking the “lncRNA Sequence,” “Exon Sequence,” or “Search lncRNA in Ensembl” button on the toolbar. The user can also download an Excel file detailing the full set of information presented in the expression table window by pressing the “Generate Report” button at the top right of the window (Figure 7D). An Excel file containing the expression table window information in its entirety will be automatically downloaded to the user’s computer (Figure 8). In addition, refined Excel file reports can be downloaded following the application of specific filters (e.g., lncRNAs with RPM > 1 in the Ribo-Seq dataset). Figure 8. lncRNA expression table “Generate Report” File. The first few rows of an example “Generate Report” Excel file detailing the full set of information presented in the lncRNA expression window. Finally, putative functional roles for lncRNAs/lncRNA exons selected in the expression table are depicted in Figure 7E-H. As lncRNAs frequently function as miRNA sponges that directly basepair with and effectively inactivate mature miRNAs(48), the “Exon Sponge to (miRNA)” window lists all miRNA complementarities of ten base pairs or greater occurring within the selected lncRNA exon (Figure 7E). Next, as numerous lncRNAs have been shown to encode sncRNAs (e.g., miRNAs and snoRNAs) in their exonic sequences, and sncRNA expression often relies on excision from the host lncRNA transcript(45), the “lncRNA host to” window lists all full length ncRNAs contained in any of the selected lncRNA’s exons (Figure 7F). In addition, as several lncRNAs have been reported to function through regulating the accessibility of transcriptional enhancers overlapping their genomic loci(49), all overlaps between a selected lncRNA and GeneHancer(52) annotated enhancer (and genes with expression linked to individual enhancers) are detailed in the “Enhancer” window (Figure 7G). And finally, in addition to lncRNA exonic sequences serving as sncRNA hosts, many sncRNAs are processed from lncRNA introns(45). Furthermore, many lncRNAs serve as naturally occurring antisense silencers of genes located on the strand opposite to themselves(50). For both of these reasons, as well as other potential regulatory relationships, all genes overlapping a lncRNA locus on either the positive or negative strand are detailed in the “lncRNA Overlapping Genes” window (Figure 7H). Importantly, a comprehensive report detailing each of the functional predictions is also available for individual lncRNAs by selecting the “LAGOOn Search” button on the homepage after entering a lncRNA Ensembl gene identifier. Notably, this search functionality does not require full LAGOOn analysis. lncRNA Exon SRR8730291 (RPM) SRR6290085 (RPM) SRR9729388 (RPM) % Standard Deviation ENSG00000230590 ENSE00003874886_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003858311_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003847528_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000225470 ENSE00003808225_JPX_1_JPX transcript, XIST activator [HGNC:37191]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003241026_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003429313_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003861803_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000230590 ENSE00003849720_FTX_-1_FTX transcript, XIST regulator [HGNC:37190]_lncRNA 1 128 30 102.23 ENSG00000283117 ENSE00003789008_AC004949.1_-1_novel transcript_lncRNA 1 35 19 76.18 ENSG00000284722 ENSE00003811861_AP003175.1_-1_novel transcript_lncRNA 1 118 14 118.23 ENSG00000259234 ENSE00002540221_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002554868_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002573893_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002557951_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000259234 ENSE00002541714_ANKRD34C-AS1_-1_ANKRD34C antisense RNA 1 [HGNC:48618]_lncRNA 1 54 10 106.57 ENSG00000213904 ENSE00003224994_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000213904 ENSE00003062809_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000213904 ENSE00001552276_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000213904 ENSE00002995358_LIPE-AS1_1_LIPE antisense RNA 1 [HGNC:48589]_lncRNA 1 17 8 75.31 ENSG00000251259 ENSE00002021304_AC004069.1_-1_novel transcript_lncRNA 1 65 6 120.52 ENSG00000272430 ENSE00003695861_LINC02637_1_long intergenic non-protein coding RNA 2637 [HGNC:54120]_lncRNA 1 9 5 65.52 ENSG00000237491 ENSE00002920037_AL669831.5_1_novel transcript_lncRNA 1 4 3 47.86 ENSG00000237491 ENSE00001642276_AL669831.5_1_novel transcript_lncRNA 1 4 3 47.86 ENSG00000237491 ENSE00001741526_AL669831.5_1_novel transcript_lncRNA 1 4 3 47.86 ENSG00000251562 ENSE00003717116_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2322 3 141.06 ENSG00000251562 ENSE00002080048_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2322 3 141.06 ENSG00000251562 ENSE00003742980_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2322 3 141.06 ENSG00000251562 ENSE00003753954_MALAT1_1_metastasis associated lung adenocarcinoma transcript 1 [HGNC:29665]_lncRNA 1 2146 3 141.03 .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ LAGOOn Example Use/Case Study ncRNAs are becoming major players in disease pathogenesis such as cancer. Metastasis Associated Lung Adenocarcinoma Transcript 1 (MALAT1) is a nuclear enriched lncRNA that is generally overexpressed in patient tumors and metastases. Overexpression of MALAT1 has been shown to be positively correlated with tumor progression and metastasis in a large number of tumor types including breast tumors. Furthermore, an earlier study evaluating breast cancer patient samples showed that MALAT1 expression is higher in breast tumors as compared to adjacent normal tissues (reviewed in (53)). As such we elected to compare lncRNA expressions in a breast cancer cell line (MDA-MB-231) RNA-Seq dataset (SRR12101868) with those of a human bone tissue RNA-Seq dataset (SRR12101882) in order to identify significantly differentially expressed lncRNAs and their putative functions, including screening a Ribo-Seq of the BRX-142 cell line (SRR12101882) established from circulating tumor cells collected from a woman with advanced HER2-negative breast cancer(54) for potential MALAT1 microprotein production. Strikingly, the total time for download and analysis of these three NGS datasets by LAGOOn was only 3 min 52 sec. More importantly, however, LAGOOn identified MALAT1 as the most highly expressed lncRNA in MDA- MB-231 breast cancer cells (Figure 9). In agreement with previous demonstrations that MALAT-1 functions (in part) as a miR-145-5p sponge in numerous malignancies including breast cancer(55), LAGOOn identified MALAT1 as a probable miR-145-5p sponge (Figure 9A, top right). In addition, LAGOOn also found MALAT1 overlaps with, and may therefore potentially be involved in regulating, several distinct genomic enhancers and sncRNAs (Figure 9A, lower windows). Finally, similarly in agreement with previous analyses(56), LAGOOn also identified MALAT1 as one of three lncRNAs significantly represented in the BRX-142 cell Ribo-Seq dataset strongly suggesting MALAT1 encodes at least one micropeptide. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ Figure 9. LAGOOn identification of MALAT1 overexpression in breast cancer. (A) The “Results” window showing MALAT1 was identified as the most highly expressed lncRNA in the highly invasive breast cancer cell line MDA-MB-231 (SRR12101868). (B) The “Generate Report” Excel file showing MALAT1 (yellow) was identified as the most highly expressed lncRNA in MDA-MB-231 cells. Both windows indicate MALAT1 is present in the breast cancer Ribo-Seq dataset (SRR10883792). LAGOOn Comparison to other Existing Tools LncRNAs represent the largest single class of ncRNAs. However, unlike sncRNAs, which are thought to mostly function in gene regulation through complementary basepairing other RNAs, the mechanisms through which lncRNAs function are highly diverse. lncRNA relatively low expressions and tissue specificity have significantly hindered lncRNA discovery, our understanding of lncRNA regulations, and characterizations of lncRNA functional mechanisms to date(44)(45)(46)(47). That said, initiatives such as ENCODE(57), FANTOM(58), and GENCODE(40) have now predicted over 60,000 human lncRNAs and identified associations between many of these and specific diseases. Thus far, however, only a handful of these lncRNAs have been examined in the literature, with even fewer being assigned any specific mechanistic function. Expression data often constitutes the first level of information of use in studying lncRNAs as differential expression analysis is clearly of value in prioritizing candidates for further examination. Differential expression, however, provides little in the way of functional insights. That said, the majority of computational platforms currently available are primarily aimed at either detecting and quantifying lncRNAs (e.g., lncRNA-screen(59), RNA-CODE(60), lncRScan(61), etc.) or predicting lncRNA:mRNA and/or lncRNA:protein interactions (e.g., PLAIDOH(62), LncRNA2Function(63), circlncRNAnet(64), etc.) (Table 2). In contrast, LAGOOn was designed to comprehensively evaluate lncRNA expression as well as the potential for lncRNAs to function through other characterized mechanisms including serving as sncRNA hosts, miRNA sponges, antisense RNAs, microprotein transcripts, and/or regulators of genomic enhancers (as well as providing links to predicted lncRNA:mRNA and/or lncRNA:protein interactions). In short, LAGOOn wholly distinguishes itself from available tools by filling a major gap in available lncRNA functional prediction platforms and eliminating the need of the user to switch platforms during the analysis process. Table 2. lncRNA analysis platform feature comparison. Various features offered by LAGOOn were compared to other existing tools including lncRNA-screen(59), RNA-CODE(60), lncRScan(61), iSeeRNA(65), Annocript(66), UClncR(67), LncRNA2Function(63), and circlncRNAnet(64). Features examined were: “Online”, if tool is available online; “Input”, form of input RNA-Seq dataset - either raw (direct NGS output) or pre-processed (e.g., requires BAM file); “TCGA, SRA, or GEO”, if publically available RNA-Seq datasets can be specified for examination based on identifier alone; “Known lncRNA”, detection and quantification of known lncRNAs; “Novel lncRNA”, detection and quantification of novel lncRNAs; “Differential Expression”, ability of the tool to integrate expression data from multiple files; “ChIP-Seq / Ribo-Seq”, if identified lncRNA occurrences in ChIP-Seq and/or Ribo-Seq datasets can be determined; “Functional Prediction”, if potential functional roles of identified lncRNAs are assessed; and “Interactive Results”, if interactive and user-friendly results are generated directly. Online Input TCGA, SRA, or GEO Input Known lncRNA Novel lncRNA Differential Expression ChIP-seq / Ribo-seq Functional Prediction Interactive Results LAGOOn yes raw yes yes no yes yes yes yes lncRNA-screen no raw yes yes yes yes yes no yes RNA-CODE no raw no yes no yes no no no lncRScan no raw no no yes yes no no no iSeeRNA yes raw no no yes yes no no yes Annocript no raw no no yes no no no no UClncR no pre-processed no no no no no no no LncRNA2Function yes pre-processed no yes no limited no yes yes circlncRNAnet yes pre-processed no yes no yes no yes yes .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ DISCUSSION Despite a mounting body of evidence supporting the physiological relevance of ncRNAs, most studies performed to date have focused primarily on proteins themselves or deciphering the pathways associated with annotated ncRNAs. Moreover, due to the perceived insurmountability of the sheer amount of data generated by NGS/TGS analyses, the full extent of regulatory networks created by ncRNAs often gets overlooked(68). In addition, whereas the cost of RNA-seq is now reasonable for most active research programs, tools necessary for the interpretation of these sequencing datasets typically require significant computational expertise and resources markedly hindering widespread utilization of these tools. As such, the necessity for development of real-time, user-friendly platforms capable of making the identification and characterization of the ncRNAome accessible to biologists lacking significant computational expertise becomes clear. In light of this, we have developed SALTS a highly accurate, super efficient, and extremely user-friendly one-stop shop for ncRNA transcriptomics. Notably, SALTS is accessed through an intuitive Web-based interface, can analyze either user-generated, standard NGS file uploads (e.g., FASTQ) or existing NCBI SRA datasets, and requires absolutely no dataset pre-processing or knowledge of library adapters/oligonucleotides. In short, SALTS constitutes the first publically available, Web- based, comprehensive ncRNA transcriptomic NGS analysis platform designed specifically for users with no computational background, providing a much needed, powerful new resource enabling more widespread ncRNA transcriptomic analysis. That said, an array of platforms and pipelines, each geared towards a specific type of transcript/ncRNA class, have previously been developed. Regardless of the platform, the core of ncRNA transcriptome expression analysis consists of two main steps: transcript detection and expression quantification(1)(3). The first step in this process involves aligning, or mapping, the NGS reads to a reference sequence(s), which can be either ncRNA sequence library or an entire reference genome. Most standard pipelines use alignment programs such as Bowtie2(69), BWA(70), NCBI’s BLAST(22) or other implementations of existing alignment algorithms like Smith-Waterman (SW)(71), Needleman-Wunsch (NW)(72), and Burrows Wheeler Transform (BWT)(73). These aligners often differ in how alignment mis-matches and gaps are scored and as such need to be taken into account when dealing with data containing high sequence variability between the individual transcripts originating from the same genomic locus or between the reads and the reference. In the second step, aligned reads are further analyzed to determine the expression, or the number of reads assigned to individual loci or library entries. This step often includes or is followed by various statistical analysis to determine differential expression and/or variance between replicates (i.e., baySeq(74) or DESeq2(75)). That said, the strikingly high accuracy and efficiency achieved by our tools as compared to existing platforms is primarily due to a novel computational approach to RNA-Seq alignment and an innovative analysis based on Hilbert and Vector spaces developed in the course of this work. Brief overviews of the primary constructs critical to toolkit implementation are described below with more in- depth descriptions detailed in Supplemental Information Files 2 and 3. SALTS toolkit implementation. Of note, both SURFR and LAGOOn were developed into real-time processing systems using the following technology stack: Programming languages used: Python 3.7, Visual C++ 2015, Erlang, JavaScript, PHP, and SPARQL. Database engines: Mongo DB 4.4 Servers: Apache Web Server, 30+ background servers composed using Master-Worker model to parallelize the workload, and Apache Jena Fuseki. Other tools and supporting technologies: Rabbit MQ, Flask, Redis, Vue JS, Dropzone JS, Apexcharts JS, Bootstrap 4, IBM Aspera, Axios JS, Moment JS, Tabulator, Matplotlib, NumPy, SciPy, and HTML5. Architecture: Microservices. Hardware Specs: Intel® Xeon® CPU-E5-2609 v4 @ 1.70GHz, 64GB RAM, 4TB Hard disk, Windows Server 2016. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ SURFR implementation. With SURFR, users with no computational background can quickly and easily analyze, visualize, and compare small RNA-Seq datasets in order to generate clear, informative results. With an interactive, user-friendly interface, SURFR is the first Web-based resource that provides users the ability to upload unmodified NGS datasets and/or provide SRA identifiers to perform comprehensive novel ncRNA and ncRNA fragment identifications and expression analyses in real-time. This is achieved through employing the following three key components: (1) Hilbert Space (HS). In mathematics, a HS is an abstract vector space (with up to infinite dimensions) representing the current physical state of a continuous system routinely applied in Quantum mechanics. HSs are highly useful in describing the relationship among Vector spaces, Wavelets, and wave functions(76)(77). For our analyses, the term “Gene Expression” is considered a higher dimensional function representing the activity of the RNA across its length where, within a RNA, expression is represented using four vectors (for A, C, T, and G) and understood using HSs. (2) MoVaK alignment. Based on utilization of the aforementioned HSs, we introduced two new data structures, namely, Similarity Vectors (SVs) and Differential Expression Vectors (DEVs). MoVaK alignment combines SVs and DEVs to profile the exact transcriptomic activity of a given RNA-Seq dataset and then retrieves a HS for each RNA that is expressed in a sample. And (3) SURFR algorithm. By defining the changes in the gene expression using the above HS interpretation, we assign a wavelet function with scales of 18 to 38 to each sncRNA micro-like behavior, i.e., miRNA-like RNAs with lengths ranging from 18 to 38 nt. Importantly, our novel methodology carries several advantages over existing computational methods: 1. Compared to current, purely string comparison methods, DEVs take significantly less time to obtain. 2. Better visualization of ncRNAs processing. 3. SURFR data structures consume very little memory thus allowing real-time calculations. 4. Calculus-based modeling can be directly applied to DEVs to understand ncRNA behavior thus providing a mathematical means to study transcriptomic functionality. 5. Our methodology is highly effective and accurate. To be more specific, our wavelet-based analysis on HS typically identifies ncRNA-derived RNA start and end positions with >=95% identity (within 2 nt) to experimentally validated databases like miRbase as opposed to the state-of-the-art methods based on BAM files such as FlaiMapper, which have been reported to correctly predict 89% of miRNA start positions and 54% of miRNA end positions(78). 6. We have extended our computational methodology to 400+ organisms and all of their sncRNAs without the necessity to change any algorithmic criteria. 7. Our method can address the dynamism associated with transcriptomic analysis using topological interpretation. LAGOOn implementation. Similar to SURFR, with LAGOOn, users with no computational background can quickly and easily analyze and compare raw RNA-Seq datasets to comprehensively evaluate lncRNA expressions as well as the potential for lncRNAs to function as sncRNA hosts, miRNA sponges, antisense RNAs, microprotein transcripts, and/or regulators of genomic enhancers. In short, LAGOOn distinguishes itself from existing platforms through offering parallel, real-time expression analysis and functional prediction. Of note, LAGOOn is essentially based on an extended version of MoVaK alignment that similarly employs SVs to perform sequence alignments. In LAGOOn, however, the algorithm was modified during extension in order to trade time and space complexities within the alignment. A detailed explanation regarding these modifications is provided in Supplemental Information File 3. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ REFERENCES 1. Veneziano,D., Nigita,G. and Ferro,A. (2015) Computational approaches for the analysis of ncRNA through deep sequencing techniques. Front. Bioeng. Biotechnol., 3. 2. Uchida,S. and Bolli,R. (2018) Short and Long Noncoding RNAs Regulate the Epigenetic Status of Cells. Antioxidants Redox Signal., 29, 832–845. 3. Wolfien,M., Brauer,D.L., Bagnacani,A. and Wolkenhauer,O. (2019) Workflow development for the functional characterization of ncRNAs. In Methods in Molecular Biology. Humana Press Inc., Vol. 1912, pp. 111–132. 4. Ulitsky,I. (2018) Interactions between short and long noncoding RNAs. FEBS Lett., 592, 2874–2883. 5. Nakahara,K. and Carthew,R.W. (2004) Expanding roles for miRNAs and siRNAs in cell regulation. Curr. Opin. Cell Biol., 16, 127–133. 6. Cheng,A.M., Byrom,M.W., Shelton,J. and Ford,L.P. (2005) Antisense inhibition of human miRNAs and indications for an involvement of miRNA in cell growth and apoptosis. Nucleic Acids Res., 33, 1290–1297. 7. Hwang,H.W. and Mendell,J.T. (2006) MicroRNAs in cell proliferation, cell death, and tumorigenesis. Br. J. Cancer, 94, 776–780. 8. Singh,S., Chitkara,D., Mehrazin,R., Behrman,S.W., Wake,R.W. and Mahato,R.I. (2012) Chemoresistance in prostate cancer cells is regulated by miRNAs and Hedgehog pathway. PLoS One, 7. 9. Visone,R. and Croce,C.M. (2009) MiRNAs and cancer. Am. J. Pathol., 174, 1131–1138. 10. Rother,S. and Meister,G. (2011) Small RNAs derived from longer non-coding RNAs. Biochimie, 93, 1905– 1915. 11. Martens-Uzunova,E.S., Olvedy,M. and Jenster,G. (2013) Beyond microRNA--novel RNAs derived from small non-coding RNA and their implication in cancer. Cancer Lett, 340, 201–211. 12. Patterson,D.G., Roberts,J.T., King,V.M., Houserova,D., Barnhill,E.C., Crucello,A., Polska,C.J., Brantley,L.W., Kaufman,G.C., Nguyen,M., et al. (2017) Human snoRNA-93 is processed into a microRNA-like RNA that promotes breast cancer cell invasion. NPJ Breast Cancer, 3, 25. 13. Olvedy,M., Scaravilli,M., Hoogstrate,Y., Visakorpi,T., Jenster,G. and Martens-Uzunova,E.S. (2016) A comprehensive repertoire of tRNA-derived fragments in prostate cancer. Oncotarget, 7, 24766–24777. 14. Ender,C., Krek,A., Friedländer,M.R., Beitzinger,M., Weinmann,L., Chen,W., Pfeffer,S., Rajewsky,N. and Meister,G. (2008) A Human snoRNA with MicroRNA-Like Functions. Mol. Cell, 32, 519–528. 15. Martens-Uzunova,E.S., Olvedy,M. and Jenster,G. (2013) Beyond microRNA--novel RNAs derived from small non-coding RNA and their implication in cancer. Cancer Lett, 340, 201–211. 16. Hirose,Y., Ikeda,K.T., Noro,E., Hiraoka,K., Tomita,M. and Kanai,A. (2015) Precise mapping and dynamics of tRNA-derived fragments (tRFs) in the development of Triops cancriformis (tadpole shrimp). BMC Genet., 16. 17. Durdevic,Z. and Schaefer,M. (2013) TRNA modifications: Necessary for correct tRNA-derived fragments during the recovery from stress? BioEssays, 35, 323–327. 18. Wu,W., Choi,E.J., Lee,I., Lee,Y.S. and Bao,X. (2020) Non-coding RNAs and their role in respiratory syncytial virus (RSV) and human metapneumovirus (hMPV) infections. Viruses, 12. 19. Zhou,K., Diebel,K.W., Holy,J., Skildum,A., Odean,E., Hicks,D.A., Schotl,B., Abrahante,J.E., Spillman,M.A. and Bemis,L.T. (2017) A tRNA fragment, tRF5-Glu, regulates BCAR3 expression and proliferation in ovarian cancer cells. Oncotarget, 8, 95377–95391. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 20. Yates,A., Akanni,W., Amode,M.R., Barrell,D., Billis,K., Carvalho-Silva,D., Cummins,C., Clapham,P., Fitzgerald,S., Gil,L., et al. (2016) Ensembl 2016. Nucleic Acids Res., 44, D710-6. 21. Huang,J., Gutierrez,F., Strachan,H.J., Dou,D., Huang,W., Smith,B., Blake,J.A., Eilbeck,K., Natale,D.A., Lin,Y., et al. (2016) OmniSearch: a semantic search system based on the Ontology for MIcroRNA Target (OMIT) for microRNA-target gene interaction data. J. Biomed. Semantics, 7, 25. 22. Camacho,C., Coulouris,G., Avagyan,V., Ma,N., Papadopoulos,J., Bealer,K. and Madden,T.L. (2009) BLAST+: architecture and applications. BMC Bioinformatics, 10, 421. 23. Leinonen,R., Sugawara,H. and Shumway,M. (2011) The sequence read archive. Nucleic Acids Res., 39. 24. Kalvari,I., Argasinska,J., Quinones-Olvera,N., Nawrocki,E.P., Rivas,E., Eddy,S.R., Bateman,A., Finn,R.D. and Petrov,A.I. (2018) Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res, 46, D335–D342. 25. Desgranges,E., Caldelari,I., Marzi,S. and Lalaouna,D. (2020) Navigation through the twists and turns of RNA sequencing technologies: Application to bacterial regulatory RNAs. Biochim. Biophys. Acta - Gene Regul. Mech., 1863. 26. Friedländer,M.R., Chen,W., Adamidi,C., Maaskola,J., Einspanier,R., Knespel,S. and Rajewsky,N. (2008) Discovering microRNAs from deep sequencing data using miRDeep. Nat. Biotechnol., 26, 407–415. 27. Humphreys,D.T. and Suter,C.M. (2013) MiRspring: A compact standalone research tool for analyzing miRNA-seq data. Nucleic Acids Res., 41. 28. Hackenberg,M., Rodríguez-Ezpeleta,N. and Aransay,A.M. (2011) MiRanalyzer: An update on the detection and analysis of microRNAs in high-throughput sequencing experiments. Nucleic Acids Res., 39. 29. Wu,X., Kim,T.K., Baxter,D., Scherler,K., Gordon,A., Fong,O., Etheridge,A., Galas,D.J. and Wang,K. (2017) SRNAnalyzer-A flexible and customizable small RNA sequencing data analysis pipeline. Nucleic Acids Res., 45, 12140–12151. 30. Rahman,R.U., Gautam,A., Bethune,J., Sattar,A., Fiosins,M., Magruder,D.S., Capece,V., Shomroni,O. and Bonn,S. (2018) Oasis 2: Improved online analysis of small RNA-seq data. BMC Bioinformatics, 19. 31. Kuksa,P.P., Amlie-Wolf,A., Katanić,Ž., Valladares,O., Wang,L.S. and Leung,Y.Y. (2018) SPAR: Small RNA-seq portal for analysis of sequencing experiments. Nucleic Acids Res., 46, W36–W42. 32. Hoogstrate,Y., Jenster,G. and Martens-Uzunova,E.S. (2015) FlaiMapper: computational annotation of small ncRNA-derived fragments using RNA-seq high-throughput data. Bioinformatics, 31, 665–673. 33. Shi,J., Ko,E.A., Sanders,K.M., Chen,Q. and Zhou,T. (2018) SPORTS1.0: A Tool for Annotating and Profiling Non-coding RNAs Optimized for rRNA- and tRNA-derived Small RNAs. Genomics, Proteomics Bioinforma., 16, 144–151. 34. Jeske,T., Huypens,P., Stirm,L., Höckele,S., Wurmser,C.M., Böhm,A., Weigert,C., Staiger,H., Klein,C., Beckers,J., et al. (2019) DEUS: An R package for accurate small RNA profiling based on differential expression of unique sequences. Bioinformatics, 35, 4834–4836. 35. Aparicio-Puerta,E., Lebrón,R., Rueda,A., Gómez-Martín,C., Giannoukakos,S., Jaspez,D., Medina,J.M., Zubkovic,A., Jurak,I., Fromm,B., et al. (2019) SRNAbench and sRNAtoolbox 2019: intuitive fast small RNA profiling and differential expression. Nucleic Acids Res., 47, W530–W535. 36. Liu,Q., Ding,C., Lang,X., Guo,G., Chen,J. and Su,X. (2019) Small noncoding RNA discovery and profiling with sRNAtools based on high-throughput sequencing. Brief. Bioinform., 10.1093/bib/bbz151. 37. Wan,C., Gao,J., Zhang,H., Jiang,X., Zang,Q., Ban,R., Zhang,Y. and Shi,Q. (2017) CPSS 2.0: A computational platform update for the analysis of small RNA sequencing data. Bioinformatics, 33, 3289– .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 3291. 38. Liao,Y., Smyth,G.K. and Shi,W. (2014) FeatureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics, 30, 923–930. 39. Martens-Uzunova,E.S., Hoogstrate,Y., Kalsbeek,A., Pigmans,B., Vredenbregt-van den Berg,M., Dits,N., Nielsen,S.J., Baker,A., Visakorpi,T., Bangma,C., et al. (2015) C/D-box snoRNA-derived RNA production is associated with malignant transformation and metastatic progression in prostate cancer. Oncotarget, 6, 17430–44. 40. Derrien,T., Johnson,R., Bussotti,G., Tanzer,A., Djebali,S., Tilgner,H., Guernec,G., Martin,D., Merkel,A., Knowles,D.G., et al. (2012) The GENCODE v7 catalog of human long noncoding RNAs: Analysis of their gene structure, evolution, and expression. Genome Res., 22, 1775–1789. 41. Ulitsky,I. and Bartel,D.P. (2013) XLincRNAs: Genomics, evolution, and mechanisms. Cell, 154, 26. 42. Lam,M.T.Y., Li,W., Rosenfeld,M.G. and Glass,C.K. (2014) Enhancer RNAs and regulated transcriptional programs. Trends Biochem. Sci., 39, 170–182. 43. Rinn,J.L. and Chang,H.Y. (2012) Genome regulation by long noncoding RNAs. Annu. Rev. Biochem., 81, 145–166. 44. Uszczynska-Ratajczak,B., Lagarde,J., Frankish,A., Guigó,R. and Johnson,R. (2018) Towards a complete map of the human long non-coding RNA transcriptome. Nat. Rev. Genet., 19, 535–548. 45. Mercer,T.R., Dinger,M.E. and Mattick,J.S. (2009) Long non-coding RNAs: Insights into functions. Nat. Rev. Genet., 10, 155–159. 46. Li,X., Wu,Z., Fu,X. and Han,W. (2014) LncRNAs: Insights into their function and mechanics in underlying disorders. Mutat. Res. - Rev. Mutat. Res., 762, 1–21. 47. Moran,V.A., Perera,R.J. and Khalil,A.M. (2012) Emerging functional and mechanistic paradigms of mammalian long non-coding RNAs. Nucleic Acids Res., 40, 6391–6400. 48. Wang,J., Liu,X., Wu,H., Ni,P., Gu,Z., Qiao,Y., Chen,N., Sun,F. and Fan,Q. (2010) CREB up-regulates long non-coding RNA, HULC expression through interaction with microRNA-372 in liver cancer. Nucleic Acids Res., 38, 5366–5383. 49. Chen,H., Du,G., Song,X. and Li,L. (2017) Non-coding Transcripts from Enhancers: New Insights into Enhancer Activity and Gene Expression Regulation. Genomics, Proteomics Bioinforma., 15, 201–207. 50. Malecová,B. and Morris,K. V. (2010) Transcriptional gene silencing through epigenetic changes mediated by non-coding RNAs. Curr. Opin. Mol. Ther., 12, 214–222. 51. Stein,C.S., Jadiya,P., Zhang,X., McLendon,J.M., Abouassaly,G.M., Witmer,N.H., Anderson,E.J., Elrod,J.W. and Boudreau,R.L. (2018) Mitoregulin: A lncRNA-Encoded Microprotein that Supports Mitochondrial Supercomplexes and Respiratory Efficiency. Cell Rep., 23, 3710-3720.e8. 52. Fishilevich,S., Nudel,R., Rappaport,N., Hadar,R., Plaschkes,I., Iny Stein,T., Rosen,N., Kohn,A., Twik,M., Safran,M., et al. (2017) GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford)., 2017. 53. Arun,G. and Spector,D.L. (2019) MALAT1 long non-coding RNA and breast cancer. RNA Biol., 16, 860– 863. 54. Jordan,N.V., Bardia,A., Wittner,B.S., Benes,C., Ligorio,M., Zheng,Y., Yu,M., Sundaresan,T.K., Licausi,J.A., Desai,R., et al. (2016) HER2 expression identifies dynamic functional states within circulating breast cancer cells. Nature, 537, 102–106. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 55. Huang,X.J., Xia,Y., He,G.F., Zheng,L.L., Cai,Y.P., Yin,Y. and Wu,Q. (2018) MALAT1 promotes angiogenesis of breast cancer. Oncol. Rep., 40, 2683–2689. 56. Ruiz-Orera,J., Messeguer,X., Subirana,J.A. and Alba,M.M. (2014) Long non-coding RNAs as a source of new peptides. Elife, 3, 3523. 57. Davis,C.A., Hitz,B.C., Sloan,C.A., Chan,E.T., Davidson,J.M., Gabdank,I., Hilton,J.A., Jain,K., Baymuradov,U.K., Narayanan,A.K., et al. (2018) The Encyclopedia of DNA elements (ENCODE): Data portal update. Nucleic Acids Res., 46, D794–D801. 58. Lizio,M., Abugessaisa,I., Noguchi,S., Kondo,A., Hasegawa,A., Hon,C.C., De Hoon,M., Severin,J., Oki,S., Hayashizaki,Y., et al. (2019) Update of the FANTOM web resource: Expansion to provide additional transcriptome atlases. Nucleic Acids Res., 47, D752–D758. 59. Gong,Y., Huang,H.T., Liang,Y., Trimarchi,T., Aifantis,I. and Tsirigos,A. (2017) lncRNA-screen: An interactive platform for computationally screening long non-coding RNAs in large genomics datasets. BMC Genomics, 18. 60. Yuan,C. and Sun,Y. (2013) RNA-CODE: A Noncoding RNA Classification Tool for Short Reads in NGS Data Lacking Reference Genomes. PLoS One, 8. 61. Sun,L., Liu,H., Zhang,L. and Meng,J. (2015) IncRScan-SVM: A tool for predicting long non-coding RNAs using support vector machine. PLoS One, 10. 62. Pyfrom,S.C., Luo,H. and Payton,J.E. (2019) PLAIDOH: A novel method for functional prediction of long non-coding RNAs identifies cancer-specific LncRNA activities. BMC Genomics, 20. 63. Jiang,Q., Ma,R., Wang,J., Wu,X., Jin,S., Peng,J., Tan,R., Zhang,T., Li,Y. and Wang,Y. (2015) LncRNA2Function: A comprehensive resource for functional investigation of human lncRNAs based on RNA-seq data. BMC Genomics, 16. 64. Wu,S.M., Liu,H., Huang,P.J., Chang,I.Y.F., Lee,C.C., Yang,C.Y., Tsai,W.S. and Tan,B.C.M. (2018) circlncRNAnet: An integrated web-based resource for mapping functional networks of long or circular forms of noncoding RNAs. Gigascience, 7, 1–10. 65. Sun,K., Chen,X., Jiang,P., Song,X., Wang,H. and Sun,H. (2013) iSeeRNA: Identification of long intergenic non-coding RNA transcripts from transcriptome sequencing data. BMC Genomics, 14. 66. Musacchia,F., Basu,S., Petrosino,G., Salvemini,M. and Sanges,R. (2015) Annocript: A flexible pipeline for the annotation of transcriptomes able to identify putative long noncoding RNAs. Bioinformatics, 31, 2199– 2201. 67. Sun,Z., Nair,A., Chen,X., Prodduturi,N., Wang,J. and Kocher,J.P. (2017) UClncR: Ultrafast and comprehensive long non-coding RNA detection from RNA-seq. Sci. Rep., 7. 68. Sun,Y.-M. and Chen,Y.-Q. (2020) Principles and innovative technologies for decrypting noncoding RNAs: from discovery and functional prediction to clinical application. J. Hematol. Oncol., 13, 109. 69. Langmead,B. and Salzberg,S.L. (2012) Fast gapped-read alignment with Bowtie 2. Nat. Methods, 9, 357– 359. 70. Li,H. and Durbin,R. (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics, 26, 589–595. 71. Bucher,P. and Hofmann,K. (1996) A sequence similarity search algorithm based on a probabilistic interpretation of an alignment scoring system. Proc. Int. Conf. Intell. Syst. Mol. Biol., 4, 44–51. 72. Phillips,A.J. (2006) Homology assessment and molecular sequence alignment. J. Biomed. Inform., 39, 18– 33. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/ 73. Lippert,R.A. (2005) Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J. Comput. Biol., 12, 407–415. 74. Kvam,V.M., Liu,P. and Yaqing,S. (2012) A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. Am. J. Bot., 99, 248–256. 75. Costa-Silva,J., Domingues,D. and Lopes,F.M. (2017) RNA-Seq differential expression analysis: An extended review and a software tool. PLoS One, 12. 76. Steeb, W.-H. (1998). Hilbert Spaces, Wavelets, Generalised Functions and Modern Quantum Mechanics. Springer Science & Business Media. 77. Debnath, L., & Mikusinski, P. (2005). Introduction to Hilbert Spaces with Applications. Academic Press. 78. Y. Hoogstrate, G. Jenster, and E. S. Martens-Uzunova, “FlaiMapper: computational annotation of small ncRNA-derived fragments using RNA-Seq high-throughput data,” Bioinformatics, vol. 31, no. 5, pp. 665– 673, Mar. 2015. .CC-BY-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted February 10, 2021. ; https://doi.org/10.1101/2021.02.08.430280doi: bioRxiv preprint https://doi.org/10.1101/2021.02.08.430280 http://creativecommons.org/licenses/by-nd/4.0/