key: cord-0717863-qiaqzdw3 authors: Ergin, Selvi; Kherad, Nasim; Alagoz, Meryem title: RNA sequencing and its applications in cancer and rare diseases date: 2022-01-06 journal: Mol Biol Rep DOI: 10.1007/s11033-021-06963-0 sha: aaff1d50ae9acdb8997eb1789ba99fcd814a33b1 doc_id: 717863 cord_uid: qiaqzdw3 With the invention of RNA sequencing over a decade ago, diagnosis and identification of the gene-related diseases entered a new phase that enabled more accurate analysis of the diseases that are difficult to approach and analyze. RNA sequencing has availed in-depth study of transcriptomes in different species and provided better understanding of rare diseases and taxonomical classifications of various eukaryotic organisms. Development of single-cell, short-read, long-read and direct RNA sequencing using both blood and biopsy specimens of the organism together with recent advancement in computational analysis programs has made the medical professional’s ability in identifying the origin and cause of genetic disorders indispensable. Altogether, such advantages have evolved the treatment design since RNA sequencing can detect the resistant genes against the existing therapies and help medical professions to take a further step in improving methods of treatments towards higher effectiveness and less side effects. Therefore, it is of essence to all researchers and scientists to have deeper insight in all available methods of RNA sequencing while taking a step-in therapy design. The basis of molecular biology began with genes located in DNA transcribed to RNA for protein synthesis; the emergence of the double-helix structure of DNA in 1953 showed the essence of life as a result of gene interaction [1, 2] . The whole machinery defines the organism's characteristics and maintains the biological functions of the cells and the organism as one. Therefore, RNA analysis is essential in understanding the genomic processes and the diseases' origin. The RNAs, collectively known as transcriptomes, are complex genomic structures with coding and non-coding regions and are intermediaries between genes and proteins. Thus, detailed study on transcriptome is essential to understand the genomic function and to identify molecular compositions of cells. In addition, more comprehensive knowledge on transcriptome can help us understand the cause as well as development of diseases. Therefore a thorough study of the transcriptome is necessary for understanding genomic function, identifying molecular compositions of cells, and understanding the cause and development of diseases [3] Among RNA species, messenger RNA (mRNA) is the most valuable one for further study as it carries the genomic data from the organism's DNA [4] . However, analysis of protein-coding RNA requires a precise technique that can distinguish the coding-protein RNA from the non-coding RNAs (ncRNAs). The complexity of the genome arises from the following; Coding genes comprise almost 2% of the whole human genome, and a majority of the coding genes undergo transcription [5] . Additionally, a single genomic locus is likely to exhibit different isoforms resulting in different splicing patterns with possibly various transcriptional start sites [6] . Moreover, unpredictable monoallelic (maternal or paternal allele) expression of genes adds an extra layer of complexity in transcriptomic analysis [7] . In-vivo and in-vitro analysis of homogenous cells populations has shown heterogeneity of the cells due to intrinsic and extrinsic factors such as microenvironment [8] . However, research shows the cells in the same microenvironment can manifest different transcript levels due to factors such as the cell cycle [9] . Under the category of ncRNAs, ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs) as functional elements in mRNA translation, small nuclear RNAs (snR-NAs) is RNA splicing, small nucleolar RNAs (snoRNAs) in rRNAs modifications [10] , microRNAs (miRNAs) and piwi-interacting RNAs (piRNAs) in post-transcriptional regulation of gene expression [11] , and long non-coding RNAs (lncRNAs) in chromatin remodelling, transcriptional and post-transcriptional regulation [12] . Designing genome analysis techniques that can accurately and efficiently profile the whole genome and distinguish between the coding and non-coding ones was the scientists' target for decades. Over the past few decades, researchers developed various methods to have an in-depth analysis of RNAs and a more accurate understanding of gene expression. Low-throughput methods such as quantitative polymerase chain reaction (qPCR) which introduced as powerful techniques for the purpose. However, it could not apply to measuring multiple transcripts. And despite the introduction of hybridizationbased microarray in 1995 that provided a better solution for the study of gene expression [13, 14] , limitations of the method such as cross-hybridization with extremely similar sequences and lack of accuracy in the quantification of lowly-and highly expressed genes [15, 16] led scientists to develop sequence-based techniques to reduce the inaccuracy in the study of transcriptomes (transcriptomics) technologies using complementary DNA (cDNA). The aim of studying transcriptome is to catalogue the whole transcript (coding and non-coding RNAs), determine the splicing pattern and the changes that occur in the post-transcriptional stage, and identify the changes in expression level of each transcript by quantifying the changes based on different intrinsic and extrinsic factors [3] . Although techniques such as Sanger sequencing of cDNA using expressed sequence tag (EST) [17] , serial analysis of gene expression (SAGE) and cap analysis of gene expression (CAGE) [18] , have improved RNA analysis, their insensitivity in discovering novel genes and high cost of Sanger sequencing makes the techniques inefficient [19] . Next-generation sequencing (NGS) that are Highthroughput sequencing can perform sequencing faster with lower cost and higher accuracy. Additionally, it is useful for identifying undefined gene expression sequences in an intense time manner [20] . Further development of long-read RNA sequencing, known as third-generation sequencing, can be used to generate full-length cDNA transcripts with a minimum number of false-positive splice sites and capturing great diversity of transcript isoforms [21] . The introduction of RNA-seq, from bulk-to single-cell RNA sequencing, has given the opportunity to process and map transcriptome. Although the development of the RNA-seq method goes back more than a decade [22, 23] , it has revolutionized the interpretation of eukaryotic transcriptomes [24, 25] by analysis of differential gene expression (DGE) using next-generation sequencing (NGS) with the standard workflow; RNA extraction, followed by mRNA enrichment or ribosomal RNA depletion, cDNA synthesis and preparation of an adaptor-ligated sequencing library. The advantage of the technique is the in-depth ability to perform 10-30 million reads in each sample on usually Illumina short-read sequencing instruments [26] . In addition, the introduction of Long-read RNA-seq, also known as thirdgeneration sequencing, and direct RNA-seq (dRNseq) have made the transcriptomics more thorough [27, 28] without requiring prior information on the RNA sequence The method introduces advantages compared to the previously discussed methods by providing a detailed understanding of the transcriptome through the quantitative measuring of the gene expression, splicing, maternal or paternal allele expression and altogether, helps to interpret the cause of diseases efficiently with lower cost. After the laboratory-based workflow, computational analysis, most importantly data processing and analysis, is carried out using various computational tools. Data processing can be performed for both organisms with and without reference genomes. The organisms with a reference genome, short RNA sequencing reads are mapped using the reference genome. On the other hand, for the organisms with no reference genome, de novo transcriptome assembly is applied [35, 36] . This review provides past and current research studies on RNA-seq, and its types focus on the advantages and disadvantages of the technique. Furthermore, it presents the use of the method in cancer as well as rare diseases. Additionally, it introduces the future possibilities of RNseq and its application for understanding disease origin and development in more detail. Finally, the paper gives a brief description of RNAseq application for different types of cancer, rare diseases, and COVID-19 (Coronavirus Disease 2019), that have been challenging medical professionals in finding the most effective way for diagnosis and treatment with least side effects and we hope to shed light on utilizing the technique for more useful and accurate protocol to minimize the error and enhance the therapies and eventually, the diseases' prognosis. RNA sequencing techniques can be categorized based on the library preparation methods and the applied approach into short-read sequencing, long-read cDNA sequencing, and long-read direct RNA sequencing. Although shortread and long-read cDNA techniques follow almost many steps, in the same manner, the quantity of sample and computational analysis of the techniques at the beginning and the end of library preparation is different. While shortread sequencing of cDNA provides Short Read Archive (SRA) that consists of almost all sequenced mRNA data [37], long-read cDNA sequencing has helped scientists to develop transcript data with their diverse isoforms [38]. The following information presents current published knowledge on short-read cDNA sequencing, long-read cDNA, and direct RNA sequencing. This method has replaced microarray in RNAs gene expression with less cost and more straightforward application with a higher quality of data through the transcriptome [39] . The commonly used platform under this category is via transcript's reversible terminator sequence and synthesis techniques [40, 41] . Like all other techniques, the technique is carried out on platforms such as IonTorrent and Illumina and performs RNA sequencing analysis through an indirect method using cDNA. And the method includes RNA extraction, mRNA enrichment, mRNA fragmentation, cDNA synthesis, cDNA fragmentation, cDNA amplification, sequencing, and data analysis [42] .The base pair banding of mRNA fragments for the technique is 150to 200 bps for library purification and preparation, and therefore, the prepared cDNA is mainly between 200 and 400 bps [43] . A short-read sequencing library is prepared with an average of 20-30 million reads for each sample. After the complete sequencing, the library is purified by computational processing to identify the reads aligned with the targeted individual transcripts. This method helps to report an association of intra-platform with inter-platform [44, 45] . Nevertheless, limitations due to possible occurring errors during sample preparation and computational analysis may cause false reports in the identification and quantification of diverse forms of isoforms that are manifested from a gene [46] , especially the transcripts with a large number of base pairs such as the ones found in humans [47] . Therefore, it is understandable that short-read RNA sequencing is not fully efficient to perform a complete analysis of long transcripts [48] . In addition to the limitations of RNA size, multi-mapped reads are not accurate. Long-read sequencing has lifted the limitations of size by tagging full-length cDNA and the use of unique molecular identifiers that are copied along with cDNA prior to library preparation (UMIs) [49, 50] . As mentioned previously, short-read sequencing requires the assembly of short RNA fragment reads, which affects the accuracy of the genome mapping process and the whole sequence cannot be identified and analyzed. However, longread sequencing can identify large-size RNA and process the full length, making genome mapping possible for mammalian cells containing 1-2 kb of transcripts and may surpass 100 kb [51] [52] [53] . The method is performed on a number of platforms that were developed in the past few years, and ones are Single-Molecule Real-Time (SMRT) technology from Pacific Biosciences (PacBio sequencing) and protein nanopore sequencing technology from Oxford Nanopore Technologies (ONT). The standard protocol includes conversion of high-quality RNA to full-length cDNA by template-switching reverse transcriptase [54] , and the cDNA undergoes amplification by polymerase chain reaction (PCR) to prepare the SMRT library [54] . While the ONT platform follows the same protocol as PacBio [55] , reverse transcriptase was shown to affect library preparation and the length of transcript read on ONT [56] . In contrast to the advantages, long-read cDNA sequencing requires a great amount of time for the large size of the genome to be processed [57] , and therefore, further studies are necessary to optimize the time. Unlike short-read and long-read cDNA sequencing, longread RNA sequencing, also known as dRNA-seq (DRS), does not require cDNA generation and therefore can eliminate the errors that occur during cDNA amplification and avoid RNA-RNA chimaeras produced by cDNA [58] . Although the limitation of reading length is not the challenge with the technique, the fragmentation of the input read is still challenging [59, 60] . The technique is carried out on nanopore sequencing technology developed by ONT [43, 61] . The process includes two ligation steps. The first ligation step includes ligation of duplex adaptor to polyA tail of RNA, followed by reverse-transcription followed by the second ligation step, which is the attachment of the motor protein-attached sequencing adaptor. Finally, the products go through library preparation [62] . The other advantage of DRS over the other two lies in the ability of the technique to identify the RNA base modifications, and thus can shed light on the epigenetics of the species [62, 63] . RNA sequencing has provided an effective approach in detecting different types of cancers and rare diseases and, thus, has shed light on developing more effective treatments. DRS has been applied for genomic studies of viral transcriptomes, and it uses cDNA to analyse and interpret viral RNA [64] [65] [66] . Previous studies applied the technique to investigate human poly(A) RNA and DNA-based viruses [67] . A recent study has shown full-length sequencing of HCoV-229E virus that belongs to the coronavirus family and encompasses the known largest RNA genome. In this study, the technique used defective interfering RNAs (DI-RNAs) for in vitro analysis of transcript using full-length cDNA [68] . In this study, in patients who manifested resistance during therapy, RNA-seq detected human gemcitabine-resistant pancreatic cancer cells (PANC1) as potential therapeutic targets [69] . In addition to the discussed RNA sequencing methods, in situ RNA sequencing was developed to perform RNA sequencing inside the cell without cell lysis and RNA extraction [70] . The study on breast cancer applied the technique to analyse short RNA fragments of ACTB gene and HER2 (abundant growth-promoting protein outside breast cells) RNA in preserved cells and tissues and helped to detect tissue heterogeneity at a molecular level [70] . Despite all new inventions and advancements in medicine, cancer remains elusive and is considered one of the most life-threatening malignant diseases. With the development of RNA sequencing as one of the high-throughput methods of transcriptome analysis, interpretation of diseases and their genetic causes at the molecular level has been conceivable. Single-cell RNA sequencing, known as scRNA-seq, has been used to analyse single malignant cell's heterogeneity to present the cause of cancers [69, 71, 72] , such as pancreatic ductal adenocarcinoma [69] . RNA-seq can find out the uses of tumour mutational burden (TMB), whose study is noteworthy as a possible immune checkpoint biomarker and helps in treatment and cancer prognosis [73] . By detecting a mutation in MET proto-oncogene and isocitrate dehydrogenase 1 (IDH1) gene using RNA-seq, the possibility of designing a better therapy for lung adenocarcinoma and chondrosarcoma has been made possible [74, 75] . Therefore, the technique has facilitated target therapy by detecting the causative gene or the mutation of target genes in various types of cancer, such as acute myeloid leukaemia (AML) [76, 77] . In head and neck cancer [78] and oligodendroglioma [79] , single-cell RNA sequencing has helped to elucidate the difference between malignant and benign cells using the data collected for copy number variations (CNV). Besides applications of RNA-seq in treatment design for cancers, the tool can be used as a diagnostic tool in blood-based sarcoma [80] . Although this review has covered limited past and present studies on RNAseq in cancer diagnosis and target therapy, it is abundantly clear that the tool in identifying the genetic and epigenetic cause of cancer, assisting in better therapy design by detecting the resistant genes, and elucidating the mutations in the genes as cancer biomarkers for better therapy. The advantages of RNA-seq extend to in better understanding of rare diseases. Over 7000 rare Mendelian disorders have been identified so far. However, the genetic basis of more than half of all Mendelian diseases reported remains elusive, despite being monogenic [81] . Furthermore, these diseases can show variable phenotypes even in cases where the causal disease gene is identified, even in patients such as siblings [82, 83] , which presents diagnostic and patient management challenges [84] . RNA-seq offers the ability to calculate allele-specific expressions that are likely to expose the existence of a broad heterozygous regulatory, splicing, nonsense variant or epimutation to help identify candidate rare disease genes and variants [85] [86] [87] [88] [89] [90] . Table 1 introduces some of the rare diseases that are investigated using RNA-seq. Advantages of RNA-seq extend in better understanding of rare diseases. Over 7000 rare Mendelian disorders have been identified so far. The genetic basis of more than half of all Mendelian diseases reported remains elusive, despite being monogenic [93] . These diseases can show variable phenotypes even in cases where the causal disease gene is identified, even in patients such as siblings [94, 95] with the same genetic mutation, which presents in diagnostic and patient management challenges [96] . RNA-seq offers the ability to calculate allele-specific expression that are likely to expose the existence of a broad heterozygous regulatory, splicing or nonsense variant or epimutation to help identify candidate rare disease genes and variants [97] [98] [99] [100] . Table 1 introduces some of the rare diseases that were investigated using RNA-seq. Advancement in RNA-seq has been one of the major revolutions in the study and interpretation of transcriptome in the past few years. With ongoing innovation and development in bioinformatics, data analysis software and platform technologies, cataloguing full-length transcript and library preparation for all organisms, whether single-cell organisms, such as yeast to mammalians, many questions elude scientists carry out further investigations on various physiological and genetic abnormalities can be answered. Moreover, library preparation has kept the information accessible to those who are researching transcript-related studies. Furthermore, the researcher can use this tool in comparing the tissues and cells in normal and abnormal conditions to track and reveal the causatives of different diseases and identify metablic abnormalities or alterations that happen in molecular and cellular levels and identify metabolic abnormalities or alterations that happen in molecular and cellular levels. The current outbreak of COVİD-19 and the emergence of variants in short-term time have been a challenge for the researchers in finding a better tool in interpreting the fulllength RNA of the SARS-CoV-2 to develop a more efficient treatment and durable vaccine. And RNA sequencing with the advantage of reading a large-size transcript has provided an insight into developing a platform that can help in a detailed analysis of SARS-CoV-2 RNA to reveal the cause of genetic variation and resistance towards the currently used treatments. Besides, the diagnostic tools are critical in cancer and rare diseases and with ongoing improvement in RNA sequencing techniques and existing diagnostic tools for some diseases, it is expected to see great advancements in developing standard diagnostic tools that benefit the biomarkers of disease that are being detected by RNA sequencing. Not to mention that the collection of all data from different organisms' transcriptomes can improve the field of taxonomy by aligning the sequenced transcripts and measuring the level of similarities among the organisms. Therefore, it is expected that the unknown and undefined forms of isoforms can be determined and eventually help the unidentified genes' function and full potentials to be uncovered, and questions in molecular and cellular evolution and diversity of many pathogenic viruses will be answered. Although this review has covered limited present and past studies and achievements on applications and advantages of RNA-seq, it is hoped that the readers of the review will benefit from the collected information and shed light on future applications of RNA sequencing in better understanding of genetically diversified human diseases. Data availability Not applicable. Regulates the motor skills and is likely to be involved in peroxisome-proliferator-activatorreceptor (PPAR)-dependent signalling [92] Identification of variants in regulatory upstream regions of genes in monogenetic neuromuscular disorders Congenital Muscular Dystrophy (CMD) [93, 94] Heterozygous variant in GDP-Mannose Pyrophosphorylase B (GMPPB) gene [93] Regulates protein, fructose, and mannose metabolism and impairment on the gene causes defective o-glycosylation of α-dystroglycan [93] Diagnosis of Mendelian rare diseases by detecting splice-affecting variant collagen VI dystrophy [95, 96] Intron inclusion in COL6A1 gene [95] Encodes collagen VI that causes muscle weakness and deformities of joints [97] Diagnosis of Mendelian rare diseases Duchenne Muscular Dystrophy (DMD) [98] Heterozygous variant in DMD gene [94, 95] Encodes dystrophin protein that forms dystrophin-glycoprotein complex in extracellular matrix [99] 19 (2017) Central dogma of molecular biology On protein synthesis Reviving the transcriptome studies: an insight into the emergence of single-molecule transcriptome sequencing Singlecell RNA-seq: advances and future challenges Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells Single-cell RNA-seq reveals dynamic, random monoallelic gene expression in mammalian ncells Gene expression profiling in single cells from the pancreatic islets of Langerhans reveals lognormal distribution of mRNA levels Transcriptome wide noise controls lineage choice in mammalian progenitor cells Non-coding RNA Small non-coding RNAs in animal development Long noncoding RNAs: Functiona surprises from the RNA world Quantitative monitoring of gene expression patterns with a complementary DNA microarray Hybridization interactions between probesets in short oligo microarrays lead to spurious correlations In situ analysis of cross-hybridisation on microarrays and the inference of expression correlation The beginning of the end for microarrays? Identification of an active gene by using large-scale cDNA sequencing Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study Landscape of transcription in humancells GENCODE reference annotation for the human and mouse genomes Microfluidic isoform sequencing shows widespread splicing coordination in the human transcriptome Counting individual DNA molecules by the stochastic attachment of diverse labels Quantitative single-cell RNA-seq with unique molecular identifiers Transcriptome assembly from long-read RNA-seq alignments with StringTie2 Opportunities and challenges in long-read sequencing data analysis Longread sequencing of chicken transcripts and identification of new transcript isoforms Full-length mRNA-Seq from singlecell levels of RNA and individual circulating tumor cells Benchmarking of the Oxford Nanopore Min-ION sequencing for quantitative and qualitative assessment of cDNA populations Long-read sequencing uncovers a complex transcriptome topology in varicella zoster virus Nanopore native RNA sequencing of a human poly(A) transcriptome Retrieval of a million high-quality, fulllength microbial 16S and 18S rRNA gene sequences without primer bias A first look at the Oxford Nanopore MinION sequencer A novel tool for predicting drug hypersensitivity The oxford nanopore MinION: delivery of nanopore sequencing to the genomics community Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis Decoding the epitranscriptional landscape from native RNA sequences Multi-platform sequencing approach reveals a novel transcriptome profile in pseudorabies virus Long-read isoform sequencing reveals ˝ a hidden complexity of the transcriptional landscape of herpes simplex virus type 1 Multi-platform analysis reveals a complex ˝ transcriptome architecture of a circovirus Native RNA sequencing on nanopore arrays redefines the transcriptional complexity of a viral pathogen Direct RNA nanopore sequencing of full-length coronavirus genomes provides novel insights into structural variants and enables modification analysis Longitudinal single-cell RNA sequencing of patient-derived primary cells reveals drug-induced infidelity in stem cell hierarchy Cancer transcriptome profiling at the juncture of clinical translation Single-cell RNA sequencing in cancer: lessons learned and emerging challenges Technical advances in single-cell RNA sequencing and applications in normal and malignant hematopoiesis 2020) Tumor mutational burden is associated with poor outcomes in diffuse glioma The transcriptional landscape and mutational profile of lung adenocarcinoma Selective inhibition of mutant IDH1 by DS-1001b ameliorates aberrant histone modifications and impairs tumor activity in chondrosarcoma Advances in targeted therapy for acute myeloid leukemia Singlecell sequencing and its applications in head and neck cancer Single-cell transcriptomic analysis of primary and metastatic tumor ecosystems in head and neck cancer Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma RNA-sequencing of tumoreducated platelets, a novel biomarker for blood-based sarcoma diagnostics International cooperation to enable the diagnosis of all rare genetic diseases Distally pronounced infantile spinal muscular atrophy with severe axonal and demyelinating neuropathy associated with the S230L mutation of SMN1 Exploring genetic modifiers of Gaucher disease: the next horizon Novel missense mutations in PNPLA2 causing late onset and clinical heterogeneity of neutral lipid storage disease with myopathy in three siblings Genetic diagnosis of Mendelian disorders via RNA sequencing Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome Identification of rare de novo epigenetic variations in congenital disorders Identification of rare-disease genes using blood transcriptome sequencing and large control cohorts Swiprosin-1/EFhd2 controls B cell receptor signaling through the assembly of the B cell receptor, Syk, and phospholipase C gamma2 in membrane rafts twins: swiprosin-1/ EFhd2 and Swiprosin-2/EFhd1, two homologous EF-hand containing calcium binding adaptor proteins with distinct functions MECR mutations cause childhood-onset dystonia and optic atrophy, a mitochondrial fatty acid synthesis disorder Expanding the boundaries of RNA sequencing as a diagnostic tool for rare mendelian disease Beeson D (2015) Mutations in GMPPB cause congenital myasthenic syndrome and bridge myasthenic disorders with dystroglycanopathies Mutations in GDPmannose pyrophosphorylase B cause congenital and limb-girdle muscular dystrophies associated with hypoglycosylation of α-dystroglycan Improving genetic diagnosis in Mendelian disease with transcriptome sequencing Collagen VI-Related Dystrophies Dominant collagen VI mutations are a common cause of Ullrich congenital muscular dystrophy Duchenne muscular dystrophy and dystrophin: pathogenesis and opportunities for treatment