key: cord-0027808-9y8pc0hu
authors: Grünberger, Felix; Ferreira-Cerca, Sébastien; Grohmann, Dina
title: Nanopore sequencing of RNA and cDNA molecules in Escherichia coli
date: 2022-03-03
journal: RNA
DOI: 10.1261/rna.078937.121
sha: ec097f6d869fe348a26c9ebfb8f98eca3b4241da
doc_id: 27808
cord_uid: 9y8pc0hu

High-throughput sequencing dramatically changed our view of transcriptome architectures and allowed for ground-breaking discoveries in RNA biology. Recently, sequencing of full-length transcripts based on the single-molecule sequencing platform from Oxford Nanopore Technologies (ONT) was introduced and is widely used to sequence eukaryotic and viral RNAs. However, experimental approaches implementing this technique for prokaryotic transcriptomes remain scarce. Here, we present an experimental and bioinformatic workflow for ONT RNA-seq in the bacterial model organism Escherichia coli, which can be applied to any microorganism. Our study highlights critical steps of library preparation and computational analysis and compares the results to gold standards in the field. Furthermore, we comprehensively evaluate the applicability and advantages of different ONT-based RNA sequencing protocols, including direct RNA, direct cDNA, and PCR-cDNA. We find that (PCR)-cDNA-seq offers improved yield and accuracy compared to direct RNA sequencing. Notably, (PCR)-cDNA-seq is suitable for quantitative measurements and can be readily used for simultaneous and accurate detection of transcript 5′ and 3′ boundaries, analysis of transcriptional units, and transcriptional heterogeneity. In summary, based on our comprehensive study, we show nanopore RNA-seq to be a ready-to-use tool allowing rapid, cost-effective, and accurate annotation of multiple transcriptomic features. Thereby nanopore RNA-seq holds the potential to become a valuable alternative method for RNA analysis in prokaryotes.

In the last decade, next-generation sequencing (NGS) technologies (Levy and Myers 2016) revolutionized the field of microbiology (Escobar-Zepeda et al. 2015) , which is not only reflected in the exponential increase in the number of fully sequenced microbial genomes but also in the detection of microbial diversity in many hitherto inaccessible habitats based on metagenomics. Using transcriptomics, important advances were also possible in the field of RNA biology (Wang et al. 2009; Hör et al. 2018 ) that shaped our understanding of the transcriptional landscape (Croucher and Thomson 2010; Nowrousian 2010 ) and RNA-mediated regulatory processes in prokaryotes (Saliba et al. 2017) . RNA sequencing (RNA-seq) technologies can be categorized according to their platform-dependent read lengths and the necessity of a reverse transcription and amplification step to generate cDNA (Stark et al. 2019) . Illumina sequencing yields highly accurate yet short sequencing reads (commonly 100-300 bp). Hence, sequence information is only available in a fragmented form, making full-length transcript-or isoform-detection a challenging task (Tilgner et al. 2015; Byrne et al. 2019) . Sequencing platforms developed by Pacific Bioscience (Pac-Bio) and Oxford Nanopore Technologies (ONT) solved this issue. Both sequencing methods are bona fide single-molecule sequencing techniques that allow the sequencing of long DNAs or RNAs (Eid et al. 2009; Mikheyev and Tin 2014) . However, the base detection differs significantly between the two methods. PacBio sequencers rely on fluorescence-based single-molecule detection that identifies bases based on the unique fluorescent signal of each nucleotide during DNA synthesis by a dedicated polymerase (Eid et al. 2009 ). In contrast, in an ONT sequencer, the DNA or RNA molecule is pushed through a membranebound biological pore with the aid of a motor protein attached to the pore protein called a nanopore. A change in current is caused by the translocation of the DNA or RNA strand through this nanopore, which serves as a readout signal for the sequencing process. Due to the length of the nanopore (version R9.4), a stretch of approximately five bases contributes to the current signal. Notably, only ONTbased sequencing offers the possibility to directly sequence native RNAs without the need for prior cDNA synthesis and PCR amplification (Soneson et al. 2019) . Direct RNA sequencing based on the PacBio platform has also been realized but requires a customized sequencing workflow using a reverse transcriptase in the sequencing hotspot instead of a standard DNA polymerase (Vilfan et al. 2013) . Direct RNA-seq holds the capacity to sequence full-length transcripts and has been demonstrated as a promising method to discriminate and identify RNA base modifications (e.g., methylations; Liu et al. 2019; Smith et al. 2019; Parker et al. 2020; Begik et al. 2021; Jenjaroenpun et al. 2021) . ONT sequencing is a bona fide single-molecule technique and hence offers the possibility to detect molecular heterogeneity in a transcriptome (Workman et al. 2019) . Recently, the technology was exploited to sequence viral RNA genomes (Keller et al. 2018; Boldogkő i et al. 2019; Viehweger et al. 2019; Wang et al. 2021) to gain insights into viral and eukaryotic transcriptomes Zhao et al. 2019; Sahlin et al. 2021) and to detect and quantify RNA isoforms in eukaryotes (Byrne et al. 2017; Workman et al. 2019; Parker et al. 2020; Dong et al. 2021; Seki et al. 2021) . Essentially, the requirements, but also the possibilities in eukaryotes and prokaryotes, are the same (Choi 2016) , with a poly(A) tail being an essential prerequisite, which is required to capture the RNAs. Using enzymatic polyadenylation of prokaryotic RNAs that in general lack poly(A) tails, the applicability of nanopore RNA-seq has already been demonstrated by metatranscriptomic sequencing of bacterial food pathogens (Yang et al. 2020 ) and by accurate estimation of gene expression levels in Klebsiella pneumoniae (Pitt et al. 2020) . Despite these initial studies, a comprehensive analysis of the applicability of nanopore RNA-seq for the analysis of prokaryotic transcriptomes is lacking.

In this study, we applied and compared all currently available ONT library preparation methods to analyze RNAs in the prokaryotic model organism Escherichia coli K-12. These include direct sequencing of native RNAs, direct sequencing of cDNAs, and sequencing of PCR-amplified cDNAs. The goal was to create a robust workflow for the simultaneous determination of multiple transcriptional features. To this end, we analyzed the reproducibility and comparability of transcript quantification, evaluated the accuracy of transcript boundary identification, and the potential of long-read ONT RNA-seq to capture the complexity of bacterial transcriptional units. Noteworthy, due to the single-molecule resolution of ONT sequencing, in-depth analysis of transcription units becomes possible. In addition, we point out practical and technical considerations of the different methods such as the effects of rRNA depletion on the sequencing depth, the possibility to enrich for full-length transcripts in the cDNA protocols, and the effects of read trimming.

Experimental design for comprehensively comparing nanopore sequencing of RNA and cDNA molecules in Escherichia coli Currently, three different protocols from ONT are available for the analysis of RNAs including (i) direct sequencing of native RNAs (SQK-RNA002, referred to as DRS in this study), (ii) direct sequencing of cDNAs (SQK-DCS109, referred to as cDNA in this study), and (iii) sequencing of PCR-amplified cDNAs (SQK-PCB109, referred to as PCR-cDNA in this study) ( Fig. 1A; Supplemental Fig. 1) . Since there is a crucial difference between sequencing RNA or DNA, we additionally use the combined term (PCR)-cDNA-seq, which refers to both cDNA and PCR-cDNA approaches. In short, all methods rely on polyadenylated RNAs as starting material since RNAs are either annealed to an oligo(dT) primer for (PCR)-cDNA approaches or ligated to a double-stranded oligo(dT) splint adaptor in the DRS approach. Although reverse transcription is optional for DRS, it is highly recommended by ONT and the community to resolve secondary structures in the RNA and to decrease the probability of pore blockage, which ultimately results in an increase in total throughput (Workman et al. 2019) . However, only the RNA strand carries the motor protein and is subsequently sequenced. The (PCR)-cDNA protocols take advantage of the template-switching ability of the reverse transcriptase, which adds a few nontemplated cytosines to the end of the cDNA (Matz et al. 1999) . This allows the enrichment of full-length sequenced transcripts during the analysis (Supplemental Fig. 1 ). After RNA digestion, the second strand is synthesized, followed by barcode ligation, PCR amplification in the PCR-cDNA protocol, attachment, or ligation of sequencing adaptors and sequencing.

We performed all three protocols using unfragmented total RNA prepared from the prokaryotic model organism E. coli K-12 strain MG1655 grown at 37°C in a rich medium. The aim was to discuss current limitations and best practices analyzing prokaryotic transcriptomes using nanopore sequencing of RNA and cDNA molecules and to compare the results to other full-length sequencing protocols and platforms. Two biological replicates for each library A D B C FIGURE 1. Overview of generated data sets for comprehensively comparing nanopore sequencing of RNA and cDNA molecules in Escherichia coli. (A) Five replicates of the prokaryotic model organism Escherichia coli strain K-12 MG1655 were sequenced using currently available RNA-seq protocols from Oxford Nanopore, including direct RNA sequencing (DRS), direct cDNA sequencing (cDNA), and sequencing of PCR-amplified cDNAs (PCR-cDNA). Different rRNA-depletion, additional treatment strategy (Terminator 5 ′ -Phosphate-Dependent Exonuclease, TEX), kit names used (RNA001, RNA002, DCS109, PCB109), and key steps of the library preparation are outlined in the graphic workflow summary. (B) Principle of nanopore sequencing: An ionic current drives the cDNA or the RNA strand of a RNA/cDNA hybrid through the membrane-embedded nanopore. The motor protein, attached during library preparation, unzips the double strands, and controls the translocation speed. Translocation of the strand alters the electric signal, which is used to determine the sequence. (C ) Basic workflow for analyzing nanopore reads including basecalling and demultiplexing using ONT-developed guppy, custom scripts to perform quality control of runs/reads, minimap2 (Li 2018) to align the reads to the reference genome and salmon (Patro et al. 2017 ) in alignment-based mode for gene quantification. (D) Genome browser view of the 30 longest untrimmed reads per selected library sorted by read start position in the genomic region of the rpsP-rimM-trmD-rplS operon. The longest read of each ONT protocol is highlighted in red. preparation method were sequenced on a MinION using R9.4 flow cells controlled by MinKNOW. The key steps of library preparation and sequencing are depicted in Figure  1A ,B and are briefly summarized in the following: after purification of high-quality RNAs using silica-membrane columns with a cut-off size of about 200 nt, RNAs were immediately polyadenylated to make them amenable for library preparation and to preserve the 3 ′ ends from further degradation during the next steps of the library preparation. Since full-length sequencing of RNAs and cDNAs is dependent on the quality of the source material, we only used RNAs with integrity values (RIN) greater than 9.5 (Schroeder et al. 2006) . Also, Bioanalyzer analysis was used to confirm efficient polyadenylation based on a shift in ribosomal RNA peaks and to check fragment size of PCR-amplified cDNAs (Supplemental Fig. 2A,B) . To increase the proportion of sequenced mRNAs, ribosomal RNAs, which usually make up the main part of the RNA pool, can be depleted. However, the input quantity requirements currently still make it challenging to use rRNA-depleted RNAs in a sensible and cost-efficient way, especially for DRS. The input amounts are currently listed to be 500 ng poly(A) + (DRS), 100 ng poly(A) + RNA (cDNA), and 1 ng (PCR-cDNA), respectively. Therefore, we used nondepleted RNA for DRS sequencing, a mix of depleted (40%) and nondepleted RNA (60%) for the cDNA protocol, and fully depleted RNA for the PCR-cDNA approach. Additionally, we tested the compatibility with other RNA treatments using the commonly applied digestion of 5 ′ -monophosphorylated nonprimary RNAs with a 5 ′ -Phosphate-dependent Terminator Exonuclease (TEX) as an example (Fig. 1A ). However, it should be noted that we deliberately chose reaction conditions not sufficient for the complete digestion of all nonprimary RNAs. The intention of this design was not to distinguish primary from processed transcripts but rather to minimize the rRNA content even further.

Overall run and raw read characteristics and analysis of mapped reads Sequencing throughput on a single FLO-MIN106 flow cell is dependent on the kit chemistry and currently listed by ONT to typically range between 1 and 4 Gb for DRS, more than 8 Gb for cDNA and about 10 Gb for the PCR-cDNA kits. Considering that a higher yield could be expected for the cDNA kits and the (partial) depletion of ribosomal RNAs, cDNA runs were multiplexed and aborted as soon as a sufficient number of reads (>0.5 Gb) was reached. All sequencing parameters and run statistics are listed in Supplemental Table 1 and shown in Supplemental Figures 3 and 4. The sequencing yield of unfiltered reads ranged between 0.09 and 2.21 million reads, or 0.08 Gb and 1.57 Gb, respectively (Supplemental Fig. 3 ). Read qualities, which are specified as mean Q-score values, were similarly distributed within the three library types, showing median values of 8.8 (DRS), 9.7 (cDNA), and 10.5 (PCR-cDNA) (Supplemental Fig. 4A ). As expected, the read length distributions of the individual samples were highly dependent on the effect of rRNA depletion (Supplemental Fig. 4B ). Although we could confirm the reports of previous studies that very short direct RNA reads are associated with bad quality (Soneson et al. 2019; Workman et al. 2019 ), we did not see a pronounced effect in other library types or for very long RNAs in our data sets (Supplemental Fig. 4C ).

We next aligned the unfiltered reads to the E. coli K-12 genome using minimap2 (Fig. 1C) . 71.4% (DRS), 64.7% (cDNA), and 48.9% (PCR-cDNA) of the reads mapped to the genome, which corresponds to 78.0% (DRS), 64.7%, and 47.2% of the bases, respectively (Supplemental Fig.  5 ). The moderate numbers arise from short reads with low quality, which dominate the class of unmapped reads and are particularly common for the direct RNA data sets but also occur in the (PCR)-cDNA approaches (Supplemental Fig. 6A -D). The lower total number of mapped reads in the PCR-cDNA samples is due to the preference for overamplification of short fragments in the PCR, which is more pronounced at higher cycle numbers. This suggests that successful sequencing can be estimated reasonably well already from the Bioanalyzer results of PCR-amplified cDNA (Supplemental Fig. 2B ). Based on the length distribution, which is similar to the unamplified cDNA and the proportion of mapped reads (62%), we concluded that 12 PCR cycles are sufficient to obtain high-quality sequencing data (Supplemental Figs. 5, 6) .

To allow a detailed analysis of the mapped reads, they were first classified into transcript features and classes using the annotation found in GenBank entry U00096.3 (Riley et al. 2006) . Most of the reads in the noncoding RNA class originate from the ssrA gene in our sample conditions, producing the transfer-messenger RNA (tmRNA), which explains the uniform length distribution. The tmRNA has tRNA-specific base modifications (Himeno et al. 2014 ) that lead to an altered current profile, which presumably explains the lower read quality in the DRS approach (Supplemental Fig. 6B ). After the RNA is transcribed into cDNA, the modifications are lost, and the quality of the sequenced reads increases significantly. As expected, the raw read length of rRNA-mapping reads was largely dependent on the predesigned depletion efficiency and subsequent TEX treatment (Supplemental Fig. 6A ). Indeed, the number of reads mapping to ribosomal RNAs is significantly reduced in TEX-treated samples compared to the nontreated counterparts (Supplemental Fig. 7) .

In the following, we will focus on mRNA-originating reads performing an in-depth analysis of transcriptomic features in E. coli. Reads mapping to mRNAs in fully rRNA-depleted libraries make up ∼33% of all mapped reads (PCR-cDNA samples with 12 PCR cycles), which corresponds to 42% of all mapped bases (Supplemental Fig. 7) . The aligned read length distribution of mRNA-mapping reads was similar between all library types with median values of 406 (DRS), 372 (cDNA), and 395 (PCR-cDNA) bases (Supplemental Fig. 8A ). Despite these relatively short median values, there is also a proportion of reads in all library types that are very long and cover large operon structures in one read, which is exemplarily shown for the rpsP-rimM-trmD-rplS operon in Figure 1D . This is particularly clear when looking at the mean aligned length of the 100 longest reads in each protocol, which are 4738 (DRS), 6567 (cDNA), and 6132 (PCR-cDNA) bases. At this point, it should be mentioned that the 100 shortest reads have mean lengths of 89 (DRS), 80 (cDNA), and 80 (PCR-cDNA) bases, which is caused by the mapping tool minimap2 used with standard parameters. As previously reported, the mapped read identity of direct RNA reads (88.1%) is substantially lower as compared to cDNA reads (96% cDNA, 94% PCR-cDNA) (Supplemental Fig. 8B ; Soneson et al. 2019 ). However, we noticed that the read identity improved when using the RNA002 chemistry instead of the meanwhile outdated RNA001 kit. Although template-switching and second-strand synthesis enriches explicitly for full-length transcripts in all cDNA protocols, no clear difference was detected in the aligned length distribution. The difference in the number of PCR cycles leads to significant differences in mean read lengths (15 cycles: 310 bases; 12 cycles: 526 bases), although there is no effect on read quality and identity.

Comparing raw read length with aligned read length, we noticed that many reads in the cDNA protocol are twice as large as their mapped counterpart, which is caused by reverse transcription artifacts (Perocchi et al. 2007; Tuiskunen et al. 2010 ) that generate 2D-like reads, containing both strands of a transcript (Supplemental Fig. 9A ). Interestingly, the reverse complement part of the read has much lower quality scores than the reverse transcribed RNA. This is confirmed by a correlation analysis between the raw read qualities and the mapped read identity (Supplemental Fig. 9B ). Direct RNA and PCR-cDNA reads with low quality led to a lower identity score. This is not observed in the direct cDNA data set since the distribution is dominated by the low-quality peak of the reverse complement. In most of the cases, only the good-quality first part of the 2D-like read maps to the genome and the aligned read identity is high. The second part of the read, however, is discarded (Supplemental Fig. 9C ,D). It should be noted that these 2D-like reads make up the majority of reads that have mapped to mRNAs in the cDNA libraries (Supplemental Fig. 9E ). Although much less common, the artifact also occurs in PCR-cDNA reads and, as expected, is not found in direct RNA reads (Supplemental Fig. 9E ).

Since the first strand is always sequenced, 2D-like reads are not expected to distort the quantification of reads. To test this and to determine the overall comparability and robustness of the data in absolute quantitative terms, we compared the count data based on untrimmed, uncorrected nanopore reads with published short-read (Illumina) and full-length long-read (SMRT-Cappable-seq protocol, Pac-Bio) cDNA sequencing data from E. coli sampled under similar conditions (Yan et al. 2018 ). Since we only consider reads that map to mRNAs for this purpose, we first looked at the sequencing depth of each data set to assess whether representative statements can be made. Sequencing depth was dependent on rRNA depletion, TEX treatment, and the total number of reads sequenced. Therefore, sequencing depths between 0.2-fold (DRS, RNA002, replicate 2) and 52-fold (PCR-cDNA, 12 cycles, replicate 1) reflect the design of the particular experiment and are mostly comparable to the selected SMRT-Cappable (replicate 1: 51-fold; replicate 2: sevenfold) and short-read Illumina (70-fold) data sets (Supplemental Fig. 10A ). Considering the sequencing strategies, (PCR)-cDNA nanopore sequencing offers a more straightforward way to produce comprehensive data sets to analyze mRNA features. Almost 90% of known genes were covered by at least one read in all (PCR)-cDNA libraries. In contrast, direct RNA libraries only covered 70% (RNA001, replicate 3), 44% (RNA002, replicate 1), and 13% (RNA002, replicate 2) of the genes (Supplemental Fig. 10B ). To evaluate how many reads are needed to cover at least 75% of all genes, we subsampled the reads of the representative rRNA-depleted PCR-cDNA sample (12 PCR cycles). We found that a sequencing depth of ∼10-fold is sufficient for this purpose, corresponding to 70,000 mRNA-mapping reads (Supplemental Fig. 10C ).

To evaluate whether nanopore RNA-seq data can be used for quantitative measurements, we looked into the correlation between replicates and when using different library types ( Fig Table 3 ). Despite different sequencing platforms, protocols for sample preparation, and sequencing depths, we observed a decent correlation between expression data from published short-read Illumina RNA-seq data and ONT data sets (Fig. 2B,C) . Nevertheless, we found that a higher number of PCR cycles resulted in particularly GC-rich genes being underrepresented, leading to an overall more insufficient correlation in the PCR-cDNA data sets (Supplemental Fig. 11A , B). However, since we observed a similar effect with the nonamplified direct cDNA sample, which overall showed the best correlation to the Illumina data, other biases cannot be ruled out completely. For example, the SMRT-Cap protocol includes stringent size-selection filtering for fragments bigger than 1 kb. Consequently, from a purely quantitative perspective, the SMRT-Cap data are not fully comparable to the nanopore data, but this may also be partly due to the sequencing depth. Considering these interfering factors, we have obtained a very good correlation between the cDNA replicates (0.97) and to the other library methods (DRS-Rep2 to cDNA-Rep2: 0.91; PCR-cDNA-Rep4 to cDNA-Rep2: 0.94).

Taken together, the ONT data are very consistent and allow a quantitative analysis of various transcriptomic features, which we will discuss in more detail below. However, the PCR bias is a critical point and researchers should carefully determine the number of PCR cycles required for their sample of choice.

To accurately quantify and identify the number of fulllength sequenced reads, we used Pychopper (github. com/nanoporetech/pychopper), a tool developed by ONT. This tool allows the detection and trimming of fulllength sequenced cDNA reads based on the presence of strand-switching primer (SSP) and anchored oligo(dT) VN primer (VNP). In addition, it orients the sequenced reads (Supplemental Fig. 12A ). As already evident from the length distribution, 2D-like reads make up a significant portion of the direct cDNA samples, which is confirmed by the low percentage of the Pychopper-detected full-length sequenced transcripts of ∼34% in contrast to over 80% of full-length sequenced reads in all PCR-cDNA samples using standard settings (Supplemental Fig. 12B ). However, a direct-cDNA specific setting in Pychopper, which can handle 2D-like reads better and allows to rescue many reads, almost doubled the number of full-length sequenced reads detected (59%). When comparing the aligned read lengths, we detected only a minimal difference between untrimmed and full-length-filtered reads, which is probably caused by random mapping of adaptor sequences (Supplemental Fig. 12C ). Despite Pychopper trimming, we observed that Adenine, caused by polyadenylation, and Guanine, caused by nontemplated addition of nucleotides by the template-switching RT, are overrepresented at the 3 ′ and 5 ′ ends of cDNA reads, respectively (Supplemental Fig. 13A) . To enable precise determination of transcript boundaries, we successfully trimmed off long poly(A) tails that are not expected to be found at the 3 ′ ends of bacterial transcripts, and remaining SSP adaptors from the 5 ′ ends (Supplemental Fig. 13A ,B).

Long-read ONT sequencing of RNA and cDNA molecules allows the simultaneous readout of 5 ′ and 3 ′ transcript boundaries (Fig. 3A) . Since full-length sequenced read starts and ends are expected to be enriched at functional relevant terminal positions and not randomly distributed, we first applied a peak calling algorithm on bedGraph files from terminal read positions that determines the positions of all local maxima. In the next step, comparable to the evaluation of SMRT-Cap or short-read data sets, we defined the highest accumulation of 5 ′ ends in a peak 300 bp upstream of an annotated gene as the primary 5 ′ transcript boundary (Fig. 3A) . Each additional peak in this region was designated as secondary and enriched intergenic peaks as internal. Because our samples do not only contain primary transcripts, we deliberately did not designate these ends as transcription start sites, although a considerable overlap is expected. We were able to define between 549 and 5019 5 ′ transcript ends in representative data sets, which varied depending on sequencing depth and trimming (Supplemental Fig. 14A ; Supplemental Table  4 ). As described in other studies, the majority of enriched 5 ′ ends are localized in internal regions. However, we could A B C FIGURE 3. Analysis of transcript boundaries detected using ONT RNA-seq methods. (A) Exemplary region in E. coli containing the noncoding RNA tff and five other genes. Coverage profiles of raw reads for the different library protocols have been normalized to 100, which refers to the highest coverage of each analyzed sample. 5 ′ read ends of raw reads (light blue) and 3 ′ read ends of trimmed reads (dark blue) are shown as line plots with the number of reads starting or ending at the positions shown on the right scale. Following the analysis pipeline depicted on the right, we identified 5 ′ and 3 ′ enriched positions. Primary 5 ′ (solid lines) and 3 ′ ends (dashed lines) derived from the analysis are shown for each data set in the coverage plot. (B) Accuracy of 5 ′ end detection using raw reads assessed by comparison of distances between primary ONT 5 ′ ends to differential RNA-seq primary transcription start sites (TSS) (Thomason et al. 2015) and SMRT-Cap TSS (Yan et al. 2018 ). (C) Accuracy of 3 ′ end detection using trimmed reads assessed by comparison of distances between primary ONT 3 ′ ends to short-read Term-seq primary transcription termination sites (TTS) (Dar and Sorek 2018) and SMRT-Cap TTS (Yan et al. 2018 ).

also identify up to 1248 primary sites. Unexpectedly, untrimmed reads had a higher agreement in 5 ′ ends at the single-nucleotide level to other comparable methods such as short-read differential RNA-seq and SMRT-Cap than trimmed reads ( Fig. 3B ; Supplemental Fig. 14B ; Thomason et al. 2015; Yan et al. 2018) . Ends determined by direct RNA sequencing are about 12 nt shorter, which is in line with previous observations (Soneson et al. 2019; Workman et al. 2019; Parker et al. 2020 ) and can be rationalized by a lack of control of the RNA translocation speed after the motor protein falls off the 5 ′ ends of the RNA (Fig. 3B ). PCR-cDNA and cDNA 5 ′ ends are very clearly defined and predominantly end at the same base. In contrast, DRS leads to fuzzy 5 ′ ends, presumably caused by a lower mapping accuracy. TEX treatment had neither a positive nor negative effect on 5 ′ end detection or the number of reads starting at the enriched 5 ′ ends. This may be due to the short treatment time and the digestion of the remaining ribosomal RNA leaving mRNA-mapping primary transcripts unaffected. Primary 5 ′ transcript ends highly correlated between all different library types provided that enough reads support the enriched position (Supplemental Fig. 15 ). The moderate correlation (0.67) to SMRT-Cap 5 ′ ends can mainly be attributed to the different library preparation approaches (Supplemental Fig. 16A ). In contrast to our data, only primary transcripts are specifically captured and subsequently small transcripts naturally occurring in E. coli are intentionally lost due to size selection. The correlation drastically improved, when considering the positions of secondary 5 ′ ends determined during ONT read analysis (Supplemental Fig. 16B ). We found that for some genes, the SMRT-Cap primary site coincides with the ONT secondary, but not primary site. Although no specific enrichment for primary transcripts was performed for most of the samples, the 5 ′ UTR distributions and the bacterial-typical nucleotide contents of upstream regions lead to the assumption that ONT sequencing is capable of accurately determining transcription start sites (Supplemental Fig. 17 ). Peak enrichment analysis and 3 ′ end annotations were performed as described for the 5 ′ ends (Fig. 3A) . Overall, the number of enriched 3 ′ ends found in the respective categories was slightly lower as compared to the 5 ′ ends (Supplemental Fig. 18A ; Supplemental Table 5 ). In contrast to the rather detrimental effect of trimming on the accuracy of 5 ′ end detection, trimming increased the number of 3 ′ ends that are identical to Term-seq (Supplemental Fig.  18B ; Dar and Sorek 2018) . However, it should be noted that in vitro polyadenylation and trimming can affect detection accuracy of 3 ′ ends that have a naturally occurring terminal polyA sequence. This is, for example, the case with the RNAse E processing site of the ssrA gene, where raw reads are artificially too long and contain additional adenines, which are later trimmed off during read processing (Supplemental Fig. 19 ; Lin-Chao et al. 1999) . Although 3 ′ end detection is highly reproducible and 3 ′ ends overall highly correlate with SMRT-Cap detected ends, ONT 3 ′ ends are fuzzier and tend to be up to 3 nt shorter ( Fig. 3C; Supplemental Fig. 20) . Since we cannot exclude that 3 ′ to 5 ′ exoribonucleases degrade RNAs after transcription, enriched sites may either represent genuine termination sites or enriched processed 3 ′ ends. Nevertheless, the 3 ′ UTR lengths of primary 3 ′ ends and the poly(T) termination motif, which is typical for intrinsic terminators, suggest that most detected primary 3 ′ ends are genuine transcription termination sites (Supplemental Fig. 21A,B) .

In contrast to DRS, the cDNA protocols provide access to full-length sequenced transcripts due to the templateswitching behavior of the RT (Supplemental Fig. 12 ). Accordingly, it is expected that the 5 ′ and 3 ′ ends are covered to the same extent and that the coverage distribution over a gene is flat overall, which should improve an accurate transcriptional unit analysis. However, previous studies have shown that both DRS and direct cDNA reads are often truncated at the 5 ′ end (Sessegolo et al. 2019; Soneson et al. 2019; Workman et al. 2019 ). The reasons for this observation are still not completely clear but could be related to the fact that RNAs are directly sequenced starting from the 3 ′ ends, to problems during template-switching, or sequencing-related issues like current spikes. To estimate the effect of the 3 ′ coverage bias, we looked at the gene body coverage profile between all samples and used previously introduced metrics, like the quartile coefficient of variation (QCoV) (Parker et al. 2020) , to quantify coverage drops along the transcripts (Fig. 4A,B) . For the DRS and cDNA samples, we can confirm that 5 ′ ends are less covered compared to the 3 ′ ends (Fig. 4A-E ). This has a particularly dramatic effect on the 5 ′ coverage of long transcriptional units, exemplarily shown for units ending at the hslU gene (Supplemental Fig. 22 ). Overamplification during PCR results in both ends being more enriched compared to the transcript center. In contrast, at 12 cycles, the reads are equally distributed across the gene body deviating on average <5% from the median coverage (Fig. 4C ). As expected, quality filtering and selection of full-length sequenced cDNA reads with both recognition adaptors results in an enrichment of the 5 ′ ends for all cDNA samples (Fig. 4D) . However, we see that transcripts longer than 2 kb are less well uniformly covered (Fig. 4E) . It should be noted that these do not occur very often in our selected data that rely on the previous annotation of 5 ′ and 3 ′ ends, which could influence the distribution.

The distribution of reads over the gene body confirmed that ONT sequencing can cover both ends of a transcript.

Nanopore RNA-seq in E. coli

Since read lengths are theoretically only limited by the transcript size, ONT sequencing has the potential to accurately define complex transcriptional unit structures by finding overlaps between the mapping coordinates of individual reads and the transcript positions (Fig. 5A) . Following the annotation approach from the SMRT-Cap protocol, the unique combination of genes within a transcriptional unit was defined as the transcriptional context of a gene. Transcriptional unit prediction was performed exemplarily for one each of the DRS (RNA001, replicate 1), cDNA (DCS109, replicate 2), and PCR-cDNA (PCB109, replicate 4) libraries (Supplemental Table 6 ). Thereby, 788 (DRS), 2264 (cDNA), and 2433 (PCR-cDNA) unique transcriptional units were defined, respectively (Fig. 5B) . Mainly limited by the sequencing depth, the vast majority of defined transcriptional units (PCR-cDNA: 90%; cDNA: 83%; RNA: 90%) overlapped between the different protocols (Fig.  5B) . Hence, rare transcriptional unit variants stretching over multiple genes are not detected at low sequencing depth and stringent detection filters, which is also reflected in the mean number of genes encoded in a transcriptional unit: 1.14 for the DRS, 1.18 for the cDNA, and 1.26 for the PCR-cDNA approach, respectively. This is in agreement with the observation that particularly long transcriptional units are underrepresented in our data set. Therefore, the overall agreement with SMRT-Cap (43%) and the Regu-lonDB database (50%) is only moderate, which is presumably additionally heavily influenced by the respective detection algorithms, library preparation, and sample conditions (Supplemental Fig. 23) . Nevertheless, the distribution of transcriptional contexts in the PCR-cDNA data set is in good agreement with the results from the SMRT-Cap analysis, showing that many genes are transcribed in more than one context (Fig. 5C ). Note that without prior enrichment or treatment, quantification of the individual transcriptional contexts should consider that prokaryotic transcripts are subject to various degradation and processing events. Therefore, it was not surprising that we captured a mix of 3 ′ or 5 ′ intact transcripts, which are often processed from the other end, as indicated by the ONT single-read tracks ( Fig. 5A ; Supplemental Figs. 24, 25) . Effects that arise from RNA processing could be analyzed in more detail when sequencing transcriptomes of exonuclease knockout strains or with protocols that specifically enrich for primary transcripts (compare Send-seq and SMRT-Cap protocol) (Yan et al. 2018; Ju et al. 2019 ). However, after the explicit enrichment of full-length sequenced transcripts and under the valid as-sumption that transcripts are not strongly degraded (compare RIN values) the extensive transcriptional heterogeneity is surprising. This cannot only be seen in Figure  5A , but also in other examples, such as the RegulonDB-annotated operon rpsP-rimM-trmD-rplS (Supplemental Fig.  22 ) or a section of the genome containing many ribosomal proteins (Supplemental Fig. 23 ). The annotation of transcriptional units fits very well with the prediction of primary 5 ′ and 3 ′ ends and shows that long-read ONT RNA-seq can more easily identify transcripts that arise from a shared promoter and have heterogeneous 3 ′ ends. As already shown in the SMRT-Cap data, the tff-rpsB-tsf unit, which is identical to the operon annotated in the RegulonDB, is terminated in a stepwise manner. However, an additional termination site can be detected directly after the putative small RNA tff, which is otherwise lost through size selection (Fig. 5A) .

Taking advantage of the 5 ′ to 3 ′ connectivity of the reads is one of the key advantages of single-molecule sequencing, which we used to perform transcriptional unit prediction. However, this feature can also be used to explore transcription, processing, and degradation patterns of individual transcripts. We exemplify this capacity using the well-described decay of the rpsO mRNA, encoding the ribosomal protein S15 (Supplemental Fig. 26a ; Régnier and Portier 1986; Régnier and Hajnsdorf 1991; Hajnsdorf and Régnier 1999) . Nanopore (PCR)-cDNA sequencing captures that the majority of the transcripts are derived from promoter P1 and end at the 3 ′ hairpin (PCR-cDNA: 44%; cDNA: 53%) protecting the primary transcript from degradation. Consequently, this represents the most abundant transcript (PCR-cDNA data shown in Supplemental Fig. 26b-d) . Additionally, frequent degradation events from the 3 ′ end after processing at M2 and minor populations (e.g., transcript cleavage at M3, transcription from a second upstream promoter or termination readthrough) can be observed. In summary, nanopore sequencing is capable of not only accurately detecting complex transcriptional unit structures but can also aid in quantification or in deciphering the unprecedented transcriptional heterogeneity, which may be improved by using specialized strains or conditions depending on the scientific question.

In this study, we performed a comprehensive comparison of all currently available kits from Oxford Nanopore for the analysis of RNAs, including direct sequencing of native RNA (RNA001, RNA002), direct cDNA (DCS109), and PCR-cDNA sequencing (PCB109) in the bacterial model organism Escherichia coli K-12. As a result, we demonstrate that multiple properties of the transcriptome can be examined simultaneously with high accuracy. This study therefore provides the first extensive analysis of ONT RNA-seq methods in prokaryotes. Furthermore, after screening important quality control metrics of the sequenced libraries, we show that nanopore RNA-seq is suitable for making quantitative measurements and correlates well with data of the most commonly used short-read Illumina RNA-seq data. Additionally, we provide a bioinformatics workflow that allows accurate determination of transcript boundaries and quantitative analysis of transcriptional units applicable to all prokaryotes.

However, at present, some disadvantages of nanopore RNA-seq should be considered that are summarized in Figure 6A . First, it must be ensured that the polyadenylation reaction in the organism of choice works equally effectively for all RNAs. Second, direct sequencing of RNAs requires a large amount of starting RNA material (>10 µg) to yield enough mRNA (500 ng) left after effective rRNA depletion. Since the depletion kits are usually not designed for these quantities, the additional reactions are another cost factor. Higher costs for DRS also originate from the slower sequencing speed, which negatively impacts throughput and the current lack of a barcoding option provided by ONT. Although there are already excellent option to build a custom set of DRS barcodes, this is not as straightforward to use as for (PCR)-cDNA libraries (Smith et al. 2020 ). Regarding 5 ′ end detection, it has been shown multiple times that about 12 bases are missing from the DRS 5 ′ ends. This observation can be explained by the motor protein falling off at the end of a transcript resulting in a loss of control to guide the RNA through the nanopore, which is not the case for the (PCR)-cDNA data (Soneson et al. 2019; Workman et al. 2019) . Another point of criticism that is repeatedly discussed is the comparatively low accuracy, especially for DRS, but also for (PCR)-cDNA data sets (Garalde et al. 2018; Soneson et al. 2019; Workman et al. 2019) . Although this is not a significant problem for most questions, it affected the base-accurate trimming of adaptor sequences and thus influenced the accuracy of the determination of the A B transcript ends. In particular, up to four more bases are trimmed off at the 3 ′ ends since the homo-poly(A) sequence is usually low in quality and can only be trimmed inaccurately. Determining the 3 ′ ends without trimming, which performs better at the 5 ′ ends, performed even worse since long-read nanopore mappers like minimap2 allow a higher number of errors (Li 2018) . In general, the choice of the mapping tool should be well considered as it greatly impacts the quality of the analysis. We applied the widely used and actively developed minimap2, which fails to align small RNAs (∼80 bases cutoff) (Li 2018) . While other mapping tools, like Magic-BLAST (Boratyn et al. 2019 ) or GraphMap2 (Sovićet al. 2016 can align short transcripts, it is usually at the expense of other aspects, and the method of choice dependent on the respective question. Despite or even because of these limitations, the nanopore community is very active and interested in providing solutions for the problems discussed. Indeed, there are already promising applications that will also further improve ONT RNA-seq in prokaryotes in the future, like the error-correction of (PCR)-cDNA reads using isONcorrect (Sahlin et al. 2021) or the improvement of 5 ′ end detection in DRS after 5 ′ -dependent adaptor ligation (Parker et al. 2020 ). Based on our results and considering the most cost-effective way to create and sequence libraries, we conclude that (PCR)-cDNA sequencing is the method of choice for most scientific questions, except for the analysis of RNA modifications (Begik et al. 2021) . As only 1 ng of rRNA-depleted RNA is sufficient to generate PCR-cDNA libraries, PCR-cDNA-seq is highly preferable for organisms or conditions where the amount of RNA isolated is a crucial criterion. Our data clearly show that the number of cycles in the PCR should be controlled with special care. Otherwise, small AT-rich transcripts are preferentially amplified and sequenced, which distorts the quantification and further analyses. However, if this is handled correctly and the number of cycles is as low as possible, in our case 12, the PCR-cDNA data are highly comparable to the direct cDNA results. In any case, reverse transcription is a critical point for all (PCR)-cDNA libraries. Nevertheless, the ONT-recommended Maxima H Minus Reverse Transcriptase (Thermo Fisher Scientific) performed quite well for our samples as documented in the gene body coverage data and the reproducibly good quantification. Another advantage of the enzyme used is that the reaction temperature can be increased to transcribe sequences with exceptionally high GC content or secondary structures.

In fact, there are already some sophisticated ways to profile full-length transcripts in E. coli, including the SMRT-Cappable-seq (Yan et al. 2018 ) and the SEnd-seq (Ju et al. 2019) protocols. Comparison of SMRT-Cap and ONT data show that both data sets are highly congruent, although the repeatedly discussed size selection in the PacBio libraries plays a critical role and is a disadvantage. Unfortunately, despite the introduction of these methods, they have not yet been used in the prokaryotic community for further studies, although the reasons for this may well be diverse. However, we can imagine that the low initial costs of purchasing a MinION and the excellent performance could encourage some laboratories to use nanopore RNA-seq in prokaryotes. The additional costs and IT infrastructure requirements are also limited, with basecalling of the data representing the highest computational effort for these analyses.

Taken together, a key advantage of ONT RNA-seq is that multiple features can be addressed simultaneously with high accuracy (Fig. 6B ). This versatility distinguishes the technique from the various RNA-seq technologies designed to tackle only one specific question or biochemical assays. Furthermore, since nanopore sequencing is a bona fide single-molecule method, molecular heterogeneity at the transcriptome level can be analyzed. Additionally, even minor RNA populations can be detected that are inevitably lost in ensemble sequencing approaches. However, we observed a complex transcription pattern with multiple possible RNA variants. Given that transcription and translation are coupled in E. coli, new questions about the translation efficiency and transcript stability of the transcript variants emerge (Proshkin et al. 2010; Wang et al. 2020; Webster et al. 2020; Irastortza-Olaziregi and Amster-Choder 2021) . Furthermore, high-quality long-read RNAseq data can be used to analyze degradation or processing patterns to gain new insights into mRNA decay in prokaryotes. With this study, we not only show the applicability of ONT RNA-seq in prokaryotes, but also provide representative long-read transcriptome data from E. coli and a robust bioinformatical workflow to the community that can be used to tackle various questions.

Escherichia coli K-12 MG1655 cells were grown in rich medium (10 g tryptone, 5 g yeast extract, 5 g NaCl per liter, pH 7.2) to an OD 600nm of 0.5-0.6. To stabilize RNAs, two volumes of RNAlater (Thermo Fisher Scientific) were immediately added to the cultures and stored at −20°C until cells were harvested by centrifugation at 4°C.

Total RNA of all samples except RNA001 was extracted using RNeasy Mini Kit (Qiagen) according to the manufacturer's instructions. RNA001 RNA was purified using the Monarch Total RNA Miniprep Kit (New England Biolabs). The integrity of total RNA from E. coli was assessed via a Bioanalyzer (Agilent) run using the RNA 6000 Pico Kit (Agilent), and only RNAs with RNA integrity numbers (RIN) above 9.5 were used for subsequent treatments and sequencing. In short, the RIN value, calculated on a scale from 0 to 10, has evolved as a standard to estimate integrity of RNA samples from the size distribution and is calculated by an algorithm that is based on the combination of different features, like 16S and 23S rRNA areas (Schroeder et al. 2006 ).

Next, RNAs were heat incubated at 70°C for 2 min and snap cooled on a prechilled freezer block before polyadenylating RNAs using the E. coli poly(A) polymerase (New England Biolabs). Briefly, 5 µg RNA, 20 units poly(A) polymerase, 5 µL reaction buffer and 1 mM ATP were incubated for 15 min at 37°C in a total reaction volume of 50 µL. Note that the identical reaction conditions were chosen here as described in the SMRT-Cap protocol that resulted in successful and efficient poly(A)-tailing (Yan et al. 2018) . To stop and clean up the reaction, poly(A)-tailed RNAs were purified following the RNeasy Micro clean-up protocol (Qiagen), which was used for all subsequent RNA clean-ups. The efficiency of poly(A)-tailing was evaluated via a Bioanalyzer run. Ribosomal RNA (rRNA) depletion was performed using the Pan-Prokaryote riboPOOL by siTOOLs, which effectively removes rRNAs from E. coli. For TEX-treated samples, partial digestion of RNAs that are not 5 ′ -triphosphorylated (e.g., tRNAs, rRNAs) was achieved by incubation of the RNA with a 5 ′ -Phosphate-dependent Terminator Exonuclease (TEX, Lucigen). Therefore, 10 µg of RNA used in the RNA001 sample, were incubated with 1 unit TEX, 2 µL TEX reaction buffer, and 0.5 µL RiboGuard RNase Inhibitor (Lucigen) in a total volume of 20 µL for 60 min at 30°C. Besides, 20 ng of rRNA-depleted samples subsequently used in the PCR-cDNA workflow (replicate 4 and 5), were only partially TEX-treated using the same enzyme and buffer concentrations but reducing the reaction time to 15 min. All reactions were terminated by adding EDTA and cleaned up following the RNeasy Micro clean-up protocol. Before library preparation, the extent of the remaining buffer and DNA contamination were tested by performing standard spectroscopic measurements (NanoDrop One) and using the Qubit 1× dsDNA HS assay kit (Thermo Fisher Scientific). Input RNAs were finally quantified using the Qubit RNA HS assay kit.

Libraries for nanopore sequencing were prepared from poly(A)tailed RNAs according to protocols provided by Oxford Nanopore (Oxford Nanopore Technologies) for direct sequencing of native RNAs (SQK-RNA001, SQK-RNA002), direct cDNA native barcoding (SQK-DCS109 with EXP-NBD104), and PCR-cDNA barcoding (SQK-PCB109) with the following minor modifications: Agencourt AMPure XP magnetic beads (Beckman Coulter) in combination with 1 µL of RiboGuard RNase Inhibitor (Lucigen) were used instead of the recommended Agencourt RNAclean XP beads to clean up samples. For reverse transcription, Maxima H Minus Reverse Transcriptase (Thermo Fisher Scientific) was used for all cDNA samples and for the RNA002 samples (SuperScript III Reverse Transcriptase from Thermo Fisher Scientific used for RNA001 sample). The amount of input RNA, barcoding strategy, number of PCR cycles, and extension times can be found in Supplemental Table 1 and are also summarized in part in Figure 1A .

Nanopore libraries were sequenced using either a MinION Mk1B connected to a laptop with the recommended specifications for nanopore sequencing or a Mk1C. All samples were sequenced on R9.4 flow cells and the recommended scripts in MinKNOW to generate fast5 files with live basecalling enabled.

In case of an observed drop in translocation speed and subsequent reduced read quality, the flow cells were refueled with flush buffer, as recommended by ONT. Flow cells were subsequently washed and reused for further runs, provided there were a sufficient number of active pores left. To avoid cross-contamination of reads, a different set of barcodes was used for the next run. Also, the starting voltage of reused flow cells was adjusted for the next run to account for the voltage drift during a sequencing run.

Basecalling, demultiplexing of raw reads, and quality control of raw reads All fast5 reads were re-basecalled using guppy (ont-guppy-for-mk1c v4.3.4) in high-accuracy mode (rna_r9.4.1_70bps_hac.cfg, dna_r9.4.1_450bps_hac .cfg) without quality filtering. While standard parameters were used for basecalling fast5s from cDNA sequencing, fast5 files from RNA sequencing were basecalled with RNA-specific parameters (-calib_detect, -reverse_sequence and -u_substitution). Next, basecalled fastq files from cDNA runs were demultiplexed in a separate step by the guppy suite command guppy_barcoder using default parameters and the respective barcoding kit. After that, relevant information from the guppy sequencing and barcode summary files were extracted to analyze the properties of raw reads (Supplemental Table 1 ). Please note that in Supplemental Table 2 , all figures created from numerical data are referenced and linked to the corresponding code in the Github repository (github.com/felixgrunberger/microbepore).

Files were mapped to the reference genome from Escherichia coli K-12 MG1655 (GenBank: U00096.3) (Riley et al. 2006) , using mini-map2 (release 2.18-r1015, github.com/lh3/minimap2) (Li 2018) . Output alignments in the SAM format were generated with -ax splice -k14 for nanopore 2D cDNA-seq and -ax splice, -uf, -k14 for direct RNA-seq with (i) -p set to 0.99, to return primary and secondary mappings, and (ii) with -MD turned on, to include the MD tag for calculating mapping identities. Alignment files were further converted to bam files, sorted, and indexed using SAMtools (Li et al. 2009 ). To evaluate the alignments, we first calculated the aligned read length by adding the number of M(atch) and I(nsertion) characters in the CIGAR string (Soneson et al. 2019) . Based on this, the mapping identity was defined as (1-NM/aligned_reads) × 100, where NM is the edit distance reported taken from minimap2. Read basecalling and mapping metrics can be found in Supplemental Table 1 . To analyze single reads in more detail with respect to the RNA type (mRNA, rRNA, other ncRNA, unspecified) they map to, bam files were first converted back to FASTQ using bedtools v2.29.2 (Quinlan and Hall 2010). Next, FASTQ files were remapped to a transcriptome file using minimap2 with the previously mentioned parameters to assign single-read names with feature IDs. To handle multimapping reads, only the mapping location with (i) the highest overall identity or if identical (ii) the position with most aligned bases was kept for every read ID.

A publicly available short-read Illumina data set (SRR1927169) obtained from RNA-seq data of E. coli K-12 grown under rich conditions was downloaded from Gene Expression Omnibus (GEO) GSE67218. Reads were first quality trimmed using Trimmomatic v0.39 (Bolger et al. 2014) (leading:20, trailing:20, slidingwindow:4:20, minlen:12) and mapped to the reference genome using bowtie2 (-N 0, -L 26) (Langmead and Salzberg 2012) .

SMRT-Cap data obtained from sequencing data from rich-medium samples (SRR7533626, SRR7533627) were downloaded from GEO GSE117273 (Yan et al. 2018) . PacBio reads were processed as described in the SMRT-Cap protocol using the pacbio_trim.py script downloaded from github.com/elitaone/SMRTcappable-seq. In short, reads were filtered and trimmed using the respective filter and poly functions. Next, reads were mapped to the E. coli K-12 genome using minimap2 with PacBio-specific (-ax map-pb) options (Li 2018) . Bam files from Illumina and SMRT-Cap sequencing were converted to FASTQ format and remapped to the gene file as described before.

To estimate gene abundances from ONT, short-read Illumina and SMRT-Cap libraries, Salmon (v.1.4.0) was applied in alignment-based mode (Patro et al. 2017) . Transcripts per million (TPM) were recalculated using the salmon-computed effective transcript length, after dropping reads mapping to rRNAs, that are variable between nondepleted and depleted RNA sets.

Full-length cDNA reads containing strand-switching primer (SSP) and anchored oligo(dT) VN primer (VNP) in the correct orientation were identified using Pychopper (v.2.5.0) with standard parameters using the default pHMM backend and autotuned cutoff parameters estimated from subsampled data (github.com/ nanoporetech/pychopper). After a first round, a second round of Pychopper was applied to the unclassified direct cDNA reads with DCS-specific read rescue enabled. Reads from rescued and full-length folders were merged and used for subsequent steps. To evaluate the influence of different trimming approaches on the accuracy of transcript boundary analysis, we applied additional 5 ′ and 3 ′ trimming steps using Cutadapt v3.2 (Martin 2011) . To this end, poly(A) sequences were removed from the 3 ′ ends (-a A{10}, -e 1, -j 0) and remaining SSP sequences were removed from the 5 ′ ends (-g TTTCTGTTGGTGCTGATATTGCTGGG, -e 1, -j 0) of direct RNA and full-length sequenced cDNA reads. Finally, trimmed reads were mapped using minimap2 as described before. Reads with more than 10 clipped bases on either side were removed from the alignments using samclip (v.0.4.0, github.com/tseemann/samclip).

To assess the impact of trimmings on gene body coverage, a coverage meta-analysis was performed. First, a transcript file was created for all genes with an ONT-annotated primary 5 ′ and 3 ′ end (see next section). Based on this, strand-specific coverage files were created from the bam files and coverage analysis performed using a custom R script. The genomic coordinates and the counted reads per position were first scaled to values between 0 and 100 and the mean coverage distribution per normalized position was calculated. To evaluate the coverage profiles and the decay at the 5 ′ or 3 ′ ends, we calculated the quartile co-efficient of variation (interquartile range/median) (Parker et al. 2020 ) and additionally compared the mean coverage in the first and last 10% of the positions to the median values.

The determination of enriched 5 ′ and 3 ′ ends was carried out in the same way, but independently of each other, and is briefly explained in the following: First, strand-specific read ends in bedgraph format were created from bam files using bedtools genomecov (-5 or -3 option, -bga) (Quinlan and Hall 2010) . Next, the previously published Termseq_peaks script (Adams et al. 2021 ) was used to call peaks for each sample individually without including replicates (github.com/NICHD-BSPC/termseq-peaks). This script is based on scipy.signal.find_peaks, which is running in the background of Termseq_peaks with lenient parameters (prominence = (None,None), width = (1,None), rel_height = 0.75). However, we deliberately used Termseq_peaks since its ability to include replicates by applying an Irreproducible Discovery Rate method, which can be applied to future studies. For end detection, only the leniently called peaks in the narrowPeak file were used after adding the number of counts for each position using bedtools intersect. Enriched positions were finally filtered and annotated based on the following criteria: (i) For each peak the position with the highest number of reads was selected. (ii) Positions within 20 bases were merged and only the position with the highest number of reads retained. (iii) Positions with less than three reads were filtered out. (iv) Positions were assigned based on their relative orientation to a gene and their respective peak height as primary (depending on 5 ′ or 3 ′ detection: highest peak within 300 bases upstream or downstream from a gene, respectively), secondary (each additional peak 300 bases up/downstream from a gene) and internal (each peak in the coding range).

Reproducibility and comparability of primary 5 ′ and 3 ′ ends were evaluated based on Pearson coefficients calculated from pairwise complete observations. Additionally, 5 ′ and 3 ′ untranslated regions (UTRs) were calculated based on the distance of the enriched primary site to the start or end of a coding region, respectively. The positions of primary sites called from direct RNA-seq data were corrected by 12 bases.

Tables containing each read as a single row were created from the bam files using the R package genomic alignments (Lawrence et al. 2013) . Reads that mapped to the opposite strand of an annotated mRNA or ncRNA or that mapped to widely separated genomic positions were discarded. Next, all range overlaps sharing more than 100 bases were defined between the read table and the genomic feature table using the findOverlaps function from the GenomicRanges package. This way, multiple features can be assigned to each individual read. If their genomic positions are adjacent, the combination of features covered by a coverage-dependent number of reads (10 reads for PCR-cDNA replicate 4) are considered as a transcriptional unit. To enable a quantitative assessment of the transcriptional units and the respective context, the number of reads is first determined for each feature individually and then compared with the number of reads in each detected unit. We compared the transcriptional units with the operon tables from the RegulonDB database (Santos-Zavaleta et al. 2019) and the SMRT-Cappable-seq study (Yan et al. 2018) .

In addition to the publicly available results from the SMRT-Cappable-seq study (Yan et al. 2018) , the short-read Illumina data for gene expression comparison and the RegulonDB (Santos-Zavaleta et al. 2019) mentioned above, we also compared ONT RNA-seq 5 ′ ends with the results of a differential RNA-seq study (Thomason et al. 2015) and 3 ′ ends with Termseq results (Dar and Sorek 2018) .

To facilitate easier access basecalled and demultiplexed FASTQ, mapped bam files from untrimmed reads and large read summary files are publicly available from zenodo.org/record/4879174#. YLSkjy221pQ. All scripts and codes used in this work are available on GitHub (github.com/felixgrunberger/microbepore). Additionally, a more detailed documentation can be found at felixgrunberger.github.io/microbepore. Sequencing files in original fast5 format are publicly available in the Sequence Read Archive SRA (RNA001: PRJNA632538; all other data sets: PRJNA731531).

Supplemental material is available for this article.

Meet the First Author(s) is a new editorial feature within RNA, in which the first author(s) of research-based papers in each issue have the opportunity to introduce themselves and their work to readers of RNA and the RNA research community. Felix Grünberger is the first author of this paper, "Nanopore sequencing of RNA and cDNA molecules in Escherichia coli." Felix is currently a postdoctoral fellow in Dina Grohmann's laboratory at the Institute of Microbiology and German Archaea Centre at the University of Regensburg, with a research focus on using sequencing-based techniques to learn more about general and regulatory features of archaeal transcription.

What are the major results described in your paper and how do they impact this branch of the field?

In our new manuscript, we evaluate how nanopore sequencing can be used to perform RNA-seq in prokaryotes. Therefore, we per-formed a comprehensive comparison of all currently available RNA-seq protocols from Oxford Nanopore, namely direct RNA sequencing, direct cDNA sequencing and PCR-cDNA sequencing in Escherichia coli. The main advantage of this sequencing technology is that it captures both ends of a transcript, which can be used to analyze transcriptional heterogeneity and processing patterns on the single-molecule level. Additionally, numerous other transcriptomics features, like start and termination sites, transcriptional units, and gene expression levels can be mapped simultaneously with high accuracy. While nanopore sequencing is quite popular in microbial genomics thanks to the super long reads, it has been hardly used for transcriptomics so far since the protocols from Oxford Nanopore are optimized for polyadenylated eukaryotic RNAs. By providing a detailed wet laboratory protocol and bioinformatical analysis, and by discussing critical considerations of the different applications, we hope to help the microbiological community get started using nanopore RNA-seq in Bacteria or Archaea.

What led you to study RNA or this aspect of RNA science? I did my PhD in Winfried Hausner's laboratory, working on gene regulatory networks and looking into regulatory aspects of archaeal transcription factors, using RNA mainly as a measure for gene expression. However, in the course of various projects and the longer I work with RNA, the more I realize that RNA is such a fascinating and versatile molecule with huge regulatory and medical impact and potential. In Regensburg, we are lucky to be part of the collaborative research group SFB960 that is investigating principles of RNP biogenesis and offers a fantastic opportunity to learn about different topics of RNA biology. For the nanopore project, for example, we collaborate with the group of Sébastien Continued Ferreira-Cerca, trying to use the technology to learn more about the insufficiently described rRNA maturation pathway in Archaea. Unfortunately, working with archaeal organisms, sooner or later, you realize that some methods cannot be applied or are not validated yet, which is a bit frustrating and delays many discoveries. This is precisely the motivation for this publication: providing the microbiological community with a comprehensive comparative study to be able to use the technology for their own research.

During the course of these experiments, were there any surprising results or particular difficulties that altered your thinking and subsequent focus?

At first, we were a bit surprised by how many different features you can look at with quite good accuracy compared to short-read methods. However, the most striking thing from a biological perspective was seeing so much transcriptional heterogeneity when we analyzed bacterial transcriptional units. Although we are aware of the speed and the multitude of all processes taking place in a cell, we did not expect this. While the publication did not change our research focus, we believe that nanopore RNA-seq offers an exciting opportunity to investigate this heterogeneity, and we will definitely use this tool in the future.

What are some of the landmark moments that provoked your interest in science or your development as a scientist? I cannot say that there was a single decisive moment when I knew that I wanted to become a scientist. It was and still is a continuous process in which I become even more aware with each new project of how much there is still to discover, which is kind of fascinating. Looking back, there were probably a few key experiences, including reading National Geographic, passionate teachers and professors, who conveyed this fascination.

Are there specific individuals or groups who have influenced your philosophy or approach to science?

When I started coding in the first year of my PhD and got more involved with bioinformatics, I tried to use as many different media and sources as possible to progress quickly. Among other things, I listened a lot to the data science podcast "Not so standard deviations" by Hilary Parker and Roger Peng. Although it had nothing to do with my own scientific questions, they managed to talk about topics like reproducible research, analysis correctness, experimental design and data science in academia and industry in an interesting and actually fun way. During my biology studies, I had not come into contact with any of the topics before, but now many of the principles are essential when I analyze data.

Regulatory roles of Escherichia coli 5 ′ UTR and ORF-internal RNAs detected by 3 ′ end mapping

Quantitative profiling of pseudouridylation dynamics in native RNAs with nanopore sequencing

Longread sequencing-a powerful tool in viral transcriptome research

Trimmomatic: a flexible trimmer for Illumina sequence data

Magic-BLAST, an accurate RNA-seq aligner for long and short reads

Nanopore long-read RNAseq reveals widespread transcriptional variation among the surface receptors of individual B cells

Realizing the potential of full-length transcriptome sequencing

On the study of microbial transcriptomes using second-and third-generation sequencing technologies

Studying bacterial transcriptomes using RNA-seq

High-resolution RNA 3 ′ -ends mapping of bacterial Rho-dependent transcripts

The long and the short of it: unlocking nanopore long-read RNA sequencing data with short-read differential expression analysis tools

Real-time DNA sequencing from single polymerase molecules

The road to metagenomics: from microbiology to DNA sequencing technologies and bioinformatics

Highly parallel direct RNA sequencing on an array of nanopores

E. coli rpsO mRNA decay: RNase E processing at the beginning of the coding sequence stimulates poly (A)-dependent degradation of the mRNA

tmRNA-mediated trans-translation as the major ribosome rescue system in a bacterial cell

Bacterial RNA biology on a genome scale

Coupled transcription-translation in prokaryotes: an old couple with new surprises

Decoding the epitranscriptional landscape from native RNA sequences

Full-length RNA profiling reveals pervasive bidirectional transcription terminators in bacteria

Direct RNA sequencing of the coding complete influenza A virus genome

Fast gapped-read alignment with Bowtie 2

Software for computing and annotating genomic ranges

Advancements in next-generation sequencing

Minimap2: pairwise alignment for nucleotide sequences

Durbin R; 1000 Genome Project Data Processing Subgroup

RNase E is required for the maturation of ssrA RNA and normal ssrA RNA peptide-tagging activity

Accurate detection of m6A RNA modifications in native RNA sequences

Cutadapt removes adapter sequences from highthroughput sequencing reads

Amplification of cDNA ends based on template-switching effect and step-out PCR

A first look at the Oxford Nanopore MinION sequencer

Next-generation sequencing techniques for eukaryotic microorganisms: sequencing-based solutions to biological problems

Nanopore direct RNA sequencing maps the complexity of Arabidopsis mRNA processing and m6A modification

Salmon provides fast and bias-aware quantification of transcript expression

Antisense artifacts in transcriptome microarray experiments are resolved by actinomycin D

Evaluating the genome and resistome of extensively drug-resistant Klebsiella pneumoniae using native DNA and RNA Nanopore sequencing

Cooperation between translating ribosomes and RNA polymerase in transcription elongation

BEDTools: a flexible suite of utilities for comparing genomic features

Decay of mRNA encoding ribosomal protein S15 of Escherichia coli is initiated by an RNase E-dependent endonucleolytic cleavage that removes the 3 ′ stabilizing stem and loop structure

Initiation, attenuation and RNase III processing of transcripts from the Escherichia coli operon encoding ribosomal protein S15 and polynucleotide phosphorylase

Escherichia coli K-12: a cooperatively developed annotation snapshot-2005

Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis

New RNA-seq approaches for the study of bacterial pathogens

RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12

The RIN: an RNA integrity number for assigning integrity values to RNA measurements

Transcript identification through long-read sequencing

Transcriptome profiling of mouse samples using nanopore sequencing of cDNA and RNA molecules

Reading canonical and modified nucleobases in 16S ribosomal RNA using nanopore native RNA sequencing

Molecular barcoding of native RNAs using nanopore sequencing and deep learning

A comprehensive examination of Nanopore native RNA sequencing for characterization of complex transcriptomes

Fast and sensitive mapping of nanopore sequencing reads with GraphMap

RNA sequencing: the teenage years

Global transcriptional start site mapping using differential RNA sequencing reveals novel antisense RNAs in Escherichia coli

Comprehensive transcriptome analysis using synthetic long-read sequencing reveals molecular co-association of distant splicing events

Multiple long-read sequencing survey of herpes simplex virus dynamic transcriptome

Self-priming of reverse transcriptase impairs strand-specific detection of dengue virus RNA

Direct RNA nanopore sequencing of full-length coronavirus genomes provides novel insights into structural variants and enables modification analysis

Analysis of RNA base modification and structural rearrangement by single-molecule real-time detection of reverse transcription

RNA-Seq: a revolutionary tool for transcriptomics

Structural basis of transcription-translation coupling

The SARS-CoV-2 subgenome landscape and its novel regulatory features

Structural basis of transcription-translation coupling and collision in bacteria

Nanopore native RNA sequencing of a human poly(A) transcriptome

SMRT-Cappable-seq reveals complex operon variants in bacteria

Direct metatranscriptome RNA-seq and multiplex RT-PCR amplicon sequencing on nanopore MinIONpromising strategies for multiplex identification of viable pathogens in food

Analysis of transcriptome and epitranscriptome in plants using PacBio Iso-seq and nanopore-based direct RNA sequencing

We thank all the members of the Ferreira-Cerca laboratory and of the Grohmann laboratory, especially Prof. Dr. Winfried Hausner and Martin Fenk for fruitful discussions. This work was supported by the Deutsche Forschungsgemeinschaft (SFB960 TPA7 to D.G., and SFB960 TPB13 to S.F.-C.); funding for open access charge: Deutsche Forschungsgemeinschaft.Received August 2, 2021; accepted November 29, 2021.