Analysis of next- and third-generation RNA-Seq data reveals the structures of alternative transcription units in bacterial genomes


1 

 
Analysis of next- and third-generation RNA-Seq data reveals the structures of 1 

alternative transcription units in bacterial genomes  2 

Qi Wang1, Zhaoqian Liu1,2, Bo Yan3, Wen-Chi Chou4, Laurence Ettwiller3, Qin Ma2,†, and Bingqiang 3 

Liu1,† 4 

1 School of Mathematics, Shandong University, Jinan 250200, China. 5 

2 Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, 6 

OH 43210, USA. 7 

3 New England Biolabs Inc., Ipswich, MA 01938, USA. 8 

4 Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 9 

02142, USA. 10 

†Corresponding author. Email: bingqiang@sdu.edu.cn (B.L.); qin.ma@osumc.edu (Q.M.) 11 

ABSTRACT 12 

Alternative transcription units (ATUs) are dynamically encoded under different conditions or 13 

environmental stimuli in bacterial genomes, and genome-scale identification of ATUs is essential for 14 

studying the emergence of human diseases caused by bacterial organisms. However, it is unrealistic to 15 

identify all ATUs using experimental techniques, due to the complexity and dynamic nature of ATUs. 16 

Here we present the first-of-its-kind computational framework, named SeqATU, for genome-scale ATU 17 

prediction based on next-generation RNA-Seq data. The framework utilizes a convex quadratic 18 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


2 

 
programming model to seek an optimum expression combination of all of the to-be-identified ATUs. 19 

The predicted ATUs in E. coli reached a precision of 0.77/0.74 and a recall of 0.75/0.76 in the two RNA-20 

Sequencing datasets compared with the benchmarked ATUs from third-generation RNA-Seq data. We 21 

believe that the ATUs identified by SeqATU can provide fundamental knowledge to guide the 22 

reconstruction of transcriptional regulatory networks in bacterial genomes. 23 

INTRODUCTION 24 

An operon in bacterial genomes is defined as a group of consecutive genes regulated by a common 25 

promoter that all share the same terminator (1). Genes in the same operon generally encode proteins 26 

with relevant or similar biological functions; e.g., lacZ, lacY, and lacA in the lac operon encode proteins 27 

that help cells use lactose (1, 2). With decades of research on bacterial transcriptional regulation, the 28 

operon model has been found to have complex mechanisms that control expression (3-5). Multiple 29 

studies have shown that bacterial genes are dynamically transcribed under different triggering 30 

conditions, leading to shared genes among different mRNA transcripts (6-8). This dynamic architecture 31 

can be redefined by all of the alternative transcription units (a.k.a., ATUs) (3, 5), and more details can be 32 

found in fig. S1.  33 

ATU identification is of fundamental importance for understanding the transcriptional regulatory 34 

mechanisms of bacteria, and these dynamic structures have been demonstrated to be associated with 35 

human diseases (9-12). For example, Bhat et al. studied the alr-groEL1 operon, which is essential for the 36 

survival or virulence of M. tuberculosis (9, 11), the causative agent of tuberculosis (TB), and found that 37 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


3 

 
the regulation of the sub-operon is distinct from the main operon (alr-groEL1 operon) under stress, 38 

especially during heat shock, pH, and SDS stresses (9). Another example is Helicobacter pylori, a 39 

gastric pathogen that is the primary known risk factor for gastric cancer (12). Sharma et al. found an 40 

acid-induced sub-operon cag22-18 transcribed from the primary cag25-18 operon in the cag 41 

pathogenicity island of the H. pylori genome under acid stress (10). The mechanism of the complex ATU 42 

structure in these pathogenic bacteria can help us to study the emergence of human diseases caused by 43 

bacterial organisms. 44 

Several newly developed techniques have provided a comprehensive view of the E. coli 45 

transcriptome by identifying full-length primary transcripts (13-17). For example, SMRT-Cappable-seq 46 

(6) combines the isolation of the full-length bacterial primary transcriptome with PacBio SMRT (Single 47 

Molecule, Real-Time) sequencing (6), and simultaneous 5’ and 3’ end sequencing (SEnd-seq) (7) 48 

captures both transcription start sites (TSSs) and transcription termination sites (TTSs) via 49 

circularization of transcripts (17). Despite the great progress in experimental techniques, there are still 50 

some deficiencies. On the one hand, the read depth and error rate of the third-generation sequencing 51 

used in SMRT-Cappable-seq have an impact on ATU prediction compared with Illumina-based RNA-52 

Seq (7, 18). On the other hand, the time-consuming, laborious, and costly properties of these 53 

experimental techniques make them unrealistic to be generally applicable to ATU predictions in bacteria 54 

under specific conditions. Thus, novel and robust computational methods for ATU identification in 55 

bacterial genomes based on RNA-Seq are urgently needed. 56 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


4 

 
Fortunately, many computational studies have been carried out to predict ATUs in bacteria, which 57 

have provided some preliminary studies for ATU prediction. Several public databases, such as 58 

RegulonDB (19), DBTBS(20), MicrobesOnline (21), DOOR (22, 23), OperomeDB (24), DMINDA 2.0 59 

(25), and ProOpDB (26), provide various levels of operon information and small amounts of ATU 60 

information. However, these databases cannot provide genome-scale ATU information under specific 61 

conditions. Some computational studies, including Rockhopper (27), SeqTU (4, 28), BAC-62 

BROWSER(29), rSeqTU (5), and Operon-mapper (30), utilize machine learning and model integration 63 

methods based on genomic information and gene expression profiles to identify bacterial transcription 64 

architecture. However, these works still cannot solve the dynamic patterns and overlapping nature of 65 

ATUs. 66 

Here, we present SeqATU, a novel computational method for genome-scale ATU prediction by 67 

analyzing next- and third-generation RNA-Seq data (Fig. 1 and table S1). SeqATU utilizes a convex 68 

quadratic programming model (CQP) and aims to provide the optimum expression combination of all of 69 

the to-be-identified ATUs. Specifically, CQP minimizes the squared error between the predicted 70 

expression level of ATUs and the actual expression levels in genetic and intergenic regions. It is 71 

noteworthy that SeqATU also utilizes the information about the bias rate function in modeling non-72 

uniform read distribution as the linear constraints of CQP to profile the complexity of the ATU 73 

architecture. Overall, SeqATU provides a generalized framework for the inference of ATUs based on 74 

next-generation RNA-Seq data collected under multiple conditions and can be easily applied to any 75 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


5 

 
bacterial organism to identify the ATU architecture and construct a transcriptional regulatory network. 76 

Please place Fig. 1 here. 77 

MATERIALS AND METHODS 78 

Data collection 79 

The two Cappable RNA-Seq datasets used in this study, M9Enrich_Seq and RiEnrich_Seq, were 80 

obtained from E. coli grown under two different conditions: M9 minimal medium and Rich medium, 81 

respectively (6). The full-length primary transcripts were enriched as described in (6) with modifications 82 

to be adapted to Illumina sequencing. The capping and polyA tailing were performed as described in (6). 83 

The capped RNA was enriched using hydrophilic streptavidin magnetic beads (New England Biolabs) 84 

and eluted with Biotin using the same condition (6). Differently, the eluted RNA was enriched once 85 

more using streptavidin beads to further remove processed RNA (e.g., rRNA). Subsequently, the eluted 86 

RNA was used for library preparation using NEBNext Ultra II directional RNA library prep kit (E7760). 87 

Sequencing was performed on the Illumina Miseq system (paired-end, 100bp). All reads were mapped to 88 

the E. coli genome using Burrows-Wheeler Aligner (BWA) with the default parameters (31). Read 89 

alignment and other computational analyses were carried out using the E. coli genome NC_000913.3, 90 

and the corresponding gene annotations (GCF_000005845.2_ASM584v2_genomic.gff) were 91 

downloaded from NCBI. Two experimentally verified ATU datasets, SMRT_M9Enrich and 92 

SMRT_RiEnrich, were used as the benchmark data to evaluate the predicted ATUs, which were 93 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


6 

 
generated by SMRT-Cappable-seq under the same conditions as the Illumina datasets M9Enrich_Seq 94 

and RiEnrich_Seq, respectively (6). In addition, the ATUs defined by RegulonDB (19) and SEnd-seq (7) 95 

were also used as additional evaluation data in our study. 96 

Calculation of the expression values of genetic and intergenic regions 97 

After the RNA-Seq reads in M9Enrich_Seq and RiEnrich_Seq were mapped to the E. coli genome using 98 

BWA, we determined the number of reads �(�) covering each genomic position �. Suppose that �� 99 

and ���� are two consecutive genes on the same strand; we denote the expression value of ��  as ��  100 

and the expression value of the intergenic region between genes ��  and ���� as ��,���. Then, the 101 

calculation of �� and ��,��� is given by: 102 

�� =
∑ �(�)�∈��

|�� |
 (1) 

��,��� =
∑ �(�)�∈��,���

|��,���|
 (2) 

where � ∈ �� denotes that genomic position � is on the gene ��  and |�� | denotes the genomic 103 

length of ��. 104 

Modeling non-uniform read distribution along mRNA transcripts 105 

We introduced the bias rate function, which is similar to the bias curves in the work of Wu et al. (32), to 106 

address the non-uniform distribution of the RNA-Seq reads along mRNA transcripts (32-35). The bias 107 

function reflects the relative read distribution bias from the 3’ end to the 5’ end of an mRNA transcript. 108 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


7 

 
We assumed that the maximum read coverage of all the genomic positions of an mRNA transcript is the 109 

expression level without bias. It is noteworthy that a single gene mRNA transcript with no shared gene 110 

among different mRNA transcripts can serve as the ideal template for modeling non-uniform read 111 

distribution along mRNA transcripts. The specific steps of modeling non-uniform read distribution are 112 

detailed as follows: 113 

Step 1: Single Gene mRNA Transcript Selection. We selected single gene mRNA transcripts from the 114 

evaluation data and plotted their expression distributions. Specifically, 12 groups of single gene mRNA 115 

transcripts with lengths ranging from 300 to 1,500 bp were selected from the evaluation data (more 116 

details are given in method S1), and each group had ten randomly chosen mRNA transcripts. Apparent 117 

decline trends appeared in the single gene mRNA transcripts with long lengths, ranging from 1,100 to 118 

1,500 bp (fig. S2). The reason for this phenomenon may be that the incomplete transcription and 3’ end 119 

degradation or processing induce the enrichment of signal at 5’ end of the mRNA transcripts with long 120 

lengths (36, 37). Finally, we plotted the expression distribution of single gene mRNA transcripts with 121 

lengths ranging from 1,100 to 1,500 bp. 122 

Step 2: Acquiring the Bias Rate Function. We applied nonlinear regression to the expression 123 

distribution of the selected single gene mRNA transcripts and acquired the hypothetical function �(�). 124 

Specifically, the � axis and � axis of the expression distribution were converted to the distance from 125 

the 3’ end of an mRNA transcript and the bias rate of read distribution, respectively. To apply nonlinear 126 

regression to single gene mRNA transcripts with different lengths, normalization was also implemented 127 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


8 

 
on �. Here, � = (��, ��, … , ��) and � = (��, ��, … , ��) are defined by:  128 

�� =

⎩
⎨

⎧
�� − ������
���� − ��

× 10�,     �������

  
�� − ��

���� − ��
× 10�,            �������   

 (3) 

�� =

⎩
⎪
⎨

⎪
⎧

�(������)

����
,     �������

  
�(�� )

����
,            �������   

 (4) 

where � denotes the number of genomic positions on an mRNA transcript; � = (��, ��, … , ��) denotes 129 

the genomic positions on an mRNA transcript; ���� = ��; �(�� ) denotes the expression level of the 130 

genomic position �� , i.e., the number of reads covering the genomic position �� ; and ���� denotes the 131 

expression level without bias in an mRNA transcript, which is calculated as ��� {�(�� )}, 1 ≤ � ≤ �. 132 

We used the function nls in R to acquire the hypothetical function �(�). 133 

Step 3: Constructing Bias Rate Vectors. We constructed a genetic or intergenic region bias rate vector 134 

for each mRNA transcript by calculating the bias rate of all of its component genetic or intergenic 135 

regions. The bias rate of a genetic or an intergenic region is the average bias rate of all the genomic 136 

positions that it contains. Considering an mRNA transcript � and its component gene set 137 

{��, ��, … , ��} (the details of the gene labels are described in method S2), we denoted the genetic 138 

region bias rate vector as � = (��, ��, … , �� ), which was calculated using the formula: 139 

�� =

⎩
⎪
⎨

⎪
⎧ ∑ �(�� )

������

��������

������� − ������� + 1
,      �������

         
∑ �(�� )

��
����

��� − ��� + 1
,                  �������     

 (5) 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


9 

 
where � denotes the number of genomic positions on �; ��  denotes the bias rate of ��  for �; and 140 

�� = (��� , ��� , ��� , ��� , … , ��� , ��� ) is the range of the genomic positions of {��, ��, … , ��}, while the 141 

range of the genomic positions of ��  is [��� , ��� ], 1 ≤ � ≤ �. Similarly, the calculation of the intergenic 142 

region bias rate vector � = (��, ��, … , ����) is provided in method S3. 143 

Modification of maximal ATU clusters 144 

A maximal ATU cluster is defined as a maximal consecutive gene set such that each pair of its 145 

consecutive genes can be covered by at least one ATU. Similar to ATUs, maximal ATU clusters are also 146 

dynamically composed under different conditions or environmental stimuli in bacterial genomes (5, 38). 147 

Such a maximal ATU cluster can be used as an independent genomic region for ATU prediction, which 148 

alleviates the difficulty in computationally predicting ATUs at the genome scale. The output of our in-149 

house tool rSeqTU can serve as the maximal ATU cluster data, which lays a solid foundation for ATU 150 

prediction (5). We modified the maximal ATU clusters from rSeqTU: (i) two maximal ATU clusters with 151 

distances less than 40 bp were combined into one cluster and (ii) a maximal ATU cluster was split at the 152 

intergenic region where the opposite-strand genes were located. In addition, we selected the maximal 153 

ATU clusters with expression values over ten (see the details in method S4), according to the study of 154 

Etwiller et al. (13).  155 

The mathematical programming model for ATU prediction 156 

The predicted ATU expression profile should be consistent with the observed expression profiles of the 157 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


10 

 
genetic and intergenic regions. Therefore, the prediction of the ATU profiles can be modeled as an 158 

optimization problem, which seeks an optimum expression combination of all of the to-be-identified 159 

ATUs to minimize the gap between the predicted ATUs and the observed genetic and intergenic region 160 

expression profiles. Here, a convex quadratic programming model was built to solve this optimization 161 

problem. 162 

We denoted a maximal ATU cluster as �, assuming that it contains the consecutive genes 163 

{��, … , ��}, and the intergenic regions of these genes are {��,�, … , ����,�}. The size of � is defined as 164 

the number of its component genes �. Theoretically, there are 
�×(���)

�
 ATUs for �, and an ATU with 165 

consecutive genes {�� , ����, … , �� } is denoted as �
�,� ; the corresponding expression value is ��,�, 1 ≤166 

� ≤ � ≤ �.  167 

For the component gene �� of �, the gap between the gene expression value ��  and the sum of the 168 

expression level of the ATUs containing it is denoted as ��, which provides the first � equality 169 

constraints in our mathematical programming model, � = 1,2, … , �. Similarly, for the intergenic region 170 

��,��� of �, the gap between the intergenic region expression value ��,��� and the sum of the 171 

expression level of the ATUs containing it is denoted as ��, providing the last � − 1 equality 172 

constraints in our mathematical programming model, � = 1,2, … , � − 1.  173 

The goal of our mathematical programming model is to minimize the square of � =174 

(��, ��, … , ��, ��, … , ����), as the combination of �
�,� with a minimal value of ��� is corresponding to 175 

an optimum expression combination of all ATUs ��,� for �, 1 ≤ � ≤ � ≤ �. Additionally, to control the 176 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


11 

 
number of optimal solutions and reduce the false-positive errors, we added an �� regularization �||�||� 177 

to ��� with ��,� ≥ 0, which is a linear function. Because of the variant expression level of different 178 

maximal ATU clusters, we used the expression value of � as �. In total, the convex quadratic 179 

programming model with unknown variables (�, �) is shown as follows: 180 

��� ��� + �||�||�  

�. �. ∑ ∑ ��,� �
�,��

���
�
��� = �� + ��   � = 1,2, … , � 

 ∑ ∑ ��,����
�,��

�����
�
��� =  ��,��� + ��  � = 1,2, … , � − 1 

 � = ���,� �, ��,� ≥ 0  1 ≤ � ≤ � ≤ � 

 � = (��, ��, … , �� , ��, … , ����) (6) 

where � = (��,� ) is the genetic region bias rate vector for �, ��,� is the bias rate of gene ��  for ATU 181 

��,�, 1 ≤ � ≤ � ≤ �，� ≤ � ≤ �, � = (��,� ) is the intergenic region bias rate vector for �, and ��,� 182 

is the bias rate of the intergenic region ����,� for ATU �
�,�, 1 ≤ � < � ≤ �，� ≤ � ≤ � (see the 183 

details in method S5).  184 

Two evaluation methods for ATU prediction 185 

In the first evaluation method, precision and recall were defined based on perfect matching (Eqs. 7). 186 

Perfect matching of two ATUs means that all of their component genes are the same. Here, the true 187 

positives (��) are the number of predicted ATUs with the same component genes as an ATU in the 188 

evaluation data; the false positives (��) are the number of predicted ATUs that do not exist in the 189 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


12 

 
evaluation data; the false negatives (��) are the number of ATUs that appear in the evaluation data but 190 

not in the prediction results of SeqATU. 191 

��������� =
��

�� + ��
  

������ =
��

�� + ��
 (7) 

In the second evaluation method, precision and recall were defined based on relaxed matching, which 192 

is measured by the similarity of two ATUs. Assuming that an ATU � is in one of two datasets (the 193 

predicted ATU dataset and evaluated ATU dataset), the definition and calculation of the similarity of � 194 

are shown in the following three cases: 195 

Case 1: If � shares boundary genes at both ends of an ATU in the other dataset, i.e., all component 196 

genes of � are the same as one in the other dataset, then ����������(�) = 1. 197 

Case 2:  If � shares exactly one boundary gene of ATUs in the other dataset, then we denote �� as 198 

the ATUs in the other dataset that share the 5’-end gene with � and denoted �� as the ATUs in the 199 

other dataset that share the 3’-end gene with �, �� ∩ �� = ∅, one of �� and �� can be empty. Then, 200 

����������(�) =
1

2
�����∈��

�(��)

�(��)
+

1

2
�����∈��

�(��)

�(��)
 (8) 

where �(��) is the number of shared genes of � and �� and �(��) is the maximal size of � and ��.  201 

Case 3: If � shares no boundary genes at both ends of the ATUs in the other dataset, then 202 

����������(�) = 0. 203 

Finally, the precision and recall based on relaxed matching are calculated by the following formula: 204 

��������� =
∑ ����������(�)�∈��

��
  

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


13 

 
������ =
∑ ����������(�)�∈��

��
 (9) 

where �� is the set of predicted ATUs, �� is the number of predicted ATUs, �� is the set of evaluated 205 

ATUs, and �� is the number of evaluated ATUs. 206 

RESULTS 207 

A reliable bias rate function is acquired in modeling non-uniform read distribution along mRNA 208 

transcripts 209 

To ensure the reliability of the bias rate function in modeling non-uniform read distribution, we selected 210 

four single gene mRNA transcript datasets randomly from the two evaluation datasets 211 

(SMRT_M9Enrich and SMRT_RiEnrich), named M9Enrich_1, M9Enrich_2, RiEnrich_1, and 212 

RiEnrich_2. Four bias rate functions, which are exponential functions, were generated after conducting 213 

nonlinear regression on the mRNA transcripts across these four datasets (Fig. 2). We found that these 214 

bias rate functions were similar (�� > 0.998) when we evaluated the R-square statistic (for more 215 

details, see method S6 and table S2). The similarity of the four bias rate functions indicated that the 216 

selection of the single gene mRNA transcript datasets had little impact on modeling non-uniform read 217 

distribution along mRNA transcripts, implying the universal common non-uniform read distribution of 218 

different mRNA transcripts of E. coli. Specifically, we used the average of these four coefficients as the 219 

final coefficients of the exponential function, which was �(�) = ���� with � = 0.256 and � =220 

0.00128. 221 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


14 

 
Please place Fig. 2 here. 222 

ATUs predicted by SeqATU reach precision and recall over 0.64 223 

The performance evaluation was conducted by comparing the predicted ATUs with the ATUs in 224 

SMRT_M9Enrich and SMRT_RiEnrich, which were generated based on the third-generation sequencing 225 

and are not sensitive to transcripts with low expression levels. For a more accurate and fair evaluation, 226 

maximal ATU clusters after pre-selection were retained in the subsequent evaluations (more details 227 

about the pre-selection of maximal ATU clusters can be seen in method S7 and fig. S3). 228 

The precision and recall of the predicted ATUs were calculated for each maximal ATU cluster. By 229 

considering only perfect matching, the average precision and recall were 0.67 and 0.67 for 230 

M9Enirch_Seq and 0.64 and 0.68 for RiEnrich_Seq, respectively. When using relaxed matching, the 231 

average precision and recall increased to 0.77 and 0.75 for M9Enrich_Seq and 0.74 and 0.76 for 232 

RiEnrich_Seq, respectively. The statistics for precision and recall on maximal ATU clusters with 233 

different sizes, as shown in Fig. 3A and fig. S4A. These results showed that the average precision and 234 

recall were decreasing with the increasing size of maximal ATU clusters (other than several large size 235 

ones due to their small number of counts). The results also indicated that the evaluation results based on 236 

relaxed matching were significantly higher than those based on perfect matching across different sizes. 237 

This result implied that the incorrectly predicted ATUs by SeqATU based on perfect matching tended to 238 

have strong similarities with the ATUs in the evaluation data. In addition, we also found that more than a 239 

quarter of the incorrectly predicted ATUs (25%/29% for M9Enrich_Seq/RiEnrich_Seq) by SeqATU 240 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


15 

 
based on perfect matching matched with the transcription units in RegulonDB (19). 241 

The two evaluation datasets (SMRT_M9Enrich and SMRT_RiEnrich) were both from SMRT-242 

Cappable-seq, while one of the processing steps of the technique filtered RNA reads smaller than 1,000 243 

bp (6), which indicated that the ATUs in these two evaluation datasets were not comprehensive. To 244 

address this issue, we enriched the evaluation data by adding the ATUs defined by SEnd-seq (7), as 245 

SEnd-seq did not introduce any filtering based on RNA size. When we used the new evaluation data, the 246 

ATUs predicted by SeqATU improved by 15% (0.77) and 19% (0.76) in terms of the average precision 247 

based on perfect matching for M9Enrich_Seq and RiEnrich_Seq, respectively, and by 9% (0.84) and 248 

12% (0.83) based on relaxed matching. The statistics for precision across different sizes of the maximal 249 

ATU clusters are shown in Fig. 3B and fig. S4B, showing that the values of precision based on perfect 250 

matching were significantly improved across different sizes of maximal ATU clusters by using the 251 

evaluated ATUs from SMRT-Cappable-seq and SEnd-seq. This result suggested that the ATUs we 252 

predicted, which were not in SMRT_M9Enrich and SMRT_RiEnrich, may be due to the RNA length 253 

selection of SMRT-Cappable-seq. We enriched the evaluation data by adding the ATUs in RegulonDB 254 

(19) and also found the improvement of precision across different sizes of maximal ATU clusters for 255 

M9Enrich_Seq and RiEnrich_Seq (fig. S4C).  256 

Furthermore, to facilitate the understanding of the performance of SeqATU and to measure the 257 

influence of the maximal ATU clusters from rSeqTU on our ATU prediction method, SMRT maximal 258 

ATU clusters collected from SMRT_M9Enrich and SMRT_RiEnrich (for more details, see method S8) 259 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


16 

 
were applied for the CQP in two conditions (M9 minimal medium and Rich medium). We found that 260 

precision and recall increased to 0.73 and 0.77 for M9Enrich_Seq, respectively, and 0.69 and 0.80 for 261 

RiEnrich_Seq based on perfect matching (fig. S4D). Additionally, when using relaxed matching, 262 

precision and recall significantly increased to 0.82 and 0.84 for M9Enrich_Seq, respectively, and 0.79 263 

and 0.86 for RiEnrich_Seq (fig. S4D). The significantly improved results verified the ability of SeqATU 264 

to accurately predict ATU when giving more accurate maximal ATU clusters. In addition, we found that 265 

the number of predicted ATUs and the evaluated ATUs under the maximal ATU cluster with the same 266 

size were similar except for the maximal size (Fig. 3C), and they were far less than the theoretical 267 

number, which indicated that SeqATU can effectively exclude most of the incorrect ATUs.  268 

Please place Fig. 3 here. 269 

The bias rate constraints efficiently improve the ability of SeqATU to predict ATUs 270 

We tried to use SeqATU without bias rate constraints to predict the ATUs of E. coli and found that its 271 

performance significantly decreased compared with SeqATU (Fig. 4 and fig. S5). Specifically, the F-272 

score of SeqATU without bias rate constraints was 0.69/0.68 based on perfect matching for 273 

M9Enrich_Seq/RiEnrich_Seq, compared with 0.75/0.74 for SeqATU. When using relaxed matching, the 274 

F-score of SeqATU without bias rate constraints was 0.79/0.78 for M9Enrich_Seq/RiEnrich_Seq, 275 

compared with 0.83/0.83 for SeqATU. This result suggested that the bias rate constraints of SeqATU 276 

could capture useful information about the non-uniform distribution of the RNA-Seq reads along the 277 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


17 

 
mRNA transcripts (32-35) and then efficiently improve the ability of the model to predict complex 278 

ATUs. 279 

Please place Fig. 4 here. 280 

ATUs predicted by SeqATU display a dynamic composition and overlapping nature 281 

A total of 2,973 distinct ATUs were identified in M9 minimal medium, and 2,767 were identified in Rich 282 

medium. Among them, there were 1,423/1,550 distinct ATUs on the forward strand and 1,323/1,444 on 283 

the reverse strand for M9Enrich_Seq/RiEnrich_Seq. Each of the predicted ATUs was comprised of an 284 

average of 2.59 genes, with the largest ATU containing 28 genes across the two conditions. The 285 

distribution of the size of the predicted ATUs is shown in Fig. 5A, from which we can see that the 286 

majority of ATUs (more than 87%) contained fewer than five genes in M9 minimal medium and Rich 287 

medium. Approximately 41% of the genes in E. coli were contained in more than one ATU for 288 

M9Enrich_Seq, compared to 43% genes for RiEnrich_Seq, suggesting that the ATUs in a maximal ATU 289 

cluster generally overlapped with each other (Fig. 5B). In addition, there were 1,576 ATU maximal 290 

clusters for M9Enrich_Seq and 1,512 ATU maximal clusters for RiEnrich_Seq. SeqATU identified a 291 

total of 1,977 identical ATUs under the two conditions, whereas there were 1,786 distinct ATUs. Among 292 

the distinct ATUs across the two conditions, 394 ATUs were from the same maximal ATU clusters in the 293 

two maximal ATU cluster datasets, and the rest were from different maximal ATU clusters. The fact 294 

there were distinct ATUs under the two conditions suggests that ATUs are dynamically responsive to 295 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


18 

 
different conditions or environmental stimuli (for more real examples about the ATUs under different 296 

conditions, see fig. S6). 297 

The dynamic composition of predicted ATUs by SeqATU is of great significance to understand the 298 

interactions inside polymicrobial communities. For example, chronic airway infection by Pseudomonas 299 

aeruginosa considerably contributes to lung tissue destruction and impairment of pulmonary function in 300 

cystic-fibrosis (CF) patients (39). Marie et al. found that the presence of E. coli complemented the 301 

growth defect of a P. aeruginosa bioA-disrupted mutant that is unable to grow on rich medium, and can 302 

be beneficial to P. aeruginosa when biotin supply is limited (39). An ATU with a high expression level 303 

coded by the uvrB gene is identified by SeqATU in Rich medium, while it does not exist in M9 minimal 304 

medium (Fig. 6). We predicted the uvrB gene to be involved in the biotin metabolism pathway, as the 305 

bioB, bioF, bioC, and bioD genes contained in a same ATU with it have been known in the biotin 306 

metabolism KEGG pathway. Therefore, the observation by Marie et al. can be explained that the ATUs 307 

coded by the uvrB gene of E. coli can provide the biotin supply for P. aeruginosa under rich medium. 308 

This result showed that SeqATU could increase our understanding of interspecies competition and 309 

cooperation, which play an important role in shaping the composition and structure of polymicrobial 310 

bacterial populations. 311 

Please place Fig. 5 here. 312 

Please place Fig. 6 here. 313 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


19 

 
Predicted ATUs by SeqATU are verified by experimental TSSs and TTSs 314 

An experimental TSS dataset of E. coli from SEnd-seq (7) and a TF binding site dataset of E. coli from 315 

the experimental dataset of RegulonDB (19) were used to further verify the reliability of SeqATU and 316 

were named dataset 1 and dataset 2, respectively. There were 5,512 experimental TSSs in dataset 1 and 317 

3,220 experimental TF binding sites in dataset 2. We considered the 5’-end genes and no 5’-end genes of 318 

the predicted ATUs by SeqATU. A gene that is not the 5’-end gene of any predicted ATU is named a no 319 

5’-end gene. We identified 2,177/2,005 5’-end genes and 1,266/1,160 no 5’-end genes of the predicted 320 

ATUs for M9Enrich_Seq/RiEnich. A gene validated by experimental TSSs or TF binding sites means 321 

that it is the immediate downstream gene of an experimental TSS or TF binding site. As a result, the 322 

proportion of 5’-end genes of the predicted ATUs that were validated by experimental TSSs or TF 323 

binding sites was over 1.7 times greater than that of the no 5’-end genes (Table 1). Specifically, the 324 

proportion of 5’-end genes (29%/30% for M9Enrich_Seq/RiEnrich_Seq) validated by experimental TF 325 

binding sites was over three times greater than the no 5’-end genes (9.2%/9.0% for 326 

M9Enrich_Seq/RiEnrich_Seq). These results further verified the reliability of the ATUs predicted by 327 

SeqATU in terms of the TSS level. In addition, four other experimental TSS or promoter datasets from 328 

RegulonDB (19), dRNA-seq (14), and Cappable-seq (13) were also examined. The results are shown in 329 

table S3, and we also found a higher proportion of 5’-end genes of the predicted ATUs validated by 330 

experimental TSSs or promoters than that of no 5’-end genes. 331 

We also used two experimental TTS datasets of E. coli from SEnd-seq (7) and RegulonDB (19) to 332 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


20 

 
verify the reliability of predicted ATUs by SeqATU in terms of TTS level. These two experimental TTS 333 

datasets were named dataset 3 and dataset 4, respectively. There were 1,540 experimental TTSs in 334 

dataset 3 and 367 experimental TTSs in dataset 4. We considered the 3’-end genes and no 3’-end genes 335 

of the predicted ATUs by SeqATU. A gene that is not the 3’-end gene of any predicted ATU is named a 336 

no 3’-end gene. We identified 2,290/2,187 3’-end genes and 1,153/978 no 3’-end genes of the predicted 337 

ATUs for M9Enrich_Seq/RiEnrich_Seq. A gene validated by experimental TTSs means that it is the 338 

immediate upstream gene of an experimental TTS. As a result, the proportion of 3’-end genes of the 339 

predicted ATUs that were validated by experimental TTSs was over two times greater than that of no 3’-340 

end genes (Table 2). Specifically, the proportion of 3’-end genes (51%/53% for 341 

M9Enrich_Seq/RiEnrich_Seq) validated by experimental TTSs from SEnd-seq was over three times 342 

greater than that of no 3’-end genes (15%/14% for M9Enrich_Seq/RiEnrich_Seq). These results further 343 

verified the reliability of the ATUs predicted by SeqATU in terms of the TTS level. In addition, two 344 

other computationally predicted TTS datasets from the works by Nadiras et al. (40) and Kingsford et al. 345 

(41) were also examined. The results are shown in table S4, and we also found the proportion of 3’-end 346 

genes (63%/62% for M9Enrich_Seq/RiEnrich_Seq) validated by computationally predicted Rho-347 

independent TTSs was over two times greater than that of no 3’-end genes (29%/29% for 348 

M9Enrich_Seq/RiEnrich_Seq).  349 

Please place Table 1 here. 350 

Please place Table 2 here. 351 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


21 

 
The gene pairs frequently encoded in the same ATUs are more functionally related than those that 352 

can belong to two distinct ATUs 353 

Functional analysis was conducted by integrating GO terms from the Gene Ontology (GO) database 354 

(42). In detail, we measured the level of functional relatedness for two types of consecutive gene pairs, 355 

which is similar to the definition in the work by Mao et al. (38). Two types of consecutive gene pairs 356 

were (i) gene pairs each consisting of a 5’-end gene of an ATU and the gene in its immediate upstream 357 

on the same strand and (ii) all the other gene pairs inside an ATU (Fig. 7A). In addition, we used a 358 

scoring scheme to measure the GO-based functional similarity between a pair of genes by Wu et al. (43). 359 

This study developed a GO similarity score and showed that the larger the score, the more likely that 360 

two genes are functionally related. In brief, the GO similarity score of a gene pair ��  and ��  is 361 

denoted as ��� (�� , �� ): 362 

��� ���, �� � = �����∈�(��),   ��∈�(��)  �(�� , �� ) 363 

where �� and �� are the GO terms assigned to �� and �� , respectively; �(�� , �� ) is the maximal 364 

number of common terms between paths in the two GO graphs induced by the GO terms �� and ��. 365 

As a result, the mean GO similarity score was higher for type-ii gene pairs (5.97 versus 4.04 for 366 

M9Enrich_Seq and 5.86 versus 3.91 for RiEnrich_Seq) than for type-i gene pairs. A total of 574/524 367 

type-ii gene pairs had GO similarity scores greater than four (64%/63% of a total of 899/834), while 368 

only 461/404 type-i gene pairs had GO similarity scores greater than four (36%/34% of a total of 369 

1,274/1,179) for M9Enrich_Seq/RiEnrich_Seq. We also applied a c�-test (44) to determine whether the 370 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


22 

 
distribution of ��� ��� , �� � was different for the type-i gene pairs and type-ii gene pairs. The c
�-371 

statistics corresponded to a P-value less than 10��, which revealed that the distribution of ��� ��� , �� � 372 

for the type-ii gene pairs was significantly different from the type-i gene pairs. Fig. 7B shows the 373 

distribution of ��� ��� , �� � for the type-i gene pairs and the type-ii gene pairs. These results strongly 374 

indicated that the type-ii gene pairs had a higher degree of GO similarity than the type-i gene pairs, 375 

suggesting that the gene pairs frequently encoded in the same ATUs (type-ii gene pairs) are more 376 

functionally related than those that can belong to two distinct ATUs (type-i gene pairs).  377 

We also carried out a similar analysis of the two different gene pairs based on KEGG enrichment 378 

analysis (see more details in method S9) and found that the proportion of type-ii gene pairs (59%/57% 379 

for M9Enrich_Seq/RiEnrich_Seq), whose two genes were contained in the same KEGG pathway, was 380 

higher than the proportion of type-i gene pairs (32%/28% for M9Enrich_Seq/RiEnrich_Seq) (Fig. 7C). 381 

The distribution of the KEGG similarity scores of the two different types of gene pairs is shown in Fig. 382 

7D, suggesting that genes of type-ii gene pairs have a higher probability of participating in the same 383 

KEGG pathway than those of type-i gene pairs. 384 

Please place Fig. 7 here. 385 

DISCUSSION 386 

We developed SeqATU, the first computational method for genome-scale ATU prediction by analyzing 387 

next- and third-generation RNA-Seq data, using a CQP model. Linear constraints provided by the bias 388 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


23 

 
rate of read distribution were, for the first time, integrated into the CQP model. Positional bias refers to 389 

the non-uniform distribution of reads over different positions of a transcript (33, 35), which is handled 390 

by learning non-uniform read distributions from given RNA-Seq reads (32) or modeling the RNA 391 

degradation (45). The bias rate function we proposed can address the non-uniform read distribution 392 

along mRNA transcripts and also be desirable for standard next-generation RNA-Seq data that involves 393 

more degraded mRNAs, as the exponential function has been used to model the degradation of mRNA 394 

transcripts (45). As a result, a total of 2,973 distinct ATUs for M9Enrich_Seq and 2,767 distinct ATUs 395 

for RiEnrich_Seq were identified by SeqATU. The precision and recall reached 0.67/0.64 and 0.67/0.68, 396 

respectively, based on perfect matching and 0.77/0.74 and 0.75/0.76, respectively, based on relaxed 397 

matching for M9Enrich_Seq/RiEnrich_Seq. We further validated predicted ATUs using experimental 398 

transcription factor binding sites or transcription termination sites from RegulonDB and SEnd-Seq. In 399 

addition, the proportion of the 5’- or 3’-end genes of predicted ATUs that were validated by 400 

experimental transcription factor binding sites and transcription termination sites was over three times 401 

greater than that of no 5’- or 3’-end genes, demonstrating the high reliability of predicted ATUs. Gene 402 

pairs frequently encoded in the same ATUs were more functionally related than those that can belong to 403 

two distinct ATUs according to GO and KEGG enrichment analyses. These results demonstrated the 404 

reliability and accuracy of our predicted ATUs, implying the ability of SeqATU to reveal the 405 

transcriptional architecture of the bacterial genome.  406 

In fact, the ATU architecture of bacteria is much more complex than that determined with currently 407 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


24 

 
used experimental techniques. We investigated the 5’-end genes and no 5’-end genes of the experimental 408 

ATUs identified by SMRT-Cappable-seq (6) using a combination of experimental TSSs from 409 

RegulonDB (19), dRNA-seq (14), Cappable-seq (13), and SEnd-seq (7). As a result, we found that the 410 

proportion of 5’-end genes (99%) validated by experimental TSSs was not significantly different from 411 

that of no 5’-end genes (92%). The high percentage of no 5’-end genes validated by experimental TSSs 412 

implied that the ATUs identified by experimental techniques are only a small proportion of the 413 

comprehensive ATUs in bacterial organisms due to the dynamic mechanisms of ATUs. These results 414 

further verified the necessity of developing robust computational methods for ATU identification.  415 

SeqATU not only provides a powerful tool to understand the transcription mechanism of bacteria but 416 

also provides a fundamental tool to guide the reconstruction of a genome-scale transcriptional regulatory 417 

network. First, the ATU structure can help us to make new functional predictions, as genes in an ATU 418 

tend to have related functions. Second, ATUs can elucidate condition-specific uses of alternative sigma 419 

factors (8, 46). For example, the thrLABC operon is regulated by transcriptional attenuation. Totsuka et 420 

al. found that under the log phase growth condition, the thrLABC operon is the only transcript, while 421 

two transcripts are found under stationary phase growth condition, the thrLABC and thrBC. As validated 422 

experimentally, � � can regulate the additional promoter located in front of thrB under the stationary 423 

phase growth condition and then separately regulate thrBC, which elucidates the condition-specific uses 424 

of � � (8). Third, understanding the ATU structure is of great help to construct transcriptional and 425 

translation regulatory networks, such as for the construction of the σ-TUG (σ-factor-transcription unit 426 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


25 

 
gene) network (47). The transcription regulatory network consists of nodes (ATU and regulatory 427 

proteins) and links (interactions) (48), and the comprehensive ATU structure can provide a nearly 428 

complete set of nodes, which can improve the accuracy of regulatory prediction.  429 

Although SeqATU has obtained satisfactory predicted results, there are still several challenges 430 

regarding the computational prediction of ATUs. On the one hand, due to the influence of the 3’ 431 

untranslated region (UTR) and 5’ untranslated region (UTR) in the intergenic regions, the expression 432 

value of intergenic regions cannot be reproduced perfectly by the same calculation used for the 433 

expression value of genetic regions. Without accurate reproduction, it is difficult to obtain the best 434 

expression combination of ATUs by the programming model based on the expression value of genetic 435 

and intergenic regions. On the other hand, due to the lack of strand-specific RNA-Seq data, it is difficult 436 

to distinguish the expression level of intergenic regions between two consecutive genes on the same 437 

strand derived from ATUs containing these two genes or antisense RNAs (asRNAs) (6, 49). All of these 438 

challenges and the great significance of ATU prediction inspire and encourage us to discover more 439 

information to determine the ATU structure in bacteria. For example, we plan to add high confidence 440 

TSSs and TTSs information to our programming model in the future. Additionally, since the microbiome 441 

is increasingly recognized as a critical component in human diseases, such as inflammatory bowel 442 

disease (50), antibiotic-associated diarrhoea (51), neurological disorders (52), and cancer (53) (54),  443 

predicting new ATUs of uncultured species from metagenomic and metatranscriptomic data is of great 444 

significance in uncovering new regulatory pathway and metabolic products during the development of 445 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


26 

 
diseases (55). However, due to a majority of species with unknown genomes or genome annotations 446 

within a microbial community, ATU prediction on metagenomics and metatranscriptomics is still a 447 

challenging task, which encourage us to pay more attention on it. 448 

REFERENCES 449 

1. F. Jacob, D. Perrin, C. Sanchez, J. Monod, Operon: a group of genes with the expression 450 

coordinated by an operator. C R Hebd. Seances. Acad. Sci 250, 1727-1729 (1960). 451 

2. F. Jacob, J. Monod, Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 452 

318-356 (1961). 453 

3. Z. Liu, J. Feng, B. Yu, Q. Ma, B. Liu, The functional determinants in the organization of bacterial 454 

genomes. Brief. Bioinform., doi.org/10.1093/bib/bbaa1172 (2020). 455 

4. W.-C. Chou, Q. Ma, S. Yang, S. Cao, D. M. Klingeman, S. D. Brown, Y. Xu, Analysis of strand-456 

specific RNA-seq data using machine learning reveals the structures of transcription units in 457 

Clostridium thermocellum. Nucleic Acids Res. 43, e67-e67 (2015). 458 

5. S.-Y. Niu, B. Liu, Q. Ma, W.-C. Chou, rSeqTU—a machine-learning based R package for 459 

prediction of bacterial transcription units. Frontiers in genetics 10, 374 (2019). 460 

6. B. Yan, M. Boitano, T. A. Clark, L. Ettwiller, SMRT-Cappable-seq reveals complex operon 461 

variants in bacteria. Nat. Commun. 9, 3676 (2018). 462 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


27 

 
7. X. Ju, D. Li, S. Liu, Full-length RNA profiling reveals pervasive bidirectional transcription 463 

terminators in bacteria. Nature microbiology 4, 1907-1918 (2019). 464 

8. K. Totsuka, K. Totsuka, The Transcription Unit Architecture of the Escherichia Coli Genome. Nat. 465 

Biotechnol. 27, 1043-1049 (2009). 466 

9. A. H. Bhat, D. Pathak, A. Rao, The alr-groEL1 operon in Mycobacterium tuberculosis: an interplay 467 

of multiple regulatory elements. Scientific Reports 7, 43772 (2017). 468 

10. C. M. Sharma, S. Hoffmann, F. Darfeuille, J. Reignier, S. Findeiß, A. Sittka, S. Chabas, K. Reiche, 469 

J. Hackermüller, R. Reinhardt, The primary transcriptome of the major human pathogen 470 

Helicobacter pylori. Nature 464, 250-255 (2010). 471 

11. J. M. Durand, G. R. Bjork, Putrescine or a combination of methionine and arginine restores 472 

virulence gene expression in a tRNA modification-deficient mutant of Shigella flexneri: a possible 473 

role in adaptation of virulence. Mol. Microbiol. 47, 519-527 (2010). 474 

12. L. E. Wroblewski, R. M. Peek, K. T. Wilson, Helicobacter pylori and gastric cancer: factors that 475 

modulate disease risk. Clin. Microbiol. Rev. 23, 713-739 (2010). 476 

13. L. Ettwiller, J. Buswell, E. Yigit, I. Schildkraut, A novel enrichment strategy reveals unprecedented 477 

number of novel transcription start sites at single base resolution in a model prokaryote and the 478 

gut microbiome. BMC Genomics 17, 199-199 (2016). 479 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


28 

 
14. M. K. Thomason, T. Bischler, S. K. Eisenbart, K. U. Forstner, A. Zhang, A. Herbig, K. Nieselt, C. 480 

M. Sharma, G. Storz, Global transcriptional start site mapping using differential RNA sequencing 481 

reveals novel antisense RNAs in Escherichia coli. J. Bacteriol. 197, 18-28 (2015). 482 

15. T. Bischler, H. S. Tan, K. Nieselt, C. M. Sharma, Differential RNA-seq (dRNA-seq) for annotation 483 

of transcriptional start sites and small RNAs in Helicobacter pylori. Methods 86, 89-101 (2015). 484 

16. D. Dar, M. Shamir, J. Mellin, M. Koutero, N. Stern-Ginossar, P. Cossart, R. Sorek, Term-seq 485 

reveals abundant ribo-regulation of antibiotics resistance in bacteria. Science 352, 6282 (2016). 486 

17. J. Clauwaert, G. Menschaert, W. Waegeman, An in-depth evaluation of annotated transcription 487 

start sites in E. coli using deep learning. bioRxiv, doi: https://doi.org/10.1101/2020.03.16.993501, 488 

4 November 2020, pre-print: not peer-reviewed. (2020). 489 

18. S. Goodwin, J. D. Mcpherson, W. R. Mccombie, Coming of age: ten years of next-generation 490 

sequencing technologies. Nat. Rev. Genet. 17, 333-351 (2016). 491 

19. A. Santos-Zavaleta, H. Salgado, S. Gama-Castro, M. Sánchez-Pérez, L. Gómez-Romero, D. 492 

Ledezma-Tejeida, J. S. García-Sotelo, K. Alquicira-Hernández, L. J. Muñiz-Rascado, P. Peña-493 

Loredo, RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge 494 

of gene regulation in E. coli K-12. Nucleic Acids Res. 47, D212-D220 (2018). 495 

20. N. Sierro, Y. Makita, M. J. L. De Hoon, K. Nakai, DBTBS: a database of transcriptional regulation 496 

in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 497 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


29 

 
36, 93-96 (2008). 498 

21. P. S. Dehal, M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, C. Dylan, G. D. Friedland, 499 

K. H. Huang, K. Keith, P. S. Novichkov, MicrobesOnline: an integrated portal for comparative and 500 

functional genomics. Nucleic Acids Res. 38, D396-D400 (2010). 501 

22. H. Cao, Q. Ma, X. Chen, Y. Xu, DOOR: a prokaryotic operon database for genome analyses and 502 

functional inference. Brief. Bioinform. 20, 1568-1577 (2019). 503 

23. X. Mao, Q. Ma, C. Zhou, X. Chen, H. Zhang, J. Yang, F. Mao, W. Lai, Y. Xu, DOOR 2.0: presenting 504 

operons and their functions through dynamic and integrated views. Nucleic Acids Res. 42, D654-505 

D659 (2013). 506 

24. K. Chetal, S. C. Janga, OperomeDB: A Database of Condition-Specific Transcription Units in 507 

Prokaryotic Genomes. Biomed Research International 2015, 1-10 (2015). 508 

25. J. Yang, X. Chen, A. Mcdermaid, Q. Ma, DMINDA 2.0: integrated and systematic views of 509 

regulatory DNA motif identification and analyses. Bioinformatics 33, 2586-2588 (2017). 510 

26. T. Blanca, C. Ricardo, C. E. Martinez-Guerrero, M. Enrique, ProOpDB: Prokaryotic Operon 511 

DataBase. Nucleic Acids Res. 40, D627-D631 (2012). 512 

27. R. McClure, D. Balasubramanian, Y. Sun, M. Bobrovskyy, P. Sumby, C. A. Genco, C. K. 513 

Vanderpool, B. Tjaden, Computational analysis of bacterial RNA-Seq data. Nucleic Acids Res. 41, 514 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


30 

 
e140-e140 (2013). 515 

28. X. Chen, W. Chou, Q. Ma, Y. Xu, SeqTU: A Web Server for Identification of Bacterial 516 

Transcription Units. Scientific Reports 7, 43925 (2017). 517 

29. I. A. Garanina, G. Y. Fisunov, V. M. Govorun, BAC-BROWSER: The Tool for Visualization and 518 

Analysis of Prokaryotic Genomes. Frontiers in Microbiology 9, 2827 (2018). 519 

30. B. Taboada, K. Estrada, R. Ciria, E. Merino, Operon-mapper: a web server for precise operon 520 

identification in bacterial and archaeal genomes. Bioinformatics 34, 4118-4120 (2018). 521 

31. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. 522 

Bioinformatics 25, 1754-1760 (2009). 523 

32. Z. Wu, X. Wang, X. Zhang, Using non-uniform read distribution models to improve isoform 524 

expression inference in RNA-Seq. Bioinformatics 27, 502-508 (2011). 525 

33. A. Roberts, C. Trapnell, J. Donaghey, J. L. Rinn, L. Pachter, Improving RNA-Seq expression 526 

estimates by correcting for fragment bias. Genome Biol. 12, 1-14 (2011). 527 

34. R. Bohnert, G. Rï¿ ½tsch, rQuant. web: a tool for RNA-Seq-based transcript quantitation. Nucleic 528 

Acids Res. 38, W348-W351 (2010). 529 

35. W. Li, T. Jiang, Transcriptome assembly and isoform expression level estimation from biased 530 

RNA-Seq reads. Bioinformatics 28, 2914-2921 (2012). 531 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


31 

 
36. B. Xiong, Y. Yang, F. R. Fineis, J.-P. Wang, DegNorm: normalization of generalized transcript 532 

degradation improves accuracy in RNA-seq analysis. Genome Biol. 20, 75 (2019). 533 

37. J. Chaitanya, Degradation of mRNA in Escherichia coli. IUBMB Life 54, 315-321 (2010). 534 

38. X. Mao, Q. Ma, B. Liu, X. Chen, H. Zhang, Y. Xu, Revisiting operons: an analysis of the landscape 535 

of transcriptional units in E. coli. BMC Bioinformatics 16, 356 (2015). 536 

39. B. Marie, K. H. Thilo, F. Thierry, T. Mikael, R. Adriana, V. D. Christian, Metabolic pathways of 537 

Pseudomonas aeruginosa involved in competition with respiratory bacterial pathogens. Frontiers 538 

in Microbiology 6, 321 (2015). 539 

40. C. Nadiras, E. Eveno, A. Schwartz, N. Figueroa-Bossi, M. Boudvillain, A multivariate prediction 540 

model for Rho-dependent termination of transcription. Nucleic Acids Res. 46, 8245-8260 (2018). 541 

41. C. L. Kingsford, K. Ayanbule, S. L. Salzberg, Rapid, accurate, computational discovery of Rho-542 

independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. 543 

8, R22 (2007). 544 

42. M. Ashburner, S. Lewis, On Ontologies for Biologists: The Gene Ontology—Untangling the Web. 545 

Novartis Found. Symp. 247, 66-80; discussion 80-63, 84-90, 244-252 (2002). 546 

43. H. Wu, Z. Su, F. Mao, V. Olman, Y. Xu, Prediction of functional modules based on comparative 547 

genome analysis and Gene Ontology application. Nucleic Acids Res. 33, 2822-2837 (2005). 548 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


32 

 
44. S. A. Teukolsky, B. P. Flannery, W. Press, W. Vetterling, Numerical Recipes in C: The Art of 549 

Scientific Computing. Cambridge University Press, Cambridge (1992). 550 

45. L. Wan, X. Yan, T. Chen, F. Sun, Modeling RNA degradation for RNA-Seq with applications. 551 

Biostatistics 13, 734-747 (2012). 552 

46. C. Yanofsky, Attenuation in the control of expression of bacterial operons. Nature 289, 751 (1981). 553 

47. B. K. Cho, D. Kim, E. M. Knight, K. Zengler, B. O. Palsson, Genome-scale reconstruction of the 554 

sigma factor network in Escherichia coli : topology and functional states. BMC Biol. 12, 4-4 (2014). 555 

48. B.-K. Cho, P. Charusanti, M. J. Herrgård, Microbial regulatory and metabolic networks. Curr. Opin. 556 

Biotechnol. 18, 360-364 (2007). 557 

49. A. Toledo-Arana, O. Dussurget, G. Nikitas, N. Sesto, H. Guet-Revillet, D. Balestrino, E. Loh, J. 558 

Gripenland, T. Tiensuu, K. Vaitkevicius, The Listeria transcriptional landscape from saprophytism 559 

to virulence. Nature 459, 950-956 (2009). 560 

50. B. Yue, X. Luo, Z. Yu, S. Mani, Z. Wang, W. Dou, Inflammatory bowel disease: a potential result 561 

from the collusion between gut microbiota and mucosal immune system. Microorganisms 7, 440 562 

(2019). 563 

51. B. H. Mullish, H. R. Williams, Clostridium difficile infection and antibiotic-associated diarrhoea. 564 

Clin. Med. 18, 237 (2018). 565 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


33 

 
52. M. Maguire, G. Maguire, Gut dysbiosis, leaky gut, and intestinal epithelial proliferation in 566 

neurological disorders: towards the development of a new therapeutic using amino acids, 567 

prebiotics, probiotics, and postbiotics. Rev. Neurosci. 30, 179-201 (2019). 568 

53. S. Vivarelli, R. Salemi, S. Candido, L. Falzone, M. Santagati, S. Stefani, F. Torino, G. L. Banna, 569 

G. Tonini, M. Libra, Gut microbiota and cancer: from pathogenesis to therapy. Cancers 11, 38 570 

(2019). 571 

54. G. Cammarota, G. Ianiro, A. Ahern, C. Carbone, A. Temko, M. J. Claesson, A. Gasbarrini, G. 572 

Tortora, Gut microbiome, big data and machine learning to promote precision medicine for cancer. 573 

Nature Reviews Gastroenterology & Hepatology 17, 635-648 (2020). 574 

55. S. S. A. Zaidi, X. Zhang, Computational operon prediction in whole-genomes and metagenomes. 575 

Briefings in functional genomics 16, 181-193 (2017). 576 

ACKNOWLEDGEMENTS 577 

Funding: This work was supported by the National Nature Science Foundation of China (NSFC) 578 

[61772313 to B.L., 11931008 to B.L.]; Interdisciplinary Science Innovation Group Project of Shandong 579 

University (2019); and the Innovation Method Fund of China [2018IM020200 to B.L.]. The authors 580 

would like to thank Yang Li for his assistance in language polishing. Authors’ contributions: B.L., 581 

Q.M. and W.C. conceived the basic idea and designed the overall analyses. Q.W. carried out most of the 582 

computational analysis and data interpretation. All the authors wrote the manuscript. Competing 583 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


34 

 
interests: The authors declare that they have no competing interests. Data and materials availability: 584 

The raw data and source code of SeqATU and a detailed tutorial can be found at 585 

https://github.com/OSU-BMBL/SeqATU. 586 

FIGURES AND TABLES 587 

Table 1. Results of predicted ATUs verified by experimental TSSs or TF binding sites. Overview of 588 

the experimental TSS and TF binding site datasets (dataset 1 and dataset 2) and the proportion of 5’-end 589 

genes and no 5’-end genes of the predicted ATUs by SeqATU for M9Enrich_Seq and RiEnrich_Seq, which 590 

were validated by experimental TSSs or TF binding sites. 591 

 dataset 1 dataset 2 

Source Ju et al. (7) 
RegulonDB TF binding 

sites 

Technique SEnd-seq Collection 

TSSs/TF binding sites 5,512 3,220 

M9Enrich_Se

q 

5’-end genes 83% 29% 

no 5’-end genes 47% 9.2% 

RiEnrich_Seq 
5’-end genes 89% 30% 

no 5’-end genes 44% 9.0% 

 592 

 593 

Table 2. Results of predicted ATUs verified by experimental TTSs. Overview of the experimental 594 

TTS datasets (dataset 3 and dataset 4) and the proportion of 3’-end genes and no 3’-end genes of the 595 

predicted ATUs by SeqATU for M9Enrich_Seq and RiEnrich_Seq, which were validated by 596 

experimental TTSs. 597 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


35 

 
 dataset 3 dataset 4 

Source Ju et al. (7) RegulonDB TTSs 

Technique SEnd-seq Collection 

TTSs 1,540 3,67 

M9Enrich_Se

q 

3’-end genes 51% 11% 

no 3’-end genes 15% 5.2% 

RiEnrich_Seq 
3’-end genes 53% 11% 

no 3’-end genes 14% 4.8% 

 598 

 599 

 600 

Fig. 1. Schematic overview of SeqATU. The blue arrow and orange line denote gene and RNA-Seq 601 

read, respectively. The preprocessing stage requires RNA-Seq data in the FASTQ format, the reference 602 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


36 

 
genome sequence in the FASTA format, and gene annotations in the GFF format, generating linear 603 

constraints for the next convex quadratic programming (CQP) stage. There are two steps in the 604 

preprocessing stage: (i) calculating the expression value of the genetic region �� and intergenic region 605 

��,� and (ii) modelling non-uniform read distribution along mRNA transcripts; specifically, we acquired 606 

a bias rate function �(�) = �� �  using nonlinear regression and then constructed genetic or intergenic 607 

region bias rate vectors. The maximal ATU cluster data determined by rSeqTU and the linear constraints 608 

from preprocessing are both taken as inputs of CQP. CQP seeks the optimum expression combination of 609 

all of the to-be-identified ATUs to minimize the gap ���  between the predicted ATU expression profile 610 

and the genetic and intergenic region expression profile. Finally, the output of CQP is the predicted 611 

ATUs. 612 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


37 

 
 613 

Fig. 2. Results of modelling non-uniform read distribution along mRNA transcripts. The four bias 614 

rate functions (� = ����) by nonlinear regression had similar coefficients (� and �) across the four 615 

datasets M9Enrich_1, M9Enrich_2, RiEnrich_1 and RiEnrich_2. 616 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


38 

 
 617 

Fig. 3. Overall evaluation results of SeqATU. (A) Precision and recall based on perfect matching and 618 

relaxed matching for M9Enrich_Seq (left) and RiEnrich_Seq (right) using evaluated ATUs from SMRT-619 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


39 

 
Cappable-seq. (B) Average precision based on perfect matching for M9Enrich_Seq (left) and 620 

RiEnrich_Seq (right) using evaluated ATUs from SMRT-Cappable-seq (black) and evaluated ATUs from 621 

SMRT-Cappable-seq and SEnd-seq (red). The magnitude of the point denotes the number of maximal 622 

ATU clusters with same size. (C) Average number of ATUs across different sizes of SMRT maximal 623 

ATU clusters for M9Enrich_Seq (left) and RiEnrich_Seq (right). 624 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


40 

 
 625 

Fig. 4. Comparative analysis of the performance between SeqATU and SeqATU without the bias 626 

rate constrains for SMRT maximal ATU clusters. (A) Precision, recall and F-score based on perfect 627 

matching for M9Enrich_Seq and RiEnrich_Seq. (B) Precision, recall and F-score based on relaxed 628 

matching for M9Enrich_Seq and RiEnrich_Seq. 629 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


41 

 
 630 

Fig. 5. Comprehensive analysis of the predicted ATUs by SeqATU. (A) Number of ATUs across 631 

different sizes. The size of an ATU is the number of its component genes. (B) Distribution of the number 632 

of ATUs per gene. 633 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


42 

 
 634 

Fig. 6. Integrative Genomics Viewer (IGV) representation of the mapping and ATUs. Mapping and 635 

ATUs of M9Enrich_Seq (orange) and RiEnrich_Seq (blue) were shown for the maximal ATU cluster 636 

containing the bioB, bioF, bioC, bioD and uvrB genes. 637 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/


43 

 
 638 

Fig. 7. Interpretation and results of the functional relatedness of different gene pairs based on GO 639 

and KEGG enrichment analyses. (A) Illustration of two different gene pairs i and ii. (B) Functional 640 

relatedness results based on GO enrichment analysis for M9Enrich_Seq (left) and RiEnrich_Seq (right). 641 

(C) The proportion of two different gene pairs whose genes are contained in the same KEGG pathway 642 

for M9Enrich_Seq (left) and RiEnrich_Seq (right). (D) The functional relatedness results based on 643 

KEGG enrichment analysis for M9Enrich_Seq (left) and RiEnrich_Seq (right). 644 

.CC-BY-NC-ND 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.02.425006
http://creativecommons.org/licenses/by-nc-nd/4.0/