Analysis of next- and third-generation RNA-Seq data reveals the structures of alternative transcription units in bacterial genomes 1 Analysis of next- and third-generation RNA-Seq data reveals the structures of 1 alternative transcription units in bacterial genomes 2 Qi Wang1, Zhaoqian Liu1,2, Bo Yan3, Wen-Chi Chou4, Laurence Ettwiller3, Qin Ma2,†, and Bingqiang 3 Liu1,† 4 1 School of Mathematics, Shandong University, Jinan 250200, China. 5 2 Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, 6 OH 43210, USA. 7 3 New England Biolabs Inc., Ipswich, MA 01938, USA. 8 4 Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA 9 02142, USA. 10 †Corresponding author. Email: bingqiang@sdu.edu.cn (B.L.); qin.ma@osumc.edu (Q.M.) 11 ABSTRACT 12 Alternative transcription units (ATUs) are dynamically encoded under different conditions or 13 environmental stimuli in bacterial genomes, and genome-scale identification of ATUs is essential for 14 studying the emergence of human diseases caused by bacterial organisms. However, it is unrealistic to 15 identify all ATUs using experimental techniques, due to the complexity and dynamic nature of ATUs. 16 Here we present the first-of-its-kind computational framework, named SeqATU, for genome-scale ATU 17 prediction based on next-generation RNA-Seq data. The framework utilizes a convex quadratic 18 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 programming model to seek an optimum expression combination of all of the to-be-identified ATUs. 19 The predicted ATUs in E. coli reached a precision of 0.77/0.74 and a recall of 0.75/0.76 in the two RNA-20 Sequencing datasets compared with the benchmarked ATUs from third-generation RNA-Seq data. We 21 believe that the ATUs identified by SeqATU can provide fundamental knowledge to guide the 22 reconstruction of transcriptional regulatory networks in bacterial genomes. 23 INTRODUCTION 24 An operon in bacterial genomes is defined as a group of consecutive genes regulated by a common 25 promoter that all share the same terminator (1). Genes in the same operon generally encode proteins 26 with relevant or similar biological functions; e.g., lacZ, lacY, and lacA in the lac operon encode proteins 27 that help cells use lactose (1, 2). With decades of research on bacterial transcriptional regulation, the 28 operon model has been found to have complex mechanisms that control expression (3-5). Multiple 29 studies have shown that bacterial genes are dynamically transcribed under different triggering 30 conditions, leading to shared genes among different mRNA transcripts (6-8). This dynamic architecture 31 can be redefined by all of the alternative transcription units (a.k.a., ATUs) (3, 5), and more details can be 32 found in fig. S1. 33 ATU identification is of fundamental importance for understanding the transcriptional regulatory 34 mechanisms of bacteria, and these dynamic structures have been demonstrated to be associated with 35 human diseases (9-12). For example, Bhat et al. studied the alr-groEL1 operon, which is essential for the 36 survival or virulence of M. tuberculosis (9, 11), the causative agent of tuberculosis (TB), and found that 37 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 the regulation of the sub-operon is distinct from the main operon (alr-groEL1 operon) under stress, 38 especially during heat shock, pH, and SDS stresses (9). Another example is Helicobacter pylori, a 39 gastric pathogen that is the primary known risk factor for gastric cancer (12). Sharma et al. found an 40 acid-induced sub-operon cag22-18 transcribed from the primary cag25-18 operon in the cag 41 pathogenicity island of the H. pylori genome under acid stress (10). The mechanism of the complex ATU 42 structure in these pathogenic bacteria can help us to study the emergence of human diseases caused by 43 bacterial organisms. 44 Several newly developed techniques have provided a comprehensive view of the E. coli 45 transcriptome by identifying full-length primary transcripts (13-17). For example, SMRT-Cappable-seq 46 (6) combines the isolation of the full-length bacterial primary transcriptome with PacBio SMRT (Single 47 Molecule, Real-Time) sequencing (6), and simultaneous 5’ and 3’ end sequencing (SEnd-seq) (7) 48 captures both transcription start sites (TSSs) and transcription termination sites (TTSs) via 49 circularization of transcripts (17). Despite the great progress in experimental techniques, there are still 50 some deficiencies. On the one hand, the read depth and error rate of the third-generation sequencing 51 used in SMRT-Cappable-seq have an impact on ATU prediction compared with Illumina-based RNA-52 Seq (7, 18). On the other hand, the time-consuming, laborious, and costly properties of these 53 experimental techniques make them unrealistic to be generally applicable to ATU predictions in bacteria 54 under specific conditions. Thus, novel and robust computational methods for ATU identification in 55 bacterial genomes based on RNA-Seq are urgently needed. 56 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 Fortunately, many computational studies have been carried out to predict ATUs in bacteria, which 57 have provided some preliminary studies for ATU prediction. Several public databases, such as 58 RegulonDB (19), DBTBS(20), MicrobesOnline (21), DOOR (22, 23), OperomeDB (24), DMINDA 2.0 59 (25), and ProOpDB (26), provide various levels of operon information and small amounts of ATU 60 information. However, these databases cannot provide genome-scale ATU information under specific 61 conditions. Some computational studies, including Rockhopper (27), SeqTU (4, 28), BAC-62 BROWSER(29), rSeqTU (5), and Operon-mapper (30), utilize machine learning and model integration 63 methods based on genomic information and gene expression profiles to identify bacterial transcription 64 architecture. However, these works still cannot solve the dynamic patterns and overlapping nature of 65 ATUs. 66 Here, we present SeqATU, a novel computational method for genome-scale ATU prediction by 67 analyzing next- and third-generation RNA-Seq data (Fig. 1 and table S1). SeqATU utilizes a convex 68 quadratic programming model (CQP) and aims to provide the optimum expression combination of all of 69 the to-be-identified ATUs. Specifically, CQP minimizes the squared error between the predicted 70 expression level of ATUs and the actual expression levels in genetic and intergenic regions. It is 71 noteworthy that SeqATU also utilizes the information about the bias rate function in modeling non-72 uniform read distribution as the linear constraints of CQP to profile the complexity of the ATU 73 architecture. Overall, SeqATU provides a generalized framework for the inference of ATUs based on 74 next-generation RNA-Seq data collected under multiple conditions and can be easily applied to any 75 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 bacterial organism to identify the ATU architecture and construct a transcriptional regulatory network. 76 Please place Fig. 1 here. 77 MATERIALS AND METHODS 78 Data collection 79 The two Cappable RNA-Seq datasets used in this study, M9Enrich_Seq and RiEnrich_Seq, were 80 obtained from E. coli grown under two different conditions: M9 minimal medium and Rich medium, 81 respectively (6). The full-length primary transcripts were enriched as described in (6) with modifications 82 to be adapted to Illumina sequencing. The capping and polyA tailing were performed as described in (6). 83 The capped RNA was enriched using hydrophilic streptavidin magnetic beads (New England Biolabs) 84 and eluted with Biotin using the same condition (6). Differently, the eluted RNA was enriched once 85 more using streptavidin beads to further remove processed RNA (e.g., rRNA). Subsequently, the eluted 86 RNA was used for library preparation using NEBNext Ultra II directional RNA library prep kit (E7760). 87 Sequencing was performed on the Illumina Miseq system (paired-end, 100bp). All reads were mapped to 88 the E. coli genome using Burrows-Wheeler Aligner (BWA) with the default parameters (31). Read 89 alignment and other computational analyses were carried out using the E. coli genome NC_000913.3, 90 and the corresponding gene annotations (GCF_000005845.2_ASM584v2_genomic.gff) were 91 downloaded from NCBI. Two experimentally verified ATU datasets, SMRT_M9Enrich and 92 SMRT_RiEnrich, were used as the benchmark data to evaluate the predicted ATUs, which were 93 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 generated by SMRT-Cappable-seq under the same conditions as the Illumina datasets M9Enrich_Seq 94 and RiEnrich_Seq, respectively (6). In addition, the ATUs defined by RegulonDB (19) and SEnd-seq (7) 95 were also used as additional evaluation data in our study. 96 Calculation of the expression values of genetic and intergenic regions 97 After the RNA-Seq reads in M9Enrich_Seq and RiEnrich_Seq were mapped to the E. coli genome using 98 BWA, we determined the number of reads �(�) covering each genomic position �. Suppose that �� 99 and ���� are two consecutive genes on the same strand; we denote the expression value of �� as �� 100 and the expression value of the intergenic region between genes �� and ���� as ��,���. Then, the 101 calculation of �� and ��,��� is given by: 102 �� = ∑ �(�)�∈�� |�� | (1) ��,��� = ∑ �(�)�∈��,��� |��,���| (2) where � ∈ �� denotes that genomic position � is on the gene �� and |�� | denotes the genomic 103 length of ��. 104 Modeling non-uniform read distribution along mRNA transcripts 105 We introduced the bias rate function, which is similar to the bias curves in the work of Wu et al. (32), to 106 address the non-uniform distribution of the RNA-Seq reads along mRNA transcripts (32-35). The bias 107 function reflects the relative read distribution bias from the 3’ end to the 5’ end of an mRNA transcript. 108 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 We assumed that the maximum read coverage of all the genomic positions of an mRNA transcript is the 109 expression level without bias. It is noteworthy that a single gene mRNA transcript with no shared gene 110 among different mRNA transcripts can serve as the ideal template for modeling non-uniform read 111 distribution along mRNA transcripts. The specific steps of modeling non-uniform read distribution are 112 detailed as follows: 113 Step 1: Single Gene mRNA Transcript Selection. We selected single gene mRNA transcripts from the 114 evaluation data and plotted their expression distributions. Specifically, 12 groups of single gene mRNA 115 transcripts with lengths ranging from 300 to 1,500 bp were selected from the evaluation data (more 116 details are given in method S1), and each group had ten randomly chosen mRNA transcripts. Apparent 117 decline trends appeared in the single gene mRNA transcripts with long lengths, ranging from 1,100 to 118 1,500 bp (fig. S2). The reason for this phenomenon may be that the incomplete transcription and 3’ end 119 degradation or processing induce the enrichment of signal at 5’ end of the mRNA transcripts with long 120 lengths (36, 37). Finally, we plotted the expression distribution of single gene mRNA transcripts with 121 lengths ranging from 1,100 to 1,500 bp. 122 Step 2: Acquiring the Bias Rate Function. We applied nonlinear regression to the expression 123 distribution of the selected single gene mRNA transcripts and acquired the hypothetical function �(�). 124 Specifically, the � axis and � axis of the expression distribution were converted to the distance from 125 the 3’ end of an mRNA transcript and the bias rate of read distribution, respectively. To apply nonlinear 126 regression to single gene mRNA transcripts with different lengths, normalization was also implemented 127 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 on �. Here, � = (��, ��, … , ��) and � = (��, ��, … , ��) are defined by: 128 �� = ⎩ ⎨ ⎧ �� − ������ ���� − �� × 10�, ������� �� − �� ���� − �� × 10�, ������� (3) �� = ⎩ ⎪ ⎨ ⎪ ⎧ �(������) ���� , ������� �(�� ) ���� , ������� (4) where � denotes the number of genomic positions on an mRNA transcript; � = (��, ��, … , ��) denotes 129 the genomic positions on an mRNA transcript; ���� = ��; �(�� ) denotes the expression level of the 130 genomic position �� , i.e., the number of reads covering the genomic position �� ; and ���� denotes the 131 expression level without bias in an mRNA transcript, which is calculated as ��� {�(�� )}, 1 ≤ � ≤ �. 132 We used the function nls in R to acquire the hypothetical function �(�). 133 Step 3: Constructing Bias Rate Vectors. We constructed a genetic or intergenic region bias rate vector 134 for each mRNA transcript by calculating the bias rate of all of its component genetic or intergenic 135 regions. The bias rate of a genetic or an intergenic region is the average bias rate of all the genomic 136 positions that it contains. Considering an mRNA transcript � and its component gene set 137 {��, ��, … , ��} (the details of the gene labels are described in method S2), we denoted the genetic 138 region bias rate vector as � = (��, ��, … , �� ), which was calculated using the formula: 139 �� = ⎩ ⎪ ⎨ ⎪ ⎧ ∑ �(�� ) ������ �������� ������� − ������� + 1 , ������� ∑ �(�� ) �� ���� ��� − ��� + 1 , ������� (5) .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 where � denotes the number of genomic positions on �; �� denotes the bias rate of �� for �; and 140 �� = (��� , ��� , ��� , ��� , … , ��� , ��� ) is the range of the genomic positions of {��, ��, … , ��}, while the 141 range of the genomic positions of �� is [��� , ��� ], 1 ≤ � ≤ �. Similarly, the calculation of the intergenic 142 region bias rate vector � = (��, ��, … , ����) is provided in method S3. 143 Modification of maximal ATU clusters 144 A maximal ATU cluster is defined as a maximal consecutive gene set such that each pair of its 145 consecutive genes can be covered by at least one ATU. Similar to ATUs, maximal ATU clusters are also 146 dynamically composed under different conditions or environmental stimuli in bacterial genomes (5, 38). 147 Such a maximal ATU cluster can be used as an independent genomic region for ATU prediction, which 148 alleviates the difficulty in computationally predicting ATUs at the genome scale. The output of our in-149 house tool rSeqTU can serve as the maximal ATU cluster data, which lays a solid foundation for ATU 150 prediction (5). We modified the maximal ATU clusters from rSeqTU: (i) two maximal ATU clusters with 151 distances less than 40 bp were combined into one cluster and (ii) a maximal ATU cluster was split at the 152 intergenic region where the opposite-strand genes were located. In addition, we selected the maximal 153 ATU clusters with expression values over ten (see the details in method S4), according to the study of 154 Etwiller et al. (13). 155 The mathematical programming model for ATU prediction 156 The predicted ATU expression profile should be consistent with the observed expression profiles of the 157 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 genetic and intergenic regions. Therefore, the prediction of the ATU profiles can be modeled as an 158 optimization problem, which seeks an optimum expression combination of all of the to-be-identified 159 ATUs to minimize the gap between the predicted ATUs and the observed genetic and intergenic region 160 expression profiles. Here, a convex quadratic programming model was built to solve this optimization 161 problem. 162 We denoted a maximal ATU cluster as �, assuming that it contains the consecutive genes 163 {��, … , ��}, and the intergenic regions of these genes are {��,�, … , ����,�}. The size of � is defined as 164 the number of its component genes �. Theoretically, there are �×(���) � ATUs for �, and an ATU with 165 consecutive genes {�� , ����, … , �� } is denoted as � �,� ; the corresponding expression value is ��,�, 1 ≤166 � ≤ � ≤ �. 167 For the component gene �� of �, the gap between the gene expression value �� and the sum of the 168 expression level of the ATUs containing it is denoted as ��, which provides the first � equality 169 constraints in our mathematical programming model, � = 1,2, … , �. Similarly, for the intergenic region 170 ��,��� of �, the gap between the intergenic region expression value ��,��� and the sum of the 171 expression level of the ATUs containing it is denoted as ��, providing the last � − 1 equality 172 constraints in our mathematical programming model, � = 1,2, … , � − 1. 173 The goal of our mathematical programming model is to minimize the square of � =174 (��, ��, … , ��, ��, … , ����), as the combination of � �,� with a minimal value of ��� is corresponding to 175 an optimum expression combination of all ATUs ��,� for �, 1 ≤ � ≤ � ≤ �. Additionally, to control the 176 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 number of optimal solutions and reduce the false-positive errors, we added an �� regularization �||�||� 177 to ��� with ��,� ≥ 0, which is a linear function. Because of the variant expression level of different 178 maximal ATU clusters, we used the expression value of � as �. In total, the convex quadratic 179 programming model with unknown variables (�, �) is shown as follows: 180 ��� ��� + �||�||� �. �. ∑ ∑ ��,� � �,�� ��� � ��� = �� + �� � = 1,2, … , � ∑ ∑ ��,���� �,�� ����� � ��� = ��,��� + �� � = 1,2, … , � − 1 � = ���,� �, ��,� ≥ 0 1 ≤ � ≤ � ≤ � � = (��, ��, … , �� , ��, … , ����) (6) where � = (��,� ) is the genetic region bias rate vector for �, ��,� is the bias rate of gene �� for ATU 181 ��,�, 1 ≤ � ≤ � ≤ �,� ≤ � ≤ �, � = (��,� ) is the intergenic region bias rate vector for �, and ��,� 182 is the bias rate of the intergenic region ����,� for ATU � �,�, 1 ≤ � < � ≤ �,� ≤ � ≤ � (see the 183 details in method S5). 184 Two evaluation methods for ATU prediction 185 In the first evaluation method, precision and recall were defined based on perfect matching (Eqs. 7). 186 Perfect matching of two ATUs means that all of their component genes are the same. Here, the true 187 positives (��) are the number of predicted ATUs with the same component genes as an ATU in the 188 evaluation data; the false positives (��) are the number of predicted ATUs that do not exist in the 189 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 evaluation data; the false negatives (��) are the number of ATUs that appear in the evaluation data but 190 not in the prediction results of SeqATU. 191 ��������� = �� �� + �� ������ = �� �� + �� (7) In the second evaluation method, precision and recall were defined based on relaxed matching, which 192 is measured by the similarity of two ATUs. Assuming that an ATU � is in one of two datasets (the 193 predicted ATU dataset and evaluated ATU dataset), the definition and calculation of the similarity of � 194 are shown in the following three cases: 195 Case 1: If � shares boundary genes at both ends of an ATU in the other dataset, i.e., all component 196 genes of � are the same as one in the other dataset, then ����������(�) = 1. 197 Case 2: If � shares exactly one boundary gene of ATUs in the other dataset, then we denote �� as 198 the ATUs in the other dataset that share the 5’-end gene with � and denoted �� as the ATUs in the 199 other dataset that share the 3’-end gene with �, �� ∩ �� = ∅, one of �� and �� can be empty. Then, 200 ����������(�) = 1 2 �����∈�� �(��) �(��) + 1 2 �����∈�� �(��) �(��) (8) where �(��) is the number of shared genes of � and �� and �(��) is the maximal size of � and ��. 201 Case 3: If � shares no boundary genes at both ends of the ATUs in the other dataset, then 202 ����������(�) = 0. 203 Finally, the precision and recall based on relaxed matching are calculated by the following formula: 204 ��������� = ∑ ����������(�)�∈�� �� .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 ������ = ∑ ����������(�)�∈�� �� (9) where �� is the set of predicted ATUs, �� is the number of predicted ATUs, �� is the set of evaluated 205 ATUs, and �� is the number of evaluated ATUs. 206 RESULTS 207 A reliable bias rate function is acquired in modeling non-uniform read distribution along mRNA 208 transcripts 209 To ensure the reliability of the bias rate function in modeling non-uniform read distribution, we selected 210 four single gene mRNA transcript datasets randomly from the two evaluation datasets 211 (SMRT_M9Enrich and SMRT_RiEnrich), named M9Enrich_1, M9Enrich_2, RiEnrich_1, and 212 RiEnrich_2. Four bias rate functions, which are exponential functions, were generated after conducting 213 nonlinear regression on the mRNA transcripts across these four datasets (Fig. 2). We found that these 214 bias rate functions were similar (�� > 0.998) when we evaluated the R-square statistic (for more 215 details, see method S6 and table S2). The similarity of the four bias rate functions indicated that the 216 selection of the single gene mRNA transcript datasets had little impact on modeling non-uniform read 217 distribution along mRNA transcripts, implying the universal common non-uniform read distribution of 218 different mRNA transcripts of E. coli. Specifically, we used the average of these four coefficients as the 219 final coefficients of the exponential function, which was �(�) = ���� with � = 0.256 and � =220 0.00128. 221 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 Please place Fig. 2 here. 222 ATUs predicted by SeqATU reach precision and recall over 0.64 223 The performance evaluation was conducted by comparing the predicted ATUs with the ATUs in 224 SMRT_M9Enrich and SMRT_RiEnrich, which were generated based on the third-generation sequencing 225 and are not sensitive to transcripts with low expression levels. For a more accurate and fair evaluation, 226 maximal ATU clusters after pre-selection were retained in the subsequent evaluations (more details 227 about the pre-selection of maximal ATU clusters can be seen in method S7 and fig. S3). 228 The precision and recall of the predicted ATUs were calculated for each maximal ATU cluster. By 229 considering only perfect matching, the average precision and recall were 0.67 and 0.67 for 230 M9Enirch_Seq and 0.64 and 0.68 for RiEnrich_Seq, respectively. When using relaxed matching, the 231 average precision and recall increased to 0.77 and 0.75 for M9Enrich_Seq and 0.74 and 0.76 for 232 RiEnrich_Seq, respectively. The statistics for precision and recall on maximal ATU clusters with 233 different sizes, as shown in Fig. 3A and fig. S4A. These results showed that the average precision and 234 recall were decreasing with the increasing size of maximal ATU clusters (other than several large size 235 ones due to their small number of counts). The results also indicated that the evaluation results based on 236 relaxed matching were significantly higher than those based on perfect matching across different sizes. 237 This result implied that the incorrectly predicted ATUs by SeqATU based on perfect matching tended to 238 have strong similarities with the ATUs in the evaluation data. In addition, we also found that more than a 239 quarter of the incorrectly predicted ATUs (25%/29% for M9Enrich_Seq/RiEnrich_Seq) by SeqATU 240 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 based on perfect matching matched with the transcription units in RegulonDB (19). 241 The two evaluation datasets (SMRT_M9Enrich and SMRT_RiEnrich) were both from SMRT-242 Cappable-seq, while one of the processing steps of the technique filtered RNA reads smaller than 1,000 243 bp (6), which indicated that the ATUs in these two evaluation datasets were not comprehensive. To 244 address this issue, we enriched the evaluation data by adding the ATUs defined by SEnd-seq (7), as 245 SEnd-seq did not introduce any filtering based on RNA size. When we used the new evaluation data, the 246 ATUs predicted by SeqATU improved by 15% (0.77) and 19% (0.76) in terms of the average precision 247 based on perfect matching for M9Enrich_Seq and RiEnrich_Seq, respectively, and by 9% (0.84) and 248 12% (0.83) based on relaxed matching. The statistics for precision across different sizes of the maximal 249 ATU clusters are shown in Fig. 3B and fig. S4B, showing that the values of precision based on perfect 250 matching were significantly improved across different sizes of maximal ATU clusters by using the 251 evaluated ATUs from SMRT-Cappable-seq and SEnd-seq. This result suggested that the ATUs we 252 predicted, which were not in SMRT_M9Enrich and SMRT_RiEnrich, may be due to the RNA length 253 selection of SMRT-Cappable-seq. We enriched the evaluation data by adding the ATUs in RegulonDB 254 (19) and also found the improvement of precision across different sizes of maximal ATU clusters for 255 M9Enrich_Seq and RiEnrich_Seq (fig. S4C). 256 Furthermore, to facilitate the understanding of the performance of SeqATU and to measure the 257 influence of the maximal ATU clusters from rSeqTU on our ATU prediction method, SMRT maximal 258 ATU clusters collected from SMRT_M9Enrich and SMRT_RiEnrich (for more details, see method S8) 259 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 were applied for the CQP in two conditions (M9 minimal medium and Rich medium). We found that 260 precision and recall increased to 0.73 and 0.77 for M9Enrich_Seq, respectively, and 0.69 and 0.80 for 261 RiEnrich_Seq based on perfect matching (fig. S4D). Additionally, when using relaxed matching, 262 precision and recall significantly increased to 0.82 and 0.84 for M9Enrich_Seq, respectively, and 0.79 263 and 0.86 for RiEnrich_Seq (fig. S4D). The significantly improved results verified the ability of SeqATU 264 to accurately predict ATU when giving more accurate maximal ATU clusters. In addition, we found that 265 the number of predicted ATUs and the evaluated ATUs under the maximal ATU cluster with the same 266 size were similar except for the maximal size (Fig. 3C), and they were far less than the theoretical 267 number, which indicated that SeqATU can effectively exclude most of the incorrect ATUs. 268 Please place Fig. 3 here. 269 The bias rate constraints efficiently improve the ability of SeqATU to predict ATUs 270 We tried to use SeqATU without bias rate constraints to predict the ATUs of E. coli and found that its 271 performance significantly decreased compared with SeqATU (Fig. 4 and fig. S5). Specifically, the F-272 score of SeqATU without bias rate constraints was 0.69/0.68 based on perfect matching for 273 M9Enrich_Seq/RiEnrich_Seq, compared with 0.75/0.74 for SeqATU. When using relaxed matching, the 274 F-score of SeqATU without bias rate constraints was 0.79/0.78 for M9Enrich_Seq/RiEnrich_Seq, 275 compared with 0.83/0.83 for SeqATU. This result suggested that the bias rate constraints of SeqATU 276 could capture useful information about the non-uniform distribution of the RNA-Seq reads along the 277 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 mRNA transcripts (32-35) and then efficiently improve the ability of the model to predict complex 278 ATUs. 279 Please place Fig. 4 here. 280 ATUs predicted by SeqATU display a dynamic composition and overlapping nature 281 A total of 2,973 distinct ATUs were identified in M9 minimal medium, and 2,767 were identified in Rich 282 medium. Among them, there were 1,423/1,550 distinct ATUs on the forward strand and 1,323/1,444 on 283 the reverse strand for M9Enrich_Seq/RiEnrich_Seq. Each of the predicted ATUs was comprised of an 284 average of 2.59 genes, with the largest ATU containing 28 genes across the two conditions. The 285 distribution of the size of the predicted ATUs is shown in Fig. 5A, from which we can see that the 286 majority of ATUs (more than 87%) contained fewer than five genes in M9 minimal medium and Rich 287 medium. Approximately 41% of the genes in E. coli were contained in more than one ATU for 288 M9Enrich_Seq, compared to 43% genes for RiEnrich_Seq, suggesting that the ATUs in a maximal ATU 289 cluster generally overlapped with each other (Fig. 5B). In addition, there were 1,576 ATU maximal 290 clusters for M9Enrich_Seq and 1,512 ATU maximal clusters for RiEnrich_Seq. SeqATU identified a 291 total of 1,977 identical ATUs under the two conditions, whereas there were 1,786 distinct ATUs. Among 292 the distinct ATUs across the two conditions, 394 ATUs were from the same maximal ATU clusters in the 293 two maximal ATU cluster datasets, and the rest were from different maximal ATU clusters. The fact 294 there were distinct ATUs under the two conditions suggests that ATUs are dynamically responsive to 295 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 different conditions or environmental stimuli (for more real examples about the ATUs under different 296 conditions, see fig. S6). 297 The dynamic composition of predicted ATUs by SeqATU is of great significance to understand the 298 interactions inside polymicrobial communities. For example, chronic airway infection by Pseudomonas 299 aeruginosa considerably contributes to lung tissue destruction and impairment of pulmonary function in 300 cystic-fibrosis (CF) patients (39). Marie et al. found that the presence of E. coli complemented the 301 growth defect of a P. aeruginosa bioA-disrupted mutant that is unable to grow on rich medium, and can 302 be beneficial to P. aeruginosa when biotin supply is limited (39). An ATU with a high expression level 303 coded by the uvrB gene is identified by SeqATU in Rich medium, while it does not exist in M9 minimal 304 medium (Fig. 6). We predicted the uvrB gene to be involved in the biotin metabolism pathway, as the 305 bioB, bioF, bioC, and bioD genes contained in a same ATU with it have been known in the biotin 306 metabolism KEGG pathway. Therefore, the observation by Marie et al. can be explained that the ATUs 307 coded by the uvrB gene of E. coli can provide the biotin supply for P. aeruginosa under rich medium. 308 This result showed that SeqATU could increase our understanding of interspecies competition and 309 cooperation, which play an important role in shaping the composition and structure of polymicrobial 310 bacterial populations. 311 Please place Fig. 5 here. 312 Please place Fig. 6 here. 313 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Predicted ATUs by SeqATU are verified by experimental TSSs and TTSs 314 An experimental TSS dataset of E. coli from SEnd-seq (7) and a TF binding site dataset of E. coli from 315 the experimental dataset of RegulonDB (19) were used to further verify the reliability of SeqATU and 316 were named dataset 1 and dataset 2, respectively. There were 5,512 experimental TSSs in dataset 1 and 317 3,220 experimental TF binding sites in dataset 2. We considered the 5’-end genes and no 5’-end genes of 318 the predicted ATUs by SeqATU. A gene that is not the 5’-end gene of any predicted ATU is named a no 319 5’-end gene. We identified 2,177/2,005 5’-end genes and 1,266/1,160 no 5’-end genes of the predicted 320 ATUs for M9Enrich_Seq/RiEnich. A gene validated by experimental TSSs or TF binding sites means 321 that it is the immediate downstream gene of an experimental TSS or TF binding site. As a result, the 322 proportion of 5’-end genes of the predicted ATUs that were validated by experimental TSSs or TF 323 binding sites was over 1.7 times greater than that of the no 5’-end genes (Table 1). Specifically, the 324 proportion of 5’-end genes (29%/30% for M9Enrich_Seq/RiEnrich_Seq) validated by experimental TF 325 binding sites was over three times greater than the no 5’-end genes (9.2%/9.0% for 326 M9Enrich_Seq/RiEnrich_Seq). These results further verified the reliability of the ATUs predicted by 327 SeqATU in terms of the TSS level. In addition, four other experimental TSS or promoter datasets from 328 RegulonDB (19), dRNA-seq (14), and Cappable-seq (13) were also examined. The results are shown in 329 table S3, and we also found a higher proportion of 5’-end genes of the predicted ATUs validated by 330 experimental TSSs or promoters than that of no 5’-end genes. 331 We also used two experimental TTS datasets of E. coli from SEnd-seq (7) and RegulonDB (19) to 332 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 verify the reliability of predicted ATUs by SeqATU in terms of TTS level. These two experimental TTS 333 datasets were named dataset 3 and dataset 4, respectively. There were 1,540 experimental TTSs in 334 dataset 3 and 367 experimental TTSs in dataset 4. We considered the 3’-end genes and no 3’-end genes 335 of the predicted ATUs by SeqATU. A gene that is not the 3’-end gene of any predicted ATU is named a 336 no 3’-end gene. We identified 2,290/2,187 3’-end genes and 1,153/978 no 3’-end genes of the predicted 337 ATUs for M9Enrich_Seq/RiEnrich_Seq. A gene validated by experimental TTSs means that it is the 338 immediate upstream gene of an experimental TTS. As a result, the proportion of 3’-end genes of the 339 predicted ATUs that were validated by experimental TTSs was over two times greater than that of no 3’-340 end genes (Table 2). Specifically, the proportion of 3’-end genes (51%/53% for 341 M9Enrich_Seq/RiEnrich_Seq) validated by experimental TTSs from SEnd-seq was over three times 342 greater than that of no 3’-end genes (15%/14% for M9Enrich_Seq/RiEnrich_Seq). These results further 343 verified the reliability of the ATUs predicted by SeqATU in terms of the TTS level. In addition, two 344 other computationally predicted TTS datasets from the works by Nadiras et al. (40) and Kingsford et al. 345 (41) were also examined. The results are shown in table S4, and we also found the proportion of 3’-end 346 genes (63%/62% for M9Enrich_Seq/RiEnrich_Seq) validated by computationally predicted Rho-347 independent TTSs was over two times greater than that of no 3’-end genes (29%/29% for 348 M9Enrich_Seq/RiEnrich_Seq). 349 Please place Table 1 here. 350 Please place Table 2 here. 351 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 The gene pairs frequently encoded in the same ATUs are more functionally related than those that 352 can belong to two distinct ATUs 353 Functional analysis was conducted by integrating GO terms from the Gene Ontology (GO) database 354 (42). In detail, we measured the level of functional relatedness for two types of consecutive gene pairs, 355 which is similar to the definition in the work by Mao et al. (38). Two types of consecutive gene pairs 356 were (i) gene pairs each consisting of a 5’-end gene of an ATU and the gene in its immediate upstream 357 on the same strand and (ii) all the other gene pairs inside an ATU (Fig. 7A). In addition, we used a 358 scoring scheme to measure the GO-based functional similarity between a pair of genes by Wu et al. (43). 359 This study developed a GO similarity score and showed that the larger the score, the more likely that 360 two genes are functionally related. In brief, the GO similarity score of a gene pair �� and �� is 361 denoted as ��� (�� , �� ): 362 ��� ���, �� � = �����∈�(��), ��∈�(��) �(�� , �� ) 363 where �� and �� are the GO terms assigned to �� and �� , respectively; �(�� , �� ) is the maximal 364 number of common terms between paths in the two GO graphs induced by the GO terms �� and ��. 365 As a result, the mean GO similarity score was higher for type-ii gene pairs (5.97 versus 4.04 for 366 M9Enrich_Seq and 5.86 versus 3.91 for RiEnrich_Seq) than for type-i gene pairs. A total of 574/524 367 type-ii gene pairs had GO similarity scores greater than four (64%/63% of a total of 899/834), while 368 only 461/404 type-i gene pairs had GO similarity scores greater than four (36%/34% of a total of 369 1,274/1,179) for M9Enrich_Seq/RiEnrich_Seq. We also applied a c�-test (44) to determine whether the 370 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 distribution of ��� ��� , �� � was different for the type-i gene pairs and type-ii gene pairs. The c �-371 statistics corresponded to a P-value less than 10��, which revealed that the distribution of ��� ��� , �� � 372 for the type-ii gene pairs was significantly different from the type-i gene pairs. Fig. 7B shows the 373 distribution of ��� ��� , �� � for the type-i gene pairs and the type-ii gene pairs. These results strongly 374 indicated that the type-ii gene pairs had a higher degree of GO similarity than the type-i gene pairs, 375 suggesting that the gene pairs frequently encoded in the same ATUs (type-ii gene pairs) are more 376 functionally related than those that can belong to two distinct ATUs (type-i gene pairs). 377 We also carried out a similar analysis of the two different gene pairs based on KEGG enrichment 378 analysis (see more details in method S9) and found that the proportion of type-ii gene pairs (59%/57% 379 for M9Enrich_Seq/RiEnrich_Seq), whose two genes were contained in the same KEGG pathway, was 380 higher than the proportion of type-i gene pairs (32%/28% for M9Enrich_Seq/RiEnrich_Seq) (Fig. 7C). 381 The distribution of the KEGG similarity scores of the two different types of gene pairs is shown in Fig. 382 7D, suggesting that genes of type-ii gene pairs have a higher probability of participating in the same 383 KEGG pathway than those of type-i gene pairs. 384 Please place Fig. 7 here. 385 DISCUSSION 386 We developed SeqATU, the first computational method for genome-scale ATU prediction by analyzing 387 next- and third-generation RNA-Seq data, using a CQP model. Linear constraints provided by the bias 388 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 rate of read distribution were, for the first time, integrated into the CQP model. Positional bias refers to 389 the non-uniform distribution of reads over different positions of a transcript (33, 35), which is handled 390 by learning non-uniform read distributions from given RNA-Seq reads (32) or modeling the RNA 391 degradation (45). The bias rate function we proposed can address the non-uniform read distribution 392 along mRNA transcripts and also be desirable for standard next-generation RNA-Seq data that involves 393 more degraded mRNAs, as the exponential function has been used to model the degradation of mRNA 394 transcripts (45). As a result, a total of 2,973 distinct ATUs for M9Enrich_Seq and 2,767 distinct ATUs 395 for RiEnrich_Seq were identified by SeqATU. The precision and recall reached 0.67/0.64 and 0.67/0.68, 396 respectively, based on perfect matching and 0.77/0.74 and 0.75/0.76, respectively, based on relaxed 397 matching for M9Enrich_Seq/RiEnrich_Seq. We further validated predicted ATUs using experimental 398 transcription factor binding sites or transcription termination sites from RegulonDB and SEnd-Seq. In 399 addition, the proportion of the 5’- or 3’-end genes of predicted ATUs that were validated by 400 experimental transcription factor binding sites and transcription termination sites was over three times 401 greater than that of no 5’- or 3’-end genes, demonstrating the high reliability of predicted ATUs. Gene 402 pairs frequently encoded in the same ATUs were more functionally related than those that can belong to 403 two distinct ATUs according to GO and KEGG enrichment analyses. These results demonstrated the 404 reliability and accuracy of our predicted ATUs, implying the ability of SeqATU to reveal the 405 transcriptional architecture of the bacterial genome. 406 In fact, the ATU architecture of bacteria is much more complex than that determined with currently 407 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 used experimental techniques. We investigated the 5’-end genes and no 5’-end genes of the experimental 408 ATUs identified by SMRT-Cappable-seq (6) using a combination of experimental TSSs from 409 RegulonDB (19), dRNA-seq (14), Cappable-seq (13), and SEnd-seq (7). As a result, we found that the 410 proportion of 5’-end genes (99%) validated by experimental TSSs was not significantly different from 411 that of no 5’-end genes (92%). The high percentage of no 5’-end genes validated by experimental TSSs 412 implied that the ATUs identified by experimental techniques are only a small proportion of the 413 comprehensive ATUs in bacterial organisms due to the dynamic mechanisms of ATUs. These results 414 further verified the necessity of developing robust computational methods for ATU identification. 415 SeqATU not only provides a powerful tool to understand the transcription mechanism of bacteria but 416 also provides a fundamental tool to guide the reconstruction of a genome-scale transcriptional regulatory 417 network. First, the ATU structure can help us to make new functional predictions, as genes in an ATU 418 tend to have related functions. Second, ATUs can elucidate condition-specific uses of alternative sigma 419 factors (8, 46). For example, the thrLABC operon is regulated by transcriptional attenuation. Totsuka et 420 al. found that under the log phase growth condition, the thrLABC operon is the only transcript, while 421 two transcripts are found under stationary phase growth condition, the thrLABC and thrBC. As validated 422 experimentally, � � can regulate the additional promoter located in front of thrB under the stationary 423 phase growth condition and then separately regulate thrBC, which elucidates the condition-specific uses 424 of � � (8). Third, understanding the ATU structure is of great help to construct transcriptional and 425 translation regulatory networks, such as for the construction of the σ-TUG (σ-factor-transcription unit 426 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 gene) network (47). The transcription regulatory network consists of nodes (ATU and regulatory 427 proteins) and links (interactions) (48), and the comprehensive ATU structure can provide a nearly 428 complete set of nodes, which can improve the accuracy of regulatory prediction. 429 Although SeqATU has obtained satisfactory predicted results, there are still several challenges 430 regarding the computational prediction of ATUs. On the one hand, due to the influence of the 3’ 431 untranslated region (UTR) and 5’ untranslated region (UTR) in the intergenic regions, the expression 432 value of intergenic regions cannot be reproduced perfectly by the same calculation used for the 433 expression value of genetic regions. Without accurate reproduction, it is difficult to obtain the best 434 expression combination of ATUs by the programming model based on the expression value of genetic 435 and intergenic regions. On the other hand, due to the lack of strand-specific RNA-Seq data, it is difficult 436 to distinguish the expression level of intergenic regions between two consecutive genes on the same 437 strand derived from ATUs containing these two genes or antisense RNAs (asRNAs) (6, 49). All of these 438 challenges and the great significance of ATU prediction inspire and encourage us to discover more 439 information to determine the ATU structure in bacteria. For example, we plan to add high confidence 440 TSSs and TTSs information to our programming model in the future. Additionally, since the microbiome 441 is increasingly recognized as a critical component in human diseases, such as inflammatory bowel 442 disease (50), antibiotic-associated diarrhoea (51), neurological disorders (52), and cancer (53) (54), 443 predicting new ATUs of uncultured species from metagenomic and metatranscriptomic data is of great 444 significance in uncovering new regulatory pathway and metabolic products during the development of 445 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 diseases (55). However, due to a majority of species with unknown genomes or genome annotations 446 within a microbial community, ATU prediction on metagenomics and metatranscriptomics is still a 447 challenging task, which encourage us to pay more attention on it. 448 REFERENCES 449 1. F. Jacob, D. Perrin, C. Sanchez, J. Monod, Operon: a group of genes with the expression 450 coordinated by an operator. C R Hebd. Seances. Acad. Sci 250, 1727-1729 (1960). 451 2. F. Jacob, J. Monod, Genetic regulatory mechanisms in the synthesis of proteins. J. Mol. Biol. 3, 452 318-356 (1961). 453 3. Z. Liu, J. Feng, B. Yu, Q. Ma, B. Liu, The functional determinants in the organization of bacterial 454 genomes. Brief. Bioinform., doi.org/10.1093/bib/bbaa1172 (2020). 455 4. W.-C. Chou, Q. Ma, S. Yang, S. Cao, D. M. Klingeman, S. D. Brown, Y. Xu, Analysis of strand-456 specific RNA-seq data using machine learning reveals the structures of transcription units in 457 Clostridium thermocellum. Nucleic Acids Res. 43, e67-e67 (2015). 458 5. S.-Y. Niu, B. Liu, Q. Ma, W.-C. Chou, rSeqTU—a machine-learning based R package for 459 prediction of bacterial transcription units. Frontiers in genetics 10, 374 (2019). 460 6. B. Yan, M. Boitano, T. A. Clark, L. Ettwiller, SMRT-Cappable-seq reveals complex operon 461 variants in bacteria. Nat. Commun. 9, 3676 (2018). 462 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 7. X. Ju, D. Li, S. Liu, Full-length RNA profiling reveals pervasive bidirectional transcription 463 terminators in bacteria. Nature microbiology 4, 1907-1918 (2019). 464 8. K. Totsuka, K. Totsuka, The Transcription Unit Architecture of the Escherichia Coli Genome. Nat. 465 Biotechnol. 27, 1043-1049 (2009). 466 9. A. H. Bhat, D. Pathak, A. Rao, The alr-groEL1 operon in Mycobacterium tuberculosis: an interplay 467 of multiple regulatory elements. Scientific Reports 7, 43772 (2017). 468 10. C. M. Sharma, S. Hoffmann, F. Darfeuille, J. Reignier, S. Findeiß, A. Sittka, S. Chabas, K. Reiche, 469 J. Hackermüller, R. Reinhardt, The primary transcriptome of the major human pathogen 470 Helicobacter pylori. Nature 464, 250-255 (2010). 471 11. J. M. Durand, G. R. Bjork, Putrescine or a combination of methionine and arginine restores 472 virulence gene expression in a tRNA modification-deficient mutant of Shigella flexneri: a possible 473 role in adaptation of virulence. Mol. Microbiol. 47, 519-527 (2010). 474 12. L. E. Wroblewski, R. M. Peek, K. T. Wilson, Helicobacter pylori and gastric cancer: factors that 475 modulate disease risk. Clin. Microbiol. Rev. 23, 713-739 (2010). 476 13. L. Ettwiller, J. Buswell, E. Yigit, I. Schildkraut, A novel enrichment strategy reveals unprecedented 477 number of novel transcription start sites at single base resolution in a model prokaryote and the 478 gut microbiome. BMC Genomics 17, 199-199 (2016). 479 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 14. M. K. Thomason, T. Bischler, S. K. Eisenbart, K. U. Forstner, A. Zhang, A. Herbig, K. Nieselt, C. 480 M. Sharma, G. Storz, Global transcriptional start site mapping using differential RNA sequencing 481 reveals novel antisense RNAs in Escherichia coli. J. Bacteriol. 197, 18-28 (2015). 482 15. T. Bischler, H. S. Tan, K. Nieselt, C. M. Sharma, Differential RNA-seq (dRNA-seq) for annotation 483 of transcriptional start sites and small RNAs in Helicobacter pylori. Methods 86, 89-101 (2015). 484 16. D. Dar, M. Shamir, J. Mellin, M. Koutero, N. Stern-Ginossar, P. Cossart, R. Sorek, Term-seq 485 reveals abundant ribo-regulation of antibiotics resistance in bacteria. Science 352, 6282 (2016). 486 17. J. Clauwaert, G. Menschaert, W. Waegeman, An in-depth evaluation of annotated transcription 487 start sites in E. coli using deep learning. bioRxiv, doi: https://doi.org/10.1101/2020.03.16.993501, 488 4 November 2020, pre-print: not peer-reviewed. (2020). 489 18. S. Goodwin, J. D. Mcpherson, W. R. Mccombie, Coming of age: ten years of next-generation 490 sequencing technologies. Nat. Rev. Genet. 17, 333-351 (2016). 491 19. A. Santos-Zavaleta, H. Salgado, S. Gama-Castro, M. Sánchez-Pérez, L. Gómez-Romero, D. 492 Ledezma-Tejeida, J. S. García-Sotelo, K. Alquicira-Hernández, L. J. Muñiz-Rascado, P. Peña-493 Loredo, RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge 494 of gene regulation in E. coli K-12. Nucleic Acids Res. 47, D212-D220 (2018). 495 20. N. Sierro, Y. Makita, M. J. L. De Hoon, K. Nakai, DBTBS: a database of transcriptional regulation 496 in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res. 497 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 36, 93-96 (2008). 498 21. P. S. Dehal, M. P. Joachimiak, M. N. Price, J. T. Bates, J. K. Baumohl, C. Dylan, G. D. Friedland, 499 K. H. Huang, K. Keith, P. S. Novichkov, MicrobesOnline: an integrated portal for comparative and 500 functional genomics. Nucleic Acids Res. 38, D396-D400 (2010). 501 22. H. Cao, Q. Ma, X. Chen, Y. Xu, DOOR: a prokaryotic operon database for genome analyses and 502 functional inference. Brief. Bioinform. 20, 1568-1577 (2019). 503 23. X. Mao, Q. Ma, C. Zhou, X. Chen, H. Zhang, J. Yang, F. Mao, W. Lai, Y. Xu, DOOR 2.0: presenting 504 operons and their functions through dynamic and integrated views. Nucleic Acids Res. 42, D654-505 D659 (2013). 506 24. K. Chetal, S. C. Janga, OperomeDB: A Database of Condition-Specific Transcription Units in 507 Prokaryotic Genomes. Biomed Research International 2015, 1-10 (2015). 508 25. J. Yang, X. Chen, A. Mcdermaid, Q. Ma, DMINDA 2.0: integrated and systematic views of 509 regulatory DNA motif identification and analyses. Bioinformatics 33, 2586-2588 (2017). 510 26. T. Blanca, C. Ricardo, C. E. Martinez-Guerrero, M. Enrique, ProOpDB: Prokaryotic Operon 511 DataBase. Nucleic Acids Res. 40, D627-D631 (2012). 512 27. R. McClure, D. Balasubramanian, Y. Sun, M. Bobrovskyy, P. Sumby, C. A. Genco, C. K. 513 Vanderpool, B. Tjaden, Computational analysis of bacterial RNA-Seq data. Nucleic Acids Res. 41, 514 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 e140-e140 (2013). 515 28. X. Chen, W. Chou, Q. Ma, Y. Xu, SeqTU: A Web Server for Identification of Bacterial 516 Transcription Units. Scientific Reports 7, 43925 (2017). 517 29. I. A. Garanina, G. Y. Fisunov, V. M. Govorun, BAC-BROWSER: The Tool for Visualization and 518 Analysis of Prokaryotic Genomes. Frontiers in Microbiology 9, 2827 (2018). 519 30. B. Taboada, K. Estrada, R. Ciria, E. Merino, Operon-mapper: a web server for precise operon 520 identification in bacterial and archaeal genomes. Bioinformatics 34, 4118-4120 (2018). 521 31. H. Li, R. Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. 522 Bioinformatics 25, 1754-1760 (2009). 523 32. Z. Wu, X. Wang, X. Zhang, Using non-uniform read distribution models to improve isoform 524 expression inference in RNA-Seq. Bioinformatics 27, 502-508 (2011). 525 33. A. Roberts, C. Trapnell, J. Donaghey, J. L. Rinn, L. Pachter, Improving RNA-Seq expression 526 estimates by correcting for fragment bias. Genome Biol. 12, 1-14 (2011). 527 34. R. Bohnert, G. Rï¿ ½tsch, rQuant. web: a tool for RNA-Seq-based transcript quantitation. Nucleic 528 Acids Res. 38, W348-W351 (2010). 529 35. W. Li, T. Jiang, Transcriptome assembly and isoform expression level estimation from biased 530 RNA-Seq reads. Bioinformatics 28, 2914-2921 (2012). 531 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 36. B. Xiong, Y. Yang, F. R. Fineis, J.-P. Wang, DegNorm: normalization of generalized transcript 532 degradation improves accuracy in RNA-seq analysis. Genome Biol. 20, 75 (2019). 533 37. J. Chaitanya, Degradation of mRNA in Escherichia coli. IUBMB Life 54, 315-321 (2010). 534 38. X. Mao, Q. Ma, B. Liu, X. Chen, H. Zhang, Y. Xu, Revisiting operons: an analysis of the landscape 535 of transcriptional units in E. coli. BMC Bioinformatics 16, 356 (2015). 536 39. B. Marie, K. H. Thilo, F. Thierry, T. Mikael, R. Adriana, V. D. Christian, Metabolic pathways of 537 Pseudomonas aeruginosa involved in competition with respiratory bacterial pathogens. Frontiers 538 in Microbiology 6, 321 (2015). 539 40. C. Nadiras, E. Eveno, A. Schwartz, N. Figueroa-Bossi, M. Boudvillain, A multivariate prediction 540 model for Rho-dependent termination of transcription. Nucleic Acids Res. 46, 8245-8260 (2018). 541 41. C. L. Kingsford, K. Ayanbule, S. L. Salzberg, Rapid, accurate, computational discovery of Rho-542 independent transcription terminators illuminates their relationship to DNA uptake. Genome Biol. 543 8, R22 (2007). 544 42. M. Ashburner, S. Lewis, On Ontologies for Biologists: The Gene Ontology—Untangling the Web. 545 Novartis Found. Symp. 247, 66-80; discussion 80-63, 84-90, 244-252 (2002). 546 43. H. Wu, Z. Su, F. Mao, V. Olman, Y. Xu, Prediction of functional modules based on comparative 547 genome analysis and Gene Ontology application. Nucleic Acids Res. 33, 2822-2837 (2005). 548 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 44. S. A. Teukolsky, B. P. Flannery, W. Press, W. Vetterling, Numerical Recipes in C: The Art of 549 Scientific Computing. Cambridge University Press, Cambridge (1992). 550 45. L. Wan, X. Yan, T. Chen, F. Sun, Modeling RNA degradation for RNA-Seq with applications. 551 Biostatistics 13, 734-747 (2012). 552 46. C. Yanofsky, Attenuation in the control of expression of bacterial operons. Nature 289, 751 (1981). 553 47. B. K. Cho, D. Kim, E. M. Knight, K. Zengler, B. O. Palsson, Genome-scale reconstruction of the 554 sigma factor network in Escherichia coli : topology and functional states. BMC Biol. 12, 4-4 (2014). 555 48. B.-K. Cho, P. Charusanti, M. J. Herrgård, Microbial regulatory and metabolic networks. Curr. Opin. 556 Biotechnol. 18, 360-364 (2007). 557 49. A. Toledo-Arana, O. Dussurget, G. Nikitas, N. Sesto, H. Guet-Revillet, D. Balestrino, E. Loh, J. 558 Gripenland, T. Tiensuu, K. Vaitkevicius, The Listeria transcriptional landscape from saprophytism 559 to virulence. Nature 459, 950-956 (2009). 560 50. B. Yue, X. Luo, Z. Yu, S. Mani, Z. Wang, W. Dou, Inflammatory bowel disease: a potential result 561 from the collusion between gut microbiota and mucosal immune system. Microorganisms 7, 440 562 (2019). 563 51. B. H. Mullish, H. R. Williams, Clostridium difficile infection and antibiotic-associated diarrhoea. 564 Clin. Med. 18, 237 (2018). 565 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 33 52. M. Maguire, G. Maguire, Gut dysbiosis, leaky gut, and intestinal epithelial proliferation in 566 neurological disorders: towards the development of a new therapeutic using amino acids, 567 prebiotics, probiotics, and postbiotics. Rev. Neurosci. 30, 179-201 (2019). 568 53. S. Vivarelli, R. Salemi, S. Candido, L. Falzone, M. Santagati, S. Stefani, F. Torino, G. L. Banna, 569 G. Tonini, M. Libra, Gut microbiota and cancer: from pathogenesis to therapy. Cancers 11, 38 570 (2019). 571 54. G. Cammarota, G. Ianiro, A. Ahern, C. Carbone, A. Temko, M. J. Claesson, A. Gasbarrini, G. 572 Tortora, Gut microbiome, big data and machine learning to promote precision medicine for cancer. 573 Nature Reviews Gastroenterology & Hepatology 17, 635-648 (2020). 574 55. S. S. A. Zaidi, X. Zhang, Computational operon prediction in whole-genomes and metagenomes. 575 Briefings in functional genomics 16, 181-193 (2017). 576 ACKNOWLEDGEMENTS 577 Funding: This work was supported by the National Nature Science Foundation of China (NSFC) 578 [61772313 to B.L., 11931008 to B.L.]; Interdisciplinary Science Innovation Group Project of Shandong 579 University (2019); and the Innovation Method Fund of China [2018IM020200 to B.L.]. The authors 580 would like to thank Yang Li for his assistance in language polishing. Authors’ contributions: B.L., 581 Q.M. and W.C. conceived the basic idea and designed the overall analyses. Q.W. carried out most of the 582 computational analysis and data interpretation. All the authors wrote the manuscript. Competing 583 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 34 interests: The authors declare that they have no competing interests. Data and materials availability: 584 The raw data and source code of SeqATU and a detailed tutorial can be found at 585 https://github.com/OSU-BMBL/SeqATU. 586 FIGURES AND TABLES 587 Table 1. Results of predicted ATUs verified by experimental TSSs or TF binding sites. Overview of 588 the experimental TSS and TF binding site datasets (dataset 1 and dataset 2) and the proportion of 5’-end 589 genes and no 5’-end genes of the predicted ATUs by SeqATU for M9Enrich_Seq and RiEnrich_Seq, which 590 were validated by experimental TSSs or TF binding sites. 591 dataset 1 dataset 2 Source Ju et al. (7) RegulonDB TF binding sites Technique SEnd-seq Collection TSSs/TF binding sites 5,512 3,220 M9Enrich_Se q 5’-end genes 83% 29% no 5’-end genes 47% 9.2% RiEnrich_Seq 5’-end genes 89% 30% no 5’-end genes 44% 9.0% 592 593 Table 2. Results of predicted ATUs verified by experimental TTSs. Overview of the experimental 594 TTS datasets (dataset 3 and dataset 4) and the proportion of 3’-end genes and no 3’-end genes of the 595 predicted ATUs by SeqATU for M9Enrich_Seq and RiEnrich_Seq, which were validated by 596 experimental TTSs. 597 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 35 dataset 3 dataset 4 Source Ju et al. (7) RegulonDB TTSs Technique SEnd-seq Collection TTSs 1,540 3,67 M9Enrich_Se q 3’-end genes 51% 11% no 3’-end genes 15% 5.2% RiEnrich_Seq 3’-end genes 53% 11% no 3’-end genes 14% 4.8% 598 599 600 Fig. 1. Schematic overview of SeqATU. The blue arrow and orange line denote gene and RNA-Seq 601 read, respectively. The preprocessing stage requires RNA-Seq data in the FASTQ format, the reference 602 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 36 genome sequence in the FASTA format, and gene annotations in the GFF format, generating linear 603 constraints for the next convex quadratic programming (CQP) stage. There are two steps in the 604 preprocessing stage: (i) calculating the expression value of the genetic region �� and intergenic region 605 ��,� and (ii) modelling non-uniform read distribution along mRNA transcripts; specifically, we acquired 606 a bias rate function �(�) = �� � using nonlinear regression and then constructed genetic or intergenic 607 region bias rate vectors. The maximal ATU cluster data determined by rSeqTU and the linear constraints 608 from preprocessing are both taken as inputs of CQP. CQP seeks the optimum expression combination of 609 all of the to-be-identified ATUs to minimize the gap ��� between the predicted ATU expression profile 610 and the genetic and intergenic region expression profile. Finally, the output of CQP is the predicted 611 ATUs. 612 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 37 613 Fig. 2. Results of modelling non-uniform read distribution along mRNA transcripts. The four bias 614 rate functions (� = ����) by nonlinear regression had similar coefficients (� and �) across the four 615 datasets M9Enrich_1, M9Enrich_2, RiEnrich_1 and RiEnrich_2. 616 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 38 617 Fig. 3. Overall evaluation results of SeqATU. (A) Precision and recall based on perfect matching and 618 relaxed matching for M9Enrich_Seq (left) and RiEnrich_Seq (right) using evaluated ATUs from SMRT-619 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 39 Cappable-seq. (B) Average precision based on perfect matching for M9Enrich_Seq (left) and 620 RiEnrich_Seq (right) using evaluated ATUs from SMRT-Cappable-seq (black) and evaluated ATUs from 621 SMRT-Cappable-seq and SEnd-seq (red). The magnitude of the point denotes the number of maximal 622 ATU clusters with same size. (C) Average number of ATUs across different sizes of SMRT maximal 623 ATU clusters for M9Enrich_Seq (left) and RiEnrich_Seq (right). 624 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 40 625 Fig. 4. Comparative analysis of the performance between SeqATU and SeqATU without the bias 626 rate constrains for SMRT maximal ATU clusters. (A) Precision, recall and F-score based on perfect 627 matching for M9Enrich_Seq and RiEnrich_Seq. (B) Precision, recall and F-score based on relaxed 628 matching for M9Enrich_Seq and RiEnrich_Seq. 629 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 41 630 Fig. 5. Comprehensive analysis of the predicted ATUs by SeqATU. (A) Number of ATUs across 631 different sizes. The size of an ATU is the number of its component genes. (B) Distribution of the number 632 of ATUs per gene. 633 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 42 634 Fig. 6. Integrative Genomics Viewer (IGV) representation of the mapping and ATUs. Mapping and 635 ATUs of M9Enrich_Seq (orange) and RiEnrich_Seq (blue) were shown for the maximal ATU cluster 636 containing the bioB, bioF, bioC, bioD and uvrB genes. 637 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/ 43 638 Fig. 7. Interpretation and results of the functional relatedness of different gene pairs based on GO 639 and KEGG enrichment analyses. (A) Illustration of two different gene pairs i and ii. (B) Functional 640 relatedness results based on GO enrichment analysis for M9Enrich_Seq (left) and RiEnrich_Seq (right). 641 (C) The proportion of two different gene pairs whose genes are contained in the same KEGG pathway 642 for M9Enrich_Seq (left) and RiEnrich_Seq (right). (D) The functional relatedness results based on 643 KEGG enrichment analysis for M9Enrich_Seq (left) and RiEnrich_Seq (right). 644 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.02.425006doi: bioRxiv preprint https://doi.org/10.1101/2021.01.02.425006 http://creativecommons.org/licenses/by-nc-nd/4.0/