key: cord-0855906-v4r1tdku
authors: Kreitmeier, Michaela; Ardern, Zachary; Abele, Miriam; Ludwig, Christina; Scherer, Siegfried; Neuhaus, Klaus
title: Spotlight on alternative frame coding: Two long overlapping genes in Pseudomonas aeruginosa are translated and under purifying selection
date: 2022-02-01
journal: iScience
DOI: 10.1016/j.isci.2022.103844
sha: 2ad98d6f890687b84da2d42872fcebceed257a49
doc_id: 855906
cord_uid: v4r1tdku

The existence of overlapping genes (OLGs) with significant coding overlaps revolutionizes our understanding of genomic complexity. We report two exceptionally long (957 nt and 1536 nt), evolutionarily novel, translated antisense open reading frames (ORFs) embedded within annotated genes in the pathogenic Gram-negative bacterium Pseudomonas aeruginosa. Both OLG pairs show sequence features consistent with being genes and transcriptional signals in RNA sequencing. Translation of both OLGs was confirmed by ribosome profiling and mass spectrometry. Quantitative proteomics of samples taken during different phases of growth revealed regulation of protein abundances, implying biological functionality. Both OLGs are taxonomically restricted, and likely arose by overprinting within the genus. Evidence for purifying selection further supports functionality. The OLGs reported here, designated olg1 and olg2, are the longest yet proposed in prokaryotes and are among the best attested in terms of translation and evolutionary constraint. These results highlight a potentially large unexplored dimension of prokaryotic genomes.

The tri-nucleotide character of the genetic code enables six reading frames in a double-stranded nucleotide sequence. Protein-coding ORFs at the same locus but in different reading frames are referred to as overlapping genes (OLGs). Studies of coding overlaps of more than 90 nucleotides, i.e., nontrivial overlaps, have mainly been restricted to viruses, where the first OLG was found in 1976 (Barrell et al., 1976 ). An OLG pair with such a nontrivial overlap can be described in terms of an older, typically longer ''mother gene'' and more recently evolved ''daughter'' gene by analogy to mother and daughter cells in reproduction.

In prokaryotes, automated genome-annotation algorithms such as Glimmer allow only one open reading frame (ORF) per locus with the exception of only short overlaps (Delcher et al., 2007) . This systematically excludes overlapping ORFs from being annotated as genes (Warren et al., 2010) . The ''inferior'' ORFs (e.g., shorter or fewer hits in databases) within overlapping gene pairs have been called ''shadow ORFs,'' as they are found in the shadow of the annotated coding ORF (Yooseph et al., 2007) . Determining which ORF to annotate within such pairs has been described as the most difficult problem in prokaryotic gene annotation (Salzberg et al., 1998) . Nevertheless, a few prokaryotic OLGs have been discovered, often serendipitously in the pursuit of other unannotated genes (Baek et al., 2017; Jensen et al., 2006; Weaver et al., 2019) . For instance, in some Escherichia coli strains a few nontrivial overlaps have been detected and experimentally analyzed (Behrens et al., 2002; Fellner et al., 2014; Fellner et al., 2015; Hü cker et al., 2018a; Hü cker et al., 2018b; Vanderhaeghen et al., 2018; Zehentner et al., 2020a; . Transcriptomic or translatomic evidence for OLGs also exist in genera such as Pseudomonas (Filiatrault et al., 2010) and Mycobacterium (Smith et al., 2019) . Recently, evidence for OLGs has also been reported in archaea (Gelsinger et al., 2020) and in mammals (Khan et al., 2020; Loughran et al., 2020) . These findings support the hypothesis that OLGs are ubiquitous. Various aspects of OLGs, including potential applications in synthetic biology, have been discussed in a recent review (Wright et al., 2021) .

Verifying OLGs using mass spectrometry (MS) is difficult because most OLGs appear to be short and weakly expressed, and proteomics has limited abilities in detecting such proteins (Petruschke et al., 2020) . RNA sequencing led to the discovery of many antisense transcripts, but whether many of these are translated second ORF, later named olg2, had an RPKM RNASeq of about 22, as compared with approximately 20 for the overlapping (annotated) gene PA1383. These values were comparable to annotated protein-coding genes, which range between 0.4 and 13,563.5 ( Figure 1A) , indicating substantial transcription for the two novel ORFs and the overlapping annotated genes (Figures 2A and 2B , first track; Table S1 ). Full-length transcription of both OLGs was confirmed by RT-PCR ( Figure S1 ).

Antisense transcription, reported in diverse bacteria including P. aeruginosa, often plays a role in gene regulation, indicated by a negative correlation between antisense and mRNA levels (Dornenburg et al., 2010; Eckweiler and Hä ussler, 2018) . However, evidence exists that some antisense transcripts are translated (Ardern et al., 2020; Friedman et al., 2017; Stringer et al., 2021; Weaver et al., 2019) . Our results (Figures 2A and 2B , second track) strongly support translation in both directions for the loci of PA1383 and tle3. The annotated genes showed RPKM RiboSeq values close to the median annotated gene RPKM RiboSeq of 25.7 ( Figure 1B) . The expression of olg1 was even higher (RPKM RiboSeq = 40.3), whereas olg2 showed a lower, but still unequivocal signal (RPKM RiboSeq = 14.2). In each case, expression was consistent across replicates, and coverage was excellent across the whole ORF ( Figures 1C, 2A and 2B , second track). The sequence features reported earlier overall iScience Article DeepRibo is a software tool combining RiboSeq signals and DNA sequence motifs by using neural networks (Clauwaert et al., 2019) . DeepRibo predicts olg1 and olg2 to be translated with the same start codons as we predict from visual assessment in both replicates; it also confirms translation of the mother genes tle3 and PA1383. As a comparison, we analyzed the genes adjacent to tle3-these genes similarly had high coverage and RPKM values (Table S1 ), implying expression of the operon, but no antisense signals (Figure S2) . Finally, ribosome coverage values (RCV) were calculated, i.e. RPKM RiboSeq over RPKM RNASeq (Neuhaus et al., 2016) . This value allows a direct estimation of the ''translatability'' of an ORF. With a value of 1.13, olg1 and PA1383 were within the top 20% of all annotated genes. tle3 and olg2 showed lower RCVs (i.e., 0.79 and 0.64, respectively) but within the range of other annotated genes ( Figure 1D ).

Mass-spectrometry-based proteomics was used to verify the expression of both OLGs as well as the respective mother gene proteins. For that, P. aeruginosa PAO1 was cultivated as described earlier. In total, 4,076 proteins could be detected, including Olg1, Olg2, Tle3, and PA1383 with 12, 5, 10, and 21 peptides, respectively (Table  S2 ). These peptides covered the start, middle, and end regions in each of the four coding sequences ( Figures  2A and 2B , last tracks). The measured mass spectrometric intensities of the four target proteins were in a medium to low range compared with all other detected proteins ( Figure S3 ). To exclude false-positive peptide identifications, we validated the fragment ion spectra of all detected peptides from the four target proteins using the artificial intelligence algorithm Prosit (Gessulat et al., 2019) . Prosit can predict a peptide's fragment ion spectrum based on its amino acid sequence. Except for two peptides, correlation scores (dot product) between experimental and predicted spectra were larger than 0.6 (Table S2) , which supports correct identification of almost all putative peptides from Olg1, Olg2, Tle3, and PA1383.

For peptide identification with highest confidence, as well as for accurate protein quantification, we next performed targeted proteomic measurements including isotopically labeled reference peptides. Based on our initial mass spectrometric data, we selected four to five peptides per target protein (Figures 2A  and 2B , lower tracks, peptides indicated with an asterisk) and purchased those in synthetic and stable isotopically labelled form. Those heavy reference peptides were spiked into P. aeruginosa PAO1 samples taken from various growth time points ( Figure 3A ) and measured using the targeted proteomic method Parallel Reaction Monitoring (PRM). We successfully validated and quantified four peptides for Olg1, five peptides for PA1383, four peptides for Tle3, and one peptide of Olg2 ( Figures 3B and S4A ).

Growth-phase-dependent changes in protein abundances were observed for all four target proteins via PRM ( Figures 3B and S4A ). High levels of both OLGs were obtained during exponential growth (1 h, 2 h) and at the exponential-stationary transition (OD1). In contrast, protein abundance in late stationary phase (24 h) was significantly lower. Both OLGs exhibited protein kinetics deviating from their respective mother gene proteins Tle3 and PA1383, indicating an independent biological regulation. qPCR analysis confirmed similar kinetics for the mRNAs of olg1 and olg2 ( Figure S4C ).

Further indication of regulated OLG expression was provided by published RNASeq and RiboSeq data in two P. aeruginosa strains (PAO1 & ATCC33988; SRA accession number PRJNA379630) (Grady et al., 2017) . Strains were cultivated in M9 broth with glycerol or n-alkanes. Reanalysis revealed transcriptional and translational signals for both OLGs in strain PAO1 (Figures 3C and S5) and olg1 in strain ATCC33988 (Table S1) . Interestingly, RPKM RNASeq values were similar for both OLGs when comparing LB with M9+glycerol but differed in M9+alkane ( Figure 3C ). This might suggest a carbon-source-dependent regulation. Further, these data indicate regulated translation of olg1 (log_FC = 1.15) and olg2 (log_FC = À0.55) in PAO1 when comparing M9+alkane with M9+glycerol (false discovery rate, FDR %0.05). Thus, both OLGs are expressed more weakly in alkane media than LB ( Figure 3C ).

Temporal control and dependence on growth media for OLG expression strongly implies functionality (Figures 3B and 3C and S4) . However, elucidation of the biological role of these overlapping-encoded proteins requires further experiments.

The novel olg1 of P. aeruginosa PAO1 (NC_002516.2) is located at the coordinates 291,556-292,512 (+) in frame À1 (i.e., directly antisense) with respect to its mother gene ( Figure 4A ). It has a minimum length of 957 iScience Article nucleotides (nt), and the most probable start codon is ATG 291556 (see Data S1). olg1 completely overlaps with the annotated mother gene tle3 (PA0260), encoding the toxic type VI lipase effector 3 of the vgrG2b-tli3-tle3-tla3 operon. tle3 contains two structural domains, an N-terminal a/b hydrolase fold domain and a C-terminal domain of unknown function (DUF3274) (Berni et al., 2019) . olg1 overlaps 39 nt with the a/b fold domain and at least 594 nt with DUF3274. . Schematic overview of the genomic structure of the tle3-olg1 and PA1383-olg2 locus (A) olg1 completely overlaps antisense in frame À1 (subpanel a) relative to the annotated gene tle3, which is part of the vgrG2b-tli3-tle3-tla3 operon. Location of the N-terminal a/b hydrolase fold as well as the C-terminal DUF3274 domain of tle3 are displayed in dark gray. olg1 shares structural features of a protein-coding gene including À35 and À10 consensus elements, divided by a 14 bp spacer, of a putative s 70 promoter. A core SD sequence of AGG was identified according to Ma et al. (2002) , interacting with the aSD sequence at the 3 0 end of the 16S rRNA within the 30S ribosomal subunit. A putative terminator between 219 and 349 nt downstream of the stop codon was identified via RT-PCR using the primer pairs indicated (P 4/5 + P 6 ). (B) olg2 overlaps nontrivially with the hypothetical gene PA1383 and trivially with galE encoding a UDP-glucose 4-epimerase. Structural features of both annotated genes are indicated. The mRNA of olg2 starts probably at a putative s 70 promoter 93 nt upstream of the start codon and terminates at the predicted terminator 218 to 247 nt downstream of the stop codon.

iScience 25, 103844, February 18, 2022 7 iScience Article Novel olg2 ( Figure 4B ) is likewise encoded in frame À1, in the mother gene PA1383 at the coordinates 1501875-1503602 (À). With ATG 1503602 as putative start codon, olg2 spans 1728 nt and overlaps with two annotated genes. A region of 1536 nt is shared with PA1383, a hypothetical gene predicted to code for an N-terminal type I signal sequence for cytoplasmic export (Lewenza et al., 2005) , which has been shown to be regulated by the transcriptional repressor NrdR (Crespo et al., 2015) and by both the repressors MvaT and MvaU (Lippa et al., 2021) . In addition, olg2 overlaps in frame +2 by 34 nt with the galE gene (PA1384), which encodes for a UDP-glucose 4-epimerase.

Both OLGs were discovered by screening RiboSeq data and further characterized using prediction tools, qPCR, transcriptome sequencing, mass spectrometry as well as phylogenetic analyses as described.

Putative s 70 promoter sequences were searched for 300 nt upstream of each OLG's start codon with the tool BPROM (Solovyev and Salamov, 2011) . Linear-discriminant function values of 1.94 (olg1) and 1.37 (olg2) clearly exceed 0.2, the threshold distinguishing promoter and nonpromoter sequences. Transcription start sites were localized 37 nt and 94 nt upstream of the start codons, respectively ( Figure 4 ). The observed distances fit the length of 5 0 UTRs reported for P. aeruginosa PA14 (median: 47 nt) (Wurtzel et al., 2012) and Pseudomonas syringae pv. tomato str. DC3000 (mean: 78 nt) (Filiatrault et al., 2011) . To investigate potential r-independent terminators, FindTerm (Solovyev and Salamov, 2011) was applied 300 nt downstream of both overlapping ORFs. A terminator was detected for olg2 218 to 247 nt downstream ( Figure 4B ). For olg1, a r-independent terminator was not predicted, but RT-PCR verified termination between 219 and 349 nt downstream of its stop codon ( Figure 4A ).

In P. aeruginosa, the core anti-Shine-Dalgarno (SD) sequence is CCUCC. It has a mean DG SD of À6.5 kcal/ mol and an optimal spacing of 7-9 nt to the start codon (Ma et al., 2002) . An SD sequence with AGG and a DG SD of À3.6 kcal/mol was detected 8 nt upstream of olg1 0 s proposed start codon ( Figure 4A ). For the mother gene tle3, an SD sequence (À5.1 kcal/mol) was identified, but neither a s 70 promoter nor a r-independent terminator was found. This is consistent with the reported finding that tle3 is part of the vgrG2boperon (Berni et al., 2019) . No SD sequence was detected for olg2. However, SD sequences are similarly absent from 30.8% of the annotated genes in P. aeruginosa (Ma et al., 2002) . The upstream region of both PA1383 and galE, the mother genes, harbored a s 70 promoter (transcription start sites 203 nt and 72 nt upstream, respectively) and an SD sequence (À6.1 kcal/mol and À4 kcal/mol, respectively).

Sequence features such as start codon, SD sequences, GC bias, and hexamer coding statistics are used by gene prediction tools, for example Prodigal (Hyatt et al., 2010) . However, Prodigal's algorithm prohibits prediction of gene pairs with an overlap larger than 200 nt and annotates only the ORF with the highest score. For P. aeruginosa PAO1, Prodigal predicted 5,681 protein coding genes including the two annotated genes tle3 and PA1383 with total scores of 132.60 and 225.92, respectively ( Figure S6 ). When ''hiding'' both annotated genes by replacing all start codons by unidentified nucleotides (''N''), Prodigal classified olg1 and olg2 as protein coding genes. Although their total scores of 4.63 and 23.63 were relatively low, some values are comparable to annotated genes, for instance the start-sequence region scores ( Figure S6 and Table S3 ). Furthermore, olg2 showed a confidence score of 99.56, indicating a very high likelihood of being a real protein-coding gene. These overlapping ORFs are not annotated by Prodigal due to their long overlaps but both nonetheless show features associated with protein coding.

Almost all detectable homologs of both olg1 and olg2 were found within Pseudomonas spp. according to BLAST searches (Table S4) . Homologs outside Pseudomonas clustered within the genus in terms of sequence identity. Thus, we infer that these are best accounted for by relatively recent horizontal gene transfer from the genus Pseudomonas and focus on evolution within this clade.

Homologs of tle3 containing both the a/b hydrolase and DUF3274 domains were found in multiple phyla; however, within the order Pseudomonadales they were found only in the genus Pseudomonas, suggesting an ancient horizontal gene transfer event from another order. The mother gene of olg2, PA1383, in contrast, does not have highly similar homologs outside of Pseudomonas. The one exception was derived from a low-quality genome of Acinetobacter baumanii (noted in RefSeq to be of excess size), which was disregarded. A few distant homologs were found, with the two top hits in E. coli (matching 43% of the iScience Article PAO1 sequence) and Salmonella enterica (match to 17%). This evolutionary distance again implied a horizontal gene transfer into or out of Pseudomonas with subsequent evolutionary divergence. None of the non-Pseudomonas homologs included the N-terminal signal peptide in PA1383, suggesting functional changes in Pseudomonas compared with other bacteria.

Both olg1 and olg2, in the length present in reference strain PAO1, were highly taxonomically restricted. The taxonomic distribution of both OLGs was limited approximately to the species P. aeruginosa, according to BLASTp hits along with searches of additional metagenome-assembled genomes (MAGs) ( Figure 5A ). The mother genes tle3 and PA1383 within the order Pseudomonadales resulted in approximately 900 and 300 unique sequences, respectively. For olg1 (in tle3), the intact ORF (i.e., without premature stop codons) was restricted to P. aeruginosa with one exception in Pseudomonas prosekii with a nonstart codon (GTA) at the locus of the start site in PAO1. An intact olg2 was restricted to a few Pseudomonas species, and only P. aeruginosa genomes shared the same stop codon. Pseudomonadales homologs in recent MAG collections (Almeida et al., 2020; Nayfach et al., 2019) supported the inferred taxonomic boundaries. Genomes with intact ORFs of olg1 or olg2 with a stop in the same position were assigned taxonomically to P. aeruginosa in the MAG data (Table S4 ). Subsequent analyses used the combined GenBank and MAG datasets.

Multiple independent lines of evidence indicated that both OLGs are under purifying (negative) selection, a strong indicator of functionality, particularly when combined with evidence of expression (Cooper and Gardner, 2020) . Firstly, the ORFs for olg1 and olg2 were both significantly longer than expected, given the amino acids (AA) in the reference genes tle3 and PA1383, using the synonymous-mutation method from the tool ''Frameshift'' ) ( Figure 5B ). This method substitutes synonymous codons randomly and obtains an empirical cumulative distribution function of the resulting ORF lengths in each alternative reading frame. This resulted in a p-value of <10 À10 for both ORFs. olg1 and olg2 were also longer than expected given the overall codon usage (codon-permutation method of ''Frameshift''), although not statistically significantly, with p = 0.163 and p = 0.0635, respectively (Table S5 ). The p-values included a correction for multiple tests, i.e., the number of observed ORFs in this alternate reading frame. The synonymous mutation p-values are still significant after a conservative multiple tests adjustment of multiplying by the total number of genes, arguably appropriate given the OLGs' detection with a genome-wide scan. The nonsignificant results for the codon permutation method are not surprising given that it implicitly depends on stop codons elsewhere in the alternate reading frame, of which there are few due to the length of the OLGs relative to their mother genes. In summary, from these results it can be concluded that the ORF lengths are unexpected, given the overall sequence composition of the mother genes, implying selection for long ORFs via negative selection on stop codons.

Secondly, sequence evolution of each mother gene was modeled without selection for maintaining an overlapping ORF. The presence of stop codons in simulated sequences was then compared with natural sequences. This method was previously used to support an inference to selection on the OLG asp in HIV-1 . When evolution was simulated using an empirical codon model in Pyvolve (Spielman and Wilke, 2015) along trees calculated from the mother genes ( Figure 5A ), stop codons evolved more frequently in the simulated OLG sequences compared with observed natural sequences. As such, fewer simulated than natural sequences had intact full-length ORFs ( Figure 5C ). Following the originators of the method , outgroup sequences without stop codons in the OLG region were chosen to root the tree ( Figure S7 ). Continued with a blue box. Lower panel: homologs of olg2 overlapping loci, for PA1383. Clade containing genomes with the same stop codon as the reference genomes is highlighted with a red box (start codon is at a nonoverlapping locus outside the sequence shown). The reference genome (NC_002516.2) is underlined in the respective OLG color, and the outgroup used in the evolutionary simulations described below is underlined in gray.

(B) Distributions of lengths of antisense (À1 frame) ORFs obtained by permutation (green) or synonymous exchanges (orange) of ''mother gene'' codons, for genes tle3 and PA1383, compared with the lengths of the embedded olg1 (blue, left) and olg2 (red, right). ORF lengths are measured between in-frame stop codons rather than start to stop.

(C) Simulations of evolution of tle3 and PA1383 in the OLG clade rooted on an outgroup with an intact ORF, using an empirical codon model, show that accumulation of stop codons is common; simulated sequences tend to have fewer full-length intact ORFs in the OLG loci and reading frame than real sequences. iScience Article Codon-position-specific constraint supports purifying selection Synonymous variation in the mother genes tle3 and PA1383 was reduced over a large part of the OLG region ( Figure 6A and Table S6 ), according to results from ''FRESCo'' . For tle3, a comparison of the rates of synonymous evolution in the OLG-containing genomes versus the rest of the alignment using a paired two-tailed t test in nonoverlapping adjacent windows of 50 codons over the whole OLG sequence showed increased constraint in the OLG region, with p = 0.086. Results for olg2 were similar for the last 350 codons of olg2 (p = 0.03) (Table S6) , but there was no synonymous constraint toward the end of the mother gene PA1383 from approximately codon 500 onwards. Because the nonoverlapping start region of olg2 is not well conserved across P. aeruginosa (Table S4) , we focused only on the part overlapping PA1383.

A more precise reading-frame-specific measure of purifying selection against nonsynonymous variants in OLGs is given by the novel tool ''OLGenie'' . Unlike standard measures of dN/dS, ''OLGenie'' calculates an OLG-appropriate measure (i.e., dNN/dNS for the OLG) by restricting the analysis to alternative frame sites where variants are nonsynonymous in the reference frame. Within the OLG-containing genome sets for both olg1 and olg2, synonymous variants were favored over variants causing AA changes ( Figure 6B ), although these tendencies were not statistically significant; for olg1 dNN/dNS = 0.33, p = 0.11 and for olg2, dNN/dNS = 0.52, and p = 0.06. These values contrasted with the genomes without olg1 or olg2 with dNN/dNS = 1.02, p = 0.66 and dNN/dNS = 0.92, p = 0.20 for olg1 and olg2, respectively (Table S7 ). Both mother genes tle3 and PA1383 were observed to be under significant purifying selection in the sets of Pseudomonas genomes with and without these OLGs. The results from ''FRESCo'' and ''OLGenie'' are fully independent, as they depend on synonymous and nonsynonymous sites in the mother genes, respectively. As each measure independently shows a tendency toward constraint, together they provide good evidence of evolutionary constraint in the OLG sequences. Further, some individual pairwise sequence comparisons with ''OLGenie'' are statistically significant ( Figure 6C and Table S7 ), and these comparisons are informative about the taxonomic extent of functional ORFs. Much of the strongest evidence of purifying selection is taxonomically close to the reference genome PAO1, which supports constraint in these OLGs. This was not guaranteed by the presence of an intact ORF in this clade, as any stop codons are excluded from the ''OLGenie'' analysis; as such, results are not affected by whether an ORF homologue has premature stops. The pattern of purifying selection on olg1 suggests that a functional ORF may have been found in the common ancestor of P. aeruginosa and Pseudomonas frederiksbergensis; the evidence for positive selection between members of this clade and genomes in the other main Pseudomonas branch also support this hypothesis. In the case of olg2, evidence for purifying selection is taxonomically more widespread across Pseudomonas, fitting the wider distribution of intact olg2 ORFs (some with stop codons downstream of that in the reference strain PAO1).

OLGs outside viruses (Chirico et al., 2010; Lebre and Gascuel, 2017) are typically categorically rejected and those already annotated have sometimes been attributed to misannotations (Pallejà et al., 2008) . In this study, we describe the detection and characterization of two exceptionally long OLGs, olg1 and olg2, in P. aeruginosa. Although additional signals for further OLGs have been detected in our data (not shown), we concentrated on two strong candidates reported here, namely olg1 and olg2, in order to characterize these in sufficient and convincing depth. The discovery of novel genes, and especially of overlapping genes, is always coupled to the choice of (arbitrary) values that are accepted for verifying a novel gene or protein. These values are discussed for sequencing data, and sometimes unexpected signals are regarded as pervasive or background nonfunctional translation events (e.g., Smith et al., 2019) . In mass spectrometry, a single peptide detection event is dismissed as a one-hit-wonder, although methods of cross-validation have been developed to mitigate this problem (Gupta et al., 2008) . In any case, we propose Figure 6 . Continued (B) OLGenie's measure of dN/dS across tle3 (left) and PA1383 (right) in the OLG genomes; sliding windows of 50 codons. A decrease in nonsynonymous changes in the OLG frame is observed in the OLG loci (blue and red boxes) when compared with the expected neutral evolution rate of 1 (black dotted lines) and the non-OLG genomes (white line).

(C) Pairwise comparisons of dNN/dNS (an OLG-appropriate measure of purifying selection calculated with OLGenie). Evidence for purifying selection is found in a wider taxonomic group than the specific ORFs studied here; for olg1, apparent purifying selection is limited to a subclade within Pseudomonas, whereas for olg2 it is found across the genus; for both ORFs however, evidence is strongest in the vicinity of P. aeruginosa. Codon numbers are with respect to an alignment including gaps.

OPEN ACCESS iScience 25, 103844, February 18, 2022 iScience Article that the long overlapping ORFs detected in this study encode functional protein products due to (1) the presence of sequence features necessary for gene expression, (2) successful transcription and translation as indicated by RNASeq and RiboSeq, (3) discovery of several translated peptides via mass spectrometry, (4) validation and confirmation of their regulated expression during growth of P. aeruginosa PAO1 using targeted proteomics and isotopically labeled reference peptides, (5) successful prediction of both ORFs on genomic and translational level by annotation programs, and (6) evidence of purifying selection on both gene candidates from multiple methods. Although these results provide strong evidence for the genuine protein-coding nature and functionality of both ORFs, they can only be designated as OLGs if their respective mother genes (tle3 and PA1383) are correctly annotated and are also genuinely protein coding. The gene tle3 has been confirmed to encode the antibacterial type VI lipase effector 3 (Berni et al., 2019; Russell et al., 2013) . PA1383 is annotated as a hypothetical gene, but we show that homologs are widely distributed across bacteria. Further, it contains a signal peptide associated with export, and it is under purifying selection. For both mother genes, we show clear expression in our RNASeq, RiboSeq, and MS experiments. Further, it appears unlikely that the MS-detected peptides represent translation products without function considering the high bioenergetic cost of translation (Lynch and Marinov, 2015) . Taken together, it is beyond reasonable doubt that both mother genes encode functional proteins and that the overlapping ORFs presented here are not just annotation errors from artifactual mother genes.

With a minimum length of 957 and 1728 nt, olg1 and olg2 represent the longest known prokaryotic OLGs with extensive experimental evidence. The discovery of such long OLGs is extraordinary considering the short length of most observed OLGs. In E. coli, for instance, several RiboSeq studies ( Almost all proposed antisense OLGs lack a native proof of the encoded protein product, arguably calling their coding potential into question. Proteomic detection of the OLG cosA via MS, for instance, failed, presumably due to its low expression (Kim et al., 2009 ). In addition to low protein abundance, the generally small size of OLGs also hampers a proteomic proof due to an insufficient amount, or complete absence, of mass spectrometry-detectable peptides (Petruschke et al., 2020) . Nevertheless, protein evidence of antisense OLGs was provided in some proteomic studies (Venter et al., 2011) but mainly attributed to a high false-positive rate. Proteomic OLG evidence was found for other bacterial genera, including Helicobacter ). In P. putida, 44 small antisense-encoded proteins were claimed based on MS data (Yang et al., 2016) . For a different species, P. fluorescens, nine protein-coding antisense OLGs were found using MS (Kim et al., 2009 ). In the latter, eight of nine detected proteins were shorter than 200 AA; but one had a reported length of 530 AA. To our knowledge, the longest antisense OLG with proteomic evidence is a 1644 nt ORF (encoding for 548 AA), located in frame À1 in Deinococcus radiodurans (Willems et al., 2020) . However, up until now all prokaryotic OLGs identified via MS have lacked verification. Thus, olg1 and olg2 not only represent antisense OLGs of exceptional sizes across bacteria, archaea, and viruses but constitute the longest known OLGs with reliable proteomic evidence.

Both olg1 and olg2 are phylogenetically young genes under selection. For both olg1 and olg2, the OLG sequence is evolving considerably faster at the AA level than the mother gene protein sequence (approximately 2 and 12 times faster, respectively; Table S8 ). This appears to have resulted in a long ORF ''opening ll OPEN ACCESS iScience 25, 103844, February 18, 2022 13 iScience Article up'' in the recent history of Pseudomonas genomes for olg1 and perhaps somewhat earlier for olg2. At some point, they became subject to purifying selection, as shown by depletion of stop codons, nonsynonymous changes, and synonymous variants in the mother gene. The yet unaccounted-for evidence of translation upstream of olg1 shown here raises the possibility of multiple start sites, which have been recently observed for many bacterial proteins (Fijalkowska et al., 2020) , including potentially for the OLG pop (Zehentner et al., 2020b) . However, in our sliding window analyses of tle3 ( Figures 6A and 6B) , we found no evidence for selection on upstream sequences.

Bioinformatic analysis of OLGs is still in its infancy. For instance, for evolutionary simulation, it would be ideal to start with the actual ancestral sequence, but accurate ancestral-sequence reconstruction for OLGs is yet unsolved. Thus, for the simulation method, rather than introducing new biases with imperfect reconstruction, we instead followed the approach of Cassan et al. (2016) of using a known leaf sequence in place of the root sequence. Further, choosing an outgroup with intact ORF to root the tree implicitly assumes that the ancestor of the outgroup and OLG clade contained an intact OLG, and the results are sensitive to the choice of sequence on which the tree is rooted ( Figure S7B ). Here also, ancestral-sequence construction would assist with realistic simulations. Further, another limitation with all existing methods is that they all use only a subset of the sequence information, e.g., ''Frameshift'' only considers stop codons in one genome, ''FRESCo'' only considers synonymous sites in the mother gene, and ''OLGenie'' is restricted to the nonsynonymous mother gene sites. Future developments combining features should increase accuracy. Additional considerations such as masking out RNA secondary structures, using machine-learning methods to find subtle signatures of selection, or including sequences from metagenomes studies of different niches should improve our understanding of the evolution of OLGs and other taxonomically restricted genes. Until recently it was thought that almost all modern genes arose through duplication and divergence from ancient genes (Ohno, 1970) . Many taxonomically restricted genes, found only in one strain or relatively few closely related genomes, have recently been discovered. The origin of few of these ''orphan'' genes, however, has been explicated in molecular detail. Young OLGs have some important advantages in the study of gene evolution. In particular, the genetic context is fixed due to the presence of the mother gene. This dramatically reduces the major problems associated with false homologs and failure to detect true homologs (Vakirlis et al., 2020; Weisman et al., 2020) . The evolutionary processes involved in the initial expression and neo-functionalization of these ORFs deserve further attention. For instance, a shift in function in PA1383 appears to have involved substantial sequence change, including gain of a signal peptide. We hypothesize that during this process of positive selection on the mother gene many possible sequences were explored in the antisense À1 frame, facilitating the origin of the ORF encoding olg2.

Our results demonstrate that bacterial genomics after decades of advance still has additional fundamental secrets to reveal (Grainger, 2016; Kirchberger et al., 2020) , potentially including many more long OLGs, which were until now hiding in the shadows of known, annotated genes. These elements have not been rigorously searched for before at a whole genome level, as appropriate detection methods are still in development, and if found they are often disregarded. In this discovery of long OLGs, new research opportunities are opened for genomics, proteomics, and translatomics, as well as in the study of evolutionary novelty and bacterial gene function. These findings together shine a spotlight on the remarkable multilayer coding potential enabled by the redundancy in the standard genetic code.

Although we report two novel overlapping genes from P. aeruginosa, we omitted many other putative overlapping genes observed in our data. Mainly, our limited resources did not allow detailing more overlapping genes. For instance, so-called ''one-hit-wonders,'' i.e., proteins only found represented by a single peptide, are widely discounted and so were also not examined. Furthermore, we do not have data on biological function of the two genes. Here, one would need, e.g., strand-specific knockouts, overexpression phenotypes, or many other experiments classically used to elucidate protein function. Regarding their evolution, we currently do not understand well how such genes originate ''de novo'' through overprinting. iScience Article (Perez-Riverol et al., 2019) and can be accessed using the dataset identifier PXD023992 (http:// proteomecentral.proteomexchange.org/cgi/GetDataset?ID=PXD023992). All targeted proteomic raw data and Skyline analysis files have been deposited using the identifier OLG_PSE to Panorama Public (https://panoramaweb.org/OLG_PSE.url) (Sharma et al., 2018) .

Detailed methods are provided in the online version of this paper and include the following: 

We thank Romy Wecko, Verena Breitner, Lara Wanner, Hermine Kienberger, and Franziska Hackbarth for technical assistance and Christopher Huptas for bioinformatic support. We also thank Siddhanth Rao for assistance with scripts for the use of ''FRESCo'' and Chase Nelson and April Wei for helpful comments on the manuscript. TUM University Library Publishing Fund helped covering publishing costs, given to K.N.

M.K. performed the sequencing experiments, analyses thereof, and wrote the first draft of the manuscript. Z.A. conducted the evolutionary analyses and critically revised the manuscript. M.A. performed the mass spectrometry experiments and analyses thereof under supervision of C.L. The study was conceived, designed, and coordinated by S.S. and K.N. All authors helped with writing and editing.

The authors declare no competing interests. Hü cker, S.M., Ardern, Z., Goldberg, T., Schafferhans, A., Bernhofer, M., Vestergaard, G., Nelson, C.W., Schloter, M., Rost, B., Scherer, S., and Neuhaus, K. (2017) 

Further information and requests for resources should be directed to and will be fulfilled by the lead contact, Klaus Neuhaus (neuhaus@tum.de).

Reagents generated in this study are available from the lead contact with a completed Materials Transfer Agreement.

Data and code availability d This paper analyses data produced for this publication. These accession numbers for the datasets are listed in the key resources table.

d Any original code reported is available via github, as listed in the key resources table.

d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

P. aeruginosa PAO1 (DSM 19880) has been used in this work.

Lysogeny broth (10 g/L tryptone, 5 g/L yeast extract, 5 g/L NaCl) was inoculated 1:100 using an overnight culture of P. aeruginosa PAO1 (DSM 19880) and aerobically incubated (37 C, 150 rpm). After 1 h, 2 h, 4 h, 6 h, 8 h, and 24 h and at OD 600nm = 1, samples were taken by centrifugation (10 min, 12,0003g, 4 C). For transcriptomes and translatomes, cellular processes were stalled at OD 600nm = 1 by adding dry ice reaching 4 C. Next, cells were centrifuged (8,0003g, 4 C, 5 min) and resuspended in polysome-lysis-buffer (Woolstenhulme et al., 2015) (325 mL per 100 mL initial culture). Cells were lysed in a cell crusher with liquid nitrogen. After centrifugation as before, the supernatant was used for transcriptomes and translatomes. Thus, the supernatant was split and the two portions were used to isolate total RNA and ribosomes for RIBOseq, respectively.

Total RNA was extracted from cells (qPCR) iScience Article to bead beating (0.1 mm zirconia beads) using a FastPrep (3 cycles, 6.5 ms -1 , 45 s; with 5 min incubation on ice after each cycle). Cell lysates (see above) were incubated each 5 min with first cooled Trizol and next 200 mL chloroform. After centrifugation (15 min, 12,0003g, 4 C), RNA was precipitated (500 mL 2-propanol, 1 mL glycogen, 30 min). RNA was pelleted (10 min, 12,0003g, 4 C) and washed twice with cold 70% ethanol. Air-dried RNA was dissolved in RNase-free water. Integrity was verified by agarose gel electrophoresis (1.5%, 100 V, 45 min; Carl Roth) and Bioanalyzer measurements (RNA 6000 Nano Assay, Agilent Technologies).

RNA samples were incubated with TURBO DNase (Thermo Fisher Scientific; 1 h, 60 C) and 25 U SUPERase$In RNase Inhibitor (Thermo) for removing residual DNA. After inactivation (15 mM EDTA, 10 min, 65 C), the RNA was precipitated overnight (-20 C) using ethanol, 3 M sodium acetate, and glycogen (690, 27.6, and 1 mL, respectively). Precipitated RNA was pelleted, washed, dried and dissolved as before. DNA absence was confirmed with PCR using Taq polymerase (NEB) with primer 1 & 2.

Oligonucleotides and synthetic peptides are listed in Table S9 . For Olg1, Olg2, Tle3, and PA1383 in total eighteen optimal peptides were selected for isotopically-labeled reference peptides (SpikeTidesL) purchased from JPT Peptide Technologies. Either the C-terminal lysine (Lys8) or arginine residue (Arg10) were 13C-and 15N-labeled. Isotope-labelled peptides were not purified and, thus, concentrations represent only estimates.

cDNA synthesis for PCR RNA (500 mg) was reverse-transcribed using SuperScript III Reverse Transcriptase (Thermo Fisher Scientific). Random nonamer (50 pmol; Sigma Aldrich) or 10 pmol primer 3 were used for reverse transcription of gyrA (reference gene) or olg1, respectively, in the presence of 20 U SUPERase$In RNase Inhibitor. Samples without reverse transcriptase served as negative controls. For transcriptional termination sites of olg1, reverse transcription was performed with primer 4 or 5. Of the latter, 1 mL cDNA was used in a 30-cycle PCR using Taq DNA Polymerase (NEB) with primer 4 & 6 or 5 & 6. For olg2, RNA was reversed transcribed using primer 7. Subsequently, 1 mL cDNA was used in a 30-cycle PCR using Q5 DNA Polymerase (NEB) with primer 8 & 9 or 10 & 11. cDNA of olg1 was additionally used testing for alternative start sites by PCR using the primer 8 & 15, 8 & 16, 8 & 17 , and 8 &18. Primer functionalities were verified with genomic DNA before (not shown).

Expression levels of olg1, olg2 and gyrA were quantified by qPCR. Each 20-mL reaction contained 10 mL SsoAdvanced Universal SYBR Green Supermix (Bio-Rad Laboratories), 500 nM forward and reverse primer, and 1 mL cDNA (or water for ''No Template Control''). For gyrA and olg1, primer 12 & 13 and 6 & 14 were used, respectively. Cycling was as follows: 95 C for 30 s; 40 cycles of 95 C for 15 s and 60 C for 30 s. A melt curve analysis (65 to 95 C with 0.5 C increments) confirmed the correct product. Each reaction was conducted in three biological and technical replicates. Data were analysed using the DDCt method (Livak and Schmittgen, 2001) . Significance was evaluated with a two-tailed Welch two-sample t-test (p value % 0.05).

Transcriptome sequencing rRNA was depleted from total RNA (of 200 mL cell-extract, DNase treated, as above) using the P. aeruginosa-specific riboPOOL kit (version v1-5, siTOOLs Biotech) followed by RNA precipitation and DNase digestion of the probes. One mg depleted RNA was fragmented (Ultrasonicator system S220, Covaris; 175 W, 10% duty cycle, 200 cycles for 180 s), dephosphorylated (Antarctic phosphatase, NEB), and phosphorylated (T4 Polynucleotide Kinase, NEB). Fragments were purified after each step using the miRNeasy Mini Kit (Qiagen). Finally, the volume was reduced to 5 mL in a Speedvac concentrator (Eppendorf) and sequencing libraries were prepared using the TruSeq Small RNA Library Prep Kit (Illumina). cDNA concentration and length were measured using a Qubit (dsDNA HS Assay Kit, Thermo Fisher) and Bioanalyzer (High Sensitivity DNA Kit, Agilent). Libraries were diluted to 2 nM in 10 mL 10 mM Tris-HCl (pH 8.5) and sequenced on a HiSeq1500 (Illumina) using a v2 Rapid SR50 cartridge (Illumina) for two biological replicates.

A unified catalog of 204,938 reference genomes from the human gut microbiome

Basic local alignment search tool

Are antisense proteins in prokaryotes functional?

Codon usage in prokaryotes

ONE 8, e79033

Protein identification by mass spectrometry: issues to be considered

Overlapping genes in bacteriophage phiX174

Frameshifting preserves key physicochemical properties of proteins

How to manage Pseudomonas aeruginosa infections

Regulation of the overlapping pic/set locus in Shigella flexneri and enteroaggregative Escherichia coli

A type VI secretion system transkingdom effector is required for the delivery of a novel antibacterial toxin in Pseudomonas aeruginosa

Fast and sensitive protein alignment using DIAMOND

Concomitant emergence of the antisense protein gene of HIV-1 and of the pandemic

fastp: an ultra-fast all-in-one FASTQ preprocessor

Why genes overlap in viruses

DeepRibo: a neural network for precise gene annotation of prokaryotes by combining ribosome profiling signal and binding site patterns

Features of functional human genes

Andromeda: a peptide search engine integrated into the MaxQuant environment

Function of the Pseudomonas aeruginosa NrdR transcription factor: global transcriptomic analysis and its role on ribonucleotide reductase gene expression

The environmental occurrence of Pseudomonas aeruginosa

Identifying bacterial genes and endosymbiont DNA with Glimmer

An exploration of ambigrammatic sequences in narnaviruses

Novel overlapping coding sequences in Chlamydia trachomatis

The Newick utilities: high-throughput phylogenetic tree processing in the UNIX shell

Entrez direct: E-utilities on the UNIX command line

Origins of genes: ''big bang'' or continuous creation? Proc

Pseudomonas aeruginosa: a formidable and ever-present adversary

Evidence for a novel overlapping coding sequence in POLG initiated at a CUG start codon

Proteomic detection of non-annotated protein-coding genes in Pseudomonas fluorescens Pf0-1

The ingenuity of bacterial genomes

The bioenergetic costs of a gene

Skyline: an open source document editor for creating and analyzing targeted proteomics experiments

Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity

IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era

Predicting statistical properties of open reading frames in bacterial genomes

Evolution of overlapping genes

New insights from uncultivated genomes of the global human gut microbiome

Dynamically evolving novel overlapping gene as a factor in the SARS-CoV-2 pandemic

OLGenie: estimating natural selection to predict functional overlapping genes

Translatomics combined with transcriptomics and proteomics reveals novel functional, recently evolved orphan genes in Escherichia coli O157: H7 (EHEC)

Differentiation of ncRNAs from small mRNAs in Escherichia coli O157:H7 EDL933 (EHEC) by combined RNAseq and RIBOseq -ryhB encodes the regulatory RNA RyhB and a peptide

IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies

Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

Evolution by Gene Duplication (Allen & Unwin

Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?

The PRIDE database and related tools and resources in 2019: improving support for quantification data

Enrichment and identification of small proteins in a simplified human gut microbiome

The relations between the precodons of overlapping genes

Sigma factors in Pseudomonas aeruginosa

FastTree 2-approximately maximum-likelihood trees for large alignments

BEDTools: a flexible suite of utilities for comparing genomic features

High-resolution TADs reveal DNA sequences underlying genome organization in flies

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data

Diverse type VI secretion phospholipases are functionally plastic antibacterial effectors

A method for the simultaneous estimation of selection intensities in overlapping genes

Microbial gene identification using interpolated Markov models

Degeneracy of the information contained in amino acid sequences: evidence from overlaid genes

A simple method to detect candidate overlapping genes in viruses using single genome sequences

Global quantification of mammalian gene expression control

FRESCo: finding regions of excess synonymous constraint in diverse viruses

Panorama public: a public repository for quantitative data sets processed in skyline

Overlapping protein-encoding genes in Pseudomonas fluorescens Pf0-1

Pervasive translation in Mycobacterium tuberculosis

Automatic annotation of microbial genomes and metagenomic sequences

Pyvolve: a flexible Python module for simulating sequences along phylogenies

Small proteins can no longer be ignored

Identification of novel translated small ORFs in Escherichia coli using complementary ribosome profiling approaches

PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments

The evolutionary origin of orphan genes

Two overlapping antiparallel genes encoding the iron regulator DmdR1 and the Adm proteins control siderophore and antibiotic biosynthesis in

The MaxQuant computational platform for mass spectrometry-based shotgun proteomics

Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes

The novel EHEC gene asa overlaps the TEGT transporter gene in antisense and is regulated by NaCl and growth phase

Proteogenomic analysis of bacteria and archaea: a 46 organism case study

Missing genes in the annotation of prokaryotic genomes

Identifying small proteins by ribosome profiling with stalled initiation complexes

A simple method for estimating the strength of natural selection on overlapping genes

Overview of nosocomial infections caused by gram-negative bacilli

Many, but not all, lineage-specific genes can be explained by homology detection failure

Codon usage in Pseudomonas aeruginosa

Lost and found: Re-searching and Rescoring proteomics data aids genome annotation and improves proteome coverage. mSystems

High-precision analysis of translational pausing by ribosome profiling in bacteria lacking EFP

Overlapping genes in natural and engineered genomes

The singlenucleotide resolution transcriptome of Pseudomonas aeruginosa grown in body temperature

Identification and validation of novel small proteins in Pseudomonas putida

Do overlapping genes violate molecular biology and the theory of evolution?

The Sorcerer II global ocean sampling expedition: expanding the universe of protein families

Evidence for numerous embedded antisense overlapping genes in diverse E. coli strains

A novel pHregulated, unusual 603 bp overlapping protein coding gene pop is encoded antisense to ompA in Escherichia coli O157: H7 (EHEC)

PROCAL: a set of 40 peptide standards for retention time indexing, column performance monitoring, and collision energy calibration

Mfold web server for nucleic acid folding and hybridization prediction

The reaction was stopped (6 mM EGTA, 50 U SUPERase$In, 10 min). Monosomes were isolated by sucrose density gradient centrifugation (104,0003g, 4 C, 3 h) followed by RNA isolation and DNase treatment as described. Ribosomal footprints were size selected using a 16% denaturing urea polyacrylamide gel (200 V, 1.5 h). After staining (SYBR Gold, Invitrogen), ribosomal footprints (19 -27 nt) were excised. Gel pieces were crushed in gel breaker tubes (15,700 3g, 2 min). Gel debris was incubated overnight in extraction buffer (300 mM NaOAc pH 5.5, 1 mM EDTA, 0.1 U/mL SUPERase$In)

For offline high-pH reversed-phase (hpH RP) fractionation and for targeted proteomics, 75 mg and 20 mg of total protein amount were reduced and alkylated

Water-diluted samples (1:1) were subjected to proteolysis with trypsin (enzyme to protein ratio 1:50, 30 C, overnight, shaking at 400 rpm) and then stopped

Three discs of Empore C18 (3M) material were packed in 200-mL pipette tips. The resulting desalting columns were conditioned (100% acetonitrile, ACN) and equilibrated (40% ACN/0.1% FA) followed by 2%

ACN/0.1% FA. Peptides of the 75-mg protein digest were loaded, washed (2% ACN/0.1% FA) and eluted

mm column (Waters) at a flow rate of 200 mL/min. Buffer A was 25 mM ammonium bicarbonate (pH 8.0), buffer B was 80% ACN. Fractions were collected every minute into a 96 well plate. Peptides were separated by a linear gradient from 4% to 32% buffer B over 45 min, followed by a gradient from 32% to 85% buffer B over 6 min. Samples were collected in 30 s steps between minute 3 and 51. The solvent was evaporated and samples were redissolved in 2% ACN/0.1% FA. To increase sensitivity

High pH reversed-phase fractionation for targeted proteomics C18-packed 200-ml tips (see above) were loaded with peptides from the 20 mg digest. A pH switch was performed using 25 mM ammonium formate (pH 10) and varying ACN concentrations for each of six fractions. ACN was added at concentrations of 0, 5, 10, 15, 25, and 50%, respectively. Fraction 1 and 5 and fraction 2 and 6 were combined. The solvent was each evaporated (1+5

LC-MS/MS measurements -full proteomes

After loading (10 min), peptides were transferred to an analytical column (ReproSil Gold C18-AQ, 3 mm, 450 mm 3 75 mm, Dr. Maisch, self-packed) and separated using a 50-min linear gradient from 4% to 32% of solvent B (ACN/0.1% FA/5% dimethyl sulfoxide, DMSO) in solvent A (HPLC-grade water with 0.1% FA/5% DMSO) at 300 nL/min flow rate. Both solvents contain DMSO boosting MS intensity. The Fusion Lumos Tribrid mass spectrometer was operated in data-dependent acquisition (DDA) and positive ionization mode. MS1 spectra (360-1300 m/z) were recorded at a resolution of 60,000 using an automatic gain control (AGC) target value of 4310 5 and maximum injection time (MaxIT) of 50 ms. Up to 20 peptide precursors were selected for fragmentation in case of the full proteome analyses. Only precursors with charge state 2 to 6 were selected and dynamic exclusion of 20 s was enabled. Peptide fragmentation was performed using higher energy collision induced dissociation (HCD) and a normalized collision energy (NCE) of 30%

MS2 spectra were acquired in the orbitrap with a resolution of 15.000 and an AGC target value of

2016) file downloaded for P. aeruginosa PAO1 (GCF_000006765.1_ASM676v1_protein.faa, 5,572 reviewed entries, 7 Feburary 2020), supplemented with common contaminants (by MaxQuant) and Olg1 and Olg2 AA sequences. Trypsin/P was specified as proteolytic enzyme. Precursor tolerance was set to 4.5 ppm and fragment ion tolerance to 20 ppm. Results were adjusted to 1% FDR on peptide spectrum match level and protein level employing a target-decoy approach using reversed protein sequences. Minimal peptide length was defined as 7 AA; the ''match-between-run'' function disabled. For full proteome analyses, carbamidomethylated cysteine was set as fixed and oxidation of methionine and N-terminal protein acetylation as variable modifications. Correlation scores (dot product) between experimental and predicted spectra were calculated via Skyline daily (64-bit

Targeted measurements using Parallel Reaction Monitoring (PRM) were performed with a 50-min linear gradient on a Dionex Ultimate 3000 RSLCnano system coupled to a Q-Exactive HF-X mass spectrometer (Thermo Fisher Scientific). The spectrometer was operated in PRM and positive ionization mode. MS1 spectra (360-1300 m/z) were recorded at a resolution of 60,000 using an AGC target value of 3310 6 and a MaxIT of 100 ms

MaxIT of 118 ms and an isolation window of 0.7 m/z. For the PRM analysis of the growth phase samples, 18 OLG and mother gene peptides plus 12 retention time reference peptides (subset of Procal peptides synthesized by JPT (Zolg et al., 2017) were targeted within a single PRM run and with a 5 min scheduled retention time window. The cycle time was $2.1 s, which leads to $10 data points per chromatographic peak

Isotope-labelled internal reference peptides were used for confident identification and quantification. Peptide selections were based on results of DDA measurements of the deep proteome at OD 600nm = 1. Peptides were selected based on intensity, location within the protein, Andromeda score, excluding modification, and charge state. All isotopically-labeled synthetic peptides were pooled and targeted proteomic measurements (PRM) showed confident detection of all 18 peptides

Peak integration, transition interferences and integration boundaries were reviewed manually, considering four to six transitions per peptide. To discriminate between true or false peptide detection, filtering according to correlation of fragment ion intensities between endogenous (light) and spike-in (heavy) peptides was applied (''Library Dot Product'' R0.8). Additionally, a good correlation of fragment ion intensities between light and heavy peptide (''Dot-ProductLightToHeavy'' of >0.9) and a mass accuracy of below G20 ppm

Putative s70 promoters within a 300-nt region upstream of the start codon were predicted by BPROM (Solovyev and Salamov, 2011) with minimum LDF scores of 0

2002) within a region of 30 nt upstream of the start codon and a minimum free energy (DG SD ) threshold of À2.9 kcal/mol. ll OPEN ACCESS To predict r-independent terminators, a 300-nt region downstream of the respective stop codon was analysed using FindTerm (Solovyev and Salamov, 2011) with an threshold of À3. Predicted terminator regions were read in non-overlapping sliding windows of 30 nt and folded with Mfold

Remaining reads were normalized to gene length and sequencing depth (RPKM: ''reads per kilobase per million mapped reads''). For each genomic nucleotide position, reads per million mapped reads (RPM) were calculated, averaged over biological replicates and visualized with pyGenomeTracks

Read counts were scaled to the smallest library size and differential expression analysis was performed using an exact test implemented in edgeR

In order to detect overlapping ORFs, all possible start codons within and in the upstream-vicinity of the coding regions for tle3 and PA1383 were masked by N (any nucleotide). Further, protein-coding ORFs were predicted based on RiboSeq data using DeepRibo (Clauwaert et al., 2019) with default settings. BLAST searches (blast) (Altschul et al., 1990) against NCBI databases were used to

Evolutionary and taxonomic analyses Scripts for evolutionary and taxonomic analyses are available in the GitHub repository

Phylostratigraphy -taxonomic distribution

Homologs of tle3 and PA1383 (NC_002516.2) were detected using BLASTp in annotated proteins from genomes in Pseudomonadales, and from the Identical Protein Groups database using the Entrez Programming Utilities (Kans, 2021). Sequences from MAG collections were added using Diamond blastp (Buchfink et al., 2014) finding homologs within genomes annotated as being within Pseudomonadales. The com

Maximum likelihood trees of tle3 and PA1383 alignments were calculated using IQ-TREE

modified to run from the Unix command line, as well as to print the scores obtained for each method (codon permutation and synonymous codon mutation

A sequence from P. prosekii, the only intact homolog outside the OLG clade of P. aeruginosa, was chosen as outgroup for olg1. For olg2, multiple non-P. aeruginosa intact ORFs were available. A more distant outgroup was chosen as tests of purifying selection described below suggest more taxonomically widespread functionality. Omega values (approximately equivalent to dN/dS) of 0.5 for both genes were chosen based on alignments of the two mother genes

Codon-position constraint analyses

Constraints in synonymous sites of tle3 and PA1383 were assessed using 'FRESCo

Approximate maximum-likelihood nucleotide trees were calculated using FastTree 2 (Price et al., 2010) for the full sets of ''OLG'' and ''non-OLG'' genomes, and 'FRESCo' was run on codon alignments (described above) with a sliding window size

Analysis for each mother-gene codon alignment (created using PAL2NAL, described above) of OLG and non-OLG genomes was conducted with standard settings. Sliding window analyses of 50 codons were conducted using a minimum number of defined codons of 2. Pairwise whole-gene comparisons of olg1 and olg2 were conducted using standard settings

calculating the number of reads per kilobase gene per million reads sequenced. Ribosome coverage values (RCV) were calculated by dividing the RPKM of the translatome by the RPKM of the transcriptome for evaluating 'translatability'. For published data sets used, read counts were scaled to the smallest library size and differential expression analysis was performed using an exact test implemented in edgeR. In qPCR, data were analysed using the DDCt method

Results were adjusted to 1% FDR on peptide spectrum match level and protein level employing a target-decoy approach using reversed protein sequences. Correlation scores (dot product) between experimental and predicted spectra were calculated via Skyline daily (64-bit, v20.1.9.234) that supports Prosit spectra predictions. For data analysis, protein intensities and iBAQ values were calculated. Peptides for validation were selected based on intensity, location within the protein, Andromeda score, excluding modification, and charge state. PRM data was analysed using Skyline-daily