key: cord-0333996-otahnh5h authors: Kreitmeier, Michaela; Ardern, Zachary; Abele, Miriam; Ludwig, Christina; Scherer, Siegfried; Neuhaus, Klaus title: Shadow ORFs illuminated: long overlapping genes in Pseudomonas aeruginosa are translated and under purifying selection date: 2021-02-10 journal: bioRxiv DOI: 10.1101/2021.02.09.430400 sha: 45b725fc9c16e0ed8a03de14b217e9808ad5f081 doc_id: 333996 cord_uid: otahnh5h The existence of overlapping genes (OLGs) with significant coding overlaps revolutionises our understanding of genomic complexity. We report two exceptionally long (957 nt and 1536 nt), evolutionarily novel, translated antisense open reading frames (ORFs) embedded within annotated genes in the medically important Gram-negative bacterium Pseudomonas aeruginosa. Both OLG pairs show sequence features consistent with being genes and transcriptional signals in RNA sequencing data. Translation of both OLGs was confirmed by ribosome profiling and mass spectrometry. Quantitative proteomics of samples taken during different phases of growth revealed regulation of protein abundances, implying biological functionality. Both OLGs are taxonomically highly restricted, and likely arose by overprinting within the genus. Evidence for purifying selection further supports functionality. The OLGs reported here are the longest yet proposed in prokaryotes and are among the best attested in terms of translation and evolutionary constraint. These results highlight a potentially large unexplored dimension of prokaryotic genomes. The tri-nucleotide character of the genetic code enables six reading frames in a double-36 stranded nucleotide sequence. Protein-coding ORFs at the same locus but in different reading 37 frames are referred to as overlapping genes (OLGs). Studies of coding overlaps of more than 38 90 nucleotides, i.e. non-trivial overlaps, have mainly been restricted to viruses, where the first 39 OLG was found in 1976 1 . An OLG pair with such a non-trivial overlap can be described in terms 40 of an older, typically longer "mother gene" and more recently evolved "daughter" gene by 41 analogy to mother and daughter cells in reproduction. 42 In prokaryotes, automated genome-annotation algorithms like Glimmer allow only one open 43 reading frame (ORF) per locus with the exception of only short overlaps 2 . This systematically 44 excludes overlapping ORFs from being annotated as genes 3 . The 'inferior' ORFs (e.g. shorter 45 or fewer hits in databases) within overlapping gene pairs have been called 'shadow ORFs' 46 since they are found in the shadow of the annotated coding ORF 4 . Determining which ORF to 47 annotate within such pairs has been described as the most difficult problem in prokaryotic gene 48 annotation 5 . Nevertheless, a few prokaryotic OLGs have been discovered, often 49 serendipitously in the pursuit of other unannotated genes 6-8 . For instance in some Escherichia 50 coli strains a few non-trivial overlaps have been detected and experimentally analysed 9-16 . 51 Transcriptomic or translatomic evidence for OLGs also exist in genera such as 52 Mycobacterium 17 or Pseudomonas 18 . Recently, antisense OLGs have also been reported in 53 archaea 19 and in mammals 20,21 . These findings support the hypothesis that OLGs are 54 Verifying OLGs using mass spectrometry (MS) is difficult because most OLGs appear to be 56 short and weakly expressed, and proteomics has limited abilities in detecting such proteins 22 . 57 RNA sequencing led to the discovery of many antisense transcripts, but whether many of these 58 RiboSeq data in two P. aeruginosa strains (PAO1 & ATCC33988) 69 . Strains were cultivated in 214 M9 broth with glycerol or n-alkanes. Reanalysis revealed transcriptional and translational 215 signals for both OLGs in strain PAO1 ( Fig. 3c and Supplementary Fig. 7 ) and for olg1 in strain 216 ATCC33988 (Supplementary Table 2 ). Interestingly, RPKMRNASeq values were similar for both 217 OLGs when comparing LB with M9+glycerol, but differed in M9+alkane (Fig. 3c ). This might 218 suggest a carbon-source-dependent regulation. Further, these data indicate regulated 219 translation of olg1 (log_FC = 1.15,) and olg2 (log_FC = -0.55) in PAO1 when comparing 220 M9+alkane with M9+glycerol (false discovery rate, FDR ≤0.05). Thus, both OLGs are 221 expressed more weakly in alkane media than LB (Fig. 3c) . Table 4 ). Homologs outside Pseudomonas clustered within the genus in terms 228 of sequence identity. Thus, we infer that these are best accounted for by horizontal gene 229 transfer from the genus and focus on evolution within the genus Pseudomonas. 230 Homologs of tle3 containing both the α/β hydrolase and DUF3274 domains were found in 231 multiple phyla, however within the order Pseudomonadales they were found only in the genus 232 Pseudomonas, suggesting horizontal gene transfer from another order. The mother gene of 233 olg2, PA1383, in contrast, does not have highly similar homologs outside of Pseudomonas. 234 The one exception was derived from a low-quality genome of Acinetobacter baumanii (noted 235 in RefSeq to be of excess size), which was disregarded. A few distant homologs were found, 236 with the two top hits in E. coli (matching 43% of the PAO1 sequence) and Salmonella enterica 237 (match to 17%). This evolutionary distance again implied a horizontal gene transfer into or out 238 of Pseudomonas with subsequent evolutionary divergence. None of the non-Pseudomonas 239 homologs included the N-terminal signal peptide in PA1383, suggesting functional changes in 240 Pseudomonas compared to other bacteria. 241 Both olg1 and olg2, in the length present in reference strain PAO1, were highly taxonomically 242 restricted. The taxonomic distribution of both OLGs was limited approximately to the species 243 P. aeruginosa, according to BLASTp hits along with additional metagenome assembled 244 genomes (MAGs) (Fig. 4a) . The mother genes tle3 and PA1383 within the order 245 Pseudomonadales resulted in approximately 900 and 300 unique sequences, respectively. For 246 olg1 (in tle3), the intact ORF (i.e. without premature stop codons) was restricted to 247 P. aeruginosa with one exception in P. prosekii with a non-start codon (GTA) at the locus of 248 olg1 and olg2 were both significantly longer than expected given the amino acids (AA) in the 258 reference genes tle3 and PA1383, using the synonymous-mutation method from the tool 259 'Frameshift' 73 ( Fig. 4b ). This method substitutes synonymous codons randomly and obtains an 260 empirical cumulative distribution function of the resulting ORF lengths in each alternative 261 reading frame. This resulted in a P-value of <10 -10 for both ORFs. Olg1 and olg2 were also 262 longer than expected given the overall codon usage (codon-permutation method of 263 'Frameshift'), although not statistically significantly, with p = 0.163, and p = 0.0635 respectively 264 (Supplementary Table 5 ). The P-values included a correction for multiple tests, i.e. the number 265 of observed ORFs in this alternate reading frame. The synonymous mutation P-values are still 266 significant after a conservative multiple-tests adjustment of multiplying by the total number of 267 genes, arguably appropriate given the OLGs' detection with a genome-wide scan. The non-268 significant results for the codon permutation method are not surprising given that it implicitly 269 depends on stop codons elsewhere in the alternate reading frame, of which there are few due 270 to the length of the OLGs relative to their mother genes. In summary, from these results it can 271 be concluded that the ORF lengths were not simply a result of the overall sequence 272 composition of the mother genes, implying selection for long ORFs via negative selection on 273 stop codons. 274 Secondly, sequence evolution of each mother gene was modelled without selection for 275 maintaining an overlapping ORF. The presence of stop codons in simulated sequences was 276 then compared to natural sequences. This method was previously used to support an inference 277 to selection on the OLG asp in HIV-1 74 . When evolution was simulated using an empirical 278 codon model in Pyvolve 75 along trees calculated from the mother genes (Fig. 4a) , stop codons 279 evolved more frequently in the simulated OLG sequences compared to observed natural 280 sequences. As such, fewer simulated than natural sequences had intact full-length ORFs (Fig. 281 4c). Following the originators of the method 74 , outgroup sequences without stop codons in the 282 OLG region were chosen to root the tree. 283 Codon position-specific constraint supports purifying selection. Synonymous variation 284 in the mother genes tle3 and PA1383 was reduced over a large part of the OLG region ( Fig. 285 5a, Supplementary Table 6), according to results from 'FRESCo' 28 . For tle3, a comparison of 286 the rates of synonymous evolution in the OLG-containing genomes versus the rest of the 287 alignment using a paired two tailed t-test in non-overlapping adjacent windows of 50 codons 288 over the whole OLG sequence showed increased constraint in the OLG region, with p = 0.086. 289 Results for olg2 were similar for the last 350 codons of olg2 (p = 0.03) (Supplementary Table 290 6), but there was no synonymous constraint towards the end of the mother gene PA1383 from 291 approximately codon 500 onwards. Because the non-overlapping start region of olg2 is not 292 well conserved across P. aeruginosa (Supplementary Table 4 Table 303 7). Both mother genes tle3 and PA1383 were observed to be under significant purifying 304 selection in the sets of Pseudomonas genomes with and without these OLGs. The results from 305 'FRESCo' and 'OLGenie' are fully independent, as they depend on synonymous and 306 nonsynonymous sites in the mother genes, respectively. As each measure independently 307 shows a tendency towards constraint, together they provide good evidence of evolutionary 308 constraint in the OLG sequences. Further, some individual pairwise sequence comparisons 309 with 'OLGenie' are statistically significant ( Prokaryotic OLGs, outside viruses 76,77 , are often categorically rejected and those already 324 annotated have been attributed to misannotations 78 . In this study, we describe the detection 325 and characterization of two exceptionally long OLGs, olg1 and olg2, in P. aeruginosa. We 326 propose that the detected overlapping ORFs encode functional protein products due to 1) the 327 presence of structural features necessary for gene expression, 2) successful transcription and 328 translation as indicated by RNASeq and RiboSeq, 3) discovery of several translated peptides 329 via mass spectrometry, 4) validation and confirmation of their regulated expression during 330 growth of P. aeruginosa PAO1 using targeted proteomics and isotopically labelled reference 331 peptides, 5) successful prediction of both ORFs on genomic and translational level by 332 annotation programs, and 6) evidence of purifying selection on both gene candidates from 333 multiple methods. While these results provide strong evidence for the genuine protein-coding 334 nature and functionality of both ORFs, they can only be designated as OLGs if their respective 335 mother genes (tle3 and PA1383) are correctly annotated and are also genuinely protein coding. 336 The gene tle3 has been confirmed to encode for the antibacterial type VI lipase effector 3 54,79 . 337 PA1383 is annotated as a hypothetical gene, but we show that homologs are widely distributed 338 across bacteria. Further, it contains a signal peptide associated with export, and it is under 339 purifying selection. For both mother genes, we show clear expression in our RNASeq, RiboSeq 340 and MS experiments. It appears unlikely that the MS-detected peptides represent translation 341 products without function considering the high bioenergetic cost of translation 80 . Taken 342 together, it is beyond reasonable doubt that both mother genes encode for functional proteins 343 and that the overlapping ORFs presented here are not just annotation errors from artifactual 344 mother genes. Narnaviridae 84 . This ORF was hypothesized to be protein-coding 85 , but experimental evidence 362 is lacking. 363 Almost all proposed antisense OLGs lack a native proof of the encoded protein product, 364 arguably calling their coding potential into question. Proteomic detection of the OLG cosA via 365 MS, for instance, failed, presumably due to its low expression 86 . In addition to low protein 366 abundance, the generally small size of OLGs also hampers a proteomic proof due to an 367 insufficient amount, or complete absence, of mass spectrometry-detectable peptides 22 . However, in our sliding window analyses of tle3 (Fig. 5bc) , we found no evidence for selection 389 on upstream sequences. 390 Bioinformatic analysis of OLGs is still in its infancy. For instance, for evolutionary simulation, it 391 would be ideal to start with the actual ancestral sequence, but ancestral-sequence 392 reconstruction for OLGs is yet unsolved. Thus, for the simulation method, rather than 393 introducing new biases with imperfect reconstruction, we instead followed the approach of 394 Cultivation and harvest. Lysogeny broth (10 g/L tryptone, 5 g/L yeast extract, 5 g/L NaCl) 437 was inoculated 1:100 using an overnight culture of P. aeruginosa PAO1 (DSM19880) and 438 aerobically incubated (37°C, 150 rpm). After 1h, 2h, 4h, 6h, 8h, and 24h and at OD600nm = 1, 439 samples were taken by centrifugation (10 min, 12,000×g, 4°C) . For transcriptomes and 440 translatomes, cellular processes were stalled at OD600nm = 1 by adding dry ice reaching 4°C. 441 Next, cells were centrifuged (8,000×g, 4°C, 5 min) and resuspended in polysome-lysis-buffer 94 442 (325 µL per 100 mL initial culture). Cells were lysed in a cell crusher with liquid nitrogen. After 443 centrifugation as before, the supernatant was used for transcriptomes and translatomes. cellulose-acetate filter tubes, footprints were precipitated with ethanol and transformed in a 504 sequencing library as above for two biological replicates. 505 Cell lysis and protein digest for mass spectrometry. Cells were lysed in 100 µL absolute 506 TFA (Sigma-Aldrich; 5 min, 55°C, shaking at 1,000 rpm) and neutralized with 900 µL 2 M 507 Tris 97 . Protein concentration was determined using Bradford reagent (B6916, Sigma-Aldrich). 508 For offline high pH reversed-phase (hpH RP) fractionation and for targeted proteomics, 75 µg 509 and 20 µg of total protein amount were reduced and alkylated (10 mM TCEP, 55 mM CAA; 510 5 min, 95°C), respectively. Water-diluted samples (1:1) were subjected to proteolysis with 511 trypsin (enzyme to protein ratio 1:50, 30°C, overnight, shaking at 400 rpm) and then stopped 512 (3% formic acid, FA). supplemented with common contaminants (by MaxQuant) and Olg1 and Olg2 AA sequences. 554 Trypsin/P was specified as proteolytic enzyme. Precursor tolerance was set to 4.5 ppm and 555 fragment ion tolerance to 20 ppm. Results were adjusted to 1% FDR on peptide spectrum 556 match level and protein level employing a target-decoy approach using reversed protein 557 sequences. Minimal peptide length was defined as 7 AA; the "match-between-run" function 558 disabled. For full proteome analyses, carbamidomethylated cysteine was set as fixed and 559 oxidation of methionine and N-terminal protein acetylation as variable modifications. 560 Correlation scores (dot product) between experimental and predicted spectra were calculated 561 via Skyline daily (64-bit, v20.1.9.234) 101 that supports Prosit 68 spectra predictions. For data 562 analysis, protein intensities and iBAQ 102 values were calculated. 563 Monitoring (PRM) were performed with a 50-min linear gradient on a Dionex Ultimate 3000 565 RSLCnano system coupled to a Q-Exactive HF-X mass spectrometer (Thermo Fisher 566 Scientific). The spectrometer was operated in PRM and positive ionization mode. MS1 spectra 567 (360-1300 m/z) were recorded at a resolution of 60,000 using an AGC target value of 3×10 6 568 and a MaxIT of 100 ms. Targeted MS2 spectra were acquired at 60,000 resolution with a fixed 569 first mass of 100 m/z, after HCD with 26% NCE, and using an AGC target value of 1×10 6 , a 570 Table 10) . Skyline-daily 101 was used to build an experimental spectral library 582 from the generated PRM data. 583 Targeted mass spectrometric data analysis. PRM data was analysed using Skyline-daily 101 . 584 Peak integration, transition interferences and integration boundaries were reviewed manually, 585 considering four to six transitions per peptide. To discriminate between positive and negative 586 peptide detection, filtering according to correlation of fragment ion intensities between the 587 endogenous (light) and the spike-in (heavy) peptides was applied ("Library Dot Product" ≥0.8). 588 Additionally, a correlation of fragment ion intensities between the light and heavy peptide 589 ("DotProductLightToHeavy" of >0.9) and a mass accuracy of below ±20 ppm ("Average Mass 590 Error PPM") was required. Total protein intensity was computed by summing up all light peptide 591 intensities detected positive in each sample ( Supplementary Fig. 6b) . Uniqueness of the 592 peptides was assessed against the RefSeq database for P. aeruginosa PAO1. 593 Bioinformatic analyses. Putative σ70 promoters within a 300-nt region upstream of the start 594 codon were predicted by BPROM 58 with minimum LDF scores of 0.2. 595 Shine-Dalgarno sequence identification was performed as described 61 within a region of 30 nt 596 upstream of the start codon and a minimum free energy (ΔGSD) threshold of −2.9 kcal/mol. 597 To predict ρ-independent terminators, a 300-nt region downstream of the respective stop 598 codon was analysed using FindTerm 58 with an threshold of −3. Predicted terminator regions 599 were read in non-overlapping sliding windows of 30 nt and folded with Mfold 104 , identifying stem 600 loops. 601 FASTQ files were processed using a custom perl script. FastQC 105 was used to assess raw 602 read quality and adapter sequences were trimmed with fastp 106 . Trimmed reads were aligned 603 to the reference (GCF_000006765.1_ASM676v1) with Bowtie2 v2.2.6 107 using 604 "--very-sensitive end-to-end" with a seed length of 17 nt. Reads mapping to rRNAs and tRNAs 605 were filtered with SAMTools 108 and BEDTools 109 . Remaining reads were normalized to gene 606 Significant changes in translation were determined between the published RiboSeq datasets 612 "M9+n-alkane" and "M9+glycerol" 69 . Read counts were scaled to the smallest library size and 613 differential expression analysis was performed using an exact test implemented in edgeR 112 . 614 Gene prediction was performed with Prodigal 62 using default parameters. In order to detect 615 overlapping ORFs, all possible start codons within and in the upstream-vicinity of the coding 616 regions for tle3 and PA1383 were masked by N (any nucleotide were assessed using 'FRESCo' 28 . Approximate maximum-likelihood nucleotide trees were 650 calculated using FastTree 123 for the full sets of "OLG" and "non-OLG" genomes, and 'FRESCo' 651 was run on codon alignments (described above) with a sliding window size (50 codons). 652 Constraint on non-synonymous codon changes in the OLGs was assessed using 'OLGenie' 30 . 653 Analysis for each mother-gene codon alignment (created using PAL2NAL, described above) 654 of OLG and non-OLG genomes was conducted with standard settings. Sliding window 655 analyses of 50 codons were conducted using a minimum number of defined codons of 2. 656 Pairwise whole-gene comparisons of olg1 and olg2 were conducted using standard settings, 657 and a custom Bash script producing a pairwise matrix. 3 1 2 3 1 2 3 +3 3 2 1 3 2 1 3 2 1 -1 3 2 1 3 2 1 3 2 1 -2 3 2 1 3 2 1 3 2 Overlapping genes in bacteriophage 671 phiX174 Identifying bacterial genes 673 and endosymbiont DNA with Glimmer Missing genes in the 676 annotation of prokaryotic genomes The Sorcerer II Global Ocean Sampling expedition: expanding the 679 universe of protein families Microbial gene identification using 681 interpolated Markov models Novel overlapping coding sequences in Chlamydia trachomatis 687 Salmonella. G3 (Bethesda) Regulation of the overlapping pic/set locus in 692 Phenotype of htgA (mbiA), a recently evolved orphan gene of 695 completely overlapping in antisense to yaaW Evidence for the recent origin of a bacterial protein-coding, overlapping 698 orphan gene by evolutionary overprinting The Novel Anaerobiosis-Responsive Overlapping Gene ano Is Overlapping Antisense 702 to the Annotated Gene ECs2385 of Escherichia coli O157:H7 Sakai. Frontiers in 703 microbiology 9 A novel short L-arginine responsive protein-coding gene (laoB) 705 antiparallel overlapping to a CadC-like transcriptional regulator in Escherichia coli 706 O157:H7 Sakai originated by overprinting The novel 709 EHEC gene asa overlaps the TEGT transporter gene in antisense and is regulated by 710 NaCl and growth phase A Novel pH-712 Regulated, Unusual 603 bp Overlapping Protein Coding Gene pop Is Encoded 713 Antisense to ompA in Escherichia coli O157:H7 (EHEC) Evidence for 716 Numerous Embedded Antisense Overlapping Genes in Diverse E. coli Strains. bioRxiv Pervasive Translation in Mycobacterium tuberculosis Transcriptome Analysis of Pseudomonas syringae Identifies New 721 Noncoding RNAs, and Antisense Activity Ribosome profiling in archaea reveals leaderless translation, 724 novel translational initiation sites, and ribosome pausing at single codon resolution. 725 Unusually efficient CUG initiation of an overlapping reading frame 727 in POLG mRNA yields novel protein POLGARF Evidence for a novel overlapping coding sequence in POLG initiated 730 at a CUG start codon Enrichment 732 and identification of small proteins in a simplified human gut microbiome Are Antisense Proteins in Prokaryotes 735 Functional? Genome-wide 738 analysis in vivo of translation with nucleotide resolution using ribosome profiling Structured RNA Contaminants in Bacterial Ribo-Seq A method for the simultaneous estimation of 743 selection intensities in overlapping genes Mapping overlapping functional elements embedded within the protein-745 coding regions of RNA viruses FRESCo: finding regions of excess synonymous constraint in 747 diverse viruses A simple method for estimating the strength of natural selection on 749 overlapping genes Estimating Natural Selection to Predict 751 Functional Overlapping Genes Retapamulin-assisted ribosome profiling reveals the alternative 754 bacterial proteome Predicting statistical 756 properties of open reading frames in bacterial genomes Bacteria-phage coevolution as a driver of ecological 758 and evolutionary processes in microbial communities Degeneracy of the information contained in amino acid 761 sequences: evidence from overlaid genes Evolution of overlapping genes Do overlapping genes violate molecular biology and the theory of 765 evolution? The relations between the precodons of overlapping genes Evolution by gene duplication Evolution of living organisms: evidence for a new theory of 770 transformation Origins of genes: "big bang" or continuous creation? Frameshifting preserves key physicochemical 774 properties of proteins Dynamically evolving novel overlapping gene as a factor in the 777 SARS-CoV-2 pandemic Neutral adaptation of the 779 genetic code to double-strand coding The evolutionary origin of orphan genes Small proteins can no longer be ignored Overlapping protein-encoding genes in Pseudomonas 787 fluorescens Pf0-1 Use of in vivo expression technology to identify genes 790 important in growth and survival of Pseudomonas fluorescens Pf0-1 in soil: discovery 791 of expressed sequences with novel genetic organization Identification and 794 validation of novel small proteins in Pseudomonas putida The environmental occurrence of Pseudomonas aeruginosa Pseudomonas aeruginosa: a formidable and ever-present 799 adversary Overview of Nosocomial Infections Caused by Gram-Negative Bacilli Has the era of untreatable infections arrived? How to manage 807 Pseudomonas aeruginosa infections Effector Is Required for the Delivery of a Novel Antibacterial Toxin in 811 Pseudomonas aeruginosa Genome-wide 814 identification of Pseudomonas aeruginosa exported proteins using a consensus 815 computational strategy combined with a laboratory-based PhoA fusion screen Function of the Pseudomonas aeruginosa NrdR 818 Transcription Factor: Global Transcriptomic Analysis and Its Role on Ribonucleotide 819 Reductase Gene Expression H-NS-like proteins in Pseudomonas 822 aeruginosa coordinately silence intragenic transcription Automatic annotation of microbial genomes and 825 metagenomic sequences. Metagenomics and its applications in agriculture, 826 biomedicine and environmental studies The single-nucleotide resolution transcriptome of Pseudomonas 828 aeruginosa grown in body temperature Genome-wide identification of transcriptional start sites in the 831 plant pathogen Pseudomonas syringae pv. tomato str. DC3000 Correlations between Shine-Dalgarno sequences and 834 gene features such as predicted expression levels and operon structures Prodigal: prokaryotic gene recognition and translation initiation site 837 identification Antisense transcription in Pseudomonas aeruginosa Widespread antisense 841 transcription in Escherichia coli Common and phylogenetically widespread coding for peptides 843 by bacterial small RNAs DeepRibo: a neural network for 846 precise gene annotation of prokaryotes by combining ribosome profiling signal and 847 binding site patterns Translatomics combined with transcriptomics and proteomics 849 reveals novel functional, recently evolved orphan genes in Escherichia coli O157: H7 850 (EHEC) Prosit: proteome-wide prediction of peptide tandem mass spectra by 852 deep learning A comprehensive multi-omics approach uncovers adaptations for 854 growth and survival of Pseudomonas aeruginosa on n-alkanes A unified catalog of 204,938 reference genomes from the human gut 857 microbiome New insights from 859 uncultivated genomes of the global human gut microbiome Features of Functional Human Genes. bioRxiv A Simple Method to Detect Candidate 864 Overlapping Genes in Viruses Using Single Genome Sequences Concomitant emergence of the antisense protein gene of HIV-1 and of the pandemic Pyvolve: a flexible Python module for simulating 871 sequences along phylogenies Why genes overlap in viruses The combinatorics of overlapping genes Large gene overlaps in prokaryotic genomes: 877 result of functional constraints or mispredictions? Diverse type VI secretion phospholipases are functionally plastic 880 antibacterial effectors The bioenergetic costs of a gene Retapamulin-Assisted Ribosome Profiling Reveals the Alternative 884 Two overlapping antiparallel genes 887 encoding the iron regulator DmdR1 and the Adm proteins control siderophore and 888 antibiotic biosynthesis in Streptomyces coelicolor A3(2) Properties and abundance of overlapping genes in 891 viruses An exploration of ambigrammatic sequences in narnaviruses A case for a negative-895 strand coding sequence in a group of positive-sense RNA viruses Proteomic Detection of Non-Annotated Protein-Coding Genes in 898 Pseudomonas fluorescens Pf0-1 Proteogenomic Analysis of Bacteria and 901 Archaea: A 46 Organism Case Study Lost and Found: Re-searching and Re-904 scoring Proteomics Data Aids Genome Annotation and Improves Proteome Coverage. 905 mSystems 5 Bacterial 907 riboproteogenomics: the era of N-terminal proteoform existence revealed but not all, lineage-specific genes 910 can be explained by homology detection failure Synteny-based analyses indicate that 913 sequence divergence is not the main source of orphan genes The unexpected complexity of bacterial genomes The Ingenuity of Bacterial Genomes High-precision 919 analysis of translational pausing by ribosome profiling in bacteria lacking EFP Analysis of Relative Gene Expression Data Using Real-922 Time Quantitative PCR and the 2−ΔΔCT Method Discovery of numerous novel small genes in the intergenic regions 925 of the Escherichia coli O157:H7 Sakai genome Sample Preparation by Easy 928 Extraction and Digestion (SPEED) -A Universal, Rapid, and Detergent-free Protocol 929 for Proteomics Based on Acid Extraction The MaxQuant computational platform for mass 932 spectrometry-based shotgun proteomics Andromeda: a peptide search engine integrated into the MaxQuant 935 environment Reference sequence (RefSeq) database at NCBI: current status, 937 taxonomic expansion, and functional annotation Skyline: an open source document editor for creating and analyzing 940 targeted proteomics experiments Global quantification of mammalian gene expression control. 943 PROCAL: A Set of 40 Peptide Standards for Retention Time Indexing Column Performance Monitoring, and Collision Energy Calibration Mfold web server for nucleic acid folding and hybridization prediction FastQC: a quality control tool for high throughput sequence data fastp: an ultra-fast all-in-one FASTQ 952 preprocessor Fast gapped-read alignment with Bowtie 2 The Sequence Alignment/Map format and SAMtools BEDTools: a flexible suite of utilities for comparing genomic 958 features High-resolution TADs reveal DNA sequences underlying genome 960 organization in flies Differentiation of ncRNAs from small mRNAs in Escherichia coli 963 O157:H7 EDL933 (EHEC) by combined RNAseq and RIBOseq -ryhB encodes the 964 regulatory RNA RyhB and a peptide edgeR: a Bioconductor package for 967 differential expression analysis of digital gene expression data Basic local alignment 970 search tool Entrez Programming Utilities Help [Internet] (National Center for 972 Biotechnology Information Fast and sensitive protein alignment using 974 DIAMOND QuickProbs 2: towards rapid construction of high-quality 977 alignments of large protein families PAL2NAL: robust conversion of protein sequence 979 alignments into the corresponding codon alignments IQ-TREE: a fast and 982 effective stochastic algorithm for estimating maximum-likelihood phylogenies IQ-TREE 2: New models and efficient methods for phylogenetic 985 inference in the genomic era Treemmer: a tool to reduce large phylogenetic datasets with minimal 987 loss of diversity The Newick utilities: high-throughput phylogenetic tree 989 processing in the UNIX shell ETE 3: reconstruction, analysis, and visualization 991 of phylogenomic data FastTree 2-approximately maximum-likelihood 993 trees for large alignments The PRIDE database and related tools and resources in 2019: 995 improving support for quantification data Panorama Public: A Public Repository for Quantitative Data Sets 998 Processed in Skyline We thank Romy Wecko, Verena Breitner, Lara Wanner, Hermine Kienberger, and Franziska 1003Hackbarth for technical assistance, and Christopher Huptas for bioinformatic support. We also 1004 thank Siddhanth Rao for assistance with scripts for the use of 'FRESCo', and Chase Nelson 1005 and April Wei for helpful comments on the manuscript. The authors declare no competing interests. 1015 1016 Correspondence and requests concerning evolutionary analyses should be addressed to Z.A.; 1018concerning biological experiments to K.N. 1019 found in a wider taxonomic group than the specific ORFs studied here; for olg1, apparent 1084 purifying selection is limited to a subclade within Pseudomonas, whereas for olg2 it is found 1085 across the genus; for both ORFs however, evidence is strongest in the vicinity of P. 1086 aeruginosa. Codon numbers are with respect to an alignment including gaps. 1087