Distribution and diversity of dimetal-carboxylate halogenases in cyanobacteria 1 Distribution and diversity of dimetal-carboxylate halogenases in cyanobacteria 1 Nadia Eusebio1, Adriana Rego1, Nathaniel R. Glasser2, Raquel Castelo-Branco1, Emily P. Balskus2* and Pedro 2 N. Leão1* 3 1Interdisciplinary Centre of Marine and Environmental Research (CIIMAR/CIMAR), University of Porto, 4 Matosinhos, Portugal 5 2Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA 6 7 8 9 *Corresponding authors, E-mail: pleao@ciimar.up.pt, balskus@chemistry.harvard.edu 10 11 Keywords: halogenases, cyanobacteria, natural products, biocatalysis 12 13 Repositories: The draft genomes generated in this study are available in the GenBank under BioProject 14 SUB8150995. 15 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 2 Abstract 16 Halogenation is a recurring feature in natural products, especially those from marine organisms. The selectivity 17 with which halogenating enzymes act on their substrates renders halogenases interesting targets for biocatalyst 18 development. Recently, CylC – the first predicted dimetal-carboxylate halogenase to be characterized – was 19 shown to regio- and stereoselectively install a chlorine atom onto an unactivated carbon center during 20 cylindrocyclophane biosynthesis. Homologs of CylC are also found in other characterized cyanobacterial 21 secondary metabolite biosynthetic gene clusters. Due to its novelty in biological catalysis, selectivity and ability 22 to perform C-H activation, this halogenase class is of considerable fundamental and applied interest. However, 23 little is known regarding the diversity and distribution of these enzymes in bacteria. In this study, we used both 24 genome mining and PCR-based screening to explore the genetic diversity and distribution of CylC homologs. 25 While we found non-cyanobacterial homologs of these enzymes to be rare, we identified a large number of genes 26 encoding CylC-like enzymes in publicly available cyanobacterial genomes and in our in-house culture collection 27 of cyanobacteria. Genes encoding CylC homologs are widely distributed throughout the cyanobacterial tree of 28 life, within biosynthetic gene clusters of distinct architectures. Their genomic contexts feature a variety of 29 biosynthetic partners, including fatty-acid activation enzymes, type I or type III polyketide synthases, 30 dialkylresorcinol-generating enzymes, monooxygenases or Rieske proteins. Our study also reveals that dimetal-31 carboxylate halogenases are among the most abundant types of halogenating enzymes in the phylum 32 Cyanobacteria. This work will help to guide the search for new halogenating biocatalysts and natural product 33 scaffolds. 34 35 Data statement: All supporting data and methods have been provided within the article or through a 36 Supplementary Material file, which includes 14 supplementary figures and 4 supplementary tables. 37 38 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 3 Introduction 39 Nature is a rich source of new compounds that fuel innovation in the pharmaceutical and agriculture sectors [1]. 40 The remarkable diversity of natural products (NPs) results from a similarly diverse pool of biosynthetic enzymes 41 [2]. These often are highly selective and efficient, carrying out demanding reactions in aqueous media, and 42 therefore are interesting starting points for the development of industrially-relevant biocatalysts [2]. Faster and 43 more accessible DNA sequencing technologies have enabled, in the past decade, a large number of genomics 44 and metagenomics projects focused on the microbial world [3]. The resulting sequence data holds immense 45 opportunities for the discovery of new microbial enzymes and their associated NPs [4]. 46 Halogenation is a widely used and well-established reaction in synthetic and industrial chemistry [5], which 47 can have significant consequences for the bioactivity, bioavailability and metabolic activity of a compound 48 [5-7]. Halogenating biocatalysts are thus highly desirable for biotechnological purposes [6, 8]. The 49 mechanistic aspects of biological halogenation can also inspire the development of organometallic catalysts 50 [9]. Nature has evolved multiple strategies to incorporate halogen atoms into small molecules [6], as 51 illustrated by the structural diversity of thousands of currently known halogenated NPs, which include drugs 52 and agrochemicals [10, 11]. Until the early 1990’s, haloperoxidases were the only known halogenating 53 enzymes. Research on the biosynthesis of halogenated metabolites eventually revealed a more diverse range 54 of halogenases with different mechanisms. Currently, biological halogenation is known to proceed by 55 distinct electrophilic, nucleophilic or radical mechanisms [6]. Electrophilic halogenation is characteristic of 56 the flavin-dependent halogenases and the heme- and vanadium-dependent haloperoxidases, which catalyze 57 the installation of C-I, C-Br or C-Cl bonds onto electron-rich substrates. Two families of nucleophilic 58 halogenases are known, the halide methyltransferases and SAM halogenases. Both utilize S-59 adenosylmethionine (SAM) as an electrophilic co-factor or as a co-substrate and halide anions as 60 nucleophiles. Notably, these are the only halogenases capable of generating C-F bonds. Finally, radical 61 halogenation has only been described for nonheme- iron/2-oxo-glutarate (2OG)-dependent enzymes. This 62 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 4 type of halogenation allows the selective insertion of a halogen into a non-activated, aliphatic C-H bond. A 63 recent review by Agarwal et al (2017) thoroughly covers the topic of enzymatic halogenation. 64 Cyanobacteria are a rich source of halogenases among bacteria, in particular for nonheme iron/2OG-dependent 65 and flavin-dependent halogenases (Fig. 1). AmbO5 and WelO5 are cyanobacterial enzymes that belong to the 66 nonheme iron/2OG-dependent halogenase family [12-14]. AmbO5 is an aliphatic halogenase capable of site-67 selectively modifying ambiguine, fischerindole and hapalindole alkaloids [12, 13]. The close homolog (79% 68 sequence identity) WelO5 is capable of performing analogous halogenations in hapalindole-type alkaloids and 69 it is involved in the biosynthesis of welwintindolinone [13, 15]. BarB1 and BarB2 are also nonheme iron/2OG-70 dependent halogenases that catalyze trichlorination of a methyl group from a leucine substrate attached to the 71 peptidyl carrier protein BarA in the biosynthesis of barbamide [16-18]. Other halogenases from this enzyme 72 family include JamE, CurA, and HctB. JamE and CurA catalyse halogenations in intermediate steps of the 73 biosynthesis of jamaicamide and curacin A, respectively [19, 20], while HctB is a fatty acid halogenase 74 responsible for chlorination in hectochlorin assembly [21]. ApdC and McnD are FAD-dependent halogenases 75 responsible for the modification of cyanopeptolin-type peptides (also known as (3S)-amino-(6R)-hydroxy 76 piperidone (Ahp)-cyclodepsipeptides). These enzymes halogenate, respectively, anabaenopeptilides in 77 Anabaena and micropeptins in Microcystis strains [22-25]. AerJ is another example of a FAD-dependent 78 halogenase, which acts during aeruginosin biosynthesis in Planktothrix and Microcystis strains [24]. 79 Recent efforts to characterize the biosynthesis of structurally unusual cyanobacterial natural products have 80 uncovered a distinct class of halogenating enzymes. Using a genome mining approach, Nakamura et al. (2012) 81 discovered the cylindrocyclophane biosynthetic gene cluster (BGC) in the cyanobacterium Cylindrospermum 82 licheniforme ATCC 29412 [26]. The natural paracyclophane natural products were found to be assembled from 83 two chlorinated alkylresorcinol units [27]. The paracyclophane macrocycle is created by forming two C-C bonds 84 using a Friedel–Crafts-like alkylation reaction catalyzed by the enzyme CylK [27] (Fig. 1). Therefore, although 85 many cylindrocyclophanes are not halogenated, their biosynthesis involves a halogenated intermediate [26, 27], 86 a process termed a cryptic halogenation [28]. Nakamura et al. (2017) showed that the CylC enzyme was 87 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 5 responsible for regio- and stereoselectively installing a chlorine atom onto the fatty acid-derived sp3 carbon 88 center of a biosynthetic intermediate that is subsequently elaborated to the key alkylresorcinol monomer (Fig. 89 1). To date, CylC is the only characterized dimetal-carboxylate halogenase (this classification is based on both 90 biochemical evidence and similarity to other diiron-carboxylate proteins) [27]. Homologs of CylC have been 91 found in the BGCs of the columbamides [29], bartolosides [30], microginin [27], 92 puwainaphycins/minutissamides [31], and chlorosphaerolactylates [32], all of which produce halogenated 93 metabolites. CylC-type enzymes bear low sequence homology to dimetal desaturases and N-oxygenases [27], 94 functionalize C-H bonds in aliphatic moieties at either terminal or mid-chain positions, and are likely able to 95 carry out gem-dichlorination (Kleigrewe 2015, Leão 2015). The reactivity displayed by CylC and its homologs 96 is of interest for biocatalysis, in particular because this type of carbon center activation is often inaccessible to 97 organic synthesis [15, 33]. An understanding of the molecular basis for the halogenation of different positions 98 and for chain-length preference will also be of value for biocatalytic applications. Hence, accessing novel 99 variants of CylC enzymes will facilitate the functional characterization of this class of halogenases, mechanistic 100 studies, and biocatalyst development. 101 Here, we provide an in-depth analysis of the diversity, distribution and context of CylC homologs in microbial 102 genomes. Using both publicly available genomes and our in-house culture collection of cyanobacteria 103 (LEGEcc), we report that CylC enzymes are common in cyanobacterial genomes, found in numbers comparable 104 to those of flavin-dependent or nonheme iron/2OG-dependent halogenases. We additionally show that CylC 105 homologs are distributed throughout the cyanobacterial phylogeny and are, to a great extent, part of cryptic 106 BGCs with diverse architectures, underlining the potential for NP discovery associated with this new halogenase 107 class. 108 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 6 109 Figure 1. Selected examples of halogenation reactions catalyzed by different classes of microbial enzymes, with 110 a focus on cyanobacterial halogenases. An asterisk denotes that the enzyme has been biochemically 111 characterized. ACP – acyl carrier protein. 112 flavin-dependent halogenases Bmp5* (Marinomonas mediterranea MMB-1) b) N H2N O OH Cl N H2N O OH PrnA* (Pseudomonas fluorescens BL915) OH Br Br OH Br OHO OH OHO nonheme iron/2OG-dependent halogenases S O HO OH O ACP S O HO Cl OH O ACP N H NC Cl H H N H NC H H CurA* (Moorea producens 3L) WelO5* (Hapalosiphon welwitschii UTEX B1830) c) dimetal-carboxylate halogenasesa) CylC* (Cylindrospermum licheniforme ATCC 29412) S O ACP S O ACP Cl McnD (Microcystis cf. wesenbergii NIVA-CYA 172/5) N OH O N OH O Cl BrtJ (Synechocystis salina LEGE 06099): unknown substrate O O HO HO OH OH Cl Cl Cl bartoloside I S O ACP S O ACP Cl ClCl ColD/ColE (Moorea bouillonii PNG) ClyC/ClyD (Sphaerospermopsis sp. LEGE 00249) .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 7 Methods 113 Sequence similarity networks and Genomic Neighborhood Diagrams 114 Sequence similarity networks (SSNs) were generated using the EFI-EST sever, following a “Sequence BLAST” 115 of CylC (AFV96137) as input [34], using negative log e-values of 2 and 40 for UniProt BLAST retrieval and 116 SSN edge calculation, respectively. This SSN edge calculation cutoff was found to segregate the homologs into 117 different SSN clusters, less stringent cutoff values resulted in a single SSN cluster. The 153 retrieved sequences 118 and the query sequence were then used to generate the SSNs with an alignment score threshold of 42 and a 119 minimum length of 90. The networks were visualized in Cytoscape (v3.80). The full SSN obtained in the 120 previous step was used to generate Genomic Neighborhood Diagrams (GNDs) using the EFI-GNT tool [34]. A 121 Neighborhood Size of 10 was used and the Lower Limit for Co-occurrence was 20%. The resulting GNDs were 122 visualized in Cytoscape (Fig. 2). 123 124 Cyanobacterial strains and growth conditions 125 Freshwater and marine cyanobacteria strains from Blue Biotechnology and Ecotoxicology Culture Collection 126 (LEGEcc) (CIIMAR, University of Porto) were grown in 50 mL Z8 medium [35] or 50 mL Z8 25‰ sea salts 127 (Tropic Marine) with vitamin B12, with orbital shaking (~200 rpm) under a regimen of 16 h light (25 μmol 128 photons m-2 s -1)/8 h dark at 25 °C. 129 130 Genomic DNA extraction 131 Fifty milliliters of each cyanobacterial strain were centrifuged at 7000 ×g for 10 min. The cell pellets were used 132 for genomic DNA (gDNA) extraction using the PureLink ® Genomic DNA Mini Kit (Thermo Fisher 133 Scientific®) or NZY Plant/Fungi gDNA Isolation kit (Nzytech), according to the manufacturer’s instructions. 134 135 Primer design 136 Basic local alignment search tool (BLAST) searches using CylC [Cylindrospermum licheniforme UTEX B 137 2014] as query identified related genes (for tBLASTn: 31-93% amino acid identity). We discarded nucleotide 138 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 8 hits with a length <210 and e-values <1×10-10. The complete sequences (56 cylC homolog sequences, Table S1) 139 were collected from NCBI and aligned using MUltiple Sequence Comparison by Log-Expectation (MUSCLE) 140 [36]. Phylogenetic analysis of the hits was performed using FastTree GTR with a rate of 100. Streptomyces 141 thioluteus aurF, encoding a distant dimetal-carboxylate protein [27] was used as an outgroup 142 (AJ575648.1:4858-5868). We divided the phylogeny of cylC homologs in five groups with moderate similarity 143 (Fig. S1). The regions of higher similarity within each group were selected for degenerate primer design (Table 144 1). 145 146 Table 1. Degenerate primers 147 Code Sequence Expected amplicon size (bp) Tm (ºC) AF CAAAAAATHGCDCTYAAYC 788-986 55 AR TGDAADCCTTCRTGTTC BF CACAAAAAHTWGCTCTYAAYC 673-715 57 BR GTKGTRTGGWARGATTCATC CF AATCAWCTTTAYTGGGTRGC 506-509 55 CR AARAARTGAAARCTYTCRTC DF AATCAAACYAGYGCWGC 299 51 DR GTRAAATAYTGACAAGC XF ATCWRGAAACCARTSAAGA 449-591 51 XR CATCAAAAACTTTYYGTARRC 148 PCR conditions 149 The PCR to detect cylC homologs were conducted in a final volume of 20 µL, containing 6.9 µL of ultrapure 150 water, 4.0 µL of 5× GoTaq Buffer (Promega), 2.0 µL of MgCl2, 1.0 µL of dNTPs, 2.0 µL of reverse and 2.0 µL 151 of forward primer (each at 10 µM), 0.1 µL of GoTaq and 2.0 µL of cyanobacterial gDNA. PCR thermocycling 152 conditions were: denaturation for 5 min at 95 °C; 35 cycles with denaturation for 1 min at 95 °C, primer 153 annealing for 30 s at different temperatures (55 ºC for group A; 57ºC for group B; 55 ºC for group C; 51 ºC for 154 group D; 51 ºC for group X) and extension for 1 min at 72 °C; and final extension for 10 min at 72 °C. 155 When not already available, the 16S rRNA gene for a tested strain was amplified by PCR, using standard primers 156 for amplification (CYA106F 5’ CGG ACG GGT GAG TAA CGC GTG A 3’ and CYA785R 5’ GAC TAC 157 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 9 WGG GGT ATC TAA TCC 3’). The PCR reactions were conducted in a final volume of 20 µL, containing 6.9 158 µL of ultrapure water, 4.0 µL of 5× GoTaq Buffer, 2.0 µL of MgCl2, 1.0 µL of dNTPs, 2.0 µL of primer reverse 159 and 2.0 µL of primer forward (each one at 10 µM), 0.1 µL of GoTaq and 2.0 µL of cyanobacterial DNA. PCR 160 thermocycling conditions were: denaturation for 5 min at 95 °C; 35 cycles with denaturation for 1 min at 95 °C, 161 primer annealing for 30 s at 52 ºC and extension for 1 min at 72 °C; and final extension for 10 min at 72 °C. 162 Amplicon sizes were confirmed after separation in a 1.0% agarose gel. 163 164 Cloning and sequencing 165 The cylC homolog and 16S rRNA gene sequences were obtained either directly from the NCBI or through 166 sequencing. To obtain high quality sequences, the TOPO PCR cloning (Invitrogen) was used. The TOPO cloning 167 reaction was conducted in a final volume of 3 µL, containing 1 µL of fresh PCR product, 1 µL of salt solution, 168 0.5 µL of TOPO vector and 0.5 µL of water. The reaction was incubated for 20 min at room temperature. Three-169 microliters of TOPO reaction were added into a tube containing chemically competent E. coli (Top10, Life 170 Technologies) cells. After 30 min of incubation on ice, the cells were placed for 30 s at 42 ºC without shaking 171 and were then immediately transferred to ice. 250 µL of room temperature SOC medium were added to the 172 previous mixture and the tube was horizontally shaken at 37 ºC for 1 h (180rpm). 60 µL of the different cloning 173 reactions were spread onto LB ampicillin/X-gal plates and incubated overnight at 37 ºC. 174 Two or three positive colonies from each reaction were tested by colony-PCR. The PCR was conducted in a 175 final volume of 20 µL, containing 10.9 µL of ultrapure water, 4.0 µL of 5x GoTaq Buffer, 2.0 µL of MgCl2, 1.0 176 µL of dNTPs, 1.0 µL of reverse pUCR and 1.0 µL of forward pUCF primers (each at 20 µM), 0.1 µL of GoTaq 177 and the target colony. PCR thermocycling conditions were: denaturation for 5 min at 95 °C; 35 cycles with 178 denaturation for 1 min at 95 °C, primer annealing for 30 s at 50 ºC and extension for 1 min at 72 °C; and final 179 extension for 10 min at 72 °C. Amplicon sizes were confirmed after separation in an 1.0 % agarose gel. Selected 180 colonies were incubated overnight at 37 ºC (180 rpm), in 5 mL of LB supplemented with 100 µg mL-1 ampicillin. 181 The plasmids containing the amplified PCR products were extracted (NZYMiniprep kits) and Sanger sequenced 182 using pUC primers. 183 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 10 184 Cyanobacteria genome sequencing 185 Many of the LEGEcc strains are non-axenic, and so before extraction of gDNA for genome sequencing, an 186 evaluation of the amount of heterotrophic contaminant bacteria in cyanobacterial cultures was performed by 187 plating onto Z8 or Z8 with added 2.5% sea salts (Tropic Marine) and vitamin B12 (10 µg/L) agar medium 188 (depending the original environment) supplemented with casamino acids (0.02% wt/vol) and glucose (0.2% 189 wt/vol) [37]. The plates were incubated for 2-4 days at 25 ºC in the dark and examined for bacterial growth. 190 Those cultures with minimal contamination were used for DNA extraction for genome sequencing. The selection 191 of DNA extraction methodology used was based on morphological features of each strain. Total genomic DNA 192 was isolated from a fresh or frozen pellet of 50 mL culture using a CTAB-chloroform/isoamyl alcohol-based 193 protocol [38] or using the commercial PureLink Genomic DNA Mini Kit (Thermo Fisher Scientific®) or the 194 NZY Plant/Fungi gDNA Isolation kit (NZYTech). The latter included a homogenization step (grinding cells 195 using a mortar and pestle with liquid nitrogen) before extraction using the standard kit protocol. The quality of 196 the gDNA was evaluated in a DS-11 FX Spectrophotometer (DeNovix) and 1 % agarose gel electrophoresis, 197 before genome sequencing, which was performed elsewhere (Era7, Spain and MicrobesNG, UK) using 2 × 250 198 bp paired-end libraries and the Illumina platform (except for Synechocystis sp. LEGE 06099, whose genome 199 was sequenced using the Ion Torrent PGM platform). A standard pipeline including the identification of the 200 closest reference genomes for reading mapping using Kraken 2 [39] and BWA-MEM to check the quality of the 201 reads [40] was carried out, while de novo assembly was performed using SPAdes [41]. The genomic data 202 obtained for each strain was treated as a metagenome. The contigs obtained as previously mentioned were 203 analyzed using the binning tool MaxBin 2.0 [42] and checked manually in order to obtain only cyanobacterial 204 contigs. The draft genomes were annotated using the NCBI Prokaryotic Genome Annotation Pipeline (PGAP) 205 [43] and submitted to GenBank under the BioProject number SUB8150995. In the case of Hyella patelloides 206 LEGE 07179 and Sphaerospermopsis sp. LEGE 00249 the assemblies had been previously deposited in NCBI 207 under the BioSample numbers SAMEA4964519 and SAMN15758549, respectively. 208 209 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 11 Genomic context of CylC homologs 210 BLASTp searches using CylC [Cylindrospermum licheniforme UTEX B 2014] as query identified related CylC 211 homologs within the publicly available cyanobacterial genomes and in the genomes of LEGEcc strains. We 212 annotated the genomic context for each CylC homolog using antiSMASH v5.0 [44] and manual annotation 213 through BLASTp of selected proteins. Some BGCs were not identified by antiSMASH and were manually 214 annotated using BLASTp searches. 215 216 Phylogenetic analysis 217 Nucleotide sequences of cylC homologs obtained from the NCBI and from genome sequencing in this study, 218 were aligned using MUSCLE from within the Geneious R11.0 software package (Biomatters). The nucleotide 219 sequence of the distantly-related dimetal-carboxylate protein AurF [27] from Streptomyces thioluteus 220 (AJ575648.1:4858-5868) was used as an outgroup. The alignments, trimmed to their core 788, 673, 506, 299 221 and 499 positions (for group A, B, C, D and X, respectively), were used for phylogenetic analysis, which was 222 performed using FastTree 2 (from within Geneious), using a GTR substitution model (from jmodeltest, [45]) 223 with a rate of 100 (Fig. S2). 224 For the phylogenetic analysis based on the 16S rRNA gene (Fig. 3, Fig. S3), the corresponding nucleotide 225 sequences were retrieved from the NCBI (from public available genomes until March 16, 2020) or from 226 sequence data (amplicon or genome) obtained in this study. The sequences were aligned as detailed for cylC 227 homologs and trimmed to the core shared positions (663). A RAxML-HPC2 phylogenetic tree inference using 228 maximum likelihood/rapid bootstrapping run on XSEDE (8.2.12) with 1000 bootstrap iterations in the Cipres 229 platform [46] was performed. 230 The amino acid sequences of CylC homologs were aligned using MUSCLE from within the Geneious software 231 package (Biomatters). The alignments were trimmed to their core 333 residues and used for phylogenetic 232 analysis, which was performed using RAxML-HPC2 phylogenetic tree inference using maximum 233 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 12 likelihood/rapid bootstrapping run on XSEDE (8.2.12) with 1000 bootstrap iterations in the Cipres platform [46] 234 (Fig. 4c). 235 236 CORASON analysis 237 CORASON, a bioinformatic tool that computes multi-locus phylogenies of BGCs within and across gene cluster 238 families [47], was used to analyze cyanobacterial genomes collected from the NCBI and the LEGEcc genomes 239 (Table S2). In total 2059 cyanobacterial genomes recovered from NCBI and 56 additional LEGE genomes were 240 used in the analysis. The amino acid sequences of CurA (AAT70096.1), WelO5 (AHI58816.1), McnD 241 (CCI20780.1), Bmp5 (WP_008184789.1), PrnA (WP_044451271.1) and CylC (ARU81117.1) were used as 242 query and, for each enzyme, a reference genome was selected (Table S2). To increase the phylogenetic 243 resolution, selected genomes were removed from the analysis of enzymes CylC, PrnA, CurA, McnD and Bmp5 244 (Table S2). Additionally, for the CylC analysis, a few BGCs were manually extracted and included in the 245 analysis (Table S2) since they were not detected by CORASON. 246 247 Prevalence of halogenases in cyanobacterial genomes 248 Representative proteins of each class were used as query in each search: CylC (ARU81117.1), BrtJ 249 (AKV71855.1), “Mic” (WP_002752271.1) - the halogenase in the putative microginin gene cluster – ColD 250 (AKQ09581.1), ColE (AKQ09582.1), NocO (AKL71648.1), NocN (AKL71647.1) for dimetal-carboxylate 251 halogenases; PrnA (WP_044451271.1), Bmp5 (WP_008184789.1), and McnD (CCI20780.1) for flavin-252 dependent halogenases; the halogenase domains from CurA (AAT70096.1), and the halogenases Barb1 253 (AAN32975.1), HctB (AAY42394.1), WelO5 (AHI58816.1) and AmbO5 (AKP23998.1) for nonheme iron-254 dependent halogenases). Non-redundant sequences obtained for these searches using a 1×10-20 e-value cutoff, 255 which represents a percentage identity between the query and target protein superior to 30%, were considered 256 to share the same function as the query. 257 258 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 13 Results and Discussion 259 CylC-like halogenases are mostly found in cyanobacteria 260 To investigate the distribution of CylC homologs encoded in microbial genomes, we first searched the reference 261 protein (RefSeq) or non-redundant protein sequences (nr) databases (NCBI) for homologs of CylC or BrtJ, using 262 the Basic Local Alignment Search Tool, BLASTp (min 25% identity, 9.9×10-20 E-value and 50% coverage). A 263 total of 128 and 246 homologous unique protein sequences were retrieved using the RefSeq or nr databases, 264 respectively; in both cases, sequences were primarily from cyanobacteria (96 and 88%, respectively) (Fig. 2a). 265 We then used the Enzyme Similarity Tool of the Enzyme Function Initiative (EFI-EST) [34] to evaluate the 266 sequence landscape of dimetal-carboxylate halogenases. Using CylC as query, we obtained a SSN (sequence 267 similarity network) composed of 154 sequences retrieved from the UniProt database [48] (Fig. 2b). The SSN 268 featured two major clusters, one containing homologs from diverse cyanobacterial genera, the other composed 269 of homologs from several cyanobacteria, with a few from proteobacteria (mostly deltaproteobacteria) and two 270 from the cyanobacteria sister-phylum Melainabacteria. A third SSN cluster was composed only by the 271 previously reported BrtJ enzymes and, finally, a homolog from the cyanobacterial genus Hormoscilla remained 272 unclustered. We were unable to recover any SSN that included clusters containing other characterized enzyme 273 functions, which attests to the uniqueness of the dimetal-carboxylate halogenases in the current protein-sequence 274 landscape. 275 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 14 276 Figure 2. Abundance of CylC homologs in bacteria. a) BLASTp using CylC (GenBank accession no: 277 ARU81117) as query against different databases, shows that these dimetal-carboxylate enzymes are found 278 almost exclusively in cyanobacteria. b) Sequence Similarity Network (SSN) of CylC depicting the similarity-279 based clustering of UniProt-derived protein sequences with homology (BLAST e-value cutoff 1×10-2, edge e-280 value cutoff 1×10-40) to CylC (GenBank accession no: ARU81117). In each node, the bacterial genus for the 281 corresponding UniProt entry is shown (NA – not attributed). 282 283 284 285 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 15 CylC homologs are widely distributed throughout the phylum Cyanobacteria 286 With the intent of accessing a wide diversity of CylC homolog sequences, we decided to use a degenerate-primer 287 PCR strategy to discover additional homologs in cyanobacteria from the LEGEcc culture collection [49], 288 because the phylum Cyanobacteria is diverse and still underrepresented in terms of genome data [50-55]. The 289 LEGEcc culture collection maintains cultures isolated from diverse freshwater and marine environments, mostly 290 in Portugal, and, for example, contains all known bartoloside-producing strains [30]. Primers were designed 291 based on 54 nucleotide sequences retrieved from the NCBI that were selected to represent the phylogenetic 292 diversity of CylC homologs (Fig. S1). Due to the lack of highly conserved nucleotide sequences among all 293 homologs considered, we divided the nucleotide alignment into five groups and designed a degenerate primer 294 pair for each. Upon screening 326 strains from LEGEcc using the five primer pairs, we retrieved 89 sequences 295 encoding CylC homologs, confirmed through cloning and Sanger sequencing of the obtained amplicons. We 296 were unable to directly analyze the diversity of the entire set of LEGEcc-derived cylC amplicons due to low 297 overlap between sequences obtained with different primers. As such, we performed a phylogenetic analysis of 298 the diversity retrieved with each primer pair (Fig. S2), by aligning the PCR-derived sequences with a set of 299 diverse cylC genes retrieved from the NCBI. For some strains, our PCR screen retrieved more than one homolog 300 using different primer pairs (e.g. Nostoc sp. LEGE 12451 or Planktothrix mougeotii LEGE 07231). In general, 301 and for each primer pair, the PCR screen retrieved mostly sequences that were closely related and associated to 302 one or two phylogenetic clades. This can likely be explained by the geographical bias that might exist in the 303 LEGEcc culture collection [49] and/or with primer design and PCR efficiency issues, which might have favored 304 certain phylogenetic clades. 305 To access full-length sequences of the CylC homologs identified among LEGEcc strains, as well as their 306 genomic context, we undertook a genome-sequencing effort informed by our PCR screen. We selected 21 strains 307 for genome sequencing, which represents the diversity of CylC homologs observed in the different PCR 308 screening groups. The resulting genome data was used to generate a local BLAST database and the homologs 309 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 16 were located within the genomes. In some cases, additional homologs that were not detected in the PCR screen 310 were identified. Overall, 33 full-length genes encoding CylC homologs were retrieved from LEGEcc strains. 311 To explore the phylogenetic distribution of CylC homologs encoded in publicly available reference genomes 312 and the herein sequenced LEGEcc genomes, we aligned the 16S rRNA genes from 648 strains with RefSeq 313 genomes and the LEGEcc strains that were screened by PCR in this study. Using this dataset, we performed a 314 phylogenetic analysis which indicated that CylC homologs are broadly distributed through five Cyanobacterial 315 orders: Nostocales, Oscillatoriales, Chroococcales, Synechococcales and Pleurocapsales (Fig. 3, Fig. S3). It is 316 noteworthy that the cyanobacterial orders for which we did not find CylC homologs (Chroococcidiopsidales, 317 Spirulinales, Gloeomargaritales and Gloeobacterales) are poorly represented in our dataset (Fig. 3, Fig. S3). 318 However, our previous BLASTp search against the nr database did retrieve two close homologs in two 319 Chroococcidiopsidales strains (genera Aliterella and Chroococcidiopsis) and a more distant homolog in a 320 Gloeobacter strain (Gloeobacterales) (Table S3). Given the wide but punctuated presence of CylC homologs 321 among the cyanobacterial diversity considered in this study, it is unclear how much of the current CylC homolog 322 distribution reflects vertical inheritance or horizontal gene transfer events. 323 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 17 324 Figure 3. RAxML cladogram of the 16S rRNA gene of LEGEcc strains (grey squares) and from cyanobacterial 325 strains with NCBI-deposited reference genomes, screened in this study. Taxonomy is presented at the order level 326 (colored rectangles). Strains whose genomes encode CylC homologs are denoted by black squares. Green 327 squares indicate that at least one homolog was detected by PCR-screening and verified by retrieving the 328 sequence of the corresponding amplicon by cloning followed by Sanger sequencing. Gloeobacter violaceus PCC 329 7421 served as an outgroup. A version of this cladogram including the bootstrap values for 1000 replications is 330 provided as Supplementary Material. 331 332 Diversity of BGCs encoding CylC homologs 333 To characterize the biosynthetic diversity of BGCs encoding CylC homologs, which were found in 78 334 cyanobacterial genomes (21 from LEGEcc and 57 from RefSeq) from different orders, we first submitted these 335 Lim n orap his rob usta C S 951 Filam ento us cyano bacter ium LEGE 060 07 Scytonem a millei VB51128 3 Geitler inema sp LEGE 1139 1 N os to c ed ap h ic um L E G E 0 72 99 M icrocystis aeruginosa N IES 25 49 No sto c sp C AV N2 Proc hlo ro cocc u s s p RS 04 Chlo r oglo eo ps is frit schii PCC 921 2 Synecho coccales cyan obacte rium LEG E 0 6003 un id en tif ie d No st oc al es L EG E 1 2 45 2 Fisch erella t herm alis WC5 27 Fisch erella t herm alis WC5 38 S yn ec ho cy st is s al in a LE G E 0 00 36 Cu sp ido thr ix iss at sc he nk oi L E GE 00 24 7 Tycho nem a sp LE G E 062 05 Filam ento us cyano bacter ium LEGE XX0 62 Cyano bi um sp LE GE 0 613 0 P lanktoth rix aga rdhii C C A P 1459 11 A M icr ocystis aer uginosa N IES 98 Croco sphae ra sub trop ica ATCC 5 1472 Le p to ly ng b ya s p LE G E 0 73 19 Ly ng by a co n fe rv oi d es B DU 1 41 95 1 Chon drocystis sp NIES 41 02 Acaryoch loris ma rina M BI C11 017 C yl in dr o sp er m op si s ra ci bo rs ki i S 05 C yl in dr o sp er m op si s ra ci bo rs ki i C Y LP No do sili ne a s p L EG E 06 19 1 A na ba en a a ph an iz o m en o id es L E G E 0 02 50 Stanier ia cyano sphaer a PCC 7437 S yn ec ho co cc us s p L EG E 0 7 07 4 Croco sphae ra chwa kensis CCY0 110 Tycho nem a sp LE G E 072 00 Cyan o bac ter iu m PC C 77 02 Cu sp ido thr ix iss at sc he nk oi L E GE 03 28 5 Le p t oly ng b y a s p L EG E 07 0 8 0 Filam ento us cyano bacter ium LEGE 071 80 Fisch erella t herm alis WC2 13 un id en tif ie d Ps eu da na ba en a ce ae c ya no ba ct er iu m L E G E 0 61 12 No do sili ne a s p L EG E 06 12 1 Fi la m en to us c ya no ba ct er iu m L E G E 0 72 09 Synec ho coc cus sp L EGE 113 79 C al ot h rix p ar as iti ca N IE S 26 7 Fisch erella m uscicola PCC 741 4 P lanktoth rix m o ugeot ii L EG E 06 222 M icrocystis sp LE G E 000 66 No do sili ne a s p L EG E 06 00 1 Tycho nema bor netii LEGE 1444 4 Pseuda nabae na af f mucicola LEGE 0 0260 M icrocystis aeruginosa LE G E 91343 C yl in dr o sp er m op si s ra ci bo rs ki i S 01 Cylindro sp erm um st a gnale PCC 74 17 C yl in dr o sp er m op si s ra ci bo rs ki i L EG E 9 9 04 5 unid entified Pseu dana baena ceae cya noba cterium LEGE 1341 5 M icrocystis aeruginosa LE G E 91094 Fo rti ea s p LE G E X X 44 3 Mo orea bouillonii PNG5 1 98 Nostoc az ollae 0 708 O sc ill at or ia s p LE G E 0 60 18 un cu ltu re d T ol yp ot h rix s p cl on e LE G E 1 13 97 Cyano bium sp LE GE 0 6015 No do sili ne a s p L EG E 07 36 4 Calo th rix sp P CC 7507 M icrocystis aeruginosa LE G E 08327 M a stigocoleu s testarum B C 008 Ca lot h r ix sp N IE S 20 98 Fisch erella t herm alis WC4 41 M icrocystis aeruginosa LE G E 05195 Fo rtie a con tor ta PCC 7126 P lanktoth rix aff m oug eotii LE G E 0622 4 Micr ocoleus sp PCC 7 11 3 M icrocystis aeruginosa N IES 12 11 Nostoc lin ckia z 6 Cyano bacter ium ap oninum PCC 1 0605 Stanier ia sp NIES 3757 Mo orea prod ucens JHB Anaba ena sp ATCC 33 047 Fisch erella t herm alis CCMEE 5 318 H al om icr on em a cf m et az oi cu m L E G E 0 71 3 2 Gloeo capsop sis sp 1H9 N os to c sp L E G E 1 24 5 0 C aloth rix sp N IE S 4071 Syn ech o co ccu s cf n id ulan s L EGE 06 322 S yn ec ho cy st is s al in a LE G E 0 61 55 Fisch erella t herm alis WC2 46 Nosto c lin c kia N IES 2 5 M icrocystis aeruginosa LE G E 11464 Cyano bium sp LEGE 0 0035 Gloeo bacter kilauee nsis JS1 Fisch erella t herm alis PCC 7 521 Vulc anoc o ccu s lim net ic u s L L Cylindro spe rm um liche niform e UTE X B 2014 Apha n izo meno n flos a qu ae N I ES 81 Halom icr onem a ho ngdech loris C220 6 Phorm idium sp LEGE 07215 N os to c sp L E G E 1 24 4 7 Cya no b ium sp LE GE 0 60 26 S yn ec ho cy st is s al in a LE G E 0 00 31 M icrocystis aeruginosa PC C 7 806S L M icrocystis aeruginosa LE G E 91352 Nostoc li n ckia z4 Cyano thece sp PCC 7 822 Limn othr ix rosea NIES 20 8 Sy nec ho coc cus nid u la ns LE GE 07 1 7 1 P lanktoth rix pau civesiculata P C C 963 1 P lanktoth rix sp P C C 1120 1 Fisch erella t herm alis WC4 39 P lanktoth rix m o ugeot ii L EG E 06 223 N os to c sp L E G E 0 73 6 5 Cya no b ium a ff gra cile LE GE 073 66 M icrocystis aeruginosa PC C 9 717 A ff R oh ol tie lla s p LE G E 1 24 11 Fisch erella sp PCC 9 605 Nostoc lin ckia z 3 un id en tifi ed N o s toc ale s L EG E XX 27 6 Do lic ho sp erm u m sp L EG E 00 26 3 M icrocystis aeruginosa LE G E 91341 C yl in dr o sp er m op si s ra ci bo rs ki i L EG E 9 9 04 3 Tycho nem a sp LE G E 072 14 Anaba ena sp 4 3 Lep tolyngb ya oha dii IS1 Lep toly ngb ya cf h a lo phi la L EG E 0 61 02 No do sili ne a s p L EG E 06 12 4 M icrocystis aeruginosa LE G E 08354 N os to c ca rn e um N IE S 2 10 7 Le p t oly ng b y a s p L EG E 06 3 0 8 N os to c sp L E G E 1 24 4 9 Cyano thece sp PCC 8 802 Le p t oly ng b y a s p L EG E 07 2 9 8 Cya no b ium sp LE GE 0 63 07 N os to c sp L E G E 1 24 5 4 P lanktoth rix prolifica N IVA C YA 98 Tycho nema sp LEGE 072 03 R ap h id io ps is b ro ok ii D 9 D 9 2 3 M icrocystis viridis N IE S 102 Xenoco ccus sp PCC 7305 S yn ec ho cy st is s p LE G E 0 60 0 5 C yl in dr o sp er m op si s ra ci bo rs ki i C 03 Hyella pa telloides L EGE 07 179 Nostoc sp 333 5mG No do sili ne a s p L EG E 06 02 0 Cyano thece sp PCC 8 801 S cytonem a sp N IE S 407 3 M icrocystis aeruginosa LE G E 91351 Lep tolyngb ya sp LEGE 134 18 R iv ul ar ia s p LE G E 0 71 5 9 S ynecho cystis sp IP PA S B 1 465 No st oc a le s cy an o ba ct er iu m L E GE 1 13 8 6 Micr ocoleus sp LEGE 07081 M icrocystis aeruginosa N IES 44 S yn ec ho cy st is s al in a LE G E 0 00 38 Rome ria sp LEG E 0 6013 Do lic ho sp erm u m sp L EG E 00 24 8 To ly po th rix te nu is P C C 71 01 Cylindro sperm opsis r aciborskii L EGE 99 046 M icrocystis aeruginosa PC C 7 005 Cyano bium sp LEGE 11 437 C yl in dr o sp er m op si s ra ci bo rs ki i S 14 M icrocystis sp M C1 9 No sto c p isc ina le C EN A2 1 Af f N od os ilin e a sp L EG E 06 14 8 Fisch erella t herm alis CCMEE 5 273 Anaba en a sp PCC 71 08 Doli chos p erm u m plan cton icum NIE S 80 No sto c sp N IE S 3 75 6 Cyano bium s p LEG E 0 61 37 Cyano bium sp LEGE XX442 S ynecho cystis sp P C C 6714 M icr ocystis aer uginosa PC C 9 807 Deser tifilum sp IPPAS B 122 0 un id en tifi ed N o s toc ale s L EG E XX 25 4 M icrocystis aeruginosa LE G E 12461 Geitler inema sp LEGE 1139 0 Sy nec ho coc cus sp L E GE 11 3 9 4 No du la ri a s p L EG E 0 428 8 Tycho nem a sp LE G E 071 96 P lanktoth rix ru bescens strain 7 821 Synecho cystis sp LEGE 0601 7 un id en tif ie d fila m en t o us S yn ec ho co cc al es L EG E 0 7 08 9 Mo orea prod ucens PAL 8 15 08 1 Chro ococcidiop sis sp TS 82 1 S yn ec ho cy st is s al in a LE G E 0 00 30 Do lic ho sp erm u m sp L EG E 00 24 6 No do sili ne a n od ulo sa P CC 71 0 4 P lanktoth rix m o ugeot ii L EG E 06 226 Doli chos p erm u m com pact um NIE S 8 0 6 No sto c sp N IE S 2 111 Cyan o bium sp L EGE 1 037 5 Croco sphae ra watso nii W H 0005 Cyano bium sp LEGE 0 6184 To xi fil um m ys id oc id a L E G E 06 10 8 C aloth rix rhizo soleniae SC 01 Aph an iz o me no n flos a qu ae 2 012 KM 1 D3 Filam ento us cyano bacter ium LEGE 000 52 Cu sp ido thr ix sp LE GE 0 32 84 Tycho nem a sp LE G E 072 21 A rthrospira sp TJS D 091 S ph ae ro sp e rm op si s sp L E G E 0 22 6 6 P lanktoth rix m o ugeot ii L EG E 06 225 Lep tolyngb ya bor yana NIES 213 5 Le p t oly ng b y a s p L EG E 07 3 1 1 A rthrospira sp O 9 1 3F Do lic ho sp erm u m sp L EG E 00 23 4 M icrocystis aeruginosa KW M icrocystis aeruginosa TA IH U9 8 Fisch erella m ajor NI ES 592 S yn ec ho cy st is s p LE G E 0 60 2 5 Li m n ot hr ix sp P R1 52 9 Gloeo capsop sis crepidin um LEGE 061 23 M icrocystis sp LE G E X X4 08 M icr ocystis aer uginosa SP C 777 Nos toc sp U IC 1 011 0 Chro ococcales cyanoba cterium LEGE 11438 C yl in dr o sp er m op si s ra ci bo rs ki i L EG E 9 9 04 8 Lep tolyngb ya sp PCC 6406 Phorm idium sp LEGE 00064 Le p to ly ng b ya s p LE G E 0 60 70 Oscillator ia sp PCC 1080 2 A naba ena cylind rica P C C 7122 Pseuda nabae na sp PCC 68 02 Pse680 2 R ap h id io ps is c ur va ta N IE S 9 32 Gloeo capsa sp PCC 7 428 S cytonem a tolyp othrich oides V B 6127 8 C yl in dr o sp er m op si s ra ci bo rs ki i G IH E 2 0 18 S ph ae ro sp e rm op si s sp L E G E 0 83 3 4 M icrocystis aeruginosa PC C 9 806 S yn ec ho cy st is s al in a LE G E 0 00 29 Fisch erella t herm alis WC2 45 Fisch erella m usc icola PCC 731 03 Dactyloco ccopsis salina PCC 83 05 C yl in dr o sp er m op si s ra ci bo rs ki i C 04 Cyano bium sp LEGE 0 6138 Oscillator iales cyano bacter ium M TP1 S yn ec ho cy st is s al in a LE G E 0 00 32 A rthrospira platen sis NI ES 39 Lep tolyngb ya sp PCC 7376 M icrocystis aeruginosa LE G E 12460 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 1 04 05 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 11 39 2 Cya no b ium sp P CC 70 0 1 M icrocystis aeruginosa LE G E 91342 Coleof asciculus chth onop la stes PCC 7420 M icrocystis aeruginosa LE G E 91095 No do sili ne a s p L EG E 07 09 1 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 0 00 53 Chro ococcidiop sis cubana SAG 3 9 79 A rthrospira platen sis YZ A rthrospira sp TJS D 092 Fi la m en to us c ya no ba ct er iu m L E G E 0 00 33 C aloth rix dese rtica P C C 7102 No sto c cy ca da e WK 1 Le p to ly ng b ya s p B C 13 07 Lep tolyngb ya sp LEGE 063 61 Fisch erella t herm alis WC3 41 Syne cho c occu s sp UW1 4 0 No do sili ne a s p L EG E 06 12 9 P lanktoth rix aff m oug eotii LE G E 0722 7 M icrocystis aeruginosa PC C 9 808 Micr ocoleus va ginatu s FGP 2 Lep tolyngb ya sp he nsonii28 Phorm idium sp LEGE 06204 M icrocystis sp LE G E 083 31 Le p t oly ng b y a s p L EG E 07 3 1 4 N os to c sp L E G E 1 24 4 8 Nos toc sp 5 18 3 Cyano bium s p LEG E 0 60 16 Cya no b ium g rac ile L EGE 093 99 Aphan othece sacrum FPU3 S yn ec ho cy st is s al in a LE G E 0 00 28 Gem inocystis sp NIES 370 8 No do silin e a no d u los a L EG E 0 6 1 04 Sy ne ch o c oc ca les cy an ob ac te riu m LE G E 1 1 3 95 C yl in dr o sp er m op si s ra ci bo rs ki i C 07 No sto c sp P CC 7 12 0 Nostoc lin ckia z9 Alkalinema aff pa ntana lense L EGE 15 481 Doli chos p erm u m circi na le AW QC3 10 F 31 0 F Tycho nem a sp LE G E 062 20 A rthrospira sp str PC C 80 05 Lyng bya ae stuarii B L J lae st3 N os to c sp L E G E 1 24 5 6 S yn ec ho cy st is s al in a LE G E 0 00 40 Tycho nem a sp LE G E 062 06 Nos toc sp 2 32 A literella a tlantica C E N A 595 M icr ocystis aer uginosa PC C 9 432 Planktoth rix mo ugeot ii L EGE 07 229 No do silin e a sp L E GE 06 010 C yl in dr o sp er m op si s ra ci bo rs ki i M VC C 14 Cyano b ium sp L EGE 0 7 175 Tycho nema bour rellyi F EM GT70 3 Crina liu m epip sammu m PCC 9333 M icrocystis sp LE G E 083 55 C yl in dr o sp er m op si s ra ci bo rs ki i C S 5 05 Cyano thece sp PCC 7 425 M icrocystis aeruginosa PC C 9 809 S ynecho cystis sp P C C 6803 No sto c c om mu ne HK 02 Cyano bium sp LEGE 0 6109 Syn ech o co ccu s sp L E GE 06 306 Phorm idesmis p riestleyi BC140 1 Chro ococcales cyanoba cterium IPPAS B 1 203 S cy to ne m a ho fm an n i U TE X 2 34 9 Lusita niella cor iacea L EGE 07 157 Rubidib acter la cunae KORDI 5 1 2 KR51 Cyano biu m sp LEG E 0 6143 M icrocystis sp LE G E 002 58 Ana ba e na s p 90 M icrocystis aeruginosa Sj C yl in dr o sp er m op si s ra ci bo rs ki i C r2 01 0 Cya no b ium sp LE GE 0 63 16 Do lic ho sp erm u m sp L EG E 00 24 0 Fisch erella m uscicola CCMEE 532 3 Cyano bium sp LEGE 0 7183 Spirulina major PCC 6 313 Fisch erella t herm alis WC4 42 M icrocystis pan niform is FA CH B 17 57 Fisch erella sp PCC 9 339 M icrocystis aeruginosa N IES 87 No du la ri a s pu mig en a C CY 9 4 14 Geitler inema sp PCC 9228 Fisch erella sp PCC 9 431 Gloeo capsop sis sp LEGE 1342 0 An aba ena va riab ilis AT CC 29 413 Chro ococcidiop sid ales cyan obacte rium L EGE 13 419 Nos toc sp A TCC 5 3 789 Fisch erella t herm alis WC3 44 M a stigoclado psis rep ens P C C 1091 4 un id en tif ie d fila m en t o us S yn ec ho co cc al es L EG E 0 7 16 3 To ly po th rix s p NI ES 4 07 5 Chro ococcidiop sid ales cyan obacte rium L EGE 13 423 Cyano bium sp LEGE 0 6 068 Westiellopsis p rolifica IICB1 Lim n orap his rob usta L EG E X X3 58 S yn ec ho cy st is s p LE G E 0 60 7 9 P lanktoth rix ru bescens NIVA CY A 4 07 Pseuda nabae na cf cu rta L EGE 10 371 M icr ocystis aer uginosa D IAN C H I90 5 C yl in dr o sp er m op si s ra ci bo rs ki i C E N A3 02 un id en tif ie d fila m en t o us S yn ec ho co cc al es L EG E 0 6 14 4 M icrocystis aeruginosa LE G E XX 359 Cyano bium sp LEGE 0 0034 M icrocystis aeruginosa PC C 9 701 Fisch erella t herm alis CCMEE 5 282 Fisch erella t herm alis BR2 B Synec ho coc cus sp L EGE 113 8 1 No do silin e a sp L E GE 10 376 Lyng bya sp P C C 810 6 Chro ococcidiop sis therm alis PCC 7 203 Planktoth rix mo ugeot ii L EGE 07 230 Lep tolyngb ya sp NIES 2104 M icrocystis aeruginosa LE G E 08328 Pleuro capsales cya noba cterium LEGE 10410 Cand id atus Atelocyan obacte rium thalassa iso late ALOHA N os to c sp L E G E 0 43 5 7 Syn ech o co cca les cya n ob acte rium LE G E 0 8 333 Lep tolyngb ya bor yana d g5 C al ot h rix s p P C C 7 10 3 Cyano bacter ium ap oninum IPPAS B 1 201 Cyano thece sp PCC 7 424 Cyano thece sp BG 0011 S ph ae ro sp e rm op si s re ni fo rm is N IE S 1 94 9 Ph orm idi um t e nu e NI ES 30 R ichelia int racellular is H M 01 Fisch erella t herm alis CCMEE 5 205 Cyano bium s p LEG E 0 613 9 Cyano bium sp LEGE 0 6008 M icrocystis w esenb ergii L EG E 08 368 Chro ogloeo cystis sidero phila NIES 1031 Fisch erella t herm alis CCMEE 5 198 Fisch erella t herm alis WC11 9 C aloth rix sp N IE S 4105 Spirulina subsalsa PCC 94 45 No do sili ne a s p L EG E 06 13 3 Trichodesm ium e rythraeum IM S 101 S ph ae ro sp e rm op si s ki ss el ev ia na N IE S 7 3 Fi la m en to us c ya no ba ct er iu m L E G E 0 00 60 Filam ento us cyano bacter ium ESFC 1 A3MYDRAF T S yn ec ho co cc al es c ya n ob ac te riu m L EG E 0 61 18 Hapa lo siphon sp MRB220 Croco sphae ra watso nii W H 0402 Cyano bium sp LEG E 0 6002 Phorm idium sp LEGE 00065 Cyano bium sp LEGE 0 6127 Le p to ly ng b ya e ct oc a rp i L E G E 1 14 2 5 Do lic ho sp erm u m sp L EG E 00 24 1 S yn ec ho cy st is s p LE G E 0 73 6 7 Lep tolyngb ya bor yana PCC 630 6 Phorm idesmis p riestleyi ULC0 07 No do sili ne a s p L EG E 07 08 8 Cyano bium s p LEGE 0 61 42 Fisch erella t herm alis WC11 4 C aloth rix sp N IE S 3974 No do silin e a sp L E GE 06 014 Nos toc sp 2 13 S yn ec ho cy st is s al in a LE G E 0 00 27 Croco sphae ra watso nii W H 0401 No sto c f lag elli for me CC NU N1 A rthrospira platen sis C1 To lyp o t hri x s p PC C 76 01 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 0 00 39 No do sili ne a s p L EG E 06 11 9 C yl in dr o sp er m op si s ra ci bo rs ki i S 07 No do sili ne a s p L EG E 06 11 5 Cyano bacter ium sp IPPAS B 120 0 Tycho nem a sp LE G E 062 07 M icr ocystis aer uginosa N IES 84 3 N os to c sp L E G E 0 60 7 7 Do lic ho sp erm u m flo s- aq ua e L EG E 02 26 8 Chro ococcidiop sis sp LEGE 0617 4 Ca lot h r ix bre vis sim a N IE S 22 Cyan o biu m u s it atum C3 Le p t oly ng b y a s p L EG E 06 0 6 9 Oscillator ia nigr o viridis PCC 7112 No du la ri a s p L EG E 0 607 1 An aba ena va riab ilis NI ES 2 3 Croco sphae ra watso nii W H 0003 To ly po th rix s p LE G E 1 44 45 S yn ec ho cy st is s al in a LE G E 0 00 37 S yn ec ho cy st is s p LE G E 0 70 7 3 Cyano bacter ium isolat e RgSB Tolypo thrix cam pylone moides VB5112 88 Croco sphae ra watso nii W H 8501 M icrocystis aeruginosa LE G E 91344 Pseuda nabae na sp ABRG5 3 No do sili ne a-l ike sp LE GE 11 42 4 Le p to ly ng b ya -li ke s p LE G E 1 34 12 Tycho nem a sp LE G E 071 99 S cytonem a sp H K 05 Le p to ly ng b ya m in u ta L E G E 0 71 2 8 Cu sp ido thr ix iss at sc he nk oi L E GE 03 28 2 M icrocystis aeruginosa LE G E 08329 Tycho nem a sp LE G E 072 02 Cyano bium sp LEGE 0 7313 Gloeo capsop sis sp LEGE 1341 4 No do silin e a cf no dul osa LE G E 1 0 377 Fisch erella sp NIES 37 54 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 0 71 60 Le p to ly ng b ya c f e ct o ca rp i L EG E 1 14 79 Cyano bium sp LEGE 0 60 12 No do silin e a sp L E GE 06 149 Nostoc lin ckia z18 Acaryoch loris sp CCM EE 541 0 Geitler inema sp PCC 7407 Chro ococcales cyanoba cterium LEGE 0601 9 Phorm idium sp LEGE 06363 Chro ococcales cyanoba cterium LEGE 0745 9 Do lic ho sp erm u m sp L EG E 03 27 8 P le ct on e m a cf r ad io su m L E G E 0 61 0 5 C aloth rix sp 33 6 3 Aphan othece sacrum FPU1 Le p t oly ng b y a s p K IO ST 1 LS S ynecho cystis sp P C C 7509 Acaryoch loris sp RCC17 74 RCC1774 Nostoc lin ckia z 2 P lanktoth rix aga rdhii N IV A C Y A 15 unid entified fila ment ous cyan obacte rium L EGE 114 80 Limn othr ix sp LEGE 0023 7 P lanktoth rix m o ugeot ii L EG E 07 231 C aloth rix sp P C C 6303 Sy nec ho coc cus sp L E GE 06 324 Phorm idium cf irrigu um LEGE 000 55 Gloeo capsop sis sp LEGE 1342 1 Nosto c lin c kia z1 4 Do lic ho sp erm u m sp L EG E 00 25 9 C yl in dr o sp er m op si s ra ci bo rs ki i L EG E 9 9 04 4 Cya no b ium sp L EGE 0 60 24 Nos toc sp N 6 No sto c sp C EN A5 43 Cyano thece sp ATCC 5 11 42 Chro ococcales cyanoba cterium LEGE 11426 Le p to ly ng b ya a ff e ct oc ar pi L E G E 11 38 9 S yn ec ho cy st is s p LE G E 0 60 8 3 Geitler inema sp PCC 7105 Lep to lyngb ya sp LEG E 061 1 7 S ph ae ro sp e rm op si s sp L E G E 0 02 4 9 No sto c s pha er o ide s K utz in g En N odo siline a sp L EG E 06 009 Fil am en to us cy an o b ac ter ium C CT 1 Fi la m en to us c ya no ba ct er iu m L E G E 0 71 70 No do silin e a sp L E GE 06 022 Lep tolyngb ya bor yana I AM M 10 1 M icrocystis aeruginosa LE G E 00239 C yl in dr o sp er m op si s ra ci bo rs ki i C S 5 08 No do sili ne a s p L EG E 06 14 5 S yn ec ho cy st is s al in a LE G E 0 00 41 Pseuda nabae na sp 59 Fisch erella sp N IES 41 06 No do sili ne a s p L EG E 06 19 3 Myxo sarcina sp LEGE 0614 6 Syne cho c occus nidu lans L EGE 061 5 6 Nostoc lin ckia z8 Cylin dro sp erm u m sp NIES 40 74 C yl in dr o sp er m op si s ra ci bo rs ki i S 10 M icrocystis aeruginosa LE G E 91347 Ma stigocladu s lami nosu s UU774 A rthrospira platen sis str P ara ca isolate UA S W S Nos toc pun ctifo rme PC C 7 3 10 2 No du la ri a s pu mig en a C EN A5 96 P lanktoth rix tep id a P C C 9214 P ho rm id iu m s p L E G E 11 38 4 N os to c sp L E G E 1 24 5 1 M icr ocystis aer uginosa PC C 7 941 No do sili ne a s p L EG E 03 28 3 Phorm idium sp LEGE 06072 Nosto c lin c kia z1 6 Cya no b ium g rac ile L EGE 124 31 Nos toc sp K VJ2 0 Fisch erella t herm alis CCMEE 5 201 R iv ul ar ia s p P C C 71 16 Gem inocystis he rdma nii PCC 6 308 Cham aesipho n polym orph us CCAL A S ph ae ro sp e rm op si s sp L E G E 0 83 3 5 S cytonem a ho fm ann ii PC C 711 0 Cham aesipho n minu tus PCC 6605 Haloth ece sp PCC 741 8 N ostoca les cyano bacter ium H T 58 2 Cyano b ium sp LEGE 0 7293 S yn ec ho cy st is s al in a LE G E 0 60 99 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 1 04 00 R om e ria a ff g ra ci lis L E G E 0 7 31 0 M icrocystis aeruginosa N IES 42 85 Nostoc sp P CC 7 524 C hlorogloea sp C C AL A 69 5 Oscillator iales cyano bacter ium JSC 12 Do lic ho sp erm u m flo s- aq ua e L EG E 04 28 9 Fi la m en to us c ya no ba ct er iu m L E G E X X0 61 M icrocystis aeruginosa LE G E 12462 Fisch erella t herm alis WC1 57 Unicellular cyanob acter iu m SU3 M icrocystis aeruginosa LE G E 12463 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 0 71 85 Mo orea prod ucens 3 L P le ct on e m a cf r ad io su m L E G E 0 61 14 Cyano biu m sp LEG E 0 6135 Nostoc lin ckia z 1 Cya no b ium sp L EGE 0 60 23 Cya no b ium sp LE GE 0 71 53 C yl in dr o sp er m op si s ra ci bo rs ki i C E N A3 03 M icrocystis aeruginosa N IES 24 81 M icr ocystis aer uginosa N IES 88 M icrocystis aeruginosa LE G E 11465 Cyano bium sp LEGE 0 7318 Tycho nem a sp LE G E 072 13 Phorm idium la etevire ns LEGE 0610 3 Cyano b ium sp L EGE 0 6 140 Anab a ena sp W A102 No sto c sp P CC 7 10 7 Lep toly ngb ya sp LE GE 07 0 8 4 No du la ri a s pu mig en a U HC C 0 039 Ca lot h r ix sp N IE S 21 00 Chlo r oglo eo ps is frit schii PCC 691 2 Le p t oly ng b y a s p L EG E 07 0 8 5 Cyano bium sp LEGE 0 7186 Nos toc sp D B3 9 92 M icrocystis aeruginosa PC C 9 443 Fisch erella t herm alis strain JSC 11 Cyano bium sp LEGE 0 6011 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 0 61 16 Pleuro capsa sp PCC 73 19 Planktoth ricoides sp SR0 01 Phorm idium sp LEGE 06078 Trichorm us sp N M C 1 No do silin e a sp L E GE 06 006 C al ot h rix s p LE G E 0 61 00 Pseuda nabae na bice ps PCC 7 429 Gloeo bacter violaceu s PCC 74 21 An ab a e no ps is cir cu lar is NI ES 21 Chro ococcidiop sid ales cyan obacte rium L EGE 13 417 H al om icr on em a ex ce nt ric um s tr L ak sh ad w e ep 2 A rthrospira m axim a C S 3 28 Pleuro capsa sp PCC 73 27 A ul os ira la xa N IE S 5 0 O cu la te lla s p LE G E 0 61 41 Le p to ly ng b ya s p LE G E 0 70 75 Phorm idium sp HE1 0JO Filam ento us cyano bacter ium LEGE 071 67 C yl in dr o sp er m op si s ra ci bo rs ki i I T E P A 1 Cya no b ium sp N IES 98 1 Fisch erella t herm alis WC111 0 No sto c c om mu ne NIE S 4 072 Lep tolyngb ya sp NIES 3755 Do lic ho sp erm u m sp L EG E 03 27 7 Sy nec ho coc cal es cya n o bac te r ium LE G E 1 3 422 M icr ocystis aer uginosa N IES 29 8 C al en e m a si ng ul ar is L EG E 0 6 18 8 No do silin e a sp L E GE 06 110 Lep tolyngb ya sp O 7 7 Cyano bium s p LEG E 0 613 4 Oscillator ia acum inata PCC 630 4 Tycho nem a sp LE G E 072 17 cf P ho rm id es m is s p LE G E 1 14 77 N os to c sp N IE S 4 10 3P lanktoth rix prolifica N IVA C YA 406 un id en tif ie d co lo ni al S yn ec h oc oc ca le s L EG E 0 6 19 2 No du la ri a c f h arv eya na HB U2 6 Croco sphae ra watso nii W H 8502 Le p to ly ng b ya s ax ic ol a L EG E 0 6 13 1 Chro ococcop sis sp LEGE 0716 8 unid entified Oscilla toriales LEG E 11 385 S yn ec ho co cc al es c ya n ob ac te riu m L EG E 0 60 21 Myxo sarcina sp GI1 co ntig 13 O sc ill at or ia le s cy an o ba ct er iu m L EG E 1 0 37 0 Nosto c lin c kia z1 5 P se ud a na ba e na s p LE G E 0 71 90 Kampt onem a for mosum PCC 6 407 An ab a e na m inu tis sim a UT EX B 16 13 S ynecho cystis sp LE G E 07211 Cyano bium sp LEGE 0 6097 Tycho nem a sp LE G E 071 97 Pleuro capsales cya noba cterium LEGE 06147 Syn ech o co cca les cya n ob acte rium LE G E 0 9 398 M icrocystis aeruginosa LE G E 08330 Le p to ly ng b ya s p H er on Is la n d J 11 Fi la m en to us c ya no ba ct er iu m L E G E 0 72 12 Fisch erella t herm alis CCMEE 5 268 Gloeo mar garita lit hopho ra Alchichica D10 Le p to ly ng b ya s p P C C 7 37 5 Le p t oly ng b y a s p L EG E 07 1 5 4 Cya no b ium g ra cile LE GE 000 54 Geitler inema sp LEGE 1139 3 unid entified Oscilla toriales LEG E 0 0049 Pseuda nabae na sp BC14 03 S cy to ne m a sp L EG E 0 7 18 9 P lanktoth rix aga rdhii N IV A C Y A 12 6 8 Le p t oly ng b y a s p L EG E 07 3 0 9 Pseuda nabae na sp PCC 73 67 N os to c sp L E G E 0 61 5 8 No do silin e a sp L E GE 06 120 Filam ento us cyano bacter ium LEGE 124 32 Nostoc lin ckia z7 Fr e my e lla dip los iph o n NI ES 3 2 75 Cyano bacter ium isolat e EtSB Nosto c sp PA 18 2 419 Oscillator ia sp PCC 6506 D es m o no st oc m us co ru m L EG E 1 2 44 6 Micr ocoleus sp LEGE 07092 Cyan o bium sp L EGE 1 037 4 M icrocystis sp T 1 4 Tycho nem a sp LE G E 071 98 M icrocystis sp 08 24 No sto c s p R F31 Ym G Tycho nema sp LEGE 072 16 C yl in dr o sp er m op si s ra ci bo rs ki i S 06 Nos toc sp 2 10 A No du la ri a s p N I ES 35 8 5 Cyan o biu m g r acile PCC 6307 Mi cr oc ys tis ae r u gin os a LE GE 9 13 38 1 7 4 2 Colored ranges Nostocales Oscillatoriales Chroococcales Synechococcales Pleurocapsales Chroococcidiopsidales Spirulinales Gloeomargaritales Gloeobacterales CylC homologs CylC homologs identified by screen LEGEcc strains .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 18 genome sequences for antiSMASH [44] analysis. 55 CylC-encoding BGCs were detected, which were classified 336 as resorcinol, NRPS, PKS, or hybrid NRPS-PKS. Given the number of CylC homolog-encoding genes detected 337 in these genomes (105), we considered that several BGCs might have not been identified with antiSMASH. 338 Therefore, we performed manual annotation of the genomic contexts of the CylC homologs and were able to 339 identify 20 additional BGCs. Upon analysis of the entire set of CylC-encoding BGCs, we classified the BGCs 340 in seven major categories, based on their overall architecture, which we designated as follows (listed in 341 decreasing abundance): Rieske-containing (n = 36), type I PKS 342 (chlorosphaerolactylate/columbamide/microginin/puwainaphycin-like, n = 29), type III PKS (n = 13), 343 dialkylresorcinol (n = 8), PriA-containing (n = 5), nitronate monooxygenase-containing (n = 3) and cytochrome 344 P450/sulfotransferase-containing (n = 1) (Fig. 4a, Figs. S4-S10). Three BGCs were excluded from our 345 classification since they were only partially sequenced (Fig. S11). Examples of each of the cluster architectures 346 are presented in Fig. 4a and schematic representations of each of the 98 classified BGCs are presented in 347 Supplementary Figures S4-S10. It should be stressed that within several of these seven major categories, there 348 is still considerable BGC architecture diversity, notably within the dialkylresorcinol, type I and type III PKS 349 BGCs. Rieske-containing BGCs are not associated with any known NP and encode between two and four 350 proteins with Rieske domains. Most contain a sterol desaturase family protein, feature a single CylC homolog 351 and are chiefly found among Nostocales and Oscillatoriales (Fig. S4). PriA-containing BGCs encode, apart from 352 the Primosomal protein N' (PriA), a set of additional diguanylate cyclase/phosphodiesterase, aromatic ring-353 hydroxylating dioxygenase subunit alpha and a ferritin-like protein and were only detected in Synechocystis spp. 354 (Fig. S5). These are similar to the Rieske-containing BGCs; however, in strains harboring PriA-containing 355 BGCs, the additional functionalities that are found in the Rieske-containing BGCs can be found dispersed 356 throughout the genome (Table S4). In our dataset, a single sulfotransferase/P450 containing BGC was detected 357 in Stanieria sp. and was unrelated to the above-mentioned architectures (Fig. S6). Type I PKS BGCs encode 358 clusters similar to those of the chlorosphaerolactylates, columbamides, microginins and puwainaphycins and 359 typically feature a fatty acyl-AMP ligase (FAAL) and an acyl carrier protein upstream of one or two CylC 360 homologs and a type I PKS downstream of the CylC homolog(s). These were found in Nostocales and 361 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 19 Oscillatoriales strains (Fig. S7). Taken together with the known NP structures associated with these BGCs [29, 362 56, 57], we can expect that the encoded metabolites feature halogenated fatty acids in terminal or mid-chain 363 positions. BGCs of the dialkylresorcinol type, which contain DarA and DarB homologs (Bode 2013, Leão 2015), 364 including several bartoloside-like clusters (found only in LEGEcc strains), were detected in Nostocales, 365 Pleurocapsales and Chroococcales (Fig. S8). Type III PKS BGCs encoding CylC homologs, which include a 366 variety of cyclophane BGCs, were detected in the Nostocales, Oscillatoriales and Pleurocapsales (Fig. S9). 367 Finally, nitronate monooxygenase-containing BGCs, which are not associated with any known NP, were only 368 found in Nostocales strains from the LEGEcc and featured also genes encoding PKSI, ferredoxin, ACP or 369 glycosyl transferase (Fig. S10). 370 A less BGC-centric perspective of the genomic context of CylC homologs could be obtained through the 371 Genome Neighborhood Tool of the EFI (EFI-GNT, [58]). Using the previously generated SSN as input, we 372 analyzed the resulting Genomic Neighborhood Diagrams (Fig. 4b), which indicated that the three SSN clusters 373 had entirely different genomic contexts (herein defined as 10 upstream and 10 downstream genes from the cylC 374 homolog). The SSN cluster that encompasses CylC and its closest homologs indicates that these enzymes 375 associate most often with PP-binding (ACP/PCPs) and AMP-binding (such as FAALs) proteins. Regarding the 376 SSN cluster that includes both cyanobacterial and non-cyanobacterial CylC homologs, their genomic contexts 377 most prominently feature Rieske/[2Fe-2S] cluster proteins as well as fatty acid hydroxylase family enzymes. 378 The cyanobacterial homologs are exclusively encoded in the Rieske and PriA-containing BGCs. Homologs from 379 this particular SSN cluster may not require a phosphopantetheine tethered substratei+ as no substrate activation 380 or carrier proteins/domains were found in their genomic neighborhoods, or may act on central fatty acid 381 metabolism intermediates. The BrtJ SSN cluster, composed only of the two reported BrtJ enzymes, shows 382 entirely different surrounding genes, obviously corresponding to the brt genes. Also noteworthy is the 383 considerable number of proteins with unknown function found in the vicinity of dimetal-carboxylate 384 halogenases, suggesting that uncharted biochemistry is associated with these enzymes. 385 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 20 Since SSN analysis generated only three clusters of CylC homologs, we next investigated the genetic relatedness 386 among these enzymes and how it correlates to BGC architecture. We performed a phylogenetic analysis of the 387 CylC homologs from the 98 classified and 3 unclassified BGCs (Fig. 4c). Our analysis indicated that PriA-388 containing and Rieske-containing BGCs formed a well-supported clade. Its sister clade contained homologs 389 from the remaining BGCs. Within this larger clade, homologs associated with the type I PKS, dialkylresorcinol 390 or type III PKS BGCs were found to be polyphyletic. In some cases, the same BGC contained distantly related 391 CylC homologs (e.g. Hyella patelloides LEGE 07179, Anabaena cylindrica PCC 7122) (Figure 4c). This 392 analysis also revealed that several strains (Fig. 5c) encode two or three phylogenetically distant CylC homologs 393 in different BGCs. Overall, our data shows that CylC homologs have evolved to interact with different partner 394 enzymes to generate chemical diversity, but that their phylogeny is, in some cases, not entirely consistent with 395 BGC architecture. These observations suggest that functionally convergent associations between CylC 396 homologs and other proteins have emerged multiple times during evolution. Examples include the CylC/CylK 397 and BrtJ/BrtB associations, which use cryptic halogenation to achieve C-C and C-O bond formation, respectively 398 [27, 59]. However, the role of the CylC homolog-mediated halogenation of fatty acyl moieties observed for 399 other cyanobacterial metabolites is not currently understood. Interestingly, while a number of CylC homologs, 400 including those that are part of characterized BGCs, likely act on ACP-tethered fatty acyl substrates [27, 59], 401 those from the PriA- Rieske- and cytochrome P450/sulfotransferase categories do not have a neighboring carrier 402 protein and therefore might not require a tethered substrate. This would be an important property for a CylC-403 like biocatalyst [15]. 404 405 406 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 21 407 Figure 4. Diversity and genomic context of CylC-like enzymes BGCs. a) Examples of the different BGCs 408 architectures found among the clusters encoding CylC homologs. b) Genome Neighborhood Diagram (GND) 409 depicting the Pfam domains associated with each cluster from the initial SSN of CylC homologs. The size of 410 each node is proportional to the prevalence of the Pfam domain within the genomic context of the CylC 411 Colored ranges: nitronate monooxygenase-containing PriA-containing Rieske-containing PKSIII dialkylresorcinol chlorsphaerolactylate/columbamides/ microginin/puwainaphycin-like sulfotransferase/P450 containing others c) a) b) LE G E 12 45 0 C lu st er 1 P C C 7 32 7 C AV N 2 N IE S 4 07 1 C lu st er 2 51 A Y C A VI N N IES 267 N IE S 4 10 5 C lu st er 2 78 21 C lus ter 2 PCC 9431 Cluster2 1 UT EX B 20 14NIV A C YA 40 6 C lus ter 2 LEGE06071 PCC 6714 LEGE07179 2 HKI 22 AurF LE G E 0 02 49 1 N IV A C Y A 4 07 NIE S 9 8 PC C9 43 2 2retsul C 7147 C C P FAC HB- 524 LEGE06083 Cluster1 PCC 7116 LEG E 00249 2 N IE S 2 09 8 PCC7417 plasmidcluster LEGE0 0031 C luster2 N IE S 4103 C luster2 PCC712 2 2 C C A P 1 45 3 38 2 LEGE00031 Cluster1 LEGE06147 Cluster1 C E N A 54 3 PCC 10605 LE G E 07179 1 LEGE06147 Cluster2 U IC 10 11 0 1 1retsul C 891 5 G N P LEGE 1146 4 PA L 8 -15 -08 -1 LEGE000 41 Cluste r2 H T 58-2 P C C 7 31 02 C lu st er 2 LE G E 10410 C luster2 N IV A C Y A 406 C luster1 PCC9 701 PCC 9431 Clus ter2 2 51 83 LEG E 06 099 PCC 9431 C luster1 LEGE11479 IPPAS B 1465 LEGE 0615 5 UT EX B 1 61 3 NIES 4106 P C C 7822 N IE S 4105 C luster1PCC 7407 LE G E 12447 C luster3 LEGE07170 NIES 3757 N IE S 21 00 PNG5 198 Cluster1 2 LE G E 10410 C luster1 N IE S 4071 C luster1 NI ES 5 0 P C C 9631 LEGE00250 2 PC C 7 52 4 LEGE00041 Cluster1 LE GE 11 39 7 LEGE12447 Cluster2 N 6 NIES 4103 Cluster1 HBU26 C C N U N 1 NI ES 2 2LEGE11480 JH B LEG E06 083 Clu ster 2 PC C 74 17 C lu st er 1 PC C 71 01 LE G E 12 44 7 C lu st er 1 P C C 9333336 3 NIE S 3 27 5 H K 0 2 NIES4074 P C C 73102 C luster1 NI ES 21 07 LEG E00250 1 P C C 7113 PN G5 198 Clu ster 2 LE G E 06147 C luster3 LEGE12450 Cluster2 PCC 6803 LEGE12446 Cluster2 LE G E 12 44 6 C lu st er 1 LEGE 11477 CCAP 1453 38 1 NIE S 8 7 PC C7 00 5 7821 C luster1 P C C 6304 LE GE 91 341 PCC7122 1 noneTubC_NSBBPFtsX PP-binding AMP-binding none DUF962 FA_ hydroxylase GH3RieskeHexapepFer2 DUF559 FtsX Glycos_transf_1 UDPGT ABC_tran none DUF5122 HlyD_D23 Biotin_ lipoyl_ 2-HlyD_ D23 Beta_ helix ACP_syn_III_C cluster 3 (BrtJ) (n = 2) cluster 1 (n = 73) cluster 2 (n = 67) ABC_tran ABC transporter ACP_syn_III_C 3-Oxoacyl-[acyl-carrier-protein (ACP)] synthase III C terminal AMP-binding AMP-binding enzyme Beta_helix Right handed beta helix region Biotin_lipoyl_2-HlyD_D23 Biotin-lipoyl like-Barrel-sandwich domain of CusB or HlyD membrane-fusion DUF5122 Domain of unknown function (DUF5122) beta-propeller DUF559 Protein of unknown function (DUF559) DUF962 Protein of unknown function (DUF962) FA_hydroxylase Fatty acid hydroxylase superfamily Fer2 2Fe-2S iron-sulfur cluster binding domain FtsX FtsX-like permease family GH3 GH3 auxin-responsive promoter Glycos_transf_1 Glycosyl transferases group 1 Hexapep Bacterial transferase hexapeptide (six repeats) HlyD_D23 Barrel-sandwich domain of CusB or HlyD membrane-fusion PP-binding Phosphopantetheine attachment site Rieske Rieske [2Fe-2S] domain SBBP Beta-propeller repeat TubC_N TubC N-terminal docking domain UDPGT UDP-glucoronosyl and UDP-glucosyl transferase Pfam Description PriA-containing (Synechocystis sp. PCC 6803) unknown product Rieske-containing (Calothrix brevissima NIES-22) unknown product type III PKS (Cylindrospermum licheniforme UTEX B 2014) cylindrocyclophanes dialkylresorcinol (Synechocystis salina LEGE 06099) bartolosides type I PKS (chlorosphaerolactylates/columbamides/microginin/ puwainaphycin-like) (Moorea bouillonii PNG05-198) columbamides nitronate monooxygenase-containing (Nostoc sp. LEGE 12447) unknown product sulfotransferase/P450-containing (Stranieria sp. NIES-3757) unknown product PriA other biosynthetic hypothetical/unknown transport/regulatory Rieske other type I PKS dimetal-carboxylate halogenase fatty acyl-AMP ligase CylK homolog DAR formation type III PKS NRPS nitronate monooxygenase acyl carrier protein sulfotransferase cytochrome P450 3 kb proposed functions: 100 100 98 93 79 100 84 100 98 97 100 87 100 97 88 93 82 100 96 88 76 10 0 99 98 10 0 99 99 98 99 10 0 85 78 93 84 95 99 100 64 98 100 96 98 100 100 100 69 98 100 100 88 94 10 0 93 100 86 99 100 88 71 10 0 74 99 81 82 17 10 0 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 22 homologs from each SSN cluster. c) RAxML cladogram (1000 replicates, shown are bootstrap values > 70%) 412 of CylC homologs. The different colors represent a categorization based on common genes found within the 413 associated biosynthetic gene clusters (see legend). Circles of the same color depict CylC homologs encoded by 414 the same BGC. AurF (Streptomyces thioluteus HKI-22) was used as an outgroup. 415 416 CylC enzymes and other cyanobacterial halogenases 417 We sought to understand how CylC-type halogenases compare to other halogenating enzyme classes found in 418 cyanobacteria in terms of prevalence and association with BGCs. To this end, we carried out a CORASON [47] 419 analysis of publicly available cyanobacterial genomes (including non-reference genomes) and the herein 420 acquired genome data from LEGEcc strains (a total of 2,115 cyanobacterial genomes). We used different 421 cyanobacterial halogenases as input, namely CylC, McnD, PrnA, Bmp5, the 2OG-Fe(II) oxygenase domains 422 from CurA and BarB1. CORASON attempts to retrieve genome context by exploring gene cluster diversity 423 linked to enzyme phylogenies [47]. The CORASON analysis retrieved 117 (5.6%) dimetal-carboxylate 424 halogenases, 61 (2.9%) nonheme iron-dependent halogenases and 226 (10.7%) flavin dependent halogenases 425 from the cyanobacterial genomes (Fig. 5a). Using the protein homologs detected in BGCs by CORASON, a 426 sequence alignment was performed for dimetal-carboxylate, nonheme iron/2OG-dependent and flavin-427 dependent halogenases. For nonheme iron/2OG-dependent halogenases, we excised the halogenase domain from 428 multi-domain enzyme sequences. After removing repeated sequences and trimming the alignments to their core 429 shared positions, maximum-likelihood phylogenetic trees were constructed for each halogenase class and BGCs 430 were annotated manually (Figs. S12-S14). Flavin-dependent halogenases were commonly associated with 431 cyanopeptolin, 2,4-dibromophenol and pyrrolnitrin BGCs and with orphan BGCs of distinct architectures (Fig. 432 S12). Regarding nonheme iron/2OG-dependent halogenases, we identified barbamide, curacin, hectochlorin and 433 terpene/indole [60] BGCs and several distinct orphan BGCs (Fig. S13). For dimetal-carboxylate halogenases, 434 columbamide, microginin, chlorosphaerolactylate, bartoloside and cyclophane BGCs were identified (Fig. S14). 435 However, while some of the CylC homolog-encoding orphan BGCs previously identified by antiSMASH and 436 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 23 manual searches were detected by CORASON, the Rieske- and the PriA-containing BGCs were not. Hence, 437 several CylC homologs were not accounted for in this analysis. For the same reasons, the other two halogenase 438 types could also be missing some of its members in the CORASON-derived datasets. To circumvent this 439 limitation and obtain a more comprehensive picture of the abundance of the three types of halogenase in 440 cyanobacterial genomes, we used BLASTp searches against available cyanobacterial genomes in the NCBI 441 database (including non-reference genomes). Several representatives of each halogenase class were used as 442 query in each search (CylC, BrtJ, “Mic” – the halogenase in the putative microginin gene cluster – ColD, ColE, 443 NocO and NocN for dimetal-carboxylate halogenases; PrnA, Bmp5 and McnD for flavin dependent halogenases; 444 the halogenase domain from CurA and the halogenases BarB1, HctB, WelO5 and AmbO5 for nonheme iron-445 dependent halogenases). Non-redundant sequences obtained for these searches using a 1×10-20 e-value cutoff 446 (corresponding to >30% sequence identity) were considered to share the same function as the query. It is worth 447 mentioning that, for nonheme iron/2OG-dependent enzymes, a single amino acid difference can convert 448 hydroxylation activity into halogenation [61], so it is possible that – at least for this class – the sequence space 449 considered does not correspond exclusively to halogenation activity. Dimetal-carboxylate and flavin-dependent 450 halogenase homologs were found to be the most abundant in cyanobacteria, each with roughly 0.2 homologs per 451 genome, while nonheme iron/2OG-dependent halogenase homologs are less common (~0.05 per genome) (Fig. 452 5b). Overall, our analyses indicate that homologs of each of the three halogenase classes are associated with a 453 large number of orphan BGCs and represent opportunities for NP discovery. Particularly noteworthy, CylC-like 454 enzymes are clearly a major group of halogenases in cyanobacteria, despite having been the latest to be 455 discovered [27]. 456 457 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 24 458 Figure 5. Prevalence of cyanobacterial halogenases. Frequency of halogenases in Cyanobacteria from 459 CORASON analysis (A) and NCBI BLASTp analysis (B). (A) Dimetal-carboxylate halogenases: CylC - NCBI 460 reference genomes, n = 2054 and LEGEcc genomes, n = 41 CylC-containing BGCs and 56 genomes; Flavin-461 dependent halogenases: PrnA - NCBI reference genomes, n = 2051 and LEGEcc genomes, n = 56 genomes; 462 Bmp5- NCBI reference genomes, n = 2050 and LEGEcc genomes, n = 56 genomes; McnD: NCBI reference 463 genomes, n = 2052 and LEGEcc genomes, n = 54 genomes); Nonheme iron/2OG-dependent halogenases: 464 halogenase domain from CurA - NCBI reference genomes, n = 2052 and LEGEcc genomes, n = 56 genomes. 465 (B) Average of the total number of homologs per dimetal-carboxylate halogenases (CylC, BrtJ, “Mic”, ColD, 466 ColE, NocO, NocN), flavin-dependent halogenases (Tryptophan 7-halogenase PrnA, Bmp5 and McnD) and 467 % o f h al og en as es (C O R A S O N ) N um be r of h om ol og s (B LA S T) a) b) Di me tal No n− he me iro n Fla vin -de pe nd en t 0 50 100 150 200 250 300 350 0 2 4 6 8 10 12 14 Di me tal No n− he me iro n Fla vin -de pe nd en t .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 25 nonheme iron/2OG-dependent halogenases (Barb1, HctB, WelO5, AmbO5 and the halogenase domain from 468 CurA). 469 470 Conclusion 471 The discovery of a new biosynthetic enzyme class brings with it tremendous possibilities for biochemistry and 472 catalysis research, both fundamental and applied. Their functional characterization can also be used as a handle 473 to identify and deorphanize BGCs that encode their homologs. CylC typifies an unprecedented halogenase class, 474 which is almost exclusively found in cyanobacteria. By searching CylC homologs in both public databases and 475 our in-house culture collection, we report here more than 100 new cyanobacterial CylC homologs. We found 476 that dimetal-carboxylate halogenases are widely distributed throughout the phylum. The genomic 477 neighborhoods of these halogenases are diverse and we identify a number of different BGC architectures 478 associated with either one or two CylC homologs that can serve as starting points for the discovery of new NP 479 scaffolds. In addition, the herein reported diversity and biosynthetic contexts of these enzymes will serve as a 480 roadmap to further explore their biocatalysis-relevant activities. Finally, bartoloside-like BGCs and a CylC-481 associated BGC architecture (nitronate monooxygenase-containing) were found only in the LEGEcc, reinforcing 482 the importance of geographically focused strain isolation and maintenance efforts for the Cyanobacteria phylum. 483 484 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 26 485 Conflicts of Interest 486 The authors declare that there are no conflicts of interest. 487 488 Funding information 489 This work was funded by Fundação para a Ciência e a Tecnologia (FCT) through grant PTDC/BIA-490 BQM/29710/2017 to PNL and through strategic funding UID/Multi/04423/2013 and by the National Science 491 Foundation (NSF) through grant CAREER-1454007 to EPB. AR and RCB are supported by doctoral grants 492 from FCT (SFRH/BD/140567/2018 and SFRH/BD/136367/2018, respectively). This material is based upon 493 work supported by an NSF Postdoctoral Research Fellowship in Biology (Grant No 1907240 to NRG). Any 494 opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and 495 do not necessarily reflect the views of the NSF. 496 497 Acknowledgments 498 We thank Hitomi Nakamura, Samantha Cassell, Diana Sousa and João Reis for technical assistance during this 499 study, and the Blue Biotechnology and Ecotoxicology Culture Collection (LEGEcc) for the genomic DNA used 500 for the PCR screening. 501 502 503 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 27 References 504 1. Pham JV, Yilma MA, Feliz A, Majid MT, Maffetone N et al. A Review of the Microbial Production 505 of Bioactive Natural Products and Biologics. Front Microbiol 2019;10(1404). 506 2. Noda-Garcia L, Tawfik DS. Enzyme evolution in natural products biosynthesis: target- or diversity-507 oriented? Curr Opin Chem Biol 2020;59:147-154. 508 3. Giani AM, Gallo GR, Gianfranceschi L, Formenti G. Long walk to genomics: History and current 509 approaches to genome sequencing and assembly. Comput Struct Biotechnol J 2020;18:9-19. 510 4. Zhang MM, Qiao Y, Ang EL, Zhao H. Using natural products for drug discovery: the impact of the 511 genomics era. Expert Opin Drug Discov 2017;12(5):475-487. 512 5. Gkotsi DS, Dhaliwal J, McLachlan MMW, Mulholand KR, Goss RJM. Halogenases: powerful tools 513 for biocatalysis (mechanisms applications and scope). Curr Opin Chem Biol 2018;43:119-126. 514 6. Agarwal V, Miles ZD, Winter JM, Eustáquio AS, El Gamal AA et al. Enzymatic Halogenation and 515 Dehalogenation Reactions: Pervasive and Mechanistically Diverse. Chem Rev 2017;117(8):5619-5674. 516 7. Weichold V, Milbredt D, van Pée K-H. Specific Enzymatic Halogenation—From the Discovery of 517 Halogenated Enzymes to Their Applications In Vitro and In Vivo. Angew Chem Int Ed 2016;55(22):6374-6389. 518 8. Schnepel C, Sewald N. Enzymatic Halogenation: A Timely Strategy for Regioselective C−H 519 Activation. Chem Eur J 2017;23(50):12064-12086. 520 9. Petrone DA, Ye J, Lautens M. Modern Transition-Metal-Catalyzed Carbon–Halogen Bond Formation. 521 Chem Rev 2016;116(14):8003-8104. 522 10. Jeschke P. The unique role of halogen substituents in the design of modern agrochemicals. Pest Manag 523 Sci 2010;66(1):10-27. 524 11. Xu Z, Yang Z, Liu Y, Lu Y, Chen K et al. Halogen Bond: Its Role beyond Drug–Target Binding 525 Affinity for Drug Discovery and Development. J Chem Inf Model 2014;54(1):69-78. 526 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 28 12. Hillwig ML, Zhu Q, Ittiamornkul K, Liu X. Discovery of a Promiscuous Non-Heme Iron Halogenase 527 in Ambiguine Alkaloid Biogenesis: Implication for an Evolvable Enzyme Family for Late-Stage Halogenation 528 of Aliphatic Carbons in Small Molecules. Angew Chem Int Ed 2016;55(19):5780-5784. 529 13. Liu X. In Vitro Analysis of Cyanobacterial Nonheme Iron-Dependent Aliphatic Halogenases WelO5 530 and AmbO5. Methods Enzymol 2018;604:389-404. 531 14. Pratter SM, Ivkovic J, Birner-Gruenberger R, Breinbauer R, Zangger K et al. More than just a 532 halogenase: modification of fatty acyl moieties by a trifunctional metal enzyme. Chembiochem 2014;15(4):567-533 574. 534 15. Hillwig ML, Liu X. A new family of iron-dependent halogenases acts on freestanding substrates. Nat 535 Chem Biol 2014;10(11):921-923. 536 16. Chang Z, Flatt P, Gerwick WH, Nguyen VA, Willis CL et al. The barbamide biosynthetic gene 537 cluster: a novel marine cyanobacterial system of mixed polyketide synthase (PKS)-non-ribosomal peptide 538 synthetase (NRPS) origin involving an unusual trichloroleucyl starter unit. Gene 2002;296(1-2):235-247. 539 17. Flatt PM, O'Connell SJ, McPhail KL, Zeller G, Willis CL et al. Characterization of the Initial 540 Enzymatic Steps of Barbamide Biosynthesis. J Nat Prod 2006;69(6):938-944. 541 18. Galonić DP, Vaillancourt FH, Walsh CT. Halogenation of unactivated carbon centers in natural 542 product biosynthesis: trichlorination of leucine during barbamide biosynthesis. J Am Chem Soc 543 2006;128(12):3900-3901. 544 19. Chang Z, Sitachitta N, Rossi JV, Roberts MA, Flatt PM et al. Biosynthetic pathway and gene cluster 545 analysis of curacin A, an antitubulin natural product from the tropical marine cyanobacterium Lyngbya 546 majuscula. J Nat Prod 2004;67(8):1356-1367. 547 20. Edwards DJ, Marquez BL, Nogle LM, McPhail K, Goeger DE et al. Structure and Biosynthesis of 548 the Jamaicamides, New Mixed Polyketide-Peptide Neurotoxins from the Marine Cyanobacterium Lyngbya 549 majuscula. Chem Biol 2004;11(6):817-833. 550 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 29 21. Ramaswamy AV, Sorrels CM, Gerwick WH. Cloning and biochemical characterization of the 551 hectochlorin biosynthetic gene cluster from the marine cyanobacterium Lyngbya majuscula. J Nat Prod 552 2007;70(12):1977-1986. 553 22. Kocher S, Resch S, Kessenbrock T, Schrapp L, Ehrmann M et al. From dolastatin 13 to 554 cyanopeptolins, micropeptins, and lyngbyastatins: the chemical biology of Ahp-cyclodepsipeptides. Nat Prod 555 Rep 2020;37(2):163-174. 556 23. Rouhiainen L, Paulin L, Suomalainen S, Hyytiainen H, Buikema W et al. Genes encoding 557 synthetases of cyclic depsipeptides, anabaenopeptilides, in Anabaena strain 90. Mol Microbiol 2000;37(1):156-558 167. 559 24. Cadel-Six S, Dauga C, Castets AM, Rippka R, Bouchier C et al. Halogenase genes in nonribosomal 560 peptide synthetase gene clusters of Microcystis (cyanobacteria): sporadic distribution and evolution. Mol Biol 561 Evol 2008;25(9):2031-2041. 562 25. Nishizawa T, Ueda A, Nakano T, Nishizawa A, Miura T et al. Characterization of the locus of genes 563 encoding enzymes producing heptadepsipeptide micropeptin in the unicellular cyanobacterium Microcystis. J 564 Biochem 2011;149(4):475-485. 565 26. Nakamura H, Hamer HA, Sirasani G, Balskus EP. Cylindrocyclophane Biosynthesis Involves 566 Functionalization of an Unactivated Carbon Center. J Am Chem Soc 2012;134(45):18518-18521. 567 27. Nakamura H, Schultz EE, Balskus EP. A new strategy for aromatic ring alkylation in 568 cylindrocyclophane biosynthesis. Nat Chem Biol 2017;13(8):916-921. 569 28. Vaillancourt FH, Yeh E, Vosburg DA, O'Connor SE, Walsh CT. Cryptic chlorination by a non-570 haem iron enzyme during cyclopropyl amino acid biosynthesis. Nature 2005;436(7054):1191-1194. 571 29. Kleigrewe K, Almaliti J, Tian IY, Kinnel RB, Korobeynikov A et al. Combining Mass Spectrometric 572 Metabolic Profiling with Genomic Analysis: A Powerful Approach for Discovering Natural Products from 573 Cyanobacteria. J Nat Prod 2015;78(7):1671-1682. 574 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 30 30. Leão PN, Nakamura H, Costa M, Pereira AR, Martins R et al. Biosynthesis-assisted structural 575 elucidation of the bartolosides, chlorinated aromatic glycolipids from cyanobacteria. Angew Chem Int Ed 576 2015;54(38):11063-11067. 577 31. Mareš J, Hájek J, Urajová P, Kust A, Jokela J et al. Alternative Biosynthetic Starter Units Enhance 578 the Structural Diversity of Cyanobacterial Lipopeptides. Appl Environ Microbiol 2019;85(4):e02675-02618. 579 32. Abt K, Castelo-Branco R, Leao PNC. Biosynthesis of Chlorinated Lactylates in Sphaerospermopsis 580 sp. LEGE 00249. Chemrxiv 2020. Preprint. https://doi.org/10.26434/chemrxiv.12885476.v2 581 33. Latham J, Brandenburger E, Shepherd SA, Menon BRK, Micklefield J. Development of 582 Halogenase Enzymes for Use in Synthesis. Chem Rev 2018;118(1):232-269. 583 34. Zallot R, Oberg N, Gerlt JA. The EFI Web Resource for Genomic Enzymology Tools: Leveraging 584 Protein, Genome, and Metagenome Databases to Discover Novel Enzymes and Metabolic Pathways. 585 Biochemistry 2019;58(41):4169-4182. 586 35. Kotai J. Instructions for preparation of modified nutrient solution Z8 for algae. Norwegian Institute for 587 Water Res 1972;11:5. 588 36. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic 589 Acids Res 2004;32(5):1792-1797. 590 37. Rippka R, Waterbury JB, Stanier RY. Isolation and Purification of Cyanobacteria: Some General 591 Principles. In: Starr MP, Stolp H, Trüper HG, Balows A, Schlegel HG (editors). The Prokaryotes: A Handbook 592 on Habitats, Isolation, and Identification of Bacteria. Berlin, Heidelberg: Springer Berlin Heidelberg; 1981. pp. 593 212-220. 594 38. Singh SP, Rastogi RP, Häder D-P, Sinha RP. An improved method for genomic DNA extraction from 595 cyanobacteria. World J Microbiol Biotechnol 2011;27(5):1225-1230. 596 39. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact 597 alignments. Genome Biol 2014;15(3):R46. 598 40. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. 599 Bioinformatics 2009;25(14):1754-1760. 600 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 31 41. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M et al. SPAdes: a new genome assembly 601 algorithm and its applications to single-cell sequencing. J Comput Biol 2012;19(5):455-477. 602 42. Wu YW, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes 603 from multiple metagenomic datasets. Bioinformatics 2016;32(4):605-607. 604 43. Tatusova T, DiCuccio M, Badretdin A, Chetvernin V, Nawrocki EP et al. NCBI prokaryotic genome 605 annotation pipeline. Nucleic Acids Res 2016;44(14):6614-6624. 606 44. Blin K, Shaw S, Steinke K, Villebro R, Ziemert N et al. antiSMASH 5.0: updates to the secondary 607 metabolite genome mining pipeline. Nucleic Acids Res 2019;47(W1):W81-W87. 608 45. Posada D. jModelTest: Phylogenetic Model Averaging. Mol Biol Evol 2008;25(7):1253-1256. 609 46. Miller MA, Pfeiffer W, Schwartz T, editors. Creating the CIPRES Science Gateway for inference of 610 large phylogenetic trees. 2010 Gateway Computing Environments Workshop (GCE); 2010 14-14 Nov. 2010. 611 47. Navarro-Muñoz JC, Selem-Mojica N, Mullowney MW, Kautsar SA, Tryon JH et al. A 612 computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 2020;16(1):60-68. 613 48. The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res 614 2016;45(D1):D158-D169. 615 49. Ramos V, Morais J, Castelo-Branco R, Pinheiro Â, Martins J et al. Cyanobacterial diversity held in 616 microbial biological resource centers as a biotechnological asset: the case study of the newly established LEGE 617 culture collection. J Appl Phycol 2018;30(3):1437-1451. 618 50. Dittmann E, Gugger M, Sivonen K, Fewer DP. Natural Product Biosynthetic Diversity and 619 Comparative Genomics of the Cyanobacteria. Trends Microbiol 2015;23(10):642-652. 620 51. D'Agostino PM, Woodhouse JN, Makower AK, Yeung AC, Ongley SE et al. Advances in genomics, 621 transcriptomics and proteomics of toxin-producing cyanobacteria. Environ Microbiol Rep 2016;8(1):3-13. 622 52. Calteau A, Fewer DP, Latifi A, Coursin T, Laurent T et al. Phylum-wide comparative genomics 623 unravel the diversity of secondary metabolism in Cyanobacteria. BMC Genomics 2014;15(1):977. 624 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/ 32 53. Baran R, Ivanova NN, Jose N, Garcia-Pichel F, Kyrpides NC et al. Functional genomics of novel 625 secondary metabolites from diverse cyanobacteria using untargeted metabolomics. Mar Drugs 626 2013;11(10):3617-3631. 627 54. Alvarenga DO, Fiore MF, Varani AM. A Metagenomic Approach to Cyanobacterial Genomics. Front 628 Microbiol 2017;8:809-809. 629 55. Beck C, Knoop H, Axmann IM, Steuer R. The diversity of cyanobacterial metabolism: genome 630 analysis of multiple phototrophic microorganisms. BMC Genomics 2012;13(1):56. 631 56. Okino T, Matsuda H, Murakami M, Yamaguchi K. Microginin, an angiotensin-converting enzyme 632 inhibitor from the blue-green alga Microcystis aeruginosa. Tetrahedron Lett 1993;34(3):501-504. 633 57. Voráčová K, Hájek J, Mareš J, Urajová P, Kuzma M et al. The cyanobacterial metabolite nocuolin 634 a is a natural oxadiazine that triggers apoptosis in human cancer cells. PLOS ONE 2017;12(3):e0172850. 635 58. Zallot R, Oberg NO, Gerlt JA. ‘Democratized’ genomic enzymology web tools for functional 636 assignment. Curr Opin Chem Biol 2018;47:77-85. 637 59. Reis JPA, Figueiredo SAC, Sousa ML, Leão PN. BrtB is an O-alkylating enzyme that generates fatty 638 acid-bartoloside esters. Nat Commun 2020;11(1):1458-1458. 639 60. Liu Y, Klet RC, Hupp JT, Farha O. Probing the correlations between the defects in metal-organic 640 frameworks and their catalytic activity by an epoxide ring-opening reaction. Chem Commun (Camb) 641 2016;52(50):7806-7809. 642 61. Mitchell AJ, Dunham NP, Bergman JA, Wang B, Zhu Q et al. Structure-Guided Reprogramming of 643 a Hydroxylase To Halogenate Its Small Molecule Substrate. Biochemistry 2017;56(3):441-444. 644 .CC-BY-NC-ND 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 5, 2021. ; https://doi.org/10.1101/2021.01.05.425448doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425448 http://creativecommons.org/licenses/by-nc-nd/4.0/