key: cord-0306748-gw5krgk3 authors: Tan, Yongjun; Wang, Cindy; Schneider, Theresa; Li, Huan; de Souza, Robson Francisco; Tang, Xueming; Hsieh, Tzung-Fu; Wang, Xu; Li, Xu; Zhang, Dapeng title: Comparative phylogenomic analysis reveals evolutionary genomic changes and novel toxin families in endophytic Liberibacter pathogens date: 2021-06-03 journal: bioRxiv DOI: 10.1101/2021.06.02.446850 sha: 2455df93a1a3d7132112675a7bf1540c89464fd1 doc_id: 306748 cord_uid: gw5krgk3 Liberibacter pathogens are the causative agents of several severe crop diseases worldwide, including citrus Huanglongbing and potato Zebra Chip. These bacteria are endophytic and non-culturable, which makes experimental approaches challenging and highlights the need for bioinformatic analysis in advancing our understanding about Liberibacter pathogenesis. Here, we performed an in-depth comparative phylogenomic analysis of the Liberibacter pathogens and their free-living, nonpathogenic, ancestral species, aiming to identify the major genomic changes and determinants associated with their evolutionary transitions in living habitats and pathogenicity. We found that prophage loci represent the most variable regions among Liberibacter genomes. Using gene neighborhood analysis and phylogenetic classification, we systematically recovered, annotated, and classified all prophage loci into four types, including one previously unrecognized group. We showed that these prophages originated through independent gene transfers at different evolutionary stages of Liberibacter and only the SC-type prophage was associated with the emergence of the pathogens. Using ortholog clustering, we vigorously identified two additional sets of genomic genes, which were either lost or gained in the ancestor of the pathogens. Consistent with the habitat change, the lost genes were enriched for biosynthesis of cellular building blocks. Importantly, among the gained genes, we uncovered several previously unrecognized toxins, including a novel class of polymorphic toxins, a YdjM phospholipase toxin, and a secreted EEP protein. Our results substantially extend the knowledge on the evolutionary events and potential determinants leading to the emergence of endophytic, pathogenic Liberibacter species and will facilitate the design of functional experiments and the development of new detection and blockage methods of these pathogens. Importance Liberibacter pathogens are associated with several severe crop diseases, including citrus Huanglongbing, the most destructive disease to the citrus industry. Currently, no effective cure or treatments are available, and no resistant citrus variety has been found. The fact that these obligate endophytic pathogens are not culturable has made it extremely challenging to experimentally uncover from the whole genome the genes/proteins important to Liberibacter pathogenesis. Further, earlier bioinformatics studies failed to identify the key genomic determinants, such as toxins and effector proteins, that underlie the pathogenicity of the bacteria. In this study, an in-depth comparative genomic analysis of Liberibacter pathogens together with their ancestral non-pathogenic species identified the prophage loci and several novel toxins that are evolutionarily associated with the emergence of the pathogens. These results shed new lights on the disease mechanism of Liberibacter pathogens and will facilitate the development of new detection and blockage methods targeting the toxins. Introduction prophage loci (as highlighted on each genome) (Fig. 2) . This is very interesting, as prophages 166 are known to play a critical role in bacterial pathogenesis 33 unclear. Thus, we conducted an in-depth gene-neighborhood analysis to systematically extract 174 the potential prophage loci, identify their boundaries, and annotate their components. As a result, 175 we retrieved 36 prophage loci from the 12 available Liberibacter genomes (Fig. 3 , 176 Supplementary Table 2 and Data 1). We annotated their gene components by clustering the 177 protein products and dissecting their domains. We found that these prophage loci display a 178 highly variable gene composition. However, based on the shared gene components, we were 179 able to classify them into four major types, including LC1, LC2, SC, and UT (Fig. 3) . Specifically, 180 the LC1 type is only found in non-pathogenic L. crescens whereas LC2 is present in all 181 Liberibacter species. They are similar in terms of gene composition; however, LC2 contains 182 several unique genes such as LC_TM, Peptidase_S74_2, LC_3, HTH_XRE, and DUF1376. The 183 SC type is found in all Liberibacter pathogens; it typically has a large genome size with big 184 variations among loci which are mainly caused by independent genome deletion and insertions. 185 It is noteworthy that several genomes contain multiple copies of the SC-type phages. These 186 include two copies in C.L. asiaticus gxpsy genome, two copies in C.L. solanacearum ZC1 187 genome, and three copies in C.L. europaeus NZ1 genome. By contrast, the UT type represents 188 a novel group of prophages that we recovered in all C.L. asiaticus strains. Its size is much 189 smaller, and according to its gene composition, it is close to the SC type. There is also a small 190 prophage locus ZC1 on the C.L. solanacearum genome, but it is more likely a fragment. 191 Phylogenetic analysis reveals independent gene transfers of phages to Liberibacter. We 192 next attempted to trace the evolutionary histories of these prophages to examine their 193 correlations with the development of pathogenicity. By comparing gene components of 194 prophage loci, we found that terminase, the key phage component involved in the phage DNA 195 packing process 38 , is the only gene family conserved across all identified prophage loci (Fig. 3) . 196 Therefore, we used the terminase as a marker to study the evolutionary origins of these 197 prophages. We conducted several BLASTP searches using the terminases from Liberibacter 198 species to collect homologs in the NCBI NR database. Using both Maximum Likelihood and 199 Bayesian Inference analyses (Fig. 4) , we found that the terminases of Liberibacter prophages 200 are not monophyletic; instead, terminases of different types, LC1, LC2, and SC (together with 201 UT), were nested in three separate clades. This indicates that the LC1, LC2, and SC prophages 202 have different evolutionary origins, and they have been transferred independently to Liberibacter 203 at different evolutionary time points (Fig. 4a) . In addition to the independent origins, the tree also 204 suggests that multiple copies of the SC phage in several Liberibacter pathogens were likely 205 generated from genome-specific duplications, such as those in C.L. solanacearum, C.L. 206 europaeus, and C.L. asiaticus (Fig. 4a) . However, the SC prophages also appear to have 207 undergone recombination between species and strains, given the facts: 1) their terminases did 208 not follow a typical pattern of vertical evolution, unlike the LC2 terminases (Fig. 4a) ; 2) in our 209 genome comparison analysis, the SC phages from different C.L. asiaticus strains display a 210 strikingly rapid divergence in certain regions, in contrast to other genomic regions (Fig. 4b) . 211 Based on the results from both gene neighborhood-based classification and 212 phylogenetic analysis, it is likely that LC2, LC1, and SC prophages were introduced at the base 213 of Liberibacter, the base of non-pathogenic L. crescens, and the common ancestor of the 214 Liberibacter pathogens, respectively. As for the UT phage, it likely originated from a duplication 215 event of the ancestral SC phage at the base of the C.L. asiaticus species (Fig. 4b) . According to 216 the phylogenetic histories of these prophages, we can infer that the origin of the SC type of 217 prophages was associated with the emergence of pathogenicity of the Liberibacter pathogens. 218 Ortholog clustering analysis reveals additional unique genes which were either lost or 219 gained in the ancestor of pathogens. In addition to prophages, we sought to identify other 220 genomic (non-prophage) genes which might be associated with the transition from the non-221 pathogenic ancestor to pathogenic descendants. These genes should be featured by having 222 undergone either gene-loss or gene-gain events in the common ancestor of all pathogens. 223 Therefore, we were specifically interested in identifying two groups of genes which display 224 unique phyletic patterns: first, the genes, that underwent gene loss, are present in the non-225 pathogenic L. crescens species, but not in pathogenic descendants; second, the genes, that 226 underwent gene gain, are present in all pathogens but not in non-pathogenic L. crescens 227 species (Fig. 5a) . 228 To identify these genes, we utilized three steps of analysis. First, we carried out an 229 ortholog analysis using the OrthoFinder program 39 to cluster the proteins (excluding the 230 prophage components) of all collected Liberibacter genomes into different orthologous groups 231 (orthogroups). Based on their phyletic profiles, we identified the orthogroups that are only 232 present in pathogens (but not in the non-pathogenic ancestor) or in the non-pathogenic ancestor 233 (but not in pathogens). We also utilized whole-genome BLASTP searches to specifically identify 234 the cases of gene gain or gene loss by comparing the Liberibacter proteins with other 235 sequences in the NCBI database, according to the pairwise sequence similarity scores. Finally, 236 to validate and confirm the above results, we conducted extensive phylogenetic analyses on all 237 identified orthogroups. From these analyses, we identified 323 orthogroups (about 335 genes in 238 L. crescens BT-1) that were lost in all pathogenic descendants and 35 orthogroups (about 35-37 239 genes in each pathogen) that were gained in the ancestor of all pathogens ( Fig. 5a and 240 Supplementary Table 3 and Table 4 ). Figs. 5b,c,d show the evolutionary histories of the three 241 orthogroups. The tree topology supports a common ancestry of the homologous genes at the 242 base of the Liberibacter pathogens and their origins were likely from bacteria other than the 243 non-pathogenic Liberibacter ancestor. 244 The gene-loss and gene-gain events might contribute to the establishment of endophytic 245 habitats of pathogens. To understand the functional significance of these two types of 246 orthogroups, we performed Gene Ontology and KEGG pathway analysis. This revealed that 247 many of the genes that were lost in the endophytic pathogens are involved in synthesis and 248 metabolism of several major types of cellular components, including amino acids, phosphate-249 containing compounds, nitrogen components, and nucleotide acids (Supplementary Figure 1, 2, 250 and Table 5 ). Strikingly, the whole biosynthetic pathways for Tryptophan and Histidine were 251 missing in the pathogens (Fig. 6 ). The majority of the biosynthetic genes are clustered together 252 as different operons on the ancestral L. cresens genome; however, a complete deletion of 253 several related operons and other genes highlights the dependency of these intracellular 254 pathogens on receiving these nutrients from their hosts. 255 For the genes that were gained in the pathogens, we found that they encode multiple 256 transporters, enzymes involved in DNA/RNA synthesis, regulation, or small molecule 257 metabolism, and several transmembrane proteins (Supplementary Table 4 ). Some of them 258 might contribute to the pathogen's ability to adapt and proliferate in the intracellular environment 259 of host cells. For example, the survival protein SurE, is a metal-dependent nucleotide 260 phosphatase that has been shown to be essential for bacterial pathogenesis and survival in the 261 stationary phase and in harsh conditions 40 . Further, the multiple transporters might play 262 important roles in exchanging chemical components between the pathogens and the plant host 263 . Thus, both the gene-loss and gene-gain events that happened in the ancestor of Liberibacter 264 pathogens seem to have defined the molecular foundation of their endophytic habitats. 265 Extensive sequence and structure analyses identify potential virulence factors, including 266 several polymorphic toxins. Since pathogenicity is an acquired trait for Liberibacter bacteria, 267 we hypothesize that such ability is attributed to the unique pathogenicity-related genes, that 268 were gained by the ancestor of these pathogens during evolution (or gene-gain events). 269 Therefore, the potential virulence factors, such as toxins and effectors, should be among the 270 gene-gain list that we identified (Supplementary Table 4 ). However, of the proteins on our list, 271 almost half of them are hypothetical proteins with no functional annotation. To accurately 272 uncover the function of these proteins, we conducted a series of sequence and structure 273 analyses to examine the sequence/structural features of these proteins, dissect their domain 274 components, establish the distant relationship between these domains with the known Pfam 275 domains, and further synthesize the function of the proteins by combining the domain 276 annotations. This systematic analysis has allowed us to identify three groups of proteins as 277 potential toxins/effectors (Supplementary Table 4) . 278 The first toxin group comprises two hypothetical proteins (e.g. WP_012778667.1 and 279 WP_012778668.1 in C.L. asiaticus strain gxpsy). By sequence analysis, we found that these 280 two proteins share a unique domain architecture containing two previously un-recognized We found that both the PD and Ntox52 domains display a typical association with other 296 toxin-specific domains (Fig. 7a) . Specifically, the PD domain is frequently coupled with several 297 long N-terminal regions and a number of C-terminal toxin domains including nucleases (REase-298 1, AHH, EndoU), peptidases (MPTase, Tox-PL4), deaminases, ADP-Ribosyltransferases (ART-299 HYD3, ART-VIP2), and Anthrax toxA (Fig. 7a) , that were identified in our earlier studies 42, 43, 45, 52 . 300 The Ntox52 domain, on the other hand, is always located at the C-terminus of toxin proteins, a 301 position that the toxin domain typically occupies, and coupled with other toxin-specific central 302 modules, such as RHS and PseudoCD2, in addition to the PD domain. These lines of evidence 303 strongly suggest that the two identified proteins represent novel polymorphic toxins where the 304 PD domain is a pre-toxin module and the Ntox52 is a toxin module. By secondary structure 305 prediction and conservation analysis, the Ntox52 domain is featured by a core structure 306 containing several alpha helices and four beta strands and by several conserved polar residues 307 (Supplementary Figure 4) . However, no distant homolog has been found in our profile-based 308 sequence searches. 309 Another potential toxin is the YdjM protein (Fig. 7b ). Our profile-profile comparison 310 revealed that it is related to several toxin domains including bacterial alpha-toxin (probability 95% 311 of profile-profile match), a membrane-disrupting phospholipase responsible for gas gangrene 312 and myonecrosis in C. perfringens infected tissues 53 , and the HET-C domain (probability 93% 313 of profile-profile match), that is also a toxin component that we identified in bacterial PTSs and 314 in the fungal nonallelic incompatibility system 43,54 (Fig. 7b) . Although they share a low sequence 315 identity, these domains display conservation in both the structural composition and the catalytic 316 core ( Fig. 7b and 7c ). The catalytic core of the known alpha toxin and Het-C domains involves 317 seven conserved residues, six of which are preserved in the YdjM domain. However, our 318 structural modeling suggested that the YdjM-specific His76 serves an equivalent role as His136 319 of the alpha toxin (Fig. 7c) . Further, by analyzing the domain architectures of many other 320 proteins containing the YdjM domain, we found: 1) several bacterial RHS-type toxins use YdjM 321 as their C-terminal toxin module (e.g. WP_143945134) (Fig. 7c) ; 2) the YdjM domain is 322 predominantly coupled with a novel beta-sandwich domain ( Fig. 7c; Supplementary Figure 5) , 323 like the alpha toxin, which has a beta-sandwich domain to facilitate toxin localization at the 324 membrane 53 . Thus, based on these lines of evidence, we proposed that YdjM is a novel 325 phospholipase toxin which might disrupt the membrane of host cells. 326 The third toxin group includes several proteins belonging to the EEP 327 to understand their functional contribution. To resolve these puzzles, we conducted a unique, 361 comprehensive operonic association analysis to recover all the prophage loci in the Liberibacter 362 genomes and systematically annotate the conserved phage components. According to both the 363 prophage composition and phylogeny of the conserved terminase, we were able to classify 364 these diverse prophage loci into four major types, namely LC1, LC2, SC, and UT. This result 365 has clarified several confusions about the presence of the prophage loci in the Liberibacter 366 species. For example, our phylogenetic analysis has linked the recently classified type 4 367 prophages from the Liberibacter pathogens 37 to the earlier LC2 prophage from the ancestral L. 368 crescens 30 . We also uncovered a previously un-detected prophage type, the UT phage, that is 369 present in all the C.L. asiaticus strains. Further, we were able to identify the LC2 and UT 370 prophages in the Japanese strain which was previously thought to contain no prophage loci 5 . 371 Thus, our detailed annotation and classification of prophages has provided a genomic resource 372 for studying the prophages in the Liberibacter bacteria. 373 Importantly, our analysis revealed that these different types of prophages originated 374 through multiple independent gene transfer events to Liberibacter at different evolutionary 375 stages (Fig. 4c) . However, it is only the SC-type prophage that was acquired by the ancestor of 376 the pathogens. Thus, this evolutionary association suggests that the acquisition of the SC 377 phage might contribute to the emergence of pathogenicity in these pathogens. Indeed, many 378 toxins of bacterial pathogens are known to be carried by prophage loci 33-35 . On average, a 379 typical SC-prophage locus contains about 35 genes. While the majority of these genes encode 380 structural components and enzymes involved in phage replication and transcription, others 381 remain uncharacterized. Therefore, a systematic analysis of the SC-prophage components will 382 be necessary to understand its role in the Liberibacter pathogens. To be noted, we observed an 383 unusual divergence between the SC prophage loci from different Liberibacter pathogens and 384 different strains of the same pathogen, distinct from other genomic regions and other types of 385 prophages. This rapid evolutionary pattern of the SC prophage suggests that some of its 386 components may have acted at the interface of host-pathogen interactions and gained the 387 upper hand during the evolutionary arms race 57,58 . 388 Our comparative genomic analysis also identified genomic (non-prophage) genes that 389 were lost or gained at the ancestor of the Liberibacter pathogens. To be noted, of the 323 390 orthogroups (335 genes) that were lost in all the pathogens, 56 genes were also identified as 391 The association of these genes with pathogenic species suggests that some of them may play a 444 role in pathogenesis. This hypothesis can be tested directly using targeted, functional 445 experiments. We envision that future research informed and enabled by our genomic analysis 446 holds great promises to elucidate the pathogenesis mechanisms of Liberibacter bacteria, 447 leading to long-sought solutions for curing HLB and other related diseases. 448 449 Genomes that were investigated in this study. A total of 12 genomes of Liberibacter species 451 were collected from the nucleotide database of the National Center of Biotechnology Information 452 (NCBI). Their genome annotations including the coding sequences (CDS) and RNA genes were 453 extracted from NCBI GenBank files. The encoded protein sequences were retrieved from the 454 NCBI protein database. The detailed information of the genomes used in this study can be 455 found in Supplementary Table 1 . Gamma distribution was used to model evolutionary rate differences among sites (four 496 categories). A bootstrap analysis with 100 repetitions was performed to assess the significance 497 of the phylogenetic grouping. 498 To reconstruct the evolutionary history of terminases, we first applied the ML analysis in 499 which a WAG model with a discrete gamma distribution (four categories) was used to model 500 rate heterogeneity among sites. A bootstrap analysis with 100 repetitions was performed to 501 assess the significance of the phylogenetic grouping. We also applied the BI analysis with a JTT 502 model and a discrete Gamma distribution (four categories). Markov chain Monte Carlo (MCMC) 503 duplicate runs of 10 million states each, sampling every 10,000 steps was computed. Logs of 504 MCMC runs were examined using Tracer 1.7.1 program 83 . Burn-ins were set to be 2% of 505 All trees with the highest log-likelihood from the ML analysis were visualized using the 507 and be preserved in all these species, whereas the genes that were gained or lost at the 528 ancestor of the Liberibacter pathogens will display a different presence in either pathogenic 529 Liberibacter species or non-pathogenic ancestor. Therefore, we extracted these orthogroups 530 using a custom python script based on the following criteria: 1. Orthogroups with gene loss: the 531 ones present in non-pathogenic L. crescens but not in any of the pathogenic Candidatus 532 Liberibacter species; 2. Orthogroups with gene gain: the ones that are found in at least 4 533 Candidatus Liberibacter species (a total of 5 species used in this study), but not in L. crescens 534 Importantly, by conducting case-by-case phylogenetic analyses, we realized that there asiaticus proteins as queries. The results showed a good correlation between sequence 546 similarity scores and the evolutionary distance. Thus, by this simple BLAST search, we were 547 able to identify 1) the proteins in the non-pathogenic ancestor that have no close homologs in 548 pathogenic Liberibacter species, and 2) the proteins in pathogenic species that have no close 549 homologs in the ancestor. Table 3 and Table 4 ). 556 The number of gene-loss and gene-gain genes might differ between species or strains due to 557 genome-specific duplications. 558 Protein function analysis. We used three levels of analyses to conduct the functional 559 annotation of the proteins that were identified in this study. First, we retrieved the annotation 560 information from the NCBI RefSeq database and Gene Ontology (GO) terms using the 561 Blast2GO program 89 . Their involvement in the molecular synthesis pathways was determined 562 by mapping the KEGG database 90 . Second, for those proteins that have no functional 563 annotation, we annotated them using the conserved domains which were identified using the 564 hmmscan program searching against Pfam 80,84 and our own curated domain profiles. Finally, 565 for proteins with potential uncharacterized domains, we conducted detailed case-by-case The phloem-limited bacterium of greening 702 disease of citrus is a member of the α subdivision of the Proteobacteria First report of a huanglongbing-like disease of citrus in São Paulo State Brazil and association of a new Liberibacter species First report of "Candidatus Liberibacter solanacearum" in tomato plants in Mexico A new 'Candidatus Liberibacter'species associated with diseases of 711 solanaceous crops Unique features of a Japanese 713 'Candidatus Liberibacter asiaticus' strain revealed by whole genome sequencing Citrus greening disease Bioterror: the green menace Emerging diseases of cultivated potato and their 720 impact on Latin America Psyllidae) with "zebra chip Diaphorina citri Kuway., a vector of the greening 726 disease of citrus in India Citrus psylla, a vector of the greening disease of sweet 728 orange Emerging concepts 730 in effector biology of plant-associated organisms SEC-translocon dependent extracytoplasmic 733 proteins of Candidatus Liberibacter asiaticus The authors declare no competing interests. Clark, K. et al. An effector from the Huanglongbing-associated pathogen targets citrus 735 proteases. Nature communications 9, 1-11 (2018