14989606 TITLE 1 Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships 2 3 AUTHORS 4 Andrzej Zielezinski1,*, Jakub Barylski2, Wojciech M. Karlowski1 5 6 AUTHOR AFFILIATIONS: 7 1 Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University 8 Poznan, Uniwersytetu Poznanskiego 6, 61-614, Poznan, Poland 9 2 Molecular Virology Research Unit, Faculty of Biology, Adam Mickiewicz University Poznan, 10 Uniwersytetu Poznanskiego 6, 61-614, Poznan, Poland 11 12 * Address correspondence to: 13 Andrzej Zielezinski: andrzejz@amu.edu.pl 14 15 ABSTRACT 16 Motivation: Similar regions in virus and host genomes provide strong evidence for phage-host 17 interaction, and BLAST is one of the leading tools to predict hosts from phage sequences. 18 However, BLAST-based host prediction has three limitations: (i) top-scoring prokaryotic 19 sequences do not always point to the actual host, (ii) mosaic phage genomes may produce matches 20 to many, typically related, bacteria, and (iii) phage and host sequences may diverge beyond the 21 point where their relationship can be detected by a BLAST alignment. 22 Results: We created an extension to BLAST, named Phirbo, that improves host prediction quality 23 beyond what is obtainable from standard BLAST searches. The tool harnesses information 24 concerning sequence similarity and bacteria relatedness to predict phage-host interactions. Phirbo 25 was evaluated on two benchmark sets of known phage-host pairs, and it improved precision and 26 recall by 25 percentage points, as well as the discriminatory power for the recognition of phage-27 host relationships by 10 percentage points (Area Under the Curve = 0.95). Phirbo also yielded a 28 mean host prediction accuracy of 60% and 70% at the genus and family levels, respectively, 29 representing a 5% improvement over BLAST. When using only a fraction of phage genome 30 sequences (3 kb), the prediction accuracy of Phirbo was 5-11% higher than BLAST at all 31 taxonomic levels. 32 Conclusion: Our results suggest that Phirbo is an effective, unsupervised tool for predicting 33 phage-host relationships. 34 Availability: Phirbo is available at https://github.com/aziele/phirbo. 35 36 KEYWORDS 37 phage-host prediction, phage, prokaryote, bacteria, virus, genome sequence 38 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint mailto:andrzejz@amu.edu.pl https://github.com/aziele/phirbo https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ INTRODUCTION 39 Prokaryotic viruses (phages) are the most abundant entities across all habitats and represent a vast 40 reservoir of genetic diversity [1]. Phages mediate horizontal gene transfer and constitute a major 41 selection pressure that shapes the evolution of bacteria [2]. Prokaryotic viruses also affect 42 biogeochemical cycles and ecosystem dynamics by controlling microbial growth rates and 43 releasing the contents of microbial cells into the environment [2,3]. Moreover, phages play a key 44 role in shaping the composition and function of the human microbiome in health and disease [4–45 6]. Recently, there has been renewed interest in phage therapy and phage-based biocontrol of 46 harmful bacteria [7,8] in medical treatment [9,10] and the food industry [11,12]. Hence, 47 characterizing phage–host interactions is critical to understanding the factors that govern phage 48 infection dynamics and their subsequent ecological consequences [13]. 49 50 The scope of phage-host interactions is poorly understood, although it has been hypothesized that 51 all prokaryotic organisms fall prey to viral attacks [1]. Methods for studying phage-host 52 interactions primarily rely on cultured virus-host systems; however, recent in silico approaches 53 suggest a much broader range of hosts may be susceptible to viral infections [14]. These methods 54 predict prokaryotic hosts based on sequence composition [15,16], direct sequence similarity 55 between phages and hosts [14], analysis of CRISPR spacers or tRNAs [13,17], as well as 56 supervised approaches that integrate several sequence-based methods [18,19]. 57 58 Despite significant progress in phage-host predictions, the classic BLAST [20] algorithm is 59 currently the most effective, unsupervised method for identifying phage-host interactions [14,15]. 60 Depending on the dataset, the tool finds the correct genus level host for 40-60% of phages [14,15]. 61 The task of finding a host for a given phage using BLAST is conceptualized as obtaining the host 62 sequence with the highest similarity to the query phage sequence. However, restricting host 63 predictions to the first top-scored prokaryotic sequence has three limitations. First, the true host 64 may not be the top-scoring match in the BLAST results. Second, selecting a prokaryotic host based 65 on the first sequence assumes that a phage infects a single host. Although phages are generally 66 host-specific, some may infect multiple host species [21,22]. Finally, many distantly-related 67 prokaryotic species may obtain a comparable BLAST score for a query phage due to spurious 68 alignments. These ambiguous host predictions require further manual curation of the taxonomic 69 or phylogenetic relationship between the top-scored prokaryotic species to select the true host(s). 70 71 We have addressed these issues by developing a simple extension to BLAST, named Phirbo, that 72 exploits the information contained in the full BLAST results, rather than its top-ranking matches. 73 Phirbo improved the accuracy of finding hosts, beyond what is found from the best BLAST match, 74 by relating phage and host sequences through intermediate, common reference sequences that are 75 potentially homologous to both phage and host queries. Subsequent quantification of the 76 overlapping signals allows for the reliable prediction of phage-host interactions without the need 77 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ for direct comparisons between the phage and host sequences and without any prior knowledge of 78 their phylogenetic or taxonomic context. 79 80 RESULTS 81 82 Phirbo algorithm overview 83 This algorithm is based on the assumption that the degree of similarity between phage and host 84 sequences is proportional to the overlap between ranked similarity matches of each sequence to 85 the same reference data set of prokaryotic sequences. Specifically, to compare a pair of phage (P) 86 and host (H) sequences, we first perform two independent BLAST searches against the reference 87 database of prokaryotic genomes (D)—one BLAST search for phage and the other for the host 88 query (Fig. 1a). The two lists of BLAST results (Fig. 1b), P → D and H → D, contain prokaryotic 89 genomes ordered by decreasing sequence similarity (i.e., bit-score). To avoid a taxonomic bias due 90 to multiple genomes of the same prokaryote species, we rank prokaryotic species according to 91 their first appearance in the BLAST list (Fig. 1c). In this way, both lists represent phage and host 92 profiles consisting of the ranks of top-score prokaryotic species. 93 94 The properties of these lists (Fig. 1c) closely resemble the outcome of an Internet search and can 95 be characterized by four features: (i) species listed at the top of each ranking are more important 96 (similar) to the query than those listed at the bottom; (ii) the lists may not be conjoint (some species 97 may appear in one ranking but not in the other); (iii) the ranking lists may vary in length (BLAST 98 may return few prokaryotic matches in response to virus sequences in contrast to thousands of 99 matches in cases of multiple-species prokaryotic families); (iv) two or more species from the 100 database may achieve the same BLAST score and, therefore, occupy the same position on the 101 ranking list (Fig. 1c). A recently introduced similarity measure used for comparing the rankings 102 of Web search engine results [23], the Rank-Biased Overlap (RBO), satisfies these four conditions. 103 The RBO algorithm starts by scoring the overlap between the sub-list containing the single top-104 ranked item of each list. It then proceeds by scoring the overlaps between sub-lists formed by the 105 incremental addition of items further down the original lists. Each consecutive iteration has less 106 impact on the final RBO score as it puts heavier weights on higher-ranking items by using 107 geometric progression, which weighs the contribution of overlaps at lower ranks (see ‘Methods’). 108 An overall RBO score falls between 0 and 1, where 0 signifies that the lists are disjoint (have no 109 items in common) and 1 means the lists are identical in content and order. Our results indicate that 110 the extent of the phage-host relationship can be estimated by the application of an RBO 111 measurement to the ranking lists generated from BLAST results (Fig. 1d). 112 113 Phirbo differentiates between interacting and non-interacting phage-host pairs 114 To assess the discriminatory power of Phirbo to recognize phage-host interactions, we used two 115 published reference data sets: Edwards et al. (2016) [14], which contains 2,699 complete bacterial 116 genomes and 820 phages with reported hosts, and Galiez et al. (2017) [16] that has 3,780 complete 117 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ prokaryotic genomes and 1,420 phage genomes. For each data set, we compared the distribution 118 of Phirbo scores between all known phage-host interaction pairs and the same number of randomly 119 selected non-interacting phage-prokaryote pairs (Fig. 2). The scores obtained by Phirbo in both 120 data sets separated the interacting from non-interacting phage-host pairs more than the BLAST 121 scores. The median Phirbo score across interacting phage-host pairs was nearly 1,500 times greater 122 than for non-interacting pairs, while the median BLAST score was three times higher for 123 interacting pairs than non-interacting pairs (Supplementary Table 1). Both methods, however, 124 differentiated between interacting and non-interacting phage-host pairs with higher accuracy than 125 WIsH — the state-of-the-art, alignment-free, host prediction tool [16]. 126 127 To further examine the discriminatory power of Phirbo across all possible phage-prokaryote pairs, 128 we used receiver operating characteristic (ROC) curves (Fig. 2a,b). The area under the ROC 129 (AUC), which measured the discriminative ability between interacting and non-interacting phage-130 host pairs, was higher for Phirbo (AUC = 0.95) in the Edwards et al. and Galiez et al. data sets 131 than for BLAST (AUC = 0.86) and WIsH (AUC = 0.78-0.79). An additional advantage of Phirbo 132 was its capacity to score phage-host pairs whose sequence similarity could not be established by a 133 direct BLAST comparison but, instead, through other, ‘intermediate’ prokaryotic sequences that 134 were detectably similar to both phage and host query sequences. For example, BLAST did not 135 provide scores for 20% of the interacting phage-host pairs in the Edwards et al. and Galiez et al. 136 data sets due to alignment score thresholds (Supplementary Table 2). Using the same BLAST 137 lists, Phirbo evaluated 99% of the interacting phage-hosts pairs. This high coverage indicated that 138 nearly every pair of phage-prokaryote sequences could be related by at least one common 139 prokaryotic sequence detectably similar to both the phage and host sequences. 140 141 Phirbo has the highest host prediction performance 142 To evaluate host prediction performance, we used precision-recall (PR) curves, which provide 143 more reliable information than ROC when benchmarking imbalanced data sets for which the non-144 interacting pairs vastly outnumber the interacting pairs [24,25]. Accordingly, we plotted PR curves 145 for Phirbo, BLAST, and WIsH predictions obtained from the Edwards et al. (Fig. 3a) and Galiez 146 et al. (Fig. 3b) data sets. Overall, Phirbo performed better at host prediction at the species level 147 than BLAST and WIsH, regardless of the data set. The area under the PR curve (AUPR), which 148 summarized overall performance, was higher in Phirbo by 25 percentage points (AUPR = 0.56-149 0.65) than in BLAST (AUPR = 0.33-0.41). Phirbo also reported the highest F1 score (an average 150 of precision and recall [see ‘Methods’]) in the Edwards et al. and Galiez et al. data sets (Fig. 3). 151 Specifically, the precision and recall of Phirbo were 59-65% and 57-64%, respectively, while 152 BLAST had precision and recall in the range of 28-43% (Fig. 3). Furthermore, Phirbo yielded 153 slightly higher specificity (99.7-99.8%) and accuracy (99.5-99.6%) than BLAST or WIsH. 154 155 Phirbo preserves BLAST top-ranked host predictions 156 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ We further evaluated the host prediction accuracy of Phirbo by selecting a top-scored prokaryotic 157 sequence for each phage [14–16,18]. Briefly, host prediction accuracy is calculated as the 158 percentage of phages whose predicted hosts have the same taxonomic affiliation as their respective 159 known hosts (if multiple top-scoring hosts are present, the prediction is scored as correct if the true 160 host is among the predicted hosts). Phirbo restored all hosts predicted by BLAST in the datasets 161 by Edwards et al. and Galiez et al., achieving the same prediction accuracy as BLAST across all 162 taxonomic levels (Table 1). Of note, BLAST found multiple different host species with equal 163 scores for 14 phage genomes. This was observed in phages infecting bacteria from the 164 Enterobacteriaceae family and the Rhodococcus and Bacillus genera. However, Phirbo assigned 165 the highest score to the correct host species (Supplementary Table 3). Additionally, it refined the 166 host prediction for the Cronobacter phage ENT39118 sequence, which BLAST assigned to the 167 Escherichia coli genome. Phirbo revealed Cronobacter sakazaki as the primary host species, as 168 the BLAST list of the Cronobacter phage is more similar in content and order to the BLAST list 169 of C. sakazaki (Phirbo score = 0.50) than E. coli (Phirbo score: 0.48) (Figure S1). 170 171 As Phirbo links phage to host through common sequences, the content of the sequence database 172 was the main factor defining host prediction quality. Since the similarity between viruses may 173 indicate a common host [18,26], we expanded the two BLAST databases of prokaryotic sequences 174 obtained from Edwards et al. and Galiez et al. by phage sequences (n = 820 and n = 1420, 175 respectively), and recalculated Phirbo scores between every phage-prokaryote pair. The phage-176 host linkage through homologous prokaryotic and phage sequences increased the host prediction 177 accuracy of Phirbo at all taxonomic levels, allowing correct identification of hosts at the genus 178 level for 56-63% of phages (Table 1). Specifically, Phirbo refined BLAST mis-predictions for 55 179 phage genomes and showed which sequences demonstrated low similarity to the sequences of their 180 host species. The direct BLAST alignments of these phage sequences, and the sequences of their 181 corresponding hosts, obtained significantly lower scores than alignments obtained by the other 182 known phage-host pairs (P = 1.9 × 10-45, Mann–Whitney U test). Notably, Phirbo also assigned 183 correct host species for 18 phages whose hosts were not reported in the BLAST results, mainly 184 Chlamydia species, Vibrio cholerae, and the opportunistic pathogen, Acinetobacter baumannii. 185 186 Phirbo is suitable for incomplete phage sequences 187 We tested the robustness of our host prediction algorithm to fragmentation of the phage sequence. 188 Following earlier studies [15,16,18], phage genomes from Edwards et al. and Galiez et al. data 189 sets were randomly subsampled to generate contigs of different lengths (20 kb, 10 kb, 5 kb, 3 kb, 190 and 1 kb) with 10 replicates. Host prediction accuracy was calculated as the mean percentage of 191 phages whose predicted hosts had the same taxonomic affiliation as their respective known hosts 192 (Fig. 4). Although Phirbo achieved equal host prediction accuracy with BLAST across all contig 193 lengths, it had substantially higher overall performance in terms of AUC and AUPR (Figure S2; 194 P < 10−5, Wilcoxon signed-rank test). Surprisingly, BLAST-based methods obtained higher host 195 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ prediction accuracy across all contig lengths compared to WIsH, a tool designed to predict the 196 hosts of short viral contigs (Fig. 4). 197 198 The host prediction accuracy of Phirbo was examined using the expanded BLAST database of 199 both prokaryotic and phage full-length sequences. To ensure fairness, for each tested phage contig 200 we removed its corresponding full-length sequence from the BLAST database and recalculated 201 Phirbo scores between the phage contig and every prokaryotic sequence. This approach 202 outperformed BLAST at every contig length across all taxonomic levels in both data sets (Fig. 4). 203 Generally, the host prediction accuracy of Phirbo improved by 5-11 percentage points compared 204 to the BLAST results. For example, when the contig length was 3 kb, the prediction accuracy of 205 Phirbo was 8-11% higher than BLAST at the family level, and 8-17% higher than WIsH (Fig. 4; 206 Supplementary Table 4). Phirbo also achieved the highest AUC and AUPR scores when 207 discriminating between interacting and non-interacting phage-host pairs (Figure S2). 208 209 Phirbo uses multiple protein and non-coding RNA signals for host prediction 210 We investigated the sequence information used by BLAST and Phirbo for host prediction. For 211 each phage that was correctly assigned to the host species by both tools (n = 485), we calculated 212 the fraction of the phage genome that was included in the segments aligned with prokaryotic 213 sequences (sequence coverage). This analysis revealed that our tool used three times more phage 214 sequence (median sequence coverage: 35%) than BLAST (12%) (Figure S3; P < 10-15, Wilcoxon 215 signed-rank test). This increased sequence coverage indicates that different genome regions of the 216 phages map to the genomes of prokaryotic species other than the host species. For 214 of the 485 217 phages, more than half of their genomes were aligned to genomes of their host species 218 (Supplementary Table 5). Such large regions of homology are likely prophages or phage debris 219 left by large-scale recombination events during phage replication. The observed high sequence 220 coverage points to the virus taxa, known for their temperate lifestyle and frequent recombination 221 with host genomes (i.e., Siphoviridae family as well as the Peduovirinae and Sepvirinae 222 subfamilies). 223 224 To further examine the properties of sequences that may be exchanged between a phage and its 225 host, we selected a population of phages with sequence coverage below 50% (n = 271). These 226 phages, which are less likely to represent complete prophages, belong to 16 viral families 227 (Supplementary Table 6). Next, we re-annotated the genomic sequences of the phages to find 228 putative protein and non-coding RNA (ncRNA) genes. Phage sequence regions used by Phirbo for 229 host predictions were significantly enriched (P < 10-5) in more than a hundred protein families of 230 known or probable function. In contrast, only half of the protein families were used in BLAST-231 based host predictions (Supplementary Table 7). The protein families used by Phirbo covered 232 most of the processes of the viral life cycle including DNA replication, cell lysis, recombination, 233 and packaging of the phage genome (Fig. 5). In contrast to BLAST, Phirbo also exploited the 234 information contained in phage ncRNAs while assigning phages to host genomes. The vast 235 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ majority of these ncRNAs (>90%) were tRNAs, which showed significant overrepresentation in 236 the phage sequence fragments used by Phirbo (P = 6 × 10-12) (Supplementary Table 8). The 237 remaining ncRNAs belonged to group I introns (3%), RNAs associated with genes associated with 238 twister and hammerhead ribozymes (1%), skipping-rope RNA motifs (1%), and 12 less abundant 239 RNA families. 240 241 Implementation and availability 242 Predicting hosts from phage sequences using BLAST is accomplished by querying phage 243 sequences against a database of candidate hosts. However, Phirbo also uses information about 244 sequence relatedness among prokaryotic genomes. Therefore, it requires ranked lists of prokaryote 245 species generated by BLAST for the phage and host genomes. The computational cost of querying 246 every host sequence against the database of all candidate hosts using BLAST may still be a limiting 247 factor. However, for mass host searches, the computational cost of all-versus-all host comparisons 248 becomes marginal, as it must be done only once. After the relatedness among host genomes is 249 established, the time required for Phirbo host predictions is negligibly higher than the time for 250 typical BLAST-based host predictions. For example, running Phirbo between ranked lists of host 251 species for 1,420 phages and 3,860 candidate hosts from Galiez et al. (resulting in ~5.5 million 252 phage-host comparisons) took 8 minutes on a 16-core 2.60GHz Intel Xeon. 253 254 As Phirbo operates on rankings, BLAST can be replaced by an alternative sequence similarity 255 search tool to reduce the time to estimate homologous relationships between host genomes. For 256 instance, Mash [27] computed host relationships in 5 minutes for the Edwards et al. and Galiez et 257 al. data sets (see ‘Methods’). The host prediction performance of Phirbo using BLAST-based 258 rankings for phages and Mash-based rankings for host genomes is high compared to the 259 performance of Phirbo predictions using BLAST rankings for both phage and host genomes 260 (Supplementary Table 9). 261 262 We envisage Phirbo as a natural extension to standard BLAST-based host predictions. The Phirbo 263 tool is written in Python and freely available at https://github.com/aziele/phirbo/. 264 265 DISCUSSION 266 The identification of similar sequence regions between host and phage genomes using BLAST has 267 been a baseline for the identification of putative virus-host connections in numerous metagenomic 268 projects [13,28,29]. However, a BLAST search requires regions with significant similarity 269 between the query phage and host [14–16]. Yet, many phage and host sequences lack sufficient 270 similarity and escape detection with standard BLAST searches. To tackle this issue, alignment-271 free tools have been developed to predict hosts from phage sequences [14–16,30]. The rationale 272 behind these tools is based on the observation that viruses tend to share similar patterns in codon 273 usage or short sequence fragments with their hosts [14–16]. As virus replication is dependent on 274 the translational machinery of its host, some phages adapt their codon usage to match the 275 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://github.com/aziele/phirbo/ https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ availability of tRNAs during viral replication in the host cell [31–33]. Similar oligonucleotide 276 frequency use may be driven by evolutionary pressure on the virus to avoid recognition by host 277 restriction enzymes and CRISPR/Cas defense systems [32,34]. Although state-of-the-art 278 alignment-free tools (i.e., WIsH [16] and VirusHostMatcher [15]) can rapidly assess sequence 279 similarity between any pair of phage and prokaryote sequences, they are less accurate for host 280 prediction than BLAST [14,15]. The relatively high accuracy of BLAST suggests that localized 281 similarities of genetic material may be a stronger indication of phage-host interactions than global 282 convergence of their genomic composition. This evidence comes in the form of protein-coding 283 DNA fragments and non-coding RNAs. The latter group is dominated by tRNA genes, which are 284 strongly over-represented in direct BLAST alignments between phages and their hosts, and are 285 even more prevalent among indirect connections used by Phirbo. This may be important, as 286 previous studies have shown that not all phage tRNA genes come directly from their hosts. Some 287 appear to be derived from genomes of other, often distantly related, bacteria and may be the result 288 of earlier evolutionary events [35]. For protein-coding genes, a more diverse picture emerges. 289 Proteins rich in phage-host BLAST alignments can be assigned into different functional categories 290 including phage virion components, replication-related proteins, regulatory factors, and proteins 291 involved in the metabolism of the host. The transfer of some over-represented families in phages 292 and/or prophages has been previously reported (e.g., lytic proteins, DNA replication and 293 recombination proteins, and enzymes involved in nucleotide and energy metabolisms [36]) and 294 some of these genes are connected with the phage-host range [37,38]. However, no clear pattern 295 emerges after analyzing the functions of the remaining, over-represented proteins. 296 297 In this study, we attempted to expand the information content of a single local alignment of phage 298 and host sequences by incorporating the results of multiple local alignments between a phage 299 sequence and different prokaryotic genomes. This approach may more closely resemble a manual 300 assignment of phage-host pairs, where an expert analyst not only considers a top-ranked matching 301 prokaryote in the BLAST results, but also uses the information contained in other, less significant, 302 matches and their sequence and taxonomic similarity. Through a taxonomically-aware 303 stratification scheme, this approach tracks the multilateral dynamics of horizontal gene transfer. 304 Therefore, we propose to relate phage and host sequences through multiple intermediate sequences 305 that are detectably similar to both the phage and host sequences. By linking phage and host 306 sequences through similar sequences, Phirbo achieved a more comprehensive list of phage-host 307 interactions than BLAST. Simultaneously, Phirbo was capable of assessing almost all phage-host 308 pairs, bringing the method closer to alignment-free tools, which compute scores between all 309 possible phage and host pairs. Thus, our approach can be directly applied to different phage and 310 prokaryote data sets without training or optimizing the underlying RBO algorithm. We 311 intentionally avoided machine learning components in Phirbo to ensure the general applicability 312 of the approach and avoid possible overfitting. 313 314 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Our results show that expanding the information obtained from plain similarity comparisons by 315 incorporating taxonomically-grounded measurements of phage-host similarity leads to improved 316 accuracy of phage-host predictions. The Phirbo method provides the phage research community 317 with an easy-to-use tool for predicting the host genus and species of query phages, which is usable 318 when searching for phages with appropriate host specificity and for correlating phages and hosts 319 in ecological and metagenomic studies. 320 321 METHODS 322 323 Virus and prokaryotic host data sets 324 The data sets analyzed in this study were retrieved from two previously published phage-host 325 studies [14,16]. The first set (Edwards et al. 2016 [14]) contained 2,699 complete bacterial 326 genomes obtained from NCBI RefSeq and 820 RefSeq genomes of phages for which the host was 327 reported. The data set encompassed 16,757 known virus-host interaction pairs and 2,196,424 pairs 328 for which interaction was not reported (non-interacting phage-host pairs). The second data set 329 (Galiez et al. 2017 [16]) contained 3,780 complete prokaryotic genomes of the KEGG database 330 and 1420 phages for which host species were reported in the RefSeq Virus database. The data set 331 consisted of 26,024 interacting- and 5,341,576 non-interacting virus-host pairs. 332 333 Phirbo score 334 The interaction score for a given phage-host pair was calculated using the RBO metric. RBO [23] 335 is a measurement of rank similarity that compares two lists of different lengths (giving more 336 attention to high ranks on the lists). RBO ranges from 0 to 1, where a greater value indicates greater 337 similarity between lists. Equation 1 was used for the calculation of the RBO value between two 338 ranking lists, S and T. 339 340 𝑅𝐵𝑂(𝑆, 𝑇, 𝑝) = (1 − 𝑝) ∑ 𝑝𝑑−1 𝑛 𝑑=1 𝐴(𝑆, 𝑇, 𝑑) 341 342 where the parameter p (0 < p < 1) determines how steeply the weight declines (the smaller the p, 343 the more top results are weighted). When p = 0, only the top-ranked item is considered, and the 344 RBO score is either zero or one. In this study, we set p to 0.75, which assigned ~98% of the weight 345 to the first 10 hosts. A(S, T, d) is the value of overlap between the two ranking lists, S and T, up to 346 rank d, calculated by Eq. 2. n is the number of distinct ranks on the ranking list. 347 348 𝐴(𝑆, 𝑇, 𝑑) = |𝑆:𝑑 ∩ 𝑇:𝑑 | |𝑆:𝑑 ∪ 𝑇:𝑑 | 349 350 where S:d and T:d represents the elements present in the first d ranks of lists S and T, respectively. 351 352 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Host prediction tools 353 The host prediction tools BLAST [20], WIsH [16], and Phirbo were run separately in the Edwards 354 et al. and Galiez et al. data sets. For each tool, sequence similarity scores were calculated across 355 all combinations of phage-host pairs. BLAST 2.7.1+ [39] was run with default parameters (task: 356 blastn, e-value threshold = 10) to query each phage sequence against a database of candidate host 357 genomes. For each BLAST alignment, the highest bit-score between every phage-host pair was 358 reported (for phage-host pairs that were absent in the BLAST results, a bit-score of 0 was 359 assigned). For RBO host prediction, an additional BLAST search was performed to establish 360 ranked lists of genetically similar host genomes. Specifically, a nucleotide BLAST was run with 361 default parameters to query each host sequence against a database of candidate host genomes. As 362 an alternative to BLAST, Mash 2.1 [27] was used with default parameters (k-mer size = 21, sketch 363 size = 1,000) to establish ranked lists for each host by comparing its sequence against the database 364 of candidate host genomes. RBO scores were calculated between all pairwise combinations of 365 phage and host ranking lists. WIsH 1.0 [16] was used with default parameters to calculate log-366 likelihood scores between all pairwise combinations of phage-host sequences. 367 368 Evaluation metrics 369 The metrics of host prediction performance were calculated using sklearn (i.e., AUC, AUPR, 370 recall, precision, specificity, and accuracy) [40]. Optimal score thresholds to calculate recall, 371 precision, specificity, and accuracy was computed as maximizing the F1 score, an accuracy metric, 372 which is the harmonic mean of precision and recall. Host prediction accuracy was evaluated 373 analogous to previous studies [14,16,18]. Specifically, for each query phage, the host with the 374 highest score to the query virus was selected as the predicted host. In cases where multiple hosts 375 were predicted, the prediction was scored as correct if the correct host was among the predictions. 376 The prediction accuracy was calculated at each taxonomic level as the percentage of viruses whose 377 predicted hosts shared a taxonomic affiliation with known hosts. 378 379 Phage genome annotation 380 To define phage genes potentially exchanged between phage and host genomes, we re-annotated 381 485 phage genomes that were correctly assigned to host species by both Phirbo and BLAST. The 382 genes were classified into predefined pVOGs (prokaryotic Virus Orthologous Groups) [41] and 383 RNA families [42]. Briefly, open reading frames (ORFs) in the analyzed 485 phage genomes were 384 identified using Transeq from EMBOSS [43]. The ORFs were then assigned to the respective 385 orthologue group by HMMsearch (e-value < 10-5) against the database of Hidden Markov Models 386 (HMMs) created for every of 9,518 pVOG alignments using HMMbuild of HMMER v3.3.1 [44]. 387 Non-coding RNAs (ncRNAs) were predicted in the phage genomes (e-value < 10-5) using Rfam 388 covariance models v14.3 [42] and the Infernal tool v1.1.3 [45]. We counted the number of times 389 each pVOG and Rfam term was present in phage sequences used by BLAST and Phirbo during 390 host prediction. To determine whether the observed level of pVOG/Rfam counts was significant 391 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ within the context of all the terms within the phage genome, we calculated the p-value using the 392 hypergeometric distribution implemented in Scipy [46]. 393 394 ACKNOWLEDGMENTS 395 We thank Bas Dutilh, Rob Edwards, Clovis Galiez, and Johannes Söding for providing us with the 396 benchmark data sets used in their studies. We likewise acknowledge William Webber for 397 assistance with modifying the RBO formula to account for tied ranks. The computations were 398 performed at the Poznan Supercomputing and Networking Center. 399 400 AUTHOR CONTRIBUTIONS 401 AZ conceived the project and designed the experiments. AZ and JB wrote Phirbo and tested its 402 performance. WMK provided the conceptual framework for sequence comparisons through 403 intermediate sequences and reviewed the software and manuscript. AZ and JB analyzed the results 404 and wrote the paper. All authors read and approved the final manuscript. 405 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ FIGURE LEGENDS 406 407 Figure 1. Calculation of the interaction score between phage and host sequences. a. The 408 BLAST search of phage and prokaryote sequences against a reference dataset result in b. two 409 BLAST lists containing prokaryote matches ordered by decreasing similarity (i.e., bit-score). c. 410 BLAST lists were converted into rankings of prokaryote species. The ranked lists differ in 411 content: Yersinia rohdei and Y. ruckeri are present in the first ranking list but absent in the 412 second list, while Shigella dysenteriae and Erwinia toletana are only present in the second list. 413 Two species, Y. rohdei and Y. ruckeri, from the first BLAST search have the same scores and are 414 consequently tied for the same rank. d. An interaction score was calculated between two ranking 415 lists using rank-biased overlap. 416 417 Figure 2. Discriminatory power of Phirbo, BLAST, and WIsH scores to differentiate 418 between interacting and non-interacting phage-host pairs. Phage-host pairs were obtained 419 from a. Edwards et al. and b. Galiez et al. data sets. Box plots show the distribution of scores for 420 all interacting phage-host pairs (n = 16,757 and n = 26,024 in Edwards et al. and Galiez et al., 421 respectively) and the same number of randomly selected, non-interacting phage-host pairs. The 422 horizontal line in each box displays the median; boxes display the first and third quartiles; 423 whiskers depict lowest and highest non-outlier scores (details of distributions including outliers 424 are provided in Supplementary Table 1). Receiver operating characteristic curves and the 425 corresponding area under the curve (AUC) display the classification accuracy of phage–host 426 predictions across all possible phage-host pairs. Dashed lines represent the levels of 427 discrimination expected by chance. 428 429 Figure 3. Host prediction performance of Phirbo, BLAST, and WIsH. The performance is 430 provided by Precision-Recall (PR) curves and statistical measures (i.e., F1 score, precision, 431 recall, specificity, and accuracy) separately for a. Edwards et al. and b. Galiez et al. data sets. 432 Dashed lines in the PR-curve plots represent the levels of discrimination expected by chance. 433 Score cut-offs for each tool were set to ensure the highest F1 score. 434 435 Figure 4. Host prediction accuracy over phage contig length. Prediction accuracy is provided 436 separately for a. Edwards et al. and b. Galiez et al. data sets. Each complete virus genome was 437 randomly subsampled 10 times for different sequence lengths (i.e., 20 kb, 10 kb, 5 kb, 3 kb, and 438 1 kb). Hosts were predicted on each subsampling replicate by selecting a prokaryotic sequence 439 with the highest similarity to the query viral sequence. Points indicate the average of the 440 resulting accuracies for all the viruses at a given subsampling length and host taxonomic level 441 (i.e., species, genus, and family). An extended version of this figure containing host prediction 442 accuracy values is provided in Supplementary Table 4. 443 444 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Figure 5. Functional classification of phage coding sequences used by Phirbo for host 445 prediction. Protein families (pVOGs) were classified into 15 functions related to phage-cycle 446 (e.g., DNA replication, transcription). Numbers in the dark circles indicate the number of 447 different pVOGs related to a given function. An extended version of this figure containing the 448 list of pVOGs is provided in Supplementary Table 7. 449 450 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ TABLES 451 452 Table 1. Host prediction accuracies (%) for phage and host genomes from the data sets by 453 Edwards et al. [14] and Galiez et al. [16]. 454 Dataset Method Species Genus Family Order Class Phylum Edwards et al. (2016) WIsH 28 44 50 53 62 70 BLAST 43 59 71 78 87 96 Phirbo* 43 59 71 78 87 95 Phirbo (+phages)† 48 63 75 82 90 97 Galiez et al. (2017) WIsH 21 44 48 53 68 77 BLAST 31 53 62 68 88 95 Phirbo* 31 53 62 68 88 95 Phirbo (+phages)† 35 56 65 72 90 96 The highest accuracies among the methods for each taxonomic level are in bold. 455 * Interaction scores were calculated using rank-biased overlap (RBO) between BLAST lists containing prokaryotic 456 sequences. Specifically, the BLAST database contained 2,699 sequences of bacterial genomes in the Edwards et al. 457 data set, and 3,780 sequences of bacterial and archaeal genomes in the Galiez et al. data set. 458 † Interaction scores were calculated using RBO between BLAST lists containing both prokaryotic and phage 459 sequences. 460 461 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ SUPPLEMENTARY FIGURES 462 463 Supplementary Figure 1. Host predictions for Cronobacter phage ENT39118 (RefSeq 464 accession: NC_019934) using a. BLAST and b. Phirbo. Querying the Cronobacter phage 465 sequence with a BLAST search against the host database returned the genomic sequence of 466 Escherichia coli (NC_017641) as the best match (bit-score = 14,588), and Cronobacter sakazakii 467 (NC_009778) as the second-best match (bit-score = 14,020). Phirbo predicted Cronobacter 468 sakazakii as the top-score host for the Cronobacter phage due to the highest extent of overlap 469 between the top-ranking BLAST matches of each sequence (NC_019934 and NC_009778) of the 470 same database. For clarity, only the first ten BLAST matches are shown. 471 472 Supplementary Figure 2. Host prediction performance of Phirbo, BLAST and WIsH over 473 phage contig length in terms of a. Area under the curve (AUC) and b. Area under the precision-474 recall curve (AUPR). Bars indicate the AUC or AUPR averaged across 10 replicates at a given 475 subsampling length of phage sequence. 476 477 Supplementary Figure 3. Scatter plot of the phage sequence coverage used in host predictions 478 of Phirbo versus that of BLAST. Each dot represents a phage genome. 479 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ SUPPLEMENTARY TABLES 480 481 Supplementary Table 1. Distribution of Phirbo, BLAST and WIsH scores among interacting 482 and non-interacting phage-host pairs obtained from Edwards et al. and Galiez et al. data sets. 483 Score ranges were summarized separately for 16,757 interacting and non-interacting phage-host 484 pairs from Edwards et al., and 26,024 interacting and non-interacting phage-host pairs from 485 Galiez et al. 486 487 Supplementary Table 2. Number of phage-host pairs evaluated by Phirbo, BLAST, and WIsH 488 in Edwards et al. and Galiez et al. data sets. 489 490 Supplementary Table 3. Phages assigned by BLAST to multiple, equally-scored host species. 491 Phirbo differentiated between host species and provided the highest score to primary host 492 species. 493 494 Supplementary Table 4. Host prediction accuracy of Phirbo, BLAST, and WIsH over phage 495 contig length. 496 497 Supplementary Table 5. Phage sequence coverage of 485 phages correctly assigned by BLAST 498 and Phirbo to their host species. Sequence coverage was calculated for each phage as the sum of 499 the lengths of its non-overlapping high scoring pairs to the genome of the correct host species, 500 divided by the size of the query-phage genome. Prophages were assumed to have sequence 501 coverage greater than or equal to 50%. 502 503 Supplementary Table 6. Summary of taxonomic affiliations of 271 phages that had sequence 504 coverage < 50% with the host species genomes. 505 506 Supplementary Table 7. Protein families present in sequence regions of 271 phage genomes 507 that were used by BLAST and/or Phirbo in host prediction. The table provides information on 508 each protein family (prokaryotic Virus Orthologous Group (pVOG)) used by BLAST and 509 Phirbo, including: (i) pVOG description and functional assignment (manually curated), (ii) 510 pVOG count (number of times a given pVOG was present in the phage genome, as well as in 511 sequences used by BLAST or Phirbo), (iii) pVOG percentage (pVOG count divided by pVOG 512 count in the genome), and (iii) P-value of pVOG enrichment. 513 514 Supplementary Table 8. RNA families present in sequence regions of 271 phage genomes that 515 were used by BLAST and Phirbo in host prediction. The table provides information on each 516 Rfam family used by BLAST and Phirbo. 517 518 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Supplementary Table 9. Comparison of Phirbo’s host prediction performance between BLAST-519 based and Mash-based rankings of prokaryotic species. 520 521 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ REFERENCES 522 523 1. Suttle CA. Marine viruses--major players in the global ecosystem. Nat Rev Microbiol. 524 2007;5: 801–812. 525 2. Breitbart M, Bonnain C, Malki K, Sawaya NA. Phage puppet masters of the marine 526 microbial realm. Nat Microbiol. 2018;3: 754–766. 527 3. Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and 528 potential biogeochemical impacts of globally abundant ocean viruses. Nature. 2016;537: 529 689–693. 530 4. Norman JM, Handley SA, Baldridge MT, Droit L, Liu CY, Keller BC, et al. Disease-531 specific alterations in the enteric virome in inflammatory bowel disease. Cell. 2015;160: 532 447–460. 533 5. Manrique P, Bolduc B, Walk ST, van der Oost J, de Vos WM, Young MJ. Healthy human 534 gut phageome. Proc Natl Acad Sci U S A. 2016;113: 10400–10405. 535 6. Meyer JR. Sticky bacteriophage protect animal cells. Proceedings of the National Academy 536 of Sciences of the United States of America. Proceedings of the National Academy of 537 Sciences; 2013. pp. 10475–10476. 538 7. Reardon S. Phage therapy gets revitalized. Nature. 2014;510: 15–16. 539 8. Salmond GPC, Fineran PC. A century of the phage: past, present and future. Nat Rev 540 Microbiol. 2015;13: 777–786. 541 9. Svoboda E. Bacteria-eating viruses could provide a route to stability in cystic fibrosis. 542 Nature. 2020;583: S8–S9. 543 10. Dedrick RM, Guerrero-Bustamante CA, Garlena RA, Russell DA, Ford K, Harris K, et al. 544 Engineered bacteriophages for treatment of a patient with a disseminated drug-resistant 545 Mycobacterium abscessus. Nat Med. 2019;25: 730–733. 546 11. Samson JE, Moineau S. Bacteriophages in food fermentations: new frontiers in a continuous 547 arms race. Annu Rev Food Sci Technol. 2013;4: 347–368. 548 12. Sulakvelidze A. Using lytic bacteriophages to eliminate or significantly reduce 549 contamination of food by foodborne bacterial pathogens. J Sci Food Agric. 2013;93: 3137–550 3146. 551 13. Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, 552 Mikhailova N, et al. Uncovering earth’s virome. Nature. 2016;536: 425–430. 553 14. Edwards RA, McNair K, Faust K, Raes J, Dutilh BE. Computational approaches to predict 554 bacteriophage–host relationships. FEMS Microbiol Rev. 2016;40: 258–272. 555 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ 15. Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. Alignment-free d_2^* oligonucleotide 556 frequency dissimilarity measure improves prediction of hosts from metagenomically-557 derived viral sequences. Nucleic Acids Res. 2017;45: 39–53. 558 16. Galiez C, Siebert M, Enault F, Vincent J, Söding J. WIsH: who is the host? Predicting 559 prokaryotic hosts from metagenomic phage contigs. Bioinformatics. 2017;33: 3113–3114. 560 17. Andersson AF, Banfield JF. Virus population dynamics and acquired virus resistance in 561 natural microbial communities. Science. 2008;320: 1047–1050. 562 18. Wang W, Ren J, Tang K, Dart E, Ignacio-Espinoza JC, Fuhrman JA, et al. A network-based 563 integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinform. 564 2020;2: lqaa044. 565 19. Zhang M, Yang L, Ren J, Ahlgren NA, Fuhrman JA, Sun F. Prediction of virus-host 566 infectious association by supervised learning methods. BMC Bioinformatics. 2017;18: 60. 567 20. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST 568 and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 569 1997;25: 3389–3402. 570 21. Lima-Mendez G, Faust K, Henry N, Decelle J, Colin S, Carcillo F, et al. Ocean plankton. 571 Determinants of community structure in the global plankton interactome. Science. 572 2015;348: 1262073. 573 22. Flores CO, Meyer JR, Valverde S, Farr L, Weitz JS. Statistical structure of host-phage 574 interactions. Proc Natl Acad Sci U S A. 2011;108: E288-97. 575 23. Webber W, Moffat A, Zobel J. A similarity measure for indefinite rankings. ACM Trans Inf 576 Syst. 2010;28: 1–38. 577 24. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot 578 when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: e0118432. 579 25. Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. 580 Proceedings of the 23rd international conference on Machine learning - ICML ’06. New 581 York, New York, USA: ACM Press; 2006. doi:10.1145/1143844.1143874 582 26. Villarroel J, Kleinheinz KA, Jurtz VI, Zschach H, Lund O, Nielsen M, et al. HostPhinder: A 583 phage host prediction tool. Viruses. 2016;8. doi:10.3390/v8050116 584 27. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast 585 genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17. 586 doi:10.1186/s13059-016-0997-x 587 28. Gao NL, Zhang C, Zhang Z, Hu S, Lercher MJ, Zhao X-M, et al. MVP: a microbe–phage 588 interaction database. Nucleic Acids Res. 2018;46: D700–D707. 589 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ 29. Paez-Espino D, Roux S, Chen I-MA, Palaniappan K, Ratner A, Chu K, et al. IMG/VR 590 v.2.0: an integrated data management and analysis system for cultivated and environmental 591 viral genomes. Nucleic Acids Res. 2019;47: D678–D686. 592 30. Roux S, Hallam SJ, Woyke T, Sullivan MB. Viral dark matter and virus-host interactions 593 resolved from publicly available microbial genomes. Elife. 2015;4. 594 doi:10.7554/eLife.08490 595 31. Lawrence JG, Ochman H. Amelioration of bacterial genomes: rates of change and 596 exchange. J Mol Evol. 1997;44: 383–397. 597 32. Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence of host-virus co-evolution in 598 tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics. 599 2006;7: 8. 600 33. Carbone A. Codon bias is a major factor explaining phage evolution in translationally 601 biased hosts. J Mol Evol. 2008;66: 210–223. 602 34. Sharp PM, Rogers MS, McConnell DJ. Selection pressures on codon usage in the complete 603 genome of bacteriophage T7. J Mol Evol. 1984;21: 150–160. 604 35. Morgado S, Vicente AC. Global in-silico scenario of tRNA genes and their organization in 605 virus genomes. Viruses. 2019;11: 180. 606 36. Sousa JAM de, Pfeifer E, Touchon M, Rocha EPC. Genome diversification via genetic 607 exchanges between temperate and virulent bacteriophages. bioRxiv. bioRxiv; 2020. 608 doi:10.1101/2020.04.14.041137 609 37. Shapiro JW, Putonti C. Gene co-occurrence networks reflect bacteriophage ecology and 610 evolution. MBio. 2018;9. doi:10.1128/mbio.01870-17 611 38. Hernandes Coutinho F, Zaragosa-Solas A, López-Pérez M, Barylski J, Zielezinski A, Dutilh 612 BE, et al. RaFAH: A superior method for virus-host prediction. bioRxiv. bioRxiv; 2020. 613 doi:10.1101/2020.09.25.313155 614 39. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: 615 architecture and applications. BMC Bioinformatics. 2009;10: 421. 616 40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: 617 Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830. 618 41. Grazziotin AL, Koonin EV, Kristensen DM. Prokaryotic Virus Orthologous Groups 619 (pVOGs): a resource for comparative genomics and protein family annotation. Nucleic 620 Acids Res. 2017;45: D491–D498. 621 42. Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, et al. 622 Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids 623 Res. 2020. doi:10.1093/nar/gkaa1047 624 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ 43. Rice P, Longden I, Bleasby A. EMBOSS: The European molecular biology open software 625 suite. Trends Genet. 2000;16: 276–277. 626 44. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity 627 searching. Nucleic Acids Res. 2011;39: W29-37. 628 45. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. 629 Bioinformatics. 2013;29: 2933–2935. 630 46. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 631 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17: 632 261–272. 633 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ BLAST Reference prokarote DNA database (D) match score E. coli K12 E. coli O157:H7 S. flexneri 2a S. boydii E. coli K12 E. coli O157:H7 E. coli M34 S. flexneri 2a S. boydii E. toletana S. dysenteriae Y. rohdei S. flexneri 2bRank species Compare rankings 2 540 2 210 1 530 1 290 810 948 390 902 110 836 880 434 420 407 230 385 970 328 660 183 230 match match rank E. coli S. boydii Y. rohdei, Y. ruckeri S. flexneri 1 2 3 4 S. flexneri E. coli S. dysenteriae E. toletana S. boydii 1 2 3 4 5 match rank AGTCGTGTACTGCGCGCCGCGCGCCAGGAC GGTTCGGCCAACGACTGGGTCCTTATCGAT CCAACGACGACGGCTCCAACGACGTTAGGC ACGTTACCGTTTAGGCGCGATGCGATGCGT Phage DNA sequence (P) a b c d score Host DNA sequence (H) Rank-Biased Overlap (RBO) = 0.76 Y. ruckeri 810 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ a S im ila ri ty s c o re interaction non-interaction 0 0.2 0.4 0.6 0.8 S im ila ri ty s c o re interaction non-interaction 0 1000 2000 3000 S im ila ri ty s c o re interaction non-interaction -1.5 -1.45 -1.4 -1.35 -1.3 -1.25 Phirbo BLAST WIsH 1 S im ila ri ty s c o re interaction non-interaction 0 0.2 0.4 0.6 0.8 S im ila ri ty s c o re interaction non-interaction -1.5 -1.45 -1.4 -1.35 -1.3 Phirbo WIsH 1 b 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 T ru e P o s it iv e R a te False Positive Rate AUC = 0.95 AUC = 0.86 AUC = 0.79 0 0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1 T ru e P o s it iv e R a te False Positive Rate 1 WIsHBLASTPhirbo AUC = 0.95 AUC = 0.86 AUC = 0.78 -1.25 4000 5000 S im ila ri ty s c o re interaction non-interaction 0 1000 2000 3000 BLAST 4000 5000 .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ a AUPR = 0.60 AUPR = 0.38 AUPR = 0.03 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall WIsHBLASTPhirbo P re ci si o n b AUPR = 0.47 AUPR = 0.28 AUPR = 0.02 0 0 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Recall P re ci si o n F1 score Recall Precision Specificity Accuracy 0.646 0.641 0.651 0.997 0.995 0.434 0.362 0.542 0.995 0.993 0.084 0.225 0.052 0.969 0.963 Phirbo BLAST WIsH F1 score Recall Precision Specificity Accuracy 0.568 0.550 0.589 0.998 0.996 0.348 0.279 0.462 0.998 0.995 0.045 0.210 0.025 0.961 0.957 WIsHBLASTPhirbo Score cut-off 0.40 731 -1.34 Score cut-off 0.40 919 -1.34 Phirbo BLAST WIsH .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ a 3 5 10 201 0 20 40 60 80 Species P re d ic ti o n a c c u ra c y ( % ) Sequence length (kb) Genus Family b Phirbo (+phages) BLAST / Phirbo WIsH 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) 3 5 10 201 0 20 40 60 80 Species P re d ic ti o n a c c u ra c y ( % ) Sequence length (kb) Genus Family 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) 3 5 10 201 0% 20% 40% 60% 80% Sequence length (kb) .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/ Capsid head Collar Tail Baseplate Fiber Spike Amino acid metabolism po l DNA replication Genome packaging Transcription Cell lysis Host defence systems Energy metabolism Nucleotide metabolism Bacterial chromosome Integration / recombination Other functions A T G C T 10 15 10 7 9 8 6 7 Antibiotic resistance 8 8 5 1 2 5 11 Full phage assembly .CC-BY-NC 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint https://doi.org/10.1101/2021.01.05.425417 http://creativecommons.org/licenses/by-nc/4.0/