14989606


TITLE 1 

Taxonomy-aware, sequence similarity ranking reliably predicts phage-host relationships 2 

 3 

AUTHORS 4 

Andrzej Zielezinski1,*, Jakub Barylski2, Wojciech M. Karlowski1 5 

 6 

AUTHOR AFFILIATIONS: 7 
1 Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University 8 

Poznan, Uniwersytetu Poznanskiego 6, 61-614, Poznan, Poland 9 
2 Molecular Virology Research Unit, Faculty of Biology, Adam Mickiewicz University Poznan, 10 

Uniwersytetu Poznanskiego 6, 61-614, Poznan, Poland 11 

 12 

* Address correspondence to: 13 

Andrzej Zielezinski: andrzejz@amu.edu.pl 14 

 15 

ABSTRACT 16 

Motivation: Similar regions in virus and host genomes provide strong evidence for phage-host 17 

interaction, and BLAST is one of the leading tools to predict hosts from phage sequences. 18 

However, BLAST-based host prediction has three limitations: (i) top-scoring prokaryotic 19 

sequences do not always point to the actual host, (ii) mosaic phage genomes may produce matches 20 

to many, typically related, bacteria, and (iii) phage and host sequences may diverge beyond the 21 

point where their relationship can be detected by a BLAST alignment. 22 

Results: We created an extension to BLAST, named Phirbo, that improves host prediction quality 23 

beyond what is obtainable from standard BLAST searches. The tool harnesses information 24 

concerning sequence similarity and bacteria relatedness to predict phage-host interactions. Phirbo 25 

was evaluated on two benchmark sets of known phage-host pairs, and it improved precision and 26 

recall by 25 percentage points, as well as the discriminatory power for the recognition of phage-27 

host relationships by 10 percentage points (Area Under the Curve = 0.95). Phirbo also yielded a 28 

mean host prediction accuracy of 60% and 70% at the genus and family levels, respectively, 29 

representing a 5% improvement over BLAST. When using only a fraction of phage genome 30 

sequences (3 kb), the prediction accuracy of Phirbo was 5-11% higher than BLAST at all 31 

taxonomic levels. 32 

Conclusion: Our results suggest that Phirbo is an effective, unsupervised tool for predicting 33 

phage-host relationships. 34 

Availability: Phirbo is available at https://github.com/aziele/phirbo. 35 

 36 

KEYWORDS 37 

phage-host prediction, phage, prokaryote, bacteria, virus, genome sequence  38 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

mailto:andrzejz@amu.edu.pl
https://github.com/aziele/phirbo
https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


INTRODUCTION 39 

Prokaryotic viruses (phages) are the most abundant entities across all habitats and represent a vast 40 

reservoir of genetic diversity [1]. Phages mediate horizontal gene transfer and constitute a major 41 

selection pressure that shapes the evolution of bacteria [2]. Prokaryotic viruses also affect 42 

biogeochemical cycles and ecosystem dynamics by controlling microbial growth rates and 43 

releasing the contents of microbial cells into the environment [2,3]. Moreover, phages play a key 44 

role in shaping the composition and function of the human microbiome in health and disease [4–45 

6]. Recently, there has been renewed interest in phage therapy and phage-based biocontrol of 46 

harmful bacteria [7,8] in medical treatment [9,10] and the food industry [11,12]. Hence, 47 

characterizing phage–host interactions is critical to understanding the factors that govern phage 48 

infection dynamics and their subsequent ecological consequences [13]. 49 

 50 

The scope of phage-host interactions is poorly understood, although it has been hypothesized that 51 

all prokaryotic organisms fall prey to viral attacks [1]. Methods for studying phage-host 52 

interactions primarily rely on cultured virus-host systems; however, recent in silico approaches 53 

suggest a much broader range of hosts may be susceptible to viral infections [14]. These methods 54 

predict prokaryotic hosts based on sequence composition [15,16], direct sequence similarity 55 

between phages and hosts [14], analysis of CRISPR spacers or tRNAs [13,17], as well as 56 

supervised approaches that integrate several sequence-based methods [18,19]. 57 

 58 

Despite significant progress in phage-host predictions, the classic BLAST [20] algorithm is 59 

currently the most effective, unsupervised method for identifying phage-host interactions [14,15]. 60 

Depending on the dataset, the tool finds the correct genus level host for 40-60% of phages [14,15]. 61 

The task of finding a host for a given phage using BLAST is conceptualized as obtaining the host 62 

sequence with the highest similarity to the query phage sequence. However, restricting host 63 

predictions to the first top-scored prokaryotic sequence has three limitations. First, the true host 64 

may not be the top-scoring match in the BLAST results. Second, selecting a prokaryotic host based 65 

on the first sequence assumes that a phage infects a single host. Although phages are generally 66 

host-specific, some may infect multiple host species [21,22]. Finally, many distantly-related 67 

prokaryotic species may obtain a comparable BLAST score for a query phage due to spurious 68 

alignments. These ambiguous host predictions require further manual curation of the taxonomic 69 

or phylogenetic relationship between the top-scored prokaryotic species to select the true host(s). 70 

 71 

We have addressed these issues by developing a simple extension to BLAST, named Phirbo, that 72 

exploits the information contained in the full BLAST results, rather than its top-ranking matches. 73 

Phirbo improved the accuracy of finding hosts, beyond what is found from the best BLAST match, 74 

by relating phage and host sequences through intermediate, common reference sequences that are 75 

potentially homologous to both phage and host queries. Subsequent quantification of the 76 

overlapping signals allows for the reliable prediction of phage-host interactions without the need 77 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


for direct comparisons between the phage and host sequences and without any prior knowledge of 78 

their phylogenetic or taxonomic context. 79 

 80 

RESULTS 81 

 82 

Phirbo algorithm overview 83 

This algorithm is based on the assumption that the degree of similarity between phage and host 84 

sequences is proportional to the overlap between ranked similarity matches of each sequence to 85 

the same reference data set of prokaryotic sequences. Specifically, to compare a pair of phage (P) 86 

and host (H) sequences, we first perform two independent BLAST searches against the reference 87 

database of prokaryotic genomes (D)—one BLAST search for phage and the other for the host 88 

query (Fig. 1a). The two lists of BLAST results (Fig. 1b), P → D and H → D, contain prokaryotic 89 

genomes ordered by decreasing sequence similarity (i.e., bit-score). To avoid a taxonomic bias due 90 

to multiple genomes of the same prokaryote species, we rank prokaryotic species according to 91 

their first appearance in the BLAST list (Fig. 1c). In this way, both lists represent phage and host 92 

profiles consisting of the ranks of top-score prokaryotic species. 93 

 94 

The properties of these lists (Fig. 1c) closely resemble the outcome of an Internet search and can 95 

be characterized by four features: (i) species listed at the top of each ranking are more important 96 

(similar) to the query than those listed at the bottom; (ii) the lists may not be conjoint (some species 97 

may appear in one ranking but not in the other); (iii) the ranking lists may vary in length (BLAST 98 

may return few prokaryotic matches in response to virus sequences in contrast to thousands of 99 

matches in cases of multiple-species prokaryotic families); (iv) two or more species from the 100 

database may achieve the same BLAST score and, therefore, occupy the same position on the 101 

ranking list (Fig. 1c). A recently introduced similarity measure used for comparing the rankings 102 

of Web search engine results [23], the Rank-Biased Overlap (RBO), satisfies these four conditions. 103 

The RBO algorithm starts by scoring the overlap between the sub-list containing the single top-104 

ranked item of each list. It then proceeds by scoring the overlaps between sub-lists formed by the 105 

incremental addition of items further down the original lists. Each consecutive iteration has less 106 

impact on the final RBO score as it puts heavier weights on higher-ranking items by using 107 

geometric progression, which weighs the contribution of overlaps at lower ranks (see ‘Methods’). 108 

An overall RBO score falls between 0 and 1, where 0 signifies that the lists are disjoint (have no 109 

items in common) and 1 means the lists are identical in content and order. Our results indicate that 110 

the extent of the phage-host relationship can be estimated by the application of an RBO 111 

measurement to the ranking lists generated from BLAST results (Fig. 1d). 112 

 113 

Phirbo differentiates between interacting and non-interacting phage-host pairs 114 

To assess the discriminatory power of Phirbo to recognize phage-host interactions, we used two 115 

published reference data sets: Edwards et al. (2016) [14], which contains 2,699 complete bacterial 116 

genomes and 820 phages with reported hosts, and Galiez et al. (2017) [16] that has 3,780 complete 117 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


prokaryotic genomes and 1,420 phage genomes. For each data set, we compared the distribution 118 

of Phirbo scores between all known phage-host interaction pairs and the same number of randomly 119 

selected non-interacting phage-prokaryote pairs (Fig. 2). The scores obtained by Phirbo in both 120 

data sets separated the interacting from non-interacting phage-host pairs more than the BLAST 121 

scores. The median Phirbo score across interacting phage-host pairs was nearly 1,500 times greater 122 

than for non-interacting pairs, while the median BLAST score was three times higher for 123 

interacting pairs than non-interacting pairs (Supplementary Table 1). Both methods, however, 124 

differentiated between interacting and non-interacting phage-host pairs with higher accuracy than 125 

WIsH — the state-of-the-art, alignment-free, host prediction tool [16]. 126 

 127 

To further examine the discriminatory power of Phirbo across all possible phage-prokaryote pairs, 128 

we used receiver operating characteristic (ROC) curves (Fig. 2a,b). The area under the ROC 129 

(AUC), which measured the discriminative ability between interacting and non-interacting phage-130 

host pairs, was higher for Phirbo (AUC = 0.95) in the Edwards et al. and Galiez et al. data sets 131 

than for BLAST (AUC = 0.86) and WIsH (AUC = 0.78-0.79). An additional advantage of Phirbo 132 

was its capacity to score phage-host pairs whose sequence similarity could not be established by a 133 

direct BLAST comparison but, instead, through other, ‘intermediate’ prokaryotic sequences that 134 

were detectably similar to both phage and host query sequences. For example, BLAST did not 135 

provide scores for 20% of the interacting phage-host pairs in the Edwards et al. and Galiez et al. 136 

data sets due to alignment score thresholds (Supplementary Table 2). Using the same BLAST 137 

lists, Phirbo evaluated 99% of the interacting phage-hosts pairs. This high coverage indicated that 138 

nearly every pair of phage-prokaryote sequences could be related by at least one common 139 

prokaryotic sequence detectably similar to both the phage and host sequences. 140 

 141 

Phirbo has the highest host prediction performance 142 

To evaluate host prediction performance, we used precision-recall (PR) curves, which provide 143 

more reliable information than ROC when benchmarking imbalanced data sets for which the non-144 

interacting pairs vastly outnumber the interacting pairs [24,25]. Accordingly, we plotted PR curves 145 

for Phirbo, BLAST, and WIsH predictions obtained from the Edwards et al. (Fig. 3a) and Galiez 146 

et al. (Fig. 3b) data sets. Overall, Phirbo performed better at host prediction at the species level 147 

than BLAST and WIsH, regardless of the data set. The area under the PR curve (AUPR), which 148 

summarized overall performance, was higher in Phirbo by 25 percentage points (AUPR = 0.56-149 

0.65) than in BLAST (AUPR = 0.33-0.41). Phirbo also reported the highest F1 score (an average 150 

of precision and recall [see ‘Methods’]) in the Edwards et al. and Galiez et al. data sets (Fig. 3). 151 

Specifically, the precision and recall of Phirbo were 59-65% and 57-64%, respectively, while 152 

BLAST had precision and recall in the range of 28-43% (Fig. 3). Furthermore, Phirbo yielded 153 

slightly higher specificity (99.7-99.8%) and accuracy (99.5-99.6%) than BLAST or WIsH. 154 

 155 

Phirbo preserves BLAST top-ranked host predictions  156 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


We further evaluated the host prediction accuracy of Phirbo by selecting a top-scored prokaryotic 157 

sequence for each phage [14–16,18]. Briefly, host prediction accuracy is calculated as the 158 

percentage of phages whose predicted hosts have the same taxonomic affiliation as their respective 159 

known hosts (if multiple top-scoring hosts are present, the prediction is scored as correct if the true 160 

host is among the predicted hosts). Phirbo restored all hosts predicted by BLAST in the datasets 161 

by Edwards et al. and Galiez et al., achieving the same prediction accuracy as BLAST across all 162 

taxonomic levels (Table 1). Of note, BLAST found multiple different host species with equal 163 

scores for 14 phage genomes. This was observed in phages infecting bacteria from the 164 

Enterobacteriaceae family and the Rhodococcus and Bacillus genera. However, Phirbo assigned 165 

the highest score to the correct host species (Supplementary Table 3). Additionally, it refined the 166 

host prediction for the Cronobacter phage ENT39118 sequence, which BLAST assigned to the 167 

Escherichia coli genome. Phirbo revealed Cronobacter sakazaki as the primary host species, as 168 

the BLAST list of the Cronobacter phage is more similar in content and order to the BLAST list 169 

of C. sakazaki (Phirbo score = 0.50) than E. coli (Phirbo score: 0.48) (Figure S1). 170 

 171 

As Phirbo links phage to host through common sequences, the content of the sequence database 172 

was the main factor defining host prediction quality. Since the similarity between viruses may 173 

indicate a common host [18,26], we expanded the two BLAST databases of prokaryotic sequences 174 

obtained from Edwards et al. and Galiez et al. by phage sequences (n = 820 and n = 1420, 175 

respectively), and recalculated Phirbo scores between every phage-prokaryote pair. The phage-176 

host linkage through homologous prokaryotic and phage sequences increased the host prediction 177 

accuracy of Phirbo at all taxonomic levels, allowing correct identification of hosts at the genus 178 

level for 56-63% of phages (Table 1). Specifically, Phirbo refined BLAST mis-predictions for 55 179 

phage genomes and showed which sequences demonstrated low similarity to the sequences of their 180 

host species. The direct BLAST alignments of these phage sequences, and the sequences of their 181 

corresponding hosts, obtained significantly lower scores than alignments obtained by the other 182 

known phage-host pairs (P = 1.9 × 10-45, Mann–Whitney U test). Notably, Phirbo also assigned 183 

correct host species for 18 phages whose hosts were not reported in the BLAST results, mainly 184 

Chlamydia species, Vibrio cholerae, and the opportunistic pathogen, Acinetobacter baumannii. 185 

 186 

Phirbo is suitable for incomplete phage sequences 187 

We tested the robustness of our host prediction algorithm to fragmentation of the phage sequence. 188 

Following earlier studies [15,16,18], phage genomes from Edwards et al. and Galiez et al. data 189 

sets were randomly subsampled to generate contigs of different lengths (20 kb, 10 kb, 5 kb, 3 kb, 190 

and 1 kb) with 10 replicates. Host prediction accuracy was calculated as the mean percentage of 191 

phages whose predicted hosts had the same taxonomic affiliation as their respective known hosts 192 

(Fig. 4). Although Phirbo achieved equal host prediction accuracy with BLAST across all contig 193 

lengths, it had substantially higher overall performance in terms of AUC and AUPR (Figure S2; 194 

P < 10−5, Wilcoxon signed-rank test). Surprisingly, BLAST-based methods obtained higher host 195 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


prediction accuracy across all contig lengths compared to WIsH, a tool designed to predict the 196 

hosts of short viral contigs (Fig. 4). 197 

 198 

The host prediction accuracy of Phirbo was examined using the expanded BLAST database of 199 

both prokaryotic and phage full-length sequences. To ensure fairness, for each tested phage contig 200 

we removed its corresponding full-length sequence from the BLAST database and recalculated 201 

Phirbo scores between the phage contig and every prokaryotic sequence. This approach 202 

outperformed BLAST at every contig length across all taxonomic levels in both data sets (Fig. 4). 203 

Generally, the host prediction accuracy of Phirbo improved by 5-11 percentage points compared 204 

to the BLAST results. For example, when the contig length was 3 kb, the prediction accuracy of 205 

Phirbo was 8-11% higher than BLAST at the family level, and 8-17% higher than WIsH (Fig. 4; 206 

Supplementary Table 4). Phirbo also achieved the highest AUC and AUPR scores when 207 

discriminating between interacting and non-interacting phage-host pairs (Figure S2). 208 

 209 

Phirbo uses multiple protein and non-coding RNA signals for host prediction 210 

We investigated the sequence information used by BLAST and Phirbo for host prediction. For 211 

each phage that was correctly assigned to the host species by both tools (n = 485), we calculated 212 

the fraction of the phage genome that was included in the segments aligned with prokaryotic 213 

sequences (sequence coverage). This analysis revealed that our tool used three times more phage 214 

sequence (median sequence coverage: 35%) than BLAST (12%) (Figure S3; P < 10-15, Wilcoxon 215 

signed-rank test). This increased sequence coverage indicates that different genome regions of the 216 

phages map to the genomes of prokaryotic species other than the host species. For 214 of the 485 217 

phages, more than half of their genomes were aligned to genomes of their host species 218 

(Supplementary Table 5). Such large regions of homology are likely prophages or phage debris 219 

left by large-scale recombination events during phage replication. The observed high sequence 220 

coverage points to the virus taxa, known for their temperate lifestyle and frequent recombination 221 

with host genomes (i.e., Siphoviridae family as well as the Peduovirinae and Sepvirinae 222 

subfamilies). 223 

 224 

To further examine the properties of sequences that may be exchanged between a phage and its 225 

host, we selected a population of phages with sequence coverage below 50% (n = 271). These 226 

phages, which are less likely to represent complete prophages, belong to 16 viral families 227 

(Supplementary Table 6). Next, we re-annotated the genomic sequences of the phages to find 228 

putative protein and non-coding RNA (ncRNA) genes. Phage sequence regions used by Phirbo for 229 

host predictions were significantly enriched (P < 10-5) in more than a hundred protein families of 230 

known or probable function. In contrast, only half of the protein families were used in BLAST-231 

based host predictions (Supplementary Table 7). The protein families used by Phirbo covered 232 

most of the processes of the viral life cycle including DNA replication, cell lysis, recombination, 233 

and packaging of the phage genome (Fig. 5). In contrast to BLAST, Phirbo also exploited the 234 

information contained in phage ncRNAs while assigning phages to host genomes. The vast 235 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


majority of these ncRNAs (>90%) were tRNAs, which showed significant overrepresentation in 236 

the phage sequence fragments used by Phirbo (P = 6 × 10-12) (Supplementary Table 8). The 237 

remaining ncRNAs belonged to group I introns (3%), RNAs associated with genes associated with 238 

twister and hammerhead ribozymes (1%), skipping-rope RNA motifs (1%), and 12 less abundant 239 

RNA families. 240 

 241 

Implementation and availability 242 

Predicting hosts from phage sequences using BLAST is accomplished by querying phage 243 

sequences against a database of candidate hosts. However, Phirbo also uses information about 244 

sequence relatedness among prokaryotic genomes. Therefore, it requires ranked lists of prokaryote 245 

species generated by BLAST for the phage and host genomes. The computational cost of querying 246 

every host sequence against the database of all candidate hosts using BLAST may still be a limiting 247 

factor. However, for mass host searches, the computational cost of all-versus-all host comparisons 248 

becomes marginal, as it must be done only once. After the relatedness among host genomes is 249 

established, the time required for Phirbo host predictions is negligibly higher than the time for 250 

typical BLAST-based host predictions. For example, running Phirbo between ranked lists of host 251 

species for 1,420 phages and 3,860 candidate hosts from Galiez et al. (resulting in ~5.5 million 252 

phage-host comparisons) took 8 minutes on a 16-core 2.60GHz Intel Xeon. 253 

 254 

As Phirbo operates on rankings, BLAST can be replaced by an alternative sequence similarity 255 

search tool to reduce the time to estimate homologous relationships between host genomes. For 256 

instance, Mash [27] computed host relationships in 5 minutes for the Edwards et al. and Galiez et 257 

al. data sets (see ‘Methods’). The host prediction performance of Phirbo using BLAST-based 258 

rankings for phages and Mash-based rankings for host genomes is high compared to the 259 

performance of Phirbo predictions using BLAST rankings for both phage and host genomes 260 

(Supplementary Table 9). 261 

 262 

We envisage Phirbo as a natural extension to standard BLAST-based host predictions. The Phirbo 263 

tool is written in Python and freely available at https://github.com/aziele/phirbo/. 264 

 265 

DISCUSSION 266 

The identification of similar sequence regions between host and phage genomes using BLAST has 267 

been a baseline for the identification of putative virus-host connections in numerous metagenomic 268 

projects [13,28,29]. However, a BLAST search requires regions with significant similarity 269 

between the query phage and host [14–16]. Yet, many phage and host sequences lack sufficient 270 

similarity and escape detection with standard BLAST searches. To tackle this issue, alignment-271 

free tools have been developed to predict hosts from phage sequences [14–16,30]. The rationale 272 

behind these tools is based on the observation that viruses tend to share similar patterns in codon 273 

usage or short sequence fragments with their hosts [14–16]. As virus replication is dependent on 274 

the translational machinery of its host, some phages adapt their codon usage to match the 275 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://github.com/aziele/phirbo/
https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


availability of tRNAs during viral replication in the host cell [31–33]. Similar oligonucleotide 276 

frequency use may be driven by evolutionary pressure on the virus to avoid recognition by host 277 

restriction enzymes and CRISPR/Cas defense systems [32,34]. Although state-of-the-art 278 

alignment-free tools (i.e., WIsH [16] and VirusHostMatcher [15]) can rapidly assess sequence 279 

similarity between any pair of phage and prokaryote sequences, they are less accurate for host 280 

prediction than BLAST [14,15]. The relatively high accuracy of BLAST suggests that localized 281 

similarities of genetic material may be a stronger indication of phage-host interactions than global 282 

convergence of their genomic composition. This evidence comes in the form of protein-coding 283 

DNA fragments and non-coding RNAs. The latter group is dominated by tRNA genes, which are 284 

strongly over-represented in direct BLAST alignments between phages and their hosts, and are 285 

even more prevalent among indirect connections used by Phirbo. This may be important, as 286 

previous studies have shown that not all phage tRNA genes come directly from their hosts. Some 287 

appear to be derived from genomes of other, often distantly related, bacteria and may be the result 288 

of earlier evolutionary events [35]. For protein-coding genes, a more diverse picture emerges. 289 

Proteins rich in phage-host BLAST alignments can be assigned into different functional categories 290 

including phage virion components, replication-related proteins, regulatory factors, and proteins 291 

involved in the metabolism of the host. The transfer of some over-represented families in phages 292 

and/or prophages has been previously reported (e.g., lytic proteins, DNA replication and 293 

recombination proteins, and enzymes involved in nucleotide and energy metabolisms [36]) and 294 

some of these genes are connected with the phage-host range [37,38]. However, no clear pattern 295 

emerges after analyzing the functions of the remaining, over-represented proteins. 296 

 297 

In this study, we attempted to expand the information content of a single local alignment of phage 298 

and host sequences by incorporating the results of multiple local alignments between a phage 299 

sequence and different prokaryotic genomes. This approach may more closely resemble a manual 300 

assignment of phage-host pairs, where an expert analyst not only considers a top-ranked matching 301 

prokaryote in the BLAST results, but also uses the information contained in other, less significant, 302 

matches and their sequence and taxonomic similarity. Through a taxonomically-aware 303 

stratification scheme, this approach tracks the multilateral dynamics of horizontal gene transfer. 304 

Therefore, we propose to relate phage and host sequences through multiple intermediate sequences 305 

that are detectably similar to both the phage and host sequences. By linking phage and host 306 

sequences through similar sequences, Phirbo achieved a more comprehensive list of phage-host 307 

interactions than BLAST. Simultaneously, Phirbo was capable of assessing almost all phage-host 308 

pairs, bringing the method closer to alignment-free tools, which compute scores between all 309 

possible phage and host pairs. Thus, our approach can be directly applied to different phage and 310 

prokaryote data sets without training or optimizing the underlying RBO algorithm. We 311 

intentionally avoided machine learning components in Phirbo to ensure the general applicability 312 

of the approach and avoid possible overfitting. 313 

 314 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


Our results show that expanding the information obtained from plain similarity comparisons by 315 

incorporating taxonomically-grounded measurements of phage-host similarity leads to improved 316 

accuracy of phage-host predictions. The Phirbo method provides the phage research community 317 

with an easy-to-use tool for predicting the host genus and species of query phages, which is usable 318 

when searching for phages with appropriate host specificity and for correlating phages and hosts 319 

in ecological and metagenomic studies. 320 

 321 

METHODS 322 

 323 

Virus and prokaryotic host data sets 324 

The data sets analyzed in this study were retrieved from two previously published phage-host 325 

studies [14,16]. The first set (Edwards et al. 2016 [14]) contained 2,699 complete bacterial 326 

genomes obtained from NCBI RefSeq and 820 RefSeq genomes of phages for which the host was 327 

reported. The data set encompassed 16,757 known virus-host interaction pairs and 2,196,424 pairs 328 

for which interaction was not reported (non-interacting phage-host pairs). The second data set 329 

(Galiez et al. 2017 [16]) contained 3,780 complete prokaryotic genomes of the KEGG database 330 

and 1420 phages for which host species were reported in the RefSeq Virus database. The data set 331 

consisted of 26,024 interacting- and 5,341,576 non-interacting virus-host pairs. 332 

 333 

Phirbo score 334 

The interaction score for a given phage-host pair was calculated using the RBO metric. RBO [23] 335 

is a measurement of rank similarity that compares two lists of different lengths (giving more 336 

attention to high ranks on the lists). RBO ranges from 0 to 1, where a greater value indicates greater 337 

similarity between lists. Equation 1 was used for the calculation of the RBO value between two 338 

ranking lists, S and T. 339 

 340 

𝑅𝐵𝑂(𝑆, 𝑇, 𝑝) = (1 − 𝑝)  ∑ 𝑝𝑑−1
𝑛

𝑑=1

𝐴(𝑆, 𝑇, 𝑑) 341 

 342 

where the parameter p (0 < p < 1) determines how steeply the weight declines (the smaller the p, 343 

the more top results are weighted). When p = 0, only the top-ranked item is considered, and the 344 

RBO score is either zero or one. In this study, we set p to 0.75, which assigned ~98% of the weight 345 

to the first 10 hosts. A(S, T, d) is the value of overlap between the two ranking lists, S and T, up to 346 

rank d, calculated by Eq. 2. n is the number of distinct ranks on the ranking list. 347 

 348 

𝐴(𝑆, 𝑇, 𝑑)  =  
|𝑆:𝑑  ∩  𝑇:𝑑 |

|𝑆:𝑑  ∪  𝑇:𝑑 |
 349 

 350 

where S:d and T:d represents the elements present in the first d ranks of lists S and T, respectively. 351 

 352 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


Host prediction tools 353 

The host prediction tools BLAST [20], WIsH [16], and Phirbo were run separately in the Edwards 354 

et al. and Galiez et al. data sets. For each tool, sequence similarity scores were calculated across 355 

all combinations of phage-host pairs. BLAST 2.7.1+ [39] was run with default parameters (task: 356 

blastn, e-value threshold = 10) to query each phage sequence against a database of candidate host 357 

genomes. For each BLAST alignment, the highest bit-score between every phage-host pair was 358 

reported (for phage-host pairs that were absent in the BLAST results, a bit-score of 0 was 359 

assigned). For RBO host prediction, an additional BLAST search was performed to establish 360 

ranked lists of genetically similar host genomes. Specifically, a nucleotide BLAST was run with 361 

default parameters to query each host sequence against a database of candidate host genomes. As 362 

an alternative to BLAST, Mash 2.1 [27] was used with default parameters (k-mer size = 21, sketch 363 

size = 1,000) to establish ranked lists for each host by comparing its sequence against the database 364 

of candidate host genomes. RBO scores were calculated between all pairwise combinations of 365 

phage and host ranking lists. WIsH 1.0 [16] was used with default parameters to calculate log-366 

likelihood scores between all pairwise combinations of phage-host sequences. 367 

 368 

Evaluation metrics 369 

The metrics of host prediction performance were calculated using sklearn (i.e., AUC, AUPR, 370 

recall, precision, specificity, and accuracy) [40]. Optimal score thresholds to calculate recall, 371 

precision, specificity, and accuracy was computed as maximizing the F1 score, an accuracy metric, 372 

which is the harmonic mean of precision and recall. Host prediction accuracy was evaluated 373 

analogous to previous studies [14,16,18]. Specifically, for each query phage, the host with the 374 

highest score to the query virus was selected as the predicted host. In cases where multiple hosts 375 

were predicted, the prediction was scored as correct if the correct host was among the predictions. 376 

The prediction accuracy was calculated at each taxonomic level as the percentage of viruses whose 377 

predicted hosts shared a taxonomic affiliation with known hosts. 378 

 379 

Phage genome annotation 380 

To define phage genes potentially exchanged between phage and host genomes, we re-annotated 381 

485 phage genomes that were correctly assigned to host species by both Phirbo and BLAST. The 382 

genes were classified into predefined pVOGs (prokaryotic Virus Orthologous Groups) [41] and 383 

RNA families [42]. Briefly, open reading frames (ORFs) in the analyzed 485 phage genomes were 384 

identified using Transeq from EMBOSS [43]. The ORFs were then assigned to the respective 385 

orthologue group by HMMsearch (e-value < 10-5) against the database of Hidden Markov Models 386 

(HMMs) created for every of 9,518 pVOG alignments using HMMbuild of HMMER v3.3.1 [44]. 387 

Non-coding RNAs (ncRNAs) were predicted in the phage genomes (e-value < 10-5) using Rfam 388 

covariance models v14.3 [42] and the Infernal tool v1.1.3 [45]. We counted the number of times 389 

each pVOG and Rfam term was present in phage sequences used by BLAST and Phirbo during 390 

host prediction. To determine whether the observed level of pVOG/Rfam counts was significant 391 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


within the context of all the terms within the phage genome, we calculated the p-value using the 392 

hypergeometric distribution implemented in Scipy [46]. 393 

 394 

ACKNOWLEDGMENTS 395 

We thank Bas Dutilh, Rob Edwards, Clovis Galiez, and Johannes Söding for providing us with the 396 

benchmark data sets used in their studies. We likewise acknowledge William Webber for 397 

assistance with modifying the RBO formula to account for tied ranks. The computations were 398 

performed at the Poznan Supercomputing and Networking Center. 399 

 400 

AUTHOR CONTRIBUTIONS 401 

AZ conceived the project and designed the experiments. AZ and JB wrote Phirbo and tested its 402 

performance. WMK provided the conceptual framework for sequence comparisons through 403 

intermediate sequences and reviewed the software and manuscript. AZ and JB analyzed the results 404 

and wrote the paper. All authors read and approved the final manuscript.  405 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


FIGURE LEGENDS 406 

 407 

Figure 1. Calculation of the interaction score between phage and host sequences. a. The 408 

BLAST search of phage and prokaryote sequences against a reference dataset result in b. two 409 

BLAST lists containing prokaryote matches ordered by decreasing similarity (i.e., bit-score). c. 410 

BLAST lists were converted into rankings of prokaryote species. The ranked lists differ in 411 

content: Yersinia rohdei and Y. ruckeri are present in the first ranking list but absent in the 412 

second list, while Shigella dysenteriae and Erwinia toletana are only present in the second list. 413 

Two species, Y. rohdei and Y. ruckeri, from the first BLAST search have the same scores and are 414 

consequently tied for the same rank. d. An interaction score was calculated between two ranking 415 

lists using rank-biased overlap. 416 

 417 

Figure 2. Discriminatory power of Phirbo, BLAST, and WIsH scores to differentiate 418 

between interacting and non-interacting phage-host pairs. Phage-host pairs were obtained 419 

from a. Edwards et al. and b. Galiez et al. data sets. Box plots show the distribution of scores for 420 

all interacting phage-host pairs (n = 16,757 and n = 26,024 in Edwards et al. and Galiez et al., 421 

respectively) and the same number of randomly selected, non-interacting phage-host pairs. The 422 

horizontal line in each box displays the median; boxes display the first and third quartiles; 423 

whiskers depict lowest and highest non-outlier scores (details of distributions including outliers 424 

are provided in Supplementary Table 1). Receiver operating characteristic curves and the 425 

corresponding area under the curve (AUC) display the classification accuracy of phage–host 426 

predictions across all possible phage-host pairs. Dashed lines represent the levels of 427 

discrimination expected by chance. 428 

 429 

Figure 3. Host prediction performance of Phirbo, BLAST, and WIsH. The performance is 430 

provided by Precision-Recall (PR) curves and statistical measures (i.e., F1 score, precision, 431 

recall, specificity, and accuracy) separately for a. Edwards et al. and b. Galiez et al. data sets. 432 

Dashed lines in the PR-curve plots represent the levels of discrimination expected by chance. 433 

Score cut-offs for each tool were set to ensure the highest F1 score. 434 

 435 

Figure 4. Host prediction accuracy over phage contig length. Prediction accuracy is provided 436 

separately for a. Edwards et al. and b. Galiez et al. data sets. Each complete virus genome was 437 

randomly subsampled 10 times for different sequence lengths (i.e., 20 kb, 10 kb, 5 kb, 3 kb, and 438 

1 kb). Hosts were predicted on each subsampling replicate by selecting a prokaryotic sequence 439 

with the highest similarity to the query viral sequence. Points indicate the average of the 440 

resulting accuracies for all the viruses at a given subsampling length and host taxonomic level 441 

(i.e., species, genus, and family). An extended version of this figure containing host prediction 442 

accuracy values is provided in Supplementary Table 4. 443 

 444 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


Figure 5. Functional classification of phage coding sequences used by Phirbo for host 445 

prediction. Protein families (pVOGs) were classified into 15 functions related to phage-cycle 446 

(e.g., DNA replication, transcription). Numbers in the dark circles indicate the number of 447 

different pVOGs related to a given function. An extended version of this figure containing the 448 

list of pVOGs is provided in Supplementary Table 7. 449 

  450 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


TABLES 451 

 452 

Table 1. Host prediction accuracies (%) for phage and host genomes from the data sets by 453 

Edwards et al. [14] and Galiez et al. [16]. 454 

Dataset Method Species Genus Family Order Class Phylum 

Edwards et al. (2016) WIsH 28 44 50 53 62 70 

 BLAST 43 59 71 78 87 96 

 Phirbo* 43 59 71 78 87 95 

 Phirbo (+phages)† 48 63 75 82 90 97 

Galiez et al. (2017) WIsH 21 44 48 53 68 77 

 BLAST 31 53 62 68 88 95 

 Phirbo* 31 53 62 68 88 95 

 Phirbo (+phages)† 35 56 65 72 90 96 

The highest accuracies among the methods for each taxonomic level are in bold. 455 

* Interaction scores were calculated using rank-biased overlap (RBO) between BLAST lists containing prokaryotic 456 

sequences. Specifically, the BLAST database contained 2,699 sequences of bacterial genomes in the Edwards et al. 457 

data set, and 3,780 sequences of bacterial and archaeal genomes in the Galiez et al. data set.  458 

† Interaction scores were calculated using RBO between BLAST lists containing both prokaryotic and phage 459 

sequences. 460 

  461 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


SUPPLEMENTARY FIGURES 462 

 463 

Supplementary Figure 1. Host predictions for Cronobacter phage ENT39118 (RefSeq 464 

accession: NC_019934) using a. BLAST and b. Phirbo. Querying the Cronobacter phage 465 

sequence with a BLAST search against the host database returned the genomic sequence of 466 

Escherichia coli (NC_017641) as the best match (bit-score = 14,588), and Cronobacter sakazakii 467 

(NC_009778) as the second-best match (bit-score = 14,020). Phirbo predicted Cronobacter 468 

sakazakii as the top-score host for the Cronobacter phage due to the highest extent of overlap 469 

between the top-ranking BLAST matches of each sequence (NC_019934 and NC_009778) of the 470 

same database. For clarity, only the first ten BLAST matches are shown. 471 

 472 

Supplementary Figure 2. Host prediction performance of Phirbo, BLAST and WIsH over 473 

phage contig length in terms of a. Area under the curve (AUC) and b. Area under the precision-474 

recall curve (AUPR). Bars indicate the AUC or AUPR averaged across 10 replicates at a given 475 

subsampling length of phage sequence. 476 

 477 

Supplementary Figure 3. Scatter plot of the phage sequence coverage used in host predictions 478 

of Phirbo versus that of BLAST. Each dot represents a phage genome.  479 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


SUPPLEMENTARY TABLES 480 

 481 

Supplementary Table 1. Distribution of Phirbo, BLAST and WIsH scores among interacting 482 

and non-interacting phage-host pairs obtained from Edwards et al. and Galiez et al. data sets. 483 

Score ranges were summarized separately for 16,757 interacting and non-interacting phage-host 484 

pairs from Edwards et al., and 26,024 interacting and non-interacting phage-host pairs from 485 

Galiez et al. 486 

 487 

Supplementary Table 2. Number of phage-host pairs evaluated by Phirbo, BLAST, and WIsH 488 

in Edwards et al. and Galiez et al. data sets.  489 

 490 

Supplementary Table 3. Phages assigned by BLAST to multiple, equally-scored host species. 491 

Phirbo differentiated between host species and provided the highest score to primary host 492 

species. 493 

 494 

Supplementary Table 4. Host prediction accuracy of Phirbo, BLAST, and WIsH over phage 495 

contig length. 496 

 497 

Supplementary Table 5. Phage sequence coverage of 485 phages correctly assigned by BLAST 498 

and Phirbo to their host species. Sequence coverage was calculated for each phage as the sum of 499 

the lengths of its non-overlapping high scoring pairs to the genome of the correct host species, 500 

divided by the size of the query-phage genome. Prophages were assumed to have sequence 501 

coverage greater than or equal to 50%. 502 

 503 

Supplementary Table 6. Summary of taxonomic affiliations of 271 phages that had sequence 504 

coverage < 50% with the host species genomes. 505 

 506 

Supplementary Table 7. Protein families present in sequence regions of 271 phage genomes 507 

that were used by BLAST and/or Phirbo in host prediction. The table provides information on 508 

each protein family (prokaryotic Virus Orthologous Group (pVOG)) used by BLAST and 509 

Phirbo, including: (i) pVOG description and functional assignment (manually curated), (ii) 510 

pVOG count (number of times a given pVOG was present in the phage genome, as well as in 511 

sequences used by BLAST or Phirbo), (iii) pVOG percentage (pVOG count divided by pVOG 512 

count in the genome), and (iii) P-value of pVOG enrichment. 513 

 514 

Supplementary Table 8. RNA families present in sequence regions of 271 phage genomes that 515 

were used by BLAST and Phirbo in host prediction. The table provides information on each 516 

Rfam family used by BLAST and Phirbo. 517 

 518 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


Supplementary Table 9. Comparison of Phirbo’s host prediction performance between BLAST-519 

based and Mash-based rankings of prokaryotic species. 520 

  521 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


REFERENCES 522 

 523 

1.  Suttle CA. Marine viruses--major players in the global ecosystem. Nat Rev Microbiol. 524 

2007;5: 801–812. 525 

2.  Breitbart M, Bonnain C, Malki K, Sawaya NA. Phage puppet masters of the marine 526 

microbial realm. Nat Microbiol. 2018;3: 754–766. 527 

3.  Roux S, Brum JR, Dutilh BE, Sunagawa S, Duhaime MB, Loy A, et al. Ecogenomics and 528 

potential biogeochemical impacts of globally abundant ocean viruses. Nature. 2016;537: 529 

689–693. 530 

4.  Norman JM, Handley SA, Baldridge MT, Droit L, Liu CY, Keller BC, et al. Disease-531 

specific alterations in the enteric virome in inflammatory bowel disease. Cell. 2015;160: 532 

447–460. 533 

5.  Manrique P, Bolduc B, Walk ST, van der Oost J, de Vos WM, Young MJ. Healthy human 534 

gut phageome. Proc Natl Acad Sci U S A. 2016;113: 10400–10405. 535 

6.  Meyer JR. Sticky bacteriophage protect animal cells. Proceedings of the National Academy 536 

of Sciences of the United States of America. Proceedings of the National Academy of 537 

Sciences; 2013. pp. 10475–10476. 538 

7.  Reardon S. Phage therapy gets revitalized. Nature. 2014;510: 15–16. 539 

8.  Salmond GPC, Fineran PC. A century of the phage: past, present and future. Nat Rev 540 

Microbiol. 2015;13: 777–786. 541 

9.  Svoboda E. Bacteria-eating viruses could provide a route to stability in cystic fibrosis. 542 

Nature. 2020;583: S8–S9. 543 

10.  Dedrick RM, Guerrero-Bustamante CA, Garlena RA, Russell DA, Ford K, Harris K, et al. 544 

Engineered bacteriophages for treatment of a patient with a disseminated drug-resistant 545 

Mycobacterium abscessus. Nat Med. 2019;25: 730–733. 546 

11.  Samson JE, Moineau S. Bacteriophages in food fermentations: new frontiers in a continuous 547 

arms race. Annu Rev Food Sci Technol. 2013;4: 347–368. 548 

12.  Sulakvelidze A. Using lytic bacteriophages to eliminate or significantly reduce 549 

contamination of food by foodborne bacterial pathogens. J Sci Food Agric. 2013;93: 3137–550 

3146. 551 

13.  Paez-Espino D, Eloe-Fadrosh EA, Pavlopoulos GA, Thomas AD, Huntemann M, 552 

Mikhailova N, et al. Uncovering earth’s virome. Nature. 2016;536: 425–430. 553 

14.  Edwards RA, McNair K, Faust K, Raes J, Dutilh BE. Computational approaches to predict 554 

bacteriophage–host relationships. FEMS Microbiol Rev. 2016;40: 258–272. 555 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


15.  Ahlgren NA, Ren J, Lu YY, Fuhrman JA, Sun F. Alignment-free d_2^* oligonucleotide 556 

frequency dissimilarity measure improves prediction of hosts from metagenomically-557 

derived viral sequences. Nucleic Acids Res. 2017;45: 39–53. 558 

16.  Galiez C, Siebert M, Enault F, Vincent J, Söding J. WIsH: who is the host? Predicting 559 

prokaryotic hosts from metagenomic phage contigs. Bioinformatics. 2017;33: 3113–3114. 560 

17.  Andersson AF, Banfield JF. Virus population dynamics and acquired virus resistance in 561 

natural microbial communities. Science. 2008;320: 1047–1050. 562 

18.  Wang W, Ren J, Tang K, Dart E, Ignacio-Espinoza JC, Fuhrman JA, et al. A network-based 563 

integrated framework for predicting virus-prokaryote interactions. NAR Genom Bioinform. 564 

2020;2: lqaa044. 565 

19.  Zhang M, Yang L, Ren J, Ahlgren NA, Fuhrman JA, Sun F. Prediction of virus-host 566 

infectious association by supervised learning methods. BMC Bioinformatics. 2017;18: 60. 567 

20.  Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST 568 

and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 569 

1997;25: 3389–3402. 570 

21.  Lima-Mendez G, Faust K, Henry N, Decelle J, Colin S, Carcillo F, et al. Ocean plankton. 571 

Determinants of community structure in the global plankton interactome. Science. 572 

2015;348: 1262073. 573 

22.  Flores CO, Meyer JR, Valverde S, Farr L, Weitz JS. Statistical structure of host-phage 574 

interactions. Proc Natl Acad Sci U S A. 2011;108: E288-97. 575 

23.  Webber W, Moffat A, Zobel J. A similarity measure for indefinite rankings. ACM Trans Inf 576 

Syst. 2010;28: 1–38. 577 

24.  Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot 578 

when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10: e0118432. 579 

25.  Davis J, Goadrich M. The relationship between Precision-Recall and ROC curves. 580 

Proceedings of the 23rd international conference on Machine learning - ICML ’06. New 581 

York, New York, USA: ACM Press; 2006. doi:10.1145/1143844.1143874 582 

26.  Villarroel J, Kleinheinz KA, Jurtz VI, Zschach H, Lund O, Nielsen M, et al. HostPhinder: A 583 

phage host prediction tool. Viruses. 2016;8. doi:10.3390/v8050116 584 

27.  Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, et al. Mash: fast 585 

genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17. 586 

doi:10.1186/s13059-016-0997-x 587 

28.  Gao NL, Zhang C, Zhang Z, Hu S, Lercher MJ, Zhao X-M, et al. MVP: a microbe–phage 588 

interaction database. Nucleic Acids Res. 2018;46: D700–D707. 589 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


29.  Paez-Espino D, Roux S, Chen I-MA, Palaniappan K, Ratner A, Chu K, et al. IMG/VR 590 

v.2.0: an integrated data management and analysis system for cultivated and environmental 591 

viral genomes. Nucleic Acids Res. 2019;47: D678–D686. 592 

30.  Roux S, Hallam SJ, Woyke T, Sullivan MB. Viral dark matter and virus-host interactions 593 

resolved from publicly available microbial genomes. Elife. 2015;4. 594 

doi:10.7554/eLife.08490 595 

31.  Lawrence JG, Ochman H. Amelioration of bacterial genomes: rates of change and 596 

exchange. J Mol Evol. 1997;44: 383–397. 597 

32.  Pride DT, Wassenaar TM, Ghose C, Blaser MJ. Evidence of host-virus co-evolution in 598 

tetranucleotide usage patterns of bacteriophages and eukaryotic viruses. BMC Genomics. 599 

2006;7: 8. 600 

33.  Carbone A. Codon bias is a major factor explaining phage evolution in translationally 601 

biased hosts. J Mol Evol. 2008;66: 210–223. 602 

34.  Sharp PM, Rogers MS, McConnell DJ. Selection pressures on codon usage in the complete 603 

genome of bacteriophage T7. J Mol Evol. 1984;21: 150–160. 604 

35.  Morgado S, Vicente AC. Global in-silico scenario of tRNA genes and their organization in 605 

virus genomes. Viruses. 2019;11: 180. 606 

36.  Sousa JAM de, Pfeifer E, Touchon M, Rocha EPC. Genome diversification via genetic 607 

exchanges between temperate and virulent bacteriophages. bioRxiv. bioRxiv; 2020. 608 

doi:10.1101/2020.04.14.041137 609 

37.  Shapiro JW, Putonti C. Gene co-occurrence networks reflect bacteriophage ecology and 610 

evolution. MBio. 2018;9. doi:10.1128/mbio.01870-17 611 

38.  Hernandes Coutinho F, Zaragosa-Solas A, López-Pérez M, Barylski J, Zielezinski A, Dutilh 612 

BE, et al. RaFAH: A superior method for virus-host prediction. bioRxiv. bioRxiv; 2020. 613 

doi:10.1101/2020.09.25.313155 614 

39.  Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: 615 

architecture and applications. BMC Bioinformatics. 2009;10: 421. 616 

40.  Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: 617 

Machine Learning in Python. J Mach Learn Res. 2011;12: 2825–2830. 618 

41.  Grazziotin AL, Koonin EV, Kristensen DM. Prokaryotic Virus Orthologous Groups 619 

(pVOGs): a resource for comparative genomics and protein family annotation. Nucleic 620 

Acids Res. 2017;45: D491–D498. 621 

42.  Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K, Marz M, et al. 622 

Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids 623 

Res. 2020. doi:10.1093/nar/gkaa1047 624 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


43.  Rice P, Longden I, Bleasby A. EMBOSS: The European molecular biology open software 625 

suite. Trends Genet. 2000;16: 276–277. 626 

44.  Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity 627 

searching. Nucleic Acids Res. 2011;39: W29-37. 628 

45.  Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. 629 

Bioinformatics. 2013;29: 2933–2935. 630 

46.  Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 631 

1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17: 632 

261–272. 633 

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


BLAST

Reference prokarote DNA database (D)

match score

E. coli K12

E. coli O157:H7

S. flexneri 2a

S. boydii

E. coli K12

E. coli O157:H7

E. coli M34

S. flexneri 2a

S. boydii

E. toletana

S. dysenteriae

Y. rohdei

S. flexneri 2bRank species

Compare rankings

2 540

2 210

1 530

1 290

810

948 390

902 110

836 880

434 420

407 230

385 970

328 660

183 230

match

match rank

E. coli

S. boydii

Y. rohdei, Y. ruckeri

S. flexneri

1

2

3

4

S. flexneri

E. coli

S. dysenteriae

E. toletana

S. boydii

1

2

3

4

5

match rank

AGTCGTGTACTGCGCGCCGCGCGCCAGGAC

GGTTCGGCCAACGACTGGGTCCTTATCGAT
CCAACGACGACGGCTCCAACGACGTTAGGC
ACGTTACCGTTTAGGCGCGATGCGATGCGT

Phage DNA sequence (P)

a

b

c

d

score

Host DNA sequence (H)

Rank-Biased Overlap (RBO) = 0.76

Y. ruckeri 810

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


a

S
im

ila
ri

ty
 s

c
o

re

interaction non-interaction

0

0.2

0.4

0.6

0.8

S
im

ila
ri

ty
 s

c
o

re

interaction non-interaction

0

1000

2000

3000

 S
im

ila
ri

ty
 s

c
o

re

interaction non-interaction

-1.5

-1.45

-1.4

-1.35

-1.3

-1.25
Phirbo BLAST WIsH

1

S
im

ila
ri

ty
 s

c
o

re

interaction non-interaction

0

0.2

0.4

0.6

0.8

 S
im

ila
ri

ty
 s

c
o

re

interaction non-interaction

-1.5

-1.45

-1.4

-1.35

-1.3

Phirbo WIsH
1

b

0
0

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

T
ru

e
 P

o
s
it
iv

e
 R

a
te

False Positive Rate

AUC = 0.95	

AUC = 0.86	

AUC = 0.79

0
0

0.2

0.4

0.6

0.8

0.2 0.4 0.6 0.8 1

T
ru

e
 P

o
s
it
iv

e
 R

a
te

False Positive Rate

1

WIsHBLASTPhirbo

AUC = 0.95	

AUC = 0.86	

AUC = 0.78

-1.25

4000

5000

S
im

ila
ri

ty
 s

c
o

re

interaction non-interaction

0

1000

2000

3000

BLAST

4000

5000

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


a

AUPR = 0.60	

AUPR = 0.38	

AUPR = 0.03

0

0

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

Recall

WIsHBLASTPhirbo

P
re
ci
si
o
n

b

AUPR = 0.47	

AUPR = 0.28	

AUPR = 0.02

0

0

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

Recall

P
re
ci
si
o
n

F1 score	

Recall	

Precision	

Specificity	

Accuracy	

0.646	

0.641	

0.651	

0.997	

0.995

0.434	

0.362	

0.542	

0.995	

0.993	

0.084	

0.225	

0.052	

0.969	

0.963	

Phirbo BLAST WIsH

F1 score	

Recall	

Precision	

Specificity	

Accuracy	

0.568	

0.550	

0.589	

0.998	

0.996

0.348	

0.279	

0.462	

0.998	

0.995	

0.045	

0.210	

0.025	

0.961	

0.957	

WIsHBLASTPhirbo

Score cut-off 0.40 731 -1.34 Score cut-off 0.40 919 -1.34

Phirbo BLAST WIsH

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


a

3 5 10 201
0

20

40

60

80
Species

P
re

d
ic

ti
o

n
 a

c
c
u

ra
c
y
 (

%
)

Sequence length (kb)

Genus Family

b

Phirbo (+phages) BLAST / Phirbo WIsH

3 5 10 201
0%

20%

40%

60%

80%

Sequence length (kb)

3 5 10 201
0%

20%

40%

60%

80%

Sequence length (kb)

3 5 10 201
0

20

40

60

80
Species

P
re

d
ic

ti
o

n
 a

c
c
u

ra
c
y
 (

%
)

Sequence length (kb)

Genus Family

3 5 10 201
0%

20%

40%

60%

80%

Sequence length (kb)

3 5 10 201
0%

20%

40%

60%

80%

Sequence length (kb)

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/


Capsid head

Collar

Tail

Baseplate	
Fiber	
Spike

Amino acid metabolism

	
po

l

DNA replication

Genome	
packaging

Transcription	

Cell lysis

Host defence systems

Energy metabolism

Nucleotide metabolism

Bacterial chromosome

Integration / recombination

Other functions

A T
G

C
T

10

15

10

7

9

8

6

7

Antibiotic resistance

8

8

5
1

2

5

11

Full phage assembly

.CC-BY-NC 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made 

The copyright holder for this preprintthis version posted January 6, 2021. ; https://doi.org/10.1101/2021.01.05.425417doi: bioRxiv preprint 

https://doi.org/10.1101/2021.01.05.425417
http://creativecommons.org/licenses/by-nc/4.0/