key: cord-1018813-icm4i0bc authors: Mollentze, Nardus; Babayan, Simon A.; Streicker, Daniel G. title: Identifying and prioritizing potential human-infecting viruses from their genome sequences date: 2020-11-12 journal: bioRxiv DOI: 10.1101/2020.11.12.379917 sha: f8c2bee92a6a2691c4316aac87535181e37c3b94 doc_id: 1018813 cord_uid: icm4i0bc Rapid assessment of which animal viruses may be capable of infecting humans is currently intractable, but would allow their prioritization for further investigation and pandemic preparedness. We developed machine learning algorithms that identify candidate zoonoses using evolutionary signals of host range encoded in viral genomes. This reduces lists of hundreds of viruses with uncertain human infectivity to tractable numbers for prioritized research, generalizes to virus families excluded from model training, can distinguish high risk viruses within families that contain a minority of zoonotic species, and could have identified the exceptional risk of SARS-CoV-2 prior to its emergence. Genome-based risk assessment allows identification of high-risk viruses immediately upon discovery, increasing both the feasibility and likelihood of downstream virological and ecological characterization and allowing for evidence-driven virus surveillance. Rapid assessment of which animal viruses may be capable of infecting humans is currently 11 intractable, but would allow their prioritization for further investigation and pandemic 12 preparedness. We developed machine learning algorithms that identify candidate zoonoses using 13 evolutionary signals of host range encoded in viral genomes. This reduces lists of hundreds of 14 viruses with uncertain human infectivity to tractable numbers for prioritized research, generalizes 15 to virus families excluded from model training, can distinguish high risk viruses within families 16 that contain a minority of zoonotic species, and could have identified the exceptional risk of 17 SARS-CoV-2 prior to its emergence. Genome-based risk assessment allows identification of 18 high-risk viruses immediately upon discovery, increasing both the feasibility and likelihood of 19 downstream virological and ecological characterization and allowing for evidence-driven virus 20 surveillance. 21 22 Introduction: 23 Most emerging infectious diseases of humans are caused by viruses that originate from other 24 animal species. Identifying these zoonotic threats prior to emergence is a major challenge since 25 only a small minority of the estimated 1.67 million animal viruses may infect humans (1, 2) . 26 Existing models of human infection risk rely on viral phenotypic information that is unknown for 27 newly discovered viruses (e.g., diversity of species a virus can infect) or that vary insufficiently 28 to discriminate risk at the species or strain level (e.g., replication in the cytoplasm), limiting their 29 predictive value (3, 4). Since most viruses are now discovered using untargeted sequencing, 30 often involving many simultaneous discoveries with limited phenotypic data, an ideal approach 31 would quantify the relative risk of human infectivity directly from sequence data alone. By 32 identifying high risk viruses warranting further investigation, such predictions could alleviate the 33 growing imbalance between the rapid pace of virus discovery and lower throughput field and 34 laboratory research needed to comprehensively evaluate risk. 35 2 Current models can identify well-characterized human-infecting viruses from genomic 1 sequences (5, 6) . However, by training algorithms on very closely related viruses (i.e., strains of 2 the same species) and potentially omitting secondary characteristics of viral genomes linked to 3 infection capability, such models are less likely to find signals of zoonotic status that generalize 4 across viruses. Consequently, predictions may be highly sensitive to substantial biases in current 5 knowledge of viral diversity (2, 7). Overcoming these biases requires discovering and exploiting 6 signals of human infectivity that generalize across unrelated viruses. Empirical and theoretical 7 lines of evidence suggest such signals might exist (8, 9) . For example, the depletion of CpG 8 dinucleotides in vertebrate-infecting RNA virus genomes may have arisen to evade zinc-finger 9 antiviral protein (ZAP), an interferon-stimulated gene (ISG) that initiates the degradation of 10 CpG-rich RNA molecules (10) . While ZAP occurs widely among vertebrates, increasingly 11 recognized lineage-specificity in vertebrate antiviral defences opens the possibility that 12 analogous, undescribed nucleic-acid targeting defences might be human (or primate) specific 13 (11). Independently, the frequencies of specific codons in virus genomes can resemble those of 14 their reservoir hosts, possibly owing to increased efficiency and/or accuracy of mRNA 15 translation (12). As such, genome compositional similarity to human-adapted viruses or to the 16 human genome may preadapt viruses for human infection (9, 13). We aimed to develop machine 17 learning algorithms which use features engineered from viral and human genome sequences to 18 predict the probability that any animal-infecting virus will infect humans given biologically 19 relevant exposure. 20 21 Results and discussion: 22 We collected a single representative genome sequence from 861 RNA and DNA virus 23 species spanning 36 viral families that contain animal-infecting species (fig. S1). We labelled 24 each virus as being capable of infecting humans or not using published species-specific reports 25 as ground truth, and trained models to classify viruses accordingly. Importantly, given diagnostic 26 limitations and the likelihood that not all viruses capable of human infection have had 27 opportunities to emerge, viruses not reported to infect humans may represent unrealized or 28 undocumented zoonoses or genuinely non-zoonotic species. Identifying these potential zoonoses 29 was an a priori goal of our analysis. 30 We first evaluated whether evolutionary proximity to human-infecting viruses 31 predictably elevates zoonotic risk. Gradient boosted machine (GBM) classifiers trained on virus 32 taxonomy or the frequency of human-infecting viruses among close relatives ("phylogenetic 33 neighbourhood" (14)) outperformed chance (median area under the receiver-operating 34 characteristic curve [AUC m ] = 0.604 and 0.558, respectively), but were no better than simply 35 ranking novel viruses by the proportion of human-infecting viruses in each family ("taxonomy-36 based heuristic", AUC m = 0.596, fig. 1A ), indicating the inability of these relatedness-based 37 models to distinguish risk at scales below the viral family level. We next quantified the performance of GBMs trained on genome composition (i.e., 14 codon usage biases, amino acid biases and dinucleotide biases), calculated either directly from 15 viral genomes ("viral genomic features") or based on similarity to three alternative sets of human 16 gene transcripts ("human similarity features"): interferon-stimulated genes (ISGs), housekeeping 17 genes, and all other genes. We hypothesized that viruses might optimally resemble ISGs since 18 both tend to be expressed concomitantly in virus-infected cells. However, we also included non- (16)). While dendrograms using raw (18), while all lyssaviruses are assumed to be zoonotic (19) . 14 The remaining viruses classified as high priority were from families not currently considered Sorex araneus coronavirus T14 -as being at least as, or more likely to be capable of infecting 30 humans than SARS-CoV-2; these should be considered high priority for further research. phenotypic models of zoonotic risk. The performance of our models, while imperfect, means that 4 many potential zoonoses can be identified immediately after virus discovery and genome 5 sequencing. Large-scale application of these models enables retrospective ranking-based 6 prioritization of hundreds of recognized viruses as well as prospective ranking in parallel with 7 virus discovery, spanning all RNA and DNA genome types (table S1). Importantly, our models 8 predict baseline zoonotic potential, which ultimately will be modulated by ecological 9 opportunities for emergence. Further, the societal impact of emergence will depend on capacity 10 for human to human spread and the severity of human disease, which likely require additional 11 non-genomic data to anticipate. Nonetheless, for both novel and recognized viruses, substantially 12 reduced lists of candidate zoonoses heightens the feasibility of further ecological and virological 13 characterisation. 14 15 Data 17 Although our primary interest was in zoonotic transmission, we trained models to predict the 18 ability to infect humans in general, reasoning that patterns found in viruses predominately 19 maintained by human-to-human transmission may contain genomic signals which also apply to 20 zoonotic viruses. Data on the ability to infect humans were obtained by merging the data of (4) 21 and (15), which contain species-level records of reported human infections. These datasets were 22 supplemented by new literature searches for 15 species (7) . In all cases, only viruses detected in 23 humans by either PCR or sequencing were considered to have proven ability to infect humans. 24 All viruses for which no such reports were found were considered to not infect humans, although 25 we emphasise that many of these viruses are poorly characterised and could therefore be 26 unrecognized or unreported zoonoses. We therefore expect our models to further improve as 27 these and new viruses become better characterized. contend that including currently unrecognized viruses is unlikely to improve the predictions of 37 our models because: (a) most will be non-human infecting (an already over-represented class) 38 and hence provide little additional information, (b) those which do infect humans will not 39 generally be known to do so due to a lack of historic testing, adding misleading signals, and (c) a randomly chosen human-infecting virus would be ranked higher than a randomly chosen virus 26 which has not been reported to infect humans. When a given feature set or combination of 27 feature sets comprised < 125 features, all features were retained. This was the case for models 28 trained using only taxonomy (7 features) or phylogenetic neighbourhoods (2 features). 29 Final models were trained using reduced feature sets. (27): is the length of the query sequence (in nucleotides), ÝŠ is the total number of nucleotides 16 in the training set (i.e., the size of the database searched), and is bitscore for this particular 17 alignment in the original blast search. 18 19 Feature importance and clustering 20 To assess the variability in feature importance while accounting for all viruses, feature 21 importance was assessed across all 1000 models trained for bagging above. In each iteration, the 22 influence of features was assessed using SHAP values, an approximation of Shapley values 23 which describe the change in the predicted log odds of infecting humans attributable to each 24 genome feature (16) . The overall importance of each feature was calculated as the mean of 25 absolute SHAP values across all viruses in the training set of a given iteration (28). 26 Because features tended to be highly correlated, we also report importance values for 27 These values were used to calculate the pairwise Euclidean distances between all virus species 5 using version 2.1.0 of the cluster library in R (31). Viruses were then clustered using 6 agglomerative hierarchical clustering, calculating distances between clusters as the mean 7 distance between all points in the respective clusters (i.e. UPGMA clustering). To explore 8 patterns common to viruses from each class, clustering was performed separately for known 9 human infecting and other viruses. 10 To compare this explanation-based clustering with virus taxonomy, we also constructed a 11 dendrogram based on taxonomic assignments as recorded in version 2018b of the ICTV master 12 species list, using all taxonomic levels from phylum to subgenus. Since some levels of the ICTV 13 taxonomy are not used consistently across all viruses, missing taxonomic levels were 14 interpolated to ensure accurate representation of the underlying taxonomy. For example, for 15 viruses which are not classified in a scheme which includes subfamilies, the next level 16 downstream -genus -was repeated, thereby treating each genus as belonging to a distinct 17 subfamily. Categorical taxonomic assignments were used to calculate pairwise Gower distances 18 between virus species (32), before performing agglomerative hierarchical clustering as described 19 above. We also assessed the ability of underlying genome feature values to reconstruct virus 20 taxonomy by performing hierarchical clustering on a Euclidean distance matrix calculated 21 directly from all genomic features (i.e. unreferenced genome, ISG similarity, housekeeping gene 22 similarity and remaining gene similarity feature sets). The similarity between dendrograms was 23 assessed using the gamma correlation index of (17), as implemented in version 1.12.0 of the 24 dendextend library in R (33). A null distribution for this statistic was calculated by randomly 25 shuffling the labels (i.e., virus species names) of both dendrograms 1000 times. To assess the 26 taxonomic depth at which dendrograms were concordant, the Fowlkes-Mallows index was 27 calculated at each possible cut-point in the dendrograms being compared (34), again using the 28 dendextend library. As before, a null distribution was generated by randomly shuffling the labels 29 of both dendrograms 1000 times. Ranking novel viruses 32 To illustrate the use of our models in practice, the best performing set of models (i.e. those 33 trained using the best 125 features selected from among all genome feature-based feature sets) 34 was used to predict the probability that novel viruses are able to infect humans. Pandemics: spend on surveillance, not 2 prediction Ability to replicate in the cytoplasm predicts zoonotic 6 transmission of livestock viruses Host 8 and viral traits predict zoonotic spillover from mammals Rapid identification of 10 human-infecting viruses Viral zoonotic risk is homogenous among taxonomic orders 14 of mammalian and avian reservoir hosts Species-Specific Host-Virus Interactions: Implications for Viral Host Range and Virulence Gene Mimicry in Influenza and Other RNA Viruses CG dinucleotide suppression enables antiviral defence targeting non-self RNA Fundamental properties of the mammalian 25 innate immune system revealed by multispecies comparison of type I interferon responses The extent of codon usage bias in human RNA viruses and its 28 evolutionary origin Evolutionary Basis of Codon Usage and 30 Nucleotide Composition Bias in Vertebrate DNA Viruses Predicting reservoir hosts and arthropod vectors 1 from evolutionary signatures in RNA virus genomes Epidemiological characteristics of human-infective RNA 3 viruses. Scientific Data Advances in Neural Information Processing Systems Stability of Two Hierarchical Grouping Techniques Case 1: Sensitivity to Data 8 Almendravirus: A 11 proposed new genus of Rhabdoviruses isolated from mosquitoes in tropical The spread and evolution of rabies virus: 14 Conquering new frontiers Taxonomic patterns in the zoonotic potential of mammalian viruses Human housekeeping genes , revisited EnvStats: an R package for environmental statistics Proceedings of the 22nd ACM SIGKDD International Conference 1 on Knowledge Discovery and Data Mining Building Predictive Models in R Using the caret Package Beyond sigmoids: How to obtain well-calibrated 7 probabilities from binary classifiers with beta calibration The Statistics of Sequence Similarity Scores From local explanations to global understanding with 13 explainable AI for trees Clustering by Passing Messages Between Data Points APCluster: an R package for affinity 17 propagation clustering cluster: Cluster analysis 19 basics and extensions A General Coefficient of Similarity and Some of Its Properties dendextend: an R package for visualizing, adjusting and comparing trees of 23 hierarchical clustering A Method for Comparing Two Hierarchical Clusterings Acknowledgments: We thank Laura Bergner Additional funding was 30 provided by the Medical Research Council through program grants MC_UU_12014 MC_UU_12014/12. Author contributions Competing interests: The authors declare no 3 competing interests. Data and materials availability: Data and code