key: cord-1033781-l7zjdurx authors: Wardeh, Maya; Baylis, Matthew; Blagrove, Marcus S.C. title: Predicting mammalian hosts in which novel coronaviruses can be generated date: 2020-06-19 journal: bioRxiv DOI: 10.1101/2020.06.15.151845 sha: 708157cd8e642cbc3071b9af87f2e232437a4205 doc_id: 1033781 cord_uid: l7zjdurx Novel pathogenic coronaviruses – including SARS-CoV and SARS-CoV-2 – arise by homologous recombination in a host cell1,2. This process requires a single host to be infected with more than one type of coronavirus, which recombine to form novel strains of virus with unique combinations of genetic material. Identifying possible sources of novel coronaviruses requires identifying hosts (termed recombination hosts) of more than one coronavirus type, in which recombination might occur. However, the majority of coronavirus-host interactions remain unknown, and therefore the vast majority of recombination hosts for coronaviruses cannot be identified. Here we show that there are 11.5-fold more coronavirus-host associations, and over 30-fold more potential SARS-CoV-2 recombination hosts, than have been observed to date. We show there are over 40-fold more host species with four or more different subgenera of coronaviruses. This underestimation of both number and novel coronavirus generation in wild and domesticated animals. Our results list specific high-risk hosts in which our model predicts homologous recombination could occur, our model identifies both wild and domesticated mammals including known important and understudied species. We recommend these species for coronavirus surveillance, as well as enforced separation in livestock markets and agriculture. novel coronavirus generation in wild and domesticated animals. Our results list specific high-risk hosts 27 in which our model predicts homologous recombination could occur, our model identifies both wild 28 and domesticated mammals including known important and understudied species. We recommend 29 these species for coronavirus surveillance, as well as enforced separation in livestock markets and 30 agriculture. 31 The generation and emergence of three novel respiratory coronaviruses from mammalian reservoirs into 34 human populations in the last 20 years, including one which has achieved pandemic status, suggests 35 that one of the most pressing current research questions is: In which reservoirs could the next novel 36 coronaviruses be generated and emerge from in future? Armed with this knowledge, we may be able to 37 reduce the chance of emergence into human populations or develop potential mitigations in advance, 38 such as by the strict monitoring and enforced separation of the hosts identified here, in live animal 39 markets, farms, and other close-quarters environments. 40 Coronaviridae are a family of positive sense RNA viruses which can cause an array of diseases. In 41 humans, these range from mild cold-like illnesses to the lethal respiratory tract infections mentioned 42 above. Seven coronaviruses are known to infect humans 3 , SARS-CoV, MERS-CoV and SARS-CoV-2 43 causing severe disease, while HKU1, NL63, OC43 and 229E tend towards milder symptoms in most 44 patients 4 . 45 Coronaviruses undergo frequent host-shifting events between non-human animal species, or non-human 46 animals and humans 5-7 , a process that may involve changes to the cells or tissues that the viruses infect 47 (virus tropism). Such shifts have resulted in new animal diseases (such as bovine coronavirus (BCoV) 48 disease 8 and canine coronavirus (CCoV) disease 9 ), and human diseases (such as OC43 10 and 229E 11 ). 49 The aetiological agent of COVID-19, SARS-CoV-2, is thought to have originated in bats 12 and shifted 50 to humans via an intermediate reservoir host, likely a species of pangolin 13 . 51 Comparison of the genetic sequences of bat and human coronaviruses has revealed five potentially 52 important genetic regions involved in host specificity and shifting, with the Spike receptor binding 53 domain believed to be the most important 1,5 . Homologous recombination is a natural process which 54 brings together new combinations of genetic material, and hence new viral strains, from two similar 55 non-identical parent strains of virus. This recombination occurs when different strains co-infect an 56 individual animal, with sequences from each parent strain in the genetic make-up of progeny virus. 57 Homologous recombination has previously been demonstrated in many important viruses such as 58 human immunodeficiency virus 14 , classical swine fever virus 15 , and throughout the Coronaviridae 1,2 , 59 including homologous recombination in Spike being implicated in the generation of SARS-CoV-2 2 . 60 As well as instigating host-shifting, homologous recombination in other regions of the virus genome 61 could also introduce novel phenotypes into coronavirus strains already infectious to humans. There are 62 at least seven potential regions for homologous recombination in the replicase and Spike regions of the 63 SARS-CoV genome alone, with possible recombination partner viruses from a range of other 64 mammalian and human coronaviruses 16 . Recombination events between two compatible partner strains 65 in a shared host could thus lead to future novel coronaviruses, either by enabling pre-existing 66 mammalian strains to infect humans, or by adding new phenotypes arising from different alleles to pre-67 existing human-affecting strains. 68 The most fundamental requirement for homologous recombination to take place is the co-infection of a 69 single host with multiple coronaviruses. However, our understanding of which hosts are permissive to 70 which coronaviruses, the pre-requisite to identifying which hosts are potential sites for this 71 recombination (henceforth termed 'recombination hosts'), remains extremely limited. Here, we utilise 72 a similarity-based analytical pipeline to address this significant knowledge gap. Our approach predicts 73 associations between coronaviruses and their potential mammalian hosts by integrating three 74 perspectives or points of view encompassing: 1) genomic features depicting different aspects of 75 coronaviruses (e.g. secondary structure, codon usage bias) extracted from complete genomes 76 (sequences = 3,271, virus strains = 411); 2) ecological, phylogenetic and geospatial traits of potential 77 mammalian hosts (n=876); and 3) characteristics of the network that describes the linkage of 78 coronaviruses to their observed hosts, which expresses our current knowledge of sharing of 79 coronaviruses between various hosts and host groups. 80 Topological features of ecological networks have been successfully utilised to enhance our 81 understanding of pathogen sharing 17,18 , disease emergence and spill-over events 19 , and as means to 82 predict missing links in a variety of host-pathogen networks 20-22 . Here we capture this topology, and 83 relations between coronaviruses and hosts in our network, by means of node (coronaviruses and hosts) 84 embeddings using DeepWalk 23a deep learning method that has been successfully used to predict 85 drug-target 24 , and IncRNA-disease associations 25 . 86 Our analytical pipeline transforms the above features into similarities (between viruses, and between 87 hosts) and uses them to give virus-mammal associations scores of how likely they are to occur. Our 88 framework then ensembles its constituent learners to produce testable predictions of mammalian hosts 89 of multiple coronaviruses. 90 In this study we address the following three questions: 1) Which species may be unidentified 91 mammalian reservoirs of coronaviruses? 2) What are the most probable mammalian host species in 92 which coronavirus homologous recombination could occur? And 3) Which coronaviruses are most 93 likely to co-infect a single host, and thus act as sources for future novel viruses? 94 Our model to predict unobserved associations between coronaviruses and their mammalian hosts 97 indicated a total of 126 (ensemble mean probability cut-off >0.5, when subtracting/adding SD from the 98 mean the number of predicted hosts is 85/169. For simplicity, we report SD hereafter as -/+ from 99 predicted values at reported probability cut-offs, here: SD=-41/+43) non-human mammalian species in 100 which SARS-CoV-2 could be found. The breakdown of these hosts by order was as follows (values in 101 brackets represent SD from ensemble mean): Carnivora: 37 (0/0); Rodentia: 32 (-9/3); Chiroptera: 25 102 (-19/38); Artiodactyla: 20 (-8/2); Eulipotyphla: 5 (-4/0); Primates: 4 (0/0); Lagomorpha: 2 (-1/0) and 103 Pholidota: 1 (0/0). Figure 1 illustrates these predicted hosts, the probability of their association with 104 SARS-COV2, as well as numbers of known and unobserved (predicted) coronaviruses that could be 105 found in each potential reservoir of SARS-CoV-2 (Supplementary results table SR1 lists Predicted hosts are grouped by order (inner circle). Middle circle presents probability of association between 114 host and SARS-CoV-2 (>0.5 light grey to 1 dark grey). Yellow bars represent number of coronaviruses 115 (species or strains) observed to be found in each host. Blue stacked bars represent other coronaviruses 116 predicted to be found in each host by our model. Predicted coronaviruses per host are grouped by prediction 117 probability into five categories (from inside to outside): >0.9, 0.9-0.9, 0.8-0.7, 0.7-0.6, and 0. Gammacoronaviruses and unclassified coronaviruses. Yellow cells represent observed associations 131 between the host and the coronavirus. Blue cells present predicted associations (predicted probability 132 ranging from >0.5 (light blue) to 1 (dark blue)). White cells represent no association between host and 133 virus (beneath cut-off probability of 0.5). Give that coronaviruses frequently undergo homologous recombination when they co-infect a host, and 177 that SARS-CoV-2 is highly infectious to humans, the most immediate threat to public health is 178 recombination of other coronaviruses with SARS-CoV-2. Such recombination could readily produce health. Our prediction of these species' potential interaction with SARS-CoV-2 and considerable 230 numbers of other coronaviruses, as well as the latter three species' close association to humans, identify 231 them as high priority underestimated risks. In addition to these human-associated species, both the 232 chimpanzee (Pan troglodytes) and African green monkey (Chlorocebus aethiops) have large numbers 233 of predicted associations (51, 46 at 0.5; and 16, 11 at 0.9, respectively), and given their relatedness to 234 humans and their importance in the emergence of viruses such as DENV 37 and HIV 38 , also serve as 235 other high priority species for surveillance. 236 The most prominent result in figure 1 is the common pig (Sus scrofa), having the most predicted 237 associations, in addition to SARS-CoV-2, of all included mammals (121 and 73 coronaviruses predicted 238 at 0.5 and 0.9 cut-offs). The pig is a major known mammalian coronavirus host, harbouring both a large 239 number (26) of observed coronaviruses, as well as a wide diversity ( figure 2, supplementary results 240 table SR4). Given the large number of predicted viral associations presented here, the pig's close 241 association to humans, its known reservoir status for many other zoonotic viruses 39 , and it being 242 involved in genetic recombination of some of these viruses 40 , the pig is predicted to be one of the 243 foremost candidates for an important recombination host. 244 In addition to the more immediate threat of homologous recombination directly with SARS-CoV-2, we 245 also present our predicted associations between all mammals and all coronaviruses (figures 2 and 3). 246 These associations represent the longer-term potential for background viral evolution via homologous 247 recombination in all species. These data also show that there is a 11.54-fold underestimation in the 248 number of associations, with 421 observed associations and 4438 predicted at 0.5 cut-off (5.72-fold 249 increase with 1989 associations predicted at 0.9 cut-off). This is visually represented in figure 3, which 250 shows the bipartite network of virus and host for observed associations (A), and predicted associations 251 (B-D); with a marked increase in connectivity between our mammalian hosts and coronaviruses, even 252 at the most stringent 0.95 confidence cut-off. This indicates that the potential for homologous 253 recombination between coronaviruses is substantially underestimated using just observed data. 254 Furthermore, our model shows that the associations between more diverse coronaviruses is also 255 underestimated, for example, the number of host species with four or more different subgenera of 256 coronaviruses increases by 41.57-fold from seven observed to 291 predicted (Table S4, We acknowledge certain limitations in our methodology, primarily pertaining to current incomplete 283 datasets in the rapidly developing but still understudied field. 1) The inclusion only of coronaviruses 284 for which complete genomes could be found limited the number of coronaviruses (species or strain) for 285 which we could compute meaningful similarities, and therefore predict potential hosts. The same 286 applies for our mammalian specieswe only included mammalian hosts for which phylogenetic, 287 ecological, and geospatial data were available. As more data on sequenced coronaviruses or mammals 288 become available in future, our model can be re-run to further improve predictions, and to validate 289 predictions from earlier iterations. 2) Research effort, centering mainly on coronaviruses found in 290 humans and their domesticated animals, can lead to overestimation of the potential of coronaviruses to 291 recombine in frequently studied mammals, such as lab rodents which were excluded from the results 292 reported here (similar to previous work 18 ), and significantly, domesticated pigs and cats which we have 293 found to be important recombination host species of coronaviruses. This latter limitation is partially 294 mitigated. Firstly, in our results, other 'overstudied' mammals such as cows and sheep, were not 295 highlight by our model, which is consistent with them being considered less important hosts of 296 coronaviruses, and certain under-studied bats were highlighted as major potential hosts; together, these 297 indicate that research effort is not a substantial driver of our results. Secondly, methodologically, the 298 effect of research effort has been mitigated by capturing similarities from our three points of view (virus, 299 host and network) and multiple characteristics therein. 300 In this study we demonstrate that the potential for homologous recombination in mammalian hosts of 301 coronaviruses is highly underestimated. The ability of the large numbers of hosts presented here to be 302 hosts of multiple coronaviruses, including SARS-CoV-2, demonstrated the capacity for homologous 303 recombination and hence production of further novel coronaviruses. Our methods deployed a meta-304 ensemble of similarity learners from three complementary perspectives (viral, mammalian and 305 network), to predict each potential coronavirus-mammal association. 306 The current consensus is that SARS-CoV-2 was generated by homologous recombination; originally 307 derived from coronaviruses in bats 12 and then shifted to humans via an intermediate reservoir host, 308 likely a species of pangolin 13 . Importantly, the lineage of SARS-CoV-2 was deduced only after the 309 outbreak in humans. With the greater understanding of the extent of mammalian host reservoirs and the 310 potential recombination hosts we identify here, a targeted surveillance program is now possible which 311 would allow for this generation to be observed as it is happening and before a major outbreak. Such 312 information could help inform mitigation strategies and provide a vital early warning system for future 313 novel coronaviruses. 314 Viruses and mammalian data 317 Viral genomic data: Complete sequences of coronaviruses were downloaded from Genbank 43 . 318 Sequences labelled with the terms: "vaccine", "construct", "vector", "recombinant" were removed from 319 the analyses. In addition, we removed these associated with experimental infections were possible. This 320 resulted in total of 3,264 sequences for 411coronavirus species or strain (i.e. viruses below species level 321 on NCBI taxonomy tree). Of these 88 were sequences of coronavirus species, and 307 of strains (in 25 322 coronavirus species, with total number of species included=92). 323 We processed meta-data accompanying 324 all sequences (including partial sequences but excluding vaccination and experimental infections) of 325 coronaviruses uploaded to GenBank to extract information on hosts (to species level) of these 326 coronaviruses. We supplemented these data species-level hosts of coronaviruses extracted from 327 scientific publications via the Enhanced Infectious Diseases Database (EID2) 44 . This resulted in 328 identification of 313 known terrestrial mammalian hosts of coronaviruses (regardless of whether a 329 complete genome was available or not, N=185 mammalian species for which an association with a 330 coronaviruses with complete genome was identified). We expanded this set of potential hosts by 331 including terrestrial mammalian species in genera containing at least one known host of coronavirus, 332 and which are known to host one or more other virus species (excluding coronaviruses, information of 333 whether the host is associated with a virus were obtained from EID2). This results in total of 876 334 mammalian species which were selected. 335 Quantification of viral and mammalian similarities 336 We computed three types of similarities between each two viral genomes as summarised below. 337 Biases and codon usage: We calculated proportion of each nucleotide of the total coding sequence 338 length. We computed dinucleotide and codon biases 45 and codon pair bias, measured as the codon pair 339 score (CPS) 45,46 in each of the above sequences. This enabled us to produce for each genome sequence 340 (N=3,271) the following feature vectors: nucleotide bias; dinucleotide bias; codon biased; and codon-341 pair bias. 342 Secondary structure: Following alignment of sequences (using AlignSeqs function in R package 343 Decipher 47 ), we predicted the secondary structure for each sequence using PredictHEC function in the 344 R package Decipher 47 . We obtained both states (final prediction), and probability of secondary 345 structures for each sequence. We then computed for each 1% of the genome length both the coverage 346 (number of times a structure was predicted) and mean probability of the structure (in the percent of the 347 genome considered). This enabled us to generate six vectors (length = 100) for each genome 348 representing: mean probability and coverage for each of three possible structures -Helix (H), Beta-349 Genome dissimilarity (distance): we calculated pairwise dissimilarity (in effect a hamming distance) 351 between each two sequences in our set using the function DistanceMatrix in the R package Decipher 47 . 352 We set this function to penalise gap-to-gap and gap-to-letter mismatches. 353 Similarity quantification We transformed the feature (traits) vectors described above into similarities 354 matrices between coronaviruses (species or strains). This was achieved by computing cosine similarity 355 between these vectors in each category (e.g. codon pair usage, H coverage, E probability). Formally, 356 for each genomic feature (N=10) presented by vector as described above, this similarity was calculated 357 as follows: 358 Where and are two genomic sequences presented by two feature vectors and from the 360 genomic feature space (e.g. codon pair bias) of the dimension d (e.g. d= 100 for H-coverage). 361 We then calculated similarity between each pair of virus strains or species (in each category) as the 362 mean of similarities between genomic sequences of the two virus strains or species (e.g. mean 363 nucleotide bias similarity between all sequences of SARS-CoV-2 and all sequences of MERS-CoV 364 presented the final nucleotide bias similarity between SARS-CoV-2 and MERS-CoV). This enabled us 365 to generate 11 genomic features similarity matrices (the above 10 features represented by vectors + 366 genomic similarity matrix) between our input coronaviruses. 367 Similarity network fusion (SNF): We applied similarity network fusion (SNF) 48 to integrate the 369 following similarities in order to reduce our viral genomic feature space: 1) Nucleotide, dinucleotide, 370 codon, and codon pair usage biases were combined into one similarity matrix -genome bias similarity. 371 And 2) Helix (H), Beta-Sheet (E), or Coil (C) mean probability and coverage similarities (six in total) 372 were combined into one similarity matrix -secondary structure similarity. 373 SNF applies an iterative nonlinear method that updates every similarity matrix according to the other 374 matrices via nearest neighbour approach and is scalable and is robust to noise and data heterogeneity. 375 The integrated matrix captures both shared and complementary information from multiple similarities. 376 Mammalian similarities: We calculated comprehensive set of mammalian similarities. Quantification of topological features: The above constructed network summarises our knowledge to 391 date of associations between coronaviruses and their hosts, and its topology expresses patterns of 392 sharing these viruses between various hosts and host groups. Our analytical pipeline captures this 393 topology, and relations between nodes in our network, by means of node (vertex) embeddings. This 394 approach encodes each node (here either a coronavirus or a host) with its own vector representation in 395 a continuous vector space, which in turns enable us to calculate similarities between two nodes based 396 on this representation. 397 We adopted DeepWalk 23 to compute vectorised representations for our coronaviruses and hosts from 398 the network connecting them. DeepWalk uses truncated random walks to get latent topological 399 information of the network and obtains the vector representation of its nodes (in our case coronaviruses, Similarity learning meta-ensemblea multi-perspective approach 411 Our analytical pipeline stacks 12 similarity learners into meta-ensembles and selects the best performing 412 ensemble following internal and external validation processes. The constituent learners based on 413 different data and points of views and can be categorised as follows: 414 Coronavirusesthe viral point of view: we assembled three models derived from a) genome similarity, 415 b) genome biases, and c) genome secondary structure. Each of these learners gave each coronavirus-416 mammalian association ( → ) a score, termed confidence, based on how similar the coronavirus 417 is to known coronaviruses of mammalian species , compared to how similar is to all included 418 coronaviruses 24,25 . In other words, if is more similar (e.g. based on genome secondary structure) to 419 coronaviruses observed in host than it is similar to those coronaviruses not observed in , then the 420 association → is given a higher confidence score. Conversely, if is somewhat similar to 421 coronaviruses observed in , and also somewhat similar to viruses not known to infect this particular 422 mammal, then the association → is given a medium confidence score. The association → 423 is given a lower confidence score if is more similar to coronaviruses not known to infect than it 424 is similar to coronaviruses observed in this host. 425 Ensemble construction: We combined the above described leaners by stacking them into ensembles 452 (meta-ensembles) using Stochastic Gradient Boosting (GBM). The purpose of this combination is to 453 incorporate the three points of views descried above, as well as varied aspects of the coronaviruses and 454 their mammalian potential hosts, into a generalisable, robust model 50 . We selected GBM as our stacking 455 algorithm due to its ability to handle non-linearity and high-order interactions between constituent 456 learners 51 . We performed the training and optimisation (tuning) of these ensembles using the caret R 457 Package 52 . 458 Ensemble sampling: our gbm ensembles comprised 100 replicate models. Each model was trained with 459 balanced random samples using 10-fold cross validation. Final ensembles were generated by taking 460 mean predictions (probability) of constituent models. Standard deviation (SD) from mean probability 461 was also generated and its values subtracted/added to predictions. 462 Validation and performance estimation: We validated the performance of our analytical pipeline 463 externally against 5 held-out test-sets. each test-set was generated by splitting the set of observed 464 associations between coronaviruses and their hosts into two random sets: a training-set comprising 85% 465 of all known associations and a test-set comprising 15% of known associations. These test-sets were 466 held-out throughout the processes of generating similarity matrices; similarity learning, and assembling 467 our learners, and were only used for the purposes of estimating performance metrics of our analytical 468 pipeline. This resulted in 5 runs in which our ensemble learnt using only 85% of observed associations 469 between our coronaviruses and their mammalian hosts. For each run calculated three performance 470 metrics based on the mean probability across each set of 100 replicate models of the gbm meta-471 ensembles: AUC, true skills statistics (TSS) and F-score. 472 Recombination, Reservoirs, and the Modular Spike: Mechanisms 498 of Coronavirus Cross-Species Transmission Cross-species transmission of the newly identified 500 coronavirus 2019-nCoV The proximal 502 origin of SARS-CoV-2 Hosts and Sources of Endemic Human 504 Advances in Virus Research Molecular Evolution of the SARS Coronavirus, during the Course of the SARS 506 Epidemic in China. Science (80-. ) Isolation and characterization of viruses related to the SARS coronavirus from 508 animals in Southern China Characterization of a novel coronavirus associated with severe acute 510 respiratory syndrome. Science (80-. ) Bovine-Like Coronaviruses Isolated from Four Species of Captive Wild 512 Ruminants Are Homologous to Bovine Coronaviruses, Based on Complete Genomic 513 Sequences Molecular characterization of a canine respiratory coronavirus strain 515 detected in Italy Evolutionary History of the Closely Related Group 2 Coronaviruses Hemagglutinating Encephalomyelitis Virus, Bovine Coronavirus, and Human Coronavirus 518 OC43 Distant relatives of severe acute respiratory syndrome coronavirus and close 520 relatives of human coronavirus 229E in bats A pneumonia outbreak associated with a new coronavirus of probable bat 523 origin Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Genetic recombination of human immunodeficiency virus Vaccination influences the evolution of 529 classical swine fever virus Testing the hypothesis of a recombinant origin of the 531 SARS-associated coronavirus A comparison of bats and rodents as reservoirs of zoonotic viruses: are bats 533 special? Integration of shared-pathogen networks and 535 machine learning reveals the key aspects of zoonoses and predicts mammalian reservoirs Using network theory to identify the causes of disease outbreaks of 538 unknown origin A hierarchical bayesian model for 540 predicting ecological interactions using scaled evolutionary relationships Predicting cryptic links in host-parasite networks Global estimates of mammalian viral 545 diversity accounting for host sharing Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min Deep mining heterogeneous networks of 550 biomedical linked data to predict novel drug-target associations Predicting lncRNA-disease associations using network topological similarity 553 based on deep mining heterogeneous networks Genetic evolution analysis of 2019 novel coronavirus and 555 coronavirus from other species. Infection Bats, civets and the emergence of SARS Common Palm Civet Paradoxurus hermaphroditus in Javan and Balinese markets COMPARISON OF CHEMICAL CHARACTERISTICS 562 AND SENSORY VALUE BETWEEN LUWAK COFFEE AND ORIGINAL COFFEE FROM 563 ARABICA (Cafeea arabica. L) AND ROBUSTA (Cafeea canephora. L) VARIETIES) Cafeea arabica. L) DAN ROBUSTA (Cafeea 566 canephora Severe Acute Respiratory Syndrome (SARS) Coronavirus ORF8 Protein Is 568 Acquired from SARS-Related Coronavirus from Greater Horseshoe Bats through 569 Recombination Bats are natural reservoirs of SARS-like coronaviruses. Science (80-. ) Genomic variance of the 2019-nCoV coronavirus Evolutionary relationships between bat coronaviruses and their hosts Susceptibility of ferrets, cats, dogs, and other domesticated animals to SARS-577 coronavirus 2 Extension of the known distribution of a novel clade C betacoronavirus in 579 a wildlife host Isolation and Characterization of a Novel Betacoronavirus Subgroup A 581 Rabbit Coronavirus HKU14, from Domestic Rabbits Chapter 1 The History and Evolution of Human Dengue 584 Emergence Chimpanzee reservoirs of pandemic and nonpandemic HIV-1. Science (80-. 586 ) A Review of the Current Status of Relevant Zoonotic Pathogens in Wild Swine 588 (Sus scrofa) Populations: Changes Modulating the Risk of Transmission to Humans The epidemiology and evolution of influenza viruses in pigs Enteric coronavirus infection in a 593 juvenile dromedary (Camelus dromedarius) Middle East respiratory syndrome coronavirus: risk factors and determinants 595 of primary, household, and nosocomial transmission. The Lancet Infectious Diseases 18 Database of host-599 pathogen and related species interactions, and their global distribution Predicting reservoir hosts and arthropod 601 vectors from evolutionary signatures in RNA virus genomes. Science (80-. ) Virus attenuation by genome-scale changes in codon pair bias Harnessing local sequence context to improve protein multiple 606 sequence alignment Similarity network fusion for aggregating data types on a genomic scale A General Coefficient of Similarity and Some of Its Properties Predicting potential drug-drug interactions by integrating chemical, 612 biological, phenotypic and network data Global hotspots and correlates of emerging zoonotic diseases Building Predictive Models in R Using the caret Package AUC: a misleading measure of the performance 618 of predictive distribution models Selecting pseudo-absences for 620 species distribution models: how, where and how many? The ghost of nestedness in ecological 623 networks Stability of ecological communities and the architecture of 625 mutualistic and trophic networks. Science (80-. ) A consistent 627 metric for nestedness analysis in ecological systems: Reconciling concept and measurement The checkered history of checkerboard 630 distributions MW acknowledges support from BBSRC and MRC for the National Productivity Investment Fund 634 (NPIF) fellowship (MR/R024898/1). Establishment of the EID2 database was funded by a UK Research 635 Council Grant (NE/G002827/1) to MB, as part of an ERANET Environmental Health award to MB Development Fund awards (BB/K003798/1 BB/N02320X/1) to MB, and the National Institute for 638 Health Research Health Protection Research Unit (NIHR HPRU) in Emerging and Zoonotic Infections 639 at the University of Liverpool in partnership with Public Health England and Liverpool School of 640 Conceived and designed the study: MW and MSCB 644 Compiled the data Analysed and interpreted the data: MW and MSCB 646 Established the EID2 database: MW and MB 647 Wrote the manuscript: MW, MB and MSCB Formally, given an adjacency matrix A of dimensions | | × | | where | | is number of coronaviruses 426 included in this study (for which a complete genome could be found), and | | is number of included 427 mammals, such that for each ∈ and ∈ , = 1 if an association is known to exist between 428 the virus and the mammal, and 0 otherwise. Then for a similarity matrix corresponding to each 429 of the similarity matrices calculated above, a learner from the viral point of view is defined as 430 follows 24,25 : 431 phylogenetically close to known hosts of , and also phylogenetically distant to mammalian species 437 not known to be associated with this coronavirus then the phylogenetic similarly learner will assign 438 → a higher confidence score. However, if does not overlap geographically with known hosts 439 of , then the geographical overlap learner will assign it a low (in effect 0) confidence score. 440Formally, given the above defined adjacency matrix A, and a similarity matrix 441 corresponding to each of the similarity matrices summarised in table 1 and calculated in Supplementary 442Note 2, a learner from the mammalian point of view is defined as follows 24,25 : 443Networkthe network point of view: we integrated two learners based on network similarities -one 445 for mammals and one for coronaviruses. Formally, given the adjacency matrix A, our two learners from 446 the network point of view as defined as follows 24 : 447AUC is a threshold-independent measure of model predictive performance that is commonly used as a 473 validation metric for host-pathogen predictive models 21,45 . Use of AUC has been criticised for its 474 insensitivity to absolute predicted probability and its inclusion of a priori untenable prediction 51,53 , we 475 also calculated the True Skill Statistic (TSS = sensitivity + specificity -1) 54 . F-score captures the 476 harmonic mean of the precision and recall and is often used with uneven class distribution. Our approach 477 is relaxed with respect to false positives s(unobserved associations), hence the low F-score recorded 478 Changes in network structure: We quantified the diversity of the mammalian hosts of each CoV in 480 our input by computing mean phylogenetic distance between these hosts. Similarly, we captured the 481 diversity of CoVs associated with each mammalian species by calculating mean (hamming) distance 482 between the genomes of these CoVs. We termed these two metrics: mammalian diversity per virus and 483 viral diversity per mammal, respectively. We aggregated both metrics at the network level by means of 484 simple average. This enabled us to quantify changes in these diversity metrics, at the level of network, 485 with addition of predicted links at three probability cut-offs: >0.5, >0.75 and ≥0.95. All data used in our analyses will be made available via figshare. During review process our data can 654 be made available upon request. 655 All codes used in our analyses will be made available via figshare. During review process our code can 657 be made available upon request. 658