key: cord-0789830-xvxv8doc
authors: Davis, Phillip; Russell, Joseph A.
title: A Genotype-to-Phenotype Modeling Framework to Predict Human Pathogenicity of Novel Coronaviruses
date: 2021-09-20
journal: bioRxiv
DOI: 10.1101/2021.09.18.460926
sha: 5b805856b984c149884b03e7329bbf1b1520347f
doc_id: 789830
cord_uid: xvxv8doc

Leveraging prior viral genome sequencing data to make predictions on whether an unknown, emergent virus harbors a ‘phenotype-of-concern’ has been a long-sought goal of genomic epidemiology. A predictive phenotype model built from nucleotide-level information alone has previously been considered un-tenable with respect to RNA viruses due to the ultra-high intra-sequence variance of their genomes, even within closely related clades. Building from our prior work developing a degenerate k-mer method to accommodate this high intra-sequence variation of RNA virus genomes for modeling frameworks, and leveraging a taxonomic ‘group-shuffle-split’ paradigm on complete coronavirus assemblies from prior to October 2018, we trained multiple regularized logistic regression classifiers at the nucleotide k-mer level capable of accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (SADS-CoV) genome sequences as non-human pathogens. LASSO feature selection identified several degenerate nucleotide predictor motifs with high model coefficients for the human pathogen class that were present across widely disparate classes of coronaviruses. However, these motifs differed in which genes they were present in, what specific codons were used to encode them, and what the translated amino acid motif was. This emphasizes the importance of a phenetic view of emerging pathogenic RNA viruses, as opposed to the canonical phylogenetic interpretations most-commonly used to track and manage viral zoonoses. Applying our model to more recent Orthocoronavirinae genomes deposited since October 2018 yields a novel contextual view of pathogen-potential across bat-related, canine-related, porcine-related, and rodent-related coronaviruses and critical adaptations which may have contributed to the emergence of the pandemic SARS-CoV-2 virus. Finally, we discuss the utility of these predictive models (and their associated predictor motifs) to novel biosurveillance protocols that substantially increase the ‘pound-for-pound’ information content of field-collected sequencing data and make a strong argument for the necessity of routine collection and sequencing of zoonotic viruses.

Leveraging prior viral genome sequencing data to make predictions on whether an unknown, 13 emergent virus harbors a 'phenotype-of-concern' has been a long-sought goal of genomic 14 epidemiology. A predictive phenotype model built from nucleotide-level information alone has 15

previously been considered un-tenable with respect to RNA viruses due to the ultra-high intra-16 sequence variance of their genomes, even within closely related clades. Building from our prior 17 work developing a degenerate k-mer method to accommodate this high intra-sequence variation 18 of RNA virus genomes for modeling frameworks, and leveraging a taxonomic 'group-shuffle-19

split' paradigm on complete coronavirus assemblies from prior to October 2018, we trained 20 multiple regularized logistic regression classifiers at the nucleotide k-mer level capable of 21 accurately predicting withheld SARS-CoV-2 genome sequences as human pathogens and 22

accurately predicting withheld Swine Acute Diarrhea Syndrome coronavirus (SADS-CoV) 23 genome sequences as non-human pathogens. LASSO feature selection identified several 24 degenerate nucleotide predictor motifs with high model coefficients for the human pathogen 25 class that were present across widely disparate classes of coronaviruses. However, these motifs 26 differed in which genes they were present in, what specific codons were used to encode them, 27 and what the translated amino acid motif was. This emphasizes the importance of a phenetic 28 view of emerging pathogenic RNA viruses, as opposed to the canonical phylogenetic 29

interpretations most-commonly used to track and manage viral zoonoses. Applying our model to 30 more recent Orthocoronavirinae genomes deposited since October 2018 yields a novel 31 contextual view of pathogen-potential across bat-related, canine-related, porcine-related, and 32 rodent-related coronaviruses and critical adaptations which may have contributed to the 33 emergence of the pandemic SARS-CoV-2 virus. Finally, we discuss the utility of these predictive 34 models (and their associated predictor motifs) to novel biosurveillance protocols that 35

substantially increase the 'pound-for-pound' information content of field-collected sequencing 36 data and make a strong argument for the necessity of routine collection and sequencing of 37 zoonotic viruses. 38

To date, the applicability of genomic sequencing data to zoonotic viral outbreaks and pandemics 40 has primarily served in post-outbreak genomic epidemiology roles. When a novel viral pathogen 41 emerges, genome sequence data is compared against prior data from other close relatives. From 42 these analyses, public health risk and resourcing (1,2), transmission chains (3), and other 43 response-related information (4) is inferred. Several studies have begun to address the utility of 44 viral genome sequencing data in a pre-outbreak, predictive methodology through development of 45 increasingly complex machine learning techniques that attempt to understand the emergence of 46 particular viral phenotypes (5-12). However, while these works provide important novel 47 biological characterization methods, their immediate applied utility for biosurveillance is limited 48 due to the complexity of interpreting their outputs. 49 50

The emergence of the SARS-CoV-2 virus, and the ensuing pandemic, has emphasized our 51 continued vulnerability to zoonotic pathogens. Despite several smaller scale outbreaks of 52 dangerous Betacoronaviruses (namely SARS and MERS), our preparedness and ability to 53 forecast these emergent pathogens have made little advancement. Traditionally, the approach to 54 understanding differences in viral phenotypes has involved problematic experimental evolution, 55

or gain-of-function research through recombinant genetics system (13, 14). 56 57

Our previous work developed a feature-agglomeration method adapted to "bag-of-words" style 58 feature extraction in RNA viruses (15). We used this method to fit a binary logistic regression 59 model for Orthocoronavirinae around a response variable of human pathogen vs non-human 60 pathogen. While this method focused on explanatory modeling by emphasizing numerical 61 stability and training-set accuracy as the model selection criteria, the original feature extraction 62 and model fitting implementation limited its predictive power and resulted in overfitting to the 63 training data. This dilemma of model extrapolation is an old problem in statistical analysis and 64 machine learning (16, 17) and is still salient in biological data science applications. This has led 65 to assertions that the goal of prediction for threat of viral emergence, directly from sequence 66 data, is infeasible based on currently available data and biological knowledge (18). 67 68 We provide a solution to these problems specifically in the case of Orthocoronavirinae, while 69 also demonstrating techniques that could be applied across the viral kingdom. We have 70 developed a protocol for feature extraction and cross-validation that is specific to the viral 71 genomics domain to produce actionable and predictive genotype-to-phenotype information for 72 global health and pandemic preparedness experts, directly from genomic data. 73

Data Labeling and Grouping 75 We adopted the same data labeling assumptions regarding human-pathogen class membership 76 that were stipulated in our previous work (15 We previously developed a feature extraction method (15), Vorpal, to reconcile the k-mer-based 93 sequence representations with the inter-example variance in RNA virus genomes. This method 94 worked by counting k-mers across the input sequences, removing k-mers that appear below a 95 frequency quantile threshold, and performing hierarchical clustering on the remaining k-mers. 96

Using hamming distance as the metric and producing flat clusters from the resulting k-mer tree at 97 different branch lengths, we can produce de-facto k-mer alignments that can be re-encoded using 98

International Union of Pure and Applied Chemistry (IUPAC) nucleic acid characters. This 99 functions as a dimensionality-reduction technique that represents the higher dimensional k-mer 100 space into a smaller vocabulary of degenerate motifs that retains information about observed 101 variance in the training data. We can construct feature spaces using this technique that are 102 influenced by three parameters: k size, k-mer frequency cutoff to proceed to clustering, and 103 degeneracy cutoff for flat clustering of the k-mer tree. A simple example to illustrate this concept 104 is depicted in Figure 1 . Expanding on this feature extraction technique, we employed several methods to transition this 115 approach from an explanatory paradigm to a predictive one. To accomplish this, we utilized two 116 key strategies to reduce possible sources of model variance. First, we used a cross-validation 117 technique to guide model selection that leverages the intrinsic modal organization of genomics 118 data imparted by phylogenetic relationships. This characterizes the problem of predictive 119 phenotype modeling as one where generalization of the model would mean maintaining accuracy 120

to a novel mode of the sample distribution, or in other words, a new species or clade of the viral 121

family. Therefore, we leverage taxonomic organization of the training data to implement a 122 group-shuffle-split (GSS) cross-validation approach (21). This simulates the problem of having 123 several species of each class represented in the training set and allows a search over model 124 parameters that maximize the ability to generalize to a withheld species in the validation set. In sequences was randomly super-sampled to 4000 instances using the stratified resampling method 178 described above. P-values for coefficients were not estimated, as predictive power to withheld 179 data is the preferred model evaluation criteria in this context. 180 181

Model selection was performed by first producing degenerate motifs across combinations of two 182 feature extraction parameters; k-mer size and degeneracy cutoff. Then, each of these feature sets 183 was used to fit models with a grid search cross-validation routine that searched over the L1 184 regularization parameter C using GSS as the cross validator, where C is the inverse of the L1 185 regularization term λ. Quantile cutoff for k-mer clustering was selected for each k-size based on 186 available system memory constraints (2TB) and are stipulated in Supplementary Table 1 

Following an exhaustive search over feature extraction parameters and the L1 regularization-209 term hyperparameter, several models were identified that correctly classified the test set at 100% 210 accuracy -specifically, the 15-mer models with 2.0 and 4.0 degeneracy cutoff, and the 17-mer 211 models with 2.0 and 4.0 degeneracy cutoff for k-mer clustering (Figure 3) .

Parameter search over L1 regularization terms was similar to our previous effort. Uniformly, 214 models were selected by the Brier score criterion (26) for the strongest regularization term 215 evaluated, which was .01. In Supplementary Figure 1 Another interesting pattern observed in the training set was a group of Bat SARS-like and 293

MERS-like viruses that were routinely classified as human pathogens -specifically, members of 294

Jinning mine group of viruses such as Rs4231 and Rs4874, as well as the MERS-likes NL13845 295 and NL140422 sampled from a cave in Guangdong (30,31). These class designations seem to be 296 supported by serological evidence of positivity to SARS-likes reported in the area surrounding 297 the Jinning cave from which these SARS-like viruses were sampled (30). Finally, human enteric 298 coronavirus 4408 was classified as a non-human pathogen in 35 of the 45 trained models, 299

including those that were 100% accurate on the test set. Complete tables of misclassified training 300 set accession numbers and class probabilities for each model replicate are available in 301

Supplementary data. The frequency of this misclassification is potentially explained by 4408's 302 status as a strictly child-associated coronavirus (32). Similarly, the novel Canine 303 alphacoronavirus isolated from a child in Malaysia (33) in 2018 shares a similar, negative 304 prediction as can been seen in Table 4 . The implications of this nuance in data labeling and the 305 characterization of the problem as a binary classification are examined in the Discussion. 306 307

Through examination of the model training results, it is possible to see the key determinants of 310 the success of our approach. First, the choice of model -regularized logistic regression -is 311 critical to the success of the models. The 17mer, 3.0 degeneracy models are examples where the 312 models failed to generalize to the test set, but had highest accuracy scores on the training data 313 (i.e., >99%). Controlling this tendency to overfit, especially where certain nuance or ambiguity 314 may exist regarding the virus phenotype that is not captured by the binary response variable, is 315 much more difficult to achieve outside of high bias model families like generalized linear 316 models. Second, the positionally-independent representation of the feature space provided by the 317

Vorpal feature extraction methodology allows for identification of genome thematics that emerge 318 as a result of convergent evolution. Finally, the degenerate characteristic of these motif 319

representations introduced by the k-mer clustering clearly contribute to success in extrapolation. 320

This is explained by observing several instances where the models did not successfully 321 generalize to the test set. In many models that were fit with lower degeneracy cutoff parameters, 322 test set probabilities for SARS-CoV-2 were 0.50 because none of the predictor motifs selected 323

during training mapped to SARS-CoV-2 (Figure 3) . Higher degeneracy feature spaces still 324 identified predictive motifs, and these motifs continued to be present in the test set. 325 326

To understand the underlying biological function of the predictor motifs, we examined their 327 genomic context. As an example, RATGTTRTTMDWCDA, shown in Table 2 Properties of the NTRNWRNTSNWSHTA motif that led to its association with human 342 pathogens are not obvious, but examining its patterns of occurrence provides potential hints. As 343 mentioned, this motif is most abundant in HKU1. However, in addition to this frequency, it also 344 occurs concurrently in the genome with another unique feature of HKU1 for which the functional 345 purpose is not understood -this motif tracks each instance of the Acid Tandem Repeats (ATRs) 346 that occur at varying copy number in the hypervariable region of NSP3 in different strains of 347 HKU1 (40). This motif also appears to be tracking the abundance of consecutive third-position-348 thymine codons. The preference of these codons is a well described phenomenon in 349

coronaviruses, but its functional provenance is not well understood and its enrichment 350 specifically in human coronaviruses has not been described (41, 42) .

The models also appear to describe a human-pathogen class definition that only includes viruses 353 that can readily transmit between adults. 

While this effort represents a specific procedure with respect to this feature extraction technique, 374

the theoretical framework is one that can be generally applied. The task for supervised learning 375 on biological sequence data is to transform to a feature subspace where the learner is 376

interpolating over the feature space as it pertains to the response variable, and is no longer 377 extrapolating. We believe these methodologies are applicable not just across the RNA virus 378 genome domain, but also across multiple feature spaces such as protein and RNA secondary 379 structure. We will explore this in future work. 380 381 382 383 Improving Biosurveillance Protocols 385 The implications of the models support a potential reimagining of biosurveillance efforts and 386 pandemic prevention. The ability to predict pathogenic phenotypes of viruses well ahead of 387 spillover, directly from sequence data, can enable more effective focusing of resource allocation 388

for ecological monitoring and prevention. The results described in this work are, to our 389 knowledge, the first demonstration of this capability. Determination of the biological function of 390 model predictors may yield a more detailed understanding of why certain organisms, such as 391

Camels and Civets, seem to act as keystone species for the spillover of certain viral families like 392

Orthocoronavirinae. This could produce a road map to understand the host genomic 393 determinants that condition these viral genomes for emergence from their natural reservoirs. 394 395

Leveraging predictive motifs in field-forward 'sequence-search' missions can enable genomic 396 epidemiologists to identify problematic viruses more quickly on site. Despite the criticality of 397 genome assembly and phylogenetic analyses during emerging outbreak scenarios, their 398 cumbersome and time-consuming nature limits the utility and feasibility of sequencing 399 operations in field-forward surveillance efforts and prevents investments in such infrastructure 400 and programs. Predictive motifs can be modeled directly in raw voltage disturbance signals from 401 nanopore platforms (48). Searching for predictive motifs from raw electrical signal obviates the 402 need for in-field basecalling, enabling more streamlined field-forward sequencing infrastructure. 403

Such infrastructure can alleviate sample bottlenecks at central reference laboratories and 404 establish a more efficient public health response network. 405 406

As the COVID-19 pandemic has made abundantly clear, the time is now for investments in these 407 types of next-generation biosurveillance ecosystems. Predictive feature-extraction genome 408 modeling frameworks, such as those 

Infection control in 421 the New Age of genomic epidemiology

Towards a genomics-informed, real-time, Global Pathogen 424 Surveillance System

Genomic epidemiology of SARS-COV-2 reveals multiple lineages and 427 early spread of SARS-COV-2 infections in Lombardy, Italy

Rapid identification and tracking of SARS-COV-2 430 variants of concern. The Lancet

Identifying and prioritizing potential 433 human-infecting viruses from their genome sequences. bioRxiv

Predicting host taxonomic information from viral 436 genomes: A comparison of feature representations

Predicting reservoir hosts and arthropod 439 vectors from evolutionary signatures in rna virus genomes

Machine learning reveals the critical interactions for SARS-COV-2 442 spike protein binding to ACE2

Genome-wide bioinformatic analyses predict key host and viral 445 factors in SARS-COV-2 pathogenesis

Translational adaptation of 448 human viruses to the tissues they infect

Machine learning using intrinsic genomic signatures for rapid 450 classification of novel pathogens: Covid-19 case study

Genomic determinants of pathogenicity in SARS-COV-2 and other 453 human coronaviruses

Experimental evolution to study virus emergence

Discovery of a rich gene pool of bat SARS-related coronaviruses provides 458 new insights into the origin of SARS coronavirus

Vorpal: A novel RNA virus feature-extraction algorithm demonstrated 461 through interpretable genotype-to-phenotype linear models

The hazards of extrapolation in regression analysis

Pandemics: Spend on surveillance, not 468 prediction

Porcine coronaviruses: Overview of the 470 state of the art

Bovine-like coronaviruses in domestic and wild ruminants

Scikit-learn: Machine Learning in Python

Variance reduction

A reference viral 478 database (RVDB) to enhance bioinformatics analysis of high-throughput sequencing for 479 novel virus detection

Virus variation resource -improved response to emergent viral 481 outbreaks

Fatal swine acute diarrhoea syndrome caused by an HKU2-related 483 coronavirus of Bat Origin

VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF 486 PROBABILITY

Isolation and characterization of a novel bat coronavirus closely related 489 to the direct progenitor of severe acute respiratory syndrome coronavirus

A novel bat coronavirus closely related to SARS-COV-2 contains natural 492 insertions at the S1/S2 cleavage site of the spike protein

Identification of novel Bat Coronaviruses sheds light on the evolutionary 495 origins of SARS-COV-2 and related viruses

Serological evidence of bat SARS-related coronavirus infection in 498 humans

Discovery of novel Bat Coronaviruses in South China that use the same 500 receptor as Middle East respiratory syndrome coronavirus

Cross-protection against a human enteric 503 coronavirus and a virulent bovine enteric coronavirus in Gnotobiotic Calves

Novel canine coronavirus isolated from a hospitalized pneumonia 506 patient, East Malaysia

Two adjacent mutations on the dimer interface of SARS coronavirus 3c-like 508 protease cause different conformational changes in crystal structure

Widespread position-specific conservation of synonymous rare 511 codons within coding sequences

Mass spectrometry approach and Elisa 514 reveal the effect of codon optimization on N-linked glycosylation of HIV-1 gp120

Comparative analysis of 22 coronavirus HKU1 genomes reveals a novel 517 genotype and evidence of natural recombination in Coronavirus HKU1

Multivariate analyses of codon usage of SARS-520 COV-2 and other betacoronaviruses

Characterization of codon usage pattern in SARS-COV-2

Human coronavirus infection among children with acute lower 524 respiratory tract infection in Thailand

Petabase-scale sequence alignment catalyses viral discovery

Emergence of porcine delta-coronavirus pathogenic infections 529 among children in Haiti through independent zoonoses and convergent evolution

Age-dependent risks of incidence and mortality of covid-19 in Hubei 532 Province and other parts of China

Ageing impairs the airway epithelium defence response to SARS-535 COV

In silico mutagenesis 537 of human ACE2 with S protein and translational efficiency explain SARS-COV-2 538 infectivity in different species

Cotranslational prolyl hydroxylation is 541 essential for flavivirus biogenesis

Monitoring cotranslational protein folding in mammalian cells at codon 544 resolution

Targeted nanopore 547 sequencing by real-time mapping of raw electrical signal with UNCALLED