Interpretable detection of novel human viruses from genome sequencing data


i
i

“output” — 2020/12/11 — 18:12 — page 1 — #1 i
i

i
i

i
i

Published online DD MM YYYY Preprint, YYYY, Vol. xx, No. xx 1–14

Interpretable detection of novel human viruses from
genome sequencing data
Jakub M. Bartoszewicz 1,2,3,4∗, Anja Seidel 1,2 and Bernhard Y. Renard 1,3,4∗

1Bioinformatics (MF1), Department of Methodology and Research Infrastructure, Robert Koch Institute, Berlin,
Germany, 2Department of Mathematics and Computer Science, Free University of Berlin, Berlin, Germany, 3Data
Analytics and Computation Statistics, Hasso Plattner Institute for Digital Engineering, Potsdam, Brandenburg,
Germany and 4Digital Engineering Faculty, University of Postdam, Potsdam, Brandenburg, Germany. 5Current
address: Central Research Institute of Ambulatory Health Care, Berlin, Germany.

Received YYYY-MM-DD; Revised YYYY-MM-DD; Accepted YYYY-MM-DD

ABSTRACT

Viruses evolve extremely quickly, so reliable
methods for viral host prediction are necessary to
safeguard biosecurity and biosafety alike. Novel
human-infecting viruses are difficult to detect
with standard bioinformatics workflows. Here,
we predict whether a virus can infect humans
directly from next-generation sequencing reads. We
show that deep neural architectures significantly
outperform both shallow machine learning and
standard, homology-based algorithms, cutting the
error rates in half and generalizing to taxonomic
units distant from those presented during training.
Further, we develop a suite of interpretability
tools and show that it can be applied also to
other models beyond the host prediction task. We
propose a new approach for convolutional filter
visualization to disentangle the information content
of each nucleotide from its contribution to the final
classification decision. Nucleotide-resolution maps
of the learned associations between pathogen
genomes and the infectious phenotype can be used
to detect regions of interest in novel agents, for
example the SARS-CoV-2 coronavirus, unknown
before it caused a COVID-19 pandemic in 2020.
All methods presented here are implemented as
easy-to-install packages enabling analysis of NGS
datasets without requiring any deep learning skills,
but also allowing advanced users to easily train and
explain new models for genomics.

INTRODUCTION

Background
Within a globally interconnected and densely populated world,
pathogens can spread more easily than they ever had before.
As the recent outbreaks of Ebola and Zika viruses have shown,
the risks posed even by these previously known agents remain

∗To whom correspondence should be addressed. Tel: +49 331 5509 4960; Email: jakub.bartoszewicz@hpi.de, bernhard.renard@hpi.de

unpredictable and their expansion hard to control (1). What is
more, it is almost certain that more unknown pathogen species
and strains are yet to be discovered, given their constant,
extremely fast-paced evolution and unexplored biodiversity,
as well as increasing human exposure (2, 3). Some of those
novel pathogens may cause epidemics (similar to the SARS
and MERS coronavirus outbreaks in 2002 and 2012) or even
pandemics (e.g. SARS-CoV-2 and the “swine flu” H1N1/09
strain). Many have more than one host or vector, which makes
assessing and predicting the risks even more difficult. For
example, Ebola has its natural reservoir most likely in fruit
bats (4), but causes deadly epidemics in both humans and
chimpanzees. As the state-of-the art approach for the open-
view detection of pathogens is genome sequencing (5, 6), it
is crucial to develop automated pipelines for characterizing
the infectious potential of currently unidentifiable sequences.
In practice, clinical samples are dominated by host reads and
contaminants, with often less than a hundred reads of the
pathogenic virus (7). Metagenomic assembly is challenging,
especially in time-critical applications. This creates a need
for read-based approaches complementing or substituting
assembly where needed.

Screening against potentially dangerous subsequences
before their synthesis may also be used as a way of ensuring
responsible research in synthetic biology. While potentially
useful in some applications, engineering of viral genomes
could also pose a biosecurity and biosafety threat. Two
controversial studies modified the influenza A/H5N1 ("bird
flu") virus to be airborne transmissible in mammals (8, 9).
A possibility of modifying coronaviruses to enhance their
virulence triggered calls for a moratorium on this kind of
research (10). Synthesis of an infectious horsepox virus
closely related to the smallpox-causing Variola virus (11)
caused a public uproar and calls for intensified discussion on
risk control in synthetic biology (12).

© YYYY The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 2 — #2 i
i

i
i

i
i

2 Preprint, YYYY, Vol. xx, No. xx

Current tools for host range prediction
Several computational, genome-based methods exist that
allow to predict the host-range of a bacteriophage (a
bacteria-infecting virus). A selection of composition-based
and alignment-based approaches has been presented in an
extensive review by Edwards et al. (13). Prediction of
eukariotic host tropism (including humans) based on known
protein sequences was shown for the influenza A virus (14).
Support-vector machines based on word2vec representations
were shown to outperform homology searches with BLAST
and HMMs in the same task, but lost their advantage when
applied to nucleic acid sequences directly (15). Two recent
studies employ k-mer based, k-NN classifiers (16) and deep
learning (17) to predict host range for a small set of three well-
studied species directly from viral sequences. While those
approaches are limited to those particular species and do not
scale to viral host-range prediction in general, the Host Taxon
Predictor (HTP) (18) uses logistic regression and support
vector machines to predict if a novel virus infects bacteria,
plants, vertebrates or arthropods. Yet, the authors argue that
it is not possible to use HTP in a read-based manner; it
requires long sequences of at least 3,000 nucleotides. This
is incompatible with modern metagenomic next-generation
sequencing (NGS) workflows, where the DNA reads obtained
are at least 10-20 times shorter. Another study used gradient
boosting machines to predict reservoir hosts and transmission
via arthropod vectors for known human-infecting viruses (19).

Zhang et al. (20) designed several classifiers explicitly
predicting whether a new virus can potentially infect humans.
Their best model, a k-NN classifier, uses k-mer frequencies
as features representing the query sequence and can yield
predictions for sequences as short as 500 base pairs (bp). It
worked also with 150bp-long reads from real DNA sequencing
runs, although in this case the reads originated also from the
viruses present in the training set (and were therefore not
"novel").

Deep Learning for DNA sequences
While DNA sequences mapped to a reference genome may
be represented as images (21), a majority of studies uses a
distributed orthographic representation, where each nucleotide
{A,C,G,T} in a sequence is represented by a one-hot encoded
vector of length 4. An "unknown" nucleotide (N) can be
represented as an all-zero vector. Chaos game representation
(CGR) and its extension, the frequency matrix CGR (FCGR)
are promising alternatives able to encode an arbitrary sequence
in an image-like format. FCGR has been used to encode
genomic inputs for deep learning approaches, including full
bacterial genomes (22) and coding sequences of HIV for the
drug resistance prediction task (23). In this study, we use
one-hot encoding with Ns as zeroes, which was previously
shown to perform well for raw NGS reads (24) and abstract
phenotype labels.

CNNs and LSTMs have been successfully used for a
variety of DNA-based prediction tasks. Early works focused
mainly on regulation of gene expression in humans (25,
26, 27, 28, 29), which is still an area of active research
(30, 31, 32). In the field of pathogen genomics, deep learning
models trained directly on DNA sequences were developed
to predict host ranges of three multi-host viral species (33)

and to predict pathogenic potentials of novel bacteria (24).
DeepVirFinder (34) and ViraMiner (35) can detect viral
sequences in metagenomic samples, but they cannot predict
the host and focus on previously known species. For a broader
view on deep learning in genomics we refer to a recent review
by Eraslan et al. (36).

Interpretability and explainability of deep learning models
for genomics is crucial for their wide-spread adoption, as it
is necessary for delivering trustworthy and actionable results.
Convolutional filters can be visualized by forward-passing
multiple sequences through the network and extracting the
most-activating subsequences (25) to create a position weight
matrix (PWM) which can be visualized as a sequence
logo (37, 38). Direct optimization of input sequences is
problematic, as it results in generating a dense matrix even
though the input sequences are one-hot encoded (39, 40).
This problem can be alleviated with Integrated Gradients
(41, 42) or DeepLIFT, which propagates activation differences
relative to a selected reference back to the input, reducing the
computational overhead of obtaining accurate gradients (43).
If the bias terms are zero and a reference of all-zeros is used,
the method is analogous to Layer-wise Relevance Propagation
(44). DeepLIFT is an additive feature attribution method, and
may used to approximate Shapley values if the input features
are independent (45). TF-MoDISco (46) uses DeepLIFT to
discover consolidated, biologically meaningful DNA motifs
(transcription factor binding sites).

Contributions
In this paper, we first improve the performance of read-
based predictions of the viral host (human or non-human)
from next-generation sequencing reads. We show that
reverse-complement (RC) neural networks (24) significantly
outperform both the previous state-of-the-art (20) and
the traditional, alignment-based algorithm – BLAST (47,
48), which constitutes a gold standard in homology-based
bioinformatics analyses. We show that defining the negative
(non-human) class is non-trivial and compare different ways
of constructing the training set. Strikingly, a model trained
to distinguish between viruses infecting humans and viruses
infecting other chordates (a phylum of animals including
vertebrates) generalizes well to evolutionarily distant non-
human hosts, including even bacteria. This suggests that
the host-related signal is strong and the learned decision
boundary separates human viruses from other DNA sequences
surprisingly well.

Next, we propose a new approach for convolutional filter
visualization using partial Shapley values to differentiate
between simple nucleotide information content and
the contribution of each sequence position to the final
classification score. To test the biological plausibility of
our models, we generate genome-wide maps of "infectious
potential" and nucleotide contributions. We show that those
maps can be used to visualize and detect virulence-related
regions of interest (e.g. genes) in novel genomes.

As a proof of concept, we analyzed one of the viruses
randomly assigned to the test set – the Taï Forest ebolavirus,
which has a history of host-switching and can cause a serious
disease. To show that the method can also be used for other
biological problems, we investigated the networks trained by

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 3 — #3 i
i

i
i

i
i

Preprint, YYYY, Vol. xx, No. xx 3

Bartoszewicz et al. (24) and their predictions on a genome of
a pathogenic bacterium Staphylococcus aureus. The authors
used this particular species to assess the performance of
their method on real sequencing data. Finally, we studied the
SARS-CoV-2 coronavirus, which emerged in December 2019,
causing the COVID-19 pandemic (49).

MATERIALS AND METHODS

Data collection and preprocessing
VHDB dataset We accessed the Virus-Host Database (50) on
July 31, 2019 and downloaded all the available data. We note
that all the reference genomes from NCBI Viral Genomes
are present in VHDB, as well as their curated annotations
from RefSeq. Additional, manually curated records in VHDB
extend on metadata available in NCBI. More non-reference
genomes are available, but considering multiple genomes per
virus would skew the classifiers’ performance towards the
more frequently resequenced ones.

The original dataset contained 14,380 records comprising
RefSeq IDs for viral sequences and associated metadata.
Some viruses are divided into discontiguous segments, which
are represented as separate records in VHDB; in those
cases the segments were treated as contigs of a single
genome in the further analysis. We removed records with
unspecified host information and those confusing the highly
pathogenic Variola virus with a similarly named genus of
fish. Following Zhang et al. (20), we filtered out viroids
and satellites, which are classified as subviral agents and
not bona fide viruses (51, 52). Note that even though they
require helper viruses for replication, this step did not affect
ubiquitous adeno-associated viruses and large virophages,
which are well established within the viral taxonomy in
the families Parvoviridae and Lavidaviridae, respectively.
Human-infecting viruses were extracted by searching for
records containing "Homo sapiens" in the "host name" field.
Note that VHDB contains information about multiple possible
hosts for a given virus where appropriate. Any virus infecting
humans was assigned to the positive class, also if other, non-
human hosts exist. In total, the dataset contained 9,496 viruses
(grouped in 7503 species), including 1,309 human viruses
(393 species). We considered both DNA and RNA viruses;
RNA sequences were encoded in the DNA alphabet, as in
RefSeq.

Defining the negative class While defining a human-infecting
class is relatively straightforward, the reference negative
class may be conceptualized in a variety of ways. The
broadest definition takes all non-human viruses into account,
including bacteriophages (bacterial viruses). This is especially
important, as most of known bacteriophages are DNA viruses,
while many important human (and animal) viruses are RNA
viruses. One could expect that the multitude of available
bacteriophage genomes dominating the negative class could
lower the prediction performance on viruses similar to
those infecting humans. This offers an open-view approach
covering a wider part of the sequence space, but may
lead to misclassification of potentially dangerous mammalian
or avian viruses. As they are often involved in clinically
relevant host-switching events, a stricter approach must also

be considered. In this case, the negative class comprises only
viruses infecting Chordata (a group containing vertebrates
and closely related taxa). Two intermediate approaches
consider all eukaryotic viruses (including plant and fungi
viruses), or only animal-infecting viruses. This amounts
to four nested host sets: "All" (8,187 non-human viruses,
7110 species), "Eukaryota" (5,114 viruses, 4275 species),
"Metazoa" (2,942 viruses, 2351 species) and "Chordata"
(2,078 viruses, 1530 species). Auxiliary sets containing
only non-eukaryotic viruses ("non-Eukaryota"), non-animal
eukaryotic viruses ("non-Metazoa Eukaryota") etc. can be
easily constructed by set subtraction.

For the positive class, we randomly generated a training
set containing 80% of the genomes, and validation and test
sets with 10% of the genomes each. Importantly, the nested
structure was kept also during the training-validation-test split:
for example, the species assigned to the smallest test set
("Chordata") were also present in all the bigger test sets. The
same applied to other taxonomic levels, as well as the training
and validation sets wherever applicable.

Read simulation We simulated 250bp long Illumina reads
following a modification of a previously described protocol
(24) and using the Mason read simulator (53). First, we only
generated the reads from the genomes of human-infecting
viruses. Then, the same steps were applied to each of
the four negative class sets. Finally, we also generated a
fifth set, "Stratified", containing an equal number of reads
drawn from genomes of the following disjunct host classes:
"Chordata" (25%), "non-Chordata Metazoa" (25%), "non-
Metazoa Eukaryota" (25%) and "non-Eukaryota" (25%).

In each of the evaluated settings, we used a total of 20
million (80%) reads for training, 2.5 million (10%) reads for
validation and 2.5 million (10%) paired reads as the held-out
test set. Read number per genome was proportional to genome
length, keeping the coverage uniform on average. Viruses with
longer genomes were therefore represented by more reads than
shorter viruses. On the other hand, their sequence diversity
was covered at a similar level. This length-balancing step
was previously shown to work well for bacterial genomes
of different lengths (24, 54). While the original datasets
are heavily imbalanced, we generated the same number of
negative and positive data points (reads) regardless of the
negative class definition used.

This protocol allowed us to test the impact of defining
the negative class, while using the exactly same data as
representatives of the positive class. We used three training
and validation sets ("All", "Stratified", and "Chordata"),
representing the fully open-view setting, a setting more
balanced with regard to the host taxonomy, and a setting
focused on cases most likely to be clinically relevant. In each
setting, the validation set matched the composition of the
training set. The evaluation was performed using all five test
sets to gain a more detailed insight on the effects of negative
class definition on the prediction performance.

Human blood virome dataset Similarily to Zhang et al. (20),
we used the human blood DNA virome dataset (55) to test
the selected classifiers on real data. We obtained 14,242,329
reads of 150bp and searched all of VHDB using blastn (with
default parameters) to obtain high-quality reference labels. If

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 4 — #4 i
i

i
i

i
i

4 Preprint, YYYY, Vol. xx, No. xx

a read’s best hit was a human-infecting virus, we assigned it
to a positive class; the negative class was assigned if this was
not the case. This procedure yielded 14,012,665 "positive" and
229,664 "negative" reads.

Virus-level and species-level predictions In this study, we
focus on predicting labels for reads originating from novel
viruses. What constitutes a "novel" biological entity is an
open question – a novel virus does not necessarily belong
to a novel species (56). If a given viral isolate clusters with
a known group of isolates, it is considered to be the same
virus; if it does not, it may be assigned a distinct name
and considered novel (56). This is separate from its putative
taxonomical assignment. Assigning a novel virus to a novel
or a previously established species is performed pursuing a
wider set of criteria, and the criteria for delineating distinct
species differ between viral families (51, 52, 56, 57). In most
cases, species are perceived as human constructs rather than
biological entities and host range often is explicitly one of
the defining features (56, 58), rendering reasoning based on
cross-species homology searches inherently difficult.

The most prominent example of this problem is the SARS-
CoV-2 virus, which is a novel virus within a previously
known species (Severe acute respiratory syndrome–related
coronavirus). Other members of this species include the
human-infecting SARS-CoV-1, but also multiple related bat
SARSr-CoV viruses (e.g. SARSr-CoV RaTG13 or Bat SARS-
like coronavirus WIV1). Importantly, SARS-CoV-2 is not a
strain of SARS-CoV-1; those two viruses share a common
ancestor (56). This echoes similar problems related to
pathogenic potential prediction for novel bacterial pathogens.
A novel bacterium may be defined as a novel strain or a novel
species (24), and the classifiers must be trained according to
the desired definition.

As the 2020 pandemic has shown, different viruses of the
same species can differ wildly in their infectious potential
and the broader impact on human societies. Therefore, threat
assessment must be performed for novel viruses, not only
novel taxa; different related viruses are non-redundant. At the
same time, redundancy below this level (i.e. multiple instances
of the same virus) must be eliminated from the dataset to
ensure reliability of the trained classifier. VHDB tackles this
problem by collecting and annotating reference genomes –
each virus in the database is a separate entity with its own ID
in NCBI Taxonomy. This virus-level approach was previously
used by Zhang et al. (20). We show that homology-based
algorithms underperform in this setting already, suggesting
that machine learning is indeed required to accurately predict
labels for novel viruses even if other members of the same
species are present in the training database.

Nevertheless, a more difficult alternative – predictions for
reads of viruses belonging to completely novel species – is a
related and potentially equally important task. For bacterial
datasets, species novelty can be modelled by selecting a
single representative genome per species (24). As the SARS-
CoV-2 example shows, this is often not possible for viruses.
To assess our approach in this stricter setup, we re-divided
the VHDB dataset into training, validation and test sets
ensuring that all viruses of a given species were assigned
to only one of those subsets. This effectively models a
"novel species" scenario while also reflecting within-species

phenotype diversity. We recreated the species-wide versions of
the "All" and "Chordata" datasets by assigning 80%, 10% and
10% of the species to the training, validation and test datasets,
respectively. We resimulated the reads as outlined above
and compared the performance of the machine learning and
homology-based approaches achieving the highest accuracy
in the simpler "novel virus" setting (see Section Prediction
performance).

Training
We used the DeePaC package (24) to investigate RC-CNN
and RC-LSTM architectures, which guarantee identical
predictions for both forward and reverse-complement
orientations of any given nucleotide sequence, and have been
previously shown to accurately predict bacterial pathogenicity.
Here, we employ an RC-CNN with two convolutional layers
with 512 filters of size 15 each, average pooling and 2 fully
connected layers with 256 units each. The LSTM used has
384 units (Fig. S1). We use dropout regularization in both
cases, together with aggressive input dropout at the rate
of 0.2 or 0.25 (tuned for each model). Input dropout may
be interpreted as a special case of noise injection, where a
fraction of input nucleotides is turned to Ns. Representations
of forward and reverse-complement strands are summed
before the fully connected layers. As two mates in a read pair
should originate from the same virus, predictions obtained
for them can be averaged for a boost in performance. If
a contig or genome is available, averaging predictions for
constituting reads yields a prediction for the whole sequence.
We used Tesla P100 and Tesla V100 GPUs for training and an
RTX 2080 Ti for visualizations.

We wanted the networks to yield accurate predictions for
both 250bp (our data, modelling a sequencing run of an
Illumina MiSeq device) and 150bp long reads (as in the
Human Blood Virome dataset). As shorter reads are padded
with zeros, we expected the CNNs trained using average
pooling to misclassify many of them. Therefore, we prepared
a modified version of the datasets, in which the last 100bp of
each read were turned to zeros, mocking a shorter sequencing
run while preserving the error model. Then, we retrained the
CNN which had performed best on the original dataset. Since
in principle, the Human Blood Virome dataset should not
contain viruses infecting non-human Chordata, a "Chordata"-
trained classifier was not used in this setting.

Benchmarking
We compare our networks to the the k-NN classifier proposed
by Zhang et al. (20), the only other approach explicitly tested
on raw NGS reads and detecting human viruses in a fully open
view setting (not focusing on a limited number of species). We
use the real sequencing data that they used (55) for an unbiased
comparison.

We trained the classifier on the "All" dataset as described
by the authors, i.e. using non-overlapping, 500bp-long contigs
generated from the training genomes (retraining on simulated
reads is computationally prohibitive). We also tested the
performance of using BLAST to search against an indexed
database of labeled genomes. We constructed the database
from the "All" training set and used discontiguous megablast
to achieve high inter-species sensitivity. For NGS mappers

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 5 — #5 i
i

i
i

i
i

Preprint, YYYY, Vol. xx, No. xx 5

(BWA-MEM (59) and Bowtie2 (60)), the indices were
constructed analogously. Kraken (61) was previously shown to
perform worse than both BLAST and machine learning when
faced with read-based pathogenic potential prediction for
novel bacterial species (54). Its major advantage – assigning
reads to lowest common ancestor (LCA) nodes in ambiguous
cases – turns into a problem in the infectivity prediction
task, as transferring labels to LCAs is often impossible (54).
Therefore, we focus on alignment-based approaches as the
most accurate alternative to machine learning in this context.

Note that both alignment and k-NN can yield conflicting
predictions for the individual mates in a read pair. What is
more, BLAST and the mappers yield no prediction at all if
no match is found. Therefore, similarly to Bartoszewicz et al.
(24), we used the accept anything operator to integrate binary
predictions for read pairs and genomes. At least one match
is needed to predict a label, and conflicting predictions are
treated as if no match was found at all. Missing predictions
lower both true positive and true negative rates.

Filter visualization
Substring extraction In order to visualize the learned
convolutional filters, we downsample a matching test set
to 125,000 reads and pass it through the network. This is
modelled after the method presented by Alipanahi et al. (25).
For each filter and each input sequence, the authors extracted
a subsequence leading to the highest activation, and created
sequence logos from the obtained sequence sets ("max-
activation"). We used the DeepSHAP implementation (45) of
DeepLIFT (43) to extract score-weighted subsequences with
the highest contribution score ("max-contrib") or all score-
weighted subsequences with non-zero contributions ("all-
contrib"). Computing the latter was costly and did not yield
better quality logos.

We use an all-zero reference. As reads from real sequencing
runs are usually not equally long, shorter reads must be padded
with Ns; the "unknown" nucleotide is also called whenever
there is not enough evidence to assign any other to the
raw sequencing signal. Therefore, Ns are "null" nucleotides
and are a natural candidate for the reference input. We
do not consider alternative solutions based on GC content
or dinucleotide shuffling, as the input reads originate from
multiple different species, and the sequence composition may
itself be a strong marker of both virus and host taxonomy
(13). We also avoid weight-normalization suggested for zero-
references (43), as it implicitly models the expected GC
content of all possible input sequences, and assumes no
Ns present in the data. Finally, we calculate average filter
contributions to obtain a crude ranking of feature importance
with regard to both the positive and negative class.

Partial Shapley values Building sequence logos involves
calculating information content (IC) of each nucleotide at
each position in a prospective DNA motif. This can be then
interpreted as measure of evolutionary sequence conservation.
However, high IC does not necessarily imply that a given
nucleotide is relevant in terms of its contribution to the
classifier’s output. Some sub-motifs may be present in the
sequences used to build the logo, even if they do not contribute
to the final prediction (or even a given filter’s activation).

To test this hypothesis, we introduce partial Shapley
values. Intuitively speaking, we capture the contributions of
a nucleotide to the network’s output, but only in the context
of a given intermediate neuron of the convolutional layer.
More precisely, for any given feature xi, intermediate neuron
yj and the output neuron z, we aim to measure how xi
contributes to z while regarding only the fraction of the
total contribution of xi that influences how yj contributes
to z. Although similarly named concepts were mentioned
before as intermediate computation steps in a different context
(62, 63), we define and use partial Shapley values to visualize
contribution flow through convolutional filters. This differs
from recently introduced contribution weight matrices (32),
where feature attributions are used as a representation of an
identified transcription factor binding site irreducible to a
given intermediate neuron.

Using the formalism of DeepLIFT’s multipliers (43) and
their reinterpretation in SHAP (45), we backpropagate the
activation differences only along the paths "passing through"
yj. In Eq. 1, we define partial multipliers µ

(yj)
xiz and express

them in terms of Shapley values φ and activation differences
w.r.t. the expected activation values (reference activation).
Calculating partial multipliers is equivalent to zeroing out the
multipliers mykz for all k 6=j before backpropagating myjz
further.

µ
(yj)
xiz =mxiyjmyjz =

φi(yj,x)φj(z,y)

(xi−E[xi])(yj−E[yj])
(1)

We define partial Shapley values ϕ
(yj)
i (z,x) analogously

to how Shapley values can be approximated by a product of
multipliers and input differences w.r.t. the reference (Eq. 2):

ϕ
(yj)
i (z,x)=µ

(yj)
xiz (xi−E[xi])=

φi(yj,x)φj(z,y)

yj−E[yj]
(2)

From the chain rule for multipliers (43), it follows that
standard multipliers are a sum over all partial multipliers for a
given layer y. Therefore, Shapley values as approximated by
DeepLIFT are a sum of partial Shapley values for the layer y
(Eq. 3).

φi(z,x)=mxiz(xi−E[xi])=
∑
j

ϕ
(yj)
i (z,x) (3)

Once we calculate the contributions of convolutional filters
for the first layer, ϕ

(yj)
i (z,x) for the first convolutional layer

of a network with one-hot encoded inputs and an all-zero
reference can be efficiently calculated using weight matrices
and filter activation differences (Eq. 4-5). First, in this case
we do not traverse any non-linearities and can directly use the
linear rule (43) to calculate the contributions of xi to yj as a

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 6 — #6 i
i

i
i

i
i

6 Preprint, YYYY, Vol. xx, No. xx

product of the weight wi and the input xi. Second, the input
values may only be 0 or 1.

φi(yj,x)=wixi =

{
wi, if xi =1
0, otherwise

(4)

ϕ
(yj)
i (z,x)=

wiφj(z,y)

yj−E[yj]
(5)

Resulting partial contributions can be visualized along the
IC of each nucleotide of a convolutional kernel. To this end,
we design extended sequence logos, where each nucleotide is
colored according to its contribution. Positive contributions
are shown in red, negative contributions are blue, and near-
zero contributions are gray. Therefore, no information is lost
compared to standard sequence logos, but the relevance of
individual nucleotides and the filter as a whole can be easily
seen. Color saturation is limited by the reciprocal of a user-
defined gain parameter, here set to nm, where n equals the
number of input features xi (sequence length) and m equals
the number of convolutional filters yj in a given layer.

Genome-wide phenotype analysis
We create genome-wide phenotype analysis (GWPA) plots
to analyse which parts of a viral genome are associated
with the infectious phenotype. We scramble the genome
into overlapping, 250bp long subsequences (pseudo-reads)
without adding any sequencing noise. For the highest
resolution, we use a stride of one nucleotide. For S. aureus,
we used a stride of 125bp. We predict the infectious
potential of each pseudo-read and average the obtained
values at each position of the genome. Analogously, we
calculate average contributions of each nucleotide to the
final prediction of the convolutional network. Finally, we
normalize raw infectious potentials into the [−0.5,0.5] interval
for a more intuitive graphical representation. We visualize
the resulting nucleotide-resolution maps with IGV (64). For
protein structures, we average the scores codon-wise to obtain
contribution scores per amino acid and visualize them with
PyMOL (65).

For well-annotated genomes, we compile a ranking of genes
(or other genomic features) sorted by the average infectious
potential within a given region. In addition to that, we scan the
genome with the learned filters of the first convolutional layer
to find genes enriched in subsequences yielding non-zero filter
activations. We use Gene Ontology to connect the identified
genes of interest with their molecular functions and biological
processes they are engaged in.

RESULTS

Negative class definition
Choosing which viruses should constitute the negative class
is application dependent and influences the performance of
the trained models. Table S1 summarizes the prediction
accuracy for different combinations of the training and test

set composition. The models trained only on human and
Chordata-infecting viruses maintain similar, or even better
performance when evaluated on viruses infecting a much
broader host range, including bacteria. This suggests that the
learned decision boundary separates human viruses from all
the others surprisingly well. We hypothesize that the human
host signal must be relatively strong and contained within
the Chordata host signal. Dropout rate of 0.2 resulted in the
highest validation accuracy for CNNStr-150 and LSTMStr. A
rate of 0.25 was selected for the other models.

Adding more diversity to the negative class may still boost
performance on more diverse test sets, as in the case of
CNN trained on the "All" dataset (CNNAll). This model
performs a bit worse on viruses infecting hosts related to
humans, but achieves higher accuracy than the "Chordata"-
trained models and the best recall overall. Rebalancing the
negative class using the "Stratified" dataset helps to achieve
higher performance on animal viruses while maintaing high
overall accuracy. The LSTMs are outperformed by the CNNs,
but they can be used for shorter reads without retraining (see
Sections Training and Prediction performance).

Prediction performance
We selected LSTMAll and CNNAll for further evaluation. We
used a single consumer-grade RTX 2080 Ti GPU to measure
inference speed. The CNN classifies 5000 reads/s and the
LSTM 1855 reads/s. Analyzing ten million reads takes only 33
minutes using the faster model; linear speed-ups are possible
if more GPUs are available. Therefore, the trained models
achieve high-throughputs necessary to analyze NGS datasets.
Table 1 presents the results of a benchmark using the "All"
test set. Low performance of the k-NN classifier (20) is
caused by frequent conflicting predictions for each read in a
read pair. In a single-read setting it achieves 75.5% accuracy,
while our best model achieves 87.8% (Table S2). Although
BLAST achieves high precision, it yields no predictions for
over 10% of the samples. CNNAll is the most sensitive and
accurate. As expected, standard mapping approaches (BWA-
MEM and Bowtie2) struggle with analysing novel pathogens –
they are the most precise but the least sensitive. Our approach
outperforms them by 15-30%.

Although we focus on the extreme case of read-based
predictions, our method can also be used on assembled contigs
and full genomes if they are available, as well as on read
sets from pure, single-virus samples. We note that assembly
itself does not yield any labels and a follow-up analysis (via
alignment, machine learning or other approaches) is required
to correctly classify metagenomic contigs in any case. We ran
predictions on contigs without any size filtering with both k-
NN and BLAST (Table 2). We present performance measures
for both individual contigs and whole genome predictions
based on contig-wise majority vote. We compare them to
BLAST with read-wise majority vote (54) and to read-wise
average predictions of our networks, analogous to presented
previously for bacteria (24). Our method outperforms BLAST
by 1.2% and k-NN by 8.9%, even though they have access
to the full biological context (full sequences of all contigs in
a genome), while we simply average outputs for short reads
originating from the contigs.

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 7 — #7 i
i

i
i

i
i

Preprint, YYYY, Vol. xx, No. xx 7

Table 1. Classification performance in the fully open-view setting (all virus
hosts), read pairs. Acc. – accuracy, Prec. – precision, Rec. – recall, Spec. –
specificity. Bowtie2, BWA-MEM and BLAST yield no predictions for over
35%, 19% and 10% of the samples, respectively. Best performance in bold.

ACC. PREC. REC. SPEC.
CNNALL (OURS) 89.9 93.9 85.4 94.4
LSTMALL (OURS) 86.4 89.0 83.0 89.8
k-NN 57.1 57.8 52.1 62.0
BOWTIE2 58.6 99.2 59.2 58.0
BWA-MEM 72.8 98.9 73.9 71.8
BLAST 80.6 98.4 79.1 82.2

We benchmarked our models against the human blood
virome dataset used by Zhang et al. (20). Our models
outperform their k-NN classifier. As the positive class
massively outnumbers the negative class, all models achieve
over 99% precision. CNNAll-150 performs best (Table 3).
However, the positive class is dominated by viruses which are
not necessarily novel. The CNN was more accurate on training
data, so we expected it to detect those viruses easily.

Finally, we repeated the analysis in the "novel species"
scenario. Classifying novel viral species when restricted to
Chordata-infecting viruses is too challenging for practical
purposes (Table S3). Read-wise predictions are not much
better than random guesses for both BLAST and CNNs. Low
precision of BLAST shows that it often recovers wrong labels
even when it does find a match – sequence similarity is not
a reliable predictor of the infectious potential in this setting.
Even if a whole genome is available, overall accuracy is low.
This looks very differently in the fully-open view scenario
(Table 4). The CNN trained on the species-wise division of
the "All" dataset (CNNSP-All) outperforms BLAST by a wide
margin on both reads and genomes. Strikingly, CNNSP-All
predictions based on a single read pair achieve higher accuracy
than BLAST predictions using whole genomes, mainly due
to their significantly higher recall. What is more, pooling
predictions from all the reads originating from a given genome
does not improve overall CNNSP-All accuracy any further. As
CNNSP-All does not reliably outperform its Chordata-trained
analog on the "Chordata" dataset (CNNSP-Cho, Table S3),
we suspect that its relatively high accuracy on the "All"
dataset is caused by its high sensitivity while maintaining good
specificity on non-Chordata viruses.

Filter visualization
Over 84% of all contributing first-layer filters in CNNAll
have positive average contribution scores. We comment more
on this fact in Section Nucleotide contribution logos. For
CNNAll, the average information content of our motifs is
strongly correlated nucleotide-wise with IC of DeepBind-like
logos (Spearman’s ρ>0.95, p<10−15 for all contributing
filter pairs except one). The difference in average IC is
negligible (0.04 bit higher for "max-contrib", Wilcoxon
test, p<10−15). Therefore, our contribution logos represent
analogous "motifs", while extracting additional, nucleotide-
level interpretations. For exactly one filter, "max-contrib" and
"max-activation" scores are not correlated. A deeper analysis
reveals that this particular filter is activated by stretches

Table 2. Classification performance, all hosts. Whole available genomes.
Negative class is the majority class. BAcc. – balanced accuracy, Rec. –
recall, Spec. – specificity. BLAST (reads) and our networks use read-wise
majority vote or output averaging to aggregate predictions over all reads
from a genome. k-NN (genome) and BLAST (genome) use contig-wise
majority vote. k-NN (contigs) and BLAST (contigs) represent performance
on individual contigs treated as separate entities. k-NN (reads) was not used,
as high conflicting prediction rates made read-wise aggregation impracticable.

BACC. AUPR REC. SPEC.
CNNALL (OURS) 91.7 91.2 89.3 94.2
LSTMALL (OURS) 86.3 85.8 96.2 76.4
BLAST (READS) 90.3 N/A 85.5 95.1
k-NN (GENOME) 82.8 N/A 93.9 71.6
BLAST (GENOME) 90.5 N/A 86.3 94.6
k-NN (CONTIGS) 83.0 N/A 94.3 71.6
BLAST (CONTIGS) 88.4 N/A 87.1 89.7

Table 3. Classification performance on the human blood virome dataset.
Positive class is the majority class. BAcc. – balanced accuracy, Rec. – recall,
Spec. – specificity.

BACC. AUPR REC. SPEC.
CNNALL-150 (OURS) 96.8 >99.9 97.3 96.2
LSTMALL (OURS) 91.8 >99.9 88.2 95.5
k-NN 83.1 99.5 80.9 85.4

Table 4. Classification performance, novel species. Top: paired reads (see
Table 1). BLAST yields predictions for only 64.3% of the pairs. Bottom:
whole available genomes or contigs – negative class is the majority class
(see Table 2). BAcc. – balanced accuracy (equal to accuracy for the balanced
paired-read dataset), Rec. – recall, Spec. – specificity. BLAST (reads) and
our networks use read-wise majority vote or output averaging to aggregate
predictions over all reads from a genome. BLAST (genome) uses contig-wise
majority vote. BLAST (contigs) represents performance on individual contigs
treated as separate entities. Note that low precision is heavily affected by class
imbalance.

BACC. PREC. REC. SPEC.
CNNSP-ALL (OURS) 77.3 86.2 65.0 89.6
BLAST 47.1 94.1 17.8 76.4
CNNSP-ALL (OURS) 76.8 34.2 69.8 83.8
BLAST (READS) 61.8 46.8 30.2 93.5
BLAST (GENOME) 64.0 44.9 36.5 91.5
BLAST (CONTIGS) 57.9 37.9 33.6 82.1

of 0s (Ns) – it is the only filter with a positive bias,
and almost all of its weights are negative (with one near-
zero positive). Therefore, an overwhelming majority of its
maximum activations are in fact padding artifacts. On the
other hand, regions of unambiguous nucleotide sequences
result in high positive contributions, since they correspond
to a lack of filter activation, where an activation is present
for the all-N reference. In fact, for over 99.9% of the reads,
positive contributions occur at every single position. We
suspect that the filter works as an "ambiguity detector". Since
Ns are modelled as all-zero vectors in the one-hot encoding

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 8 — #8 i
i

i
i

i
i

8 Preprint, YYYY, Vol. xx, No. xx

scheme used here, the network represents "meaningful" (i.e.
unambiguous) regions of the input as a missing activation of
the filter. This is supported by the fact that the filter lacks any
further preference for the specific non-zero nucleotide type.
Since sequence logos presented here ignore ambiguous (i.e.
noninformative) nucleotides, their ICs for this filter are near-
zero, preventing meaningful visualization. On the other hand,
this ambiguity seems to play a role in the final classification
decision, as contribution distributions are well-separated for
both classes (Fig. S2). We speculate that this could be caused
by lower quality of the non-pathogen reference genomes, but
understanding how exactly this information is used would
require further investigation, including feature interactions at
all layers of the network. Importantly, only the contribution
analysis reveals the relevance of the filter beyond simple
activation and nucleotide overrepresentation. The choice of
the reference input is crucial.

In the Fig. 1 we present example filters, visualized as "max-
contrib" sequence logos based on mean partial Shapley values
for each nucelotide at each position. All nucleotides of the
filters with the second-highest (Fig. 1a) and the lowest (Fig.
1b) score have relatively strong contributions in accordance
with the filters’ own contributions. However, we observe
that some nucleotides consistently appear in the activating
subsequences, but the sign of their contributions is opposite
to the filter’s (low-IC nucleotides of a different color, Fig.
1c). Those "counter-contributions" may arise if a nucleotide
with a negative weight forms a frequent motif with others
with positive weights strong enough to activate the filter. We
comment on this fact in the Section Nucleotide contribution
logos. Some filters seem to learn gapped motifs resembling
a codon structure (Fig. 1c). We extracted this filter from the
original DeePaC network predicting bacterial pathogenicity
(24) where the counter-contributions are common, but we find
similar filters in our networks as well (Fig. S3). We scanned a
genome of S. aureus subsp. aureus 21200 (RefSeq assembly
accession: GCF_000221825.1) with this filter and discovered
that the learned motif is indeed significantly enriched in
coding sequences (Fisher exact test with Benjamini-Hochberg
correction, q<10−15). It is also enriched in a number
of specific genes. The one with the most hits (sraP, q<
10−15) is a serine-rich adhesin involved in the pathogenesis
of infective endocarditis and mediating binding to human
platelets (66). The filter seems to detect serine and glycine
repeats in this particular gene (Fig. S5), but a broader,
cross-species, multi-gene analysis would be required to fully
understand its activation patterns. An analogous analysis
revealed that the second-highest contributing filter (Fig. 1a)
is overall enriched in coding sequences in both Taï Forest
ebolavirus (q<10−15, RefSeq accession: NC_014372) and
SARS-CoV-2 coronavirus (q=5.6×10−5, RefSeq accession:
NC_045512.2). The top hits are the nucleocapsid (N) protein
gene of SARS-CoV-2 and the VP35 ebolavirus gene encoding
a polymerase cofactor suppressing innate immune signaling
(q<10−15).

Genome-wide phenotype analysis
We created a GWPA plot for the Taï Forest ebolavirus
genome. Most genes (6 out of 7) can be detected with
visual inspection by finding peaks of elevated infectious

potential score predicted by at least one of the models
(Fig. 2a). Intergenic regions are characterized by lower mean
scores. Noticeably, most nucleotide contributions are positive,
and low non-negative contributions coincide with regions of
negative predictions. Taken together with the surprisingly
good generalization of Chordata-trained classifiers and a
dominance of positive filters discussed above, this suggests
that our networks work as positive class detectors, treating all
other sequences as “negative” by default. Indeed, the reference
sequence of all Ns is predicted to be "non-pathogenic" with a
score of 0.

We ran a similar analysis of S. aureus using the built-in
DeePaC models (24) and our interpretation workflow. While
a viral genome contains usually only a handful of genes, by
compiling a ranking of 870 annotated genes of the analyzed
S. aureus strain we could test if the high-ranking regions are
indeed associated with pathogenicity (Table S4). Indeed, out
of three top-ranking genes with known biological names and
Gene Ontology terms, sarR and sspB are directly engaged
in virulence, while hupB regulates expression of virulence-
involved genes in many pathogens (67). In contrast to the viral
models, both negative and positive contributions are present
(Fig. S6), and the model’s output for the all-N reference is
slightly above the decision threshold (0.58). Even though the
network architecture of the viral and the bacterial model are
the same, the latter learns a "two-sided" view of the data. We
assume this must be a feature of the dataset itself.

Fig. 2b presents a GWPA plot for the whole genome
of the SARS-CoV-2 coronavirus, successfully predicted to
infect humans, even though the data was collected at least 5
months before its emergence. Interestingly, its mean infectious
potential (0.57 as scored by CNNAll) is relatively close to
the decision threshold, while its closest known relative, a bat-
infecting SARSr-CoV RaTG13, is actually falsely classified
as a human virus with a slightly lower mean infectious
potential (0.55). What is more, the gene encoding the spike
protein, which plays a significant role in host entry (68),
has a mean score slightly above the threshold for SARS-
CoV-2 (0.52) and below the threshold for RaTG13 (0.49).
As shown in the GWPA plots of both viruses (Fig. 2b and
Fig. S4), regions that the network has learned to associate
with the infectious phenotype are distributed non-uniformly
and tend to cluster together. This suggests that low-confidence
mean prediction for those viruses is not a result of random
guessing, but genuine ambiguity present in the data – and
the misclassification of RaTG13 could be indicative of a
general zoonotic potential of SARS-related coronaviruses. In
the Fig. 2b, we highlighted the score peaks aligning the spike
protein gene (S), as well as the E and N genes, which were
scored the highest (apart from an unconfirmed ORF10 of
just 38aa downstream of N) by the CNN and the LSTM,
respectively. Correlation between the CNN and LSTM outputs
is significant, but species-dependent and moderate (0.28 for
Ebola, 0.48 for SARS-CoV-2), which suggests they capture
complementary signals.

Fig. 2c shows the nucleotide-level contributions in a small
peak within the receptor-binding domain (RBD) of the S
protein, crucial for recognizing the host cell. The domain
location was predicted with CD-search (69) using the default
parameters. The maximum score of this peak is noticeably
higher for SARS-CoV-2 (0.87) than for its analog in RaTG13

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 9 — #9 i
i

i
i

i
i

Preprint, YYYY, Vol. xx, No. xx 9

(a) (b) (c)

(d) (e) (f)

Figure 1. Nucleotide contribution logos of example filters. 1a: Second-highest mean contribution score (CNNAll). Error bars correspond to Bayesian 95%
confidence intervals. 1b: Lowest mean contribution score (CNNAll). 1c: Gaps resembling a codon structure, extracted from Bartoszewicz et al. (24). Consensus
sequence: CAWCNNCNNCNNCNN. 1d-1f: Analogous logos created with the DeepBind-like "max-activation" approach. Our "max-contrib" logos visualize
contributions of individual nucleotides, including counter-contributions.

(0.67). Fig. 3 presents the RBD in the structural context of the
whole S protein (PDB ID: 6VSB, (70)), as well as in complex
with a SARS-neutralizing antibody CR3022 (PDB ID: 6W41,
(71)). The high score peak roughly corresponds to one of the
regions associated with reduced expression of the RBD (72),
located in the core-RBD subdomain. It covers over 71% of
the CR3022 epitope, as well as the neighbouring site of the
N343 glycan. The latter is present in the epitope of another
core-RBD targeting antibody, S309 (73). All the per-residue
average contributions in the region are positive (Fig. S7), even
in the regions of lower pathogenicity score, in accordance with
the results presented in Fig. 2c.

DISCUSSION

Accurate predictions from short DNA reads
Compared to the previous state-of-the-art in viral host
prediction directly from next-generation sequencing reads
(20), our models drastically reduce the error rates. This
holds also for novel viruses not present in the training set.
Generalization of virus-level Chordata models to other host
groups is a sign of a strong, “human” signal. We suspect our
classifiers detect the positive class treating all other regions of
the sequence space as “negative” by default, exhibiting traits
of a one-class classifier even without being explicitly trained
to do so. We find further support for this hypothesis: the
networks learn many more “positive” than “negative” filters
and regions of near-zero nucleotide contributions (including
the null reference sample) result in negative predictions. As
this effect does not occur for bacteria, we expect it do be task-
and data-dependent. While we ignore the simulated quality
information here, investigating the role of sequencing noise
will be an interesting follow-up study. Although the data setup
is crucial in general, the modelling step is also important,
as shown by our comparison to the baseline k-NN model.
The RC-nets are relatively simple, but they are invariant
to reverse-complementarity and perform better than random

forests, naïve Bayes classifiers and standard NN architectures
in another NGS task (24).

In the paired read scenario, the previously described k-
NN approach fails, and standard, alignment-based homology
testing algorithms cannot find any matches in more than
10% of the cases, resulting in relatively low accuracy. On a
real human virome sample, where a main source of negative
class reads is most likely contamination (55), our method
filters out non-human viruses with high specificity. In this
scenario, the BLAST-derived ground-truth labels were mined
using the complete database (as opposed to just a training
set). In all cases, our results are only as good as the
training data used; high quality labels and sequences are
needed to develop trustworthy models. Ideally, sources of
error should be investigated with an in-depth analysis of a
model’s performance on multiple genomes covering a wide
selection of taxonomic units. This is especially important
as the method assumes no mechanistic link between an
input sequence and the phenotype of interest, and the input
sequence constitutes only a small fraction of the target genome
without a wider biological context. Still, it is possible to
predict a label even from those small, local fragments. A
similar effect was also observed for image classification with
CNNs (74). Virulence arises as a complex interplay between
the host and the virus, so the predictions reflect only an
estimated potential of the infectious phenotype. This mirrors
the caveats of bacterial pathogenic potential prediction (24),
including the considerations of balancing computational cost,
reliability of error estimates, size and composition of the
reference database. Even though deep learning outperforms
the standard homology-based methods, it is still an open
question whether it captures "functional" signals, or just a
more flexible sequence similarity function. By the very nature
of machine learning and sequence comparison in general, we
expect similar viruses to yield similar predictions; in principle
this could be used to asses a risk of a host-switching event. The

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 10 — #10 i
i

i
i

i
i

10 Preprint, YYYY, Vol. xx, No. xx

(a)

(b)

(c)

Figure 2. Taï Forest ebolavirus and SARS-CoV-2 coronavirus genomes. Top: score predicted by LSTMAll. Middle: score predicted by CNNAll. Heatmap:
nucleotide contributions of CNNAll. Bottom, in blue: reference sequence. 2a: Taï Forest ebolavirus. Genes that can be detected by at least one model are
highlighted in black. 2b: Whole genome and sequences encoding the spike protein (S), envelope protein (E) and nucleocapsid protein (N). 2c: Spike protein
gene, a small peak (positions 22,595-22,669, dashed line in Fig. 2b) within the receptor-binding domain (predicted by CD-search, positions 22,517-23,185).
Binding to the receptor is crucial for entry to the host cell. Local host adaptation could help switch hosts between the animal reservoir and humans.

interpretability suite presented here aims at shedding some
light on this question, but more research is needed.

Dual-use research and biosecurity
While we focused on the NGS-based prediction scenario, our
models could in principle be used to screen DNA synthesis
orders for potentially dangerous sequences the context
of cyberbiosecurity in synthetic biology. Since standard,
homology-based approaches like BLAST are not enough to
guarantee accurate screening at a reasonable cost (75, 76, 77),
machine learning methods are a promising solution. This has
been suggested before for the bacterial DeePaC models (24),
and is applicable to the viral networks presented here as well.

However, this line of research can raise questions about
possible dual-use. O’Brien and Nelson (78) suggested
that while the intended purpose of pathogenicity potential

prediction is to mitigate biosecurity threats, it could actually
enable designing new pathogens to cause maximal harm.
The importance of this concern is difficult to overstate
and it must be addressed. If an ML-guided, genome-wide
phenotype optimization tool existed, it would indeed be a
classical dual-use technology not unlike more established
computer-aided design approaches for synthetic biology –
potentially dangerous, but offering tremendous benefits (e.g.
in agriculture, medicine or manufacturing) as well. However,
the models presented here do not allow biologically sensible
optimization of target sequences. For example, we find
meaningless, low-complexity sequences of mononucleotide
repeats corresponding to global maxima (infectious potential
of 1.0). These artifacts highlight the fact that only some
generally undefined regions of the theoretically possible
sequence space are biologically relevant. What is more, we
operate on short sequences constituting minuscule fractions of

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 11 — #11 i
i

i
i

i
i

Preprint, YYYY, Vol. xx, No. xx 11

(a) (b) (c)

(d) (e)

Figure 3. Predicted infectious potentials plotted over the SARS-CoV-2 spike glycoprotein receptor-binding domain. 3a-3c: Top and side view of the spike protein.
Three receptor-binding domains (RBDs) are colored in blue, white and red according to the predicted infectious potential of the corresponding genomic sequence.
One of the domains is in the "up" conformation. Red regions corresponding to the peak in Fig. 2c are located in the core-RBD subdomain. 3d: RBD in complex
with a SARS-neutralizing antibody CR3022 (green). The red region covers over 71% of the CR3022 epitope, but spans also to the neighbouring fragments,
including the site of the N343 glycan (carbohydrate in red stick representation). This is a part of the epitope of another neutralizing antibody, S309. 3e: Cartoon
representation of Fig. 3d. The red region is centered on two exposed α-helices surrounding the core β-sheet (lower score, white).

the whole genome with all its complexity. Although successful
deep learning approaches for both protein (79, 80, 81) and
regulatory sequence design (82, 83, 84, 85) do exist, moving
from read-based classification to genome-wide phenotype
optimization would require considerable research effort, if
possible at all. This would entail capturing a wealth of
biological contexts well beyond the capabilities of even the
best classification models currently available.

Nucleotide contribution logos
Visualizing convolutional filters may help to identify more
complex filter structures and disentangle the contributions
of individual nucleotides from their "conservation" in
contributing sequences. Counter-contributions suggest that
the information content and the contribution of a nucleotide
are not necessarily correlated. Visualizing learned motifs
by aligning the activating sequences (25) would not fully
describe how the filter reacts to presented data. It seems
that the assumption of nucleotide independence – which is
crucial for treating DeepLIFT as a method of estimating
Shapley values for input nucleotides (45) – does not hold

in full. Indeed, k-mer distribution profiles are frequently
used features for modelling DNA sequences (as shown
also by the dimer-shuffling method of generating reference
sequences proposed by Shrikumar et al. (43)). However,
DeepLIFT’s multiple successful applications in genomics
indicate that the assumption probably holds approximately.
We see information content and DeepLIFT’s contribution
values as two complementary channels that can be jointly
visualized for better interpretability and explainability of
CNNs in genomics. Filter enrichment analysis enables even
deeper insight in the inner workings of the networks. We
generate activation data for hundreds to thousands of species,
genes and filters. Yet, aggregation and interpretation of those
results beyond case studies is non-trivial, and a promising
avenue for further research.

Genome-scale interpretability
Mapping predictions back to a target genome can be used
both as a way of investigating a given model’s performance
and as a method of genome analysis. GWPA plots of well-
annotated genomes highlight the sequences with erroneous

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 12 — #12 i
i

i
i

i
i

12 Preprint, YYYY, Vol. xx, No. xx

and correct phenotype predictions at both genome and gene
level, and nucleotide-resolution contribution maps help track
those regions down to individual amino-acids. On the other
hand, once a trusted model is developed, it can be used on
newly emerging pathogens, as the SARS-CoV-2 virus briefly
analyzed in this work. Therefore, we see GPWA applications
in both probing the behaviour of artificial neural networks
in pathogen genomics and finding regions of interest in
weakly annotated genomes. What is more, the approach could
be easily co-opted to genome-wide activation analyses of
any arbitrary, intermediate neuron. The methods presented
here may also be applied to other biological problems,
and extending them to other hosts and pathogen groups,
multi-class classification or gene identification is possible.
However, experimental work and traditional sequence analysis
are required to truly understand the biology behind host
adaptation and distinguish true hits from false positives.

Conclusion
We presented a new approach for predicting a host of a
novel virus based on a single DNA read or a read pair,
cutting the error rates in half compared to the previous
state-of-the-art. For convolutional filters, we jointly visualize
nucleotide contributions and information content. Finally, we
use GWPA plots to gain insights into the models’ behaviour
and analyze a recently emerged SARS-CoV-2 virus. The
approach presented here is implemented as a python package
(see Data availability) and a command line tool easily
installable with Bioconda (86).

DATA AVAILABILITY

The datasets of simulated reads with associated metadata
are hosted at https://doi.org/10.5281/zenodo.4312525.
The tool can be installed with Bioconda (conda install
deepacvir, requires setting up Bioconda), Docker (docker
pull dacshpi/deepac) or pip (pip install deepacvir). Detailed
installation instructions, user guide and the main codebase
(including the interpretability workflows presented here)
are available at https://gitlab.com/dacs-hpi/DeePaC. Source
code of the plugin shipping the trained models, config files
describing the architectures used and the models themselves
are available at https://gitlab.com/dacs-hpi/DeePaC-vir.

ACKNOWLEDGEMENTS

We gratefully acknowledge Yong-Zhen Zhang and the
scientists at the Shanghai Public Health Clinical Center &
School of Public Health, Fudan University, who shared the
sequence of the SARS-CoV-2 virus ahead of publication. We
thank Melania Nowicka (Max Plank Institute for Molecular
Genetics) for inspiring discussions on efficient calculations of
partial Shapley values, Vitor C. Piro (Hasso Plattner Institute)
for discussions on traversing taxonomy graphs, Lothar H.
Wieler (Robert Koch Institute) for useful comments on the
first draft of the manuscript and the Anonymous Reviewers
for their suggestions and feedback.

FUNDING

This work was supported by the German Academic
Scholarship Foundation (JMB), the BMBF Computational
Life Sciences initiative (project DeePath, to BYR) and the
BMBF-funded de.NBI Cloud within the German Network
for Bioinformatics Infrastructure (de.NBI) (031A537B,
031A533A, 031A538A, 031A533B, 031A535A, 031A537C,
031A534A, 031A532B).

REFERENCES

1. Calvignac-Spencer, S., Schulze, J. M., Zickmann, F., and Renard, B. Y.
(2014) Clock rooting further demonstrates that Guinea 2014 EBOV is a
member of the Zaïre lineage. PLoS currents, 6.

2. Vouga, M. and Greub, G. (January, 2016) Emerging bacterial pathogens:
the past and beyond. Clinical Microbiology and Infection, 22(1), 12–21.

3. Trappe, K., Marschall, T., and Renard, B. Y. (September, 2016) Detecting
horizontal gene transfer by mapping sequencing reads across species
boundaries. Bioinformatics, 32(17), i595–i604.

4. Leendertz, S. A. J., Gogarten, J. F., Düx, A., Calvignac-Spencer, S., and
Leendertz, F. H. (Mar, 2016) Assessing the Evidence Supporting Fruit
Bats as the Primary Reservoirs for Ebola Viruses. EcoHealth, 13(1), 18–
25.

5. Lecuit, M. and Eloit, M. (2014) The diagnosis of infectious diseases
by whole genome next generation sequencing: a new era is opening.
Frontiers in Cellular and Infection Microbiology, 4, 25.

6. Calistri, A. and Palù, G. (2015) Editorial commentary: Unbiased
next-generation sequencing and new pathogen discovery: undeniable
advantages and still-existing drawbacks. Clinical Infectious Diseases: An
Official Publication of the Infectious Diseases Society of America, 60(6),
889–891.

7. Andrusch, A., Dabrowski, P. W., Klenner, J., Tausch, S. H., Kohl, C.,
Osman, A. A., Renard, B. Y., and Nitsche, A. (2018) PAIPline: pathogen
identification in metagenomic and clinical next generation sequencing
samples. Bioinformatics, 34(17), i715–i721.

8. Herfst, S., Schrauwen, E. J. A., Linster, M., Chutinimitkul, S., Wit, E. d.,
Munster, V. J., Sorrell, E. M., Bestebroer, T. M., Burke, D. F., Smith, D. J.,
Rimmelzwaan, G. F., Osterhaus, A. D. M. E., and Fouchier, R. A. M.
(June, 2012) Airborne Transmission of Influenza A/H5N1 Virus Between
Ferrets. Science, 336(6088), 1534–1541.

9. Imai, M., Watanabe, T., Hatta, M., Das, S. C., Ozawa, M., Shinya, K.,
Zhong, G., Hanson, A., Katsura, H., Watanabe, S., Li, C., Kawakami,
E., Yamada, S., Kiso, M., Suzuki, Y., Maher, E. A., Neumann, G., and
Kawaoka, Y. (June, 2012) Experimental adaptation of an influenza H5 HA
confers respiratory droplet transmission to a reassortant H5 HA/H1N1
virus in ferrets. Nature, 486(7403), 420–428.

10. Lipsitch, M. and Inglesby, T. V. (December, 2014) Moratorium on
Research Intended To Create Novel Potential Pandemic Pathogens. mBio,
5(6).

11. Noyce, R. S., Lederman, S., and Evans, D. H. (January, 2018)
Construction of an infectious horsepox virus vaccine from chemically
synthesized DNA fragments. PLOS ONE, 13(1), e0188453.

12. Thiel, V. (2018) Synthetic viruses-Anything new?. PLoS pathogens,
14(10), e1007019.

13. Edwards, R. A., McNair, K., Faust, K., Raes, J., and Dutilh, B. E. (2016)
Computational approaches to predict bacteriophage-host relationships.
FEMS microbiology reviews, 40(2), 258–272.

14. Eng, C. L., Tong, J. C., and Tan, T. W. (2014) Predicting host tropism of
influenza A virus proteins using random forest. BMC Medical Genomics,
7(3), S1.

15. Xu, B., Tan, Z., Li, K., Jiang, T., and Peng, Y. (July, 2017) Predicting the
host of influenza viruses based on the word vector. PeerJ, 5, e3579.

16. Li, H. and Sun, F. (2018) Comparative studies of alignment, alignment-
free and SVM based approaches for predicting the hosts of viruses based
on viral sequences. Scientific Reports, 8(1), 10032.

17. Mock, F., Viehweger, A., Barth, E., and Marz, M. (08, 2020) VIDHOP,
viral host prediction with Deep Learning. Bioinformatics, btaa705.

18. Gałan, W., Bąk, M., and Jakubowska, M. (2019) Host Taxon Predictor -
A Tool for Predicting Taxon of the Host of a Newly Discovered Virus.
Scientific Reports, 9(1), 3436.

19. Babayan, S. A., Orton, R. J., and Streicker, D. G. (November, 2018)

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.5281/zenodo.4312525
https://gitlab.com/dacs-hpi/DeePaC
https://gitlab.com/dacs-hpi/DeePaC-vir
https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 13 — #13 i
i

i
i

i
i

Preprint, YYYY, Vol. xx, No. xx 13

Predicting reservoir hosts and arthropod vectors from evolutionary
signatures in RNA virus genomes. Science, 362(6414), 577–580.

20. Zhang, Z., Cai, Z., Tan, Z., Lu, C., Jiang, T., Zhang, G., and Peng, Y.
(2019) Rapid identification of human-infecting viruses. Transboundary
and Emerging Diseases, 66(6), 2517–2522.

21. Poplin, R., Chang, P.-C., Alexander, D., Schwartz, S., Colthurst, T., Ku,
A., Newburger, D., Dijamco, J., Nguyen, N., Afshar, P. T., Gross, S. S.,
Dorfman, L., McLean, C. Y., and DePristo, M. A. (2018) A universal
SNP and small-indel variant caller using deep neural networks. Nature
Biotechnology, 36(10), 983–987.

22. Rizzo, R., Fiannaca, A., La Rosa, M., and Urso, A. (June, 2016)
Classification Experiments of DNA Sequences by Using a Deep Neural
Network and Chaos Game Representation. In Proceedings of the
17th International Conference on Computer Systems and Technologies
2016 New York, NY, USA: Association for Computing Machinery
CompSysTech ’16 pp. 222–228.

23. Löchel, H. F., Eger, D., Sperlea, T., and Heider, D. (January, 2020) Deep
learning on chaos game representation for proteins. Bioinformatics, 36(1),
272–279.

24. Bartoszewicz, J. M., Seidel, A., Rentzsch, R., and Renard, B. Y. (07,
2019) DeePaC: predicting pathogenic potential of novel DNA with
reverse-complement neural networks. Bioinformatics, 36(1), 81–89.

25. Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. (2015)
Predicting the sequence specificities of DNA- and RNA-binding proteins
by deep learning. Nature Biotechnology, 33(8), 831–838.

26. Zhou, J. and Troyanskaya, O. G. (2015) Predicting effects of noncoding
variants with deep learning–based sequence model. Nature Methods,
12(10), 931–934.

27. Zeng, H., Edwards, M. D., Liu, G., and Gifford, D. K. (2016)
Convolutional neural network architectures for predicting DNA–protein
binding. Bioinformatics, 32(12), i121–i127.

28. Quang, D. and Xie, X. (2016) DanQ: a hybrid convolutional and recurrent
deep neural network for quantifying the function of DNA sequences.
Nucleic Acids Research, 44(11), e107–e107.

29. Kelley, D. R., Snoek, J., and Rinn, J. L. (2016) Basset: learning the
regulatory code of the accessible genome with deep convolutional neural
networks. Genome Research, 26(7), 990–999.

30. Greenside, P., Shimko, T., Fordyce, P., and Kundaje, A. (2018)
Discovering epistatic feature interactions from neural network models of
regulatory DNA sequences. Bioinformatics, 34(17), i629–i637.

31. Nair, S., Kim, D. S., Perricone, J., and Kundaje, A. (July,
2019) Integrating regulatory DNA sequence and gene expression to
predict genome-wide chromatin accessibility across cellular contexts.
Bioinformatics, 35(14), i108–i116.

32. Avsec, Ž., Weilert, M., Shrikumar, A., Alexandari, A., Krueger, S., Dalal,
K., Fropf, R., McAnany, C., Gagneur, J., Kundaje, A., and Zeitlinger, J.
(August, 2019) Deep learning at base-resolution reveals motif syntax of
the cis-regulatory code. bioRxiv, p. 737981.

33. Mock, F., Viehweger, A., Barth, E., and Marz, M. (2019) Viral host
prediction with Deep Learning. bioRxiv, p. 575571.

34. Ren, J., Song, K., Deng, C., Ahlgren, N. A., Fuhrman, J. A., Li, Y., Xie,
X., and Sun, F. (June, 2018) Identifying viruses from metagenomic data
by deep learning. arXiv:1806.07810 [q-bio], arXiv: 1806.07810.

35. Tampuu, A., Bzhalava, Z., Dillner, J., and Vicente, R. (September, 2019)
ViraMiner: Deep learning on raw DNA sequences for identifying viral
genomes in human samples. PLOS ONE, 14(9), e0222271.

36. Eraslan, G., Avsec, Ž., Gagneur, J., and Theis, F. J. (July, 2019) Deep
learning: new computational modelling techniques for genomics. Nature
Reviews Genetics, 20(7), 389–403.

37. Schneider, T. D. and Stephens, R. M. (October, 1990) Sequence logos: a
new way to display consensus sequences. Nucleic Acids Research, 18(20),
6097–6100.

38. Crooks, G. E., Hon, G., Chandonia, J.-M., and Brenner, S. E. (June, 2004)
WebLogo: a sequence logo generator. Genome Research, 14(6), 1188–
1190.

39. Lanchantin, J., Singh, R., Lin, Z., and Qi, Y. (2016) Deep Motif:
Visualizing Genomic Sequence Classifications. CoRR, abs/1605.01133.

40. Lanchantin, J., Singh, R., Wang, B., and Qi, Y. (2017) Deep motif
dashboard: visualizing and understanding genomic sequences using deep
neural networks. Pacific Symposium on Biocomputing. Pacific Symposium
on Biocomputing, 22, 254–265.

41. Sundararajan, M., Taly, A., and Yan, Q. (2016) Gradients of
Counterfactuals. CoRR, abs/1611.02639.

42. Jha, A., Aicher, J. K., Singh, D., and Barash, Y. (2019) Improving

interpretability of deep learning models: splicing codes as a case study.
bioRxiv,.

43. Shrikumar, A., Greenside, P., and Kundaje, A. (August, 2017) Learning
Important Features Through Propagating Activation Differences. In
Precup, D. and Teh, Y. W., (eds.), Proceedings of the 34th International
Conference on Machine Learning, International Convention Centre,
Sydney, Australia: PMLR Vol. 70 of Proceedings of Machine Learning
Research, pp. 3145–3153.

44. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and
Samek, W. (July, 2015) On Pixel-Wise Explanations for Non-Linear
Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE,
10(7), e0130140.

45. Lundberg, S. M. and Lee, S.-I. (2017) A Unified Approach to Interpreting
Model Predictions. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H.,
Fergus, R., Vishwanathan, S., and Garnett, R., (eds.), Advances in Neural
Information Processing Systems 30, pp. 4765–4774 Curran Associates,
Inc.

46. Shrikumar, A., Tian, K., Shcherbina, A., Avsec, Ž., Banerjee, A., Sharmin,
M., Nair, S., and Kundaje, A. (March, 2019) TF-MoDISco v0.4.2.2-alpha:
Technical Note. arXiv:1811.00416 [cs, q-bio, stat], arXiv: 1811.00416.

47. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.
(1990) Basic local alignment search tool. Journal of Molecular Biology,
215(3), 403–410.

48. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J.,
Bealer, K., and Madden, T. L. (December, 2009) BLAST+: architecture
and applications. BMC Bioinformatics, 10(1), 421.

49. Wu, F., Zhao, S., Yu, B., Chen, Y.-M., Wang, W., Hu, Y., Song, Z.-
G., Tao, Z.-W., Tian, J.-H., Pei, Y.-Y., Yuan, M.-L., Zhang, Y.-L., Dai,
F.-H., Liu, Y., Wang, Q.-M., Zheng, J.-J., Xu, L., Holmes, E. C., and
Zhang, Y.-Z. (January, 2020) Complete genome characterisation of a
novel coronavirus associated with severe human respiratory disease in
Wuhan, China. bioRxiv, p. 2020.01.24.919183.

50. Mihara, T., Nishimura, Y., Shimizu, Y., Nishiyama, H., Yoshikawa, G.,
Uehara, H., Hingamp, P., Goto, S., and Ogata, H. (2016) Linking Virus
Genomes with Host Taxonomy. Viruses, 8(3), 66.

51. King, A. M. Q., Adams, M. J., Carstens, E. B., and Lefkowitz, E. J., (eds.)
(2012) Virus Taxonomy: Ninth Report of the International Committee on
Taxonomy of Viruses, Academic Press, London; Waltham.

52. Lefkowitz, E. J., Dempsey, D. M., Hendrickson, R. C., Orton, R. J.,
Siddell, S. G., and Smith, D. B. (January, 2018) Virus taxonomy: the
database of the International Committee on Taxonomy of Viruses (ICTV).
Nucleic Acids Research, 46(D1), D708–D717.

53. Holtgrewe, M. (2010) Mason – A Read Simulator for Second Generation
Sequencing Data. Technical Report FU Berlin,.

54. Deneke, C., Rentzsch, R., and Renard, B. Y. (2017) PaPrBaG: A machine
learning approach for the detection of novel pathogens from NGS data.
Scientific Reports, 7, 39194.

55. Moustafa, A., Xie, C., Kirkness, E., Biggs, W., Wong, E., Turpaz,
Y., Bloom, K., Delwart, E., Nelson, K. E., Venter, J. C., and Telenti,
A. (March, 2017) The blood DNA virome in 8,000 humans. PLOS
Pathogens, 13(3), e1006292.

56. Gorbalenya, A. E., Baker, S. C., Baric, R. S., de Groot, R. J.,
Drosten, C., Gulyaeva, A. A., Haagmans, B. L., Lauber, C., Leontovich,
A. M., Neuman, B. W., Penzar, D., Perlman, S., Poon, L. L. M.,
Samborskiy, D. V., Sidorov, I. A., Sola, I., Ziebuhr, J., and Coronaviridae
Study Group of the International Committee on Taxonomy of Viruses
(April, 2020) The species Severe acute respiratory syndrome-related
coronavirus : classifying 2019-nCoV and naming it SARS-CoV-2. Nature
Microbiology, 5(4), 536–544.

57. Simmonds, P. and Aiewsakun, P. (August, 2018) Virus classification –
where do you draw the line?. Archives of Virology, 163(8), 2037–2046.

58. Van Regenmortel, M. H. V. (January, 2018) Chapter One - The Species
Problem in Virology. In Kielian, M., Mettenleiter, T. C., and Roossinck,
M. J., (eds.), Advances in Virus Research, Vol. 100, pp. 1–18 Academic
Press.

59. Li, H. and Durbin, R. (2009) Fast and accurate short read alignment with
Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760.

60. Langmead, B. and Salzberg, S. L. (2012-03) Fast gapped-read alignment
with Bowtie 2. Nature methods, 9(4), 357–359.

61. Wood, D. E. and Salzberg, S. L. (2014) Kraken: ultrafast metagenomic
sequence classification using exact alignments. Genome Biology, 15(3),
R46.

62. Nix, R. and Kantarciouglu, M. (July, 2012) Incentive Compatible
Privacy-Preserving Distributed Classification. IEEE Transactions on

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/


i
i

“output” — 2020/12/11 — 18:12 — page 14 — #14 i
i

i
i

i
i

14 Preprint, YYYY, Vol. xx, No. xx

Dependable and Secure Computing, 9(4), 451–462 Conference Name:
IEEE Transactions on Dependable and Secure Computing.

63. Matejczyk, S. and Michalak, T. (2015) Solving Influence Maximization
Problem Using Methods from Cooperative Game Theory., Instytut
Podstaw Informatyki PAN, Publication Title: k 20533.

64. Thorvaldsdóttir, H., Robinson, J. T., and Mesirov, J. P. (March, 2013)
Integrative Genomics Viewer (IGV): high-performance genomics data
visualization and exploration. Briefings in Bioinformatics, 14(2), 178–
192.

65. DeLano, W. L. and others (2002) Pymol: An open-source molecular
graphics tool. CCP4 Newsletter on protein crystallography, 40(1), 82–92.

66. Yang, Y.-H., Jiang, Y.-L., Zhang, J., Wang, L., Bai, X.-H., Zhang, S.-J.,
Ren, Y.-M., Li, N., Zhang, Y.-H., Zhang, Z., Gong, Q., Mei, Y., Xue, T.,
Zhang, J.-R., Chen, Y., and Zhou, C.-Z. (June, 2014) Structural Insights
into SraP-Mediated Staphylococcus aureus Adhesion to Host Cells. PLOS
Pathogens, 10(6), e1004169.

67. Stojkova, P., Spidlova, P., and Stulik, J. (2019) Nucleoid-Associated
Protein HU: A Lilliputian in Gene Regulation of Bacterial Virulence.
Frontiers in Cellular and Infection Microbiology, 9, 159.

68. Li, F. (2016) Structure, Function, and Evolution of Coronavirus Spike
Proteins. Annual Review of Virology, 3(1), 237–261.

69. Marchler-Bauer, A., Bo, Y., Han, L., He, J., Lanczycki, C. J., Lu, S.,
Chitsaz, F., Derbyshire, M. K., Geer, R. C., Gonzales, N. R., Gwadz,
M., Hurwitz, D. I., Lu, F., Marchler, G. H., Song, J. S., Thanki, N.,
Wang, Z., Yamashita, R. A., Zhang, D., Zheng, C., Geer, L. Y., and
Bryant, S. H. (2017) CDD/SPARCLE: functional classification of proteins
via subfamily domain architectures. Nucleic Acids Research, 45(D1),
D200–D203.

70. Wrapp, D., Wang, N., Corbett, K. S., Goldsmith, J. A., Hsieh, C.-L.,
Abiona, O., Graham, B. S., and McLellan, J. S. (March, 2020) Cryo-
EM structure of the 2019-nCoV spike in the prefusion conformation.
Science, 367(6483), 1260–1263 Publisher: American Association for the
Advancement of Science Section: Report.

71. Yuan, M., Wu, N. C., Zhu, X., Lee, C.-C. D., So, R. T. Y., Lv, H., Mok,
C. K. P., and Wilson, I. A. (May, 2020) A highly conserved cryptic
epitope in the receptor binding domains of SARS-CoV-2 and SARS-
CoV. Science, 368(6491), 630–633 Publisher: American Association for
the Advancement of Science Section: Report.

72. Starr, T. N., Greaney, A. J., Hilton, S. K., Crawford, K. H., Navarro, M. J.,
Bowen, J. E., Tortorici, M. A., Walls, A. C., Veesler, D., and Bloom,
J. D. (June, 2020) Deep mutational scanning of SARS-CoV-2 receptor
binding domain reveals constraints on folding and ACE2 binding. bioRxiv,
p. 2020.06.17.157982 Publisher: Cold Spring Harbor Laboratory Section:
New Results.

73. Pinto, D., Park, Y.-J., Beltramello, M., Walls, A. C., Tortorici, M. A.,
Bianchi, S., Jaconi, S., Culap, K., Zatta, F., De Marco, A., Peter,
A., Guarino, B., Spreafico, R., Cameroni, E., Case, J. B., Chen,
R. E., Havenar-Daughton, C., Snell, G., Telenti, A., Virgin, H. W.,
Lanzavecchia, A., Diamond, M. S., Fink, K., Veesler, D., and Corti,
D. (May, 2020) Cross-neutralization of SARS-CoV-2 by a human
monoclonal SARS-CoV antibody. Nature, pp. 1–10 Publisher: Nature
Publishing Group.

74. Brendel, W. and Bethge, M. (2019) Approximating CNNs with Bag-
of-local-Features models works surprisingly well on ImageNet. In
International Conference on Learning Representations.

75. National Research Council (2010) Sequence-Based Classification of
Select Agents: A Brighter Line, The National Academies Press, .

76. National Academies of Sciences, Engineering, and Medicine (2018)
Biodefense in the Age of Synthetic Biology, The National Academies
Press, .

77. Diggans, J. and Leproust, E. (2019) Next Steps for Access to Safe, Secure
DNA Synthesis. Frontiers in Bioengineering and Biotechnology, 7.

78. O’Brien, J. T. and Nelson, C. (June, 2020) Assessing the Risks Posed
by the Convergence of Artificial Intelligence and Biotechnology. Health
Security, 18(3), 219–227.

79. Brookes, D., Park, H., and Listgarten, J. (May, 2019) Conditioning by
adaptive sampling for robust design. In International Conference on
Machine Learning pp. 773–782.

80. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M., and Church, G. M.
(December, 2019) Unified rational protein engineering with sequence-
based deep representation learning. Nature Methods, 16(12), 1315–1322.

81. Biswas, S., Khimulya, G., Alley, E. C., Esvelt, K. M., and Church,
G. M. (January, 2020) Low-N protein engineering with data-efficient deep
learning. bioRxiv, p. 2020.01.23.917682.

82. Gupta, A. and Zou, J. (February, 2019) Feedback GAN for DNA
optimizes protein functions. Nature Machine Intelligence, 1(2), 105–111.

83. Gupta, A. and Kundaje, A. (July, 2019) Targeted optimization of
regulatory DNA sequences with neural editing architectures. bioRxiv, p.
714402.

84. Linder, J., Bogard, N., Rosenberg, A. B., and Seelig, G. (December,
2019) Deep exploration networks for rapid engineering of functional
DNA sequences. bioRxiv, p. 864363.

85. Schreiber, J., Lu, Y. Y., and Noble, W. S. (May, 2020) Ledidi:
Designing genomic edits that induce functional activity. bioRxiv, p.
2020.05.21.109686.

86. Grüning, B., Dale, R., Sjödin, A., Chapman, B. A., Rowe, J., Tomkins-
Tinch, C. H., Valieris, R., and Köster, J. (July, 2018) Bioconda:
sustainable and comprehensive software distribution for the life sciences.
Nature Methods, 15(7), 475–476 Number: 7 Publisher: Nature Publishing
Group.

.CC-BY-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is 

The copyright holder for this preprintthis version posted January 4, 2021. ; https://doi.org/10.1101/2020.01.29.925354doi: bioRxiv preprint 

https://doi.org/10.1101/2020.01.29.925354
http://creativecommons.org/licenses/by-nd/4.0/

	Interpretable detection of novel human viruses from genome sequencing data
	Introduction
	Materials and Methods
	Results
	Discussion
	Data availability
	Acknowledgements
	Funding
	REFERENCES