key: cord-0685708-fj25riqr authors: Mateos, Pablo Acera; Balboa, Renzo F.; Easteal, Simon; Eyras, Eduardo; Patel, Hardip R. title: PACIFIC: A lightweight deep-learning classifier of SARS-CoV-2 and co-infecting RNA viruses date: 2020-07-24 journal: bioRxiv DOI: 10.1101/2020.07.24.219097 sha: d3b14ab548ce28d3dbeb66bb96ac14783c56fda2 doc_id: 685708 cord_uid: fj25riqr Viral co-infections occur in COVID-19 patients, potentially impacting disease progression and severity. However, there is currently no dedicated method to identify viral co-infections in patient RNA-seq data. We developed PACIFIC, a deep-learning algorithm that accurately detects SARS-CoV-2 and other common RNA respiratory viruses from RNA-seq data. Using in silico data, PACIFIC recovers the presence and relative concentrations of viruses with >99% precision and recall. PACIFIC accurately detects SARS-CoV-2 and other viral infections in 63 independent in vitro cell culture and patient datasets. PACIFIC is an end-to-end tool that enables the systematic monitoring of viral infections in the current global pandemic. Acute respiratory tract infections are the third largest global cause of death, infecting 545 49 million people and claiming 4 million lives every year (1-3). RNA viruses such as influenza, 50 parainfluenza virus, respiratory syncytial virus, metapneumovirus, 51 rhinovirus, and coronavirus are amongst the top pathogens causing respiratory infections and 52 disease (4,5). Novel respiratory diseases, including coronaviruses, cross species boundaries 53 repeatedly. Since December 2019, millions of people have been affected by COVID-19, an 54 infectious zoonotic disease caused by severe acute respiratory syndrome coronavirus 2 55 (SARS-CoV-2, NCBI Taxonomy ID: 2697049). Novel zoonotic coronaviruses also caused the 56 Rhinovirus and Coronaviridae) and the human transcriptome (Additional file 1: Table S1 ) 156 (Methods). In silico fragments from both strands were generated without errors to 157 accommodate paired-end sequenced reads and to retain the natural variation between genomes 158 in each class. We used 90% of the data for training and 10% to tune the hyperparameters and 159 network architecture. 160 The selection and grouping of virus classes were based on several considerations. First, we 161 wanted PACIFIC to accurately detect SARS-CoV-2 as an independent class and to discriminate 162 it from other coronaviruses. Second, we selected viruses that have been recently reported to 163 appear as co-infections with SARS-CoV-2 (11). Third, we restricted our selection to viruses 164 for which humans have been defined as one of the host species in NCBI Taxonomy database 165 (27). Fourth, as the majority of reads in a sample are expected to be derived from human RNAs, 166 we included an independent human class representing the human transcriptome to avoid the 167 misclassification of human reads as viral origin. 168 As input reads are divided into k-mers within the model, we investigated appropriate virus and 170 human k-mer properties. A k-mer length of 9 was previously reported to be the optimal k-mer 171 length for the phylogenetic separation of viral genomes (28). However, 9-mer profiles of 172 SARS-CoV-2 and the human transcriptome have not been previously explored. We computed 173 all-vs-all Jensen-Shannon divergence (JSD) scores using 9-mers to confirm that k=9 is the 174 effective k-length to distinguish between the six PACIFIC classes. JSD is a symmetric measure 175 of (dis)similarity that accounts for shared k-mer frequency distributions between a pair of 176 sequences (29) . JSD values range between 0 for identical sequences and 1 for two sequences 177 that do not share any k-mer. Overall, inter-class JSD values were higher compared to the intra-178 class JSD values for 9-mers, which confirm that 9-mers are effective at separating sequences belonging to different viral classes and human transcripts (Additional file 2: Figure S1 ). 180 Specifically, the average JSD between the SARS-CoV-2 class and the Coronaviridae class was 181 0.786, which was greater than 0.767 intra-class JSD for the Coronaviridae and 0.002 for the 182 SARS-CoV-2 class, thus indicating sufficient divergence for their separation into distinct 183 classes. Given these results, we decided to encode input sequences as 9-mers with a stride of 184 1. 185 186 PACIFIC testing shows high precision and recall for simulated data 187 Performance metrics (false positive rates (FPRs), false negative rates (FNRs), precision, recall, 188 and accuracy) were calculated using in silico generated reads that modelled sequencing-189 induced substitution and indel errors from sample mixtures with known class labels. We 190 generated 100 independent datasets of 150nt single end reads with Illumina HiSeq2500 errors 191 using ART (30). Each dataset contained ~700,000 reads and was comprised of approximately 192 100,000 reads from each of the 6 classes in the PACIFIC model, plus ~100,000 reads from 193 unrelated viral genomes (Methods). 194 First, we compared the performance metrics between predictions using only the forward strand 195 of a read (single prediction) and predictions using both the forward and reverse-complemented 196 strands (double prediction) (Figure 2 ). In single prediction, a read was assigned the class label 197 with the highest posterior probability for the forward sequence if the posterior probability for 198 a class was ≥ 0.95. For double prediction, the read was predicted to be of a given class if the 199 posterior probability of the predicted forward and reverse-complemented read was ≥0. 95 prediction relative to single prediction ( Figure 2) . Concomitantly, the average precision for all 207 viral classes increased and the average recall decreased by a small margin. Due to these 208 observations, double prediction was implemented as the standard classification approach in 209 Overall, PACIFIC achieved high precision (average ≥ 0.9995), recall (average ≥ 0.9966) and 211 accuracy (average ≥ 0.9995) for each of the virus classes (Table 1) . For the human class, the 212 average precision was lower at 0.50140 compared to the viral classes; this is attributed to the 213 large number of reads (>99%) from unrelated viral genomes being assigned to the human class. 214 As a result, sequences that do not belong to any of the viruses in the model are unlikely to be 215 mislabelled as one of the virus classes. negatives compared to double prediction (green) where predictions are made on the forward 220 strand of a sequence and its reverse complement. 221 Using the same in silico datasets, we then assessed the effect of mismatches on FPR and FNR. 228 All 100 controlled datasets contained ~22% of reads with substitutions and indel errors relative 229 to the reference genomes. FNRs were higher for mismatch-containing reads relative to exact 230 reads for all viral classes, increasing 5 to 64-fold ( Figure 3 ). In contrast, FPRs increased 0.98 231 to 2-fold for mismatch-containing reads for all classes. These results suggest that FPRs are 232 relatively less affected by the presence of mismatches compared to FNRs for viral classes. 233 Despite relative differences of FPR and FNR in mismatch-containing reads compared to exact 234 reads, PACIFIC achieved high precision (0.9994), high recall (0.9909) and high accuracy 235 (0.9982) for mismatch-containing reads for all five viral classes (Additional file 1: Table S2 ). 236 cut-off; rc_discarded -reads that were discordantly predicted using double prediction. 259 260 In the previous section, we showed that PACIFIC displayed low FPR in balanced datasets with 262 similar proportions of reads from each class. However, incorrect predictions about the presence 263 or absence of a virus in a sample could lead to misguided follow-ups and the unnecessary use 264 of valuable clinical resources. Therefore, we decided to establish the minimum percentage of 265 reads for each viral class required to confidently predict the presence of a virus in a sample. 266 In practice, RNA-seq data will be unbalanced with almost all reads originating from the human 267 transcriptome mixed with variable proportions of viral reads. To model this imbalance in class 268 proportions, we simulated 500 independent datasets (100 for each class), each containing 269 500,000 150nt long reads using ART (30) with Illumina HiSeq2500 error profiles. Each dataset 270 unrelated viral genomes. One of the five viral classes was intentionally excluded, and the 272 excluded class was considered to be the test class. All reads assigned to this test class were 273 counted as false positives. PACIFIC achieved similar average FPRs to the benchmarking 274 experiments using balanced datasets (Table 2) . 275 Assessment of the distribution of false positive rates for each viral class and skewness-kurtosis 276 plots indicated that the percentage of false positives observed in unbalanced datasets followed 277 a Beta distribution. Therefore, we used moment matching to estimate the shape parameters for 278 the quantile function of the Beta distribution and determined the numeric threshold above 279 which 99% of false positive samples were excluded. Using these thresholds, a sample would 280 be classified as positive for Coronaviridae if >0.0405% reads were labelled as Coronaviridae 281 by PACIFIC. Similarly, these limits were >0.000807% for Influenza, >0.000154% for 282 Metapneumovirus, >0.0418% for the Rhinovirus, and >0.000213% for the SARS-CoV-2 class 283 (Table 2) . 284 class. * These thresholds represent 0.99 quantile for FP% for that class. 288 Next, we assessed PACIFIC's performance in classifying viral reads in RNA-seq data derived 290 from human biological samples and compared its output with alignment-based (BWA-MEM) 291 and k-mer based (Kraken2) approaches. To reduce bias in the comparisons, we built the BWA-292 MEM index and Kraken2 database using the same virus genome assemblies and the human 293 transcriptome that were used for PACIFIC training (Additional file 1: Table S1 ). Additionally, 294 we used the same percentage thresholds determined in the previous section (Table 2) for all 295 three methods to assign the presence of a virus in a sample. All three methods were applied to 296 63 human RNA-seq datasets from independent research studies, with five of them known to 297 contain SARS-CoV-2 (31,32). Four RNA-seq datasets were derived from primary human lung 298 epithelium cells (NHBE) infected with SARS-CoV-2 in vitro (NCBI SRA accession: 299 SRX7990869) (31) and one dataset was from a patient bronchoalveolar lavage fluid sample 300 (NCBI SRA accession: SRR10971381) that was positive for SARS-CoV-2 (32). In addition, 301 we analysed RNA-seq datasets for 48 airway epithelial cell samples from the GALA II cohort 302 study, of which 22 were reported to contain respiratory infection viruses (33,34), and 10 were 303 reportedly devoid of infections by the viral classes studied here, as indicated by the sample 304 metadata and the corresponding publications (Additional file 1: Table S3) . 305 For the five samples that were positive for SARS-CoV-2, PACIFIC assigned 0.047%-0.048% 306 of all reads to the SARS-CoV-2 class for the in vitro infected cells and 0.19% of all reads in 307 the patient sample; all five samples were above the established detection threshold for that class 308 (>0.000213%) ( Figure 5) . Similarly, BWA-MEM and Kraken2 successfully identified SARS-309 CoV-2 reads above the detection threshold in these five samples (Figure 5a ). All three methods 310 accurately predicted the presence of SARS-CoV-2 and the absence of other virus classes in 311 these samples as per class specific detection thresholds. We subsequently tested the 48 samples from the GALA II cohort (33,34). Of these, 22 were 313 reported to contain between 4 and 164,870 reads from respiratory viruses (human rhinovirus, 314 respiratory syncytial virus, human metapneumovirus or human parainfluenza viruses I, II and 315 III). PACIFIC, BWA-MEM and Kraken2 identified 9 samples as positive for one of the five 316 virus classes considered, and 39 samples as negative for the same viral classes (Figure 5a ; 317 Additional file 2: Figure S2 , Additional file 1: Table S3 ). The discrepancy between these results 318 and the original study (34) could be partly explained by the exclusion of the Respiratory 319 Syncytial Virus class in our analyses. However, further verifications could not be performed 320 because sample labels provided in the manuscript were mismatched with the submitted 321 sequence data (see correction (35)). 322 reads for a class are above detection thresholds described in Table 2 In addition to the 53 samples tested above, we analysed 10 publicly available human RNA-seq 361 datasets without expected viral infections using all three methods. In particular, all ten samples 362 were registered in the NCBI SRA database on or before 17 October 2019 and therefore were 363 unlikely to contain SARS-CoV-2 (Additional file 1: Table S3 ). Of note, two samples 364 (SRR8927151 and SRR8928257) were published in February 2020 (Additional file 1: Table 365 S3) (38). However, the two NCBI BioProjects for these samples were registered on 18 April searches of these SARS-CoV-2 assigned reads showed Homo sapiens as the best hit (E-values of all 9-mers in these reads were poly-A or poly-T derived, suggesting low-complexity 379 sequences. In addition, SRR4895842 was also assigned as co-positive for the Rhinovirus class 380 in addition to SARS-CoV-2 by BWA-MEM (Figure 5b ). BLASTN searches of 406 reads 381 assigned to Rhinovirus within this sample revealed that 303 reads had best hits to Homo 382 sapiens, and 48 reads had their best hits to Pan paniscus (E-value ≤ 6.06e-10; bit-scores ≥ 383 76.8). These reads had 19% of their 9-mers derived from poly-A or poly-T sequences, 384 suggesting low-complexity sequences. 385 Overall, these results show that PACIFIC can accurately identify viral reads and the use of our 386 detection thresholds assisted in correctly establishing the presence or absence of viral classes 387 in RNA-seq data from biological samples with better accuracy than existing methods. which were based on medical images (39). This study concluded that these predictive models 396 were, in general, poorly described and contained multiple biases, likely resulting in unreliable 397 predictions when applied in practice. To overcome these potential limitations, we used multiple 398 diverse and independent simulated datasets reflecting realistic scenarios to validate the 399 performance of PACIFIC. Importantly, PACIFIC was successfully applied to 63 RNA-seq 400 datasets derived from infected cell cultures and patient samples for the detection of viral 401 infections, demonstrating that PACIFIC can be applied to human-derived RNA-seq datasets 402 and assist in clinical settings. 403 In 2013, the World Health Organisation launched the Battle against Respiratory Viruses 404 (BRaVe) initiative, which identified six research strategies to tackle and mitigate risks of death 405 due to respiratory tract infections. One of the proposed strategies was to "improve severe acute 406 respiratory infection diagnosis and diagnostic tests amongst others" (40). High-throughput 407 sequencing-based approaches can provide immense diagnostic potential and facilitate 408 molecular epidemiological studies, thereby contributing towards the BRaVe initiative's goals 409 (41,42). It is more important than ever to explore and determine the diagnostic potential of 410 A comprehensive study using multiplex RT-PCR and a sequencing-based metagenomic 412 approach revealed that RNA-seq has sufficient sensitivity and specificity to be applicable in the 413 clinic for respiratory viruses (42). However, the use of RNA-seq in diagnostic settings is often 414 complicated due to complex analytical workflows (34,42). A typical workflow for virus 415 detection in high-throughput sequencing data involves quality assessment and filtering of raw 416 data, removal of host sequences, de novo assembly of remaining reads, and lastly, the alignment 417 and annotation of the generated contigs (43). Implementation of these workflows require expert 418 knowledge of bioinformatics software and databases and often dedicated computing facilities. 419 PACIFIC overcomes these limitations by modelling the differences in k-mer content of 420 respiratory viruses and human sequences in a model that is efficient in compute and storage 421 requirements, easy to use, and therefore applicable in contexts with minimal resources. We downloaded 362 virus genomes from the NCBI assembly database corresponding to five 466 classes of single stranded RNA viruses (Table 4 , Additional file 1: Table S1 ). GenBank 467 assembly identifiers and assembly versions with other metadata are listed in Additional file 1: 468 Table S1 . Since our focus was to detect co-infections with SARS-CoV-2, we made a separate 469 class for SARS-CoV-2 containing 87 different assemblies (Table 4 ). The Coronaviridae class 470 contained 12 genomes of alpha, beta, gamma and unclassified coronaviruses. The Influenza 471 class contained assemblies of influenza A (H1N1, H2N2, H3N2, H5N1, H7N9, and H9N2 472 strains) and influenza B viruses. For the Rhinovirus class, assemblies of rhinovirus A 473 (including A1 strain), B, C (including C1, C2, and C10 strains), and unlabelled enterovirus were grouped together. There were five distinct assemblies for metapneumovirus which were 475 grouped into a single class. We included Human GENCODE (48) canonical transcript 476 sequences (downloaded from Ensembl v99 database (49)) as an additional class to distinguish 477 sequencing reads derived from the human transcriptome. We generated between 0.44 and 3.5 478 million 150nt-long fragments in silico for each class using a custom Perl script available at 479 https://github.com/pacific-2020/pacific (generatetestdata.pl, Table 4 ). These training 480 sequences were randomly sampled without any base substitutions and were derived from both 481 strands of the genome assemblies. 482 PACIFIC was implemented using the Keras API with a TensorFlow backend. Input reads were 488 converted into 9-mers with a stride of 1, forming a vocabulary size of 4 9 = 262,144 k-mers. 489 Each of these k-mers is assigned a number using the Tokenize API from Keras (50) from 1 490 to 262,144. The first index position of 0 is reserved to denote zero-padding for variable length 491 sequences. Tokens are fed into the first hidden layer of the neural network and 492 transformed into continuous vectors of length 100. After the embedding, a convolutional layer A pooling layer is used after the convolution, using max pooling with a kernel size of 3. A 495 bidirectional long-short term memory (BiLSTM) layer then follows, which uses two traditional 496 LSTMs; one starts 'reading' the input sequence from one of the two flanks, and the other from 497 the opposite end. The output of the two LSTMs is then combined and passed to the next 498 layer. Finally, PACIFIC has a fully connected layer using a softmax function 499 to calculate posterior probabilities for each of the six classes. To reduce overfitting, we used 500 20% dropout at each hidden layer. We generated 100 independent test datasets using the ART sequence simulation software 524 (version 2.5.8, (30)) with default error models for substitutions, insertions and deletions using 525 the Illumina® HiSeq 2500 sequencing platform. For each dataset, we set seeds starting from 526 2021 to 2120 using a random number generator for reproducibility. Synthetic data contained 527 150nt single end reads derived from seven classes; the five model virus classes, a human class, 528 and an "unrelated" class composed of 32,550 distinct virus genomes downloaded from the 529 NCBI Assembly database. We sampled ~100,000 reads per class using a class-specific fold-530 coverage parameter to generate ~700,000 reads per test data (Table 5 ). Approximately 22% of 531 reads contained mismatches, insertions or deletions relative to their respective reference 532 sequences, reflecting error profiles of the Illumina sequencing platform. This process was 533 automated using a custom script (generatebenchmarkdata.pl). PACIFIC was used to assign class labels to reads in the test data, and performance metrics were 544 calculated by comparing known and predicted labels for each read. A read was assigned a class 545 if the maximum posterior probability score for a class was ≥0.95. A true positive (TP) was 546 defined when the true label and the predicted label were the same for a read. A true negative 547 (TN) was defined when a read that did not belong to the true class was correctly predicted as a 548 class different from the true class. False positives (FP) were reads which were predicted to be 549 as the true class, although they originated from a different class. False negatives (FN) were all 550 reads belonging to the true class but were predicted as a different class. An example confusion 551 matrix for SARS-CoV-2 is described in Table 6 . Precision, recall, accuracy, false positive rate 552 and false negative rate were calculated using equations 3-7 below. 553 554 Table 6 . Confusion matrix using SARS-CoV-2 as an example of a positive class. and unrelated viral genomes in variable proportions. Reads were simulated using the ART 571 software (30) with Illumina® HiSeq2500 error profiles that were 150nt long and modelled 572 single end experiments. One of the five viral classes that was excluded was considered as the 573 test class. This process was automated using a custom script (generatefprdata.pl). Subsequently, PACIFIC was run in double prediction to assign classes to each read. To 575 calculate the percentage of false positives in each experiment, we counted the number of reads 576 predicted as the absent test class and divided by the total number of reads. 577 We downloaded 63 RNA-seq experiments from NCBI SRA database. Run accessions and other 579 metadata details are supplied in Additional file 1: Table S2 . All data were downloaded from 580 the NCBI database using the SRA Toolkit prefetch and fastq-dump commands and applying 581 the --gzip and --fasta options (52). For the GALA II cohort study with 48 RNA-seq datasets 582 and read lengths 18-390nt, we discarded reads <150nt long. We then used PACIFIC to assign 583 the presence/absence of each virus class in all 63 samples using the detection thresholds 584 established in the previous section. We compared PACIFIC's predictions with two alternative 585 methods for virus detection: an alignment-based approach using BWA-MEM (53) , and a k-586 mer based approach using Kraken2 (19), described below. 587 For BWA-MEM (53), all reads were mapped using default parameters to a combined reference 588 containing assembly sequences for the five viral classes and the human transcriptome used for 589 training PACIFIC (Table 4 ). Reads were assigned to a virus class based on the class 590 membership of the genome assembly as described in Table 4 and Additional file 1: Table S1 . 591 For Kraken2, we first downloaded the Kraken taxonomy database and built a k-mer database 592 using the same genomes used to train PACIFIC (Table 4 ). Kraken2 was then run using the -593 use-names flag, and output reads were parsed using species scientific names and reads were 594 assigned a class based on the class membership of the genome assembly (Additional file 1: 595 Table S1, Table 4 ). To fairly compare all three methods, we applied class detection thresholds 596 as determined for and used in PACIFIC (Table 2) Addressing 650 the public health burden of respiratory viruses: the Battle against Respiratory Viruses Prevalence and attributable health burden of chronic respiratory diseases, 1990-2017: 654 a systematic analysis for the Global Burden of Disease Study Global 657 epidemiology of non-influenza RNA respiratory viruses: data gaps and a growing need 658 for surveillance Origin and evolution of pathogenic coronaviruses Centers for Disease Control and Prevention (CDC). Revised U.S. surveillance case 666 definition for severe acute respiratory syndrome (SARS) and update on SARS cases United States and worldwide Viral and atypical bacterial detection in acute respiratory infection in children under 675 five years Testing for Upper Respiratory Pathogens in the Emergency Department: A 678 Randomized Controlled Trial. Open Forum Infect Dis Rates of Co-infection between CoV-2 and Other Respiratory Pathogens Clinical characteristics and 683 outcome of influenza virus infection among adults hospitalized with severe COVID-684 19: A retrospective cohort study from Wuhan Virus a Protective Factor of COVID-19? SSRN Electron J COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agent-689 based modeling Multiplex PCR: Optimization and 691 application in diagnostic virology A review of methods and databases for 694 metagenomic classification and assembly. Brief Bioinform Ultrafast metagenomic sequence classification using 697 exact alignments KrakenUniq: Confident and fast 699 metagenomics classification using unique k-mer counts Improved metagenomic analysis with Kraken 2 Machine Learning for detection 704 of viral sequences in human metagenomic datasets DeepMicrobes: taxonomic classification for 707 metagenomics with deep learning Deep learning on raw DNA 709 sequences for identifying viral genomes in human samples Comparison of Word Embeddings and Sentence 712 Encodings as Generalized Representations for Crisis Tweet Classification Tasks Comprehensive evaluation of deep learning 715 architectures for prediction of DNA/RNA sequence binding specificities Multi-scale orderless pooling of deep 718 convolutional activation features The Performance of LSTM and BiLSTM in 2019 IEEE International Conference on Big Data Viral phylogenomics using an 727 alignment-free method: A three-step approach to determine optimal length of k-mer IEEE 730 TRANSACTIONS ON INFORMATION THEORY ART: a next-generation sequencing read 732 simulator Imbalanced Host Response to SARS-CoV-2 Drives Development of COVID-19 A new coronavirus 737 associated with human respiratory disease in China Factors 740 associated with degree of atopy in Latino The Genes-environments and Admixture in Latino Asthmatics (GALA II) study Dual RNA-seq reveals viral infections in asthmatic children without respiratory illness 745 which are associated with changes in the airway transcriptome Correction: Dual RNA-seq reveals viral infections in asthmatic children without 749 respiratory illness which are associated with changes in the airway transcriptome 750 New complete 753 genome sequences of human rhinoviruses shed light on their phylogeny and genomic 754 features New 756 respiratory enterovirus and recombinant rhinoviruses among circulating 757 picornaviruses. Emerg Infect Dis Transcriptional Programs Define Intratumoral Heterogeneity of Ewing Sarcoma at Single-Cell Resolution. Cell Rep Prediction models for diagnosis and prognosis of covid-19 infection: Systematic 763 review and critical appraisal Integrating host response and unbiased microbe detection for lower respiratory tract 768 infection diagnosis in critically ill adults Unbiased 771 detection of respiratory viruses by use of RNA sequencing-based metagenomics: A 772 systematic comparison to a commercial PCR panel Proficiency testing of virus diagnostics based on bioinformatics analysis of simulated 776 in silico high-throughput sequencing data sets Sample Pooling as a Strategy to Detect Viral Mutation Rates A survey of transfer learning GENCODE reference annotation for the human and mouse genomes Project Title. GitHub repository. GitHub Adam: A method for stochastic optimization Conference on Learning Representations, ICLR 2015 -Conference Track Proceedings Aligning sequence reads, clone sequences and assembly contigs with BWA Basic local alignment search 801 tool Database indexing for production MegaBLAST searches Supplementary Information 807 Additional file 1. Supplementary Tables S1, S2, and S3 Supplementary Table S1: Summary table of genomes and assemblies used to train 809 PACIFIC To investigate the origin of reads for all reads in samples that were discordantly predicted for 599 the presence of a virus class by PACIFIC, BWA-MEM or Kraken2, we used the BLAST suite 600 (v2.10.1+) (54,55) to align reads to the NCBI nucleotide (nt) database, which includes 601 sequences from all domains of life. We took the best hit from the pairwise alignment for each 602 read, filtering for alignments with an E-value <1e-6. BLASTN was used with the following 603 parameters: -task 'megablast' -max_target_seqs