key: cord-0317433-zd9n8nj7 authors: Melnick, Marko; Gonzales, Patrick; LaRocca, Thomas J.; Dowell, Robin D.; Song, Yuping; Wuu, Joanne; Benatar, Michael; Oskarsson, Björn; Petrucelli, Leonard; Link, Christopher D.; Prudencio, Mercedes title: Application of a bioinformatic pipeline to RNA-seq data identifies novel virus-like sequence in human blood date: 2021-01-28 journal: bioRxiv DOI: 10.1101/2021.01.27.428546 sha: 1ae4650155364b66fbcf8f7436194cc595b87a54 doc_id: 317433 cord_uid: zd9n8nj7 Numerous reports have suggested that infectious agents could play a role in neurodegenerative diseases, but specific etiological agents have not been convincingly demonstrated. To search for candidate agents in an unbiased fashion, we have developed a bioinformatic pipeline that identifies microbial sequences in mammalian RNA-seq data, including sequences with no significant nucleotide similarity hits in GenBank. Effectiveness of the pipeline was tested using publicly available RNA-seq data. We then applied this pipeline to a novel RNA-seq dataset generated from a cohort of 120 samples from amyotrophic lateral sclerosis (ALS) patients and controls, and identified sequences corresponding to known bacteria and viruses, as well as novel virus-like sequences. The presence of these novel virus-like sequences, which were identified in subsets of both patients and controls, were confirmed by quantitative RT-PCR. We believe this pipeline will be a useful tool for the identification of potential etiological agents in the many RNA-seq data sets currently being generated. increased load of various pathogens in ALS samples compared to controls in multiple tissues 134 suggesting these pathogens are present and cannot be simply attributed to contamination 9,11,20,22,23 . 135 Ultimately, the presence of ALS dysbiosis is unresolved and remains an active area of investigation 136 with evidence for 34-38 and against 39 it. 137 The biological role that these alternative microbiotas play in ALS is also unclear. ALS 138 patients may have a compromised blood brain barrier (BBB) or blood spinal cord barrier (BSCB) 139 function 40, 41 . It has been reported that ALS patients also have elevated Gram negative 140 endotoxin/lipopolysaccharide (LPS) in the blood 42 . Patients with ALS also display activation of 141 the innate immune system along with changes in blood 43 our previous study which analyzed protein levels of poly(GP) in C9ORF72-associated ALS 148 (c9ALS) 57 . The search for pathogens using sequencing data from blood samples in ALS patients 149 has been conducted before 58-61 , but previous efforts have not looked for novel pathogens. 150 Next-generation sequencing (NGS) technologies have shown broad detection of pathogens 151 in a target-independent unbiased fashion 62-65 . However, designing a microbial detection 152 experiment is non-trivial considering the variety of methods 66 and algorithms 67 that can be applied. 153 Our primary goal when designing a new pipeline was to be conservative and unbiased with regards 154 to discovery of novel pathogens. Furthermore, we wanted our pipeline to allow for the 155 quantification of both novel and known pathogens. While other pipelines have used reads that do 156 not map to the host genome (unmapped reads) for microbial identification and quantification, these 157 pipelines cannot be used for discovery as they rely on existing databases of microbial genomes 68-158 71 . Thus, we opted for de-novo assembly of unmapped reads into contigs, followed by alignment 159 of unmapped reads back to these contigs for quantification. A similar pipeline known as IMSA 72 160 uses this strategy, but disregards contigs that might be identified by translated amino acid sequence 161 similarity using BLASTX (a set we call the "dark biome") as well as subsequent contigs with no 162 BLASTN or BLASTX hit (a set we call the "double dark biome"). 163 We validated our pipeline by using datasets with known bacterial or viral infections. We 164 also examined the differences in microbial identification between polyA and total RNA recovery 165 in multiple tissues, and investigated the effects of globin depletion of blood samples. We then used 166 our pipeline on a novel blood dataset (termed "Our Study") as well as on five other published ALS 167 datasets from blood or spinal cord samples, analyzed each dataset individually, and analyzed 168 across datasets for changes in microbiota. While we did not identify any microbes enriched in the 169 blood of ALS patients, we did identify and validate a novel virus-like sequence, demonstrating the 170 potential of the bioinformatic pipeline we have established. binding beads and finally eluted in elution buffer. The resulting RNA free of >95% globin mRNA 199 transcripts was then processed for next generation sequencing. Of note, to assess the efficiency of 200 the globin RNA depletion, 10% of the samples processed were selected randomly and amplified 201 using a Target-Amp Nano labeling kit (Epicentre). Samples were normalized to 100 ng input and 202 reverse transcribed. First strand cDNA was generated by incubating at 50°C for 30 min with first 203 strand premix and Superscript III. This was followed by second strand cDNA synthesis through 204 DNA polymerase by incubating at 65°C for 10 min and at 80°C for 3 min. In-vitro transcription 205 was then performed at 42°C for 4 hours followed by purification using RNeasy mini kit (Qiagen). 206 Due to the large number of samples, the globin depletion step was performed in two batches. 207 We provided guidelines on how samples would be divided among the batches and also for how 208 samples would be grouped in the sequencing runs in order to minimize technical variability. The 209 Jackson Laboratory personnel were blinded to the identity of the samples. 210 RNA-seq of total blood RNA (globin and ribosomal RNA depleted) was performed in an 211 Illumina HiSeq4000 with >70 million read pairs per sample. Raw reads were then sent back to us 212 for bioinformatics analyses. 213 214 Quantitative RT-PCR for blood RNA samples 215 A total of 500 ng of total blood RNA was used for reverse transcription polymerase chain 216 reaction (RT-PCR), using the High Capacity cDNA Transcription Kit with random primers 217 (Applied Biosystems). Quantitative real-time PCR (qRT-PCR) was performed using SYBR 218 GreenER qPCR SuperMix (Invitrogen). Samples were run in triplicate, and qRT-PCRs were run 219 on a QuantStudio 7 Flex Real-Time system (Applied Biosystems). 220 221 List of primers and their sequences: 222 RDRP forward 5'-GCTGTCAAATCGGTTTCCAAC-3'; 223 RDRP reverse 5'-CTGCCTTCGTCATCTTGGAG-3'; 224 GAPDH forward 5'-GTTCGACAGTCAGCCGCATC-3'; 225 GAPDH reverse 5'-GGAATTTGCCATGGGTGGA-3'. 226 227 Transcriptomics 228 See pipeline description in results for an overview of the pipeline; see bioinformatics 229 supplement File S1 for a more detailed description of the analysis pipeline, versions, and statistical 230 quantification. All data in this study was processed identically using the pipeline. 231 232 233 Statistical Analysis 234 To assess statistical differences between conditions, a two tailed Student's t-test is 235 calculated using normalized coverage for contigs or binned normalized coverage for 236 species/genus, etc. The number of contigs or genus/species is used to obtain an adjusted p-value 237 using scipy in Python. Cutoff for statistical significance is less than an adjusted p-value of 0 Raw reads were first checked for quality using FastQC then trimmed to remove both 257 adaptor contamination and low quality basecalls using Trimmomatic. Trimmed reads were then 258 mapped to the host genome using multiple alignment algorithms in series (STAR, Bowtie2) and 259 unmapped reads were retained for contig assembly. Filtering out host reads made downstream 260 assembly faster and required less memory. We assembled contigs from unmapped reads with the 261 SPAdes assembler (with "-rna" setting). This assembler was chosen for its recent use in studies of 262 microbial diversity 73 and proven robustness to biological and technical variation 74 . The species 263 each contig belongs to was identified with BLASTN using default settings, and the top hit for each 264 contig was retained (a set we call "regular biome"). Contigs with no BLASTN hits were then 265 filtered to remove highly repetitive regions (DUST) and retained if they had a greater than 60% 266 pairwise alignment (LAST) between contigs assembled from a single sample, group/condition, or 267 all samples. 268 We then identified contigs that lacked detectable nucleotide similarity to any GenBank 269 entry but showed similarity at the amino acid level using BLASTX ("dark biome"). Furthermore, 270 contigs with no BLASTN or BLASTX hits were labelled as "double dark biome" (also filtered by 271 DUST and LAST). Every contig in the "regular biome" and "dark biome" were then queried 272 against the Joint Genome Institute Server for additional taxonomic information. As Mystery Miner 273 is an agnostic tool, it cannot distinguish between true tissue or cell-associated microbes and 274 experimentally introduced contamination. 275 For quantification, we mapped the non-host reads using Bowtie2 to the contigs obtained 276 from SPAdes. Next, we mapped reads to contigs using samtools mpileup (default mapq score) to 277 calculate the amount of reads over each base pair in a contig. We then calculated coverage on the 278 contigs by summing all of the counts for each base pair in a contig and dividing by the length of 279 the contig. We then calculated normalized coverage by library size using the number of mapped 280 reads to the host genome. This gave us normalized coverage (NC) for a contig or binned 281 normalized coverage (BNC) for multiple contigs within a species/genus, etc. To assess statistical 282 differences between conditions, a Student's t-test was calculated through NC or BNC, using the 283 number of contigs or genus/species to obtain an adjusted p-value using scipy in Python. Reads were first checked with FastQC and trimmed using Trimmomatic (1. grey). Reads were then 291 aligned to the host genome using various aligners (2. blue). Non-host (unmapped) reads were 292 assembled into contigs with RNA SPAdes and regular biome contigs were identified with 293 BLASTN (3. yellow). Unidentified contigs were filtered for repetitive sequences with Dust, filter 294 by single, group or all with LAST, and dark biome contigs were identified with BLASTX. Double 295 dark biome unidentified BLASTX contigs were sent directly to quantification (4. purple). Dark 296 biome and regular biome contigs were assigned complete taxonomy using the JGI server and 297 filtered one last time to remove mammalian/host genome contigs (5. Green). Non-host reads were 298 then mapped to all contigs and normalized coverage was calculated for subsequent statistical 299 analysis. 300 301 302 To confirm that Mystery Miner is able to recover and quantify known bacterial infections 305 from sequencing data, we utilized an in vitro model of Chlamydia trachomatis infection 306 (Humphrys 2016) 75 . In this study, epithelial cell monolayers were infected with Chlamydia 307 trachomatis; and polyA RNA (6 samples) and total RNA (6 samples) were sequenced 1 hour and 308 24 hours post infection (hpi). Using the Mystery Miner pipeline, out of 5.32 X 10 6 reads from all 309 of the samples, 6.04 X 10 5 reads remained unmapped (~11.34%) after trimming and mapping to 310 the host genome (File S2). From these non-host reads, 3,257 contigs were assembled and 1,199 of 311 these contigs were identified as regular biome (File S3). An additional 27 contigs had no BLASTN 312 hit. Of these, we identified 2 dark biome (BLASTX identified) and no double dark biome (no 313 BLASTX or BLASTN hit) contigs (File S4 and File S5). 314 Using Mystery Miner we successfully identified, and found significantly elevated levels, 315 of Chlamydia trachomatis (BNC by species) in 24 hours post infection (hpi) samples compared to 316 1 hpi samples in both polyA (Padj = 0.004) and total RNA (Padj = 0.0005). In addition to 317 Chlamydia trachomatis, we identified 6 additional bacterial species and one viral species 318 (Alphapapillomavirus 7) in the samples (Figure 2A ), including significantly elevated levels of 319 Mycoplasma hyorhinis contigs in total RNA samples. No significant differences were observed in 320 the dark or double dark contigs (File S6). 321 To confirm that the pipeline can detect known viral infections, we ran Mystery Miner on a 322 total RNA dataset from an in vitro model of severe acute respiratory syndrome coronavirus 323 (SARS-CoV) 1 or 2 infection (Emanuel2020 76 ). In this study human epithelial Calu3 cells were 324 infected with SARS-CoV-1 or SARS-CoV-2 (4, 12, or 24 hours), mock (4 or 24 hours), or 325 untreated (4 hours). 326 Out of the 2.81 X 10 8 reads obtained from all of the samples, 8.23 X 10 7 reads remained 327 unmapped (~29%) after trimming and mapping to the host genome (File S2). From these non-host 328 reads, 42,816 contigs were assembled, of which 1,346 regular biome, 27 dark biome, and 7 double 329 dark biome contigs passed the filtering steps (File S2, File S3, File S4, File S5) 330 Mystery Miner successfully identified both SARS-CoV-2 and SARS-CoV-1 isolates and 331 found significantly elevated levels of each virus compared to controls ( Figure 2B ). Hereafter we 332 refer to SARS-CoV-1 or SARS-CoV-2 infected cells as COV1 or COV2 to avoid confusion with 333 recovered names of contigs. Consistent with the viruses being similar, we identified both SARS-334 CoV-2 and SARS-CoV-1 in both the COV1-24hr and COV2-24hr samples when compared to 335 mock-24hr. However, when we compared COV2-24hr to COV1-24hr, we were able to distinguish 336 SARS-CoV-1 isolates from SARS-CoV-2 in the appropriate samples (i.e., SARS-CoV-2 was 337 significantly elevated in COV2). Similar results were seen in the 12 hour samples but the 4 hour 338 samples did not have sufficient viral reads to detect either SARS-CoV virus (File S7). To simulate 339 a novel virus, we ran Mystery Miner on versions of the BLASTN and BLASTX databases obtained 340 before SARS-CoV-2 was discovered and were able to properly identify SARS-CoV-2 as a bat 341 related coronavirus 77 ( Figure S1) In order to compare effects of library enrichment or depletion, we compared recovered 369 pathogens in a dataset that has polyA enrichment or rRNA depleted total RNA from blood or 370 colonic tissue (VonSchack2018) 78 . When we compared polyA RNA vs total RNA and looked at 371 BNC by superkingdom of bacteria we found significantly more reads map to bacteria for total 372 RNA than polyA RNA (Padj = 0.0349), in blood but not in colon (Padj=0.11709) ( Figures S2 and 373 File S8). We found similar amounts of significant BNC by species for polyA RNA vs total RNA 374 in blood (29) and in colon (26). We then looked at significant BNC by genus and found double the 375 amount in blood (14) compared to colon (7), with only one significant genus (Actinomyces) found 376 in both comparisons. We did not find any significant differences in coverage when we looked at 377 the species, genus or superkingdom level for viruses (File S8). We conclude that library 378 enrichment with total RNA compared to polyA RNA increases bacterial recovery and diversity in 379 blood. 380 We next looked at a RNA-seq dataset from whole blood with globin depleted (GD) vs non-381 globin depleted (NGD) total RNA (Shin2014 79 ). With BNC by superkingdom, we found 382 significantly increased levels in globin depleted vs. not-depleted samples for both bacteria (Padj = 383 0.004) ( Figure S3 ) and viruses (Padj = 0.030) ( Figure S4 ). We also found significant differences 384 in BNC by species ( Figure S5 ) or genus ( Figure S6 ) primarily from E. coli with elevated levels in 385 globin-depleted blood RNA. We did not find any significant differences when we looked for 386 viruses at the species or genus level (File S9 In general, we found a modest positive correlation between library size and number of 412 bacterial contigs assembled, species detected (Figure 3) , and genera detected for each sample as 413 well as a homogenous mixture of samples with respect to disease status. No specific taxonomy or 414 contig sequence correlated with participant class within the dataset. Yet, by pooling bacterial read 415 counts across all of the samples, we found alpha proteo-bacteria, Actinobacteria, Firmicutes, and 416 Bacteroidetes as the most highly represented taxonomies, consistent with other blood biome 417 studies 80 ( Figure S6 ). Most of the bacterial genera (~65%) assigned to the dark biome contigs were 418 also found in the regular biome, however this was not the case in the viral sets, as only 5% (4/78) 419 of dark viral contigs were observed in the regular biome (File S10). This observation suggested 420 that our pipeline might be identifying novel viral sequences. 421 Within the dark biome contigs, we noted numerous contigs with a region of protein 422 sequence similarity to RNA-dependent RNA polymerase (RdRP) from multiple RNA viruses, 423 showing highest similarity to the velvet tobacco mottle virus (first row in heatmap of Figure 4 , 424 complete metadata shown in Figure S7 ). datasets for all studies used in this paper ( Table 2) . As we observed in Our Study, we first noted 481 that increased library size correlated with an increased number of bacterial contigs assembled, 482 species detected, and genera detected ( Figure 5 , and Figure S8 -10 show all datasets used in this 483 study). 484 We then looked at the total overlap of genus or species to determine if there are similarities 485 in recovered microbial or viral sequences between datasets. For genus in the regular bacteriome, 486 our dataset had the highest number of unique genus (185), followed by Ladd2017 (117), and 487 Gagliardi2018 (38). The highest number of overlapping bacterial genus was between our dataset 488 and Ladd2017 (133) followed by the intersection between our dataset, Ladd2017 and 489 Gagliardi2018 (61) and there was a modest overlap (24) for 9/10 datasets ( Figure 6 ). We observed 490 roughly the same trend in the regular bacterial biome at the species level and in the dark bacterial 491 biome (S Figure 11 , File S11). In contrast, the regular virome of each dataset was relatively unique 492 with very low amounts of overlap (<= 3) between datasets (species and genus shows a similar 493 pattern). Interestingly, the highest overlap for species in the dark virome was between our dataset 494 and Ladd2017 (13), one of which is similar to RDRP viruses, although the contigs in Ladd's data 495 were not similar to the velvet tobacco mottle virus in our dataset ( Figure S12 , File S12). 496 In addition to comparing datasets using taxonomy, we also compared contigs between 497 datasets for nucleotide similarity (> 70%) using LAST (File S1 for methods). We found that in 498 general, datasets in the regular biome with the largest amount of contigs have the most overlap. 499 Unsurprisingly, in the dark biome we observed less overlap by nucleotide similarity and that our 500 RDRP contig does not share nucleotide similarity with contigs from any dataset. In addition, we 501 also compared the nucleotide similarity of double dark biome contigs and found there is not a large 502 percentage of similar contigs between datasets (File S13). 503 504 505 Figure 6 . Upset plots of overlapping genus in the regular bacteriome between datasets. 506 Upset plots are Venn diagram-like plots. A set refers to a dataset used in this study and each set is 507 on a row with total amounts in a set as a blue bar plot on the left (ordered by set size). The black 508 histogram on top shows the counts that are in the intersection of sets (a single dot for one dataset 509 or connected dots for overlap of multiple datasets). Intersections less than 4 are removed for 510 visualization purposes. 511 512 513 514 Comparison of taxonomy by condition within ALS datasets 515 Finally, we looked for differences in ALS vs control samples for each dataset. In the 516 Gagliardi2014 dataset, when we compared ALS patients with the FUS mutation to controls, we 517 found 3 significant differences in BNC by species in the regular bacteriome (Neisseria sp., 518 Pseudomonas sp., Sphingomonas sp.) and one significant difference in BNC by genus in the dark 519 bacteriome (photobacterium). In ALS patients with mutations in SOD1 compared to controls, we 520 found two species significantly different in the regular bacteriome (Hydrogenophaga crassostreae, 521 Sphingomonas hengshuiensis) (Gagliardi FUS and SOD1 supplement). We did not find anything 522 significant in sporadic ALS, or in ALS patients with TARDBP mutations with regards to 523 genus/species (regular or dark biome or viruses) for Gagliardi2014. We found no significant 524 statistical differences between ALS and control samples for genus/species of viruses/bacteria in 525 the regular/dark biome for any of the remaining ALS datasets. 526 527 Meta analysis between datasets 528 529 Since our dataset and many others had no significant comparisons for ALS vs control 530 groups within each dataset, a meta-analysis between datasets using this criteria would be difficult. 531 As a second pass analysis we created a less stringent filtering method in order to compare the 532 presence of microbes for each group between datasets (ALS vs. ALS; or controls vs. controls) 533 (Figure 7) . We assigned a contig to a condition if ≥ 2 samples from that condition contain at least 534 90% of the summed normalized coverage (from all samples) to the contig. This filtering greatly 535 reduced the number of comparable genus/species for each dataset and, for example, reduced the 536 genus of the regular bacteriome in our dataset from 305 for all samples to 33 (SYM:8, C9S:6, 537 C9A:2, CTL:17) (File S14). 538 When we looked at ALS or control contigs in the regular bacteriome, the highest number 539 of unique genus or species was from Ladd2017, and in general there was a small amount of overlap 540 between datasets (≤1 for ALS or ≤ 8 for controls) (Figure 7 ). When we looked at genus in the dark 541 bacteriome we saw no overlap for ALS contigs and low overlap (≤ 1) between control conditions 542 (species was similar) (File S14). In the regular virome there was no overlap between datasets and 543 only our study (one contig from ALS) and Ladd2017 (three from ALS, five from controls) had 544 contigs that passed the filter (similar values for species). When we looked in the dark virome by 545 genus there was no overlap between datasets, and our dataset had only one genus (Sobemovirus 546 from controls) with the rest coming from Ladd2017 (18 from controls, 5 from ALS) (File S15). In 547 conclusion, despite our conservative and loose approaches, we did not find any convincing 548 evidence in ALS samples that the presence (or lack of presence) of an organism (or multiple 549 organisms) was different compared to control samples. Upset plots are Venn diagram-like plots. A set refers to a contig that was assigned to a condition 555 from a dataset. Each set is on a row with total amounts in a set as a blue bar plot on the left (ordered 556 by set size). The black histogram on top shows the counts that are in the intersection of sets (a 557 single dot for one dataset or connected dots for overlap of multiple datasets). A. ALS contigs in 558 the regular bacteriome. B. Control contigs from the regular bacteriome. 559 560 561 Discussion 562 563 We have created Mystery Miner to search for and quantify known and unknown microbes 564 in RNA-seq datasets as a tool for researchers to study dysbiosis and identify infectious agents. We 565 validated the pipeline by recovering and quantifying Chlamydia and SARS-CoV reads from RNA-566 seq datasets from intentionally infected cells. Interestingly, we also identified Mycoplasma reads 567 in the Chlamydia dataset, suggesting that Mystery Miner may also be able to identify unsuspected 568 bacterial infections. We also use published data to investigate the difference of polyA vs total RNA 569 recovery of bacterial species in multiple tissues. Perhaps surprisingly, we did not see a consistent 570 difference in the recovery of bacterial reads between the two types of RNA-seq libraries, 571 considering that bacterial transcripts are not expected to be polyadenylated. However, it is well-572 recognized that polyA selection is imperfect, and libraries constructed from polyA-selected RNA 573 routinely contain transcripts thought not to be polyadenylated (e.g., rRNA). We also found 574 increased recovery of bacterial species with globin RNA depletion in blood. This could be the 575 result of an effective increase in read depth for bacteria when not sequencing globin, or an increase 576 in contamination from the globin depletion step. We stress that our bioinformatic approach alone 577 cannot distinguish between contamination and the true existence of microbial sequences in human 578 tissue. 579 We then used Mystery Miner on Our Study dataset consisting of 8.64 X 10 9 reads. This 580 dataset was generated from whole blood total RNA that was depleted for both ribosomal and globin 581 transcripts. It encompasses samples from controls, participants with a C9ORF72 hexanucleotide 582 expansion (symptomatic and pre-symptomatic), and C9ORF72 negative ALS patients. We found 583 no statistical difference in microbial sequence read coverage between controls and any class of 584 ALS patients, examining either individual contigs or using various taxonomical binning 585 approaches. We also did not detect any batch effects or obvious age-or sex-biases in the recovery 586 of microbial reads ( Figure S7 ). Overall, we found no evidence of blood microbial sequences 587 contributing to either C9ORF72 negative ALS or symptomatic patients harboring the C9ORF72 588 hexanucleotide expansion. However, ALS is a CNS disease, so our findings in these blood samples 589 do not necessarily preclude a role for microbes in this disease. 590 591 A unique aspect of Mystery Miner is that it tracks non-human reads that do not have 592 significant BLASTN hits in GenBank. We were intrigued by the identification of a large contig 593 (>5kb) in the dark biome of our ALS dataset that showed protein sequence similarity to RNA-594 dependent RNA polymerases, the essential replicase of RNA viruses. To validate that this virus-595 like sequence was not an artifact of contig assembly or a contaminant introduced during library 596 construction or sequencing, we used RT-PCR of the original patient samples to demonstrate that 597 this sequence was present in positive samples identified through the RNA-seq analysis and not 598 detectable in negative samples. We cannot extrapolate from this specific example to determine 599 what fraction of the "dark" and "double dark" sequences represent true novel microbial sequences 600 present in human blood, but we note that analysis of human cell free blood DNA has reported the 601 identification of thousands of novel bacterial sequences 86 . We suggest that Mystery Miner is a 602 generally useful tool for the identification of novel microbial sequences in RNA-seq data. 603 To extend our analysis we processed publicly available blood and spinal cord ALS datasets 605 through our pipeline. As observed in our dataset, library size generally correlated with number of 606 bacterial contigs assembled and number of bacterial genera/species recovered. When the microbial 607 sequences we found in our dataset were compared to the other datasets we found similar 608 genera/species and, not surprisingly, larger datasets generally had greater overlap. One dataset 609 (Ladd2017) yielded comparable recovery of bacteria and viruses for the regular biome but a far 610 greater recovery bacteria and viruses in the dark biome compared to all the other datasets. This 611 study used laser capture microdissection (LCM) to isolate cervical spinal cord motor neurons and 612 had comparable read amounts per sample to other studies and was conducted in the same 613 laboratory as two other studies (Brohawn2016, Brohawn2019). We are unsure why this dataset 614 yielded a much larger dark biome compared to the other datasets. Potentially these differences are 615 a byproduct of using LCM to acquire samples. 616 617 We then analyzed several publicly available ALS datasets for statistically significant 618 differences between recovered microbial sequences in ALS and control samples. Only one dataset 619 (Gagliardi2018) had any significant microbial sequence differences between control and ALS 620 samples, specifically ALS patients with FUS or SOD1 mutations. However, the sample number 621 in this sub-study was small (N = 2-3), and in the case of the SOD1 patients the excess microbial 622 reads were in the control samples. In the absence of additional information (e.g., batch assignments 623 for the samples) it is difficult to conclude that these sequence/sample correlations are meaningful. 624 Finally, we compared identified microbial sequences in the control and ALS samples across the 625 datasets and did not identify any genera/species that were preferentially recovered in either sample 626 type. 627 628 Using our bioinformatic analysis pipeline Mystery Miner, we have not identified an 629 association between ALS pathology and sequences corresponding to known or unknown microbial 630 species. However, there are intrinsic limitations in using "repurposed" RNA-seq data to assay 631 tissue-associated microbial sequences, including the relatively small number of non-human reads 632 (<1% of total) upon which the analysis is based. This limited sequence signal could preclude 633 identification of rarer microbes. Perhaps more problematic is the possibility that contaminating 634 sequences obscure true tissue-associated microbial sequences. Any candidate microbes identified 635 using Mystery Miner that correlate with human phenotypes will necessarily require independent 636 validation. Despite these limitations, we believe Mystery Miner will be a useful tool for future 637 researchers investigating known and unknown microbes that could contribute to disease, as our 638 analyses have shown it to be sensitive to bacterial/viral agents in sequencing data. Upset plots are venn diagram-like plots. Each set is on a row with total amounts in a set as a blue 806 bar plot on the left. The black histogram on top shows the counts that are in the intersection of 807 sets (a single dot for one set or connected dots for multiple sets). The regular virome of each 808 dataset is relatively unique with very low amounts of overlap (<= 3) between datasets (species 809 and genus shows a similar pattern). Interestingly, the highest overlap for species in the dark 810 virome is between our dataset and Ladd2017 (13). 811 812 813 Figure S13 . Upset plots of Bacteria in the regular biome for genus/species in ALS and 814 Control contigs 815 Upset plots are venn diagram-like plots. Each set is on a row with total amounts in a set as a blue 816 bar plot on the left. The black histogram on top shows the counts that are in the intersection of 817 sets (a single dot for one set or connected dots for multiple sets). We assigned a contig to a 818 condition if >= 2 samples from that condition contain at least 90% of the summed normalized 819 coverage (from all samples) to the contig. In the genus and species from ALS samples there is a 820 low amount of overlap between datasets ( <= 1). When we look at control samples there is a 821 much higher overlap for both genus and species. Upset plots are venn diagram-like plots. Each set is on a row with total amounts in a set as a blue 831 bar plot on the left. The black histogram on top shows the counts that are in the intersection of 832 sets (a single dot for one set or connected dots for multiple sets Exploring the "multiple-hit hypothesis" of 850 neurodegenerative disease: Bacterial infection comes up to bat Infectious agents and 853 amyotrophic lateral sclerosis: another piece of the puzzle of motor neuron degeneration Prevalence of amyotrophic lateral sclerosis -United 856 The epidemiology of amyotrophic lateral sclerosis Risk factors for amyotrophic lateral sclerosis Smoking and amyotrophic lateral sclerosis: A mendelian randomization 864 study Relationship between smoking and 866 ALS: Mendelian randomisation interrogation of causality Virus detection by nucleic 869 acid hybridization: Examination of normal and ALS tissues for the presence of poliovirus Enteroviral Infection: The Forgotten Link to 876 Amyotrophic Lateral Sclerosis? Fungal infection in 879 neural tissue of patients with amyotrophic lateral sclerosis Amyotrophic Lateral Sclerosis-882 like Syndrome after Chikungunya. Cureus Herpes simplex virus 886 and Alzheimer's disease: A search for virus DNA by spot hybridisation The Infectious Etiology of Alzheimer's Disease. Curr 889 Neuropharmacol Virologic and immunologic considerations in Parkinson's disease Evidence for association between 894 hepatitis C virus and Parkinson's disease Gut microbiota: Implications in Parkinson's disease. Park 897 Relat Disord Role of pathogens in multiple sclerosis Searching for Bacteria in Neural Tissue From Amyotrophic 901 Lateral Sclerosis Detection of Mycoplasmas in Patients with 903 Evidence for fungal infection in cerebrospinal fluid and 906 brain tissue from patients with amyotrophic lateral sclerosis Corpora Amylacea of Brain Tissue from 909 Neurodegenerative Diseases Are Stained with Specific Antifungal Antibodies. Front 910 Neurosci Risk of sporadic amyotrophic lateral sclerosis 912 associated with seropositivity for herpesviruses and echovirus-7 Detection and cellular localization 915 of enterovirus RNA sequences in spinal cord of patients with ALS Cerebrospinal fluid detection of enterovirus 918 genome in ALS: A study of 242 patients and 354 controls Enteroviral infection: The forgotten link to 921 amyotrophic lateral sclerosis? Detection of enteroviral 924 sequences from frozen spinal cord samples of Japanese ALS patients ALS syndrome in patients with HIV-1 infection A comparative study of motor neuron disease in 929 HIV-infected and HIV-uninfected patients Identification of active loci of a human endogenous 932 retrovirus in neurons of patients with amyotrophic lateral sclerosis Human endogenous retrovirus-K contributes to motor 935 neuron disease Humoral immunity response to human endogenous 938 retroviruses K/W differentiates between amyotrophic lateral sclerosis and other 939 neurological diseases Evaluation of the microbial diversity in amyotrophic lateral 944 sclerosis using high-throughput sequencing Antibiotics use and risk of amyotrophic lateral sclerosis in 947 Target Intestinal Microbiota to Alleviate Disease 949 Progression in Amyotrophic Lateral Sclerosis The role of the microbiota-gut-brain axis 952 and antibiotics in ALS and neurodegenerative diseases. Microorganisms The fecal microbiome of ALS patients Decreased mRNA expression of tight 957 junction proteins in lumbar spinal cords of patients with ALS Circulating endotoxin and systemic immune 962 activation in sporadic amyotrophic lateral sclerosis (sALS) Immune system alterations in sporadic 965 amyotrophic lateral sclerosis patients suggest an ongoing neuroinflammatory process Correlation of 968 peripheral immunity with rapid Amyotrophic lateral sclerosis progression Innate and adaptive immunity in 971 amyotrophic lateral sclerosis: Evidence of complement activation Inflammation induces TDP-43 mislocalization and 974 aggregation 1H-NMR-Based metabolomic profiling of CSF in early 978 amyotrophic lateral sclerosis A CSF biomarker panel for identification 982 of patients with amyotrophic lateral sclerosis Evaluating the levels of CSF and serum factors in ALS Epigenetic differences between 987 monozygotic twins discordant for amyotrophic lateral sclerosis (ALS) provide clues to 988 disease pathogenesis Increase in DNA methylation in patients with 990 amyotrophic lateral sclerosis carriers of not fully penetrant SOD1 mutations ALS blood expression profiling 993 identifies new biomarkers, patient subgroups, and evidence for neutrophilia and hypoxia Small RNA sequencing of sporadic amyotrophic 996 lateral sclerosis cerebrospinal fluid reveals differentially expressed miRNAs related to 997 neural and glial activity Poly(GP) proteins are a useful 1002 pharmacodynamic marker for C9ORF72-associated amyotrophic lateral sclerosis Long non-coding and coding RNAs characterization 1005 in Peripheral Blood Mononuclear Cells and Spinal Cord from Amyotrophic Lateral 1006 Sclerosis patients RNA-Seq profiling in peripheral blood mononuclear 1008 cells of amyotrophic lateral sclerosis patients and controls Identification of molecular signatures 1011 and pathways common to blood cells and brain tissue of amyotrophic lateral sclerosis 1012 patients Application of next generation sequencing for the detection of human 1017 viral pathogens in clinical specimens RNA-Seq Analysis of Gene Expression, Viral Pathogen, 1020 and B-Cell/T-Cell Receptor Signatures in Complex Chronic Disease Resolving host-pathogen interactions by dual RNA-1023 seq The Sensitivity of Massively Parallel Sequencing 1025 for Detecting Candidate Infectious Agents Associated with Human Tissue Interrogating the microbiome: experimental and 1028 computational considerations in support of study reproducibility A review of 1031 bioinformatics tools for bio-prospecting from metagenomic sequence data. Front Genet ROP: dumpster diving in RNA-sequencing to find the 1034 source of 1 trillion reads across diverse adult human tissues Pipeline for 1037 Quantification of Microbiome in Human RNA-seq MetaMap: an atlas of metatranscriptomic reads 1040 in human disease-related RNA-seq data. Gigascience SEPATH: 1043 benchmarking the search for pathogens in human tissue whole genome sequence data 1044 leads to template pipelines A fast and robust 1047 protocol for metataxonomic analysis using RNAseq data A new genomic blueprint of the human gut 1050 microbiota Optimizing and evaluating the reconstruction 1052 of Metagenome-assembled microbial genomes Simultaneous transcriptional profiling of bacteria 1055 and their host cells Bulk and single-cell gene expression profiling of 1058 SARS-CoV-2 infected human cell lines identifies molecular targets for therapeutic 1059 intervention. bioRxiv Evaluation of two main RNA-seq 1064 approaches for gene quantification in clinical RNA sequencing: polyA+ selection versus 1065 rRNA depletion Variation in RNA-Seq Transcriptome Profiles of 1067 Peripheral Whole Blood from Healthy Individuals with and without Globin Depletion The healthy human blood microbiome: Fact 1070 or fiction? Front Cell Infect Microbiol Real-time PCR in virology Copy number loss of the interferon gene 1074 cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis RNAseq analyses identify tumor necrosis factor-1077 mediated inflammation as a major abnormality in ALS spinal cord Laser-captured spinal cord motorneurons from ALS 1080 subjects show increased gene expression in vacuolar ATPase networks RNA Sequencing Reveals Small and Variable 1083 Contributions of Infectious Agents to Transcriptomes of Postmortem Nervous Tissues 1084 Alzheimer's Disease and Parkinson's Disease 1085 Subjects, and Increased Expression of Genes From Disease-Activated Microglia. Front 1086 Neurosci Numerous uncharacterized and highly 1088 divergent microbes which colonize humans are revealed by circulating cell-free DNA