key: cord-0932433-9e3xakvr authors: Zhang, Wei; Xu, Xiaoguang; Fu, Ziyu; Chen, Jian; Chen, Saijuan; Tan, Yun title: PathogenTrack and Yeskit: tools for identifying intracellular pathogens from single-cell RNA-sequencing datasets as illustrated by application to COVID-19 date: 2022-02-22 journal: Front Med DOI: 10.1007/s11684-021-0915-9 sha: 59cb8272662012cfbec337ff7a4449229d2f0b06 doc_id: 932433 cord_uid: 9e3xakvr Pathogenic microbes can induce cellular dysfunction, immune response, and cause infectious disease and other diseases including cancers. However, the cellular distributions of pathogens and their impact on host cells remain rarely explored due to the limited methods. Taking advantage of single-cell RNA-sequencing (scRNA-seq) analysis, we can assess the transcriptomic features at the single-cell level. Still, the tools used to interpret pathogens (such as viruses, bacteria, and fungi) at the single-cell level remain to be explored. Here, we introduced PathogenTrack, a python-based computational pipeline that uses unmapped scRNA-seq data to identify intracellular pathogens at the single-cell level. In addition, we established an R package named Yeskit to import, integrate, analyze, and interpret pathogen abundance and transcriptomic features in host cells. Robustness of these tools has been tested on various real and simulated scRNA-seq datasets. PathogenTrack is competitive to the state-of-the-art tools such as Viral-Track, and the first tools for identifying bacteria at the single-cell level. Using the raw data of bronchoalveolar lavage fluid samples (BALF) from COVID-19 patients in the SRA database, we found the SARS-CoV-2 virus exists in multiple cell types including epithelial cells and macrophages. SARS-CoV-2-positive neutrophils showed increased expression of genes related to type I interferon pathway and antigen presenting module. Additionally, we observed the Haemophilus parahaemolyticus in some macrophage and epithelial cells, indicating a co-infection of the bacterium in some severe cases of COVID-19. The PathogenTrack pipeline and the Yeskit package are publicly available at GitHub. ELECTRONIC SUPPLEMENTARY MATERIAL: Supplementary material is available in the online version of this article at 10.1007/s11684-021-0915-9 and is accessible for authorized users. Microbes are the most ubiquitous life forms with little known in terms of their diversity. Due to a deeper understanding of their role in regulating host immune response and causing host pathogenesis, studies on microbes have become a hotspot in the biological exploration of human health and disease. In addition, pathogenic microorganisms show regulatory roles on the machineries of genetic information flow in host cells, such as affecting the transcriptional program. This raises the question of how cell-to-cell variability predicts or alters the host relationship with microbial organisms in a given disease context. In recent years, much insight has been gained into host responses to microbial infections, but tools used for interpreting the distribution of pathogens in a single cell and the impact of pathogens on host cell homeostasis are not available. Single-cell RNA-seq (scRNA-seq) has recently been engaged as the most useful tool for studying transcriptomic characteristics at the single-cell level [1] . It has been widely used for revealing the distribution of cells in the microenvironment of organs, tissues and tumors, tracking cell hierarchy, understanding tumor heterogeneity, and inferring intracellular communication and regulatory networks [2] [3] [4] [5] [6] [7] . Since the RNA of intracellular pathogens may also be captured when preparing libraries for transcriptome sequencing, the scRNA-seq data set can also be used to track pathogens at the single-cell level [8] . In fact, a recent study focusing on viral-host interaction revealed the SARS-CoV-2 sequencing reads in 3′ scRNAseq data [9] , highlighting the potential usage of scRNA-seq in identifying intracellular pathogens. Nevertheless, there has not been any computational tools that systematically explore the metagenomic features in host cells at the single-cell level. Here, we established a computational framework for identifying and exploring intracellular pathogens (bacteria or viruses) at the single-cell level. The method includes a Python package (PathogenTrack) for intracellular pathogen identification and an R package (Yeskit) for integration, clustering, differential gene analysis, functional annotation, and visualization of single-cell data. Our algorithm has been tested on various simulated and real scRNA-seq data sets and performed robustly. Taking the scRNA-seq data of two severe COVID-19 patients as an example, we used PathogenTrack to identify microbial infected cells at the single-cell level and used Yeskit to explore the biological functions that may be related to SARS-CoV-2 infection. The PathogenTrack workflow The first step of PathogenTrack is to extract cell barcodes (CBs) and unique molecular identifiers (UMIs) and add them to the header of the sequenced reads. Many existing tools for whitelisting CBs are available, such as cellranger [1] , alevin [10] , and UMI-tools [11] . Since cellranger and alevin are designed for scRNA-seq quantification and generate more reliable CBs, it is recommended to use cellranger or alevin to acquire valid CBs. The CBs and UMIs are extracted and added to the header of read2 with UMI-tools. Fastp [12] is an ultra-efficient tool for read quality control and is employed to remove low-quality or low-complexity reads. After quality control, reads not aligned to the host reference genome with STAR [13] are kept for taxonomy classification. Kraken2 [14] is a readlevel taxonomy classification tool with high precision and speed and is employed in PathogenTrack. Since strains from the same species may have shared genomic sequences, the k-mer based method like kraken2 would not accurately classify all reads at the species level [14] . To solve this problem, PathogenTrack corrects the taxonomy IDs with less read support to those supported by the most abundant reads under the same species level. Reads with assigned taxonomy IDs are deduplicated and then quantified with UMIs. Finally, a quantification matrix of pathogen species with UMI counts is generated. Yeskit is an R package designed for single-cell gene expression data importation, integration, clustering, differential analysis, functional analysis, and visualization. It consists of 17 functions, including data importation and integration (scRead, scIntegrate, and scOne), differential analysis (scDGE and scPathogenDGE), functional analysis (scGO, scPathogenGO, and scMSigdbScoring), visualization (scDimPlot, scDensityPlot, scPopulationPlot, scVizMeta, scPathogenRatioPlot, scVolcanoPlot, scGO-BarPlot, scGODotPlot, and scScoreDimPlot). Yeskit obeys the default data structure of Seurat [15] , and stores pathogen expression data and pathway enrichment scores in the obj@meta.data slot and stores differential gene analysis results and GO enrichment results in the obj@misc slot. The scRead function is designed for reading 10x Genomics single-cell count matrix and filtering out low quality cells. Besides, two features (reading pathogen count matrix and handling PDX model) were included to fulfill the corresponding circumstances. If the pathogen count matrix was specified, scRead will read and store the pathogen-by-cell count matrix into the obj@meta.data slot. If the input file was a scRNA-seq count matrix from xenografts samples (PDX model with human and mouse genomes), scRead can distinguish human cells and mouse cells by the threshold fraction (a minimum of 90% hostspecific reads by default) of reads to separate human and mouse cells. In the quality control step, soft thresholds of the number of expressed genes (99-th quantile), and the percentage of mitochondrial genes (99-th quantile) were used instead of hard thresholds. The scIntegrate function is a wrapper for the Seurat standard workflow. It can be used to merge two or more Seurat object together, normalize the data, select features, scale the data, perform linear dimensional reduction (PCA), cluster the cells, run nonlinear dimensional reduction (UMAP/tSNE), and return an integrated Seurat object. Since the heterogeneity among clinical samples is always very large, the robust and time-efficient batch effect removing method Harmony [16, 17] is included in the scIntegrate function. If the project has only one sample, scOne function can be used instead of scIntegrate to complete the Seurat standard workflow. The scDensityPlot function is used to visualize the cell density between various samples. The dark red area indicates that the cell density in this area is high, and the white area indicates that the cell density in this area is low. It is routine to calculate marker genes in each cluster or differentially expressed genes between two conditions. Seurat's FindMarkers function is very useful for finding markers (differentially expressed genes) for identity classes. The scDGE function is a wrapper of FindMarkers, which is used to detect differentially expressed genes in each cluster between two groups. The scPathogenDGE function is also a wrapper for FindMarkers, used to detect genes differentially expressed between pathogen-positive and pathogen-negative (or pathogen-bystander) cell groups. The scGO function is designed for annotating the Gene Ontology functions of each cluster with the topGO [18] package. For cell markers generated by FindAllMarkers from the Seurat package, only the function of upregulated genes can be annotated, while for results generated by scDGE, both the functions of up-or downregulated genes can be annotated. The scVizMeta function can be used to display any numerical column stored in the obj@meta.data slot in UMAP/tSNE/PCA embeddings. The scMSigdbScoring function is used to calculate the pathway scores from MSigDB [19] and store them in the obj@meta.data slot. It uses the AddModuleScore function of Seurat to calculate and stores gene module scores in the obj@meta.data slot. Simulation is a compelling benchmarking strategy since the ground truth is known when the data are generated, making it possible to evaluate the performance of various methods. We need a scRNA-seq data set that mimics host cells infected with pathogens. There are two main processes in the scRNA-seq read simulation stage: single-cell count matrix preparation and single-cell sequencing read generation. During the preparation of single-cell expression data, the host single-cell count matrix and the pathogen count matrix must be generated separately. To make the simulated host scRNA-seq data closer to the real data set, the publicly peripheral blood mononuclear cells data set (pbmc 4k) from 10x Genomics website [1] was used as the host single-cell count matrix. The pathogen single-cell count matrix was generated by Splatter [20] . Splatter is a powerful count-level simulation method that can generate scRNA-seq count data robustly. We obtained 20 clinically common pathogenic species with complete sequenced core genomes from the list of human infectious pathogens at Wikipedia website and retrieved their gene sequences from NCBI. Since prokaryotic genomes vary greatly in gene size and number of genes, to ensure that each species is fully simulated, we randomly selected up to 50 genes for each species. To simulate host cells infected with pathogens at different levels, Splatter's default parameters were used except the library size parameter (lib.loc) was set between 1 and 5, with an increment of 1. After the pathogen single-cell count matrix is generated, the host and pathogen single-cell count matrices were combined into one count matrix for the single-cell reads generation process. In the single-cell reads generation process, minnow [21] was chosen to simulate scRNA-seq reads. Since numerous simulation methods have been introduced for scRNA-seq data [21] [22] [23] [24] [25] , minnow is a powerful simulator that can currently be used for read-level simulation of single-cell experiments. It can mimic the single-cell sequencing process, such as randomly assigning UMIs to molecules, simulating PCR duplicates based on real reads distribution, imputing sequencing errors, and generating random start positions from transcripts. Therefore, we employed minnow to simulate single-cell reads guided by the single-cell count matrix. Since the start position of the reads simulated from each transcript obeys the truncated normal distribution N(μ,σ), to avoid repeated sampling reads from the same start position, we run minnow with the default parameters, except that the standard deviation σ was changed [26] . To systematically simulate host cells infected with pathogens under various conditions, technical features including UMI length, read length, PCR cycles, and reads coverage were considered. In more detail, we repeated simulations with two UMI lengths (10 and 12 bp, characteristic of the 10x Genomics Single Cell 3′ Version 2 and Version 3, respectively), three read lengths (from 50 to 150 bp, at 50 bp increments), three σs (the standard deviation of start position from 25 to 75 in increments of 25), three PCR cycles (from 4 to 6 in increments of 1), five incremental pathogen infection levels (Splatter's library size location parameter from 1 to 5 in increments of 1, to indicate the infection level of bacterial or viral reads) and three replicates per simulation. In total, this represents 810 simulations (2 UMI lengths  3 read lengths  3 σs  3 PCR cycles  5 infection levels  3 replicates). We have limited our assessment to smaller simulated data sets consisting of 100 cells by down sampling the PBMC using geosketch [27] to save computational resources. In the "time and memory evaluation" step, to save calculation time, we randomly sampled 18 data sets. The detailed simulation parameters are as follows: (1) UMI length was set to 10; (2) Read length was set to 100; (3) PCR cycle was set to 5; (4) Pathogen infection level was set to 3; (5) The σ was set to 50; (6) The number of cells was set to 100, 500, 1000, 2000, 3000, and 4000 each time, and each simulation was repeated three times. The simulated data sets were processed with Viral-Track and PathogenTrack. Viral-Track uses UMI-tools to detect valid barcodes, while PathogenTrack uses barcodes generated by alevin or cellranger. Alevin is an accurate and fast end-to-end tool for processing droplet-based scRNA-seq data from fastq to count matrix. It performs better in CB detection and UMI deduplication. To make a fair comparison between Viral-Track and PathogenTrack, we replaced the default barcode file of Viral-Track with the barcode file generated by alevin. The accuracy of cells infected by a particular pathogen is evaluated by converting the detection results into a binary matrix. Each column represents a specific pathogen, and each row represents a cell. Then we record whether each cell is classified as pathogen infection (1) or not (0). Since we know the actual cells infected by specific pathogens in the simulation data set, we can evaluate each pathogen species' sensitivity and specificity. For each pathogen detection method, we calculated the number of true positive (TP; pathogen-infected cells were correctly classified), false positive (FP; non-infected cells were classified as pathogen-infected cells), true negative (TN; non-infected cells were classified as non-infected cells) and false negative (FN; pathogen-infected cells were classified as non-infected cells). We then calculated the sensitivity of each method as TP/(TP + FN) and specificity as TN/(TN + FP). The sensitivity of each classified pathogen was calculated and recorded. To make a fair comparison of these two methods, we only used pathogens in the simulated data for the benchmark. To benchmark the methods' performance, we implemented these detection tools in multiple high-performance computing platforms. "/usr/bin/seff" was used to record the run time and the maximum memory consumption. Any relevant data are available from the authors upon reasonable request. The scRNA-seq data used in this manuscript are all publicly available, and they are summarized in Table S1 . The PBMC data sets are available at 10x Genomics's official website. The PathogenTrack pipeline and the Yeskit package are publicly available at GitHub websites. For simple installation, PathogenTrack has been deposited in the Bioconda channel and the PyPI repository. PathogenTrack: unsupervised characterization of the intracellular microbiome from scRNA-seq data PathogenTrack is an unsupervised computational pipeline that uses unmapped reads to characterize intracellular pathogens at the single-cell level (Fig. 1A) . PathogenTrack includes the following steps: (1) Pre-processing scRNAseq reads with single-cell quantification software (such as cellranger or alevin) to obtain the gene quantification matrix and CB file (Fig. 1B) . The CB file is taken as the whitelist file for UMI-tools to extract the CB and UMI from Read1 and is added to the header of Read2 (barcoded-read2). (2) Removing low quality or low complexity reads in barcoded-read2 using fastp. (3) Aligning the remaining reads, which passed the quality control, to the host reference genome (such as hg38) using STAR algorithm. The unmapped reads are reserved for further use. (4) Metagenomic classification of the unmapped reads using Kraken2 algorithm. The taxonomy identifiers (IDs) are appended to the header of the corresponding reads. (5) Reads assigned with taxonomy IDs are subject to deduplication, taxonomic correction, and quantification with UMIs. The output pathogen species-by-cell quantification matrix with UMI counts is then ready for downstream analysis (Fig. 1A) . Yeskit: an R-based package for interpreting the scRNA-seq data Next, we come up to a method named Yeskit (Yet another single-cell analysis toolkit) to integrate and interpret the host gene expression data and the intracellular pathogen quantification data at the single-cell level (Fig. 1C) . Yeskit is an R package designed for single-cell gene expression matrix importation, data integration, clustering, differential analysis, functional analysis, and visualization. Since Yeskit does not change the default data structure of Seurat, it can be easily integrated into most existing scRNA-seq analysis workflows. Yeskit can be used to read other information (such as gene mutation-by-cell matrix, pathogen count-by-cell matrix) and store them as additional data in the Seurat obj@meta.data slot. Moreover, it calculates MSigDB pathway enrichment scores and stores them in the obj@meta.data slot. In addition, it performs differential gene analysis between groups or between pathogeninfected (Pos) and bystander (Neg) cells in each cluster. Furthermore, it performs GO enrichment analysis and stores their results in the obj@misc slot. Besides, when there are many points in the vector diagram, editing becomes difficult. To deal with the challenge, most visualization functions in Yeskit have the option to rasterize the point layer and keep all axes, labels, and text in vector format. To evaluate the applicability of our workflow for detecting and decoding intracellular pathogens in human single-cell data, we took the BALF data sequenced by 10x Genomics technology from two severe COVID-19 patients (SRA accession number: SRP250732) as an example [28] . Gene expression data were obtained by cellranger, and pathogen quantification matrices were generated by PathogenTrack. We then used Yeskit to integrate the host and the pathogen quantification matrices and explored the lesions of biological functions related to SARS-CoV-2 infection (Fig. 2) . A total of 13 138 high-quality single cells were ultimately obtained. Four major cell lineages were identified: macrophages, neutrophils, lymphocytes, and epithelial cells (Fig. 2A) . The cell distribution can be visualized by density plot, and the distribution of cell populations per samples were shown in Fig. 2B and 2C . We visualized the distribution of SARS-CoV-2 positive cells in each sample. In both samples, the sequence of SARS-CoV-2 could be found in epithelial cells and immune cells (Fig. 2D and Fig. 2E ). Secondary bacterial infections were reported to cause serious complications associated with worse outcomes in COVID-19 patients. PathogenTrack is optimally designed to systematically profile the source of infection or coinfections in human clinical samples. Interestingly, a small fraction of cells from one of the COVID-19 patients (patient C145) revealed the presence of a co-infected bacterium, Hemophilus parahemolyticus. The bacterium was enriched in neutrophils and macrophages (Fig. 2F) . Hemophilus parahemolyticus has been reported to cause acute respiratory distress syndrome and septic shock [29] . Differential gene expression analysis between SARS-CoV-2-RNA-positive and negative neutrophil cells indicated that SARS-CoV-2 positive cells exhibited elevated expression of a diverse set of genes required for monocyte activation, such as G-CSF receptor (CSF3R), CD16 (FCGR3B), and interferon-induced transmembrane protein 2 (IFITM2) (Fig. 2G) . These genes were enriched in pathways such as "type I interferon signaling pathway," "negative regulation of viral genome replication," and "neutrophil degranulation" (Fig. 2H) . Additionally, we scored the potential enriched pathways of all cells. Interestingly, the type I interferon response gene module was enriched in the neutrophil cell cluster (Fig. 2I) . These results suggest that the SARS-CoV-2 activates the IFN response pathway in neutrophil cells. Altogether, our analysis depicted the distribution of SARS-CoV-2-infected cells and Hemophilus parahemolyticus-infected cells in BALF samples and revealed the activated IFN response by SARS-CoV-2. Next, we systematically evaluated various factors that may affect the detection performance of our method. Since there was no "gold standard" data to assess the accuracy of pathogen detection at the single-cell level, we evaluated our method on simulated data sets. We employed minnow to simulate 810 data sets of host cells infected with pathogens under various simulation parameters. 20 microbes, including 10 bacteria (the Gram-positive Clostridioides difficile, Clostridium perfringens, Corynebacterium diphtheriae, Listeria monocytogenes, and Staphylococcus aureus; the Gram-negative Chlamydia trachomatis, Helicobacter pylori, Legionella pneumophila, Salmonella enterica, and Vibrio cholerae) and 10 viruses (EBV, HIV, Human metapneumovirus (hMPV), Human papillomaviruses (HPV), Herpes simplex virus (HSV), Molluscum contagiosum virus (MCV), Middle East respiratory syndrome coronavirus (MERS), Rabies virus, SARS-CoV-2, and Varicella-zoster virus (VZV)), were involved in the stimulation. The technological features including UMI length, read length, PCR cycles, pathogen infection levels, and σ (the standard deviation of start positions) were considered. As shown in Fig. 3A-3C , UMI length, PCR cycles, and σ have almost no effect on the performance of our method. The performance increases with the augmentation of read length or pathogen infection level ( Fig. 3D and 3E) . Read length over 100 bp showed good performance. In addition, we analyzed the accuracy of our method. A good agreement between the number of pathogen-infected cells predicted by our method and the expected genuine was shown in Fig. 3F . Since Viral-Track [9] was the only single-cell intracellular virus detection tool, we compared the sensitivity of our algorithm with Viral-Track on the 810 simulated data sets. The performance statistics of Viral-Track are illustrated in Fig. S1 . As is depicted in Fig. S1A , PathogenTrack performs as good as Viral-Track in virus detection. More importantly, PathogenTrack has the ability in detecting bacteria with high sensitivities (Fig. S1A) . To reduce the computational time when evaluating the time and memory consumptions of both methods, we randomly sampled 100 to 4000 cells. Benchmarking showed a roughly linear relationship between the number of cells and the processing time required, and the maximum memory consumption of PathogenTrack is less than that of Viral-Track (Fig. S1B) . To compare the performance of Viral-Track and Pathogen-Track for detecting pathogenic reads in real human samples, we next compared the results on several real human scRNA-seq data sets (Table S1 ). These data sets consist of a variety of tissues and cell lines (blood, lung, intestine, stomach, and lymphoblastoid cell lines) and various well-known viruses and bacteria: influenza A (H1N1 and H3N2), human immunodeficiency virus 1 (HIV-1), severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), Epstein-Barr virus (EBV), and Helicobacter pylori (H. pylori). Since cellranger or alevin provides more reliable CBs than UMI-tools, we used the default parameters to run these two methods, except that the input whitelist file was changed to barcode file generated by cellranger (applicable to the 10x Genomics platform). Because the pathogenic microbes and their infected cells were not aware in real data sets, only the well-known pathogens in each data set were evaluated. We observed that the number of virus-infected cells predicted by PathogenTrack was close to that of Viral-Track's, but there were a few differences (Fig. 4) . In the in vitro influenza A data (Cal07), PathogenTrack obtained nearly equal virus-infected cells to Viral-Track in infected samples, but fewer or none in controls (Fig. 4A) . It may imply PathogenTrack has higher specificity in detecting virus-infected cells. Besides, the Viral-Track failed to detect any bacteria, such as H. pylori (Fig. 4A ). Next, we calculated the correlation of the UMI counts of the intracellular pathogens predicted by both methods for each sample. The results show that pathogen UMI counts are in good agreement between these two methods (Fig. 4B) . We further tested the applicability of PathogenTrack in tracing bacterial reads in human clinical samples. We ran PathogenTrack on 13 gastric antral mucosa biopsy scRNAseq data sets (SRA accession number: SRP215370). Helicobacter pylori was detected only in two samples (Fig. 4A, last two venn diagram) , which was consistent with the clinical information in the original paper [30] . Subsequently, we asked if the pathogen-infected cells only predicted by PathogenTrack are reliable. Since it seems impossible to conduct experiments at the single-cell level to verify pathogen infection events, we used bioinformatics tool to address this question. As Pathogen-Track provides pathogenic sequences that support each cell infection event, we randomly selected a few pathogenic reads and verified the predictions by performing BLAST searches on the Nucleotide database. The time and memory consumption of these two methods are illustrated in Fig. 4C . PathogenTrack consumed almost constant memory (~30 Gb) and reasonable time on all real data sets, while Viral-Track consumed up to 128 Gb of memory in Cal07_Infected data and ran out of memory in Perth09_Infected data (> 180 Gb RAM). One reason could be that PathogenTrack detects pathogens at read-level while Viral-Track consumes more memory when assembling a virus genome. The last few years have witnessed emerging of many computational methods on detecting microbe-derived sequences in human clinical samples. However, methods for thoroughly identifying and interpreting intracellular pathogens at the single-cell level are rarely documented. In this study, we propose a computational method for identifying (PathogenTrack) and exploring (Yeskit) intracellular pathogens (viruses and bacteria) at the single-cell level from unmapped scRNA-seq data. PathogenTrack performed robustly on simulated and clinical data sets; Yeskit provided valuable information about pathogen infection status and pathogen-induced pathways. The currently accessible human clinical single-cell transcriptome data provide us with an excellent opportunity to identify these pathogen species and explore their functions in disease progression. These pathogens may indicate disease states and shed new lights on drug targets. Using COVID-19 as an example, we showed the SARS-CoV-2 existed in neutrophils, lymphocytes, macrophages, and epithelial cells. In addition, SARS-CoV-2 positive cells exhibited distinct gene expression patterns compared to those bystander cells. Although the SARS-CoV-2 mainly enter the host cell via the ACE2 protein, which is primarily expressed in type II alveolar epithelial cells, the SARS-CoV-2 might also enter immune cells owing to the following reasons: (1) viral overload; (2) the SARS-CoV-2 might be swallowed via endocytosis, trogocytosis, and phagocytosis. SARS-CoV-2 positive neutrophils showed an enhanced expression of genes related to interferon response and antigen-presenting; (3) ACE2 was found to be expressed on the surface of pulmonary macrophages [31] , which might serve as an entry path for SARS-CoV-2 into these cells. As a result, there could be viral proliferation in the pulmonary macrophage, leading to a vicious cycle of severe pulmonary viral infection and/or cytokine release syndrome. Alternately, pulmonary macrophage may play a role in antigen processing/presentation to other immune cells [31, 32] . Analysis of the BALF data suggests the SARS-CoV-2 may trigger distinct transcriptional programs related to aberrant immune response in COVID-19. In addition to infectious disease, emerging studies have also unveiled the intracellular pathogens, including viruses and bacteria, in the progression of malignant diseases [33] [34] [35] . Generally, the pathogenic microbiome is present in the digestive tract and respiratory tract [36] [37] [38] [39] [40] . Studies have also found that certain microbiomes may exist in cells and contribute to the occurrence and development of malignant diseases [37, 41] . For instance, many pathogens have been reported to be key regulators in malignant diseases, such as EBV, HBV, and HPV. However, how these viruses engaged in tumorigenesis remains unclear. We have shown the PathogenTrack can detect these pathogens at single-cell levels. In the current study, SARS-CoV-2 was enrolled as an example to illustrate the capacity of the PathogenTrack and Yeskit. Future studies focusing on other pathogen-related disease, such as malignant disease, might be conducted to unveil the potential regulatory role of pathogen in disease progression at the single-cell level. We might establish an online database for pathogens at single-cell level in multiple pathogen-related disease, such as EBV-related lymphoma, HBV-related hepatocellular carcinoma, HPV-related cervical cancer, and bacteria in multiple cancers. Currently, our method might only be applicable to the 10x Genomics and microwell-based scRNA-seq data sets, which are the most commonly used method to conduct scRNA-seq, and we might expand the application of our tools in future. Massively parallel digital transcriptional profiling of single cells Comprehensive single-cell transcriptional profiling of a multicellular organism SCENIC: single-cell regulatory network inference and clustering Single-cell RNA sequencing to explore immune cell heterogeneity Landscape and dynamics of single immune cells in hepatocellular carcinoma Inference and analysis of cell -cell communication using CellChat Single-cell RNA sequencing technologies and bioinformatics pipelines Resolving host-pathogen interactions by dual RNA-seq Host-viral infection maps reveal signatures of severe COVID-19 patients Alevin efficiently estimates accurate gene abundances from dscRNA-seq data UMI-tools: modeling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy fastp: an ultra-fast all-in-one FASTQ preprocessor STAR: ultrafast universal RNAseq aligner Improved metagenomic analysis with Kraken 2 Comprehensive integration of single-cell data Fast, sensitive and accurate integration of single-cell data with Harmony A benchmark of batch-effect correction methods for single-cell RNA sequencing data Gene set enrichment analysis with topGO Molecular signatures database (MSigDB) 3.0 Splatter: simulation of single-cell RNA sequencing data Minnow: a principled framework for rapid simulation of dscRNA-seq data at the read level A statistical simulator scDesign for rational scRNAseq experimental design Simulating multiple faceted variability in single cell RNA sequencing SERGIO: a single-cell expression simulator guided by gene regulatory networks ESCO: single cell expression simulation incorporating gene co-expression Polyester: simulating RNA-seq datasets with differential transcript expression Geometric sketching compactly summarizes the single-cell transcriptomic landscape Single-cell landscape of bronchoalveolar immune cells in patients with COVID-19 Haemophilus parahaemolyticus septic shock after aspiration pneumonia Dissecting the single-cell transcriptome network underlying gastric premalignant lesions and early gastric cancer Alveolar macrophage dysfunction and cytokine storm in the pathogenesis of two severe COVID-19 patients Integrating longitudinal clinical laboratory tests with targeted proteomic and transcriptomic analyses reveal the landscape of host responses in COVID-19 The landscape of bacterial presence in tumor and adjacent normal tissue across 9 major cancer types using TCGA exome sequencing Microbiome analyses of blood and tissues suggest cancer diagnostic approach The human tumor microbiome is composed of tumor type-specific intracellular bacteria Probiotics and the gut microbiota in intestinal health and disease The human microbiome: at the interface of health and disease The resilience of the intestinal microbiota influences health and disease Probiotics and prebiotics in intestinal health and disease: from biology to the clinic Interaction between microbiota and immunity in health and disease The gut microbiota shapes intestinal immune responses during health and disease We thank the support from Prof. Gang Lv and the ASTRA computing platform in the National Research Center for Translational Medicine (Shanghai) and the Pi computing platform in the Center for High Performance Computing at Shanghai Jiao Tong University. Wei Zhang, Xiaoguang Xu, Ziyu Fu, Jian Chen, Saijuan Chen, and Yun Tan declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. This article does not contain any studies with human or animal subjects. Supplementary material is available in the online version of this article at https://doi.org/ 10.1007/s11684-021-0915-9 and is accessible for authorized users.