key: cord-0819917-c65im816 authors: Xu, Yi; Kang, Lu; Shen, Zijie; Li, Xufang; Wu, Weili; Ma, Wentai; Fang, Chunxiao; Yang, Fengxia; Jiang, Xuan; Gong, Sitang; Zhang, Li; Li, Mingkun title: Hybrid capture-based sequencing enables unbiased recovery of SAR-CoV-2 genomes from fecal samples and characterization of the dynamics of intra-host variants date: 2020-08-01 journal: bioRxiv DOI: 10.1101/2020.07.30.230102 sha: d644d7021fcdd209e6da9676d1e48456d27660a1 doc_id: 819917 cord_uid: c65im816 Background In response to the current COVID-19 pandemic, it is crucial to understand the origin, transmission, and evolution of SARS-CoV-2, which relies on close surveillance of genomic diversity in clinical samples. Although the mutation at the population level had been extensively investigated, how the mutations evolve at the individual level is largely unknown, partly due to the difficulty of obtaining unbiased genome coverage of SARS-CoV-2 directly from clinical samples. Methods Eighteen time series fecal samples were collected from nine COVID-19 patients during the convalescent phase. The nucleic acids of SARS-CoV-2 were enriched by the hybrid capture method with different rounds of hybridization. Results By examining the sequencing depth, genome coverage, and allele frequency change, we demonstrated the impeccable performance of the hybrid capture method in samples with Ct value < 34, as well as significant improvement comparing to direct metatranscriptomic sequencing in samples with lower viral loads. We identified 229 intra-host variants at 182 sites in 18 fecal samples. Among them, nineteen variants presented frequency changes > 0.3 within 1-5 days, reflecting highly dynamic intra-host viral populations. Meanwhile, we also found that the same mutation showed different frequency changes in different individuals, indicating a strong random drift. Moreover, the evolving of the viral genome demonstrated that the virus was still viable in the gastrointestinal tract during the convalescent period. Conclusions The hybrid capture method enables reliable analyses of inter- and intra-host variants of SARS-CoV-2 genome, which changed dramatically in the gastrointestinal tract; its clinical relevance warrants further investigation. The ongoing Coronavirus disease 2019 (COVID-19) pandemic has brought a severe threat to public health and the global economy. The causative pathogen, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is likely of zoonotic origin. Genomes of RNA viruses mutate fast and undergo rapid evolution, which could ultimately affect their virulence, infectivity, and transmissibility. Monitoring variations in SARS-CoV-2 genome at the population level is vital for tracing the outbreak origin, tracking transmission chains, and understanding viral evolution (1) (2) (3) . Meanwhile, the study of intra-host single nucleotide variation (iSNV), which represents the intermediate stage between the origin and the fixation of the mutation at the individual level, is essential for understanding how the virus evolves in the human body under the immune pressure (4) . These studies require complete, in-depth, and unbiased profiling of SARS-CoV-2 genomes from a large number of patients. However, there are still challenges in obtaining high-quality SARS-CoV-2 genome directly from clinical samples, especially for those with low viral loads. For samples with Ct > 30, the proportion of genome recovered was lower than 90% irrespective of the amplification and sequencing approach used (2) . Current strategies for targeted enrichment of SARS-CoV-2 genome include hybrid capture and multiplex PCR amplification; the former method was proposed to be more reliable in generating unbiased coverage across the genome and identifying minor alleles (5) . As the effectiveness of the hybrid capture method is influenced by multiple factors, such as probe design, rounds of hybridization, and viral load, the fidelity of hybrid capture warrants further evaluation in the context of iSNV investigation. Recent studies have revealed the gastrointestinal (GI) tract as an important actor in the pathogenesis and transmission of COVID-19. GI infection of SARS-CoV-2 has been confirmed by the isolation of viable virus from the fecal specimen (6, 7) , and diarrhea is observed in > 20% of hospitalized COVID-19 patients (8) . Moreover, the shedding of SARS-CoV-2 from the GI tract lasted much longer than that from the respiratory tract in children (9) , raising the possibility of fecal-oral transmission. However, neither the genomic diversity of the virus nor its longitudinal dynamics in the GI is known. To disentangle how SARS-CoV-2 evolves and adapts in the GI tract, 18 longitudinal fecal specimens from nine children with COVID-19 were collected. Nucleic acids of SARS-CoV-2 were enriched by a hybrid capture method. Seventeen near-complete genomes (> 99% genome covered) of the SARS-CoV-2 were obtained. Among them, the majority rule consensus sequences at four sites were changed in 1-5 days, and another 15 sites showed allele frequency changes greater than 0.3, indicating a rapid genomic change in the GI. Eighteen fecal samples were collected from nine hospitalized children with real-time RT-PCR (RT-qPCR) confirmed SARS-CoV-2 infection. Two fecal samples were collected from each child within a time interval of 1-5 days (T1 and T2) in the recovery stage. The patients were characterized by mild symptoms and long duration of SARS-CoV-2 shedding in their feces (see Supplementary Table 1 for demographic and clinical information). All samples were inactivated at 56 for 30 min, and stored at -80 before processing. This study was approved by the ethics review committee of Guangzhou Women and Children's Medical Center. Written informed consents were obtained from the legal guardians of all children. Each fecal sample (~200 mg) was suspended in 1 mL PBS, and centrifuged at 8,000 g Quality control of the sequencing reads including adapter trimming, low-quality reads removal, and short reads removal was done by fastp v0.20.0 (10) (-l 70, -x, -cut-tail, Eighteen fecal samples were collected from nine children with COVID-19, two samples were collected from each child within a time interval of 1-5 days (T1 and T2). Ages of the children ranged from three months to 13 years old. Their symptoms were either mild or moderate, and two of them had diarrhea. All samples were collected during the recovery stage (see Supplementary Table 1 19,726 and 33 times higher than those in Raw and E1 data, respectively ( Fig 1A) . Consequently, the hybrid capture method significantly increased SARS-CoV-2 Table 1 ). In addition, we detected 24 and 12 SARS-CoV-2 RPM in two NC samples with one-round hybrid capture, indicating that the signal of SARS-CoV-2 observed in the COVID-19 patients was much stronger than that in NC. An excess number of PCR cycles were implemented to ensure enough input for sequencing following the hybridization due to the loss of non-target nucleic acids. Accordingly, we noted that the proportion of duplicated reads increased from 10.9% in the Raw to 53.1% after the first hybridization and 95.5% after the second hybridization ( Fig 1D) Fig 1E, Fig S1A-C) . Of note, the depth distribution was only slightly influenced by the probe density and genomic GC content, as the correlation was relative low between the depth and probe density (E1, R 2 = 0.01, E2, By comparing the consensus sequences (major allele frequency > 0.7) of E1/E2 to those of Raw, we identified 41 inconsistent sites (with depth > 5) (Supplementary Fig 2A-C) . Moreover, the changes of AAFs from T1 to T2 was also significantly correlated between the Raw and E1 (R 2 = 0.20; p < 0.001, Fig 2D) . Notably, the frequency change calculated from the Raw data showed a larger variance than E1 (Fig 2D) , likely reflecting more significant stochastic fluctuations related to the lower sequencing depth. These results demonstrated that hybrid capture-based sequencing is ideal for genomic diversity analysis, and also enable an accurate longitudinal analysis of iSNVs. However, the absence of the mutation in the population dataset could also be attributed to the limited genomic diversity included in the current population dataset. The number of mutations was inversely correlated with mutant allele frequency (MAF), except for iSNVs with MAF 0.8-0.95 (Fig 3A) , which probably reflected the back mutation from the mutant allele to the ancestral allele. The number of iSNVs in different genes is proportional to the length of corresponding regions (Fig 3A) , suggesting no visible mutation hotspot on the SARS-CoV-2 genome. The number of iSNVs markedly varied among different individuals, ranging from 6 to 39 in samples with full genome covered by at least 50-fold (Fig 3B) . Twenty-one iSNVs were shared among multiple individuals, which is more than expected by chance (P<0.001, Exact Poisson test, Fig 3C) . Moreover, 4 of the shared iSNVs were located at mutation hotspots (3) . Considering that only 198 such mutation hotspots were identified, shared iSNVs were significantly enriched at positions with higher mutation rates (P<0.001, Fisher exact test). C8782T and T28144C are widespread variants that defined a subgroup of SARS-CoV-2 (17,18); they were both heterogeneous in multiple samples in our data, suggesting that either the position is prone to mutate or multiple strains were circulating in the population when the samples were collected. Mutations at most positions only involved transitions between two nucleotides, except that more than two nucleotides were observed at four positions, namely 14307, 14308, 11147, and 11148 ( Fig 3C, Supplementary Table 3 ). The dynamics of all iSNVs were then examined, and only 8.3%-34.5% of iSNVs were observed at both T1 and T2 ( Fig 3B) . Meanwhile, 19 iSNVs showed frequency changes greater than 0.30 within 1-5 days (Fig 3D) , 16 of which are non-synonymous mutations. Nine of these significant shifts occurred within a day and four within two days, implying intense selection pressure or strong genetic drift. Notably, the most dramatically shifted iSNV C21771T, which caused a non-synonymous mutation from Ser to Leu, was observed in two individuals but showed opposite tendencies, whereby the frequency of the mutant allele increased by 0.91 in one day in S8 while decreased by 0.71 in two days in S5. Although we favored a random genetic drift hypothesis, considering that the viral genome differed at more than 20 positions between S8 and S5, the possibility of distinct selection pressures under different genomic backgrounds cannot be ruled out. Besides, the number of rapidly-changing positions varied among different individuals, which may reflect different immune pressures (Fig 3D) . We also noted that some mutations in the same individual had similar frequencies and showed concurrent changes (S5 and S8), suggesting they were located on the same haplotype. We further investigated the correlation between virological features and clinical features. First, the load of SARS-CoV-2 was found to be positively correlated with the duration of detection of SARS-CoV-2 RNA in feces since T1/T2 (linear regression, R 2 = 0.39, p < 0.01, Fig 3E) . Second, the number of iSNV was in a negative correlation with SARS-CoV-2 load (R 2 = 0.90, p < 0.01, Fig 3F) , as well as the duration of detection of SARS-CoV-2 RNA (R 2 = 0.62, p = 0.01, Fig 3G) . The increased genomic diversity in samples with lower viral loads may reflect a stronger immune response to the virus, which enables faster elimination of the virus in the body and thus associated with better clinical outcomes. The limited viral load in clinical samples has long been an obstacle to both pathogen detection and genomic studies, especially considering that the abundance of viruses could decrease when the disease progresses. Here we have demonstrated that the hybrid capture method was capable of recovering unbiased SARS-CoV-2 genome from RT-qPCR-negative and metatranscriptomic sequencing-negative samples. Furthermore, it also enables reliable analysis of inter-and intra-host variants, making it a promising strategy for genomic studies on the SARS-CoV-2. The identification of iSNVs in fecal samples as well as in throat swab and bronchoalveolar lavage fluid, suggest ongoing evolution of SARS-CoV-2 in the human body (19) (20) (21) (22) . However, the possibility of infection of multiple strains cannot be entirely ruled out. With the time series samples, we have proved that novel mutations could occur within one day, and the frequency of the mutation changed notably fast in humans. Thus, iSNVs are more likely to represent spontaneous mutations rather than infection of multiple strains. A recent study proposed that SARS-CoV-2 in GI had a relatively higher genomic diversity compared to that in the respiratory tract (20) , suggesting that the mutation rate may vary among different tissues. Thus, the fast-evolving of the viral genome in GI observed in our study may not be applicable to other organs. All samples in this study were collected from pediatric patients in rehabilitation with no clinical symptoms. There had been more than two weeks since symptoms onset for most patients except Sample 5. Fecal shedding of SARS-CoV-2 had been observed in multiple studies and was proposed to be longer compared to respiratory samples (9, 23) . However, whether the RNA shedding from stools is infectious during the convalescent phase is unclear (24) . The evolving of the viral genome in feces observed in our study suggested that the virus was viable and actively replicated in the gastrointestinal tract. Thus we speculate that the fecal transmission of SARS-CoV-2 is possible, and further study is warranted to investigate whether the viral load in feces is sufficient to establish an infection. Although a general purifying selection on the viral genome had been shown in our data and also by previous studies, we did not observe any signature of selection on based on these two mutations should be with great caution. The mutation and their frequency change may also affect the response to drugs. Position 10194 locates in the region encoding 3-chymotrypsin-like cysteine protease (3CL pro ), which is essential for the replication of coronavirus, and this region is also predicted to be the target for a few drugs including Lopinavir (25, 26) . At the first time point in patient S5, there were an equivalent number of reads presenting A and T at this position (a mutation from A to T results in an amino acid change from Glu to Val). At the second time point, which was one day later, the mutant allele T almost disappeared. Although it is unclear whether the mutation could affect the binding of anti-viral drugs, the risk should not be overlooked. In summary, our study highlighted the need for extensive studies on the intra-host variant dynamics in different tissues, using large cohorts spanning a wide range of ages, disease severity, and geographic regions. In A-C, local polynomial regression line, R and p values of Spearman correlations are shown; Ct value that below the detection limit was replaced with 42 for better visualization. (D) Heatmap of Ct, duplicate reads rate, RPM, and genome coverage after de-duplication. The median number for each group is shown with a bar plot, and crosses indicate that the samples were not included in E2 data. BDL, below the detection limit. (E) Depth distribution along SARS-CoV-2 genome. The relative depth was calculated for every 500 bases, and normalized to the average depth of each sample or the probes; only samples with average depth > 1 are shown. Red curve indicates GC content along the reference genome. Phylogenetic network analysis of SARS-CoV-2 genomes Genomic Epidemiology of SARS-CoV-2 in Guangdong Province Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Multiple approaches for massively parallel sequencing of HCoV-19 (SARS-CoV-2) genomes directly from clinical samples Evidence for Gastrointestinal Infection of SARS-CoV-2 Infection of bat and human intestinal organoids by SARS-CoV-2 Enteric involvement in hospitalised patients with COVID-19 outside Wuhan Characteristics of pediatric SARS-CoV-2 infection and potential evidence for persistent fecal viral shedding fastp: an ultra-fast all-in-one FASTQ preprocessor Aligning sequence reads, clone sequences and assembly contigs with The Sequence Alignment/Map format and SAMtools VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing Improved metagenomic analysis with Kraken 2 KaKs_Calculator 2.0: A Toolkit Incorporating Gamma-Series Methods and Sliding Window Strategies The establishment of reference sequence for SARS-CoV-2 and variation analysis On the origin and continuing evolution of SARS-CoV-2 Characterization of SARS-CoV-2 viral diversity within and across hosts Intra-host Variation and Evolutionary Dynamics of SARS-CoV-2 Population in COVID-19 Patients Intra-host site-specific polymorphisms of SARS-CoV-2 is consistent across multiple samples and methodologies Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients Prolonged fecal shedding of SARS-CoV-2 in pediatric patients. A quantitative evidence synthesis Potential fecal transmission of SARS-CoV-2: Current evidence and implications for public health Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors An investigation into the identification of potential inhibitors of SARS-CoV-2 main protease using molecular docking study We thank Dr. Xue Yongbiao and colleagues from National Genomics Data Center for helpful discussion and computational resource support. We thank Dr. Tang The authors declare no competing interests.