key: cord-0751955-87d7gzgb authors: Shen, Zijie; Xiao, Yan; Kang, Lu; Ma, Wentai; Shi, Leisheng; Zhang, Li; Zhou, Zhuo; Yang, Jing; Zhong, Jiaxin; Yang, Donghong; Guo, Li; Zhang, Guoliang; Li, Hongru; Xu, Yu; Chen, Mingwei; Gao, Zhancheng; Wang, Jianwei; Ren, Lili; Li, Mingkun title: Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients date: 2020-03-09 journal: Clin Infect Dis DOI: 10.1093/cid/ciaa203 sha: 6da968668b2379ce81436dcede1332563e54d334 doc_id: 751955 cord_uid: 87d7gzgb BACKGROUND: A novel coronavirus (SARS-CoV-2) has infected more than 75,000 individuals and spread to over 20 countries. It is still unclear how fast the virus evolved and how the virus interacts with other microorganisms in the lung. METHODS: We have conducted metatranscriptome sequencing for the bronchoalveolar lavage fluid of eight SARS-CoV-2 patients, 25 community-acquired pneumonia (CAP) patients, and 20 healthy controls. RESULTS: The median number of intra-host variants was 1-4 in SARS-CoV-2 infected patients, which ranged between 0 and 51 in different samples. The distribution of variants on genes was similar to those observed in the population data (110 sequences). However, very few intra-host variants were observed in the population as polymorphism, implying either a bottleneck or purifying selection involved in the transmission of the virus, or a consequence of the limited diversity represented in the current polymorphism data. Although current evidence did not support the transmission of intra-host variants in a person-to-person spread, the risk should not be overlooked. The microbiota in SARS-CoV-2 infected patients was similar to those in CAP, either dominated by the pathogens or with elevated levels of oral and upper respiratory commensal bacteria. CONCLUSION: SARS-CoV-2 evolves in vivo after infection, which may affect its virulence, infectivity, and transmissibility. Although how the intra-host variant spreads in the population is still elusive, it is necessary to strengthen the surveillance of the viral evolution in the population and associated clinical changes. Since the outbreak of a novel coronavirus (SARS-CoV-2) in Wuhan, China, the virus had spread to more than 20 countries, resulting in over 75,000 cases and more than 2,300 deaths (Until Feb 22, 2020) [1, 2] . The basic reproduction number was estimated to range from 2.2 to 3.5 at the early stage [3] , making it a severe threat to public health. Recent studies have identified bat as the possible origin of SARS-CoV-2, and the virus likely uses the same cell surface receptor as SARS-CoV [4] , namely ACE2. These studies have advanced our understanding of SARS-CoV-2. However, our knowledge of the novel virus is still limited. The virus undergoes a strong immunologic pressure in humans, and may thus accumulate mutations to outmaneuver the immune system [5] . These mutations could result in changes in viral virulence, infectivity, and transmissibility [6] . Therefore, it is imperative to investigate the pattern and frequency of mutations occurred. Aside from the pathogen, microbiota in the lung is associated with disease susceptibility and severity [7] . Alterations of lung microbiota could potentially modify immune response against the viral and secondary bacterial infection [8, 9] . Thus, understanding the microbiota, which comprises bacteria that could cause secondary infection or exert effects on the mucosal immune system, might help to predict the outcome and reduce complications. In our study, we conducted metatranscriptome sequencing on bronchoalveolar lavage fluid (BALF) samples from 8 subjects with Coronavirus disease 2019 (COVID-19, the disease caused by SARS-CoV-2) patients. We found that the number of intra-host variants ranged from 0 to 51 with a median number of 4, suggesting a high evolution rate of the virus. By investigating a person-to-person spread event, we found no evidence for the transmission of intra-host variants. Meanwhile, we found no specific microbiota alteration in the BALF of COVID-19 patients comparing to CAP patients with other suspected viral causes. By metatranscriptome sequencing, more than 20 million reads were generated for each BALF of COVID-19 patients (nCoV) as well as a negative control (nuclease-free water, NC). For comparison, the metatranscriptome sequencing data with similar number of reads from 25 virus-like community-acquired pneumonia patients (CAP, determined by at least 100 viral reads and 10-fold higher than those in the NC), 20 healthy controls without any known pulmonary diseases (Healthy), and two extra NCs (two saline solutions passing through the bronchoscope) were used in this study. Demographic and clinical information was collected and summarized in Supplementary Table 1 . After quality control, a median number of 55,571 microbial reads were generated for each sample. nCoV had the highest proportion of microbial reads compared to CAP and Healthy (nCoV: median proportion of 7%, CAP: 0.8%, Healthy:0.1%, p < 0.001, Figure 1A ), and 49% of the microbial reads could be mapped to SARS-CoV-2, which was not different from the viral proportion in CAP ( Figure 1B) . Only SARS-CoV-2 was identified in nCoV, and no read was mapped to other species belonging to Betacoronavirus. Moreover, besides the detection of HCoV-OC43 in one Healthy and HCoV-NL63 in a CAP, no other samples showed any signal of Betacoronavirus, which proved the authenticity of the data and methods used in our analysis. The sequencing depth of SARS-CoV-2 ranged from 18-fold in nCoV-5 to 32,291fold in nCoV1, with more than 80% of the genome covered by at least 50-fold in five samples ( Figure 2A , Supplementary Table 2 ). In total, 84 intra-host variants were identified with minor allele frequency (MAF) greater than 5%, and 25 variants were with MAF greater than 20% (Supplementary Table 3 , Figure 2B , nCoV5 was excluded from the analysis due to large gaps on its genome coverage). Notably, the number of variants was not associated with the sequencing depth (Supplementary Figure 1) . The overall Ka/Ks ratio was significantly smaller than 1, which was similar for intra-host variants and the polymorphisms observed in the population data, suggesting a purifying selection acting on both types of mutations ( Table 1 ). The numbers of variants observed in the gene were proportional to gene lengths (cor = 0.950, p = 8E-06 for the intra-host variant; cor = 0.957, p = 4E-06 for the polymorphisms). Although only a small fraction of the variants was observed in multiple patients (2 out of 84, Figure 2C ), some positions were more prone to mutate or variants were transmitting in the population, such as position 10779, where the mutant allele A was observed in all seven patients, with the frequency ranging from 15% to 100% ( Figure 2D ). The number of intra-host variants per individual showed a large variation (0 to 51, median 4 for variants with MAF ≥ 5%; 0 to 19, median 1 for variants with MAF ≥ 20%), which could not be explained by the batch effect, coverage variance, or contamination (Supplementary Figure 1 ; nCoV1-4 were in one batch, nCoV5-8 were in another batch; most mutations were not observed in the population data). We also noted that the number of variations was not relevant to the days after symptom onset or the age of patients (Supplementary Figure 2) . Collectively, we did not find any reason for the extremely high level of variants in nCoV6 (51 variants). A larger population size is needed to investigate how frequent such outliers are, and whether they are associated with the level of host immune response or the viral replication rate. We also noted similar outliers for other viruses [11] . Of note, the origin of variants could be either mutation occurred in vivo after infection or multiple transmitted SARS-CoV-2 strains. Among the eight COVID-19 patients, nCoV4 and nCoV7 were from the same household, with dates of symptom onset differing by five days; thus a transmission from nCoV4 to nCoV7 is highly suspected, especially considering that only nCoV4 had been to the Huanan seafood market in Wuhan, which is the starting point of the outbreak and suspected to be the source. First, the consensus sequence of the virus was the same for two samples, and all four intra-host variants passing the selection criteria in nCoV4 were not detected in nCoV7 (Table 2) . We further expanded the investigation to all variants with MAF ≥ 2% and supported by at least 3 reads. By doing so, we detected seven variants (out of 25) shared between the two samples. However, the MAF in both nCoV4 and nCoV7 were similar to those in other samples, suggesting that these positions were either error-prone or mutation-prone; hence they cannot support the transmission of these variants. Meanwhile, among all 84 intra-host variants, only three of them were found to be polymorphic in the population data (position 7866 G/T; 27493 C/T; 28253 C/T). This small number of overlap also suggests that intra-host variants were rarely transmitted to other samples. However, we cannot rule out the possibility that the sequence diversity in the population is underestimated by the current database. Metatranscriptome data also enabled us to profile the transcriptionally active microbiota in different types of pneumonia, which is associated with the immunity response in the lung [12, 13] . In general, a significant difference in microbiota composition was observed among the nCoV, CAP, and Healthy groups (R 2 = 0.07, p = 0.001; Figure 3A ). However, the clustering of some samples with NC indicated a barren microbiota in some samples. After removing the problematic samples and ambiguous components, we still found that nCoV and CAP were both different from the healthy controls (nCoV vs. Healthy: R 2 = 0.45, p = 0.001; CAP vs. Healthy: R 2 = 0.10, p = 0.002), implying a dysbiosis occurred in their lung microbiota. Microbiota could be classified into three different types ( Figure 3B ). In particular, the microbiota in cluster I was dominated by the possible pathogens, whereas the microorganisms in other clusters were more diverse. By further inspecting the species belonging to each cluster (Supplementary Table 4 -5), we found that bacteria in Type III were mainly commensal species frequently observed in the oral and respiratory tract, whereas bacteria in Type II were mostly environmental organisms, thereby contamination was highly suspected. Therefore, the microbiota was either pathogen-enriched (Type I) or commensalenriched (Type III) or undetermined due to low microbial load (Type II). The microbiota in six nCoV samples were pathogen-enriched, and the other two were commensal-enriched ( Figure 3B ). Moreover, two nCoV samples (2, 6) with an excess number of intra-host SARS-CoV-2 variants both possessed the pathogenenriched microbiota. The overwhelming proportion of the virus may associate with a higher replication rate, and could also potentially stimulate the intense immune response against the virus, under which circumstance, an excess number of intra-host mutations would be expected. However, as only eight nCoV patients were included in this analysis, and the absolute microbial load was unknown, more data is needed for further investigation. RNA viruses have a high mutation rate due to the lack of proofreading activity of polymerases. Consequently, RNA viruses are prone to evolve resistance to drugs and escape from immune surveillance. The mutation rate of SARS-CoV-2 is still unclear. However, considering that the median number of pairwise sequence differences was 4 (Interquartile Range: 3-6) for 110 sequences collected between Dec 24, 2019 and Feb 9, 2020, the mutation rate should be at the same order of magnitude in SARS-CoV (0.80-2.38×10 -3 nucleotide substitution per site per year) [14] . The high mutation rate also results in a high level of intra-host variants in RNA viruses [11, 15] . The median number of intra-host variant in COVID-19 patients was 4 for variant with frequency ≥ 5%, and this incidence was not significantly different from that reported in a study on Ebola (655 variants with frequency ≥ 5% in 134 samples) (p>0.05) [11] , suggesting that the mutation rate of SARS-CoV-2 was also comparable to Ebola virus. An exoribonuclease (ExoN) has been proposed to provide proofreading activity in SARS-CoV [16, 17] , and we noted that all three key motifs in the gene were identical between SARS-CoV and SARS-CoV-2 (Supplementary Figure 3) . In addition, neither polymorphism nor intra-host variant was detected in these motifs, suggesting that the gene is highly conserved, and thereby it could be a potential target for antiviral therapy. Although we did not find any mutation hotspot genes in either polymorphism or intrahost variants, the observation of shared intra-host variants among different individuals implied the possibility of adaptive evolution of the virus in patients, which could potentially affect the antigenicity, virulence, and infectivity of the virus [6] . It is worth noting that the SARS-CoV-2 genome in patients could be highly diverse, which was also observed in other viruses [11] . The high diversity could potentially increase the fitness of the viral population, making it hard to be eliminated [15] . Further studies are needed to explore how this may influence the immune response towards the virus and whether there is a selection acting on different strains in the human body or during the transmission. In a single transmission event investigated in this study, we found no evidence for the transmission of multiple strains. However, it is unclear whether these intra-host variants occurred before the transmission or after the transmission, which would result in different conclusions. Additionally, a bottleneck may be involved in the transmission, which could also result in the loss of diversity [18] . Nevertheless, the observation of high mutation burden in some patients emphasized the possibility of rapid-evolving of this virus. Recent studies have shown that the microbiota in the lung contributed to the immunological homeostasis and potentially altered the susceptibility to viral infection. Meanwhile, the lung microbiota could also be regulated by invading viruses [9, 19] . However, besides the feature that the microbial diversity was significantly lower in pneumonia than that in healthy controls ( Figure 3B ), we did not identify any specific microbiota pattern shared among COVID-19 patients, neither for CAP patients. A possible reason for this could be the use of antibiotics in pneumonia patients. However, this was not true for all pneumonia samples, as a substantial proportion of bacteria were observed in some samples, including two COVID-19 patients. It is well known that a common complication of viral infection, especially for respiratory viruses, secondary bacterial infection often results in a significant increase in morbidity [20] . Thus, the elevated level of bacteria in the BALF of some COVID-19 patients might increase the risk of secondary infection. In the clinical data, the secondary infection rate for COVID-19 was between 1%-10% [2, 21] . However, the quantitative relationship between bacterial relative abundance/titer and infection is unclear. Overall, our study has revealed the evolution of SARS-CoV-2 in the patient, a common feature shared by most RNA viruses. How these variants influence the fitness of viruses and genetic diversity in the population awaits further investigation. Currently, only limited sequences are shared in public databases (Supplementary Table 6 ); hence there is an urgent need to accumulate more sequences to trace the evolution of the viral genome and associate the changes with clinical symptoms and outcomes. Table 1. For each patient, BALF samples were collected using a bronchoscope as part of normal clinical management. The volume of BALF samples ranged between 5ml and 30ml, most of which were used for bacterial culture and the remnant were aliquoted and stored at -80 ℃ before processing. The raw sequencing data reported in this paper have been deposited in the Genome Warehouse in National Genomics Data Center [23], under project number PRJCA002202 that is publicly accessible at https://bigd.big.ac.cn/gsa. Meanwhile, the data have also been submitted to NCBI Sequence Read Archive (SRA) database under project number PRJNA605907. Quality control processes included adapter trimming, low quality reads removal, short Permanova test, samples and microorganisms were filtered for further analyses with the following criteria. Samples with less than 5000 microbial reads were discarded. Microorganisms satisfying the following criteria were considered in the microbiota analysis, 1) archaea, bacteria, fungi, or virus; 2) with relative abundance ≥ 1% in the raw data and filtered data; 3) supported by at least 100 reads; 4) abundance higher than 10-fold of that in the negative control; 5) no batch effect; 6) abundance was not negatively correlated with bacteria titer; 7) not known contamination. Clean reads were mapped to the reference genome of SARS-CoV-2 (GenBank: [34] and an in-house scripts. All variants had to satisfy the following requirements: 1) Sequencing depth ≥ 50; 2) Minor allele frequency ≥ 5%; 3) Minor allele frequency ≥ 2% on each strand; 4) Minor allele count ≥ 5 on each strand; 5) The minor allele was supported by the inner part of the read (excluding 10 bp on each end); 6) Both alleles could be identified in at least 3 reads that specifically assigned to genus Betacoronavirus. For comparison with the polymorphism in the population, we obtained 110 sequences from GISAID (www.gisaid.org) [35, 36] . The accession number and acknowledgment were included in Supplementary Table S6 . Pearson which some of our analysis is based, a full name list of all submitters was given in Table S6 . The authors declare no competing interests. Heatmap of microbiota composition after QC filter (filters were described in Methods). The CAP samples were labeled as virus names followed by numbers. COVID-19 patients were highlighted by black rectangles, and two co-occurring bacterial clusters were highlighted by red rectangles. The names of all viruses are labeled in blue, and contaminant genera reported by Salter and colleagues are labeled in red [38] . A Novel Coronavirus from Patients with Pneumonia in China Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Preliminary estimation of the basic reproduction number of novel coronavirus (2019-nCoV) in China, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak A pneumonia outbreak associated with a new coronavirus of probable bat origin Viral escape mechanisms--escapology taught by viruses Evolution of virulence in emerging epidemics The Lung Microbiome, Immunity, and the Pathogenesis of Chronic Lung Disease Respiratory Viral Infection-Induced Microbiome Alterations and Secondary Bacterial Pneumonia The respiratory tract microbiome and lung inflammation: a two-way street SARS-CoV ORF1b-encoded nonstructural proteins 12-16: replicative enzymes as antiviral targets Intra-host dynamics of Ebola virus during Transcriptionally Active Lung Microbiome and Its Association with Bacterial Biomass and Host Inflammatory Status Enrichment of the lung microbiome with oral taxa is associated with lung inflammation of a Th17 phenotype Moderate mutation rate in the SARS coronavirus genome and its implications Viral quasispecies evolution Discovery of an RNA virus 3'->5' exoribonuclease that is critically involved in coronavirus RNA synthesis Coronaviruses lacking exoribonuclease activity are susceptible to lethal mutagenesis: evidence for proofreading and potential therapeutics Effects of Transmission Bottlenecks on the Diversity of Influenza A Virus Association between the respiratory microbiome and susceptibility to influenza virus infection Virus-induced secondary bacterial infection: a concise review Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Infectious Diseases Society of Thoracic Society consensus guidelines on the management of community-acquired pneumonia in adults fastp: an ultra-fast all-in-one FASTQ preprocessor Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments The diploid genome sequence of an Asian individual SortMeRNA: fast and accurate filtering of ribosomal 29 BLAST+: architecture and applications MEGAN Community Edition -Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM The Sequence Alignment/Map format and SAMtools VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing disease and diplomacy: GISAID's innovative contribution to global health Global initiative on sharing all influenza data -from vision to reality KaKs_Calculator: calculating Ka and Ks through model selection and model averaging Reagent and laboratory contamination can critically impact sequence-based microbiome analyses