key: cord-0813979-yb6if23t authors: wen, shaoqing; Zhang, Xiao title: A High-Coverage SARS-CoV-2 Genome Sequence Acquired by Target Capture Sequencing date: 2020-04-15 journal: nan DOI: 10.1101/2020.04.11.20061507 sha: f38233601aebd27eb4374f38ed67ba8488655829 doc_id: 813979 cord_uid: yb6if23t This manuscript is based on the method we developed urgently to deal with the research requirement in the conflict between achieving a complete genome sequence for the evolutionary history of SARS-CoV-2 study and the low viral RNA concentration. Here, in this manuscript, we developed a set of SARS-CoV-2 enrichment probes to increase the sensitivity of sequence-based virus detection and characterization via obtaining the comprehensive genome sequence. Following the CDC health and safety guidelines, we test the concept using the culturing supernatant contain SARS-CoV-2 particles, and its full-length sequence was used for further analysis. The fraction of SARS-CoV-2 endogenous DNA was 93.47% with Cluster Factor about 1.1, which demonstrate that the numbers of mapped reads to SARS-CoV-2 reference sequence significantly increased, compared to metagenomic sequencing technology, following SARS-CoV-2 probe enrichment. Moreover, based on the high-quality sequence, we discussed the heterozygosity and viral expression during replication of coronavirus, and its phylogenetic relationship with other selected high-quality samples from The Genome Variation Map (GVM) (on 2020/03/22). We believe this manuscript is valuable for all the researchers who are interested in using clinical warp samples to obtain the high coverage of SARS-CoV-2 genome sequence with a relatively low concentration of viral particles. This would allow the clinician to correlate the diagnostic data with molecular monitoring in viral evolutional, the most importantly, to track the functional mutation of SARS-CoV-2. health concern. Since a patient with pneumonia of unknown etiology were firstly reported in the city of Wuhan on 30th of Dec, 2019., Subsequently, epidemiological, clinical, radiological, laboratory and genomic findings of this virus gradually revealed by the Chinese and 00 experts around the globe [1] . At current stage, however, two topics emerged as things that needed to be addressed. Firstly, according to latest diagnostic criteria, reverse-transcriptase-polymerase-chain-reaction (RT-PCR) assays are recommended as the standard diagnosis of SARS-CoV-2-infection. Despite, present studies found that some patients had typical imaging findings, including ground-glass opacity, but negative RT-PCR testing [2] . The false-negative RT-PCR results may cause by many reasons, especially insufficient detection sensitivity in a low viral load scenario [2] . Secondly, although, by using metagenome sequencing, the complete genome sequence, taxonomic position, potential intermediate hosts, and evolutionary history of SARS-CoV-2 have been already reported [3] [4] [5] [6] , it is more important to monitor the virus mutation and its influence on disease severity and progression. Necessitating the full-length of SARS-CoV-2 genome, metagenome sequencing technology is the latest and most comprehensive approach but still costly. Moreover, in metagenome sequencing library, there are human (host) nucleic acid contamination cannot be eliminated. Inevitably, the carrier RNA introduced by commercial RNA extraction kits presented as contamination in library preparation stage, which indeed impairs the amount of viral sequence readout. In the context, we developed a set of SARS-CoV-2 enrichment probes, by utilizing hybridization capture technology, to increase the sensitivity of sequence-based virus detection and characterization. This method was first used to enrich sequence targets from the human genome [7] and then from vertebrate virome [8] . The enrichment probe set contains 502 ssDNA biotin-labelled probes at 2X tiling designed based on all available SARS-CoV-2 viral sequences, downloaded from the GISAID (Global Initiative on Sharing All Influenza Data; https://www.gisaid.org/) on 2/1/2020, and it can be used to enrich for SARS-CoV-2 sequences without prior knowledge of type or subtype. Additionally, the probes for human housekeeping genes (GAPDH, PCBP1, EIF3L, POLR2A, EIF3A, TGOLN2, TCEB3, CDK12, and BTBD7) were spiked in the probe set as internal controls for studying viral expression. The SARS-CoV-2 virus isolation and culturing was reported previously, which followed the CDC guidelines and good practice in laboratory health and safety requirement. Experiments were performed with approvals of W96-027B framework. Therefore, a SARS-CoV-2 virus strain (20SF014) was cultured in Vero-E6 cell, and RNA preparation performed using the QIAamp Viral RNA Mini Kit (Qiagen, Hilden, Germany). The cell culture media was collected and tested for viral RNA using the RT-PCR amplicons of partial ORF1ab with Ct value 36, and N gene regions with Ct value 33, respectively. We divided the total RNA sample into six samples (with following slightly different experimental conditions) ( Table 1) The cutadapt (version 2.7) and trimmonmatic (version 0.38) software was used for clipping adaptors and trimming low-quality reads. After removing adaptor, low-quality, and low-complexity reads, high-quality reads were first filtered against the human reference genome (hg 38) using Burrows-Wheeler Alignment (MEM). The remaining non-human reads were then realigned to the SARS-COV-2 reference (MN908947.3, https://www.ncbi.nlm.nih.gov/nuccore/MN908947) using bowtie2 (version 2.3.4.1) and filtered reads according to mapping quality (-q 30) by samtools (version 1.10). The variant was called by samtools and varscan (version 2.3.9, parameter: --strand-filter 0 --min-avg-qual 30 --min-reads2 15 --min-coverage 15). Finally, the sample consensus sequence was created by samtools and bcftools (version 1.9) according to the variants called above. The results illustrated in Table 6 , for library 1-6, 4,797,881-199,421,414 unique reads were obtained. Among them, 4,412,665-192,545,532 reads were mapped to SARS-CoV-2 reference sequence (MN908947.3). The fraction of SARS-CoV-2 endogenous DNA from these enrichment libraries were found to be between 90.07% and 96.58%, demonstrating that the numbers of mapped reads to SARS-CoV-2 reference sequence significantly increased, compared to metagenomic sequencing technology, following SARS-CoV-2 probe enrichment. The library complexity is evaluated by Cluster Factor, which is defined by "the number of raw reads divided . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.11.20061507 doi: medRxiv preprint by the number of reads after removing duplicates". In our six enrichment libraries, the Cluster Factor is about 1.1. The closer to 1, the better the library construction is. Notably, when adding the PCR cycle numbers of library amplification from 15 to 17, the library quality gets better. Collectively, merging the data from six enrichment libraries, a total of 371,981,580 reads were obtained, among which 358,112,573 reads were mapped to SARS-CoV-2 reference. Using these unique SARS-CoV-2 fragments, we reconstructed six SARS-CoV-2 genomes (mean depth being 186,869× and minimum coverage 13,816×). Only the merged sequence (coverage 1,121,217×) was used for further analysis. There are five variants called from merged data, including one homozygous variant at SNP (T23569C), and four heterozygotic variants (three SNPs: C4534T, A5522T, C23525T, and one deletion: CT16779C). The phenomenon of heterozygosity had also been reported in previous studies [6, 10] , but we propose that this heterogeneity could be caused by the mutations that occur during viral cultivation or the infection by multi-strain of coronavirus. Table S1 ). Five clades could be identified and labelled, corresponding to the Full genome tree delivered by GISAID (see Figure 1 ). In Figure 2 , we found two peaks in genome sequencing depths, one covering the 5'UTR region (MN908947.3:1-256) and another covering the N region (MN908947.3:28274-29533), which may be associated with the high expression in these two regions during replication of coronavirus [11] [12] . For high sequencing depths in 5'UTR region, a reasonable explanation is that 5'UTRs before ORF1a is necessary for the discontinuous synthesis of sub genomic RNAs in the beta coronaviruses and contains the cis-acting sequences necessary for viral replication [11] . Clinically, N gene RT-PCR assay was found to be more sensitive than other genes in SARS-CoV-2 detection, which is consistent with our finding of high sequencing depths in N region. This can be explained as the structural composition of coronavirus, also the difference in expression regulation in the host cells regarding sub genomic mRNA [12] [13] [14] In the current study, we, based on the available SARS-CoV-2 virus sequences, designed a set of SARS-CoV-2 enrichment probes. We made six enrichment libraries from one cultured SARS-CoV-2 virus strain to test the enrichment effects and sequenced them on MGI-2000 platform. Overall, the SARS-CoV-2 enrichment probe set-up described in this study showed significance in SARS-CoV-2-specific enrichment. This high specific probe rate with RNA carrier-free approach can function as a useful tool for the SARS-CoV-2 research community, especially in detecting SARS-CoV-2 RNA in low amounts and monitoring the virus evolution . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.11.20061507 doi: medRxiv preprint continuously. . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.11.20061507 doi: medRxiv preprint sequence for SARS-CoV-2 and variation analysis. J Med Virol, 2020;1-8. Doc S1 A protocol for in-solution SARS-CoV-2 genome capture enrichment (PDF) The information of 597 high-quality SARS-CoV-2 sequences (especially 45 selected SNVs and their weight) for Network analysis in this study (XLS) . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.11.20061507 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.11.20061507 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.11.20061507 doi: medRxiv preprint A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases A new coronavirus associated with human respiratory disease in China A pneumonia outbreak associated with a new coronavirus of probable bat origin Genomic characterization and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding On the origin and continuing evolution of SARS-CoV-2 Enrichment of sequencing targets from the human genome by solution hybridization Virome Capture Sequencing Enables Sensitive Viral Diagnosis and Comprehensive Virome Analysis First Isolation and Identification of SARS-CoV-2 in Guangdong Province RNA based mNGS approach identifies a novel human coronavirus from two individual pneumonia cases in 2019 Wuhan outbreak The structure and functions of coronavirus genomic 3' and 5' ends Molecular Diagnosis of a Novel Coronavirus (2019-nCoV) Causing an Outbreak of Pneumonia Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients Zixiang Chen et al (2020) The establishment of reference We sincerely thank those who are on the front lines battling SARS-CoV-2 virus. We also thank the technical support provided by Guangzhou Koalson Bio-Technique Co. Ltd. Groups interested in testing this protocol can request guidance by emailing wenshaoqing@fudan.edu.cn, and a limited number of our SARS-CoV-2 enrichment probe set are available on request. The authors declare no conflict of interest.