key: cord-0974385-dfjzse77 authors: Wen, Shaoqing; Sun, Chang; Zheng, Huanying; Wang, Lingxiang; Zhang, Huan; Zou, Lirong; Liu, Zhe; Du, Panxin; Xu, Xuding; Liang, Lijun; Peng, Xiaofang; Zhang, Wei; Wu, Jie; Yang, Jiyuan; Lei, Bo; Zeng, Guangyi; Ke, Changwen; Chen, Fang; Zhang, Xiao title: High‐Coverage SARS‐CoV‐2 Genome Sequences Acquired by Target Capture Sequencing date: 2020-06-03 journal: J Med Virol DOI: 10.1002/jmv.26116 sha: d7e5eafe6b229775086e62963f575fcf03e5c988 doc_id: 974385 cord_uid: dfjzse77 In this study, we designed a set of SARS‐CoV‐2 enrichment probes to increase the capacity for sequence‐based virus detection and obtain the comprehensive genome sequence at the same time. This universal SARS‐CoV‐2 enrichment probe set contains 502 120nt ssDNA biotin‐labeled probes designed based on all available SARS‐CoV‐2 viral sequences and it can be used to enrich for SARS‐CoV‐2 sequences without prior knowledge of type or subtype. Following the CDC health and safety guidelines, marked enrichment was demonstrated in a virus strain sample from a cell culture, three nasopharyngeal swab samples (cycle threshold [Ct] values: 32.36, 36.72, and 38.44) from patients diagnosed with COVID‐19 (positive control) and four throat swab samples from patients without COVID‐19 (negative controls), respectively. Moreover, based on these high‐quality sequences, we discuss the heterozygosity and viral expression during coronavirus replication, and its phylogenetic relationship with other selected high‐quality samples from The Genome Variation Map (GVM). Therefore, this universal SARS‐CoV‐2 enrichment probe system can capture and enrich SARS‐CoV‐2 viral sequences selectively and effectively in different samples, especially clinical swab samples with a relatively low concentration of viral particles. This article is protected by copyright. All rights reserved. The outbreak of the novel coronavirus (SARS-CoV-2) disease has become a global and ongoing health concern. Since a patient with pneumonia of unknown etiology was first reported in the city of Wuhan on Dec 30, 2019, epidemiological, clinical, radiological, laboratory, and genomic findings of this virus were gradually discovered by Chinese and international experts 1 . At the current stage of research, however, two crucial topics must be addressed. First, according to latest diagnostic criteria, reverse-transcriptase-polymerase-chain-reaction (RT-PCR) assays are recommended as the standard diagnosis of SARS-CoV-2-infection. However, present studies found that some patients have typical imaging findings, including ground-glass opacity, but negative RT-PCR results 2 . The false-negative RT-PCR results can be caused by many factors, especially the insufficient detection sensitivity in a low viral load scenario 2 . Second, more work must be done to monitor the virus mutation and these mutations influence of disease severity and progression. Necessitating the full-length of SARS-CoV-2 genome, metagenome sequencing technology is the latest and most comprehensive approach 3-6 but still costly. Moreover, in metagenome sequencing library, there are significant amounts of host (human) Accepted Article nucleic acid contamination and carrier RNA contamination introduced in commercial RNA extraction kits, both of which impair the amount of viral sequence readout. In this context, we developed a set of SARS-CoV-2 enrichment probes by using hybridization capture technology to increase the sensitivity of sequence-based virus detection and characterization. This method was first used to enrich sequence targets from the human genome 7 and then from vertebrate virome 8 . The enrichment probe set contains 502 ssDNA biotin-labelled probes at 2X tiling designed based on all available SARS-CoV-2 viral sequences, downloaded from the GISAID (Global Initiative on Sharing All Influenza Data; https://www.gisaid.org/) on 2020/02/01, and it can be used to enrich for SARS-CoV-2 sequences without prior knowledge of type or subtype. Additionally, the probes for human housekeeping genes (GAPDH, PCBP1, EIF3L, POLR2A, EIF3A, TGOLN2, TCEB3, CDK12, and BTBD7) were spiked in the probe set as internal controls for studying viral expression. To evaluate the sensitivity and specificity, we tested the enrichment probe set by using a virus strain sample derived from cell culture, three nasopharyngeal swab samples collected from patients diagnosed COVID-19 (positive control) and four throat swab samples taken from patients without COVID-19 (negative controls), respectively. Blank control is RNase free water. The SARS-CoV-2 virus isolation and culturing was reported previously 9 , which followed the CDC guidelines and good practice in laboratory health and safety requirement. Experiments were performed with the approval of the W96-027B framework. The RT-PCR tests were performed on all samples following a previously described method 10 . The RT-PCR test kits (Bioperfectus) were officially approved by China's NMPA (National Medical Products Administration). The Ct values for all samples are listed in Table 1 . Notably, the sample GDFS2020329 showed weakly positive RT-PCR result, and the Ct value was adjacent to the cut-off value (40) for positivity. We divided the total RNA sample of the SARS-CoV-2 virus strain (20SF014) into six samples (with slightly different experimental conditions) ( Table 1) . Six virus strain samples, three positive samples, four negative samples and one blank control were reverse-transcribed into cDNA, respectively, followed by the second-strand synthesis. Using the synthetic double-stranded DNA, all DNA libraries were constructed through DNA-fragmentation, end-repair, adaptor-ligation, and PCR amplification. Subsequently, library hybridization Accepted Article capture was performed by using the SARS-CoV-2 enrichment probe set. The enriched libraries were qualified with Agilent 2100 Bioanalyzer using Agilent High Sensitivity DNA Kit and equivalent double-stranded DNA libraries were pooled and transformed into a single-stranded circular DNA library through DNA-denaturation and circularisation. DNA nanoballs were generated from single-stranded circular DNA by rolling circle amplification, then qualified with Invitrogen Qubit 2.0 Fluorometer (ThermoFisher, Foster City, CA, USA) and loaded onto the flow cell and sequenced with PE100 on the MGISEQ-2000 platform (MGI, Shenzhen, China). Detailed experimental protocol in the Chinese and English version is presented in the Supplementary Doc S1. The cutadapt (version 2.7) and trimmonmatic (version 0.38) software was used for clipping adaptors and trimming low-quality reads. After removing the adaptor, low-quality, and low-complexity reads, high-quality reads were first filtered against the human reference genome (hg 38) using Burrows-Wheeler Alignment (MEM). The remaining non-human reads were then realigned to the SARS-COV-2 reference (MN908947.3, https://www.ncbi.nlm.nih.gov/nuccore/MN908947) using bowtie2 (version 2.3.4.1) and filtered reads according to mapping quality (-q 30) by samtools (version 1.10). The variant was called by samtools and varscan (version 2.3.9, parameter: --strand-filter 0 --min-avg-qual 30 --min-reads2 15 --min-coverage 15). Finally, the sample consensus sequence was created by samtools and bcftools (version 1.9) according to the variants called above. The summary statistics for each enrichment library are described in Table 1 . The fraction of SARS-CoV-2 endogenous DNA from virus strain enrichment libraries were found to be between 90.07% and 96.58%, demonstrating that the numbers of mapped reads to SARS-CoV-2 reference sequence significantly increased compared to metagenomic sequencing technology. The library complexity is evaluated by Cluster Factor, which is defined by "the number of raw reads divided by the number of reads after removing duplicates". In all enrichment libraries, the Cluster Factor is less than 1.5, with 1 being the best value for library construction. Notably, when adding the PCR cycle numbers of library amplification from 15 to 17, the library quality improves. Moreover, by merging Accepted Article the data from six virus strain enrichment libraries, we obtained a total of 371,981,580 unique reads, among which 358,112,573 reads were mapped to SARS-CoV-2 reference. Using these unique SARS-CoV-2 fragments from virus strain sample, we reconstructed six SARS-CoV-2 genomes (mean depth being 186,869× and minimum coverage 13,816×). Only the merged sequence (coverage 1,121,217×) was used for further analysis. For three positive samples, we also reconstructed three SARS-CoV-2 genomes (mean depth 98.92×, 14.76×, and 2370.64×, respectively). Their Ct values are 36.72, 38.44, and 32.36, accordingly. Finally, for virus strain sample, there are five variants called from merged data, including one homozygous variant at SNP (T23569C), and four heterozygotic variants (three SNPs: C4534T, A5522T, C23525T, and one deletion: CT16779C). For three positive samples, GDFS2020309 has two homozygous variants: C23525T, CT27791C and a heterozygotic variants T23569C; GDFS2020336 has two homozygous variants: C635T and C29303T; GDFS2020329 has no variant. The phenomenon of heterozygosity had been reported in previous studies 6, 11 , we propose that this heterogeneity could be caused by the mutations that occur during viral replication or the infection by multi-strain of coronavirus. We collected the variations information (gff3 files) of high-quality samples from The Genome Variation Map (GVM) (ftp://download.big.ac.cn/GVM/Coronavirus/gff3/) (on 2020/03/22). According to the quality criteria for 2019-nCoV delivered by National Genomics Data Center (2019nCoVR, https://bigd.big.ac.cn/ncov) 12 , we enrolled 601 samples with 45 SNVs at first and second levels (with MAF>0.01 and no dense variation regions, see https://bigd.big.ac.cn/ncov/variation/annotation) in the following analysis. The information of raw variations in gff3 file is recoded into binary format as an input file for Network analysis (Network version 5, www.fluxus-engineering.com) (Table S1 ). Five clades could be identified and labelled, corresponding to the full genome tree delivered by GISAID (see Figure S1 ). Except for three main larger clades In Figure 1A , we found two peaks in genome sequencing depths, one covering the 5'UTR region (MN908947.3:1-256) and another covering the N region (MN908947.3:28274-29533), which may be associated with the high expression in these two regions during replication of coronavirus [13] [14] . For high sequencing depths in 5'UTR region, a reasonable explanation is that 5'UTRs Accepted Article before ORF1a is necessary for the discontinuous synthesis of subgenomic RNAs in the beta coronaviruses and contains the cis-acting sequences necessary for viral replication 12 . Clinically, N gene RT-PCR assay was found to be more sensitive than other genes in SARS-CoV-2 detection, which is consistent with our finding of high sequencing depths in N region. This can be explained as the structural composition of coronavirus, also the difference in expression regulation in the host cells regarding subgenomic mRNA [14] [15] [16] . In Figure 1B , however, there was no typical depth peaks found in the 5'UTR region and N region in positive samples. We suggest that larger sample size is needed to evaluate the divergent expression pattern in the future. In general, in our selected human housekeeping genes (GAPDH, PCBP1, EIF3L, POLR2A, EIF3A, TGOLN2, TCEB3, CDK12, and BTBD7), GAPDH exhibited a relatively high expression level, PCBP1, EIF3L and POLR2A showed a moderate expression level, and the rest genes had a relatively low expression level. This gene expression pattern was clearly shown in all positive and negative samples (see Figure 1C ). Importantly, according to the Transcripts Per Million (TPM) statistics, we found that positive samples (GDFS2020336 [Ct value: 32.36], GDFS2020309 [Ct value: 36.72], and GDFS2020329 [Ct value: 38.44]) exhibited the high, moderate, and low expression level (red bar), respectively, which was nearly equivalent to that of gene GAPDH, PCBP1, and BTBD7 ( Figure 1C ). In the current study, we, based on the available SARS-CoV-2 virus sequences, designed a set of SARS-CoV-2 enrichment probes. We made six enrichment libraries from one cultured SARS-CoV-2 virus strain and seven enrichment libraries from three positive samples (especially a weakly positive sample) and four negative samples to test the enrichment effects and sequenced them on MGISEQ-2000 platform. Overall, the SARS-CoV-2 enrichment probe described in this study showed significant, SARS-CoV-2-specific enrichment and should be a useful tool for the SARS-CoV-2 research community for detecting SARS-CoV-2 RNA in low amounts and for monitoring the future mutations. This article is protected by copyright. All rights reserved. A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Correlation of Chest CT and RT-PCR Testing in Coronavirus Disease 2019 (COVID-19) in China: A Report of 1014 Cases A new coronavirus associated with human respiratory disease in China A pneumonia outbreak associated with a new coronavirus of probable bat origin Genomic characterization and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding On the origin and continuing evolution of SARS-CoV-2 Enrichment of sequencing targets from the human genome by solution hybridization Virome Capture Sequencing Enables Sensitive Viral Diagnosis and Comprehensive Virome Analysis First Isolation and Identification of SARS-CoV-2 in Guangdong Province SARS-CoV-2 Viral Load in Upper Respiratory Specimens of Infected Patients RNA based mNGS approach identifies a novel human coronavirus from two individual pneumonia cases in 2019 Wuhan outbreak The 2019 novel coronavirus resource The structure and functions of coronavirus genomic 3' and 5' ends Molecular Diagnosis of a Novel Coronavirus (2019-nCoV) Causing an Outbreak of Pneumonia Genomic diversity of SARS-CoV-2 in Coronavirus Disease 2019 patients The establishment of reference sequence for SARS-CoV-2 and variation analysis We sincerely thank those who are on the front lines battling SARS-CoV-2 virus. We also thank the technical support provided by Guangzhou Koalson Bio-Technique Co. Ltd. Groups interested in testing this protocol can request guidance by emailing wenshaoqing@fudan.edu.cn, and a limited number of our SARS-CoV-2 enrichment probe set are available on request. This article is protected by copyright. All rights reserved. Wen SQ, Zhang X were involved in designing the study and preparing the manuscript; Zheng HY, Zhang H, Zou LR, Liu Z, Liang LJ, Peng XF, Zhang W, Wu J, Yang JY, Lei B and Zeng GY performed most of the experiments; Wen SQ, Sun C, Wang LX, Du PX and Xu XD analyzed the data; Ke CW, Chen F and Zhang X contributed to critical revision of the manuscript. The corresponding authors were responsible for all aspects of the study, and ensured that issues related to the accuracy or integrity of any part of the work were investigated and resolved. All authors reviewed and approved the final version of the manuscript. The authors declare no conflict of interest.