key: cord-0955959-41ogljo8 authors: Wang, Changtai; Liu, Zhongping; Chen, Zixiang; Huang, Xin; Xu, Mengyuan; He, Tengfei; Zhang, Zhenhua title: The establishment of reference sequence for SARS‐CoV‐2 and variation analysis date: 2020-03-20 journal: J Med Virol DOI: 10.1002/jmv.25762 sha: 34420b3e302a8ba248743ff23e2913f4705adc1f doc_id: 955959 cord_uid: 41ogljo8 Starting around December 2019, an epidemic of pneumonia, which was named COVID‐19 by the World Health Organization, broke out in Wuhan, China, and is spreading throughout the world. A new coronavirus, named severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) by the Coronavirus Study Group of the International Committee on Taxonomy of Viruses was soon found to be the cause. At present, the sensitivity of clinical nucleic acid detection is limited, and it is still unclear whether it is related to genetic variation. In this study, we retrieved 95 full‐length genomic sequences of SARAS‐CoV‐2 strains from the National Center for Biotechnology Information and GISAID databases, established the reference sequence by conducting multiple sequence alignment and phylogenetic analyses, and analyzed sequence variations along the SARS‐CoV‐2 genome. The homology among all viral strains was generally high, among them, 99.99% (99.91%‐100%) at the nucleotide level and 99.99% (99.79%‐100%) at the amino acid level. Although overall variation in open‐reading frame (ORF) regions is low, 13 variation sites in 1a, 1b, S, 3a, M, 8, and N regions were identified, among which positions nt28144 in ORF 8 and nt8782 in ORF 1a showed mutation rate of 30.53% (29/95) and 29.47% (28/95), respectively. These findings suggested that there may be selective mutations in SARS‐COV‐2, and it is necessary to avoid certain regions when designing primers and probes. Establishment of the reference sequence for SARS‐CoV‐2 could benefit not only biological study of this virus but also diagnosis, clinical monitoring and intervention of SARS‐CoV‐2 infection in the future. study, we retrieved 95 full-length genomic sequences of SARAS-CoV-2 strains from the National Center for Biotechnology Information and GISAID databases, 4 Cases have also been reported from 101 countries or areas including Thailand, Japan, Korea, Australia, France, and the United States. 5 Family clustering of infection, 3000 cases of healthcare personnel infection, and other evidence together have provided strong supporting evidence for human-to-human transmission of SARS-CoV-2 infection with a basic reproduction number (R0) of 2-4. 6-8 SARS-CoV-2 has a high transmissibility and can have a long incubation time before manifesting symptoms including fever, coughing, shortness of breath, and diarrhea. SARS-CoV-2 infection can be symptom-free in some patients, but may cause multiple organ failures in lung, heart, and liver in some other patients. The mortality rate of SARS-CoV-2 infection is about 3%. 1, 9, 10 The Yongzhen Zhang team in China was the first group to determine the full-length genomic sequence of the SARS-CoV-2 virus. 11 The genome is arranged in the order of a 5′-untranslated region (UTR)-replicase complex (orf 1ab)-structural proteins MERSr-CoV and SARSr-CoV can be transmitted from human to human, and were highly pathogenic resulting in high mortality. [12] [13] [14] So far, scientists from different countries have obtained and uploaded more than 100 full-length or partial genomic sequences for SARS-CoV-2. Some companies have developed rapid nucleic acid detection kits based on these sequences. However, significant differences in the sensitivity and specificity among these kits have been found from clinical application of these kits. In addition, a standardized quantitative detection method is still lacking. Missed diagnosis and misdiagnosis are currently not uncommon due to these reasons. [15] [16] [17] To provide template sequence for proper design of polymerase chain reaction (PCR) primers and probes to minimize false negative results, and to obtain reliable sequence information for molecular and immunological studies on and vaccine development for SARS-CoV-2 virus, we retrieved from the National Center for Biotechnology Information (NCBI) and GISAID websites, full-length sequences from different regions of the world, established the reference sequence for SARS-CoV-2 by homology and phylogenetic tree analyses, analyzed mutations at different locations, and conducted preliminary bioinformatics analyses for the reference sequence. The NCBI (http://www.ncbi.nlm.nih.gov/genbank/) and GISAID (https://www.gisaid.org/) databases up to 14 Feb, 2020 were searched by using keywords "novel, coronavirus, complete, Wuhan" or "2019-nCoV." The inclusion criteria included the length of the full-length sequence, which was 25000 to 32000 bp, and was verified to be human SARS-CoV-2 sequence. Repetitively submitted sequences and sequences with too many undetermined nucleotides were excluded from this study. Sequences were classified as stage 1 and 2 based on the time sequences retrieved. Sequences obtained on Feb 6, 2020 went into stage 1, while sequences retrieved from Feb 6 to 14 were classified as stage 2 ( Figure 1 and Table S1 ). Homology analysis and sequence alignment were conducted for all stage 1 sequences by using Primer 7.0 and Mega (7.0.14). The reference sequence was conducted by selecting the most common nucleotide in each position. The reliability of the reference sequence was confirmed by comparing it with stage 2 sequences. The ClustalW program of the MEGA software (7.0.14) was used to conduct multiple sequence alignment and the phylogenetic tree was constructed by using a maximum likelihood approach based on stage 1 sequences. Related coronaviral sequences were used as references (229E(KY369908), NL63(MK334046), SARSr-CoV(AY278488), bat coronavirus(MN996532)). Primer 7.0 was used to compare the reference nucleotide sequence to those of related human isolates and analyze the variation at different locations. Sequence comparison and variation analysis were also conducted at the amino acid level. F I G U R E 1 Flow chart of severe acute respiratory syndrome coronavirus 2 (SARS-COV-2) sequence data collection Sequences of PCR primers/probes from published articles were aligned with our reference sequence to analyze sequence variation. Whether these primer/probe sequences overlapped with variation sites was also examined. 3 | RESULTS A total of 145 sequences were obtained from databases. These sequences were examined and 50 sequences were excluded from the study according to predetermined criteria. As a result, 95 sequences, among which 63 were obtained from stage 1 and 32 from stage 2, were used for analyses ( Figure 1 ). These sequences were reported from China, America, Australia, Thailand, the United Kingdom, Germany, France, Finland, Korea, Japan, and Singapore, etc (Table S1 ). The reference sequence was constructed by nucleotide sequence alignment (accession number: EPI_ISL_412026 The phylogenetic tree was constructed using sequences from database search stage 1 and other coronaviruses. While different types of coronaviruses showed scattered distribution, all SARS-CoV-2 strains clustered together tightly. Importantly, the reference sequence was located in the middle of the SARS-CoV-2 cluster, demonstrating good representativeness of the reference sequence constructed ( Figure S1 ). Sequence alignment found that mutations at both nucleotide and amino acid levels were relatively rare. However, mutations did exist. Mutations which occurred in ≥3 strains were found in these locations: Table S2 ). In addition, six deletion mutations were found in five isolated strains. These mutations resulted in four different truncations in amino acid sequence (1/3/6/8 aa). Furthermore, two deletion mutations were found in the 5′ and 3′ nonencoding regions, respectively (Table S3 ). Sequence alignment revealed differences between some primer/probe sequences and the reference sequence (Table 3 ). In a newly published article, 7 there are site differences between the primers from ORF1b and the reference sequence. In another publication, 16 Although sequence variation among SARS-CoV-2 isolates was low, and sequence analysis showed a rather random distribution of A potential shortcoming of this study is that, since all sequences used in this study were retrieved from databases, the accuracy of sequences could not be verified. Although the sequence included in this study is still small, it basically includes most of SARS-CoV-2 complete viral sequences that have been published worldwide and is widely distributed, which should be able to represent the characteristics of the virus. In summary, in this study SARS-CoV-2 genomic sequences that are available from the NCBI and GISAID databases so far were analyzed and the reference sequence for this virus was established. The variations in individual coding regions at both the nucleotide and amino acid level were further analyzed and part of the reasons why the sensitivity of current nucleic acid detection methods is far from ideal was revealed. Establishment of the reference sequence for SARS-CoV-2 could benefit not only biological study of this virus but also diagnosis, clinical monitoring and intervention of SARS-CoV-2 infection in the future. Clinical features of patients infected with 2019 novel coronavirus in Wuhan Genomic characterization and epidemiology of 2019 novel coronavirus: implications of virus origins and receptor binding A novel coronavirus from patients with pneumonia in China National Health Commission of the People's Republic of China The Novel Coronavirus Pneumonia Emergency Response Epidemiology Team. The epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19) in china A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Early transmission dynamics in Wuhan, China, of novel coronavirus-infected pneumonia Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study Clinical characteristics of 2019 novel coronavirus infection in China. medRxiv A new coronavirus associated with human respiratory disease in China Epidemiology, genetic recombination, and pathogenesis of coronaviruses Severe acute respiratory syndrome coronavirus as an agent of emerging and reemerging infection Middle East respiratory syndrome coronavirus: another zoonotic betacoronavirus causing SARS-like disease Recent advances in the detection of respiratory virus infection in humans Molecular diagnosis of a novel coronavirus (2019-nCoV) causing an outbreak of pneumonia Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR A pneumonia outbreak associated with a new coronavirus of probable bat origin Isolation and characterization of 2019-nCoV-like coronavirus from Malayan pangolins Identification of 2019-nCoV related coronaviruses in Malayan pangolins in southern China Differences in Cpg island distribution between subgenotypes of the hepatitis B virus genotype In vitro and in vivo replication of a chemically synthesized consensus genome of hepatitis B virus genotype B Genomic characterization of the 2019 novel human-pathogenic coronavirus isolated from a patient with atypical pneumonia after visiting Wuhan A deep learning algorithm using CT images to screen for Corona Virus Disease (COVID-19). medRxiv The establishment of reference sequence for SARS-CoV-2 and variation analysis The authors declare that there are no conflict of interests. ZL, ZC, XH, TH, CW, and ZZ collect and analyze data. ZZ, CW, and JL wrote the manuscript. JL participated in the coordination of the study and manuscript modification. ZZ conceived the project. All authors contributed, read, and approved the manuscript. http://orcid.org/0000-0002-8480-9004