key: cord-0837450-tocj84ct authors: Yu, Jian; Sun, Shanshan; Tang, Qianqian; Wang, Chengzhuo; Yu, Liangchen; Ren, Lulu; Li, Jun; Zhang, Zhenhua title: Establishing reference sequences for each clade of SARS‐CoV‐2 to provide a basis for virus variation and function research date: 2021-12-01 journal: J Med Virol DOI: 10.1002/jmv.27476 sha: 5d77066d90b2a7e860782cf324beeb51d239b2d1 doc_id: 837450 cord_uid: tocj84ct Coronavirus disease 2019 (COVID‐19) is a severe respiratory disease caused by the highly infectious severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2). As the COVID‐19 pandemic continues, mutations of SARS‐CoV‐2 accumulate. These mutations may not only make the virus spread faster, but also render current vaccines less effective. In this study, we established a reference sequence for each clade defined using the GISAID typing method. Homology analysis of each reference sequence confirmed a low mutation rate for SARS‐CoV‐2, with the latest clade GRY having the lowest homology with other clades (99.89%–99.93%), and the homology between other clade being greater than or equal to 99.95%. Variation analyses showed that the earliest genotypes S, V, and G had 2, 3, and 3 characterizing mutations in the genome respectively. The G‐derived clades GR, GH, and GV had 5, 6, and 13 characterizing mutations in the genome respectively. A total of 28 characterizing mutations existed in the genome of the latest clades GRY. In addition, we found differences in the geographic distribution of different clades. G, GH, and GR are popular in the USA, while GV and GRY are common in the UK. Our work may facilitate the custom design of antiviral strategies depending on the molecular characteristics of SARS‐CoV‐2. structural proteins targeted by vaccines may impair vaccine efficacy; mutations in nonstructural proteins may result in antiviral-resistant strains. 3 In the early stage of the COVID-19 outbreak, some scientists were controversial over virus genotyping, thinking that it was difficult to prove the relationship between virus mutation and its function, and it was recommended not to over-interpret genome mutations during the pandemic. 4 However, as the SARS-CoV-2 pandemic became prolonged, continuous accumulation of genomic data and indepth study of the pathogenic and immune characteristics of the virus resulted in development of a few different methods to genotype and classify the virus. Current commonly used typing methods include the Chinese typing method, Pangolin typing method, GISAID typing method, and Nextstrain typing method. In this study, we established a reference sequence for each clade based on the GISAID typing method and further analyzed characterizing mutations and frequent mutations for each clade. In addition, the evolution of characterizing mutations of all genotypes was analyzed in key regions (UK, South Africa, USA, India, and Brazil). Sequence analyses were performed by a previously reported method. 5 Homology analysis and sequence alignment were conducted for downloaded sequences by using Primer 7.0 and Mega (7.0.14). The reference sequence was established by selecting the most common nucleotide in each position. The ClustalW program of the MEGA software (7.0.14) was used to conduct multiple sequence alignment and the phylogenetic tree was constructed by using a maximum likelihood approach based on reference sequences. The MEGA6.0 software was used to construct a phylogenetic tree with the established reference sequences ( Figure S2 ). Here, the phylogenetic tree simply shows the magnitude of differences between each clade, does not represent the evolutionary relationship between each clade Then we analyzed the homology between genotypes (Table 1) , and the results showed that the homology between genotypes was high. The latest clade GRY had the lowest homology with other clades (99.89%-99.93%), and the homology between other clades was greater than or equal to 99.95%. By comparing to the established reference sequences, we identified characterizing mutations for each clade at the nucleotide and amino acid levels (Tables 2 and 3 5ʹUTR 204 G T 241 C T T T T T 1a 445 T C 913 C T 1059 C T 3037 C T T T T T It is worth noting that the mutation rate of N_S202 was close to 100% after September 2020, which can be considered as a characterizing mutation of clade S; and the frequency of high-frequency mutations of GH showed an upward trend over time, with the mutation frequency reaching about 50% for both by January 2021. Finally, we analyzed the occurrence of characterizing mutations in key regions (Figure 3 ). Mutation NSP6_L37F (V-specific mutation) was prevalent in the UK and USA in the early days, then appeared in India, but is currently about to disappear; ORF3a_G251V is another specific mutation for V, but it was only prevalent in the early months in the UK and USA. These observations suggested that although both NSP6_L37F and ORF3a_G251V were characterizing mutations of clade V, they did not necessarily occur at the same time. T A B L E 3 Characterizing mutations at the amino acid level of SARS-CoV-2 based on reference sequences in January was as high as 50%, but it was mostly due to the small number of early uploaded sequences from India (The number of sequence in January is 2). G, GH, and GR are currently mainly popular in the USA, and GV and GRY are currently mainly common in the UK ( Figure S3 ). March 2021. Clade GRY adds the S_N501Y mutation on the basis of the S_D614G mutation, a series of studies have shown that the N501Y mutation strengthens its binding to human receptor angiotensin-converting enzyme 2 (ACE2) and further enhances the ability of the virus to enter host cells. [14] [15] [16] Therefore, GRY has gradually become the main epidemic Clade from 2021. In general, results from this study will facilitate viral detection, functional analysis, vaccine design, epidemic investigation, and evaluation of drug efficacy, among others. In this study, we established a reference sequence for each clade classified using the GISAID typing method. By comparing with the established reference sequences, we found that the earliest geno- A new coronavirus associated with human respiratory disease in China An online coronavirus analysis platform from the National Genomics Data Center SARS-CoV-2, the pandemic coronavirus: Molecular and structural insights No evidence for distinct types in the evolution of SARS-CoV-2 The establishment of reference sequence for SARS-CoV-2 and variation analysis Hepatitis B virus genotype A: design of reference sequences for sub-genotypes Geographic and genomic distribution of SARS-CoV-2 mutations Coronavirus pathogenesis and the emerging pathogen severe acute respiratory syndrome coronavirus Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein Structures of the SARS-CoV-2 nucleocapsid and their perspectives for drug design Detection of 2019 novel coronavirus (2019-nCoV) by real-time RT-PCR Similarities and differences in the conformational stability and reversibility of ORF8, an accessory protein of SARS-CoV-2, and its L84S variant SARS-CoV-2 spike D614G change enhances replication and transmission The new SARS-CoV-2 strain shows a stronger binding affinity to ACE2 due to N501Y mutant N501Y mutation of spike protein in SARS-CoV-2 strengthens its binding to receptor ACE2 Higher infectivity of the SARS-CoV-2 new variants is associated with K417N/T, E484K, and N501Y mutants: an insight from structural data This study was supported by the Anhui Provincial Natural Science Foundation of China (grant number: 1608085MH162). The authors declare that there are no conflict of interests. Jian Yu and Shanshan Sun contributed equally to this study. Jian Yu and Shanshan Sun analyzed the data and wrote the manuscript; Qianqian Tang, Chengzhuo Wang, and Liangchen Yu were responsible for collecting, collating, and checking data; Lulu Ren contributed to drawing; Zhenhua Zhang and Jun Li conceptualized and designed the study and critically revised the manuscript. Data for this study can be accessed through http://www.gisaid.org.The information of data is listed at Table S2 . http://orcid.org/0000-0002-8480-9004