key: cord-0728419-ps6f2c7q authors: Li, Bai-sheng; Li, Zhen-cui; Hu, Yao; Liang, Li-jun; Zou, Li-rong; Guo, Qian-fang; Zheng, Zhong-hua; Yu, Jian-xiang; Song, Tie; Wu, Jie title: Genomic Evolution and Variation of SARS-CoV-2 in the Early Phase of COVID-19 Pandemic in Guangdong Province, China date: 2021-04-20 journal: Curr Med Sci DOI: 10.1007/s11596-021-2340-3 sha: 6f3a8f4c19b07c38cf352948024aadf1b7bb8e3f doc_id: 728419 cord_uid: ps6f2c7q Severe acute respiratory syndrome Coronavirus 2 (SARS-CoV-2) with unknown origin spread rapidly to 222 countries, areas or territories. To investigate the genomic evolution and variation in the early phase of COVID-19 pandemic in Guangdong, 60 specimens of SARS-CoV-2 were used to perform whole genome sequencing, and genomics, amino acid variation and Spike protein structure modeling analyses. Phylogenetic analysis suggested that the early variation in the SARS-CoV-2 genome was still intra-species, with no evolution to other coronaviruses. There were one to seven nucleotide variations (SNVs) in each genome and all SNVs were distributed in various fragments of the genome. The Spike protein bound with human receptor, an amino acid salt bridge and a potential furin cleavage site were found in the SARS-CoV-2 using molecular modeling. Our study clarified the characteristics of SARS-CoV-2 genomic evolution, variation and Spike protein structure in the early phase of local cases in Guangdong, which provided reference for generating prevention and control strategies and tracing the source of new outbreaks. Among the several coronaviruses that are pathogenic to humans, most are associated with mild clinical symptoms [1] , with two notable exceptions: severe acute respiratory syndrome coronavirus (SARS-CoV) [2, 3] and Middle East respiratory syndrome coronavirus (MERS-CoV) [4] , which have caused more than 10 000 cases, with mortality rates of 10% for SARS-CoV and 37% for MERS-CoV [5, 6] . These facts suggest that there is always a threat of coronavirus infection to human beings, especially novel coronavirus from animal origin. In December, 2019, a series of cases with clinical manifestations of viral pneumonia of unknown cause emerged. Deep sequencing analysis from lower respiratory tract samples indicated a novel coronavirus, which was named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) [7] . Meanwhile, World Health Organization (WHO) named novel coronavirus pneumonia as Corona Virus Disease 2019 . According to the current data of WHO, the number of infected and dead cases has increased to 73 575 202 and 1 656 317, respectively (Dec. 2019 to Dec. 19, 2020) . Guangdong Province has the largest number of reported cases in China except for Hubei Province. The first confirmed COVID-19 cases in Guangdong appeared on January 14, 2020. It is an important content to monitor and analyze the genome variation of the virus, which is helpful to predict the epidemic trend of the disease. Here we describe the genomic characterization of 60 genomes of SARS-CoV-2 from patients in Guangdong as well as publicly available genomes, providing important information on the genomic variation of this new virus in the early phase of COVID-19 pandemic in Guangdong. We collected clinical samples including bronchoalveolar lavage fluid (BALF), endotracheal aspirates, throat swabs, and nasal swabs from patients and performed meta-transcriptomic sequencing. Total RNA was extracted from 200 μL sputum fluid with the human rRNA Depletion Kit (NEB #E6350). A metatranscriptomic library was constructed for single-end (75 bp) sequencing using an Illumina NextSeq 550Dx, and the sequencing data were analyzed with the rapid pathogen detection system (RPD-seq, Guangzhou Sagene Biotech Co., Ltd.). Sequence reads were de novo assembled and screened for the whole genome with potential mutations. We retrieved the coronavirus sequences including the SARS, bat SARS-like and available genomes from NCBI viral genome database (https://www.ncbi. nlm.nih.gov/) and the GISAID (https://www.gisaid. org/). Multiple sequence alignment of all coronavirus genomes was performed by using MUSCLE software [9] . Out of coronavirus representative genomes of all category were used for phylogenetic tree development using MEGAX software based on neighbor joining method [10] . The phylogenetic tree bootstrap value was 1000 to evaluate reliability. The glycoprotein region of the SARS CoV, Bat-SL-RaTG13 CoV and SARS-CoV-2 were aligned and visualized using Multalin software [11] . The identified amino acids were aligned with whole viral genome database using BLASTp. The conservation of the amino acid motifs in clinical variants of SARS-CoV-2 genome was presented by performing multiple sequence alignment using MEGAX software. The three dimensional structure of SARS-CoV-2 envelope (spike, S) glycoprotein was generated by using SWISS-MODEL online server [12] and the structure was marked and visualized by using RasMol (www.openrasmol.org). The model of the SARS-RBD combined with receptor complex (PDB code 6acg) was used to predict possibility of the SARS-CoV-2 RBD binding with the potential human receptor (ACE2) [13] . The original macro-transcriptome data of 60 samples were obtained by Next-Generation Sequencing (NGS). After removing the interference of host data, the complete genome sequence of SARS-CoV-2 was successfully assembled (table 2) . Large fragment sequences of viruses can be assembled when the sequencing genome has almost complete coverage and the average sequencing depth is more than 10 times. Using reference sequence to correct the assembly sequence can improve the assembly quality. The first 6 confirmed cases in Guangdong province and the first 5 confirmed cases in Wuhan were selected to construct a phylogenetic tree with other Beta coronaviruses ( fig. 1 ). The 11 cases of SARS-CoV-2 represent highly homologous sequences, which are obviously clustered. The evolutionary relationship within the cluster is not obvious, which is significantly different from other coronavirus genomes. This suggests that the current variation in the SARS-CoV-2 genome is still intraspecies, with no evolution to other coronaviruses. The four groups of family cluster cases were clustered into clusters ( fig. 2) , among which only one SNP existed between the internal strains in group 1, and the other three groups had no variation during transmission. The 60 complete genomes were nearly identical across the whole genome, with sequence identity being above 99.9%, indicating the genome is stable in the process of virus transmission. Notably, the sequence identity between the virus genomes from family clustering cases was more than 99.99%. There were 179 nucleotide and 107 amino acid variations in 60 genomes ( fig. 2 ). The number of nucleotide variations in each genome varies from one to seven. There is no highly variable region, and all the single nucleotide variations (SNVs) are distributed in various fragments of the genome. As shown in the Spike receptor binding domain (RBD) sequences, GD-Pangolin-CoV was found to be the closest Beta-coronavirus to the three COVID-19 genomes, showing as high as 97.22% identity and similarity between them. The coronavirus isolates (Bat-Cov-RaTG) came right after the beta-conronavirus, it showed an overall similarity 88.89% with the SARS-CoV-2 Spike-RPD sequences. Surprisingly, the SARS Spike-RPD displayed a much far distance, of which similarity was 72.63% ( fig. 3) . The predicted crystal structures of SARS Spike trimmer with the receptor angiotensin-converting enzyme Ⅱ (ACE2) were applied as the modeling template due to the availability, and relatively high identity and similarity between their RBD sequences [14] [15] [16] (fig. 3 ). Unlike SARS-CoV, a salt bridge between Lys417 and Asp12 was found as a strong interaction force between the SARS-CoV-2 Spike-RPD and the receptor ACE2 in structural prediction model ( fig. 4) . By checking back to their primary protein sequences, we found the Lys417 of SARS-CoV-2 Spike-RBD was an amino acid replacement. On the SARS-CoV, it was a neutral amino acid Valine there shown on the sequence alignment above and marked by a rectangle with red color (fig. 3A ). The Spike protein belongs to the Class Ⅰ viral fusion protein, including SARS Spike protein (S), HIV envelop glycoprotein (Env), flu Hemagglutinin (HA) and Ebolavirus glycoproteins (GP). For further elucidation of the host-virus interactions, we checked the Spike fusion ability with the host membrane through scanning possible S1/S2 cleavage site of SARS-CoV-2 Spike protein. Compared with those beta-coronavirus Spike sequences, there was a four amino acid PRRAinsert in SARS-CoV-2, but not in any others such as SARS-CoV, Bat-CoV or the most closely-related GD-Pangolin-CoV ( fig. 5 ). With another Arginine (R685) right after it, the PRRA-insert resulted in a typical protease furin cleavage site RRAR685 on SARS-CoV-2, in which the S1/S2 boundary was highly assumed. Through multiple sequence comparison and evolution analysis of those assembly sequences obtained from 60 specimens, results showed that all the specimens had very few mutations, which was highly consistent with the whole genome sequence of SARS-CoV-2 in the early outbreak. Those variations tend to be randomly dispersed due to no selective pressure. By analyzing the SNPs in 60 specimens, the results showed that the mutation frequency was low in high depth regions (DP-30). A branch with ORF8:L84S variation was commonly observed, indicating that the available virus specimen was not under much selective pressure with a very slow mutation rate. However, there were also some strains that did not have a significant epidemiological association but also had a 100% sequence identity, suggesting that there may be some epidemiological associations not observed from the spatial and temporal distribution. Through establishing a molecular model of interaction between Spike and human receptor, an amino acid salt bridge was found in SARS-CoV-2 , but not in SARS-CoV. Furthermore, a potential furin cleavage site right located on the S1/S2 boundary of Spike protein greatly enhanced the virus invasion and pathogenicity. The receptor-binding analysis may identify important factors in the infection and invasion of COVID-19 much stronger than those of SARS in 2003. More than increasing its invasion, this potential furin cleavage site in SARS-CoV-2 might also result from the difference of the virus package mechanism from the SRAS-CoV [17] . It is not in the case of SRAS-CoV, although its invasion is still required. The SLLR667 on the SARS-CoV spike glycoprotein enhances cell-cell fusion but does not affect virion entry, which has not been considered as a typical furin cleavage site. The SLLR667 on the SARS-CoV spike glycoprotein enhances cell-cell fusion but does not affect virion entry and has not been considered as a typical furin cleavage site [18] [19] [20] . In contrast, the SARS-CoV-2 might use the package way similar to the mouse hepatitis virus (MHV), human immunodeficiency virus (HIV) or Ebolavirus, since most other betacoronavirus did not display a typical furin cleavage site between S1 and S2 boundary of their Spike proteins. With a different packaging mechanism, In summary, we selected 60 specimens of SARS-CoV-2 whole genome sequences in the early phase of COVID-19 pandemic in Guangdong, and systematically analyzed the characteristics of the SARS-CoV-2 genomics evolution, amino acid variation and Spike protein structure. This study provided reference for generating prevention and control strategies and the tracing the source of new outbreaks. S S I I I I I I I I I I I I K T T T T I I I I I I I I I I I I I I I I I I I I I I I I A A A A A A A A A A A A V I I I I I A A A A A A A A A A A A V Q Q Q Q Q -S S S S S --P P P -S S S S S S S T T T T T --R R ORF8:L84S N:S194L S:H49Y ORF8:G50 * ORF3a:G251V ORF1a:T1542I orf1ab Genetic Recombination, and Pathogenesis of Coronaviruses A novel coronavirus associated with severe acute respiratory syndrome Newly discovered coronavirus as the primary cause of severe acute respiratory syndrome Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Summary of probable SARS cases with onset of illness from 1 Middle East respiratory syndrome coronavirus (MERS-CoV) Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing disease and diplomacy: GISAID's innovative contribution to global health Molecular Evolutionary Genetics Analysis across Computing Platforms Multiple sequence alignment with hierarchical clustering SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information Cryo-EM structure of the SARS coronavirus spike glycoprotein in complex with its host cell receptor ACE2 A Novel Coronavirus from Patients with Pneumonia in China Cryo-EM structure of the 2019-nCoV spike in the prefusion conformation High-resolution global peptide-protein docking using fragments-based PIPER-FlexPepDock Cryo-EM structure of the SARS coronavirus spike glycoprotein in complex with its host cell receptor ACE2 Gene of the month: the 2019-nCoV/SARS-CoV-2 novel coronavirus spike protein Furin cleavage of the SARS coronavirus spike glycoprotein enhances cellcell fusion but does not affect virion entry Activation of the SARS coronavirus spike protein via sequential proteolytic cleavage at two distinct sites The authors declare that there is no conflict of interest with any financial organization or corporation or individual that can inappropriately influence this work.