key: cord-0856273-w70holyy authors: Moradi, Jale; Moghoofei, Mohsen; Alvandi, Amir Houshang; Abiri, Ramin title: Variation analysis of SARS-CoV-2 complete sequences from Iran date: 2021-01-24 journal: bioRxiv DOI: 10.1101/2021.01.23.427885 sha: 4d9eec586a3777d121780b2105162c559c9c7883 doc_id: 856273 cord_uid: w70holyy The SARS-CoV-2 is a new emerging coronavirus initially reported in China at the late December 2019 and rapidly spread to the whole of the world. To date, 1261903 total case and 55830 deaths are reported from Iran as 8 January. In this study, we investigated all the complete sequences of SARS-CoV-2 that publicly reported from Iran. Twenty-four sequences between March to September 2020 were analyzed to identify genome variations and phylogenetic relationships. Furthermore, we assessed the amino acid changes related to the spike glycoprotein as an important viral factor associated with the entry to the host cells and as a vaccine target. Most of the variations are occurred in the ORF1ab, S, N, intergenic and ORF7 regions. The analysis of spike protein mutations demonstrated that D614G mutation could be detected from the May and beyond. Phylogenetic analysis showed that most of the circulated viruses in Iran are belong to the B.4 lineage. Although, we found a limited number of variants associated to the B.1 lineage carrying D614G mutation. Furthermore, we detected a variant characterize as the B.1.36 lineage with sixteen mutations in the spike protein region. This study showed the frequency of the viral populations in Iran as September, therefore, there is an emergent need to genomic surveillance to track viral lineage shift in the country beyond the September. These data would help to predict future situation and apply better strategy to control of the pandemic. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a new emerging single-stranded RNA virus which initially was reported in Wuhan, Hubei Province, China at the late December 2019 (1) . The disease was rapidly spread to other countries. According to the WHO, there were 85,929,428 confirmed cases and 1,876,100 confirmed deaths reported until 8 January, 2021. (3, 4) . Some viral proteins are targeted by the immune system that variations in these parts of the genome would affect the virulence and transmissible potential (5) . Furthermore, mutations can interrupt the vaccines efficacy and validity of diagnostic tests due to the changes in the targeted proteins and probes (6, 7) . Now, there are 290,997 complete genome sequences of the SARS-CoV-2 in GISAID and the number of the submitting sequences are increasing. There is no classifying system for SARS-CoV-2 variants in International Committee on Taxonomy of Viruses (ICTV). Although, many scientists are trying to characterize the genetic diversity of the variants. In the early of the pandemic, two clades named S and L were introduced, and then L evolved as another clade. Until recently, there are six clades circulating in the world based on the sequences submitted in the GISAID (4). In another nomenclature, eighty-one lineages were identified for SARS-CoV-2 phylogeny belong to the A and B lineages (8) . There are large number of SARS-CoV-2 genome sequences that are deposited in the GISAID public database, and it is possible to characterize the evolution pattern of the virus geographically by phylogenetic analysis (9) . In total, 25 SARS-CoV-2 complete sequences related to Iran were released by November 25, 2020. One of the retrieved sequences was removed due to a high N content (more than 5%). All of 24 sequences were aligned to SARS-CoV-2 reference genome (NC_045512.2) using MAFFT (v7.455) (10) . We applied SNP-sites to extract the genome variations from the multiple sequence alignment (MSA) file (11) . Then, the association of each variation to SARS-CoV-2 ORFs was surveyed (2) . Furthermore, we analyzed the amino acid changes related to the spike protein. For phylogenetic analysis, the MSA file was visualized for checking the quality and trimming using the UGENE software (12) . The trimmed file was applied to construct phylogenetic tree with maximum likelihood method using RaxML-NG v .0.9.0 (applying 1000 bootstrap) (13) . The constructed tree was visualized using FigTree v1.4.4. In total, 24 complete sequences were analyzed, which 11 of the sequences were reported from Tehran (capital of Iran), 1 from Semnan (east of Iran), 1 from Qom (center of Iran), 1 from Zahedan (south of Iran) and 9 from unknown sources (Supplementary table 1 ). In total, 275 variation sites were detected that 191 variation sites were found in ORF1ab and 32, 13 and 10 were related to S, N and intergenic regions, respectively (Supplementary Table 2 ). The other genome segments had less than 10 variations sites. Of the 419 variations that were observed in the 24 sequences, most of the variants were related to the 21 variation sites ( Table 1) Seventeen sequences had amino acid changes in the spike glycoprotein (Table 2 ). These sequences were sampled throughout the pandemic, although, we analyzed D614G mutation with or without other mutations A475V, L452R, V483A, and F490L that are related to the increasing virus infectivity and transmissibility potentials. We found that the sequences that were belonged to May and beyond had D614G mutation without any of the other important co-existent mutations. Furthermore, the sequence with 203 variation sites that described earlier, had 16 types of the spike mutations including D614G. Phylogenetic analysis GISAID has introduced six phylogenetic groups of SARS-CoV-2 genomes represented by S and L groups at the first of the pandemic, followed by V and G groups which were evolved from group L. Groups G, GH and GR are most recent derivative ones. In another nomenclature system, SARS-CoV-2 is classified in 81 viral lineages named groups A and B (8) . Totally, six lineages are grouped in lineage A named A.1 to A.6, among which lineage A.1 consist of two sub-lineages. Also, Lineage B includes 16 lineages that B.1 lineage is the most prevalent worldwide and consists more than seventy sublineages. In the current study, we analyzed the frequency of the detected lineages (Fig. 4) . Although, the B.4 was the predominant lineage that 19 sequences are classified in this lineage type (Blue items, The genome of SARS-CoV-2 includes seven major ORFs and 23 unannotated ORFs (2, 14) . ORF1ab consists of two overlapping ORFs (ORF1a and ORF1b) that occupied two-third of the viral genome and encodes a poly-protein which is cleaved to 16 nonstructural proteins. According to our results and many other reports, most of the variation sites are raised in ORF1ab. Other major ORFs encode four canonical structural proteins including spike (S), membrane (M), envelope (E) and nucleoside (N). Since the amino acid sequences must be conserved for an ideal immunogen protein (in this case, S protein), there is an urgent need to characterize the rate of the mutations in this part of the genome in all geographic regions (15, 16) . Our results demonstrated that approximately 10% of the variation sites belong to S protein, although, these result in the 34 amino acid changes. Nucleocapsid and intergenic regions are other mutation vulnerable sequences. Spike and N proteins have higher mutation rate after the ORF1ab in the other studies that is compatible with our results (4). However, mutations in the intergenic sites are much higher in our experience comparing the same studies. The number of the variations were consistent among the isolate, although there was a sample with high mutation rate. This strain was isolated in May 2020 and showed 203 nucleotide variations. This variant is categorized to the lineage B.1.36 and previous studies showed that this lineage could contain various spike mutations which results in less reactogenicity to the antibodies generated by vaccine candidates (17) . We found that this isolate has 16 types of the mutation in the spike glycoprotein including D614G. The Spike mutations were seen in isolates with different sampling time, although D614G mutations only existed in the specimens which were collected from May and after. In a systematic review, 80 variants and 26 glycosylation mutants of the spike protein were surveyed in the terms of infectivity and reactivity, results showed that most of the variants were susceptible to the neutralizing antibodies, although, D614G in coexistence with some variants including A475V, L452R, V483A, and F490L were unidentifiable by neutralizing antibodies and were more infectious (18) (19) (20) . Spike protein's D614G mutation was detected in March 2020 in 26% of the studied sequences, but its frequency increased to 74% by June 2020 (21) . As mentioned earlier, we only detected D614G mutation without other infectivity related mutations. Totally, our analysis in the term of D614G mutation screening indicated that this mutation showed up in the Iran in May when the frequency of this variant had been increased globally. Plante et al. demonstrated that this mutant virus has more potential of viral replication in the human lung tissue. Also their report showed that D614G ability to S1/S2 cleavage substitution and shedding of spike protein had changed (22) . Previous structural analyses on S protein revealed the receptor binding domains (RBDs) of D614G variant changed and its ability to bind to the receptor was improved (23, 24) . In another research, have been shown that D614G variant exhibits more competitive fitness and efficient infection in primary human epithelial lung cells. As well as transmission of this variant is significantly faster and then the wild-type virus (25) . The phylogenetic analysis demonstrated that most of circulating viruses in Iran belong to the lineage B.4 that is compatible with the other reports (8, 26) . B.1 is the most prevalent lineages in the world, although some reports from other countries show that other lineages are more frequent circulating variants (27) . Furthermore, just as all other reports, we found that all the D614G mutations are related to the sequences belong to B.1 lineages (28) . In summary, in the current report we studied genomics variations and phylogeny of the SARS-CoV-2 sequences in Iran. Our results showed that most of the variations are related to the ORF1ab that are A Novel Coronavirus from Patients with Pneumonia in China The coding capacity of SARS-CoV-2 Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders Variant analysis of SARS-CoV-2 genomes. Bull World Health Organ SARS-CoV-2 genomic variations associated with mortality rate of COVID-19 Analytical sensitivity and efficiency comparisons of SARS-CoV-2 RT-qPCR primer-probe sets Effects of a major deletion in the SARS-CoV-2 genome on the severity of infection and the inflammatory response: an observational cohort study A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Geographic and Genomic Distribution of SARS-CoV-2 Mutations MAFFT-DASH: integrated protein sequence and structural alignment SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb genomics Unipro UGENE: a unified bioinformatics toolkit RAxML-NG: a fast, scalable and userfriendly tool for maximum likelihood phylogenetic inference Characteristics of SARS-CoV-2 and COVID-19 Characterization of spike glycoprotein of SARS-CoV-2 on virus entry and its immune cross-reactivity with SARS-CoV Structural and functional properties of SARS-CoV-2 spike protein: potential antivirus drug development for COVID-19 Making Sense of Mutation: What D614G Means for the COVID-19 Pandemic Remains Unclear Structural basis of receptor recognition by SARS-CoV-2 Structure, Function, and Antigenicity of the SARS-CoV-2 Spike Glycoprotein The Impact of Mutations in SARS-CoV-2 Spike on Viral Infectivity and Antigenicity Tracking Changes in SARS-CoV-2 Spike: Evidence that D614G Increases Infectivity of the COVID-19 Virus Spike mutation D614G alters SARS-CoV-2 fitness Structural and Functional Analysis of the D614G SARS-CoV-2 Spike Protein Variant The SARS-CoV-2 Spike Variant D614G Favors an Open Conformational State. bioRxiv : the preprint server for biology SARS-CoV-2 D614G variant exhibits efficient replication ex vivo and transmission in vivo An emergent clade of SARS-CoV-2 linked to returned travellers from Iran SARS-CoV-2 lineage B.6 was the major contributor to early pandemic transmission in Malaysia Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on Transmissibility and Pathogenicity