key: cord-0832630-l4hueltv authors: Hasan, Md. Mahbub; Das, Rasel; Rasheduzzaman, Md.; Hussain, Md. Hamed; Muzahid, Nazmul Hasan; Salauddin, Asma; Rumi, Meheadi Hasan; Mahbubur Rashid, S.M.; AMAM, Zonaed Siddiki; Mannan, Adnan title: Global and Local Mutations in Bangladeshi SARS-CoV-2 Genomes date: 2021-03-15 journal: Virus Res DOI: 10.1016/j.virusres.2021.198390 sha: 042b80c3b7d1df0105cdebae55995d02df1156de doc_id: 832630 cord_uid: l4hueltv Coronavirus Disease 2019 (COVID-19) warrants comprehensive investigations of publicly available Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) genomes to gain new insight about their epidemiology, mutations, and pathogenesis. Nearly 0.4 million mutations are identified so far among the ∼60,000 SARS-CoV-2 genomic sequences. In this study, we compared a total of 371 SARS-CoV-2 published whole genomes reported from different parts of Bangladesh with 467 sequences reported globally to understand the origin of viruses, possible patterns of mutations, and availability of unique mutations. Phylogenetic analyses indicated that SARS-CoV-2 viruses might have transmitted through infected travelers from European countries, and the GR clade was found as predominant in Bangladesh. Our analyses revealed 4604 mutations at the RNA level including 2862 missense mutations, 1192 synonymous mutations, 25 insertions and deletions and 525 other types of mutation. In line with the global trend, D614 G mutation in spike glycoprotein was predominantly high (98%) in Bangladeshi isolates. Interestingly, we found the average number of mutations in ORF1ab, S, ORF3a, M, and N were significantly higher (p < 0.001) for sequences containing the G614 variant compared to those having D614. Previously reported frequent mutations, such as R203K, D614G, G204R, P4715L and I300F at protein levels were also prevalent in Bangladeshi isolates. Additionally, 34 unique amino acid changes were revealed and categorized as originating from different cities. These analyses may increase our understanding of variations in SARS-CoV-2 virus genomes, circulating in Bangladesh and elsewhere. These analyses may increase our understanding of variations in SARS-CoV-2 virus genomes, circulating in Bangladesh and elsewhere. Keywords: SARS-CoV-2; Whole-genome sequence; COVID-19; Mutation; Bangladesh Severe Acute Respiratory Syndrome-Coronavirus-2 (SARS-CoV-2) has been identified as the etiological agent of the disease called Coronavirus Disease-2019 . To elucidate the viral pathogenesis, modern genomic tools are highly crucial and have been employed by researchers around the world. An increasing number of whole-genome datasets of the SARS-CoV-2 virus are now being submitted in publicly accessible databases from different parts of the globe every day. It is high time to analyze the variations among those sequences which will help future strategic efforts for its preventive measures, such as vaccine design and development of potential novel therapeutics. SARS-CoV-2 consists of positive-sense singlestranded RNA with a genome size ranging from ~29.8 to 29.9 Kb (Khailany et al., 2020) . It contains a variable number (9-11) of open-reading frames (ORFs) and the first ORF covers almost two-thirds of the whole genome (Khailany et al., 2020; Kim et al., 2020) . The genome encodes 4 structural, 16 non-structural (NS), and 6 accessory proteins (Astuti, 2020; Chen et al., 2020; O'Meara et al., 2020; Yoshimoto, 2020) . According to a recent study on 48,635 SARS-CoV-2 genome sequences, a total of 353,341 mutations have been observed globally compared with that of the reference genome from Wuhan (Accession ID MN908947). Of them, sequences from India, Congo, Bangladesh, and Kazakhstan were reported to show significantly high numbers of mutations per sample J o u r n a l P r e -p r o o f compared with the global average (Mercatelli and Giorgi, 2020) . Out of these mutations, D614G mutation (causing aspartate to glycine at 614 in S protein ) is reported to be the most prevalent mutations in Europe, Oceania, South America, Africa, Middle East, and India (Gong et al., 2020; Raghav et al., 2020; Zhang et al., 2020) . Several laboratory experiment-based investigations have reported that the level of angiotensin-converting enzyme 2 (ACE2) expression was distinctly higher by the retroviruses pseudo-typed with G614 compared with that of D614 in cell culture experiments (Dumonteil and Herrera, 2020; Zhang et al., 2020) . Another reported mutation of ORF1ab is P4715L linked with D614G, which is linked with higher fatality rates in 28 countries and 17 states of the United States (Mercatelli and Giorgi, 2020; Toyoshima et al., 2020) . Three other mutations namely C14408T (ORF1b), C241T (5' UTR), C3037T (ORF1a) are reported to be common and coexisting in the same genome, while G11083T has been found mostly in Asian countries (Toyoshima et al., 2020) . In Bangladesh, until November, 2020, nearly 465,000 people have been infected and 6644 people have died due to COVID-19 (https://iedcr.gov.bd/). During this period, a total of 371 complete and high coverage SARS-CoV-2 whole-genome sequences were submitted in the GISAID database (https://www.gisaid.org). Analyses of these sequences are sparsely reported in the literature. As for instance, one study reported analyses of 60 SARS-CoV-2 genome sequences from Bangladesh and compared them with 6 Southeast Asian countries (Islam et al., 2020) . Another group analyzed only 14 isolates and found 42 mutations (Parvez et al., 2020) . A separate study identified only 9 variants where unique mutations (UMs) were found prevalent mostly in ORF1ab . Another study involved 64 SARS-CoV-2 whole-genome sequences and identified the presence of 180 mutations in the coding regions of the viruses, and mutations at nsp2 were the most prevalent (Ahmed et al., 2020) . Due to the small numbers of J o u r n a l P r e -p r o o f genome sequences analyzed, most of these findings were not conclusive and truly representative of the whole country. Further comprehensive analyses are therefore necessary to better understand the circulating virus in this highly populated country. The present study was based on SARS-CoV-2 genome sequences submitted in the global publicly accessible GISAID database (GISAID, 2020) . According to the GISAID database, clades are characterized by genomic specificity and classified by calculating genome distances of the variants through phylogenetic clusters analysis. Currently, all of the SARS-CoV-2 variants are classified into 7 clades-S, L, V, G, GH, GR, and GV. We used this clade-based classification in this study. We compared a total of 371 whole-genome sequences obtained from isolates collected from Bangladeshi COVID-19 patients with that of 467 global sequences that are from Asia, Africa, Europe, Australia and North American countries using time-resolved phylogenetic analysis. We studied the frequency of mutations in different ORFs of SARS-CoV-2 genomes, having the mutation of D614G. In addition, we explored the origin and distribution of SARS-CoV-2 in Bangladesh along with the circulating variants present in different parts of the country. This was achieved through identifying the pattern of mutations including the unique mutations (UMs). These pave the way to increase our understanding of the distribution of different variants of SARS-CoV-2 virus in different regions and associated mutation events. containing "N" and other ambiguous IUPAC codes (Korber et al., 2020a) . A total of 8723 complete genomic sequences were returned upon running the script as per the aforementioned logic. To select representative sequences from curated 8723 global sequences and make a comparison against sequences from Bangladesh, priorities were given to those countries that had a higher number of infections in each continent (https://www.worldometers.info/coronavirus/). We selected these sequences considering that each continent must be represented by at least one sequence from each GISAID clade. The number of sequences selected from a country was based on the total number of unique sequences retrieved. This resulted in a total of 839 unique representative sequences from 42 countries (see Supplementary File Table S1 ). From Bangladesh, 518 whole-genome RNA sequences of SARS-CoV-2 were uploaded to GISAID until November 30, 2020. Only high coverage complete sequences (n=371) were considered for analysis in this study. All of these 371 sequences were retrieved and aligned with the previously selected 467 representative sequences along with that of Wuhan-1 (Accession ID MN908947) as a reference sequence (Wu et al., 2020) . A list of accession numbers of all sequences, used in this study, is provided in Supplementary File Table S2 . To ensure comparability, the flanks of all the sequences were truncated to the consensus range from 56 to 29,797 (Forster et al., 2020) , with nucleotide position numbering according to the Wuhan-1 reference sequence, prior to alignment. Multiple Sequence Alignment (MSA) and phylogenetic tree construction were carried out using Molecular Evolutionary Genetic Analysis X (MEGA X) software (version 10.1) (Kumar et al., 2018) . The selected sequences were aligned using the J o u r n a l P r e -p r o o f MUSCLE software tool (Edgar, 2004) . Later, a NJ (Neighbor-Joining (NJ)) phylogenetic tree was constructed using the Tamura-Nei model (Saitou and Nei, 1987) . Tree topology was assessed using a fast bootstrapping function with 1000 replicates. Tree visualization and annotations were performed in the Interactive Tree of Life (iTOL) v5 (Letunic and Bork, 2019). The Genome Detective Coronavirus Typing Tool Version 1.13 was used for variant analyses at the RNA level of SARS-CoV-2 which is specially designed for this virus analysis (https://www.genomedetective.com/app/typingtool/cov/) (Cleemput et al., 2020) . For investigating the UM at protein level among 371 genomic sequences from Bangladesh, we used a CoV server hosted by the GISAID server (https://www.gisaid.org/epifluapplications/covsurver-mutations-app/). The server analyzed our dataset against all available genomic sequences of SARS-CoV-2 including the Wuhan reference sequence deposited on GISAID until November 30, 2020. The spatial map was created using layers downloaded from GeoDASH (The Bangladesh Geospatial Data Sharing Platform) website on ArcGIS Desktop (Esri Inc.,109 Redlands, California, United States) licensed to King's College London. Descriptive and inferential statistics were used to analyze different mutations and their correlation with different categorical variables. For correlation, we used one-way ANOVA To understand the SARS-CoV-2 viral transmission in Bangladesh, phylogenetic analysis was performed based on the selected 371 complete viral genomes from different regions of Bangladesh along with selected 467 globally submitted sequences from 42 countries of 6 continents as shown in Figure 1 . This represents an overall clade distribution of all global sequences along with sequences from Bangladeshi isolates. Analysis of Figure 1 depicts the GR clade, which is found predominant in Bangladesh. About 86% of the sequences are grouped to this clade followed by G and GH with ~6%, respectively. Similar clade distribution has been reported in SARS-CoV-2 isolates, originating from European countries (Hamed et al., 2020) . We also compared the sequence data among different regions of Bangladesh, looking at the districtwise distribution of clades as shown in Figure 2 . As can be seen in Figure 2 , the sequences from all districts largely clustered with that of GR clade except sequences from Chittagong. We also observed that the major clade distribution of SARS-CoV-2 isolates from Chittagong is consists of both GR and GH (Figure 2 ). The GH clade was predominantly found in Saudi Arabia J o u r n a l P r e -p r o o f (Mercatelli and Giorgi, 2020) , which is clustered together with Chittagong as shown in Figure 1 . Based on this evidence, it is highly likely that the introduction of GH clade in Bangladesh could be of Middle-Eastern origin. Our analyses revealed a total of 4604 mutations observed among 371 Bangladeshi SARS-CoV-2 genomic sequences that include 2862 missense, 1192 synonymous, 25 insertion/deletions and 525 other mutations (Ambiguous mutation and nucleotide mutation in UTR region) (See Table 1 ). Among the identified mutations, 28881G>A & 28882G>A (R203K; N protein), 23403A>G (D614G; S glycoprotein), 28883G>C (G204R; N protein), 14408C>T (P4715L; nsp12), and 1163A>T (I300F; nsp2) were the most frequently occurring common mutations found in Bangladesh with a frequency of 650, 365, 325, 304 and 243, respectively (Figure 3 ). Nucleotide mutations of 28881G>A and 28882G>A resulted in R203K due to codon degeneracy. Notably, no particular mutations occurred at any specific time rather they have been observed over the whole period of disease incidence. However, this explanation is predicted and needs to be investigated in detail in the future. The RNA-dependent RNA polymerase (i.e. nsp12) is essential for replication and transcription of the viral RNA genome. The observed mutation P4715L in nsp12 was also found in most of the US states (28 out of 31 states from where the sequences were deposited). The same mutation was prevalent in European countries like Spain, France etc. This alteration could affect the pathogenesis triggered by antibody escape variants with epitope loss (Banerjee et al., 2020; Gupta and Mandal, 2020) . In a separate study, it has been reported J o u r n a l P r e -p r o o f that the G614 type might have originated either in Europe or China (Korber et al., 2020b) . They also reported that the original Wuhan D614 form was also predominant in Asian samples. Meanwhile, the G614 form had clearly been established and started expanding in countries outside of China. We noticed that 98% of genomes from Bangladesh have D614G mutation, which is also dominant in the world. However, the average number of mutations per ORF is varied among genomes containing D614 and G614 that we have studied (n=838) as revealed by Table 2 . The average number of mutations per genome between D614 and G614 is 6.12 and 10.68, respectively. The average number of mutations in ORF1ab, S, ORF3a, M and N of genomes having mutated G614 (n=626) are significantly higher (p≤0.001) than those having the original D614 (n=212) in S glycoprotein (See Table 2 and Figure 1) . Interestingly, the average mutation number is declined in ORF8, having G614 mutation (p<0.001). This correlation indicates that the genomes containing D614G mutation also bear more mutations which is aligned with the various reports on the link between the transmission and pathogenesis of SARS-CoV-2 (Grubaugh et al., 2020; Korber et al., 2020a; Zhang et al., 2020) , and this mutation increases cell entry and transduction due to resistance to proteolytic cleavage (Daniloski et al., 2021; Ozono et al., 2021) . However, without further experimental evidence this is not conclusive. Meanwhile, R203K and G204R mutations in N protein were previously reported in Indian, Spanish, Italian, and French samples (Koyama et al., 2020; Maitra et al., 2020) . These mutations are located in the site of the SR-rich region which has been reported to be intrinsically disordered (Chang et al., 2014) . This region further incorporates a few phosphorylation sites (Surjit et al., 2005) , including the GSK3 (Glycogen synthase kinase 3) phosphorylation at Ser202 and a CDK (Cyclin-dependent kinase) phosphorylation site at Ser206 which are located close to the position of this mutation. The and sequence motifs J o u r n a l P r e -p r o o f are dependent on GSK3 and CDK phosphorylation motifs, respectively. Other variations 28881G>A and 28882G>A together convert polar to non-polar amino acid (R203K) and 28883G>C variation converts nonpolar to polar amino acid (G204R) (See Figure 3) . We observed 34 unique shifts from the different proteins of SARS-CoV-2 isolated from Bangladesh (Table 3) . Surprisingly, most of these UMs were found to be in patients in a specific area with a few exceptions. For example, some UMs as shown in Table 3 are only found in Barisal, Chandpur, Chittagong, Dhaka, Jessore, Moulvibazar, Mymensingh, and Rangpur districts. Interestingly, some UMs are concurrently present in multiple cities. These changes in amino acids might have occurred due to rapid mutation events and/or recombination with existing CoVs in the human body (Hassan et al., 2020) . The circulation of a high number of these UMs in different cities indicates the possible emergence of community transmission in the Bangladeshi population (Samyuktha and Kumar, 2020) . We observed 2 UMs in nsp1 (See Table 3 ), but the location of these amino acids was not in the KH domain (K164 and H165 of nsp1), which binds with the 40s subunit of ribosome (Thoms et al., 2020) . However, nsp1 acts as a primary virulence factor in SARS-CoV-2 infection, and mutation in this protein could affect the structure and functional properties, thereby altering its virulence properties. Similar to nsp1, 2 UMs were seen in nsp2, but their effects on host cells were merely reported in the literature. Since nsp2 interacts with the host proteins and disrupts the host cell survival signaling pathway (Bianchi et al., 2020) , any mutation in nsp2 may play a crucial role in SARS-CoV infections. In general, mutations of nsp3 are J o u r n a l P r e -p r o o f responsible for affecting the virus assembly and hence their replication. One would assume that this is due to the disruption of the replicase polyprotein processing into nsps. These nsps assemble with cellular membranes and facilitate virus replication. Among 34 amino acid mutations, 9 mutations were located in nsp-3. The UMs found in nsp3 might have some significant impacts in the viral pathogenesis. Firstly, the UMs (A889V and V843F) were found in the main domain of nsp3 that is important for processing endopeptidases from coronaviruses. Secondly, some UMs (e.g., G1691C, A602S and L373M) are found in the topological (cytoplasmic) domain. Thirdly, we observed one UM (L373M) in the ADP-ribose-1′phosphatase (ADRP) or (Macro) domain of nsp3. It has been shown that mutation of the ADRP domains does not diminish virus replication in mice, but reduces the production of the cytokine IL-6, which is an important proinflammatory molecule (Mielech et al., 2015) . However, we could not find any mutation in the active sites and zinc finger motif, attributing normal catalytic activity of nsp3. Overall, our opinion is that the PL-PRO domain is important for the development of antiviral drugs and the actual role of this enzyme is to cleave of several nsps from the polyprotein pp1a and also the biogenesis of the SARS-CoV replicase complex is yet to be explored. It can be postulated that the proteins like nsp3, nsp4 and nsp6 through their transmembrane domains, are involved in the replicative and transcription complex (Coppée et al., 2020) . In this study, we observed only two UMs in nsp4 and nsp6, respectively. Meanwhile, nsp5 encodes 3C-like proteinase that cleaves the C-terminus of the replicase polyprotein at 11 sites (Lokhande et al., 2020; Roe et al., 2021) . One UM that we found in nsp5 did not fall in its active sites (3304 and 3408 in ORF1ab) warrants further investigation with a large number of sequence datasets. The present global outbreak of COVID-19, caused by SARS-CoV-2, has already taken more than 2.5 million lives. Bangladeshi citizens are highly vulnerable to COVID-19 as evident by a number of circulating variants in different regions of this country. To combat this deadly disease, we need a better understanding of the pathobiology of this deadly virus. Hence, it is essential to minimize the translational gap between viral genomic information and its clinical consequences for developing effective therapeutic strategies. In this study, we have attempted to explore genomic variations of Bangladeshi SARS-CoV-2 viral isolates while comparing them with a large cohort of global isolates. Our analyses will facilitate the understanding of the origin, mutation patterns, and their possible effect on viral pathogenicity. We have addressed the importance of the variations in the viral genomes and their necessity for therapeutic interventions. The unique insights from this study will undoubtedly be supportive for a better understanding of the molecular mechanism of SARS-CoV-2 which would help to understand pathophysiology of the virus in highly populated countries like Bangladesh. In silico comparative genomics of SARS-CoV-2 to determine the source and diversity of the pathogen in Bangladesh Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host response Mutational spectra of SARS-CoV-2 orf1ab polyprotein and signature mutations in the United States of America Mutational screening of the proteome of Sars-Cov-2 isolates: mutability of ORF3a, Nucleocapsid and Nsp2 proteins The SARS coronavirus nucleocapsid protein-forms and functions Emerging coronaviruses: genome structure, replication, and pathogenesis Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes SARS-CoV-2: virus mutations in specific European populations The Spike D614G mutation increases SARS-CoV-2 infection of multiple human cell types Polymorphism and selection pressure of SARS-CoV-2 vaccine and diagnostic antigens: implications for immune evasion and serologic diagnostic performance MUSCLE: multiple sequence alignment with high accuracy and high throughput Phylogenetic network analysis of SARS-CoV-2 genomes Clade and lineage nomenclature aids in genomic epidemiology studies of active hCoV-19 viruses SARS-CoV-2 genomic surveillance in Taiwan revealed novel ORF8-deletion mutant and clade possibly associated with infections in Middle East Making sense of mutation: what D614G means for the COVID-19 pandemic remains unclear Non-synonymous Mutations of SARS-Cov-2 Leads Epitope Loss and Segregates its Varaints Global dynamics of SARS-CoV-2 clades and their relation to COVID-19 epidemiology Molecular conservation and Differential mutation on ORF3a gene in Indian SARS-CoV2 genomes Emergence of European and North American mutant variants of SARS-CoV-2 in South-East Asia Genomic characterization of a novel SARS-CoV-2 The architecture of SARS-CoV-2 transcriptome Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. bioRxiv Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus Variant analysis of SARS-CoV-2 genomes MEGA X: molecular evolutionary genetics analysis across computing platforms Interactive Tree Of Life (iTOL) v4: recent updates and new developments Molecular docking and simulation studies on SARS-CoV-2 Mpro reveals Mitoxantrone, Leucovorin, Birinapant, and Dynasore as potent drugs against COVID-19 Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility Geographic and Genomic Distribution of SARS-CoV-2 Murine coronavirus ubiquitin-like domain is important for papain-like protease stability and viral pathogenesis A SARS-CoV-2-Human Protein-Protein Interaction Map Reveals Drug Targets and Potential Drug-Repurposing SARS-CoV-2 D614G spike mutation increases entry efficiency with enhanced ACE2-binding affinity Genetic analysis of SARS-CoV-2 isolates collected from Bangladesh: insights into the origin, mutation spectrum, and possible pathomechanism Analysis of Indian SARS-CoV-2 Genomes Reveals Prevalence of D614G Mutation in Spike Protein Predicting an Increase in Interaction With TMPRSS2 and Virus Infectivity Targeting novel structural and functional features of coronavirus protease nsp5 (3CLpro, Mpro) in the age of COVID-19 The neighbor-joining method: a new method for reconstructing phylogenetic trees Emergence of RBD and D614G Mutations in Spike Protein: An Insight from Indian SARS-CoV-2 The severe acute respiratory syndrome coronavirus nucleocapsid protein is phosphorylated and localizes in the cytoplasm by 14-3-3-mediated translocation Structural basis for translational shutdown and immune evasion by the Nsp1 protein of SARS-CoV-2. bioRxiv