key: cord-0945155-0wtqii1c authors: Ghosh, Nimisha; Saha, Indrajit; Nandi, Suman; Sharma, Nikhil title: Characterisation of SARS-CoV-2 Clades based on Signature SNPs unveils Continuous Evolution date: 2021-09-20 journal: Methods DOI: 10.1016/j.ymeth.2021.09.005 sha: eaf685b511b84c402153a1f2879ef27cb4af6415 doc_id: 945155 cord_uid: 0wtqii1c Since the emergence of SARS-CoV-2 in Wuhan, China more than a year ago, it has spread across the world in a very short span of time. Although, different forms of vaccines are being rolled out for vaccination programs around the globe, the mutation of the virus is still a cause of concern among the research communities. Hence, it is important to study the constantly evolving virus and its strains in order to provide a much more stable form of cure. This fact motivated us to conduct this research where we have initially carried out multiple sequence alignment of 15359 and 3033 global dataset without Indian and the dataset of exclusive Indian SARS-CoV-2 genomes respectively, using MAFFT. Subsequently, phylogenetic analyses are performed using Nextstrain to identify virus clades. Consequently, the virus strains are found to be distributed among 5 major clades or clusters viz. 19A, 19B, 20A, 20B and 20C. Thereafter, mutation points as SNPs are identified in each clade. Henceforth, from each clade top 10 signature SNPs are identified based on their frequency i.e. number of occurrences in the virus genome. As a result, 50 such signature SNPs are individually identified for global dataset without Indian and dataset of exclusive Indian SARS-CoV-2 genomes respectively. Out of each 50 signature SNPs, 39 and 41 unique SNPs are identified among which 25 non-synonymous signature SNPs (out of 39) resulted in 30 amino acid changes in protein while 27 changes in amino acid are identified from 22 non-synonymous signature SNPs (out of 41). These 30 and 27 amino acid changes for the non-synonymous signature SNPs are visualised in their respective protein structure as well. Finally, in order to judge the characteristics of the identified clades, the non-synonymous signature SNPs are considered to evaluate the changes in proteins as biological functions with the sequences using PROVEAN and PolyPhen-2 while I-Mutant 2.0 is used to evaluate their structural stability. As a consequence, for global dataset without Indian sequences, G251V in ORF3a in clade 19A, F308Y and G196V in NSP4 and ORF3a in 19B are the unique amino acid changes which are responsible for defining each clade as they are all deleterious and unstable. Such changes which are common for both global dataset without Indian and dataset of exclusive Indian sequences are R203M in Nucleocapsid for 20B, T85I and Q57H in NSP2 and ORF3a respectively for 20C while for exclusive Indian sequences such unique changes are A97V in RdRp, G339S and G339C in NSP2 in 19A and Q57H in ORF3a in 20A. The first case of SARS-CoV-2 was registered in Wuhan China, 2019 and it quickly took over the normal functioning of human lives. In the initial phases, lock-down was implemented to limit the spread of infection. It is well known that virus mutations take place in the form of single nucleotide variants, deletions and large structural variants [1] mainly due to replication and some hotspot mutations having severe impact on the host. Many cities around the globe have gone through staggered phases of lockdown in order to avoid the spread of the different strains of SARS-CoV-2. Among these strains, B.1.1.7 (Alpha) is found to be highly transmissible [2] and causes more severe pathogenic infection in young people [3] . Another major variant B.1.351 (Beta) which has emerged from South Africa [4, 5] had also led to a sudden surge in the total number of cases. The efficacy of therapeutic monoclonal antibodies (mAbs) are known to be reduced against another variant P.1 (Gamma). It is estimated to be 2.6 times more transmissible [6] . Another variant B.1.617.2 (Delta) is known for the surge in cases in India during the 2nd wave of the pandemic. SARS-CoV-2 is a 29.9kb long single-stranded genomic RNA [7] [8] [9] [10] [11] [12] . It covers 11 coding regions where ORF1ab occupies majority of the genomic sequence while Spike (S), ORF3a, Envelope, Membrane, ORF6, ORF7a, ORF7b, ORF8, Nucleocapsid and ORF10 constitute the rest of the sequence [8, 9, 13] . Through various studies it is found that the South African B.1.351 variant strain consists of mutation in three prominent places in Spike (S) protein, they being K417N, E484K and N501Y [4] whereas the UK variant B.1.1.7 which was found to be part of the 20B clade contains multiple mutations with a combination of N501Y in Spike (S) protein [2] and the 69-70del which have been circulating within the community for months. Hence, it is incumbent that such frequent mutations be given special focus by the scientific community to trace and tackle the challenges posed by the mutations. Tang et al. [14] investigated the extent of molecular divergence between SARS-CoV-2 and other related coronaviruses by analysing 103 SARS-CoV-2 genomes and reported two major lineages, L and S. Several other mutations are also identified in the last few months which demands re-purposing of the current methods to deal with the virus. Wang et al. [15] have proposed a h-index mutation ratio criteria to evaluate the non-conserved and conserved proteins with the help of more than 15000 sequences. Consequently, nucleocapsid, spike and papain-like protease are found to be highly non-conserved while envelope, main protease, and endoribonuclease protein are relatively conservative. They have further identified the mutations on 40% of nucleotides in nucleocapsid, thereby indicating potential impacts on the ongoing development of various COVID-19 diagnosis and cure. Such similar analysis conducted by Yuan et al. [16] with 11183 sequences revealed 119 high frequency substitutions or Single Nucleotide Polymorphism (SNP) around the globe. Among the nucleotide changes in SNPs, C to T is the major one indicating adaptation and evolution of the virus in the human host which can pose new challenges. On the other hand, Chen et al. [17] focused on the * Corresponding author: indrajit@nitttrkol.ac.in binding of free energy changes between the angiotensin converting enzyme 2 (ACE2) receptor with the frequently changing Spike protein of SARS-CoV-2 considering algebraic topology-based machine learning model and found 3 sub-type of the virus with slightly high infectivity. 570 SARS-CoV-2 sequences were analysed and 10 distinct hotspot mutations points from China, India, USA, Europe which might affect the replication-relevant proteins were identified by Weber et al. [18] . Further, they found that these mutations can effect the secondary structure of the RNA molecule of SARS-CoV-2 and its repertoire which are essential for viral and cellular proteins. Moreover, Nagy et al. [19] computed the direct effect of mutations over clinical outcome with the help of Chi-square test, in which they found mutations in ORF8, NSP6, ORF3a, NSP4 and nucleocapsid regions are associated with mild effects while inferior outcomes were mapped in spike, RNA-dependent RNA polymerase, ORF3a, NSP3, ORF6 and nucleocapsid. Further, they concluded that mutations in ORF3a and NSP7 can lead to severe outcomes but with low prevalence. Cheng et al. [20] identified 5 major mutation points, C28144T, C14408T, A23403G, T8782C and C3037T in almost all strains for the month of April 2020. Their functional analysis show that these mutations lead to a decrease in protein stability and eventually a reduction in the virulence of SARS-CoV-2 but A23403G mutation increases the Spike-ACE2 interaction leading to an increase in its infectivity. Whole-genome analysis of 837 Indian SARS-CoV-2 genomes were carried out by Sarkar et al. [21] which revealed 33 different mutations, out of which 18 were unique to India. Based on their co-existing mutations, the Indian isolates were classified into 22 groups. Their study highlighted the evolution of divergent SARS-CoV-2 strains and also co-circulation of multiple strains in India. Motivated by the aforementioned studies, in this work we have analysed 18392 SARS-CoV-2 genomes for 71 countries where 15359 global dataset without Indian (Dataset A) and dataset of 3033 exclusive Indian (Dataset B) SARS-CoV-2 genomes are taken separately to identify clade specific signature SNPs. To achieve this, Multiple alignment using fast fourier transform (MAFFT) [22] is used for multiple sequence alignment (MSA) followed by Nextstrain [23] In this section, the collection of the dataset for SARS-CoV-2 genomes and the proposed pipeline are discussed. The collection of the dataset can be summarised as below: • For phylogenetic analyses, Dataset A with 15359 sequences and Dataset B with 3033 SARS-CoV-2 genomes are collected from Global Initiative on Sharing All Influenza Data (GISAID) 1 while the Reference Genome (NC 045512.2) 2 is collected from National Center for Biotechnology Information (NCBI). • The 18392 SARS-CoV-2 sequences for 71 countries are mostly distributed from December 2019 to December 2020. The average and maximum lengths of the sequences are 29820 and 29903 respectively. • Further, to map the changes of amino acid in proteins, PDB of the proteins are collected from Zhang Lab 3 and other reliable sources. • All these analysis are performed on High Performance Computing facility of NITTTR, Kolkata and the codes are written in MATLAB R2019b. The pipeline of the work is provided in Figure 1 techniques [22] . Thus, MAFFT is used in this work for multiple sequence alignment. On the other hand, Nextstrain is a collection of open-source tools which is useful for understanding the evolution and spread of pathogen, particularly during an outbreak. Using Nextstrain, proper and meaningful visualisation of a large number of virus sequences can be achieved. It consists of "auspice" which is a web-based visualisation program used to present and interact with phylogenomic and phylogeographic data. There are a substantial number of tools in Nextstrain which perform phylodynamic analysis [24] which ranges from subsampling, alignment, phylogenetic inference to temporal dating of ancestral nodes and discrete trait geographic reconstruction and inference of the most likely transmission events. The spread and evolution of virus genomes can be visualised at nextstrain.org using auspice. By taking the advantage of this tool, in this work the evolution and geographic distribution of SARS-CoV-2 genomes are visualised by creating the metadata in our High Performance Computing environment. Once the virus clades are identified using Nextstrain, clade specific aligned sequences are used to identify the mutation points as substitutions specifically SNPs in each clade. Following this, amino acid changes in the virus proteins corresponding to the SNPs are identified using the codon table. Thereafter, individually for Datasets A and B, top 10 signature SNPs are identified in each clade based on their frequencies or number of occurrences in the virus genome. Such frequencies can also be quantified by considering entropy values as well, the calculation of which is described in details in [25] . It is to be noted that the amino acid changes for the SNPs can be either synonymous or non-synonymous. Thereafter, the amino acid changes in the non-synonymous signature SNPs are considered to evaluate their functional characteristics. These amino acid changes are visualised in the respective protein structure as well. The amino acid changes for the non-synonymous signature SNPs in the different coding regions for both Datasets A and B are visualised graphically in Figure 1 (d). The experiments in this work are carried out according to the pipeline as provided in Figure 1 (a). In this regard, initially multiple sequence alignment of Dataset A with 15359 and Dataset B with 3033 SARS-CoV-2 genomic sequences are carried out using MAFFT followed by their phylogenetic analysis using Nextstrain. The results from the phylogenetic analyses are as follows: • As a result of phylogenetic analysis by Nextstrain, 5 clades are identified viz. 19A, 19B, 20A, 20B and 20C. Subsequently, mutation points as SNPs are identified in each clade for both Dataset A and Dataset B. Tables 1 and 2 respectively. For example, as reported in Table 1 • For India as given in Table 2 , the most dominant clade in Maharashtra is 20B while for Gujarat it is 20A. It can be further concluded from Table S2 that most of the variants in India belong to clades 20A and 20B. • These clade wise distributions of Datasets A and B are visualised in Figure 2 (e) and (f) respectively. • The clade wise evolution of all 18392 global including India (country wise) and separately 3033 Indian (state wise) SARS-CoV-2 genomes for each month is shown in the form of pie charts in supplementary Tables S1 and S2 respectively. • The month wise evolution of such genomes for each clade is reported in supplementary Tables S3 and S4 respectively. The corresponding colour representation for the five major clades and the months are shown in supplementary Figure S1 . Moreover, the entropy values for the nucleotide changes for Datasets A and B are shown respectively in Figure 2 (g)-(h). Furthermore, the coding regions of the SARS-CoV-2 genome are visualised in Figure 2 (i). Once the mutation points as SNPs are determined, top 10 signature SNPs are identified in each clade for both Datasets A and B, thus resulting in 50 signature SNPs for each category as reported in Table 3 Table 3 and their amino acid changes in protein are shown in Figure 3 . The corresponding clade-wise distribution is shown in Figure 4 . To depict the common signature SNPs in the five clades for both Datasets A and B, visualisation in the form of Venn diagram is shown in Figure 5 supplementary Table S5 . SARS-CoV-2 has resulted in a mass meltdown throughout the globe. Recently, the mutated variants of the virus are turning out to be another major concern for the researchers. Table 3 , the discarded change being E110* in ORF8 as the amino acid change leads to a stop codon. Indian genomes are C6310A, C6312A, C13730T, G22346A, C28311T, T28688C, C28854T and G28878A while C241T, C1059T, G1397A, C3037T, C8782T, G11083T, C14408T, A23403G, G25563T, T28144C, G28881A, G28882A, G28883C and G29742A are common in both Datasets A and B. Furthermore, for Dataset A, G26144T which corresponds to G251V in ORF3a in clade 19A is damaging and shows a decrease in stability while C13730T which corresponds to A97V in RdRp for 19A, C1059T corresponding to T85I in NSP2 for clade 20C and G25563T corresponding to Q57H in ORF3a for 20C are damaging and exhibits shrinking stability for both Datasets A and B. In this work, multiple sequence alignment of 18392 SARS-CoV-2 sequences is carried out using MAFFT fol- Moreover, a comparative study is also put forth to show the correctness of our work. We hope this work will better equip the researchers in their path of designing anti-viral therapeutics to mitigate COVID-19. As a future scope of research, consensus of SNPs can be considered by taking more than one multiple sequence alignment techniques. Also, investigation of the characteristics of these signature SNPs of SARS-CoV-2 on human hosts can be conducted with the help of virologists. The authors are working in these directions. The ethical approval or individual consent was not applicable. All the files which include dataset (raw and aligned sequences, metadata for Nextstrain and JSON files as outputs of Nextstrain), codes, supplementary PDF and videos of clade specific virus evolution and transmission in 71 countries are available at "http://www.nitttrkol.ac.in/indrajit/projects/COVID-Evolution-SignatureSNPs-18K/". Not applicable. The authors declare that they have no conflict of interest. Hotspots of human mutation Emergence of a new sars-cov-2 variant in the uk Effect of the new sars-cov-2 variant b. 1.1. 7 on children and young people, The Lancet Child & Adolescent Health South africa responds to new sars-cov-2 variant Structural and molecular perspectives of sars-cov-2 Novel SARS-CoV-2 variants: the pandemics within the pandemic The architecture of sars-cov-2 transcriptome A pneumonia outbreak associated with a new coronavirus of probable bat origin A sars-cov-2 protein interaction map reveals targets for drug repurposing Applying next-generation sequencing to unravel the mutational landscape in viral quasispecies Genotyping coronavirus sars-cov-2: methods and implications Genome-wide mapping of SARS-CoV-2 RNA structures identifies therapeuticallyrelevant elements Whole genome analysis of more than 10 000 SARS-CoV-2 virus unveils global genetic diversity and target region of NSP6 On the origin and continuing evolution of SARS-CoV-2 Decoding sars-cov-2 transmission, evolution and ramification on covid-19 diagnosis, vaccine, and medicine Global SNP analysis of 11,183 SARS-CoV-2 strains reveals high genetic diversity Mutations strengthened sars-cov-2 infectivity Signal hotspot mutations in sars-cov-2 genomes evolve as the virus spreads and actively replicates in different parts of the world Different mutations in sars-cov-2 associate with severe and mild outcome Functional alterations caused by mutations reflect evolutionary trends of sars-cov-2 Comprehensive analysis of genomic diversity of SARS-CoV-2 in different geographic regions of India: an endeavour to classify Indian SARS-CoV-2 strains on the basis of co-existing mutations MAFFT: A novel method for rapid multiple sequence alignment based on fast fourier transform Nextstrain: real-time tracking of pathogen evolution Viral phylodynamics Genome-wide analysis of indian sars-cov-2 genomes for the identification of genetic mutation and snp, Infection Provean web server: a tool to predict the functional effect of amino acid substitutions and indels A method and server for predicting damaging missense mutations Casadio, I-mutant2.0: predicting stability changes upon mutation from the protein sequence or structure In silico analysis predicting effects of deleterious snps of human rassf5 gene on its structure and functions A Distinct Phylogenetic Cluster of Indian Severe Acute Respiratory Syndrome Coronavirus 2 Isolates SARS-CoV-2 hot-spot mutations are significantly enriched within inverted repeats and CpG island loci Genetics and genomics of sars-cov-2: A review of the literature with the special focus on genetic diversity and sars-cov-2 genome detection Clade gr and clade gh isolates of sars-cov-2 in asia show highest amount of snps, Infection Variant analysis of the first lebanese sars-cov-2 isolates Genetic cluster analysis of sars-cov-2 and the identification of those responsible for the major outbreaks in various countries Rapid Spread of Mutant Alleles in Worldwide SARS-CoV-2 Strains Revealed by Genome-Wide Single Nucleotide Polymorphism and Variation Analysis CoV-2 Genomes Venn Diagram of Unique Non-Synonymous Signature SNPs in 19B Clade for Global excluding India and Indian SARS-CoV-2 Genomes Venn Diagram of Unique Non-Synonymous Signature SNPs in 20A Clade for Global excluding India and Indian SARS-CoV-2 Genomes Venn Diagram of Unique Non-Synonymous Signature SNPs in 20B Clade for Global excluding India and Indian SARS-CoV-2 Genomes Venn Diagram of Unique Non-Synonymous Signature SNPs in 20C Clade for Global excluding India and Indian SARS-CoV-2 Genomes Nimisha Ghosh: Conceptualization; Methodology Software; Validation; Writing -original draft. Indrajit Saha: Conceptualization Data curation; Supervision; Funding acquisition; Formal analysis; Investigation Writing -review & editing. Nikhil Sharma: Conceptualization Formal analysis; Software; Validation; Visualization; Writing -review & editing. Suman Nandi: Conceptualization; Formal analysis; Software; Validation We thank all those who have contributed sequences to GISAID database.