key: cord-0723672-lt0uo7q3 authors: Saha, Indrajit; Ghosh, Nimisha; Maity, Debasree; Sharma, Nikhil; Sarkar, Jnanendra Prasad; Mitra, Kaushik title: Genome-wide analysis of Indian SARS-CoV-2 genomes for the identification of genetic mutation and SNP date: 2020-07-11 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2020.104457 sha: 564e676c076f52ba2508f3236339bb392b9bd0cb doc_id: 723672 cord_uid: lt0uo7q3 The wave of COVID-19 is a big threat to the human population. Presently, the world is going through different phases of lock down in order to stop this wave of pandemic; India being no exception. We have also started the lock down on 23rd March 2020. In this current situation, apart from social distancing only a vaccine can be the proper solution to serve the population of human being. Thus it is important for all the nations to perform the genome-wide analysis in order to identify the genetic variation in Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) so that proper vaccine can be designed. This fast motivated us to analyze publicly available 566 Indian complete or near complete SARS-CoV-2 genomes to find the mutation points as substitution, deletion and insertion. In this regard, we have performed the multiple sequence alignment in presence of reference sequence from NCBI. After the alignment, a consensus sequence is build to analyze each genome in order to identify the mutation points. As a consequence, we have found 933 substitutions, 2449 deletions and 2 insertions, in total 3384 unique mutation points, in 566 genomes across 29.9 K bp. Further, it has been classified into three groups as 100 clusters of mutations (mostly deletions), 1609 point mutations as substitution, deletion and insertion and 64 SNPs. These outcomes are visualized using BioCircos and bar plots as well as plotting entropy value of each genomic location. Moreover, phylogenetic analysis has also been performed to see the evolution of SARS-CoV-2 virus in India. It also shows the wide variation in tree which indeed vivid in genomic analysis. Finally, these SNPs can be the useful target for virus classification, designing and defining the effective dose of vaccine for the heterogeneous population. Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2), generally known as COVID-19, which originated in Wuhan, China (Zhu et al., 2020) , has wreaked havoc on human lives and declared as a pandemic by World Health Organisation on 20th March, 2020. Among others, symptoms of SARS-CoV-2 include fever, cough and shortness of breath (Chen et al., 2020) . In more severe cases, infection may lead to pneumonia (Zhou et al., 2020) , kidney failure and eventual death. As of now, no vaccine or medicine has been invented or discovered and the only protective measures are being taken by different countries are through lock downs and social distancing. However, even these extreme measures have not been able to contain the SARS-CoV-2. Everyday thousands of new cases are coming into light. According to the record of 27th June, globally more than 9.8 million people are affected by this deadly virus, with a total reveals that the SARS-CoV-2 is a single-stranded enveloped RNA virus with a genome length of 29.9 kilobases (Cui et al., 2019 , Su et al., 2016 , Weiss & Navas-Martin, 2005 , Zhou et al., 2020 . It has 11 coding regions as reported in NCBI that can encode ORF1ab polyproteins, spike (S) glycoprotein, envelope (E) protein, membrane (M) glycoprotein, nucleocapsid (N) protein and accessory proteins such as ORF3a, ORF6, ORF7a, ORF7b, ORF8 and ORF10. It has also been reported that several non-structural proteins (nsp) are encoded from Open Reading Frame (ORF). The genomic orientation of SARS-CoV-2 virus is shown in Fig. 1 . The strain of this virus is novel and the understanding the genetic variability as mutation of this virus in different nations is still very limited, especially the coding region of Open Reading Frame (ORF). Generally, the mutation occurs when an error is incorporated in a viral genome (Fleischmann, 1996) . It can also be considered to be a coping mechanism with genomic damage. As a consequence, the resultant mutated strain may cause an outbreak in human host like the case with SARS-CoV-2. The DNA mutation can be of three types: base substitution, deletion and insertion. Moreover, if the substitution occurs more than 1% of the population, it can be considered as Single Nucleotide Polymorphism (SNP). Such polymorphism usually different from the mutation as it creates a variant in the population while mutation keeps the population same (Pavlovic-Lazetic et al., 2004) . On the other hand, RNA viruses have high mutation rates (Jenkins et al., 2002 , Woo et al., 2009 . Thus it is difficult to identify the proper string of the virus. Subsequently, designing and define the dose of the vaccine is also very challenging task (Paital et al., 2020) . In this regard, Chothe et al. used SNP on sequences of Bovine herpesvirus-1 (BoHV-1) (Chothe et al., 2018) , which was affecting cattle and causing respiratory illness, to cluster them into three groups with two different vaccine groups and one distinct cluster of field isolates. Based on this information, they developed an SNP-based PCR assay to show differentiation between To address the above facts, we have analyzed publicly available 566 Indian complete or near complete SARS-CoV-2 genomes in order to find the mutation points as substitution, deletion J o u r n a l P r e -p r o o f and insertion. For this purpose, Multiple Sequence Alignment (Wallace et al., 2005) is performed in presence of reference sequence from NCBI. Thereafter, a consensus sequence is build to analyze each genome to identify the mutation points. As a result, we have found 933 substitutions, 2449 deletions and 2 insertions, in total 3384 unique mutation points, in 566 genomes genomes across 29.9K bp. Further, it has been classified into three groups (a) cluster of mutation points if the mutation appears more than two times in consecutive genomic positions (b) point mutations as substitution, deletion and insertion that are not present in clusters (c) Single Nucleotide Polymorphism (SNP) that appeared more than 1% of the population of SARS-CoV-2 used in our study. Finally, 100 clusters of mutation (mostly deletions), 1609 point mutations as substitution, deletion and insertion and 64 SNPs out of categories (a) and (b) have been identified as they appeared more than 1% of the population i.e. 6 times in Indian SARS-CoV-2 genomes. These outcomes are visualized using BioCircos and bar plots as well as plotting entropy value of each genomic location. Moreover, phylogenetic analysis (Stuessy, 2009) has also been performed to see the evolution of SARS-CoV-2 virus in India. In this section, we have discussed the source of data or genomic sequence of virus and methods used in systemic way to accomplish this task of finding mutation points as substitution, deletion, insertion as well as SNPs. The genomic sequences of Indian SARS-CoV-2 virus was collected from Global Initiative on Sharing All Influenza Data (GISAID) 1 in fasta format on 11th June 2020. The dataset contains 566 genomes with sequence ID and sequence in fasta format. We have extracted the date from the sequence ID to show the number of sequences uploaded per month. This is shown in Fig. 2 . for our study. Further, we have downloaded the Reference Sequence (NC 045512.2) 2 from National Center for Biotechnology Information (NCBI) to conduct the experiment with 566 Indian SARS-CoV-2 genomes. This reference genome is also used to map the coding regions as collected from NCBI. This is also reported in Table 1 and used while mentioning the mutation points in result section. Please note that for the data visualization and editing BioEdit and MEGA-X have been used. The pipeline of the workflow is shown in Fig. 3 (a). In order to find the mutations in 566 Indian SARS-CoV-2 genomes, the multiple sequencing alignment (MSA) technique called ClustalW Thompson et al. (1994) is used in presence of reference sequence from NCBI. The ClustalW uses the concept of Neighbor-Joining tree where bootstrap size is consider as 1000. It is a widely used MSA technique for aligning any number of homologous nucleotide or protein sequences like in our case. It uses progressive alignment method where the most similar sequences with the best alignment score are aligned first. After performing the alignment, a consensus sequence is built in order to extract the mutation points from each genome as substitution, deletion and insertion. The detection scheme of identifying substitution, deletion and insertion is shown in Fig. 3 where p a V represents the frequency of each residue a occurring at position p . 5 represents the number of possible residues for nucleic acid (in this case 4) plus gap. Further to verify the such genetic variation in Indian SARS-CoV-2 genome phylogenetic analyses is performed so that the evolution can be seen. The phylogenetic analysis is conducted using Maximum Likelihood technique Stuessy (2009) where Neighbor-Joining is used to construct the tree for the visualization of evolution. The results of the experiment are discussed here. Our objective is to identify point mutation as substitution, deletion and insertion initially after performing the multiple sequence alignment. In supplementary Table S3 . Thereafter, 933 substitutions are considered to identify SNPs that are present at least in 6 virus genomes as a clause of 1% of the virus population. As a consequence, we have found 64 SNPs, out of which 57 SNPs in 6 coding regions while 7 of them are present in 5′-UTR and 3′-UTR. However, in Table 3 Table S4 . Moreover, 6 SNPs in different coding regions are shown in Fig. 7 using BioEdit software. This is to be noted that for each lists of mutations as mentioned in Tables 2 and 3 , we have provided genomic coordinates, number of occurrence of mutation in virus genome (frequency of mutation), change in nucleotide, change in amino acid, entropy to measure the change in nucleotide as information contains at that genomic location and mapping with coding region so that mutation point can be identified precisely in Tables 2 and 3 . For example, in 3 the SNP at 18879 occurs in 117 virus sequences where the change in nucleotide is C > T, change in corresponding amino acid is S > F and the value of entropy is 0.5216. Higher the entropy value signifies that the change in nucleotide is more informative. This is important to mention that the over all results of the mutation as substitution, deletion, insertion, cluster and SNPs are shown using BioCircos plots in Fig. 8 where each track shows the frequency of occurrence of mutation as histogram using bar and dot plots. Generally, it summarizes all the results visually. Moreover, the computed entropy at each genomic location to have the information of change of nucleotide for the whole population of virus genome is also shown in Fig. 9 . This is prepared using BioEdit software. Finally, phylogenetic analysis is shown in Fig. 10 for 566 and its bootstrap samples of 60 virus sequences in order to visualize the variation in trees clearly. It is evident from the trees that the Indian SARS-CoV-2 genomes are having wide variation which we have also noticed in genomic analysis. These trees are generated using MEGA-X software. In addition to this, the aligned sequences are provided as supplementary 4 for further use. In this paper we have analyzed 566 Indian SARS-CoV-2 genomes in order to find the mutation as substitution, deletion and insertion as well as SNPs. Our analysis has identified 100 clusters of mutations (mostly deletions), 1609 point mutations as substitution, deletion and insertion and 64 SNPs. Out of these 64 SNPs, 57 are present in the 6 coding regions. The purpose of finding SNPs is to identify the genomic location that can be targeted to classify the virus strain in India. Apart from this, the major advantage is that for personalized vaccine these SNPs could be used to define the dose of the vaccine after identifying the proper strain of the virus. Moreover, for future research, these SNPs can be used to model the proteins and to see its conformational changes so that potential drag can be designed to target such proteins for Indian patients. We are currently working in this direction and also help the other researchers to conduct their research with the use of these SNPs. The ethical approval or individual consent was not applicable. The aligned 566 Indian SARS-CoV-2 genomes with reference and consensus sequences, software to find mutation and supplementary are available at "http://www.nitttrkol.ac.in/indrajit/projects/COVID-Mutation-India/". Moreover, Indian SARS-CoV-2 genomes used in this work are publicly available at GISAID database. Not applicable. This work has been partially supported by CRG short term research grant on COVID-19 Table 3 : Mutation as SNPs in more than 10% of population of Indian SARS-CoV-2 genomes Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study Whole-genome sequence analysis reveals unique snp profiles to distinguish vaccine and wild-type strains of bovine herpesvirus-1 (bohv-1) Origin and evolution of pathogenic coronaviruses Viral Genetics Applying next-generation sequencing to unravel the mutational landscape in viral quasispecies Inter nation social lockdown versus medical care against covid-19, a mild environmental insight with special reference to india. Science of The Total Environment Bioinformatics analysis of sars coronavirus genome polymorphism Plant Taxonomy: The Systematic Evaluation of Comparative Data Epidemiology, genetic recombination, and pathogenesis of coronaviruses Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice Multiple sequence alignments Coronavirus pathogenesis and the emerging pathogen severe acute respiratory syndrome coronavirus. Microbiology and Molecular Biology Worldometer (2020). Coronavirus disease 2019 (covid-19) cases in india A pneumonia outbreak associated with a new coronavirus of probable bat origin We thank all those who have contributed sequences to GISAID database and reviewers for the valuable comments to improve the article.