key: cord-0745459-fy02datc authors: Biswas, Nupur; Mallick, Priyanka; Maity, Sujay Krishna; Bhowmik, Debaleena; Mitra, Arpita Ghosh; Saha, Soumen; Roy, Aviral; Chakrabarti, Partha; Paul, Sandip; Chakrabarti, Saikat title: Genomic surveillance and phylodynamic analyses reveal emergence of novel mutation and co-mutation patterns within SARS-CoV2 variants prevalent in India date: 2021-03-25 journal: bioRxiv DOI: 10.1101/2021.03.25.436930 sha: 914683437a035d571ba57d03f44ada1b0179a78f doc_id: 745459 cord_uid: fy02datc Emergence of distinct viral clades has been observed in SARS-CoV2 variants across the world and India. Identification of the genomic diversity and the phylodynamic profiles of the prevalent strains of the country are critical to understand the evolution and spread of the variants. We performed whole-genome sequencing of 54 SARS-CoV2 strains collected from COVID-19 patients in Kolkata, West Bengal during August to October 2020. Phylogeographic and phylodynamic analyses were performed using these 54 and other sequences from India and abroad available in GISAID database. Spatio-temporal evolutionary dynamics of the pathogen across various regions and states of India over three different time periods in the year 2020 were analyzed. We estimated the clade dynamics of the Indian strains and compared the clade specific mutations and the co-mutation patterns across states and union territories of India over the time course. We observed that GR, GH and G (GISAID) or 20B and 20A (Nextstrain) clades were the prevalent clades in India during middle and later half of the year 2020. However, frequent mutations and co-mutations observed within the major clades across time periods do not show much overlap, indicating emergence of newer mutations in the viral population prevailing in the country. Further, we explored the possible association of specific mutations and co-mutations with the infection outcomes manifested within the Indian patients. as the active cases only started to appear in India in March 2020 only. In order to enrich the viral genome sequence data for the latter half of the year, we decided to sequence 54 SARS-CoV2 sequences collected from the state of West Bengal during August 2020 to December 2020 making it only the third Indian state apart from Telangana and Maharashtra to deposit more than fifty sequences in that time period. With these sequences we decided to analyze and understand the spatio-temporal evolutionary dynamics of the pathogen across various states and union territories (UT) of India in the year 2020. We estimated the clade dynamics of the Indian strains and compared the clade specific mutations, speculated their positive selection and calculated the comutations patterns across states and UTs of India. Further, we explored the possible association of specific mutations/co-occurred mutations with the infection outcome manifested within the patients. We found that for Indian sequences GR, GH and G (GISAID) or 20B and 20A (Nextstrain) (18) clades were primarily the major prevalent clades in middle and later half of the year 2020. However, frequent mutations observed within each of the major clades do not show much overlap, especially for the last half of the year 2020, indicating the emergence of new mutations in the viral population prevailing in the country. Interestingly, only 10% of the mutations within the GISAID clades across various Indian states are found to be common. Co-mutations or co-occurrence of mutations within a specific viral strain were investigated and frequent co-mutation patterns for different Indian states were identified. Finally, associations between specific mutation and comutation pattern with respect to patient status (deceased, symptomatic, asymptomatic, etc) have been explored. Fisher Scientific, Waltham, MA) and then purified by using AMPure XP beads. Purified cDNA was then amplified by each of the two ARTIC v3 primer pools which tile the SARS-CoV2 genome. The amplified product was further subjected to end-repair, and barcoded by Native Barcode Expansion kits (1-12) and purified by 0.4X of concentration of AmpureXP. All samples were then pooled together and ligated with sequencing adapters, purified samples were finally quantified using Qubit 4.0 Fluorometer Two different methods were implemented for the assembly of long and short reads as obtained from ONT and Illumina sequencing platforms, respectively. All reads from both the platforms were checked for their quality using FastQC (20) . In case of long reads, reference based assembly was performed using the first SARS-CoV2 strain identified from China, Wuhan (NCBI Accession Number NC-045512.2 (21) which is identical to the GISAID reference sequence EPI_ISL_402124 (16)) as reference, and Minimap2 (22) with ONT specific parameters. In case of short reads, they were first filtered using kneaddata (23) and then assembled by SPAdes (24) using default parameters. Pilon (25) was used for polishing and generation of final consensus sequences. The assembled SARS-CoV2 genome sequences were checked for frameshifts using Genome Detective online tool (26) and their depths were calculated using Mosdepth (27) . All the 54 genome sequences with their associated metadata were uploaded to GISAID database (Table S1 ). We have accessed the protein sequences of SARS-CoV2 virus collected from different continents from the EpiCoV database of GISAID (16) . The database was searched on 1 st January 2021 up to sample collection date 31 st December 2020 using the primary key-words 'hCoV-19' and 'Human'. Only complete and high coverage sequences were considered. Sequences with genomes >29,000 bp were considered complete. Sequences with <1% Ns (undefined bases) were considered as high coverage sequences. Sequences collected from India were analyzed separately. Sequences from different states of India were also accessed and analyzed separately. Additional metadata for the sequences which include location of sample collection, age and sex of the patients, clade, lineage, patient status were also downloaded. We found only 23 sequences collected in 2019, all from Wuhan, China from where the disease spread. To analyze the evolution of viral clades with time, we divided all sequences in three terms, depending on the date of collection of sample. 'Term1' includes sequences collected till March 2020. 'Term2' defines April 2020 to July 2020 and 'Term3' includes sequences collected from August 2020 to December 2020. Table 1 shows the number of sequences collected from different continents and India in different terms. For identifying the mutations, GISAID reference sequence EPI_ISL_402124, collected from human sample in Wuhan, China, in December 2019, was considered as reference (28) . To extract the unique representative sequences and exclude redundant sequences CD-HIT (29) server was used. The number of CD-HIT runs was kept as 1 with sequence identity cut-off 1.0 (100% identity). It provided clusters of sequences which are 100% identical. The cluster representative sequences along with the reference sequence were aligned using Kalign protein sequence alignment tool (30) . Python (version 3.4) codes were used for extracting mutations and further analysis. Mutation was considered as frequent when frequency was calculated with at least 50 sequences (N) and the frequency was ≥ 2.5% for N ≥ 200 and mutation count was at least 5 when N < 200. For mutational analysis within India, we have chosen states and UTs, which had at least 50 sequences in a given term. For Term2, we found 8 states and UTs. Among them, only 3 states had more than 50 sequences in Term3. Hence, temporal analysis was done for those 3 states only. We also analyzed how different mutations co-occurred in different states and terms. Network was constructed for each state and UT showing co-9 occurrence between frequent mutations of that state/UT. Cytoscape.js (31) was used to construct the network. To correlate disease severity with mutation and co-mutation pattern, available metadata was analyzed. Based on the available 'patient status', patients were classified broadly in four categories, 'deceased', 'symptomatic', 'mild', and 'asymptomatic'. Patient status was associated with frequent mutations, co-occurring mutations, and clades. Fisher's Exact test was performed using the following contingency 54 SARS-CoV2 samples taken from patients residing in Kolkata and West Bengal were sequenced using whole genome sequencing approach. The age of the patients ranges from 6 to 88 years with a maximum peak (24.07%) in the range of 51-60 years ( Figure S1A ) with an overall male to female ratio of 1.7 against the national ratio of 1.99. For comparison we have also plotted the age distribution of the Indian patients for which the extracted SARS-CoV2 sequences were deposited in GISAID database ( Figure S1A ). The quality of the sequencing data represented in the form of depth shows a mean depth of approximately 26035X and 455X for short reads and long reads respectively ( Figure S1B ) while the coverage for all the assembled genomes was above 99%. The GISAID clade distribution of the 54 sequences is slightly different from the national distribution with majority of O clade representation and same is true for Nextstrain clades with prevalence of 20A clade ( Figure S1C -D). In order to explore the evolutionary and mutational dynamics represented in terms of phylogenetic clades we compared the clade distribution of all the complete and high-quality SARS-CoV2 sequences (3277) deposited in GISAID from India until December 31, 2020. The sequences were categorized into three time points ('Term') based on the date of the collection/submission of the sequences. Figure 1 shows the distribution of different GISAID and Nextstrain clades in three time points. It is evident that GR clade is prevalent in Term2 and Term3 followed by GH and G clades (33) . Similarly, Nextstrain clades 20B and 20A are found to be more prevalent in the latter half of the year 2020. To compare the clade dynamics with respect to SARS-CoV2 sequences from elsewhere, we have performed the similar analysis with sequences deposited from North America, South America, Europe, Africa, Oceania and Asia (without Indian sequences) ( Figure S2 ). Interestingly, except North America and Europe, all the other continents show prevalence of GR clade in the second and third term of the year. North America shows a consistent prevalence of GH clade throughout the year where a massive increase in the numbers of GV clade sequences was observed in Europe during the latter part of the 2020. However, it is interesting to investigate whether sequences belonging to the major clades observed in India through various time 'Terms' harbor similar mutations or not. Hence we compared the non-clade defining and 'frequent' (frequency >=2.5% or >5 when N <= 200) mutations observed within the same clade across the time terms ( Figure S3 ). Interestingly, we observed that a significant fraction of the mutations turned out be unique in 'Term2' and 'Term3' indicating accumulation of novel mutations. Curious to see the previous observation, we wanted to investigate whether the mutational variability within clades is more specific to certain geographical location (Table 2 ). In 'Term3' only from three states (MH, TG and WB) more than fifty sequences were deposited and similar to observation from 'Term2', very few overlap of mutations were found to be common across these three states ( Figure S4 and Table 2 ). Combined impact of co-occurrence of mutations within a specific viral strain could be crucial to elicit infectivity and sustenance of viral load within the host. Hence, comutation patterns within SARS-CoV2 variants were investigated and specific/common co-mutations for different Indian states were identified irrespective of clade. Figure However, relatively higher (>=10%) association was observed for a G clade (Indian population) mutation, N(S194L), which was exclusively found in 16% symptomatic patients only ( Figure S6A ). In case of GH clade, N(S194L) is observed in more than Figure S6B ). In N. American GH clade specific frequent mutation NSP14(A320V) was found exclusively only in deceased (16%) whereas NSP5(L89F) and NS8(S24L) were observed in 14% and 32% symptomatic patients, respectively ( Figure S6C ). Impact of the single point mutation could be essential but may not be sufficient to elicit pathological response and or variation in disease phenotype. Hence, we thought of exploring the association of evolutionary selected multiple mutations within a viral strain with respect to the disease status of the patient from whom the strain was isolated. respectively) compared to other states and overall India (Figure 1 and 2 ). This indicates that the evolutionary dynamics of the WB strains are slower and therefore may be the usual fixation of major clades like 'GR' or 20B is delayed for some reason. Table 2 ). Further, we also examined the variation in co-occurrence of mutations (co-mutation) across Indian states and union territories compared over the three time terms. We observed a large number of non-clade defining mutations to co-occur within SARS-CoV2 strains from the states and union territories during the 'Term2' and 'Term3' which is consistent with earlier reports based on the sequences collected mostly in 'Term1' period (17) . Consequently, a larger number of unique combinations among these cooccurred mutations were appeared during the latter half of the year 2020. It is really interesting to observe that larger fractions of viral strains extracted from states like 2 0 Telangana, Maharashtra and Karnataka during 'Term2' harbored five or more comutations while major fractions from West Bengal and Gujarat possessed less than five co-mutations. Even in 'Term3' far larger fractions of sequences from Telangana and Maharashtra showed higher co-mutating residues compared to that of West Bengal. Karnataka compared to West Bengal and Gujarat during 'Term2' and 'Term3' (15, 40) . Hence, the higher number co-mutations in those two states could be associated with higher infectivity of the strains prevalent there. In West Bengal during 'Term3' only ~22% sequences harbored more than 5 co-mutations while almost 58% of the sequences were designated as 'O' clade known for harboring larger amount of other (34) mutations. Finally, we tried to associate the frequent mutations and co-mutation patterns along with COVID-19 patient status broadly categorized into deceased, symptomatic, mild, and asymptomatic groups, respectively. Although the number of patient status mapped data is lower (806 out of 3277 sequences) we thought it was worthwhile to investigate if some of the mutations and co-mutation patterns could be specifically associated with those four states of the COVID-19 disease. Apart from the usual Spike(D614G) and NSP12(P323L) mutations, specific association of mutations NS3(Q57H), N(S194L), and Spike(L54F) are observed with symptomatic and deceased status of Indian patients while NS9b(P10S), N(P13L), NSP12(A97V), and NSP3(T1198K), respectively are found to be relatively more in asymptomatic Indian patients. Similarly, presence and absence of specific mutation with respect to symptomatic or asymptomatic status were also examined for North American and European samples for which the disease status was 1 available at the GISAID database. Although we did not find any specific co-mutation pattern with severe (deceased or symptomatic) patients but few co-mutation patterns were observed to be significantly higher in asymptomatic and mild symptomatic patients Lower panel represents overlap for >=5 co-mutations per sequence. If the world fails to protect the economy, COVID-19 will damage health not just now but also in the future A critical analysis of the impacts of COVID-19 on the global economy and ecosystems and opportunities for circular economy strategies Circular economy Sustainability Sustainable development Supply chain resilience Climate change. Rrsources, Conserv Recycl COVID-19 Vaccine: A comprehensive status report Snapshot of COVID-19 Related Clinical Trials in India Nature and dimensions of the systemic hyper-inflammation and its attenuation by convalescent plasma in severe COVID-19 A Distinct Phylogenetic Cluster of Indian Severe Acute Respiratory Syndrome Coronavirus 2 Isolates Structural and Drug Screening Analysis of the Non-structural Proteins of Severe Acute Respiratory Syndrome Coronavirus 2 Virus Extracted From Indian Coronavirus Disease SARS-CoV-2 genomics: An Indian perspective on sequencing viral variants Analysis of Indian SARS-CoV-2 Genomes Reveals Prevalence of D614G Mutation in Spike Protein Predicting an Increase in Interaction With TMPRSS2 and Virus Infectivity Full-genome sequences of the first two SARS-CoV-2 viruses from India Clinical and whole genome characterization of SARS-CoV-2 in India Genome-wide analysis of Indian SARS-CoV-2 genomes for the identification of genetic mutation and SNP Mutations in SARS-CoV-2 viral RNA identified in Eastern India: Possible implications for the ongoing outbreak in India and impact on viral structure and host susceptibility Genomic epidemiology reveals multiple introductions and spread of SARS-CoV-2 in the Indian state of Karnataka GISAID Comprehensive analysis of genomic diversity of SARS-CoV-2 in different geographic regions of India: an endeavour to classify Indian SARS-CoV-2 strains on the basis of coexisting mutations FastQC: a quality control tool for high throughput sequence data Minimap2: pairwise alignment for nucleotide sequences. Birol I, editor KneadData -The Huttenhower Lab SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement Genome Detective: an automated system for virus identification from high-throughput sequencing data. Birol I, editor Mosdepth: quick coverage calculation for genomes and exomes A pneumonia outbreak associated with a new coronavirus of probable bat origin CD-HIT: Accelerated for clustering the nextgeneration sequencing data The EMBL-EBI search and sequence analysis tools APIs in 2019 Cytoscape: A software Environment for integrated models of biomolecular interaction networks Basic biostatistics for medical and biomedical practitioners Association of clade-G SARS-CoV-2 viruses and age with increased mortality rates across 57 countries and India Geographic and Genomic Distribution of SARS-CoV-2 A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell The Architecture of SARS-CoV-2 Transcriptome Intra-host variation and evolutionary dynamics of SARS-CoV-2 populations in COVID-19 patients Covid19india The authors acknowledge CSIR-Indian Institute of Chemical Biology for infrastructural support. SC acknowledges financial support from CSIR MLP-132 grant. NB The authors have declared no competing interests.