key: cord-0759858-x6hln4wk
authors: Banu, Sofia; Jolly, Bani; Mukherjee, Payel; Singh, Priya; Khan, Shagufta; Zaveri, Lamuk; Shambhavi, Sakshi; Gaur, Namami; Mishra, Rakesh K; Scaria, Vinod; Sowpati, Divya Tej
title: A distinct phylogenetic cluster of Indian SARS-CoV-2 isolates
date: 2020-05-31
journal: bioRxiv
DOI: 10.1101/2020.05.31.126136
sha: 640016cd2c60e931d875fedb804da0662fc14da4
doc_id: 759858
cord_uid: x6hln4wk

From an isolated epidemic, COVID-19 has now emerged as a global pandemic. The availability of genomes in the public domain following the epidemic provides a unique opportunity to understand the evolution and spread of the SARS-CoV-2 virus across the globe. The availability of whole genomes from multiple states in India prompted us to analyse the phylogenetic clusters of genomes in India. We performed whole-genome sequencing for 64 genomes making a total of 361 genomes from India, followed by phylogenetic clustering, substitution analysis, and dating of the different phylogenetic clusters of viral genomes. We describe a distinct phylogenetic cluster (Clade I / A3i) of SARS-CoV-2 genomes from India, which encompasses 41% of all genomes sequenced and deposited in the public domain from multiple states in India. Globally 3.5% of genomes, which till date could not be mapped to any distinct known cluster fall in this newly defined clade. The cluster is characterized by a core set of shared genetic variants – C6312A (T2016K), C13730T (A88V/A97V), C23929T, and C28311T (P13L). Further, the cluster is also characterized by a nucleotide substitution rate of 1.4 × 10−3 variants per site per year, lower than the prevalent A2a cluster, and predominantly driven by variants in the E and N genes and relative sparing of the S gene. Epidemiological assessments suggest that the common ancestor emerged in the month of February 2020 and possibly resulted in an outbreak followed by countrywide spread, as evidenced by the low divergence of the genomes from across the country. To the best of our knowledge, this is the first comprehensive study characterizing the distinct and predominant cluster of SARS-CoV-2 in India.

Since the emergence of the outbreak in the Chinese city of Wuhan in late 2019, the novel Coronavirus disease has spread widely to become a global pandemic, with over 5,000,000 individuals infected worldwide and resulting in the death of over 300,000 individuals (as of 26 May 2020). 1 The causative virus, SARS-CoV-2, is a member of the genus betacoronavirus. During its transmission, the virus has differentiated into at least 10 clades globally and is continuously evolving. 2 This has implications in genetic epidemiology, surveillance, contact tracing and the development of long term strategies for mitigation of this disease. 3 The recent availability of whole-genome sequences of the SARS-CoV-2 from across the world deposited in public databases provides an unprecedented opportunity to understand the dynamics and evolution of the pathogen. The availability of genomic data in a public repository like GISAID 4 also provides wider access to the resources and enables researchers across the globe to address pertinent hypotheses. Likewise, this gave us a unique scope to understand the introduction, evolution, and spread of the virus in India and understand it in the context of global clades circulating across the world.

In this manuscript, we report the sequences of SARS-CoV-2 isolates predominantly sampled from the state of Telangana. Further, we systematically analysed the phylogenetic clusters of genomes from India and characterised a unique cluster of sequences (Clade I/A3i), which could not be classified into any of the previously annotated global clades. Isolates forming this cluster were predominant in a number of states and potentially characterized by a shared set of four genetic variants. The cluster potentially arose from a single outbreak followed by a rapid spread across the country. To the best of our knowledge, this is the first comprehensive report of the novel and predominant cluster of sequences from India and suggests its distribution beyond India in many countries in the Middle East, South Asia, and Oceania.

Sample collection and RNA purification Samples were collected and processed as per the guidelines of the Institutional Ethics Committee. Nasopharyngeal or oropharyngeal swabs collected in viral transport media were used for RNA isolation. Samples were anonymised by removing all patient identifiers except for gender, age, collection date, and symptoms where applicable. SARS-CoV-2 nucleic acids were isolated from 300 µl viral transport media using the QIAamp Viral RNA Mini kit (Qiagen) according to the manufacturer's protocol. Samples were eluted in 50 µl nuclease-free water and stored in -80°C until further use.

Shotgun RNA sequencing 100ng of RNA was taken as input for rRNA depletion using TruSeq RiboZero Gold kit (Illumina). Following rRNA depletion and fragmentation, RNA was converted to cDNA and processed for NGS library prep using TruSeq Stranded Total RNA library prep kit (Illumina). Library profiles were analyzed using Bioanalyzer and sequenced on Illumina NovaSeq 6000 at a depth of 30 million reads per sample.

A modified version of the nCoV-2019 sequencing protocol as described by Josh Quick, 2020 was used to amplify the viral genome. 5 Briefly, a combination of random hexamers, anchored oligo(dT)23, and the 98R primer from nCoV-2019/V3 primer set were used to synthesize the first strand according to the reaction conditions mentioned in the protocol. Next, a 3-step multiplex PCR was carried out to amplify the viral genome using nCoV-2019/V3 primer pools 1 and 2. The~400 bp amplicons thus obtained in two pools were combined, purified using Agencourt AMPure XP beads (Beckman Coulter) and eluted in 30 µl elution buffer (Qiagen). Around 100 ng of purified amplicons were used for library preparation. DNA libraries were prepared using either the QIAseq FX DNA Library Kit (Qiagen) or Nextera DNA Flex Library Prep Kit (Illumina) according to the manufacturer's protocol. Paired-End Sequencing (2x150bp) was performed on the NovaSeq 6000 (Illumina) with a targeted depth of 2 million reads per sample (~20,000X coverage).

Raw image data was basecalled using bcl2fastq v2.20 with --no-lane-splitting option enabled. QC of the FASTQ files was performed using FastQC v0.11.7, and adaptors/poor quality bases were trimmed using Trimmomatic. 6 , 7 Reads were aligned to the reference genome MN908947.3 using hisat2. 8 Consensus sequence from the bam file was derived using seqtk and bcftools. 9 Samtools depth command was used to calculate the coverage across the genome, with the options -a and -d 50000. 10 The sequences were deposited in GISAID. The complete list of accessions is detailed in Supplementary Data 1.

The datasets of Indian SARS-CoV-2 genomes deposited in GISAID (till 25 May 2020) were used for the analysis. The samples, sample annotations, originating, and submitting institutions are listed in Supplementary Data 2. Further 10 high-quality genomes from each of the 10 clades respectively as annotated by Nextstrain were retrieved from GISAID and used in the analysis.

A stringent quality criteria was employed for considering the genomes for evaluation of the nucleotide substitution rates, molecular clock, and phylogenetic clustering, as these would be sensitive to the quality of genomes. For this, the individual samples were systematically aligned to the Wuhan-Hu-1 reference genome (NC_045512) using EMBOSS needle. 11 Samples which had Ns > 1% and/or gaps > 1% of the sequence length and/or degenerate bases were removed from the analysis. Similarly, if a submitting laboratory had >90% samples annotated as low quality, all the data from the laboratory was removed from the analysis. Samples having an ambiguous date of collection were also removed from the analysis. After the initial high-quality analysis, all genomes originating from India irrespective of the quality criteria were considered for computation of the proportions of clades in each of the states.

A compendium of problematic genomic loci was compiled from a previous study and from the protocol specified by Nextstrain. 12, 13 These positions were masked in all the genomes. This compendium included 36 variants and is available as Supplementary Data 3.

Phylogenetic analysis of the samples was performed as detailed previously following the standard protocol for analysis of SARS-CoV-2 genomes provided by Nextstrain. 13 , 14 The sequences were clustered using Augur, the phylodynamic pipeline provided by Nextstrain. 15 The sequences were aligned against the WH1 reference genome using MAFFT and spurious variant positions were masked from the alignment. 16 The initial phylogenetic tree was constructed using IQTREE following the Augur tree implementation. 17 The raw tree was further processed with Augur to construct a TimeTree, annotate ancestral traits, infer mutations, and identify clades. The resulting tree was viewed using Nextstrain.

most recent common ancestor The substitution rate of nCoV-2 was estimated using BEAST v1.10.4 following a methodology described previously. 18, 19 The masked alignment file was used as input to estimate a strict molecular clock and a coalescent growth rate model (exponential growth). MCMC was run for 50 million steps and burn-in was adjusted to attain a suitable effective sample size (ESS).

Times to the most recent common ancestor (tMRCA) for the individual clusters were also computed by Bayesian coalescent analysis using BEAST. All runs were executed using the HKY+Γ substitution model with gamma-distributed rate variation (gamma categories=4), a strict molecular clock and a coalescent growth rate model. MCMC was run for steps as described above. The log output of each MCMC run was analysed in Tracer v1.7.1. 20 The priors used for each parameter in the analysis are given in Supplementary Data 4.

The variants were also evaluated for the functional consequences using the Sorting Intolerant from Tolerant (SIFT). 21 SIFT prediction database for SARS-CoV-2 was built using the genome annotations for the WH1 reference genome (NC_045512). A SIFT score of 0.0 to 0.05 was interpreted to have a deleterious effect while scores in the range 0.05 to 1.0 were predicted to be tolerated. The functional effects of protein variants identified in the clades were assessed using the PROVEAN web server, using the protein sequences of the Wuhan-Hu-1 genome as reference and a default threshold value of -2.5. 22 Additionally, PhyloP conservation scores and base-wise GERP Rejected Substitutions scores (RS scores) for the variants were also computed. 23, 24 Sites having positive PhyloP scores are predicted to be conserved, while negative scores are indicative of fast-evolving sites. Positive GERP scores were considered indicative of a site under evolutionary constraint, while negative scores indicate neutrally evolving sites. The variants were also checked for overlaps with immune epitope predictions as given on UCSC Genome Browser for SARS-CoV-2.

The samples sequenced encompass 64 genomes in total, majorly collected from the state of Telangana. The age of the patients ranged from 10-73 years, with >80% (54 out of 64) within the age bracket of 20-60 years (Fig S1A) . A total of 61 samples were sequenced using an amplicon-based approach with a target of~2 million paired-end reads per sample. We could achieve an average coverage of >1000x in all cases, with a uniform representation from all amplicons (Fig S1B top, S1C ). 3 samples were sequenced using a shotgun sequencing approach and had an average coverage of approximately 100x, however, the coverage across the genome was uniform (Fig S1B, bottom) . The samples and metadata for the isolates sequenced and deposited in the public domain are summarised in Supplementary Data 1

A total of 361 genome sequences of the SARS-CoV-2 were available for analysis as of 25th May 2020 from India including the genomes sequenced by our group. After stringent analysis of the sequence quality and removal of all sequences which did not meet the quality criteria, the dataset resulted in a total of 213 genomes submitted from 7 institutions (including 55 out of our 64 genomes). Supplementary Data 2 summarises the genomic data considered for the analysis, submitting institutions, quality criteria used for inclusion/exclusion, and the acknowledgements.

The genomes isolated from India were found to be classified under 5 clusters. 4 of these clusters are known clades identified by Nextstrain: A2a, A3, B, and B4. 13 The first and the major cluster encompassed 133 (62%) of genomes which fell into the A2a clade of the SARS-CoV-2 genome. The clade was represented by samples derived from multiple states across the country including Gujarat, West Bengal, Maharashtra, Tamil Nadu, and Telangana.

The second-largest cluster consisted of 62 genomes (29%). This cluster of sequences could not be classified into any of the 10 clade sequences defined by Nextstrain. 13 We further evaluated the nucleotide compositions which define each of the 10 clusters/clades worldwide and compared this to the nucleotide composition of the clade in question. Our analysis suggests that the nucleotide compositions which define the 10 clades were absent in the clade of question. This cluster was found to have diverged from the A1a and A3 clades and most, but not all of the sequences shared a variant (L3606F in ORF1a) with members of the A3 and A1a clades. We propose to call this the A3i clade in cognisance of this fact. To avoid potential conflict with the nomenclature followed by Nextstrain, we, therefore, define this cluster of sequences as Clade I/A3i, for the unique occurrence as a dominant cluster amongst SARS-CoV-2 genome sequences from India and also since this clade is largely formed by sequences from India.

The other clusters encompassed the A3, B4, and B clades with 13, 3, and 1 genomes each falling into the respective clusters. Figure 1 summarises the different phylogenetic clusters of the genomes from India. 

Systematic analysis of members of the cluster revealed that a set of unique genetic variants defined the core cluster. A discriminant analysis was performed for all variants in any genome defined by the cluster of sequences. A unique combination of four variants, C6312A, C13730T, C23929T, and C28311T was found to be shared by a majority of members of the cluster. A total of 60 genomes of the 62 genomes (96.8%) in the cluster shared the combination of variants. This unique combination of variants was shared by none of the other genomes which were assigned to any other clade. We further analysed the global datasets for identifying the genomes which displayed matches for 4/4 variants which defined the Clade I/A3i. Our prospective search retrieved a total of 21 high-quality genomes ( Supplementary Data 5 ). Of the retrieved genomes, the largest number of genomes originated from Singapore which had 5 genomes and constituted 8% of the high-quality genomes from Singapore. The other genomes originated from a number of countries including Australia, Brazil, Brunei, Canada, China, Gambia, Guam, Philippines, Saudi Arabia, Slovenia, and Taiwan. The members in the cluster, however, contributed to a much smaller proportion of the clades/clusters identified in the respective countries.

Of these, 4 were sampled from a date earlier than the earliest sample of this cluster from India and were from Australia, Brazil, Canada, and Saudi Arabia.

Mutation rates were calculated for the Indian sequences using BEAST, with the WH1 genome as the root. Our analysis suggests that the substitution rate is 1.64 × 10 -3 (95% HPD 1.41 × 10 -3 -1.88 × 10 -3 ) substitutions per site per year for the entire Indian SARS-CoV-2 genomes put together. This also confirms the estimates previously made. 25 The substitution rate was also computed for the individual clades. The gene-wise substitution rates were also similarly calculated for the major clusters. The analysis suggests that the I/A3i clade has a nucleotide substitution rate of 1.4 x 10 -3 variants per site per year compared to the estimate of 1.65 x 10 -3 variants per site per year for the prevalent A2a clade and 1.64 x 10 -3 variants per site per year for all the high quality genomes from India analysed.

The nucleotide substitution rate suggests that the evolution of the I/A3i clade is largely determined by changes in the structural proteins -Nucleocapsid (N) and Envelope (E) genes, compared to the A2a, the globally predominant clade, which is determined by changes in the Spike (S) and Membrane (M) genes (Table 1) . Table 1 Nucleotide Substitution rates of the different structural protein genes and genome-wide across the different clusters and clades in India. The estimates for B and B4 were not computed since the clades had only 3 and 1 genomes respectively from India.

The date of the most recent ancestor for the dataset of all Indian SARS-CoV-2 genomes, with WH1 genome sequence included, was computed using BEAST. The median tMRCA was found to be 11 December 2019 (95% HPD 26 November to 25 December), confirming the previous estimates of the origin of the epidemic in Wuhan city of China. 26 tMRCA for the I/A3i clade, as well as the A2a clade, which constituted the majority of samples, was also computed. The clade A2a, which is the predominant clade in India had a tMRCA of 2nd Jan 2020 (95% HPD Interval 13 Dec 2019 -22 Jan 2020), while the clade I/A3i had a tMRCA of 8 Feb 2020 (95% HPD Interval 17 Jan 2020 -25 Feb 2020).

Of the four variants which characterise the clade I/A3i, three of the variants caused an amino acid change (non-synonymous). Additionally, all variants which defined the other clades were compiled. The nonsynonymous variants were analysed for potential functional consequences using the Sorting Intolerant from Tolerant (SIFT) and Protein Variant Effect Analyser (PROVEAN). 21 , 22 SIFT prediction database for SARS-CoV-2 was built using the genome annotations for the WH1 reference genome (NC_045512). Additionally, PhyloP and GERP conservation scores were also computed for the variants. 23, 24 The variants and their respective predictions are summarised in Table 2 .

Majority of the variants which defined other clades were predicted to be neutral by PROVEAN, with the exception of G251V which defines the A1a clade. Three variants which define Clade I/A3i, C6312A, C13730T and C28311T resulted in amino acid changes with potentially deleterious functional consequences as predicted by SIFT and mapped to quite conserved genomic loci in the SARS-CoV-2 genome ( Table 2 ) . One of these variants, A97V in the RDRP protein (corresponds to A88V in ORF1b) is located in its NiRAN domain, which is suggested to be important in RNA binding and nucleotidylation activity. 27 Both SIFT and PROVEAN analyses (Table 2) suggest that the effect of this mutation is deleterious in nature, however, as both Alanine and Valine are hydrophobic amino acids, the exact effect of the mutation needs to be experimentally validated. Of notable significance is the P13L variant (C28311T) in the Nucleocapsid protein which is required for the viral entry into the cells. The variant maps to the IDR domain of the N protein and SIFT predicts the variant to be deleterious, though the PROVEAN analysis categorized it as a neutral mutation ( Table 2) .

Two of the variants C6312A in ORF1a and C13730T in ORF1b also mapped to immune epitope predictions (HLA-A0201 binding peptides) from NetMHC 4.0 available on UCSC Genome Browser and as listed on UCSC Genome Browser for SARS-CoV-2. 28 The potential consequences of the variants in the immune response could not be ascertained. 

The presence of a short tree of Clade I/A3i with divergence from a single point suggests a single point of introduction. 29 The single point of divergence also suggests that the origin and spread of the cluster were possibly from a single outbreak (Figure 3 ) . The clustering of samples in Feb 2020 suggests a rapid spread spanning multiple regions across the country. The first sequence from the cluster in India was GMC-KN443/2020 (Accession ID  EPI_ISL_431103 

Genomic evolution coupled with the appropriate tools like genome sequencing provides a unique opportunity to understand the spread and evolution of pathogens. 30, 31 The recent emergence of COVID-19 as a global pandemic and the emergence of the Open Data for SARS-CoV-2 genomes from across the globe facilitated by genomic databases like GenBank and GISAID has truly opened up new opportunities to understand the pathogen and its spread and evolution at an unprecedented rate. 4, 32 Whole-genome sequencing of SARS-CoV-2 has also been extensively used in understanding epidemics at a macro-level as well as at micro-levels, at hospitals. 33 In this report, we describe a distinct cluster of sequences from genomes of SARS-CoV-2 sequenced and deposited from multiple laboratories across India, which we classify as the I/A3i clade. This distinct cluster could not be classified into any of the 10 clade annotations as described by Nextstrain, and was characterized by a unique combination of four variants which was shared by over 92% of the isolates falling in the cluster. The cluster was predominantly found in genomes from India; though additional members could also be found from genomes deposited in other countries, they form a minor proportion of the genomes from the respective countries. The Indian genomes constituted~63% of the global genomes for this cluster.

In-depth analysis of the genome cluster suggests a comparable rate of nucleotide substitutions with other predominant clades, though a gene-wise estimate of substitution suggests a distinct mode of evolution, driven by the Nucleocapsid (N) and Envelope (E) genes, and sparing of the Spike (S) gene in contrast to predominant diversity in the Spike (S) and Membrane protein (M) genes in A2a clade, the globally predominant clade. 34 However, it has not escaped our attention that host genetic factors could modulate the evolution of the virus genome, and without large-scale host-genomic studies, the causal relationships cannot be conclusively established.

The cluster suggests a potential single introduction in February, followed by a country-wide spread, mostly affecting the South Indian states as evidenced by the tMRCA as well as the short cluster. Our analysis suggests that the Clade I/A3i was represented in almost all states from which genomes are available, barring a few. Members of the Clade I/A3i formed the predominant class of isolates from the states of Tamil Nadu, Telangana, Maharashtra, and Delhi and the second largest in membership in Bihar, Karnataka, Uttar Pradesh, West Bengal, Gujarat, and Madhya Pradesh.

Put together, the cluster of genomes (Clade I/A3i) forms a distinct cluster, predominantly found amongst Indian SARS-CoV-2 genomes, with limited representation outside the region.

To the best of our knowledge, this is the first comprehensive study characterizing the distinct and predominant cluster of SARS-CoV-2 in India. This report also exemplifies the fact that timely and open access to genomic data can provide unique insights into the genetic epidemiology of pathogens.

RKM, DTS and VS conceptualized and designed the study. SB and PM processed the sequencing data. SB, PM, and PS performed sequence data analysis and visualization. BJ performed the analysis for phylogenetic clustering, molecular characterization and quantification of relatedness. SK, LZ, SS and NG conducted the experiments. DTS, VS and BJ prepared the manuscript with inputs from RKM. All authors read and approved the manuscript.

Authors declare no conflict of interest.

An online and updated resource for SARS-CoV-2 genomes from India, their clade assignments and distribution across the country is available at http://clingen.igib.res.in/genepi/phylovis/

Coronavirus Disease (COVID-19) Situational Report 128

COVID-19: Epidemiology, Evolution, and Cross-Disciplinary Perspectives

Genotyping coronavirus SARS-CoV-2: methods and implications

Global initiative on sharing all influenza data -from vision to reality

nCoV-2019 sequencing protocol v2 v1 (protocols.io.bdp7i5rn)

Babraham Bioinformatics -FastQC A Quality Control tool for High Throughput Sequence Data

Trimmomatic: a flexible trimmer for Illumina sequence data

Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype

A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data

The Sequence Alignment/Map format and SAMtools

EMBOSS: the European Molecular Biology Open Software Suite

Issues with SARS-CoV-2 sequencing data

Computational Analysis and Phylogenetic clustering of SARS-nCov-2 genomes

Nextstrain: real-time tracking of pathogen evolution

Recent developments in the MAFFT multiple sequence alignment program

IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies

Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10

Phylodynamic estimation of incidence and prevalence of novel coronavirus (nCoV) infections through time

Posterior Summarization in

SIFT Missense Predictions for Genomes

PROVEAN web server: a tool to predict the functional effect of amino acid substitutions and indels

Detection of nonneutral substitution rates on mammalian phylogenies

Distribution and Intensity of Constraint in Mammalian Genomic Sequence

Temporal signal and the evolutionary rate of 2019 n-CoV using 47 genomes collected by

Structure of the RNA-dependent RNA polymerase from COVID-19 virus

NetMHCpan-4.0: Improved Peptide-MHC Class I Interaction Predictions Integrating Eluted Ligand and Peptide Binding Affinity Data

Tracking virus outbreaks in the twenty-first century

The Establishment of Reference Sequence for SARS-CoV-2 and Variation Analysis

On the origin and continuing evolution of SARS-CoV-2

Rapid implementation of real-time SARS-CoV-2 sequencing to investigate healthcare-associated COVID-19 infections

Emergence of Drift Variants That May Affect COVID-19 Vaccine Development and Antibody Treatment

Authors acknowledge the GISAID database and the contributors of genomic data, without which this analysis was not possible. The COVID-19 volunteer team of CCMB is thanked for help in initial sample processing. The full acknowledgements of data and contributors are available in Supplementary Data 2 . Authors acknowledge funding from the Council of Scientific and Industrial Research (CSIR India). BJ is a recipient of the Junior Research Fellowship from CSIR India. The funders had no role in the design of the experiments, preparation of the manuscript or decision to publish.

All sequences generated in this study have been submitted to GISAID. The accession names of the samples and associated metadata are outlined in Supplementary Data 1 .