key: cord-0885177-9slpoyz7 authors: Bhattacharjee, Bornali; Pandit, Bhaswati title: Phylogenetic clustering of the Indian SARS-CoV-2 genomes reveals the presence of distinct clades of viral haplotypes among states date: 2020-05-28 journal: bioRxiv DOI: 10.1101/2020.05.28.122143 sha: b583d4661a54e42fb8f380124aaa835319b6f118 doc_id: 885177 cord_uid: 9slpoyz7 The first Indian cases of COVID-19 caused by SARS-Cov-2 were reported in February 29, 2020 with a history of travel from Wuhan, China and so far above 4500 deaths have been attributed to this pandemic. The objectives of this study were to characterize Indian SARS-CoV-2 genome-wide nucleotide variations, trace ancestries using phylogenetic networks and correlate state-wise distribution of viral haplotypes with differences in mortality rates. A total of 305 whole genome sequences from 19 Indian states were downloaded from GISAID. Sequences were aligned using the ancestral Wuhan-Hu genome sequence (NC_045512.2). A total of 633 variants resulting in 388 amino acid substitutions were identified. Allele frequency spectrum, and nucleotide diversity (π) values revealed the presence of higher proportions of low frequency variants and negative Tajima’s D values across ORFs indicated the presence of population expansion. Network analysis highlighted the presence of two major clusters of viral haplotypes, namely, clade G with the S:D614G, RdRp: P323L variants and a variant of clade L [Lv] having the RdRp:A97V variant. Clade G genomes were found to be evolving more rapidly into multiple sub-clusters including clade GH and GR and were also found in higher proportions in three states with highest mortality rates namely, Gujarat, Madhya Pradesh and West Bengal. genome (RNA) sequence of SARS-CoV-2 was published on the 5 th of January, 2020 [3] and 48 currently more than 30,000 SARS-CoV-2 sequences have been submitted from across the world 49 to Global Initiative on Sharing All Influenza Data (GISAID) [4] . It has also been identified on the 50 basis of nucleotide variants that 8 major clades of viral haplotypes have spread across the globe 51 causing the pandemic [4] . However, the implications of the evolutionary genome-wide changes 52 still remain elusive. 53 Sequencing of SARS-CoV-2 is imperative to understand the transmission routes, possible 54 sources and cross species evolution and transmission to human hosts. On the basis of such 55 sequence identity it has been speculated that the bats form reservoir of such viruses (bat CoV 56 genome, RaTG13) and are a probable species of origin [5] . Further, reports have also shown 57 strong homology among viruses in metavirome data sets of SARS-CoV, which were generated 58 from the lungs of deceased pangolins [6] . 59 In India, the first three cases of COVID-19 with travel history from Wuhan, China were reported 60 from the state of Kerala in February 2020. Since then the virus and the disease has spread to all 61 37 states and union territories with 86110 active cases and 4531 deaths till date and the 62 percentage of death rates seem to differ among states so far [7] . Attempts have been made to 63 sequence the genomes of Indian clinical isolates to understand genome-wide variability and viral 64 4 evolution and over 300 sequences have been deposited to GISAID so far from many Indian 65 states [8, 9] . However, there has been no study to delineate ancestries or to characterize the 66 distribution patterns of viral haplotypes across states. Hence, in this study a total of 305 Indian 67 SARS-CoV genome sequences were used in an effort to understand the evolution of these 68 viruses, trace the routes of infection and gauge the clustering patterns across states. Phylogenetic analysis was carried out following the median-joining approach using Network Given the number of variants identified, the diversities across all the ORFs were calculated. ORF7a was found to have the lowest nucleotide diversity while the S ORF had the highest. Overall, the nucleotide diversity (π) values were low across ORFs in comparison to the θ 114 (Watterson's estimater) values ( Figure 1D ) which was indicative of the presence of higher 115 proportion of low frequency variants as has been described in Figure 1A . The next objective was 116 to determine if the patterns of diversity could be attributed to genetic drift or neutrality. Tajima's 117 test for neutrality was applied and all the ORFs were found to have negative Tajima's D values 118 ( Figure 1D ) indicative of non-neutral evolution. was found to cluster among the clade G viral isolates and its evolving sub-clusters. All the amino acid changes that were present at ≥1% frequency were also evaluated on 158 the basis of conservation (Table 1) . NSP3 amino acid changes were found to be at the most non-159 conserved sites; however, the changes were predicted to be affecting protein function. Further, 160 there were three loci where the variants resulted in amino acid changes that were fixed in either 884 1059 1281 1397 1707 1820 2632 3176 3742 4809 4866 6081 6310 6312 7392 8022 8653 9438 10478-10479 11083 12685 13730 14408 14425 16078 16945 16993 20063 21724 21792 21795 22093 23277 23311 23403 25563 25613 26144 26467 27613 28144 28311 28854 28878 28881-28883 Coding and specimens were collected from India on 27 th January and 31 st January 2020 respectively [9]. It was found that the EPI_ISL_413522 haplotype clustered with the Wuhan-Hu-1 haplotype and 180 the second isolate had two nucleotide changes resulting in an amino acid change in the ORF8 These clade G viruses were also found in more numbers in states where higher mortality rates 219 were recorded. Occurrence of higher numbers of mutations might be attributed to altered 220 secondary structure and impaired RdRp proofreading due to the C14,408U (RdRp: P323L) 221 variant as has been speculated in earlier reports [21, 22] . Additionally, there has been a report on 222 clinical outcome from Sheffield, England where the G614 mutation was associated with higher 223 viral loads [23] which might be contributing to the higher mortality rates. However, these 224 implications will have to be tested further with direct correlations using comprehensive clinical 225 data and genomic data from all the states. Supplementary materials 227 S1 The authors acknowledge the submitters of coronavirus sequence data to the GISAID database, 236 the database managers, developers and scientists associated with GISAID and Prof. Saumitra Genomic characterisation and epidemiology 266 of 2019 novel coronavirus: implications for virus origins and receptor binding A new coronavirus associated with human respiratory 270 disease in China A Novel Bat Coronavirus Closely SARS-CoV-2 Contains Natural Insertions at the S1/S2 Cleavage Site of the Spike Protein. 275 Current biology Viral Metagenomics Revealed Sendai Virus and Coronavirus 277 Infection of Malayan Pangolins (Manis javanica) Mutations in SARS Cov2 viral 281 RNA identified in Eastern India: Possible implication for the ongoing outbreak in India and 282 impact on viral structure and host susceptibility Full-genome 285 sequences of the first two SARS-CoV-2 viruses from India. The Indian journal of medical On the origin and continuing evolution of SARS-CoV-2 National Science Review 2020 MUSCLE: multiple sequence alignment with high accuracy and 290 highthroughput Nucleic Acids Research Predicting Deleterious Amino Acid Substitutions Statistical methods to test for nucleotide mutation hypothesis by DNA 294 polymorphism Genetics Analysis across computing platforms Median-joining networks for inferring intraspecific 299 phylogenies The 301 novel Coronavirus enigma: Phylogeny and mutation analyses of SARS-CoV-2 viruses 302 circulating in India during early 2020 Analyses of spike protein from first deposited sequences of SARS-CoV2 from West Bengal Emergence of multiple 307 variants of SARS-CoV-2 with signature structural changes Phylogenetic network analysis of SARS-309 CoV-2 genomes What happened after the initial global spread of pandemic human influenza virus A (H1N1)? A 313 population genetics approach Emerging SARS-CoV-2 316 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Identification of novel mutations in RNA-319 dependent RNA polymerases of SARS-CoV-2 and their implications on its protein structure Spike mutation pipeline reveals the 324 emergence of a more transmissible form of SARS-CoV-2