key: cord-0781857-xa314x4p authors: Kumar, Pramod; Pandey, Rajesh; Sharma, Pooja; Dhar, Mahesh S; Vivekanand, A; Uppili, Bharathram; Vashisht, Himanshu; Wadhwa, Saruchi; Tyagi, Nishu; Fatihi, Saman; Sharma, Uma; Singh, Priyanka; Lall, Hemlata; Datta, Meena; Gupta, Poonam; Saini, Nidhi; Tewari, Aarti; Nandi, Bibhash; Kumar, Dhirendra; Bag, Satyabrata; Deepanshi,; Rathore, Surabhi; Jatana, Nidhi; Jaiswal, Varun; Gogia, Hema; Madan, Preeti; Singh, Simrita; Singh, Prateek; Dash, Debasis; Bala, Manju; Kabra, Sandhya; Singh, Sujeet; Mukerji, Mitali; Thukral, Lipi; Faruq, Mohammed; Agrawal, Anurag; Rakshit, Partha title: Integrated genomic view of SARS-CoV-2 in India date: 2020-06-04 journal: bioRxiv DOI: 10.1101/2020.06.04.128751 sha: 0e90f98d20d4b697b0cfec6f0228b969a6938252 doc_id: 781857 cord_uid: xa314x4p India first detected SARS-CoV-2, causal agent of COVID-19 in late January-2020, imported from Wuhan, China. March-2020 onwards; importation of cases from rest of the countries followed by seeding of local transmission triggered further outbreaks in India. We used ARTIC protocol based tiling amplicon sequencing of SARS-CoV-2 (n=104) from different states of India using a combination of MinION and MinIT from Oxford Nanopore Technology to understand introduction and local transmission. The analyses revealed multiple introductions of SARS-CoV-2 from Europe and Asia following local transmission. The most prevalent genomes with patterns of variance (confined in a cluster) remain unclassified, here, proposed as A4-clade based on its divergence within A-cluster. The viral haplotypes may link their persistence to geo-climatic conditions and host response. Despite the effectiveness of non-therapeutic interventions in India, multipronged strategies including molecular surveillance based on real-time viral genomic data is of paramount importance for a timely management of the pandemic. The rapid worldwide spread of a novel coronavirus following its first appearance in China has pressed the global community to take measures to flatten its transmission (Chan et al., 2020; Zhu et al., 2020) . The virus was named as "SARS-CoV-2" and the disease it causes has been defined as "coronavirus disease 2019" (CSG, 2020) . WHO declared COVID-19 a Pandemic on 11 th of March, 2020 based on the speed and scale of transmission with almost 4.7 million cases reported across the globe, so far (WHO, 18th May 2020). The common signs of infection by SARS-CoV-2 include cough, fever, sore throat, respiratory symptoms inclusive of shortness/difficulties in breathing. More severe symptoms can include pneumonia, severe acute respiratory syndrome, kidney failure and even death with coalescence of factors (Zhu et al., 2020 , Youg et al., 2020 . Many COVID-19 cases have been reported to be asymptomatic and may serve as carrier of SARS-CoV-2 (Xu et al., 2020; He et al., 2020) . Whole Genome Sequences (WGS) of SARS-CoV-2 suggest bat-CoV to be its closest progenitor with 96% homology but, the RNA binding domain (RBD) of its spike protein with an efficient binding to ACE-2, the receptor for SARS-CoV-2 in human cell seems to have been derived from Pangolin-CoVs (Wan et al., 2020 , Andersen et al., 2020 . Next generation sequencing (NGS) aided understanding of evolution of SARS-CoV-2 genomes and its transmission patterns after it enters a new population is proving to be an important step towards formulating strategies for management of this pandemic (Chen and Li et al., 2020) . The first three cases in India were reported from the state of Kerala in late January and early February, with a travel history of Wuhan, China. India took drastic steps to contain the further spread of the virus including imposition of travel restrictions to-and-from the affected countries. There were no new cases of COVID-19 for almost a month. All three cases subsequently tested negative making India free of the disease at that point of time (Press Information Bureau, PIB, The fasta sequences were aligned using MAFFT considering the MN90894.3 version as the reference sequence. Phylogenetic tree has been constructed using the Neighbour joining algorithm as statistical method and Maximum Composite Likelihood as model in MEGA10 software. FIGTREE was used for the graphical visualisation of Phylogenetic analysis (http://tree.bio.ed.ac.uk/software/figtree). Pheatmap and complex heatmap packages from R were used to plot the heatmaps. The Haplotype Network analysis has been done using PopART [Leigh and Bryant, 2015] . The protein-based annotation was performed according to the reference genome of SARS-CoV-2 (NC_045512 in the NCBI database) to categorize the specific amino acid change. Computational protein structure models of SARS-CoV-2 was created to map high frequency mutations on Spike, Nucleocapsid and nsp3 and RdRp (nsp12) proteins. The detailed methodology for the protein based annotation and 3-dimensional protein models have been mentioned in the supplementary materials. The mean (standard deviation) age of the total 127 subjects was 41.4±17.5 years with age range 0.5-76 years and median of 39 years. The gender ratio of male: females in the age group <39 years was 35:28 while the remaining 46 subjects >40 years had the ratio of 58:6. The majority of the SAS-CoV-2 positive samples were obtained from New Delhi covering the national capital region of Delhi, India and few clusters identified as per surveillance team (covering various states of Delhi, Tamil Nadu, Maharashtra, Uttar Pradesh, Andhra Pradesh, West Bengal, Bihar, Orissa, Rajasthan, Haryana, Punjab, Assam and Union territory of Ladakh). The exposure to the COVID-19 was suggestive of travel history of subjects to Europe, West Asia and East Asia. Few subjects were from foreign countries i.e. Indonesia (n=14), Thailand (n=2) and Kyrgyz Republic (n=2). The identified localities of the subjects will further help in molecular surveillance of SARS-CoV-2 in respective geographical regions. The average amplicon coverage for the V3 ARTIC primers used in the study was more than 1000X coverage across the majority of the samples ( Figure 1A) . We observed that 73 out of 98 amplicons had more than 1000X in 90% of the samples. Conversely, there were only 25 amplicons which had less than 1000X coverage in 10% of the samples used in the study. We also looked into whether lower Ct values are a good indicator of genome coverage using a minimal set of virus mapping reads. We plotted genome coverage and average sequencing depth across Ct value of both the genes (E and RdRp). It was observed that higher Ct values (27 onwards) have increased possibility of lower genome coverage ( Figure 1B ) although some lower Ct value samples also had incomplete genome coverage. This may be due to the integrity of the RNA used for sequencing. It was observed that the average sequencing depth was higher (1000X-3000X) for samples with lower Ct values, whereas higher Ct value samples have relative lower average sequencing depth (100X-1000X). This would be important for informed decision making for reads required for ONT based sequencing. We sequenced a subset of samples on orthogonal platforms and sequencing methods (shotgun and amplicon) using ONT and Illumina platform. Significantly, we observed that the genetic variants were common between both the platforms ( Figure 1C ). A total of 104 samples passed the quality threshold for mapping full genome coverage threshold for SARS-CoV-2 genome (accession ID:) <0.05 N content with median coverage-~1500X. Twenty-three samples that did not qualify the threshold criteria were excluded from strain identification. However, few samples were retained for variant analysis wherever the sequencing coverage was sufficiently of high quality for variant calling. The phylogenetic analysis of 104 high quality sequences reveal all the strains to be grouped into two major clades, a sub-clade and other clades ( Cluster-2: In our study large numbers of strains (n=65) belong to this unclassified cluster (as per GISAID and Nextstrain). The strains in this cluster had the predominant variants, G11083T variant (NSP6) (n=65), C13730T in RdRp (n=65), C28311T (N-capsid); n=65, C6312A (NSP3 variant); n=64 (one sequence being called N), C23929T (S-protein), n=50 (other being low depth/N bases). The variant C6310A (NSP3); n=22 being observed as another frequent alteration ( Figure 2B) . We also observed few novel variants, G12685T (NSP8); n=5 and TC1706T (NSP2); n=4, T7621C (ORF1a/b); n=3, A21792T (S protein); n=3, G13920A (NSP12/RdRp); n=2, A16355G (NSP13/Helicase); n=2 and G18803T (NSP14/Exonuclease); n=2 in this cluster. The majority of the key cluster variants 11083, 13730, 28311, 6312, 23929 are also shared in sequences submitted from Singapore region and Brunei (GISAID) and also similar clade sequences were observed in India submitted by National Institute of Mental Health and Neuro-Sciences (NIMHANS) and Gujarat Biotechnology Research Centre (GBRC) cohort. From history, based on the geographical location of the subjects of this cluster, a significant number of natives of Indonesia (n=7) and two each from Thailand and Kyrgyzstan were part of this cluster from our study site. This probably suggests introduction of this particularly from East Asian countries into India. With over represented variants in cluster-2 for variants 11083/13730/28311/6312/23929, we defined this cluster with A4 clade. This has similarity with sequences submitted from Singapore, Brunei and other Indian sequences submitted. The haplotype network analysis suggests that these sequences are having a common origin from East Asia/South-East Asia ( Figure 2C and Figure 3 ). This A4 clade has multiple variants in important region of viral genome, RdRp (A97V), Ncapsid (P12L), NSP3 (T2016K), NSP6 (L37F) and NSP3 (S1197R) variants. In our cohort of samples, the majority of subjects were from Tamil Nadu, Delhi and Indonesia and others were from various other states (Figure 3 ). To provide quantitative insights into the mutant proteins, we characterized amino acid substitutions across the 104 viral genomes. Of the 53 point mutations identified, 29 were missense that resulted in amino acid substitutions. Figure 4 plots the occurrence of these mutations as a function of each viral protein. The frequency of amino acid variations were maximum in nsp6 (L37F) present in 68 genomes, followed by nsp12 (A97V) in 65, nsp3 (T1198K) in 62 and nucleocapsid (P13L) in 53 genomes. Interestingly, D614G mutation in spike protein which is considered as a prevalent global mutation [Korber et al., 2020; Chandrika et al., 2020] , was present in only 26 of the 104 sequenced genomes. Next, we correlated the occurrence of each mutation with the type of amino acid change ( Figure 5A ). Interestingly, ~45% of these mutations showed no change in the nature of the amino acids, suggesting that the viral proteins harbors subtle changes in the protein shape or function. Within high-occurring mutations, P13L, L37F, A97V also showed no major residue alterations. However, there are other loci that involve complete change in amino acid type once mutated such as interchange between polar, hydrophobic or charged. For instance, high frequency positions, including T1198K in nsp3 involve acquisition of a charged group. In addition, the key mutation in spike protein (D614G) also involves loss of the charged group. These mutations that lead to positively charged groups may cause more severe structural and functional effects. We also compared SARS-CoV-2 mutation sites with other six coronavirus sequences ( Figure 5B ). Most of the mutations were present mostly in variable locations. Out of 29 mutations, 10 are present on highly conserved residue locations. Interestingly, higher frequency mutations are at positions that evolve faster/are variable across the coronaviruses except A97L and L37F, which are present on conserved locations. To understand how these mutations are changing the local environment, we mapped higher frequency mutations onto SARS-CoV-2 protein structures, including nucleocapsid, nsp3, nsp12, and spike protein (Figure 6 ). In nucleocapsid, there are two structural domains, the N-and Cterminal connected via a long linker region that facilitates RNA packing [Kang et al., 2020] . We found that all the mutations, including P13L found in 53 genomes, are present outside these domains, specifically, linker regions (S194L, G204R, R203K), and the N-arm of the protein (P13L). The nsp12 is a highly conserved protein with multiple domains. The observed mutations are overlaying onto the interface (P323L) and NiRAN (A97L) region. The latter is critical as it contains a Zn+ binding site, however, little is known about the exact functional output. In contrast, the P323L mutation is present on protein interaction junctions where a hydrophobic cleft is known to bind to inhibitors ( Figure 6B) . This variant will result in loss of kink due to amino acid substitution from proline to leucine, i.e., 5-membered amino acid which resides in the buried area of the protein from to an acyclic amino acid. The nsp3 protein has a similar higher order domain arrangement, and the mutation is present on the NAB domain, which is a nucleic acid binding domain and also interacts with nsp12. This mutation may impact RNA synthesis machinery; however, little is known about its exact mechanism of action. Lastly, the D614G mutation in spike protein is an interesting substitution and has been reported with increased tally [Korber et al., 2020; Chnadrika et al., 2020] . Structurally, this mutation is located in the S1 subunit that also contains the RBD domain. Although present outside the functional region, the proximity of D614G around S1 cleavage site implicates an important change in the local environment. SARS-CoV-2 classified for their preponderance of distribution across geographical locations of India vis-a-vis states of India. Enrichment and depletion of specific clades/classes was observed in certain states of India. We explored the possibility of whether travel history based viral strain showed signs of adding new variants once it diversified in India. This is the first comprehensive genomic picture of the SARS-CoV-2 prevalent in the Indian population during the early phase of outbreaks. The understanding is important keeping in view the vast geographical expanse and population density of India. There were three major waves of viral entry in India associated with multiple outbreaks (Figure 7) . First wave includes importation of SARS-CoV-2 (A2a cluster) through travelers from Europe (Italy, UK, France etc) and the USA. Second wave of SARS-CoV-2 (A3 cluster) was linked with the Middle East (Iran and Iraq). Third wave comprises combined viral (haplotype redefined as A4) entries from Southeast Asia (Indonesia, Thailand and Malaysia) and Central Asia (Kyrgyzstan). The study taken together with other reported genomes (Potdar et al., 2020) revealed A4 cluster (previously unclassified) is the most prevalent in Indian population. Many novel mutations identified may be specific to Indian conditions but more genomic data is needed to strengthen the assumption to rule out sampling bias and other factors (Lu et al., 2020) . During the early phase of the epidemic in India, imported cases followed by local transmission were restricted to urban regions where effective IDSP network, diagnostic support, health care services and timely placed interventions (nationwide lockdown, establishment of containment zones and practicing social distance) interrupted community transfer. The mounting burden of epidemic in urban regions associated with job crunch forced migrant workers to return to their home mostly situated in rural areas which is an impending threat of the viral entries in rural regions (comprising major portion of the Indian population) and expected community transmission. To tackle domestic transmission at larger scale and higher risk of community transfer, preventive measures shall be strengthened in rural regions in addition to safe transportation arrangements for domestic migrants. The national lockdown may have led to evolution/gain of certain variants with potential role in adapting to Indian conditions. This may have given rise to a distinct lineage awaiting its inclusion in Nextstrain (an open-source project to harness the scientific and public health potential of pathogen genome data). A more detailed analysis of these genomes might provide information whether these variations need to be considered during design of diagnostic primers as the need for testing shoots up. It may allow for creation of cost effective panels to trace the movement of lineage specific strains across geographical regions more rapidly and effectively. Lots of efforts are ongoing to identify suitable vaccine candidates through docking studies. These observations are important to consider the variants especially that map to the Indian genomes during such prioritization studies since these strains would now form a major fraction of the genomes that are likely to become more prevalent in India after lockdown. Mapping of these variant genomes in conjunction with the clinical history in terms of recovery, hospitalization and co-morbidity might allow identification of variants that should be actionable and would also have relevance for prognosis. It is imperative that robust genomic data based on large sample size including rural populations with even distribution can bring out the real scenario once correlated with epidemiological data eventually helping in drafting of further management policies. All the authors read and approved the final manuscript. Authors declare no conflict of interests with the present study. Kang, S., Yang, M., Hong, Z., Zhang, L., Huang, Z., Chen, X., He, S., Zhou, Z., Zhou, Z., Chen, Q. et al. (2020) . shows the presence in 2 genomes. The last row depicts the conservation score of that particular mutation of SARS-CoV-2 with 6 other coronavirus. The conservation ranges between 1 to 9 with 9 depicting the highest score. I F I E R , D I S T A N C E = 3 0 1 0 8 7 1 0 1 1 2 2 9 7 2 9 3 ' U T R 3 ' U T R --Q H I 4 2 1 9 9 . 1 M O D I F I E R , D I S T A N C E = 5 5 0 8 4 5 7 1 0 2 6 4 2 8 4 5 ' U T R 5 ' U T R --0 M O D I F I E R 1 3 0 8 1 3 7 5 8 1 1 1 6 7 5 ' U T R 5 ' U T R --Q H D 4 3 4 1 5 . 1 M O D I F I E R , D I S T A N C E = 9 9 0 1 8 3 3 7 1 3 8 3 1 2 0 3 5 ' U T R 5 ' U T R --Q H D 4 3 4 1 5 . 1 M O D I F I E R , D I S T A N C E = 6 3 0 3 1 8 3 5 5 3 6 3 1 3 3 7 N S P 1 o r f 1 a b --Q H D 4 3 4 1 5 . 1 : p . 2 4 R L O W 0 2 0 8 6 8 6 3 4 1 5 0 7 N S P 1 o r f 1 a b --Q H D 4 3 4 1 5 . 1 : p . 8 1 -8 6 H G H V M V > H M O D E R A T E 8 7 1 1 0 0 0 1 1 1 5 0 9 N S P 1 o r f 1 a b --Q H D 4 3 4 1 5 . 1 : p . 8 2 -8 5 G H V M > V M O D E R A T E 0 0 8 7 1 0 0 1 2 1 E n d o R N A s e # N / A --# N / A # N / A 0 0 0 8 6 8 2 4 0 1 2 0 2 5 5 E n d o R N A s e o r f 1 a b --Q H D 4 3 4 1 5 . 1 : p . 6 6 6 4 -; Q H D 4 3 4 1 5 . 1 : p . 6 6 6 4 D > G M O D I F I E R ; M O D E R A T E 8 7 2 0 0 0 0 2 1 2 0 4 2 9 E n d o R N A s e o r f 1 a b --Q H D 4 3 4 1 5 . 1 : p . 6 7 2 2 P > L M O D E R A T E 0 1 0 8 7 2 0 1 1 2 0 5 8 0 E n d o R N A s e o r f 1 a b --Q H D 4 3 4 1 5 . 1 : p . 6 7 7 2 V ; Q H D 4 3 4 1 5 . 1 : p . 6 7 7 2 -L O W ; M O D I F I E R 1 3 8 7 1 5 0 3 The proximal origin of SARS-CoV-2 A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission, a study of a family cluster Global Spread of SARS-CoV-2 Subtype with Spike Protein Mutation D614G is Shaped by Human Genomic Variations that Regulate Expression of TMPRSS2 and MX1 Genes SARS-CoV-2, virus dynamics and host response The species severe acute respiratory syndrome-related coronavirus, classifying 2019-nCoV and naming it SARS-CoV-2 NanoPack, Visualizing and processing long-read sequencing data Nextstrain, Real-Time tracking of pathogen evolution Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein