key: cord-0910364-t7cjwuyp authors: Gupta, A.; Sabarinathan, R.; Bala, P.; Donipadi, V.; Vashisht, D.; Katika, M. R.; Kandakatla, M.; Mitra, D.; Dalal, A.; Bashyam, M. D. title: Mutational landscape and dominant lineages in the SARS-CoV-2 infections in the state of Telangana, India date: 2020-08-26 journal: nan DOI: 10.1101/2020.08.24.20180810 sha: 4fbd75a6c38be3ba76d6386ad7f92b4867bedce3 doc_id: 910364 cord_uid: t7cjwuyp The novel Severe Acute Respiratory Syndrome CoronaVirus 2 (SARS-CoV-2) causing COVID-19 has rapidly turned into a pandemic, infecting millions and causing ~7 million deaths across the globe. In addition to studying the mode of transmission and evasion of host immune system, analysing the viral mutational landscape constitutes an area under active research. The latter is expected to impart knowledge on the emergence of different clades, subclades, viral protein functions and protein-protein and protein-RNA interactions during replication/transcription cycle of virus and response to host immune checkpoints. In this study we have attempted to bring forth the viral genomic variants defining the major clade(s) as identified from samples collected from the state of Telangana, India. The outbreak of COVID-19 caused by the Severe Acute Respiratory Syndrome Towards this end, we applied next generation sequencing to determine the complete sequence of 210 SARS-CoV-2 RNA samples. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint The Centre for DNA Fingerprinting and Diagnostics (CDFD), Hyderabad, initiated Since RDRP consistently provided more robust amplification than E-gene and is a SARS-COVID-2 specific gene (unlike E-gene which is specific for all respiratory coronaviruses), we considered Ct values of RDRP gene alone for analysis. Samples exhibiting an RDRP Ct value greater than 10 and less than 35 were chosen for sequencing. Sequencing of SARS-CoV-2 RNA samples was performed using protocol described earlier (Quick et al., 2020) with slight modifications. Briefly, RNA isolated from nasopharyngeal swabs was reverse transcribed using random primer mix (New England Biolabs, Massachusetts, United States), and Superscript-IV (Thermofisher Scientific, Massachusetts, United States). The resulting cDNA was subjected to a 3-step multiplex PCR using nCoV-2019/V3 primer pools (Eurofins, India) 1, 2 and 3. The ∼ 400 bp amplicons thus obtained in the pools were combined, purified using Agencourt AMPure XP beads (Beckman Coulter, California, United States) and eluted in 45µl elution buffer . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. All raw fastq files from Illumina were checked for overall sequencing quality, presence of adapters and bad quality reads using FastQC and Fastp 4 . The adapter sequences were trimmed using a wrapper script for Cutadapt 5 tool, called Trim Galore. The filtered reads were aligned to the reference strain NC_045512.1, Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, using bwa-mem 6 algorithm with default parameters. Mapping quality was assessed using samtools 7 and BAMStats. Post alignment, the reads were filtered, sorted and indexed using samtools, and any primer sequences were masked using iVar 8 . Subsequent mutation calling and generation of consensus sequence was performed using samtools mpileup and iVar (https://github.com/connorlab/ncov2019-artic-nf/). The resulting VCF files were annotated using snpEff 9 . For processing the nanopore data, we followed the protocol suggested by ARTIC pipeline (https://github.com/artic-network/fieldbioinformatics, https://github.com/connorlab/ncov2019-artic-nf/) for mutation calling as well as for assembling the reads for generating the consensus sequence. A schematic describing the entire workflow is shown in Supplementary Fig S1. Before analysing the obtained calls, we filtered all the problematic sites prone to errors by multiple sources as recommended by De Maio et al (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473). The consensus fasta files generated for both Illumina and Nanopore data were subjected for phylogenetic analysis using the Nextstrain pipeline with recommended . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint default criteria for filtering, multiple sequence alignment (MSA) and nucleotide substitution calculations. To briefly summarize the workflow of the pipeline, all the consensus sequences having length < 27000 and Ns > 5% were filtered out. Three samples were removed from further analysis as their sequence data included Ns > 5%; thus all analyses were conducted on 207 individual patient viral genome sequences. A compendium of problematic sites as used earlier (https://virological.org/t/issues-withsars-cov-2-sequencing-data/473), was also provided to mask those sites prior to MSA by MAFFT 10 . Following MSA, the workflow constructed a time-resolved phylogenetic tree using the maximum likelihood based method IQ-TREE 11 . The resultant tree was pruned and internal nodes and ancestral traits were inferred from the dates of the sample collection using TreeTime 12 . The final tree in Newick format was then customized for visualization using iTol 13 . . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint Our dataset consists of samples collected during late March to July, 2020. Interestingly, samples collected from late May till July represented a higher proportion of asymptomatic cases when compared to samples collected earlier (Figure 1a) . A majority of our samples belonged to age group between15-62 years, with males (61%) dominating the profile distribution over females (39%) (Figure 1b) . We also compared the distribution of cases with respect to Ct values, the latter being a proxy for viral load. Symptomatic cases appeared to be associated with higher Ct values (thus lower viral load), compared to asymptomatic ones, which was unexpected (Figure 1c) . there was a reduction in the Ct values in samples, as we neared the end of June, 2020, implying that more recent samples seemed to carry a higher viral load than earlier samples and A23403G (S, D614G). Seven samples belonged to this 20A clade, directly linked to . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint the dominant viral lineage established in Europe (Belgium, Italy, of which one exhibited a distinct 20A/18877T profile. This division retained none of the mutations found in the initial samples belonging to 19A and 19A/C13730T cluster (Supplementary figure S3) . From the middle of April and onwards, the profile was strongly dominated by the 20B clade, which also formed the second cluster within this division, with a characteristic mutation of GGG28881/28882/28883>AAC (Figure 2 and Supplementary figure S4. Thus, of all the clades identified, our dataset was strongly populated by the presence of a single major clade 20B. From a time resolved mutational map calculated for all samples (Supplementary figure S4) , we observed that the more recent samples, collected from the end of June onwards, did not show many of the mutations found in ORF1a, C5700A (nsp3, A1812D), C6573T (nsp3, S2103F) and C25528T (ORF3a, L46F). These samples also coincided with the most diverged samples along the phylogenetic time-tree, indicating towards a newer divergence path of the virus among the later infections. We checked whether these samples belonged to any single cluster within the phylogenetic tree, but they were found to be interspersed among different branches (Figure 2, Supplementary figure S4 ). Within the dominant 20B clade, a major proportion of samples between the age group of 15 and 50 were found to be asymptomatic, while the symptomatic patients mainly belonged to samples collected from age group >50. As also discussed before, the viral load, calculated in terms of RdRp gene, was found to be primarily associated with asymptomatic patients. This behaviour can explain the higher transmission rates of the virus among the Indian population, in general, as opposed to per capita mortality rates, which will be further elaborated upon in the next section. A location wise distribution indicating the neighbourhood origin of the samples in the phylogeny tree has been shown in We performed Nextstrain analysis separately on our Illumina and nanopore datasets, in order to rule out the possibility of incorporation of any bias in cluster formation due to the sequencing platform. The clusters obtained did not show any segregation due to the platform used and were distributed across the clades (data not shown). . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint From the mutation analysis on the filtered, combined pool of 207 sequences from Illumina and Nanopore data, we obtained a total 302 mutations across the SARS-CoV-2 genome (Supplementary data D1). From this set, 17 mutations were consistently found to be present in >10% of samples (Figure 3a) . The proportion of asymptomatic cases for each of these 17 high frequency mutations, as a fraction of the total number of cases, was found to be higher than symptomatic cases (Figure 3b) Figure 3a ). The A23403G (D614G) mutation in Spike protein was identified in samples as early as beginning of April, 2020. Although a highly recurrent mutation in multiple demographics, no clear correlation has been established between D614G mutation and severity of disease 14 . The D614G mutation has almost invariably been found to be associated with C241>T, C3037>T (a silent mutation) and a mutation in RNA dependent RNA polymerase gene, nsp12 C14408>T as has also been reported earlier 15 . The haplotype defined by the co-occurrence of these 4 mutations is the current dominant form circulating across the world. Apart from these four mutations, the nsp3 protein region within ORF1a also displayed a higher frequency of mutations, G4354A, A4372G, C5700A, C6027T, C6573T, the latter three being missense mutations. Of these, C5700 along with a silent C313T mutation has been reported to co-occur in samples collected from Western state of Maharashtra, India 16 . The nsp4 and nsp5 proteins each harbored one high frequency missense mutation each namely C9693T and C10815T, respectively. Similarly, the ORF3a region possessed one high frequency missense mutation C25528T. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint In the current study, we have presented a comprehensive map of the mutations identified from the confirmed COVID-19 cases collected from the southern state of India, Telangana. After a slow progress of the outbreak during the months of February-April, the state has been witnessing a constant upsurge in the number of infections and has been listed as one of the worst affected states in the country. Identifying the mutations from samples collected over a period of time, provides a way to assess the genomic diversity which the virus might have experienced during infection and transmission. With these aspects in our purview, we have attempted to characterize the genomic epidemiology of novel coronavirus and a comprehensive mutational landscape, using a dataset of 210 samples sequenced using both Illumina and Nanopore sequencing technologies. Most of our samples were collected from mid-March with the months of June-July reporting highest collections. A significantly higher association of samples with asymptomatic behaviour was noted, which were also associated with lower Ct values. We also observed an upsurge in the asymptomatic cases compared to symptomatic. A majority of our samples belonged to the 20B clade, with the clade seeming to appear in the beginning of April. Although we did not see a clear correlation between age and viral load, it was observed that samples collected from higher age groups frequently displayed symptomatic behaviour. More importantly, mutational analysis revealed the presence of unique mutations in the samples from Telangana, especially in the nsp3, nsp4, nsp5 and ORF3a. The nsp3 is a papain-like protease (PLP2) and nsp5 is a 3C-like protease (3CLpro), both required for cleavage of polyproteins pp1a and pp1ab to generate 16 non-structural proteins (nsp1-16) 17 . Nsp3 is the largest multidomain protein encoded by SARS-CoV-2, and binds to viral RNA and nucleocapsid protein 18 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint (b) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint Supplementary figure S1: Detailed workflow of the methodology and various stages. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint Supplementary figure S4: Time resolved phylogenetic tree created using Nextstrain. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint Supplementary figure S5: Phylogenetic tree with local distribution of samples. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted August 26, 2020. . https://doi.org/10.1101/2020.08.24.20180810 doi: medRxiv preprint The genetic sequence, origin, and diagnosis of SARS-CoV-2 The Architecture of SARS-CoV-2 Transcriptome Identification of Coronavirus Isolated from a Patient in Korea with COVID-19 fastp: an ultra-fast all-in-one FASTQ preprocessor Cutadapt removes adapter sequences from high-throughput sequencing reads Fast and accurate short read alignment with Burrows-Wheeler transform The Sequence Alignment/Map format and SAMtools An amplicon-based sequencing framework for accurately 12 Maximum-likelihood phylodynamic analysis Interactive Tree Of Life (iTOL) v4: recent updates and new developments The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2 Phylogenomic analysis of SARS-CoV-2 genomes from western India reveals unique linked mutations COVID-2019: The role of the nsp2 and nsp3 in its pathogenesis Nsp3 of coronaviruses: Structures and functions of a large multi-domain protein The SARS-Coronavirus PLnc domain of nsp3 as a replication/transcription scaffolding protein Severe acute respiratory syndrome Coronavirus ORF3a protein We are thankful to Drs Rashna Bhandari and R Harinarayanan, CDFD, Hyderabad, for co-ordinating the establishment of the COVID-19 testing laboratory at CDFD. All volunteers and 'COVID warriors' from CDFD, Hyderabad, are gratefully acknowledge for . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted August 26, 2020.