key: cord-0889766-fe2pb042
authors: Pattabiraman, Chitra; Habib, Farhat; P. K., Harsha; Rasheed, Risha; Prasad, Pramada; Reddy, Vijayalakshmi; Dinesh, Prameela; Damodar, Tina; Hosallimath, Kiran; George, Anson K.; Kiran Reddy, Nakka Vijay; John, Banerjee; Pattanaik, Amrita; Kumar, Narendra; Mani, Reeta S.; Venkataswamy, Manjunatha M.; Shahul Hameed, Shafeeq K.; Kumar B. G., Prakash; Desai, Anita; Vasanthapuram, Ravi
title: Genomic epidemiology reveals multiple introductions and spread of SARS-CoV-2 in the Indian state of Karnataka
date: 2020-12-17
journal: PLoS One
DOI: 10.1371/journal.pone.0243412
sha: 9fce2c4425ace602f4afe1bf643cf396f47536e3
doc_id: 889766
cord_uid: fe2pb042

Karnataka, a state in south India, reported its first case of Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) infection on March 8, 2020, more than a month after the first case was reported in India. We used a combination of contact tracing and genomic epidemiology to trace the spread of SARS-CoV-2 in the state up until May 21, 2020 (1578 cases). We obtained 91 genomes of SARS-CoV-2 which clustered into seven lineages (Pangolin lineages—A, B, B.1, B.1.80, B.1.1, B.4, and B.6). The lineages in Karnataka were known to be circulating in China, Southeast Asia, Iran, Europe and other parts of India and are likely to have been imported into the state both by international and domestic travel. Our sequences grouped into 17 contact clusters and 24 cases with no known contacts. We found 14 of the 17 contact clusters had a single lineage of the virus, consistent with multiple introductions and most (12/17) were contained within a single district, reflecting local spread. In most of the 17 clusters, the index case (12/17) and spreaders (11/17) were symptomatic. Of the 91 sequences, 47 belonged to the B.6 lineage, including eleven of 24 cases with no known contact, indicating ongoing transmission of this lineage in the state. Genomic epidemiology of SARS-CoV-2 in Karnataka suggests multiple introductions of the virus followed by local transmission in parallel with ongoing viral evolution. This is the first study from India combining genomic data with epidemiological information emphasizing the need for an integrated approach to outbreak response.

Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), a novel coronavirus that was first detected in individuals with acute pneumonia in China in late 2019, has now spread throughout the world [1] . The World Health Organization (WHO) on March 11 declared the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 TruFactor-InMobi Group provided support in the form of salary for author FH, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. The specific roles of FH are articulated in the 'author contributions' section.

Competing interests: FH is an employee of TruFactor-InMobi group company, however, he was permitted to participate as a volunteer in this study and his employer neither had access to data, nor any say in the design of the study or the decision to publish. This does not alter our adherence to PLOS ONE policies on sharing data and materials. All other authors are employees of state (PD, PK) or central government. The employers had no role in the design of the study or the decision to publish. The authors declare that they do not have any other financial or nonfinancial relationships that could present a conflict of interest.

list of positive patients was provided by the Directorate of Health and Family Welfare Services, Karnataka and missing data was filled in from the ICMR portal. The line list contained detailed epidemiological workup of each sample including information on age, sex, clinical signs and symptoms, location, contacts, description of exposure type, sampling dates, date of hospitalization and follow up of hospitalized cases. A minimal anonymized data set is provided in S7 

Samples received for testing at NIMHANS were subjected to RT-PCR based on ICMR guidelines [19] . A total of 21,418 samples were tested in NIMHANS (April 5, 2020-May 20, 2020), 369 of these were positive and 101 were included for sequencing. The criteria for sequencing was RT-PCR positive samples with Ct values under 30. Samples with Ct value >30 were included when they were considered critical for representing a cluster or if they were from symptomatic individuals. We used a tiling primer based approach for whole genome sequencing described by the ARTIC Network using Primal Scheme [9, 20] . Briefly, we used the V3 primer set-these are 96 pairs with amplicons of about 400 basepairs (bp) spanning the whole genome except 31bp of the 5' and a part of the 3'UTR. PCR was performed by pooling adjacent/overlapping primers into different pools so as to prevent preferential amplification of short fragments between adjacent primer pairs. Primers were initially used at a concentration of 10μM as per the protocol, then modified to amplify regions that were missed by increasing primer concentrations to 50μM. For four regions additional primers were designed (S1 Appendix) and a separate reaction was set up before the pooling step in order to complete the genome. The resulting PCR amplicons were used for preparing libraries for Nanopore sequencing using the native barcoding (NBD 104/114) approach combined with the ligation sequencing kit (SQK-LSK109). Between 12-24 samples were barcoded and included in a single run. The resulting DNA was cleaned up and added to FLO-MIN-106 flow cell and sequenced on the MinION.

Sequences were basecalled and demultiplexed using guppy (v3.6), read lengths between 100-600bp were considered for further analysis. Primers were removed from the sequencing reads by trimming 25bp at the ends and additional trimming based on alignment using BBDuk (v38.37). Resulting reads were mapped to the RefSeq strain (NC_045512) using Minimap2 (v2.17) within Geneious Prime (Geneious Prime 2020.0.3). A consensus was created for regions with coverage >10x using the 0% majority rule and corrected. Consensus was aligned to the reference to ensure the correct reading frame and annotation was transferred from the reference. A schematic of the workflow is provided in S1 Appendix. We obtained 91 genome sequences of these 75 were full-length at 10X (>90% genome coverage) and 16 were partial (>60% genome coverage). We obtained an average of 0.2 million sequencing reads per sample with an idealized coverage of 2810x (S1 Table) . SARS-CoV-2 sequences were deposited into the GISAID database and GenBank (NCBI) Sequencing reads have been deposited in the Sequence Read Archive (SRA) under the BioProject PRJNA670824. Accession numbers are provided in S8 Table. Phylogenetic analysis, lineage assignment, detection of SNPs and amino acid replacements Consensus sequences from the 91 genomes from this study were aligned with the reference genome using MUSCLE (v 3.8.425) [21] . The multiple sequence alignment was used to infer the phylogeny using iqtree (v 1.6.12) [22] . A total of 88 DNA models were tested, and the GTR +F+I model was selected based on the Bayesian Information Criterion. Maximum likelihood tree was constructed as the consensus tree from 1000 bootstraps, including the reference sequence (NC_045512) and hCoV19/Wuhan/WH04/2020 as the outgroup. Nodes with bootstrap values >80% were used for interpretation. Time scaled phylogenies were constructed using TreeTime with the multiple sequence alignment described above and the date of sampling as dates. Pangolin lineage assignments were performed using the online tool (Pangolin v2.07 lineages version 2020-08-29) [23] . Lineage assignments from Pangolin were compared with the lineage assignments from the maximum likelihood tree, in case of a discrepancy, sequences from sub-lineages were collapsed to their parent lineage or reassigned based on lineage of sister clades (S6 Table) . Single nucleotide polymorphisms (SNPs) and amino acid replacements were detected using the CoV-Glue web application [24] . Both tools use sequences submitted to the GISAID database [6] .

The epidemiological data was extracted from the line list and a contact map was constructed using the state line list of positive cases. We identified primary and secondary contacts for a patient from the line list. We then built a graph where each node is a positive individual and is connected by an edge with their contacts who were positive. This gives us the contact map. The graphs were then filtered by size of clusters or clusters containing a node with a particular property to focus on clusters of interest. The graphs are visualized using the d3.js [25] library with attributes like clinical state, lineages, and geographical location indicated by colours of the nodes.

Karnataka recorded 1578 cases between March 5-May 21, 2020. Most of these cases were from six high burden districts, with Bangalore Urban (the district encompassing the city of Bengaluru) reporting 256 cases (Fig 1) . In total 369 (23.38%) positives were recorded at our centre, of which 101 samples were taken for sequencing (Fig 1, Table 1 ). The features of positive cases in the Karnataka, and those tested and sequenced at our centre are in Table 1 . Most of the positive individuals (1133/1578: 71.8%) in the state were below 40 years of age. More males than females tested positive (987/1578: 62.5%). Amongst the positive individuals in the state, 84.35% were asymptomatic (at sample collection). A total of 87.5% cases (1380/1578) had contact with a known COVID-19 case or travel history. Amongst 369 positive cases tested at our centre, 168 did not have a known contact as of May 21, 2020. We included 24 of these 168 for sequencing.

Overall 101 samples were sequenced, the Ct values of these samples ranged from 14.5.1-37.85, 91 genomes were obtained. For these 91 samples, the average number of reads was 212322 reads, the average depth of coverage was over 3000x (S1 Table) and 90.48 (average) of the sequenced reads mapped to the reference genome (S1 Table, (14), B.4(9) and B.6(47) (Fig 2A) . Six of the seven lineages were apparent by maximum likelihood based phylogeny with bootstrap supports of >80%. The seventh distinct clade has sequences from both lineage B and B.6 clustered together in this analysis ( Fig 2B) . A time scaled maximum likelihood phylogeny of the genomes with the reference sequence shows that the lineages branched out at different time points (Fig 2C) with defining mutations (S2 Fig, S2 Table) .

Overall 154 Single Nucleotide Polymorphisms (SNPs) were identified in the 91 genomes in comparison to the reference sequence (S2 Table) . Proportion and position of the SNPs are shown (Fig 2) . A total of 100 amino acid replacements were identified (S2 Fig, S3 Table) .

Amongst the 24 sequences from individuals with no known contact ( Table 2 , Fig 3) , four clustered into lineage A and reported travel to other parts of India. Of the remaining 20, six 

sequences that clustered into lineages B.1.80, B.1, and B.1.1 (two each) also had a history of travel within India. An additional sequence from an individual with no known contact was also assigned the lineage B.1. All five cases (of the 24) which were under investigation belonged to lineage B.6. In addition to this, B.6 was assigned to two individuals with history of travel within India, three sequences from symptomatic individuals and one with contact unknown. In addition, one sequence from a symptomatic individual with SARI and from an individual with no known contact clustered into the B lineage.

Analysis of contact information in the state line list, revealed that 822 of 1578 cases could be assigned into 104 contact clusters (S4 Table) . Of these 104 clusters, 38 clusters were tested at our centre and 17 of the 38 were included for sequencing. These 17 included 309 people and covered ten large clusters (>5 individuals) from the state (S4 Table) .

Of the 17 clusters (C1-C17), C3, C 5-9, C13-17 had sequences which were only assigned to lineage B.6. The cluster C4, had sequences assigned to B.1.80 and clusters C11-12 had sequences assigned to B.1. Three clusters had sequences assigned to more than one lineage. 

Introduction and spread of SARS-CoV-2 in the Indian state of Karnataka Table) .

Most of the cases in the state were asymptomatic (1331/1578) ( Table 1 , Fig 3B) . Analysis of information pertaining to the index cases (earliest detected individuals) from the 17 clusters revealed the following-12 of the 17 (70.5%) were symptomatic, of these seven presented with SARI, two with influenza like illness (ILI), three symptomatic individuals had history of interstate travel (S4 Table, Fig 3) Further, analysis of individuals with maximum number of contacts (spreader) within a cluster revealed that 11/17 were symptomatic (S4 Table, Fig 3) .

The location of the clusters was analysed using a contact graph (Fig 3C) . Most clusters (12/ 17) were limited to a single district excepting clusters C1, C2, C3, C5, and C11 which were spread across districts. Time course of the ten largest sequenced clusters showed that cluster C1, C9 and C10 had no cases since May 1, 2020. Clusters C7 and C4 had no linked cases since May 7, 2020. Ongoing transmission was noted for the other clusters (Table 2, Fig 4, S5 Table) .

SARS-CoV-2, the virus causing COVID-19, has now spread throughout the world. Despite restricting travel from affected countries early in the pandemic, India started reporting cases of COVID-19 by January 30, 2020 and sustained local transmission was observed in multiple states including Delhi, Maharashtra, and Gujarat [12] . SARS-CoV-2 was first detected in the South Indian state of Karnataka on March 8, 2020 and by May 21, 2020 it had spread to 28 out circulating lineages. Viruses from both lineages are now circulating in different countries of the world [23] .

In this study, 4 of the 91 sequences, belong to lineage A and were from individuals with travel history to other states within India. This lineage is defined by two SNPs T8782C and C28144T and has been reported from Saudi Arabia, Russia, Turkey, and India [23] . No onward transmission was reported from these four cases, however they indicate continued importation of SARS-CoV-2 into the state emphasizing the need for active surveillance of domestic travel.

Of the lineages in the state, B.1 (related to GISAID clade G, and Nextstrain clade A2a) and B.1.1 (related to GISAID clade GR and Nextstrain clade A2a) are European clades. Both lineages harbour the D614G mutation on the Spike protein. It has been suggested that viruses with this mutation are more infectious and the mutation was present at higher frequency in samples across the world [17, 26, 27] . Of these two lineages, B.1 was a major contributor to the Italian outbreak [23] .

The largest cluster in the state, C1 (comprising of 67 individuals) had four sequenced samples belonging to lineage B.1 and three sequenced samples belonging to lineage B.1.1 (Nextstrain clade A2a). Lineage B.1.1 is defined by three additional (to B.1) SNPs-G28881A, G28882A, G28883C [23] . The cluster C1 was initially restricted to Mysuru and subsequently spread to Vijayapura and Mandya (Fig 3, S5 Table) . Of the 67 cases in this cluster, 16 were symptomatic with the index case being a SARI patient (Fig 3) . No new cases could be linked to this cluster after May 1, 2020 up until the conclusion of this analysis (May 21, 2020), suggesting that it had been contained.

The index case of this cluster had no history of international travel but was an employee of a company which had a number of international visitors until mid-February 2020, including visitors from Europe. The presence of two lineages is indicative of multiple introductions of the virus in this cluster. Additionally, sequences from five individuals with no known contacts with a positive case were classified into B.1 and B.1.1 ( Table 2) . Of these four, two each of B.1 and B.1.1 had known history of travel to other states. Highlighting that these lineages are circulating within India and continue to be imported into the state. One other sequence from Mysuru district was also assigned to lineage B.1. However, this sequence clustered separately from all the others in the phylogeny and therefore may represent a separate introduction.

In our study, twelve sequenced samples from a large cluster, C4 (comprising of 35 individuals), belonged to a sub-clade of B.1, B.1.80 which has been sampled from India, Australia, and Luxembourg (Fig 3, S4 , S5 and S7 Tables). This cluster was restricted to Bengaluru city and no new cases were reported from it between 7-21 May (Fig 4) . The index case for this cluster was a patient with SARI. Two sequences from individuals with recent travel within India also belonged to this lineage. These data suggest that while the initial C4 cluster in the state appears to the be contained, this lineage continues to be imported via domestic travel.

Yet another large cluster (C2) consisting of 50 patients and largely restricted to a quarantine centre in the city of Bengaluru had two lineages B.4 and B.6. Of the sequenced samples from this cluster, nine were assigned to lineage B.4 (Nextstrain A3). B.4 is a clade which was first associated with travellers from Iran and has been sampled from UK, Australia, and India [23] . One sequence from this cluster was from lineage B.6. This cluster merits further analysis as lineage B.4 was unique to this cluster. The presence of two lineages in this cluster have two possible explanations. Either there were dual introductions of the virus (different lineages) into the cluster or that the two lineages are actually part of two clusters. The isolated B.6 case (Figs 3 and 4) and the cluster of B.4 cases (Figs 3 and 4) suggests that the two sets may be epidemiologically unlinked.

Lineages B/B.6 were assigned to the largest number of sequences (50/91) sequences in this study. Using a maximum likelihood based approach we were unable to completely separate B and B.6 as some branches had sequences from both clades (Fig 2) . B is the parent lineage for B.6 and it is possible that these three sequences lack the information (S1 Table) for complete classification.

Further, 13 of the 17 clusters in this study and 13 of the 24 cases with no known contact (54.2%) belonged to lineages B/B.6 or both. Lineage B is one of the two clades that were circulating in China in late 2019. Lineage B.6 has earlier been reported from Philippines, UK, North America, Australia, Singapore and has also been reported from other parts of India(16) (Pangolin). The defining mutations for these lineages are similar to that of the A3i clade which has been described as a distinct phylogenetic group in India [16] . Indeed, up to a third of the cases in multiple states across the country belong to the A3i clade [16] . These lineages were detected throughout the study period and across the state, including sequences from two domestic travellers (Fig 4) . In Bengaluru city, three clusters C7, C9 and C14 (S5 Table) , as well as three symptomatic, epidemiologically unlinked cases were assigned this lineage (Table 2) . Overall, our analysis suggests that the B/B.6 lineage is now established and sustained by local transmission in the state with continued importation from other parts of India.

One of the notable features in this study was the ability to assign virus lineages to cases with no known history of contact with a positive individual ( Table 2 ). This underscores the utility of genomic epidemiology in filling the gaps of identifying the source of infection.

Some studies had initially proposed a link between viral lineages, transmission, and disease phenotypes which have not been substantiated by experimental evidence [28] . The analysis of sequences obtained from symptomatic and asymptomatic (at the time of testing) individuals in this study did not reveal any association with a particular lineage. Symptomatic individuals were spread across lineages B.1, B.1.80, and B/B.6 along with asymptomatic individuals (Fig 2,  S2 Fig). Of the 17 clusters represented in the sequencing data, both the index case and the spreaders were more often symptomatic (Fig 3, S4 Table) . However, sequencing did not reveal any mutations that were specifically associated with clinical state.

Our study was undertaken during the early part of the pandemic during which Karnataka had recorded 1578 positive cases (Fig 4, S7 Table) . Within the study period, there was an early ban of international passenger travel. Karnataka has two major international airports (at Bengaluru and Mangaluru) and our data (based on ten of the thirty clusters of >5 individuals in the state) suggests that the tracing and containing of these cases was effective. A nationwide closure of air and rail routes followed. This, however, seems to have been incompletely effective in preventing importation of cases due to domestic travel. In spite of nationwide lockdowns (Fig 4) and quarantining of migrant workers, cases continued to be imported into Karnataka. These observations are consistent with wide-spread circulation of the virus in some states of the country. They emphasize the need for screening and quarantining of travellers as restrictions are relaxed, in addition to the follow up of ongoing transmission in the state (Fig 4) .

Our study had the following limitations-it is a single point analysis and some follow-up data is not available, for instance we do not know if individuals who were asymptomatic at testing later developed symptoms. Further, lineage assignments during an outbreak are dynamic and could change as more data is added and sequencing errors are accounted for. Notwithstanding these limitations, our analysis provides insights about introduction, spread, and establishment of SARS-CoV-2 in Karnataka. Further, we were able to capture both geographic diversity and obtain representation from ten large contact clusters in the state. This was made possible by linking epidemiological information to genomic data. Integrating such

A novel coronavirus from patients with pneumonia in China

WHO Situation Report

Richard Neher TB 1. Genomic analysis of COVID-19 spread

Richard Neher TB 1. Genomic analysis of COVID-19 spread

disease and diplomacy: GISAID's innovative contribution to global health. Glob Challenges

Phylodynamic Analysis | 176 genomes | 6

Spread of SARS-CoV-2 in the Icelandic population

Genomic Epidemiology of SARS-CoV-2 in Guangdong Province

Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States

Cryptic transmission of SARS-CoV-2 in Washington State. medRxiv

Laboratory surveillance for SARS-CoV-2 in India: Performance of testing and descriptive epidemiology of detected COVID-19

Full-genome sequences of the first two SARS-CoV-2 viruses from India

Genomic analysis of SARS-CoV-2 strains among Indians returning from Italy, Iran, and China, and Italian tourists in India

NextStrain: Real-time tracking of pathogen evolution

A distinct phylogenetic cluster of Indian SARS-CoV-2 isolates

Analysis of RNA sequences of 3636 SARS-CoV-2 collected from 55 countries reveals selective sweep of one virus type

Genomics of Indian SARS-CoV-2: Implications in genetic diversity, possible origin and spread of virus

nCoV-2019 sequencing protocol v2 V.2 [Internet

MUSCLE: multiple sequence alignment with high accuracy and high throughput

IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies

A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology

CoV-GLUE: A Web Application for Tracking SARS-CoV-2 Genomic Variation

D3 data-driven documents

Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. bioRxiv

The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shedding and increases infectivity

No evidence for distinct types in the evolution of SARS-CoV-2. Virus Evol

This work would not have been possible without the support of the COVID-19 diagnostic team at the Department of Neurovirology, NIMHANS-Ashwini MA, Ruthu Nagraj, Mahesh, Stiben R, Suman Das, Raghavendra Setty TK, Srinivasa R, Tanmoy Nandi, Sourabh Suran, Priti Das, Pallavi SJ, Sathyapriya M, Arpita N Maladkar, Srilatha Marate, Kamala SJ, Gayathri Devi, Kavitha S, Sandhya Rani, Rashmi Kumari, Kumar V, Prasad R, Raja G, Shivakumar V, Jothi Kala C and the data entry team from the State.We would like to thank the District Surveillance Units and Virus Research and Diagnostic Laboratories (VRDLs) across the state for sample collection, transport and testing.We gratefully acknowledge the contributions of all the laboratories that have submitted their sequences to GISAID, in particular laboratories across India that have been involved in sequencing efforts.

Conceptualization: Chitra Pattabiraman, Shafeeq K. Shahul Hameed, Anita Desai, RaviVasanthapuram. 

an approach, in real time, into public health measures is essential for an effective outbreak response. Table. Position of the reference sequence (NC_045512) and an early sequence from Wuhan (hCoV19/Wuhan/WH04/2020) is also shown on the tree (A). Frequency of amino acid replacements across the genome in different lineages are shown. Gene boundaries are shaded in blue. Details of the changes are provided in S3 Table (B) . Root to tip regression analysis of sequences in this study is shown. Mutation rate is estimated at 8.764e-04 mutations/site/year with r^2 = 0.24 (C). Note: While the temporal signal in this data is weak, analysis of larger datasets by others is suggestive of a molecular clock. (PDF) S1