key: cord-1001048-3n5cnepq authors: Tamim, Sana; Trovao, Nidia S.; Thielen, Peter; Mehoke, Tom; Merritt, Brian; Ikram, Aamer; Salman, Muhammad; Alam, Muhammad Masroor; Umair, Massab; Badar, Nazish; Khurshid, Adnan; Mehmood, Nayab title: Genetic and evolutionary analysis of SARS-CoV-2 circulating in the region surrounding Islamabad, Pakistan date: 2021-07-14 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2021.105003 sha: d3249da3d63ac2b260c8e9dc9e907159e243809d doc_id: 1001048 cord_uid: 3n5cnepq Genomic epidemiology of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) has provided global epidemiological insight into the COVID-19 pandemic since it began. Sequencing of the virus has been performed at scale, with many countries depositing data into open access repositories to enable in-depth global phylogenetic analysis. To contribute to these efforts, we established an Oxford Nanopore Technologies (ONT) sequencing capability at the National Institutes of Health (NIH), Pakistan. This study highlights multiple SARS-CoV-2 lineages co-circulating during the peak of a second COVID-19 wave in Pakistan (Nov 2020-Feb 2021), with virus origins traced to the United States of America and Saudi Arabia. Ten SARS-CoV-2 positive samples were used for ONT library preparation. Sequence and phylogenetic analysis determined that the patients were infected with lineage B.1.1.250, originally identified in the United Kingdom and Bangladesh during March and April of 2020, and in circulation until the time of this study in Europe, USA and Australia. Lineage B.1.261 was originally identified in Saudi Arabia with widespread local dissemination in Pakistan. One sample clustered with the parental B.1 lineage and the other with lineage B.6 originally from Singapore. In the future, monitoring the evolutionary dynamics of circulating lineages in Pakistan will enable improved tracing of the viral spread, changing trends of their expansion trajectories, persistence, changes in their demographic dynamics, and provide guidance for better implementation of control measures. Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) was first identified in Wuhan, China, in December 2019 and causes mild to severe respiratory symptoms. Severe disease is more frequent in patients with comorbidities or old age, and may require hospitalization (Huang, Wang et al. 2020; Wu, Li et al. 2020) or lead to death. On March 11 th 2020 the World Health Organization (WHO) declared coronavirus disease 2019 (COVID-19) a pandemic (World Health Organization 2020). The pandemic represents the third major human outbreak from viruses belonging to the coronaviridae family, betacoronavirus genus, and to date has caused more than 121 million confirmed cases and 2.7 million deaths worldwide (World Health Organization, 2020) . The initial seeding of COVID-19 cases occurred in countries with travelers from China, and later cryptic transmission events resulted in local spread of the virus (Bedford, Greninger et al. 2020; Nabeshima, Takazono et al. 2021 ). Pakistan's first case was reported on February 28 th 2020 and was imported by a traveler from Iran. Since then, sporadic cases were observed with travel history from Iran and the Middle East. The first wave of viral transmission in Pakistan peaked between May and September 2020. Building the health sector's capacity for diagnosis and early detection for rapid outbreak response was a main priority during the first wave of the COVID-19 pandemic in Pakistan. As of July 17 th 2020 Pakistan had a total testing capacity of 71,780 tests per day, with 133 testing laboratories nationwide [4] (www.dw.com/en/pakistan-coronavirus-testing/a-54221822). The containment response plan shifted from complete lockdown to smart lockdown in 20 major cities across Pakistan as coronavirus hotspots and contact tracing culminated into only temporary solutions as an epidemiological strategy to track COVID 19 cases. In November 2020, the number of SARS-CoV-2 cases started surging again, and as of February 4 th 2021, more than 550,000 SARS-CoV-2 cases had been confirmed in Pakistan, with an average of 1,000 to 5,000 J o u r n a l P r e -p r o o f (Tegally H 2020) and Brazil (P.1) (Japan. 2021), with suspected increased transmissibility rate (50 -70%) (Rambaut A 2020) has alarmed national and international public health groups and emphasized the significance of genomic surveillance in addition to laboratory investigation. In this study, we established the ARTIC network protocol for full genome sequencing of SARS-CoV-2 isolates in Pakistan using the Oxford Nanopore Technologies MinION platform. The sequences obtained from clinical specimens were analyzed for their genetic variation, evolutionary history and spatio-temporal dynamics. Nasopharyngeal swab samples were collected from symptomatic patients from the major tertiary care hospitals in Islamabad as part of the laboratory-based COVID-19 surveillance J o u r n a l P r e -p r o o f Journal Pre-proof during the month of December 2020 (Table 1) . Samples were received in Department of Virology, NIH on the date of collection for diagnostic purposes. The use of human specimens was approved by the Institute's Research Committee, which waived written consent requirements for viral genome sequencing on the condition that the clinical information of patient will remain anonymous and the urgency of the pandemic crisis situation. Viral RNA was isolated from nasopharyngeal swabs through a TANBead Nucleic extractor (SLA-16/32, SLA-E132 Series) to conduct lysis, washing, and elution steps. RNA extracts of samples positive for SARS-CoV-2 (ct value < 20) were reverse transcribed with SSIV VILO cDNA master mix and used as primary input for overlapping tiled PCR reactions (400-600 nt reads) spanning the viral genome using New England Biolabs Q5 High-Fidelity 2X Master Mix. (M0492L) (primers provided in Supplementary Table T1 ). Amplicon pools were generated using the ARTIC Network amplicon sequencing protocol v2, with the v3 primer pools (Quick 2020 (Hadfield, Megill et al. 2018) . The consensus sequences of genomes from Pakistan were deposited in Genbank with following accession numbers MW535197, MW534548, MW542138, We analyzed the evolutionary and spatio-temporal dynamics of two samples (NIH-421800/2020 and NIH-417328/2020), that had sufficient coverage across the genome (approx. 70%). We used the phylogenetic assignment of named global outbreak lineages (PANGOLIN) [COG-UK (cog-uk.io)] to capture the genetic diversity patterns of sequences NIH-421800/2020 and NIH-417328/2020. For phylogenetic analyses, full-length viral genome sequences belonging to Pango lineages --reorder --anysymbol --nomemsave --adjustdirection --addfragments, and used Wuhan-Hu-1 (GenBank accession number: MN908947.3) sequence as a reference. Sequences with fewer than 75% unambiguous bases were excluded, as were duplicate sequences defined as having identical nucleotide composition and having been collected on the same date and in the same country. The resulting dataset was trimmed at the 5' and 3' ends resulting in a multi-sequence alignment with 29782 nucleotides. This dataset was subjected to multiple iterations of phylogeny reconstruction J o u r n a l P r e -p r o o f Journal Pre-proof with 1000 replicates of ultrafast bootstraps using IQ-TREE multicore software version v1.6.12 (Nguyen, Schmidt et al. 2015) with parameters -m GTR+G -bb 1000 -bnni -nt 50, and exclusion of outlier sequences whose genetic divergence and sampling date were incongruent using TempEst (Rambaut, Lam et al. 2016) , resulting in a datasets with 34 and 107 sequences for the B.1.261 and B.1.1.250 datasets, respectively ( Supplementary Figures 1 and 2) . Phylogenetic relationships were inferred for B. We estimated spatial diffusion dynamics among countries using a Bayesian discrete phylogeographic approach (Lemey, Rambaut et al. 2009 ). This approach conditions on the trait information recorded at the tips and models the transition history among those states as a continuous time Markov chain (CTMC) process, allowing the inference of unobserved states at the ancestral nodes in each tree of the posterior distribution. We used a non-reversible CTMC model (Edwards, Suchard et al. 2011 ) and incorporated a Bayesian stochastic search variable J o u r n a l P r e -p r o o f Journal Pre-proof selection to identify a sparse set of transition rates that adequately summarized the epidemiological connectivity (Lemey, Rambaut et al. 2009 In order to develop a MinION sequencing protocol at our location, we selected ten SARS-CoV-2 samples with diagnostic real-time PCR cycle threshold (Ct) values less than 20 for targeted sequencing. Following preparation with the ARTIC network tiled amplicon approach, the final library of 15µl at 14ng/µl was loaded onto a MinION flowcell set to run for 72hrs. After 36 hours, the sequencing run was terminated due to reduced sequence data generation. Analysis through the ARTIC sequencing pipeline showed that, among the 10 samples, four had coverage of 25,049, 26,458, 16,673 and 18,169 with sequence depth of 940x, 787x, 82x and 111x, respectively ( Table 1) . Six of the ten samples did not produce sufficient sequencing output to generate consensus genomes. (T80I) and the N gene (P13S, A220V). PANGOLIN analysis identified sample NIH-421800/2020 as lineage B.1.1.250, which had previously been observed in isolates from Bangladesh, the United States, and the United Kingdom during March -September 2020 (Figure 2 ). NIH-417328/2020 was identified as part of lineage B.1.261 which also circulated in Saudi Arabia and South Korea between March and June 2020 (Fig 3) . Both lineages were assigned with high support (probability = 1). Both lineages were found to be evolving at similar rates Table T3 ). Through implementation of Oxford Nanopore sequencing, we identified SARS-CoV-2 lineages in circulation in the Pakistani population. There is abundant genomic data available from Europe, United States and South East Asia, whereas very few sequences from Pakistan have been reported. NIH has established this sequencing technique through collaboration with the United Physics Laboratory (JHU/APL). Through this effort, we successfully established the ARTIC network sequencing protocol locally. The two patients whose samples were used for phylogenetic analysis had no travel history outside of Pakistan and had only local travel to Islamabad, since it has the largest tertiary care hospital. Both cases contracted COVID-19 in Islamabad. NIH-421800/2020, collected from a patient who was part of a family cluster, had the D614G mutation that may have accelerated the spread of B.1.1.250, one of the most contagious variants during the first wave (Leung 2020) . The patient infected with NIH-417328 had co-morbidities (Table 1 ) and succumbed to the infection despite being a non-D614G mutation variant. We speculate that these samples were from two separate local sporadic transmission events since we reconstructed distinct phylogenetic origins and the viruses harbored unique mutations in their S and ORF1ab genes (i.e singleton mutations). The D614G mutation of S gene identified in NIH-421800/2020 was first detected in February in Europe and was the predominant lineage of the first wave of COVID-19 globally and replaced Wuhan-Wu-1 (Korber, Fischer et al. 2020 ). It has J o u r n a l P r e -p r o o f also been reported circulating in Karachi (Shakeel, Irfan et al. 2021 Individual contact tracing is useful only during phases of the outbreak when transmission chains can be easily traced (Leo, Chen et al. 2003; Faye, Boëlle et al. 2015; Kim, Tandi et al. 2017 ). However, during widespread pandemics, when patient numbers overwhelm local contact tracing capacity, genomic surveillance may be a practical strategy to trace the viral spread, as has been proven effective in previous viral outbreaks (Bahl, Nelson et al. 2011; Baillie, Galiano et al. 2012; Dudas, Carvalho et al. 2017; Grubaugh, Ladner et al. 2017 ). Viral lineage dynamics can fluctuate at the regional-level during epidemics, and genomic surveillance can provide insights into local genetic diversity as well as identifying previously undetected lineages and potential phenotypic determinants of transmission and pathogenicity. Transmission events or seeding of particular lineages in a population can be more easily traced through patients with travel history. The phylogeographic analysis estimated the origins of the most recent common ancestor for NIH-421800/2020 and NIH-417328/2020 in Saudi Arabia and the United States, respectively, months earlier than their collection dates. This could suggest some degree of undetected cryptic transmission in Pakistan, since transmission probably did not occur directly between two locations, and most likely involved unsampled intermediary locations before being detected by genomic surveillance. We could not determine the entry points of these lineages in the country or account for their previous circulation in Pakistan due to lack of within country SARS-CoV-2 genetic data. Additional evolutionary relationships could be established with the inclusion of complete viral sequences from an increased number of samples, which would help determine the diversity and regional distribution with respect to host population. The actual genetic diversity of SARS-CoV-2 in Pakistani population could be addressed with large sample numbers across the country, and this study is limited by the overall number of sequences described. A large sample size will support the identification of the predominant lineages in circulation and avoid sampling bias. There is a demand for pragmatic resource allocation for high throughput sequencing techniques that systematically target cases, instead of focusing only on a few or rare lineages which would jeopardize the characterization of the true genetic landscape. Genomic surveillance of COVID-19 patient samples allows the identification of polymorphisms such as deletions, synonymous and missense mutations in circulating strains that may contribute to increased transmissibility or pathogenicity of viral lineages creating SARS-CoV-2 variants of concern. Our analysis stresses the need to build capacity for real-time genomic surveillance and epidemiology in Pakistan and other low-resource nations to assist the public health response to COVID-19 pandemic or future outbreaks. These measures would help identify predominant and rare lineages while tracking their community transmission, providing important information for public health decision makers. The following are the supplementary data related to this article. This study was approved by the Institutional Review Committee of National Institute of Health, Islamabad, Pakistan. The authors have no conflict of interest to declare. J o u r n a l P r e -p r o o f BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics Temporally structured metapopulation dynamics and persistence of influenza A H3N2 virus in humans Evolutionary dynamics of local pandemic H1N1/2009 influenza virus lineages revealed by whole-genome analysis Cryptic transmission of SARS-CoV-2 in Washington state Virus genomes reveal factors that spread and sustained the Ebola epidemic Ancient hybridization and an Irish origin for the modern polar bear matriline Chains of transmission and control of Ebola virus disease in Conakry, Guinea, in 2014: an observational study Genomic epidemiology reveals multiple introductions of Zika virus into the United States Nextstrain: real-time tracking of pathogen evolution Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China Brief report: new variant strain of SARS-CoV-2 identified in travelers from Brazil MAFFT online service: multiple sequence alignment, interactive sequence choice and visualization Middle East respiratory syndrome coronavirus (MERS-CoV) outbreak in South Korea, 2015: epidemiology, characteristics and public health implications Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus Bayesian phylogeography finds its roots Severe acute respiratory syndrome-Singapore Empirical Transmission Advantage Of The D614G Mutant Strain Of SARS-Cov-2 Evolution and genetic diversity of SARS-CoV-2 in Africa using whole genome sequences COVID-19 cryptic transmission and genetic information blackouts: Need for effective surveillance policy to better understand disease burden IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies nCoV-2019 sequencing protocol v1 (protocols.io.bbmuik6w) A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen) Preliminary genomic characterization of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations Surveillance of genetic diversity and evolution in locally transmitted SARS-CoV-2 in Pakistan during the first wave of the COVID-19 pandemic Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa World Health Organization Coronavirus disease (COVID-2019) situation report -126 Saf Clinical Features of Maintenance Hemodialysis Patients with 2019 Novel Coronavirus-Infected Pneumonia in Wuhan, China Maximum likelihood tree inferred for the whole genomes of lineage B.1.261. Bootstrap support is annotated as circles in the nodes (small: low support; large: high support). Node leading to NIH-417328/2020 has a bootstrap support of 53. The ancestral node of NIH-421800/2020 is annotation with the inferred location and probability. J o u r n a l P r e -p r o o f