key: cord-0918543-inoct61x authors: Page, Andrew J.; Mather, Alison E.; Le-Viet, Thanh; Meader, Emma J.; Alikhan, Nabil-Fareed; Kay, Gemma L.; de Oliveira Martins, Leonardo; Aydin, Alp; Baker, David J.; Trotter, Alexander J.; Rudder, Steven; Tedim, Ana P.; Kolyva, Anastasia; Stanley, Rachael; Yasir, Muhammad; Diaz, Maria; Potter, Will; Stuart, Claire; Meadows, Lizzie; Bell, Andrew; Gutierrez, Ana Victoria; Thomson, Nicholas M.; Adriaenssens, Evelien M.; Swingler, Tracey; Gilroy, Rachel A. J.; Griffith, Luke; Sethi, Dheeraj K.; Aggarwal, Dinesh; Brown, Colin S.; Davidson, Rose K.; Kingsley, Robert A.; Bedford, Luke; Coupland, Lindsay J.; Charles, Ian G.; Elumogo, Ngozi; Wain, John; Prakash, Reenesh; Webber, Mark A.; Smith, S. J. Louise; Chand, Meera; Dervisevic, Samir; O’Grady, Justin title: Large-scale sequencing of SARS-CoV-2 genomes from one region allows detailed epidemiology and enables local outbreak management date: 2021-06-29 journal: Microb Genom DOI: 10.1099/mgen.0.000589 sha: 8abe069cc1f76ec2145258a5f2191fd787037da8 doc_id: 918543 cord_uid: inoct61x The COVID-19 pandemic has spread rapidly throughout the world. In the UK, the initial peak was in April 2020; in the county of Norfolk (UK) and surrounding areas, which has a stable, low-density population, over 3200 cases were reported between March and August 2020. As part of the activities of the national COVID-19 Genomics Consortium (COG-UK) we undertook whole genome sequencing of the SARS-CoV-2 genomes present in positive clinical samples from the Norfolk region. These samples were collected by four major hospitals, multiple minor hospitals, care facilities and community organizations within Norfolk and surrounding areas. We combined clinical metadata with the sequencing data from regional SARS-CoV-2 genomes to understand the origins, genetic variation, transmission and expansion (spread) of the virus within the region and provide context nationally. Data were fed back into the national effort for pandemic management, whilst simultaneously being used to assist local outbreak analyses. Overall, 1565 positive samples (172 per 100 000 population) from 1376 cases were evaluated; for 140 cases between two and six samples were available providing longitudinal data. This represented 42.6 % of all positive samples identified by hospital testing in the region and encompassed those with clinical need, and health and care workers and their families. In total, 1035 cases had genome sequences of sufficient quality to provide phylogenetic lineages. These genomes belonged to 26 distinct global lineages, indicating that there were multiple separate introductions into the region. Furthermore, 100 genetically distinct UK lineages were detected demonstrating local evolution, at a rate of ~2 SNPs per month, and multiple co-occurring lineages as the pandemic progressed. Our analysis: identified a discrete sublineage associated with six care facilities; found no evidence of reinfection in longitudinal samples; ruled out a nosocomial outbreak; identified 16 lineages in key workers which were not in patients, indicating infection control measures were effective; and found the D614G spike protein mutation which is linked to increased transmissibility dominates the samples and rapidly confirmed relatedness of cases in an outbreak at a food processing facility. The large-scale genome sequencing of SARS-CoV-2-positive samples has provided valuable additional data for public health epidemiology in the Norfolk region, and will continue to help identify and untangle hidden transmission chains as the pandemic evolves. In December 2019, a new coronavirus-related disease was first reported in Wuhan, China [1] ; the causal agent was identified as the novel human coronavirus, SARS-CoV-2. Since then, SARS-CoV-2 has spread globally leading to 120 million confirmed infections and 2.7 million deaths (as of 17 March 2021) [2] . Two risk factors are associated with higher mortality: sex, as males are at higher risk than females; and age, as older age groups are at substantially higher risk [3] . Whole genome sequencing provides high-resolution data that enable investigation of pathogen evolution and population structure [4] . When combined with robust epidemiological data, it is possible to gain insights into SARS-CoV-2 origins [5] , transmission (both global [6] and local [7] ) and responses to control measures [8] . Since the start of the pandemic, sequencing efforts and data sharing have facilitated tracking of the pandemic [9] , identifying multiple independent virus introductions into different countries [6] . The ability to assign identifiers rapidly to groups of samples that are related is essential in public health, as demonstrated for influenza [10] . These identifiers can be formulated in different ways: from conserved sequences identified by multi-locus sequence typing [11] , by assigning SNP addresses [12] ; or, in the case of SARS-CoV-2, through the assignment of lineages [13] . The COVID-19 Genomics UK (COG-UK) consortium [14] is a UK-wide public health surveillance initiative comprising nearly 20 organizations from universities, research institutes and public health agencies that was created to generate and analyse large-scale SARS-CoV-2 sequencing datasets to understand virus evolution, transmission and spread in the UK. These data allow detailed insight into the course of the pandemic at the country, county and individual institution level. It was through large-scale analysis of SARS-CoV-2 genomes that evidence of a mutation (D614G) in the spike protein was revealed; it is likely that this mutation is responsible for increased transmissibility of the virus [15] . For the Norfolk region, we established a robust, rapid sequencing pipeline for SARS-CoV-2. Weekly sequencing data were fed back into the national effort for pandemic management, whilst simultaneously being used to assist local outbreak analyses. Here we describe the sequencing of genomes present in 1565 SARS-CoV-2 samples from 1376 cases, collected between March and August 2020. This represented 42.6 % of all cases in the local population and included those with a clinical need, and key workers (such as healthcare, care and police) and their families. For context, at the end of the study period (27 August 2020) , only five countries (UK, Australia, Spain, India and the USA) out of 103 countries had sequenced more SARS-CoV-2 genomes than had been sequenced in Norfolk for this paper. We used these data to investigate the genetic and epidemiological characteristics of the COVID-19 pandemic in the stable, lowdensity population of Norfolk and surrounding areas, UK. Our objectives were to use these sequence data to understand the evolution and spread of the virus locally, adding context Genomic lineages of SARS-CoV-2 can be used to track progression of the pandemic on an international scale. We undertook whole genome sequencing of the SARS-CoV-2 genomes present in positive clinical samples from one region. We combined clinical metadata sequencing data to understand the origins, genetic variation, transmission and expansion of the virus within the region and provide context nationally in the UK. In total, 42.6 % of all positive samples identified by hospital testing were sequenced. The large-scale genome sequencing of SARS-CoV-2positive samples has provided valuable additional data for public health epidemiology in the Norfolk region, and will continue to help identify and untangle hidden transmission chains as the pandemic evolves. to the national and global data, and to evaluate the role of rapid whole-genome sequencing for outbreak analysis in this setting. Our analysis: identified a sublineage associated with six care facilities; found no evidence of reinfection in longitudinal samples; ruled out a nosocomial outbreak; identified 16 lineages in key workers which were not in patients, indicating infection control measures were effective; and found the D614G spike protein mutation which is linked to increased transmissibility dominates the samples and rapidly confirmed relatedness of cases in an outbreak at a food processing facility. This demonstrates the valuable role of large-scale genome sequencing of SARS-CoV-2 to inform surveillance and regional outbreak management. The clinical samples we used were initially collected passively for diagnostic testing with ethical approval from Public Health England (R and D ref. NR0195) and with sampling directed by government public health policy and local clinical need. Samples were taken at four large hospitals: Norfolk and Norwich University Hospital (NNUH) (1200 beds) in Norwich, Norfolk; The Queen Elizabeth Hospital (QEH) (500 beds) in Kings Lynn, Norfolk; The James Paget University Hospital (JPUH) (500 beds) in Great Yarmouth, Norfolk; and the Ipswich Hospital (550 beds) in Ipswich, Suffolk. Additional clinical samples that were included were collected at five smaller hospitals; by three community care organizations (representing dozens of care facilities and GP practices); and at drive-through testing facilities for healthcare workers, essential workers (such as police) and their families who live or work in Norfolk and the surrounding areas (Fig. S1 ). Samples from cases with suspected SARS-CoV-2 were processed using five different diagnostic platforms over three laboratories on the Norwich Research Park: the Cytology Department and Microbiology Department, NNUH, Norwich, UK and the Bob Champion Research and Education Building (BCRE), University of East Anglia, Norwich, UK. Samples were primarily nasal/oropharyngeal swabs, although nasopharyngeal aspirates, bronchoalveolar lavage and sputum samples were also collected. The Cytology Department processed samples using the Roche Cobas 8800 SARS-CoV-2 system (https:// tinyurl. com/ yy58t8sp). The Microbiology Department processed samples using either the Hologic Panther Fusion System SARS-CoV-2 assay (https:// tinyurl. com/ yye3m25p) according to the manufacturer's instructions, the AusDiagnostics SARS-CoV-2, Influenza and RSV 8-well panel (https:// tinyurl. com/ yyeh5y2w) or Altona Diagnostics RealStar SARS-CoV-s RT-PCR Kit 1.0 (https:// altonadiagnostics. com/ en/ products/ reagents-140/ reagents/ realstar-real-time-pcr-reagents/ realstar-sars-cov-2-rt-pcrkit-ruo. html). RNA was extracted from swab samples in the Microbiology Department using either the QIAsymphony (Qiagen) or AusDiagnostics MT-Prep (AusDiagnostics) instruments according to the manufacturer's instructions before being processed through the AusDiagnostics assay. In the BCRE, RNA was extracted using the MagMAX Viral/ Pathogen II Nucleic Acid Isolation kit (Applied Biosystems) according to the manufacturer's instructions and the KingFisher Flex system (ThermoFisher). The presence of SARS-CoV-2 was determined on either the QuantStudio 5 (Applied Biosystems) or Lightcycler LC480II (Roche) with the 2019-nCoV CDC assay (https://www. fda. gov/ media/ 134922/ download). Viral transport medium from positive swabs (stored at 4 °C) was collected for samples run on the Roche Cobas and Hologic Panther Fusion systems and in all other cases excess RNA was collected (stored at 4 °C and collected within 4 days for samples tested by the AusDiagnostics assay, while all other RNA samples were initially frozen and thawed for collection). Excess positive SARS-CoV-2 inactivated swab samples (200 µl viral transport medium from nose and throat swabs inactivated in 200 µl Zymo DNA/RNA shield and 800 µl Zymo viral DNA/RNA buffer) were collected from the Cytology and the Microbiology Department and SARS-CoV-2-positive RNA extracts (~20 µl) were collected from the Microbiology Department and the BCRE as part of the COG-UK Consortium project (PHE Research Ethics and Governance Group R and D ref. no NR0195), with full details in Tables S1-S3. For inactivated swab samples, RNA was extracted using the Quick DNA/RNA Viral Magbead kit from step 2 of the DNA/RNA purification protocol (Zymo -https:// tinyurl. com/ y2lqoneq). SARS-CoV-2-positive samples were transferred to the Quadram Institute Bioscience for sequencing. The lower cycle threshold (Ct) or take-off value produced by the SARS-CoV-2 assays in the Roche, AusDiagnostics, Altona Diagnostics and CDC assays were used to determine whether samples needed to be diluted for sequencing according to the ARTIC protocol [for AusDiagnostics results, 13 was added to the take-off value to generate an approximate Ct value -this is because 15 cycles of PCR are performed before a dilution step and a further 35 cycles of nested PCR (the take-off value is determined in the nested PCR)]. The SARS-CoV-2 assay in the Hologic Panther does not provide a take-off or Ct value but rather a combined fluorescence signal for both targets in relative fluorescence units (RLUs), and therefore all samples tested by the Hologic Panther were processed undiluted in the ARTIC protocol. cDNA and multiplex PCRs were prepared following the ARTIC nCoV-2019 sequencing protocol v2 [16] . Dilutions of RNA were prepared when necessary based on Ct values following ARTIC protocol guidelines. V3 CoV-2 primers (https:// github. com/ artic-network/ artic-ncov2019/ tree/ master/ primer_ schemes/ nCoV-2019/ V3) were used to perform the multiplex PCR for SARS-CoV-2 according to the ARTIC protocol [16] with minor changes. Due to variable Ct values, all RNA samples used in the two ARTIC multiplex PCRs were run for 35 cycles. Odd and even PCRs were pooled and cleaned using a 1× SPRI bead clean with KAPA Pure Beads (Roche Catalogue No. 07983298001), according to the manufacturer's instructions. PCR products were eluted in 30 µl of 10 mM Tris-HCl buffer, pH 7.5, and cDNA was quantified using the QuantiFluor ONE dsDNA System (Promega). Libraries were prepared for sequencing on the Illumina or Nanopore platform and sequenced as described previously [17] . Raw reads were demultiplexed using bcl2fastq (v2.20) (Illumina) allowing for zero mismatches in the dual barcodes to produce FASTQ files. The reads were used to generate a consensus sequence for each sample using an open source pipeline adapted from https:// github. com/ connor-lab/ ncov2019-artic-nf (https:// github. com/ quadram-institute-bioscience/ ncov2019-artic-nf/ tree/ qib). Briefly, read adapters were trimmed using TrimGalore (https:// github. com/ FelixKrueger/ TrimGalore) and aligned to the Wuhan Hu-1 reference genome (accession MN908947.3) using BWA-MEM (v0.7.17) [18] ; ARTIC amplicons were masked and a consensus was built using iVAR (v.1.2) with primary parameters 'ivar consensus -m 10 -q 20 t 0.75' [19] . Samples were prepared and sequenced in 96-well plates with one cDNA-negative control per plate and one RNA extraction-negative control, where applicable. Contaminated samples were removed from analysis (Table S2 ). The COG-UK consortium defines a consensus sequence as passing COG-UK basic quality control (QC) if: >50 % of the genome is covered by confident calls or there is at least one contiguous sequence of more than 10 000 bases; and no evidence of contamination in the negative control. A confident call is defined as having 10× depth of coverage. If the coverage falls below these thresholds, the bases are masked with the character N indicating the base at that position is unknown or not available. Low-quality variants are also masked with Ns. The QC threshold for inclusion in the public database GISAID (Global Initiative on Sharing All Influenza Data) is higher, requiring that >90 % of the genome is covered by confident calls and that there is no evidence of contamination. The COG-UK quality control criteria were used as the minimum requirements for lineage and phylogenetic analysis. Although we did not use a homoplasy-based test for batch effects as described previously [20, 21] , a visual inspection of the consensus sequences for common SNPs across lineage boundaries for each batch is undertaken using snipit (https:// github. com/ aineniamh/ snipit). Additionally civet reports (https:// github. com/ COG-UK/ civet) are generated for each batch, with genomes from other labs included for context and are manually inspected. These reports identify for each generated sequence the closest sample in the COG-UK and GISAID databases, where in many cases identical samples sequenced by other labs are present. This helps to monitor for lineage altering artefacts. All consensus sequences were deposited in GISAID [22] if they met its minimum QC threshold. All raw sequence data and metadata [23] were deposited in the European Nucleotide Archive (ENA) [24] . In both cases this happened soon after sequence generation, facilitated through COG-UK, and using MRC CLIMB [25] . Lineages [13] assigned to each consensus genome were determined using Pangolin (https:// github. com/ cov-lineages/ pangolin), which is run routinely by the Rambaut group over SARS-CoV-2 consensus sequences deposited on MRC CLIMB [25] . Global lineages are identifiers given to actively spreading lineages, are defined using a phylogenetic framework (https:// github. com/ COG-UK/ grapevine) and often represent distinct introductions into new territories or regions, taking the form B.1.2.3 (see Rambaut et al. [13] for full details). UK lineages represent the subsequent spread within the UK, taking the form UK1234 and providing an identifier for a cluster for a given phylogeny. Unlike global lineages, however, UK lineage identifiers are not consistent between phylogenies. In this paper, a sublineage is defined as a set of samples within a lineage which share a common history within this lineage. For example, if all samples within a lineage are represented on a phylogenetic tree, a sublineage would describe all samples derived from a single internal node. Only samples that passed COG-UK QC were considered for lineage assignment (>50 % of genome reconstructed), and only samples with more than 90 % of non-ambiguous aligned sites (i.e. not N) were included in the upstream phylogenetic pipeline. This upstream analysis at MRC CLIMB (https:// github. com/ COG-UK/ grapevine) generates both the phylogenetic tree, aligned sequences and lineages (global and local) for most data. When adding more samples to the phylogenetic tree (as in the case studies below), we use IQ-Tree2 [26] , first constraining the tree search to the upstream tree to create an initial tree, and then through unconstrained optimization using default parameters and the HKY model [27] with gamma heterogeneity [28] in both cases. Sequences are aligned with MAFFT using Wuhan Hu-1 (accession MN908947.3) as a reference genome. Phylogenetic trees were visualized using ggtree [29] . Epidemiological analyses of outbreaks presented in the results were instigated and overseen by clinicians within the NHS or by public health bodies. Genome sequencing, bioinformatics analysis and genomic epidemiology were performed by Quadram Institute Bioscience with limited anonymized metadata as part of the COG-UK consortium; patient-identifiable data were retained by the hospitals or public health bodies. For each case, the Norwich Research Park Biorepository (part of NNUH) anonymously linked all instances where cases were sequenced longitudinally (140) and provided the information to Quadram Institute Bioscience for analysis. The UK lineages were extracted for each case with multiple samples using precomputed lineages from COG-UK. Consensus genomes which did not yield high enough quality genomes to compute a lineage were excluded. Cases which had more than two high-quality samples were validated to ensure the lineages were the same. Where differences were identified, all consensus genomes for the case were extracted into a FASTA file, and the differences compared to the Wuhan Hu-1 reference were noted using SNP-sites [30] (version 2.3.3). An initial list of SARS-CoV-2 samples associated with a single care facility was provided by NNUH to Quadram Institute Bioscience. The UK lineages were identified for each sample using precomputed lineages from COG-UK. All other samples with the same UK lineage in the COG-UK dataset were identified and a phylogenetic tree was computed using IQ-Tree2 under an HKY+G model, as described above [26] (version 2.0.6). All samples in a sublineage associated with Norfolk were identified. The mutations defining this sublineage were calculated using SNP-sites [30] (version 2.3.3), with the Wuhan Hu-1 reference as the base. A phylogenetic tree of the sublineage was calculated before first removing singleton mutations, most of which were C→T/U SNPs, and are markers of RNA degradation. Anonymized care facility sample metadata were added to the sublineage, with the data visualized in Phandango [31] and the relatedness of the samples and care facilities was visually confirmed. Hospital admission and discharge data for the residents was analysed solely by Public Health England co-authors and an anonymized summary was provided for this paper to maintain patient confidentiality. A list of SARS-CoV-2 samples associated with a hospital were provided by Ipswich Hospital to Quadram Institute Bioscience. The lineages were identified for each sample using precomputed lineages from COG-UK. The frequency of each lineage was identified for Ipswich Hospital. SARS-CoV-2 samples collected by hospitals (NNUH, QEH and JPUH) and care organizations in the same region as Ipswich Hospital had the frequencies of their lineages calculated in a similar fashion, providing context. SARS-CoV-2-positive samples were sent to Quadram Institute Bioscience. Samples were prepared, sequenced using an Oxford Nanopore Technologies MinION and bioinformatically analysed, all within 24 h of sample receipt. Consensus genomes were provided to Civet (https:// github. com/ COG-UK/ civet), which assigned lineages to each genome. SNPs defining the sublineage of the outbreak were manually identified from the Civet report. The global lineage where the samples fell into was analysed further. All public samples from the global lineage which were publicly accessible through GISAID were identified. The countries of origin and collection dates were noted. A phylogenetic tree of all samples in this lineage from May onwards was created using IQ-Tree2 as described above. The SNPs defining the sublineages were calculated using SNP-sites (version 2.3.3). The first reported case in the Norfolk region was on 6 March 2020 from a returning traveller; by 31 August 2020, there were 3225 cases identified by NNUH from Norfolk and surrounding areas from a total of 3751 SARS-CoV-2-positive clinical samples (some cases were sampled multiple times). Of these, 1565 samples (41.7 %) were sequenced and analysed, from 1376 cases (42.6 %). This represents approximately 172 SARS-CoV-2-positive samples sequenced per 100 000 population. The sequenced cases were broken down by locality and age group (Table S4) . For cases sampled multiple times, the earliest collection date of a SARS-CoV-2-positive sample was used for sequence analysis. These samples were collected in the East of England, predominantly from cases providing an address in Norfolk (Fig. S1 ). The samples came from individuals in the community (20.7%, n=285), inpatients (40.6 %, n=559) and outpatients (0.3 %, n=4) at hospitals, and staff (key workers) (23.8%, n=328) and their families (4.7 %, n=65) (Fig. 1 ). Inpatients represented a mixture of patients newly admitted to the hospital (with or without COVID-19 symptoms), and existing patients with possible nosocomial SARS-CoV-2 infections. As testing was extended to more groups, so the regions from which samples were collected also changed (Fig. S2 ). The number of positive samples in the Norfolk region peaked at the end of April 2020; specifically, the number of SARS-CoV-2-positive samples peaked in the week of 20-26 April 2020, with 591 positive samples (Fig. 2) . The peak month was April with 1992 positive samples, followed by May with 1188 positive samples. These numbers include a small number of repeat samples. In July only 10 new positive cases were reported, before rising to 79 in August. More than 60 of the August cases were related to a food processing facility outbreak. Proportionally, the number of SARS-CoV-2-positive samples that were sequenced followed the same trend as the total number of positive samples, peaking in the week of 27 April to 3 May, with genomes from 320 cases being sequenced (Fig. 2 ). Although project sample collection for sequencing officially began on 8 April 2020, and no samples were sequenced from the period 27 March to 7 April 2020, 59 archived samples from March were available and were sequenced. Overall the number of genomes sequenced does reflect the number of positive cases in the region and we can confidently conclude that the peak period was April/May 2020, in this region. The number of positive cases sequenced was greatest in older individuals, with the largest number of samples (n=316, 36.5 %) being from cases aged 80-90 years (Fig. 3a) . Just nine samples originated from cases under 10 years of age. Females were also significantly over-represented in the dataset (Fig. 3b) , accounting for 57 % (n=741 out of 1286, P<0.001, one-proportion z-test) of cases. Virtually all samples made available for sequencing were sequenced, with no selection criteria based on patient data or quality cut-offs; while we cannot rule out differential submission rates of testing facilities representing variations in patient populations, we thus generally expect the sequenced positive cases to reflect patient characteristics in the underlying population of total positive cases. Samples received for sequencing varied substantially in viral load. For our samples the Ct correlated well with the percentage of bases missing in the reconstructed consensus genome (Fig. S3 ) with a substantial reduction in genome completeness for samples with a Ct above 32. Virtually all diagnostic positives were sequenced, irrespective of Ct value, to avoid underrepresenting patients with low viral loads. In a small number of instances [32] samples were not available for sequencing as there was insufficient material or the samples could not be found. We identified potential biases in the quality of the genomes sequenced from samples. The sex and age of the cases (for those genomes which had the relevant associated metadata) were evaluated against four QC categories: not sequenced, failed all QC, passed basic QC and passed high-quality QC (Fig. S4 ). There were significantly more genomes from females in three QC categories (62, 63 and 55 % female in genomes failing all QC, passing basic QC and passing high-quality QC, respectively, one-proportion z-test, P<0.001), but not significantly different in the not sequenced category (66 % female, P=0.052). With respect to an individual's age, the mean age of individuals contributing samples that failed all QC was significantly higher (mean age 70.0 years vs. 64.9, 65.7 and 58.8 years of age of individuals contributing genomes passing basic QC, passing high-quality QC and not sequenced, respectively, P<0.05, pairwise Kruskal-Wallis rank sum tests adjusted for multiple hypothesis testing). Completeness of consensus genomes is related to the Ct of the input samples (Figs S5 and S6 ) and three temperaturesensitive ARTIC PCR primer dropout areas were visible [33] visible. A further three performed poorly at higher Ct values, due to reduced amplification efficiency, variation in which is to be expected in a large amplicon pool. Overall, the ARTIC protocol was robust and sensitive. Above Ct 32 the completeness of the genomes recovered did begin to tail off and there was a substantial, largely random, drop off above Ct 35, i.e. there was no consistency in the primer pairs that performed well or poorly in the multiplex when there were <10 genome copies present. In total, 901 samples (65.4 %) passed the GISAID QC criteria (≥90 % genome completeness), 120 samples (8.7 %) passed the COG-UK QC criteria only (≥50 % genome completeness and 357 samples (25.9 %) failed (<50 % genome completeness). As samples were collected over a 6-month period, the median number of SNPs per genome increased every month compared with the Wuhan Hu-1 reference (accession MN908947.3). When only considering high-quality consensus genomes (Fig. S7) it increased from six SNPs in March to 16 SNPs in August. The evolutionary rate was estimated to be ~2 SNPs per month. The maximum number of co-occurring global lineages in a given week was 13 for the period 27 April to 10 May 2020, approximately 5-6 weeks after the UK government instituted a lockdown (23 March 2020) (Fig. 4) . This rapidly reduced as the number of samples dropped. When considering the number and proportion of co-occurring global lineages every week it is apparent that, during the peak (April/May), some global lineages became extinct and were replaced by new lineages, which rapidly increased in abundance (Fig. 5) The global lineage nomenclature system of Rambaut et al. [13] provides a flexible and consistent naming scheme for genomically detectable introductions of SARS-CoV-2 into new locations; there have been 1035 global lineages assigned in this scheme. A total of 26 of these global lineages were observed in our data, 20 of which were observed in more than one sample. All of the global lineages present in more than one sample were from lineage B, which is Norfolk samples were set in context as part of the COG-UK phylogenetic pipeline (7 September 2020) using a phylogenetic tree based on all publicly released genome sequences (Fig. 6) . Building phylogenetic trees with incomplete genomes is challenging [20, 34] in light of the low diversity of SARS-CoV-2 during the study period, with the genomes in the 50-90 % completeness range having a higher potential for misalignment and phylogenetic misplacement. Overall genomes from the Norfolk region represented a random sampling of co-occurring global lineages within the UK as a whole. Some major global lineages are under-represented in Norfolk such as B.1 and others are over-represented, such as B.1.1, when compared with the UK samples (Table 1) . Global lineages were further subdivided into UK lineages, to identify ongoing transmission and evolution within the UK. The numbers assigned to UK lineages are subject to change and must be recalculated for all genomes with each phylogenetic reanalysis. Thus, the numbers reported here are for a single phylogenetic analysis. Stable cluster identification and nomenclature for SARS-CoV-2 is currently an open problem, and thus all analyses in this paper were relative to a single snapshot, with consistent algorithm/ database/software versions, phylogenetic analysis and sample sets. There were 100 UK lineages detected in the dataset, 49 of which were present in two or more cases. The number of co-occurring UK lineages peaked at 20 in the week of 27 April 2020, approximately 5 weeks after the UK nationwide lockdown began; thereafter the number dropped to a single lineage in July and August (Fig. S8) . The proportions of samples with particular lineages varied week to week (Fig. S9) , with the most common UK lineage being UK5, which was present in 324 cases; this is also the most commonly identified lineage in the UK (https:// microreact. org/ project/ cogconsortium-2020-09-02/ f5aa0bdd/). The next most common UK lineage was UK2913, which was present in 113 samples; this was a sublineage associated with care facilities in the region around Norwich city (detailed later). There is evidence that a mutation in the spike protein of SARS-CoV-2 (an amino acid change from D to G at position 614; D614G) increases infectivity of a pseudotype virus in vitro in cells; this is associated with an observed increase in viral loads in patients [35] . Overall, in the Norfolk dataset, 89.4 % (n=819) of samples had the D614G mutation while only 10.6 % (n=97) had the wild type (Fig. 7) . The relative proportion of the two genotypes differed over time. In March, 66.6 % (n=24) of samples contained the wild type and 33.3 % (n=12) contained the D614G mutation. In April the proportion of genomes that were wild type had reduced to 10.7 % (n=47) while those with the D614G mutation were dominant at 89.3 % (n=392). In May the proportion of genomes that were wild type had reduced to 5.5 % (n=22) compared with 94.4 % (n=374) of genomes having the D614G mutation (Fig. 7) . In August 2020 Quadram Institute Bioscience, at the request of the Microbiology Department at the NNUH, evaluated the genomes present in multiple longitudinal samples taken from the same case over extended periods of time during infection. The aim was to determine whether they were infected by the same lineage or different lineages, the latter (Table S3) , each with between two and six samples, 88.5 % of whom were hospital inpatients at some point during their illness. The median time span of the sampling was 13 days, with a mean of 16.2 days. The longest time span was 71 days, with 22 cases having a time span greater than 28 days. The clinical outcomes were not available for analysis. Only samples with different collection dates from the same individual were considered. We limited cases to those with high-quality consensus genomes (passing GISAID QC) in two or more of the sample time points; this resulted in a series of longitudinal genome samples from 42 cases; each series had two to four samples. In every example the lineage remained the same between samples from the same individual, with the exception of the linked samples NORW-ED449 and NORW-ECD30; this was because NORW-ECD30 had nine IUPAC [36] symbols for 'partially' ambiguous bases, which are likely to be due to differences in viral load between the original samples (Ct 17 vs. Ct 27). These results suggest that there is no evidence of reinfection in any of the individuals for which a series of positive samples had been taken. In August 2020 Quadram Institute Bioscience and Ipswich Hospital used the genome data from a set of 31 samples from hospital patients to determine whether these samples represented a single nosocomial outbreak or whether they were unrelated and the result of community transmission. The 31 positive samples were collected between 6 March and 28 August 2020; 80.6 % (n=25) were from patients over the age of 65 years. From these, 18 yielded genome sequences of sufficient quality to assign lineages. A total of six global lineages and eight UK lineages were observed in the samples, with the most commonly observed (n=5) being UK5, which is also the most commonly observed lineage within the UK. This number of co-occurring lineages indicated that there was not a single large nosocomial outbreak at this location. In June 2020 the Microbiology Department at NNUH and Quadram Institute Bioscience evaluated an outbreak at a care facility in the Norwich region using SARS-CoV-2 sequenced genomes from the dataset in this paper. The analysis undertaken indicated probable intra-care facility transmission, corresponding to a discrete sublineage circulating in the care sector as opposed to the wider community. It revealed that 14 out of 15 genomes from cases had the same UK lineage, UK2913, over a sustained period of time (8 April to 1 June 2020). An analysis of this lineage in all COG-UK data (n=395) revealed that it represented a distinct sublineage in the Norwich region of Norfolk, defined by a single synonymous mutation (A→G) at position 24232 in the S gene; this mutation was not found in any other COG-UK lineages. Most of the cases with this sublineage were >80 years of age and concentrated in distinct areas in the Norwich region. Analysis also confirmed that the samples were predominantly collected from six care facilities. There were 89 cases sequenced in the Norwich region with this sublineage and 76 of these were known to be patients (n=64) of care facilities or healthcare workers in those facilities (n=9) and their families (n=3). Links could not be established for 13 cases who tested positive for this sublineage. This sublineage had not been observed previously in community testing and the last new positive patient with this sublineage was on 1 June 2020. As it has not been seen in 3 months this sublineage is now regarded as extinct. An analysis was undertaken to understand the role of hospital discharges in this sublineage. Of the residents in this sublineage, 12 had a hospital admission, of whom two were admitted twice to three hospital trusts. Six had a community-acquired infection, testing positive within 7 days of admission, three were inconclusive due to missing data, one had a probable hospital-acquired infection and tested positive within 7 days of discharge, and two had a definite hospital-acquired infection (https://www. gov. uk/ government/ publications/ wuhan-novel-coronavirus-infection-prevention-and-control/ epidemiological-definitions-of-outbreaks-and-clusters-inparticular-settings). All residents with a hospital-acquired infection had a test prior to discharge, suggesting the package of infection prevention and control (IPC) measures were being followed. In the time period covered by the study, patients required a test prior to discharge to a care facility, and a positive test did not preclude them from returning to the care facility, but rather that enhanced IPC and isolation measures needed to be taken for a designated period of time. Given that some of this cohort of patients tested positive in May with community-acquired infections, a number of weeks after the adult social care IPC Department of Health and Social Care measures were announced, suggests that these measures may not have been sufficient. On examination of all genome sequences obtained from a town with two of these care facilities, we found 70 samples were positive for SARS-CoV-2, with 52 of those yielding genome sequences of sufficient quality to assign a lineage. Thirty-seven samples (71 %) were associated with the care facilities and were the UK2913 lineage. The remaining 15 samples in the town came from 13 different lineages, indicating that the number of co-occurring lineages within the care facilities did not reflect the number of co-occurring lineages within the wider locality. Here, we have used intensive whole genome sequencing of SARS-CoV-2 samples in a single geographical area to investigate the evolution and transmission of the virus in this region. The average age of the Norfolk population is significantly higher than that of England as a whole; 24.5 % of residents are aged 65 years or older compared with 18.4 % for England as a whole (https://www. norfolkinsight. org. uk/ population/, accessed 31 May 2020). The largest hospital in the region is the Norfolk and Norwich University Hospitals NHS Foundation Trust (1200 beds), serving a population of around one million patients from Norfolk and neighbouring counties, supported by a network of smaller hospitals. There was a lower incidence of SARS-CoV-2 in the Norfolk region compared with England as a whole, with the proportion of cases testing positive for SARS-CoV-2 in Norfolk at 363.2 per 100 000 compared with 573.9 per 100 000 in England as a whole (accessed 14 September 2020, https:// coronavirus. data. gov. uk/). This was also reflected in the number of deaths that occurred due to infection within 28 days of a positive diagnosis; for Norfolk this was 43.8 per 100 000 population compared with 65.7 per 100 000 for England as a whole, which is significantly below average despite having an older, more vulnerable population. The sequencing data represented a rate of 172 sequenced genomes per 100 000 in the Norfolk population, which corresponded to 113.8 cases for which high-quality genome sequences were available for evaluation per 100 000 in the population. Specifically, we evaluated high-quality genome sequence for 31.3 % of all cases that tested positive for SARS-CoV-2 in Norfolk. However, as the samples primarily came from healthcare settings, they captured cases with the most severe symptoms rather than asymptomatic community cases. Only positive tests from the population with clinical need (primarily hospital and care facilities), and key workers and their families were available for sequencing. Community testing was done at large regional 'Lighthouse labs' capable of processing hundreds of thousands of samples per day. Most community testing required the case to have COVID-19 symptoms, with the exception of population-level surveillance, which made up a small proportion of positive cases. Of the sequenced community samples during the study period just 25 were attributed to Norfolk and lacked metadata, and thus were not included in analysis. The highest number of positive samples in this region occurred at the end of April/beginning of May 2020, approximately 6 weeks after the UK instituted a nationwide lockdown (23 March 2020) . Thereafter, the number of positive cases we sequenced dropped substantially as the impact of the lockdown and social distancing began to reduce transmission. With the exception of the food processing facility outbreak, by August 2020, only two new cases with positive samples were detected. Analysis of the demographic metadata associated with positive samples indicated cases were more likely to be older and female. This skewed distribution in relation to age and sex is likely to be due to the directed use of diagnostic testing to symptomatic cases during the peak of the pandemic; this approach was driven by global shortages in reagents and testing capacity. Thus, vulnerable elderly cases were more likely to be tested during the peak of the pandemic, and were more likely to be female as they have a longer life expectancy. Thereafter, when testing was opened up to key workers, they were predominantly female healthcare workers who make up 77 % of the NHS workforce [37] . Viral RNA loads in individuals, as measured by PCR, were correlated strongly with the percentage of the genome that could be reconstructed from the sequencing data. This could be due to individual variation in host factors, disease stage [38] or quality of the sample material [32] . Phylogenetically useful genomes where the Ct was below 32 (more than ~100 viral copies) were routinely sequenced but there was a substantial tail off from Ct 35. These results must be interpreted cautiously in terms of transmission potential, as it is not known whether an individual is infectious at low viral RNA loads; it is possible that the positive results from high Ct samples are due to detection of residual RNA from a past infection. However, this does demonstrate that sequencing can produce usable information from samples containing the wide range of viral RNA loads likely to be encountered during SARS-CoV-2 infections. There is evidence that a mutation in the spike protein with an amino acid change of D to G at position 614 (D614G) increases the transmissibility of the virus, which is associated with an increased viral load in mutant-infected cases [15, 35] . This has been observed in the UK and globally [15] , potentially indicating that a more transmissible strain is now in circulation. As seen in Fig. 7 The information provided by these sequences allowed an examination of the overall genetic variation within SARS-CoV-2 circulating in Norfolk and comparison with other regions. The number of co-occurring global lineages was similar to the range found within the UK as a whole [7] , Europe [6] and beyond [39] . The notable exception was the lack of lineage A samples within the region, with only two being observed. This indicates that most of the lineages that entered the region did not come directly from China; rather, they are estimated to have predominantly come from Europe or within the UK. In the region, 23.2 % (n=26) of all global lineages were observed out of 112 lineages that have been defined to date [13] . This variation shows that genomically distinct lineages have expanded worldwide, with different distributions taking hold in different settings (see Microreact [40] https:// microreact. org/ project/ cogconsortium-2020-09-02/ f5aa0bdd/). The B.1.11 samples in our dataset were specifically associated with care facilities in Norfolk. These data demonstrate a substantial number of co-occurring global lineages within one small region, indicating multiple concurrent introductions and their subsequent spread. This places a lower bound on the number of independent introductions to the region at 26, but it is likely to be substantially higher as not all COVID-19 infections were identified, tested and sequenced. As case numbers rose during the course of the pandemic, more lineages were identified, with a peak in the number of co-occurring lineages around 5-6 weeks after the UK instituted a national lockdown. Thereafter, the number of lineages dropped substantially, with many rapidly becoming extinct in the region, providing further evidence that lockdown measures break transmission. By subdividing the global lineages into standardized UK lineages, a finer resolution of viral genomic relatedness was obtained and allows us to make more detailed comparisons. As we observed with global lineages, the UK lineages provided further evidence for substantial viral genomic variability circulating in the region. There were 100 UK lineages observed in the region out of a total of 1725 lineages reported for the UK as a whole (5.7%). The dominant lineage in the UK (UK5) was also the dominant lineage in the region. Interestingly, the second most commonly observed lineage in the UK (UK1535), with 2269 (5.7 %) samples and widespread circulation around the UK, was only observed 19 (1.8 %) times in Norfolk (18 in this dataset, one sequenced by the University of Cambridge and none in Pillar 2 community testing for Norfolk). One of the most important applications of these data was in epidemiological investigations to identify outbreaks. This was particularly important during the peak in April and May, when the number of new infections was high, providing the resolution required to distinguish between transmission clusters that would not have been otherwise possible. Our genomic data were informative in the following cases: (1) In a hospital setting, lineage information was used to differentiate nosocomial from community transmission. We sequenced 31 samples from Ipswich Hospital and found eight UK lineages; the most commonly observed (UK5) was also the most commonly observed in the UK. This demonstrated that a single large nosocomial outbreak had not occurred but that the patients in hospital had become infected in the community by circulating lineages. (2) Our data unexpectedly uncovered a sustained outbreak in six care facilities within the region. These data indicated probable intra-care facility transmission and is currently under further investigation. The outbreak was identified while looking at a common lineage within the region and noticing that the positive samples had mostly come from elderly people, suggesting a possible link to care facilities. Further investigation at NNUH identified six care facilities sharing a distinct sublineage primarily found only in these facilities. This sublineage was not detected in community testing (Pillar 2) at any point. Only two cases were a definite hospital acquired infection with one probable hospital-acquired infection. Examination of all genome sequences obtained from a town with two of these care facilities showed there were 13 different lineages circulating within the locality, but only a single lineage circulating in the care facilities. Samples could be broken down by the areas from which the samples were collected, representing towns or cities and their surrounds (Figs S2 and S10). Whilst UK5 was the commonest UK lineage in most urban areas in the Norfolk region, as it was nationally in the UK, there were other UK lineages that showed sustained persistence and spread within discrete communities from small geographical areas. UK2913 was primarily observed in Norwich and to the south and east of the city (South Norfolk, Waveney, Broadland, Great Yarmouth), nearly exclusively from clients of care facilities and healthcare workers at those facilities. Another UK lineage, UK721, was observed in seven community care residents (aged 78-92 years) and two healthcare workers in south-west Norfolk. One lineage, UK173, was observed only in one suburb of Norwich; it dominated for 1 month (13 April to 19 March 2020), then went extinct and has not been observed in Norfolk since. This indicates that lineages introduced into small urban areas may expand but do not necessarily spread more widely. The UK6 lineage was primarily observed in the Kings Lynn area, accounting for 90 % (73/81) of UK6 samples in Norfolk. In contrast Norwich, 70 km away, recorded only a single case of UK6. These patterns are repeated throughout the dataset. The number of co-occurring global and UK lineages in circulation amongst inpatients (n=559, 22 global, 71 UK) was also reflected in key workers and their families (n=394, 19 global, 49 UK). Sixteen UK lineages were observed in key workers and their families but were not observed in patients or in community care, though it must be noted that 10 of these lineages were only observed once. Four UK lineages (UK244, UK606, UK1049, UK1162) were observed three times each, and only in key workers and their families. As these lineages were not observed in hospitals or care facilities, it is likely that these key workers became infected in the community. It also indicates that these key workers were unlikely to have passed the virus to patients, providing evidence that infection control measures were effective in these cases. All UK lineages observed in household members of key workers were also seen in the key workers. Definitive confirmation of transmission, and the direction of transmission, cannot be inferred from the genome sequences due to the low evolutionary rate of the virus, which we have observed in our dataset as approximately two changes per month. Over time most lineages became dormant or extinct, with some expanding rapidly such as UK448 and then disappearing just as quickly over a 3-week period. Tracking lineages over time is an open challenge, with current methods requiring continual reanalysis, which adds substantial complexity and a certain degree of uncertainly. Caution must be exercised when making comparisons across different geographical regions, as the UK accounts for 64 % (39 483 out of 61 740 as at 1 September 2020) of all publicly sequenced SARS-CoV-2 genomes. Out of 103 countries that have made SARS-CoV-2 genomes publicly available through GISAID, only the UK, Australia, Spain, India and the USA have sequenced more genomes than have been sequenced in Norfolk alone for this paper. The density of sequences when normalized for population size is 13 times greater in Norfolk than in the USA. We provide an in-depth examination of the genomic epidemiology of SARS-CoV-2 within a single geographical region covering the whole of the first wave of the pandemic from March to August 2020. We sequenced the genomes from 172 SARS-CoV-2-positive samples per 100 000 population (1035 cases), representing 42.6 % of all positive samples collected through the Microbiology Department at NNUH. From this, we identified 100 distinct lineages in the region, corresponding to multiple parallel introductions of the virus (n≥26). Dense sequencing of the virus provided actionable information for pandemic management, including: identifying a sublineage associated with care facilities, ruling out a large nosocomial outbreak in a hospital, showing no evidence of reinfection in longitudinal samples, and confirming an outbreak at a food processing facility while allowing for spillover into the community to be monitored. These achievements were only possible through the collaborative efforts of scientists (data and molecular), clinicians, data managers and epidemiologists. The large-scale genome sequencing of SARS-CoV-2-positive samples has provided valuable additional data for public health epidemiology in the Norfolk region, and will continue to help identify and untangle hidden transmission chains as the pandemic evolves. Clinical features of patients infected with 2019 novel coronavirus in Wuhan An interactive web-based dashboard to track COVID-19 in real time Gender differences in patients With COVID-19: focus on severity and mortality Developing insights into the mechanisms of evolution of bacterial pathogens from wholegenome sequences Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Geographical and temporal distribution of SARS-CoV-2 clades in the WHO European Region Genomic epidemiology of SARS-CoV-2 spread in Scotland highlights the role of European travel in COVID-19 emergence Rapid SARS-CoV-2 whole-genome sequencing and analysis for informed public health decision-making in the Netherlands Global initiative on sharing all influenza data -from vision to reality Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic Comparison of classical multi-locus sequence typing software for next-generation sequencing data Snap-perDB: a database solution for routine sequencing analysis of bacterial isolates A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology An integrated national scale SARS-CoV-2 genomic surveillance network Evaluating the effects of SARS-CoV-2 Spike mutation D614G on transmissibility and pathogenicity nCoV-2019 sequencing protocol v2 CoronaHiT: high-throughput sequencing of SARS-CoV-2 genomes Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. ArXiv13033997 Q-Bio An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar Stability of SARS-CoV-2 phylogenies Issues with SARS-CoV-2 sequencing data disease and diplomacy: GISAID's innovative contribution to global health The PHA4GE SARS-CoV-2 contextual data specification for open genomic epidemiology Sequence Database Collaboration IN. The international nucleotide sequence database collaboration CLIMB (the Cloud Infrastructure for Microbial Bioinformatics): an online resource for the medical microbiology community IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic Era Dating of the human-ape splitting by a molecular clock of mitochondrial DNA Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods Using ggtRee to visualize data on tree-like structures Snp-sites: Rapid efficient extraction of Snps from multi-FASTA alignments Phandango: an interactive viewer for bacterial population genomics Evaluation of transport media and specimen transport conditions for the detection of sars-cov-2 by use of real-time reverse transcriptionpcr COVID-19 ARTIC v3 Illumina library construction and sequencing protocol Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult Tracking changes in sars-cov-2 spike: Evidence that d614g increases infectivity of the COVID-19 virus Abbreviations and symbols for nucleic acids, polynucleotides, and their constituents NHS. Equality and diversity NHS trusts and CCGs SARS-COV-2 viral load in upper respiratory specimens of infected patients Emergence of genomic diversity and recurrent mutations in SARS-CoV-2 Visualizing and sharing data for genomic epidemiology and phylogeography The authors gratefully acknowledge the support of the Biotechnology and Biological Sciences Research Council (BBSRC); this research was funded by the BBSRC Institute Strategic Programme Microbes in the Food Chain BB/R012504/1 and its constituent projects BBS/E/ F/000PR10348, BBS/E/F/000PR10349, BBS/E/F/000PR10351 and BBS/E/F/000PR10352. D.J.B., N.F.A., T.L.V. and A.J.P. were supported by the Quadram Institute Bioscience BBSRC-funded Core Capability Grant Roche Diagnostics. A.P.T. was funded by Sara Borrell Research Grant CD018/0123 from ISCIII and co-financed by the European Development Regional Fund (A Way to Achieve Europe programme) and A.P.T.'s QIB internship was additionally funded by 'Ayuda de la SEIMC'. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Thanks to the COG-UK Consortium Study Group for their contributions. Thanks to Dr Judith Pell for her help and insightful comments on the manuscript. We gratefully acknowledge the submitters to GISAID; full details are listed in Supplementary Material 2. Five reasons to publish your next article with a Microbiology Society journal 1 . The Microbiology Society is a not-for-profit organization. 2. We offer fast and rigorous peer review -average time to first decision is 4-6 weeks. 3. Our journals have a global readership with subscriptions held in research institutions around the world. 4. 80% of our authors rate our submission process as 'excellent' or 'very good'. 5. Your article will be published on an interactive journal platform with advanced metrics.Find out more and submit your article at microbiologyresearch.org.