key: cord-0844872-t7w72wvs
authors: O'Toole, A. N.; Hill, V.; Jackson, B.; Dewar, R.; Sahadeo, N.; Colquhoun, R.; Rooke, S.; McCrone, J. T.; McHugh, M.; Nicholls, S.; Poplawski, R.; The COVID-19 Genomics UK Consortium,; COVID-19 Impact Project,; Aanensen, D.; Holden, M.; Connor, T. R.; Loman, N.; Goodfellow, I. G.; Carrington, C.; Templeton, K.; Rambaut, A.
title: Genomics-informed outbreak investigations of SARS-CoV-2 using civet
date: 2021-12-14
journal: nan
DOI: 10.1101/2021.12.13.21267267
sha: 17d4cb759889380f121656036c84df29ed60bea2
doc_id: 844872
cord_uid: t7w72wvs

The scale of data produced during the SARS-CoV-2 pandemic has been unprecedented, with more than 5 million sequences shared publicly at the time of writing. This wealth of sequence data provides important context for interpreting local outbreaks. However, placing sequences of interest into national and international context is difficult given the size of the global dataset. Often outbreak investigations and genomic surveillance efforts require running similar analyses again and again on the latest dataset and producing reports. We developed civet (cluster investigation and virus epidemiology tool) to aid these routine analyses and facilitate virus outbreak investigation and surveillance. Civet can place sequences of interest in the local context of background diversity, resolving the query into different 'catchments' and presenting the phylogenetic results alongside metadata in an interactive, distributable report. Civet can be used on a fine scale for clinical outbreak investigation, for local surveillance and cluster discovery, and to routinely summarise the virus diversity circulating on a national level. Civet reports have helped researchers and public health bodies feedback genomic information in the appropriate context within a timeframe that is useful for public health.

The timely sharing of genomic data during the SARS-CoV-2 pandemic has enabled large-scale national and international surveillance efforts around the world. On a finer scale, pathogen genomics can supplement infection prevention and control efforts in clinical settings, as well as aid in outbreak investigations in community settings (Köser 2012 , Quick 2014 , Houldcroft 2018 , Brown 2019 . However, the intense SARS-CoV-2 sequencing effort has produced a genomic dataset orders of magnitude larger than any previous epidemic, with more than 5 million sequences shared publicly at time of writing. It is therefore challenging to effectively condense information into relevant summaries and provide meaningful context in a timeframe that allows the data to be of immediate use to those involved in local outbreak response.

Analysing or interpreting genomic information alone without relevant epidemiological information can be misleading and lead to incorrect conclusions due to the incomplete nature of the data. The relatively low mutation rate of SARS-CoV-2, frequent occurrence of convergent mutations (homoplasies), and prevalence of incomplete genome sequences make it critical to integrate epidemiological information alongside the genomic data to provide the most accurate picture and extract the most value from any given dataset. This includes temporal and spatial information, but may also include outbreak-specific data such as profession, ward, clinical metadata, or the background of viral lineages actively circulating in the community. Outbreak investigations often require bespoke reports that present information in a transparent and accessible manner. The data presented must be easily interpretable by health care providers and teams involved in infection control, the majority of whom are not accustomed to incorporating this type of data into their decision making processes.

The virus genomics community has developed a number of tools for analysing and visualising virus genomic data on the order of magnitude of this pandemic. HgPhyloPlace uses UShER to rapidly place sequences of interest into a global SARS-CoV-2 phylogeny (https://hgwdev.gi.ucsc.edu/cgi-bin/hgPhyloPlace; Turakhia et al 2021). Tree visualization tools such as Pando (pando.tools), cov2tree (cov2tree.org), Microreact (Argimón et al 2016) and

Dendroscope (Huson et al 2007) can efficiently display phylogenies with a million sequences. However, even with these innovations, it is challenging to construct a phylogenetic tree of that size, given the particular challenges of SARS-CoV-2 data (De Maio et al 2020 , Morel et al 2021 . NextStrain takes an alternative approach and downsamples the dataset heavily, leaving a manageable amount of data to display (Hadfield et al 2018) . The advantage is a rapidly generated phylogeny, however only a small subset of the full diversity is represented. Approaches to condense SARS-CoV-2 genomic information by Single Nucleotide Polymorphism (SNP) typing or lineage typing - 

To run civet, the user must minimally provide a sequence alignment and metadata file representing the background diversity of the pathogen of interest. Users on CLIMB-COVID . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ;  have this data provided by the COG-UK Datapipe (https://github.com/COG-UK/datapipe) although a similarly centralised set up could be applied elsewhere. Civet can also generate the background alignment, metadata file and a SNP summary file from an unaligned fasta sequence, such as the bulk download sequence file available from GISAID ( Figure 1 ). This short pipeline first filters genome sequences based on a minimum length and maximum ambiguity content (%N) cut-off. It then maps against a reference sequence (default is the canonical SARS-CoV-2 reference genome Genbank ID: NC_045512.2, but any reference genome can be supplied) using minimap2 v2.17 (Li 2018). The resulting sam file is converted to fasta format with the 5' and 3' untranslated regions (UTRs) masked using gofasta (https://github.com/cov-ert/gofasta). We generate the background metadata file by parsing information from the sequence headers. Figure 1 : Background data generation pipeline (a-c) and how a civet query is defined (d-f). In order to contextualize the query sequences, civet requires a set of background data files, minimally an alignment and metadata file. a) These files can be generated from an unaligned multi-sequence file using the flag: --generate-civet-background-data. b) The genome sequences are put through a minimum length and maximum N-content filter before being mapped against a reference sequence. The alignment file is generated by trimming and padding against the reference, masking terminal ends with Ns. Information encoded in the sequence header is used to generate the metadata file. gofasta condenses the alignment to the set of derived nucleotide changes in each sequence with respect to the root of the pandemic, to provide an extra speed up for analysis within civet. c) The background files created can then be used as the background data for civet with --datadir or set as an environment variable. d) The query is generated from the background data supplied by specifying a set of criteria to match against, for example all sequences from a particular location within a certain timeframe. The user can also provide a string of specific ids to match or an additional metadata file that specifies the query records and may contain extra metadata fields that only correspond to query sequences, for example patient IDs. e) An additional fasta file for sequences not present in the background data can be provided and civet will perform some quality control checks and align the sequences by mapping and padding against the reference (Default NC_045512.2). f) civet combines the set of query sequence records matched from the background data and from the input fasta file to generate the full query set, and then collapses identical sequences for efficiency. These get expanded out at the end of the analysis pipeline.

There are two main ways to define a query dataset, described in Figure 1d . First, a user can define a query from the background data based on metadata, for instance a collection date within a certain time frame, or sequences from a particular location. For example, to generate a report for sequences from June 2021 sampled in Edinburgh: civet --from-metadata date=2021-06-01:2021-07-01 location=Edinburgh. Alternatively, the user can supply a string of query identifiers directly to civet, or a comma-separated (CSV) file specifying the query sequences with some additional metadata not present in the background, like patient IDs. Optionally, a separate fasta file can be supplied to run an analysis on sequences not present in the background dataset. The sequences will go through configurable quality control filters for minimum sequence length and maximum N-content, and are then aligned by mapping and padding against the reference sequence as described for the background dataset creation (Figure 1e ). Query identifiers are matched with the alignment in the background data, and the set of fasta sequence records is compiled from queries in both alignment files. Identical sequences are collapsed to a single unique sequence (Figure 1f ). Collapsing identical sequences greatly improves analysis efficiency, particularly for outbreak investigations of epidemiologically linked sequences.

Once identical sequences have been collapsed, civet searches the background dataset As illustrated in Figure 2a , SNPs can either be unique to the query sequence, unique to the target sequence, or present in the intersection of the two. SNPs present in the intersection represent shared ancestry whereas an excess of SNPs in either the query or target set can be interpreted to give directionality relative to a root sequence. These set comparisons (details in Supplementary Figure 1 ) allow the target sequences to be classified as either on a polytomy with (same), a direct ancestor of (up), a direct descendant of (down) or polyphyletic with (side) the query sequence ( Figure 2 ). Each target is then ranked according to SNP distance from the query sequence (as illustrated in the schema in Figure   2b ). The customisable SNP distance is used to define which target sequences fall within the catchment of a given query. All equally distant targets are included in the catchment.

For a given query, if no targets fall within the SNP distance cut off, the algorithm continues outwards in all directions and attempts to get at least one sequence per category (up, down or side). This results in a set of targets for each query, and any queries with overlapping targets have their catchments merged together (Figure 2c ).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ;  At this point, there is no limit to the size of catchments and as the pandemic has been sampled so intensively in some areas, even relatively low SNP distances can lead to a large catchment. The user has the option to downsample the catchments prior to tree building and configure the maximum number of the background sequences to include in a given catchment tree ( Figure 2d ). Downsampling can be run in: random mode, which randomly samples from the full catchment; enrich mode, which allows the user to specify a metadata trait to enrich for and the factor by which to enrich over the other targets in the catchment; or normalise mode, which allows the user to sample evenly across a metadata trait, such as epiweek. The query sequences, background catchment sequences and an anonymised early lineage A outgroup sequence are then gathered for tree building. Each catchment tree along with the queries is then estimated using iqtree with the HKY . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; Figure 2 : Schema of civet catchment and tree building pipeline. We show three query sequences, falling in two distinct catchments (pink and green). a) Each query sequence is compared against the set of SNPs for every record (target) in the background metadata. By evaluating the intersection and union of the two SNP sets, it is possible to assess directional SNP distance relative to the reference sequence (the early lineage A sequence with GISAID ID EPI_ISL_406801). b) For each query, all targets are ranked by distance from the query and classified as either up, down or side targets based on the set profile in panel a. c) Catchments are constructed by selecting all targets that fall within the specified SNP distance. Up, down and side distances can be configured separately (the default SNP distance of 2 SNPs for all categories is shown here). Civet then merges any catchments with overlapping targets. d) An outgroup reference sequence is added to each catchment and, if necessary, catchments are downsampled. e) Civet estimates a maximum likelihood tree for each catchment using iqtree. f) The reference sequence is pruned out and the tips of the tree are annotated with user-specified fields. g) Specific metadata annotations are added to each tip, which can be toggled within the report.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101/2021.12.13.21267267 doi: medRxiv preprint

Civet generates a fully customisable report, summarising information about the queries of interest and the surrounding diversity. The report generated is a HTML file that can be viewed in a web browser, thus allowing the interactivity of web-pages. The components of the report include an interactive table summarising metadata of the query sequences, including any user supplied metadata; which catchment a query falls in; and the mutations of interest if specified. This table can be sorted, filtered and its columns can be dynamically configured, all within the distributable report. For each catchment, the civet report contains a table summarising the catchment content (prior to downsampling) and describes which lineages and countries are present in this local diversity neighbourhood (example shown in Figure 4b ).

The civet report displays the catchment trees using the interactive tree visualisation library FigTree.js (https://github.com/rambaut/figtree.js). The trees can be expanded out along the vertical axis and tip nodes can be coloured by any field specified with annotations --tree-annotations. Clades can be collapsed down by clicking on the parent branch and uncollapsed by clicking again. Each taxa in the tree is associated with additional metadata that can be displayed by selecting a tip (demonstrated in Figure 4d ). Civet runs snipit, a python tool that finds the SNPs relative to a reference in a multiple sequence alignment and highlights these changes as a figure (https://github.com/aineniamh/snipit).

The report also contains a query timeline based on supplied temporal metadata, and interactive maps both for plotting the query sequence locations and for summarising the background diversity in the location of interest up to administrative level 2 for the UK and administrative level 1 for the rest of the world.

The user can generate multiple reports with one command to customise content for different intended audiences. Using the --report-content option, a report containing all the results shown in Figure 4 can be generated alongside a report intended for the Infection . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ;

Prevention and Control (IPC) team, which may just contain the summary tables for instance and not the phylogenies Full report configuration details can be found at the civet documentation at https://cov-lineages.org/resources/civet.html.

There have been a number of studies demonstrating the utility of in-hospital genomic The case study presented in Figure 3 describes an outbreak investigation carried out in an Edinburgh hospital in 2020. An outbreak of SARS-Cov-2 was detected, with cases across three wards that included multiple staff and patients (Figure 3a ). The earliest case detected was a patient in Ward B sampled on Day 0 (Figure 3b ). In the following days, three more patients across Wards A and B tested positive for SARS-CoV-2, two of whom had recently travelled from Country X. Subsequently, three healthcare workers who had been working in Ward A and two healthcare workers who had been working across Wards B and C tested positive. A household contact of one of these healthcare workers tested positive the same day and finally a healthcare worker in ward C tested positive. At the outset of the investigation, the outbreak was thought to have been caused by either an initial patient to staff transmission event with subsequent staff to staff transmission, or multiple patient to staff exposures.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; infer directionality based on this information, however this phylogeny does show that the diversity in the hospital overlapped with that present in the community. Figure 4d shows the phylogenetic relationship of catchment 2, with the two patients with travel history from Country X and earliest staff member to contract lineage B.1.1 all sharing identical SARS-CoV-2 genome sequences. Figure 4e displays the snipit plots that summarise the . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101/2021.12.13.21267267 doi: medRxiv preprint nucleotide changes from reference among queries of interest, and the sample collection date for each query sequence is shown in the timeline plot in Figure 4f , coloured by ward. civet resolved the outbreak into two distinct catchment trees making it likely that there were multiple introductions into the hospital from the community, and the mixture of wards present in each catchment implies some between-ward transmission. This report highlights two areas of control for the IPC to focus future efforts. As the case was deemed at least two separate introduction events with clear transmission links, the outbreak investigation was subsequently closed by the IPC team. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 14, 2021.

summarised in an interactive table, with sortable columns that can be toggled on and off. b) Each catchment is summarised in full, regardless of downsampling. Number of queries and the countries and lineages within the catchment are indicated. c) The catchment phylogenies are displayed initially in compact form, but can be expanded vertically using the Expansion slider. By default tip nodes are coloured by whether a tip is a query taxa or not, but the dropdown menu allows the user to colour tip nodes by any trait specified in --tree-annotations. d) Tip nodes can be selected to show the metadata associated with that particular sequence and clades can be collapsed to a single node by selecting the parent branch. e-f) snipit graphs highlight nucleotide differences from the reference genome. g-h) A timeline summarises any query date information provided. Note: all metadata has been de-identified for data protection purposes.

Civet can also be used as part of routine local surveillance to summarise the diversity of viruses circulating in a local area or to flag and monitor clusters of interest. The N501Y mutation in the SARS-CoV-2 spike protein has been predicted to increase SARS-CoV-2 receptor binding domain ACE2 affinity (Starr et al 2020;

https://jbloomlab.github.io/SARS-CoV-2-RBD_DMS/ last accessed 2021-08-10). As such, the presence of this mutation has been monitored as part of the genomic surveillance efforts in the UK and around the world. We present a hypothetical case of a civet report generated from a simple command used to search a background dataset from COG-UK from the 21st of October 2020 ( Figure 5 , full report available at https://cov-lineages.org/resources/civet/civet_case_study_2.html). The search defined queries as sequences from the UK with the spike N501Y mutation from the beginning of September 2020 to the latest data in the background set (2020-10-21). Figure 5a demonstrates the query summary table sorted by earliest samples. At the time in the UK, two concurrent geographically-distinct clusters existed ( Figure 5c ); one in Wales that became known as B.1.1.70 and one in south east England that became B.1.1.7. There were also two further, very small, clusters that contained S:N501Y between 1st September and 21st October 2020. At this snapshot in time, B.1.1.7 is clearly distinguishable but only has 13 sequences. By running civet routinely, the user can both discover and monitor . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 clusters such as B.1.1.7 and B.1.1.70 as they progress, facilitating rapid public health interventions.

Sample of figures from a civet report demonstrating its use for community surveillance in the UK. As a hypothetical example, we used civet to search the COG-UK dataset from the 21st of October 2020 for SARS-CoV-2 sequences with the spike protein mutation N501Y in September and October 2020. At this point, 4 independent occurrences of this mutation were detected using civet. The earliest sequences can be seen in panel a. The two main clusters correspond to B.1.1.70, which was a lineage circulating in Wales, and B.1.1.7, which only had 13 sequences at this time point. Despite being small, the striking basal branch of B.1.1.7 is clearly visible in panel b. Running civet routinely enables early identification and tracking of clusters such as these. Panel c shows the query map of the samples identified with N501Y and the geographic separation of catchments 1, 2 and 4.

Civet also has the flexibility to inform surveillance efforts at the national level. In Figure 6 , we show a schema of a civet report summarising genomic surveillance efforts in Trinidad and Tobago during 2020, full report available at . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 https://cov-lineages.org/resources/civet/civet_case_study_3.html. Figure 6a Tobago fits into the overall diversity of SARS-CoV-2 in 2020. Reports could be routinely generated on a weekly or monthly basis to provide information on the changing context of a country's epidemic compared to its neighbours.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101/2021.12.13.21267267 doi: medRxiv preprint . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101/2021.12.13.21267267 doi: medRxiv preprint Figure 6 : Schema of a national level surveillance report generated using civet for Trinidad and Tobago. All SARS-CoV-2 genome sequences on GISAID from 2020 with <20% ambiguity content are summarised in the report (n=28). a. Available metadata for query sequences from Trinidad and Tobago. Most genomes have been assigned lineage B.1.111, although a smaller number of genomes are assigned other lineages B.1.1.33 and B.1.1. b. Catchment 1 phylogeny. Query sequences are placed in the context of background diversity beyond Trinidad and Tobago. c. Expanding the phylogeny and colouring tips by lineage shows this catchment includes query sequences from lineage B.1.111. d. Aggregate count of queries over time, coloured by lineage. e. Lineage diversity of Trinidad and Tobago and surrounding countries as generated using the background diversity map in civet.

Virus genome sequencing can help reveal transmission chains and clusters of interest to aid outbreak investigations and surveillance efforts, as exemplified by the case studies above. With civet, academic researchers and public health scientists can easily run complex and robust phylogenetic analyses with a single command, contextualising sequences of interest in the large background dataset and visualising them alongside temporal, spatial and other epidemiological metadata in an interactive, distributable report.

This frees users to place emphasis on interpreting the data and allows them to deliver information on a time-frame that is useful for public health responses.

Throughout the SARS-CoV-2 pandemic, civet has been primed for use investigating SARS-CoV-2 clinical outbreaks and running local surveillance on CLIMB-COVID (Nicholls et al 2021) as part of the COG-UK project. Each day on CLIMB-COVID, researchers from around the UK upload the latest SARS-CoV-2 genome sequences and accompanying metadata. The read data undergo rigorous quality control and a data-processing and phylogenetics pipeline compiles and analyzes the resulting genomes in combination with the global dataset from GISAID (https://github.com/COG-UK/datapipe). This makes the latest SARS-CoV-2 genome data available to civet users on a daily basis. COG-UK data . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 protection stipulates that data cannot be removed from CLIMB-COVID and often outbreak investigations involve sensitive, protected metadata. With civet, researchers can run analysis on CLIMB-COVID, distribute the report and keep their metadata protected. Civet has been popular and widely used within the framework of COG-UK, by academic researchers and scientists in public health agencies, for investigating SARS-CoV-2 clinical outbreaks and running local surveillance. A similar centralised server infrastructure could be set up for a national surveillance response or more local "locked down" compute environments (Nicholls et al 2021) and civet could be easily implemented within this framework to aid outbreak investigations.

Civet can easily perform phylogenetic analysis on large datasets and provide reports for any countries with sequences to analyse. Default settings are configured for SARS-CoV-2, but civet is virus-agnostic and can be set up to run on other viruses of interest with an appropriate background dataset and reference sequence. Although civet is currently a command-line based tool, a clear extension to the software is to develop and provide a graphical user interface. This will enable users unfamiliar with the command line to run civet. We also plan to continue developing civet and adding extra features, including a country specific summary comparing counts of genomes sequenced over time with additional epidemiological data such as cases per country over time, which is already available on the Johns Hopkins University COVID-19 DataAPI (Dong et al 2020) . This particular feature will help give appropriate context for countries with relatively low numbers of sequences as it is important to keep sequencing biases into account when inferring outbreak or transmission dynamics.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 As the ability to rapidly sequence pathogens at scale has become less technically challenging, in part due to the availability of robust protocols such as those by the ARTIC Network (Quick 2017), the amount of data that can be generated from a small laboratory with limited infrastructure has significantly increased. Arguably the greatest challenges now lay at trying to best utilise this data in an effective way to inform the response efforts, which hinges entirely on the ability to efficiently contextualise the data and provide an output that is interpretable by those less versed in the interpretation of phylogenetic trees. In this way, civet can help alleviate the analytical bottleneck that exists as a major issue for many public health labs and can maximise the value of genomic data. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. 

Project home page: https://github.com/artic-network/civet

Operating system(s): Unix based platforms, tested on Ubuntu and MacOSX Programming language: Python, mako

All code is open-source and available on GitHub at github.com/artic-network/civet under a GNU General Public License v3.0.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Supplementary Materials Figure S1 . Set categories for gofasta "updown-top-ranking". Shaded regions in the Venn diagrams represent having at least one SNP in that category (either in Q, T or Q ∩ T). civet --from-metadata N501Y=Y country=UK \ sample_date=2020-09-01:2020-10-21 \ --mutations S:N501Y \ . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) preprint

The copyright holder for this this version posted December 14, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 

COVID-19 Genomics UK (COG-UK) Consortium, 2021. The role of viral genomics in understanding COVID-19 outbreaks in long-term care facilities

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography

Norovirus Transmission Dynamics in a Pediatric Hospital Using Full Genome Sequences

CLIMB (the Cloud Infrastructure for Microbial Bioinformatics): an online resource for the medical microbiology community

An integrated national scale SARS-CoV-2 genomic surveillance network

Issues with SARS-CoV-2 sequencing data

An interactive web-based dashboard to track COVID-19 in real time

SARS-CoV-2 lineage dynamics in England from

Data, disease and diplomacy: GISAID's innovative contribution to global health

The impact of real-time whole genome sequencing in controlling healthcare-associated SARS-CoV-2 outbreaks

Nextstrain: real-time tracking of pathogen evolution

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA

Dr Alisha Davies PhD 74 , Elen De Lacy MPH 74 , Fatima Downing 74

Dr Rebecca Williams BMBS 33 , Wendy Chatterton MSc 34 , Monika Pusok MSc 34 , William Everson MSc 37 , Anibolina Castigador IBMS HCPC 44 , Emily Macnaughton FRCPath 44 , Dr Kate El Bouzidi MRCP 45 , Dr Temi Lampejo FRCPath 45 , Dr Malur Sudhanva FRCPath 45

We thank the following for helpful suggestions, comments, beta-testing, feature requests