key: cord-0283862-iw4otetc
authors: Ferreira, Roux-Cil; Wong, Emmanuel; Gugan, Gopi; Wade, Kaitlyn; Liu, Molly; Baena, Laura Muñoz; Chato, Connor; Lu, Bonnie; Olabode, Abayomi S.; Poon, Art F. Y.
title: CoVizu: Rapid analysis and visualization of the global diversity of SARS-CoV-2 genomes
date: 2021-07-21
journal: bioRxiv
DOI: 10.1101/2021.07.20.453079
sha: 2994ad769f21f0968a5c3ec1614ab8744fe7264a
doc_id: 283862
cord_uid: iw4otetc

Phylogenetics has played a pivotal role in the genomic epidemiology of SARS-CoV-2, such as tracking the emergence and global spread of variants, and scientific communication. However, the rapid accumulation of genomic data from around the world — with over two million genomes currently available in the GISAID database — is testing the limits of standard phylogenetic methods. Here, we describe a new approach to rapidly analyze and visualize large numbers of SARS-CoV-2 genomes. Using Python, genomes are filtered for problematic sites, incomplete coverage, and excessive divergence from a strict molecular clock. All differences from the reference genome, including indels, are extracted using minimap2, and compactly stored as a set of features for each genome. For each Pango lineage (https://cov-lineages.org), we collapse genomes with identical features into ‘variants’, generate 100 bootstrap samples of the feature set union to generate weights, and compute the symmetric differences between the weighted feature sets for every pair of variants. The resulting distance matrices are used to generate neigihbor-joining trees in RapidNJ and converted into a majority-rule consensus tree for the lineage. Branches with support values below 50% or mean lengths below 0.5 differences are collapsed, and tip labels on affected branches are mapped to internal nodes as directly-sampled ancestral variants. Currently, we process about million genomes in approximately nine hours on 34 cores. The resulting trees are visualized using the JavaScript framework D3.js as ‘beadplots’, in which variants are represented by horizontal line segments, annotated with beads representing samples by collection date. Variants are linked by vertical edges to represent branches in the consensus tree. These visualizations are published at https://filogeneti.ca/CoVizu. All source code was released under an MIT license at https://github.com/PoonLab/covizu.

Number of genomes per week J a n 2 0 2 0 A p r 2 0 2 0 J u l 2 0 2 0 O c t 2 0 2 0 J a n 2 0 2 1 A p r 2 0 2 1 J u l 2 0 points (indicated by vertical dashed lines) using the R package segmented [9] . An increasing linear trend relative to a log-transformed y-axis indicates an exponentially growing rate of genome submission.

the time scale of transmission for SARS-CoV-2 tends to outpace its molecular clock, such that 1 many new infections are genetically identical to their source populations. Paradoxically, we can 2 become increasingly uncertain about the relationships among specific lineages as we collect greater 3 amounts of data [10] . This uncertainty is exacerbated by sequencing error [11] and a substantial 4 prevalence of missing data (incomplete genome sequences).

Even if it is computationally feasible to accurately infer the evolutionary relationships among 6 millions of sampled infections, visualizing these results in a meaningful way is a significant chal-Sequence alignment and cleaning. An uncompressed data stream from the GISAID provisioned 1 file is processed in Python to exclude any record whose genome sequence: (1) lacks a Pango 2 lineage assignment; (2) was sampled from a non-human host; (3) was shorter than 29,000 nt; (4) 3 lacks a complete sample collection date (e.g., year and month with no day); (5) time, we assume a uniform rate over all possible genetic differences and ignore multiple hits. To are re-assigned to its parent. Thus, an internal node may be associated with multiple variants. 1 We interpret a labeled internal node as an ancestral variant that has been directly observed as a 2 genome sequence. The resulting tree is serialized into a JSON file comprising node and edge lists.

A node list is an associative array comprising lists of sample labels keyed by variant. An edge list 4 comprises pairs of parent and child nodes (variants), branch lengths and bootstrap support values.

The CoVizu front-end is implemented in JavaScript using the D3.js (https://d3js.org/) and jQuery Mouseover events on rectangular elements trigger a 'tool tip' dialog that provides lineage-level 7 summary statistics, such as the number of samples and mean deviation from the clock model. In 8 addition, this dialog displays a list of all mutations that were observed in at least 50% of samples. 9 Following the colon-delimited notation used in https://cov-lineages.org, amino acid substitutions give the accession number.

Search interface. Since the front-end was designed to enable users to browse the relationships 1 among millions of SARS-CoV-2 samples, we also needed to implement a search interface to en-2 able users to quickly focus on samples matching specific parameters. The search interface com-3 prises a text box for submitting a substring query, which can be matched against Pango lineage 4 names, GISAID accession numbers, countries, and sample names; and date selection widgets for 5 specifying a range of sample collection dates. If the substring query matches a regular expression 6 that identifies it as a partial Pango lineage name or accession number, the browser populates a 7 drop-down with suggested 'autocompletions' of the substring.

The submitted query is compared to metadata extracted from all samples, and the unique iden-9 tifiers of bead and lineage elements that contain hits are stored. Next, the browser modifies the 10 class attribute of all of matching elements, which causes the window to update how these elements 11 are drawn, i.e., with CSS highlighting. Caching the search results in this way streamlines the pro- 

South Africa/KRISP-K002579

South Africa/KRISP-K002392

South Africa/KRISP-K001378

South Africa/NICD-LR00342

South Africa/KRISP-K002253

South Africa/KRISP-K000580

South Africa/KRISP-K002843

South Africa/KRISP-K002491

South Africa/KRISP-K002582

South Africa/KRISP-K004292

South Africa/KRISP-K003668

South Africa/KRISP-K002210

South Africa/KRISP-K002852

South Africa/KRISP-K002403

South Africa/KRISP-K004004

South Africa/KRISP-K004023

Real-time tentative assessment of tracking of pathogen evolution

TreeTime: Maximum-likelihood phylodynamic analysis. 19 Virus evolution

Segmented: an R package to fit regression models with broken-line rela-1 tionships. R news