key: cord-0961298-f5fshm3w
authors: Dellicour, Simon; Durkin, Keith; Hong, Samuel L; Vanmechelen, Bert; Martí-Carreras, Joan; Gill, Mandev S; Meex, Cécile; Bontems, Sébastien; André, Emmanuel; Gilbert, Marius; Walker, Conor; De Maio, Nicola; Faria, Nuno R; Hadfield, James; Hayette, Marie-Pierre; Bours, Vincent; Wawina-Bokalanga, Tony; Artesi, Maria; Baele, Guy; Maes, Piet
title: A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages
date: 2020-11-03
journal: Mol Biol Evol
DOI: 10.1093/molbev/msaa284
sha: 6b35dc4d272d9d348c1511e308c6334d8490ad2d
doc_id: 961298
cord_uid: f5fshm3w

Since the start of the COVID-19 pandemic, an unprecedented number of genomic sequences of SARS-CoV-2 have been generated and shared with the scientific community. The unparalleled volume of available genetic data presents a unique opportunity to gain real-time insights into the virus transmission during the pandemic, but also a daunting computational hurdle if analysed with gold-standard phylogeographic approaches. To tackle this practical limitation, we here describe and apply a rapid analytical pipeline to analyse the spatio-temporal dispersal history and dynamics of SARS-CoV-2 lineages. As a proof of concept, we focus on the Belgian epidemic, which has had one of the highest spatial densities of available SARS-CoV-2 genomes. Our pipeline has the potential to be quickly applied to other countries or regions, with key benefits in complementing epidemiological analyses in assessing the impact of intervention measures or their progressive easement.

First reported in early December 2019 in the province of Hubei (China), COVID-19 (coronavirus disease 2019) is caused by a new coronavirus (severe acute respiratory syndrome coronavirus 2; SARS-CoV-2) that has since rapidly spread around the world 1,2 , causing an enormous public health and socioeconomic impact 3, 4 . Since the early days of the pandemic, there has been a necessary mobilisation of the scientific community to understand its epidemiology and provide a real-time response. To this end, research teams around the world have massively sequenced and publicly released dozens of thousands of viral genome sequences to study the origin of the virus 5, 6 , and to trace its spread at global, country or community-level scales [7] [8] [9] . In this context, a platform like Nexstrain 10 , already widely used and recognised by the academic community and beyond, has quickly become a reference to follow the dispersal history of SARS-CoV-2 lineages.

In the context of the COVID-19 pandemic, the volume of genomic data available presents a unique opportunity to gain valuable real-time insights into the dispersal dynamics of the virus. However, the number of available viral genomes is increasing every day, leading to substantial computational challenges. While Bayesian phylogeographic inference represents the gold standard for inferring the dispersal history of viral lineages 11 , these methods are computationally intensive and may fail to provide useful results in an acceptable amount of time for large data sets. To tackle this practical limitation, we here describe and apply an analytical pipeline that is a compromise between fast and rigorous analytical steps. In practice, we propose to take advantage of the rapid time-scaled phylogenetic tree inference process used by the online Nextstrain platform 10 . Specifically, we aim to use the resulting time-scaled tree as a fixed empirical tree along which we infer the ancestral locations with the discrete 12 and spatially-explicit 13 phylogeographic models implemented in the software package BEAST 1.10 14 .

In Belgium, there are two main laboratories (from the University of Leuven and the University of Liège) involved in sequencing SARS-CoV-2 genomes extracted from confirmed COVID-19 positive patients. To date, some genomes (n=58) have also been sequenced at the University of Ghent, but for which metadata about the geographic origin are unavailable. As of June 10, 2020, a total of 740 genomes have been sequenced by these research teams and deposited in the GISAID (Global Initiative on Sharing All Influenza Data 15 ) database. In the present study, we exploit this comprehensive data set to uncover the dispersal history and dynamics of SARS-CoV-2 viral lineages in Belgium. In particular, our objective is to investigate the evolution of the circulation dynamics through time and assess the impact of lockdown measures on spatial transmission. Specifically, we aim to use phylogeographic approaches to look at the Belgian epidemic at two different levels: (i) the importance of introduction events into the country, and (ii) viral lineages circulating at the nationwide level. Our analytical pipeline is detailed in Supplementary Information.

On June 10, 2020, we downloaded all Belgian SARS-CoV-2 sequences (n=740) available on GISAID, as well as all non-Belgian sequences used in Nextstrain to represent the overall dispersal history of the virus (n = 4,309 with 126 different countries of origin). We generated a time-scaled phylogenetic tree using a rapid maximum likelihood approach 16 and subsequently ran a preliminary discrete phylogeographic analysis along this tree to identify internal nodes and descending clades that likely correspond to distinct introductions into the Belgian territory (Figs. 1, S2). We inferred a minimum number of 331 introduction events (95% HPD interval = [315-344]). When compared to the number of sequences sampled in Belgium (n=740), this number illustrates the relative importance of external introductions in establishing transmission chains in the country. Introduction events resulted in distinct clades (or "clusters") linking varying numbers of sampled sequences (Fig. 1) . However, many clusters only consisted of one sampled sequence. According to the time-scaled phylogenetic tree and discrete phylogeographic reconstruction (Fig. S1 ), some of these introduction events could have occurred before the return of Belgian residents from carnival holidays (around March 1, 2020), which was considered to be the major entry point in time of transmission chains in Belgium.

To analyse the circulation dynamics of viral lineages within the country, we then performed spatiallyexplicit phylogeographic inference along the previously identified Belgian clades ( Fig. 2A ). Our reconstructions reveal the occurrence of long-distance dispersal events both before (Fig. 2B ) and during (Fig. 2C ) the lockdown. By placing phylogenetic branches in a geographical context, spatiallyexplicit phylogeographic inference allows for treating those branches as conditionally independent movement vectors 17 . Here, we looked at these movement vectors to assess how the dispersal dynamics of lineages was impacted by the national lockdown, of which the main measures were implemented on March 18, 2020.

Firstly, we investigated whether the national lockdown was associated with a change in lineage dispersal velocity. We estimated a higher dispersal velocity before the lockdown (5.4 km/day, 95% HPD dispersal velocity was globally higher at the early phase of the Belgian epidemic, which corresponds to the week following the returns from carnival holidays, it then seemed to drop just before the beginning of the lockdown before increasing again to reach a plateau. In the second half of April, our estimates indicate that the lineage dispersal velocity drops again. However, this result may be an artefact associated with the lower number of phylogenetic branches currently inferred during that period (Fig.   3 ).

Secondly, we further investigated the impact of the lockdown on the lineage dispersal events among provinces. Our analyses indicate that lineage dispersal events among provinces tended to decrease during the epidemic (Fig. 3) : such dispersal events were more frequent at the beginning of the epidemic and then progressively decreased until reaching a plateau at the beginning of the lockdown.

Again, the relatively limited number of phylogenetic branches currently inferred from mid-April does not convincingly allow for interpretation of the fluctuations of the proportion of lineage dispersal events among provinces during that period.

Finally, because our approach is dependent on the initial time-scaled tree, we also assess the robustness of our results to the selection of the maximum likelihood starting tree. Specifically, we reran the analytical pipeline on five additional time-scaled trees independently obtained using maximum likelihood inference. We then compared the inferred number of introduction events into the Belgian territory as well as the lineage dispersal velocity estimates obtained when running the pipeline on each starting tree (Table S1 ). Despite some relatively low differences in lineage dispersal velocity estimates, our results confirmed that the different estimates remain robust to the choice of the starting tree and that our pipeline appears not to be substantially impacted by the statistical uncertainty associated with the initial maximum likelihood inference step. Our spatially-explicit phylogeographic analyses uncover the spatiotemporal distribution of Belgian SARS-CoV-2 clusters, indicating a relatively low impact of the lockdown on both the dispersal velocity of viral lineages and on the frequency of long-distance dispersal events. While it has been demonstrated that the national lockdown had an overall impact on the virus transmission, i.e. reducing its effective reproduction number to a value below one 19 , our results highlight that the lockdown did not clearly decrease the velocity at which the viral lineages travelled or their ability to disperse over long distances within the country. This finding may be important to consider in the context of potential future lockdown measures, especially if more localised (e.g. at the province or city level). Indeed, locally reduced transmission rates will not automatically be associated with a notable decrease in the average velocity or distance travelled by lineage dispersal events, which could in turn limit the effectiveness of localised lockdown measures in containing local upsurge of the virus circulation.

Applying the present phylodynamic pipeline in a real-time perspective does not come without risk as new sequences can sometimes be associated with spurious nucleotide changes that could be due to sequencing or assembling errors. Directly starting from inference results kept up to date by a database like GISAID allows for fast analytical processing but also relies on newly deposited data that could sometimes carry errors. To remedy such potentially challenging situations, our proposed pipeline could be extended with a sequence data resource component that makes uses of expert knowledge regarding a particular virus. The GLUE 20 software package allows new sequences to be systematically checked for potential issues, and could hence be an efficient tool to safely work with frequently updated SARS-CoV-2 sequencing data. Such a "CoV-GLUE" resource is currently being developed (http://cov-glue.cvr.gla.ac.uk/#/home).

While we acknowledge that a fully integrated analysis (i.e. an analysis where the phylogenetic tree and ancestral locations are jointly inferred) would be preferable, fixing an empirical time-scaled tree represents a good compromise to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages. Indeed, the number of genomes available, as well as the number of different sampling locations to consider in the analysis, would lead to a joint analysis requiring weeks of run-time in a Bayesian phylogenetic software package like BEAST. To illustrate the computational demands of such an approach, we ran a classic Bayesian phylogenetic analysis on a smaller SARS-CoV-2 data set (2,795 genomic sequences) using BEAST 1.10 (data not shown). This analysis required over 150 hours to obtain enough samples from the posterior distribution, while using the latest GPU accelerated implementations 21 across 15 parallel runs. With a combined chain length of over 2.2x10 9 states, and an average runtime of 0.9 hours per million states, the significant computational demands make this approach impractical when speed is critical. On the other hand, here we use a maximum likelihood method implemented in the program TreeTime 16 to obtain a time-scaled phylogenetic tree in a short amount of time (~3 hours for the data set analysed here). Given the present urgent situation, we have deliberately assumed a time-scaled maximum likelihood phylogenetic tree as a fair estimate of the true time-scaled phylogenetic tree.

Our analytical workflow has the potential to be rapidly applied to study the dispersal history and dynamics of SARS-CoV-2 lineages in other restricted or even much broader study areas. We believe that spatially-explicit reconstruction can be a valuable tool for highlighting specific patterns related to the circulation of the virus or assessing the impact of intervention measures. While new viral genomes are sequenced and released daily, a limitation could paradoxically arise from the non-accessibility of associated metadata. Indeed, without sufficiently precise sampling location data for each genome, it's not feasible to perform a spatially-explicit phylogeographic reconstruction. In the same way that viral genomes are deposited in databases like GISAID, metadata should also be made available to enable comprehensive epidemiological investigations using a similar approach as we presented here. A cluster is here defined as a phylogene c clade likely corresponding to a dis nct introduc on into the study area (Belgium). We delineated these clusters by performing a simplis c discrete phylogeographic reconstruc on along the me-scaled phylogene c tree while only considering two poten al ancestral loca ons: "Belgium" and "non-Belgium". We iden fied a minimum number of 331 lineage introduc ons (95% HPD interval = [315-344]), which showcases the rela ve importance of external introduc ons considering the number of sequences currently sampled in Belgium (740). On the tree, lineages circula ng in Belgium are highlighted in green, and green nodes correspond to the most ancestral node of each Belgian cluster (see also Figure S1 for a non-circular visualisa on of the same tree). Besides the tree, we also report the distribu on of cluster sizes (number of sampled sequences in each cluster) as well as the number of sequences sampled through me. For each clade, we mapped the maximum clade credibility (MCC) tree and overall 80% highest posterior density (HPD) regions reflec ng the uncertainty related to the phylogeographic inference. MCC trees and 80% HPD regions are based on 1,000 trees subsampled from each post burn-in posterior distribu on. MCC tree nodes were coloured according to their me of occurrence, and 80% HPD regions were computed for successive me layers and then superimposed using the same colour scale reflec ng me. Con nuous phylogeographic reconstruc ons were only performed along Belgian clades linking at least three sampled sequences for which the geographic origin was known (see the detailed analy cal pipeline in Supplementary Informa on for further detail). Besides the phylogene c branches of MCC trees obtained by con nuous phylogeographic inference, we also mapped sampled sequences belonging to clades linking less than three geo-referenced sequences. Furthermore, when a clade only gathers two geo-referenced sequences, we highlighted the phylogene c link between these two sequences with a dashed curve connec ng them. Sub-na onal province borders are represented by white lines. (B) MCC tree branches occurring before the 18 th March 2020 (beginning of the lockdown). (C) MCC tree branches occurring a er the 18 th March 2020. See also Figure S2 for a zoomed version of the dispersal history of viral lineages in the Province of Liège, for which we have a par cularly dense sampling. Except the number of phylogene c branches occurring at each me slice, all es mates were smoothed using a 14-days sliding window. Dark grey surrounding polygons represent 95% credible intervals, and light grey surrounding polygons represent 95%credible intervals re-es mated a er subsampling 75% of branches in each of the 1,000 posterior trees. The credible interval based on the subsampling procedure is an indica on of the robustness of the es mate. In addi on, we also report the number of phylogene c branches occurring per tree at each me slice (blue curve). The number of branches available at each me slice is an addi onal, yet qualita ve, indica on of robustness of the es mate for a given me period.

A pneumonia outbreak associated with a new coronavirus of probable bat origin

A Novel Coronavirus from Patients with Pneumonia in China

If the world fails to protect the economy, COVID-19 will damage health not just now but also in the future

Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science

The proximal origin of SARS-CoV-2

Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding

An emergent clade of SARS-CoV-2 linked to returned travellers from Iran

Genomic epidemiology of SARS-CoV-2 in Guangdong Province

Screening of healthcare workers for SARS-CoV-2 highlights the role of asymptomatic carriage in COVID-19 transmission

Nextstrain: Real-time tracking of pathogen evolution

Recent advances in computational phylodynamics

Bayesian phylogeography finds its roots

Phylogeography takes a relaxed random walk in continuous space and time

Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evol

Global initiative on sharing all influenza data -from vision to reality

Maximum-likelihood phylodynamic analysis

Unifying the spatial epidemiology and molecular evolution of emerging epidemics

A genomic survey of SARS-CoV-2 reveals multiple introductions into northern California without a predominant lineage

COVID-19 report on a meta-population model for Belgium: A first status report

GLUE: A flexible software system for virus sequence data

Improved performance, scaling, and usability for a high-performance computing library for statistical phylogenetics

IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era

Some probabilistic and statistical problems in the analysis of DNA sequences

A space-time process model for the evolution of DNA sequences

Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen)

Posterior summarization in Bayesian phylogenetics using Tracer 1.7

Phylodynamic assessment of intervention strategies for the West African Ebola virus outbreak

Explaining the geographic spread of emerging epidemics: a framework for comparing viral phylogenies and environmental landscape data

SERAPHIM: studying environmental rasters and phylogenetically informed movements