key: cord-0791036-0e1wmy41 authors: Danesh, G.; Elie, B.; Michalakis, Y.; Sofonea, M. T.; Bal, A.; Behillil, S.; Destras, G.; Boutolleau, D.; Burrel, S.; Marcelin, A.-G.; Plantier, J.-C.; Thibault, V.; Simon-Loriere, E.; van der Werf, S.; Lina, B.; Josset, L.; Enouf, V.; Alizon, S. title: Early phylodynamics analysis of the COVID-19 epidemics in France date: 2020-06-04 journal: nan DOI: 10.1101/2020.06.03.20119925 sha: 0d80aef2fb71e4a35026b6118cc239e086ddb11d doc_id: 791036 cord_uid: 0e1wmy41 France was one of the first countries to be reached by the COVID-19 pandemics. Here, we analyse 196 SARS-Cov-2 genomes collected between Jan 24 and Mar 24 2020, and perform a phylodynamics analysis. In particular, we analyse the doubling time, reproduction number (Rt) and infection duration associated with the epidemic wave that was detected in incidence data starting from Feb 27. We show that a slowing down of the epidemic spread can be detected in Mar, which is consistent with the implementation of the national lock-down on Mar 17. The inferred distributions for the infection duration and Rt are in line with those estimated from contact tracing data. Overall, this analysis shows the potential to use sequence genomic data to inform public health decisions in an epidemic crisis context. On Jan 8 2020, the Chinese Center for Disease Control announced that an outbreak of atypical pneumonia was caused by a novel coronavirus. The genetic sequence of what is now known as SARS-Cov-2 was released on Jan 10 (Liu et al., 2020) . This was less than two weeks after the initial report of the outbreak by the Wuhan Health Commission, which took place on Dec 31 2019. Never has a 15 novel pathogen been sequenced so rapidly. The number of sequences in the databases grew rapidly thanks to an altruistic and international effort of virology departments all around the world gathered via the Global Initiative on Sharing All Influenza Data (GISAID, https://www.gisaid.org/). Early results allowed to better understand the origin of SARS-Cov-2 and identify a bat coronavirus (SARSr-CoV RaTG13) as its closest rel-20 ative with more than 96% homology, as well as some potentially adaptive mutations (ICTV, 2020, Andersen et al., 2020 , Xiao et al., 2020 . The available sequences were also analysed using the field of phylodynamics (Grenfell et al., 2004 , Volz et al., 2013 , Frost et al., 2015 , which aims at inferring epidemiological processes from sequence data with known sampling dates. Most of these analyses were shared through the website 25 virological.org. In particular, using 176 genomes from which he extracted 85 representative sequences (to avoid a potential cluster effect), Rambaut (2020) estimated the molecular clock to be approximately 8 · 10 −4 substitutions per position per year, with a 95% Highest Posterior Density (HPD) between 1.4·10 −4 and 1.3·10 −3 subst./pos./year, which yielded a date of origin of the outbreak mid-Nov 2019, with a 95% HPD spanning from Aug 27 to Dec 19. Further analysis with more recent 2 Results 2.1 Phylogeny and regional structure Figure 1 shows the regional structure of the French epidemics. Sequences corresponding to black leaves were ignored in the subsequent analyses because they do not belong to the main clade. Most of these originate from travelers isolated upon arrival in France, which explains their under-representation 65 in the ongoing epidemic wave. Focusing on the main clade, we see that all the leaves originate from a common branching event, which is approximately half-point of the phylogeny. The polytomy in this point can be due to a lack of phylogenetic signal. Another interpretation, could be independent introductions in France (up to 6 events). This could also be associated with a superspreading event, i.e. the infection of many people 70 by the same person. Addressing this issue will require more sequences from the early stages of the epidemic wave since currently the earliest sequence in this major clade is from Feb 21, 2020. Colors indicate the regional structure of the French epidemic. As expected, we see some regional clusters. We also see that sequences from the same region belong to different subclades of the major clade, which is consistent with multiple introductions or dispersal between regions. The general 75 structure of the French epidemics will be the focus of a future study (see also the work by (Gambaro et al., 2020) ). . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2020. This estimate was obtained assuming an exponential growth coalescent population model and a fix molecular clock (see Figure S4 for the BDSKY model). The slower the clock rate (in different colors), the further away in time the most recent common ancestor (MRCA). Vertical lines show the distribution medians. The most recent sample dates from Mar 24, 2020. In the following of the work, we focus on the main clade associated with the epidemic wave. We first report the estimation of the time to the most recent common ancestor (TMRCA) of the 186 80 sequences that belong to the epidemic wave. Although this is the ancestor of the vast majority of the French sequences grouped in the B.1 clade (also referred to as G or A2 clade), the associated infection may have taken place outside France because the epidemic wave may be due to multiple introduction events (although from infections caused by similar viruses given the clustering). Estimates of SARS-Cov-2 molecular clock should be treated with care given the limited amount 85 of phylogenetic signal , Duchene et al., 2020 . This is particularly true in our case since we are analysing a small subset of the data. In Appendix, we present the analysis of the temporal signal in the data using the TempEst software (Rambaut et al., 2016) and show that it strongly relies on early estimates that do not belong to the epidemic wave clade ( Figure S2 ). As shown in Figure 2 , the molecular clock value directly affected the time to the most recent 90 common ancestor for the coalescent model. This was also true for the BDSKY model, where the prior shape for the recovery rate had little impact ( Figure S3 ). For both models, sampling of the 122 sequences amongst the 186 has a much smaller impact ( Figure S5 ). Table 1 shows the dates for models with different evolution rates and different population models (exponential coalescent or BDSKY). Note that smaller dataset may not include the most recent 95 samples. For most of our datasets and models, the origin for the clade corresponding to the sequences from 5 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2020. . https://doi.org/10.1101/2020.06.03.20119925 doi: medRxiv preprint the French epidemic wave is dated between mid-Jan and early Feb. This large interval is due to the scarcity of "old" sequences (the first one collected in this clade dates from Feb 21) and on the fact that this clade averages the epidemics in several regions of France, which could have been seeded by 100 independent introductions from outside France. The date provided by the slowest molecular clock (Fix4.4-DT) seems at odds with the data as we will see below. To evaluate the effect of a potential sampling bias, we also estimate the time to the MRCA for 10 different sets of 122 sequences ( Figure S5 ). We found similar median values for 9 of these 10 random datasets. Notice that the value of their parent dataset (France186), was slightly larger. For the 105 BDSKY model, the effect was even less pronounced ( Figure S4 ). Overall, these dates (except for the slowest molecular clock) are consistent with those obtained by Rambaut (2020) Using a coalescent model with exponential growth and serial sampling (Drummond et al., 2002) , we can estimate the doubling time, which corresponds to the number of days for the epidemic wave to double in size. This parameter is key to calculate the basic reproduction number R 0 (Wallinga and 115 Lipsitch, 2007). In Figure 3 , we show this doubling time for four of our datasets that cover the whole (France122a), the first three quarters (France81), the first half (France61-1) and the second half (France61-2) of the time period. Since the first dataset includes more recent sequences than the second, which itself 6 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. includes more recent sequences than the third, our hypothesis is that we can detect variations in Again, we explore the effect of sampling by estimating the doubling time on 10 different sets of 122 sequences and find a limited effect on the median value ( Figure S7 ). Notice that the value of the parent dataset (France186), was slightly larger. We also studied the effect of the molecular clock on the doubling time ( Figure S6 ). As expected, 130 the slower the molecular clock, the higher the doubling time. However, for our realistic molecular clocks, the effect is limited: the median is 3.4 days assuming a high value for the molecular clock and 3.7 days for our default (medium) value. The low value of the molecular clock led to a high median doubling time of 5.6 days. This is at odds with the incidence data in France, which indicates an exponential growth rate of 0.23 days −1 which corresponds to a doubling time of 3 days, suggesting 135 that our default molecular clock is more realistic. In comparison, phylodynamic inferences made from data from China with 86 genomes (Rambaut, 2020) found a median doubling time of about 7 days with a confidence interval between 4.7 and 16.3 days). One reason for the slower growth rate of the epidemic compared to ours is that we have focused on one rapidly expanding clade of the epidemic and neglected the smaller clades. Another possibility 140 could be related to the timing of the sampling (early or late in the infection). . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. The birth-death skyline (BDSKY) model (Stadler et al., 2013) allows us to estimate the duration of contagiousness and the reproduction number of the epidemic (i.e. the number of secondary infections caused by an infected host). The exponential growth coalescent model described above cannot distin-145 guish between these two quantities. However, the BDSKY model requires more parameter values to be estimated. From this rate, we can obtain the duration of contagiousness, knowing that it is also necessary to account for the sampling rate because patients whose infections are sequenced can be assumed not to transmit the infection after this detection. The sampling rate after Feb 21 (it is set to 0 before that date) 150 is estimated at 0.093 days −1 with a (wide) 95% confidence interval between 0.006 and 0.627 days −1 . If we analyse this in days, the median value of the distribution yields 10.8 days and is consistent with the fact that in the French epidemics most of the screening for SARS-Cov-2 is done on severe cases upon hospital admission. The distribution of infectious periods is obtained by taking the inverse of the sum of the sampling 155 rate and the contagiousness end rate. The median of this distribution is 5.12 days and 95% of its values are between 2.89 and 7.05 days (Figure 4 ). In Supplementary Figure S8 , we show that the estimate for the infectious period duration is sensitive to the shape of the prior assumed for the recovery rate. Indeed, if we use a, less informative, uniform prior, then the median sampling rate estimate is larger and the median infectious period 160 estimated is shorter. 8 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . With the BDSKY model, we can estimate the temporal reproductive number, noted R(t), since the onset of the epidemic wave. Here, given the limited temporal signal, we only divided the time into 3 intervals to estimate three reproduction numbers: R 1 before Feb 19, R 2 between Feb 19 and Mar 7, 165 and R 3 between Mar 7 and Mar 24. These results are very consistent with those obtained for the doubling time, even if the time periods are different. For the period before Feb 19, the estimate is the least accurate with values of R 1 with a median of 1.05 but at 95% Highest Posterior Density (HPD) between 0.13 and 3.22. The lack of information can be seen in Figure 5 as the posterior distribution (gray area) is very similar to the prior 170 (dashed curve). This is consistent with the fact that the oldest sequence dates from Feb 21, while the tree root is estimated at the beginning of Feb. Over the second time period (in orange), the distribution shape is similar to that of the prior but the median is very different and rapid growth is detected with a median value of R 2 of 2.56 (95% HPD between 1.66 and 4.74). Finally, the most recent period after Mar 7 is the most accurate and detects a slowing down of the epidemic with a R 3 of 1.38 (95% HPD 175 between 1.13 and 2.03) In Appendix, we show that these estimates for R t are robust to the prior used for the recovery rate ( Figure S9 ). They are also robust to the sampling of 122 of the 186 sequences ( Figure S10 ). . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . https://doi.org/10.1101/2020.06.03.20119925 doi: medRxiv preprint of infections and to estimate the value of epidemiological parameters of interest (Volz et al., 2013 , Frost et al., 2015 . We performed this analysis based on the 196 sequences sampled in France and available on Apr 4, 2020. We focus in particular on the largest clade regrouping 186 of the most recent sequences and likely corresponding to the epidemic wave that peaked in France early Apr 2020. Before summarizing the results, we prefer to point out several limitations of our analysis. First, 185 the French clade we analysed is in fact an international clade: although most French sequences appear to be grouping into two main subclades within this clade, it is possible that the variations in epidemic growth that we detect are more due to European than French control policies. Second, some French regions (e.g. Auvergne-Rhône-Alpes) are more represented than others (e.g. Occitanie is absent), which could bias the analysis at the national level. Finally, the molecular clock had to be set in this Finally, the BDSKY model also provides us with an estimate of the duration of the infectious period, which is essential in the calculation of R 0 (Wallinga and Lipsitch, 2007) . The result we obtain, with a 95% Highest Posterior Distribution between 3 and 7 days and a median of 5.2, is highly relevant biologically and comparable to results obtained using contact tracing data. For instance, By increasing the number of SARS-CoV-2 genomic sequences from the French epidemic (and the number of people working on the subject), in particular sequences collected at the beginning of the epidemic, it would be possible to better estimate the date at which the epidemic wave took 10 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . https://doi.org/10.1101/2020.06.03.20119925 doi: medRxiv preprint off in France, improve the estimate for the infection generation time and the reproduction number, 210 better understand the spread between the different French regions, and estimate the number of virus introductions into the country. Finally, it is important to set these results into their context. As acknowledge in the introduction, the French state only acknowledged the magnitude of the COVID-19 epidemics on the last days of Feb 2020 and these genomes were mostly collected between Feb 21 and Mar 24. Most of this analysis 215 was published on Apr 6. At this time, the epidemic peak was barely noticeable in the incidence data. Furthermore, the serial interval, which is used to estimate the generation time of the infection and classically measured from contact tracing data, is still unknown in France. These results illustrate the contribution phylodynamics can make to public health during a crisis. One sequence was removed due to low quality. The list of the sequences used is shown in Supplementary Table S1 . We screened the dataset with RDP4 (Martin et al., 2015), which did not detect any recombination events. We first performed a maximum likelihood inference of the phylogeny using SMS (Lefort et al., 2017) and PhyML (Guindon and Gascuel, 2003) . The mutation model inferred by SMS was GTR. We then used Beast 1.8.3 to perform inference using a bayesian approach. More specifically, we assumed an exponential coalescent for the population model (Drum-235 mond et al., 2002) . We used the default settings for the model, which correspond to a gamma distribution for the growth rate prior Γ(0.001, 1000) and an inverse prior for the population size 1/x (see 11 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . https: //doi.org/10.1101 //doi.org/10. /2020 Supplementary Methods S2). We also used Beast 2.3 (Bouckaert et al., 2014) to estimate key parameters using the birth-death skyline (BDSKY) model (Stadler et al., 2013 , Kühnert et al., 2014 . One of these parameters is the 240 temporal reproduction number (R t ) and we here assume three periods in the epidemics (which means we estimate 3 values R 1 , R 2 and R 3 ). Another parameter is the rate at which the infectiousness ends in absence of treatment. The final key parameter is the sampling rate, the inverse of which corresponds to the average number of days until an infected person is sampled. The ratio between the sampling rate and the sum of the sampling and end of infection rates indicates the fraction of infections that 245 are actually sampled. By sampled, we mean that the patient is identified and the virus population causing the infection is sequenced. Note that we assume sampled hosts are not infectious anymore. We considered multiple priors for the rate of end of the infectious period by setting a lognormal prior LogNorm(90, 0.5) and a uniform prior Unif(5, 350). We assumed a beta prior β (1.0, 1.0) for the sampling rate (see Supplementary Methods S2). As previous models (Stadler et al., 2013) , we set the 250 sampling rate to 0 before the first infected host is sampled (here on Feb 21, 2020). For both analyses in Beast, we assumed a GTR mutation model, following the results of SMS. We also assumed a uniform prior U(0, 1) for the nucleotide frequencies and a lognormal prior for parameter κ, LogNorm(1, 1.25). Regarding the molecular clock, earlier studies have reported a limited amount of phylogenetic 255 signal in the first sequences from the COVID-19 pandemics. Given that we here focus on a subset of these sequences, we chose to fix the value of the strict molecular clock to 8.8 · 10 −4 substitutions/position/year, following the analysis by Rambaut (2020). In Appendix, we study the influence of this value on the results by setting it to a lower (4.4 · 10 −4 subst./pos./year) or a higher (13.2 · 10 −4 subst./pos./year) value. Finally, we also estimate this parameter assuming a strict molecular clock. The most recent estimates suggest that the intermediate and high value are the most realistic ones (Duchene et al., 2020) . We analysed subsets of the whole data set. Our largest subset, excluded 10 sequences that did not belong to the French epidemic wave clade and therefore contained 186 sequences. Figure S1 shows 265 the sampling date and French region of origin for each sequence. Some sampling dates are over-represented in the dataset, which could bias the estimation of di-12 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . https://doi.org/10.1101/2020.06.03.20119925 doi: medRxiv preprint vergence times (Seo et al., 2002 , Stadler et al., 2012 . To correct for this, we sampled 6 sequences for each of the days where more than 6 sequences were available. This was done 10 times to generate 10 datasets with 122 sequences (France122a to France122h). To investigate temporal effects using the coalescent model, we created three other subsets of the France122a dataset: "France61-1" contains the 61 sequences sampled first (i.e. from Feb 21 to Mar 12), "France61-2" contains the 61 sequences sampled more recently (i.e. from Mar 12 to Mar 24), and "France81" contains the 81 sequences sampled first (i.e. from Feb 21 to Mar 17). With the exponential coalescent model (denoted DT for "doubling time"), we analysed all subsets 275 of data (Fra61-1, Fra61-2, Fra81 , and all the 10 Fra122 datasets), whereas for the BDSKY model we show the main dataset (Fra186) and analyse the 10 subsets with 122 leaves in Appendix. We thank the two Centre Nationaux de Référence (Nord and Sud) in France for generating the sequence data and sharing it via GISAID. Gonché Danesh is supported by a doctoral grant from the Fondation pour la Recherche Médicale (FRM grant number ECO20170637560). We thank the ETE modelling group for discussion. We also thank the patients, nurses, doctors and all the French laboratories who made this work possible by generating and sharing the virus genome sequences. We acknowledge the IRD itrop high-performance computer (South Green Platform) at IRD mont-285 pellier for providing resources that have contributed to the results presented in this work (more details on bioinfo.ird.fr). Preliminary versions of this report were posted on Apr 9 (in French on http://covid-ete. ouvaton.org) and on Apr 21 (in English on http://virological.org/). . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . https://doi.org/10.1101/2020.06.03.20119925 doi: medRxiv preprint from Malayan pangolins. Nature pp. 1-7. Publisher: Nature Publishing Group. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . https://doi.org/10.1101/2020.06.03.20119925 doi: medRxiv preprint S2 BEAST priors 375 MCMC chains were run for 5 · 10 8 iterations. The first 10% runs were discarded as a burn in and convergence was assessed using Effective Sample Size (ESS). All parameters had ESS greater than 200. Original XML files cannot be shared due to the GISAID agreement. 17 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. 18 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. Figure S2 -Root-to-tip correlation. We analyse a phylogeny based on all 196 French sequences (i.e. not only that from the epidemic wave). The four earliest cases in Jan and early Feb were all isolated and belong to another clade than the rest of the sequences. The figure was obtained using TempEst (Rambaut et al., 2016) . Here we assume BDSKY model. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. Figure S6 -Effect of the molecular clock on the doubling time. Note that in the "Strict" model we infer the value using a strict molecular clock but convergence is limited. Here, an exponential growth coalescent model is assumed. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted June 4, 2020. 23 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2020. Figure S9 -Effect of the prior shape and of the molecular clock on the temporal reproduction number estimate. The molecular clock value has a stronger effect than the prior shape except for R 3 , where the effect is limited. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2020. Here we assume BDSKY model. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2020. Here we assume BDSKY model. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted June 4, 2020. . https://doi.org/10.1101/2020.06.03.20119925 doi: medRxiv preprint The proximal origin of SARS-CoV-2 BEAST 2: A Software Platform for Bayesian Evolutionary Analysis Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data Bayesian phylogenetics with BEAUti and the BEAST 1.7 Temporal signal and the phylodynamic threshold of SARS-CoV-2. bioRxiv p Quantifying SARS-CoV-2 transmission suggests epidemic control with dig-305 ital contact tracing Eight challenges in phylodynamic inference Introductions and early spread of SARS-CoV-2 in France Unifying the epidemiological and evolutionary dynamics of pathogens A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood Phylodynamic analyses based on 93 genomes A viral sampling design for testing 350 the molecular clock and for estimating evolutionary rates and divergence times Birth-death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV) Estimating the Basic Reproductive Number from Viral Sequence Data Viral phylodynamics How generation intervals shape the relationship between growth rates and reproductive numbers Isolation of SARS-CoV-2-related coronavirus