key: cord-0694873-88m8l0mo authors: Murall, C. L.; Fournier, E.; Galvez, J. H.; N'Guessan, A.; Reiling, S. J.; Quirion, P.-O.; Naderi, S.; Roy, A.-M.; Chen, S.-H.; Stretenowich, P.; Bourgey, M.; Bujold, D.; Gregoire, R.; Lepage, P.; St-Cyr, J.; Willet, P.; Dion, R.; Charest, H.; Lathrop, G. M.; Roger, M.; Bourque, G.; Ragoussis, J.; Shapiro, B. J.; Moreira, S. title: A small number of early introductions seeded widespread transmission of SARS-CoV-2 in Quebec, Canada date: 2021-03-26 journal: nan DOI: 10.1101/2021.03.20.21253835 sha: b780fd5310cff16666d0ab99c5fb9f0b238be81b doc_id: 694873 cord_uid: 88m8l0mo Using genomic epidemiology, we investigated the arrival of SARS-CoV-2 to Quebec, the Canadian province most impacted by COVID-19, with >280,000 positive cases and >10,000 deaths in a population of 8.5 million as of March 1st, 2021. We report 2,921 high-quality SARS-CoV-2 genomes in the context of >12,000 publicly available genomes sampled globally over the first pandemic wave (up to June 1st, 2020). By combining phylogenetic and phylodynamic analyses with epidemiological data, we quantify the number of introduction events into Quebec, identify their origins, and characterize the spatio-temporal spread of the virus. Conservatively, we estimated at least 500 independent introduction events, the majority of which happened from spring break until two weeks after the Canadian border closed for non-essential travel. Subsequent mass repatriations did not generate large transmission lineages (>50 cases), likely due to mandatory quarantine measures in place at the time. Consistent with common spring break and 'snowbird' destinations, most of the introductions were inferred to have originated from Europe via the Americas. Fewer than 100 viral introductions arrived during spring break, of which 5-10 led to the largest transmission lineages of the first wave (accounting for 36-58% of all sequenced infections). These successful viral transmission lineages dispersed widely across the province, consistent with founder effects and superspreading dynamics. Transmission lineage size was greatly reduced after March 11th, when a quarantine order for returning travelers was enacted. While this suggests the effectiveness of early public health measures, the biggest transmission lineages had already been ignited prior to this order. Combined, our results reinforce how, in the absence of tight travel restrictions or quarantine measures, fewer than 100 viral introductions in a week can ensure the establishment of extended transmission chains. Using genomic epidemiology, we investigated the arrival of SARS-CoV-2 to Québec, the 28 Canadian province most impacted by COVID-19, with >280,000 positive cases and >10,000 29 deaths in a population of 8.5 million as of March 1 st , 2021. We report 2,921 high-quality SARS-30 CoV-2 genomes in the context of >12,000 publicly available genomes sampled globally over the 31 first pandemic wave (up to June 1 st , 2020). By combining phylogenetic and phylodynamic 32 analyses with epidemiological data, we quantify the number of introduction events into Québec, 33 identify their origins, and characterize the spatio-temporal spread of the virus. Conservatively, we 34 estimated at least 500 independent introduction events, the majority of which happened from 35 spring break until two weeks after the Canadian border closed for non-essential travel. 36 Subsequent mass repatriations did not generate large transmission lineages (>50 cases), likely 37 due to mandatory quarantine measures in place at the time. Consistent with common spring break 38 and 'snowbird' destinations, most of the introductions were inferred to have originated from 39 Europe via the Americas. Fewer than 100 viral introductions arrived during spring break, of which 40 5-10 led to the largest transmission lineages of the first wave (accounting for 36-58% of all 41 sequenced infections). These successful viral transmission lineages dispersed widely across the 42 province, consistent with founder effects and superspreading dynamics. Transmission lineage 43 size was greatly reduced after March 11 th , when a quarantine order for returning travelers was 44 enacted. While this suggests the effectiveness of early public health measures, the biggest 45 transmission lineages had already been ignited prior to this order. Combined, our results reinforce 46 how, in the absence of tight travel restrictions or quarantine measures, fewer than 100 viral 47 introductions in a week can ensure the establishment of extended transmission chains. 48 Introduction continued to enter the country after repatriation calls from the government. On March 20 th , Québec 85 reached the threshold of 100 cases per day and by March 28 th random road checks were set up 86 to discourage movement between regions within Québec and between neighbouring provinces 87 (i.e., movement between Gatineau, Québec, bordering Ottawa, Ontario, was restricted). In April 88 2020, the virus spread significantly in long-term care facilities overwhelming many of them, thus 89 requiring redeployment of health care workers and by April 20 th the Canadian Armed Forces sent 90 personnel to the Montréal region to help. Having flattened the epidemic curve and with cases 91 declining, public health measures began easing mid-May (Fig.1A) . A year later, as of March 1 st , 92 2021, Québec had suffered the highest death toll in Canada (over 10,000 dead) and among the 93 highest death rates in the world (~92 deaths per 100,000). 94 In April 2020, we assembled the Coronavirus Sequencing in Québec (CoVSeQ) consortium of 96 academic and government scientists (https://covseq.ca/) to sequence SARS-CoV-2 genomes in 97 Québec. The CoVSeQ consortium is part of the Canadian COVID Genomic Network 98 (CanCOGeN), a pan-Canadian cross-agency network for large-scale SARS-CoV-2 and human 99 host sequencing (https://www.genomecanada.ca/en/cancogen). To better understand the early 100 introductions and spread of SARS-CoV-2 in Québec during the first wave, we sequenced and 101 analyzed 2,921 high-quality consensus genome sequences obtained between mid-February and 102 June 1st, 2020. We studied how these Québec sequences were related to 12,801 genomes 103 sampled from elsewhere in Canada and internationally. We inferred geographical origins of 104 introduction events by comparing travel history data with phylogenetic inference, and estimated 105 their likely arrival dates and subsequent spread. We conservatively estimated >500 independent 106 introduction events, mainly involving viral lineages of European origin, of which the most 107 successful arrived during the spring break period. Similarly to Massachusetts (Lemieux et al. 108 2021), viral transmission was overdispersed and suggestive of superspreading, with most 109 lineages going extinct and only 5-10 introduction events giving rise to 50 or more cases. Sampling and sequencing the first wave in Québec 116 Our province-wide sequencing effort covers the first pandemic wave up to June 1 st , 2020, with a 117 focus on the earliest confirmed cases up to April 1 st (Fig. 1A) . Consensus sequences of SARS-118 CoV-2 viral genomes were obtained by targeted amplification from clinical nasopharyngeal swabs 119 specimens followed by sequencing on Nanopore (n=180), Illumina (n=2630) or MGI (n=111) 120 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.20.21253835 doi: medRxiv preprint platforms. Only sequences passing quality criteria (less than 5% undetermined bases, "Ns") were 121 considered for further phylogenetic analyses (Methods ; Table S1 ). With these genome 122 sequences we covered 5.7% of the total number of reported cases (45,641 laboratory confirmed 123 cases and 5,849 suspected cases) up to and including June 1 st . To capture early introduction 124 events, our sequencing effort was highest (covering 27% of cases) before April 1 st , two weeks 125 after the Canadian border closed and most repatriation of Canadian citizens from abroad had 126 occurred. Until early April, the mean age of sequenced cases was approximately 50 years old, 127 then jumped to ~75 years old, likely reflecting that the virus had entered long-term care facilities 128 (Fig. 1B) . By April 1 st , over 500 long-term care facilities had reported at least one case of COVID-129 19, and the virus spread steadily through these primarily elderly populations during the month of 130 April (Shingler 2020 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) We calculated Tajima's D as a simple non-parametric metric of viral effective population size (Kim, 165 Omori, and Ito 2017) and found strongly negative values of D early in the epidemic, consistent 166 with exponential growth in mid-March to early April, followed by decelerating growth as public 167 health measures likely reduced viral transmission (Fig. 1C) . The decline of Tajima's D from March 168 5 to 19th coincides with, or slightly precedes the increase in the epidemic curve starting on March 169 19th, suggesting its utility as an early indicator of population expansion. For example, viral lineage 170 B.1, which originated in Italy and spread throughout Europe, showed evidence of rapid growth in 171 Québec (median D ~ -2.5 in mid-March) followed by a decline in late April and May. This is 172 consistent with our observation that B.1 became very common in Québec by April before being 173 replaced by other B.1 variants (notably B.1.147 and B.1.350) by the end of May (Fig. S1) . 174 Of the 2,921 Québec consensus sequences analyzed here, 328 were from COVID-19 cases that 176 had reported recent travel history. Note that a lack of travel history could indicate a true lack of 177 travel, or a lack of available data. Travelers reported returning from the Caribbean and Latin 178 America (n = 105, 32%, mainly from Mexico, n = 31, 9.5% and the Dominican Republic, n = 30, 179 9%), Europe (n = 104, 32% with the most from France, n = 39, 12% and Spain, n = 20, 6%), and 180 the USA (n = 77, 24%) ( Fig. 2A) . There was very little reported travel from Asia (n = 4, 1.2%) and 181 none from China. Only 153-162 phylogenetically inferred introduction events (25-28%; range 182 across parsimony or ML methods) were associated with travel history. These events were broadly 183 concordant with travel history, with some exceptions: notably, Latin America and Europe were 184 approximately equally popular destinations based on travel history, but phylogenetic analysis 185 identified Europe as the more likely origin of introductions into Québec ( Fig. 2A,B) . This is 186 consistent with European viral lineages arriving in Québec, perhaps via the Americas -but before 187 accumulating lineage-defining mutations in the Americas. The early introductions of viral lineages 188 A and B.4, common in the early outbreaks in China and Iran respectively, appear not to have 189 been successful in Québec and were not observed by mid-May (Fig. S1 ). The first confirmed case of COVID-19 in Québec was detected on February 25 th , but phylogenetic 208 analysis has the potential to infer earlier introduction events. We inferred 13-16 potential 209 introduction events before February 25 th , based on their time to the most recent common ancestor 210 (TMRCA), of which only 4-5 had reported travel history. The global phylogeny during the first two 211 months of 2020 is undersampled, due to sequencing efforts only beginning to ramp up at that 212 time. This, combined with relatively slow accumulation of mutations by SARS-CoV-2, resulted in 213 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.20.21253835 doi: medRxiv preprint many large polytomies (unresolved branchings), making precise inference of introductions 214 challenging. Indeed, ten of these 14-16 early introductions are in polytomies and 53-63% are of 215 unclear origin; thus their true TMRCA is questionable (Fig. 2C) . The earliest reliably dated 216 introduction has a TMRCA of February 15 th , from Europe (clustering with sequences from 217 Switzerland), followed by an introduction from the UK with a TRMCA of February 21 st . We do not 218 reliably detect any introductions arriving in January or early February, which is consistent with a 219 study of samples from patients with flu-like symptoms between November 2019 to early March 220 that did not find any SARS-CoV-2, suggesting that introductions before late February are unlikely 221 (as reported in Le Devoir, September 5 th , 2020; "L'épidémie a Commencé Autour de La Relâche 222 Scolaire," 2020) and appear not to have given rise to sustained transmission. 223 224 To test the hypothesis that spring break travel was a major source of viral introductions into 226 Québec, we investigated the Québec transmission lineages with a TMRCA between Feb. 23 rd 227 and March 10 th and defined them as having been likely introduced during spring break. During 228 this period, there were 70-96 introduction events (only 12-16% of the total), of which 26-36 had 229 recorded travel history (~37%). This is a conservative estimate since people may have traveled 230 just before or after those precise dates. Also, given the SARS-CoV-2 generation time and 231 incubation period, some introductions during spring break are expected to appear in the counted 232 cases past March 16 th . The majority of introductions, then, happened after spring break, with 78-233 82% of TMRCA dates after March 10 th (Fig. 2C) . The USA is a common travel destination for 234 Québecers, where many (known as 'snowbirds') have winter homes. The bulk of the USA travel-235 related cases were detected after the border closed on March 16 th , and thus were likely part of 236 the repatriation effort. However, the phylogenetically inferred introductions from the USA suggest 237 that these were not as successful as the introductions that happened in early March (the only 238 transmission lineages with >20 viral genomes of US origin arrived before March 15 th , Fig. 2C) . 239 The majority of the 41 introductions from other Canadian provinces were not reported in travel 240 history records (38-39 introductions, 93-95%), which is consistent with inter-provincial travel 241 having been common until being discouraged in late March. 242 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Of the 555-614 independent inferred introduction events, the majority were unsuccessful: 52-63% 253 were singletons (lineages with only one observed sequence) and 84-93% gave rise to small 254 transmission lineages of less than ten sampled genomes. In contrast, only 5-10 introductions (0.9-255 1.6% of the total; range of estimates from parsimony and ML) were successful enough to cause 256 more than 50 cases in Québec (Fig. 3) . The ten introductions inferred by ML events gave rise to 257 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Larger transmission lineages tended to be sampled across more regions in Québec (Fig. 4B) , 267 indicating that the success of these lineages was not due to local outbreaks but rather to wide 268 geographic spread across the province. The viral lineages that spread the most throughout 269 Québec (i.e. being found in more than ten health regions) were B.1, B.1.5, and B.1.147 (Fig. S2) . June 1 st (Fig. S1 and Fig. S2 ). In contrast, B.1.147 was introduced half as many times (19 275 introductions) but these events tended to have occurred earlier in spring break (Fig. S1) . B.1, each of which was introduced multiple times (Fig. 4C, Fig. S3, and Fig. S4) . which is also consistent with strongly negative Tajima's D values (Fig. 1C) . Together, these 289 results suggest rapid population growth of the most successful lineages. 290 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The ten largest transmission lineages likely arrived during spring break (all TMRCA 95% HPD 301 intervals overlap with spring break) and were still detectable in late May (Fig. 5) . The median 302 effective reproductive numbers (Re, estimated by phylodynamic analysis) for the two largest 303 transmission lineages were estimated in the range of 2-3, consistent with exponential growth (Fig. 304 5) . The two transmission lineages of ~60 cases caused by introductions of B.1 and B.1.3 had 305 higher Re values, potentially due to their rapid spread in long-term care facilities. The B.1 306 transmission lineage that spread mostly in a care facility in the city of Laval (McKenzie 2020), is 307 a particularly striking example, where the median age jumped to 83 years old (IQR: 71 to 89 308 years) after a likely introduction by a person in their forties (Fig. S6) . This outbreak was brought 309 under control, and then no sequences were detected past early May ( Fig. 5; Fig. S6 ). The self-310 isolation mandates for arriving travelers (Québec's orders on March 11 th and federal mandatory 311 quarantine orders on March 25 th ) appear to have been effective, such that the mean transmission 312 lineage size before March 11 th was 22 cases (+/-85 SD) and only 3 cases (+/-5.1 SD) after March 313 11 th . After the federal quarantine orders, 70% of introductions were singletons and only four gave 314 rise to 10 or more sampled genomes. The TMRCA of the last introduction event was April 16 th , 315 2020. 316 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. Finally, we considered the extent to which the success of an introduction event could be explained 327 by founder effects and adaptive mutations. To investigate the role of specific mutations, we 328 defined nine lineage-specific single nucleotide variants (alleles) present in all members of each 329 viral lineage, and tested their associations with transmission lineage size. We found that mutation 330 D614G in the Spike protein (genome position A23403G) was present in all ten of the most 331 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.20.21253835 doi: medRxiv preprint successful introduction events into Québec (Fig. S7) and generally dominated our sampled 332 sequences (Fig. 6A) . Independent introductions of viral lineages with the derived G allele gave 333 rise to a mean transmission lineage size of 6.6 cases, compared to 3.4 for the ancestral D allele; 334 however, this difference is not statistically significant (Fig. S7) . In contrast, derived 335 nonsynonymous mutations in three consecutive nucleotide sites (28881-3) spanning two codons 336 in the nucleocapsid (N) protein were significantly associated with smaller transmission lineage 337 size (Fig. S7) and were less represented in our sequences (Fig. 6A) . 338 339 If founder effects also played a role in determining successful transmission, we would expect the 340 earliest introduction events to give rise to larger transmission lineages. Consistent with founder 341 effects, we observed a significant negative correlation between inferred arrival time and 342 transmission lineage size. This negative correlation was strongest before quarantine measures 343 began on March 11th, suggesting that that founder effects occurred independently and could be 344 decoupled from the effects of quarantine measures (Fig. 6B) . The negative correlation was 345 detectable but weak after March 11th, suggesting a strong effect of quarantine measures. We 346 also note that most of the early, successful introduction events had no reported travel history, 347 highlighting the importance of phylogenetic analysis in identifying them. Therefore, while we 348 cannot exclude a role of specific mutations affecting transmission, lineage success in Québec's 349 first pandemic wave can most parsimoniously be explained by a combination of founder effects 350 and effective public health measures. 351 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. We thank all the authors, developers, and contributors to the GISAID database for making their 419 SARS-Cov-2 sequences publicly available. We are grateful to the molecular biology team of the 420 public health laboratory of Québec following the procedure from the manufacturer. Plate 9 was sequenced using both Nanopore and 468 Illumina technologies, as well as by applying the Cleanplex assay by Paragon Genomics, followed 469 by MGI sequencing. For samples that were sequenced more than once, the data with the higher 470 coverage was used to generate consensus sequences and subsequent phylogenetic analysis. 471 For the Cleanplex assay, we used the Cleanplex for MGI SARS-CoV-2 research panel by 472 Paragon Genomics. The assay utilizes 343 primer pairs tiled across the viral genome as 473 described (Li et al. 2020 ). The manufacturer's protocol (UG4002-1) was used, with the 474 modification of increasing the multiplex PCR cycle number to 16 in order to improve the 475 sequencing of samples with qPCR cycle threshold (Ct) values of >29, followed by sequencing on 476 an MGI DNBSEQ G400 instrument. 477 478 Basecalling and consensus sequence generation 479 All samples were aligned to the reference genome of the Severe Acute Respiratory Syndrome 480 Coronavirus-2 isolate Wuhan-Hu-1 (GenBank Accession MN908947.3). Aligned reads were then 481 used to produce a consensus sequence using pipelines based on the Artic Network novel 482 coronavirus bioinformatics protocol. A brief description of the pipeline, including software 483 packages and important parameters, is provided for each sequencing platform below. 484 485 Datasets produced using the Nextera Flex Illumina protocol were first filtered to remove any host 486 reads. To do so, reads were aligned to a hybrid reference including SARS-CoV-2 (MN908947.3) 487 and GRCh38 using bwa-mem (v0.7.17). Any reads mapping to a region of the human reference 488 with a mapping quality of zero or more were removed from the dataset. After filtering out host 489 reads, the remaining reads were trimmed using cutadapt (v2.10), then aligned to the SARS-CoV-490 2 reference (MN908947.3) using bwa-mem (v0.7.17). After alignment, reads were filtered using 491 sambamba (v0.7.0) to remove paired reads with an insert size outside of the 60-300bp range, as 492 well as any unmapped reads, secondary alignments and reads that did not match the FR/RF 493 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.20.21253835 doi: medRxiv preprint orientation. iVar (v1.3) was used to trim any remaining primers. Samtools (v1.9) was used to 494 produce a pileup which was then used as input by iVar (v1.3) to create a consensus sequence 495 for regions with a minimum of 10x depth, using reads with a Q score >20 and a minimum allele 496 frequency of 0.75. A full description of the process can be found here: 497 https://c3g.github.io/covseq_McGill/SARS_CoV2_Sequencing/Illumina_overview.html 498 499 Datasets produced using the CleanPlex MGI protocol were processed using the same pipeline 500 as Illumina Nextera Flex samples, except that Artic Network primers and amplicon data was 501 changed to the corresponding CleanPlex information. A full description of the process can be 502 found here: 503 https://c3g.github.io/covseq_McGill/SARS_CoV2_Sequencing/MGI_overview.html 504 505 Raw data produced using Nanopore sequencing was basecalled using guppy (v3.4.4) with a High-506 Accuracy Model (dna_r9.4.1_450bps_hac). Reads were de-multiplexed using guppy barcodes 507 (v3.4.4), requiring barcodes on both ends. Reads were filtered by size to remove anything outside 508 of the 400-700bp range using the ARTIC Network 'guppyplex' tool. Reads were aligned with 509 minimap2 (v2.17), then filtered to remove incorrect primer pairs and randomly downsample high-510 depth regions to keep 800x depth per strand using the ARTIC network framework. Nanopolish 511 (v0.13.1) was used to call variants in regions with a minimum depth of 16x and a flank of 10bp. 512 After masking regions with coverage below 20x, the called variants were used to generate a 513 consensus sequence using bcftools (v1.9) consensus. A full description of the process can be 514 found here: 515 https://c3g.github.io/covseq_McGill/SARS_CoV2_Sequencing/ONT_overview.html 516 517 For samples sequenced with two or more technologies, all datasets were processed separately 518 using the methods described above. The resulting consensus sequences were compared to keep 519 only the most complete consensus for downstream analyses, as determined based on the number 520 of missing bases (Ns). The consensus sequences were deposited in GISAID under accession 521 number listed in Table S1 . 522 523 Phylogenetic analysis: 524 Raw and time-scaled phylogenomic trees were built using the NextStrain pipeline 525 (https://github.com/nextstrain, version 1.16.2) installed in a conda environment 526 (https://github.com/conda/conda, version 4.8.3) (Hadfield et al. 2018 ). This pipeline uses the 527 Augur toolkit (https://github.com/nextstrain/augur, version 7.0.2) to filter, align/mask genomic 528 sequences, build trees (divergence and time-scaled) and produce an output file processed by the 529 Auspice web interface (https://github.com/nextstrain/auspice, version 2.16.0) to explore 530 phylodynamic and phylogenomic data. Augur removed all sequences shorter than 27,500 bp and 531 sampled after June 1 st , 2020. The Augur/align module was then called to execute the multiple 532 sequence alignment with MAFFT (https://github.com/GSLBiotech/mafft, version v7.463) using 533 Wuhan-1 (Genbank accession MN908947) as a reference genome. The final alignment was 534 masked at the beginning (first 100 sites) and end (last 50 sites), and at positions 18529, 29849, 535 29851 and 29853 (sites of known low sequencing quality and homoplasies). We then used Augur 536 to select sequences from GISAID that were most similar to our 2,921 Québec sequences. These 537 global context sequences were then grouped by country/month in order to keep a maximum of 538 100 sequences and 5 identical sequences per country-month combination. 539 540 We used IQ-TREE (http://www.iqtree.org/, version 1.6.12) to construct a phylogenetic tree of 541 Québec only sequences and another tree of Québec and global context sequences, with the GTR 542 substitution model. Branch lengths, sampling dates and ancestral states (geographic regions, 543 nucleotides and amino acids sequences) at internal nodes were inferred with the Augur/refine 544 and Augur/traits modules by calling TreeTime (https://github.com/neherlab/treetime, version 545 0.7.5) (using the same default parameters as those chosen in public builds; 546 https://github.com/nextstrain/ncov) (Sagulenko, Puller, and Neher 2018) . Finally, the 547 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.20.21253835 doi: medRxiv preprint Augur/export module exports a single compiled results file required for data visualization in 548 Auspice. All Nexstrain analyses were executed on a 64-bit CentOS server version 7.4.1708 using 549 40 CPUs. 550 551 Clade assignment was done during the Nextrain build. As input, the Augur/clades module uses 552 the phylogenetic tree, the observed and inferred nucleotide sequences at each node and a clade 553 configuration file. In this clade file, every single clade value is associated with a specific 554 combination of position/nucleotide variant. As an alternative clade assignment scheme, we also 555 used the Phylogenetic Assignment of Named Global Outbreak LINeages (Rambaut et sequences were discarded. We calculated D as described (Tajima 1989 ): 609 610 where the ' denotes the expected sampling variance of ( ! − " ). ! is the nucleotide diversity, 611 calculated based on the average number of pairwise differences among consensus sequences: 612 where n is the genome length, N is the number of consensus sequences, ( , 2) is the choose() 613 function which calculates the number of pairs of consensus sequences in a set of size N and 614 Nb_reads_pwdiff is the number of pairwise nucleotide differences. Because pairwise differences 615 are maximized when there are intermediate-frequency mutations, ! is more sensitive to 616 intermediate-frequency mutations. " is another estimator of the nucleotide diversity which is 617 calculated based on the number segregating sites and is sensitive to low-frequency mutations: 618 where S in the number of segregating sites, $ is a normalizing factor for the sample size of 619 consensus sequences (N). 620 621 We also estimated dN/dS, the ratio of non-synonymous (dN) and synonymous substitutions rates 622 (dS), by comparing consensus sequences to the reference genome (Genbank MN908947.3) 623 allowing us to infer changes in selective pressures at the protein level. 624 625 where Nbnsub is the number of non-synonymous substitutions, Nbnss is the number of 626 nonsynonymous sites, Nbssub is the number of synonymous substitutions, and Nbss is the number 627 of synonymous sites. We only considered consensus sequences with more than 1 synonymous 628 mutation to be able to attribute finite values to dN/dS. These analyses were coded in R. 629 630 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.20.21253835 doi: medRxiv preprint . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The resulting heatmap was ordered by hierarchical clustering. (B) Associations between lineage-defining mutations and transmission lineage size. Each of the nine lineage-defining mutations was tested for association with transmission lineage size by comparing the transmission lineages size with the ancestral allele (blue) to the derived allele (red) (Mann-Whitney test, P-values reported after false-discovery rate correction for multiple hypothesis testing). The lineage-defining mutations are labeled in the following format: Ancestral nucleotide allele, Genome position, Derived nucleotide allele; Ancestral Amino Acid allele, Position In Protein, Derived Amino Acid allele; ORF; Gene. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 26, 2021. ; https://doi.org/10.1101/2021.03.20.21253835 doi: medRxiv preprint Cryptic Transmission of SARS-CoV-2 in Washington State BEAST 2.5: An Advanced Software 638 Platform for Bayesian Evolutionary Analysis The McDonald-Kreitman Test and Slightly Deleterious 642 Mutations Estimated Transmissibility and Impact of SARS-646 CoV-2 Lineage B.1.1.7 in England Données COVID-19 Au Québec Temporal Signal and the Phylodynamic Threshold of SARS-CoV-2 Potentially Adaptive SARS-CoV-2 Mutations Discovered with Novel 657 Genomic Epidemiology of Early Introductions of SARS-CoV-2 into the Canadian Province of 661 2020 The Role of Case Importation in Explaining Differences in Early 666 SARS-CoV-2 Transmission Dynamics in Canada-A Mathematical Modeling Study of Surveillance 667 Introductions and Early Spread of SARS-CoV-2 672 in the New York City Area Making Sense of 675 Mutation: What D614G Means for the COVID-19 Pandemic Remains Unclear Nextstrain: Real-Time Tracking of 680 Pathogen Evolution Inferring Epidemiological Dynamics of 684 Infectious Diseases Using Tajima's D Statistic on Nucleotide Sequences of Pathogens Genomic Epidemiology of the Early Stages of the SARS-CoV-2 Outbreak in Russia Real-Time PCR-Based SARS-CoV-2 Detection in Canadian 694 Laboratories Phylogenetic Analysis of SARS-CoV-2 in 699 Boston Highlights the Impact of Superspreading Events L'épidémie a Commencé Autour de La Relâche Scolaire Highly Sensitive and Full-Genome Interrogation of SARS-CoV-2 Using Multiplexed 708 PCR Enrichment Followed by next-Generation Sequencing Ligne Du Temps COVID-19 Au Québec Insights from SARS-CoV-2 Quebec Man Files Suit after 69 Residents Die of COVID-19 at Care 718 Global News Establishment and Lineage Dynamics of the SARS-CoV-2 Epidemic in the UK A Dynamic Nomenclature Proposal for SARS-727 CoV-2 Lineages to Assist Genomic Epidemiology Exploring the 731 Temporal Structure of Heterochronous Sequences Using TempEst (formerly Path-O-Gen) Comparisons of dN/dS Are Time Dependent for Closely Related Bacterial Genomes COVID-19 in Quebec: A Timeline of Key Dates and Events Phylodynamic Analysis COVID-19 in Quebec: Staying Away from Seniors' Homes a Matter of 746 'Life or Death,' Legault Says Genomic Epidemiology Reveals Multiple 751 Introductions of SARS-CoV-2 from Mainland Europe into Scotland Statistical Method for Testing the Neutral Mutation Hypothesis by DNA 755 Evaluating the Effects of SARS-CoV-2 Spike Mutation D614G on 759 Transmissibility and Pathogenicity The Emergence 764 of SARS-CoV-2 in Europe and North America SARS-CoV-2 Spike D614G Change Enhances 769