key: cord-0726944-r2ivqg5j authors: Rockett, Rebecca J; Arnott, Alicia; Lam, Connie; Sadsad, Rosemarie; Timms, Verlaine; Gray, Karen-Ann; Eden, John-Sebastian; Chang, Sheryl; Gall, Mailie; Draper, Jenny; Sim, Eby; Bachmann, Nathan L; Carter, Ian; Basile, Kerri; Byun, Roy; O’Sullivan, Matthew V; Chen, Sharon C-A; Maddocks, Susan; Sorrell, Tania C.; Dwyer, Dominic E; Holmes, Edward C; Kok, Jen; Prokopenko, Mikhail; Sintchenko, Vitali title: Revealing COVID-19 Transmission by SARS-CoV-2 Genome Sequencing and Agent Based Modelling date: 2020-04-24 journal: bioRxiv DOI: 10.1101/2020.04.19.048751 sha: e3c03a0cf150672d2cfdf960a3dbde9d5782b6e2 doc_id: 726944 cord_uid: r2ivqg5j Community transmission of the new coronavirus SARS-CoV-2 is a major public health concern that remains difficult to assess. We present a genomic survey of SARS-CoV-2 from a during the first 10 weeks of COVID-19 activity in New South Wales, Australia. Transmission events were monitored prospectively during the critical period of implementation of national control measures. SARS-CoV-2 genomes were sequenced from 209 patients diagnosed with COVID-19 infection between January and March 2020. Only a quarter of cases appeared to be locally acquired and genomic-based estimates of local transmission rates were concordant with predictions from a computational agent-based model. This convergent assessment indicates that genome sequencing provides key information to inform public health action and has improved our understanding of the COVID-19 evolution from outbreak to epidemic. In January 2020, a novel betacoronavirus (Coronaviridae), named Severe Acute Respiratory Syndrome coronavirus-2 (SARS-CoV-2), was identified as the etiologic agent of a cluster of pneumonia cases occurring in Wuhan City, Hubei Province, China, which were first reported in late December 2019 1, 2 . The disease arising from SARS-CoV-2 infection, Coronavirus disease 2019 (COVID- 19) , subsequently spread rapidly worldwide. The World Health Organization (WHO) declared COVID-19 a pandemic on March 11 th 2020, when 118,000 cases had been reported from 110 countries. As of 18 th April 2020, the number of global cases had surpassed 2,000,000, following multiple worldwide independent importations of infection from visitors and returned travellers making the control of this disease of prime global public health importance 3, 4 . Major outbreaks have been documented in South Korea, Iran, the USA and Europe 2,3 . At the time of writing, person-to-person transmission had been documented primarily through household contacts 5 , with up to 85% of human-to-human transmission occurring in family or household clusters 6, 7 . The rapid growth in the number of COVID-19 cases with associated morbidity and mortality has overburdened healthcare facilities and the workforce. However, our understanding of the natural history and mechanisms of disease spread remain limited. These events, combined with estimations from epidemic models, have led to unprecedented measures of disease control being instituted by national governments with profound costs to citizens and economies. Epidemic models of COVID-19 have suggested that virus transmission can be significantly disrupted by rapid detection and quarantine of infectious cases and their close contacts 8 . However, validation of COVID-19 modelling predictions is becoming increasingly important since many are built using incomplete and inconsistent data and thus produce divergent outcomes 9,10 , thus affecting confidence in public health policy directions. Here, we use the combination of near real-time SARS-CoV-2 genomic and public health surveillance data to verify inferences from computational models. Genomic epidemiology has become a high-resolution tool for public health surveillance and disease control [11] [12] [13] and the COVID-19 pandemic has triggered unrivalled efforts for the real-time genome sequencing of SARS-CoV-2. Indeed, thousands of SARS-CoV-2 genomes have already been sequenced and made publicly available on GISAID (the Global Initiative on Sharing All Influenza Data) 14 . Importantly, the ongoing analysis of this global data set suggests no significant differences or links between SARS-CoV-2 genome sequence variability and virus transmissibility or disease severity 15 . However, even during these early stages of the global pandemic, genomic surveillance has been used to differentiate currently circulating 4 strains into distinct, geographically based lineages and reveal multiple SARS-CoV-2 importations into geographical regions of China and the USA 16, 17 . Australia, as an island country between the Pacific and Indian oceans with strong traffic of people to and from COVID-19 hotspots in Asia, Europe and North America, has experienced unique challenges and opportunities in responding to the pandemic. The first laboratoryconfirmed COVID-19 patients were diagnosed in Melbourne and Sydney on 25 th and 26 th of January 2020, respectively. Since then, and as of the end of this study period on March 29 th 2020, 4159 cases had been confirmed in Australia with 1981 cases (47.6%) occurring in New South Wales (NSW), the most populous state of Australia (24.5/100,000 population) 18 . The Australian Government introduced progressive epidemic mitigation measures on 23 rd March 2020 to limit social interactions, reduce virus diffusion and prevent community-based transmissions. This strategy has been supported by widely available testing for SARS-CoV-2 in NSW, with 1541 tests performed per 100,000 residents 19 . In this study, we examine the value of near-real time genome sequencing of SARS-CoV-2 in understanding of local transmission pathways during the containment stage of the COVID-19 epidemic and compare findings from the genomic surveillance of SARS-CoV-2 with predictions of a computational agent-based model. This comparison was performed to assess the impact of potential sampling bias in genomic surveillance as well as to validate model-based inferences using experimental data. The synergistic use of high-resolution genomic surveillance and computational agent-based modelling not only improves our understanding of SARS-CoV-2 transmission chains in the community and the evolution of this novel virus but is essential for helping mitigate community-based transmissions. Sampling COVID-19 cases in the first phase of the Australian epidemic. Between January 26 th and March 28 th , 1617 cases of COVID-19 were diagnosed and reported to the NSW Ministry of Health. All patients resided in metropolitan Sydney. Prior to February 29 th , only four cases of COVID-19 were detected in NSW, all of which were imported. The first locally acquired case in NSW was reported on March 3 rd , following which a sharp spike in both imported and locally acquired cases occurred during the week commencing March 15 th (Fig. 1 ). Between March 1 st and 21 st , the weekly proportion of imported cases was between 5 and 20%. During the same period, cases epidemiologically defined as 'unknown origin/under investigation' increased from none during the week beginning March 1 st to between 31 and 35% from March 8 th to 21 st (Fig. 1c) . Rapid high-throughput SARS-CoV-2 sequencing directly from clinical samples. Of the 1,617 COVID-19 cases reported during the study period, complete viral genomes were obtained from 209 (13%) (Fig. 1a) . Following an initial delay of 21 days between date of collection and the first sequencing run for the first three samples received, the median number of days between clinical sample collection and sequencing was five days (range: 1 -21 days; Table S1 ). No significant changes in the amplicon primer sites were detected during the study. up to two SNP differences between genomes during the outbreak. The duration of these three institutional outbreaks was between six and 17 days. We therefore chose to define clusters as sequences that differ by no more than 2 SNPs from the index case of each outbreak. In this study 27 clusters were identified, of which eight (29.6%) consisted of two cases and 11 (40.7%) consisted of five or more cases. All clusters consisted of five or more cases were associated with different institutions with no overlapping epidemiological connections. The largest cluster 6 contained 35 cases linked by COVID-19 exposure in a single institution (Fig. 2b) . With a single exception, all clusters remained active during the study period (Fig. 1d ). The phylodynamics of the epidemic in NSW was also investigated (Fig. 3a) , however, genomic clusters were sampled for a limited period (maximum 19 days) and displayed weak temporal structure (R 2 =0.171). The tracking of cluster evolution over time will become increasingly important to identify active clusters over a longer sampling period. The low genetic diversity of SARS-CoV-2 in the early phase of this epidemic means that both genomic and epidemiological data are needed to clearly define SARS-CoV-2 outbreak clusters. Twenty-two (10.5%) of the 209 cases included in this study were epidemiologically classified as 'locally acquired -contact not identified'. Of these 22 cases, 15 (68%) were identified by genomic surveillance as belonging to nine genomic clusters containing cases with known epidemiological links (Fig. 2b) . The remaining seven cases were found to be genomic singletons, not clustering with genomes included in this analysis. In the agent-based model, the COVID-19 pandemic spread in Australia was initiated by overseas passenger arrivals, with some infections probabilistically generated in proportion to the average daily number of incoming passengers at airports, and binomially distributed within a 50km radius of each airport. Fig. 4a presents a network formed by community transmission chains produced by an ABM run simulating the period corresponding to the time interval between week 6 and week 10 of the study, that is, the period preceding the introduction of major lockdown strategies. A typical distribution of chain lengths is shown in Fig. 4 These fractions are also found to be in strong concordance with their counterparts defined through the genomic cluster analysis (25.8% for all local transmissions, with 17.1-18.3% 7 during the final week) (Fig. 4c) . As expected, only a proportion of inferred transmission chains were detected by genome surveillance based on identified COVID-19 cases (Fig. 4b ). This is the first report of the convergent application of SARS-CoV-2 genomic surveillance and agent-based modelling to investigate the local transmission of COVD-19. Particular strengths of this study were integration of high-resolution genomic data with local epidemiology data and inferences made by agent-based modelling, providing context and confirmation for the genomic results and clustering. Our prospective SARS-CoV-2 genome sequencing has been instrumental in not only defining local transmission events and clusters, but enabling 68% of the cases for which no epidemiological links had been identified to be assigned to known epidemiological clusters, thereby allowing more efficient public health follow-up. The fine scale resolution provided by the genomic analyses presented in this study will become increasingly important for the containment of local outbreaks by enabling identification of control measures. Only a quarter of cases appeared to be locally acquired and genomics-derived 9 local transmission rates were concordant with predictions from the computational agent-based model. This convergent assessment improves our understanding of the COVID-19 evolution from outbreak to epidemic. Integrated analysis of outputs from SARS-CoV-2 genomic surveillance and computational models can refine our understanding of the evolution of COVID-19 epidemic and will be equally relevant for assessment of other emerging pathogens that public health is going to face in the future. Moving forward, in order to contain SARS CoV-2 in a relatively low-burden setting such as Australia, application of this high resolution genomic analysis will be crucial to track, trace and place cases in context to ensure targeted and informed public health action. were an extension of Stage 1 and included advising the public to stay at home unless going to work or education, shopping for essential supplies, undertaking personal exercise or attending medical appointments or compassionate visits 28 . We undertook SARS-CoV-2 WGS using an existing amplicon-based Illumina sequencing approach 29, 30 . Briefly, RT-PCR positive samples were reverse transcribed using SuperScript IV VILO MasterMix (ThermoFisher Scientific). The viral cDNA was used as input for multiple overlapping PCR reactions (~2.5kb each) that spanned the viral genome using Platinum SuperFi MasterMix. Amplicons were then pooled equally, purified and quantified before Nextera XT library preparation and multiplex sequencing on an Illumina iSeq or MiniSeq (150 cycle flow cell) 31 . All consensus SARS-CoV-2 genomes identified in the study have been uploaded to GISAID (Supplement Table S1 ). The raw sequence data was subjected to an in-house quality control procedure prior to further analysis. Demultiplexed reads were quality trimmed using Trimmomatic (sliding window of 4, minimum read quality score of 20, leading/trailing quality of 5) 32 . The taxonomic identification of the sample was verified using centrifuge version 1.0.4, based on a database compiled from human, prokaryotes and viral sequences (including SARs-CoV-2 Refseq sequences available prior to March 2020) 33 . All samples had > 99% assignment to genus 'betacoronavirus'. Reference mapping and consensus calling was performed using iVar version 1.2 29 . Briefly, reads were mapped to the reference SARS-CoV-2 genome (NCBI GenBank accession MN908947) using BWA-mem version 0.7.17, with unmapped reads discarded. iVar trim was used to soft-clip reads containing primer sequences, and discard reads <20 length discarded following trimming. A consensus sequence was called for positions where depth >10, quality >20 with a minimum frequency threshold of 0.1. The 5' and 3' UTR regions were masked from the consensus due to poor quality of these regions. QUAST version 5.0.2 was used to evaluate the consensus sequence quality in addition to manual inspection in Geneious Prime (2020.0.5) 34 . SARS-CoV-2 genomes from NSW were compared with one another as well as with complete, or near complete, global genomes available at GISAID (www.gisaid.org: accessed 28 th March 2020, see Supplementary Table S2 for complete list of international genomes used in this study). The quality of GISAID genomes was evaluated using QUAST, with sequences retained only if they were >28,000-bp in length and contained <0.05% missing bases (n=1,985 reference genomes). The GISAID and NSW genomes were aligned with MAFFT v7.402 (FFT-NS-2, progressive method) 35 . Genomes were trimmed to remove 5' and 3' untranslated regions. Phylogenetic analysis was performed using the maximum likelihood approach (IQTree v1.6.7 (substitution model: GTR+F+R2) with 1,000 bootstrap replicates 36 . SARS-CoV-2 genomic lineages were inferred using Phylogenetic Assignment of Named Global Outbreak LINeages (PANGOLIN) (https://github.com/hCoV-2019/pangolin). Total SNP numbers between the index SARS-CoV-2 genome from NSW (GISAID Accession: NSW01/EPI-ISL-407893) and each genome in the study was calculated using SNP-sites (excluding ambiguities) 37 . Temporal structure and distribution of genomic clusters in NSW was visualised using Treetime 38 . Phylogenetic trees were constructed using R package ggtree 39 The original/raw SARS-CoV-2 genome sequencing data will be available in the National Center for Biotechnology Information GenBank by the time of publication. Consensus genome sequences for national and international genomes are available from the GISAID, (the Global Initiative on Sharing All Influenza Data; www.gisaid.org) (see Supplementary Tables S1 and S2 ). The ABM data sources have been detailed elsewhere [40] [41] [42] . There are no unique pipelines or source code developed for this project. Cluster number and number of cases (in brackets) are indicated next to bars (only clusters containing five or more cases are shown). Colour gradient reflects proportion of overseas acquired cases within the cluster, with darkest red representing 100% overseas acquired cases and yellow representing zero overseas acquired cases in the cluster. Phylogenetic relationships between SARS-CoV-2 genomes recovered from patients in NSW. The inner ring represents the allocation of clusters in NSW (only clusters equal or larger than 5 genomes are presented). Outer ring demonstrates the classification of cases as locally or overseas acquired based on genomic and epidemiological data. Bootstrap data for Lineage inference is presented in Supplemental Table S1. Table S1 . Bioinformatic metrics of consensus SARS-CoV-2 genome sequences from NSW that have been uploaded to GISAID Table S2 . International and Australian SARS-CoV-2 genomes from GISAID used in this study A new coronavirus associated with human respiratory disease in China World Health Organisation Coronavirus Situation Report -8 th COVID-19) Situation Report 67 Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) First known person-to-person transmission of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) in the USA WHO Report of the WHO-China Joint Mission on Coronavirus Diseases 2019 (COVID-19) A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: A study of a family cluster How will countrybased mitigation measures influence the course of the COVID-19 epidemic? Estimates of the severity of coronavirus disease 2019: A model-based analysis Fundamental principles of epidemic spread highlight the immediate need for large-scale serological surveys to assess the stage of the SARS-CoV-2 epidemic Tracking virus outbreaks in the twenty-first century Towards a genomics-informed, real-time, global pathogen surveillance system Global initiative on sharing all influenza data -from vision to reality COVID-19: Towards controlling of a pandemic A genomic survey of SARS-CoV-2 reveals multiple introductions into Northern California without a predominant lineage Genomic epidemiology of SARS-CoV-2 in Guangdong Province Coronavirus (COVID-19) current situation and case numbers Spread of SARS-CoV-2 in the Icelandic population The role of pathogen genomics in assessing disease transmission Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding Nextstrain: Real-time tracking of pathogen evolution Threats to timely sharing of pathogen sequencing data Genomic diversity of SARS-CoV-2 in coronavirus disease 2019 patients An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar Evolution of human respiratory syncytial virus (RSV) over multiple seasons in New South Wales An emergent clade of SARS-CoV-2 linked to returned travellers from Iran Trimmomatic: a flexible trimmer for Illumina sequence data Centrifuge: rapid and sensitive classification of metagenomic sequences QUAST: quality assessment tool for genome assemblies MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments Maximum-likelihood phylodynamic analysis ggtree: An R package for visualization and annotation of phylogenetic trees with their covariates and other associated data Modelling transmission and control of the COVID-19 pandemic in Australia Investigating spatiotemporal dynamics and synchrony of influenza epidemics in Australia: an agent-based modelling approach Urbanization affects peak timing, prevalence, and bimodality of influenza pandemics in Australia: results of a census-calibrated model Agent-based modelling by SLC and MP. Study coordination by VS