key: cord-0887516-n7lg0jqe authors: Giandhari, Jennifer; Pillay, Sureshnee; Wilkinson, Eduan; Tegally, Houriiyah; Sinayskiy, Ilya; Schuld, Maria; Lourenco, Jose; Chimukangara, Benjamin; Lessells, Richard; Moosa, Yunus; Gazy, Inbal; Fish, Maryam; Singh, Lavanya; Khanyile, Khulekani Sedwell; Fonseca, Vagner; Giovanetti, Marta; Alcantara, Luiz Carlos Junior; Petruccione, Francesco; de Oliveira, Tulio title: Early transmission of SARS-CoV-2 in South Africa: An epidemiological and phylogenetic report date: 2020-11-12 journal: Int J Infect Dis DOI: 10.1016/j.ijid.2020.11.128 sha: 9e56e2516be4cefe3ec216d9ba04ccb93351cb07 doc_id: 887516 cord_uid: n7lg0jqe OBJECTIVES: To investigate introduction and understand the early transmission dynamics of the SARS-CoV-2 in South-Africa, we formed the Network for Genomic Surveillance in South Africa (NGS-SA). DESIGN: Here, we present the first results of this effort, which is a molecular epidemiological study of the first twenty-one SARS-CoV-2 whole genomes sampled in the first port of entry, KwaZulu-Natal (KZN), during the first month of the epidemic. By combining this with calculations of the effective reproduction number (R), we aim to shed light on the patterns of infections in South Africa. RESULTS: Two of the largest provinces, Gauteng and KwaZulu-Natal, had a slow growth rate on the number of detected cases, while in Western Cape and Eastern Cape the epidemic is spreading fast. Our estimates of transmission potential suggest a decrease towards R = 1 since the first cases and deaths but a subsequent estimated R average of 1.39 between 6-18(th) of May 2020. We also demonstrate that early transmission in KZN was associated with multiple international introductions and dominated by lineages B1 and B and provide evidence for locally acquired infections in a hospital in Durban within the first month of the epidemic. CONCLUSION: The COVID-19 pandemic in South Africa was very heterogeneous in its spatial dimension, with many distinct introductions of SARS-CoV2 in KZN and evidence of nosocomial transmission, which inflated early mortality in KZN. The pandemic at the local level is still developing and the objective of NGS-SA is to clarify the dynamics of the epidemic in South Africa and devise the most effective measures as the outbreak evolves. To investigate introduction and understand the early transmission dynamics of the SARS-CoV-2 in South-Africa, we formed the Network for Genomic Surveillance in South Africa (NGS- Here, we present the first results of this effort, which is a molecular epidemiological study of the first twenty-one SARS-CoV-2 whole genomes sampled in the first port of entry, KwaZulu-Natal (KZN), during the first month of the epidemic. By combining this with calculations of the effective reproduction number (R), we aim to shed light on the patterns of infections in South Africa. Two of the largest provinces, Gauteng and KwaZulu-Natal, had a slow growth rate on the number of detected cases, while in Western Cape and Eastern Cape the epidemic is spreading fast. Our estimates of transmission potential suggest a decrease towards R=1 since the first cases and deaths but a subsequent estimated R average of 1.39 between 6-18 th of May 2020. We also demonstrate that early transmission in KZN was associated with multiple international introductions and dominated by lineages B1 and B and provide evidence for locally acquired infections in a hospital in Durban within the first month of the epidemic. The COVID-19 pandemic in South Africa was very heterogeneous in its spatial dimension, with many distinct introductions of SARS-CoV2 in KZN and evidence of nosocomial transmission, which inflated early mortality in KZN. The pandemic at the local level is still J o u r n a l P r e -p r o o f The novel coronavirus disease 2019 was detected in China in late December 2019. On 30 January 2020, it was declared a Public Health Emergency of International Concern by the World Health Organization (WHO) (Sohrabi et al., 2020) . By 15 th of May 2020, there were 4,621,410 COVID-19 cases and 308,542 related deaths (Worldometer, 2020) worldwide involving almost every country in the world. Within five months, the virus had spread to Europe, America and eventually to Africa. The first case in Africa was reported in Nigeria on 28 th of February 2020 (Adepoju, 2020) , and at the time of writing, the pandemic has spread to almost all countries on the African continent. South Africa has had the highest number of COVID-19 cases to date with a total of 13,524 people infected and 247 deaths (as at 15 th May)(COVID-19 WEEKLY EPIDEMIOLOGY BRIEF PROVINCES AT A GLANCE, n.d.). The first confirmed case of COVID-19 in South Africa was reported on 5 th of March 2020. Decisive early action was taken by the government: a national state of disaster was declared on 15 th of March 2020, and a nationwide lockdown was enforced on 27 th of March 2020 to avoid the first wave overwhelming the health system. While initially only people who had travelled to at-risk countries and their contacts received PCR tests for severe acute respiratory syndromerelated coronavirus 2 (SARS-CoV-2), the recommendation broadened to include all people with an acute respiratory illness. Furthermore, a program of community-based screening and testing was rolled out across the country (NICD, 2020). Testing increased rapidly and by the middle of May 2020, over 600,000 tests had been carried out in South Africa (approximately 10,000 per million population) (Roser M et al., 2020) . As the global pandemic has expanded, WGS and genomic epidemiology (Grubaugh et al., 2019) have been consistently used to investigate COVID-19 transmission and outbreaks (Deng et al., 2020; Eden et al., 2020; Gonzalez-Reiche et al., 2020; Grubaugh, 2020; Leung et al., 2020; Lu et al., 2020; Munnink et al., 2020) . In response to the COVID-19 pandemic, the South African Network for Genomics Surveillance of COVID (NGS-SA) was formed (Msomi et al., 2020) , which is a network of five large government laboratories and five public universities funded by the Department of Science and Innovation and the South African Medical Research Council. In this paper, our consortium focuses on a detailed analysis of the epidemic in South Africa and preliminary genomic analysis of some of the first introductions of SARS-CoV-2 in KwaZulu-Natal (KZN). We show that although the South African epidemic started in KZN, which have the first cases and deaths, other provinces in the country, namely the Western Cape (WC), Gauteng (GP) and the Eastern Cape (EC), have overtaken KZN in the number of confirmed cases. We also show evidence of many distinct introductions of SARS-CoV-2 in KZN and early evidence suggesting nosocomial transmission. We used publicly released data up to 11 May 2020 from the National Department of Health (NDoH) and the NICD in South Africa, which are collected in the repository of the Data Science for Social Impact Research Group at the University of Pretoria (Marivate et al., 2020) , as well as global data on confirmed cases from the Johns Hopkins Coronavirus Resource Centre (Dong et al., 2020) . The NDoH releases daily updates on the number of new confirmed cases, with a breakdown by province. In the early stages of the epidemic, individual-level information on sex, age and travel history was released, but detailed reporting was discontinued on 23 rd of J o u r n a l P r e -p r o o f March. In addition, the National Institute of Communicable Diseases (NICD) releases daily updates on the number of reverse-transcriptase polymerase chain reaction (RT-PCR) tests performed across all public and private sector laboratories, as well as the number of cases testing positive for severe acute respiratory syndrome-related coronavirus 2 (c). We also extracted information from government press releases and speech transcripts to chart a timeline of the government response to the epidemic. To understand the epidemic trajectory, we plotted the cumulative number of confirmed cases by province since the report of the hundredth case in the country by province. The effective reproduction number (R) was estimated by taking into account the observed epidemic growth rate r and two theoretical relationships (i, ii) of R with r previously described in the literature. (i) We used the relationship R=(1+r/b) a as described in Imperial College London's COVID-19 report 13 (Flaxman et al., 2020) , where a=m 2 /s 2 and b=m/s 2 , m the serial interval (SID) mean and the SID standard deviation. The SID distribution used is the one estimated by Nishiura and colleagues (Nishiura et al., 2020) , with m=4.7 and s=2.9. We term this approach the Flaxman et al. approach (Flaxman et al., 2020) . (ii) We used the relationship R=(1+r/sigma)(1+r/delta), with 1/sigma the infectious period and 1/delta the incubation period, as described by Wallinga and Lipsitch (Wallinga and Lipsitch, 2007) , which is based on an SEIR modelling framework and expects both periods to be exponentially distributed. We used exponential distributions with mean 5.1 days for incubation (Kucharski et al., 2020; Linton et al., 2020) and 4 days for the infection (Kucharski et al., 2020; Linton et al., 2020) . We term this approach the Wallinga et al. approach . To obtain the epidemic growth rate r, we used maximum likelihood estimation in R (function optim), by fitting the exponential growth J o u r n a l P r e -p r o o f model A0e rt to the reported time series of cases and deaths (independently), where t is time, A0 is the number of reports at t=0, and r the growth rate. We used daily reported deaths and cases. The time periods for which we had data for deaths was 27 th March to 11 th May, and for cases was 5 th March to 12 th May. This approach is similar to that implemented by Xavier et al (Xavier et al., 2020) . Tiling Polymerase Chain Reaction cDNA synthesis was performed on the RNA using random primers followed by gene specific multiplex PCR using the ARTIC protocol (Quick, 2020) . Briefly, extracted RNA was converted to cDNA using the Protoscript II First Strand cDNA synthesis Kit (New England Biolabs, Hitchin, UK) and random hexamer primers. SARS-CoV-2 whole genome amplification by multiplex PCR was carried out using primers designed on Primal Scheme Raw reads coming from both Nanopore and Illumina sequencing were assembled using Genome Detective 1·126 (https://www.genomedetective.com/) and the Coronavirus Typing Tool (Cleemput et al., 2020; Vilsker et al., 2019) . The initial assembly obtained from Genome Detective was polished by aligning mapped reads to the references and filtering out low-quality mutations using bcftools 1·7-2 mpileup method. All mutations were confirmed visually with bam files using Geneious software (Biomatters Ltd, New Zealand). All of the sequences were deposited in GISAID (https://www.gisaid.org/) (Shu and McCauley, 2017) . We downloaded all sequences and associated metadata from the GISAID sequence database (https://www.gisaid.org/) (Shu and McCauley, 2017) as of 1 st of May 2020 (n=15,793). Due to the low variability of SARS-CoV-2, we wished to only include high quality sequences in our downstream analyses. To this end, we filtered out sequences that were <25kbp in length as well as sequences with a high proportion of ambiguous sites (>5%). Additionally, we also removed sequences that lacked any geographic and or sampling date information. The resulting 10,959 sequences were analyzed along with 20 sequences that were generated by the laboratory at the (Rambaut et al., 2020) , which suggest this to be the first lineage (hence A). Lineage A genomes are characterized by two unique mutations (8782C>T and 28144T>C), relative to lineage B. Lineage B, on the other hand, shares no common mutations since this lineage contains the global SARS-CoV-2 genome reference (Wuhan-Hu-1). From these lineages, sub-lineages (e.g. A·1, A·2, A·3 and so forth) are then designated, each defined by an additional set of unique mutations. For example, for sub-lineage A·1, these mutations would be; 11747C>T, 1785A>G and 18060C>T. Sub-lineages can further diversify into sub sublineages (e.g. A·1·1). Please refer to the schema provided in Supplementary Figure 5 for more information. Phylogenetic analysis 10,959 GISAID reference genomes and 20 KRISP sequences were aligned in Mafft v7·313 (FF-NS-2) followed by manual inspection and editing in the Geneious Prime software suite (Biomatters Ltd, New Zealand). We constructed a maximum likelihood (ML) tree topology in IQ-TREE (GTR+G+I, no support) (Nguyen et al., 2015; Tavaré and Miura, 1986) . Due to the large size of the alignment and the low variability, we opted to not infer support for splits in this tree topology. In any tree topology of SARS-CoV-2 the majority splits will be poorly supported with only the major splits separating the major lineages having good support. The resulting ML tree topology was transformed into a time scaled phylogeny using TreeTime (Sagulenko et al., 2018) with a clock rate of 8x10 -4 and rooted along the branch of Wuhan-WH04 (GISAID: hCoV19/Wuhan/WH04/2020) and Wuhan-Hu1 (Genbank: MN908947). The resulting phylogeny was viewed and annotated in FigTree and ggtree. Based on this large phylogeny of SARS-CoV-2, we randomly down sampled the GISAID reference sequences that passed initial sequence quality checks to ~10% of the original size. All African sequences in the GISAID subset, the 20 genotypes that were generated in this study, as well as a select few external references (e.g. Wuhan-Hu-1) were included. The resulting dataset of 1848 sequences was used in a custom build on the NextStrain analysis platform In short, sequences were aligned in mafft v7.313 and visualized and manually edited in Geneious software (Biomatters Ltd, New Zealand) as previously described. ML-tree topologies were inferred from each alignment in IQ-TREE v1·6·9 (GTR+G+I, with transfer support values) (Nguyen et al., 2015; Tavaré and Miura, 1986) . Resulting tree topologies were analyzed in TempEst software suite for temporal clock signal (Supplementary Figure S4) . Coalescent molecular clock analyses were performed in BEAST v1·8. In short, analyses were run under a strict molecular clock assumption at a constant evolutionary rate of 8x10 -4 nucleotide substitutions per site per year and an exponential growth coalescent tree prior. The Markov Chains were run in duplicate for a total length of 100 million steps sampling every 10,000 iterations in the chains. Runs were assets in Tracer for good convergence (ESS>200) and TreeAnnotator after discarding 10% of runs as burn-in. Data Availability SARS-CoV-2 genome sequences generated in this study have been deposited in the GISAID database (https://www.gisaid.org/), under the following accession IDs: EPI_ISL_421572, EPI_ISL_421573, EPI_ISL_421574, EPI_ISL_421575 EPI_ISL_421576 EPI_ISL_436684 EPI_ISL_436685 EPI_ISL_436686 EPI_ISL_436687. In addition, raw short and long reads have been submitted to the Short Read Archive (SRA) and can be accessed under BioProject Accession: PRJNA636748 (https://www.ncbi.nlm.nih.gov/bioproject/PRJNA636748). The first confirmed case of COVID-19 in South Africa was reported on 5 th of March 2020 in KZN, a South African citizen returning home from a skiing holiday in Italy. A steady increase in the number of confirmed cases in South Africa (all imported cases) followed over the next week, with the first suspected case of local transmission reported on 13 th of March 2020 in Durban, KwaZulu-Natal. The early cases were predominantly located in the three provinces with the main urban populations and international travel hubs, namely GP (main cities Pretoria and Johannesburg), the WC (Cape Town) and KZN (Durban). In these three provinces, the doubling time for confirmed cases was approximately three days prior to the lockdown ( Figure 1 ). However, since the lockdown on 27 th of March 2020, the epidemic seems to be growing at different rates in South Africa. The South African epidemic has been very heterogeneous. For example, the first cases and deaths happened in KZN and GP. This was more pronounced in KZN, as a large nosocomial outbreak in a private hospital in Durban caused KZN to lead the country in number of deaths until the WC overtook it on 21 April 2020. In addition, GP, home of the largest metropolitan area of Johannesburg, had an unusual epidemic, as the majority of initial cases were in middle age and rich individuals who traveled overseas for holidays. This translated in a very small number of deaths over time and infections were concentrated in the rich suburb of Sandton in Johannesburg. However, the epidemic expanded the fastest in the Western Cape (WC) province, especially in Cape Town, which is the capital and most populated city of the WC. At the time of writing this report, this province has over 60% of all of the cases and deaths in South Africa (Figure 2 ). There is mounting evidence that the Western Cape is seeding the growing epidemic in the Eastern Cape as the funerals from some of the deaths in the Western Cape are taking place in the Eastern Cape. This dynamic and heterogeneous epidemic complicates the estimation of effective reproductive number (R) over time and space. For example, deaths, which is normally one of the gold standard data for estimation of R0 for South Africa in May 2020 were stable at 1·12 (1.0-1.2) (Supplementary figure 1) . KZN, the first province affected by COVID-19, initially had the highest death rate but in the last period analyzed, had only 3 deaths. We have therefore attempted to estimate R from two data sources: aggregated reported cases and deaths at the country level (See Methods). Similarly, to that observed in other regions of the world, our estimates of R for South Africa suggest a decreasing transmission potential towards R=1 since the first cases and deaths have been reported, independently of the data source used. By the last period analyzed between 6-18 th of May, using the Wallinga et al approach (Wallinga and Lipsitch, 2007) , we find that R was still 1.39 (1.04 -2.15, 95% CI), suggesting potential of sustained transmission for the near future. In order to determine the route of introduction of the SARS-CoV-2 in KZN, we assessed 27 of some of the first confirmed cases in the province. Samples obtained from nasopharyngeal swabs represented fourteen females and ten males between the ages of 23-74 years. We managed to produce 20 near-whole genome sequences (>90% coverage) from these samples, and six partial genomes (Supplementary Table S2, Table S3 ). To this dataset, we added an extra genome from the NICD, which was sampled in KZN (a close contact of the first reported case) on 7 th of March 2020. The 21 KZN whole genomes (20 KRISP and one NICD) were J o u r n a l P r e -p r o o f assigned to SARS-CoV-2 sub-lineages according to the nomenclature proposed and lineage classification obtained from >5000 genomes analyzed by Rambaut et al. (Rambaut et al., 2020) . Given uncertainties pertaining to the low diversity of this virus (Moreno et al., 2020) The spread of SARS-CoV-2 across the globe has given rise to one of the largest evolving pandemics in modern times. South Africa currently has the highest number of infections in Africa. South Africa seems to be moving to the next stage of the COVID-19 pandemic, with increasing community transmission even during the stringent lockdown and the epidemic growing at different rates in different regions of the country. At the time of writing this report, Cape Town, the main city in the WC, has the fastest increase of new infections and deaths in South Africa. Recent data indicates that over 62% of the new infections and deaths are happening in this province, although only 17% of the South African population lives in this region. The fast spread of COVID-19 in the WC is not fully explained by the higher testing rates as this province has performed between 20-22% of the tests in South Africa, but the positivity rate has been around 9%, were as in the other provinces the positivity rate is around 1-2%. Our estimates of transmission potential for South Africa suggest a decreasing transmission potential towards R=1 since the first cases and deaths have been reported, similarly to that observed in other regions of the world. By the last period analyzed between 6-18 th of May, when using the Wallinga et al estimation approach applied to time series of reported cases, we estimate that R was on average 1.39 (1.04 -2.15, 95% CI). Overall, these results suggest of an epidemic still in expansion at that time, in spite of a very early lockdown. Sequencing of viral isolates from early COVID-19 cases in KZN, which is the province of South Africa with the first infections and early deaths, provided useful insights into the origins and transmission of SARS-CoV-2. From the first twenty-one genomes analyzed, we found thirteen independent introductions in KZN. These introductions were related to lineages B, B·1 and B·2, which have spread widely in Europe and North America. We also found a cluster of cases in health care workers in Durban, highlighting the potential importance of nosocomial transmission in this pandemic and potentially two other transmission pairs. The production of genomes from the WC will be crucial to understand the drivers of transmission during the lockdown period, and particularly whether health care facilities, prisons, workplaces and other institutions are acting as amplifiers of transmission. This is one of the main activities that our consortium, NGS-SA, is currently working on. Genomic analysis of SARS-CoV-2 in Africa has proved challenging on many fronts. First, sequencing of high-quality SARS-CoV-2 genomes is not a straightforward task. For example, a survey of thousands of sequences deposited in public databases has revealed a number of putative sequencing issues that appear to be the result of contamination, recurrent sequencing errors or hypermutability (Virological, 2020). These might arise from laboratory-specific techniques of sample preparation, sequencing technology or consensus calling. Furthermore, the low diversity of this virus and the small number of mutations that define lineages have prompted caution in the interpretation of early phylogenetic analysis worldwide (Lu et al., 2020) . Often apparent local transmission clusters can in fact be the result of multiple introductions from under-sampled regions from non-uniform sequencing efforts (Grubaugh et al., 2019; Kraemer et al., 2019) . To mitigate this we confirmed phylogenetic results by manual inspection of mutations relative to the reference of SARS-CoV-2 (Supplementary Table S4 ). Second, the pandemic is still evolving and grouping of SARS-CoV-2 into lineages and subclades is likely to be dynamic at this stage and it is influenced by proportionally larger number of sequences produced in the northern hemisphere (Rambaut et al., 2020) . Third, the travel histories of apparent community transmission need to be thoroughly investigated in order to elucidate the true dynamics of transmission in a particular area. In our case, a subsequent investigation into the samples comprising the monophyletic cluster revealed the association with a big hospital outbreak of SARS-CoV-2 infections in Durban, KZN (Lessells et al. manuscript in preparation 2020) . This paper has some important limitations. The first is related to estimation of R from a limited number of deaths in a high heterogeneous epidemic both in time and space -for which we were able to estimate R only at the aggregated country level. The second is a lack of well set up genomics laboratories that can sequence the virus in Africa. This is also amplified by the difficulty of acquiring reagents that are in high demand, coupled with the disruption of air freight. It is therefore a high priority for our consortium, NGS-SA, to evaluate and share protocols among national laboratories in South Africa that could generate sequences of highquality and capacitate our laboratories with the protocols and bioinformatics pipelines to properly investigate virus introduction and to validate the call of variants with a detailed and reliable bioinformatics system. NGS-SA is also working with the Africa Center for Disease Control (CDC) and the World Health Organization (WHO) to strengthen genomics surveillance in the African continent. In this paper, we provide an early analysis of COVID-19 pandemic in South Africa, showing very heterogeneuos epidemics in the different provinces. We also estimated SARS-CoV-2 genetic diversity in KZN using the first twenty one genomes from some of the first cases in the J o u r n a l P r e -p r o o f country. We find that KZN had many distinct introductions of SARS-CoV2, but also had early evidence of nosocomial transmission. The pandemic at the local level is still developing and the objective of NGS-SA is to clarify the dynamics of the epidemic in South Africa and devise the most effective measures as the outbreak evolves. The authors have no conflict of interest to declare. The geographic region of the other sequences is marked with coloured circles. Nigeria responds to COVID-19; first case detected in sub-Saharan Africa The proximal origin of SARS-CoV-2 Genome Detective Coronavirus Typing Tool for rapid identification and characterization of novel coronavirus genomes COVID-19 WEEKLY EPIDEMIOLOGY BRIEF PROVINCES AT A GLANCE A Genomic Survey of SARS-CoV-2 Reveals Multiple Introductions into Northern California without a Predominant Lineage An interactive web-based dashboard to track COVID-19 in real time An emergent clade of SARS-CoV-2 linked to returned travellers from Iran Estimating the number of infections and the impact of non-pharmaceutical interventions on COVID-19 in 11 Imperial College COVID-19 response team Introductions and early spread of SARS-CoV-2 in the New York City area Coast-tocoast spread of SARS-CoV-2 during the early epidemic in the United States Tracking virus outbreaks in the twenty-first century Issues with SARS-CoV-2 sequencing data -Novel 2019 coronavirus / nCoV-2019 Genomic Epidemiology -Virological Nextstrain: real-time tracking of pathogen evolution Reconstruction and prediction of viral disease epidemics Early dynamics of transmission and control of COVID-19: a mathematical modelling study Renewing Felsenstein's phylogenetic bootstrap in the era of big data A territory-wide study of early COVID-19 outbreak in Hong Kong community: A clinical, epidemiological and phylogenomic investigation Evolutionary history, potential intermediate animal host, and cross-species analyses of SARS-CoV-2 Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: a statistical analysis of publicly available case data CoV-2 in Guangdong Province Coronavirus disease (COVID-19) case data -South Africa Coronavirus Pandemic (COVID Publ Online OurWorldInDataOrg Limited SARS-CoV-2 diversity within hosts and following passage in cell culture A genomics network established to respond rapidly to public health threats in South Africa Rapid SARS-CoV-2 whole genome sequencing for informed public health decision making in the Netherlands IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies Serial interval of novel coronavirus (COVID-19) infections Forked from Ebola virus sequencing protocol 2020 A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology TreeTime: Maximum-likelihood phylodynamic analysis Global initiative on sharing all influenza data-from vision to reality World Health Organization declares global emergency: A review of the Some Mathematical Questions in Biology: DNA Sequence Analysis Lectures on Mathematics in the Life Sciences Genome Detective: an automated system for virus identification from high-throughput sequencing data How generation intervals shape the relationship between growth rates and reproductive numbers The ongoing COVID-19 epidemic in Minas Gerais, Brazil: insights from epidemiological data We wish to extend our thanks to all laboratory personnel that have worked hard to genotype SARS-CoV-2 samples and who have generously made it public via the GISAID database.Without this free data-sharing environment, this research would not have been possible. A full list of acknowledgments to contributing laboratories can be found in Supplementary Table S8. J o u r n a l P r e -p r o o f