key: cord-0771744-x6mqff34 authors: Motayo, Babatunde Olarenwaju; Oluwasemowo, Olukunle Oluwapamilerin; Olusola, Babatunde Adebiyi; Akinduti, Paul Akiniyi; Arege, Olamide T; Obafemi, Yemisi Dorcas; Faneye, Adedayo Omotayo; Isibor, Patrick Omoregbe; Aworunse, Oluwadurotimi Samuel; Oranusi, Solomon Uche title: Evolution and genetic diversity of SARSCoV-2 in Africa using whole genome sequences date: 2020-11-28 journal: Int J Infect Dis DOI: 10.1016/j.ijid.2020.11.190 sha: b4e5e7878447ffbcca2f8189911c05f04d082806 doc_id: 771744 cord_uid: x6mqff34 BACKGROUND: The ongoing SARSCoV-2 pandemic was introduced into Africa on 14(th) February 2020 and has rapidly spread across the continent causing severe public health crisis and mortality. We investigated the genetic diversity and evolution of this virus during the early outbreak months between 14(th) February to 24(th) April 2020, using whole genome sequences. METHODS: We performed recombination analysis against closely related CoV, Bayesian time scaled phylogeny and investigated spike protein amino acid mutations. RESULTS: recombination signals were observed between the AfrSARSCoV-2 sequences and reference sequences within the RdRPs and S genes. The evolutionary rate of the AfrSARSCoV-2 was 4.133 × 10(-4) Highest Posterior Density (HPD 4.132 × 10(-4) to 4.134 × 10(-4)) substitutions/site/year. The time to most recent common ancestor (TMRCA) of the African strains was December 7(th) 2019, (95% HPD November 12(th) 2019-December 29(th) 2019). The AfrSARCoV-2 sequences diversified into two lineages A and B with B being more diverse with multiple sub-lineages confirmed by both maximum clade credibility (MCC) tree and PANGOLIN software. There was a high prevalence of the D614-G spike protein amino acid mutation 59/69(82.61%) among the African strains. CONCLUSION: This study has revealed a rapidly diversifying viral population with the G614 spike protein variant dominating, we advocate for up scaling NGS sequencing platforms across Africa to enhance surveillance and aid control effort of SARSCoV-2 in Africa. Towards the end of December 2018, Chinese authorities through the World Health Organization office in China made known of a new pathogen responsible for a series of pneumonia associated infections in Wuhan, Hubei province (WHO 2020). The pathogen was later identified to be a novel coronavirus closely related to the severe acute respiratory syndrome virus (SARS), with a possible bat origin (Zhou et al, 2020) . The world health organization named the disease COVID-19 (Chan et al, 2020) , and later declared it a pandemic on 11 th March 2020 prompting concerted efforts towards prevention and control worldwide (WHO 2020). On Febuary 11 th 2020 the international committee on the taxonomy of viruses (ICTV) adopted the name SARS-CoV-2 following the report of their coronavirus working group (CSG, 2020) . The virus has been placed in the subgenera Sabecovirus, genus Betacoronavirus, subfamily coronavirinea, family Coronaviridea (de Groot et al, 2013; Gorbalenya et al, 2020) . Coronaviruses are enveloped viruses containing a single-stranded positive sense RNA genome with a size of between 26kb to 32kb (Masters and Pearlman 2013) . They are responsible for a host of human and animal infections. The Betacoronaviruses contain the most medically important species of human coronaviruses such as HuCoVOC43, HuCoVHKu13. The severe acute respiratory syndrome coronavirus SARSCoV and the Middle East respiratory syndrome coronavirus MERS are also members of this group, and have been reported as high consequence pathogens causing large scale epidemics with zoonotic potential (Lau et al, 2005; Zaki et al, 2012) . Genomic and structural analyses have revealed that SARSCoV-2 encodes four structural proteins spike (S), membrane (M), envelope (E) and nucleocapsid (N) proteins as well as several non structural proteins (Chen et al, 2020; Lu et al, 2020) . The spike protein is the major antigenic protein responsible for initiating infection, via attachment of its receptor binding domain (RBD) to the SARSCoV/SARSCoV-2 receptor angiotensin converting enzyme 2 ACE 2 (Donelli et al, 2004; Monteil et al, 2020 (Zhou et al, 2020) and Pangolins . Phylogenetic analysis has shown that the virus has diversified through the duration of the pandemic into two major lineages A and B with several sub-lineage diversifications (Rambaut et al 2020) . Majority of the reports were generated using genome sequences of SARSCoV-2 from J o u r n a l P r e -p r o o f America, Europe and Asia (Rambaut et al, 2020) . There has been paucity of data on the genetic evolution of SARSCoV-2 sequences from Africa, despite the increasing number of genome sequence submissions into the Global initiative for sharing of Avian Influenza data (GISAID) database from Africa. There were 97 whole genome sequences available in the GISAID database as at 24 th April 2020. Majority of the published information on SARSCoV-2 particularly in Sub-Saharan Africa have been on Socio-economic impact of the virus in the region (Akinduti et al, 2020; Olasehinde et al, 2020; Oleribe et al, 2020) . This gap in knowledge prompted the conceptualization of this study. Describing the genetic diversity and evolutionary dynamics of SARSCoV-2 will facilitate real time surveillance, antigenic diversity and virus transmission patterns. This study was therefore designed to determine the genetic diversity and evolutionary history of SARSCoV-2 genome sequences isolated in Africa. Full genome sequences with high coverage were downloaded from the Global initiative for sharing of Avian Influenza data (GISAID) database. As of 24 th April there were 97 full genome sequences from Africa available in the GISAID database. A total of 69 high coverage genomes, defined as sequences with < 1% unidentified nucleotides, < 0.05% mutations not found in another isolate and no indel mutations not verified by the submitter. This selection was done by an option in the database that automatically filters only high coverage genomes. Another 151 high coverage full genome sequences were also downloaded from three continents America (USA), Asia (China and South Korea) and Europe (England, Italy and Germany). Three different datasets were then generated from these sequences, the first dataset consisted of high coverage J o u r n a l P r e -p r o o f full genome sequences from Africa, along with the SARSCoV2 reference genome sequence from Wuhan, China, Bat and Pangolin SARS related reference sequences and SARSCoV reference sequence (n = 76). The first data set was used for the evolutionary and Bayesain phylogenetic analysis of the African SARSCoV-2 genome sequences. The second dataset consisted of complete genome sequences from Africa, America, Asia and Europe (n =220), the second data set was used for the generation of Bayesian phylogenetic data as well as lineage determination. The third dataset consisted of complete spike protein (S) gene sequences from Africa, Bat and Pangolin SARS related reference S gene sequences (n = 69). This data set was used exclusively for spike protein amino acid motif analysis and visualization to determine significant mutations of the African SARSCoV-2 spike protein sequences. In addition to the sequence data retrieved from GISAID, clinico-demographic information was also retrieved from 33 of the sequence submissions from Africa between February 14 th and April 24 th 2020. These were the total number of submissions that had additional demographic information relating to the infected patients. Table 1 shows a summary of the demographic distribution by country of the 33 patients. Whole genome sequences downloaded from the GISAID database were aligned using We analyzed potential recombination events using the recombination detection program (RPD) software (Martin et al, 2015) . The analysis was conducted on whole genome sequences of identified lineages among the 69 African isolates, using RDP, bootscan analysis, GENECOV, Chimera, SISCAN, 3SEQ, and maximum chisquare methods. A putative recombination event was passed only if three of the above mentioned methods gave a positive recombination signal (Liu et al, 2010) . Temporal clock signal was analyzed among the aligned sequences using TempEst version 1.5 (Rambaut et al, 2016) . The root-to-tip divergence and sampling dates supported the use of molecular clock analysis in this study. Phylogenetic trees were generated by Bayesian inference through Markov chain Monte Carlo (MCMC), implemented in BEAST version 1.10.4 (Suchard et al, 2016) . We partitioned the coding genes into first+second and third codon positions and applied a separate Hasegawa-Kishino-Yano (HKY+G) substitution model with gammadistributed rate heterogeneity among sites to each partition (Hasaegawa et al, 1985) . The relaxed clock with Gausian Markov Random Field Skyride plot (GMRF) coalescent prior was selected for the final analysis, after running different models and comparing them using Bayes factor with marginal likelihood estimated using the path sampling and stepping stone methods implemented in BEAST version 1.10.4 (Suchard et al, 2016) . One hundred million MCMC chains were run with10% burn in. Results were then visualized with Tracer version 1.8. Complete S protein gene sequence of AfrSARSCoV-2 was aligned along with RaTG13 BtCoV and Pangolin SARSrCoV sequences using MAFFT (Katoh et al, 2015) . The alignment was then edited and visualized using BioEdit software (Hall, 1999) . The current global SARSCoV-2 pandemic, otherwise known as COVID-19 began on the African continent from a European returnee in Egypt on February 17 th 2020 (WHO 2020). It has since spread to virtually all the countries within the African region. This study was based on sequences generated during the early phase of the pandemic in Africa precisely between, February 2020 and April 2020. Sixty nine high coverage full genome sequences from six African countries, (Figure 1b ). This result is consistent with a previous report from Saudi Arabia which investigated the recombination between SARSCoV-2 and closely related viruses such as SARSCoV and MERS (Nour et al, 2020) . Evolutionary rate for the AfrSARSCoV-2 isolates during the period under study was 4.133 × 10 -4 substitutions/site/year, (high posterior density interval HPD, 4.132 × 10 -4 to 4.134 × 10 -4 ). This is slightly higher than that of an earlier report from early outbreak strains from China with a rate of 3.345 × 10 -4 (Li et al, 2020) , it is however lower than the calculated global SARSCoV-2 evolutionary rate estimated to be 8.0 × 10 -4 reported by Nexstrain (www.nextstrain.org/ncov/global ). There seems to be a gradual increase to December 17 th 2019). Our TMRCA was lower than a similar study which reported a TMRCA of 14 th October 2019 among global isolates including Chinese isolates (Li et al, 2020 ), but was slightly higher than another recent study investigating the evolutionary dynamics of the ongoing SARSCoV-2 epidemic in Brazil which reported a TMRCA of 10 th February 2020 (Xaiveir et al, 2020). These slight differences in time of origin observed from different studies can be due to differences in the number of sequences analyzed, different Bayesian models employed for analyses, although majority of reports often utilize the coalescent relaxed models (Li et al, 2020) . The epidemic history of the ongoing outbreak was investigated using the Bayesian Skyline Plot BSP. The BSP showed a steady increase in viral population as the outbreak progressed under the study period (Figure 3b ). This observation is expected as viral sequence population is supposed to increase as the infection spreads. A major limitation was the rather small number of sequences J o u r n a l P r e -p r o o f analyzed and very short study duration; therefore our results may not reflect the exact viral population dynamic of the outbreak in Africa. The AfrSARSCoV-2 sequences were analyzed for the D614-G mutation within the S1 subunit of the spike protein, which has been reported to contribute to increased transmissibility of SARSCoV-2 (Korber et al, 2020) . Figure 4 shows a representative amino acid alignment of selected Afr SARSCoV-2 sequences along with reference sequences of BtCoV RaTG13 and PCoV. Our results revealed high prevalence of D614-G mutation among AfrSARSCoV-2 with 12/69 (17.39%). The mutation was recorded in isolates from all African countries analyzed in this study, supplementary figure 2. Prior to this report the D614-G spike mutation was found predominantly in Europe accompanied by high number of cases and significant mortality rate (Pachetti et al, 2020; Korber et al, 2020b) . The introduction of this strain in Africa is quite worrisome, considering the population densities of most African cities and the poor state of public health infrastructure to support medical intervention of symptomatic SARSCoV-2 cases. Although more evidence is still required to determine the extent of the effect of the D614-G mutation on the virulence properties of the virus, current evidence from in vitro studies seem to support the hypothesis of increased transmissibility of this variant of the virus (Korber et al, 2020; Hu et al, 2020) . In conclusion we have reported the genetic diversity and evolutionary history of SARSCoV-2 isolated in Africa during the early outbreak period. Our findings have identified diverse sublineages of SARSCoV-2 currently circulating among Africans. We identified a relatively high prevalence of the D614-G spike protein variant of the virus capable of rapid transmission in all countries sampled. Major limitations to this study were the lack sufficient patient information from the originating samples which would have helped in further linking epidemiologic data to J o u r n a l P r e -p r o o f the sequence data, and the relatively low amount of sequence submission available in GISAID database from Africa as at the time of this study compared with those of other regions such as Europe and Asia. We advocate for upscale of next generation sequencing NGS capacity for whole genome sequencing of SARCoV-2 samples across the African continent to support surveillance and control effort in Africa. We are grateful to all the authors, originating and submitting laboratories from Global Initial on Sharing All Influenza Data (GISAID's EpiCoV database. http://www.gisaid.org.) for making the sequences available for use in our study. We also acknowledge the management of Covenant University Otta for their support to publish this work. Sero-epidemiological impact of SARSCoV-2 on the socio-demographic status of African populace The species severe acute respiratory syndrome-related virus: classifying 2019-nCoV and naming it SARSCoV-2 A familial cluster of pneumonia associated with the 2019 novel coronavirus indicating person-to-person transmission: a study of a family cluster Commentary: Middle East respiratory syndrome coronavirus (MERS-CoV): announcement of the Coronavirus Study Group Epidemiological and genetic analysis of severe acute respiratory syndrome Covid-19 situation world wide as at 21 th May BioEdit: a user-friendly biological sequence alignment editor and analysis program for Windows 95/98/NT". 1999 Dating of the human-ape splitting by a molecular clock of mitochondrial DNA The D614G mutation of SARS-CoV-2 spike protein enhances viral infectivity and decreases neutralization sensitivity to individual convalescent sera Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2 Identifying SARS-CoV-2 related coronaviruses in Malayan pangolins Severe acute respiratory syndrome coronavirus-like virus in Chinese horseshoe bats Evolutionary history, potential intermediate animal host, and cross species analysis of SARSCoV2 Codon usage bias and recombination events for neuraminidase and hemagglutinin genes in Chinese isolates of influenza A virus subtype H9N2. Archives of Virology Genomic characterization and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding RDP4: Detection and analysis of recombination patterns in virus genomes. Virus evolution Chapter 28, Coronaviridea Inhibition of SARS-CoV-2 infections in engineered human tissues using clinical-grade soluble human ACE2 Insights into evolution and recombination of pandemic SARSCoV-2 using Saudi Arabian sequences. 2020. bioRxiv preprint COVID-19 Pandemic: Perception, Practices and Preparedness in Nigeria COVID-19 experience: Taking the right steps tat the right time to prevent avoidable Morbidity and Mortality in Nigeria and other nation of the World Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant Exploring the temporal structure of heterochronous sequences using TempEst (formerly Path-O-Gen). Virus Evol A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evolution Novel Coronavirus ( 2019-nCoV ) Situation Report -1, 21 WHO Africa/Second case of nCoV confirmed in Africa Coronavirus disease 2019 (COVID-19) The ongoing COVID-19 epidemic in Minas Gerais, Brazil: insights from epidemiological data and SARS-CoV-2 whole genome sequencing Maximum Likelihood Phylogenetic Estimation from DNA Sequences with Variable Rates over Sites: Approximate Methods Isolation of a novel coronavirus from a man with pneumonia in Saudi Arabia Probable Pangolin Origin of SARS-CoV-2 Associated with the COVID-19 Outbreak A pneumonia outbreak associated with a new coronavirus of probable bat origin The authors declare that they have no known competing financialinterestsor personal relationships that could have appeared to influence the work reported in this paper. The authors declare that there are no conflicts of interests in regards to the publication of this study. The authors did not receive any form of funding to conduct this research The study did not require an ethical approval, so none was sought