key: cord-0791701-y78w33it authors: Sassi, Mouna Ben; Ferjani, Sana; Mkada, Imen; Arbi, Marwa; Safer, Mouna; Elmoussi, Awatef; Abid, Salma; Souiai, Oussema; Gharbi, Alya; Tejouri, Asma; Gaies, Emna; Eljabri, Hanene; Ayed, Samia; Hechaichi, Aicha; Daghfous, Riadh; Gouider, Riadh; Khelil, Jalila Ben; Kharrat, Maher; Kacem, Imen; Alya, Nissaf Ben; Benkahla, Alia; Trabelsi, Sameh; Boubaker, Ilhem Boutiba-Ben title: Phylogenetic and amino acid signature analysis of the SARS-CoV-2s lineages circulating in Tunisia date: 2022-05-10 journal: Infect Genet Evol DOI: 10.1016/j.meegid.2022.105300 sha: 1521a73c677b1a4fd7e93d9eea386d7c98b41143 doc_id: 791701 cord_uid: y78w33it Since the beginning of the Coronavirus disease-2019 pandemic, there has been a growing interest in exploring SARS-CoV-2 genetic variation to understand the origin and spread of the pandemic, improve diagnostic methods and develop the appropriate vaccines. The objective of this study was to identify the SARS-CoV-2s lineages circulating in Tunisia and to explore their amino acid signature in order to follow their genome dynamics. Whole genome sequencing and genetic analyses of fifty-eight SARS-CoV-2 samples collected during one-year between March 2020 and March 2021 from the National Influenza Center were performed using three sampling strategies.. Multiple lineage introductions were noted during the initial phase of the pandemic, including B.4, B.1.1, B.1.428.2, B.1.540 and B.1.1.189. Subsequently, lineages B1.160 (24.2%) and B1.177 (22.4%) were dominant throughout the year. The Alpha variant (B.1.1.7 lineage) was identified in February 2021 and firstly observed in the center of our country. In addition, A clear diversity of lineages was observed in the North of the country. A total of 335 mutations including 10 deletions were found. The SARS-CoV-2 proteins ORF1ab, Spike, ORF3a, and Nucleocapsid were observed as mutation hotspots with a mutation frequency exceeding 20%. The 2 most frequent mutations, D614G in S protein and P314L in Nsp12 appeared simultaneously and are often associated with increased viral infectivity. Interestingly, deletions in coding regions causing consequent deletions of amino acids and frame shifts were identified in NSP3, NSP6, S, E, ORF7a, ORF8 and N proteins. These findings contribute to define the COVID-19 outbreak in Tunisia. Despite the country's limited resources, surveillance of SARS-CoV-2 genomic variation should be continued to control the occurrence of new variants. Coronavirus disease-2019 caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a growing public health concern. In some people, produces an asymptomatic disease or mild symptoms disease that does not require particular medical care. However, in specific groups of patients, particularly the elderly and those with chronic health diseases, the infection progresses into severe respiratory distress, requiring hospitalization in intensive care units (Thielen et al., 2021) . SARS-CoV-2 genome like other RNA viruses, shows a high mutation rates. Initially, the virus emerged from an animal reservoir in the city of Wuhan, China. Then, a human-tohuman transmission with a rapid spread worldwide has been established (Chunyang et al., 2020) . Over one year of COVID-19 pandemic, new SARS-CoV-2 genome mutations were constantly emerging and more than 4,000 variants have been reported (Bian et al., 2021) . During the initial stage of the pandemic, due to the lack of specific treatments that prevent or block viral replication, massive prevention strategies were applied by most countries. A wide difference in case fatality rates was observed, probably due to a diverse demographic composition and the type of measures that were taken in different countries to limit viral spread (Rader et al., 2021) . Subsequently, following the widespread vaccination, SARS-CoV-2 infections and deaths declined and social and economic conditions improved relatively. Globally, 196.553 .009 confirmed cases of with 4.200.412 deaths as of 30 of July 2021(WHO, 2021 a ). In Tunisia, the first case of COVID-19 was identified on 3 March, 2020. Preventive strategies were quickly put in place, in particular, lockdown and enhanced contact tracing around all positive cases (Chakroun et al., 2020) . As of 25 May 2020, the cumulative number of confirmed cases of COVID-19 was 1051 corresponding to a cumulative incidence of 8.87/100,000 inhabitants and an average of daily incidence around 13 cases (Abid et al., J o u r n a l P r e -p r o o f 2020). In view of the critical socio-economic situation, the Tunisian authorities allowed the reduction of restrictions. Accordingly, the virus has continued to spread with alarming rates, recording 595.532 positive cases and 20.067 deaths on 30 th July 2021 (WHO, 2021 a ). The low vaccine administration rate (11%) coupled with the emergence and wide circulation of the different Variants of Concern (VOC) had certainly played an important role in the evolution of the epidemiological situation in Tunisia (WHO, 2021 b ). The aim of this study was to identify the different SARS-CoV-2 introduction events in Tunisia, and to explore mutation profile to follow their genome dynamics through a collection of SARS-CoV-2 strains (n=58) from National Influenza Center over one year of the COVID-19 pandemic (March 2020 to March 2021). The PRFCOVID-GP3 project titled "SARS-CoV-2 genome Sequencing and study of hostpathogen molecular interactions in Tunisia: epidemiological, clinical and therapeutic impact" was approved by the medical ethics committee of Razi Hospital of Tunis. All procedures involving human participants were in accordance with the ethical standards of the Medical Ethics Committee of Razi Hospital and with the 1964 Helsinki declaration. In total 90 SARS-CoV-2 strains resulting from three sampling strategies during one year of the COVID-19 pandemic were included in this study. Initially, a stratified random sampling (n=32) was performed from February 5 to July 17, 2020 (first phase of the epidemic, wellcontrolled) according to the following criteria: super spread events (n = 1), extreme evolution (n = 3), death (n = 3), local infection (n = 12) and imported infection (n=13) [France (n=4), Italy (n=1), England (n=1), Egypt (n=1), Switzerland (n=1), Turkey (n=4) and Spain (n=1)]. Then from July 18 to December 23, 2020, a simple random sampling was applied from the list of samples with positive RT-PCR (n=32). Finally, from December 24 to March 20, 2021, the sequencing indications were in accordance with the national sequencing strategy of SARS-CoV-2. It aimed to identify and monitor the VOCs emergence in Tunisia (n=26). Thirty-two samples were excluded, due to high Ct value (n=10), amplification failure during sequence processing (n=17) or poor genomic coverage (<60%, n= 5). Thus, 58 samples were included in this study (Supplementary Data Table S1 ). None of the included samples were collected in May and June 2020, due to the absence of COVID-19 cases in Tunisia. All nasopharyngeal samples included in this study were collected at the National Influenza Center (NIC), also nominated as National Reference Lab for SARS-CoV-2 and other Respiratory Viruses and hosted at the Microbiology lab of Charles Nicolle Hospital of Tunis. RNA extraction was performed using the Chemagic™ automate and the viral RNA 300 Kit H96 (Perkin Elmer, Hamburg, Germany) according to the manufacturer's instructions. Also, manual extraction using the Qiagen Viral RNA Mini Kit (QIAGEN, California, USA) was used depending on the availability of reagents. SARS-CoV-2 was detected by the Hong Kong RT-PCR assay using AgPath-ID™ one-Step RT-PCR Reagents and ABI 7500 instrument (WHO, Laboratory and diagnosis, 2020). Which is a qualitative real time RT-PCR TaqMan method. According to this assay, a positive COVID-19 result was determined when both targets N and ORF1b-nsp14 reach a defined threshold below 0.2 and Ct value below 40. All samples included in this study had a Ct value < 30. WGS of SARS-CoV-2 strains were FASTQ sequencing data files were input to the DRAGEN RNA Pathogen Detection pipeline® and the ID by DNAExplify Respiratory Virus Oligos Panel Platform® for analysis and viral detection. These platforms were accessed in BaseSpace Sequence Hub. Clades were assigned to SARS-CoV-2 genome sequences (n=58) There were two clades identified in 2019 (19A and 19B) and 9 more in 2020 (20A to 20I). Subclades within a major clade were designated by specific nucleotide mutations. GISAID (https://www.gisaid.org/) uses specific combinations of genetic markers. Currently eight clades are defined: S and L, to the further evolution of L into V and G, and later of G into GH, GR and GV, and more recently GR into GRY. The lineages assigned by the PANGOLIN nomenclature system were used to discuss viral diversity throughout this manuscript (updated on august 29, 2021). A phylogenetic tree was built from the 58 full-length Tunisian sequences and the reference NC_045512 sequence using approximate maximum-likelihood (ML) method of MEGA X software (Sneath et al., 1973) based on 1000 bootstrap replicates. The 58 SARS-CoV-2 genome sequences were aligned using Clustal W program (Larkin et al., 2007) implemented in MEGA X software (Kumar et al., 2018) . Multiple alignments were J o u r n a l P r e -p r o o f manually edited by trimming the 5'and 3' untranslated regions, removing gaps and lowquality sequences, and then visualized using MEGA X. In addition, Open Reading Frames (ORFs) were predicted and annotated following the annotation of the SARS-CoV-2 reference genome generated from the Wuhan-Hu-1 sequence (access number: NC_045512). Each genome was compared to the reference NC_045512, then, genomic variants were identified using Geneious software (Kearse M et al., 2012) . Frequencies of identified variants were calculated and plotted according to their position on NC_045512 using Graph Pad Prism v8 (Graph Pad Software, Inc. San Diego California, USA). Mutations with frequencies above 20% were considered as hot spots. Figure S1 ). The 19Asubclade gathered the Wuhan reference sequence with sequences "6736" and "7899" collected in the first pandemic period (March and April 2020). These sequences showed an identity score of 97.97% when compared to the NC_045512. The difference between these 2 and the reference sequence was in 8 locations affecting the nonstructural proteins NSP2 (n=2), NSP4 (n=2), and NSP6 (n=1), and the structural N-Protein (n=3). Among these, 5 caused changes in the protein sequences of ORF1a (V378I, G3072C, and L3606F) and N (M1I and S188P) ( was also characterized by the emergence of mutation sets affecting ORF1ab (n=12), S-Protein (n= 13), ORF3a (n=2), ORF8 (n=5) and N-Protein (n=2). The 6 nucleotide deletions in ORF8 were responsible for 2 amino acid deletions "D119" and "F120". The second major node C2 revealed that the Tunisian SARS-CoV-2 sequences were different from the reference sequence and were split into 5 clades 20B, 20I, 20C, 20A and 20E ( Figure 2 ), all sharing the spike mutations D614G and NSP12-RdRp mutation P314L (Table 1, Supplementary Data Table S3 -S4). 20A and 20E represented the 2 main subclades (in purple and green (Figure 2 ) bringing together sequences of SARS-CoV-2 viruses isolated at different times and locations in Tunisia. Two mutations, T223I and H1101Y/L5F in ORF3a and S-Protein, respectively, were shared between these subclades. Moreover, in February 2021, the clade 20I (Alpha, V1) emerged (red cluster). This clade shares mutations in the N-Protein (R203K andG204R) with clade 20B. A unique mutation in the clades that constitute their genetic signature was also observed ( The multiple sequence alignment of the fifty-eight Tunisian sequences according to the (Table 1, Supplementary Data Table S4 ). Among all amino acid changes, 62 were found in structural proteins where 38 were observed in the spike (S) glycoprotein; 22 in the nucleocapsid (N) and only 2 in the membrane protein (M). No changes were seen in the envelope protein (E). Ninety-six additional amino acid changes were identified in non-structural proteins (NSPs 1-16 in ORF1ab), and 33 in accessory protein genes such as ORF3a (n=13), ORF7a (n=4), ORF8 (n=13) and ORF9b (n=3) ( Table 1 , Supplementary Data TableS3-S4). Deletions were observed at 10 sites and were identified in the genomic sequences of NSP3, NSP 6, protein S, protein E, ORF7a, ORF8 and protein N. Seven genomic deletions caused consequently the deletion of amino acids and six others caused frameshifts (Table 2) . Interestingly, among the 191 non synonymous mutations, 17 were found as hotspots with more than 20% of mutation frequency in SARS-CoV-2 genomes derived from Tunisian patients. Eight out of them were found in ORF1ab, four in the spike glycoprotein (S), four in the nucleocapsid and one in accessory protein of ORF3a.In addition, some variants which presented the P314L in ORF1b-non-structural protein RNA-dependent RNA polymerase (RdRp), the D614G in Spike and the Q57H in ORF3a mutations, have a frequency that exceeded 30% in Tunisian sequences (Figure3). Face to the unusual SARS-CoV-2 pandemic, our country has increased its genomic capacity to track the rapid evolution of the virus. A national strategy which includes federated multidisciplinary research projects (PRF) has been implemented during the early pandemic stage. The participation of the National Observatory of New and Emerging Diseases was to define sampling strategies during the different waves of pandemic and to capture strains with particular priority for sequencing. The first strain introduced in Tunisia was completely sequenced in the framework of this national project. The implemented national strategy also allowed the detection of different VOCs as soon as they were introduced in the country corresponded to 19B clade which appeared in early 2020, was expected to disappear over time (Carmen LiaMurallet al., 2021) , but which frequency increased worldwide probably due to convergent mutations affecting S protein including D614G (Volz et al., 2021) . In our sequences we did not observe any D614G mutation in this clade, but rather a combination of L18F, N501Y, L452R, H655Y, D796Y and G1219V mutations in S protein. These later, (L18F, N501Y, L452R and H655Y) were previously described in France (Fourati et al., 2021) and were suggested to improve the interaction of S protein with ACE2 viral receptor (Weisblum et al., 2020) and conducting to increased resistance to neutralizing antibodies (Choi et al., 2020 and Baum et al., 2020 As seen, the appearance of some mutations and/or variants can be a major event in the epidemic evolution and the spread of genetically polymorphic variants. Continuing to sequence new variants as soon as they appear could therefore be useful, in monitoring and managing the epidemy, as well as treatment and vaccine development. Genome sequences generated in this study were deposited in the GISAID The authors declare that there is no competing interest. Prediction of SARS-CoV-2 epitopes across 9360 HLA class I alleles Structural basis for SARS-CoV-2 envelope protein recognition of human cell junction protein PALS1 First case of imported and confirmed COVID-19 in Tunisia Persistence and Evolution of SARS-CoV-2 in an Immunocompromised Host Estimated transmissibility and impact of SARS-CoV-2 lineage B.1.1.7 in England Variant Derived from Clade 19B, France. Emerging infectious diseases Global phylodynamic analysis of avian paramyxovirus-1 provides evidence of inter-host transmission and intercontinental spatial diffusion Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv : the preprint server for health sciences Emerging infectious diseases Mutation Patterns of Human SARS-CoV-2 and Bat RaTG13 Coronavirus Genomes Are Strongly Biased Towards C>U Transitions, Indicating Rapid Evolution in Their Hosts Recurrent emergence of SARS-CoV-2 spike deletion H69/V70 and its role in the Alpha variant B.1.1.7 A small number of early introductions seeded widespread transmission of SARS-CoV-2 in Québec National Observatory of New and Emerging Diseases Ministry of Health, Tunisia. COVID -19 en Tunisie Point de situation à la date du 28 mars 2020 Dutch-Covid-19 response team (2020). Rapid SARS-CoV-2 whole-genome sequencing and analysis for informed public health decision-making in the Netherlands Escape from neutralizing antibodies by SARS-CoV-2 spike protein variants. eLife, 9 Effects of SARS-CoV-2 mutations on protein structures and intraviral protein-protein interactions Mining of epitopes on spike protein of SARS-CoV-2 from COVID-19 patients Mask-wearing and control of SARS-CoV-2 transmission in the USA: a cross-sectional study. The Lancet. Digital health, 3(3) , e148-e157.https://doi.org/10.1016/S2589-7500(20) Rambaut, A., Holmes, E. C., O'Toole, Á., Hill, V., McCrone, J. T., Ruis, C., du Plessis, L., &Pybus, O. G. (2020) . A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology.Nature microbiology, 5 Please check the following as appropriate:oAll authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.oThis manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.oThe authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript membrane glycoprotein; N: nucleocapsid phosphoprotein; ORF: open reading frame; S: spike glycoprotein