key: cord-0324379-xybi4qgc authors: Rono, Evans K. title: Covid-19 genomic analysis reveals clusters of emerging sublineages within the delta variant date: 2021-10-12 journal: bioRxiv DOI: 10.1101/2021.10.08.463334 sha: 18899380f4c76790780877dc021373fea709fe6e doc_id: 324379 cord_uid: xybi4qgc The emerging SARS-CoV-2 variants may potentially have enhanced transmissibility and virulence of the virus, and impacts on performance of diagnostic tools and efficacy of vaccines. Genomic surveillance provides an opportunity to detect and characterize new mutations early enough for effective deployment of control strategies. Here, genomic data from Germany and United Kingdom were examined for genetic diversity by assessing gene mutations and inferring phylogeny. Delta variant sublineages were grouped into seven distinct clusters of spike mutations located in N-terminal domain of S1 region (T95I, D138H, *D142G, Y145H and A222V) and S2 region (T719I and *N950D). The most predominant cluster was T95I mutation, with the highest frequencies (71.1% - 83.9%) in Wales, England and Scotland, and the least frequencies (8.9% - 12.1%) in Germany. Two mutations, *D142G and *N950D here described as *reverse mutations and T719I mutation, were largely unique to Germany. In a month, frequencies of D142G had increased from 55.6% to 67.8 % in Germany. Additionally, a cluster of D142G+T719I/T mutation went up from 27.7% to 34.1%, while a T95I+ D142G+N950D/N cluster rose from 19.2% to 26.2%. Although, two distinct clusters of T95I+D138H (2.6% - 3.8%) and T95I+Y145H+A222V (2.5% - 8.5%) mutations were present in all the countries, they were most predominant in Wales and Scotland respectively. Results suggest divergent evolutionary trajectories between the clusters of D142G mutation and those of T95I mutation. These findings provide insights into underlying dynamics of evolution of the delta variant. Future studies may evaluate the epidemiological and biological implications of these sublineages. and T719I mutation, were largely unique to Germany. In a month, frequencies of D142G had 23 increased from 55.6% to 67.8 % in Germany. Additionally, a cluster of D142G+T719I/T mutation 24 went up from 27.7% to 34.1%, while a T95I+ D142G+N950D/N cluster rose from 19.2% to 25 26.2%. Although, two distinct clusters of T95I+D138H (2.6% -3.8%) and T95I+Y145H+A222V 26 (2.5% -8.5%) mutations were present in all the countries, they were most predominant in Wales 27 and Scotland respectively. Results suggest divergent evolutionary trajectories between the clusters 28 of D142G mutation and those of T95I mutation. These findings provide insights into underlying 29 dynamics of evolution of the delta variant. Future studies may evaluate the epidemiological and 30 biological implications of these sublineages. 31 Germany had also increased from 27.7% and 34.1%. Single delta_D142G mutation, in overall, 159 increased in frequency from 55.7% (Fig. 2a) to 67.8% (Fig. S3a) . In both submissions, England 160 had the highest number of sequences, while sample size from Northern Ireland in the second 161 submission suffered significantly from the lowest (N = 76) representation of sequences (Fig. S3b) , 162 which was a drastic drop from 1085 sequences in the first submission (Fig. 2a) . 163 These results, considering good sample sizes of genome sequences analyzed in this study, and the 164 observed wide spread of these mutations, may suggest that natural selection and not chance events 165 drives the emergence of these mutations (Lauring & Hodcroft, 2021) . Mutations: T95I, D138H, 166 D142G, Y145H and A222V are clustered in the N-terminal domain (NTD) in S1 region (Fig. 167 S3b). The T719I position is located in the S2 region just before the fusion peptide, while N950D is 168 transmission (Hou et al., 2020) . In this context, the T95I mutation being the most predominant 196 mutation with a wide geographical spread, may suggest that it may confer more transmissive 197 ability or fitness and/or adaptiveness to the virus (Liu et al., 2020). Evidently, the reducing 198 frequencies of the parental delta variant observed in this survey, may be a pointer that the parental 199 delta may soon be phased out by its emerging descendants, especially by T95I and/or *D142G 200 mutations. These T95I and D142G mutations appear to evolve independently as seen by their 201 clustering with D138H, Y145H+A222V, D142G+N950D/N and D142G+T719I/T. Notably, 202 mutations in N gene and non-structural genes; orf1ab, orf3a, orf7a and orf8 genes (Table S1) , 203 revealed evident signatures of polymorphic differences, which may have some consequences in 204 viral packaging and replication. Whether these substitutions are associated with roles of 205 L452R/T478K and/or D614G/P681R mutations remains unknown. In addition, outstanding 206 questions on adaptive benefits of the new mutations, and the implications they have on 207 transmissibility, antigenicity, or virulence of the virus remain to be understood. 208 In summary, the unique splitting of delta variant into distinct clusters of emerging delta sub-209 lineages may be hypothesized to mean that the parental delta variant, may be evolving into new 210 genetic variants. A speculation, that future research needs to test on the basis of their phenotype 211 differences in transmissibility and/or epidemiology in real SARS-CoV-2 public health infections. (Table S1 and S2) 230 are included at end of the manuscript. 231 Under terms and conditions of use, genome sequences used in this study cannot be circulated here 233 or elsewhere. Supplementary data files: Data 1 ( Fig. 1 and Fig. S1 ); Data 2 to data 12 (Table S1); 234 and Data 13 (Table S2 , Fig. 2, Fig. 3, Fig. S2 and Fig. S3 ) are provided. Any other additional data 235 and methods are available from the author upon request. 236 The author did not receive any funding towards this work. 238 The author has no conflict of interest to disclose. 240 The author conceived the study, designed and validated the methods, downloaded and analyzed 242 the data, prepared and submitted the manuscript. 243 Many thanks to all the researchers and the five Nations: England, Germany, Northern Ireland, 245 Scotland and Wales, for investing in SARS-CoV-2 genome sequencing and openly sharing their 246 genomic data via the GISAID platform. Great appreciation to the NCBI and the GISAID for the 247 access of the SARS-CoV-2 genomic data. 248 9 250 The total number of sequences were n = 93992, sieved from N = 169315 by removing non-DNA characters from the spike sequences. a). Variant-calling using the marker mutations specific to each variant of concern (VOC) in table 1. Wuhan reference sequence was included as a wild type sequence. The most dominant sequence was the delta variant. By using all the delta markers in Table1, sequences grouped under 'other', did not fall into any of the groups of the variants. b). Visualization of amino acid positions of the delta variant from sequences called using the deletions at 156 and 157 fixed markers for the delta variant. The variability indicates the number of different amino acid molecules competing for each position. Positions are numbered relative to the Wuhan reference sequence. Each plotted data point represents the total number of sequences sharing the most dominant amino acid in each position. The labeling threshold was placed at <99% of the total number of sequences. Positions 95, 138, 142, 145, and 222 were revealed to be accumulating mutations. Next, a method for variant calling was also simplified. To do this, the workflow for variant calling 58 was done by numbering positions of the spike mutation markers that define individual variants 59 relative to self, instead of the reference spike sequence (Table 1) (Arendt, 2020) , and v1.4.0 writexl R packages. Plots were visualized using v3.3.5 ggplot2 R 87 Sa <-AAStringSet(substr (Sa, start=1, stop=1195) ). dfSa <-data.frame(stri_extract_all_regex(dfSa$Sa, '.{1,1}')) dfSc <-data.frame(stri_extract_all_regex(dfSc$Sc, '.{1,3}')) dfSa$B.1.617.2_Delta <-ifelse(dfSa$X19=="R" &dfSa$X142=="D" & dfSa$X156=="G&dfSa$X157=="V"&dfSa$X415=="K"&dfSa$X450=="R"& dfSa$X476=="K" & dfSa$X482=="E"& dfSa$X612=="G"& dfSa$X679=="R"& dfSa$X948=="N" ,"yes", "no") package (H, 2016) and v0.4.13 circlize R package (Gu et al., 2014) . Plots were further refined in 88 v1.1.1 Inkscape (Project, 2021) . Table S1 ). The sequences were 95 transformed to data frame and split to positions 1 to 1271 of individual amino acids. Note that 96 these sequences were not aligned relative to the Wuhan reference genome, and as such they were 97 two-amino acid shorter in lengths than the reference genome. To observed genetic diversity in the remaining genes: Orf1ab, orf3a, E, M, orf6, orf7a, orf7b, orf8, 150 N and orf10, similar analyses that were done on the spike gene were extended to all these genes. 151 Note that for the orf1ab gene, the reading was corrected so that part orf1a of the gene joins with 152 the second part orf1b to give the correct reading frame for the entire orf1ab gene. Therefore, 153 reading frame of orf1ab gene in all the orf1ab sequences was corrected by running the following 154 code: 155 156 6. Grouping of single mutations 157 Sequences were grouped into groups with different single non-synonymous mutations using the 158 genetic markers for the delta variant in Table 1 as well as new emerging single amino acid 159 mutations. By looking at the patterns of these new single mutations at the spike protein sequences, 160 and to some extend the rest of the genes, further groups consisting of clusters of one or more 161 combinations of mutations were created, and confirmed by phylogenetic analysis. 162 To select representative genomes from 93647 sequences (GISAID submissions from 2021.07.23 164 to 2021.08.30) for use to infer phylogeny, complete genome sequences were trimmed. The 165 sequences were trimmed to exclude the sequence portions before the first gene (orf1ab) and the 166 sequence portions after the last gene (orf10) using the previously described pattern matching 167 method above. Sequences were cleaned to ensure quality of sequence coverage on the entire 168 genome. Frequency of identical genomes (haplotypes) were counted from the trimmed sequences, 169 and those sequences which appeared once were discarded to remain with 27993 genomes. 170 Frequencies of haplotypes in each mutation category were counted and arranged from the most 171 BrowseSeq(all_Sa, highlight = 1) all_orf1ab <-DNAStringSet(str_replace (all_orf1ab, "AAACGGGTT", "AAACCGGGTT")) dominant to the least dominant haplotypes. From these grouped haplotypes, representative 172 sequence for each of the first 10 haplotypes per mutation category (except for delta_ N950D, 173 which had only 4 representatives), were used to infer phylogeny (Table S2) . 174 Phylogeny was inferred using IQTREE maximum likelihood (Minh et Sample sizes are as listed in Fig. 3b. b T19R T19R T19R T19R T19R T19R G142D G142D D142G G142D G142D G142D E156-E156-E156-E156-E156-E156-F157-F157-F157-F157-F157-F157-R158G R158G R158G R158G R158G R158G L452R L452R L452R L452R L452R L452R T478K T478K T478K T478K T478K T478K P4715L P4715L P4715L P4715L P4715L P4715L P5401L P5401L P5401L P5401L P5401L P5401L G5063S G5063S G5063S G5063S G5063S G5063S A1306S A1306S A1306S A1306A/S A1306S A1306S P2046L P2046L P2046L P1640L P2046L P2046L P2287S P2287S P2287S P2046P/L P2287S P2287S V2930L V2930L V2930L P2287P/S A2529V A2529V T3255I T3255I T3255I A2529A/V V2930L V2930L T3646A T3646A T3646A V2930V/L T3255I T3255I A6319V A6319V A6319V A3209V T3646A T3646A E87E/D A2529V E87E/D V3718A A6319V A6319V K261K/N V665V/I K261K/N T3750I M5900M/I I695I/ --L16L/P T38T/I -T38T/I -Bright green colour highlights fixed mutations. Turquoise colour highlights new substitutions undergoing fixation. Grey colour shows emerging reverse-mutations. Yellow colour represents mutations that have significantly increased in frequencies. The rest substitutions (not highlighted) are candidates for future genomic surveillance. Cut off for highlighting was placed at >1% prevalence, that is, more than 1 sequence in every 100 sequences sharing same mutation per site. '/' for example, in T22T/I, means that the site has two amino acids, but in this case, there are more 'T's than 'I's. *These three positions can also be captured in these two ways: Either as deletions at F157-and R158-, and substitution at E156G or deletions at E156-and R158-, and substitution at F157G. In all three cases, the final markers that define an unaligned delta variant sequence at these positions are 156G and 157V, and therefore they have no effect on variant calling. Read, Write, Format Excel Using DECIPHER v2.0 to Analyze Big Biological Sequence Data in R stringi: Fast and portable character string processing in R Investigation of SARS-CoV-2 variants of concern: technical briefings circlize implements and enhances 213 circular visualization in R Biostrings: Efficient manipulation of biological 216 strings ggplot2: Elegant Graphics for Data Analysis dplyr: A Grammar of Data 220 Structure of the SARS-CoV-2 spike receptor-binding domain bound to the 223 ACE2 receptor IQ-TREE 2: New Models and Efficient Methods for Phylogenetic 226 Inference in the Genomic Era ShortRead: a bioconductor package for input, quality assessment and exploration of high-230 throughput sequence data MF 2: Genome Sequencing and Genomic Epidemiology SARS-CoV-2 spike 239 P681R mutation enhances and accelerates viral fusion R: A language and environment for statistical computing. R Foundation for 244 Reshaping Data with the reshape Package tidyr: Tidy Messy Data A new coronavirus associated with human respiratory 255 disease in China Using ggtree to Visualize Data on Tree-Like Structures