key: cord-0271311-kwf3yt92
authors: Yang, X.-J.
title: SARS-COV-2 Î´ variant drives the pandemic in India and Europe via two subvariants
date: 2021-10-20
journal: nan
DOI: 10.1101/2021.10.16.21265096
sha: 9dc9610b51702c8a925dd046a5dbf28fea9a4501
doc_id: 271311
cord_uid: kwf3yt92

SARS-COV-2 evolution generates different variants and drives the pandemic. As the current main driver, delta variant bears little resemblance to the other three variants of concern, raising the question what features future variants of concern may possess. To address this important question, I compared different variant genomes and specifically analyzed delta genomes in the GISAID database for potential clues. The analysis revealed that delta genomes identified in India by April 2021 form four different groups (referred to as delta1, delta2, delta3 and delta4) with signature spike, nucleocapsid and NSP3 substitutions defining each group. Since May 2021, delta1 has gradually overtaken all other subvariants and become the dominant pandemic driver, whereas delta2 has played a less prominent role and the remaining two (delta3 and delta4) are insignificant. This group composition and variant transition are also apparent across Europe. In the United Kingdom, delta1 has quickly become predominant and is the sole pandemic driver underlying the current wave of COVID-19 cases. Alarmingly, delta1 subvariant has evolved further in the country and yielded a sublineage encoding spike V36F, A222V and V1264L. These substitutions may make the sublineage more virulent than delta1 itself. In the rest of Europe, delta1 is also the main pandemic driver, but delta2 still plays a role. In many European countries, there is a delta1 sublineage encoding spike T29A, T250I and Q613H. This sublineage originated from Morocco and has been a key pandemic driver there. Therefore, delta variant drives the pandemic in India and across Europe mainly through delta1 and delta2, with the former acquiring additional substitutions and yielding sublineages with the potential to drive the pandemic further. These results suggest a continuously branching model by which delta variant evolves and generates more virulent subvariants.

Since the initial cases were identified in December 2019, coronavirus disease 2019 has caused the global pandemic in a way totally unexpected to almost everyone in the world.

In a short 22 months, this pandemic has led to almost 238 million confirmed cases around the world, accounting to almost 3.0% of the entire population (according to the Our World in Data website, https://ourworldindata.org/; accessed on October 10, 2021). As there are many exposed but unreported cases, the real case count is much larger than this. As for mortality, almost 4.9 million people have died of this disease, accounting for over 1 out of 2,000 people.

Moreover, many of the deceased were adult caregivers, thereby leaving behind thousands of orphans around the world [1, 2] . For example, 1.5 million children across 21 countries lost a care-giving parent or grandparent due to COVID-19 from March 2020 to April 2021 [1] . Thus, this pandemic has caused lots of suffering and tragic loss of life. Moreover, it has crippled the economy and hindered its recovery around the world. To make the matter even much worse, this pandemic is still actively ongoing, with new waves of cases in many countries, including Israel, the United Kingdom, the United States of America, Canada, Singapore, Malaysia and Indonesia (Our World in Data). Notably, many of them, including Israel, the United Kingdom, U.S.A., Canada and Singapore are among the most vaccinated countries in the world. Thus, this pandemic remains a major and urgent public health crisis unprecedented in the recent human history.

The disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), with its first isolates sequenced and reported at the beginning of January 2020 [3] . The genomes sequences indicate that the virus is homologous to SARS-COV-1, an enveloped, positive-sense and single-stranded RNA virus that caused the severe acute respiratory syndrome (SARS) epidemic mainly in Asia from 2003 to 2004 was declared a pandemic by the WHO at the beginning of March 2020 [3] . Since December 2019, there are multiple waves, with the newer ones much worse than the first one. From genomic surveillance, it has become gradually clear that virus evolution plays a key role in driving waves of cases in different countries and continents around the world. Thus, it is an important task to map out how the virus mutates and evolves. This also sheds light on the evolutionary trajectories of the virus, which will be helpful for preventing pandemics in the future.

For genomic surveillance, data sharing and annotation are critical. In this regard, the global initiative on sharing avian influenza data (GISAID) has played a critical role since the beginning of the pandemic and as of October 13, 2021, >4.25 million SARS-COV-2 genomes have been deposited into the database [4] . From genomic surveillance, it is clear that the virus has evolved dynamically and yielded many variants, with some being the main drivers of the pandemic. Among them are four dominant variants, a (B.1.1.7) [5] , b (B.1.351) [6] , g (P.1) [7] and d (B. 1.617.2) [8], initially identified in the United Kingdom, South Africa, Brazil and India/Japan, respectively. The WHO has designated these strains as variants of concerns. One astonishing common theme is that they emerged from complete obscurity and then rapidly rose to become the pandemic drivers, raising the question about candidates as the potential future variants of concern. This question would greatly help us get well ahead of SARS-COV-2, (instead of 'tail-gating' it) and take 'pre-emptive' strikes against such variants. To address this important question, I have closely tracked how different variants evolve. Here, I describe that d variant drives the pandemic in India and Europe mainly through d1 and d2, with the former acquiring additional virulent substitutions and yielding multiple sublineages with the potential to drive the pandemic further. This study complements two companion reports on how d1 and d2 subvariants drive the pandemic in the USA [9], Indonesia, Singapore and Malaysia [10] .

SARS-COV-2 evolution drives the pandemic around the world As shown in Fig. 1A , there have been six waves of COVID-19 cases in the world since December 2019. The initial identification of a [5] , b [6] and g (P. 1) [7] variants at the end of 2020 and the beginning of 2021 signaled that SARS-COV-2 evolution is the major driver of the pandemic. Accordingly, the pandemic can be divided into pre-VOC (variant of concern) and VOC phases. In the first few months of 2021, these three VOCs were the main pandemic driver but just as they caught the world by surprise, d variant suddenly emerged from complete obscurity [8] , rapidly overtook the three other VOCs and multiple variants of interests (such as µ, k and Ida, Fig. 1A ) [9] , and gradually became the major pandemic driver. This raises an important question about what features such a pandemic driver possesses. Answers to this question will help identify potential new VOCs. The initial and obvious guess was that d variant possess key spike L452R, T478K and P681R substitutions (Table 1) . These substitutions are individually present in many other variants, so one possibility is that their combination is deadly. Even if so, one question is whether this combination is sufficient. Related to this, despite its possession of L452R and P682R, k variant has not spread from India and colonized other countries as d variant has done. One main difference is that T478K of d variant is replaced with E484K in k variant. Structural and molecular analysis [11] [12] [13] [14] [15] indicated that this difference is insufficient to explain the huge difference between these two variants in terms of their ability to become pandemic drivers. This raises the question what else makes d variant become a dominant pandemic driver so rapidly. To answer this question, it is necessary to understand how this variant has evolved. As it originated in India, I first analyzed its evolution there. d variant rapidly emerged as the major pandemic driver in India . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 As shown in Fig. 1B , there have been two waves of COVID-19 cases in India, with the first one starting in the summer of 2020 and ending a few months later. To understand the genomic basis of the epidemiological wave, I analyzed different variant genomes identified in India and deposited into the GISAID database [4] . As shown in Fig. 1C , a and k variants were the major drivers in March 2021, but d variant rapidly overtook all variants and became the major driver only one month later. Since May 2021, it has been present over 80% COVID-19 cases (Fig.   1C ). Its rapid rise raises important questions about its evolution and on its molecular and genomic characteristics.

Related to its evolution, one would wonder whether it came from the first wave of cases in 2020. To understand the genomic basis of that wave, I performed mutation profiling of 820 SARS-COV-2 genomes in COVID-19 cases identified in India during September 2020. For this, I analyzed the genomes by utilizing Coronapp, an efficient web-based mutation annotation application [16, 17] . The analysis indicated that on the average, there are 15-16 mutations per genome (Fig. 1D ). In terms of spike substitutions, only D614G is present in all genomes, while P681H, Q677H and N440K are only encoded in 6.6%, 5.9% and 2% genomes, respectively Table 1 ). d variant displays signature spike substitutions such as L452R, T478K

and P681R, as well as three signature nucleocapsid substitutions, including R203M (Table 1) .

These spike and nucleocapsid signature substitutions are absent in a [5] , b [6] and g [7] variants.

Strikingly, among the 820 genomes, there is only one genome possessing these four signature substitutions (Table S1 ). If the sample collection date for this genome is correct, it may be a d variant precursor. Together, these observations suggest that d variant did not evolve from the majority of cases identified during the first wave, but rather originated from some rare evolutionary events. Conceptually, this is reminiscent of the way by which a variant emerged in the United Kingdom in September 2020 [5] . Emergence of a and d variants as two major pandemic drivers from complete obscurity also raises an important issue about how to detect them in the early stages of their evolution. Due to the low abundance at such stages (often lower than 1%), this is challenging for genomic surveillance unless all COVID-19 cases are sequenced. d variant has evolved and yielded different subvariants in India Then I asked how d variant had evolved by March 2020 when it was still obscure. To address this question, I utilized Coronapp [16, 17] to analyze the mutation profile of 985 d-genomes identified in India by March 2021. This analysis revealed different subgroups of d-genomes (Fig. 2 ). In addition to what is presented in Fig. 2 , Coronapp generated a list of mutations that each genome carries. I manually inspected this list (Table S2) to identify the subgroups. Initially, spike substitutions such as E156del, D950N, G142D and A222V ( Fig. 2A & Table 1 ) were used to identify such subgroups, but the outcome was not clear cut, perhaps due to loss of G142D, . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 E156del and/or D950N in some genomes during evolution. Then I noticed that nucleocapsid substitutions G215C and R385K (Fig. 2B) were mutually exclusive. Moreover, they were associated with two distinct sets of NSP3 substitutions: 1) G215C with A488S, P1228L and P1469S of NSP3; and 2) R385K with P822L of NSP3 ( Fig. 2B-C) . With this guide, I noticed that spike K77T and A222V are mutually exclusive, but both are associated with P822L. Thus, it appeared that the NSP3 substitutions are good markers to separate two large groups, with one group defined by nucleocapsid G215C along with three substitutions of NSP3 (A488S, P1228L and P1469S). The other group is defined by P822L of NSP3. This group can be further divided into three subgroups defined by spike K77T, A222V and nucleocapsid R385K (Table 1 ). This classification also appears to be true when I inspected mutation reports generated by CoVsurver, enabled by GISAID [4] . Thus, d variant had yielded four distinct groups of subvariants in India by March 2021.

To substantiate this, I carried out phylogenetic analysis of 291 d-genomes identified in India by March 21, 2021. This analysis uncovered 5 major clusters (Fig. 3A) . To assign the clusters, CoVsurver was used to generate mutation reports for some representative genomes of each cluster. Manual inspection of the reports indicated that four clusters correspond to the four groups mentioned above, so these four groups are referred to as d1, d2, d3 and d4, with nucleocapsid G215C, spike A222V, nucleocapsid R385K and spike K77T as signature substitutions, respectively. While d1 is associated with A488S, P1228L and P1469S of NSP3, d2, d3 and d4 encode P822L of NSP3. The fifth cluster from phylogenetic analysis is related to d3 but does not carry nucleocapsid R385K, so this cluster is referred to as pre-d3. Therefore, d variant had generated 4-5 subvariants in India by March 2021.

This raises the question how these subvariants have grown since the beginning of 2021.

As shown in Fig. 3B , the number of genomes encoding d1-d4 subvariants was very low by January 2021. After that, the number has changed dynamically. Among the four subvariants, d4 is minor. In April 2021, d1, d2 and d3 genome numbers were comparable, but d1 became dominant afterwards. In August 2021, it corresponded to >70% genomes sequenced (Fig. 3B ).

Thus, d1 rapidly became a major pandemic driver in India. Among the remaining three subvariants, d2 played a less important role than d1 in August and September 2021 (Fig. 3B ).

For example, in August 2021, the number of d2 genomes was ~20% of that for d1 (Fig. 3B ).

By comparison, d3 and d4 became almost insignificant in August and September 2021 (Fig.  3B ). These observations also suggest that the distinct substitutions in the four subvariants make them players of different importance in driving the pandemic in India. Fig is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 supporting that d1 and d2 subvariants were responsible for ~80% and ~20% COVID-19 cases, respectively ( Fig. 3D -E). Because d1 subvariant possesses two extra NSP3 substitutions than d2, the evolutionary speed or mutation rate of d1 subvariant is slightly faster than d2. This raises the question what makes d1 a much more prominent pandemic driver than d2. This also highlights the importance of the d1 substitutions not shared by d2 (Table 1) . Compared to d1, d2 possesses an extra spike substitution, A222V ( Fig. S3 & Table 1 ). Thus, it is not spike substitutions but rather extra nucleocapsid and NSP3 substitutions that make d1 much more powerful in driving the pandemic than d2 subvariant. This analysis revealed that similar to what was observed with d-genomes from India, d genomes identified around the world by March 09, 202 also formed 5 different clusters. They can be assigned using the same designation system: d1, d2, pre-d3, d3 and d4 ( Fig. 4A & Table  1 ). However, different from what was observed in India ( Fig. 3A) , the d1 cluster is the largest, with many of its genomes identified in Europe ( Fig. 5B , d1 variant gradually became the major driver, but the other three d subvariants played much less important roles. In March and April 2021, d2 subvariant was still present in 21.5% and 11.1% in d1 genomes (Fig. 5B ), but the percentage rapidly declined to 1.0% in September 2021. Notably, this decline is much more dramatic than what was observed in India (Fig. 3B) , raising the intriguing possibility that vaccination might have contributed to this rapid decline of d2 cases in the U.K. If so, this subvariant may be more sensitive to vaccination than d1 subvariant.

Thus, it is possible that vaccination has helped make d1 subvariant almost the sole pandemic driver in the country.

An important question is how d variant has evolved in the U.K. To map the evolutionary trajectory, I utilized Coronapp [16, 17] to analyze the mutation profile of 2,267 dgenomes . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 identified in the country during September 2021. As expected, the predominant majority encodes d1 subvariant ( Fig. 5C-D) . The average mutation load is 42-43 per genome (Fig. 5E ). This is 1-2 mutations more than d1 subvariant genomes identified in India around the same time (Fig. 3D ). Related to this difference, about 60% of the d1 genomes from the U.K. encode A1711V of NSP3 as an extra substitution (Fig. 5D ). Among this subgroup, there is a sublineage encoding two extra spike substitutions, Y145H and A222V (Fig. 5C ). This is alarming as A222V is a signature substitution of d2 subvariant (Table 1) . Moreover, A222V is a signature substitution of the SARS-COV-2 GV clade, a major pandemic driver in Europe during the summer of 2020. Thus, A222V may make a variant more virulent. For convenience, this A222V-encoding d1 sublineage is referred to as d1V, where the letter V denotes V222. This sublineage carries at least three extra substitutions: spike Y145H, A222V and NSP3 A1711V (Table 1 ). This sublineage was first identified in May 2021 and since then, its genome number has increased rapidly, reaching almost 6% in the U.K. in September 2021 (Fig. 5F ). The dramatic increase also supports that A222V makes d1V more virulent than d1 variant itself.

While analyzing V1264L-encoding genomes in another study [10] , I found this substitution in a subset of d1V genomes. As shown in Fig. S4A , V1264L is present in 20% d1V genomes. Moreover, a subset of these genomes encodes spike S13T (Fig. S4A ). S13 is in the signal peptide and its replacement by threonine may affect spike protein processing. While the average mutation load is 44 mutations per d1V genome, this increases to 48 in the V1264Lencoding d1V genomes [10]. V1264L is present in a d1 sublineage that drives the pandemic in Indonesia, Singapore and Malaysia, and confers an acidic di-leucine motif in the cytoplasmic tail of spike protein for its cytoplasmic traffic and processing [10] . Thus, this V1264L-encoding d1V sublineage (referred to as d1V1) may be more virulent than d1V itself.

Alarmingly, spike V36F is present in 10% d1V genomes (Fig. S4A ). This substitution is mutually exclusive with V1264L, suggestive of a second d1V sublineage. This second sublineage encodes V36F as an extra substitution and is thus referred to as d1V2. The first d1V2 genome was identified in the U.K. on July 23, 2021 (Table 1 ). There are now 991 such genomes in the GISAID database. Coronapp analysis of these genomes revealed that almost all of them encodes an extra nucleocapid substitution, S202N (Fig. S4B ). This is adjacent to R203M, a signature substitution shared by k and d variants. In agreement with the extra V36F and S202N substitution, the average mutation load is 44 mutations per d1V2 genome (Fig.  S4C ). V36 is adjacent to A222 (Fig. S4D ). Because both are part of an extensive hydrophobic interaction network (Fig. S4D ), V36F may synergize with A222V to improve this network, suggesting that d1V2 may be even more virulent than d1V itself. Therefore, d1 subvariant has not only driven the pandemic in the U.K. since April 2021 but also yielded different sublineages with potential to drive this pandemic even further.

. CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. Mutation profiling of 934 d1 genomes identified in Spain after September 05, 2021 revealed a sublineage encoding spike T29A, T250A and Q613H (Fig. 6E ). This sublineage is referred to as d1T, where the letter T refers to T29 and T250, two spike residues to be replaced. A subset of the genomes also encode spike T299A and form an offspring strain, referred to as d1T1 (denoting a derivative of d1T). d1T is also present in genomes from Italy, France and Germany even though the genome number is very low in Italy (Fig. S4 ). According to GISAID, the first d1T genome sequence was identified in Japan on June 11, 2021 (Table 1 ), but this case has travel history to Morocoo. The second case was reported from Morocco on June 20, 2021 and additional cases with travel history to the country were identified in Japan, Singapore and South Korea. Among 62 d genomes reported from Morocco (GISAID, accessed on October 03, 2021), 28 (45.2%) correspond to d1T. Thus, d1T sublineage is a key driver behind the latest wave of COVID-19 cases (from July to September 2021; https://ourworldindata.org/) in the country.

What occurred in Morocco suggests that d1T is more virulent than d1 itself. To investigate how d1T has fared in Europe, I analyzed the number of d1T genomes in different countries there. As shown in Fig is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101/2021.10.16.21265096 doi: medRxiv preprint

Having identified different d subvariants and their sublineages (Table 1) , I next considered mechanistic impact of the associated new substitutions. Many of them are located at the Nterminal domain of spike protein (Fig. 8A) [18]. T29, K77, Y145 and T250 are within a super antigenic site in the N-terminal domain (Fig. 8A-B) [19]. Among them, T250 is located within an unstructured loop (Fig. 8B) . Thus, T29A, K77T, Y145H and T250I may confer immune evasion. Both V36 and A222 are part of an extensive hydrophobic interaction network (Fig.   S4D ). In addition to L54, L560 and I285, there are 5 aromatic residues (F562, Y38, F275, F85 and F200). The side chain of V36 is 3.4 Å away from that of A222, supporting interaction between these two hydrophobic residues. V36F and A222V should facilitate the hydrophobic interaction. As shown in Fig. 8D , the methyl group of the T299 side chain is 3.7 Å away from that of T315, indicating hydrophobic interaction between the side chains. The methyl group of T299 is 6.6 Å away from that of T599, so T299I may facilitate hydrophobic interaction of I299 with the methyl groups of T513 and T599.

Q613 is adjacent to D614 (Fig. 8A ). D614G is a key substitution that defines the entire course of the pandemic [20] [21] [22] . This substitution alters conformational state of the spike protein [12], so Q613H may synergize with D614G. V1264 is located at the C-terminal cytoplasmic tail of spike protein (Fig. 8A ) and V1264L confers an acidic di-leucine motif in the cytoplasmic tail of spike protein for its cytoplasmic traffic and processing [10] . Alarmingly, V1264L is a signature substitution of a d1 sublineage that drives the pandemic in Indonesia, Singapore and Malaysia [10].

As shown in Fig. 8D , nucleocapsid protein is composed of an N-terminal tail, an Nterminal RNA-binding domain, a middle SR (serine/arginine-rich) loop, a C-terminal RNAbinding domain and a C-terminal tail. The N-terminal tail is a known immune epitope, so Q9L in a subset of d1 variant may confer immune evasion. D63G is a signature substitution of d1 variant. D63 is located within the N-terminal domain (Fig. 8D ) and directly involved in RNA binding. D63G should affect this binding [23], so it is puzzling why d1 variant has acquired this substitution. It is possible maximal RNA binding is not optimal for viral fitness.

The SR loop is highly phosphorylated [24] and may regulate the nucleocapsid protein level and function. In addition to d variant, this loop is altered in the three other variants of concern. R203K and G204R are present in a and b variants, whereas T205I is associated with g variant [5] [6] [7] . Similarly, R203M, a signature substitution of d variant, alters the SR loop (Fig.   8D ). Moreover, d1V2 possesses S202N (Fig. S4D) , recurrent in other SARS-COV-2 variants. S202 is phosphorylated [24], so S202N blocks phosphorylation. Thus, S202N may synergize with R203M in altering phosphorylation of the SR loop. In addition to this new substitution, . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101/2021.10.16.21265096 doi: medRxiv preprint d1V2 has acquired spike V36F and A222V, both of which are alarming (Fig. S4C) . Thus, it is important to watch out this new d1 sublineage.

Nucleocapsid protein also possesses sequence elements for subcellular trafficking.

Residues 219-235 are leucine-rich and may serve as a nuclear export signal [25] . This segment is part of a long a-helix and G215 is at the N-terminal end of this helix (Fig. 8D ). Because glycine is often a helix breaker, G215C in d1 subvariant (Table 1) should improve the structure of this helix. Moreover, Q229H from d1V1 sublineage (Table 1) is also located in this helix and may thus synergize with G215C in improving the stability of this helix. In addition to its nuclear export signal, nucleocapsid protein possesses a bipartite nuclear localization signal spanning residues 369-389. Related to this, the substitutions D377Y and R385K are within this region (Fig. 8D) . Thus, alteration of nucleocapsid trafficking in a common strategy that d1 adopts to improve its viral fitness.

In addition to spike and nucleocapsid proteins, NSP3 is a frequent target for alteration in different d subvariants ( Dr. Yang Zhang's lab, University of Michigan, USA). P1228 and A1711 are on the potential antigenic surface of the protein, so their alterations may confer immune evasion. Therefore, in addition to modifying spike protein, d variant has taken diverse strategies to finetune functions of nucleocapsid and NSP3 for improving its viral fitness.

Shown in Fig. 8E is a continuously branching model illustrating evolutionary relationship among d subvariants described herein. According to this model, d variant yields different subvariants, which in turn evolve and generate additional lineages continuously. Host environment (including natural immunity, vaccination and experimental drug treatment) then select a specific set of such lineages, which then spread and drive the pandemic. Such a model can also explain evolution of SARS-COV-2 in general. This model also suggests that there are simple rules this virus adopts to improve its fitness during evolution. Identification and elucidation of such rules will help us predict its evolutionary trajectory, which will in turn give us a winning chance to get ahead of the virus (instead of tailgating it) in this evolutionary game.

This may be the most effective way to control and end this pandemic. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. study about this sublineage in Indonesia, Singapore, Malaysia and East Timor [10] . The continuously branching model also appears true in many other countries, including Canada, Mexico, Israel, Singapore, Australia, New Zealand, Japan, South Korea, Vietnam and China, although the subvariant composition varies from country to country. As it is in Europe , d1 is now predominant in Singapore [10], Israel, Australia, New Zealand and Japan, but d2 is more dominant than d1 subvariant in South Korea, Vietnam and some Canadian provinces (unpublished observations). Thus, it is necessary to systematically track d subvariants and their evolutionary trajectory in different countries around the world.

As summarized in Fig. 8F , d1 and d2 subvariants are much more virulent than a variant, which explains why d variant has overtaken all other variants and become almost the sole pandemic driver around the world. Alarmingly, d1 sublineages such as d1L, d1V1, d1V2, d1T1 and d1T2 (Fig. 8E) appear to be even more virulent than d1 itself. Therefore, we will need to track SARS-COV-2 evolution very closely and map its evolutionary trajectory in a precise and timely fashion, which will help us adopt the best public health measures and develop the new generation of vaccines. Dynamic evolution of d variant (Fig. 8E -F) also reiterates the notion that to end this pandemic swiftly, it will be important to take preventive measures (such as immunity maximization through full vaccination and subsequent boosters) to block further evolution of d subvariants.

. CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint 

The author declares no competing interests.

SARS-COV-2 genome sequences were downloaded the GISAID database (https://www.gisaid.org/) on the dates specified in the figure legends. CoVsurver (https://www.gisaid.org/epiflu-applications/covsurver-mutations-app/) was used to analyze mutations in representative genomes. Fasta files containing specific groups of genomes were downloaded from the GISAID database. During downloading, empty spaces in the fasta headers were replaced by underscores because such spaces make the files incompatible for subsequent sequence alignment and phylogenetic analysis. To shorten the Fasta headers, the Find/Replace_All function of TextEdit (version 1.16) was used to delete "Cov19/" and "/2021" at the beginning and middle of the headers, respectively. In addition, "/" and "|" symbols in the headers are incompatible for sequence alignment and phylogenetic analysis, so they were also replaced with underscores via the Find/Replace_All function of TextEdit. The cleaned Fasta files were used for mutation profiling via Coronapp (http://giorgilab.unibo.it/coronannotator/), a web-based mutation annotation application [16, 17] .

Some cleaned Fasta files were also uploaded to SnapGene (version 5.3.2) for multisequence analysis via the MAFFT tool. For this, only high-coverage genomes with complete sample collection date information were used. After the alignment, the 5' and 3' were manually trimmed to make >95% genomes possess the same length. The sequence alignment was exported out as Fasta files. Each Fasta header in the files contains "(0 bp)" along with some adjacent empty spaces, which are incompatible for subsequent phylogenetic analysis and is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint and R385K (Table 1) were used as markers for d1 or d3 genomes, respectively. Spike substitutions A222V and K77T were used as markers for d2 or d4 genomes, respectively. In Europe, there are many d1V genomes that also encode spike A222V, so the NSP3 substitution P822L was used together with spike A222V to identify d2 genomes for analysis or downloading. There are several limitations with these markers. The first limitation is that some genomes do not have the right substitutions due to poor sequence quality. Another limitation is that some genomes may have incorrect date information on sample collection due to typos and other errors (for example, a genome collected in 2021 was submitted as one from 2020). A third limitation is that a few d1 or d2 genomes also encode K77T or R395K, so the number of d3 or d3 may be overestimated in some cases. However, these limitations should not affect the overall conclusions.

The PyMol molecular graphics system (version 2.4.2, https://pymol.org/2/) from Schrödinger, Inc. was used for downloading structure files from the PDB database for further analysis and image export. The images were cropped via Adobe Photoshop and further presentation using

Illustrator.

Pandemic and vaccination data . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10. 1101 the GISAID database on September 14, 2021 for phylogenetic analysis. Only high-coverage genomes with complete sample collection date information were used. The package RAxML-NG was used to generate 20 maximum likelihood trees and the bestTree for presentation via Figtree as in Fig. 3D . The strain names and GISAID accession numbers of the genomes are detailed in Figure S2. (B-C) Monthly distribution of d1 subvariants in the world and Europe.

The analysis was done on the GISAID website on September 30, 2021. Quebec, Canada from June to September 2021. One subset of d1T genomes (referred to as d1T1) also encode T299I and another subset carries the nucleocapsid substitution Q229H (referred to as d1T2, only identified in Denmark). The first d1T genome sequence was first submitted to the GISAID database on June 11, 2021 in Japan, but the case has travel history to Morocoo (Table 1) . d1T sublineage is a key driver behind the latest wave of COVID-19 cases in the country. (B-C) Mutation profile of 1,267 d1T genomes identified in different countries around the world. The genomes were downloaded from the GISAID database on October 01, . CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. RBD, receptor-binding domain; S1/S2, boundary of S1 and S2 domains after furin cleavage between residues R685 and S686; FP, fusion peptide; FPR, fusion peptide C-terminal proximal region; HR1 and HR2, heptad-repeat regions 1 and 2, respectively; S2', cleavage site between residues R815 and S816 in S2 domain; TM, transmembrane motif; CT, C-terminal cytoplasmic tail. The domain organization was adapted from a published study [18] . is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Fig. 2A ).

** For sublineages, only extra substitutions are shown.

*** The sample collection dates for some early d genomes listed in the GISAID database appear to have typos and errors, so for d variant and its four subvariants (d1-d4), the first genomes identified in India are listed here.

. CC-BY-NC-ND 4.0 International license It is made available under a perpetuity.

is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 is that d2 but not d1 genomes encode A222V. Another difference is that ~40% d1 genomes from India encode T95I, but this percentage is very low among d2 genomes. The percentage of T95I-encoding d1 genomes varies from country to country. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

The copyright holder for this this version posted October 20, 2021. 

Global minimum estimates of children affected by COVID-19-associated orphanhood and deaths of caregivers: a modelling study

COVID-19-Associated Orphanhood and Caregiver Death in the United States

Characteristics of SARS-CoV-2 and COVID-19

Data, disease and diplomacy: GISAID's innovative contribution to global health

Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England