key: cord-0970467-uu7bipci
authors: Piantham, C.; Ito, K.
title: Estimating the increased transmissibility of the B.1.1.7 strain over previously circulating strains in England using fractions of GISAID sequences and the distribution of serial intervals
date: 2021-03-17
journal: nan
DOI: 10.1101/2021.03.17.21253775
sha: 57cf09285c76fdcaeb1bc20c92029b10ced5fb1a
doc_id: 970467
cord_uid: uu7bipci

The B.1.1.7 strain, a variant strain of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is thought to have higher transmissibility than previously circulating strains in England. The fraction of the B.1.1.7 strain among SARS-CoV-2 viruses in England have grown rapidly. In this paper, we propose a method to estimate the selective advantage of a mutant strain over previously circulating strains using the time course of the fraction of B.1.1.7 strains. Based on Wallinga-Teunis's method to estimate the instantaneous reproduction numbers, our method allows the reproduction number to change during the target period of analysis. Our approach is also based on the Maynard Smith's model of allele frequencies in adaptive evolution, which assumes that the selective advantage of a mutant strain over previously circulating strains is constant over time. Applying this method to the sequence data in England using serial intervals of COVID-19, we found that the transmissibility of the B.1.1.7 strain is 40% (with a 95% confidence interval (CI) from 40% to 41%) higher than previously circulating strains in England. The date of the emergence of B.1.1.7 strains in England was estimated to be September 20, 2020 with its 95% CI from September 11 to September 20, 2020. The result indicated that the control measure against the B.1.1.7 strain needs to be strengthened by 40% from that against previously circulating strains. To get the same control effect, contact rates between individuals need to be restricted to 0.71 of the contact rates that have been achieved form the control measure taken for previously circulating strains.

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of COVID-19, has been rapidly evolving since its emergence in 2019. In December 2020, the Public Health England detected a new cluster of SARS-CoV-2 viruses phylogenetically distinct from the other strains circulating in the United Kingdom (Chand et al., 2020) . These viruses belong to the lineage B.1.1.7 according to the PANGO nomenclature (Andrew. and the World Health Organization (WHO) designated them as Variant of Concern, year 2020, month 12, variant 01 (VOC 202012/01) (World Health Organization, 2020).

The B.1.1.7 strain was first detected in England in September 2020, and the numbers of infections with this strain increased in October and November in 2020 (Chand et al., 2020) . By February 2021, B.1.1.7 strain occupied 95% of the SARS-CoV-2 circulating in England (Davies et al., 2021) .

Several studies have compared the transmissibility of the B.1.1.7 strain to that of previously circulating strains. Davies et al. estimated that the reproduction number of the B.1.1.7 strain is 43-90% (with a 95% credible interval of 38-130%) higher than preexisting strains using data from England (Davies et al., 2021) . However, different models resulted in different ranges of estimates in their multiplicative increase in reproduction number ( ). Grabowski et al. estimated a 83-118% increase with a confidence interval of 71-140% compared to previously circulating strains in England (Grabowski et al., 2021) . Volz et al. estimated 40-75% increase in using data from England . Their methods use linear regression of log odds ratio between B.1.1.7 strain and previously circulating strains and estimate increase of transmissibility under assumption that is constant over time during the target period of analysis. Washington et al. estimated a 35-45% increase using data from the United States of America using Volz's method (Washington et al., 2021) . Chen et al. estimated 49-65% increase in reproduction number using data from Switzerland also under the constant assumption (Chen et al., 2021) . Due to the high transmissibility the B.1.1.7 strain, strong control measure such as lockdown was taken when the strain was introduced (Davies et al., 2021) . Thus, the constant assumption is questionable when analyzing the increase in the reproduction number of B.1.1.7 compared to that of previously circulating strains.

In this paper, we propose a method to estimate the selective advantage of a mutant strain over previously circulating strains. Based on Wallinga-Teunis's method to estimate the instantaneous reproduction numbers (Wallinga & Teunis, 2004) , our method allows the reproduction number to change during the target period of analysis. Our approach is also based on the Maynard Smith's model of allele frequencies in adaptive evolution, which assumes that the selective advantage of a mutant strain over previously circulating strains is constant over time (Maynard Smith & Haigh, 1974) . Appling the developed method to the sequence data in England using the serial interval distribution of COVID-19 estimated by Nishiura et al. (Nishiura et al., 2020) , we estimate the increase in the instantaneous reproduction number of B.1.1.7 strains compare to that of previously circulating strains. Based on the estimate, we discuss its implication to control measures for COVID-19.

Nucleotide sequences of SARS-CoV-2 viruses was downloaded from GISAID EpiCoV database (Shu & McCauley, 2017) on March 1, 2021. Nucleotide sequences that determined from viruses detected in England were selected and aligned to the reference amino acid sequence of S protein of SARS-CoV-2 virus (YP_009724390) using DIAMOND (Buchfink et al., 2015) . The aligned nucleotide sequences were translated into amino acid sequences, then were aligned with the reference amino acid sequence of S protein using MAFFT (Katoh et al., 2002) . Amino acid sequences having either an ambiguous amino acid or more than ten gaps were excluded from the rest of analyses. Table 1 shows amino acids on S protein of characterizing B.1.1.7 strain, retrieved from the PANGO database (Andrew. . Aspartic acid (D) 681

Histidine (H) 716

Isoleucine (I)  982 Alanine (A) 1118

Histidine (H)

We divided amino acid sequences into three groups based on amino acids shown in Table 1 . The first group is sequences having all of B.1.1.7-defining amino acid substitutions in Table 1 . We call a virus in this group a "B.1.1.7 strain". The second group are sequences which have none of the B.1.1.7-defining substitutions. We call a virus in the this group a "non-B.1.1.7 strain". The third group are sequences which have at least one but incomplete set of the B.1.1.7-defining amino acid substitutions. We called a strain in the third group a "B.1.1.7-like strain". Table 2 shows the number of sequences categorized into each group. We used the number of B.1.1.7 strains and non-B.1.1.7 strains for the rest of the analyses. B.1.1.7-like strains were excluded from the analyses, since we do not know whether they had the same transmissibility as B.1.1.7 strains or not. Figure 1 shows the daily numbers of GISAID sequences of B. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 17, 2021. ; 

The method we propose in this paper uses discrete distributions of serial intervals. Function ( ) gives the probability that the onset day of a secondary infection is at days after the onset day of its primary case. We obtained the values of ( ) by discretizing the lognormal distribution of serial intervals of COVID-19 estimated by Nishiura et al. (Nishiura et al., 2020) . Thus, the probability mass function of serial intervals is given by

where ( ) is the probability density function of a lognormal distribution with a mean of 4.7 days and a standard deviation of 2.9 days.

Consider we have a large population of viruses consisting of strains of two genotypes and , of which fraction in the viral population at a calendar date are $ ( ) and % ( ), respectively. Suppose also that genotype is mutant of that emerges at time & .

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 17, 2021. ;

We assume that a virus of genotype generates 1 + times as many secondary transmissions as those of genotype . Then, can be considered as the coefficient of selective advantage in adaptive evolution. As described by Maynard Smith and Haigh (1974) , the fraction of viruses of allele after the n transmissions, ' , satisfies

Let ( ) be a discrete distribution of serial intervals defined in the previous subsection. Since the expected fraction of allele in the population at calendar time can be represented as and $ ( − ) for 0 ≤ ≤ with a probability of ( ), the value of $ ( ) can be represented as follows.

( 2) See Appendix for the relationship between Equation (2) and instantaneous reproduction numbers proposed by Wallinga and Teunis (Wallinga & Teunis, 2004) . Assuming (0) = 0, we can approximate the formula.

(3)

Let ( ) be the number of sequences of either genotype or observed at calendar date . Let # , … , * be calendar dates such that ( ! ) > 0 for 1 ≤ ≤ . Suppose that we have $ ( + ) samples of genotype at calendar date + . Since genotype emerged at time & , $ > + ? = 0 % > + ? = 1 for + < & . Let & be initial frequency of genotype , i.e., & = $ ( & ). Then the following equation gives the likelihood function of , & , and & for observing $ ( + ) samples of viruses of genotype at calendar date + .

for 1 ≤ ≤ . The likelihood function of , & , and & for observing $ ( # ), … , $ ( * ) sequences of genotype at calendar dates # , … , * is given by the following formula.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 17, 2021. ; https://doi.org/10.1101/2021.03.17.21253775 doi: medRxiv preprint

Parameter estimation from sequence data The B.1.1.7 strain was first detected in England on September 20, 2020. We assume that & is this day or someday before this day. Parameters , & , and & were estimated by maximizing likelihood of observations on September 1, 2020 and later on. B.1.1.7 strains, viruses having complete subset of B.1.1.7-difining substitutions on its S protein were considered as genotype . The non-B.1.1.7 strains, viruses having none of B.1.1.7-defining substitutions were considered to be genotype . The B.1.1.7-liike strains, viruses having an incomplete set of B.1.1.7 substitutions on the S protein, were excluded from the analysis. Parameters of , & , and & were estimated by maximizing log likelihood defined in Equation (5). The 95% confidence intervals of parameters were estimated by profile likelihood (Pawitan, 2013) . The optimization of likelihood function was done by the nloptr package in R (Johnson; Rowan, 1990) .

The selective advantage of B.1.1.7 strains over non-B.1.1.7 strains, , was estimated to be 0.40 with its 95% confidence intervals from 0.40 to 0.41 ( Table 2 ). The date of emergence of B.1.1.7 strains in England, & , was estimated to be September 20, 2020 with its 95% confidence interval from September 11, 2020 to September 20, 2020. The initial fraction of B.1.1.7 among non-B.1.1.7 and B.1.1.7 strains at the emergence in England, & , was estimated to be 0.0030 with its 95% confidence intervals from 0.0014 to 0.0031. 

In this paper, the selective advantage of the B.1.1.7 strain in England over non-B.1.1.7 strains was estimated to be 0.40 with a 95% CI from 0.40 to 0.41. The date of emergence of B.1.1.7 strains in England was estimated to be September 20, 2020 with its 95% confidence interval from September 11, 2020 to September 20, 2020. The initial fraction of B.1.1.7 among all sequences except B.1.1.7-like strains at the time of emergence in England was estimated to be 0.0030 with its 95% confidence intervals from 0.0014 to 0.0031.

Our estimation method is based on the principle that the expected fraction of a mutant strain among all strains can be determined from those in previous generation using the serial interval distribution of infections. The method is related to Wallinga-Teunis's method for estimating the instantaneous reproduction number, and thus it allows reproduction numbers of strains to change during the target period of analysis. Instead, our method assumes that the selective advantage of a mutant strain over . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted March 17, 2021. ;  previously circulating strains is constant over time, which is based on Maynard Smith's model of allele frequencies in adaptive evolution.

The estimated selective advantage of 0.40 indicates that the instantaneous reproduction number of the B.1.1.7 strain is 40% higher than that of previously circulating strains in England. This means that the control measures for the B.1.1.7 strain needs to be strengthened by 40% compared to that for the previously circulating strains. To get the same control effect, contact rates between individuals needs to be restricted below 1/1.40 = 0.71 of the contact rates to achieve the same control measure for the previously circulating strains.

Our method relies on the serial interval distribution, and thus the results may change depending on serial interval distribution used in the analysis. In this paper, we used the serial interval distribution estimated by Nishiura et al. (Nishiura et al., 2020) . The distribution is a lognormal distribution with a mean serial interval of 4.7 days with a standard deviation of 2.9 days. Several groups have estimated the serial intervals of SARS-CoV-2 using different datasets. Some variations are observed among these estimated values (Rai et al., 2021) . Volz et al. (Volz et al., 2021 ) assumes a fixed serial interval of 6.5 days based on results by Bi et al. (Bi et al., 2020) . However, Ali et al. have reported that the serial interval estimated using data from China during before January 22, 2020 was longer than estimates after January 22, 2020 (Ali et al., 2020). The serial interval estimated by Bi et al. contains data before January 22, 2020 and there might be some possibility that the estimated serial interval does not reflect the current situation. This is the reason why we did not use serial interval estimated by Bi et al.

As of March 17, 2021, the B.1.1.7 strain has now been observed in 93 countries (A. . The selective advantage of the B.1.1.7 strains over previously circulating strains in other countries remains to be our future work. Variant strains originated in Brazil and South Africa also show higher transmissibility than previously circulating strains (World Health Organization, 2021). There is an urgent need to estimate the selective advantage of these strains over previously circulating strains.

Let ( ) be the number of infections by viruses of either genotype or at calendar time and ( ) be the probability mass function of serial intervals. Suppose that the instantaneous reproduction number of genotypes and at calendar time are $ ( ) and % ( ). The following equations give the discrete version of Wallinga-Teunis's instantaneous reproduction numbers of infections by genotype A and at time . (2)

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted March 17, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 

Epidemiology and transmission of COVID-19 in 391 cases and 1286 of their close contacts in Shenzhen, China: a retrospective cohort study

Fast and sensitive protein alignment using DIAMOND

Investigation of novel SARS-COV-2 variant. Variant of Concern 202012/01 (PHE gateway number: GW-1824)

Quantification of the spread of SARS-CoV-2 variant B

SARS-CoV-2 Variant of Concern 202012/01 Has about Twofold Replicative Advantage and Acquires Concerning Mutations

The NLopt nonlinear-optimization package

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

The hitch-hiking effect of a favourable gene

Serial interval of novel coronavirus (COVID-19) infections

All Likelihood: Statistical Modelling and Inference Using Likelihood

Estimates of serial interval for COVID-19: A systematic review and meta-analysis

A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic epidemiology

Preliminary genomic characterisation of an emergent SARS-CoV-2 lineage in the UK defined by a novel set of spike mutations

Functional Stability Analysis of Numerical Algorithms

GISAID: Global initiative on sharing all influenza data -from vision to reality

Transmission of SARS-CoV-2 Lineage B.1.1.7 in England: Insights from linking epidemiological and genetic data. medRxiv

Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures

Genomic epidemiology identifies emergence and rapid

SARS-CoV-2 Variants. Disease Outbreak News

We gratefully acknowledge the laboratories responsible for obtaining the specimens and the laboratories where genetic sequence data were generated and shared via the GISAID Initiative, on which this research is based. This work was supported by CREST (grant number JPMJCR1413) from Japan Science and Technology Agency (http://www.jst.go.jp/), and the World-leading Innovative and Smart Education (WISE) Program (1801) from the Ministry of Education, Culture, Sports, Science, and Technology, Japan (http://www.mext.go.jp/). The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.