key: cord-0272037-7d60wn98
authors: Wohl, S.; Lee, E. C.; DiPrete, B. L.; Lessler, J.
title: Sample Size Calculations for Variant Surveillance in the Presence of Biological and Systematic Biases
date: 2022-01-01
journal: nan
DOI: 10.1101/2021.12.30.21268453
sha: b9dc8836f97af0cc77fb7cf7240e74bfd615a370
doc_id: 272037
cord_uid: 7d60wn98

As demonstrated during the SARS-CoV-2 pandemic, detecting and tracking the emergence and spread of pathogen variants is an important component of monitoring infectious disease outbreaks. Pathogen genome sequencing has emerged as the primary tool for variant characterization, so it is important to consider the number of sequences needed when designing surveillance programs or studies, both to ensure accurate conclusions and to optimize use of limited resources. However, current approaches to calculating sample size for variant monitoring often do not account for the biological and logistical processes that can bias which infections are detected and which samples are ultimately selected for sequencing. In this manuscript, we introduce a framework that models the full process from infection detection to variant characterization and demonstrate how to use this framework to calculate appropriate sample sizes for sequencing-based surveillance studies. We consider both cross-sectional and continuous sampling, and we have implemented our method in a publicly available tool that allows users to estimate necessary sample sizes given a specific aim (e.g., variant detection or measuring variant prevalence) and sampling method. Our framework is designed to be easy to use, while also flexible enough to be adapted to other pathogens and surveillance scenarios.

When designing a sequencing-based study or surveillance system, the first step is to identify the overall goal of the study, as different aims require different sample sizes in order to obtain reliable results. Whether variant detection or measuring prevalence is the primary goal [12] , an important consideration is if samples are going to be collected in a single cross-sectional snapshot or through ongoing surveillance (Fig 1) . This decision will influence both how the sample size is defined (i.e., overall study size versus average daily or weekly sampling rate), as well as what targets must be specified to calculate the appropriate sample size. For instance, in a cross-sectional study a possible target could be the probability of detecting a variant at a particular prevalence, while with ongoing surveillance it might be the waiting time to detect a recently introduced variant that is growing in prevalence. The requirements for specific goals are enumerated in Figure 1 . If the study design is fixed (i.e., a set number of samples have already been collected or sequenced), the same principles can be applied to evaluate the questions that can be answered and confidence in the results. Identifying the key goals of the study (green shaded region; variant detection or measuring variant prevalence) and sampling method (blue shaded region; cross-sectional or periodic) are necessary to determine the required targets and parameters (red shaded region) that must be specified in order to calculate the appropriate sample size.

For detection-and prevalence-based questions, we can use existing sampling theory as the basis of our approach.

Specifically, the sample size needed to detect novel variants at some probability can be calculated with a simple application of the binomial distribution. The probability of detecting at least one case belonging to a variant of 3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint interest ( ) given the prevalence of this variant in the population ( ) is equivalent to one minus the probability of not detecting it at all. Therefore, the sample size ( ) needed to detect at least one case of a VOC at a pre-determined probability ( ) has been shown to be [10] :

Similarly, existing sample size theory can be used to estimate the prevalence of known VOCs in a population.

Specifically, sample size calculations for estimating proportions can be used to determine the number of sequences that should be generated to estimate VOC prevalence within a desired confidence interval [13] :

Where is the number of samples needed, is the Z-statistic for the 95% confidence level, is the expected prevalence of the VOC in the population and is the desired absolute precision (tolerance for error in the prevalence estimate). This methodology has previously been used to calculate the number of SARS-CoV-2 samples needed to detect variants at different frequency levels [8] and it assumes that the sample size ( ) is small compared to the total infected population.

These approaches to sample size calculations are subject to limitations. For one, both equations assume that the pool of samples available for sequencing are a representative random sample of the total infected population.

However, the biology and epidemiology of SARS-CoV-2 VOCs, such as severity of disease, may affect which samples are collected and sequenced (Fig 2, Fig S1) . The sequences used for analysis may therefore not be directly reflective of the underlying distribution of viral sequences, and this bias may be detrimental or useful depending on the goals of surveillance.

Here, we characterize the mechanistic process from infection to case detection to variant identification using a simple modeling framework that captures how these processes differ between variants. We first explore how VOC attributes could bias detection of SARS-CoV-2 cases, and then determine how this bias may affect sample size calculations. We then extend the framework and focus on genomic surveillance as a continuous process, with sampling occurring periodically over time. Our framework has been implemented in a customizable, publicly available spreadsheet (https://github.com/HopkinsIDD/VOCsamplesize). ; blue = infections caused by other variants of the same pathogen.

As discussed above, we aimed to characterize the factors that could affect the collection of SARS-CoV-2 samples and their selection for downstream processes such as sequencing. To do this, we developed a model that tracks how biological differences between variants, as well as logistical challenges in case and variant detection, may affect estimated variant frequency, given a true underlying frequency in a population. This model distinguishes all infections (N), from those detected positive (D), from those that produce a genome sequence (G) that can be used to identify the underlying variant.

We conceive of the model in two phases: 1) infection detection, which describes the joint biological and testing mechanisms that lead infections (N) to be detected as COVID-positive (D) by a surveillance system (Fig 3, top row), and 2) infection characterization, which describes the selection of samples for genomic sequencing from high quality detected infections ( ) and identification of specific variants from the resulting high-quality sequences (Fig   3, bottom row). In this context, "high-quality detected infections" refers to pathogen-positive samples of high enough quality (e.g., by a metric such as cycle threshold value) that they will be selected for sequencing, while "high-quality sequences" refers to pathogen sequences that are complete enough to characterize the infection-causing variant.

5 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 1, 2022. Parameters are defined in Table 1 .

Model states are separated by transition parameters that model how biological differences between variants can affect factors such as testing rates, testing sensitivity, and sample quality ( variant i at any given step, which is often more interesting than the total number.

Using this model, the number of high quality detected infections attributable to a specific variant is as follows:

By calculating this quantity for each variant of interest and the remainder of the population, we can determine the prevalence of each VOC in the pool of high quality samples available for sequencing. When the detection parameters do not vary between variants, the VOC prevalence in detected high quality infections mirrors its prevalence in the greater population. Variation in these parameters between pathogen variants can be summarized in a single parameter, which we term the coefficient of detection:

The value of this coefficient for each variant will determine the bias already present in , the population from which we ultimately draw our sample.

6 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 1, 2022. 

We explored the effects of biological and logistical differences between variants-summarized in the coefficient of detection (Equation 4)-on the variant proportions observed in , the pool of high quality detected infections from which to sample (Equation 3). We can calculate the multiplicative bias in our estimate of a particular variant as a function of the underlying prevalences and coefficients of detection for all variants in the system:

Where is the total number of variants in the population ( ).

Unsurprisingly, a larger differential between the coefficient of detection for and the coefficients of detection for 1 other variants in the system leads to more bias in the observed frequency. Additionally, the observed prevalence of in is more biased when is smaller (Fig 4A) .

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (B) Number of sequences required to detect at least one infection caused by with 95% probability, for different 1 1 prevalence values and coefficient of detection ratios. (C) Number of sequences required to determine the prevalence of variants with a frequency of at least in the population, with 95% confidence and 25% precision. The prevalence calculated with these 1 sequences will reflect the observed (biased) value, and will need to be corrected using Equation 6 . All panels assume a two-variant system, where is the variant of interest and is the rest of the pathogen population. In (B) and (C), note that the We can also calculate a correction factor such that:

Where is the observed odds of the prevalence in . This equation allow for a direct conversion between 1 * 1 the observed variant frequencies in the sampling pool ( ) and the true frequency of in the infected population ( 

In the following sections, we provide examples of how to calculate the appropriate sample size for surveillance given potential biases in observed variant frequencies. We also discuss how this bias-or, in some cases, enrichment-may make it easier to detect or measure the prevalence of certain variants. 8 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint

Detecting the introduction of new variants into specific populations is a common goal during a pathogen outbreak.

This requires identification of variants while they are still at low frequency in the population. For example, we may be interested in determining the minimum sample size needed to have a 95% chance of detecting a variant at 2% frequency in a specific population. If this variant is biologically and epidemiologically identical to the rest of the population, its frequency in the sampling pool ( ) will reflect its frequency in the overall population. In this case, we can apply binomial sampling theory (Equation 1) to calculate the number of sequences needed (see Fig S2 for validation of binomial sampling process):

Most variants of interest, however, are not biologically and epidemiologically identical to the rest of the pathogen population. For example, variants can emerge that are more transmissible, such as the SARS-CoV-2 Delta variant [1, 14, 15] . This increased transmissibility can be for a variety of reasons, such as higher viral loads in infected patients or more efficient entry into host cells, all of which require adjustment to these calculations. Here, we assume that a VOC is more transmissible specifically because it causes higher viral titers in infected patients, and that this increased titer increases the testing sensitivity of the variant ( ) as compared to the rest of the ϕ 1 = 0. 975

). We also assume that detected infections caused by this VOC contain more virus and ϕ 2 = 0. 95 therefore have an increased probability ( ) of meeting quality thresholds (e.g., Ct value cutoffs) than other γ 1 = 0. 8 positive samples ( ). We assume that all other biological and surveillance parameters are the same γ 2 = 0. 6 between the VOC and the rest of the pathogen population.

Using these parameters, we can calculate the ratio of coefficients of detection and use this to calculate the VOC frequency we expect to see in our sample. Rearranging Equation 5, we see that:

We then apply sampling theory as above, using the observed variant frequency ( ) as . Because the variant 2. 7% is enriched in our population of detected infections, we find that only 110 sequences are needed to be 95% confident in detection of this variant (Fig 4B; see Fig S3A for sequence requirements for 50% confidence). Since not every sequenced sample produces a usable sequence, even after selecting for high quality samples (Fig 3) , we assume a sequencing success rate of 80% for all variants ( ), which means 138 samples should be selected ω = 0. 8 for sequencing in order to obtain 110 complete genomes. The same procedure could be performed to determine the 9 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint sample size needed to detect a more severe variant-or any variant that has some effect on pathogen detection-provided the ratio of coefficients of detection can be estimated.

After a variant is first detected, sequencing is often used to monitor its frequency in the population, and to note any increases in variant frequency that may suggest epidemiological or biological trends. Therefore, we assume that we are interested in calculating the minimum sample size needed to correctly (within 25% of the true value) determine the prevalence of a variant at >10% frequency in the population with 95% confidence. If this variant is biologically and epidemiologically identical to the rest of the population, its frequency in the sampling pool ( ) will reflect its frequency in the population. In this case, we can apply existing theory (Equation 2) to calculate the number of sequences needed:

In this example, we calculate the sample size with the smallest prevalence (10%) we are interested in accurately measuring, since this requires the largest sample size. We do not apply any sort of finite population size correction [8] , though this could decrease the sample size needed for prevalence estimation. We then apply sampling theory as above, using . We find that the enrichment of VOC samples among CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint resources are constrained or not enough samples can be collected, the confidence level for both monitoring and detection can be calculated from the number of samples available.

When later using these sequences to estimate VOC prevalence, it is important to note that the variant prevalence estimated from the sequence (or other variant characterization) data will be the observed variant prevalence, even if the required number of sequences are available. In other words, because the sampling pool itself is biased, the proportion of sequences that are characterized as is equal to (and not the true underlying prevalence, ), 

During an infectious disease outbreak, variant detection and monitoring are ongoing processes. Focusing on sampling strategies over time will allow us to answer more realistic questions, including: what sample size is required to ensure detection of a variant in a pre-specified amount of time, or before the variant reaches a particular prevalence in the population?

To answer these questions, we assume that the same number of sequences are sampled at each time step. Even if samples are not sequenced at each time step (e.g., every day), we assume that the same number of samples from each day are included in a (e.g., weekly) sequencing batch, effectively increasing sequencing frequency to daily with some delay in final results. Given these assumptions, we can again use binomial sampling theory (Equation 1)

to calculate the probability of detecting a VOC on or before time step . The resulting equation takes the form of a survival function, as follows:

Where is the probability of detection on or before time , is the sample size per unit time, and is the ( ≤ ) prevalence of the variant of interest in the population at time . After rearranging this equation to solve for the per-timestep sample size and approximating the product with a continuous function (see Appendix for full derivation), we obtain: 11 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint Where is the cumulative density of the function used to model variant growth over time, at time . In other ( ) words, we can easily estimate the necessary sample size per time unit, provided we can approximate how the variant prevalence is changing over time.

In the example below, we assume that variant prevalence follows a logistic growth curve. Logistic growth is often ascribed to variants with a fitness advantage, such as the Alpha SARS-CoV-2 variant [16] , though in this section we will assume that the variant of interest does not affect any of the parameters that go into calculating the coefficient of detection (we will relax this assumption in the following section). We assume there was a single introduction of this variant into a population of 10,000 individuals and that the growth rate is approximately 0.1 per day [17] . Now, we can use Equation 7 to calculate the per-day sample size needed to ensure detection (with 95% probability) of this Alpha-like variant within 14 days of its initial emergence:

In other words, generating sequences per week (assuming sequences are well-distributed 158 * 7 = 1, 106 throughout the week) ensures a 95% probability of detection of this variant within 14 days of initial introduction. It is important to note that, given a single introduction and a growth rate of 0.1 per day, the variant will only be at a frequency of 0.04% by day 14. It may be more realistic to assume multiple introductions of a highly transmissible variant, or to compute the sample size needed to detect the variant before it reaches a particular prevalence in the population. If we instead assume 3 introductions into a population of 10,000 and the same growth rate of 0.1 per day, we can calculate that the prevalence will surpass 1% on day 36. We can again apply Equation 7 to calculate the number of sequences needed per day to ensure detection by the time the VOC prevalence surpasses 1%:

Generating 29 sequences per day, or just over 200 sequences per week, may be a much more manageable number.

That said, it is important to consider the sequencing success rate (e.g., ) when calculating the number of ω = 0. 8 samples that should be selected for sequencing. To generate 203 high quality sequences per week, 254 samples will need to be selected for sequencing.

As discussed above, VOC prevalence may be enriched in the sampling pool, meaning that fewer sequences may be needed for confident detection of the variant. Using Equation 5, we can calculate the observed variant frequency at each time step given a growth rate and starting variant prevalence (e.g., 3 introductions into a population of 10,000) for a two-variant system as follows:

12 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint Where represents the relative coefficients of detection between the general pathogen population ( ) and the over time, at time , and we again assume an initial prevalence of 3 in 10,000 and a growth rate of 0.1 per day (see Appendix for full derivation). As expected, the enrichment of the VOC in the sampling pool decreases the number of sequences needed for detection (Fig 5) . We can also use Figure 5 (and Fig S4) to evaluate the marginal costs and benefits of changing the number of samples selected for sequencing, which may allow for the design of surveillance systems that take maximum advantage of available resources.

We have implemented the method described above in a publicly available spreadsheet (https://github.com/HopkinsIDD/VOCsamplesize). This spreadsheet can be used to calculate the required sample size in each of the three scenarios described above: cross-sectional sampling for variant detection, cross-sectional sampling for measuring variant prevalence, and periodic sampling for variant detection. The equations implemented in this spreadsheet can be used both backwards and forwards-a user can input epidemiological and biological parameters and use them to determine the sample size needed to achieve the primary aim (detection or measuring prevalence) given a desired confidence level, or they can input a desired sample size and use them to calculate confidence in the results.

13 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint 

Designing pathogen surveillance systems must begin with identifying the primary purpose or key questions to be answered with the system. For example, surveillance strategies will be different when the goal is early detection of a newly introduced VOC versus when the goal is measuring the prevalence of an existing variant [12] . In either case, there are a myriad of factors that influence which infections are ultimately sequenced. Here we present a framework for thinking about these factors, and we show that their effects can be summarized in a single number, the coefficient of detection. This coefficient characterizes how biological and logistical factors can lead to VOC enrichment in a sample-leading to earlier detection-while also biasing measurement of the true underlying VOC prevalence. Depending on the purpose of surveillance, it will be important to account for these effects in sample size calculations and subsequent reporting of results. The work presented here aims to provide an accessible set of methods for doing so, and a general approach that can be extended to other settings and study designs. In addition to providing evidence-based guidance for sampling design, our framework can be applied retrospectively to evaluate the accuracy of a detection or variant prevalence result based on the number of samples sequenced.

A perceived barrier to using the approach outlined here may be lack of knowledge of the exact parameters that are summarized by the coefficient of detection. However, it is not necessary to know individual parameter values when using our framework as long as we can approximate their ratio, since all calculations rely solely on the ratio of coefficients of detection. Likewise, parameters that have the same value across variants need not be specified at all. 14 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021. 12.30.21268453 doi: medRxiv preprint Although decreasing the number of parameters that need to be specified makes the framework easier to use, there is still value in breaking down the process of surveillance into its component parts. For instance, it may be difficult to estimate a single pathogen testing rate in settings without consistent testing of asymptomatic individuals (e.g., hospitals or settings with limited testing capacity). But if the asymptomatic and symptomatic testing rates are separated into two parameters, we can assume the asymptomatic testing rate is negligible (or at least similar between variants) and focus on the symptomatic testing rate, which may be easier to quantify.

In considering the full process from infection detection to variant characterization, we have aimed to make our framework flexible enough to handle situations not explicitly discussed above. Although most of the examples presented in this manuscript focus on a two-variant model, the framework is set up to allow for exploration of multiple variants simultaneously (see Appendix). Further, while we focus on detecting variants undergoing logistic growth, the sample size needed to detect a variant can be calculated for any growth function as long as the functional form and underlying parameters are known. Additionally, while we have tried to identify the key processes that affect pathogen detection, the coefficient of detection could be modified to incorporate other parameters that differ between variants, or to include factors that may affect which sequences produce complete genomes (i.e., factors that affect the variant characterization process shown in the bottom part of Fig 3) . For example, we could allow the sequencing success rate ( ) to differ between variants (e.g., due to differences in ω primer binding when using PCR-based methods for variant characterization or amplification prior to sequencing), despite the use of an initial sample quality filter ( ). Finally, the framework could be extended to any pathogen for γ which there is some method (e.g., variant-specific PCR assays) to differentiate pathogen lineages with potentially different epidemiological or biological processes.

While this manuscript covers sample size calculations for variant detection given both cross-sectional and periodic sampling approaches, additional work is needed to determine appropriate sample sizes for measuring VOC prevalence with periodic sampling. This would enable interpretation of small changes in variant prevalence over time that could serve as an early indication of changing epidemiological dynamics in the population. Additionally, when multiple variants are present in the population, accurate prevalence estimation of one VOC necessarily constrains the potential prevalence values of another VOC; expanding the framework to co-estimate prevalences for multiple VOCs may make it possible to leverage this interdependence and further reduce the sample sizes required for accurate monitoring.

When designing surveillance systems based on this framework, it is important to remember that infection and sampling processes can be heterogeneous in ways not captured by our model. Future work could consider the effects of spatial heterogeneity on transmission and sampling, the impact of time-varying model parameters in the continuous surveillance context, or could extend the framework to metapopulations. Given the current framework and assumption of homogeneity, samples selected for sequencing should be selected as randomly as possible, or selected in a way that maximizes the geographic and temporal distribution of sequences.

15 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ;

Whole genome sequencing has revealed the importance of characterizing and monitoring specific pathogen variants during an ongoing epidemic. As this technology becomes more accessible and central to our understanding of established and emerging pathogens, it is important that we improve the rigor with which we design studies that use this data. Sophisticated modeling approaches have been invaluable in improving how we collect and interpret pathogen genomic information, but most are neither nimble nor accessible enough to be widely used during a crisis.

Similarly, ad-hoc approaches or classical study designs may not lead to the optimal allocation of resources. Here we have attempted to lay out a framework that is widely accessible yet still accounts for many of the factors that uniquely impact pathogen genomic studies and surveillance programs. As the SARS-CoV-2 pandemic continues and new infectious threats arise, we hope this approach will help better guide the collection of data that has proven critical to the pandemic response and serve as a starting point for further methodological innovation.

We thank Edyth Parker for her insightful comments on the manuscript. Funding was provided by Bill and Melinda

Gates Foundation INV-025321 (S.W.) and OPP1195157 (S.W. and J.L.).

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. 18 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. which to sample. In each simulation, and infections that progress between model states are selected stochastically ω = 0. 8 using a binomial process. Blue lines = binomial distribution (i.e., sampling with replacement, an approximation of the simulated sampling process) given stated sampling fraction (from ) and variant prevalence; green dotted lines = hypergeometric distribution (i.e., sampling without replacement, the exact sampling process) given stated sampling fraction and variant prevalence; red vertical lines = marker of input variant frequency. The binomial distribution approximates the simulated process well, except when nearly all detected infections are sampled (as expected) or the variant prevalence is very low.

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; 

. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint

Given the coefficient of detection (C V i ) for each variant in the pathogen population, we can calculate the actual prevalence of each variant (P V i ) from what we observed in the pool of high quality detected infections (H). In the sections below, we use a property of odds ratios to calculate a correction factor q that allows for this conversion.

In a two-variant system (i.e., a system with one variant of interest, V i , that is compared to the rest of the population, V 2 ), the odds of V 1 is:

Similarly, the observed odds (the odds of P V 1 in H) is:

We define a bias factor, q, such that:

If we solve the above equation for q, we obtain:

Because we know that:

, we can use the correction factor q to easily calculate the true proportion of any variant i in a population from its observed proportion in the sample H:

In a 3-variant system, the odds of V 1 are as follows:

Using this, we can calculate q V 1,123 , the correction factor between the true and observed odds when V 1 is the variant of interest in a 3-variant system with V 1 , V 2 ,and V 3 :

CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint

At this point, we recognize that

is equivalent to P V 2 if variants 2 and 3 are the only variants in that system. Assuming a 2-variant system with variants 2 and 3 only, we can use the results of the previous section to write P V 2 as a function of the observed odds in this system:

where odds ú V 2,23 is the observed odds of V 2 in this 2-variant system with V 2 and V 3 (i.e.,

) and q V 2,23 is the correction factor in this system (which we know to be equal to

). Therefore, we can continue our calculation of q V 1,123 by substituting these values as follows:

We can extend the conclusion above to derive a formula for the correction factor q in a system with n variants:

The exact value of q, and therefore the true value of P V 1 , can be calculated recursively given only the coefficients of detection and observed proportions of the variants in the population.

2 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted January 1, 2022. The probability of detecting a variant (i.e., generating one or more high quality sequences indicating a patient was infected by this variant) on or before time t is equal to one minus the probability of not detecting it at any time between t 0 and t. In other words, regardless of the time unit used, this probability can be written as:

Assuming a binomial sampling process, the probability of detection at time x is equal to one minus the probability of not detecting the variant:

Pr(detection at time x) = 1 ≠ (1 ≠ P x ) n

Where n is the sample size and P x is the prevalence of the variant at this time step. Therefore, we can write the probability of detecting the variant on or before time t as:

Where n is the per-time step sample size and P x is the prevalence of the variant at time x. This assumes the same number of samples are selected at every time step, and that the prevalence of the variant at each time step is known. We can rewrite this equation to solve for the per-time step sample size:

We can approximate the value of the product with a continuous function using the Volterra product integral. For a scalar function f and real values of a and b:

Let f (x) = ≠P x dx. This allows us to write the product of 1 ≠ P x as:

If we plug this into our per-time step sample size calculation we get: 

Where G(t) is the integral of g(t). In other words, G(t) is the cumulative density function of the growth function (g(t) = P x dx) used to model the change in the variant frequency over time.

3 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint B.2 Logistic growth of variant frequency Growth in prevalence of variants of interest (i.e., variants with some fitness advantage) are often modeled by logistic growth functions. In other words:

Where r is the per-time step growth rate and a = 1 t 0 ≠ 1. The cumulative density of this probability distribution can be computed as follows: 

In a two-variant system, the observed prevalence of a particular variant of interest in a sample of high quality detected infections (H) is a function of the true prevalence and the ratio between coefficients of detection:

Assuming P V 1 can be computed at any given time step using a logistic model, we can calculate the observed frequency distribution as follows:

. Because in this case the observed frequency function takes the same form as the actual frequency function, we can easily calculate G ú (t), the cumulative density of the observed variant frequency: G ú (t) = 1 r ln |b + e rt | + C 4 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted January 1, 2022. ; https://doi.org/10.1101/2021.12.30.21268453 doi: medRxiv preprint

Tracking SARS-CoV-2 variants

Investigation of novel SARS-CoV-2 variant: Variant of Concern 202012/01

Emergence and rapid spread of a new severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) lineage with multiple spike mutations in South Africa. bioRxiv. medRxiv

Genomics and epidemiology of the P.1 SARS-CoV-2 lineage in Manaus

SARS-CoV-2 Variant of Concern

Population impact of SARS-CoV-2 variants with enhanced transmissibility and/or partial immune escape

Sample size calculation for phylogenetic case linkage

European Centre for Disease Prevention and Control. Sequencing of SARS-CoV-2: first update

Genomic surveillance at scale is required to detect newly emerging strains at an early timepoint. bioRxiv. medRxiv

Sample Size Calculator Detecting COVID-19 Variants. In: Variant Detection Calculator

Global disparities in SARS-CoV-2 genomic surveillance. medRxiv

European Centre for Disease Prevention and Control. Guidance for representative and targeted genomic SARS-CoV-2 monitoring

Biostatistics: A Foundation for Analysis in the Health Sciences

The reproductive number of the Delta variant of SARS-CoV-2 is far higher compared to the ancestral SARS-CoV-2 virus

Early epidemiological signatures of novel SARS-CoV-2 variants: establishment of B.1.617.2 in England. bioRxiv. medRxiv

Assessing transmissibility of SARS-CoV-2 lineage B.1.1.7 in England