key: cord-320953-1st77mvh authors: Overton, ChristopherE.; Stage, HelenaB.; Ahmad, Shazaad; Curran-Sebastian, Jacob; Dark, Paul; Das, Rajenki; Fearon, Elizabeth; Felton, Timothy; Fyles, Martyn; Gent, Nick; Hall, Ian; House, Thomas; Lewkowicz, Hugo; Pang, Xiaoxi; Pellis, Lorenzo; Sawko, Robert; Ustianowski, Andrew; Vekaria, Bindu; Webb, Luke title: Using statistics and mathematical modelling to understand infectious disease outbreaks: COVID-19 as an example date: 2020-07-04 journal: Infect Dis Model DOI: 10.1016/j.idm.2020.06.008 sha: doc_id: 320953 cord_uid: 1st77mvh During an infectious disease outbreak, biases in the data and complexities of the underlying dynamics pose significant challenges in mathematically modelling the outbreak and designing policy. Motivated by the ongoing response to COVID-19, we provide a toolkit of statistical and mathematical models beyond the simple SIR-type differential equation models for analysing the early stages of an outbreak and assessing interventions. In particular, we focus on parameter estimation in the presence of known biases in the data, and the effect of non-pharmaceutical interventions in enclosed subpopulations, such as households and care homes. We illustrate these methods by applying them to the COVID-19 pandemic. Mathematical epidemiology is a well-developed field. Since the pioneering work of Ross in malaria modelling [67] and Kermack and McKendrick's general epidemic models [40] , there has been gathering interest in using mathematical tools to investigate infectious diseases. The allure is clear, since mathematical models can provide powerful insight into how these complex systems behave, which in turn can enable these problems to be better controlled/prevented. Not only is the power of the mathematical tools increasing, but the availability of data on infectious diseases, whether this be a rapid release of data during an outbreak or detailed collection of data for endemic pathogens, is increasing. Rapid interpretation of epidemiological data is critical for the development of effective containment, suppression and mitigation interventions, but there are many difficulties to interpreting case data in real-time. These include interpreting symptom progression and fatality ratios with delay distributions and right-censoring, exacerbated by exponential growth in cases leading to the majority of case data being on recently infected individuals; lack of clarity and consistency in denominators; inconsistency of case definitions over time and the eventual impact of interventions and changes to behaviour on transmission dynamics. Mathematical and statistical techniques can help overcome some of these challenges to interpretation, aiding in the development of intervention strategies and management of care. Examining key epidemiological quantities alongside each other in a transmission model can provide quantitative insights into the outbreak, testing the potential impact of intervention strategies and predicting the risk posed to the human (or animal) host population and healthcare preparedness. Mathematical modelling has been used as part of the planning process during outbreak response by governments worldwide for many recent outbreaks. For example the UK Department of Health has a long established committee Scientific Pandemic Influenza group on Modelling, or SPI-M to advise on new and emerging respiratory infections [22] . One of the largest instances of such an outbreak in recent history was the 2009 H1N1 pandemic. The World Health Organisation developed a network of modelling groups and public health experts to work on exploring various characteristic of the outbreak [12, 78] . These ranged from characterising the dynamics of the outbreak to investigating the effectiveness of different intervention strategies. This integration of mathematics into policy design indicates the important insights that modelling and statistics can provide. This paper is a collection of work-streams addressing various technical questions faced by the group as part of the ongoing response to COVID-19, and as such is written to be reflective of the experience we have gone and are currently going through. Therefore, to aid the reader each section includes results and a short discussion. Many of the questions and techniques presented here can be further developed as the availability of data and research interests evolves, but are compiled into this manuscript as an overview of methodology and scientific approaches beyond the standard SIR textbook model that benefit the ongoing efforts in tackling this and other outbreaks. First documented in December 2019, an outbreak of community-acquired pneumonia began in Wuhan, Hubei Province, China. In January, this outbreak was attributed to a novel coronavirus, SARS-CoV-2. The initial spread of the pathogen in Wuhan was fast, and after a period of case-finding and contact tracing, China moved to implement a 'shutdown' of Wuhan on January 23, and other cities in China the following days, to try to suppress the growth of the epidemic. These measures may have succeeded at slowing down the rate at which cases have been seeded elsewhere, but in many countries initial importation of cases and transmission has not been contained. Countries around the world are now seeing outbreaks that are overwhelming, or have the potential to overwhelm, healthcare systems and cause a high number of deaths even in high-income countries [64] . While the majority of documented symptomatic cases are mild, characterised in many reports by a persistent cough and fever, a significant proportion of these individuals go on to develop pneumonia, with some then developing acute respiratory failure and a small proportion of overall cases becoming fatal. Severity of symptoms has been observed to increase with age and with the presence of underlying health conditions such as diabetes [23] and cardiac conditions, with some evidence that severity of symptoms might depend on gender and ethnicity [31, 65, 82, 83, 85] . SARS-CoV-2 has a fast doubling time (the time it takes for the number of cases in the region to double, estimated at approximately 3 days [63] ) and, potentially, a very large 0 R (the average number of infections caused by each infected individual, with estimates ranging from 1.4 to 6.47 [48, 50, 51, 59] ). It is possible that there is a significant degree of asymptomatic and/or pre-symptomatic transmission [46, 52, 56] , though without robust serosurveys, this is difficult to quantify with certainty. These characteristics result in the pathogen being able to spread widely, rapidly and undetected, presenting a significant risk to public health. Typically, the aim of an intervention strategy would be to push and keep the reproduction number t R , defined as the average number of cases generated by a typical infective at time t, below 1. At this point each infected individual subsequently infects, on average, less than one individual, such that the number of cases should decline. The basic reproduction number, 0 R , represents the initial value of t R , before any intervention is put in place and the population can be assumed to be fully susceptible. High 0 R , fast growth, and possible pre-or asymptomatic infection make the design of potential interventions, and the modelling that would inform them, particularly challenging. Large values of 0 R mean a substantial amount of transmission needs to be halted; fast growth causes the number of cases in the absence of interventions to rise rapidly, so that the time scale of interventions to reduce 0 R must also be fast in order to effect substantive early changes on a population level; finally, the resulting interventions must encompass possible pre-and asymptomatic cases, a challenging prospect when in many instances these individuals are indistinguishable from healthy individuals. Consequently, we must consider the possibility of interventions that are massively disruptive to society and may have to be sustained for a long period of time in order to cause the number of infections to decline towards zero [25] . If infections remain, and the susceptible proportion of the population remains above the herd immunity threshold, these interventions must be upheld to prevent a second wave of the epidemic. There is not yet conclusive evidence as to the degree and duration of immunity conferred by infection with SARS-CoV-2 nor the feasibility of a vaccine, the timeline for which is unlikely to be any time shorter than 18 months away at the time of writing [35] . Therefore, short term extreme interventions are not as effective as they might be in other circumstances, since after their removal there remains a long period of time in which cases can rise again. The longer these significantly suppressive and disruptive interventions are in effect, the more severe the effect on the economy, and broader societal health and well-being. Furthermore, adherence to interventions will likely vary with their duration and severity. We are further challenged by the lack of transferable intuition. Early work looked at intuition gained from SARS and MERS outbreaks, also caused by coronaviruses. Some parameters do appear to be similar to these pathogens, such as the average length of the incubation period [44, 79, 80] . However, there are also clear differences, with both SARS and MERS being more fatal, but seemingly less efficient at spreading since they did not seed major global pandemics. Another complication is the spread of the infection during the Chinese Spring Festival, a time period during which movement, social, and contact patterns vary significantly. This presents significant challenges as experience and intuition from other studies regarding population mixing and spatial patterns must either be modified or are invalid. Furthermore, the pandemic has received a proportionately larger level of public attention than e.g. the 2009 H1N1 pandemic [19, 68] , largely boosted by social media. This greater level of public awareness, and the successive, staggered interventions placed to prevent disease spread are responsible for significant variations in behaviour [16, 27] and adherence to public guidance both in China and abroad. The structure of this paper follows two main themes. In Section 2, we discuss various biases that are present in outbreak data and techniques for estimating epidemiological parameters. Accounting for biases and producing robust parameter estimates is important throughout the duration of an epidemic, both for increasing our understanding of the underlying dynamics, and for feeding into models. Firstly, we discuss a bias-corrected method for estimating the incubation period, which can also be applied to serial intervals, onset-to-death time, and other delay distributions. We then present a method for estimating the true growth rate of the epidemic, accounting for the bias encountered since infected individuals may be exported from the region. Our next method is a tool for estimating the expected size of the next generation of infectives based on the rate of observed cases. This tool provides insight into the size of small outbreaks, which can inform decision making when trying to prevent a major outbreak taking off. In Section 3, we propose a variety of mathematical models looking at disease impact and intervention strategies, with particular focus on non-pharmaceutical interventions due to the current lack of widely deployable, targeted pharmaceutical treatments. These models focus on enclosed populations, since this is the level at which most interventions are implemented. Since the disease is particularly fatal in the elderly and other at-risk groups, we develop a care home model to investigate how the pathogen may spread through care homes. We also develop household models to investigate the impact of different intervention/control strategies. These models can inform policy design for mitigating or controlling epidemic spread. Finally, in the context of relaxing strong social distancing policies, we investigate the extinction probability of the pathogen. We first consider the extinction probability after lifting restrictions. We then develop a household-based contact tracing model, with which we investigate the extinction probability under weaker isolation policies paired with contact tracing, thus shedding light on possible combinations of interventions that allow us to feasibly manage the infection while minimising the social impact of control policies. 2 . Biases and estimation during outbreaks 2.1 . Potential biases in the outbreak data Techniques are constantly developing that enable higher volumes of more accurate data to be collected real-time during an epidemic. These data present a large opportunity for analysis to gain insight into the pathogen and the dynamics of the outbreak. However, although the quality of the data is constantly increasing, there are still many biases present. Some of these are due to the data collection methods, and in an ideal world we would be able to eliminate them, and some are simply due to the nature of the outbreak, and will be present regardless of data collection methods. During an outbreak, many parameters depend on delay distributions (the length of time between two events), such as the time from infection to symptom onset (the incubation period). If an individual can be followed indefinitely, it is easy to determine the length of these events. However, in reality only events that occur before a given date are observed. Therefore, the data is subject to censoring and truncation issues. In the incubation period, for example, censoring comes into play since, if we have observed an infection but the individual has not yet developed symptoms, we only have a lower bound on how long it will take them to develop symptoms. To account for this, we can instead condition on observing symptom onset before the cut-off date. However, this leads to a truncation issue, since individuals who were infected close to the cut-off date will only be observed if they have a short incubation period, which leads to an overexpression of short delays. The number of cases tends to grow exponentially during the early stages of an outbreak, causing the force of infection and the number of reported cases to increase with time. This further complicates the truncation issue since not only are recent cases truncated but they also account for the majority of cases. The growing force of infection also needs to be accounted for, since if the potential time of infection is interval-censored rather than observed directly, the probability that the case was infected in each day of that interval is not constant. In theory, both of these biases are relatively straightforward to account for. In practice however, there are other biases in the data. One of the major biases is the reporting rate. Although the total number of cases may be reasonably described as growing exponentially with a constant rate in the early stages of an outbreak, high-resolution data may exhibit more complex behaviour. This can be due to a variety of reasons, such as the workload becoming overwhelming, the availability of individual-level data decreasing, the laboratories or offices slowing down activity over the weekend, the case definition changing, the testing capabilities increasing, and so on. Another uncertainty arises since generally only the date of each event is recorded rather than the time. This presents a large window of uncertainty in the length of the delay, since the time of each event can vary up to 24 hours, and for a delay distribution, which depend on two events, it could vary by up to 48 hours. Travel rate is another bias present in the data. For example, this changes the density of observed cases in a region, which can change the apparent growth rate. Intervention strategies present a further bias because this can change the growth rate of the epidemic and the reporting rate. Additionally, estimates of certain parameters may vary depending on the interventions that are implemented, so these need to be considered carefully. To model the incubation period, we require information regarding when an individual was infected and when they expressed symptoms. Observing exact time of infection is unlikely, but it can be possible to find potential exposure windows. We consider three different data sets. The first two consist of individuals who travelled from Wuhan before expressing symptoms. We can assume these individuals were infected in Wuhan, since at the time of this data, the force of infection was significantly higher in Wuhan than elsewhere. The length of time spent in Wuhan therefore provides a window during which each individual became infected, and for many of these individuals we also have the date of symptom onset. In the early stages, the growth rate in reported cases was constant, and dependent on the epidemic growth rate in Wuhan and the rate at which people left Wuhan. By using travel to estimate the true number of cases, we estimate the exponential growth rate in Wuhan as 0.25 r = (see Section 2.3). Therefore, the force of infection on day i, ( ) g i , is proportional to 0.25 e i . After 23 January when significant travel bans were introduced, the rate at which individuals left Wuhan diminished significantly, causing the reporting rate for our sample dataset to suddenly drop. This occurs since cases are only included if we have a fixed window of time spent in Wuhan prior to developing symptoms. Therefore, if the data is truncated after 23 January, the reporting rate must be appropriately adjusted. This is illustrated in Figure 1a . The difference between these two datasets is the truncation date, with the first truncated at 20 January and the second at 9 February. The third dataset contains cases that were infected through a discrete infection event, such as spending time with a known infected case. In this ''non-Wuhan" dataset, the reporting rate is constant and the force of infection can be assumed constant over each exposure window. The source we use for these three data sets is a publicly available line-list [74] . Incubation periods, and many other delay distributions, are generally observed to have right skewed distributions. We therefore choose to use a Gamma distribution, though other distributions can also be applied using the proposed methods, such as Weibull and Log-normal. To fit the data, we use maximum likelihood estimation. To adjust for the biases we use a ''forwards'' approach [55, 69, 73, 75] , where we condition on the time of the first event, time of exposure, and find the distribution looking forward to the second event, time of symptom onset. For a data point { } , , infection occurs between i a and i b , and i y is the symptom onset date, the likelihood function is given by where ( ) g ⋅ is the density function of the infection date and ( ) f θ ⋅ is the density function of the incubation period parameterised by θ. From this, the likelihood function for our dataset X is given by This approach is independent of the reporting rate bias, since the reporting rate depends on the date an individual leaves Wuhan ( i b ), which is conditioned against (see Appendix 5) . We use the mean and standard deviation to characterise the MLE. Since the tail of the incubation period is important when designing quarantine strategies, we then calculate the probability that the incubation period is longer than 14 days and find the minimum day by which 99% of cases will have expressed symptoms (excluding true asymptomatic cases). We also investigate the reporting date uncertainty mentioned in Section 2.1 by considering the different extremes that the data could represent. This is achieved through adding or subtracting a day to all recorded data. Methods accounting for truncation and growth biases in epidemic data have been discussed widely in the literature [36, 55, 72, 76] , however there are fewer applications to outbreaks [24] . In the context of COVID-19, estimates have considered growing force of infection, for example [44] , and some approaches have considered truncation, for example [47] . However, these attempts do not adjust for the reporting rate in the data or use the correct force of infection, causing the incubation period to be overestimated. Although the method presented here is independent of the reporting rate, other approaches for estimating the incubation period are not. Here we demonstrate the importance of truncation (Table 1) . We use the data truncated at 20 January, which has exposure windows between 1 December and 19 January. This data set is chosen since it is most sensitive to truncation due to the exponentially growing force of infection and high reporting rate. Without accounting for truncation, the length of the incubation period is significantly underestimated, which could have a large impact on the success of intervention strategies. To demonstrate the effectiveness of the bias correction method, we compare three different data sets ( Table 2) . The similar distributions predicted across these datasets suggests a robust method. Figure 1b compares the full distributions for these three estimates. Here we investigate the effect that uncertainty in the reporting date can have on the results, using the data truncated at 9 February (Table 3) . The standard interval is the recorded data, wide intervals are obtained by removing a day from the exposure window lower bound and adding a day to the upper bound, and the narrow interval vice versa. The uncertainty in the reporting date can impact the estimated incubation period, showing that it is important to consider this risk when designing interventions. When constructing intervention strategies for an epidemic, the incubation period is an important parameter. For example, consider the quarantine strategy deployed in many countries during the early stages of the epidemic, aimed at preventing cases being imported from Wuhan. This strategy quarantined individuals upon their return from Wuhan for 14 days. For such a strategy to be effective, we require most incubation periods to be less than 14 days, so that the majority of infected people would develop symptoms before quarantine ended, enabling them to be further isolated. In this analysis, we show that in the worst-case scenario we would expect 1 in 62 cases to slip through this quarantine, with the best fit predicting 1 in 101 cases. Therefore, the 14 day quarantine period would capture the majority of cases. Throughout the epidemic, this seems to have been reasonably successful and prevented early seeding of cases in many countries. However, potentially due to complicated travel patterns or asymptomatic transmission, cases have slipped through detection and not been quarantined, which unfortunately has led to the situation observed today. In addition to the incubation period, there are many other delay distributions that must be estimated while an epidemic is growing, which can be estimated using the same technique. These include the generation time, the time between two infection events in a transmission chain; the serial interval, the time between symptom onset of an infector to their infectee; and the onset-to-death delay, the time from symptom onset to death. Transportation modelling plays a crucial role in the early stages of an outbreak; an infected individual may travel outside of the region in which they were originally infected and seed further infections across geographical scales which are impossible to contain. Furthermore, as the rate of travelling increases, the number of observed cases within the known ''origin" region decreases, and if exportation is not taken into account this results in an underestimation of the number of cases. These underestimates can be improved by looking at the total number of cases across all known affected regions, but doing so introduces further complications. For example, if an individual has less severe symptoms they may not seek medical assistance, thereby not being recorded as a case at their destination. This underestimation of cases can have significant effects if the traveller is able to infect more people. A new transmission chain can thus be started which remains undetected for some time due to a lacking known connection to the ''origin" region. In the ''origin" region an individual with mild symptoms may still be tested for an infection due to a higher level of alertness in the local health care system. However, this level of active case-finding may not be present elsewhere, or may not have been allocated a comparable level of resources. Further complications to this model arise from the incubation period of individuals wherein detection is unlikely, and the variations in movement and mixing between people when preventative measures are put in place. We consider a metapopulation model seeded with an infection in one of the regions, O, and investigate how exportation from this region combined with variability in case-finding can alter estimates for the doubling time and the expected portion of the population we expect to identify. This accordingly bounds the proportion of the infected population one would be able to target for personal intervention (e.g. quarantine or treatment). Note that the proportion of identified cases need not necessarily correlate with the proportion of the infected population who exhibit symptoms. Let us assume that movement from O begins at time where ω is the mean case-finding ability across all destinations. In the presence of real-time transition probabilities ij p of moving between two regions, these estimates can be further elaborated. We assume that detection occurs immediately following the end of the incubation period, i.e. . Similarly, we assume a gamma-distributed incubation period with shape and scale parameters k and θ, respectively. We can parametrise this distribution using the ''non-Wuhan" estimate of the incubation period in Table 2 , which yields a gamma-distribution with mean 4.84 and standard deviation 3. 22 . In contrast to other values in the table, this estimate is obtained from discrete infection events, e.g. contact with a known infected case, and therefore has a constant reporting rate, and a constant force of infection over each exposure window. Therefore, this estimate of the incubation period does not rely on the exponential growth rate unlike other estimates from Section 2. when accounting for ρ. This difference may seem small, but it reduces the doubling time by approximately 12 hours. The expected value of r grows linearly with the exportation rate, which has also been observed with real-time travel models [42] . Further models have also been developed which consider travel and exportation of cases in greater detail [20, 30] . The relationship between the observed cases in our origin and destinations can be used to determine the case-finding ability, though it should be noted that ω likely varies with time as burdens are increased on public services and the number of cases grow. Early estimates using data from [1] indicate at most an 80% case-finding ability, suggesting thousands of undetected cases exported to other regions of China, a sufficient quantity to sustain further transmission post-exportation independently of the number of asymptomatic cases present. The intention of these estimates is not to provide specific values for the doubling time of the spread of COVID-19 in China (as the estimates above use historic travel data and are limited by the availability of data), but to bring attention to the unusual circumstances surrounding changes in contact patterns, and mobility during the Chinese Spring Festival, the largest human migration on Earth [38] . Failing to account for the significant level of dispersion or exportation of cases during these circumstances will significantly skew our estimates. In a scenario where a single individual exposes a group to infection, it can be unclear how many people have been infected since they do not immediately develop symptoms. However, knowing the true prevalence in the population is essential to determine the most effective interventions to put in place, and to estimate future burdens on public services. Using the probability density function of the incubation period, we consider the efficacy of using the time it takes for people to present with symptoms as a predictor for the size of the infected group. This analysis is an effective ready reckoner at early stages of a novel infection, or in close contact environments, and is useful for predicting generation size when a complete data set is not yet available. In this analysis we focus on a scenario where infection time is known. In reality, we may only know an exposure window. For short exposure windows this method can still be valid, but for longer exposure windows it will need extending to account for this added uncertainty. We assume that the number of individuals who have been exposed to potential infection is known, in which case the number of people who are infected can be assumed to be binomially distributed with an unknown probability P that each individual has been infected. To determine the distribution of infected individuals, we use the available information regarding the number of individuals who have expressed symptoms. This yields two cases. In the first case, we assume that the true number of symptomatic individuals are observed. In the second case, we take the number of observed symptomatic individuals as a lower bound on the true value. We wish to determine the probability that the first generation has 0 e individuals, 0 0 E e = , given that i τ symptomatic individuals are observed on day τ, I i τ τ = . This is given by (see Appendix 6) This gives a distribution of the generation-size based on the number of observed symptomatic individuals by time τ. We can extend it to investigate a scenario where no symptomatic individuals have been observed by time τ by using a value of 0 for I τ : This can be used to illustrate worst and best case scenarios given τ time has passed without symptomatic individuals. Additionally, if we consider the probability that 0 E = 0, we can find the value of τ where we can have a 95% confidence that there will not be a second generation: This analysis considers the case when the number of observed symptomatic individuals to date is the true number. In practice however, we do not generally observe every symptomatic individual, so the number of observations is only a lower bound on the true number. To address this, rather than considering I τ as the total number of people who have developed symptoms by time τ, we can define I τ as the minimum number of people who have developed symptoms by time τ. We assume that the probability that I τ is equal to i τ for a given value of i τ is uniform at 1 1 i τ + . We can then use the same methods as above to infer a distribution for P. Details are provided in Appendix 7. e , in an infection event in which 20 people were exposed. A, B and C show the density when the number of observed symptomatics is taken to be the true number of symptomatics, D, E and F consider the case where the observed symptomatics is a lower bound on the true symptomatics. A and D consider the case when zero symptomatics are observed after 5 days, B and E when 5 are observed after 5 days, and C and F when 5 are observed after 10 days. The incubation period for the disease has been modelled as a gamma distribution with a mean of 4.84 and standard deviation of 2.79 ( Table 2) . As we can see from Figure 2 , this method can be used to predict the number of infected individuals in the original exposed group. However, we have also demonstrated the importance of caution when interpreting this data. If there is uncertainty surrounding the presentation of symptomatic patients, using I τ as a lower bound is a robust method to ensure the size of the generation is not underestimated. 3 . Modelling intervention strategies 3 When designing intervention strategies, we need to consider how adherence may alter their effectiveness. This is important, since highly effective interventions may not be adhered to if they present great individual cost to a population. In this case, a theoretically less effective intervention may perform better, if it has sufficient reduction in individual-level cost. In this section, we illustrate the potential impacts of adherence on the effectiveness of interventions using a toy model. Consider a standard SIR model, and denote by ( ) S t and ( ) R t , respectively, the susceptible and recovered/immune fractions of the population at time t. We can write S in terms of R such that If an intervention is put in place that reduces (with full adherence) 0 1 R < then the outbreak will be controlled. Indeed, let us assume that 0 R is reduced to zero by the intervention: for example, assume that social distancing is perfect and the number of contacts of a fully-adherent individual is zero. If only 50% of people adhere to the intervention then the average number of contacts is effectively reduced by a half and logically † 0 0 / 2 1.5 R R = = (the † representing quantities post intervention) and ( ) † 0.58 R ∞ = in this case. However, this assumes that adherence is an independent random process at each contact. This suggests that for each contact an individual would ordinarily make, they ''toss a coin" to decide whether to isolate or not. In reality, individuals are more likely to show polarity, where some individuals reduce all their contacts and follow the measures and a proportion of individuals choose to not adhere to the intervention. If there was distinct polarity in the population such that 50% adhered perfectly and 50% ignored policy, then a toy model can be created with two infectious groups, A I and B I , that behave differently. In this case where a dot over a variable represents its time derivative. Such an epidemic model, where the two groups have the same susceptibility but different infectivity, has the same final size as an epidemic in a single-type model with the same 0 R (e.g. see [5] ). However, they have different durations as can be seen in Figure 3 , where This shows that the assumptions about the nature of adherence predict the same growth rate and final size, but that the more polarised adherence has faster early growth and therefore an earlier peak. More complicated model structures could be constructed by incorporating adherence with intervention by susceptible states, which would lead to core group dynamics (see for example [37] ). This issue of independent versus polarised adherence is related to the idea of all-or-nothing versus leaky vaccination [29, 49] , where you either vaccinate a fraction of the population with 100% efficacy or vaccinate 100% of the population with reduced efficacy [34] . Note however that vaccination reduces your susceptibility (whether only or also), rather than only your infectivity as in the model discussed above, and variation in susceptibility does reduce the final size, with imperfect coverage with a perfect vaccine (all-or-nothing) leading to a lower final size than full coverage with a leaky vaccine (all individuals having the same mean susceptibility). The ongoing COVID-19 outbreak is known to have higher mortality rates amongst the elderly, the immunocompromised and those with respiratory and health complications [31, 82, 83, 84, 85] . In this section, we model the introduction of an infectious disease into care homes, in order to obtain estimates of the final size of the epidemic in the vulnerable population as well as predictions for the number of hospitalisations and fatalities. Modelling of care homes in the UK is conducted against the backdrop of a wider epidemic in the general population, which we here assume to be following SEIR dynamics with a basic reproduction number 0 R that might be different from the within-care home reproduction number C R . Care homes are assumed to be closed populations, with the infection entering each of them independently with a certain probability. Infection is seeded only once, and within-care home outbreaks then evolve independently from, and do not contribute to, other care home outbreaks and the epidemic in the background population. To keep track of hospitalisations, we model the withincare home infection dynamics using a compartmental model that, in addition to SEIR model, has compartments for mildly symptomatic prodromal cases (P), who show no symptoms but are capable of transmitting the virus, those who recover from the disease after mild symptoms that did not require hospitalisation (M), those who have severe symptoms and are admitted to hospital (H), those who recover after hospitalisation (R), and those that die (D). This is illustrated in Figure 4 . The infection pressure up to time t for a median sized care home is the integral from 0 to t of the force-of-infection (FOI) applied to the care home coming from all infectious sources, multiplied by a probability p. This probability represents the probability of the infection being introduced to a median-sized care home. For other care homes, we allow this probability to be proportional to its size, under the assumption that larger care homes employ more staff and are therefore at higher risk of introduction. When the infection pressure becomes higher than an individual care home's resilience threshold, that care home begins its own deterministic infection dynamics with a single initial infected case. The equations describing the background epidemic and the within-care home epidemic are given in Appendix 8. Figure 4 : Compartmental model for disease dynamics within a care-home. We extend a deterministic SEIR model to include compartments for prodromal (infectious) cases (P), mildly symptomatic cases that recover without requiring hospitalisation (M), cases that do require hospitalisation and are removed from the care home (H), cases that die in hospital (D) and cases that recover in hospital (R). We run this model on data for the entire care home population in the UK, so that there are approximately 15,000 care homes with a total population of approximately 450,000 residents [21] . Care home sizes range from 1 to 215, with a mean size of 29.4 . In this model we only consider the vulnerable population within care homes. We assume 0 1.5 R = in the background epidemic, a relatively low value that somehow accounts for a certain degree of control, and an 3 C R = to allow relatively explosive epidemics in care homes due to potentially more frail individuals, difficulty in isolation and staff inadvertently passing the infection from one case to the next. The other parameters in the baseline scenario are reported in Table 4 . Apart from the reproductive number, the background epidemic uses the same parameters as the care home epidemic. However, in the background model there are only rates from E to I and I to R, which are taken to be the rates from E to P and I to M, respectively. There are a variety of assumptions underpinning this model. Firstly, the background epidemic ignores structure and assumes homogeneous mixing. This is likely to make the peak more pronounced, so presents a worst case scenario for the demand on hospital beds. The assumption of 0 R for the background epidemic only affects the shape and duration of the background epidemic, since the tuneable parameter p controls the risk of introduction to the care home. That is, if 0 R is small, a large p still presents a high force of infection into the care homes. Therefore, for a fixed p, we expect that changing 0 R does not affect the total number of deaths, but it changes the peak hospitalisation incidence because a faster and more explosive background epidemic makes epidemics in care homes more synchronised. In fact, when testing the impact of a longer, flatter background epidemic, for example obtained by simulating three slightly desynchronised background SEIR epidemics, results have lower peaks and a much more variable timing (not shown). Assuming each care home is independent might not be realistic, since it is likely that staff are shared between multiple homes, in which case they can act as vectors of transmission between homes. However, in the model current outbreaks are already quite synchronised (the within care home outbreaks occur at similar times), so the effect of this assumption is likely to be minimal. The final major assumption is that the epidemic within the care homes is deterministic. This removes the probability of random extinction and random delays, and should obviously be relaxed with a stochastic model, given half of care homes have size smaller than 25. However, the extinction probability is very low with 3 C R = , so this stochastic effect is unlikely to have a large impact. Random delays, instead, may change the shape and timing of the epidemic, which could potentially reduce the peak burden. Therefore, this model represents a worst case scenario. In the absence of cure or vaccine for COVID-19, governments worldwide must rely on nonpharmaceutical interventions (NPIs) to control the outbreak [32] . A natural such intervention is to ask individuals who express symptoms similar to COVID-19 to isolate themselves, but variants to such individual isolation might include policies sometimes referred to as household isolation, household quarantine and mixed isolation. In this section, we investigate how such strategies affect the spread of the epidemic when bearing in mind that adherence to each intervention may differ. Individual isolation relies on individuals staying in isolation when they express symptoms, thereby stopping transmission. However, there is potential asymptomatic or prodromal transmission before they go into isolation. Additionally, isolation strategies generally ask infected individuals to remain at home, which presents an infection risk to the other members of their household, who may go on to spread the infection. The term 'household isolation' refers to a policy where, upon first detection of symptoms within a household, all individuals within the household go into isolation for a fixed duration of time. This strategy reduces the risk that other household members, if they are infected within the household, transmit in the community when pre-symptomatic (and hence before they self-isolate themselves) or if asymptomatic but still infectious. A blanket policy invoking a fixed duration of household isolation might cover the full epidemic in a small household. However, a larger household might present multiple generations of infection, potentially extending the within-household outbreak beyond the fixed duration of the household isolation policy. To address this issue, 'household quarantine' is another potential strategy. Upon detection of symptoms, the entire household is isolated until a fixed duration of time after the last symptomatic case within the household expresses symptoms. This ensures that there are no symptomatic cases evading intervention but applies quite drastic measures to the household. A fourth strategy, that reduces the cost relative to household quarantine, is mixed isolation. Here, upon detection of symptoms the entire household is isolated for a fixed length of time. Any subsequent cases within the household then undergo individual isolation as described above. This reduces the risk of cases not being isolated whilst allowing recovered individuals to return to work. There is however still some remaining risk that infected individuals may not yet express symptoms after the end of the isolation period, but this risk can be controlled through the duration of each isolation. Although there is now a rich theoretical literature on households models [7, 8, 9] , the mainstream methodological tools in this research area present important limitations that make them not directly applicable to studying these control policies. First, exact theoretical or asymptotic results in these models are mostly restricted to time-integrated quantities, i.e. those quantities that do not depend on the detailed temporal shape at which the infectivity is spread by an individual: these are 0 R (or any other reproduction number [10, 60] , e.g. the household reproduction number * R ), the probability of a large epidemic, and the epidemic final size [4] . For this reason, the vast majority of the literature relies on the standard stochastic SIR model [4] , despite its unrealistic infectivity profile. Even if more recent work has expanded beyond time integrated quantities, for example considering the real-time growth rate [10, 62] , if the interest is on tracking the dynamics of infection spread, a model based on full temporal representation of between-and within-household dynamics [33] appears necessary. A second limitation of standard household models is the key assumption of constant parameter values. This appears essential for any form of analytical progress. However, in the context of the interventions discussed above, a reduction in transmission between households, as well as a potential increase in within the household, require parameters to change over time. To overcome these limitations, we consider two approaches. The first approach fully captures both within and between-household dynamics with a master-equation formalism, i.e. by relying on a Markovian within-household dynamics and keeping track of the expected number of households in each possible state of their internal dynamics. The second approach has a greater emphasis on withinhousehold dynamics, and is fundamentally an independent-households, individual-based, stochastic simulation. The more limited mathematical tractability is the price to pay for an increased flexibility, as the within-household Markov assumption is relaxed and exact distributions for delays between events, typically informed by the data, can be explicitly inputted. Although both approaches can account for increased within-household transmission as isolation and quarantine are imposed, we only consider this for the second method here. This aspect allows us to study the increased risk of infection a vulnerable individual in the household would experience following the implementation of a control policy. To model the households in the UK, we construct a realistic distribution of household sizes (which is given in the supplied code). We take this demographic data from the 2001 Census [58] . More recent information, though less specific on large household sizes, shows that sizes of smaller households are largely unchanged over time [57] . 3 . 3 In this section, we investigate the above intervention strategies under the assumption that a fraction of households adhere 100% with an intervention and the remaining households ignore the intervention. To model the interventions, we implement a dynamical household model that explicitly represents the small sizes of households. The dynamics of the outbreak are simulated using an SEPIR model. This model assumes that there are five possible states in which an individual can be. These are, susceptible, latent, mildly symptomatic prodrome, symptomatic infectious and removed. Individuals are infectious during the mildly symptomatic prodrome state and the symptomatic infectious state. Following [17] , we assume that within-household transmission scales with the inverse of the household size to a specified power η. Such a model can be used to investigate how the pathogen spreads through and between households. The methodology involved is the use of self-consistent differential equations, first written down by Ball [6] . More recent developments, including numerical methods for these equations, include [13, 33, 41, 66] . Important features of this approach include allowing for a small, finite size of each household in which random effects are important and each pair can only participate in one infection event. Here Λ represents infections imported from outside the population of households, and the other terms represent between-household transmissions. In our code, we assume Λ is a step function. Results are largely insensitive to the precise choice of Λ , but compared to, for example, random seeding of infections in households, starting the whole population susceptible and exposing to a small amount of external infection for a fixed time period has less room for the precise initial condition chosen to influence results, and is more realistic for the situation observed in countries apart from China. We take a 'global' intervention as part of the baseline, in particular, we can model phenomena such a school closures that hold during a set of times T as We call ε the global reduction. We will generally drop this t-indexing for simplicity, and will also consider only a household isolation strategy (though the other strategies can be considered similarly, with an example of how other strategies could be captured in this model framework given in Appendix 9). Instead of isolating for a fixed duration, we assume that a fraction W α of households isolates when there is at least one symptomatic case in the household, and isolating households leave isolation when no symptomatic cases remain. We make this assumption since it may potentially capture the behaviour of real households, who are more likely to remain isolated based on presence of symptoms rather than for a fixed duration. In the non-Markovian household model in Section 3. 3 Using the methods in [13, 66] , it is possible to fit household models of this kind to the overall growth rate, r, which we take to correspond to a doubling time of three days. Natural history parameters can then be set directly based on reasonable estimates: e p r → to the inverse of the latent period; p i r → to the inverse of the prodromal period; i r ∅ → to the inverse of the symptomatic period. Shaw [71] analyses various household datasets for respiratory pathogens and estimates values for η close to 1, so this is taken to be 0. 8 . The remaining degrees of freedom are relative infectiousness of the prodrome (taken as a third) and the probability of transmitting within a pair, which we can take as a typical value given by Shaw [71] . For the numerical results in Figures 6 and 7 Using the given parameter values for our baseline scenario (Table 5) , we consider a combination of household isolation (which follows all-or-nothing adherence) with global reduction in transmission (which follows leaky adherence) for three weeks and show the results in Figure 6 . The distribution of infectious individuals varies with household size, which is shown in Figure 7 for different durations of global intervention. Applying household isolation at 65% adherence ( 0.65 W α = ) manages to reduce the spread of infection, but appears insufficient in this model and with baseline parameters for controlling the outbreak in the long-term, unless other intervention strategies that reduce the global transmission (increasing ε) are adopted at the same time. Alternatively, different levels of adherence can be considered to determine if and when control may be achieved purely through household-based interventions. For the model proposed in the next section, we look into the effectiveness of increasing adherence. Parameters are in the main text. The model described above has the advantage of being able to track the dynamics within the household as well as the overall epidemic in the population in a relatively efficient manner. We now discuss a different framework that loses part of the capability in keeping track of the overall epidemic, but offers further flexibility both in the impact of policies on the within-household dynamics and in the distributions between events in the infectious life of an individual. We use this model to investigate the relative effectiveness of the different control policies. We also consider allowing recovered individuals to leave the household, even in the context of household isolation or household quarantine. This has no impact on the transmission dynamics, but reduces the individuals' life disruption and potential economic cost of any policy implemented. This model assumes that there is no reintroduction within households so each household can only be isolated or quarantined once. The assumption that only one household member is infected from outside is approximately satisfied if we assume homogeneous mixing between households and a large number of households, which are all fully susceptible at the start of the epidemic. However, the reality of heterogeneous mixing makes reintroduction a likely possibility even early on in the epidemic. This model, therefore, lacks an explicit description of the social network structure beyond the household. For simplicity, we assume that within households all individuals are identical in terms of their disease dynamics, although the method might be extended to allow for different age/risk groups with different disease dynamics. We assume that the level of within-household transmission in a household of size n scales proportionally to ( ) 1/ 1 n − , though we acknowledge that true transmission is slightly more complex [18] . We consider independent households of size Each individual is given an indicator function of whether they are symptomatic or not (individuals show symptoms independently of each other with probability s p ) and a resilience threshold. This last quantity is drawn from an exponential distribution with mean 1, and represents the overall infection pressure this individual is able to withstand before they get infected. The infection pressure up to time τ is the integral from 0 to τ of the force-of-infection (FOI) applied to this individual coming from all infectious sources. At the beginning of the within-household epidemic, a single initial case is assumed. Time is discretised with a predefined time step d 0.1 t = days. At any time step, the current infectivity of all infectives in that time step is summed over, keeping track differently of the infectivity spread outside and inside the household. An overall measure of the accumulated infectivity within the household is updated at each time step and when this crosses the resilience threshold of a susceptible individual, they acquire the infection. We assume an individual spends half of their time outside and half inside the household. When selfisolation starts, the assumed adherence i a represents the fraction of the time spent outside that is shifted from outside to within the household. Therefore, for perfect adherence, from the moment symptoms occur, the individual stops transmitting outside but their infectivity within the household grows by 100%. We also explore variations in this compensatory behaviour, so that the time of an individual is split in a more flexible proportion than 1:1. The same argument applies to other control policies, with adherence levels h a for household isolation and q a for household quarantine. When multiple control policies are in place at the same time, their effect is assumed to be multiplicative: if an individual has symptoms and the household isolates, the outside transmission rate from that individual is reduced from baseline by a multiplicative factor ( ) ( ) (1) and the area under this curve is known in the literature as the household reproduction number, and is typically denoted by * R . If enough transmission is prevented, so that * 1 R < , the epidemic is controlled. The basic reproduction number 0 R and * R share the same threshold at one, so they are simultaneously larger, equal or smaller than unity. However, in a growing epidemic 0 * R R < [10, 60] . The real-time growth rate r is related to * R by the Lotka-Euler equation asymptomatic cases, which are assumed to be half as infectious as symptomatic ones. The total number of cases under isolation (possibly compounded, e.g. both household and individual isolation) is shown in red (right axis). All random numbers involved in the realisation of the stochastic epidemic are drawn at the start, before the impact of each control policy is implemented. Row 1 shows no isolation and individual and isolation. Rows 2, 3 and 4 show, respectively, household isolation, mixed isolation and household quarantine. The difference between the columns is that the basic policy on the left is ''upgraded'' to the more cost-effective version on the right that allows recovered individuals to leave the house as they cannot transmit outside anymore. When no control is implemented, the primary case (individual A, infected at time 0) infects another individual (B) around time 11 . After a long latent period (i.e. incubation minus prodromal), B becomes infectious and infects a further individual (C). The last individual (D) escapes infection. When different intervention strategies are in place, within-household infectivity is increased. This can result in individual C becoming infected earlier in the outbreak and individual D no longer escaping infection, both due to the increased force of infection. In this simulation, the dynamics for individual B do not change since they are infected before A becomes symptomatic. Individual D is infected earliest under mixed isolation, because within-household transmission is higher than household isolation alone, due to increased adherence from individual isolation also being in place. Adherence levels to household quarantine are lower than those of household isolation, due to the higher demand of full quarantine, thus leading to less enhanced within-household transmission. We assume that adherence to individual isolation is 90%, household isolation is 80% and household quarantine is 60%. The more severe the intervention, the better it captures the infectious periods of infected individuals within the household. The lower x-axis gives the adherence to household quarantine, and the upper x-axis adherence to household isolation. We assume that household isolation is less demanding, and therefore adherence is assumed to be ''twice as high'', meaning it is at the midpoint between that of household quarantine and 1 (e.g. 0.6 for an x value of 0.2, 0.9 for an x value of 0.8, etc.). The black dash-dotted line in (a) gives the amount needed to control the spread by achieving * 1 R = . Notice how: the effect of individual isolation is independent of adherence to household quarantine (dotted lines); the effect of household isolation is independent of adherence to individual isolation (overlapping dash-dotted lines); mixed isolation is always superior to household isolation; household quarantine is only optimal at really high levels of adherence (for these baseline parameters, generally, beyond the level needed to achieve control), but quickly becomes suboptimal to mixed isolation as adherence is reduced. When a sufficiently large reduction in * ); similarly, in all households, individual isolation would total exactly 7 days if all cases were symptomatic and all individuals in the household were ultimately infected. In (d), we assume that adherence is 100% to each intervention. Under the baseline parameter values (Table 6) , control can in principle be achieved via certain interventions, but only for high levels of adherence, which might be difficult to enforce for a prolonged length of time (Figure 9a ). More importantly, the model's conclusions are highly sensitive to variations in parameter choices, which are uncertain. Parameters that present problems here are the delay from symptom onset to isolation (with control failing for 1 day detection delay unless adherence is essentially perfect), proportion of asymptomatic infections (any chance of control lost at 50%) and the strength of asymptomatic transmission. The short delay before symptomatic individuals isolate may be unrealistic unless the susceptible population is very well-informed about symptoms that call for isolation, and so likely does not apply in very early stages of an outbreak. Overall, in the face of the many uncertainties, household-based interventions triggered purely by symptoms appear useful to slow the spread but need to be complemented by other policies. Comparing the different strategies (Figure 9b , household quarantine can be optimal (as one might expect), but this requires high adherence levels. As adherence drops, this strategy becomes suboptimal to mixed isolation. Mixed isolation is significantly better than household isolation on its own and requires little extra social cost, so should not cause adherence to drop (relative to household isolation adherence levels). The difference between the two strategies comes down to the transmission slipping through after the 14 day household isolation. The cheapest strategy, when considering working age adults, is individual isolation (Figure 9d ), but the effect is limited compared to the other models and cannot achieve control in the baseline scenario even with 100% adherence. Overall, the mixed isolation strategy appears to be most cost-effective. However, this is dependent on the assumption that adherence is better for 14 day isolation rather than a very long quarantine. It can be observed that household-based interventions are more effective than individual isolations, demonstrating the importance of these strategies in designing intervention policy. Figure 8 shows how the different isolation strategies contain the infectious periods of individuals within the household and also indicates the number of individuals being isolated within the household. To study the impact such an increased within-household transmission has on the chance that a vulnerable individual is infected in the household, we randomly choose one non-primary case in the household as the vulnerable one and count how many of the e n epidemics result in this individual being infected under the different control policies (Figure 9c ). Under these interventions, the risk of a vulnerable individual getting infected within-household, conditional on the infection entering it in the first place, is in the range 5 15% − . Since this model relies on the Sellke construction, we calculate the infection pressure that accumulates (within a household) during the outbreak. In relation to Figure 8 , we report in Figure 10 the infection pressure that accumulates for the different control policies, showing the different impact each intervention can have on the within household dynamics. Figure 10 : Accumulated infection pressure in the simulation presented in Figure 8 for different control policies. Horizontal dotted lines represent individuals' resilience thresholds. As time progresses, the accumulated infection pressures (coloured lines) increase and when they cross the resilience thresholds, the corresponding individual acquires infection. Notice that: in the absence of control, one individual escapes infection; with household isolation only, the infection pressure reaches a relatively low endpoint because of the last symptomatic individual slipping through and not transmitting much in the household; with mixed isolation, infection pressure is higher due to combined adherence; and with household quarantine, the infection pressure builds up more slowly at the beginning due to lower adherence. We assume that adherence to individual isolation is 90%, household isolation is 80% and household quarantine is 60%. Social distancing, isolation and lockdowns act to mitigate the spread of an infectious disease and reduce the number of cases. However, such interventions, particularly widespread lockdowns, cannot be maintained indefinitely and must be lifted at some point. For the disease to be controlled, these interventions can be implemented until pharmaceutical interventions are developed, such as a vaccine, or until the case numbers are low enough that the disease may go extinct. Here, we consider the situation where interventions are lifted just before extinction, when the number of cases has reached a low but non-zero initial value 0 n : at this point, the number of cases might rebound or might go extinct by random chance despite an 0 1 R > . We use a time-inhomogeneous birth-death chain model [39] to , Q t s satisfies the differential equation subject to the initial condition ( ) 0, . Q s s = Solving for Q and setting 0 s = gives the probability that, at time t, the number of cases has reached zero and the disease has become extinct. We denote this probability by ( ) q t [3] , which is given by The corresponding generating function, ( ) , R t s , for this random variable satisfies Again, solving for ( ) We simulate data based on one initial case 0 1 n = , though this may easily be extended to any number of initial cases. We run simulations both with and without immigration, choosing W is the initial (constant) rate of importation of cases before any controls on immigration are put into effect. We set 0 5 W = imported cases per day. With these choices of parameters, the resulting extinction probabilities are given in Figure 11 . Note that we are assuming the immigration rate is decreasing to 0, so if the infection is controlled internally for long enough, an overall ultimate extinction is possible in this model. For these parameter choices, the final probability of extinction, defined as n q t . These probabilities suggest that, without widespread immunity, stochastic extinction might be aided by social distancing but is heavily compromised by immigration. Border controls, therefore, if of limited use when transmission is self-sustaining, become key when the number of cases is low. Note that we have assumed an importation function ( ) t η that goes to 0 for large t, in line with a pandemic that goes extinct in other geographical regions. However, the presence of an animal reservoir might lead to an importation function that is non-zero over longer time scales, thus effectively making ultimate extinction impossible unless the effective reproduction number is kept below one by a systematic and permanent intervention (e.g. technology-based change in behaviour) or herd immunity. Contact tracing is a complementary control policy to isolation or quarantine. When a case is discovered, attempts are made to identify and isolate individuals who may have been infected. In doing so, some of the secondary cases will be discovered and isolated early in their infection, decreasing their effective infectious period. If contact tracing is successful, it can greatly reduce the effective reproduction number of the infection, and in combination with other interventions may drive an epidemic extinct, as was seen in the case of SARS [81] . Contact tracing in itself presents numerous challenges, which are exacerbated by its success relying not only on the effectiveness of the tracing process but also the underlying transmission characteristics. For COVID-19, some of these challenges include mild symptoms which cause infections not to be reported, pre-symptomatic transmission which occurs before a case is reported, and short generation times [28] which can cause the epidemic to outrun contact tracing. Additionally, contact tracing is only feasible for smaller case numbers, because each case generates multiple contacts to follow up, so the tracing workload expands dramatically, and an increasing number of chains remain unobserved. This makes it a viable strategy in the early days of an outbreak, or, if containment has failed, following a period of severe interventions, such as a lockdown. Combining contact tracing with isolation is being considered by many countries as part of a test, trace and isolate strategy to be implemented once lockdowns or comparable measures are lifted, provided these lockdowns succeed at driving case numbers sufficiently low. In this section, we develop a householdlevel contact tracing model for an emerging outbreak, since we do not wish to make assumptions about immunity or depletion of susceptibles. These assumptions can be added to the model as the availability of data into immunity improves. We are interested in the likelihood that the contact tracing process is overwhelmed by large case numbers and the likelihood that, combined with isolation, it can drive the disease to extinction. The early days of an outbreak can be modelled using a branching process, where generations of infections produce infectious offspring. Contact tracing processes can be incorporated as a superinfection along the tree generated by the branching process [11] . When a node is 'superinfected' by the contact tracing process, it is isolated. We model the infection spreading through a fully susceptible population of individuals, segmented into households of different sizes according to the 2019 ONS survey [57] , and progress through discrete time steps of 1 day. As such, our branching process is at the household level, coupled with localised within-household epidemics. This allows us to model contact tracing strategies that isolate whole households, which may contain several undetected infections. It also enables a wider range of contact tracing strategies to be modelled, each with different intervention scope and costs. Each day, individuals (or nodes) make contacts to a random set of individuals; divided into local contacts to members of the same household, and global contacts to members of other households. The number of individuals contacted in a day is distributed using an overdispersed negative binomial distribution and parameterised using estimates from the POLYMOD social contact survey [54] , stratified by household size. Since the probability that a contact causes infection cannot be directly observed, we use improper hazard rates that give rise to the 5 day COVID-19 generation time [26] and 0 3 R = . For contact tracing to begin, an infection must be diagnosed, which we assume occurs 70% of the time among infected individuals due to flaws in reporting or very mild symptoms in those infected. We assume a Gamma distributed incubation period with mean 4.84 (Table 2 ) and a Geometric reporting delay from symptom onset with mean 4.8 days [42] . Intuition suggests that if 0 3 R = then tracing two thirds of contacts will control the epidemic. However, in practice transmission may occur before tracing, so this will not reduce the number of infectious contacts by two thirds. To demonstrate this, we assume that contact tracing successfully traces two thirds of contacts. Trained professionals have to trace all reported contacts from the last 14 days, so we assume that the contact tracing delay follows a Geometric distribution with a mean of 2 days. Individuals are considered recovered 21 days after infection, as the chances that they are still transmitting then are negligible. Though our general framework can be modified extensively, we assume the following contact tracing strategy. When an individual reports infection, their household is immediately isolated. Contact tracing attempts are then made for all households connected to one of the individuals in this household, whether symptomatic or not. When a connected household is identified (after the contact tracing delay), all individuals within the household are immediately placed under observation. If any of the individuals in the observed households develop symptoms, then the household becomes isolated and the contact tracing process continues to connected households. When a household is isolated, we assume all individuals are isolated with 100% adherence, and cannot transmit the virus within or outside the household. The assumption that isolation prevents local infections is unrealistic, but does not change the overall behaviour of the process as there are no more global infections. This strategy imposes high individual-level cost, since by isolating all individuals within a household, it isolates individuals who have not had direct contact with an infected individual. In practice, such a strategy may have poor adherence. Figure 13a shows an example contact tracing network. 3 . 5 When choosing contact tracing strategies, a balance must be struck between the effectiveness of a strategy and the resources that it requires. Some strategies are only feasible when there are few infections, since the resources required can grow rapidly depending on the dynamics of the outbreak and the contact tracing process. To define the capacity of the contact tracing process, we consider the ability of a public health agency to observe the condition of those asked to self-isolate, due to their recent exposure to an infected individual. The health agency must remain in contact for the duration of the 14 day self-isolation period, so that if any individual under isolation develops symptoms and then tests positive, the contact tracing process can be initiated on this node. We will define the capacity of the contact tracing process to be the number of people that can be placed under observation and assume two possible capacities: 800 and 8000. We assume that when a node is contact traced, they are asked to report their global contacts for the last 14 days. All global contacts are assumed to be to a new person since we are in the early stages of an outbreak. Parameters are given in Table 8 . Hitting probability (8000) 76 .8% We carried out 6507 simulations of the contact tracing process for 150 days. Contact tracing capacity was reached in 5000 simulations, and in 180 the epidemic neither went extinct nor was the 8000 capacity reached. In the remaining simulations, the epidemic went extinct. Figure 12a and Table 8 show that increasing the contact tracing capacity tenfold less than doubles the time until that capacity is reached. However, it does increase the odds of driving the epidemic to extinction without hitting the capacity by about 10% (Table 8) . Different contact tracing strategies will strain different aspects of the health agency. A strategy that generates large amounts of work is only feasible if there are few active infections. The optimal strategy will need to compromise and may need to change depending on the number of active infections, which cannot be directly observed. Figure 12 : Capacity hitting times for the contact tracing model. When there is a small number of cases in a single country, it may be possible to drive the pathogen to extinction. This small case number could correspond to the start of an outbreak or removing of severe interventions. We consider the latter case, but conservatively assume a fully susceptible population. We assume that social distancing is enforced on day 0 and reduces global contacts by 70%. Full parameters are given in Table 8 . Since we are interested in extinction, we will no longer consider the contact tracing capacity. Under these baseline parameter assumptions and 10000 simulations, the combined force of this contact tracing strategy and isolation is enough to drive the epidemic extinct ( Figure 13b ), but measures will need to be in place for months in some cases. If the infection is ever re-imported, then the process would begin again, since herd immunity is not achieved. Note that the minimum extinction time is 21 days due to this being the time after which an infected individual is labelled recovered. Additionally, this model only considers extinction under the assumption that no cases are imported. In Section 3.4, we have shown that importation of cases significantly reduces the extinction probability. This suggests that extinction may no longer be guaranteed, and the time to extinction will be significantly increased. This analysis has focused on a single contact tracing strategy using indicative parameters for COVID-19. The proposed model can be extended to more strategies and region specific parameters to inform the design of control policies. Also, as is shown in Table 8 , contact tracing capacity is likely to be reached, which may prevent extinction from being achieved. This complication is compounded by the issues of loss of immunity or the presence of an animal reservoir discussed in Section 3.4. Figure 13 : Example of the contact tracing process (a) and the extinction times distribution (b) for the contact tracing model. 4. Discussion In this manuscript we have presented a range of mathematical tools to tackle infectious disease outbreaks. In particular, these tools address various technical questions posed by the authors to support the ongoing public health response to COVID- 19 . This toolkit considers both estimation efforts for key parameters, and investigative efforts (often numerical simulations) in gauging the effectiveness of various intervention or control measures. Joint consideration of estimation and simulation efforts is critical. Parameter estimates are obtained using a certain set of assumptions regarding the data, and investigations or simulations utilising these estimates should ensure that their underlying assumptions are consistent. These challenges in model construction and applicability of statistical methods are compounded by the limitations of the data with which decisions must be made. Some of the biases present in the data can be addressed with an improved data collection methodology -often challenging in the context of a fast-moving outbreak -but many are also inherent to the nature of early outbreak data [15, 45] . The consequent lack of intuitive insight from this data underscores the need for careful parametric estimates, especially considering the large variability in predicted outcomes resulting from small differences in parameters. Even with robust estimates for some parameters, many other parameters are challenging to estimate using the available data. Therefore, models need to address this variability and uncertainty in order to inform public health policy. We have presented methods to address biases arising from a growing force of infection, changes in the reporting rate, truncated data samples and a varying travel rate. We use these methods to account for these biases when estimating delay distributions, such as the incubation period, and the growth rate/doubling time. These biases can have significant impact when estimating key parameters: the mean incubation period estimates for COVID-19 range from 3.48 days without correcting for truncation to 4.84 days with the correction, and the doubling time in Hubei province decreases from 3.15 days without correcting for travel to 2.77 days. These differences can significantly alter our understanding of the outbreak, and could have a large impact on policy and public health. For instance, underestimating the incubation period may lead to quarantine strategies failing to identify infected individuals if the quarantine length is too short. Overestimating the doubling time (or underestimating the growth rate) will underestimate the risk posed to the host population -both in terms of final size of the epidemic and the rate at which it spreads, which can have significant public health impacts as discussed in [63] . It is important to note that the above-mentioned biases, and consequent impact of implementing the methods correcting for their presence, may vary across different settings. As an example, the potential underestimation of the COVID-19 growth rate is exacerbated by an overlap in early outbreaks with a period of significant travel and movement in China, and would be less detrimental if first observed in other populations such as Italy. Also, for the incubation period, we have shown two different types of data; one from Wuhan and one from discrete infection events. In the Wuhan data set, truncation and force of infection biases are very important, whereas in the other data set, there is no force of infection bias since the infection events are observed. When an outbreak occurs in an enclosed group, such as a large gathering, we may wish to know how many individuals are likely to be infected. We developed a statistical method to estimate the first generation size based on the number of symptomatic individuals, taking care to account for the uncertainty in this quantity. This ready reckoner can inform testing of large groups to help control the disease spread, but does not apply to later generations or the possible interventions enacted on the population. Building on these enclosed population scenarios, we have developed a set of models that investigate public control measures or interventions on enclosed populations, such as households and care homes. These structured descriptions improve the population risk profiles relative to assumptions of homogeneous mixing. A complementary aspect to a structured population when modelling interventions is adherence. Motivated by vaccination modelling, we consider leaky adherence, where every household chooses to adhere or not whenever an event occurs, and all-or-nothing adherence, where some households adhere every time and some never adhere. We observed that in a homogeneous population, although the two types of adherence predict the same growth rate and final size, the timing of the peak and the early growth can be faster under all-or-nothing adherence. This insight, combined with lessons from the vaccination literature, suggests that efforts should focus on ensuring complete adherence in individuals or households with some level of pre-existing adherence, rather than pushing non-adherent individuals or households to change behaviour. Dedicated modelling of disease spread in care homes is essential due to the documented history of comorbidity of their residents during pandemics [31, 82, 83, 84, 85] . We do so by regarding care homes as closed populations that are subjected to a force of infection from an external epidemic. We develop a tool for analysing the risk posed to this population by determining the peak size of the epidemic within the care homes and the number of deaths. Applying this model to COVID-19, we find that by ''cocooning" the care homes, i.e. shielding them to reduce the chance of introduction from the external outbreak, we can significantly reduce the size of the peak and therefore reduce the number of deaths. However, assessing the necessary level of shielding requires accurate characterisation of the external force of infection, and underestimating this may invalidate shielding efforts. A limitation to the proposed model is the deterministic within care home epidemic. However, since the average size of care homes is relatively large and we assume a high 0 R within care homes, the deterministic assumption is unlikely to significantly alter the conclusions. When modelling households however, we are concerned with much smaller population sizes. Therefore, it is important to consider stochastic effects within each household, combined with between-household dynamics. We consider two different household models: one which contains features of both within-and between-household transmission, where small-scale transmission can be linked to the epidemic on a population level, and another which facilitates more detail in the withinhousehold transmission and delay distributions, but with reduced correspondence to the populationwide transmission. With the first model, a 65% adherence to household isolation appears insufficient to control the epidemic without severe global reductions in transmission. Coupled with a short term global reduction, the epidemic can be controlled, but upon lifting the global intervention, which could take the form of a lockdown, household isolation is insufficient to maintain control. For the second model, we look into changing the strength of adherence, and the impact this can have on achieving control. Indicative but reasonable parameter values suggest that the COVID-19 outbreak can potentially be controlled using household isolation strategies, provided the level of adherence is sufficiently high. However, such a high level of adherence may be difficult to maintain in the long-term and this modicum of control is anyway highly sensitive to the chosen parameters. We further investigated the efficacy of various isolation or quarantine measures. A policy of individual isolation struggles to curtail the epidemic for any adherence. Instead, mixed isolation, whereby first the whole household isolates and any individual infected during isolation goes on to self isolate after household isolation is lifted, appears to be the most cost-effective strategy. Countries have put into place strict social distancing and lockdown intervention to suppress or regain control of epidemics that threaten to overwhelm the health system and cause massive mortality, but they cannot be sustained in the long term without growing social and economic costs. We have shown however, that the probability of the epidemic becoming extinct once these policies are lifted, even when very few cases remain, is very small. We therefore consider a contact tracing intervention as a potential strategy for managing the COVID-19 outbreak, once severe lockdown interventions are lifted. We developed a household-level contact tracing model to explore the feasibility of combining these strategies to control the epidemic. Firstly, we noted that by using knowledge of household structure, we can reduce the burden on the contact tracing process by isolating household and removing them from the contact tracing process once an infected member has been identified. Secondly, we investigated how contact tracing combined with household isolation may drive the disease to extinction, finding that aggressive contact tracing coupled with household isolation can drive the epidemic to extinction under the indicative parameters assumed, when starting with a single infection. However, the time until extinction can be impractically long, which risks the contact tracing capacity being overwhelmed, suggesting such a strategy may be infeasible in practice. A less aggressive strategy could be implemented that would be less likely to overwhelm local health agencies. Whilst it may not lead to extinction, this can still be beneficial at mitigating and controlling the spread of an outbreak as part of a test, trace and isolate strategy. There are many complexities when modelling an outbreak of a novel infectious disease. To address some of these, we have described a variety of techniques to serve as part of a generally applicable toolkit. However, our proposed models, and many other models, are subject to important limitations which must be considered prior to their application. Key among these is the lack of heterogeneous population mixing, such as through age-stratification [61] and different risk-groups [77] , and spatiotemporal variations [43] , all of which influence modelling estimates and predictions. Nevertheless, the relative simplicity of the presented models allows for the development of qualitative intuition regarding the efficacy of various intervention methods, whilst providing tractable theoretical frameworks which can be further developed and better inform policy-makers. 5 . Independence of the incubation period likelihood function and the reporting rate To estimate the incubation period distribution, we need to find the distribution that maximises the probability of observing the sampled data. However, the sampled data does not directly record incubation period, and instead contains infection exposure window and the symptom onset date. Additionally, the sample does not contain all individuals, and therefore there is a reporting rate that must be incorporated into the likelihood function. If the reporting rate is constant, it can be ignored. However, in the data coming out the Wuhan, the reporting rate varies significantly, since individuals are no longer exported from Wuhan after travel restrictions. Since the reporting rate depends on individuals leaving Wuhan, the main factor effecting the probability that a case is included in the data set is the date an individual leaves Wuhan. Therefore, the reporting rate depends on the days an individual spends in Wuhan and is independent of their symptom onset date. We need the likelihood function that an individual was infected between days a and b, had symptom onset on day y, and was included in the data set. We will condition against the infection window, a I b < < , the case being included in the data set, x D ∈ , and symptom onset occurring before the truncation date, Y T < , and determine the probability that for such an individual we observe the given symptom onset date, Y y = . where f θ is the probability density function of the incubation period distribution. Therefore, the likelihood function ( | , ) P Y y a I b Y T = < < ≤ is independent of the reporting rate for the data coming out of Wuhan in the early days of the outbreak. 6 . Generation-size derivation To solve this, we need to determine the distribution of the infection probability P given the number of observed symptomatics. Assuming that P is uniformly distributed, we have F represents the hypergeometric function [2] . Substituting this into Equation (9) gives ( ) ( 7 . Estimating the generation size using a lower bound on the number of symptomatic individuals In the analysis in Section 2.4, it is assumed that every person who has developed symptoms by time τ is known to the observer. However, depending on the disease, symptoms can be subjective. One person may not notice something another person may visit hospital for. Additionally, one person may not want to come forward with symptoms if they are worried about the repercussions of coming forward (for example being isolated against their own will). To address this, rather than considering I τ as the total number of people who have developed symptoms by time τ, we can consider I τ as the number of people who have presented with symptoms by time τ. We do no know the true value of I τ , but we know that it cannot be below I τ . We assume that the probability of I τ being i τ for a given value of i τ is uniform at 1 1 i τ + , dI E I dt ρ γ = − (12) . dR I dt γ = (13) This provides a force of infection that is used to model seeding within each care home via the Sellke construction, with the details provided in the main text. Once infection has been seeded within a care home, the epidemic progresses using the following ordinary differential equations Handbook of mathematical functions with formulas, graphs, and mathematical tables Pre-existence and emergence of drug resistance in a generalized model of intra-host viral dynamics Stochastic epidemics in dynamic populations: quasi-stationarity and extinction The final size of an epidemic and its relation to the basic reproduction number Stochastic and deterministic models for SIS epidemics among a population partitioned into households Seven challenges for metapopulation models of epidemics, including households models Household epidemic models with varying infection response Epidemics with two levels of mixing Reproduction numbers for epidemic models with households and other social structures II: Comparisons and implications for vaccination Stochastic epidemic models featuring contact tracing with delays Coordinating the real-time use of global influenza activity data for better public health planning Characterising pandemic severity and transmissibility from data collected during first few hundred studies The final size of a serious epidemic Estimation in emerging epidemics: Biases and remedies Models overestimate Ebola cases A Bayesian MCMC approach to study transmission of influenza: application to household longitudinal data Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak The effect of travel restrictions on the spread of the 2019 novel coronavirus (COVID-19) outbreak. Science Care Quality Commission. CQC care directory -with ratings Department of Health and Social Care. SPI-M modelling summary for pandemic influenza Are patients with hypertension and diabetes mellitus at increased risk for COVID-19 infection? SARS incubation and quarantine times: when is an exposed individual known to be disease free? Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing The spread of awareness and its impact on epidemic outbreaks Estimating the generation interval for COVID-19 based on symptom onset data. medRxiv Reproductive numbers, epidemic spread and control in a community of households Estimated effectiveness of symptom and risk screening to prevent the spread of COVID-19. eLife Clinical characteristics of coronavirus disease 2019 in China Oxford COVID-19 Government Response Tracker (OxCGRT Deterministic epidemic models with explicit household structure Epidemic prediction and control in clustered populations Coronavirus: GSK and Sanofi join forces to create vaccine Regression models for right truncated data with applications to AIDS incubation times and reporting lags Modeling infectious diseases in humans and animals China braces for world's biggest travel rush around Spring Festival On the generalized ''birth-and-death" process Containing papers of a mathematical and physical character Scabies in residential care homes: Modelling, inference and interventions for well-connected population sub-units The effect of human mobility and control measures on the covid-19 epidemic in china Spatial and temporal dynamics of superspreading events in the 2014-2015 West Africa Ebola epidemic The incubation period of 2019-nCoV from publicly reported confirmed cases: estimation and application The Epidemiologic Toolbox: Identifying, Honing, and Using the Right Tools for the Job Substantial undocumented infection facilitates the rapid dissemination of novel coronavirus (SARS-CoV2) Incubation period and other epidemiological characteristics of 2019 novel coronavirus infections with right truncation: A statistical analysis of publicly available case data The reproductive number of COVID-19 is higher compared to SARS coronavirus China coronavirus: what do we know so far? Early in the epidemic: impact of preprints on global discourse about COVID-19 transmissibility. The Lancet Global Health Estimating the asymptomatic proportion of coronavirus disease 2019 (COVID-19) cases on board the Diamond Princess cruise ship The largest human migration on the planet: What happens in China when 1.4bn go on holiday at the same time Social contacts and mixing patterns relevant to the spread of infectious diseases Time variations in the generation time of an infectious disease: implications for sampling to appropriately quantify transmission potential World Health Organisation. Coronavirus disease 2019 (COVID-19). Situation Report -69 Reproduction numbers for epidemic models with households and other social structures. I. definition and calculation of 0 Systematic selection between age and household structure for models aimed at emerging epidemic predictions Epidemic growth rate and household reproduction number in communities of households, schools and workplaces Challenges in control of Covid-19: short doubling time and long delay to effect of interventions. arXiv COVID-19 and italy: what next? The Lancet COVID-19: Disproportionate impact on ethnic minority healthcare workers will be explored by government Calculation of disease dynamics in a population of households The Prevention of Malaria Public perceptions, anxiety, and behaviour change in relation to the swine flu outbreak: cross sectional telephone survey Some model based considerations on observing generation times for communicable diseases On the asymptotic distribution of the size of a stochastic epidemic SIR epidemics in a population of households Modeling left-truncated and right-censored survival data with longitudinal covariates Empirical estimation of a distribution function with truncated and doubly intervalcensored data and its application to AIDS studies Early epidemiological analysis of the coronavirus disease 2019 outbreak based on crowdsourced data: a population-level observational study. The Lancet Digital Health A note on generation times in epidemic models Evaluating factors associated with STD infection in a study with interval-censored event times and an unknown proportion of participants not at risk for disease Reorganization of nurse scheduling reduces the risk of healthcare associated infections. medRxiv Epidemic and intervention modelling: a scientific rationale for policy decisions? Lessons from the 2009 influenza pandemic Investigation of a nosocomial outbreak of severe acute respiratory syndrome (SARS) in Toronto, Canada Comparison of incubation period distribution of human infections with MERS-CoV in South Korea and Saudi Arabia Can we contain the COVID-19 outbreak with the same measures as for SARS? Estimating clinical severity of COVID-19 from the transmission dynamics in Wuhan, China Characteristics of and important lessons from the coronavirus disease 2019 (COVID-19) outbreak in China: summary of a report of 72 314 cases from the Chinese Center for Disease Control and Prevention Prevalence of comorbidities in the novel Wuhan coronavirus (COVID-19) infection: a systematic review and meta-analysis Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan Author contributions CO and HS compiled the manuscript. All authors were involved in the research and revising of the manuscript. The authors declare no competing interests. All data and code used in this analysis is provided in the Github repository https://github.com/thomasallanh ouse/covid19-stochastics and https://github.com/thomasallanhouse/covid19-growth, with the exception of UK specific data. This data is provided by Public Health England under a data sharing agreement and we are unable to share this data. where c is a normalising constant such that We add further compartments to the within care home model since we are interested in the different pathways that these individuals may take. For the background epidemic this is not important, since all we need is a force of infection provided by individuals in the infectious class. 9 . Population and household transmission: Individual isolation and household quarantine Individual isolation does not intimately involve the household and so we assume that a fraction I α of symptomatic cases self-isolates and ceases transmission outside the household, meaning that we take the baseline but with To capture the essential features of household quarantine, we need to add states to the dynamical variables. Let We now suppose that a fraction S In this model, we assume that the duration of isolation after the absence of symptoms is exponentially distributed with mean equal to the fixed isolation period, given by 1/ σ . This assumption aids the modelling and is justifiable because, in reality, although a household may choose to isolate, they may not strictly follow the fixed period. It is likely that within-household isolation will increase transmission to other members of the household. We do not incorporate this property into the model, but this limitation is worth bearing in mind when drawing conclusions about the effectiveness of different strategies.