key: cord-0707726-m7n5bn9e
authors: Wood, S. N.; Wit, E. C.
title: Was R<1 before the English lockdowns? On modelling mechanistic detail, causality and inference about Covid-19
date: 2021-02-06
journal: nan
DOI: 10.1101/2021.02.03.21251112
sha: e47ef0c17bf4175f4902f4845dbefd513b6ca24e
doc_id: 707726
cord_uid: m7n5bn9e

Detail is a double edged sword in epidemiological modelling. The inclusion of mechanistic detail in models of highly complex systems has the potential to increase realism, but it also increases the number of modelling assumptions, which become harder to check as their possible interactions multiply. Knock et al (2020) fit an age structured SEIR model with added health service compartments to data on deaths, hospitalization and test results from Covid-19 in seven English regions for the period March to December 2020. The simplest version of the model has 684 states per region. One main conclusion is that only full lockdowns brought the pathogen reproduction number, R, below one, with R >> 1 in all regions on the eve of March 2020 lockdown. We critically evaluate the Knock et al. epidemiological model, and the semi-causal conclusions made using it, based on an independent reimplementation of the model designed to allow relaxation of some of its strong assumptions. In particular, Knock et al. model the effect on transmission of both non-pharmaceutical interventions and weather using a piecewise linear function, b(t), with 12 breakpoints at selected government announcement or intervention dates. We replace this representation by a smoothing spline with time varying smoothness, thereby allowing the form of b(t) to be substantially more data driven. We conclude that there is no sound basis for using the Knock et al. model and their analysis to make counterfactual statements about the number of deaths that would have occurred with different lockdown timings. However, if fits of this epidemiological model structure are viewed as a reasonable basis for inference about the time course of incidence and R, then without very strong modelling assumptions, the pathogen reproduction number was probably below one, and incidence in substantial decline, some days before either of the first two English national lockdowns. Of course this does not imply that lockdowns had no effect, but it does suggest that other non-pharmaceutical interventions (NPIs) were much more effective than Knock et al. imply.

In principle the inclusion of known mechanisms into models used for statistical inference should improve inference by reducing the bias caused by model misspecification. But there is a catch. What happens if the mechanisms are themselves described only in an approximate manner by ad hoc sub-models? It is then possible for the assumptions built into the sub-models to introduce substantial misspecification bias. The real world consequences of such bias could be substantial if the model is used to determine major public policies. This paper examines and re-implements the model of Knock et al. (2020) to investigate the robustness of the inferences made with it. We show that key results are entirely dependent on strong but incidental assumptions introduced in the model formulation, and that relaxation of those assumptions effectively reverses the conclusions. This may matter in assessing the effectiveness of lockdowns. Lockdowns substantially modify the evolutionary landscape for the pathogen in ways that seem unlikely to promote milder disease. They also cause substantial economic hardship. In England economic hardship is associated with very substantial loss of life as reviewed in detail in Marmot et al. (2020) . Knock et al. (2020) is the 41st report of the COVID-19 response team from Imperial College London, whose reports have played a profound role in the shaping of UK government policy on COVID-19. A major message from Knock et al. is that the pathogen reproductive number R was only reduced below one by full lockdowns in England in March and November (see figure 1), with incidence apparently increasing until the eve of the March lockdown. We show that this result does not survive relaxation of some strong modelling assumptions. Knock et al. also present 'counterfactual' simulations from the calibrated model from which they draw conclusions about the deaths that could have been avoided by an earlier first lockdown. We show that these simulations can not be viewed as 'counterfactuals' in the usual inferential sense. Hence the avoidable death figures are simple model extrapolations.

The Knock et al. model is an age-structured SEIR model with age-structured hospital compartments. The population is divided into 5-year age classes with a final 80+ class and two unstructured classes for care home residents and staff. There are 36 states in each of 19 classes (see figure 2) . The model was specified as a set of ODEs and converted to a discrete time stochastic model for fitting by the τleap method (Gillespie, 2001) . The model was fitted to daily data on hospital deaths, care home deaths, hospital admissions, general ward occupancy, ICU occupancy, antibody test results and PCR test results from surveys, which Knock et al. supply as supplementary material. Knock et al also attempt to fit data on test results from the health system. However their model does not attempt to deal with the nonrandom, opportunistic nature of the sampling in this data stream, despite the continual changes in test capacity, criteria for testing, and operation of the contact tracing system over the course of the data. We therefore believe that there is substantial danger of these data simply undermining the analysis and they should not be included in data to be fitted 1 . Data were available for seven English regions, which were fitted separately. The model has 26 free parameters. Knock et al. based model inference (fitting) on particle filtering methods, with full fit to all regions reported to take over 100 CPU days. This computational cost makes model checking difficult, particularly if the stronger model assumptions are relaxed, which involves allowing substantially more free parameters plus hyper-parameters. Additionally Knock et al. specify massive overdispersion in all but the test data streams. Decreasing this over-dispersion to levels consistent with the data would likely increase particle depletions problems in filtering, leading to yet longer computing times. Given these issues, plus the fact that the data sampling interval and total data duration are fairly short relative to the model's dynamic timescales, as well as the relatively large numbers of individuals involved, and the fact that we are fitting to aggregated data, it seems reasonable to fit the original ODE based model directly, rather than filtering its stochastic representation. The neglect of stochasticity in the state equations seems likely to be a minor issue here, relative to the other approximations made in the model. In any case, any results dependent on stochasticity would then require a much stronger justification for the stochastic formulation than that it was produced by discretisation of an underlying ODE model. Furthermore, a generic strength and weakness of the particle filtering methods used by Knock et al. is that they necessarily filter the state variables as well as model parameters. This is advantageous for state forecasting, but can be more problematic for inferential tasks. For an ill-specified dynamic model the filter is often forced to repeatedly select state transitions that are improbable under the model, in order to be sufficiently close to the data. This can result in the filtered states being in an extreme tail of the posterior predictive distribution of the model: that is, of the distribution implied by simulating unfiltered states from the model given the posterior distribution of parameters. Hence model adequacy needs to be checked by comparison of the data with simulations from the posterior predictive distribution. Knock et al. do not report such checks, showing only the outputs of filtering. This is problematic when reality is then contrasted to 'counterfactual' simulations, necessarily from the posterior predictive distribution. The simple ODE approach used here does not filter. Instead the states are determined entirely by the model equations and the parameter values. This approach is unforgiving of model misspecification: adequacy is directly assessable from the model fit. It also reduces fit time by four orders of magnitude.

2 Evaluation of original Knock et al. age-structured SEIR model Figure 2 is a schematic showing the compartments in each 5-year age or care home class. The exposed, but pre-symptomatic, E stage is modelled by two sequential compartments. It is assumed that no infections are caused by this class. Symptomatic and asymptomatic stages I C and I A follow and cause infections, both are single compartment. The duration of the I C stage is set from data on time from onset of symptoms to hospital admission. The absence of pre-symptomatic infection will lead to longer generation times than are reported in the literature (e.g. Flaxman et al., 2020; Anderson et al., 2020, table 1) , elevating the R estimates required to achieve observed growth rates. Care home residents are not hospitalised, and the G i D class shown actually only receives patients for the care home resident class. Knock et al. (2020) , but with the notational modifications used here, stages represented as two sequential compartments indicated with notched boxes, and the location of the extra stage P that we insert to relax the generation time assumptions shown by the grey arrow and 'P'. To obtain the rate of flow from one compartment to another, follow the path joining them in the direction of the arrow, multiplying the source state variable by the rate parameters labelling the segments of the path. Rates with a superscript i vary with age class. The relative rates in different classes was obtained from a separate analysis reported in Knock Model compartments for PCR and antibody test positivity are fed by the infection rate and the progression rate from the E state, respectively. The infection rate is driven by an age-structured mixing model with contact matrix, C based on the POLYMOD survey data (Mossong et al., 2017) . Most elements of C are multiplied by a function b(t) modelling the impact of NPIs and weather on contact rates. In Knock et al. b(t) is piecewise linear with 12 breakpoints (and 12 free parameters) at policy change points. A major aim here is to relax the very strong assumptions built in to such a restrictive model. Care home contact rates are separately parameterized.

Hospitalized patients follow an ICU or general ward route. There are separate compartments for those eventually recovering or dying on the general ward. The ICU route has a pre-ICU compartment, from which patients enter compartments for those dying in ICU, entering ICU but dying later on the general ward, or entering ICU and recovering on the general ward. All compartments are duplicated for confirmed Covid (starred) and not yet confirmed (not starred), with a parameter, γ U , controlling the rate of testing based transfer from unconfirmed to confirmed. It is assumed that, from the start, 25% of patients arrive at hospital with confirmed covid. This is improbable given initial testing capacity.

The model captures many features in impressive detail, but several aspects are not modelled:

4. Region-to-region transmission at the epidemic start is not represented, compromising early model fit and R estimates, as imported cases are modelled as local.

The assumption of no pre-symptomatic infectivity is inconsistent with empirical estimates of the serial interval and generation time, reviewed in Anderson et al. (2020) , for example.

6. Within hospital transmission is not modelled, although hospital-acquired infections has been reported to account for 1 in 4 cases at times in both waves 2 , compromising the hospital data fit.

7. No interaction between NPIs and age is allowed, which is unlikely given the risk-by-age profiles.

8. Differential transmission rates between symptomatics and asymptomatics are not modelled.

9. The reported differences in disease progression between men and women are not modelled.

10. Changes in testing rates with capacity changes are not modelled.

For concreteness we describe the core of the SEIR model, giving the equations for other compartments in supplementary A. Denoting the time derivative of a variable x byẋ, then for the ith class,

λ i (t) is the force of infection defined below, and is the only interesting interaction between age classes. p c is the proportion of the infected showing symptoms, and the γ parameters determine between compartment flow rates, given in Knock et al. (2020) . I(·) is an indicator function and φ t 0 ,σt is an N (y 0 , σ 2 t ) p.d.f. where t 0 is a free parameter. This initialization differs slightly from Knock et al. who put 10 individuals in the age 15-20 asymptomatics at t 0 . It is unclear why this is sensible, although it may slightly delay the first wave model care home epidemic. Susceptibles, S i , are initialized from regional demography supplied in the Knock et al. supplementary material. Care home sizes are supplied in the sircovid package by the carehomes_parameters() function (Baguelin et al., 2021) . R for this system can be computed according to Diekmann et al. (1990) . See supplementary A.3. Our fitting also requires the derivatives of the model states with respect to the parameters: the sensitivities. These follow directly from the model specification.

Generically each term in the model equation involving a state gets replaced by that state's derivative w.r.t the parameter of interest, and to this are added any terms relating to direct dependence on the parameter of interest. For example, if γ C was a free parameter thenİ i

2 https://www.hsj.co.uk/patient-safety/covid-infections-caught-in-hospital-rise-by-a-third-in-one-week/7029211.article 5 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ;

Writing I for the vector of infectious individuals in each class, then the model for the force of infection in each class is λ = MI where

, m chw and m chw are free parameters. b(t) is a parameterized function of time controlling the variation of infection causing contact over time. C is a symmetric matrix of contact rates and c chw a vector (derived from it for carehome workers). I j is the sum of asymptomatic (I j A ) and symptomatic infectious (I j C ) in class j. Supplementary A.1 has the force of infection expressed so that sensitivities follow by inspection.

C is based on the POLYMOD survey (Mossong et al., 2017) accessed through the socialmixr R package (Funk, 2020) . This had 1011 UK participants, who each recorded their contacts on one day. There were 7 participants in the 75-80 age group and none over 80. Supplementary A.2 gives details.

The likelihood is constructed from binomial components for the PCR and antibody test data (see supplementary A.9), and negative binomial components for the hospital death, care home death, hospital admissions, general ward occupancy and ICU occupancy data. For the negative binomial components Knock et al. (2020) set κ = µ 2 /(σ 2 − µ) equal to 2 in all cases without justification offered. This is a huge level of overdispersion, heavily down-weighting the data relative to the priors. For example, for an expected death rate of 200 it raises the standard deviation from 14, for a Poisson deviate, to 140. It is hard to understand this choice, unless it was made to avoid particle depletion problems in filtering. Hospital deaths, for example, show no evidence of over-dispersion relative to Poisson. Still more problematic is the assumption that observed daily bed occupancy is given by a negative binomial deviate with expectation given by the model, with these deviates independent between days. We are at a loss to understand what mechanism could give rise to such a model. A reasonable model might have daily arrivals and discharges as independent random variables with means given by the model, but occupancy obviously integrates these arrival and discharge rates over days, leading to strong dependence between days. The stochastic version of the model might model some of this dependence, but leaves even less justification for additional independent negative binomial variability.

In this section we present a modification of the Knock et al. model in order to deal with some of the deficiencies identified above. They consist of a number of corrections and minor modifications and, more fundamentally, relaxing the strong modelling assumptions made by Knock et al.

Rates. The γ parameters controlling rates of progression between model compartments are either taken from the literature, or are estimated from CHESS 3 data that are not available for checking. There are at least two identifiable problems with the durations used in Knock et al. (2020) . Firstly they set the mean duration of the E stage to 4.6 days citing Lauer et al. (2020) . That paper actually reports a mean of 5.5 days, with 4.6 days lying just above the lower 95% confidence limit for the median. Here we used the 3 COVID-19 Hospitalisations in England Surveillance System 6 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ; https://doi.org/10.1101/2021.02.03.21251112 doi: medRxiv preprint mean of 5.8 days from the meta-analysis of McAloon et al. (2020) , which includes Lauer et al. as one of the studies. In fact the most statistically careful analysis we found (Deng et al., 2020) gives an estimated mean incubation period of 9.1 days (n=1211), and generation time of 5-6 days. Secondly Knock et al. assume that the mean time from symptoms to hospitalization is 4 days based on Docherty et al. (2020) , but that paper gives 4 days as the median. An exponential distribution is used for time from symptoms to hospitalization (a model which the figures reported in Docherty et al. does seem to support), so the median is log 2 of the mean. Based on the male and female medians of 5 and 4 days reported in Docherty et al., we therefore used a mean time to hospitalization of 6.5 days.

Another issue is the assumption that 25% of patients were arriving at hospital with a test confirming their status from the start of the epidemic, despite the initial lack of infra-structure for this to happen.

The assumption actually appears to make rather little difference to inference, but leaving it in place is clearly not quite right. Have no data to set up an evidence based alternative, we implemented the ad hoc modification of allowing the pre-tested rate, p * to increase linearly from 0 to 0.25 between days 90 and 200, staying at 0.25 thereafter. While not ideal, this is arguably a less wrong assumption.

Priors. The priors used were not exactly those in Knock et al., rather priors were set to be vague on a working parameter scale. Any limits on parameter were set by the prior intervals reported in Knock et al.. Parameters were optimized on a working scale -either untransformed, log transformed or scaled logit transformed. Gaussian priors on the working scale were also applied, but except for t 0 these were vague, and their only purpose was to allow ready detection of any parameters that were not identifiable. See supplementary A.8 for details.

While our basic conclusions are in fact unchanged if we use the Knock et al. likelihood for the hospital occupancy data, we can see no valid justification for this part of the model formulation, and therefore replaced it with a defensible likelihood, based on the daily change in occupancy. In particular we model the ward (or ICU) arrivals and departures as independent overdispersed Poisson deviates, the difference in which gives the daily change in occupancy. A difficulty with applying this model directly is that hospital arrivals and discharges tend to have weekly pattern. This pattern shows up strongly in the ACFs and PACFs of occupancy first differences for some regions, especially east of England, but is absent from the model. We therefore base the likelihood on weekly changes. Since the changes in occupancy carry no information on the level of occupancy, we also add the sum of daily bed occupancies as a final datum to be fitted, treating this as close to Poisson (by setting κ to a very high constant).

For the total daily hospital admissions data and the care home deaths data we retain the negative binomial model, with the respective κ parameters free to be estimated. Some overdispersion to deal with likely model mismatches in these components seems pragmatic. For the hospital deaths we set κ = 2000, which gives a likelihood very close to Poisson. There seems no legitimate reason to expect overdispersion here if the model is at all fit for purpose.

The largest change made here is to relax the strong assumption that b(t), which represents the effects of NPIs and the weather, is a piecewise linear function with slope changes only at 12 selected NPI change points. Here, b(t) is instead represented semi-parametrically by a logistic transform (see section A.8) of an adaptive smoothing spline, with 80 coefficients and 5 smoothing parameters, in which the degree of smoothness is allowed to vary smoothly with t. See section 5.3.5 of Wood (2017) for details. This relaxation substantially increases the role of data, relative to model assumptions, in inferences about the 7 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ; size and shape of b(t). Of course it does nothing to remove the confounding of weather and NPI effects, but does avoid the implication that the weather changes in response to government announcements.

We also relaxed the assumption that all the γ parameters are fixed and known. Firstly, the reference used to justify the choice of γ G , controlling the rate of progression of fatal disease in care homes, Bernabeu-Wittel et al. (2020) , appears to contain no information on this parameter, so we allowed it to be a free parameter, which slightly reduces care home death mistiming. Secondly, the model also has difficulty matching the general ward and ICU occupancy data, tending to over-estimate both in the Midlands and two northern regions. To reduce this problem it seemed reasonable to relax the assumption that all the rate parameters controlling progression through the health system were fixed and known. In particular we relaxed the parameters for which there seemed likely to be most scope for some latitude in clinical judgement, perhaps driven by local circumstances, to make substantial differences. In particular we relaxed the assumptions on the rates related to movement of recovering patients through the system. That is γ IC Wr , γ Wr and γ Hr were treated as free parameters.

A final rigidity in the model structure is that there is assumed to be no infection before individuals could at least potentially become symptomatic on leaving the E stage. At the same time the mean duration of the symptomatic infective stage is set equal to the mean time from symptom onset to hospitalisation. This makes for a very long generation time, much longer than the 5-7 days reported in the literature for the serial interval or generation time (see table 1 of Anderson et al., 2020 , for a review). One consequence of this is that R estimates need to be higher than those usually quoted to meet the initial rate of increase in the disease (Knock et al., 2020 , actually limits R in a way that avoids estimates being too high). To relax this link between clinical disease progression rates and the generation interval, we introduced an extra compartment between I c and hospitalization (see the grey 'P' on figure 2).

where P replaces I c in all flows into hospital compartments and the R state. By appropriate choice of γ ph , this state allows us to shorten the E state and I c state, hence reducing the generation time, without changing the literature based mean time from infection to hospitalisation. In particular, we shortened the E state to have an average of 3 days to infectivity, and the I C state to be 4 days, yielding a generation time of 6.2 days (accounting for the duration of I A , which was unchanged).

The sensitivities of the model states with respect to the parameters were obtained for all 703 model state variables, yielding a system of 65379 sensitivity ODEs. Model and sensitivities were solved by fourth order Runge-Kutta integration (see e.g. Press et al., 2007) with a one day time step (having confirmed that halving the step made negligible difference to the evaluated likelihood). Hence the log likelihood and its derivatives w.r.t. the free parameters could be readily evaluated. Due to sparsity and cache efficiency, the sensitivity system less than doubles computing time for the model. Computing the likelihood and derivatives and R series for the full model takes less than a second on a single core of a low specification laptop -it is considerably faster for the original Knock et al. model with fewer free parameters.

Given the log likelihood and derivatives, the penalized log likelihood and derivatives are also readily evaluated, so the posterior modes of the free parameters can be obtained by quasi-Newton optimization. The smoothness of b(t) was controlled by a Gaussian smoothing prior, with 5 free smoothing parameters, which were obtained by the approximate marginal likelihood optimization method of Wood and Fasiolo (2017). Uncertainty was assessed using the large sample approximate posterior covariance matrix of the parameters, and the delta method. See supplementary A.10 8 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ; https://doi.org/10.1101/2021.02.03.21251112 doi: medRxiv preprint Figure 3 shows the fit of the model with the various assumption relaxations applied. The model fits imperfectly, but given the ambitious nature of the fitting task, it seems reasonable to view it as useful in the 'all models are wrong, but some are useful' sense. Figure 4 shows the corresponding inferences about R and incidence. All regions have peak incidence prior to the first lockdown with total incidence for England in decline well before lockdown. The regional incidence picture is more mixed at the second lockdown, although the total is again falling well before lockdown. Furthermore all regions have R 1 by either lockdown, with average R < 1 some days before either lockdown. Several regions relatively distant from London have the inferred R initially increasing. This is probably an artefact caused by the independent initialisation of each region, which cannot capture the initial region-to-region spread. As in Knock et al. (2020) the plotted uncertainties are over-optimistic as they do not account for the uncertainty in most of the rate constants.

Although it could also be weather driven, the systematic pattern of R continuing to fall after the first lockdown and then increasing again well before restrictions were lifted, is to be expected. R is the average number of new infections per existing infection. Immediately after lockdown most infections are in the locked down population, with a low R, and only a minority are in the key worker population with higher R, so the average is low. After the locked down populations runs out of household members to infect, the proportion of infections among key workers must increase, due to their higher R. So the average R must increase too. The piecewise linear model of Knock et al. can not accommodate this effect.

Fits with the generation time relaxation (not shown), but not correcting the mean/median confusion in the time to hospitalization, shift the incidence peak and R = 1 point a little later, but still before lockdown. The equivalent results without relaxation of the long generation time assumption gave similar incidence patterns, but with higher initial R, as expected. The average R for England dropped below one at about lockdown: before or after can not be determined with statistical confidence. Not correcting the occupancy data likelihood again gives results very similar to figures 3 and 4.

If fits of this model to data are viewed as a reasonable basis for inference about the timing of incidence and R levels, then the implication is that R < 1 probably occurred some time before both the first two English lockdowns, and that incidence was already in sharp decline before either. The contrary result of Knock et al. relies on a very restrictive model for b(t), a much longer generation time than is reported in the literature, and the inappropriate use of a exponential median as an exponential mean. Knock et al. (2020) make three major claims in their paper. Whereas the first is of a descriptive nature, namely that the two English Covid-19 lockdowns in March and November 2020 coincide with a major drop in the reproduction rate of Covid-19 in the UK, the other two are of a so-called "counterfactual" nature: (i) if England had not gone into lockdown, then there would have not been an associated drop in reproduction rate and (ii) if England had gone down into lockdown earlier (or later) then a lot of lives would have been saved (or lost, respectively).

The key challenge is that a counterfactual cannot be directly observed and must be approximated with reference to a comparison group. There are various accepted approaches to determining an appropriate comparison group for counterfactual analysis, ideally using a prospective design. When this is not available, such as in this case, a retrospective approach is necessary. But there are stringent conditions on a retrospective design in order for it to have counterfactual validity, such as avoiding confounding, contamination, and impact heterogeneity. Confounding occurs where certain factors, for example the various social distancing measures in place prior to the lockdowns, are correlated with exposure to the 9 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ; https://doi.org/10.1101/2021.02.03.21251112 doi: medRxiv preprint ) , to the death, hospital and testing data, with one region per row. In the leftmost 'deaths' column, grey points are hospital deaths and red points care home deaths. In the second 'occupancy' column, grey is general ward occupancy and red ICU occupancy. Note some substantial discrepancies in the two northern regions.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ; https://doi.org/10.1101/2021.02.03.21251112 doi: medRxiv preprint intervention and, independent of exposure, are causally related to the outcome of interest. Confounding factors are therefore alternate explanations for an observed, but possibly spurious, relationship between intervention and the outcome; in this case between lockdown and the reduction in R. The pre-lockdown social distancing measures are also an example of contamination, which may also invalidate any counterfactual statements. Contamination occurs when members of treatment group (i.e. the actual population) and/or comparison groups (i.e. the counterfactual populations) have access to another intervention which also affects the outcome of interest. Additionally, there is the issue of impact heterogeneity: the impact of the lockdown will be very different in the locked down subset of the population, compared to key workers, who are less restricted. Finally, Knock et al. explicitly state that b(t) is modelling both the effects of NPIs and the weather. There is therefore no basis on which the model can identify the effect of lockdown independent of the weather, enabling the counterfactual manipulation of one while appropriately controlling the other. But such control is absolutely fundamental to causal reasoning with counterfactuals.

We conclude that the model and inference of Knock et al. do not form any sort of reasonable basis for making counterfactual statements about how many people would have died if lockdown had occurred at a different time.

Our results on the timing of R < 1 and peak incidence obviously do not imply that the lockdowns had no effect. The point is rather that the additional effect, on top of the cumulative effects of other behavioural changes pre-dating lockdown, seems likely to have been greatly overstated. In our view, determining definitively what caused R to drop below one is not possible. In March especially, policy and behavioural changes were so rapid (with major government changes on March 4th, 13th, 16th, 20th and 24th) that there would simply have been insufficient time to determine what had worked, even if adequate data had been gathered to answer this question. In fact, there was no surveillance testing at that point. However, it seems to us very difficult to make the case that only full lockdowns brought R below one, whether region-by-region or in aggregate for England.

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 

This appendix gives full details of the model structure and computations not covered in the main paper, including the model compartments for care homes, hospitals and testing, as well as the output variables predicting the data, the specification of the likelihood terms, and further details of the inferential methods.

Here is the force of infection term for for classes 0 to 16, written out in a form where the sensitivities follow easily.

For care home workers, class 17,

and for care home residents, class 18,

The sensitivities require the derivatives of these terms w.r.t. the parameters θ (i.e. , m chw , m chw and the parameters of b(t) -sensitivities to further parameters are obviously zero).

C is based on the POLYMOD survey (Mossong et al., 2017) accessed through R package socialmixr (Funk, 2020) , which had 1011 participants in the UK, who each recorded their contacts on one day. There were 7 participants in the 75-80 age group and none over 80. Knock et al. (2020) are vague about exactly how C ij is obtained, but given the statement that it is symmetric there is really only one sane option. Let A ij denote the average number of contacts of someone in age class i with someone in age class j. Because population sizes may differ between age classes, A ij = A ji in general. Let S 0 i denote the population (initially susceptible) in class i. The total number of contacts between members of class i and class j is T ij = A ij S 0 i . Obviously for total contacts T ij = T ji . In reality the polymod estimates of T ij and T ji will usually be different because the study population is not closed and there are likely to be some recording errors. It therefore makes sense to replace both by their average.

The number of infections generated in class i by class j is proportional to the total contacts between S i and I j . That is the total number of contacts between the classes, multiplied by the proportion that are between S i and I j ,

by definition of C ij .The contacts for care home workers with the general population are set to the mean of the C ij for the ages 20-65. In practice there is no data on 80+ with 80+ contacts. The preceding age class has 10% of contacts within age class, so this proportion could be assumed to apply to the 80+ also, and was the assumption made here. Knock et al. (2020) do not document what they did. The sircovid package accompanying Knock et al. (2020) has a carehomes_parameters() function. Among other things this returns a 14 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021. ; matrix m, which is exactly the C matrix defined above, up until age 70, but thereafter something undocumented appears to have been done and it is not clear if this was the C used or not. In sensitivity testing it made negligible difference whether we used the sircovid matrix or the version just described.

A.3 Computing R Knock et al. (2020) do not describe exactly what was done to compute R, but section 4.1 of Diekmann et al. (1990) provides what is required. For the current model we conceptually divide the population into those who will show symptoms and those who will not, so that we have E A and E C (exposed eventually asymptomatic and exposed eventually symptomatic). This division does not change the dynamics. Let S denote the vector of susceptibles in each class. We define the matrix

R is the dominant eigenvalue of K. Checking by simulation, the epidemic does indeed decline when R < 1 and increase when R > 1. Note that except when R = 1, anything that lengthens the assumed generation time requires a corresponding increase(decrease) in R in order to achieve the same epidemic growth(shrinkage) rate in time. The generation times assumed by Knock et al. (2020) are fairly long, and get longer after the rate parameter corrections described in the main paper.

To average R from different regions we need to weight by the total number of infectives in each region, since conceptually R is the mean number of new infections caused by each infective.

The first additional compartments relate to care home deaths. The equations are given for each age class, but actually only apply to class i = 18, care home residents.

Because this only applies in a single class there is only one p G D parameter to estimate. ψ i H is given in the Knock et al.supplementary material as psi_hosp_symp.

Now consider the ICU flows (i = 0 . . . 17). Define

t 0 is set to 1 April and t 1 to 1 June. The reference given to justify this term is a report on a clinical trial that showed modest improvements, but did not finish until July (RECOVERY Collaborative Group, 2020). It is unclear how improvements developed during the trial could have wide-spread consequences before its completion.

In what follows starred states are for cases where Covid-19 has been confirmed by test. The ICU compartments are governed by the following ODEs -here written out in a form such that the sensitivity equations follow by inspection. ICU i pre is the compartment preceding admission to ICU, ICU i Wr is the compartment for patients in the ICU who will eventually recover, back on the ward, ICU i W D is for ICU patients who will eventually die back on the general ward, ICU D compartments are for patients who 15 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted February 6, 2021.

Finally there are compartments used to determine the proportions testing positive in randomized testing. They have no free parameters (S subscript for antibodies, P for PCR).

The negative binomial log likelihood is used for admissions, general ward and ICU occupancy and deaths.

l(µ, κ) = κ log κ κ + µ + y log µ µ + κ + log Γ(κ + y) − log Γ(κ) − log y! and so ∂l ∂µ = y µ − y + κ µ + κ .

As discussed in the main paper, the likelihood based on independent negative binomial deviates is not justifiable for occupancy, and an alternative based on changes in occupancy, which can be modelled as independent, is more appropriate. If we model the ward (or ICU) arrivals as Poisson, and the departures as Poisson, then this daily change will follow a Skellam distribution, but that allows for no overdispersion. The difference between negative binomials (with common κ parameter) is a skewed generalized discrete Laplace distribution, but is computationally awkward. We therefore model ward and ICU arrivals and departures using overdispersed versions of the normal approximation to the Poisson, N (µ 1 , kµ 1 ) and N (µ 2 , kµ 2 ), with difference N (µ 2 − µ 1 , k(µ 1 + µ 2 )). Let σ 0 = µ 1 + µ 2 and α = (y − µ 1 + µ 2 ), where y is now the observed change in occupancy, then l d (µ 1 , µ 2 ) = − α 2 2kσ 0 − log(kσ 0 )/2 − log(2π)/2 and so ∂l d

A difficulty with applying this model directly is that hospital arrivals and discharges tend to have weekly pattern. This pattern shows up strongly in the ACFs and PACFs of occupancy first differences for some regions, especially East of England, but is absent from the model. We therefore base the likelihood on weekly changes. Since the changes in occupancy carry no information on the level of occupancy, we also add the sum of daily bed occupancies as a final datum to be fitted, treating this as close to Poisson (by setting κ to a very high constant). For the total daily hospital admissions data and the care home deaths data we retain the negative binomial model, with the respective κ parameters free to be estimated. Some overdispersion to deal with likely model mismatches in these components seems pragmatic. For the hospital deaths we set κ = 2000, which gives a likelihood very close to Poisson. There seems no legitimate reason to expect overdispersion here if the model is at all fit for purpose.

Given smoothing parameters, λ, and overdispersion parameters we find the posterior parameter modeŝ

l is the log likelihood and the S j are fixed positive semi-definite matrices imposing the spline smoothing penalty. µ θ j and ν θ j are prior means and precisions. The precisions may be zero, and are for all spline coefficients. Quasi-Newton optimization was used to findθ (see e.g. Nocedal and Wright, 2006) . Let H λ denote the Hessian of the negative of the penalized log likelihood given in (48), evaluated atθ (it is obtained by finite differencing the computationally exact gradients). Then we have the large sample approximation to the posterior θ ∼ N (θ, H −1 λ ).

19 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 6, 2021. ; https://doi.org/10. 1101 /2021 The smoothing parameters were obtained using the approximate marginal likelihood maximization method of Wood and Fasiolo (2017) , which alternates optimization of (48), given smoothing parameters, with updates of all smoothing parameters, givenθ

At each step of this iteration the overdispersion parameters were obtained by maximum likelihood estimation, given the current model predictions. Uncertainties for incidence and R trajectories were obtained from (49) by the delta method (applied on the log scale).

. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted February 6, 2021. ; https://doi.org/10.1101/2021.02.03.21251112 doi: medRxiv preprint

Reproduction number (R) and growth rate (r) of the COVID-19 epidemic in the UK: methods of estimation, data sources, causes of heterogeneity, and use as a guide in policy formulation

sircovid: SIR Model for COVID-19

Death risk stratification in elderly patients with covid-19. a comparative cohort study in nursing homes outbreaks. Archives of gerontology and geriatrics 91

Estimation of incubation period and generation time based on observed length-biased epidemic cohort with censoring for covid-19 outbreak in china

On the definition and the computation of the basic reproduction ratio R0 in models for infectious diseases in heterogeneous populations

Features of 20 133 UK patients in hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: prospective observational cohort study

Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe

socialmixr: Social Mixing Matrices for Infectious Disease Modelling

Approximate accelerated stochastic simulation of chemically reacting systems

The 2020 SARS-CoV-2 epidemic in England: key epidemiological drivers and impact of interventions

The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application

Health Equity in England: The Marmot Review 10 Years On. The Health Foundation

die in the ICU. Parameters µ IC , µ D , p max H , p max IC , p max IC D and p max W d are free, and we additionally treated γ IC Wr as free.The next 6 compartments are the step down from ICU to regular ward compartments. They originally involved no free parameters, but we treated γ Wr as free. Subscripts r and D refer to recovery or death.Six further compartments are for the hospital general ward, and again depend on free parameters, including new one p max H d . We also treated γ Hr as free. Again subscripts r and D refer to recovery or death.The recovered compartment is governed bẏ(this is corrected from Knock et al., 2020) .(corrected from Knock et al., 2020) .

The outputs required from all this are as follows, again written in a form in which the sensitivities are immediate. Firstly the rate of new Covid cases in hospital, which is the sum of the flows into the H r * , H D * and ICU pre * states plus all the rate γ U flows.Note that Knock et al. (2020) modify some of the γ U plumbing between the ODE statement of the model and its stochastic discretisation, but this does not change the flow totals. The hospital regular bed occupancy is given by (correcting report notation mutation)But a correctly specified likelihood actually requires regular bed arrivalsand departuresbut again for a defensible likelihood we need arrivalsand departuresThe death rate in hospital isThe care home death rate is given byThe testing data require(summation over ages 15 to 65) and(summation from age 5 upwards and care home workers, but not residents). In the Knock et al. there is one further stream for the Pillar 2 PCR data, but the model seems so crude that this stream can really only undermine inference and is better omitted.A.8 Priors .Parameters were optimized on a working scale, so that the dynamic model parameters were θ j = h k (θ j ) where θ j was unconstrained and k selects among 3 alternatives. h 1 was the identity, h 2 the exponential and h 3 = a 1 + (a 2 − a 1 )e θ j /(1 + e θ j ), which constrains the θ j to the interval (a 1 , a 2 ). The limits were generally set from the prior intervals given in Knock et al. (2020) . Gaussian priors on the working scale were also applied, but except for t 0 these were vague, and their only purpose was to allow ready detection of any parameters that were not identifiable. h 2 was used for γ G , γ IC Wr , γ Wr and γ Hr . The working scale mean vector for these was µ p = (−1, −2.75, −1.8, −2.37) with σ p = 2 in all cases. The means were set from the Knock et al. (2020) estimates. The h 3 applied to the adaptive spline to yield b(t) had a 1 = 0 and a 2 = 0.1.

The likelihood is constructed from negative binomial and binomial components. The binomial log likelihood is used for the serology and REACT PCR data.l b (p) = y log p + (n − y) log(1 − p) + log n! − log y! − log(n − y)! so ∂l b ∂p = y p − n − y 1 − p .