key: cord-0982111-6vqf2n5j
authors: Brauner, J. M.; Sharma, M.; Mindermann, S.; Stephenson, A. B.; Gavenciak, T.; Johnston, D.; Salvatier, J.; Leech, G.; Besiroglu, T.; Altman, G.; Ge, H.; Mikulik, V.; Hartwick, M.; Teh, Y. W.; Chindelevitch, L.; Gal, Y.; Kulveit, J.
title: The effectiveness and perceived burden of nonpharmaceutical interventions against COVID-19 transmission: a modelling study with 41 countries
date: 2020-05-30
journal: nan
DOI: 10.1101/2020.05.28.20116129
sha: 5f9fb5bb4f8e92e72f67b874ae61ab6c2ddd4417
doc_id: 982111
cord_uid: 6vqf2n5j

Background: Existing analyses of nonpharmaceutical interventions (NPIs) against COVID19 transmission have focussed on the joint effectiveness of large-scale NPIs. With increasing data, we can move beyond estimating aggregate effects, to understanding the effects of individual interventions. In addition to effectiveness, policy decisions ought to reflect the burden different NPIs put on the population. Methods: To our knowledge, this is the largest data-driven study of NPI effectiveness to date. We collected chronological data on 9 NPIs in 41 countries between January and April 2020, using extensive fact-checking to ensure high data quality. We infer NPI effectiveness with a novel semi-mechanistic Bayesian hierarchical model, modelling both confirmed cases and deaths to increase the signal from which NPI effects can be inferred. Finally, we study the burden imposed by different NPIs with an online survey of preferences using the MaxDiff method. Results: Six NPIs had a >97.5% posterior probability of being effective: closing schools (mean reduction in R: 58%; 95% credible interval: 50% - 64%), limiting gatherings to 10 people or less (24%; 6% - 39%), closing nonessential businesses (23%; 5% - 38%), closing high-risk businesses (19%; 1% - 34%), testing patients with respiratory symptoms (18%; 8% - 26%), and stay-at-home orders (17%; 5% - 28%). These results show low sensitivity to 12 forms of varying the model and the data. The model makes sensible forecasts for countries and periods not seen during training. We combine the effectiveness and preference results to estimate effectiveness-to-burden ratios. Conclusions: Our results suggest a surprisingly large role for schools in COVID-19 transmission, a contribution to the ongoing debate about the relevance of asymptomatic carriers in disease spreading. We identify additional interventions with good effectiveness-burden tradeoffs, namely symptomatic testing, closing high-risk businesses, and limiting gathering size. Closing most nonessential businesses and issuing stay-at-home orders impose a high burden while having a limited additional effect.

The governments of the world have mobilized vast resources to fight the COVID-19 pandemic. A wide range 1 of non-pharmaceutical interventions (NPIs) has been deployed, among them drastic measures like the closure of all businesses and national lockdowns. Recent analyses show that these large-scale NPIs appear to be jointly effective at reducing the virus' effective reproduction number. 2, 3 As time progresses, more data becomes available from different countries that have implemented different NPIs (Figure 2 ). We can thus move beyond estimating the aggregate effect of a bundle of NPIs, and understand the effects of individual interventions.

But selecting the right policy depends on more than the estimates of effectiveness. Drastic NPIs, such as society-wide social distancing, cause widespread disruption to many aspects of social life, including the quality of life, economic prospects, 4 and potentially mental health 5 of the entire population. When selecting policies, it is thus important to consider the burden they impose.

The aim of this paper is to estimate the effectiveness of various NPIs at reducing the spread of COVID-19 and the burden they put on the population.

To disentangle the effects of individual NPIs, we need to leverage data from multiple regions with diverse bundles of NPIs. With some exceptions (Flaxman et al. 2 , Chen and Qiu 6 , and Banholzer et al. 7 ), previous data-driven studies focus on single NPIs and/or single regions (Table 1 ). In contrast, we evaluate the impact of 9 NPIs on the growth of the epidemic in 34 European and 7 non-European countries. To our knowledge, this is the largest data-driven model of NPI effects on COVID-19 transmission to date. Additionally, the focus of previous work has largely been on costly NPIs (Table 1 ). In line with our aim of identifying effective interventions with little burden, we additionally analyse the effects of several less disruptive NPIs (Table 2) .

Before collecting data, we experimented with two public datasets on NPIs, finding that they contained some incorrect dates and were not complete enough for our modelling. a By focussing on a smaller set of countries and NPIs than is present in these datasets, we were able to implement strong quality controls in our data collection. We make this high-quality dataset public, as well as the Epidemic Forecasting Global NPI database, a much larger but less rigorously verified dataset. a We evaluated the following datasets: • Oxford COVID-19 Government Response Tracker (OxCGRT) 8 • #COVID19 Government Measures Dataset 9 Note that these datasets are under continuous development. Many of the mistakes we found will already have been corrected. Also, we know from our own experience that data collection can be very challenging. We have the fullest respect for the work of the people behind these datasets. In this paper, we focus on a much more limited set of countries and NPIs than is present in these datasets, allowing us to ensure higher data quality in this subset. Given our experience with public datasets and our data collection, we encourage fellow COVID-19 researchers to independently verify the quality of public data they use, if feasible.

To estimate NPI effectiveness, we design a novel semi-mechanistic Bayesian hierarchical model with a time-delayed effect for each NPI. A key assumption of our model is that the effect of each NPI on the reproduction number is stable across different countries and over time. This assumption is present in all closely related works. Our model can be seen as an extension to that of Flaxman et al. 2 , using both confirmed cases and deaths as observations to increase the amount of signal from which NPI effects can be inferred.

Constructing an NPI model is a perilous task since its conclusions can be sensitive to the assumptions and data. Therefore, it is crucial to validate it. However, such validation is often incomplete or absent from previous work. We perform what is, to our knowledge, by far the most extensive validation of any NPI model for COVID-19 to date -evaluating predictions for countries and time periods not seen during training (Figures 4 & 5) , evaluating different models that use different observations (deaths and confirmed cases; Figure 6 ), testing the robustness to unobserved NPIs (Figure E. 10) , and analyzing sensitivity to many perturbations (Appendix E). Nonetheless, our model comes with important limitations and uncertainties, which we discuss in Appendix I.

Finally, to study how burdensome people perceive different NPIs to be, we collected preference data using a best-worst scaling 10 discrete choice online survey instrument. As community surveys are often successfully used in public health settings to estimate the preferences over various treatments and interventions, 11 we believe this data can provide valuable input when evaluating NPIs. While there are many other ways to estimate the costs of NPIs, for example by modelling economic impacts, these estimates are often dominated by longterm effects. For example, a large part of the economic impact of closing schools could consist in the loss of human capital. 12 These long-term effects are currently hard to predict and are co-determined by economic policy responses and many other effects beyond the scope of this study.

• High quality data on the largest number of countries and NPIs studied to date, including several less costly NPIs • A novel combined model utilising both confirmed cases and deaths • Extensive model validation • Estimation of population preferences over NPIs and analysis of effectiveness-burden tradeoffs 4 

We collected a large database from 67 countries, which we call the Epidemic Forecasting Global NPI (EFGNPI) database. The database contains more than 1700 events, tagged with 194 keywords, which are distilled into 24 classes of NPIs. Details of the EFGNPI database are given in Appendix B.

As described in the introduction, we found that public datasets on NPIs contained frequent incorrect entries. We expect the same to be true for the full EFGNPI database. For the smaller set of NPIs and countries used in this study, we implemented further steps to ensure data quality (see below). The data used in this study, including sources, can be found in Appendix C. 

Mask wearing

One or both of: • a country has implemented a policy of requiring mask usage among the general public, sometimes limited to certain domains like a duty to wear masks in public transportation and supermarkets • survey reports indicate that over 60% of people were wearing masks in public.

Testing is available to anyone showing COVID-19 symptoms (as defined by the country). In a few countries, testing is even available to people without symptoms.

Gatherings limited to 1000 people or less A country has set a size limit on gatherings. The size limit is at most 1000 people (often less) and gatherings above the maximum size are disallowed. For example, a ban on gatherings of 500 people or more would be classified as "gatherings limited to 1000 or less" but a ban on gatherings of 2000 people or more would not. Gatherings limited to 100 people or less A country has set a size limit on gatherings. The size limit is at most 100 people (often less) and gatherings above the maximum size are disallowed. Gatherings limited to 10 people or less A country has set a size limit on gatherings. The size limit is at most 10 people (often less) and gatherings above the maximum size are disallowed.

A country has specified a few kinds of customer-facing businesses that are considered "high risk" and need to suspend operations (blacklist). Common examples are restaurants, bars, nightclubs, and gyms. By default, businesses are not suspended.

A country has suspended the operations of many customer-facing businesses. By default, customer-facing businesses are suspended unless they are designated as essential (whitelist).

A country has closed many or all schools. (Note that this was accompanied by closing universities in more than 75% of cases in our data.)

Stay-at-home order (with exemptions)

An order for the general public to stay at home has been issued. This is mandatory, not just a recommendation. Exemptions are usually granted for certain purposes (such as shopping, exercise, or going to work), or, more rarely, for certain times of the day. In practice, a stay-at-home order was often accompanied by other NPIs such as businesses closures. However, a stay-at-home order does not in principle entail these other NPIs, but only the (additional) order to generally stay at home except for exemptions.

d Feature taken from the Oxford COVID-19 Government Response Tracker 8 We analyse 41 countries c (see Figure 2 ) and 9 NPIs (Table 2) . We only recorded when NPIs were implemented in most of a country. The window of analysis spans the period from 22nd January to 25th April 2020 d , inclusive. Data on confirmed COVID-19 cases and deaths were taken from the John Hopkins Center CSSE COVID-19 Dataset 25,26 .

Gathering bans, school closure, business closure, stay-at-home order

For each NPI and each country, one to three contractors independently collected data on the start date of the NPI, including sources. Each country was then extensively researched by one of the authors, using media articles, government sources, and Wikipedia articles. The researcher finalised the data based on their research, the data in the EFGNPI dataset, the data provided by the contractors, and, if available, data from the Oxford COVID-19 Government Response Tracker. 8

To estimate the local prevalence of mask-wearing, we conducted surveys of n=908 participants from most of the countries studied. Respondents were asked about the number of people they had seen wearing masks (details in Appendix D). We also used Wikipedia and the masks4all dataset 27 to ascertain when countries mandated mask-wearing in (some) public places. In all countries in which the government mandated mask-wearing, our survey results indicate that more than 60% of people started wearing masks around the time when the mandate was implemented.

The Oxford COVID-19 Government Response Tracker 8 has complete data on testing policies implemented in different countries. To check its accuracy, we compared the data with the number of tests per confirmed case 28 and found that activation of the testing feature was correlated with a substantial increase in the number of tests per confirmed case. We did not do further verification. As of version 5.0 of the dataset, our "symptomatic testing" feature corresponds to the following feature in the OxCGRT dataset: ID H2, levels 2-3. 

We construct a semi-mechanistic Bayesian hierarchical model, similar to Flaxman et al. 2 The main difference is that we model both confirmed cases and deaths, allowing us to leverage significantly more data. Furthermore, we do not assume a specific infection fatality rate since we do not aim to infer the total number of COVID-19 infections. The end of this section details further adaptations which allow us to make minimal assumptions about testing, reporting, and the infection fatality rate (IFR). Please see Appendix G for further details.

We describe the model in Figure 1 from bottom to top. The growth of the epidemic is determined by the time-and-country-specific reproduction number R t ,c . It depends on: a) the basic reproduction number R 0,c without any NPIs active, and b) the active NPIs. We place a prior (and hyperprior) distribution over R 0,c , reflecting the wide disagreement of regional c The countries were selected by a case threshold (at the time of modelling), the availability of reliable data on NPIs, and how trustworthy we estimated the reporting of deaths from this country to be. Some particular countries were excluded for specific reasons. For example, we excluded South Korea because the country made heavy use of contact tracing which we don't model (because data on contact tracing is very hard to get). d 22nd January -17th April for confirmed cases estimates of R 0 . 29 We parameterize the effectiveness of NPI i , assumed to be similar across countries and time, with α i . The effect of each NPI on R t ,c is assumed to be multiplicative (and therefore independent) as follows:

where φ i ,c,t = 1 means NPI i is active in country c on day t (φ i ,c,t = 0 otherwise). In section 3, we discuss this interaction between NPIs. There is a symmetric prior (and hyperprior) over α i , allowing for both positive and negative effects.

Growth rates. N t ,c denotes the number of new infections at time t and country c. In the early phase of an epidemic, N t ,c grows exponentially with a daily e growth rate g t ,c . During exponential growth, there is a one-to-one correspondence between g t ,c and R t ,c : 30

where M (·) is the moment-generating function of the distribution of the serial interval (the time between successive cases in a chain of transmission). We assume that the serial interval distribution is given by a Gamma(5.18, 0.96) f distribution 31 . Using (??), we can write g t ,c as g t ,c (R t ,c ) (see Appendix G).

Infection model. Rather than modelling the total number of new infections N t ,c , we model new infections that either will be subsequently a) confirmed positive, N (C ) t ,c , or b) lead to a reported death, N (D) t ,c . They are backwards-inferred from the observation models for cases and deaths, shown further below. We assume that both grow at the same expected rate g t ,c :

where (·) τ,c ∼ N (0, σ N = 0.2) are separate, independent noise terms. We seed our model with unobserved initial values, N (C ) 0,c and N (D) 0,c , which have uninformative priors. g e Many epidemiological models define growth rates as the exponent r in an exponential growth function. Here, we use daily growth rates instead for ease of exposition. These choices are mathematically equivalent. Note that we adapted equation (2.9) in Wallinga & Lipsitch 30 to account for our choice.

f The two parameters are the shape and rate. The mean is 5.1 days. g Since we treat new infections as a continuous number, its initial value can (and often should) be between 0 and 1.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Observation model for confirmed cases. The mean predicted number of new confirmed cases is a discrete convolution

where P C (delay) is the distribution of the delay from infection to confirmation. This delay distribution is the sum of two independent gamma distributions: the incubation period and the delay from onset of symptoms to confirmation. We use previously published and consistent empirical distributions from China and Italy, [32] [33] [34] [35] which sum up to a mean delay of 10.35 days. Finally, the observed cases C t ,c follow a negative binomial noise distribution with mean C t ,c and an inferred dispersion parameter, following Flaxman et al. 2 Observation model for deaths. The mean predicted number of new deaths is a discrete convolution

where P D (delay) is the distribution of the delay from infection to death. It is also the sum of two independent Gamma distributions: the aforementioned incubation period and the delay from onset of symptoms to death 32, 36 , which sum up to a mean delay of 18.8 days.

Finally, the observed deaths D t ,c also follow a negative binomial distribution with mean D t ,c and an inferred dispersion parameter.

Single and combined models. To construct models which only use either confirmed cases or deaths as observations, we remove the variables corresponding to the disregarded observations.

Testing, reporting, and infection fatality rates. Scaling all values of a time series by a constant does not change its growth rates. The model is therefore invariant to the scale of the observations and consequently to country-level differences in the IFR and the ascertainment rate (the proportion of the infected cases who are subsequently reported positive). For example, assume countries A and B differ only in their ascertainment rates. Then, our model will infer a difference in N (C ) t ,c (Eq. (4)) but not in the growth rates g t ,c across A and B (Eq. (2)-(3)). Accordingly, the inferred NPI effectiveness will be identical. h In reality, a country's ascertainment rate (and IFR) can also change over time. In principle, it is possible to distinguish changes in the ascertainment rate from the effects of NPIs: decreasing the ascertainment rate decreases future cases C t ,c by a constant factor whereas h This is only approximately true. The negative binomial output distribution has a coefficient of variation diminishing with its mean i.e., smaller observations are relatively more noisy and carry less weight. Furthermore, whilst the prior over N (C ) 0,c could break scale invariance, the uninformative prior results in a negligible effect.

the introduction of an NPI decreases them by a factor that grows exponentially over time. i The noise terms, exp (C ) τ,c (Eq. (2)), mimic changes in the ascertainment rate -noise at time τ affects all future cases -and allow for gradual, multiplicative changes in the ascertainment rate.

We infer the unobserved variables in our model using Hamiltonian Monte-Carlo 37,38 (HMC), a standard MCMC sampling algorithm.

We collected preference data to study the direct impact of NPIs on people's lives. We used a best-worst scaling discrete choice survey instrument, specifically MaxDiff, 10 and surveyed N = 474 US residents recruited on Amazon's Mechanical Turk platform. The platform typically yields participants with greater demographic diversity than typical internet samples. 39 Note that this survey was entirely separate from the survey used for studying mask wearing described above.

Each respondent was given a short description of all studied NPIs (Appendix H) and then presented with 12 MaxDiff questions with 6 options, where each option consisted of a type of NPI and a duration (1 week, 2 weeks, 1 month, 3 months, 6 months, 1 year). Participants were asked to select the two options that they perceived as overall least and most burdensome (example question in Appendix H).

Before analysis, 140 responses with inconsistent answers were discarded; we considered answers erroneous when they preferred a longer duration of an intervention (often this happened for participants who responded quickly). To extract utility scores, we used the analytical estimation for the multinomial logit model, 40 as implemented in the bwsTools package 41 in R.

To analyse how the effectiveness of NPIs compares to their social impact, we can use the utility scores derived from the survey responses. However, utility scores are on an interval scale, because the survey only asks for relative comparisons between options. 42 While respondents presumably dislike all choices, we cannot say that e.g. a stay-at-home order is three times worse than school closure.

To estimate the effectiveness-burden-ratio, we need to estimate a measure for the intervention burden on a ratio scale, which we call "perceived intervention costs". These can be i However, our model may struggle when the ascertainment rate also changes exponentially over time. This could happen when a country reaches its testing capacity. See Appendix I. derived from the utility scores with additional assumptions, which are well justified by the empirical data ( Figure 7 , details in Appendix H).

With these, the effectiveness-burden-ratio E B R i of intervention i can be defined as: j

where m i is the multiplicative factor on R (e.g. for a 20% reduction in R, m i = 0.8), and c i is the cost of intervention i . To determine the error of E B R i , we used error propagation: 43

is the variance.

Conducting the online surveys on intervention burden was approved by the Medical Sciences Interdivisional Research Ethics Committee at the University of Oxford (Ethics Approval Reference: R69410/RE001)

The funding source did not influence any aspect of study design, execution, or reporting.

We aim to estimate the effectiveness of individual NPIs. If all countries implemented the same set of NPIs, on the same day, the individual effect of each NPI would be unidentifiable. However, many countries implemented different sets of NPIs, at different times, in different orders ( Figure 2 ). j This particular functional form is chosen because it is a simple expression that satisfies three desirable properties:

i) repeated application of an intervention x times that has effectiveness factor m and a constant unit cost c has equal effectiveness-burden-ratio each time it is applied. Formally: for any c ∈ R + and any m ∈ (0, 1),

CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 14 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint

The model fits the observations well in 3 randomly selected countries (Figure 3 , left). The fits for all other countries can be found in Appendix F. Plotting posterior values of the noise terms (C ) t and (D) t shows periods where infections grew faster or slower than predicted based on the active NPIs, illustrating where the model might account for unobserved interventions or changes in reporting.

An important way to validate a Bayesian model is by checking its predictions on heldout data. 44 Our model makes sensible, calibrated forecasts over long periods in countries whose data was not used to infer the effectiveness of NPIs ( Figure 4 , see Appendix F for other countries).

We additionally validate our model's predictions by holding out the last 20 days of both new cases and deaths for all countries. These are challenging predictions; the longest attempted period we found in related work was 3 days. 2 The accurate forecasts in Figure 5 provide strong empirical evidence that our estimates of R are plausible.

The estimates of NPI effectiveness are our main result. To interpret them correctly, we need to keep in mind that our model assumes no interaction between different NPIs. In our model, each NPI reduces R by a multiplicative factor, independent of the context, i.e. the presence of other NPIs. This independence assumption is present in all multi-NPI studies we are aware of and seems reasonable for many NPIs. For instance, the effectiveness of closing businesses is likely to be similar whether or not schools are closed. However, in some situations, the effectiveness of an NPI might depend on its context. For example, if a stay-at-home order is in place, a larger fraction of the remaining transmission might occur in private spaces, and wearing masks in public spaces might be less effective.

Given this discussion, the effectiveness estimates should not be interpreted as the average effectiveness across all possible contexts, but rather as the (additional) effectiveness averaged across the contexts in which the NPI was present in our data. This result, which is equally important for the interpretation of other related studies, is derived for a simplified model in Appendix G.3. Figure 6 (bottom left) visualises the contexts of each NPI in our data, aiding interpretation. Figure 6 shows the estimates of NPI effectiveness. Reassuringly, our three models have similar results. This suggests that results are not biased by factors that are specific to the deaths or cases model, such as changes in the ascertainment rate, reporting, and modelspecific time delays.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint Figure 6 : Top:Posterior reduction in R for each NPI. The plot shows 50% and 95% credible intervals. A negative 1% reduction refers to a 1% increase in R. The following NPIs are hierarchical: gathering bans and business closures. For example, the result for Most Businesses Suspended shows the cumulative effect of two NPIs with separate parameters and symbols: suspending some (high-risk) businesses, and suspending most remaining (non-high-risk, but non-essential) businesses. The exact numbers are given in Appendix A. Bottom Left: The conditional activation matrix shows the situations encountered in our data. Cell values indicate the frequency that NPI x is active given that NPI y was active. E.g., schools were closed whenever a stay-home-order had been issued (bottom row, second column from the right), but not vice versa. Bottom Right: Total number of days each NPI was active across countries.

There are six NPIs for which the 95% credible interval does not include 'no effect' in our main model (combined deaths + confirmed cases): testing symptomatic patients, limiting the size of gatherings to 10 people (or less), suspending some or most businesses, closing schools, and a stay-at-home order. These results had a high degree of robustness in our sensitivity analysis (next section) and in other model variants we tried, whose results are not shown. There is no need to adjust for multiple-hypothesis-testing since our model is hierarchical. 45 We confirmed the quality of the MCMC inference with the Gelman-Rubin convergence statistic 46 (Appendix F).

We ran a wide range of sensitivity experiments on our combined model. Appendix E shows posterior effectiveness plots for the many conditions we tested. Table 3 summarizes sensitivity results qualitatively. We conservatively diagnosed 'moderate' sensitivity when, for every NPI, all 95% credible intervals, but not all 50% intervals, overlap. 'Low' means all 50% intervals from all experiment runs overlap.

Results were generally stable, not affecting our conclusions.

Robustness to unobserved effects. The model assumes that there are no unobserved factors changing R (i.e. unobserved confounders such as spontaneous social distancing). But this is not necessarily true in practice. We test robustness to unobserved factors by computing NPI effectiveness whilst removing the observation of each NPI in turn. The sensitivity is low, supporting the claim that the model successfully unobserved factors.

Furthermore, we investigated robustness to unobserved confounding factors by including mobility data 47 as an 'NPI' that serves as a proxy for behaviour changes. We find that the mobility data explains the effect of business closures and stay-home-orders, which is expected as the effect of these NPIs is mediated through retail and recreation mobility. The inferred effectiveness of other NPIs is unchanged.

We do not report sensitivity to:

• The prior over the initial outbreak size N 0,c (because it is already extremely wide, having a negligible effect) • The observation noise parameter (because it is inferred) • Alternative models of infection and NPI interaction 20 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

We surveyed 474 US residents recruited on Amazon's Mechanical Turk platform about their preferences regarding various NPIs using a best-worst scaling survey. 140 responses were filtered for internally inconsistent answers, and 334 were used for subsequent analysis (demographics in Appendix H). The NPI Symptomatic testing was not included in the preference elicitation because the mere option to get tested for Covid-19 when having symptoms does not impose any burden on people.

The ranking of the NPIs is largely independent of the duration (Figure 7 ). The durationdependence of preferences is largest for mask wearing, which is more preferable if required only briefly, and the most stringent interventions, stay-at-home orders and the closure of most non-essential businesses, which are perceived as particularly bad if implemented for unrealistically long durations. 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . With some further assumptions (see Section 2.4), we can convert the utility scores to a ratio-scaled measure of intervention burden and calculate an effectiveness-burden-ratio for every NPI (Figure 9 ).

We find evidence for the effectiveness of several NPIs. Combining effectiveness estimates with results from preference surveys, we can draw interesting conclusions:

• Closing high-risk businesses, such as bars and restaurants, appears only slightly less effective than closing most non-essential businesses, while imposing a substantially smaller burden. • There is no obvious best choice for gathering size restrictions: though stricter limits are more effective, they are more burdensome, giving a similar effectiveness-burden ratio.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. .

• Some NPIs, like mask wearing, have unclear evidence of effectiveness but are perceived as little burdensome and might thus have high effectiveness-burden-ratios. This warrants further investigations into their effectiveness.

We now discuss some of the main or more surprising results in detail.

Testing. With no direct negative effects on the population and a demonstrable effect on transmission, testing of patients with respiratory symptoms looks very promising from an effectiveness-burden perspective. k Of course, the main negative effect of testing is the cost of purchasing and conducting tests. However, a recent economic analysis concluded that even testing asymptomatic people is vastly more cost-effective than indiscriminate measures. 48 .

Stay-at-home-orders. We estimate a comparatively small effect for stay-at-home orders.

The 'stay-at-home order (with exemptions)' NPI (Table 2 ) should be interpreted literally: a mandatory order to generally stay at home, except for exemptions. When countries introduced stay-at-home orders, they nearly always also banned gatherings and closed nonessential businesses if they had not done so already ( Figure 6 ). Accounting for the effect of these NPIs, it is not surprising that the additional effect of ordering citizens to stay at home is small-to-moderate. Accordingly, it may be acceptable to lift burdensome stayhome-orders, provided other NPIs stay active. Our result agrees with Banholzer et al. 7 (they call this NPI 'lockdown'), and we have not seen contradictory results in related work. In particular, the 'lockdown' NPI in Flaxman et al. 2 includes several other NPIs. Chen & Qui 6 found a significant effect, but without defining 'lockdown'.

Mask wearing. Mask wearing was often introduced towards the end of our analysis period ( Figure 2 ), meaning that it is, by far, the NPI with the least data ( Figure 6 ). We conclude that we have insufficient data to make claims about the effectiveness of mask wearing, and indeed, inferred effectiveness is moderately sensitive to left-out countries (Figure E.16). Additionally, mask wearing might have a reduced effect in the context of the particular countries we studied. People started wearing masks when interactions in public spaces were already limited by other NPIs. When relatively more transmission occurs in private spaces, wearing masks in public is expected to be less effective. This might explain the difference to Chen & Qui, 6 who found a small significant effect of mask-wearing based on data from two countries (China and South Korea), as mask-wearing was common in South Korea before other NPIs were implemented.

k Note that we did not directly measure the burden of testing because this is not possible in the framework of our preference analysis (Section 3.6)

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . School closures. All our models find a very large effect for school closures. This result is surprising, even when accounting for the fact that school closure was highly correlated with university closure. However, the large effect was remarkably robust across our sensitivity analysis, many other model variants implemented during our model checking process 49 , and across a long process of collecting data for additional countries and NPIs. By inspecting the observations and the inferred infections, it is easy to see why the effect is so large: school closures are consistently followed by a clear reduction in growth (after the appropriate delay).

It is possible that our model confuses the effect of closing schools and unobserved behaviour changes. However, our sensitivity analysis showed that results are fairly robust to unobserved NPIs, suggesting they are robust to unobserved factors. Furthermore, we directly modelled unobserved factors by introducing mobility data 'NPIs' as a proxy for them. Again, the effect of school closures was unchanged. While these techniques closely mirror well-established sensitivity checks for unobserved causal effects, 50,51 they, too, rely on assumptions.

A further concern is that school closures have a delayed effect on deaths and confirmed cases, since children are less likely to die or show symptoms than adults. However, the result is not sensitive to the mean delay we assume ( Appendix E).

Additionally, since the closure of schools was often the first major NPI introduced (Figure 2 ), it may have caused public concern to increase, causing behaviour changes. We do not distinguish this indirect signalling effect from the direct effect (for any NPI). Conversely, reopening schools could also have a signalling effect.

Previous evidence relevant to school closures is mixed. Flaxman et al. 2 and Banholzer et al. 7 did not find a significant non-zero effect with their data (Banholzer et al. focused on primary schools). Limited data suggests that children are equally susceptible to infection but have a lower observed case rate than adults 52-54 -whether this is due to school closures remains unknown. There is insufficient data about transmission from children. However, viral shedding appears to be comparable across age groups. 55, 56 Little is known about the attack rate in schools (since they are closed); the best-documented case found that 38.3% to 59.3% were infected in one French high school. 57 As our results suggest a large role of schools (and universities) in Covid-19 transmission, this topic deserves further study.

Our study is not without assumptions and limitations, which are discussed in greater detail in Appendix I. To highlight some important points: NPI effectiveness may vary across countries and time; we cannot quantify the influence of unobserved factors on our results; regional differences within countries complicate the analysis. Therefore, a high degree of uncertainty remains in our findings. Our results should not be seen as the final answer on NPI effectiveness and burdens, but rather as a contribution to a diverse body of evidence, next to other retrospective studies, experimental trials and clinical experience.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint

See next page. The effectiveness estimates are computed with the combined cases + deaths model. For the disutility scores, lower disutility implies higher utility and a stronger preference. Utilities are on an interval scale, the absolute values have no significance, only differences between utilities carry meaning. The zero point has no particular meaning. We can, e.g., say that the preference for Some businesses closed over Stay-at-home order was equally strong as the preference for Gatherings limited to 100 people or less over Some businesses closed (ca. 0.3 a.u., arbitrary units)

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. 

Up-to-date information on the Epidemic Forecasting Global NPI (EFGNPI) database can be found at http://epidemicforecasting.org/containment.

The full database (DB) is a daily representation of the response of each of 97 countries. It aims at collecting as broad a range of NPIs as possible. However, data on minor NPIs is often hard to find. As a result, the absence of an entry does not necessarily mean that this NPI was not implemented by a country.

A smaller dataset, the EFGNPI Features dataset (FD), is derived from the full DB. The FD data aggregates many tags in the main database to produce a dataset easier to use in machine learning applications. The tags are also used to determine a stringency score for each feature. (Please note that details of how the FD data is produced from the main database may change slightly over time.) . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint * The database contains data on 97 countries, but only 67 of these are complete at time of writing.

The underlying data was gathered by a team of volunteers. The database integrates many sources. Wikipedia entries were taken as a starting point for the set of NPIs implemented by each country. These were then refined by reference to national centres for disease control.

The full database is recorded as a dataset of tags. We began without a predefined list of attributes to record, so collection proceeded with a dynamic set of keyword tags as data on national responses was collected. After the data had been collected, a method for aggregating tags was created. The resulting database includes a 'Source' field for most rows.

Please note that the EFGNPI database, in contrast to the data used in this study ( Appendix C), has not been subject to extensive fact-checking.

See Table B .6. It's important that researchers select the dataset appropriate for their usecase. We think that a particular strength of the EFGNPI database is that it tracks a vast array off NPIs, but possibly at the cost of completeness. For the features that are contained in it, it seems likely that the Oxford COVID-19 Government Response Tracker dataset will have the highest quality, given the large team behind this dataset. However, as we have stated in various sections of this paper: Given our experience with several public datasets and our own data collection, we encourage fellow Covid-19 researchers to independently verify the quality of public data they use, if feasible.

We include the data used in our study, including sources.

Note that the following features are hierarchical, admitting several levels of stringency: limiting the size of gatherings (<1000 people, to <100 people, to <10 people), and business closure (some high-risk businesses, to most nonessential businesses). This is reflected in the data, for example, if a country banned all gatherings of 5 or more people, this was recorded as "gatherings limited to 10 or less", "gatherings limited to 100 or less", and "gatherings limited to 1000 or less".

Human-readable. All NPIs with sources (except "symptomatic testing", which was taken from the Oxford COVID-19 Government Response Tracker 8 ): LINK 30 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

Volunteers and Amazon Mechanical Turk (AMT) workers were asked to fill out an online survey between 25th March and 7th April 2020. The first-round volunteers were recruited via Facebook posts and private emails, with a request to both complete the survey and share it with their contacts, especially overseas contacts. Owing to a lack of geographical coverage in the first round, a second round, surveying users of country-specific forums on Reddit, was conducted and completed on 28th April.

The survey features three sets of questions, regarding: 1. the requirements or recommendations to wear masks in the participant's home country. (This question was added in the second round.) 2. the percentage of mask wearers they saw in public at weekly intervals between the end of February and the beginning of April 3. the number of people in indoor public areas as a percentage of the usual number of people seen in these areas at weekly intervals between the end of February and the beginning of April 31 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . Both strategies (private word of mouth and public internet sampling) are likely to yield non-representative samples owing to self-selection. This could yield poor results if mask usage varies a lot within countries, for instance in large countries such as India and the United States. However, we found a good deal of consistency in responses within countries on specified days. The average standard deviation of "percentage of population wearing masks" within country-days was 18.6, while the same measure, between countries but within days, was 28.6. Given this, we expect the inclusion of countries with even a single response to give a better indication of mask-wearing behaviour in that country than assuming such countries to have average levels. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint Appendix D. 1 

We computed a binary feature of mask-wearing, attributed to the middle day of each week in the survey, by thresholding the average survey response for that week at 60%.

To create the mask-wearing feature used in our modelling, we combined the data from the surveys with data on government orders requiring the the wearing of masks in public places in the following way:

• We only considered survey results for countries with at least 5 responses • If there was either a government order or a mask wearing start date according to the survey results (but not both), we accepted that date • If there was both a government order and a mask wearing start date according to survey results, we accepted whichever was earlier. An exception were cases where the start date according to surveys was less than 3 days before the government order. In these cases we accepted the date of the government order (because the temporal resolution of the survey results was +/− 3.5 days)

Mask data in detail (sheet "combined"): LINK

If we assume that, for country days with over 15 responses, the true number of people wearing masks is given by the mean of the survey responses, we can estimate the misclassification rate for different numbers of responses by randomly sampling responses for that country day and comparing them with the sample mean excluding the selected responses. Table D .8 represents the average from 100 iterations of this procedure. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. .

We replicate the posterior of the effectiveness of NPIs, showing it's sensitivity variations of the assumptions and the data. Recall we show cumulative effects for two sets of NPIs: gatherings and business closures. This means that, e.g., a high sensitivity for closing some businesses will show up a second time as a high sensitivity for closing most businesses. This overstates the number of individual parameters α i which are sensitive. 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint t ,c on new infections (combined model). We vary the parameter σ N of a lognormal noise distribution. Deaths and cases have independent noise terms, with the same standard deviation. Note that a larger noise scale implies that the rates of ascertainment (testing) and fatality are allowed to change more rapidly. Predictably, results are less confident given more noise.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint Figure E.17: Sensitivity to including mobility data as additional 'NPI'. Mobility data mostly explains the effect of business closures and stay-home-orders, which is expected as the effect of these NPIs is mediated through retail, recreation, and workplace mobility. (Though the latter had no effect on its own (not shown)). The negative effect of business closures has no clear interpretation as adding mobility data breaks our assumption about NPI interaction. We did not experiment with other mobility categories.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . Rt R0 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . Rt R0 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint 

We perform the following data preprocessing:

• Our data for confirmed cases and deaths is given by the John Hopkins Centre for Systems Science and Engineering 25, 26 . We smooth this data by averaging the number of cases and deaths in a five day period around every day, assuming the data is symmetric at the boundaries. • We mask new cases before a country has reached 100 confirmed cases. This accounts for cases being imported from other countries and rapid changes in testing regime when the case count is small. • To avoid bias from imported deaths, we mask new deaths before a country has reached 10 deaths. • Days where there are zero cases or deaths do not provide information about the relative change in the size of the epidemic. Therefore, they are masked.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint

τ,c ) , with noise (G.17)

t ,c represents the number of daily new infections at time t in country c who will eventually be tested positive (N (C ) t ,c similar but for infections who will pass away). • Observation Model: We use discrete convolutions to produce the expected number of new cases and deaths on a given day. 

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint α is the dispersion parameter of the distribution. Caution: larger values of α correspond to a smaller variance, and less dispersion. With our parameterisation, the variance of the Negative Binomial distribution is µ + µ 2 α , so that smaller observations are relatively more noisy.

We have previously noted that the effectiveness of each NPI, α i , may depend on the presence of other NPIs. For example, masks may be less effective when a stay-at-home order has been issued because more of the remaining transmission occurs in private spaces. We claimed that, in such a situation, we can roughly interpret the inferred effect α i of NPI i as the average additional effect it had in the contexts (i.e., the sets of simultaneously active NPIs) in which it was active. The average is over days and countries in which it was active.

Here, we formalize this claim for the maximum likelihood estimator (MLE) of α i with a simplified model in which we know the true values of R c,t (perhaps from another model). In reality, these values are not known but rather estimated by our model. Although, we are performing Bayesian inference, the posterior density will be high where the likelihood is high, and thus this interpretation is still insightful. The maximum of our posterior (the MAP) will be close to the maximum of the likelihood (the MLE) since the influence of our prior distribution on α i is, empirically, small.

Simplified Model. We have NPI activations φ i ,c,t , where φ i ,c,t = 1 represents NPI i being active in country c on day t . Assume that the true values of R c,t , R 0,c have been provided to us. Our simplified model is: Taking derivatives with respect to α i yields: (G.28) 60 . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. 

where N i is the number of days that NPI i was active. Rearranging gives the desired result:

PredictedlogR based on other NPIs − log R c,t ).

(G.30) α MLE i is the average additional effect that NPI i had over the simultaneously active NPIs, where the average is taken over the days where NPI i was active.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. .

We assume that the effect of NPIs on growth rates is similar across countries and time. However, the exact implementation and adherence of each NPIs is likely to vary. Our uncertainty estimates in Figure 6 account for these problems only to a strictly limited degree. Additionally, different countries have different cultural norms and age profiles, affecting the degree to which a particular intervention is effective. For example, a country where a higher proportion of the population is in education will likely observe a larger effect from a government order to close schools and universities.

Unobserved changes in behavior. Our method assumes that changes in the reproduction number are caused by the observed NPIs rather than unobserved factors such as spontaneous behaviour changes. We test the sensitivity of our results to unobserved interventions by hiding observed NPIs and by including mobility data. Our conclusions were stable (see Figure E .13), but removing our most effective NPI, school closure, increased the inferred effectiveness for gathering bans and business closures.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . Testing, reporting, and the IFR. DOur model can account for differences in testing (and IFR/reporting) between countries and over time, as discussed in section 2). However, we have not used additional data on testing to validate if it does so reliably. Our model may struggle to account for changes in the testing regime -for instance, when a country reaches its testing capacity so that the ascertainment rate declines exponentially. An exponential decline would have the same effect on observations as an unobserved NPI. Consequently, we cannot quantify its effect on our results (though the sensitivity analyses look promising).

Interaction between NPIs. As discussed in Section 3, our model only reports the average additional effect each NPI had in the contexts where it was active in our data (derivation in Appendix G). Figure 6 shows these contexts, aiding interpretation. The effectiveness of an NPI can only be extrapolated to other contexts if its effect does not depend on the context. Growth rates. The functional form of the relationship between the daily growth rate in the number of infections g and the reproductive number R holds exactly when the epidemic is in its exponential growth phase, but becomes less accurate as the number of susceptible people in a population decreases and/or control measures are implemented.

Signalling effect of NPIs. As we explained in Section 4 for school closures, we do not distinguish between the direct effect of an NPI and its indirect effect as it signals the gravity of the situation to the public. Conversely, lifting interventions may also have a signalling effect.

We work under the standard assumption of a well-mixed population (Anderson & May 58 ). This could affect results in various ways. For example, suppose country A tests an older demographic than country B, and we are considering the effect of an NPI that mostly affects the older demographic (for example, isolating the elderly). Then the NPI will appear to have a greater effect on confirmed cases in country A, breaking the assumption that effects are stable across countries.

We estimate the burden that different NPIs put on people's lives. Of course, implementation of NPIs has many other costs (and benefits) than just the encumbrance on daily life. Many long-term costs of NPIs will also be co-determined by the economic policy response they engender, their impacts on global supply chains, their structural damage to networks of business contacts, and many other similar effects. Estimating these long-term impacts might be prohibitively difficult and is out of scope for this study. Nevertheless, these factors should be considered for policy decisions to the degree possible.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . Our preference data is a sample of US residents only, in particular those working on the Amazon Mechanical Turk platform. This may limit the international applicability of our cost-effectiveness estimates. Even though recruitment on Amazon Mechanical Turk usually results in greater demographic diversity than typical internet samples, 39 there will still be selection bias. It's also important to note that, due to ethical reasons, the sample does not include participants under 18 years of age, which is a main limitation when estimating the perceived costs of closing schools.

Finally, using the mean population preference for policy decisions may be problematic in itself. For example, the closure of schools will likely strongly affect the parents of school children but pose little burden on the majority of people that are not parents of school children. The mean burden of closing schools may then just be moderate, but for policy decisions it is necessary to also take considerations around fairness and inequality into account.

. CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint . CC-BY-NC 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review)

The copyright holder for this preprint this version posted May 30, 2020. . https://doi.org/10.1101/2020.05.28.20116129 doi: medRxiv preprint

World Health Organization. Non-pharmaceutical public health measures for mitigating the risk and impact of epidemic and pandemic influenza

Estimating the number of infections and the impact of nonpharmaceutical interventions on COVID-19 in 11 European countries

Code for modelling estimated deaths and cases for COVID-19 from Report 13 published by MRC Centre for Global Infectious Disease Analysis

The Macroeconomics of Epidemics

Multidisciplinary research priorities for the COVID-19 pandemic: a call for action for mental health science. The Lancet Psychiatry

Scenario analysis of non-pharmaceutical interventions on global COVID-19 transmissions

Impact of non-pharmaceutical interventions on documented cases of COVID-19. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

Oxford COVID-19 Government Response Tracker. Blavatnik School of Government

Government Measures Dataset

Best-worst scaling: A model for the largest difference judgments. University of Alberta: Working Paper

Valuing citizen and patient preferences in health: recent developments in three types of best-worst scaling

Economic Activity and the Spread of Viral Diseases: Evidence from High Frequency Data. Institute of Labor Economics (IZA); 2015. 9326

Worldwide Effectiveness of Various Non-Pharmaceutical Intervention Control Strategies on the Global COVID-19 Pandemic: A Linearised Control Model. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

Social distancing to slow the U.S. COVID-19 epidemic: an interrupted time-series analysis. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

The effect of human mobility and control measures on the COVID-19 epidemic in China

Early dynamics of transmission and control of COVID-19: a mathematical modelling study. The Lancet Infectious Diseases

Neural Network aided quarantine control model estimation of global Covid-19 spread

Effective containment explains subexponential growth in recent confirmed COVID-19 cases in China

Villas-Boas M, Villas-Boas V. Are We #StayingHome to Flatten the Curve? UC Berkeley: Department of Agricultural and Resource Economics

Quantifying the impact of physical distance measures on the transmission of COVID-19 in the UK

How effective has been the Spanish lockdown to battle COVID-19? A spatial analysis of the coronavirus propagation across provinces. FEDEA; 2020. 2020-03

A Spatiotemporal Epidemic Model to Quantify the Effects of Contact Tracing, Testing, and Containment

Spread and dynamics of the COVID-19 epidemic in Italy: Effects of emergency containment measures

The effect of inter-city travel restrictions on geographical spread of COVID-19: Evidence from Wuhan, China. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

An interactive web-based dashboard to track COVID-19 in real time. The Lancet Infectious Diseases

What Countries Require Masks in Public or Recommend Masks?

tests-per-confirmed-case-vs-total-confirmed-cases-of-covid-19-per-million-peo ple?

Basic Reproduction Rate and Case Fatality Rate of COVID-19: Application of Meta-analysis. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

How generation intervals shape the relationship between growth rates and reproductive numbers

Evolving epidemiology and transmission dynamics of coronavirus disease 2019 outside Hubei province, China: a descriptive and modelling study. The Lancet Infectious Diseases

Incubation Period and Other Epidemiological Characteristics of 2019 Novel Coronavirus Infections with Right Truncation: A Statistical Analysis of Publicly Available Case Data. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus-Infected Pneumonia

The early phase of the COVID-19 outbreak in

Epidemiology and Transmission of COVID-19 in Shenzhen China: Analysis of 391 cases and 1,286 of their close contacts. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

Estimates of the severity of coronavirus disease 2019: a model-based analysis. The Lancet Infectious Diseases

The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo

Amazon's Mechanical Turk. Perspectives on Psychological Science

Best-Worst Scaling in analytical closed-form solution

Tools for Case 1 Best-Worst Scaling (MaxDiff) Designs

Sawtooth Software, Inc. Proceedings of the Sawtooth Software Conference

An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books

Model checking and improvement

Why We (Usually) Don't Have to Worry About Multiple Comparisons

Inference from Iterative Simulation Using Multiple Sequences

See how your community is moving around differently due to COVID-19

Optimal COVID-19 quarantine and testing policies

Philosophy and the practice of Bayesian statistics

Assessing Sensitivity to an Unobserved Binary Covariate in an Observational Study with Binary Outcome

Observational Studies

SARS-CoV-2 (COVID-19): What do we know about children? A systematic review

Coronavirus Infections in Children Including COVID-19. The Pediatric Infectious Disease Journal

An analysis of SARS-CoV-2 viral load by patient age

Shedding of infectious SARS-CoV-2 in symptomatic neonates, children and adolescents. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

Cluster of COVID-19 in northern France: A retrospective closed cohort study. COVID-19 SARS-CoV-2 preprints from medRxiv and bioRxiv

A framework for discussing the population biology of infectious diseases

Variables are indexed by intervention i , country c, and day t . All prior distributions are independent. 2. NPI Effectiveness:3. Infection Initial Counts.

• Hyperparameters 1. Infection Noise Scale, σ N = 0.1 (selected by cross-validation). 2. Serial Interval Parameters. The serial interval is assumed to have a Gamma distribution with α = 1.87 and β = 0.28. 31 3. Delay Distributions. The time from infection to confirmation is assumed to be the sum of the incubation period and the time taken from symptom onset to laboratory confirmation. Therefore, the time taken from infection to confirmation, T (C ) is:The time from infection to death is assumed to be the sum of the incubation period and the time taken from symptom onset to death. Therefore, the time taken from infection to death, T (D) is: [32] [33] [34] [35] where α is known as the dispersion parameter. Caution: larger values of α correspond to a smaller variance, and less dispersion. With our parameterisation, the variance of the Negative Binomial distribution is µ + µ 2 α . For computational efficiency, we discretise this distribution using Monte Carlo sampling. We therefore form discrete arrays, π C [i ] and π D [i ] where the value of π C [i ] corresponds to the probability of the delay being i days. We truncate π C to a maximum delay of 31 days and π D to a maximum delay of 63 days.where α and β are the parameters of the serial interval distribution. This is the exact conversion under exponential growth, following eq. (2.9) in Wallinga & Lipsitch. 30 (Note that we use daily growth rates.) 3 .

All school closedAll levels of schools are closed.

This survey focuses on how socially and personally burdensome people perceive various COVID-19 mitigation measures to be.In order to understand how to best react to the COVID-19 pandemic, we need to find out how different mitigation measures compare to each other. In this survey, we are only interested in how mitigation measures affect people's personal lives, but not in how effective different measures are at reducing the spread of COVID-19 nor what their effects are on the economy as a whole.As such, we only ask about how different measures affect your life, not about how they affect the course of the pandemic.Which of these mitigation measures would you find least burdensome, and which most burdensome?The following shows a selection of mitigation measures that may occur as part of the response against Covid-19. Note that the measures differ in type and duration of deployment. Consider how burdensome would the measures be if they had the same effect on the reduction of COVID spreading.

Best (Least burdensome)Worst (Most burdensome) Stay-at-home order for 2 weeks All non-essential businesses closed for 1 week All schools closed for 3 months All schools closed for 1 year Wearing masks for2 weeks Special precautions in clinics and hospitals for 1 week

Let u(i , d ) be the average population utility score for a pair of intervention i and duration d . We now make two additional assumptions, which are well justified by the empirical data ( Figure 7) : 

We only record NPIs implemented nationally. For example, several regions in Germany implemented stay-at-home orders even though this was not ordered nationally. Regional orders do not appear in our data. Additionally, while we included more NPIs than previous work (Table 1) , there are many NPIs for which we were not able to collect enough highquality data for our modeling, such as public cleaning or changes to public transportation.