key: cord-0957174-98sllajg authors: Haber, Noah A.; Clarke-Deelder, Emma; Feller, Avi; Smith, Emily R.; Salomon, Joshua; MacCormack-Gelles, Benjamin; Stone, Elizabeth M.; Bolster-Foucault, Clara; Daw, Jamie R.; Hatfield, Laura A.; Fry, Carrie E.; Boyer, Christopher B.; Ben-Michael, Eli; Joyce, Caroline M.; Linas, Beth S.; Schmid, Ian; Au, Eric H.; Wieten, Sarah E.; Jarrett, Brooke A; Axfors, Cathrine; Nguyen, Van Thu; Griffin, Beth Ann; Bilinski, Alyssa; Stuart, Elizabeth A. title: Problems with Evidence Assessment in COVID-19 Health Policy Impact Evaluation (PEACHPIE): A systematic review of evidence strength date: 2021-02-08 journal: medRxiv DOI: 10.1101/2021.01.21.21250243 sha: 00edc4f0c99f163070805045ac9454b4dbf8fb38 doc_id: 957174 cord_uid: 98sllajg INTRODUCTION: Assessing the impact of COVID-19 policy is critical for informing future policies. However, there are concerns about the overall strength of COVID-19 impact evaluation studies given the circumstances for evaluation and concerns about the publication environment. This study systematically reviewed the strength of evidence in the published COVID-19 policy impact evaluation literature. METHODS: We included studies that were primarily designed to estimate the quantitative impact of one or more implemented COVID-19 policies on direct SARS-CoV-2 and COVID-19 outcomes. After searching PubMed for peer-reviewed articles published on November 26 or earlier and screening, all studies were reviewed by three reviewers first independently and then to consensus. The review tool was based on previously developed and release review guidance for COVID-19 policy impact evaluation, assessing what impact evaluation method was used, graphical display of outcomes data, functional form for the outcomes, timing between policy and impact, concurrent changes to the outcomes, and an overall rating. RESULTS: After 102 articles were identified as potentially meeting inclusion criteria, we identified 36 published articles that evaluated the quantitative impact of COVID-19 policies on direct COVID-19 outcomes. The majority (n=23/36) of studies in our sample examined the impact of stay-at-home requirements. Nine studies were set aside because the study design was considered inappropriate for COVID-19 policy impact evaluation (n=8 pre/post; n=1 cross-section), and 27 articles were given a full consensus assessment. 20/27 met criteria for graphical display of data, 5/27 for functional form, 19/27 for timing between policy implementation and impact, and only 3/27 for concurrent changes to the outcomes. Only 1/27 studies passed all of the above checks, and 4/27 were rated as overall appropriate. Including the 9 studies set aside, reviewers found that only four of the 36 identified published and peer-reviewed health policy impact evaluation studies passed a set of key design checks for identifying the causal impact of policies on COVID-19 outcomes. DISCUSSION: The reviewed literature directly evaluating the impact of COVID-19 policies largely failed to meet key design criteria for useful inference. This was largely driven by the circumstances under which policies were passed making it difficult to attribute changes in COVID-19 outcomes to particular policies. More reliable evidence review is needed to both identify and produce policy-actionable evidence, alongside the recognition that actionable evidence is often unlikely to be feasible. Policy decisions to mitigate the impact of COVID-19 on morbidity and mortality are some of the most important issues policymakers have had to make since January 2020. Decisions regarding which policies are enacted depend in part on the evidence base for those policies, including understanding what impact past policies had on COVID-19 outcomes. 1, 2 Unfortunately, there are substantial concerns that much of the existing literature may be methodologically flawed, which could render its conclusions unreliable for informing policy. The combination of circumstances being difficult for strong impact evaluation, the importance of the topic, and concerns over the publication environment may lead to the proliferation of low strength studies. High-quality causal evidence requires a combination of rigorous methods, clear reporting, appropriate caveats, and the appropriate circumstances for the methods used.2-5 Rigorous evidence is difficult in the best of circumstances, and the circumstances for evaluating non-pharmaceutical intervention (NPI) policy effects on COVID-19 are particularly challenging. 3 The global pandemic has yielded a combination of a large number of concurrent policy and non-policy changes, complex infectious disease dynamics, and unclear timing between policy implementation and impact; all of this makes isolating the causal impact of any particular policy or policies exceedingly difficult. 4 The scientific literature on COVID-19 is exceptionally large and fast growing. Scientists published more than 100,000 papers related to COVID-19 in 2020. 5 There is some general concern that the volume and speed 6, 7 at which this work has been produced may result in a literature that is overall low quality and unreliable. [8] [9] [10] [11] [12] Given the importance of the topic, it is critical that decision-makers are able to understand what is known and knowable 3, 13 from observational data in COVID-19 policy, as well as what is unknown and/or unknowable. Motivated by concerns about the methodological strength of COVID-19 policy evaluations, we set out to review the literature using a set of methodological design checks tailored to common policy impact evaluation methods. Our primary objective was to evaluate each paper for methodological strength and reporting, based on pre-existing review guidance developed for this purpose. 14 As a secondary objective, we also studied our own process: examining the consistency, ease of use, and clarity of this review guidance. Overview This systematic review of the strength of evidence took place in three phases: search, screening, and full review. The protocol for this study was pre-registered on OSF.io 15 based on PRISMA guidelines. 16 Deviations from the original protocol consisted largely of language clarifications and error corrections for both the inclusion criteria and review tool, an increase in the number of reviewers per fully reviewed article from two to three, and simplification of the statistical methods used to assess the data. Notably, this protocol differs in many ways from more traditional systematic review protocols, as instead of being a review summary of the evidence of a particular topic, this is a systematic review of methodological strength of evidence. The following eligibility criteria were used to determine the papers to include: • The primary topic of the article must be evaluating one or more individual COVID-19 policies on direct COVID-19 outcomes ○ The primary exposure(s) must be a policy, defined as a government-issued order at any government level to address a directly COVID-19-related outcome (e.g., mask requirements, travel restrictions, etc). ○ COVID-19 outcomes may include cases detected, mortality, number of tests taken, test positivity rates, Rt, etc. ○ This may NOT include indirect impacts of COVID-19 on things such as income, childcare, trust in science, etc. • The primary outcome being examined must be a COVID-19-specific outcome, as above. The study must be designed as an impact evaluation study from primary data (i.e., not primarily a predictive or simulation model or meta-analysis). • The study must be peer reviewed, and published in a peer-reviewed journal indexed by PubMed. • The study must have the title and abstract available via PubMed at the time of the study start date (November 26). The study must be written in English. These eligibility criteria were designed to identify the literature primarily concerning the quantitative impact of one or more implemented COVID-19 policies on COVID-19 outcomes. Studies in which impact evaluation was secondary to another analysis (such as a hypothetical projection model) were eliminated because they were less relevant to our objectives and/or may not contain sufficient information for evaluation. Categories for types of policies were from the Oxford COVID-19 Government Response Tracker. 17 Reviewer recruitment, training, and communication Reviewers were recruited through personal contacts and postings on online media. All reviewers had experience in systematic review, quantitative causal inference, epidemiology, econometrics, public health, methods evaluation, or policy review. All reviewers participated in two meetings in which the procedures and the review tool were demonstrated. Screening reviewers participated in an additional meeting specific to the screening process. Throughout the main review process, reviewers communicated with the administrators and each other through Slack for any additional clarifications, questions, corrections, and procedures. The main administrator (NH), who was also a reviewer, was available to answer general questions and make clarifications, but did not answer questions specific to any given article. The search terms combined four Boolean-based search terms: a) COVID-19 research,17 b) regional government units (e.g., country, state, county, and specific country, state, or province, etc.), c) policy or policies, and d) impact or effect. The full search terms are available in Appendix 2. The search was limited to published articles in peer-reviewed journals. This was largely to attempt to identify literature that was high quality, relevant, prominent, and most applicable to the review guidance. PubMed was chosen as the exclusive indexing source due to the prevalence and prominence of policy impact studies in the health and medical field. Preprints were excluded to limit the volume of studies to be screened and to ensure each had met the standards for publication through peer review. The search was conducted on November 26, 2020. Eight reviewers screened the title and abstract of each article for the inclusion criteria. Two reviewers were randomly selected to screen each article for acceptance/rejection. In the case of a dispute, a third randomly selected reviewer decided on acceptance/rejection. Training consisted of a one-hour instruction meeting, a review of the first 50 items on each reviewers' list of assigned articles, and a brief asynchronous online discussion before conducting the full review. The full article review consisted of two sub-phases: the independent primary review phase, and a group consensus phase. Each article was randomly assigned to three of the 23 reviewers in our review pool. Each reviewer independently reviewed each article on their list, first for whether the study met the eligibility criteria, then responding to methods identification and guided strength of evidence questions using the review tool, as described below. Reviewers were able to recuse themselves for any reason, in which case another reviewer was randomly selected. Once all three reviewers had reviewed a given article, all articles that weren't unanimously determined to not meet the inclusion criteria underwent a consensus process. During the consensus round, the three reviewers were given all three primary reviews for reference, and were tasked with generating a consensus opinion among the group. One randomly selected reviewer was tasked to act as the arbitrator. If consensus could not be reached, a fourth randomly selected reviewer was brought into the discussion to help resolve disputes. This review tool and data collection process was an operationalized and lightly adapted version of the COVID-19 health policy impact evaluation review guidance literature, written by the lead authors of this study. All reviewers were instructed to read and refer to this guidance document to guide their assessments. Additional explanation and rationale for all parts of this review tool is available in Haber et al., 2020 14 . The review tool consisted of two main parts: methods design categorization and full review. The review tool and guidance categorizes policy causal inference designs based on the structure of their assumed counterfactual. This is assessed through identifying the data structure and comparison(s) being made. There are two main items for this determination: the number of pre-period time points (if any) used to assess pre-policy outcome trends, and whether or not policy regions were compared with non-policy regions. These, and other supporting questions, broadly allowed categorization of methods into cross-sectional, pre/post, interrupted time series (ITS), difference-in-differences (DiD), comparative interrupted time-series (CITS), (randomized) trials, or other. Given that most papers have several analyses, reviewers were asked to focus exclusively on the impact evaluation analysis that was used as the primary support for the main conclusion of the article. Studies categorized as cross-sectional, pre/post, randomized controlled trial designs, and other were set aside for no further review for the purposes of this research. Cross-sectional and pre-post designs were considered inappropriate for policy causal inference for COVID-19 due largely to inability to account for a large number of potential issues, including confounding, epidemic trends, and selection biases. Randomized controlled trials were assumed to broadly meet key design checks. Studies categorized as "other" received no further review, as the review guidance would be unable to assess them. Additional justification and explanation for this decision is available in the review guidance. For the methods receiving full review (ITS, DiD, and CITS), reviewers were asked to identify potential issues and give a category-specific rating. The specific study designs triggered sub-questions and/or slightly altered the language of the questions being asked, but all three of the methods design categories shared these four key questions: • Graphical presentation: "Does the analysis provide graphical representation of the outcome over time?" • Functional form: "Is the functional form of the counterfactual (e.g., linear) well-justified and appropriate?" • Timing of policy impact: "Is the date or time threshold set to the appropriate date or time (e.g., is there lag between the intervention and outcome)?" • Concurrent changes: "Is this policy the only uncontrolled or unadjusted-for way in which the outcome could have changed during the measurement period [differently for policy and non-policy regions]?" For each of the four key questions, reviewers were given the option to select "No," "Mostly no," "Mostly yes," and "Yes" with justification text requested for all answers other than "Yes." Each question had additional prompts as guidance, and with much more detail provided in the full guidance document. Graphical representation is included here primarily as a key way to assess the plausibility and justification of key model assumptions, rather than being necessary for validity by itself. Finally, reviewers were asked a summary question: • Overall: "Do you believe that the design is appropriate for identifying the policy impact(s) of interest?" Reviewers were asked to consider the scale of this question to be both independent/not relative to any other papers, and that any one substantial issue with the study design could render it a "No" or "Mostly no." Reviewers were asked to follow the guidance and their previous answers, allowing for their own weighting of how important each issue was to the final result. A study could be excellent on all dimensions except for one, and that one dimension could render it inappropriate for causal inference. As such, in addition to the overall rating question, we also generated a "weakest link" metric for overall assessment, representing the lowest rating among the four key questions (graphical representation, functional form, timing of policy impact, and concurrent changes). A "mostly yes" or "yes" is considered a passing rating, indicating that the study was not found to be inappropriate on the specific dimension of interest. A "yes" rating does not necessarily indicate that the study is strongly designed, conducted, or is useful; it only means that it passes a series of key design checks for policy impact evaluation and should be considered for further evaluation. The papers may contain any number of other issues that were not reviewed (e.g., statistical issues, inappropriate comparisons, generalizability, etc.,). As such, this should only be considered an initial assessment of plausibility that the study is well-designed, rather than confirmation that it is appropriate and applicable. The full review tool is available in the supplementary materials. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Statistical analysis Statistics provided are nearly exclusively counts and percentages of the final dataset. Analyses and graphics were performed in R. 18 Inter-rater reliability was assessed using Krippendorff's alpha 19 using the IRR package. 20 Relative risks were estimated using the epitools package. 21 Citation counts for accepted articles were obtained through Google Scholar 22 on January 11, 2021. Journal impact factors were obtained from the 2019 Journal Citation Reports. 23 Data and code Data, code, the review tool, and the review guidance are stored and available here: https://osf.io/9xmke/files/ . The dataset includes full results from the search and screening and all review tool responses from reviewers during the full review phase. Search and screening is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; After search and screening of titles and abstracts, 102 articles were identified as likely or potentially meeting our inclusion criteria. Of those 102 articles, 36 studies met inclusion after independent review and deliberation in the consensus process. The most common reasons for rejection at this stage were that the study did not measure the quantitative direct impact of specific policies and/or that such an impact was not the main purpose of the study. Many of these studies implied that they measured policy impact in the abstract or introduction, but instead measured correlations with secondary outcomes (e.g., the effect of movement reductions, which are influenced by policy) and/or performed cursory policy impact evaluation secondary to projection modelling efforts. . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; Publication information from our sample is shown in Figure 2 . The articles in our sample were generally published in journals with high impact factors (median impact factor: 3.6) and have already been cited in the academic literature (median citation count: 5, on 1/11/21). The most commonly evaluated policy type was stay at home requirements (64% n=23/36). Reviewers noted that many articles referenced "lockdowns," but did not define the specific policies to which this referred. Reviewers most commonly selected interrupted time-series (39% n=14/36) as the methods design, followed by difference-in-differences (9% n=9/36) and pre-post (8% n=8/36). There were no randomized controlled trials of COVID-19 health policies identified (0% n=0/36), nor were any studies identified that reviewers could not categorize based on the review guidance (0% n=0/36). Table 1 : Summary of articles reviewed and reviewer ratings for key and overall questions . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 The identified articles and selected review results are summarized in Table 1 . is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; This chart shows the final overall ratings (left) and the key design question ratings for the consensus review of the 36 included studies, answering the degree to which the articles met the given key design question criteria. The key design question ratings were not asked for the nine included articles which selected methods assumed by the guidance to be non-appropriate. The question prompt in the figure is shortened for clarity, where the full prompt for each key question is available in the Methods section. Graphical representation of the outcome over time was relatively well-rated in our sample, with 74% (n=20/27) studies being given a "mostly yes" or "yes" rating for appropriateness. Reasons cited for non-"yes" ratings included a lack of graphical representation of the data, alternative scales used, and not showing the dates of policy implementation. Functional form issues appear to have presented a major issue in these studies, with only 19% receiving a "mostly yes" or "yes" rating, 78% (n=21/27) receiving a "no" rating, and 4% (n=1/27) "unclear." There were two common themes in this category: studies generally using scales that were broadly considered inappropriate for infectious disease outcomes (e.g., linear counts), and/or studies lacking stated justification for the scale used. Reviewers also noted disconnects between clear curvature in the outcomes in the graphical representations and the analysis models and outcome scales used (e.g., linear). In one case, reviewers could not identify the functional form actually used in analysis. Reviewers broadly found that these studies dealt with timing of policy impact (e.g., lags between policy implementation and expected impact) relatively well, with 70% (n=19/27) rated "yes" or "mostly yes." Reasons for non-"yes" responses included not adjusting for lags and a lack of justification for the specific lags used. Concurrent changes were found to be a major issue in these studies, with only 11% (n=3/27) studies receiving passing ratings ("yes" or "mostly yes") with regard to uncontrolled concurrent changes to the outcomes. Reviewers nearly ubiquitously noted that the articles failed to account for the impact of other policies that could have impacted COVID-19 outcomes concurrent with . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; the policies of interest. Other issues cited were largely related to non-policy-induced behavioral and societal changes. When reviewers were asked if sensitivity analyses had been performed on key assumptions and parameters, about half (56% n=15/27) answered "mostly yes" or "yes." The most common reason for non-"yes" ratings was that, while sensitivity analyses were performed, they did not address the most substantial assumptions and issues. Overall, reviewers rated only four studies (11%, n=4/36,) as being plausibly appropriate ("mostly yes" or "yes") for identifying the impact of specific policies on COVID-19 outcomes, as shown in Figure 3 . 25% (n=9/36) were automatically categorized as being inappropriate due to being either cross-sectional or pre/post in design, 33% (n=12/36) of studies were given a "no" rating for appropriateness, 31% "mostly no" (n=11/36), 8% "mostly yes" (n=3/36), and 3% "yes" (n=1/36). The most common reason cited for non-"yes" overall ratings was failure to account for concurrent changes (particularly policy and societal changes). Figure 4 : Comparison of independent reviews, weakest link, and direct consensus review . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; This chart shows the final overall ratings by three different possible metrics. The first column contains all of the independent review ratings for the 27 studies which were eventually included in our sample, noting that reviewers who either selected them as not meeting inclusion criteria or selected a method that didn't receive the full review did not contribute. The middle column contains the final consensus reviews among the 27 articles which received full review. The last column contains the weakest link rating, as described in the methods section. The question prompt in the figure is shortened for clarity, where the full prompt for each key question is available in the Methods section. As shown in Figure 4 , the consensus overall proportion passing ("mostly yes" or "yes") was a quarter of what it was from the initial independent reviews. 45% (n=34/75) of studies were rated as "yes" or "mostly yes" in the initial independent review, as compared to 11% (n=4/36) in the consensus round (RR 0.25, 95%CI 0.09:0.64). The issues identified and discussed in combination during consensus discussions, as well as additional clarity on the review process, resulted in reduced overall confidence in the findings. Increased clarity on the review guidance with experience and time may also have reduced these ratings further. The large majority of studies had at least one "no" or "unclear" rating in one of the four categories (74% n=20/27), with only one study whose lowest rating was a "mostly yes," no studies rated "yes" in all four categories. Only one study was found to pass design criteria in all four key questions categories, as shown in the "weakest link" column in Figure 4 . During independent review, all three reviewers independently came to the same conclusions on the main methods design category for 33% (n=12/36) articles, two out of the three reviewers agreed for 44% (n=16/36) articles, and none of the reviewers agreed in 22% (n=8/36) cases. One major contributor to these discrepancies were the 31% (n=11/36) cases where one or more reviewers marked the study as not meeting eligibility criteria, 64% (n=7/11) of which the other two reviewers agreed on the methods design category. Inter-rater reliability of the primary independent reviews was relatively low across the board for the key questions. For the overall scores, Krippendorff's alpha was only 0.16 due to widely varying opinions between raters. The four key categorical questions had slightly better inter-rater reliability than the overall question, with Krippendoff's alphas of 0.59 for graphical representation, 0.34 for functional form, 0.44 for timing of policy impact, and 0.15 for concurrent changes, respectively. The consensus rating for overall strength was equal to the lowest rating among the independent reviews in 78% (n=21/27) of cases, and only one higher than the lowest in the remaining 22% (n=6/27). This strongly suggests that the multiple reviewer review, discussion, and consensus process better identifies issues than independent review alone. Differences in initial opinions between reviewers may be attributable to any number of factors, including true differences in opinion, misunderstandings/learning about the review tool and process, and expected reliance on the consensus process. Notably, there were two cases for which reviewers requested an . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 additional fourth reviewer to help resolve standing issues for which the reviewers felt they were unable to come to consensus. The most consistent point of feedback from reviewers was the value of having a three reviewer team with whom to discuss and deliberate, rather than two as initially planned. This was reported to help catch a larger number of issues and clarify both the papers and the interpretation of the review tool questions. Reviewers also expressed that one of the most difficult parts of this process was assessing the inclusion criteria, some of the implications of which are discussed below. This systematic review of evidence strength found that only four (or only one by a stricter standard) of the 36 identified published and peer-reviewed health policy impact evaluation studies passed a set of key checks for identifying the causal impact of policies on COVID-19 outcomes. Because this systematic review examined a limited set of key study design features and did not address more detailed aspects of study design, statistical issues, generalizability, and any number of other issues, this result may be considered an upper bound on the overall strength of evidence within this sample. Two major problems are nearly ubiquitous throughout this literature: failure to isolate the impact of the policy(s) of interest from other changes that were occurring contemporaneously, and failure to appropriately address the functional form of infectious disease outcomes in a population setting. Similar to other areas in the COVID-19 literature, 24 we found the current literature directly evaluating the impact of COVID-19 policies largely fails to meet key design criteria for useful inference. The framework for the review tool is based on the requirements and assumptions built into policy evaluation methods. Quasi-experimental methods rely critically on the scenarios in which the data are generated. These assumptions and the circumstances in which they are plausible are well-documented and understood, 2, 3, 14, [25] [26] [27] including one paper discussing application of difference-in-differences methods specifically for COVID-19 health policy, released in May 2020. 3 While "no uncontrolled concurrent changes" is a difficult bar to clear, that bar is fundamental to inference using these methods. The circumstances of isolating the impact of policies in COVID-19 -including large numbers of policies, infectious disease dynamics, and massive changes to social behaviors -make those already difficult fundamental assumptions broadly much less likely to be met. Some of the studies in our sample were nearly the best feasible studies that could be done given the circumstances, but the best that can be done often yields little useful inference. The relative paucity of strong studies does not in any way imply a lack of impact of those policies; only that we lack the circumstances to have evaluated their effects. Because the studies estimating the harms of policies share the same fundamental circumstances, the evidence of COVID-19 policy harms is likely to be of similarly poor strength. Identifying the effects of many of these policies, particularly for the spring of 2020, is likely to be . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted February 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 unknown and perhaps unknowable. However, there remains additional opportunities with more favorable circumstances, such as measuring overall impact of NPIs as bundles, rather than individual policies. Similarly, studies estimating the impact of re-opening policies or policy cancellation are likely to have fewer concurrent changes to address. The review process itself demonstrates how guided and targeted peer review can efficiently evaluate studies in ways that the traditional peer review systems do not. The studies in our sample had passed the full peer review process, were published in largely high-profile journals, and are highly cited, but contained substantial flaws that rendered their inference utility questionable. The relatively small number of studies included, as compared to the size of the literature concerning itself with COVID-19 policy, may suggest that there was relative restraint from journal editors and reviewers for publishing these types of studies. The large number of models, but relatively small number of primary evaluation analyses is consistent with other areas of 29 At minimum, the flaws and limitations in their inference could have been communicated at the time of publication, when they are needed most. In other cases, it is plausible that many of these studies would not have been published had a more thorough or better targeted methodological review been performed. This systematic review of evidence strength has limitations. The tool itself was limited to a very narrow -albeit critical -set of items. Low ratings in our study should not be interpreted as being overall poor studies, as they may make other contributions to the literature that we did not evaluate. While the guidance provided a well-structured framework and our reviewer pool was well-qualified, strength of evidence review is inherently subjective. It is plausible and likely that other sets of reviewers would come to different conclusions. Most importantly, this review does not cover all policy inference in the scientific literature. One large literature from which there may be COVID-19 policy evaluation otherwise meeting our inclusion criteria are pre-prints. Many pre-prints would likely fare well in our review process. Higher strength papers often require more time for review and publication, and many high quality papers may be in the publication pipeline now. Second, this review excluded studies that had a quantitative impact evaluation as a secondary part of the study (e.g., to estimate parameters for microsimulation or disease modeling). Not only are these assessments not the primary purpose of those studies, they also typically lack the detail requisite to make a critical assessment of the study design and methods used. Third, the review does not include policy inference studies that do not measure the impact of a specific policy. For instance, there are studies that estimate the impact of reduced mobility on COVID-19 outcomes but do not attribute the reduced mobility to any specific policy change. Finally, a considerable number of studies that present analyses of COVID-19 outcomes to inform policy are excluded because they do not present a quantitative estimate of specific policies' treatment effects. While COVID-19 policy is one of the most important problems of our time, the circumstances under which those policies were enacted severely hamper our ability to study and understand their effects. Claimed conclusions are only as valuable as the methods by which they are produced. Replicable, rigorous, intense, and methodologically guided review is needed to both . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 8, 2021. ; communicate our limitations and make more useful inference. Weak, unreliable, and overconfident evidence leads to poor decisions and undermines trust in science. 12, 30 In the case of COVID-19 health policy, a frank appraisal of the strength of the studies on which policies are based is needed, alongside the understanding that we often must make decisions when strong evidence is not feasible. 31 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Works cited (excluding reviewed articles) . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted February 8, 2021. ; https://doi.org/10.1101 https://doi.org/10. /2021 Making Decisions in a COVID-19 World Defining High-Value Information for COVID-19 Decision-Making . Health Policy Using Difference-in-Differences to Identify Causal Effects of COVID-19 Policies Which interventions work best in a pandemic? How a torrent of COVID science changed research publishing -in seven charts Pandemic publishing poses a new COVID-19 challenge Rapid publications risk the integrity of science in the era of COVID-19 An alarming retraction rate for scientific publications on Coronavirus Disease 2019 (COVID-19) An "alarming" and "exceptionally high" rate of COVID-19 retractions? Scientific quality of COVID-19 and SARS CoV-2 publications in the highest impact medical journals during the early phase of the pandemic: A case control study A systematic bias assessment of top-cited full-length original clinical investigations related to COVID-19 Waste in covid-19 research A How-to Guide for Conducting Retrospective Analyses: Example COVID-19 Study . Open Science Framework Policy evaluation in COVID-19: A guide to common design issues Systematic review of COVID-19 policy evaluation methods and design Variation in Government Responses to COVID-19 R: A Language and Environment for Statistical Computing . R Foundation for Statistical Computing Content Analysis: An Introduction to Its Methodology Various Coefficients of Interrater Reliability and Agreement Prediction models for diagnosis and prognosis of covid-19: systematic review and critical appraisal Mostly Harmless Econometrics: An Empiricist's Companion Quasi-experimental study designs series-paper 7: assessing the assumptions Evaluating the impact of healthcare interventions using routine data Measures implemented in the school setting to contain the COVID-19 pandemic: a rapid scoping review COVID-19-related medical research: a meta-research and critical appraisal Too much information, too little evidence: is waste in research fuelling the covid-19 infodemic? Will COVID-19 be evidence-based medicine's nemesis? Reviewed study citations The Efficacy of Lockdown Against COVID-19: A Cross-Country Panel Analysis. Appl Health Econ Health Policy Empirical assessment of government policies and flattening of the COVID 19 curve Association Between Statewide School Closure and COVID-19 Incidence and Mortality in the US S. county level analysis to determine If social distancing slowed the spread of COVID-19. Revista Panamericana de Salud Pública All things equal? Heterogeneity in policy effectiveness against COVID-19 spread in chile COVID-19: The impact of social distancing policies, cross-country analysis. EconDisCliCha The effect of state-level stay-at-home orders on COVID-19 infection rates Examining the effect of social distancing on the compound growth rate of COVID-19 at the county level (United States) using statistical analyses and a random forest machine learning model Strong Social Distancing Measures In The United States Reduced The COVID-19 Growth Rate: Study evaluates the impact of social distancing measures on the growth rate of confirmed COVID-19 cases across the United States COVID-19 spreading in Rio de Janeiro, Brazil: Do the policies of social isolation really work? Chaos WHEN DO SHELTER-IN-PLACE ORDERS FIGHT COVID-19 BEST? Were urban cowboys enough to control COVID-19? Local shelter-in-place orders and coronavirus case growth Extensive Testing May Reduce COVID-19 Mortality: A Lesson From Northern Italy SARS-CoV-2 infection in London, England: changes to community point prevalence around lockdown time Trends in COVID-19 Incidence After Implementation of Mitigation Measures -Arizona The effect of large-scale anti-contagion policies on the COVID-19 pandemic Analysis of the impact of lockdown on the reproduction number of the SARS-Cov-2 in Spain Physical distancing interventions and incidence of coronavirus disease 2019: natural experiment in 149 countries The Effects of Border Shutdowns on the Spread of COVID-19 Effects of policies and containment measures on control of COVID-19 epidemic in Chongqing Revealing regional disparities in the transmission potential of SARS-CoV-2 from interventions in Southeast Asia Comparison of Estimated Rates of Coronavirus Disease 2019 (COVID-19) in Border Counties in Iowa Without a Stay-at-Home Order and Border Counties in Illinois With a Stay-at-Home Order Community Use Of Face Masks And COVID-19: Evidence From A Natural Experiment Of State Mandates In The US: Study examines impact on COVID-19 growth rates associated with state government mandates requiring face mask use in public Shelter-In-Place Orders Reduced COVID-19 Mortality And Reduced The Rate Of Growth In Hospitalizations: Study examine effects of shelter-in-places orders on daily growth rates of COVID-19 deaths and hospitalizations using event study models Association of State Stay-at-Home Orders and State-Level African American Population With COVID-19 Case Rates COVID-19 effective reproduction number dropped during Spain's nationwide dropdown, then spiked at lower-incidence regions The effect of lockdown on the COVID-19 epidemic in Brazil: evidence from an interrupted time series design Public health interventions slowed but did not halt the spread of COVID-19 in India Effect of mitigation measures on the spreading of COVID-19 in hard-hit states in the Coronavirus Disease 2019 (COVID-19) Transmission in the United States Before Versus After Relaxation of Statewide Social Distancing Measures Social distancing merely stabilized COVID-19 in the United States Fangcang shelter hospitals are a One Health approach for responding to the COVID-19 outbreak in Wuhan Impact of National Containment Measures on Decelerating the Increase in Daily New Cases of COVID-19 in 54 Countries and 4 Epicenters of the Pandemic: Comparative Observational Study Associations of Stay-at-Home Order and Face-Masking Recommendation with Trends in Daily New Cases and Deaths of Laboratory-Confirmed COVID-19 in the United States. Exploratory Research and Hypothesis in Medicine Lessons Learnt from China: National Multidisciplinary Healthcare Assistance Identifying airborne transmission as the dominant route for the spread of COVID-19 AND AND British Indian Ocean Territory French Southern Territories Keeling Islands United Republic of Tanzania We would like to thank Dr. Steven Goodman and Dr. John Ioannidis for their support during the development of this study, and Dr. Lars Hemkins and Dr. Mario Malicki for helpful comments in the protocol development. The authors have no financial or social conflicts of interest to declare.. The full, original pre-registered protocol is available here: https://osf.io/7nbk6Inclusion criteria Minor language edits were made to the inclusion criteria to improve clarity and fix grammatical and typographical errors. This largely centered around improving clarity that a study must estimate the quantitative impact of policies that had already been enacted. The word "quantitative" was not explicitly stated in the original version. The original protocol specified that each article would receive two independent reviewers. This was increased to three reviewers per article once it became clear both that the number of articles which would be accepted for full review was lower than expectations, and that there would be substantial differences in opinion between reviewers. Firstly, the original protocol specified that 95% confidence intervals would be calculated. However, after further discussion and review, we determined that sampling-based confidence intervals were not appropriate. Our results are not indicative nor intended to be representative of any super-or target-population, and as such sampling-based error is not an appropriate metric for the conclusions of this study.Secondly, the original protocol specified Kappa-based interrater reliability statistics. However, using three reviewers, rather than the originally registered two, meant that most Kappa statistics would not be appropriate for our review process. Given the three-rater, four-level ordinal scale used, we opted instead to use Krippendorff's Alpha. A number of changes were made to the review tool during the course of the review process. While the original protocol included logic to allow pre/post for review in some of the key questions, this was removed for consistency with the guidance document.The remaining changes to the review tool were error corrections and clarifications (e.g. correcting the text for the concurrent changes sections in difference-in-differences so that it stated "uncontrolled" concurrent changes, and distinguishing the DiD/CITS requirements from the ITS requirements to emphasize differential concurrent changes). Note: The search filter for COVID-19 and SARS-CoV-2 were the exact search terms used for the National Library of Medicine one-click search option at the time of the protocol development and when the search took place. This reflects that some of the early literature referred to Wuhan specifically (both in geographic reference for where the SARS-CoV-2 was initially found, and unfortunately also early naming of the virus/disease) before official naming conventions became ubiquitous in the literature. In order to comprehensively capture the literature and use searching best practices, we used the most standard and recommended terms.