key: cord-0934146-90td5n9j
authors: Paez, Antonio
title: Reproducibility of Research During COVID‐19: Examining the Case of Population Density and the Basic Reproductive Rate from the Perspective of Spatial Analysis
date: 2021-11-18
journal: Geogr Anal
DOI: 10.1111/gean.12307
sha: 2223ada01a0b81ee5ad280a241ef61054ce24558
doc_id: 934146
cord_uid: 90td5n9j

The emergence of the novel SARS‐CoV‐2 coronavirus and the global COVID‐19 pandemic in 2019 led to explosive growth in scientific research. Alas, much of the research in the literature lacks conditions to be reproducible, and recent publications on the association between population density and the basic reproductive number of SARS‐CoV‐2 are no exception. Relatively few papers share code and data sufficiently, which hinders not only verification but additional experimentation. In this article, an example of reproducible research shows the potential of spatial analysis for epidemiology research during COVID‐19. Transparency and openness means that independent researchers can, with only modest efforts, verify findings and use different approaches as appropriate. Given the high stakes of the situation, it is essential that scientific findings, on which good policy depends, are as robust as possible; as the empirical example shows, reproducibility is one of the keys to ensure this.

The emergence of the novel SARS-CoV-2 coronavirus in 2019, and the global pandemic that followed in its wake, led to an explosive growth of research around the globe. According to Fraser et al. (2021) , over 125,000 COVID-19-related papers were released in the first 10 months from the first confirmed case of the disease. Of these, more than 30,000 were shared in pre-print servers, the use of which also exploded in the past year (Kwon 2020; Vlasschaert, Topf, and Hiremath 2020; Añazco et al. 2021) .

Given the ruinous human and economic cost of the pandemic, there has been a natural tension in the scientific community between the need to publish research results quickly and the imperative to maintain consistently high-quality standards in scientific reporting; indeed, a call for maintaining the standards in published research termed the deluge of COVID-19 publications a "carnage of substandard research" (Bramstedt 2020) . Part of the challenge of maintaining quality standards in published research is that, despite an abundance of recommendations and guidelines (e.g., Ince et al. 2012; Ioannidis et al. 2014; Broggini et al. 2017; Brunsdon and Comber 2020) , in practice reproducibility has remained a lofty and somewhat aspirational goal . As reported in the literature, only a woefully small proportion of published research was actually reproducible before the pandemic (Iqbal et al. 2016; Stodden, Seiler, and Ma 2018) , and the situation does not appear to have changed substantially since (Gustot 2020; Sumner et al. 2020) .

The push for open software and data (Páez 2021; Arribas-Bel et al. 2021; Bivand 2020) , along with more strenuous efforts toward open, reproducible research, is simply a continuation of long-standing scientific practices of independent verification. Despite the (at times disproportionate) attention that high profile scandals in science tend to elicit in the media, science as a collective endeavor is remarkable for being a self-correcting enterprise, one with built-in mechanisms and incentives to weed out erroneous ideas. Over the long term, facts tend to prevail in science. At stake is the shorter-term impacts that research may have in other spheres of economic and social life. The case of economists Reinhart and Rogoff comes to mind: by the time the inaccuracies and errors in their research were uncovered (see Herndon et al. 2014) , their claims about debt and economic growth had already been seized by policy-makers on both sides of the Atlantic to justify austerity policies in the aftermath of the Great Recession of [2007] [2008] [2009] . 1 As later research has demonstrated, those policies cast a long shadow, and their sequels continued to be felt for years (Basu, Carney, and Kenworthy 2017) .

In the context of COVID-19, a topic that has grabbed the imagination of numerous thinkers has been the prospect of life in cities after the pandemic (e.g., Florida et al. 2020) ; as a result, the implications of the pandemic for urban planning, design, and management are the topic of ongoing research (e.g., Sharifi and Khavarian-Garmsir 2020) . The fact that the worst of the pandemic was initially felt in dense population centers such as Wuhan, Milan, Madrid, and New York, unleashed a torrent of research into the associations between density and the spread of the pandemic. The answers to some important questions hang on the results of these research efforts. For example, are lower density regions safer from the pandemic? Are de-densification policies warranted, even if just in the short term? In the longer term, will the risks of life in high density regions presage a flight from cities? And, what are the implications of the pandemic for future urban planning and practice? Over the past year, numerous papers have sought to throw light on the underlying issue of density and the pandemic; nonetheless the results, as will be detailed next, remain mixed. Furthermore, to complicate matters, precious few of these studies appear to be sufficiently open to support independent verification.

The objective of this article is to illustrate the importance of reproducibility in research in the context of the flood of COVID-19 papers. For this, I focus on a recent study by Sy, White, and Nichols (2021) that examined the correlation between the basic reproductive number of COVID-19, R 0 , and population density. The basic reproductive number is a summary measure of contact rates, probability of transmission of a pathogen, and duration of infectiousness. In rough terms, it measures how many new infections each infections begets. The paper of Sy, White, and Nichols (2021) was selected for being, in the literature examined, almost alone in supporting reproducible research. Accordingly, I wish to be clear that my objective in singling their work for discussion is not to malign their efforts, but rather to demonstrate how open and reproducible research efforts can greatly help to accelerate discovery. More concretely, open data and open code mean that an independent researcher can, with only modest efforts, not only verify the findings reported, but also examine the same data from a perspective which may not have been available to the original researchers due to differences in disciplinary perspectives, methodological traditions, and/or training, among other possible factors. The example, which shows consequential changes in the conclusions reached by different analyses, should serve as a call to researchers to redouble their efforts to increase transparency and reproducibility in their research. In this spirit, the present article also aims to show how data can be packaged in well-documented, shareable units, and code can be embedded into self-contained documents suitable for review and independent verification. The source for this article is an R Markdown document which, along with the data package, are available in a public repository. 2

The concern with population density and the spread of the virus during the COVID-19 pandemic was fueled, at least in part, by dramatic scenes seen in real-time around the world from large urban centers such as Wuhan, Milan, Madrid, and New York. In theory, there are good reasons to believe that higher density could have a positive association with the transmission of a contagious virus. It has long been known that the potential for interpersonal contact is greater in regions with higher density (see for example the research on urban fields and time-geography, including Moore 1970; Moore and Brown 1970; Farber and Páez 2011) . Mathematically, models of exposure and contagion indicate that higher densities can catalyze the transmission of contagious diseases (Li, Richmond, and Roehner 2018; Rocklöv and Sjödin 2020) . The idea is intuitive and likely at the root of messages, by some figures in positions of authority, that regions with sparse population densities faced lower risks from the pandemic. 3 As Rocklöv and Sjödin (2020) note, however, mathematical models of contagion are valid at small-to-medium spaces (and presumably, smaller time intervals too, such as time spent in restaurants, concert halls, cruises), and the results do not necessarily transfer to larger spatial units and longer time periods. There are solid reasons for this: while in a restaurant, one can hardly avoid being in proximity to other customers. On the other hand, a person can choose to (or be forced to as a matter of policy) not go to a restaurant in the first place. Nonetheless, the idea that high density correlates with high transmission is so seemingly sensible that it is often taken for granted even at the scale of large spaces (e.g., Cruz et al. 2020; Micallef et al. 2020 ). In such conditions, however, there exists the possibility of behavioral adaptations, which are difficult to capture in the mechanistic framework of differential equations (or can be missing in agent-based models, e.g., Gomez et al. 2021 ); these adaptations, in fact, can be a key aspect of disease transmission.

A plausible behavioral adaptation during a pandemic, especially one broadcast as widely and intensely as COVID-19, is risk compensation. Risk compensation is a process whereby people adjust their behavior in response to their perception of risk (Noland 1995; Richens, Imrie, and Copas 2000; Phillips et al. 2011 ). In the case of COVID-19, Chauhan et al. (2021) have found that perception of risks in the United States varies between rural, suburban, and urban residents, with rural residents in general expressing less concern about the virus. It is possible that people who listened to the message of leaders saying that they were safe from the virus because of low density may not have taken adequate precautions. Conversely, people in dense places who could more directly observe the impact of the pandemic may have become overly cautious. Both Paez et al. (2020) and Hamidi, Ewing, and Sabouri (2020b) posit this mechanism (i.e., greater compliance with social distancing in denser regions) to explain the results of their analyses. The evidence available does indeed show that there were important changes in behavior with respect to mobility during the pandemic (Jamal and Paez 2020; Molloy et al. 2020; Harris and Branion-Calles 2021) ; furthermore, shelter in place orders may have had greater buy-in from the public in higher density regions (Feyman et al. 2020; Hamidi and Zandiatashbar 2021) , and the associated behavior may have persisted beyond the duration of official social-distancing policies (Praharaj et al. 2020 ). In addition, there is evidence that changes in mobility correlated with the trajectory of the pandemic (Paez 2020; Noland 2021) . Given the potential for behavioral adaptation, the question of density becomes more nuanced: it is not just a matter of proximity, but also of human behavior, which is better studied using population-level data and models.

When it comes to population density and the spread of COVID-19, the international literature to date remains inconclusive.

On the one hand, there are studies that report positive associations between population density and various COVID-19-related outcomes. Bhadra, Mukherjee, and Sarkar (2021) , for example, reported a moderate positive correlation between the spread of COVID-19 and population density at the district level in India, however their analysis was bivariate and did not control for other variables, such as income. Similarly, Kadi and Khelfaoui (2020) found a positive and significant correlation between number of cases and population density in cities in Algeria in a series of simple regression models (i.e., without other controls). A question in these relatively simple analyses is whether density is not a proxy for other factors. Other studies have included controls, such as Pequeno et al. (2020) , a team that reported a positive association between density and cumulative counts of confirmed COVID-19 cases in state capitals in Brazil after controlling for covariates, including income, transport connectivity, and economic status. In a similar vein, Fielding-Miller, Sundaram, and Brouwer (2020) reported a positive relationship between the absolute number of COVID-19 deaths and population density (rate) in rural counties in the United States. Roy and Ghosh (2020) used a battery of machine learning techniques to find discriminatory factors, and a positive and significant association between COVID-19 infection and death rates in U.S. states. Wong and Li (2020) also found a positive and significant association between population density and number of confirmed COVID-19 cases in U.S. counties, using both univariate and multivariate regressions with spatial effects. More recently, Sy, White, and Nichols (2021) reported that the basic reproductive number of COVID-19 in U.S. counties tended to increase with population density, but at a decreasing rate at higher densities.

On the other hand, a number of studies report non-significant or negative associations between population density and COVID-19 outcomes. This includes the research of Sun et al. (2020) who did not find evidence of significant correlation between population density and confirmed number of cases per day in conditions of lockdown in China. This finding echoes the results of Paez et al. (2020) , who in their study of provinces in Spain reported nonsignificant associations between population density and infection rates in the early days of the first wave of COVID-19, and negative significant associations in the later part of the first lockdown. Similarly, Skórka et al. (2020) found zero or negative associations between population density and infection numbers/deaths by country. Fielding-Miller, Sundaram, and Brouwer (2020) contrast their finding about rural counties with a negative relationship between COVID-19 deaths and population density in urban counties in the United States. For their part, in their investigation of doubling time, White and Hébert-Dufresne (2020) identified a negative and significant correlation between population density and doubling time in U.S. states. Likewise, Khavarian-Garmsir, Sharifi, and Moradpour (2021) found a small negative (and significant) association between population density and COVID-19 morbidity in districts in Tehran. Finally, two of the most complete studies in the United States, by Hamidi, Ewing, and Sabouri (2020a, b) , used an extensive set of controls to find negative and significant correlations between density and COVID-19 cases and fatalities at the level of counties in the United States.

As can be seen, these studies are implemented at different scales in different regions of the world. They also use a range of techniques, from correlation analysis, to multivariate regression, spatial regressions, and machine learning techniques. This is natural and to be expected: individual researchers have only limited time and expertise. This is why reproducibility is important. To pick an example (which will be further elaborated in later sections of this article), the study of Sy, White, and Nichols (2021) , hereafter referred to as SWN, would immediately grab the attention of a researcher with expertise in spatial analysis.

SWN investigated the basic reproductive number of COVID-19 in U.S. counties, and its association with population density, median household income, and prevalence of private mobility. For their multivariate analysis, SWN used mixed linear models. This is an appropriate modeling choice: R 0 is an interval-ratio variable that is suitably modeled using linear regression; further, as SWN note there is a likelihood that the process in not independent "among counties within each state, potentially due to variable resource allocation and differing health systems across states" (p. 3). A mixed linear model accounts for this by introducing random components; in the case of SWN, these are random intercepts at the state level. SWN estimated various models with different combinations of variables, including median household income and prevalence of travel by private transportation. These controls help to account for potential variations in behavior: people in more affluent counties may have greater opportunities to work from home, and use of private transportation reduces contact with strangers. Moreover, they also conducted various sensitivity analyses. After these efforts, SWN concluded that there is a positive association between the basic reproductive number and population density at the level of counties in the United States.

One salient aspect of the analysis in SWN is that the basic reproductive number can only be calculated reliably with a minimum number of cases, and a large number of counties did not meet such threshold. As researchers do, SWN made modeling decisions, in this case basing their analysis only on counties with valid observations. A modeler with expertise in spatial analysis would likely ask some of the following questions on reading SWN's paper: how were missing counties treated? What are the implications of the spatial sampling framework used in the analysis? Is it possible to spatially interpolate the missing observations? Was there spatial residual autocorrelation in the models, or was the use of mixed models sufficient to capture spatial dependencies? These questions are relevant and their implications important. Fortunately, SWN are an example of a reasonably open, reproducible research product: their paper is accompanied by (most of) the data and (most of) the code used in the analysis. This means that an independent researcher can, with only a moderate investment of time and effort, reproduce the results in the paper, as well as ask additional questions.

Alas, reproducibility is not necessarily the norm in the relevant literature.

There are various reasons why a project can fail to be reproducible. In some cases, there might be legitimate reasons to withhold the data, perhaps due to confidentiality and privacy reasons (e.g., Lee et al. 2020) . But in many other cases the data are publicly available, which in fact has commonly been the case with population-level COVID-19 information. Typically the provenance of the data is documented, but in numerous studies the data themselves are not shared (Cruz et al. 2020; Feng et al. 2020; Fielding-Miller, Sundaram, and Brouwer 2020; Hamidi, Ewing, and Sabouri 2020a, b; Souris and Gonzalez 2020; Amadu et al. 2021; Bhadra, Mukherjee, and Sarkar 2021; Inbaraj, George, and Chandrasingh 2021) . As any researcher can attest, collecting, organizing, and preparing data for a project can take a substantial amount of time. Pointing to the sources of data, even when these sources are public, is a small step toward reproducibility-but only a very small one. Faced with the prospect of having to recreate a data set from raw sources is probably sufficient to dissuade all but the most dedicated (or stubborn) researcher from independent verification. This is true even if part of the data are shared (e.g., Wong and Li 2020) . In other cases, data are shared, but the processes followed in the preparation of the data are not fully documented (Ahmad et al. 2020; Skórka et al. 2020) . These processes matter, as shown by the errors in the spreadsheets of Reinhart and Rogoff (see Herndon et al. 2014 for the discovery of these errors), as well as by the data of biologist Jonathan Pruitt that led to an "avalanche" of paper retractions (see Viglione 2020) . Another situation is when papers share well-documented data, but fail to provide the code used in the analysis (Pequeno et al. 2020; Noury et al. 2021; Wang et al. 2021 ). Making code available only "on demand" (e.g., Brandtner et al. 2021) is an unnecessary barrier when most journals offer the facility to share supplemental materials online. Then there are those papers that more closely comply with reproducibility standards, and share well-documented processes and data, as well as the code used in any analyses reported (Feyman et al. 2020; Paez et al. 2020; White and Hëbert-Dufresne 2020; Stephens, Chernyavskiy, and Bruns 2021; Sy, White, and Nichols 2021) . Even in this case, the pressure to publish "new findings" instead of replication studies can act as a deterrent. 4 This may be particularly true for younger researchers.

In the following sections, the analysis of SWN is reproduced, some relevant questions from the perspective of an independent researcher with expertise in spatial analysis are asked, and the data are reanalyzed.

SWN examined the association between the basic reproductive number of COVID-19 and population density. The basic reproductive number R 0 is a summary measure of contact rates, probability of transmission of a pathogen, and duration of infectiousness. In rough terms, R 0 measures how many new infections each infections begets. Infectious disease outbreaks generally tend to die out when R 0 < 1, and to grow when R 0 > 1. Reliable calculation of R 0 requires a minimum number of cases to be able to assume that there is community transmission of the pathogen. Accordingly, SWN based their analysis only on counties that had at least 25 cases or more at the end of the exponential growth phase (see Fig. 1 ). Their final sample included 1,151 counties in the United States, including in Alaska, Hawaii, Puerto Rico, and island territories. SWN used COVID-19 data collected by the New York Times and made available (with versioning) in a GitHub repository. 5 For each county, SWN assumed that the exponential growth period began one week prior to the second daily increase in cases, and assumed that the period of exponential growth lasted approximately 18 days. Table 1 reproduces the first three models of SWN (the fourth model did not have any significant variables; see Table 1 in SWN). It is possible to verify that the results match, with only the minor (and irrelevant) exception of the magnitude of the coefficient for travel by private transportation, which is due to a difference in the input (here the variable is changed to 1% units, instead of the 10% units used by SWN). The mixed linear model gives random intercepts (i.e., the intercept is a random variable), and the standard deviation is reported in the fifth row of Table 1 . It is useful to map the random intercepts: as seen in Figure 2 , other things being equal, counties in Texas tend to have somewhat lower values of R 0 (i.e., a negative random intercept), whereas counties in South Dakota tend to have higher values of R 0 . The key of the analysis, after 

The preceding section shows that thanks to the availability of code and data, it is possible to verify the results reported by SWN. As noted earlier, though, an independent researcher might have wondered about the implications of the spatial sampling procedure used by SWN. The decision to use a sample of counties with reliable basic reproductive numbers, although apparently sensible, results in a non-random spatial sampling scheme. Turning our attention back to Fig. 1 , there is a distinct impression that many counties without reliable values of R 0 are in more rural, less dense parts of the United States. This impression is reinforced when the boundaries of urban areas are overlaid with population greater than 50,000 on the counties with valid values of R 0 (see Fig. 3 ). The fact that R 0 could not be accurately computed in many counties without large urban areas does not mean that there was no transmission of the virus: it simply means that we do not know with sufficient precision to what extent that was the case. The low number of cases may be related to low population and/or low population density. This is intriguing, to say the least: by excluding cases based on the ability to calculate R 0 we are potentially selecting the sample in a non-random way.

A problematic issue with non-random sample selection is that parameter estimates can become unreliable, and numerous techniques have been developed to address this. A model useful for sample selection problems is Heckman's selection model (see Maddala 1983 ). The selection model is in fact a system of two equations, as follows: where y S * i is a latent variable for the sample selection process and y O * i is the latent outcome. Vectors x S i and x O i are explanatory variables (with the possibility that x S i = x S i ). Both equations include random terms (i.e., S i and O i ). The first equation is designed to model the probability of sampling, and the second equation the outcome of interest (say R 0 ). The random terms are jointly distributed and correlated with parameter .

What the analyst observes is the following:

and:

In other words, the outcome of interest is observed only for certain cases (y S i = 1, i.e., for sampled observations). The probability of sampling depends on x S i . For the cases observed, the outcome y O i depends on x O i . A sample selection model is estimated using the same selection of variables as SWN Model 3. This is Sample Selection Model 1 in Table 2 . The first thing to notice about this model is that Figure 3 . Urban areas with population >50,000 (Alaska, Hawaii, Puerto Rico, and territories not shown).

the sample selection process and the outcome are correlated ( ≠ 0 with 5% of confidence). The selection equation indicates that the probability of a county to be in the sample increases with population density (but at a decreasing rate due to the log-transformation), when travel by private modes is more prevalent, and as median household income in the county is higher. This is in line with the impression made by Fig. 3 that counties with reliable values of R 0 tended to be those with larger urban centers. Once that the selection probabilities are accounted for in the model, several things happen with the outcomes model. First, the coefficient for population density is still positive, but the magnitude changes: in effect, it appears that the effect of density is more pronounced than what SWN Model 3 indicated. The coefficient for percent of private transportation changes signs. And the coefficient for median household income is now significant.

The second model in Table 2 (Selection Model 2) changes the way the variables are entered into the model. The log-transformation of density in SWN and Selection Model 1 assumes that the association between density and R 0 is monotonically increasing (if the sign of the coefficient is positive) or decreasing (if the sign of the coefficient is negative). There are some indications that the relationship may actually not be monotonical. For example, Paez et al. (2020) found a positive (if non-significant) relationship between density and incidence of COVID-19 in the provinces of Spain at the beginning of the pandemic. This changed to a negative (and significant) relationship during the lockdown. In the case of the United States, Fielding-Miller, Sundaram, and Brouwer (2020) found that the association between COVID-19 deaths and population density was positive in rural counties, but negative in urban counties. A variable transformation that allows for non-monotonic changes in the relationship is the square of the density.

As seen in the table, Selection Model 2 replaces the log-transformation of population density with a quadratic expansion. The results of this analysis indicate that with this variable transformation, the selection and outcome processes are still correlated ( ≠ 0 with 5% of confidence). But a few other interesting things emerge. On examination of the outcomes model, the quadratic expansion has a positive coefficient for the first order term, but a negative coefficient for the second order term. This indicates that R 0 initially tends to increase as density grows, but only up to a point, after which the negative second term (which grows more rapidly due to the square), becomes increasingly dominant. Secondly, the sign of the coefficient for travel by private transportation becomes negative again. This, of course, makes more sense than the positive sign of Selection Model 1: if people tend to travel in private transportation, the potential for contact should be lower instead of higher. And finally median household income is no longer significant, similar to SWN Model 3.

The results of the selection models, in particular Selection Model 2, make us reassess the original conclusion that density has a positive association with the basic reproductive number of COVID-19. A spatial analyst might still wonder about spatial residual autocorrelation. A challenge here is that spatial models tend to be technically more demanding, and although spatial models for qualitative variables exist, a spatial implementation of the sample selection model does not appear to exist. It might be argued that a reproducible research project can also allow a researcher to be more adventurous with their modeling decisions: since data and code are shared, other researchers can promptly and with relative ease poke the methods and see if they appear to be sound.

In the present case, it appears that an application of spatial filtering (see Getis and Griffith 2002; Griffith 2004; Paez 2019) can help. Spatial filtering provides an elegant solution to regression problems that may have difficulties handling the spatial structures of spatial statistical and econometric models (Griffith 2000) . A key issue in the present example is the fact that there are numerous missing observations, which prevents the calculation of autocorrelation statistics, let alone the estimation of models with spatial components.

The following is an unorthodox, but potentially effective use of filters in a sample selection model:

1. Estimate a sample selection model and retrieve the residuals of the outcome. This will be a vector with missing values for locations that were not sampled. 2. Fit a spatial filter to the residuals. This is done by regressing the estimated residuals of the observed data on the corresponding values of the Moran eigenvectors. 3. The resulting filter will correlate highly with the known residuals, and will provide information in non-sampled locations that is consistent with the spatial pattern of the known residuals. 4. Test the filter for spatial autocorrelation: 4.1 If significant spatial autocorrelation is detected, this would be indicative of residual spatial pattern. Introduce the filter as a covariate in the outcome model of the sample selection model and return to step 1.

4.2 If no significant spatial autocorrelation is detected, this would be indicative of random residual pattern. Stop. This procedure is implemented using a stopping criterion whereby the search for the filter only stops when the P-value of Moran's Coefficient of the filter fitted to the residuals is greater than 0.25, which was chosen as a sufficiently conservative value for testing for autocorrelation. The correlation of the known residuals with the corresponding elements of the filter is consistently high (the correlation coefficient typically is greater than 0.9). The results of implementing this procedure appear in Table 3 as Selection Model 3. The results are consistent with Selection Model 2, with two intriguing differences: 1) the variance of Sample Model 3 is smaller; and 2) the sample and outcome processes are no longer correlated (the confidence interval of includes zero). It appears that by capturing the spatial pattern of the residuals, which is likely strongly determined by the non-random sampling framework, the outcome model is not only substantially more precise, but also appears to be independent from the selection process.

Clearly, the various models display some intriguing differences; but how relevant are said differences from a more substantive standpoint? Fig. 4 shows the relationship between density and R 0 implied by SWN Model 3, Selection Model 2, and Selection Model 3. The left panel of the figure shows the non-linear but monotonic relationship implied by SWN Model 1. The conclusion is that at higher densities, R 0 is always higher. The two panels on the right, in contrast, shows that Selection Model 2 and Selection Model 3 coincide that R 0 tends to increase as density grows. This continues until a density of approximately 2.9 (1,000 people per sq.km). At higher densities than that the relationship between density and R 0 begins to weaken, and the relationship becomes negative at densities higher than approximately 5.7 (1,000 people per sq.km).

To put this into context, other things being equal, the effect of density in a county like Charlottesville in Virginia (density ~1,639 people per sq.km) is roughly the same as that in a county like Philadelphia (density ~4,127 people per sq.km). In contrast, the effect of density on R 0 in a county like Arlington in Virginia (density ~3,093 people per sq.km) is stronger than either of the previous two examples. Lastly, the density of counties like San Francisco in California, or Queens and Bronx in NY, which are among the densest in the United States, contributes even less to R 0 than even the most rural counties in the country. 

It is worth at this point to recall Cressie's dictum about modeling: "[w]hat is one person's mean structure could be another person's correlation structure" (Cressie 1989, p. 201 ). There are almost always multiple ways to approach a modeling situation, as lively illustrated by a recent paper that reports the results of a crowdsourced modeling experiment (Schweinsberg et al. 2021 ).

In the present case, I would argue that spatial sampling is an important aspect of the modeling process. Importantly, by adopting high reproducibility standards, SWN made a valuable contribution to the collective enterprise of seeking knowledge. Their effort, and subsequent efforts to validate and expand on their work, can potentially contribute to provide clarity to ongoing conversations about the relevance of density and the spread of COVID-19.

In particular, it is noteworthy that a sample selection model with a different variable transformation does not lend support to the thesis that higher density is always associated with a greater risk of spread of the virus [in Wong and Li's words, "'Density is destiny' is probably an overstatement"; (2020)]. At the same time, the results presented here also stand in contrast to the findings of Hamidi et al., who found that higher density was either not significantly associated with the rate of the virus in a cross-sectional study (Hamidi, Ewing, and Sabouri 2020b) , or was negatively associated with it in a longitudinal setting [Hamidi, Ewing, and Sabouri (2020a) . In this sense, the conclusion that density does not aggravate the pandemic may have been somewhat premature; instead, reanalysis of the data of SWN suggests that Fielding-Miller, Sundaram, and Brouwer (2020) might be onto something with respect to the difference between rural and urban counties. More generally, there is no doubt that in population-level studies density is indicative of proximity, but it also potentially is a proxy for adaptive behavior. And it is possible that the determining factor during COVID-19, at least in the United States, has been variations in perceptions of the risks associated with contagion (Chauhan et al. 2021) , and subsequent compensations in behavior in more and less dense regions.

The tension between the need to publish research potentially useful in dealing with a global pandemic, and a potential "carnage of substandard research" (Bramstedt 2020) , highlights the importance of efforts to maintain the quality of scientific outputs during COVID-19. An important part of quality control is the ability of independent researchers to verify and examine the results of materials published in the literature. As previous research illustrates, reproducibility in scientific research remains an important but elusive goal (Gustot 2020; e.g., Iqbal et al. 2016; Stodden, Seiler, and Ma 2018; Sumner et al. 2020 ). This idea is reinforced by the review conducted for this paper in the context of research about population density and the spread of COVID-19.

Taking one recent example from the literature (Sy, White, and Nichols, 2021] , the present article illustrates the importance of good reproducibility practices. Sharing data and code can catalyze research, by allowing independent verification of findings, as well as additional research. After verifying the results of SWN, experiments with sample selection models and variations in the definition of model inputs, lead to an important reappraisal of the conclusion that high density is associated with greater spread of the virus. Instead, the possibility of a non-monotonical relationship between population density and contagion is raised. I do not claim that the analysis presented here is the last word on the topic of density and the spread of COVID-19, and there is always the possibility that someone else will be better equipped to analyze these data with greater competence. By opening up the analysis, documenting the way data were pre-processed, and by sharing analysis ready data, my hope would be that others will be able to discover the limitations of my own analysis and improve on it, as appropriate.

More generally, my hope is that the research of Sy, White, and Nichols (2021) , the present article, and similar reproducible publications, will continue to encourage others to adopt higher reproducibility standards in their research.

The analysis reported in this paper was conducted in the R computing statistical language (R Core Team 2021). The source document is an R Markdown document (Xie, Allaire, and Grolemund 2018; Xie, Dervieux, and Riederer 2020) processed using knitr (Xie 2014 (Xie , 2015 . The following packages were used in the analysis, and I wish to acknowledge their creators for their generous efforts: adespatial (Dray et al. 2021) , censReg (Henningsen 2020), dplyr 1. Nobel Prize in Economics Paul Krugman noted that "Reinhart-Rogoff may have had more immediate influence on public debate than any previous paper in the history of economics" https://www.nyboo ks.com/ artic les/2013/06/06/how-case-auste rity-has-crumb led/?pagin ation =false 2. https://github.com/paezh a/Repro ducti ve-Rate-and-Densi ty-US-Reana lyzed. 3. Governor Kristi Noem of South Dakota, for example, claimed that sparse population density allowed her state to face the pandemic down without the need for strict policy interventions https://www.infor um.com/lifes tyle/healt h/50256 20-South -Dakot a-is-not-New-York-City-Noem-defen ds-lack-of-state wide-COVID -19-restr ictions. 4. The present article was desk rejected by three journals that had previously published research on population density and the spread of COVID-19; in one case, the paper was too opinionated for the journal, in the other two cases, the paper was not a "good fit" despite dealing with a nearly identical issue as papers previously published in said journals. 5. https://github.com/nytim es/covid -19-data.

Association of Poor Housing Conditions with COVID-19 Incidence and Mortality Across US Counties

Assessing Sub-Regional-Specific Strengths of Healthcare Systems Associated with COVID-19 Prevalence, Deaths and Recoveries in Africa

Publication Rate and Citation Counts for Preprints Released During the COVID-19 Pandemic: The Good, the Bad and the Ugly

Open data products-A framework for creating valuable analysis ready data

Ten Years After the Financial Crisis: The Long Reach of Austerity and its Global Impacts on Health

Matrix: Sparse and Dense Matrix Classes and Methods

Impact of Population Density on Covid-19 Infected and Mortality Rate in India

spData: Datasets for Spatial Analysis

Progress in the R ecosystem for representing and handling spatial data

Applied Spatial Data Analysis with R

Comparing Implementations of Global and Local Indicators of Spatial Association

The Carnage of substandard research during the COVID-19 Pandemic: A Call for Quality

Creatures of the State? Metropolitan Counties Compensated for State Inaction in Initial U.S. Response to COVID-19 Pandemic

Reproducible Research: Geophysics Papers of the Future-Introduction

Opening practice: supporting reproducibility and critical spatial data science

COVID-19 Related Attitudes and Risk Perceptions Across Urban, Rural, and Suburban Areas in the United States

Computing Generalized Method of Moments and Generalized Empirical Likelihood with R

Geostatistics

Exploring the Young Demographic Profile

Cases in Hong Kong: Evidence from Migration and Travel History Data

adespatial: Multivariate Multiscale Spatial Analysis

Running to Stay in Place: The Time-Use Implications of Automobile Oriented Land-Use and Travel

Spatiotemporal Spread Pattern of the COVID-19 Cases in China

Effectiveness of COVID-19 Shelter-in-Place Orders Varied by State

Social Determinants of COVID-19 Mortality at the County Level

How Life in Our Cities Will Look After the Coronavirus Pandemic

The Evolving Role of Preprints in the Dissemination of COVID-19 Research and Their Impact on the Science Communication Landscape

Computation of Multivariate Normal and T Probabilities

Comparative Spatial Filtering in Regression Analysis

INFEKTA-An Agent-Based Model for Transmission of Infectious Diseases: The COVID-19 Case in Bogotá, Colombia

A Linear Regression Solution to the Spatial Autocorrelation Problem

A spatial Filtering Specification for the Autologistic Model

Quality and Reproducibility During the COVID-19 Pandemic

Longitudinal Analyses of the Relationship Between Development Density and the COVID-19 Morbidity and Mortality Rates: Early Evidence from 1,165 Metropolitan Counties in the United States

Does Density Aggravate the COVID-19 Pandemic?

Compact Development and Adherence to Stay-at-Home Order During the COVID-19 Pandemic: A Longitudinal Investigation in the United States

Changes in Commute Mode Attributed to COVID-19 Risk in Canadian National Survey Data

censReg: Censored Regression (Tobit) Models

maxLik: A Package for Maximum Likelihood Estimation in R

miscTools: Miscellaneous Tools and Utilities

purrr: Functional Programming Tools

Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff

Seroprevalence of COVID-19 Infection in a Rural District of South India: A Population-Based Seroepidemiological Study

The Case for Open Computer Programs

Increasing Value and Reducing Waste in Research Design, Conduct, and Analysis

Reproducible Research Practices and Transparency Across the Biomedical Literature

Changes in Trip-Making Frequency by Mode During COVID-19

Population Density, a Factor in the Spread of COVID-19 in Algeria: Statistic Study

Are High-Density Districts More Vulnerable to the COVID-19 Pandemic?

In-Depth Examination of Spatiotemporal Figures in Open Reproducible Research

Computational Reproducibility in Geoscientific Papers: Insights from a Series of Studies with Geoscientists and a Reproduction Study

How Swamped Preprint Servers are Blocking Bad Coronavirus Research

Human Mobility Trends During the Early Stage of the COVID-19 Pandemic in the United States

Effect of Population Density on Epidemics

Limited-Dependent and Qualitative Variables in Econometrics

Tmvtnorm: Truncated Multivariate Normal and Student T Distribution

A National Cross-Sectional Study

Tracing the Sars-CoV-2 Impact: The First Month in Switzerland

Some Spatial Properties of Urban Contact Fields

Urban Acquaintance Fields: An Evaluation of a Spatial Model

tibble: Simple Data Frames

Perceived Risk and Modal Choice-Risk Compensation in Transportation System

Mobility and the Effective Reproduction Rate of COVID-19

How Does COVID-19 Affect Electoral Participation? Evidence from the French Municipal Elections

Using Spatial Filters and Exploratory Data Analysis to Enhance Regression Models of Spatial Data

Using Google Community Mobility Reports to Investigate the Incidence of COVID-19 in the United States

Open Spatial Sciences: An Introduction

A Spatio-Temporal Analysis of the Environmental Correlates of COVID-19 Incidence in Spain

Simple Features for R: Standardized Support for Spatial Vector Data

Classes and Methods for Spatial Data in R

Patchwork: The Composer of Plots

Scico: Colour Palettes Based on the Scientific Colour-Maps

Air Transportation, Population Density and Temperature Predict the Spread of COVID-19 in Brazil

Risk Compensation and Bicycle Helmets

R-core. Linear and Nonlinear Mixed Effects Models: Nlme

R: A Language and Environment for Statistical Computing

Condoms and Seat Belts: The Parallels and the Lessons

High Population Densities Catalyse the spread of COVID-19

Factors Affecting COVID-19 Infected and Death Rates Inform Lockdown-Related Policymaking

Same Data, Different Conclusions: Radical Dispersion in Empirical Results When Independent Analysts Operationalize and Test the Same Hypothesis

The COVID-19 Pandemic: Impacts on Cities and Major Lessons for Urban Planning, Design, and Management

The Macroecology of the COVID-19 Pandemic in the Anthropocene

COVID-19: Spatial Analysis of Hospital Case-Fatality Rate in France

Impact of Altitude on COVID-19 Infection and Death in the United States: A Modeling and Observational Study

An Empirical Analysis of Journal Policy Effectiveness for Computational Reproducibility

Reproducibility and Reporting Practices in COVID-19 Preprint Manuscripts

Impacts of Geographic Factors and Population Density on the COVID-19 Spreading Under the Lockdown Policies of China

Across United States Counties

Sample Selection Models in R: Package Sample Selection

Avalanche'of Spider-Paper Retractions Shakes Behavioural-Ecology Community

Proliferation of Papers and Preprints During the Coronavirus Disease 2019 Pandemic: Progress or Problems with Peer Review?

Tidycensus: Load US Census Boundary and Attribute Data as Tidyverse and Sf-Ready Data Frames

Epidemic Situation Using Multisource Spatio-Temporal Big Data

State-Level Variation of Initial COVID-19 Dynamics in the United States

ggplot2: Elegant Graphics for Data Analysis

forcats: Tools for Working with Categorical Variables (Factors)

tidyr: Tidy Messy Data

Welcome to the Tidyverse

dplyr: A Grammar of Data Manipulation

spatialprobit: Spatial Probit Models

Spatialprobit: Spatial Probit Models

Spreading of COVID-19: Density Matters

Knitr: A Comprehensive Tool for Reproducible Research in R

Dynamic Documents with R and Knitr

R Markdown: The Definitive Guide

R Markdown Cookbook

Econometric Computing with HC and HAC Covariance Matrix Estimators

Object-Oriented Computation of Sandwich Estimators

Various Versatile Variances: An Object-Oriented Implementation of Clustered Covariances in R

Construct Complex Table with 'kable' and Pipe Syntax