key: cord-0914702-saqa0z2v authors: Haring, R. S.; Trende, S.; Ramirez, C. title: Use of Environmental Variables to Predict SARS-CoV-2 Spread in the U.S. date: 2021-05-21 journal: nan DOI: 10.1101/2021.05.19.21257350 sha: 641381e7213fe2db3b8be3a5646bf6a87dc927fe doc_id: 914702 cord_uid: saqa0z2v Background: The COVID-19 pandemic has challenged even the most robust public health systems world-wide, leaving state and local health departments, hospitals, and physicians with little guidance on planning and resource allocation. Efforts at predicting the virus spread have largely failed to capture the nuances presented by national and local geographic, environmental, and sociological variables. Objective: Using county-level data from the United States, we sought to measure the extent to which these demographic, geographic, and environmental variables correlate with the spread of COVID-19. Methods: Using demographic data from the US Census Bureaus American Community Survey, weather station data from the National Oceanic and Atmospheric Administration (NOAA), and COVID-19 case data from the Center for Systems Science and Engineering at Johns Hopkins University and the New York State Department of Health, we employed Bayesian hierarchical modeling with zero-inflated Negative Binomial regression to calculate correlations between these variables, COVID-19 case count, and rate of viral spread. Key predictors were identified and measured during two periods of two weeks each: March and June of 2020. The resultant model was then employed to predict case counts and spread rate for early July 2020. Results: While demographic and environmental factors explain viral spread well, our findings challenge earlier conclusions about how these factors related to viral progress. Using these factors alone, we were able to predict spread to within 1% in all but 8 counties (99.9%), and within 0.1% in all but 51 counties (98.4%). The model was subsequently able to predict early July viral spread to within 0.5% in 98% of counties. Contrary to earlier findings, temperature had variable effect; as Spring temperatures warmed, cases decreased, but Summer heat increased cases, likely reflecting movement of populations from indoors to outdoors and back in. States varied little in their case rate relative to the model, and much of the variation could be linked to known superspreader events. Conclusion: While environmental and demographic variables can help predict COVID-19 spread rates, some relationships are variable in ways earlier research failed to identify. COVID-19 has infected over 81 million people and killed over 1,700,000 [1] . Due to the rapid spread of the novel virus, governments and physicians have had to manage the clinical, societal, and economic ramifications with imperfect data. While certain aspects of the virus are well-understood, little is known about the macro-environmental factors contributing to viral spread. Early in the pandemic, researchers believed that warm weather could slow the spread of the virus [2] . This was buttressed by a few early studies suggesting that warm weather and high humidity slowed spread. Wang et al. examined the spread of COVID-19 in 100 Chinese cities and 1,005 U.S. counties, and found that temperature and relative humidity had a significant, negative effect on viral transmission [3] . They concluded that viral spread would likely slow in the summertime, although temperature alone would not be enough to reduce the R0 below one. Other work reached similar conclusions [4, 5] . In the US, outbreaks were, in fact, limited in northern states for much of the summer. States like Arizona, Florida and Texas, however, saw spikes of the virus, seemingly confounding earlier expectations [6] . In addition, countries with hot summer climates such as Brazil, Mexico, and Peru saw the disease spread widely [7] . This summer wave has attracted less analysis that the first wave, despite the tension it creates with theories of viral spread in the earlier literature. To help better understand the relationship between COVID spread and environmental and demographic factors, we examine COVID-19 spread in the US during what we see as two distinct phases: the initial outbreak of the virus (mid-March through early April), and the second (mid-June through early July). The two time periods of interest were selected based on a combination of factors. We chose the first time period, measured with what we call the "Spring Model," because it represented a time where COVID-19 had spread widely enough to create a reasonable amount of variance in the data, and because infections that presented during this time period were likely transmitted before or shortly after U.S. President Donald Trump declared a state of emergency [8] and as states were implementing a variety of non-pharmaceutical interventions [9, 10] . In other words, we selected a time period when viral spread in all states was accelerating. We also modeled spread in late June with our "Summer Model." We selected this time period because most states had begun to emerge from lockdowns by this point, reducing between-state variation [11] . As a sensitivity and robustness analysis, we used the Summer model to try and predict viral spread completely out-of-sample, during the month of July. 2 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) We took daily case counts from the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [12] . CSSE data do not break New York City down by borough, so we took data from the New York State Department of Health COVID-19 Tracker [13] . We obtained county-level population estimates from the US Census Bureau's American Community Survey (ACS) for 2019 [14] , and calculated population density by dividing these estimates by the land area of each county as reported by the US Census Bureau [15] . Data collected from cruise ships and US territories were excluded. Humidity data were taken using weather station data from the National Oceanic and Atmospheric Administration (NOAA) [16] . We identified daily temperature and dew point data for each weather station for the months of interest and calculated the mean humidity for each month using the Magnus formula [17] . We then calculated the distance from each county centroid to each weather station, and data from the nearest weather station were assigned accordingly. County-level temperature data were not available on a day-to-day basis, and including such data in the model would have been difficult given the lag between infection and diagnosis [18] . Instead, we use the mean county temperature for the months of interest, under the assumption that post-transmission lag in presentation would be relatively normally distributed, and that cases transmitted in early-and mid-month would present around mid-and late-month, respectively [19] . We extracted latitude and longitude for the centroids of each county from CCSE data, and included these as covariates. Latitude was included to test hypotheses about the effects of sunlight on viral transmission. Longitude was included to control for the observation that the early outbreaks of the virus tended to occur on the U.S. coasts, and then spread inward [20] . To capture and quantify this phenomenon, we calculated the absolute distance of the longitude for the centroid of each county from the central meridian of the 48 contiguous states (-98.583 degrees). Our primary outcome was the change in cases in all 3,108 United States counties located within the contiguous 48 states from March 16, 2020 to April 1, 2020 (Spring model), and from June 16, 2020 to July 1, 20020 (Summer model). A handful of counties showed a decrease in cumulative cases over these three weeks; these were dropped, as they almost certainly reflect reporting errors or estimate revisions. Because we utilize a difference-in-differences estimator for our response, we were sparing in our control variables. Because New York City was hit hard early in the pandemic, separated Metropolitan New York City into two variables: one including only the five-county New York City area, and another excluding this area but including the rest of the area making up the Metropolitan New York City Statistical Area. Finally, we controlled for relative humidity, following the approach of Wang et al. [3] Because data are publicly available and no personally identifying characteristics are available in the data set, this study was exempt from Institutional Review Board approval. 3 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 21, 2021. ; https://doi.org/10.1101/2021.05.19.21257350 doi: medRxiv preprint We estimated factors driving the rate of viral spread utilizing a Bayesian hierarchical model, estimated through the R-INLA package [21] , using nested Laplace approximations to obtain fast calculations of Bayesian hierarchical models [22] . Case counts are necessarily positive (since cumulative infection numbers cannot decrease over time), so we utilized a generalized linear model with a logarithmic link [23] . To test for possible non-linearities in the data, population densities were squared and cubed. Zeroes dominate the distribution of case spread, accounting for about 1/3 of the data, possibly due to limited testing or undetected cases ( Figure 1 ). To account for this, we employed a zero-inflated model, which estimates an additional parameter to account for the possibility that a zero represents a "false negative." To better conceptualize this approach, one might imagine that the number of fish caught by people fishing in a park follows a Poisson distribution. If one estimates a simple Poisson model by counting the number of fish caught by everyone exiting the park, you may find an excessive number of zeroes because some people exiting won't have fished at all. The zero number of fish caught, in this sense, is false. In other words, there is actually a two step process involved: a process for determining whether someone opts to fish, and a process for estimating the number of fish caught by those who do fish. The zero-inflated model therefore first estimates the probability that a given zero is false. It then estimates the remaining terms using a different distribution. Due to overdispersion, we used a (zero-inflated) negative binomial regression model. All calculations were performed using the R Statistical Package, version 3.6.2 [24] . . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 21, 2021. ; https://doi.org/10.1101/2021.05.19.21257350 doi: medRxiv preprint Population density showed an exponential positive effect on spread (4.85e-04, 95% CI: 3.927e-04, 5.842e-04). Rates of viral spread not accounted for in our model, which may include local and regional factors such as shutdown orders and the availability of testing in a given area, are illustrated in Fig. 2 As suggested by the high accuracy of prediction at the state level, overall fit for the Spring model was quite good. Examining the residuals, half of all counties were predicted within a single case, 83% of counties were predicted within 10 cases, and 97.2% of counties were predicted within 100 cases. Of course, there is a substantial difference between predicting within 100 cases in a county of 1,000 people, and predicting within 100 cases in a county of 1,000,000. To account for this, we examined rate of spread by dividing the residuals by the population, and examine the rate of spread directly. The Spring model predicted the rate of spread in counties well: within a percentage point in all but 8 counties (99.9%) and within a tenth of a point in all but 51 counties (98.4%). Population density during the Summer period remained strongly positively correlated with the spread of the virus (2.44e-04, 95% CI: 1.69e-04, 3.23e-04), and the coefficients for density-squared (-1.63e-08, 95% CI: -1.63e-08, -5.87e-09) and density-cubed (1.03e-13, 95% CI: 4.92e-14, 1.67e-13) are similar to those we observed during the earlier period. This continuity from March to June is actually quite striking and gives us increased confidence that population density is a significant factor in the spread of COVID-19. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 21, 2021. ; https://doi.org/10.1101/2021.05.19.21257350 doi: medRxiv preprint CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 21, 2021. ; https://doi.org/10.1101/2021.05.19.21257350 doi: medRxiv preprint 8 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 21, 2021. ; https://doi.org/10.1101/2021.05.19.21257350 doi: medRxiv preprint Relative humidity (-0.012, 95% CI: -0.016, -0.008) during period 2 was negatively signed after controlling for other variables, which is consistent with early published reports [3] . The coefficient for temperature, however, shifted substantially, becoming strongly positively associated with viral spread (0.019, 95% CI: 0.005, 0.032). Estimates of viral spread unaccounted for in our Summer model are similar to those found in the Spring model. The scale of the outbreak had grown significantly by Summer (or testing availability had improved enough to better reveal the true scale of the outbreak), so residuals are larger in absolute terms. Substantial changes in states' observed over expected rates of spread are relatively rare. Colorado and New York transition from having greater than expected rates in the Spring to lower than expected in the Summer, while North Carolina and Oregon moved in the opposite direction. In the Summer model, a near-majority (48 %) of counties were predicted within 10 cases, 80.5% were predicted within 50 cases, and 88% were predicted within 100 cases. When residuals are scaled by population, the rate of infection spread is estimated within a half-percentage point in 98.2% of counties, and within two-tenths of a percentage point in 92.1% of counties. For model testing, we used the Summer model to attempt to predict viral spread for the first half of July, from July 2 to July 15. We recalculated the relative humidity data for that time period, and used the average county temperature for July. When applied to a future time period (early July), the Summer (June) model predicted 52.6% of counties within 15 cases in July, 75.8% of counties within 50 cases, and 84.9% of counties within 100 cases. Put in terms of rate, it predicted 98% of counties within a half point and 87.2% of counties within two-tents of a percent. Population density was the strongest predictor of viral spread in both Spring and Summer of 2020. Consistent with the findings of earlier studies, temperature was negatively correlated with viral spread in the Spring. In the Summer, however, temperature was positively correlated with viral spread. Given that viral spread is more common in indoor spaces, this shift may reflect behavior patterns affected by the differential influence of extreme temperatures on behavior [25] . In very cold weather, such as we see in northern states in March, people choose to congregate indoors, while in very hot weather, such as we see in southern states in June, people retreat to heavily air-conditioned where close contact can facilitate viral spread; the effect of these patterns on behaviors affecting infectious and non-infectious morbidity and mortality has been documented extensively in the literature [26, 27, 28, 29, 30, 31] . Viewed in this light, the temperature variable is not causal, but rather drives behaviors that are causal. We see some evidence of this in the residuals as well. Outliers tended to occur in counties where the popular press reported so-called "super-spreader" incidents involving indoor activities. In the Spring model, 9 . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint this version posted May 21, 2021. ; https://doi.org/10.1101/2021.05.19.21257350 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) [32] , and a large funeral marked by close physical contact [33] . As with the first model, the Summer model suggests a well-predicted epidemic with stochastic disease "outbreaks." As with the first model, many (though not all) of these counties have documented causes behind their particularly large outbreaks, which are in turn largely consistent with evidence supporting indoor spread in poorly ventilated indoor spaces. These models incorporating geographic, demographic, and environmental covariates were surprisingly accurate in their prediction of county-level rates of COVID-19 spread during two periods of 2020, despite not incorporating any variables reflecting state and regional non-pharmaceutical interventions such as "safer at home" orders, mask mandates, or business shutdowns. We would caution strongly against inferences suggesting that these interventions had no effect in predicting spread, however. First, the state-level random effects and inclusion of a zero-inflation term act, to some extent, as controls for statewide policies and testing availability. About half of the states evince significant effects suggesting some meaningful variation among states which may be related to the availability of tests and particular policy interventions. Also, very little data exist on extent to which individuals or populations conform to these orders, or the extent to which variations in rates of conformity might impact regional observations in COVID-related morbidity and mortality. This model, in part, explains disparities in the rate of spread within the US based on environmental factors. As expected, density is related to viral spread. Temperature is more complex but consistent with the hypothesis that weather driving people indoors may facilitate spread. We believe these results may help guide policy and future research in two ways: first, these results seem to be consistent with arguments that indoor activity is a main vector for spread. As people are less willing to hold family gatherings or to dine outdoors due to temperature extremes, the virus may transmit more rapidly. Second, we are forced to observe that viral spread is well-explained without reference to popular interventions based entirely on variables reflecting geographic, demographic, and meteorologic factors, to the exclusion of interventions aimed at reducing transmission. Our findings are important because they assist policymakers in predicting where future viral outbreaks are likely to occur. Although the implementation of effective vaccines seems likely to slow the spread of the virus, outbreaks seem likely to continue to occur in the near future until the virus can be contained. We also believe this serves as a warning. Winter is coming, as they say, conditions that drive people indoors can bring factors associated with increased spread perhaps giving rise to a second waves in the late fall and winter. Future research needs is warranted to assess the additional impact of targeted interventions as well as super-spreading events and behavioral variation. . CC-BY-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted May 21, 2021. ; https://doi.org/10.1101/2021.05.19.21257350 doi: medRxiv preprint An interactive web-based dashboard to track COVID-19 in real time Will Warm Weather Slow Spread of Novel Coronavirus? High Temperature and High Humidity Reduce the Transmission of COVID-19 Spread of SARS-CoV-2 Coronavirus likely to be constrained by climate Potential impact of seasonal forcing on a SARS-CoV-2 pandemic Hospitalizations up in Arizona, Texas, Florida as US nears 3 million COVID-19 cases, CIDRAPLibrary Catalog: www.cidrap.umn Virus Gains Steam Across Latin America Proclamation on Declaring a National Emergency Concerning the Novel Coronavirus Disease (COVID-19) Outbreak, library Catalog: www.whitehouse Coronavirus is shutting down American life as states try to battle outbreak Where states reopened and cases spiked after the U.S. shutdown, library Catalog: www.washingtonpost An interactive web-based dashboard to track COVID-19 in real time NYSDOHCOVID-19Tracker-Map?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n Topologically Integrated Geographic Encoding and Referencing [TIGER] database Land-based datasets and products, Tech. rep., National Oceanic and Atmosphereic Administration Improved Magnus Form Approximation of Saturation Vapor Pressure Coronavirus Disease 2019 Case Surveillance -United States Coast-to-Coast Spread of SARS-CoV-2 during the Early Epidemic in the United States Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations Generalized Linear Models A Language and Environment for Statistical Computing Effects of ventilation on the indoor spread of covid-19 The effect of maximum daily temperature on outdoor violence interaction of temperature, illuminance and apparent time on sedentary work fatigue Associations between daily ambient temperature and sedentary time among children 4-6 years old in mexico city Linking outdoor air temperature and sars-cov-2 transmission in the us using a two parameter transmission model, medRxiv Absolute humidity, temperature, and influenza mortality: 30 years of county-level evidence from the united states How an Austrian ski resort helped coronavirus spread across Europe, library Catalog: edition.cnn CDC: Mardi Gras quickened spread of coronavirus in Louisiana; canceling was never recommended, library Catalog: www.nola Days After a Funeral in a Georgia Town, Coronavirus 'Hit Like a Bomb