key: cord-0897341-8a1bdqoh authors: Olmo, Jose; Sanso‐Navarro, Marcos title: Modeling the spread of COVID‐19 in New York City date: 2021-06-28 journal: Pap Reg Sci DOI: 10.1111/pirs.12615 sha: 4c1e8fa9df7919e682a0bf1aa795a982f148b1b2 doc_id: 897341 cord_uid: 8a1bdqoh This paper proposes an ensemble predictor for the weekly increase in the number of confirmed COVID‐19 cases in the city of New York at zip code level. Within a Bayesian model averaging framework, the baseline is a Poisson regression for count data. The set of covariates includes autoregressive terms, spatial effects, and demographic and socioeconomic variables. Our results for the second wave of the coronavirus pandemic show that these regressors are more significant to predict the number of new confirmed cases as the pandemic unfolds. Both pointwise and interval forecasts exhibit strong predictive ability in‐sample and out‐of‐sample. confirmed cases of COVID-19 have been reported to the World Health Organization (WHO), including 2,517,964 deaths. The coronavirus pandemic has spread worldwide affecting, to a greater or lesser extent, most countries. It has hit hardest big cities such as Delhi, London, Madrid, Melbourne, New York, Paris or Rio de Janeiro. These cities are characterized not only by having a large number of inhabitants but also by being highly densely populated, owning a complex network of public transport, and accommodating large concentrations of workers in the services sector: education, healthcare, entertainment, retail, and finance. Individuals also suffer long commuting times to work, staggering income inequalities, and heterogenous living standards across districts, see Cheshire et al. (2014) and Nijman and Wei (2020) . Urban agglomerations are also known to be very heterogeneous with respect to the composition of the population, including individuals from different races and cultural backgrounds (Shertzer & Walsh, 2019; Wei et al. 2018 ). One of the cities that has been hit hardest by the coronavirus pandemic is New York (NYC). The impact of in this city during the first wave that took place in the spring of 2020 has been widely studied in the literature, see Almagro et al. (2021) , Almagro and Orane-Hutchinson (2020) , Borjas (2020) , Glaeser et al. (2021) , and Schmitt-Grohé et al. (2020) , among others. These important contributions focus on different aspects of the effects and transmission of the disease for the population of NYC, and generally consider data at zip code level. More specifically, Almagro and Orane-Hutchinson (2020) explore different channels to explain the disparities in COVID-19 incidence across NYC neighborhoods. These authors estimate several linear regression models to assess the statistical relevance of variables reflecting neighborhood characteristics and occupations, finding that the latter are important for explaining observed incidence patterns. Their results show that those occupations with a higher degree of human interaction are more likely to be exposed to the virus. A second contribution of Almagro and Orane-Hutchinson (2020) is to suggest a selection on testing, whereby those residents in worse conditions are more likely to get tested, with such selection decreasing over time as tests become more widely available. Borjas (2020) merges information on the number of tests and the number of infections at the zip code level with demographic and socioeconomic information from the decennial census and the American Community Survey. This author finds that people residing in poor or immigrant neighborhoods were less likely to be tested; but the likelihood that a test was positive was higher in these areas, as well as in those with larger households or predominantly black populations. The dependent variable in this study is the rate of infection in the population, which depends on both the frequency of tests and on the fraction of positive tests among those tested. One important contribution of Borjas (2020) is to show that the non-randomness in testing across NYC neighborhoods partly invalidates standard statistical inferences between the rate of infection and the socioeconomic characteristics. Schmitt-Grohé et al. (2020) investigate access to COVID-19 testing across incomes using zip code level data on the number of tests, test results, and income per capita. These authors find that the distribution of tests across income levels is significantly more egalitarian than the distribution of income itself. Glaeser et al. (2020) analyze the efficiency of mobility restrictions in limiting COVID-19 spread. Using zip code data for five U.S. cities, including New York, these authors find that total cases per capita decrease by 19% for every ten percentage point fall in mobility. Using panel data for NYC with week and zip code fixed effects, the decline increases to 30%. This paper aims to extend the studies described above in several directions. First, we go beyond the spring of 2020, and analyze data on the number of COVID-19 infections in NYC at zip code level from 1 September 2020 to 2 February 2021. Second, and most importantly, we focus on the prediction of the number of weekly new confirmed cases of the disease. Third, we include temporal and spatial effects in our set of explanatory variables. Temporal effects are captured by an autoregressive lag of the response variable and the lagged incidence rate, and spatial effects are accommodated by including their averages in contiguous neighborhoods. Fourth, we apply Bayesian model averaging (BMA) techniques using a generalized linear regression for count data as benchmark. The implementation of this methodology allows us to derive the posterior distribution of the parameters associated to the covariates. Thus, we shed light on the sensitivity of the increase in confirmed cases to demographic and socioeconomic factors, as well as autoregressive and spatial terms, under a potentially large number of specifications of the regression model. Fifth, we provide pointwise and interval forecasts for each week of the evaluation period and across NYC neighborhoods. By doing so, we model the uncertainty about our pointwise predictions. Finally, by fitting our empirical framework to cross-sectional weekly data, we are able to accommodate the presence of a time-varying intensity rate of the disease that is reflected in changes in the slope parameters. These dynamics can be associated to policy-induced changes related to social distancing and the effect of the vaccine, among a few others, see Fern andez-Villaverde and Jones (2020). The motivation of this paper is twofold. On the one hand, we acknowledge that standard linear models are not suitable for predicting the number of confirmed cases of the COVID-19 disease as it is a count variable. To correct for this feature, we propose econometric frameworks that are more suitable for count data. In particular, our benchmark is a Poisson regression model with intensity parameter given by a linear function of a large set of demographic and socioeconomic variables, as well as temporal and spatial effects. On the other hand, we are not aware of econometric models for prediction of the COVID-19 disease beyond Li and Linton (2021) , Liu et al. (2021) and Fern andez-Villaverde and Jones (2020). Li and Linton (2021) fit a time series model based on a quadratic trend specification to country level data. Liu et al. (2021) use a panel data model to generate density forecasts for daily COVID-19 infections for a sample of countries/regions. In this setup, the growth rate of active infections can be represented by autoregressive fluctuations around a downward sloping deterministic trend function with a break. Although Liu et al. (2021) also adopt a Bayesian approach, their methodology and objectives are very different from those in the present paper. Importantly, our model accommodates a large set of covariates, whereas these authors build on a stochastic version of the susceptible-infected-recovered (SIR) model of Kermack and McKendrick (1927 , 1932 , 1933 that incorporates dynamics in the model parameters and a structural break. Another recent important contribution focused on prediction that extends the SIR formulation is Fern andez-Villaverde and Jones (2020). These authors apply a standard epidemiological model to estimate and simulate the expansion of the coronavirus disease with data on deaths from several cities and countries around the world. Two main conclusions can be drawn from our empirical analysis. First, the proposed set of covariates are more significant to predict the number of new COVID-19 confirmed cases in NYC as the pandemic unfolds. Second, the factors with predictive power are different from those in the first wave. In contrast to income, household size, and occupational variables, we find that the persistence of the incidence rate, spatial effects, population size, and age-related variables gain traction in this period compared to demographic and socioeconomic covariates. That is, the dynamic component of the proposed Poisson regression model is more relevant for prediction purposes than the static component given by the exogenous regressors reflecting neighborhood characteristics. The predictive performance of these models is assessed in-sample and out-of-sample. Both Poisson regression and BMA frameworks report low mean square prediction errors from one-period-ahead pointwise forecasts. Interestingly, the baseline Poisson regression model reports very good fit in-sample and out-of-sample, and its performance is comparable to the Bayesian ensemble predictor. In order to capture the uncertainty associated to the pointwise predictions, we also construct interval forecasts. The reliability and accuracy of these predictions is evaluated by computing empirical coverage probabilities at 90%. The information content of the intervals varies across periods but, in general, they are significantly narrower than during the first wave of the coronavirus pandemic, see Olmo and Sanso-Navarro (2020). The autoregressive factors and the spatial effects seem to carry most of the predictive content compared to the spring of 2020. To illustrate further the reliability of our predictions, we use choropleth maps showing that our framework is able to capture the dynamics in the evolution of the pandemic and to predict the heterogeneity of new confirmed COVID-19 cases across neighborhoods on a weekly basis. The present paper is also related to the recent epidemiological studies by Cordes and Castro (2020) and DiMaggio et al. (2020) . The first authors, in a descriptive analysis using zip code level data for NYC, identify areas with low access to testing and high case burden. Cordes and Castro (2020) analyze testing rates, positivity rates, and the proportion of positive tests, and relate them to socioeconomic variables. They find that clusters with less testing and low proportion of positive tests are characterized by higher income, education, and a dominant presence of white population. On the contrary, clusters with higher testing rates and shares of positive tests were disproportionately black and without health insurance. Simple correlation measures show inverse associations of the proportion of positive tests with white race, education, and income, and positive links with black race, Hispanic ethnicity, and poverty. DiMaggio et al. (2020) also claim that there is increasing evidence that, in addition to individual clinical factors, demographic, socioeconomic and racial characteristics of COVID-19 infections play an important role. These authors analyze positive testing results counts within NYC neighborhoods with Bayesian hierarchical Poisson spatial models using integrated nested Laplace approximations. This study shows that spatial clustering accounts for approximately 32% of the variation in the data, with hot spots in all five boroughs. However, the strongest univariate association with positive testing rates is a clinical factor given by the proportion of residents with chronic obstructive pulmonary disease. 1 The rest of the paper is structured as follows. Section 2 discusses the dataset used for the study and describes the set of covariates. Section 3 introduces the forecast models, briefly reviewing Poisson regressions and BMA techniques for prediction. Section 4 shows the application of these methods to data from NYC at the zip code level. Suitable choices of variables with power to explain the cross-sectional increase in the number of confirmed COVID-19 cases are discussed, and a forecast evaluation exercise comparing the performance of several competing models is presented. Section 5 concludes. Tables and figures are collected in an appendix. A separate online appendix presents the results of the full analysis for all the weeks over the evaluation period. The NYC Department of Health and Mental Hygiene (DOH) compiles, since 31 March 2020, the cumulative count of confirmed COVID-19 cases at the zip code level. 2 Although this information is available on a daily basis, and in line with most of the related studies discussed in the introductory section, the main variable of interest is the number of new confirmed COVID-19 cases reported on a given week. Figure This study exploits the information compiled during the second wave of the coronavirus pandemic in NYC, considering that it began in September 2020. Table 1 shows descriptive statistics of weekly new confirmed COVID-19 cases in modified zip code tabulation areas (ZCTAs) on a 3-week basis. In what follows, and for the sake of clarity, we are reporting the results using this frequency. The complete tables with the full set of results are shown in the online appendix. The figures reported in Table 1 reflect a steady increase in the cross-sectional mean of new cases ranging from 11 in the second week of September 2020 to roughly 189 cases in the second week of January 2021. The dispersion in the number of cases has also increased during the period analyzed. The information of the demographic and socioeconomic characteristics of the neighborhoods have been extracted from the American Community Survey (ACS). This data correspond to 5-year estimates for the period from 2014 to 2018. Table 2 describes the variables that have been used in our study. A first group of regressors reflects the importance of demographic factors in explaining the cross-sectional variation of the increase in confirmed COVID-19 cases at the zip code level, see Borjas (2020) and Schmitt-Grohé et al. (2020) . 1 Another related literature is concerned with the actual number of infections, suggesting that confirmed COVID-19 cases may be only the tip of the iceberg. Hortacscu et al. (2021) and Manski and Molinari (2021) have developed approaches to estimating (or at least bounding) the number of unreported cases. Although we do not pursue this strategy in the present study, it is worth noting that our prediction models might be applied in this setting under appropriate inference methods to obtain insights into the extent of the actual rate of infections in a given geographical area. 2 https://github.com/nychealth/coronavirus-data/ These variables are total population, its percentage of males, and the average household size. The shares of Blacks and African Americans, and of Hispanics or Latinos are intended to capture the racial composition of the population. The age structure is measured using the median age, the percentage of population 64 years and over, and the schooling enrollment rate. Economic conditions have been proxied by income per capita, the employment rate, the percentage of households with self-employment income, and the share of full-time workers. As pointed out by Harris (2020) and Hamidi and Hamidi (2021) , public transport may have contributed to the spread of the coronavirus pandemic in NYC. We try to capture this influence by including as regressors the percentage of workers 16 years and over, the use of public transportation, excluding taxicab, and average travel time to work. Almagro and Orane-Hutchinson (2020) find that occupations are important in explaining the differential incidence of COVID-19 across NYC neighborhoods. For this reason, we have also considered the share of workers in sectors highly exposed to the virus as explanatory variables: retail trade; transportation, warehousing, and utilities; educational services, health care and social assistance; arts, entertainment, and recreation; and accommodation and food services. Living conditions have been reflected using the percentage of housing units that are occupied, the median number of rooms, the percentage of households with an Internet subscription, and the share of population lacking health insurance coverage. The persistence in confirmed COVID-19 cases is captured by two different variables: the number of new weekly confirmed cases and the incidence rate (ratio of confirmed cases per 100,000 inhabitants), both being lagged one period. These regressors are aimed at controlling for self-reinforcing effects of the pandemic above and beyond the explanatory power of our set of demographic and socioeconomic variables. Spatial, neighboring, effects are also introduced into the model by including two additional regressors constructed as the weighted average of lagged new confirmed cases and of lagged incidence rates across contiguous locations. These spatial lags have been calculated using a row-standardized contiguity spatial weights matrix, 3 constructed using the geographic information for modified ZCTAs provided by the U.S. Census Bureau. This section outlines the model used to describe the weekly evolution of the coronavirus pandemic in NYC neighborhoods. Standard regression models estimated using ordinary least squares (OLS) are not suitable in this context because they assume that the response variable is continuous. To apply an OLS estimation framework, the recent extant literature on the topic considers the incidence rate as the dependent variable, see Almagro and Orane-Hutchinson (2020) and Borjas (2020) for linear regression and a group logit model, respectively. As an alternative, we directly model the number of weekly new confirmed cases. Therefore, as our response variable is count data, we consider as benchmark a generalized linear model (GLM) given by a Poisson regression. In order to accommodate a large set of potential covariates, we embed this specification within a BMA approach. It is characterized by an ensemble of predictors obtained from a battery of different Poisson regression models including different combinations of the covariates. This ensemble approach is known to reduce the forecast error and, more importantly, account for model uncertainty. In doing so, we are able to obtain posterior inclusion probabilities (PIPs) for each regressor, as well as posterior distributions for both the model parameters and the response variable. The evolution of the number of new COVID-19 cases in NYC at the zip code level is modeled as a Poisson regression, see McCullagh and Nelder (1989) . This estimation framework allows us to use a count variable as response variable and model, directly, the increase in the number of confirmed cases per period. This variable is assumed to follow a Poisson distribution PðλÞ, with λ denoting the intensity parameter. This function is convenient for modeling the expected number of arrivals of a random event for a fixed period of time, and is closely related to the exponential distribution that would model the interarrival times. In this setting, we assume that the number of new confirmed COVID-19 cases is the number of arrivals to the neighborhood with confirmed cases from a given population, and λ is the average number of arrivals. Our predictions will be based on the estimates of this coefficient. The Poisson regression model allows the intensity parameter to evolve over time and to depend on a set of m covariates. Mathematically, the number of confirmed cases per period is a Poisson random variable y it , with where N is the number of neighborhoods, and t ¼ 1, …, T, with T the number of periods. The expected value of this random variable is λ it , which we consider to depend on an intercept α t and three types of regressors. The first one is an autoregressive dynamic component given by the lagged values of the number of new confirmed cases (y it À 1 ) and of the incidence rate (z it À 1 ). The second type of covariate contains a linear combination of p exogenous demographic and socioeconomic variables fx ij g p j¼1 . These regressors are assumed to be constant over time because neighborhood characteristics are persistent variables that do not significantly change in the short run. As explained in the previous section, this information corresponds to the observational period 2014-2018. The third type of covariate tries to capture spatial effects, and is defined using a contiguous adjacency matrix W that assigns ones to those zip codes that are contiguous and zeros, otherwise. For consistency with the spatial econometrics literature, we row-standardize the matrix by normalizing the rows to sum to one. Therefore, the spatial lag of a given variable is its average value in contiguous neighborhoods. The model is as follows: with γ t and ρ t are the autoregressive parameters associated to the dynamic component; θ jt are the parameters corresponding to each regressor j ¼ 1, …, p;ρ t andγ t are the parameters capturing the spatial effects; w i is a row vector of the row-standardized contiguity spatial weights matrix W; z tÀ1 ¼ ðz 1,tÀ1 , …,z N,tÀ1 Þ 0 , and y tÀ1 ¼ ðy 1,tÀ1 , …, y N,tÀ1 Þ 0 . The econometric specification of the spatial effects in model (1) is reminiscent of spatial autoregressive (SAR) model, see Anselin (2003) ; however, in contrast to this model, we avoid the presence of endogeneity by considering the spatial terms to be lagged one period. This specification is plausible in our dynamic context and convenient for predictive purposes. Model averaging techniquesavailable both in frequentist and Bayesian contextsconsist of estimating all candidate models and then computing a weighted average of their estimates, taking into account the implicit uncertainty conditional on a given model and across different models. This approach was originally developed for linear regression models, see Raftery (1995) and Raftery et al. (1997) , and subsequently extended to more general frame- works. An early contribution to BMA with GLMs is Raftery (1996) , who proposes to use approximations for the Bayes factors, based on the Laplace method for integrals. This author also suggests a way to elicit reasonable, but data-dependent, proper priors. In general, BMA assigns a prior probability to a set of models, and to a set of parameters associated to each model. These prior distributions are updated once we incorporate the information obtained from the data to obtain the posterior distributions. Therefore, BMA is able to deal simultaneously with model selection, estimation, and inference. There are K ¼ 2 m models comprised by all possible combinations of the m covariates in the Poisson regression (1). Each model is denoted as M k , with k ¼ 1,…, K, and depends on a vector of parameters β k with a conditional posterior probability given by: where fðyjβ k , M k Þ and gðβ k jM k Þ denote, respectively, the likelihood function of the observations, and the prior density function of the parameters, conditional on model M k . The literature on BMA models has studied extensively the suitability of prior density functions. Reviews of the most popular options, both in linear and GLM frameworks, can be found in Forte et al. (2018) , Li and Clyde (2018) , and Steel (2020). For a given prior model probability P(M k ), its posterior probability can be calculated applying Bayes' rule: with f(y j M k ) and f(y) the conditional and marginal likelihood functions, respectively. Leamer (1978) assumes that the true parameters β associated to the regression variables are a function of β k , from a specific model M k . This author proposes to obtain the posterior density function of β, conditional on the data fy i g N i¼1 and model candidates fM k g K k¼1 , using the law of total probability: When m is moderate to large, posterior probabilities of individual models can be very small and their interpretation loses appeal. In such situations, inclusion probabilities are very useful. Formally, the PIP for variable x j , with j ¼ 1, …, m, is: where d j ¼ 1 if variable j is included in the model, zero otherwise. Using these models we can obtain the predictive distribution of the quantity of interest. For the response variable y, this distribution, denoted asŷ, is computed as: where the quantity in square brackets is the predictive distribution given M k , obtained using the posterior density function defined in (2). This function is integrated over the space of parameters β k B k , for each k ¼ 1,…, K. In practice, the BMA approach can be simplified using Monte-Carlo simulation methods or algorithms to rule out models with low probability of generating the observations such as the Occam's window, see Madigan and Raftery (1994) and Raftery et al. (1997) , among others. This section presents the findings of our empirical study for weekly data on the number of new confirmed COVID-19 cases in NYC. We divide the analysis into two exercises. First, we discuss model selection and focus on the relevance of the set of covariates presented in Section 2 to explain the differential evolution of the coronavirus pandemic across neighborhoods. Second, we assess the predictive power of these variables both in-sample and outof-sample. Table 3 reports the estimation results of the Poisson regression model (1) for selected weeks in our sample period separately. 4 The variables that exhibit statistical significance across most periods are the dynamic terms given by increase in the number of confirmed cases and the incidence rate in the previous week. Both variables display a positive coefficient and gain statistical significance as the number of cases accelerates through the wave. The spatial terms are also significant in most periods, highlighting the influence of neighboring ZCTAs in the evolution of the pandemic. As can be observed in Table A1 of the online appendix, this effect is particularly important in the last two weeks of the sample period analyzed when the pandemic gains traction. This finding also suggests that the presence of positive spillovers in the spread of COVID-19 across neighborhoods helps to predict the evolution of the disease in NYC. 4 As noted before, we report the results on a 3-week basis for the sake of clarity and due to space constraints. The full set of results for the period from 1 September 2020 to 2 February 2021 are included in the online appendix. Note: The number of observations is 177. Standard errors in parentheses. *** p < 0.01, ** p < 0.05, * p < 0.10. Population size is strongly significant and displays a positive relationship with the response variable. This result is not surprising because total population can be considered to proxy for other fundamental factors reflecting living standards, population density, and deprivation levels. In contrast to the findings reported for the first wave (Almagro et al. 2021; Olmo & Sanso-Navarro, 2020) , household size is negatively related to the number of new confirmed cases, but the magnitude of the parameter decreases over time. It is worth noting that contagion at the household level may be captured by the median number of rooms. The percentage of male population is hardly significant during our sample period. Interestingly, and in clear contrast to the results obtained for the spring of 2020 (Choi, 2020) , the racial composition of the neighborhoods does not have a prominent role in explaining differences in the number of new confirmed cases across zip codes during the second wave. Age-related variables have a strong predictive power in this model specification. The median age of the population is negatively associated to the number of weekly COVID-19 cases. Nonetheless, those neighborhoods with a large proportion of people over 64 years old tend to exhibit a higher number of new confirmed cases. Whereas age may be a proxy for other socioeconomic variables such as income, the share of the elderly reflects the strong influence of the coronavirus pandemic on this age group. During the first wave, income and the number of cases had a negative relationship that was highly significant, see Olmo and Sanso-Navarro (2020). However, income per capita is hardly significant in our sample period, suggesting that the number of infections is not very related to the income of the neighborhood. Similarly, and in contrast to existing findings for the spring of 2020, occupational variables lose most of their significance. The main exceptions are the self-employment rate and some variables reflecting the sectoral composition of employment. In particular, those measuring the importance of the transport, education and health, and entertainment sectors. In all these three cases the effects are quite significant, reflecting that those neighborhoods with a large fraction of individuals working on these sectors are more affected by the disease. Access to Intern et also has an effect during the last weeks of the study that, as expected, is negative. Other variables such as average commuting time, and having a full-time job lose any statistical significance during this second wave of the coronavirus pandemic. The estimation results of the Poisson regressions displayed in Table 3 may vary if only a subset of the covariates is included. For this reason, and in order to control for model uncertainty, the same analysis has been carried out adopting a BMA approach. 5 That is to say, we do not impose any specification a priori and let data speak. The sampling method combines a random walk Metropolis-Hastings algorithm with a random swap between included and excluded variables, see Raftery et al. (1997) and Clyde et al. (2011) , with one million iterations. We have specified 6 a uniform prior over the model space, and a benchmark priorwhich displays a good performance for both estimation and prediction purposesfor model-specific parameters (Li & Clyde, 2018) . In addition to the implementation of BMA, we use the posterior means under model selection. This corresponds to a decision rule that combines estimation and selection. In particular, we consider the highest probability (HPM) and the best prediction (BPM) models. From a Bayesian decision theory perspective, the latter is the model that is closest to BMA predictions under squared error loss. Liang et al. (2008) discuss these alternative Bayesian methods in some detail. Table 4 reports the PIP (5) for each covariate in selected periods. This analysis complements the preceding one on the statistical significance of the regressors based on t-tests. In this case, the relevance of the covariates in the ensemble predictor relies on the probability that they are included in the model specification. The discussion of the results can be separated by the type of covariate. The inclusion probabilities associated to the temporal lag of the response variable are very low in most periods, suggesting that this term does not contain much predictive power. In contrast, the lagged incidence rate receives a PIP close to one in most periods. The variables that capture the spatial effects are very significant, what corroborates their importance for prediction purposes. The racial composition and age-related variables are other important factors that are picked up by the ensemble predictor. On the contrary, variables related to labor market conditions are hardly relevant in this setting. The predictive performance of the variables reflecting the sectoral composition of employment is mixed. During the first weeks of our sample period these regressors receive very low PIPs, however, during the last weeks, most of these variables achieve very high values of these probabilities. This is particularly the case of the percentages of workers in the transport, education and health, and entertainment sectors. Two main conclusions can be drawn from this analysis. On the one hand, the inclusion probabilities increase over time as the pandemic unfolds, implying that the covariates are more significant to predict the number of new confirmed cases at the end of the sample period. On the other hand, the factors with predictive power are different from those in the first wave when the self-employment rate and income per capita were important covariates. During the second wave of the coronavirus pandemic these variables lose their predictive power. Instead, we find that the persistence of the incidence rate, captured by its first lag, the spatial effects, population size, and age-related variables F I G U R E 2 Bayesian model averaging: Posterior density functions for the coefficients of selected regressors, 12 Jan -19 Jan gain traction compared to socioeconomic variables. That is, the dynamic component of the Poisson regression model is more relevant for prediction purposes than the static component given by the exogenous regressors. To gain a better insight into the nature of the relationship between the predictors and the response variable with the BMA model, we plot in Figure 2 the posterior density functions of the model parameters (4) associated to a selection of regressors, according to their high PIPs during the sample period analyzed, for the second week of January 2021, when the peak of the second wave was reached. The density functions, from left to right and then going down, correspond to the lagged incidence rate, its average over contiguous neighborhoods, total population, the racial composition (black and hispanic), median age, the share of the elderly, and the percentages of workers in the transport and entertainment sectors. The plots are very informative about the sign of the relationship between these variables and the increase in the number of confirmed COVID-19 cases. The density functions also show the presence of uncertainty about the parameters associated to each regressor. The probability that the covariates are not correlated to the response variable, characterized by β ¼ 0, is represented by the height of a vertical line. These density functions reinforce the previous findings: the incidence rate, the spatial effects of contiguous neighborhoods, and population size are strongly significant predictors in the BMA specification. In contrast, the other variables are far less relevant. This subsection carries out an exercise to assess the predictive performance of the models implemented above. We The loss function that is used to evaluate the in-sample predictions of the models is the root mean square error, , with y it the realized number of new confirmed cases in each neighborhood i ¼ 1,…, N for a given time period t ¼ 1,…, T; andŷ it the associated predictions obtained from the different methods. Similarly, the performance of the out-of-sample predictions is assessed with the root mean square prediction error, obtained as , with y i, t + 1 the realized number of weekly new cases in each neighborhood i ¼ 1, …, N at period t + 1. The in-sample predictions are obtained with data from period t À 1 and tested with the data in t; whereas the out-of-sample predictions are obtained combining previous estimated parameters with data from period t and tested with data from period t + 1. For completeness, we also report the relative counterparts of the RMSE and RMSPE statistics, obtained from a loss function defined for the prediction error over the realized observation. For the in-sample exercise we have RMSE r , and for the out-of-sample predictions we obtain RMSPE r There is a well-established literature indicating the predictive advantages of ensemble predictions with respect to simple specifications of the data. Madigan and Raftery (1994) state that BMA methods predict at least as well as any single model in terms of logarithmic probability score. (Min & Zellner, 1993) show that the expected squared error loss of point forecasts is always minimized by BMA, provided the model space includes the one that generated the data. These important results highlight the superiority of BMA predictions in linear settings. Nonetheless, our problem concerns the predictions of a GLM for count data. In this setting it is not clear that the findings of the above authors naturally carry forward to ensembles of Poisson regression models. Therefore, the empirical forecasting evaluation exercise that we implement in this subsection is of independent value. periods even outperform, those of the other three ensemble predictors. This is the case for all periods except the week starting on 12 January 2021, when the number of infections reaches a peak. These empirical findings are robust across evaluation periods, and for the in-sample and out-of-sample exercises. The second main conclusion derived from the results reported in Table 5 is about the accuracy of the predictions. The magnitude of the loss functions for the absolute RMSE and RMSPE is smaller than that of the response variable. As it can be observed in Table 1 , the average number of new infections across neighborhoods during the first week of our sample period is 11.254. The in-sample predictions for that week yield a RMSE ranging between 4.781 and 4.889, suggesting a very good in-sample fit. The predictive performance is not as good for the corresponding out-of-sample exercise at the bottom panel of probability. This quantity is obtained as the fraction of the number of observations y it , for a given t, that lie inside the interval forecasts. More formally, the empirical coverage probability for the in-sample prediction analysis is defined as N À1 X N i¼1 1 jy it Àŷ it j ≤ z 1Àα=2 RMSE t Þ À , with 1(Á) an indicator function that takes a value of one if the argument is true, zero otherwise. The out-of-sample empirical coverage probability is 1 jy i,tþ1 Àŷ itþ1 j ≤ z 1Àα=2 RMSPE tþ1jt Þ À . Table 6 reports the empirical coverage probabilities obtained for predictive intervals constructed with a 90% confidence. These figures show empirical coverages that are close to the nominal ones across methods and periods. The accuracy of the intervals slightly varies between the in-sample and out-of-sample exercises, and also across periods, but not much across methods. As expected, the empirical coverage probability of the predictive intervals insample is closer to the nominal one than that for those out-of-sample. Overall, the analysis in Table 6 To illustrate further the predictive ability of our proposed framework we show the evolution of the pandemic graphically. With this aim, Figure replicates the cross-sectional differences across neighborhoods. In contrast, the predictive framework works quite well at the end of the sample period, when the model captures both the momentum in the number of cases and the cross-sectional variability. We present a model based on Poisson point processes to study the evolution of the coronavirus pandemic in the city of New York. Our study makes use of information on the number of confirmed COVID-19 cases at the zip code level from September 2020 to February 2021. The aim of the study has been twofold. First, we are interested in determining the demographic, socioeconomic, and spatial factors that have ability to explain the differences in the number of new confirmed cases across neighborhoods. Second, we have assessed the in-sample and out-of-sample predictive performance of alternative models. To do this, we consider a benchmark given by a Poisson random variable, modeling the intensity rate of the increase in weekly confirmed cases, that is extended in several dimensions by (i) adding a large set of regressors containing autoregressive components, exogenous covariates, and spatial effects; and (ii) controlling for model uncertainty by averaging the predictions of all possible combinations of regressors using a Bayesian approach. Our results show that the proposed predictive framework displays a good performance both in-sample and outof-sample during the second wave of the coronavirus pandemic in NYC. Importantly, the main predictors are the autoregressive component of the model given by the first lag of the incidence rate, population size, age-related variables, and spatial effects capturing spillovers between neighborhoods. The shares of employment in the transport, education and health, and entertainment sectors also exhibit predictive ability. These results contrast with those findings for the first wave reported in the related literature that highlight the role of occupational variables, household size, and income. The associated prediction intervals obtained from estimates of the mean square error and mean square prediction error display very good coverage probabilities, allowing us to construct interval forecasts. The econometric methodology proposed in this paper, and the insights obtained from the empirical study, can be applied as useful tools for policy evaluation to assess the effectiveness of specific policies such as social distancing measures, the use of face masks, and other non-pharmaceutical interventions implemented at neighbourhood or regional level. Our cross-sectional analysis and prediction exercise would allow policy makers with information on these measures to gauge their impact across regions and over time. Racial disparities in frontline workers and housing crowding during COVID-19: Evidence from geolocation data. Opportunity & Inclusive Growth Institute Working Paper No JUE Insight: The determinants of the differential exposure to COVID-19 in New York city and their evolution over time Spatial econometrics, A companion to theoretical econometrics Demographic determinants of testing incidence and Covid-19 infections in New York City neighborhoods Urban economics and urban policy: Challenging conventional policy wisdom Racial impact on infections and deaths due to COVID-19 in BAS: Bayesian variable selection and model averaging using Bayesian adaptive sampling Bayesian adaptive sampling for variable selection and model averaging Spatial analysis of COVID-19 clusters and contextual factors in New York City Blacks/African American communities are at highest risk of COVID-19: Spatial modeling of New York City ZIP code-level testing results Estimating and simulating a SIRD model of COVID-19 for many countries, states, and cities Methods and tools for Bayesian variable selection and model averaging in normal linear regression JUE Insight: How much does COVID-19 increase with mobility? Evidence from New York and four other U.S. cities Subway ridership, crowding, or population density: Determinants of COVID-19 infection rates in New York City The subways seeded the massive coronavirus epidemic in New York city Estimating the fraction of unreported infections in epidemics with a known epicenter: An application to COVID-19 Contributions to the mathematical theory of epidemics -I Contributions to the mathematical theory of epidemics -II. The problem of endemicity Contributions to the mathematical theory of epidemics -III. Further studies of the problem of endemicity Regression selection strategies and revealed priors When will the Covid-19 pandemic peak Mixtures of g-priors in generalized linear models Mixtures of g priors for Bayesian variable selection Panel forecasts of country-level Covid-19 infections Model selection and accounting for model uncertainty in graphical models using Occam's window Estimating the COVID-19 infection rate: Anatomy of an inference problem Generalized linear models (Monographs on Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates Urban inequalities in the 21st century economy Bayesian model selection in social research Approximate Bayes factors and accounting for model uncertainty in generalised linear models Bayesian model averaging for linear regression models Covid-19: Testing inequality in New York City Racial sorting and the emergence of segregation in American cities Model averaging and its use in economics R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing Neighborhood, race and educational inequality The authors acknowledge financial support from Gobierno de Arag on (ADETRE Research Group, Grant S39-20R).Jose Olmo acknowledges financial support from Fundaci on Agencia Aragonesa para la Investigaci on y el Desarrollo (ARAID). https://orcid.org/0000-0002-0437-7812Marcos Sanso-Navarro https://orcid.org/0000-0002-5926-2166 Additional supporting information may be found online in the Supporting Information section at the end of this article.How to cite this article: Olmo, J., & Sanso-Navarro, M. (2021). Modeling the spread of COVID-19 in New York City. Papers in Regional Science, 1-21. https://doi.org/10.1111/pirs.12615