key: cord-0720885-qfrn3v3s authors: Bloise, Francesco; Tancioni, Massimiliano title: Predicting the spread of COVID-19 in Italy using machine learning: Do socio-economic factors matter? date: 2021-01-21 journal: Structural change and economic dynamics DOI: 10.1016/j.strueco.2021.01.001 sha: 01a67e999d21e3b2f213c5a30eacc651f11d01b9 doc_id: 720885 cord_uid: qfrn3v3s We exploit the provincial variability of COVID-19 cases registered in Italy to select the territorial predictors of the pandemic. Absent an established theoretical diffusion model, we apply machine learning to isolate, among 77 potential predictors, those that minimize the out-of-sample prediction error. We first estimate the model considering cumulative cases registered before the containment measures displayed their effects (i.e. at the peak of the epidemic in March 2020), then cases registered between the peak date and when containment measures were relaxed in early June. In the first estimate, the results highlight the dominance of factors related to the intensity and interactions of economic activities. In the second, the relevance of these variables is highly reduced, suggesting mitigation of the pandemic following the lockdown of the economy. Finally, by considering cases at onset of the “second wave”, we confirm that the territorial distribution of the epidemic is associated with economic factors. The COVID-19 pandemic has opened up new challenges for understanding the factors associated with the spread of contagious diseases and the role played by social, economic, and environmental conditions. In this study we investigate the case of Italy-the first European country to experience a large number of registered cases in early 2020-to address three main questions. First, we use a methodology (based on a machine learning algorithm) for selecting the relevant predictors of registered COVID-19 cases from a large set of official, provincial-level data in Italy. These data include economic activity indicators within a conditioning set that considers a high number of potential triggers addressed in the current literature. Second, by repeating the analysis on cumulative cases observed at different points in time, we evaluate whether some selected predictors lose their relevance following the timespecific containment measures implemented by the Italian government, which might signal that these measures have been effec-tive in mitigating the pandemic's further spread. Third, we setup a simulation strategy to verify the external validity of the model by testing its ability to predict the diffusion of the COVID-19 in "unseen" areas. Before going into the details of the analysis, we provide some background of the pandemic in Italy. On January 30, 2020, the first two cases of COVID-19 were detected, and the respective patients were hospitalized in Rome, Italy. Only about 20 days later, on February 20, 2020, the first COVID-19 outbreak was identified in Codogno, a municipality belonging to the province of Lodi in the Lombardy region. Since then, Italy came out as the first European country to be severely hit by the COVID-19 pandemic. Specifically, as of March 8, 7,375 COVID-19 cases were registered in Italy, of which nearly 57% were in Lombardy. On March 21, when the highest number of diagnosed new positive cases was registered (6,557), the cumulative number of cases increased to 53,578 (of which 47.6% were in Lombardy), exhibiting a growth rate close to 14%. Several increasingly restrictive containment measures have been adopted by the Italian government since the detection of the first outbreak, leading to a nearly complete lockdown of the country's economic activities on March 21, 2020. This strong containment measure (stage one) was in effect until May 3, 2020. From this date until June 3 (stage two), the containment had been partially relaxed. At the end of stage two on June 3, the number of registered cases was 233,836, with a 0.1% growth rate. The country then entered a "third stage" in the management of the COVID-19 spread, mostly based on the adoption of prescriptions for personal protection and social distancing. One of the main characteristics of the COVID-19 pandemic in Italy is its highly heterogeneous territorial distribution. As of June 3, 2020, Lombardy's official disease count accounted for 38.2% of all Italian cases, a number that was 128% higher than that expected for a homogeneous territorial distribution (i.e.16.7% regional share in the national population). Within Lombardy, official disease counts also indicated high heterogeneity in incidences across provinces. On June 3, 2020, COVID-19 counts per 10 0,0 0 0 inhabitants ranged from 408 cases, registered in the province of Varese, to 1,802 cases in the province of Cremona. A high territorial heterogeneity of the spread was also found in other Italian regions. In the central region of Lazio, the incidences ranged between a minimum of 94 cases per 10 0,0 0 0 inhabitants in the province of Latina to a maximum count of 248 cases, registered in the province of Rieti. In Sicily, official disease counts ranged from a minimum of 30 cases per 10 0,0 0 0 inhabitants in the province of Ragusa, to a maximum of 258 cases in the province of Enna. A number of explanations have been suggested in the ongoing debate for the uneven geographical spread of the pandemic. A natural conjecture is that such heterogeneity in disease counts reflects differences in the territorial distribution of its triggers. Demographics ( Dowd et al., 2020 ) , health care system characteristics ( Black et al., 2020 ; Gan et al., 2020 ; Brindle and Gawande, 2020 ) , enrollment in education systems ( Chang et al., 2020; Li et al., 2020; Viner et al., 2020; Zhang et al., 2020 ) , transport and mobility specificities ( Do et al., 2020; Li et al., 2020; Zheng et al., 2020 ) , climate factors ( Bashir et al., 2020; Sajadi et al., 2020; Wang et al., 2020 ) , pollution ( Yongjian et al., 2020 ; Wu et al., 2020 ) , social attitudes, and family ties ( Bayer and Kuhn, 2020; Belloc et al., 2020; Borgonovi and Andrieu, 2020 ) are the general categories wherein the majority of speculations on potential disease triggers being proposed can be grouped. Only few and highly focused studies address the role of economic factors in the spread of the pandemic ( Barbieri et al., 2020 ; Dingel and Neiman, 2020 ; Qiu et al., 2020 ; Fogli and Veldkamp, 2020 ) , whereas much more attention is being devoted to the opposite causal nexus, that is, the investigation of the effects of the pandemic on economic activities and conditions ( Atkenson, 2020 ; Baker et al., 2020 ; Baqaee and Fahri, 2020 ; Bronka et al., 2020 , Decerf et al., 2020 , Dosi et al., 2020 Fernandez, 2020 ; Gregory et al., 2020 ; Guan et al., 2020 ; Ludvigson et al., 2020 ; Palomino et al., 2020 ) . Each of the many explanations has its own rationale. However, in the absence of randomized testing on individuals or of an established diffusion model for the infectious disease, their objective relevance for the prediction of the uneven spread of the epidemic remains questionable. Empirical analyses that focus on specific triggers in non-experimental environments are unavoidably at risk of strong under-specification biases, unless they rely on strategies able to consider all these factors jointly. Detecting the relevant correlates of the geographical heterogeneity of the epidemic from aggregate data is a difficult task, since potential predictors are many and highly correlated. In this setting, analyses that strictly focus on specific aspects often lead to biased estimates due to the omission of important controls. By increasing the set of correlates to reduce the bias, the standard errors of the estimates obtained with typical penalty functions (such as OLS) tend to inflate, implying that the statistical relevance of the conditioning sets shrinks toward zero, and the model's out-of-sample predictive performances deteriorate. In other words, standard estimation tools are bounded to an underspecification/overfitting trade-off. A viable strategy to handle the bias-variance trade-off and minimize the mean squared prediction error is to leave the predictors' set unrestricted (thus taking an agnostic perspective) and use statistical learning to select a parsimonious model. This basically implies the introduction of weights in the OLS estimator in the form of penalties that are able to select a compact model structure with "optimal" out-of-sample predictive properties. Given our empirical setting, we use the elastic net learning algorithm ( Zou and Hastie, 2005 ; Hastie et al., 2009 ) , which provides an estimator belonging to this general strategy. The elastic net estimator combines the properties of ridge regression, which basically mitigates the multicollinearity problem through regularization, and of the least absolute shrinkage and selection operator (LASSO), which further increases the predictive ability of models. This method has many advantages. First, it allows joint consideration of a very large number of correlated predictors in a unified empirical framework and reduction of the risk of overfitting by selecting only those variables that provide the highest predictive performance. Second, relative to other machine learning algorithms, the elastic net ensures good performances, even when the number of tested predictors is high (in principle even larger than the number of observations) relative to the sample size ( Zou and Hastie, 2005 ) . Third, since the elastic net can be conceptualized as a generalization of OLS, it allows a standard interpretation of results. With this strategy, we exploit an extended information set including COVID-19 cases observed at the provincial level and the many potential triggers addressed in the literature. Specifically, we consider a set of 77 province-specific candidate predictors that can be grouped in the following conceptual sub-sets of indicators: i) economic activity and intensity, ii) climate and pollution, iii) sociodemographic, iv) geographical and territorial distance, v) healthcare-system-related indicators, vi) public and private mobility, and vii) educational-system-related indicators. As a first robustness check, we include regional dummies to check whether our results are driven by region-specific characteristics, an occurrence that would switch the set of predictors selected by the elastic net estimator. Robustness is also tested with respect to the role of the control identifying the distance from the first outbreak. To the best of our knowledge, this is the first study in which a very large set of the potential triggers of the geographical diffusion of COVID-19 addressed in the literature are jointly considered and analyzed in a unified empirical framework. The results point to a substantial improvement of the model's predictive properties with elastic net estimates. The gain is striking, as compared with OLS, and evident with respect to ridge regression and LASSO. The model estimated with the elastic net using the registered cumulative cases on March 21, 2020 identifies few relevant predictors of the geographical distribution of the epidemic among the 77 explanatory variables being considered. Importantly, we find that five out of 10 triggers belong to the economic sub-set. In the order of relative importance, productivity (value-added per employee), the intensity of firms' international relationships, the general employment rate, and the share of labor enrolled in manufacturing, denote a positive correlation with prevalence of COVID-19 cases, whereas the share of labor in agriculture is selected as a favorable trigger (negative correlation). The highest positive correlation is obtained by the measure of close Euclidean distance from the first outbreak (i.e. within 50 km from the province of Lodi). Health characteristics are shown to affect the geographical spread through a positive correlation with the mortality rate for infectious diseases. Climatic and pollution factors selected as critical triggers include the number of frost days in a year and the average concentration of PM10. Family ties, proxied by the average family size, are shown to have a weak but negative correlation with COVID-19 cases. Re-running the elastic net estimates on June 3, 2020 case data and controlling for cumulative cases registered on March 21, 2020 yield results that cancel three out of the five economic triggers, namely, valued-added per employee, the share of employment in manufacturing, and the general employment rate. We interpret this result as evidence of the effectiveness of the strong containment measures adopted by the government. This evidence, which confirms the results provided by Flaxman et al. (2020) for a set of 11 European countries, is reinforced by the deletion of the close distance identifier, which was selected as the most important trigger of the pandemic in the pre-containment sample. In the postcontainment sample, the registered cumulative cases in March 21, 2020 (which can be conceived as a measure of the attack rate of the disease in the second stage) is found to be the most important trigger of the subsequent spread. Baseline results are robust to the introduction of regional dummies. From a simulation exercise based on estimates obtained over bootstrapped samples, the models selected with the elastic net algorithm are shown to outperform largely those estimated with OLS in terms of out-of-sample properties. Different from OLS, predictions obtained with the models estimated with the elastic net are shown to be stable across simulations and to replicate the real pattern correctly, irrespective of the randomly generated sample being considered. Furthermore, models estimated on five different training samples (in which 20% of provinces are iteratively excluded from the estimation set) are shown to maintain their predictive power also in "unseen" areas-those excluded by the training sample. This is a signal of the external validity of the model being selected by the elastic net algorithm, at least for the Italian case. Finally, using information on cumulative COVID-19 cases recorded between September and October 2020,we show that cross-province economic differences are confirmed as key factors to predict the spread of the epidemic in Italy also during the "second wave". The paper proceeds as follows. Section 2 briefly describes the evolving literature. Section 3 discusses the data used in the analysis and provides some stylized descriptive evidence. Section 4 describes our estimation strategy and discusses its main advantages. Section 5 presents and discusses the main results, provides information about their robustness, shows the prediction improvement of the model selected by the elastic net algorithm as compared with OLS, and evaluates its external validity. Section 6 provides the conclusion. A number of contributions from authors belonging to different disciplines have recently tried to find out the reasons behind the territorial heterogeneity of COVID-19 spreads observed at the global level. Different views are emerging from separate studies focused on specific research questions. Such heterogeneity of views reflects the existence of both an objective puzzle and an investigation difficulty. On the one hand, in the case of Italy, the considerably higher rate of infection in Lombardy and in other northern provinces cannot be explained only by the fact that at the beginning of the epidemic in February (i.e. before the first important Italian outbreak in the province of Lodi was discovered and public authorities implemented stringent containment measures) most people got infected in the northern provinces. Evidence from registered cases clearly points out that for any given distance from the first outbreak, there were provinces in the north with a high rate of infection and others that displayed much lower prevalence. On the other hand, a common feature of recent contributions per-formed with standard tools is that the emerging correlations between the diffusion of COVID-19 and its triggers come out from investigations that miss a comprehensive handling of the potential predictors that are reasonably conjecturable. To contextualize our analysis, here we briefly mention only few of them, approximately covering the domain of the explanations to which interest has been directed. Among possible explanations of the territorial spread of the epidemic, interest has been first focused on demographic factors. From this perspective, large streams of studies have addressed the role of the age structure, sex, and intergenerational interactions for the observed territorial heterogeneity in the diffusion and fatality rate of COVID-19 ( Dowd et al., 2020 ) . Other works tried to evaluate the possible association between COVID-19 cases and climatic factors. Investigations were mostly focused on the role of air temperature and humidity, obtaining mixed results ( Bashir et al., 2020; Zhu and Xie, 2020; Sajadi et al., 2020; Wang et al., 2020 ) . Moreover, the infection risk for health care workers in hospitals and their role for the outer transmission of the disease have been the focus of other studies. Black et al. (2020) , by noting that nearly 45% of secondary cases could be infected by index cases in a pre-symptomatic stage, and that in fully tested realities, asymptomatic cases ranged from 51% to 88%, made a strong case for the strategic role of mass health care workers testing to prevent propagation within and out of hospitals. From a similar perspective, Gan et al. (2020) addressed the case of the Singapore health system, and Brindle and Gawande (2020) studied the specifics of managing the pandemic risk in surgical systems. All these contributions implicitly conjecture that health care system characteristics and policies are critical for the spread of the pandemic. This possibility, still missing objective evidence, is the subject of harsh debate and juridical investigations in Italy. Some evidence on influenza outbreaks attributed an important role in the transmission of infectious diseases to school and education system arrangements ( Jackson et al., 2016 ; Bin et al., 2018 ) . This stimulated interest for investigations on the role of class attendance and participation in the education system as a trigger of the COVID-19 pandemic. In an early review study focusing on schools, Viner et al., 2020 showed that the evidence on school closures for the containment of the COVID-19 pandemic is weak or mixed. Unless reinforced with other stringent social distancing measures, social contacts outside schools are not less risky than child activities in schools ( Chang et al., 2020 ) . However, Zhang et al. (2020) , in a model-based analysis focused on China, showed that school closure, by delaying the epidemic spread, can significantly reduce the peak incidence. Li et al. (2020) , in a crosscountry study on the time-varying effectiveness of a set of containment measures on the COVID-19 replication rate, showed that school closure decreases transmission by 15% after four weeks, while school reopening could increase transmission by 24%. Interest has also been directed at evaluating the role of environmental characteristics, with a specific focus on air pollution. Yongjian et al. (2020) , in a strictly focused investigation, found a positive correlation between short-term exposure to air pollution and number of confirmed COVID-19 cases in China. Their investigation, however, does not take into account other factors that can be simultaneously correlated to the spread of the pandemic and air pollution. Wu et al. (2020) suggested that air pollution might be correlated to higher mortality rates in the United States, even after controlling for some confounding factors. Other studies focused on the correlation between the infection rates and social/family habits. Intergenerational family ties and cohabitation, known to be very high in Italy as compared with other high-income countries ( Reher, 1998 ; Di Giulio and Rosina, 2007 ; Santarelli and Cottone, 2009 ), have been evaluated as a possi-ble trigger of the epidemic. From this perspective, ( Bayer and Kuhn, 2020 ) found a positive correlation between a measure of family vertical integration and the COVID-19 fatality rate, using cross-country data recorded at an early stage of the "first wave" of the pandemic. Belloc et al. (2020) argued that this result might be driven by country-specific factors simultaneously correlated to both intergenerational family ties and the spread of COVID-19. Such a potential selection bias is obviously high in analyses in which the variability across structurally and institutionally different countries or regions is exploited. In fact, Belloc et al. (2020) showed that the correlation between the COVID-19 fatality rate and a measure of vertical social integration (i.e. the share of adults aged 18-34 living with their parents) turns negative when the sample variability in the diffusion of the epidemic is referred to the 20 Italian regions, where the southern display higher family ties and lower case fatality rates. Under a similar perspective, Borgonovi and Andrieu (2020) evaluated the role of social capital (an index comprising both social norms and networks, obtained from an assortment of measured human attitudes, activities, and behaviors) in the response of U.S. county communities to COVID-19-related containment policies, measured in terms of changes in mobility patterns. They found that the social capital index is negatively correlated with mobility during the COVID-19 outbreak and thus, has a lowered risk of contagion. With regard to the economic literature, efforts are mostly focused on the potential effects of the pandemic on the economy. Different aspects have been addressed: global economic performances ( Atkenson, 2020 ; Baqaee and Fahri, 2020 ; Fernandes, 2020 ; Gregory et al., 2020 ; Ludvigson et al., 2020 ) , global supply chains ( Guan et al., 2020 ) , economic uncertainty ( Baker et al., 2020 ) , and the distribution of income and poverty ( Bonacini et al., 2020 ; Bronka et al., 2020 ; Decerf et al., 2020 ; Dosi et al., 2020 ; Palomino et al., 2020 ) . There are fewer studies that analyze economic factors as potential triggers of the pandemic. To cite some of them, Qiu et al. (2020) showed that the transmission rate of the infection increases with per capita GDP. Their result suggests that economic factors should be further investigated as important predictors of the COVID-19 diffusion. Fogli and Veldkamp (2020) suggested that areas that are more productive are socially and economically connected with each other and with the rest of the world and thus, are more vulnerable to spreads of infectious diseases. Ascani et al. (2020) find an association between the geographical spread of COVID-19 in Italy during the "first wave" and the structure of local economies. However, using the OLS estimator, they are forced to consider a limited number of explanatory variables to avoid multicollinearity issues. From a microeconomic perspective, Barbieri et al. (2020) evaluated the extent to which the probability of being infected varies across different categories of workers. Dingel and Neiman (2020) showed that the probability of infection is related to the possibility of working from home. These studies suggest that the link between a pandemic crisis and economic activity should be addressed considering two directions of causality. On the one hand, areas that are more productive and interconnected are more likely to be affected by high infection rates due to their higher degree of social and economic networks. On the other hand, highly infected areas contribute more to global supply chains and value-added, such that global economic growth may be strongly reduced in the occurrence of a pandemic crisis, irrespective of the containment measures adopted to reduce the infection's transmission rate ( Guan et al., 2020 ) . This brief and unavoidably incomplete review of the related literature provides a sketch of the contributions to which our work is related. It also helps in forming an idea about the difficulties arising from analyses focused on few specific predictors of the COVID-19 pandemic in a non-experimental environment. In the following chapters, we propose an analysis that is able to circumvent these problems, with the specific goal of selecting, among a large set of candidate economic and non-economic triggers, those that have the highest predictive power for the heterogeneous diffusion of the COVID-19 pandemic. Our investigation analyzes the geographical diffusion of the COVID-19 pandemic in Italy and identifies its predictors in a unified empirical framework. We collect data on registered COVID-19 cases and on a large set of potential predictors observed at the provincial level from different official data sources. First, we take information on COVID-19 cases provided by the Italian Civil Protection Department (ICPD) on a daily basis since the first cases were identified in the municipality of Codogno at the end of February. 1 Although we are aware that information on registered cases is likely affected by a high degree of measurement error(the number of infected people has been largely underestimated due to the low number of swabs and tests carried out at the beginning of the epidemic), data provided by the ICPD are so far generally recognized as the sole official and controlled source available in Italy. We refer to the number of cumulative COVID-19 cases registered by the ICPD on two different dates: March 21, 2020 and June 3, 2020. The former date is selected to focus on the heterogeneity in the geographical distribution of COVID-19 cases observed before the containment measures implemented by the Italian government have had their effects. 2 In considering cumulative cases as of June 3, 2020, we focus on the geographical distribution of COVID-19 cases registered between March 21, 2020 and June 3, 2020, that is, on infection events that occurred when the strongest containment measures were in place. By exploiting this difference in policy implementation, we evaluate whether the tested predictors of the spread of the epidemic vary due to the implementation of the measures. 3 Fig. 1 shows the geographical diffusion of infected people per 10 0,0 0 0 inhabitants registered on March 21, 2020 (left-end map) and in June 3, 2020 (right-end map) by grouping the 107 provinces in deciles of the national distribution of COVID-19 cases. The two maps clearly show that most of Lombardy's provinces have been severely hit by the epidemic and are in the top decile of the national distribution, with 219 to 761 (819 to 1,802) cases per 10 0,0 0 0 inhabitants as of March 21, 2020 (June 3, 2020). As of March 21, 2020, aside from some areas in northern Italy, the two bordering provinces of Rimini and Pesaro-Urbino are the only other provinces along the eastern coast of Italy that belong to the top decile of the distribution. On June 3, 2020, they fall to the 8 th and 9 th decile, respectively. Most of the provinces in the central or southern part of Italy show a lower degree of infections with 2 to 7 (28 to 51) cases per 10 0,0 0 0 inhabitants on March 21, 2020 (June 3, 2020). However, differences can be detected across provinces in each region or area. For instance, the number of infected people per 10 0,0 0 0 inhabitants is clearly higher in northern Sardinia (province of Sassari) than in the rest of the island, whereas the province of Enna in central Sicily, on June 3, 2020, shows a much higher degree of infection compared with other Sicilian provinces. To provide a comprehensive analysis of the many potential predictors of COVID-19 diffusion across Italian provinces, we use 77 explanatory variables, all observed at the provincial level. Data come from different official sources: the national statistical office (ISTAT), the Ministry of Economy and Finance, the Ministry of Economic Development, and the Ministry of Health. These data can be grouped into seven conceptual sub-sets of indicators: i) economic activity and intensity (19 variables), ii) climate and pollution (9 variables), iii) socio-demographic (9 variables), iv) geographical and territorial distance (8 variables), v) health-care-system-related (13 variables), vi) public and private mobility (12 variables), and vii) educational-system-related indicators (7 variables). In particular, the set of economic predictors includes labor market characteristics, specifically the employment and unemployment rates, the percentage of employment in agriculture, industrial districts, manufacturing, and services and the percentage of self-employed workers; economic and distribution characteristics, specifically the value-added per employee (productivity), the value added per employee in agriculture, manufacturing, and services, the poverty rate; firm characteristics, specifically the firm size, firm density, the share of employment in industrial districts, the intensity of firms' export relationships, the share of unloaded goods in provincial harbors, the density of livestock units, and the density of firms producing animal-derived products. The full set of indicators for each conceptual sub-set is listed and described in detail in the Appendix ( Tables A.1 to A.7 ). In line with many recent studies, we can exemplify the problems that potentially emerge by analyzing simple correlations between registered log cases per 10 0,0 0 0 residents in a province and specific factors that so far have been identified as possible predictors of the geographical spread of COVID-19 infections. For instance, Fig. 2 shows that COVID-19 cases are positively correlated with the average number of frost days in a year (Panel A), with the average concentration of PM10 (particles with diameter ≤ 10 μm), which can serve as proxy for air pollution in the province (Panel B), and with the mortality rate for infectious diseases (Panel D), while they are negatively correlated with the percentage of families with at least five members (Panel C). The latter result basically replicates that of Belloc et al. (2020) , which was obtained at the regional level. It is noteworthy that such analyses are not able to predict accurately the distribution of cases across areas. Single correlations, even if statistically relevant, may be driven by many other confounding factors that should be considered in a comprehensive predictive model. This risk is particularly high when, as in this case, there is no sound theoretical support for variable selection, from which a structural model of the unequal diffusion of COVID-19 across different territories can be derived. Moreover, although using few explanatory variables minimizes the risk of overfitting, we can hardly obtain a good predictive performance of the COVID-19 spread by focusing on single potential predictors. This is why in the following sections we base our model specification on a machine learning algorithm capable of selecting the pandemic's relevant triggers from a joint consideration of many potential explanatory variables. We perform our estimation using the elastic net machine learning algorithm originally proposed by Zou and Hastie (2005) . The elastic net algorithm combines the ridge and LASSO regularizations to increase further the predictive ability of a model. 4 The role of these penalties is to select a parsimonious model (and/or shrink the size of its coefficients) from a very high number of explanatory variables (possibly exceeding the number of observations), and when the conditioning set displays near or exact collinearity. The model selection implies conditioning the penalties to an optimal target, that is, the maximization of the out-ofsample model's predictive abilities (or the minimization of the outof-sample mean squared error). In practice, the penalty parameters are obtained from repeated rounds of model validation, known as cross-validation, in which the estimates obtained on estimation sets are generalized to predict unseen data (i.e. predictive sets). More specifically, the penalties are obtained by maximizing the model's ability to predict data that are not used in the estimates. Since the elastic net can be conceptualized as a generalization of OLS (i.e. a methodology belonging to the family of regularized least squares) we provide the details of the estimation method in relation to some key properties of the OLS estimator. The OLS estimator is very often exploited to predict a given number of observations of an outcome variable using a vector of predictors. Usually, the outcome variable of interest is predicted by estimating those parameters, ensuring that the in-sample sum of squares of residuals is as small as possible. However, there are two fundamental aspects of an estimator to be considered in the evaluation of its predictive performance: the bias and the variance. The former quantifies the error that is introduced by approximating an unknown data generating process. Specifically, if we assume N random samples associated to different data generating processes, we could obtain a range of predictions, one for each randomly drawn sample. The bias is thus a measure of the distance between the expected value of the prediction and the unknown function which captures the true relationship between the outcome variable and predictors. The variance is the variability of a model prediction around its expected value. According to the definition of bias and variance, the prediction performance of a model can be evaluated by looking at its mean squared error (MSE), which is the expected error in predicting a given outcome variable: In our study, y p denotes registered log cases at a given date per 10 0,0 0 0 inhabitants observed at the provincial level, and f (x) denotes predicted log cases per 10 0,0 0 0 inhabitants, which is a func-tion of the vector of provincial predictors included in the model. The statistical learning literature ( Hastie et al., 2009 ) shows that the MSE can be decomposed as follows: where the first term is the variance of the model; the second term is the square bias; and the last is the noise term, which cannot be reduced. To attain the best prediction, we should minimize the MSE by reducing both the first and the second components of Equation (2) . In finite samples, the well-known trade-off between variance and bias requires a balance between the first two components of Equation (2) , such that the lowest attainable MSE is conditional on the specific set of predictors at our disposal. Although OLS is the best linear unbiased estimator, it produces very poor predictions in the following two cases: i When there is a high degree of correlation between predictors included in the vector x p . ii When too many predictors need to be included in the model with respect to the number of observations. In worst casescenarios, typical of investigations that are missing the support of an established theoretical model, the number of predictors might exceed the number of observations, such that it is not even possible to estimate the parameters of interest using OLS. In both cases, the variance of the prediction can be extremely high, such that the predictive performance of the OLS estimator is very low, even though the bias component is minimized. Therefore, in very complex models, allowing for a small degree of bias is essential to obtain a strong reduction in variance and improve the prediction performance of the model. In our estimation problem, we need to identify the main predictors of the geographical spread of the epidemic considering a large set of potential predictors addressed by the literature, whose coefficients are to be estimated with a small sample size (i.e. the 107 Italian provinces). This is a typical case in which the prediction performance of the OLS estimator is very low, given several issues related to multicollinearity and overfitting. For this reason, we handle the variance-bias trade off by using the elastic net regularization algorithm originally proposed by Zou and Hastie (2005) : The elastic net algorithm combines the penalties of the ridge regression and LASSO and mitigates some of the known drawbacks that affect LASSO, which have been shown to saturate when the number of predictors is very high with respect to the sample size, or when there is a high degree of correlation among predictors. In Equation (3) , the λ parameter controls the relevance of the 2 ) , which shrinks the coefficient toward zero to reduce overfitting. When the parameter α = 0 , the elastic net algorithm collapses to the ridge regression and no predictor is excluded from the model. When α = 1 , the elastic net algorithm is equivalent to the LASSO, which has the potential ability to set some of the coefficients to equal zero. When both λ and α are greater than zero, the algorithm has the ability of setting some coefficients exactly to zero and shrinks others to minimize the prediction error. On the contrary, when both λ and α equal zero, the elastic net algorithm collapses to the OLS case, such that all predictors are exploited to predict the outcome variable without any shrinkage. Therefore, it is possible to get very different predictive models and estimated coefficients for each combination value of λ and α. Among all possible specifications, we select λ and α by using kfold cross-validation to minimize the out-of-sample MSE, and we evaluate the external predictive performance of the model by testing its ability to predict new data that were not used for its estimation ( James et al., 2013 ) . K-fold cross-validation is a re-sampling procedure that randomly splits the sample in K subsets (folds). Then, for each of the K-folds, one is iteratively defined as the test set, and the K-1 remaining folds are used to estimate the model coefficients. Following Mullainathan and Spiess (2017) , we calibrate our algorithm and evaluate its out-of-sample performance in different steps: i) we randomly divide our data in a training sample (80% of the observations) and a hold-out sample (the remaining 20% of the data); ii) in the 80% training sample we use 5-fold cross-validation to select a specific λ − α pair among a set of different possible combinations of λ and α. The selected λ and α are the ones that minimize the average MSE computed across the five folds; 5 iii) we run the algorithm in the training sample using the λ-α combination selected through k-fold validation; iv) we compute the out-ofsample prediction error in the hold-out sample to test the model's capability to predict "unseen" data. In this section we present the results of our analysis by using, as dependent variable in our regressions, the log of COVID-19 cumulative cases per 10 0,0 0 0 residents in the province, measured at different dates. In our baseline analysis, we refer to positive cases registered until March 21, 2020 to consider the geographical spread of cases across Italian provinces in the first stage of the epidemic ( Table 1 ) . Then, we focus on cumulative cases registered between March 22, 2020 and June 3, 2020 to detect whether, and to what extent, predictors change when containment measures are in force ( Table 4 ) . 5 Usually a 10 or 5-fold cross-validation is used to select the values of λ and α that minimize the prediction error computed across the five test sets. To compare the magnitude of the estimated coefficients and identify which predictors are more relevant for our analysis, we standardize all explanatory variables so that we can interpret each estimated coefficient multiplied by 100 as the percentage increase of cases per 10 0,0 0 0 inhabitants for one standard deviation increase in predictors. Table 1 shows our baseline results obtained with the elastic net calibrated using a 5-fold cross-validation. 6 In this case, among all 77 explanatory variables considered in the analysis, the algorithm selects only 10 predictors, presented in descending order of importance. The coefficients of the other 65 explanatory variables are penalized to zero, denoting that they are not relevant predictors of the epidemic and are not shown in Table 1 . On the contrary, although the coefficients of the 10 variables selected by the algorithm are reduced in size to minimize the risk of overfitting, they are not set to equal zero. The results suggest that for a given province of the initial outbreak, all provinces within 50 km are more likely to have a high infection rate (49.6% more cases per 10 0,0 0 0 people) about 1 month after the beginning of the epidemic. Nevertheless, all other dummies of distance are not selected as potential predictors, suggesting that for distances above 50 km, there are other diffusion factors not related to the geographical distance. Among the other selected variables, economic factors are shown to be the main predictors of the epidemic spread. In particular, for a one standard deviation increase, the percentage of registered cases per 10 0,0 0 0 residents increases with value-added per capita (by 24.4%), intensity of firms' export relationships (18.8%), overall employment rate (7.1%), and the percentage of employment in manufacturing (4.8%), and decreases with the percentage of employment in agriculture (-15.0%). Our results suggest that provinces that are more productive are more likely to be severely hit by the epidemic. Additionally, more intensive international relationships, a higher employment rate, and a large share of employees in the manufacturing industry are triggers of the initial COVID-19 geographical spread. It should be noted that manufacturing is the most strongly integrated sector in the global economy, as it is involved in global value chains and produces goods that make up the majority of exports in OECD countries ( De Backer et al., 2015 ) . Outside economic triggers and the Euclidean distance, the most relevant explanatory variable identified as a predictor of the COVID-19 spread is the average number of days with temperature below 3 °C ( + 16.1 % for a one standard deviation increase). The rate of positive cases per 10 0,0 0 0 people is also higher where mortality for infection diseases and PM10 concentration are higher ( + 9.3 % and + 7.2% for a one standard deviation increase, respectively). These results suggest that provinces with high COVID-19 registered cases are also those where the transmission rate of infections is generally higher and, as suggested by previous works, ( Wu et al., 2020 ) , where air pollution is higher. Finally, the other variable that has been selected by the algorithm (with lower estimated coefficients) is the average size of the family (-2.8% for a 1 standard deviation increase). The latter result could indicate that stricter family ties reduce the exposure of family members to the spread of infections through external social and professional networks. 7 In the OLS case (last three columns of Table A .8 in the Appendix), all 77 coefficients, aside from the omitted categories, are estimated. However, given the large number of predictors with respect to the sample size and the high degree of collinearity among explanatory variables, all coefficients are imprecisely estimated and most of them are not statistically significant. Moreover, the out-ofsample MSE, which is the measure of the out-of-sample prediction error of the model, is considerably higher than the in-sample MSE (5.763 vs. 0.045, respectively). This result suggests that, even if the OLS performs very well within the specific sample we are using, it performs very poorly in predicting external "unseen" observations. Therefore, given the high degree of overfitting and multicollinearity, we can neither identify the main predictors of the geographical spread of COVID-19 across Italian provinces nor obtain a good outof-sample predictive performance of our model using the standard OLS estimator. One common problem that arises in Machine Learning is the lack of a measure of dispersion of the estimated coefficients to evaluate their precision. This limitation cannot be easily overcome with standard methods since the theoretical distribution of the estimator is unknown. Following Hastie et al. (2015) , we circumvent such a drawback in post-selection inference by using a bootstrap re-sampling method to approximate the data-specific distribution of the estimated coefficients. Based on this bootstrapped distribution, we evaluate how often each of the 77 coefficients is estimated to be different from zero. Specifically, using the elastic net algorithm properly calibrated through 5-fold cross-validation, we take 200 bootstrap replications of the 80% training sample to calculate how many times a given predictor exhibits a non-zero coefficient. Fig. A .2 in the Appendix shows that, although all potential predictors are selected at least once across the 200 bootstrapped samples, six relevant predictors selected by elastic net (e.g. valueadded per employee, average frost days in a year, the dummy that identifies provinces located within 50 km from Lodi, mortality from infectious diseases, the percentage of employment in agriculture, and PM10) have a non-zero coefficient in more than 95% of the 200 replications, while the probability of selecting other predictors is generally lower. As a first robustness check, we estimate a different model that also includes regional dummies to verify whether our results are driven by region-specific characteristics that modify the set of predictors selected by the elastic net algorithm. Since disease testing policies are managed at the regional level, these controls also capture potential differences in the testing ability and in the degree of measurement error in the registered infections. 8 The results presented in Table 2 , which are based on March 21, 2020 data, clearly show that both the selection of the main predictors of COVID-19 spread and all estimated coefficients are robust to the inclusion of the regional dummies and comparable to the baseline model. Moreover, all the regional dummies are not selected by the elastic net algorithm. In a second robustness check we evaluate whether the Euclidean distance control is key for the characterization of the other predictors of the spread. By excluding this explanatory variable, we basically take the perspective of a pandemic event whose triggers possible association between the outcome variable and the unselected predictors. The exclusion of some predictors simply indicates that the provincial variability of those variables is not informative for our specific research objective. 8 We are aware that in the classical textbook case, left-hand side measurement errors do not lead to any bias (since they are assumed to be random components). However, in our case, left-hand side measurement errors are region-specific due to the Italian institutional framework. Elastic net regression of log cumulative cases per 10 0,0 0 0 inhabitants on March 21, 2020: sensitivity to the inclusion of regional dummies. Including regional dummies are unconditional with respect the geographical detection of the first registered outbreak. Table 3 shows that baseline results are confirmed even in the absence of the distance controls. No other variables, but the number of foggy days in a year, are selected by the elastic net algorithm, and the size of the coefficients of the single predictors is basically unaffected. Table 4 presents the results of three alternative models. In the first column we summarize the results of the baseline model estimated by using the log cases of COVID-19 per 10 0,0 0 0 residents registered until March 21, 2020. In the second column, we evaluate how predictors change by considering cases registered between March 22, 2020 and June 3, 2020. Finally, in the last column we re-estimate the second model by including the log cases registered until March 21, 2020 in the conditioning set. This additional control is useful for eliminating the effect of the first stage of the epidemic on the transmission of infections that occurred after March 21 and, thus, on the selection of the relevant predictors of cases that occurred between March 22, 2020 and June 3, 2020. The results showed in Table 4 suggest that the containment measures and the lockdown of economic activities might have reduced the transmission rate of the epidemic mostly by reducing the relative importance of economic factors. Specifically, moving from Model 1 to Model 3, all economic predictors, except for the percentage of workers in agriculture and the intensity of export relationships (which show a coefficient closer to zero in the updated estimate), are no longer included among the relevant predictors of the epidemic. The same result holds for the dummy that identifies provinces within 50 km from the province of the initial outbreak and for previously selected climate factors (i.e. the number of frost days). On the contrary, in the second stage of the epidemic some additional predictors unrelated to economic factors (i.e. percentage of families with 5 or more members, average family size, municipality density, mortality from pneumonia, mortality rate and mean altitude of the province) become relevant. 9 In this section we evaluate the predictive performance of our model by adopting different methodological perspectives. We first compare the predictive performance of elastic net with those obtainable with LASSO, ridge regression, and OLS by training each estimator in the 80% training sample and testing the corresponding out-of-sample performance in the 20% hold-out sample. Table 5 shows that elastic net outperforms LASSO, ridge regression, and OLS, providing the lowest out-of-sample MSE (even if OLS, as expected, exhibits the best in-sample predictive performance). It is relevant to note that the out-of-sample prediction error is also very imprecisely estimated using OLS. Specifically, the standard error of the MSE calculated using 200 bootstrapped replications of the hold-out sample is more than 14 times higher in the OLS case than in the case of elastic net. Table 4 Elastic net regression of log cumulative cases per 10 0,0 0 0 inhabitants-pre and post containment measures. We then provide graphical comparisons of the predictive outof-sample performance of the elastic net algorithm and of the OLS estimator using two alternative strategies. With the first, we provide an additional graphical intuition of the extent to which the elastic net (and other regularization methods) is able to reduce the variance component of the MSE. Specifically, we generate three bootstrapped realizations of the training sample to calibrate and estimate our model. Then, we exploit the three sets of estimated coefficients to predict COVID-19 cases per 10 0,0 0 0 residents registered as of March 21, 2020 in the hold-out sample. Using the OLS estimator, although the in-sample MSE is considerably low (see Table 5 ), the predicted COVID-19 cases in the holdout sample vary substantially across the three bootstrapped realizations of the training sample ( Fig. A.3 in the Appendix). Moreover, in the OLS case there are some provinces that are predicted to be in the upper decile of the geographical COVID-19 distribution that instead belong to the lowest deciles of the actual distribution. Finally, we find that the OLS estimator highly overestimates COVID-19 cases per 10 0,0 0 0 inhabitants in the top decile and predict many provinces to have zero COVID-19 cases. Using elastic net and the same three bootstrapped realizations of the training sample to calibrate and estimate the coefficients, we obtain that the predictive performance in the hold-out sample is very stable and generally close to the actual number of registered cases ( Fig. A.4 in the Appendix). Moreover, even if there are some specific provinces for which the algorithm makes an error in predicting the actual decile of the COVID-19 distribution, there are no cases in which a province in the upper deciles is predicted to be in the lower deciles or vice-versa. The elastic net algorithm, besides predicting accurately the geographical pattern of the spread of the epidemic, is able to predict the minimum and maximum values of each decile with minimal errors. The second evaluation methodology of the predictive performance of elastic net and OLS relies on randomly dividing our data into five equally sized hold-out samples. These provincial subgroups are iteratively exploited to predict COVID-19 cases per 10 0,0 0 0 inhabitants, using each 80% corresponding training sample to estimate the coefficients. 10 This strategy helps to further illustrate the extent to which the elastic net algorithm, as compared to the standard OLS estimator, is able to dramatically improve prediction in "unseen" areas. Thus, it could be conceived as a tool capable of predicting the possible geographical diffusion of the epidemic at a given point in time. The geographical distribution of registered cases is shown to be accurately predicted by the elastic net algorithm, whereas the OLS estimator performs poorly, failing to identify the true decile for many provinces ( Fig. 3 ) . Specifically, OLS largely over-estimates cases in provinces in the top decile of the distribution, showing an unsatisfactory maximum value of 9,0 0 0 cases per 10 0,0 0 0. Moreover, OLS predicts some provinces in the lowest decile to have zero registered cases when, as of March 21, there were no provinces registering zero infected people per 10 0,0 0 0 inhabitants 2020 (compare Fig. 1 and Fig. 3 ) . 10 Even in this case we train the elastic net algorithm in each of the five training samples using the λ and α selected by 5-fold cross-validation and compute the out-of-sample MSE in the corresponding hold-out sample. Source: Authors' elaborations. Notes: In the elastic net case the outcome variable in each 20% hold-out group is predicted after calibrating and training the algorithm using the remaining 80% of the data. In the OLS case the outcome variable in each 20% hold-out group of provinces is predicted after estimating the model in the remaining 80% provinces. As of October 2020, Europe has been severely hit by a "second wave" of the COVID-19 epidemic. The strong containment measures adopted in Italy during the "first wave" prevented the spread of the contagion in the central and southern provinces. This is why, even though in the baseline analysis we control for regional dummies and the Euclidean distance from the first outbreak, some of the results obtained might still be related to provincial characteristics that are specific to the location of the first COVID-19 outbreak. However, as containment measures were relaxed in June, intense tourism movements during summer have caused a reshuffling of the provincial distribution of the COVID-19 contagion. This is why the "second wave" of the COVID-19 epidemic in Italy cannot be related anymore to the geographical location of any initial outbreak. For this reason, we update our analysis by considering cumulative provincial cases over 10 0,0 0 0 inhabitants recorded between September 1 and October 30, 2020 and the same conditioning set exploited in the baseline analysis. This allows us to verify if, and to what extent, relevant predictors selected on data from 21 March, 2020 are confirmed once the spread of the COVID-19 epidemic is unrelated to any initial outbreak and to an infection coming from abroad. The results summarized in Table 6 show that, once again, many relevant economic variables are selected by the elastic net algorithm. Specifically, richer and more productive provinces are more likely to experience higher infection rates, while rural areas and provinces characterized by high unemployment rates are generally Table 6 Elastic net regression of log cumulative cases per 10 0,0 0 0 inhabitants between September 1, 2020 and October 30, 2020. Percentage of the population that lives close to a train station 0.049 Mean altitude of the province 0.046 People that actually live in the province as a percentage of residents 0. Notes: Constant terms and unselected predictors are not shown. The α − λ combination has been selected in the 80% training sample using 5-fold cross-validation. The out-of-sample predictive performance is tested in the 20% remaining observations. less affected by the COVID-19 epidemic. It is noteworthy that, once COVID-19 outbreaks are spread throughout the country, the variables capturing the degree of international economic connections (i.e. the share of workers in manufacturing and the intensity of export relationships), which are selected as relevant during the "first wave" of the epidemic in March, are no more selected by elastic net from the prediction set. Additional triggers of the epidemic (i.e. the share of population that lives close to a train station, population density, and the number of people that actually live in the province as a share of the residents) emerge in the "second wave". We assume that these additional triggers remained hidden in our baseline analysis given that the containment measures adopted in early March prevented the spread of the epidemic in some of the most populated Italian cities. As a further result, the number of hours of continuity health care services ( Guardia medica ) per capita, a measure of the degree of proximity of the Italian health system, is selected as a relevant predictor negatively correlated with the number of infected people per 10 0,0 0 0 inhabitants. Thus, the response to the spread of COVID-19 is probably associated to the capacity of the health system to provide a proper territorial medical care. Finally, the geographical spread of COVID-19 cases per 10 0,0 0 0 inhabitants is confirmed to be strongly associated to climate and geographic factors such as the average number of sunny days in a year, the annual number of rainy days, and the mean altitude of the province. This result verifies that the COVID-19 epidemic might be subject to seasonality. The analysis we propose here is motivated by the observation of a highly heterogeneous spread of COVID-19 across Italian geographical areas. Such heterogeneity seems to be also characteristic of the pandemic experience of other countries preceding and following the Italian case. The intensity and specificity of the uneven distribution of the spread in Italy falls far beyond that which is conceivable by taking into account only distance and geographical interrelating factors. This signals that other triggers, outside those that characterize a standard transmission mechanics from index to secondary cases are at work. A number of explanations have been proposed in the increasingly rich current literature. Studies are addressing the predictive ability of variables belonging to quite different conceptual clusters. We noted that each potential trigger proposed by the literature has its own rationale, as each one basically captures a different aspect of the human relationships emerging in a connected territorial, economic, and social environment. However, these studies are inherently focused on a specific aspect of the story, thus lacking a central requirement of analyses oriented at maximizing the out-of-sample predictive abilities of a model (i.e. its instrumental validity). Since a central feature of our analysis is that it considers economic and non-economic factors in a unified empirical framework, we adopted an empirical strategy, based on statistical learning, which is able to select, among a large set of potential triggers, those that maximize the out-of-sample predictive power of the selected model. From this perspective, we showed that the elastic net estimator clearly outperforms OLS and other alternative regularization methods such as the ridge regression and LASSO. The results point to a very contained subset of relevant triggers among the 77 included in the analysis. Within this subset, economic factors are shown to play an important role. This result signals a possible trade-off between economic intensity and epidemic risks. Since areas that are more productive are socially and economically connected domestically and internationally, they are more exposed to the spread of new infectious diseases. The link between a pandemic crisis and economic activity should thus be considered in both directions: economically developed areas are more likely to be affected by high infection rates due to their intense economic networks. However, since highly infected areas contribute more to value-added, global economic growth may be strongly reduced on the occurrence of a pandemic crisis, irrespective of the containment measures adopted to reduce the infection's transmission rate. Repetition of the analysis on the sub-sample of cumulative cases emerging after a time window in which containment measures were in force highlighted the cancellation of some of these economic triggers. This result is likely to signal the transmission channels of the containment measures being adopted by the Italian government. Finally, using provincial data on the early stages of the "second wave" of the epidemic, which cannot be linked to a specific initial outbreak, we confirm that highly productive provinces are more likely to be affected by infectious diseases. The opposite holds for poor areas characterized by higher levels of employment in agriculture. From the simulations of models estimated over bootstrapped samples, we showed that the elastic net algorithm is able to maintain stability and correctness of predictions of the actual spread of the pandemic by minimizing the variance component of the outof-sample mean squared error. A further interesting result emerged from simulations obtained with models estimated on reduced samples. We showed that these models are able to provide reliable predictions of the pandemic spread also in unseen areas-those territories not considered in the retained sub-samples. This result highlights the external validity of the model selected by the elastic net. We wish to test whether this property continues to hold in other countries in future research. We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. Source ( Altitude of the province capital Altitude of the provincial capital measured at City Hall ISTAT Distance from the first outbreak: less than or equal to 50 km Dummy for a provincial capital which is 50 km or less away from the province of the first outbreak (Lodi) Own elaboration using latitude, longitude and the curvature constant Distance from the first outbreak: between 51 and 100 km Dummy for a provincial capital which is between 51 km and 100 km away from the province of the first outbreak (Lodi) Own elaboration using latitude, longitude and the curvature constant Distance from the first outbreak: between 101 and 300 km Dummy for a provincial capital which is between 101 km and 300 km away from the province of the first outbreak (Lodi) Own elaboration using latitude, longitude and the curvature constant Distance from the first outbreak: between 301 and 500 km Dummy for a provincial capital which is between 301 km and 500 km away from the province of the first outbreak (Lodi) Own elaboration using latitude, longitude and the curvature constant Distance from the first outbreak: more than 500 km Dummy for a provincial capital which is between 500 km away from the province of the first outbreak (Lodi) Own elaboration using latitude, longitude and the curvature constant Municipality density Number of municipalities per km 2 ISTAT (2020) Mean altitude of the province Mean altitude of the provincial territory ISTAT Source: Authors' elaborations. Notes: The red map shows the geographical distribution of cases per 10 0,0 0 0 residents on March 21, 2020 in the 21 provinces randomly selected in the hold-out-sample. The blue maps present three different out-of-sample predictions in the 20% hold-out-sample obtained after training the OLS estimator in three alternative bootstrapped realizations of the 80% training sample. The blue maps present three different out-of-sample predictions in the 20% hold-out-sample obtained after calibrating and training the elastic net algorithm in three alternative bootstrapped realizations of the 80% training sample. The geography of COVID-19 and the structure of local economies: the case of Italy What will be the economic impact of COVID-19 in the US? Rough estimates of disease scenarios Covid-induced economic uncertainty Supply and Demand in Disaggregated Keynesian Economies with an Application to the Covid-19 Crisis Italian workers at risk during the Covid-19 epidemic Correlation between climate indicators and COVID-19 pandemic Intergenerational ties and case fatality rates: A cross country analysis Cross-country correlation analysis for research on COVID-19 School closure during novel influenza: a systematic review COVID-19: the case for health-care worker screening to prevent hospital transmission Working from home and income inequality: risks of a 'new normal' with COVID-19 Bowling together by bowling alone: social capital and COVID-19 Managing COVID-19 in Surgical Systems The Covid-19 crisis response helps the poor: the distributional and budgetary consequences of the UK lock-down Modelling transmission and control of the COVID-19 pandemic in Australia Manufacturing or Services-That is (not) the Question': The Role of Manufacturing and Services in OECD Economies Lives and Livelihoods: Estimates of the Global Mortality and Poverty Effects of the Covid-19 Pandemic. ECINEQ, Society for the Study of Economic Inequality Working Papers 542 Demographic science aids in understanding the spread and fatality rates of COVID-19 Intergenerational family ties and the diffusion of cohabitation in Italy How many jobs can be done at home? Unequal societies in usual times, unjust societies in pandemic ones Risk for Transportation of Coronavirus Disease from Wuhan to Other Cities in China Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe Germs, social networks and growth. Review of Economic Studies forthcoming Economic effects of coronavirus outbreak (COVID-19) on the world economy. Available at SSRN 3557504 Preventing Intra-hospital Infection and Transmission of Coronavirus Disease 2019 in Health-care Workers. Safety and Health at Pandemic Recession: L or V-Shaped? Global supply-chain effects of COVID-19 control measures The elements of statistical learning: data mining, inference, and prediction Statistical learning with sparsity: the lasso and generalizations The relationship between school holydays and transmission of influenza in England and Wales In: An introduction to statistical learning The temporal association of introducing and lifting non-pharmaceutical interventions with the time-varying reproduction number (R) of SARS-CoV-2: a modelling study across 131 countries. The Lancet Infectious Disease Covid19 and the Macroeconomic Effects of Costly Disasters Machine learning: an applied econometric approach Wage inequality and poverty effects of lockdown and social distancing in Europe Impacts of social and economic factors on the transmission of coronavirus disease 2019 (COVID-19) in China Family ties in Western Europe: persistent contrasts. Population and Development Review Temperature and latitude analysis to predict potential spread and seasonality for COVID-19 Leaving home, family support and intergenerational ties in Italy: Some regional differences School closure and management practices during coronavirus outbreaks including COVID-19: a rapid systematic review High temperature and high humidity reduce the transmission of COVID-19 Air pollution and COVID-19 mortality in the United States: Strengths and limitations of an ecological regression analysis Association between short--term exposure to air pollution and COVID-19 infection: Evidence from China Changes in contact patterns shape the dynamics of the COVID-19 outbreak in China Spatial transmission of COVID-19 via public and private transportation in China Association between ambient temperature and COVID-19 infection in 122 cities from China Regularization and variable selection via the elastic net