key: cord-1027269-i1bi5rqu authors: Roberto, Cazzolla Gatti; Alena, Velichevskaya; Andrea, Tateo; Nicola, Amoroso; Alfonso, Monaco title: Machine learning reveals that prolonged exposure to air pollution is associated with SARS-CoV-2 mortality and infectivity in Italy() date: 2020-08-21 journal: Environ Pollut DOI: 10.1016/j.envpol.2020.115471 sha: e711a1b461db5c4c0a5990afaade2cd8cf60dc20 doc_id: 1027269 cord_uid: i1bi5rqu Air pollution can increase the risk of respiratory diseases, enhancing the susceptibility to viral and bacterial infections. Some studies suggest that small air particles facilitate the spread of viruses and also of the new coronavirus, besides the direct person-to-person contagion. However, the effects of the exposure to particulate matter and other contaminants on SARS-CoV-2 has been poorly explored. Here we examined the possible reasons why the new coronavirus differently impacted on Italian regional and provincial populations. With the help of artificial intelligence, we studied the importance of air pollution for mortality and positivity rates of the SARS-CoV-2 outbreak in Italy. We discovered that among several environmental, health, and socio-economic factors, air pollution and fine particulate matter (PM2.5), as its main component, resulted as the most important predictors of SARS-CoV-2 effects. We also found that the emissions from industries, farms, and road traffic - in order of importance - might be responsible for more than 70% of the deaths associated with SARS-CoV-2 nationwide. Given the major contribution played by air pollution (much more important than other health and socio-economic factors, as we discovered), we projected that, with an increase of 5-10% in air pollution, similar future pathogens may inflate the epidemic toll of Italy by 21-32% additional cases, whose 19-28% more positives and 4-14% more deaths. Our findings, demonstrating that fine-particulate (PM2.5) pollutant level is the most important factor to predict SARS-CoV-2 effects that would worsen even with a slight decrease of air quality, highlight that the imperative of productivity before health and environmental protection is, indeed, a short-term/small-minded resolution. The year 2020 has started with Chinese authorities alerting WHO that several cases of 43 unusual pneumonia had appeared in the area of Wuhan (Huang et al., 2020; Wang et al. 44 2020) . In a few weeks, despite travel restrictions, Italy became the first Western country with 45 the most serious outbreak of SARS-CoV-2 (Chinazzi et al., 2020) . After six months and 46 millions of people infected worldwide, Italy still ranks first among European Union's countries 47 in the death toll (Dong et al., 2020) . During the initial pandemic wave, Italy has immediately 48 adopted social distancing measures but it appears clear that the casualties due to this new 49 coronavirus spread have not affected each Italian region equally (Livingston and Bucher, 50 2020) . 51 Although it is well-known the dynamics of epidemics can be shaped and explained by a 52 combination of several health and socio-economic factors (Onder et al., 2020) , the role 53 played by environmental causes is still poorly explored. Nonetheless, there is increasing 54 We collected the last 5-year (2015-2019) Air Quality Index (AQI) from data provided by 136 http://moniqa.dii.unipi.it/#iqa, which allowed us to synthetically represent the state of air 137 quality at provincial and regional scales considering at the same time the data of several 138 atmospheric pollutants. This index represents a unitless indicator of immediate reading. The 139 calculation of the index is performed by dividing the measurement of the pollutant, by its 140 We indicated as "Hospital beds" the available beds in public hospitals, per regional 155 population, in 2017. Data were collected from (Italian Ministry of Health, 2017). 156 Data on the total regional number of "Incinerators" in 2017 were collected from (ISPRA, 172 Rapporto Rifiuti Urbani, 2017) and data on the regional number of "Airports" were collected 173 from (ENAC, Aeroporti certificate, 2019). 174 Data for "Traffic" were collected from (ACI, Annuario statistic, 2019) as the average regional 175 quantity of fuel (gasoline and regular in tonnes) sold annually from 2005-2018. 176 177 All data were standardized to the total regional population size. 178 179 180 All the processing and statistical analyses were performed in R version 3.6.1 (R Core Team, 182 2020). 183 Following a procedure widely used in the literature, to carry out a statistical comparison 184 between the different resident populations, neutralizing the effects deriving from their 185 different age structures and population size, the data on positivity at the regional scale (for 186 which age classes were available) were standardized with the direct method (Curtin and 187 Klein, 1995) , while the data on mortality at the regional scale (for which age classes were not 188 available) were standardized with the indirect method (Wilcox and Russell, 1986 ). The 189 standardization procedure leads to removing the effect of any age differences between two 190 populations, keeping the real differences in disease frequency. Both direct and indirect 191 standardizations involve the calculation of numbers of expected events (i.e. deaths and the 192 number of positive people in our study), which are compared to the number of observed 193 events. Standardized ratios were calculated as the ratio between the observed number of 194 deaths/infected in the study population (regional or provincial) and the number of 195 deaths/infected would be expected, based on the age-and sex-specific rates in the standard 196 population (Italy) and the population size of the study population by the same age/sex 197 groups. 198 In the direct method of standardization, age-adjusted rates are derived by applying the 199 category-specific mortality rates of each population to a single standard population. This 200 produced age-standardized positivity ratios (SPR) that Italian regions would have if they had 201 the same age distribution as the standard population (Italy). 202 The indirect method is based on the ratio between the deaths observed in a territory and 203 those expected in the same. The expected deaths were calculated by applying the 204 corresponding specific mortality ratios of the population assumed as standard (the national 205 one in this analysis) to the average annual population by age and sex classes of each 206 territorial unit (regions, in this analysis). The Standardized Mortality Ratio (SMR), therefore, 207 expresses the relationship between the deaths observed in a specific territory and the 208 expected deaths if in the same territory there was the annual mortality, specific for age 209 groups, of the population used as standard. Random Forest generate an unbiased estimate of the generalization error. Furthermore, the 229 randomization procedure at the base of random forest significantly reduces the problem of 230 overfitting (Breiman, 1996) . 231 For our analysis, we adopted a default configuration for Random Forest, with n = 500 trees 232 and m = f/3 with f being the number of features. As previously mentioned, one of the main 233 advantages of Random Forest is the possibility to internally assess the importance of each 234 feature for the model accuracy. We evaluated the feature importance using the mean where indicates a single tree grown by randomly selecting m variables. 245 The mean-squared generalization error for each predictor tree is: 246 (2) 247 with Y that represents the expected numerical outcome. For accurate Random Forest 248 regression is required a low correlation between residuals of differing regressor trees and a 249 minimization of the prediction error function for the individual trees, defined in (2) (Segal, 250 2004) . 251 In the present work, we assessed the problem of predicting the rates of mortality and 252 positivity in 20 Italian regions based on heterogeneous characterizations. In particular, we 253 built three Random Forest regression models for each indicator (SMR and SPR): the first in 254 which we used the 6 environmental, health, and socio-economic variables as input features, 255 the second was implemented with information on sources of air pollution, finally, in the third 256 application we modelled our regressor through concentrations of 7 main pollutants. We 257 verified the goodness of our models employing the root-mean-square error (RMSE) and the 258 coefficient of determination (R 2 ). Random Forest was also employed to project the increase 259 in the SARS-CoV-like estimated mortality (SMR) and positivity (SPR) with a 5-10% higher 260 AQI (i.e. air pollution worsening). 261 262 A linear regression analysis was performed and a Pearson's correlation coefficient was 263 calculated to study the relationship between the total number of recorded cases at the 264 provincial level (adjusted by the provincial population size) and the provincial AQI. The trend 265 line's equation was then employed to project the increase in the SARS-CoV-like estimated 266 cases with a 5-10% higher AQI (i.e. worsening of air pollution). We also compared the 267 positivity rate of each province with the value estimated with a simple linear model where 268 this rate was expressed as a function of the AQI. The difference between the actual and the 269 expected value is the plotted residuals. 270 271 As a secondary analysis to assess the robustness of our results gathered with Random 272 Forest, we applied the Canonical Correlation Analysis (CCA) (Hotelling, 1992) to research 273 the best correlation between two following sets of data: the first composed by the 6 274 environmental, health, and socio-economic variables (Overweight, Smokers, Hospital beds, 275 Income, AQI and Swabs) and the second with the SPR and SMR. Our purpose was to 276 evaluate which of the 6 variables provided the greatest contribution (weight) in the 277 correlation. Given two independent datasets X and Y, the CCA identifies the pairs of linear combinations, one of each dataset, most closely related to each other. In other words, CCA 279 detects the best correlation that can be obtained between two independent datasets. 280 Assuming X and Y, two independent datasets, composed by N cases with n and m features 281 respectively, the CCA technique allows to find, for each case, k pairs of canonical variables, 282 being k the minimum between n and m. Each pair has as a first element a linear combination 283 of the features of the dataset X and as a second element a combination of the features of 284 the dataset Y. The k pairs are ordered in such a way that the first pair has a greater 285 correlation than the second, the second has a greater correlation than the third, and so on. 286 In addition to the scores, the CCA algorithm returns the canonical factors (the coefficients to 287 calculate each canonical variable) and the canonical correlations, the k correlations relating 288 to each pair of canonical variables. 289 Since one of the two dataset in our study contains only two variables (SPR and SMR), CCA 290 returns only two pairs of canonical variable; with the first to be preferred correlation as it 291 yields a stronger correlation between the linear combination of SPR and SMR and the linear 292 combination of the six environmental variables. In addition, to avoid the problems of 293 overfitting related to the small number of available cases in the model (20 Italian regions), 294 we only considered all possible pairs by combining the six variables two by two. We applied 295 the CCA to each of these pairs to estimate the best correlation existing with SMR and SPR 296 dataset. This analysis confirms what has already been achieved by applying the CCA to the 297 entire dataset. In fact Fig. 2 shows that only the couples with AQI have a correlation 298 coefficient above 70% respect to SPR and SMR combination. 299 300 3. Results 301 First, we analysed the relation between the last 5-year exposure to air pollution (hereafter Air 303 Quality Index, AQI; see Appendix A for Supplementary Methods), the number of smokers, 304 the level of obesity and overweight, the mean annual income per family, the number of 305 public hospital beds, and the number of SARS-CoV-2 tests performed with the rates of 306 mortality (SMR; see Methods) and positivity (SPR) in 20 Italian regions. SMR and SPR are 307 standardized rates that represent the percentage of the increase or decrease in mortality of 308 a study cohort (regional populations, in this study) compared to the general population (Italy, 309 in this study). We adjusted all rates to the national norm to remove the differences in size 310 and age distribution of the regional populations, which may represent confounding factors 311 (Naing, 2020) . Then, we modelled how SMR and SPR vary with these factors with a popular 312 supervised learning algorithm from Artificial Intelligence (A.I.), which is Random Forests (see showing a significant and positive correlation with SMR (R 2 = 0.54; Fig. 1D ). This Machine 317 Learning evidence was also confirmed by a Canonical Correlation Analysis (CCA), which 318 showed that SMR and SPR have always higher correlations with all the pairwise 319 combinations of the six variables that include AQI (Fig. 2) . 320 Because the impact of the AQI on SARS-CoV-2 infection showed a strong geographical 390 pattern at a regional level, we zoomed our analysis in on a smaller provincial scale to better 391 understand the effect of the prolonged exposure to air pollution (Fig. S3) . From a regression 392 between the total recorded cases in 99 provinces (adjusted by their population) and the AQI 393 in each of them (Fig. 5A) , we confirmed a positive linear trend (slope = 0.52, r = 0.44). 394 Interesting additional elements, however, came out. Six provinces showed to be evident 395 outliers (Fig. 5B) : 5 of them (Cremona, Lodi, Piacenza, Bergamo, and Brescia) with an 396 excess of cases than those predicted by the AQI in our model and one province (Siracusa) 397 with a lack of cases than those expected by its highest level of exposure to air pollution. 398 Given the major contribution played by air pollution (much more important than other health 399 and socio-economic factors, as we discovered from the regional analysis), we estimated 400 the expected effects of a decrease in air quality on new potential SARS-CoV-like epidemics. 401 Alarmingly, we estimated that with an overall increase between 5% and 10% of air pollution 402 at the national scale, there would be 21-32% additional cases, whose 19-28% more 403 positives and 4-14% more deaths, to sum to the future epidemic tolls of Italy. Intensive and extensive agriculture and farms, in fact, generate fumes from nitrogen-rich 449 fertilizers and animal waste, which mainly contain ammonia and combine in the air with 450 combustion emissions to form solid small particles. Road transport is a major source of air 451 pollution and widely confirmed to harm human health and the environment (Samet, 2007) . 452 Vehicles emit a range of pollutants including NOx, particulate matter, and O 3 precursors. The 453 use of biomass such as firewood and pellets for domestic heating is recognized as a heavy threat for air quality because its burning emits black smoke from smokestacks full of PM, 455 sulphur oxides (SO x ), NO 2 , dioxins, and furans (Boman et al., 2003) . that road traffic ranks only third in terms of importance as a source of air pollution shows that 492 it is a contributing, but not the most relevant, compartment affecting the mortality to viral-493 induced respiratory infections, although it emerged as the most important variable related to 494 SARS-CoV-2 positivity, which could mean that this source produces other air pollutants that, 495 affecting the respiratory system, increased the susceptibility to the new coronavirus. 496 We recognize that at a smaller scale, besides air quality, additional and strictly local factors 497 may have played a role in increasing the number of total SARS-CoV-2 cases in some 498 provinces of Lombardy, which would explain the outliers we found in our regression at the 499 provincial level. Although correlation does not imply causation, our findings would 500 recommend significant interventions to reduce the air pollution deriving from industrial and 501 farming production activities, home heating and transports to hopefully weaken the impact of 502 future epidemics. 503 In contrast, according to our model, a province in Sicily could become a new outbreak 504 location if measures to limit this new coronavirus and pathogens with similar respiratory 505 effects are not well managed. 506 As far as we know, this is the first study addressing this problem within a quantitative 507 framework which allows the evaluation of the statistical association between pollution and 508 SARS-CoV-2 effects. Nonetheless, we identified some limitations of our study to improve 509 future research on the same topic. First of all we recognize that at a smaller scale, besides 510 air quality, additional and strictly local factors may have played a role in increasing the 511 number of total SARS-CoV-2 cases in some provinces of Lombardy and North Italy; 512 accordingly, the correlation between pollution and epidemic effects could reasonably include 513 the contribution of other co-factors. This study is far from being an exhaustive and 514 comprehensive examination of all possible factors which may contribute to aggravated viral 515 infections in the human lower respiratory tract; further studies could address this aspect. 516 Lockdown measures and travel restrictions preventing the virus diffusion could have played 517 a relevant role and partially account for regional differences in epidemic mortality and 518 severity. Another important aspect to consider is time exposure, in this study we consider a 519 pollutant exposure up to 5 years, but it could be interesting to investigate if and to which 520 extent longer time exposure has a relevant effect on epidemic severity. 521 Nonetheless, the use of artificial intelligence to understand the importance of multiple factors 522 represents an advancement of classical statistical analysis, which through a machine 523 learning process sensitively increased the accuracy of our model. Our results showed a strong correlation between SARS-CoV-2 mortality and positivity and 528 the prolonged exposure to air pollution. We designed and evaluated a multivariate model to 529 investigate how different factors would affect pandemics diffusion and severity. In particular, 530 we observed that in our model air quality plays the most relevant role. Interestingly, other 531 factors inherent to socio-economic conditions and lifestyles showed much less importance 532 . The model is accurate and paves the way for the prediction of future outcomes: a further 533 worsening of air quality might lead to even more dramatic consequences in future 534 pandemics. We also showed stronger evidence that a prolonged exposure particularly to 535 small particles is strongly related to an enhanced SARS-CoV-2 mortality. Together with this 536 evidence, it might also be that the pollutants stagnation, resulting from a combination of 537 specific climatic conditions, local anthropogenic emissions and regional topography, may 538 promote a longer permanence of the of viral particles in the air, thus favouring an indirect 539 diffusion in addition to the direct one from individual to individual. Further investigations can 540 shed more light on these dynamics and the role of small particles in the diffusion of 541 pathogens. However, we already have quite clear proof that prolonged exposure to air 542 pollution in Italy, mainly from highly industrialized and intensively farmed areas and 543 congested roads that produce fine particulate matter, enhances the risks associated with 544 epidemics. This must be taken into account in country-based environmental policies on a 545 global scale because it is highly probable that the relation between air pollution and SARS-546 CoV-like mortality we discovered in Italy is a more common pattern than thought before and 547 may worsen the effects of pathogens' spread in other parts of the world. 548 Overall, our finding that air quality is the most important factor connected with mortality and 549 positivity of SARS-CoV-like pathogens, which would worsen even with a slight increase of 550 pollutants, makes clear that the imperative of productivity before health and environmental 551 protection is, indeed, a short-term/small-minded resolution. Valuation of air pollution externalities: comparative assessment of economic damage and 569 emission reduction under COVID-19 lockdown Analysis of the air pollution climate at a 572 background site in the Po valley Adverse health effects from ambient 575 air pollution in relation to residential wood combustion in modern society Bagging predictors Random forests Machine Learning Benchmarks and Random Forest Regression Air pollution and infection in respiratory illness The effect of travel restrictions on the spread of the 2019 novel 592 coronavirus (COVID-19) outbreak Through the magnifying glass: provincial aspects of 595 industrial growth in post-Unification Italy 1 Air pollution and respiratory viral infection Can atmospheric pollution be considered a co-600 factor in extremely high level of SARS-CoV-2 lethality in Northern Italy? Direct standardization (age-adjusted death rates) Hyattsville, 603 MD: US Department of Health and Human Services Disease Control and Prevention Navel V. 2020. COVID-19 as a factor influencing air pollution An interactive web-based dashboard to track COVID-19 Sources of air pollution Relations between two sets of variates Clinical features of patients infected with 2019 novel coronavirus in Annuario Statistico del Servizio Sanitario Nazionale in Beijing temporal pattern and its association with influenza Coronavirus disease 2019 (COVID-19) in Italy Covid -19 pandemic and environmental 639 pollution: a blessing in disguise? Easy way to learn standardization: direct and indirect methods Respiratory syncytial virus bronchiolitis, weather conditions and air 643 pollution in an Italian urban area: An observational study Assessing nitrogen dioxide (NO2) levels as a contributing factor to the 646 coronavirus (COVID-19) fatality rate. Science of The Total Environment Mortality and morbidity among people living close to incinerators: a 654 cohort study based on dispersion modeling for exposure assessment Traffic, air pollution, and health Airports, air pollution, and contemporaneous health SARS-CoV-2 spread in Northern Italy: what about the pollution role Winter air 664 pollution and infant bronchiolitis in Paris Effect of restricted emissions 667 during COVID-19 on air quality in India A novel coronavirus outbreak of 669 global health concern A comprehensive emission inventory of multiple air pollutants from 671 iron and steel industry in China: Temporal trends and spatial variation characteristics Birthweight and perinatal mortality: III. Towards a new 674 method of analysis Significant impact of coal combustion on VOCs emissions in winter in 687 a North China rural site Mitigating ammonia emission from agriculture reduces PM2. 5 690 pollution in the Hai River Basin in China Funding: The authors received no specific funding for this work Author contributions: RCG, AM and NA conceived the study. RCG and NA performed the 698 literature search. RCG, AM and AV collected data. AM, NA, RCG and AT analyzed the data Competing interests: Authors declare no competing interests All processed data is available in the main text or the 703 supplementary materials. Original data are freely available from the references mentioned in 704 the main text or the supplementary materials. Code, and materials used in the analysis will 705 be made available to any researcher for purposes of reproducing or extending the analysis The agreement between the SPR actual values and the Random Forests 731 predictions (R 2 = 0.93), which follows a positive linear trend