key: cord-0016364-yb5hdke0 authors: Wang, Bocheng title: A novel causality-centrality-based method for the analysis of the impacts of air pollutants on PM(2.5) concentrations in China date: 2021-03-26 journal: Sci Rep DOI: 10.1038/s41598-021-86304-0 sha: 43706bf5f284eadd1d96f0c1a5f371bca3cab73c doc_id: 16364 cord_uid: yb5hdke0 In this paper, we analyzed the spatial and temporal causality and graph-based centrality relationship between air pollutants and PM(2.5) concentrations in China from 2013 to 2017. NO(2), SO(2), CO and O(3) were considered the main components of pollution that affected the health of people; thus, various joint regression models were built to reveal the causal direction from these individual pollutants to PM(2.5) concentrations. In this causal centrality analysis, Beijing was the most important area in the Jing-Jin-Ji region because of its developed economy and large population. Pollutants in Beijing and peripheral cities were studied. The results showed that NO(2) pollutants play a vital role in the PM(2.5) concentrations in Beijing and its surrounding areas. An obvious causality direction and betweenness centrality were observed in the northern cities compared with others, demonstrating the fact that the more developed cities were most seriously polluted. Superior performance with causal centrality characteristics in the recognition of PM(2.5) concentrations has been achieved. | (2021) 11:6960 | https://doi.org/10.1038/s41598-021-86304-0 www.nature.com/scientificreports/ polycentricity on PM 2.5 concentrations using spatial econometric models based on a three-year panel of data for urban cities in China and used the spatial centralization index and spatial concentration index together to quantify polycentricity. Zhou et al. 14 collected high-resolution PM 2.5 data by mobile monitoring along different roads in Guangzhou, China, and explored the spatial-temporal heterogeneity of the relationship between the built environment and on-road PM 2.5 during the morning and evening rush hours, calculating the betweenness centrality index for measuring the pollution impact. Despite all these studies, no research has covered further analysis with topological centrality for meteorology or air pollutants, especially in causal-based adjacent matrices. The causal direction would be such an important factor in differentiating the mutual functionality of each pollutant in the air. Recognition of air quality by model training is a future trend in the domain of atmospheric artificial intelligence. Deep learning can be used to achieve accurate prediction with specialized knowledge. Wang et al. 15 collected eight meteorological factors from the 100 most developed cities in China and trained an ensembled boosted tree model with 90.2% accuracy. Huang et al. 16 developed a deep neural network model that integrated the convolutional neural network (CNN) and long short-term memory (LSTM) architectures and collected historical data such as cumulated hours of rain, cumulated wind speed and PM 2.5 concentrations. The feasibility and practicality of the trained model were verified to improve the ability to estimate air pollution, especially in smart cites. In these studies, meteorological or pollutant factors were passed directly through machine learning models, and the intrinsic relationship among these factors was ignored during training. The spatial-temporal characteristics need to be more widely studied over a large extent. In this paper, we studied the air pollutants NO 2 , SO 2 , carbon monoxide (CO) and O 3 by means of time series from a large number of air monitoring data in the Jing-Jin-Ji region in China and focused on the causality influence of the accumulative process of each pollution component on air PM 2.5 . By establishing four joint regression models, we quantitatively analyzed the influence degree of air pollutants on the cause of PM 2.5 to better clarify the formation of haze and trained a multilayer perception model to achieve improved performance compared with other methods. Figure 1 illustrates the new causality (NC) impacts from the four pollutants on PM 2.5 concentrations. For the inner-city impact, as shown in Fig. 1A , NO 2 has an obvious causal effect on the PM 2.5 concentrations in Beijing and Tianjin, followed by those in Chengde and Tangshan. SO 2 also has a significant causal effect on the PM 2.5 concentrations in Langfang and Cangzhou. In Fig. 1B , the causality of pollutants from peripheral cities around Beijing to the Beijing PM 2.5 concentrations is considered, and NO 2 in Zhangjiakou and Chengde have the greatest influence, followed by CO in Langfang. SO 2 in all the cities bordering Beijing, such as Langfang and Zhangjiakou, has certain impacts on the PM 2.5 concentrations in Beijing. Neither O 3 from the inner city itself that from the peripheral cities has a causal impact on the PM 2.5 concentrations, as shown in green. Detailed information on Fig. 1 is listed in Table 1 and Table 2 . The column order refers to lagging days in the NC model. The causality-centrality results are drawn in Fig. 2 . The upper row shows the betweenness centrality under the four pollutants in the Jing-Jin-Ji region, and the bottom row shows the clustering coefficient mapping results. A large betweenness centrality is present in the northern cities, especially those adjoining Beijing, such as Chengde (CO and O 3 ), Langfang (SO 2 ) and Zhangjiakou (NO 2 ). The discriminative ability of clustering coefficients in Fig. 2B does not behave as well as the betweenness centrality. Although the coefficient values are close to each other, it can still be inferred that pollutants around the Beijing area play an important role in the PM 2.5 concentrations in the Jing-Jin-Ji region. Figure 3 shows the causal direction among the Jing-Jin-Ji cities under the four pollutants. In Fig. 3A , the causal impacts for CO among each city are modeled by NC. The causalities in Shijiazhuang, Langfang, Baoding Fig. 3D , SO 2 in Shijiazhuang, Tianjin, Hengshui, Cangzhou, Zhangjiakou, Baoding and Handan has a direct causal impact on that in other cities, and Beijing becomes an input-oriented SO 2 polluted city. Table 3 lists the recognition results with causal centrality measures used in the multilayer perception (MLP) model. By constructing a three-class confusion matrix, weather was categorized into 'Fine' , 'Bad' , and 'Polluted' according to the air quality index, and the corresponding evaluation indicators, including accuracy, precision, sensitivity, and F1 score, were computed with different training parameters. The model was tested with [50, 100, 200] epochs. To accelerate the training process, the batch size was enlarged to 32 when the epoch number was 200. In this study, the causal centrality characteristics are analyzed for the relationship between the air pollutants and PM 2.5 concentrations of the Jing-Jin-Ji region in China. The NC-based adjacent matrices with causal direction weighting information reveal the basal functionality for the formation of PM 2.5 under air pollutants. Different Previous studies [26] [27] [28] have widely carried out research on air quality recognition mainly based on meteorological or pollutant characteristics. The centrality measured from the NC method shows superior performance in distinguishing different degrees of air pollution. The method proposed in this study can be considered efficient and practical for training the deep learning model. As shown in Table 3 , the number of epochs tested ranged from 50 to 200. The best testing results were generally obtained with the parameter set (epoch = 150, batch = 16). When the epoch reached 200, nearly all critical classification indicators declined, which means that overfitting existed in the model. For all the models tested in Table 3 , NO 2 shows the most effective classification capability, which is in consensus with the results above that it has the greatest impact on the PM 2.5 concentrations in Beijing and its surrounding areas. There are some limitations in this study. First, only air pollutants are under consideration. However, air quality is affected by many factors in addition to air pollutants or meteorological factors. These factors should also be considered in the joint regression models. Second, data from restricted areas in China are collected and analyzed. Air pollution is such a complex and regional mutual weather phenomenon, and a vast spatial scale should be covered for the analysis of PM 2.5 formation. Materials. Data New causality. New causality theory is derived from Granger causality (GC) theory. GC was proposed by Granger. This theory was first applied in economics and was recently widely used in neuroscience, global climate change and other scientific domains [29] [30] [31] . A brief introduction is given here. Considering a set of time series, GC exhibits the causal relationship between variations based on past values. In the form of a linear regression model, two time series are assumed to be jointly stationary. The autoregressive representations (Eq. 1) and their joint representations (Eq. 2) are described below. where i and j are integer numbers ranging from 1 to the lagging order m of time series X . a j is the coefficient of X . t represents time. The noise terms, ǫ i and η i , are uncorrelated over time and have zero means. The covariance between η 1 and η 2 is defined by σ η 1 η 2 = cov ( η 1 η 2 ). If the past values of variable X 2 make the estimation of X 1 more accurate, the noise term of σ 2 η 1 should be less than σ 2 ǫ 1 . In this case, X 2 is said to have a causal influence on X 1 . However, if σ 2 , X 2 has no causal impact on X 1 . The GC value from X 2 to X 1 is therefore defined in Eq. (3). (1) (2) X 1,t = m j=1 a 11,j X 1,t−j + m j=1 a 12,j X 2,t−j + η 1,t X 2,t = m j=1 a 21,j X 1,t−j + m j=1 a 22,j X 2,t−j + η 2,t www.nature.com/scientificreports/ There is no causal influence from X 2 to X 1 when F X 2 →X 1 = 0 , and if F X 2 →X 1 > 0 , X 2 is said to exhibit GC on X 1 . For long-term empirical research, the vector of past values in X 1 or X 2 will be too large to build a regressive model. A general approach for determining the lagged order is the AIC-Akaike information criterion (AIC). Many algorithms can be adopted to estimate the coefficients in the joint representations. In this paper, the least squares method is used to solve the equations. However, the value of Granger causality has been suggested to be inaccurate in some cases. It overlooks the influence of other variances in the multivariable regression model and considers only the noise terms. In 2011, Hu et al. 32 pointed out the limitations and shortcomings of GC and provided plenty of examples that GC cannot exactly demonstrate the true causality relationship between variables. The NC method was proposed to avoid limitations and successfully applied to reveal the evident causal relationship between time series. In practice, the defined NC direction is most effective in explaining phenomena observed in nature and human activities, such as the processing of EEG signals, the increase in global temperature caused by the greenhouse effect, and the fluctuation of the stock market in the economy. In Eq. (2), past values of X 1,t−j and X 2,t−j occupy a large portion among the three contributors to X 1,t or X 2,t . Based on this, a more appropriate form of causality for multivariate interactions is defined in Eq. (4). In which, i and k are any unequal integers. D represents the causal direction from variable X i to X k . m is the lagging order in X i and X k . N is the total length of observed time series. n is the number of variables. h ranges from 1 to n . t ranges from m to N . j ranges from 1 to m . η k,t is the noise term for X k at time point t . In this paper, the causality relationship between pollutants and PM 2.5 concentrations is tested, and the following model (Eq. 5) is built to describe the influence of each component contributing to haze, which appears frequently in the Jing-Jin-Ji region. Each of the four pollutants is represented by Pollutant. Graph-based centrality analysis. Graph-based centrality analysis has been a widely used method for topological relationship analysis among variables. In this study, each city in the Jing-Jin-Ji region is considered the graph node, the NC value between any two cities is regarded as the weighted edge, and an 11 × 11 square adjacent matrix is generated. Topological centrality measures, including the betweenness and clustering coefficient, are computed based on this matrix. Different from the correlation coefficient-based matrix, causality can be used to measure the causal direction between two factors. Thus, we build four-pollutant models, which correspond to four NC adjacent matrices, to analyze the causal importance from pollutants to PM 2.5 concentrations. The betweenness centrality is given in Eq. (6) , and the clustering coefficient is defined in Eq. (7), where ρ hj is the number of shortest paths between cities h and j , and ρ (i) hj is the number of shortest paths between cities h and j that pass through city i . N is the city set in the Jing-Jin-Ji region, and n is the number of cities in N . a ij is defined as the connection weights between cities i and j . Betweenness centrality measures the number of shortest paths that pass through a given city in a communication graph. We use this measure to characterize the importance of each city in the process of pollutant spread. The clustering coefficient can be used to measure the degree of topological clustering of pollutants around cities. To verify the effectiveness of the causality-centrality-based method proposed in this study, we use the calculated causality-centrality measures in MLP to determine whether these properties would bring superior classification results to the PM 2.5 concentration prediction. MLP is a deep learning model used for classification. It mainly consists of three parts: the input layer (dependent variables), the hidden layer (interconnected neural network units) and the output layer (independent variable). The purpose of MLP is to obtain a prediction model with strong generalization ability by training the labeled input data. An MLP model with a 1024 × 1024 hidden layer is trained with these causality and centrality modalities. Instead of batch normalization, the layer normalization strategy is adopted for standardization with a range of [0, 1]. Principal component analysis is used for dimension reduction, and L 1 embedding feature selection is implemented to avoid sparsification and overfitting. Equation (8) shows the L 1 penalty ( ) term added to Eq. (5). The influence of increased population density in China on air pollution Spatio-temporal patterns of air pollution in China from 2015 to 2018 and implications for health risks A comprehensive analysis of the spatio-temporal variation of urban air pollution in China during A review of low-level air pollution and adverse effects on human health: implications for epidemiological studies and public policy Associations between ambient air pollution and hospitalizations for acute exacerbation of chronic obstructive pulmonary disease in Jinhua Population susceptibility differences and effects of air pollution on cardiovascular mortality: epidemiological evidence from a time-series study Maritime transport in the French economy and its impact on air pollution: an input-output analysis Regional and global emissions of air pollutants: recent trends and future scenarios Particulate matters emitted from maize straw burning for winter heating in rural areas in Guanzhong Plain, China: current emission and future reduction Formation process of the widespread extreme haze pollution over northern China in January 2013: implications for regional air quality and climate Severe haze in northern China: a synergy of anthropogenic emissions and atmospheric processes Long term causality analyses of industrial pollutants and meteorological factors on PM2.5 concentrations in Zhejiang Province Mono-and polycentric urban spatial structure and PM2.5 concentrations: regarding the dependence on population density Spatial-temporal heterogeneity of air pollution: the relationship between built environment and on-road PM2.5 at micro scale Applying machine-learning methods based on causality analysis to determine airquality in China A deep CNN-LSTM model for particulate matter (PM25) forecasting in smart cities Methane emissions from natural gas vehicles in China Dynamics of urban sprawl and sustainable development in China Research on the relationship between energy consumption and air quality in the Yangtze River Delta of China: an empirical analysis based on 20 sample cities PM2.5 and O3 pollution during 2015-2019 over 367 Chinese cities: spatiotemporal variations, meteorological and topographical impacts Spatiotemporal variation of heat and cold waves and their potential relation with the large-scale atmospheric circulation across Inner Mongolia Quantitative assessment of industrial VOC emissions in China: historical trend, spatial distribution, uncertainties, and projection Anthropogenic atmospheric emissions of antimony and its spatial distribution characteristics in China Nitrate debuts as a dominant contributor to particulate pollution in Beijing: roles of enhanced atmospheric oxidizing capacity and decreased sulfur dioxide emission The impact of the 'air pollution prevention and control action plan' on PM2.5 concentrations in Jing-Jin-Ji region during 2012-2020 Importance of meteorology in air pollution events during the city lockdown for COVID-19 in Hubei Province c Meteorology-driven variability of air pollution (PM1) revealed with explainable machine learning PM 2.5 diminution and haze events over Delhi during the COVID-19 lockdown period: an interplay between the baseline pollution and meteorology Climate change impacts on cereal crops production in Pakistan Empirical analysis of climate change factors affecting cereal yield: evidence from Turkey Cereal production in the presence of climate change in China Causality analysis of neural connectivity: critical examination of existing methods and advances of new methods LGF18A010001). B.W designed the expriments and wrote the whole manuscript text. The author declares no competing interests. Correspondence and requests for materials should be addressed to B.W.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.