key: cord-0070416-wbvvggoq authors: Jabeur, Sami Ben; Ballouk, Houssein; Arfi, Wissal Ben; Khalfaoui, Rabeh title: Machine Learning-Based Modeling of the Environmental Degradation, Institutional Quality, and Economic Growth date: 2021-11-24 journal: Environ Model Assess (Dordr) DOI: 10.1007/s10666-021-09807-0 sha: 192caea3492785d24f571a3be4b5ddb780ae22e0 doc_id: 70416 cord_uid: wbvvggoq This study was aimed at investigating the determinants of environmental sustainability in 86 countries from 2007 to 2018. The natural gradient boosting (NGBoost) algorithm was implemented along with five machine learning models to forecast the trends of CO(2) emissions. In addition, the SHapley Additive exPlanation (SHAP) technique was used to interpret the findings and analyze the contribution of the individual factors. The empirical results indicated that the predictions obtained using NGBoost were more accurate than those obtained using other models. The SHAP value exhibited a positive correlation among the amount of CO(2) emissions, economic growth, and opportunity entrepreneurship. A negative correlation was observed among the governance, personnel freedom, education, and pollution. According to independent analyses by the National Aeronautics and Space Administration (NASA) and the National Oceanic and Atmospheric Administration (NOAA), 2019 was the second warmest year on record and the end of the warmest decade (2010-2019) since modern recordkeeping began in 1880 [1] . The last 5 years have been the warmest among the last 140 years [2] . Climate change is influencing all the countries worldwide. In particular, climate change can disrupt national economies and influence the lives of citizens. Moreover, with the changing weather conditions, sea levels are increasing, and weather events are becoming increasingly extreme. The main challenges among the 17 sustainable development goals (SDGs) were to address climate change and adopt environmental protection measures. In general, SDGs are a call to action for all nations with low, middle, and high incomes to promote prosperity while protecting the planet. As countries move toward rebuilding their economies after COVID-19, it is expected that the stimulus packages can shape the economy of the twenty-first century in a clean, green, healthy, secure, and more resilient manner. In fact, the recent crisis is an opportunity to implement deep and systemic changes to create a more sustainable economy that is beneficial to both people and the planet. To address the climate emergencies, post-pandemic stimulus plans must trigger long-term systemic changes that can alter the trajectory of CO 2 levels in the atmosphere. In recent years, governments worldwide have spent considerable time and effort in developing plans to chart a safer and more sustainable future for the citizens. Incorporating these plans at present, as part of recovery planning, can help the world recover better from the current crisis. This aspect is a strong motivation to review the energy policy of developed countries from the perspective of climate change. Governments have increasingly sought to strengthen environmental regulations or laws and generate environmental awareness. In this regard, this study is aimed at examining the impact of both these policy instruments on CO 2 emissions. Although the relationships among energy consumption, CO 2 emissions, and economic growth have been widely debated, the results have been inconclusive. In recent decades, the field of energy policy has undergone considerable development owing to the use of increasingly sophisticated research methods [3, 4] . Nevertheless, this study demonstrates the need for more rigors in quantitative energy and environment research, driven by recent developments. In particular, empirical methodologies in energy and CO 2 emission research involve notable limitations in terms of the validity of the estimated coefficients and their elasticities because the applied tests are not based on an appropriate quantitative framework. For example, in many analyses, diagnostic statistics and specification tests that are necessary to obtain unbiased and consistent regression results are often not performed or considered suitably. The key contributions of this study are as follows. First, seven factors affecting CO 2 emissions are examined comprehensively. The factors are divided into three categories: economic environment, legislative environment, and environmental awareness. To the best of our knowledge, this is the first study in which the proposition of Akerlof [5] regarding the impact of democratic participation practices on CO 2 emissions is evaluated. Two indicators are considered to measure democratic participation: personal freedom and fair elections and political participation in governance. Second, this study examines alternative tools that can overcome the limitations of the existing quantitative approaches. Reliable alternative tools based on advanced machine learning models, namely, natural gradient boosting (NGBoost) techniques, XGBoost, LightGBM, neural networks, and random forest (RF) algorithms are considered. Notably, in the existing work, the gradient boosting framework has not been used to examine the relationship among CO 2 emissions, economic growth, energy use, and institutional conditions. Third, the importance of the input variables on environmental degradation is evaluated, as none of the existing empirical studies has employed the SHapley Additive exPlanation (SHAP) value and game theories to integrate interpretable machine learning in analyzing the complex nonlinear behavior of CO 2 emissions. The remaining paper is structured as follows. Section 2 discusses the existing literature. Section 3 presents the methodology and considered data. Section 4 describes the experimental results and discusses the main findings. Section 5 presents the concluding remarks and describes the scope for future work. Researchers have explained the high intensity of carbon emissions considering the increase in energy use owing to the rapid economic growth [6] . Among carbon emission studies, the peak theory of pollutant emissions has been widely accepted in many countries [7, 8] . CO 2 emission peaks have been tested using the environmental Kuznets curve (EKC) [9, 10] . In particular, an inverted U-shaped relationship was considered to exist between the gross domestic product (GDP) and pollutant emissions, as the level of environmental damage first increases to a peak with an increasing GDP and later declines [11] . This observation is aligned with the peak of energy use, which corresponds to a regulatory trend in the long term. In the industrialization phase, energy use initially rises to a peak value and later declines. The turning point is caused by the structural changes from the higher energy consumption of heavy industries to the low consumption of light industries. Moreover, the product structure shifts from general value-added to high value-added products and from material production to knowledge production [12] . Many empirical studies have approved this U-shaped relationship. Ben Cheikh et al. applied the nonlinear panel smooth transition regression to examine the Middle East and North African (MENA) region and reported on the inverted U-shaped pattern for the impact of energy on CO 2 ; moreover, the authors reported that an increase in the GDP significantly affected the pollutant emissions, thereby increasing the energy consumption of heavy industries [13] . Ehigiamusoe et al. examined the influence of the energy use in the CO 2 emissions-income relationship for 64 middle-income countries and quantified the effect of the GDP on the CO 2 emissions at different levels of energy use [14] . The authors applied multiplicative interaction models and noted that energy consumption affected the relationship between CO 2 emissions and income. Thus, we hypothesize the possible effects suggested from the theory. Hypothesis (H) 1: CO 2 emissions follow a Kuznets curve which increases with economic development and then decreases. The relationship between entrepreneurship and environmental quality has attracted considerable attention from policymakers in the search for solutions to attain environmental sustainability. According to Omri, a U-shaped relationship exists between entrepreneurship and pollution, albeit only in high-income countries [15] . Dhahri and Omri [16] concluded that entrepreneurship contributes positively to the economic and social aspects of sustainable development and negatively to the environmental dimension. Moreover, several authors [17, 18] investigated the effect of the entrepreneurship opportunity on the quality of the environment and indicated a negative relationship between the two variables. Hence, we assume that the level of entrepreneurship opportunity is associated positively with CO 2 emissions. H2: The level of entrepreneurial opportunity is associated with an increase in CO 2 emissions. In this regard, to achieve low-carbon objectives, although the overall trend of the CO 2 Kuznets curve cannot be adjusted, factors to adjust the shape of the EKC can be identified [19] . When executing a low-carbon environmental policy, the radius of the U-shaped curve is expected to be smaller, and a horizontal path can be generated in which the inflection point occurs earlier [20] . A key factor to flatten the curve pertains to environmental law and regulations. Considering the data of China, Yin et al. [21] concluded that the EKC under liberal environment regulations gradually tended to the horizontal orientation to fall below the per capita CO 2 emissions and required considerable time to reach the inflection point. In contrast, the EKCs under stringent environment regulations increased rapidly, and after reaching the inflection point, the pollutant emissions reduced rapidly [21] . Moreover, Pang et al. reported that the government policies on environmental regulations could enhance energy savings and emission reductions [22] . In this regard, effective governance is essential to boost the impact of environmental regulations. According to Samimi et al., environmental governance is necessary to protect environmental quality and ensure sustainable use of resources [23] . Different measures of governance exert different effects on the CO 2 emissions [24] , and the dimensions of governance include the institutional quality, quality of service, corruption index, accountability civil rights, and political liberties [25, 26] . The daily actions of people form the natural and social environment in which they reside according to their needs and their consumption of resources (water, food, and energy), the waste they produce and the way in which they treat it, and the government policies they approve. Numerous studies have shown that the direct influence on the actions of individuals is derived from information and awareness [26] [27] [28] . Furthermore, the measures to mitigate climate change cannot be implemented without public encouragement and engagement. Environmental education is a key element in establishing the ecological quality of a nation [29] . It has been reported that an increase in environmental awareness can serve as a tool to formulate energy policies [30] . Akerlof explains the relations between knowledge, attitudes, and behavior. He suggests that in some cases, public awareness as an end goal is a highly necessary political and programmatic goal for governments, effectively an inevitable goal for liberal democracies. Akerlof recommended three conditional elements to effectively exploit environmental awareness to solve problems related to the environment. First, a policy must be established to facilitate behavioral adjustment among individuals. Second, to achieve satisfactory results, decision-making institutions must involve democratic participation practices. Finally, there must be a direct focus for community-level changes in education, morals, and values [5] . Our final hypothesis captures these necessary factors of environmental awareness: H3a: The level of education is associated with a decrease in CO 2 emissions. H3b: The level of governance is associated with a decrease in CO 2 emissions. H3c: The level of personal freedom is associated with a decrease in CO 2 emissions. Linear regression (LR) represents the gold standard technique in the machine learning domain. The method of ordinary least squares (OLS) is commonly used to estimate the parameters of the intercept and slope regression. The model can be expressed as follows: where Y is the output variable, X i is the explanatory variable, and i is the parameter estimated through OLS regression. The limitation of LR is that it only clarifies the interaction between the mean input variables and output variables. Nevertheless, the functional form of the link between the target variable Y and predictor X can be obtained in a flexible manner through advanced machine learning models. In the neural network (NN) regression technique, a forwardstructured artificial neural network is used to map a collection of input vectors to the output vector collection. The NN consists of several layers of nodes, with each layer linked to the subsequent layer [31] . In addition to the input nodes, each node is a neuron with a nonlinear activation function that can be expressed as follows: where Y j is the predicted value of the jth neuron, X i is the input variable of the ith neuron, and W ij is the weight between the ith and jth neurons in the following layer. In the literature, many activation functions have been used, such as tangent hyperbolic, sigmoid, sinusoid, logistic, and wavelet. They are represented by the following equations: Tangent hyperbolic: The sigmoid function, also called the logistic function: Morlet wavelet function: where b1 and b2 are actual variables that will be adjusted as weights between neurons during training. We used the logistic function in this paper since it better reflects the asymmetric behavior of the economic cycle. Furthermore, this function highlights that the expansion and recession phases have different dynamics. Usually, backpropagation algorithms are used to run multi-layer perceptron NNs, which represent the most commonly deployed NN. In the basic framework, gradient search techniques are used to minimize the mean square error between the real output value of the network and the predicted output value [32] . In recent years, RFs have attracted considerable attention in the social science domain owing to their high efficiency, ease of execution, and low computational costs. This ensemble learning methodology was developed by Breiman, centered on creating multiple regression trees [33] . The bootstrap samples obtained from each tree were used to train the whole training set [34] . The output can be obtained as follows: where h(x) is a set of kth learner random tree and X is the vector of the input variables. A new training set is generated by replacing the original data for each constructed regression tree. To enhance the predictive capacity of the model, the model hyperparameters must be tuned using the validation dataset. LightGBM, proposed by Ke et al. [35] , is a robust gradient boosting framework, which is less computationally expensive and has a higher training speed and lower memory consumption compared to those of the gradient boosting decision tree algorithms. The LightGBM algorithm is based on two innovative strategies, i.e., gradient-based one-side sampling and unique function bundling. The final output model can be established as follows: where f t (X i ) is the learned function of the tth decision tree and M is the number of trees. Details regarding the calculation procedures of LightGBM can be found elsewhere [35] . The predictive performance is considerably affected by the hyperparameters, and thus, before using LightGBM, the number of hyperparameters must be determined. Typically, gradient boosting approaches can obtain the most precise predictions over standardized or tabular input data. Recently, an innovative approach named NGBoost was developed [36] , which could be used to calculate the statistical uncertainty through the gradient boosting technique by using probabilistic forecasts (including actual valued outputs). In general, NGBoost requires considerably less expertise to operate than the other relevant approaches and performs well on standard metrics. NGBoost achieves notably satisfactory results on small data sets. The predicted output can be expressed as follows [36] : where θ is obtained by a mixture of additive M base learner outputs, g (m) is the set of base learners, ρ (m) is the scaling factor per stage, and is the common learning rate. Although the performance of the NGBoost is comparable to that of the existing probabilistic regression approaches, it exhibits several other advantages. Specifically, NGBoost is modular, scalable, and easy to use. The algorithm can be scaled to handle large variables or observations with reasonable complexity, in contrast to classical boosting algorithms. XGBoost [37] is a profiling technique widely used by the scientific community to solve classification and regression problems. The algorithm combines strong classifiers to obtain a strong tree model using a serial training process [38] . Overfitting is prevented by introducing a regularization term. New trees are constantly generated and trained to fit the last iteration. The predicted value can be defined as follows: where K denotes the number of trees and f k X i is the newly generated tree model of the input X i for the kth tree. When designing the XGBoost with enhanced predictive performance, suitable hyperparameters must be calculated. Specifically, an appropriate set of values for the hyperparameters essential to the establishment of the model should be defined and tested. In the present study, grid search with cross-validation was employed to identify the optimal parameters. In the grid search technique, the hyperparameters are tuned to evaluate the optimum values for a given model. This aspect is important as the output of the entire model depends on the defined hyperparameters. In general, machine learning algorithms involve several hyperparameters to be tuned. Adequate hyperparameter regulation can help avoid model overfitting [39] . In this study, a grid search with 10 cross-validations (GridSearchCV) was performed to tune the RF, NN, Light-GBM, NGBoost, and XGBoost. Appendix Table 6 lists the optimal parameters considered in this study. Usually, interpreting the results of machine learning models is very difficult owing to their complex "black box" architecture. The ability to correctly interpret the output of a prediction model is extremely valuable. It creates the appropriate user trust, provides information on improving a model, and helps to understand the modeled process. In response, several tools and methods have recently been suggested for interpreting the predictions of complex models, but it is often difficult to know how these methods relate and when one method is preferable to another. To solve this problem, Lundberg and Lee [40] presented a unified framework for the interpretation of predictions, SHAP. The SHAP value is a better tool than others, such as the local interpretable modelagnostic explanations (LIME), the partial dependence plot (PDP), or ELI5 algorithms, based on three characteristics. First, regarding the "overall interpretability," the SHAP values indicate how much each predictor contributes, positively or negatively, to the target variable. Second, regarding the "local interpretability," each observation obtains its own set of SHAP values. Third, SHAP values can be calculated for any tree model, whereas other methods use linear regression or logistic regression models as alternate models. To understand and evaluate the drivers to estimate the CO 2 emissions and to compare the importance of individual features, we calculated the SHAP values [41] . In general, the SHAP values test each combination of predictors to assess the effect of each predictor. Based on the game theory and conditional assumptions, the SHAP values are likely to be consistent across various trees [42] . To calculate the SHAP values, the output of a tree centered on a subset of functions, S, is defined as g x (S) = [E(g(x)], and the SHAP values are expressed as where M is the number of input variables. To improve the interpretability of gradient boosting decision tree algorithms, a new explanation tool was proposed [41] , which directly measured the local feature interaction effects. In particular, TreeExplainer, an explanation technique for trees allows the optimal local explanations to be computed considering the desirable properties according to the game theory. In this study, the SHAP explanation of the optimal model was examined through Python packages, to assess the importance of each factor. Panel data of 86 countries for the period 2007-2018 were considered. The world development indicators (WDI, 2020) were considered to extract the CO 2 emissions (metric tons per capita), gross capital formation (% to the total GDP), energy use (kg as oil equivalent per capita), and GDP per capita (constant 2010 U.S). For the institutional variables pertaining to governance, education, personal freedom, and business environment, the Legatum Institute database (2019) was considered. The definitions of the considered variables are presented in Table 1 . The causal link among the CO 2 , energy use, economic growth, and institutional conditions can be expressed as Figure 1 shows the intensity of the relationships among the variables. The correlation matrix indicates that the pairs EU-CO 2 , GDP-GOU, BE-GOU, and GOU-PF are strongly correlated. Table 2 summarizes the statistics for the considered variables. The EU and GDP exhibit the largest standard deviation and mean for all the countries. The higher standard deviation also implies that the EU and GDP series is more volatile than the other variables. Moreover, the skewness coefficients for the GCF, EDU, and PF are negative and those for the CO 2 , EU, GDP, BE, and GOU time series are positive. Furthermore, the kurtosis coefficients for all the time series are positive, except for those of BE, GOU, and PF. None of the values for the kurtosis is equal to three, suggesting that the fat-tailed distribution occurs. The test for normality is indicated by the Shapiro-Wilk statistic, and a non-normal distribution is suggested. The initial stage of the analysis is to investigate the properties of the data. To experimentally evaluate whether the macroeconomic variables are stationary, five nonlinear unit root tests are applied. We used (FFFFF) [43] , (CEO) [44] , (OSHa) [45] , (OY) [46] , and (OEHa) [47] tests. The source of nonlinearity of the employed unit root tests has been identified as a time-dependent nonlinearity, as well as a Tables 3 and 4 . Based on the FFFFF and CEO tests, we found that all variables are stationary at significance level and first difference. Moreover, the results tabulated in Table 3 suggested that all variables have a stationary process and nonlinear trend around the deterministic component at various significance levels. According to Omay et al. [48] , these nonlinearities may emerge simultaneously in the data generating process to identify these hybrid structures (Table 4) . OSHa unit root test provides evidence of stationarity for seven stationary variables, except GCF. Moreover, OEHa (2018) test states the same results, whereas OY (2014) reveals stationarity for all variables. Our results are consistent with Omay et al. [48] , who documented that future movements in macroeconomic variables may be predicted based on previous behavior, because time series data on the state level follow meanreversing processes. To conclude, all the variables that will be included in machine learning models were found to be stationary. We designed artificial intelligence models to predict CO 2 emissions. Six machine learning approaches, namely, the LR, RF, NN, LightGBM, NGBoost, and XGBoost, were applied. The training and test data were divided into a ratio of 70:30; 70% of the data were used for model training, and 30% were used to examine the efficiency and accuracy of the proposed model. Four metrics were considered to evaluate the performance of the forecasting models: mean absolute error (MAE), root mean square error (RMSE), mean absolute percentage error (MAPE), and coefficient of determination (R 2 ). As the predictor variables included several macroeconomic variables with varying sizes and domains, data normalization was performed to access a homogeneous dataset. Specifically, we normalized the initial data series for all the variables (X i ). Table 5 lists the R 2 , MAE, RMSE, and MAPE values for the testing data for all the models. NGBoost exhibits the best predictive performance in terms of all the metrics, with considerably higher accuracy than that of the XGBoost, RF, LightGBM, NN, and LR. In addition, NGBoost exhibits the largest R 2 of 0.907 and smaller values of the MAE, RMSE, and MAPE compared to those of the five techniques. LR exhibits the worst performance. The NGBoost model considerably enhances the precision of the forecasting method. Moreover, because NGBoost is a complex nonlinear model, it provides the relative value of the individual features used to construct the model. Using the measured SHAP function value, we compared the global importance of all the characteristics. The SHAP analysis was performed for different communities by using the trained NGBoost model. In particular, the SHAP values define the significance of a feature by evaluating the forecasted value with and without considering the feature. Figure 2 presents the SHAP global feature importance plot for the optimal machine learning model. Seven variables notably influence the CO 2 emissions. Moreover, the highest contribution to pollution corresponds to the GDP, followed by energy use, education, personal freedom, governance, business environment, and gross capital formation. GDP appears to be the most important variable. A review of the relevant studies indicates that economic development (GDP per capita) is likely the most significant factor [40] [41] [42] [43] . Furthermore, energy use appears as the second most important factor. These findings are in line with those reported previously [44] [45] [46] [47] . However, energy consumption and economic growth negatively influence the environment. Policymakers need to be very careful when choosing between taking care of the environment and stimulating economic growth. However, it is difficult to formulate economic and environmental policies while promoting economic growth and protecting the environment from degradation. This is why environmental awareness is a key factor in moderating the relationship between economic growth and environmental degradation. In fact, environmental awareness promotes environmentally friendly behavior without negatively affecting economic growth. In Fig. 3 , each point on the overview plot corresponds to the SHAP value for a characteristic of the feature. The location on the x-axis corresponds to the SHAP value, and the color indicates the intensity, ranging from low (blue) to high (red). Figure 3 indicates that strong positive relationships exist between the GDP and CO 2 emissions. In addition, higher values of the GDP result in higher values of the SHAP, corresponding to a higher probability of an increase in the CO 2 emissions. This observation is consistent with the previous findings [51] , which indicated that the economic growth positively and significantly influences the CO 2 emission levels based on unbalanced panel data for 128 nations for the period 1990-2014. Energy use is the second most important variable; higher values of this factor are expected to aggravate environmental degradation. This finding is in line with those reported previously [52] , which indicated that the CO 2 emissions increase owing to the increase in energy consumption. Education is the third most important variable, and it can be noted that higher values of education correspond to lower values of CO 2 emissions. This evidence is consistent with H3a that education enhances environmental awareness by understanding the harmful consequences of emissions and fosters environmentally friendly behavior. The findings are in line with the prior work of Stef and Ben Jabeur [53] . Moreover, H3c is confirmed; high values of personal freedom decrease the volume of CO 2 emissions as stated by Brulle (2010, p. 91): "It is well known that political mobilization campaigns are more effective and legitimate if they engage citizens in a sustained dialog rather than treating them as a mass opinion to be manipulated. The importance of public participation in developing decisions that include concern about the natural environment has been stressed by numerous authors." Fig. 2 indicates that effective governance can moderate the amount of carbon emissions. These findings are in line with those reported previously [54] and those reported by Stef and Ben Jabeur, indicating that effective governance can mitigate environmental degradation and enhance environmental quality. Figure 3 shows the impact of the business environment on environmental degradation. In particular, although entrepreneurial opportunity increases the profits, the ecosystem is destroyed as the environmental quality is considered to be a luxury at this point. These findings confirm our second hypothesis (H2) that the level of entrepreneurship opportunity affects the quality of the environment. Our results are in line with those reported previously [15] . A slight positive effect can be noted in terms of the gross capital formation, although high levels of the gross capital formation may lead to considerable degradation of the ecosystem. The SHAP values can be used to obtain the partial dependence plot, which shows the marginal effect of one or two variables on the predicted CO 2 emissions. Figure 4 shows the SHAP dependence plots of the NGBoost model, and the findings validate the aforementioned observations regarding the factors and outcome. Specifically, Fig. 4a shows that a nonlinear relationship does not exist between the GDP and CO 2 , and initially, CO 2 volumes increase with increasing values of the GDP and then it decreases. The findings appear to confirm H1, the EKC hypothesis, according to which, the economic growth and environmental degradation exhibit an inverted U-shape. This result is in line with several existing reports pertaining to the EKC [49, 50] . Moreover, the dependence plot indicated the presence of a positive relationship between energy use and CO 2 emission volumes. Using the SHAP interaction values, the effects of a feature on a single sample can be decomposed into the significant influence and interactions with other variables. Lundberg et al. proposed a modern method of interpretation in which the local factor interaction effects were calculated [41] . To perform the local interpretation, we selected two highly polluted countries, namely, Pakistan and Bahrain, and two highly clean countries, namely, Switzerland and France. The prediction of the output using the NGBoost model is illustrated in Fig. 5 . The red and blue variables indicate that the outcome is higher and lower than the base value, respectively. Figure 5a shows the contribution of each feature to the model prediction for Pakistan in 2016. The base value is equal to 5.57, and the model prediction is 3.18. On the right side of the forecast value, the diagram reflects certain characteristics in blue that drive the prediction to lower values, and on the left side, in red, one can observe the characteristics that drive the prediction to higher values. It can be noted that the GDP (1159) and EU (199) are associated with the decrease in CO 2 emissions. In contrast, the GOU, EDU, and PF are not notably associated with the increase in carbon dioxide emissions. In other words, emerging countries do not focus considerably on institutional conditions and climate change policies. In the case of Bahrain, as shown in Fig. 5b , the high values of EDU and EU cause the CO 2 emissions to decrease, and the GOU and GDP values cause the predicted output to increase. In other words, Arab countries involve less democratic aspects and personal freedom. Therefore, Arab countries as well as emerging authorities which have less democratic aspects and less personal freedom are recommended to target awareness raising as a main element of their communication campaigns and education programs in the pursuit of pro-environmental and practical decisions by the public. Figure 5c shows the local interpretation for Switzerland. The EDU, PF, GOU, BE, and EU are associated with lower CO 2 emissions. In other words, developed countries with a high degree of freedom, more educated people, and strong environmental laws tend to be more respected [53] . Nevertheless, the GDP contributes to environmental degradation. In the case of France, as shown in Fig. 5d , the EDU, PF, BE, and GCF are associated with lower values of CO 2 emissions. In contrast, the GDP is associated with higher CO 2 emissions. The causal relationships between the CO 2 emissions, economic growth, energy use, and institutional conditions were examined through machine learning models, which have attracted considerable interest in academic and policy debates. To interpret the complex and black-box models, such as those pertaining to the feature importance and local effects, the SHAP value was employed. Environmental pollution is a multi-faceted challenge; every stage of environmental pollution is related to economic growth. Our empirical results show an EKC relationship between output growth and environmental degradation, which can be explained in a multidimensional way. The EKC literature emphasizes that there is no single policy that can reduce pollution levels and increase economic growth at the same time. Therefore, our study provides evidence that environmental awareness is a crucial process for reducing CO 2 emissions without negatively affecting economic growth. This observation is a strong reason for its consideration in public policies. Environmental awareness must be considered in public education and participatory decision-making processes for governments to maintain democracy. The methods by which environmental awareness is most likely to be achieved are not through campaigns using small messages; instead, they are more likely to be group processes within and between universities, schools, cities, communities, and nations [5] . Governments will still need to limit certain behaviors that damage the environment and encourage others. When implementing soft policy processes, environmental awareness should be the main component, but for optimal results several other behavioral components and interventions as well as regulations, taxes, and other financial incentives, must be implemented. In addition, further improvement in governance in emerging countries can help further reduce CO 2 emissions, as better governance implies better access to knowledge and democratic independence, thereby improving the awareness of citizens on environmental protection and fostering public interest and support for environmental legislation. From a practical viewpoint, using advanced and interpretable machine learning models, the level of CO 2 emissions for large panel data can be predicted, thereby enabling the interpretation of the effect of each feature. Owing to this interpretability, machine learning is an effective tool for organizations and policymakers who require a deep understanding of the data and to help energy-consuming countries reduce excessive energy consumption to improve the environmental quality. Nevertheless, this study involves two main limitations. Future work can be aimed at extending the proposed research framework by integrating more institutional conditions factors such as food security, health, and poverty. Second, more advanced machine learning and interpretable frameworks must be incorporated. As future directions, we suggest the following. (i) For a predictive accuracy purpose, hybrid predictive approaches can be developed, for instance, by combining the complex network and traditional artificial neural network (ANN) for carbon emissions forecasting. (ii) Other types of machine learning techniques can be used, such as support vector machine (SVM) and deep learning models: convolutional neural network (CNN) and long short-term memory (LSTM). (iii) Future research studies may also focus on the key sectors of different economies, such as energy, industrial, and services. Code Availability Not applicable. Competing Interests Not applicable. NASA, NOAA Analyses Reveal NASA Goddard Institute for Space Studies Modelling and forecasting CO2 emissions in the BRICS Growing green? Forecasting CO2 emissions with environmental Kuznets curves and logistic growth models When Should Environmental Awareness Be a Policy Goal? Modelling the CO2 emissions and economic growth in Croatia: Is there any environmental Kuznets curve? Energy The impacts of globalization, financial development, government expenditures, and institutional quality on CO2 emissions in the presence of environmental Kuznets curve Investigating the environmental Kuznets curve hypothesis in Kenya: A multivariate analysis Does natural gas consumption mitigate CO2 emissions: Testing the environmental Kuznets curve hypothesis for 14 Asia-Pacific countries. Renewable and Sustainable Energy Reviews To examine environmental pollution by economic growth and their impact in an environmental Kuznets curve (EKC) among developed and developing countries Economic growth and the environment The nature of CO2 emission Kuznets curve On the nonlinear relationship between energy use and CO2 emissions within an EKC framework: Evidence from panel smooth transition regression in the MENA region The moderating role of energy consumption in the carbon emissions-income nexus in middle-income countries Entrepreneurship, sectoral outputs and environmental improvement: International evidence Entrepreneurship contribution to the three pillars of sustainable development: What does the evidence really say? World Development Opportunity-based entrepreneurship and environmental quality of sustainable development: A resource and institutional perspective How can entrepreneurship and educational capital lead to environmental sustainability? Structural Change and Economic Dynamics Environmental Kuznets curve: the prediction and the analysis of influencing factors of the CO2 of China Making economic growth more sustainable The effects of environmental regulation and technical progress on CO2 Kuznets curve: An evidence from China Pollute first, control later? Exploring the economic threshold of effective environmental regulation in China's context Governance and environmental degradation in MENA region Carbon dioxide emissions and governance: A nonparametric analysis for the G-20 Governance, institutions and the environmentincome relationship: A cross-country study A review of intervention studies aimed at household energy conservation Sustainable consumption. Handbook of Sustainable Development Fostering sustainable behavior: An introduction to community-based social marketing (Third Edition) How renewable energy consumption contribute to environmental quality? The role of education in OECD countries Measuring the effect of procrastination and environmental awareness on households' energy-saving behaviours: An empirical approach Revisiting oilstock nexus during COVID-19 pandemic: Some preliminary results Research on calculation method of free flow discharge based on artificial neural network and regression analysis. Flow Measurement and Instrumentation Random Forests Random forest regression for improved mapping of solar irradiance at high latitudes LightGBM: A highly efficient gradient boosting decision tree NGBoost: Natural gradient boosting for probabilistic prediction XGBoost: A Scalable Tree Boosting System Developing window behavior models for residential buildings using XGBoost algorithm Enhanced credit card fraud detection based on SVM-recursive feature elimination and hyperparameters optimization A unified approach to interpreting model predictions From local explanations to global understanding with explainable AI for trees Xgboost application on bridge management systems for proactive damage estimation Fractional frequency flexible Fourier form to approximate smooth breaks in unit root testing Re-examining the real interest rate parity hypothesis (RIPH) using panel unit root tests with asymmetry and cross-section dependence Testing PPP hypothesis under temporary structural breaks and asymmetric dynamic adjustments Nonlinearity and smooth breaks in unit root testing Structural break, nonlinearity and asymmetry: A re-examination of PPP proposition Testing the hysteresis effect in the US state-level unemployment series CO2 emissions, energy consumption and economic growth nexus in MENA countries: Evidence from simultaneous equations models The energy consumption and economic growth nexus in top ten energy-consuming countries: Fresh evidence from using the quantile-on-quantile approach CO2 emissions, economic and population growth, and renewable energy: Empirical evidence across regions Energy consumption, CO2 emissions and economic growth in developed Climate change legislations and environmental degradation Foreign investment and air pollution: Do good governance and technological innovation matter?