key: cord-0729733-wfdtmc9i authors: Ojha, Narendra; Girach, Imran; Sharma, Kiran; Sharma, Amit; Singh, Narendra; Gunthe, Sachin S. title: Exploring the potential of machine learning for simulations of urban ozone variability date: 2021-11-18 journal: Sci Rep DOI: 10.1038/s41598-021-01824-z sha: 29670a858884f497de3238395ee80e15a9b33879 doc_id: 729733 cord_uid: wfdtmc9i Machine learning (ML) has emerged as a powerful technique in the Earth system science, nevertheless, its potential to model complex atmospheric chemistry remains largely unexplored. Here, we applied ML to simulate the variability in urban ozone (O(3)) over Doon valley of the Himalaya. The ML model, trained with past variations in O(3) and meteorological conditions, successfully reproduced the independent O(3) data (r(2) ~ 0.7). Model performance is found to be similar when the variation in major precursors (CO and NO(x)) were included in the model, instead of the meteorology. Further the inclusion of both precursors and meteorology improved the performance significantly (r(2) = 0.86) and the model could also capture the outliers, which are crucial for air quality assessments. We suggest that in absence of high-resolution measurements, ML modeling has profound implications for unraveling the feedback between pollution and meteorology in the fragile Himalayan ecosystem. The chemical processes in the urban atmospheres of Himalayan foothills have strong potential to impact the regional air quality, agriculture, and therefore the economy [1] [2] [3] . In addition, the build-up of climate-forcing pollution in the Himalayan region can have irreversible effects on the hydrological cycle and global climate 1,2,4-7 . The atmospheric dynamics above the Himalaya also form the crossroad of so called "Atmospheric Brown Clouds" to the Tibetan Plateau 8 . Recent increase in extreme weather events triggering the calamities also indicate an intensifying interplay between the increasing pollution and meteorology over fragile ecosystem of the Himalaya 9-12 . The enhanced concentrations of ozone (O 3 ) and other climate-forcing pollutants in the Himalayan foothills are attributed to unprecedented growth in population and urbanization [13] [14] [15] [16] . The intense forest-fires, diverse natural factors, and the topography also play vital roles in the build-up of trace gases and aerosols here 5, 11, 15, [17] [18] [19] . The Himalayan atmosphere is particularly influenced by the most densely populated region of the world-the Indo-Gangetic Plain (IGP) 20, 21 . The IGP is a global hotspot of elevated O 3 and aerosol loading due to strong anthropogenic emissions and intense crop-residue burning in prevalence of favorable meteorological conditions [22] [23] [24] [25] [26] [27] . The emissions and photochemistry in the IGP affect the Himalayan atmosphere in particular through the mountain meteorology and boundary layer dynamics 20, 21, 28 . A potential climate warming combined with future increase in the emissions can further intensify the atmospheric chemistry over this part of the world [29] [30] [31] . Considering the discussed scenario, measurements and modeling studies have been conducted to assess the effects of diverse emissions, photochemistry, and dynamics on atmospheric composition over the Himalayan region 8, 15, 17, 32, 33 . The concentrations of O 3 and precursors were found to be enhanced during pre-monsoon (spring) and post-monsoon (autumn) seasons due to regional pollution supplemented with biomass-burning, intense solar radiation, and less precipitation 15, 20, 21, 34, 35 . The long-term measurements of atmospheric composition and meteorological parameters however remain lacking over the Himalayan foothills in India, which are experiencing severe air quality and extreme weather events. Studies to fill this gap are of paramount significance since the chemistry-climate models also have greater biases in reproducing already sparse measurements over the Himalayan region 20, 34, 36, 37 . The stronger biases are suggested to be mainly due to the limitation of models in resolving the highly complex topography of Himalaya and foothills 5, 19, 20, 37, 38 . The uncertainties in the emission inventories and parameterizations of physical and chemical processes also increase the biases in the 19, 37, [39] [40] [41] . Besides higher biases, the conventional models also need intensive computing resources which poses further limitation in conducting high-resolution simulation. In the current era, the artificial intelligence (AI) and machine learning (ML) have emerged as powerful alternative tools for modeling in various fields including the Earth system science [42] [43] [44] [45] . Recent studies utilized AI/ML modeling in the analyses of extreme whether events and prediction of oceanic phenomenon as well as atmospheric composition [46] [47] [48] . These studies have shown that ML models trained with data from observations or physical models can produce reliable simulations without intensive high-end computing. Nevertheless, the applications of AI/ML to simulate complex atmospheric chemistry remain still limited. Considering the scientific and societal implications, lack of measurements, and limitations of conventional models over Himalayan region, the objectives of this study are as follows: (1) To explore the potential of ML modeling for simulating urban O 3 variability. (2) To study the effects of meteorological and chemical variables on model performance. (3) To assess the effect of the data fraction used in the training on model performance. The study region, datasets, and modeling are described in the "Methodology" section. Model simulations and results are presented in the "Model simulations and results" section, followed by "Discussion" section. pandemic. Further details of these O 3 measurements are presented in the earlier study 15 . Auxiliary datasets used in training the ML model include the meteorological and chemical reanalysis from the ECMWF (European Center for Medium range Weather Forecasting). The meteorological parameters: temperature, humidity, horizontal winds, and boundary layer height (BLH) are included from the ERA-Interim 50 . Whereas, the chemical species: O 3 and key precursors (CO, NO, NO 2 ) have been included from the CAMS (Copernicus Atmosphere Monitoring Service) reanalysis 51 . ERA-Interim and CAMS products have been analyzed for diverse studies including over the Indian region 15,52-54 . The CAMS data has been shown to reproduce the day-to-day variability in the noontime O 3 over the study region 15 Machine learning model. This study utilizes the XGBoost (Extreme Gradient Boosting) algorithm of the ML modeling 55 to simulate the O 3 variations. Considering the dependence of O 3 on meteorological parameters and precursor gases, this modeling is under the supervised learning of AI. In the gradient boosting algorithm, a prediction model is developed in form of an ensemble of weak prediction systems i.e., decision trees. The model is built in a stage-wise manner and generalizations are made by allowing optimization of an arbitrary differentiable loss function (e.g., squared error). Further details of the XGBoost can be found elsewhere 55 . The method adopted to build and evaluate the model is shown as a flow chart in the Fig. 1 . Hyper parameters have been varied iteratively following the trial and error method to achieve better prediction. The parameters were fine-tuned using the grid search function (https:// scikit-learn. org/ stable/ modul es/ gener ated/ sklea rn. model_ selec tion. GridS earch CV. html). The values of hyper parameters set in the model are given in the supplementary material- Table S1 . Other hyper parameters were kept to their default values (https:// xgboo st. readt hedocs. io/ en/ latest/ param eter. html). To avoid overfitting, the iterations are aborted once they cease to improve the fit parameters further, i.e., no reduction in RMSE (root mean square error) over 100 iterations. The model performance in simulating O 3 variations has been evaluated by estimating correlation (r 2 ), slope of linear fit, and RMSE (root mean square error). A series of simulations have been performed under this study, as summarized in the Table 1 . These simulations and the evaluation of model performance are discussed in the following subsections. Simulation utilizing in-situ O 3 measurements. In the first simulation ML_obs_O 3 _met_prec, the ML model has been trained using the observational data of O 3 and reanalysis data of meteorological parameters (met) and precursors (prec). Analysis is focussed on the variations in noontime (11:30 h local time) O 3 . The data of April 2018 to April 2019 (number of days N = 222) has been used for training the ML model, which is 50% of total available data. Model simulation is evaluated against remaining independent observations for April-December 2019 period (N = 223 days). Figure 2 shows the correlation between the ML model simulation and in-situ measurements of noontime O 3 over Doon valley for April-December 2019 period. ML model is found to successfully reproduce the temporal variability in the noontime O 3 with r 2 value of 0.75 (p < 0.01) and RMSE value of 10 ppbv. The estimated bias in ML model result is seen to be significantly lower as compared to the bias values reported in global and regional atmospheric models over this region 15, 34, 37 . The result suggests that in absence of high-resolution measurements, the ML modeling can be combined with reanalysis and limited Indian Himalayan region and the temporal coverage is also very limited. In view of this, we include the longterm CAMS data to assess the potential and performance of ML modeling more deeply. With availability of longterm data, here, we train the ML Model with noontime (11:30 local time) CAMS O 3 and reanalysis meteorology for 2003-2015 (70% of total data). This makes a significant fraction (30%) of total data during 2015-2019 period available for the evaluation. The simplest simulation is ML_cams_O 3 in which model is trained only with the O 3 time series without including any additional parameter. This simulation is found to predict the independent O 3 variations with r 2 value of 0.47 and RMSE of 11.6 ppbv (Fig. 3) . This result is a manifestation of a periodicity in O 3 data embedded by the seasonal cycle in India. The relative effect of including variations in the meteorological parameters versus major precursors (CO, NO, NO 2 ) has been evaluated by performing additional simulations (Table 1, Fig. 3 ). Model trained with O 3 and meteorology (ML_cams_O 3 _met) reproduces independent O 3 variations with r 2 value of 0.71 and slope value of 0.65. Another simulation in which the ML model is trained with O 3 and precursors but not with the meteorology shows similar or slightly improved performance (r 2 = 0.74, slope = 0.79, p < 0.01). The inter-comparison of these two simulations suggests that reasonable predictions of urban O 3 variability can be made with ML models trained with either of the meteorological or precursor dataset. This is important as this region lacks comprehensive datasets especially of the precursors, and in such cases the meteorological datasets can be used to predict O 3 . Further, to explore the potential of ML approach, we performed another simulation ML_cams_O 3 _met_prec in which both meteorology as well as precursors have been included in the model. This led to significant improvement in the model performance with r 2 value as high as 0.86 and slope value of 0.91. For this simulation, the RMSE value also drops drastically to 6 ppbv and the mean bias is also smaller (~ 3 ppbv). An important finding is that when the potentials of both meteorological as well as chemical datasets are combined, the model's ability to predict outliers improves drastically, which is of major significance in air quality assessments. A comparison of r 2 values among all these simulations (numbered 2-5 in the Table 1 ) suggests that ~ 47% of O 3 variations can be explained (r 2 = 0.47 in ML_cams_O 3 ) by the periodicities embedded in the data originated from the seasonal cycle. As precursors and meteorology act in tandem, higher r 2 values (~ 0.7) in simulations trained with either meteorology or precursors suggest that this additional ~ 25% of O 3 variability can be attributed to the changes in meteorology or precursor levels. Meteorology plus major precursors could explain ~ 86% (r 2 = 0.86 in ML_cams_O 3 _met_prec) of the variations in the urban O 3 . The remaining variability could be due to diverse unaccounted factors such as deposition, vertical transport, and volatile organic compounds, etc. The analysis suggests that ML simulations can provide deep insights into the relative importance of the physical and chemical processes affecting the air quality. The performance of different simulations has been compiled in form of a Taylor's diagram (Fig. 3) . The figure includes statistics like r, normalized RMSE, and normalized standard deviation (SD) where normalization is done with respect to the SD in the reference (CAMS). The relative performance of different simulations is assessed by comparing how close a simulation is to the reference point (CAMS). For an ideal agreement, ML simulation should coincide with the reference point (r = 1, normalized SD = 1, and normalized RMSE = 0). It is evident that the ML simulation exploiting the potentials of both meteorology and precursors (ML_cams_O 3 _met_prec) www.nature.com/scientificreports/ performed the best. Besides stronger r value, a normalized SD value close to 1 suggests that the simulation produces similar extent of the variability as in the CAMS. On the other hand, ML simulations using either meteorology or precursors had similar performance. Also, ML_cams_O 3 _prec produced more variability likely due to non-linearities in chemistry as compared with the simulation using meteorological variations (ML_cams_O 3 _met). Effect of training data length. We further investigate the sensitivity of model performance to the fraction of available data being used for the training. In this regard, a series of simulations have been performed using the best performing model set up (ML_cams_O 3 _met_prec) by using 20-95% data for model training. Figure 4 shows the variations in r 2 and RMSE values due to variation in the training data fraction. The analysis shows that the model performance is highly sensitive to the length of total data being used in its training. The r 2 value is found to increase significantly from about 0.6-0.87 and RMSE shows reduction from ~ 11 to 6 ppbv with increase in the training data fraction. The analysis suggests that longer time-dependent datasets are highly desirable for optimizing performance of ML models in predicting air quality variation. This underlines that long-term in situ measurements and validated chemistry-climate simulations can help in further exploiting the potential offered by the ML approach. Our study unravels the strong potential of ML modeling for computationally inexpensive simulations of urban O 3 variability in the Himalayan foothills region. The periodicity in O 3 and meteorological parameters due to systematic seasonal cycle of India tends to allow ML model to reproduce data fairly well. In lack of high-resolution measurements, ML simulations can be used to assess the impacts of O 3 on health and agriculture in this region. Additionally, the series of simulations conducted here would serve as a reference for further applications of AI/ ML based modeling to complement conventional Earth system models. It is however pointed out that here the environment is urban and the O 3 variations are greatly governed by the regional photochemistry. The scenario could be very different for cleaner remote regions where O 3 variability is dominated by transport from upwind polluted regions or from the higher altitudes. In this regard, we recommend establishing baseline stations to continuously monitor the atmospheric composition as well as the meteorology to exploit the full potential of ML modeling. Model performance is already promising with inclusion of only meteorology, nevertheless, the inclusion of precursors enhances the model's ability to capture outliers, which are critical in air quality assessments. Future studies may extend the scope to additional climate-forcing pollutants and to unravel feedback between pollution and meteorology causing calamities in the fragile ecosystem of the Himalaya experiencing strong anthropogenic pressure. The contribution of outdoor air pollution sources to premature mortality on a global scale Revisiting the crop yield loss in India attributable to ozone Premature mortality in India due to PM2.5 and ozone exposure Atmospheric brown clouds: Impacts on South Asian climate and hydrological cycle Climate Change in the Himalayas Ganges Valley Aerosol Experiment: Science and Operations Plan Dust dominates high-altitude snow darkening and melt over high-mountain Asia Atmospheric brown clouds reach the Tibetan Plateau by crossing the Himalayas Observed evidence for steep rise in the extreme flow of western Himalayan rivers Aerosol-induced high precipitation events near the Himalayan foothills Air pollution in the Hindu Kush Himalaya On rising temperature trends at Dehradun in Doon valley of Uttarakhand The Himalayas must be protected Black carbon and biomass burning associated high pollution episodes observed at Doon valley in the foothills of the Himalayas Surface ozone in the Doon Valley of the Himalayan foothills during spring Evaluation of ambient air quality in Dehradun city during 2011-2014 Influences of the springtime northern Indian biomass burning over the central Himalayas Boundary layer evolution over the central Himalayas from radio wind profiler and model simulations Effects of spatial resolution on WRF v3.8.1 simulated meteorology over the central Himalaya Variabilities in ozone at a semi-urban site in the Indo-Gangetic Plain region: Association with the meteorology and regional processes First simultaneous measurements of ozone, CO, and NOy at a high-altitude regional representative site in the central Himalayas Season-wise analyses of VOCs, hydroxyl radicals and ozone formation chemistry over north-west India reveal isoprene and acetaldehyde as the most potent ozone precursors throughout the year On the widespread enhancement in fine particulate matter across the Indo-Gangetic Plain towards winter In situ Ozone Production is highly sensitive to Volatile Organic Compounds in the Indian Megacity of Delhi Avoiding high ozone pollution in Delhi Global distribution of tropospheric ozone from satellite measurements using the empirically corrected tropospheric ozone residual technique: Identification of the regional aspects of air pollution Source apportionment of volatile organic compounds in the northwest Indo-Gangetic Plain using a positive matrix factorization model Time variability of surface-layer characteristics over a mountain ridge in the central Himalayas during the spring season The South Asian monsoon-Pollution pump and purifier The influence of temperature on ozone production under varying NO x conditions-A modelling study How will air quality change in South Asia by Pre-monsoon air quality over Lumbini, a world heritage site along the Himalayan foothills Overview of VOC emissions and chemistry from PTR-TOF-MS measurements during the SusKat-ABC campaign: High acetaldehyde, isoprene and isocyanic acid in wintertime air of the Kathmandu Valley Variations in surface ozone at Nainital: A high-altitude site in the central Himalayas Variations in surface ozone and carbon monoxide in the Kathmandu Valley and surrounding broader regions during SusKat-ABC field campaign: Role of local and regional sources WRF-Chem simulated surface ozone over south Asia during the pre-monsoon: Effects of emission inventories and chemical mechanisms Simulations over South Asia using the Weather Research and Forecasting model with Chemistry (WRF-Chem): Chemistry evaluation and initial results Downscaled climate change projections for the Hindu Kush Himalayan region using CORDEX South Asia regional climate models What controls the seasonal cycle of black carbon aerosols in India Variations in O 3 , CO, and CH 4 over the Bay of Bengal during the summer monsoon season: Shipborne measurements and model simulations Distribution of volatile organic compounds over Indian subcontinent during winter: WRF-chem simulation versus observations A machine learning-based global atmospheric forecast model AQ-Bench: A benchmark dataset for machine learning on global air quality metrics A novel framework for spatio-temporal prediction of environmental data using deep learning Fast domain-aware neural network emulation of a planetary boundary layer parameterization in a numerical weather forecast model Using machine learning to analyze physical causes of climate change: A case study of U.S. midwest extreme precipitation A machine learning based prediction system for the Indian Ocean Dipole A novel CMAQ-CNN hybrid model to forecast hourly surface-ozone concentrations 14 days in advance Direct assessment of international consistency of standards for ground-level ozone: Strategy and implementation toward metrological traceability network in Asia The ERA-Interim reanalysis: Configuration and performance of the data assimilation system The CAMS reanalysis of atmospheric composition O 3 and CO in the South Asian outflow over the Bay of Bengal: Impact of monsoonal dynamics and chemistry On the understanding of surface ozone variability, its precursors and their associations with atmospheric conditions over the Delhi region Impact of increasing carbon dioxide on dinitrogen and carbon fixation rates under oligotrophic conditions and simulated upwelling XGBoost: A scalable tree boosting system We gratefully acknowledge the open source library, XGBoost algorithm, developed by the contributors from DMLC/XGBoost community (https:// github. com/ dmlc/ xgboo st/ blob/ master/ CONTR IBUTO RS. md). ERA-Interim meteorological data from ECMWF (European Center for Medium range Weather Forecasting; https:// apps. ecmwf. int/ datas ets/ data/ inter im-full-daily/ levty pe= sfc/); and CAMS data from https:// ads. atmos phere. coper nicus. eu/ cdsapp# !/ datas et/ cams-global-reana lysis-eac4? tab= form used in the study are highly acknowledged. We also acknowledge the Atmospheric Trace gases-Chemistry, Transport and Modelling (AT-CTM) project under the Geosphere Biosphere Programme of the Indian Space Research Organisation (ISRO-GBP). We are thankful to D. Pallamraju, S. Suresh Babu, Radhika Ramachandran and Anil Bhardwaj for valuable support during the study. The constructive comments and suggestions from the two anonymous reviewers and the handling editor are greatly appreciated. N.O. and I.G. conceived the idea and performed the modeling. K.S. and I.G. conducted the ozone measurements. N.O., I.G., and A.S. performed the analyses with inputs from N.S. and S.S.G. N.O. wrote the manuscript with contributions from all the co-authors. The authors declare no competing interests. The online version contains supplementary material available at https:// doi. org/ 10. 1038/ s41598-021-01824-z.Correspondence and requests for materials should be addressed to N.O. or I.G.Reprints and permissions information is available at www.nature.com/reprints.Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.