key: cord-0200450-tfvaor3l authors: Rubio-Herrero, Javier; Marrero, Carlos Ortiz; Fan, Wai-Tong Louis title: Modeling Atmospheric Data and Identifying Dynamics: Temporal Data-Driven Modeling of Air Pollutants date: 2020-10-13 journal: nan DOI: nan sha: 2a2b473819f05252851cee08e7d98aa5d4fe8a33 doc_id: 200450 cord_uid: tfvaor3l Atmospheric modeling has recently experienced a surge with the advent of deep learning. Most of these models, however, predict concentrations of pollutants following a data-driven approach in which the physical laws that govern their behaviors and relationships remain hidden. With the aid of real-world air quality data collected hourly in different stations throughout Madrid, we present an empirical approach using data-driven techniques with the following goals: (1) Find parsimonious systems of ordinary differential equations via sparse identification of nonlinear dynamics (SINDy) that model the concentration of pollutants and their changes over time; (2) assess the performance and limitations of our models using stability analysis; (3) reconstruct the time series of chemical pollutants not measured in certain stations using delay coordinate embedding results. Our results show that Akaike's Information Criterion can work well in conjunction with best subset regression as to find an equilibrium between sparsity and goodness of fit. We also find that, due to the complexity of the chemical system under study, identifying the dynamics of this system over longer periods of time require higher levels of data filtering and smoothing. Stability analysis for the reconstructed ordinary differential equations (ODEs) reveals that more than half of the physically relevant critical points are saddle points, suggesting that the system is unstable even under the idealized assumption that all environmental conditions are constant over time. Chemists and environmental scientists refer to nitrogen oxides as the group of compounds that contain nitrogen and oxygen. The most important of those gases, nitric oxide (NO) and nitrogen dioxide (NO 2 ), are of special interest because they are byproducts of many human activities. In 2013, 76.4% of the tropospheric NO x had an anthropogenic source. Of that total, 75.5% originated from fossil fuel combustion and industrial processes (IPCC, 2013) . The relevance of these two gases in how ozone (O 3 ) is distributed on earth makes the control of their emissions of paramount importance from a policy, environment, and health perspective. The chemical reactions that lead to the creation of ozone and the connection of this phenomenon with urban pollution have been largely studied (Crutzen, 1970 , 1979 , Seinfeld and Pandis, 2016 . Ozone is critical in the stratosphere, as it protects the earth from the harmful effects of UV radiation by absorbing it. However, the creation of ozone in the troposphere poses a serious pollution problem as it is the main component of smog and thus becomes part of the air we breathe (Blaszczak, 1999) . In addition, its creation is deeply affected by climate change, as the chemical reactions that produce it are very sensitive to temperature and lead to higher concentrations at higher temperatures (Aw and Kleeman, 2003) . In order to fight against the effects of pollution, many governments have implemented policies at different levels, namely, national, regional, and local. For example, parts of California have seen a significant decrease in pollution over a period of 20 years. More recently, the Chinese government imposed strict measures to fight against the spread of COVID-19 (Zhu et al., 2020) , which resulted in a cease of transportation and industrial activities in most parts of the country. A collateral effect of these strict measures was a sharp decrease in the emissions of NO 2 . At a council level, some major capitals are introducing their own control policies. In an attempt at pedestrianization, many councils are moving towards transit models that hamper the use of cars in densely populated urban areas. In 2018, Spain's capital, Madrid, designated parts of its downtown as low-emission zones, known as Madrid Central (MC) (Ayuntamiento de Madrid, a). This city's council limited the access of certain vehicles, albeit the transit of residents' vehicles remained permitted. Recent research shows that these limitations have decreased the concentration of NO 2 in downtown (Lebrusán and Toutouh, 2019) . Predicting the concentration of pollutants in the atmosphere is a well-studied topic (Cooper et al., 1997, Daly and Zannetti, 2007) . In Li et al. (2016) the authors rightfully indicate that there are two types of modeling efforts when it comes to forecasting the concentration of pollutants in the atmosphere. On the one hand, researchers can tackle this problem under a deterministic approach, in which the physics of the model comes into play in the form of diffusion equations, the pollutants' chemical characteristics, or fluid dynamics. These models usually require the solution of nonlinear mathematical relationships typically expressed in the form of partial differential equations (Ogura and Phillips, 1962 , Wilhelmson and Ogura, 1972 , Lanser and Verwer, 1999 . Other physics-based models compute air parcels and track (or backtrack) the dispersion of atmospheric pollutants (Stein et al., 2015) . On the other hand, purely data-driven, statistical approaches bypass the physics that underlie the complicated behavior of these pollutants. The range of tools employed in these approaches vary considerably. For example, classic statistical models were employed by Robeson and Steyn (1990) , who used univariate deterministic/stochastic models, ARIMA models, and bivariate models to forecast maximum ozone concentrations. With this same goal, Prybutok et al. (2000) proposed a regression model, a Box-Jenkins model, and a fully-connected neural network and concluded that the neural network performed better. Simulation models have also been an alternative in this context: an example is CALIOPE (Baldasano et al., 2011) , a Spanish air quality forecasting system for temporal and spatial prediction of gas phase species and particulate matter (i.e., NO 2 , S O 2 , O 3 and PM 10 ). Finally, the advent of deep learning brought new opportunities for more accurate forecasting. Recurrent neural networks (RNNs) in general, and Long Short-Term Memory (LSTMs) in particular, have been explored profusely as a means to explain the evolution of unknown variables over time. A recent example is the work by Feng et al. (2019) . More recently, other black-box-based models were used to forecast concentrations of pollutants or air quality indexes (Abirami and Chitra, 2021 , Liu et al., 2020 . LSTMs were used by Pardo and Malpica (2017) to predict the levels of NO and NO 2 with lags of 8, 16, and 24 hours, producing results that were superior to those reported by Baldasano et al. (2011) . These different approaches within data-driven solutions present advantages and disadvantages. Neural networks require many training examples to attain accurate results and do not extrapolate well outside the regime in which they where trained (Bilbrey et al., 2020) . Most importantly, these black box models produce uninterpretable relationships between the variables. Conversely, more traditional statistical procedures force the selection of a model or a family of models prior to calibrating and estimating coefficients, thus reducing the modeling flexibility. However, they provide clear, closed-form mathematical expressions that relate dependent and independent variables. To address some of the challenges outlined above, some researchers have started to look at ways to integrate the data-driven efforts with domain modeling (i.e., modeling that aims at tuning parameters that set the relationships between variables and that also enforces the laws that govern the systems under consideration). Most of the work is beginning to coalesce under the banner of Scientific Machine Learning (Baker et al., 2019) and promising approaches are continually being developed and improved (Raissi et al., 2019 , Chen et al., 2018 , Rackauckas et al., 2020 . Inspired by these recent successes in the field, the goal of this article is to outline an empirical data-driven, but domain-aware, framework to model atmospheric pollutants. Such a framework bridges the gap between these two classic perspectives (deterministic and data-driven) and provides methods that can leverage data as well as produce relationships between atmospheric chemical species that capture the physics of the system being modeled. In this paper, we apply these techniques to find a system of ordinary differential equations (ODEs) from real-world measurements of NO 2 and O 3 . As previously discussed, the use of these techniques have clear applications in policy-making environments at the national, regional, and local levels where accurate quantitative tools are of vital importance to assess atmospheric contamination. In turn, these tools can help policy makers determine the benefits and impact of their pollution control techniques (Popp, 2006) . The structure of this paper is as follows: 1. In Section 3 we propose an alternative optimization approach for sparse identification of nonlinear dynamics (SINDy) . We will apply this approach to real time-series data collected in various air quality stations located in Madrid in order to find systems of ordinary differential equations (ODEs) that will capture the dynamics of ozone and nitrogen dioxide at those geographical spots for a given time frame. We will discuss the implications of noisy datasets in this context and how that impacts the performance of SINDy. 2. In Section 4, we analyze some basic mathematical properties of the ODEs reconstructed from the data in Section 3 and offer some insight regarding the global behavior of the dynamics of the concentrations of NO 2 and O 3 . For each air quality station that measures both chemical species, we classify the critical points of the system of ODEs according to a stability analysis. The goal of this analysis is to provide us with a way to interpret the performance and limitations of our fitted equations. 3. In Section 5 we reconstruct the time series of the concentrations of O 3 at those stations where only NO 2 readings are available. To this end, we rely on Takens' embedding theorem to perform the reconstruction. The goal of this reconstruction is to provide the foundation of a method that allows to identify the dynamics of a chemical species in a location where readings of this species are not available. The application in hand plays a crucial role in the dichotomy between flexibility and tractability posed by black-box models and regression models. Many dynamical systems can be explained with low-order mathematical expressions. As a matter of fact, the dynamics of many physical and chemical systems are modeled as systems of ODEs. In particular, in the field of atmospheric chemistry many chemical reactions and interactions that take place on the Earth's atmosphere can be modeled this way under certain conditions. Chemical reactions are governed by their rate equations. The kinetics of a chemical reaction show how the concentrations of the reactants and the products vary during the reaction. Rate equations are easy to obtain from elementary reactions in closed systems (i.e., in systems where only those reactions occur and there is not a flux of other molecules entering or exiting those systems). For instance, mass-action kinetics suggest that the reaction aA + bB → cC + dD consumes reactant A at a rate of mole/(m 3 · s) given by the differential equation where k is the rate constant of the reaction (Érdi and Tóth, 1989) . In turn, the exponents of the concentrations, a and b, called the partial orders of the reaction are in this case the stoichiometric coefficients of the chemical reaction. Their sum is referred to as the overall order of the reaction. Eventually, in a closed system, the reactants are exhausted. In open systems with influx and efflux of several chemical species, like the troposphere, many chemical reactions occur simultaneously. Consequently, molecules are constantly created and destroyed, either created or consumed as a result of those reactions. This phenomenon yields a steady-state in the concentrations that stems from a dynamic equilibrium (Denbigh et al., 1948) . Also, in open systems the reactions occur in multiple steps and the partial orders of rate equations usually do not match the stoichiometric coefficients. Moreover, since a reactant can be part of several reactions at the same time (i.e., can be consumed or produced as a result of other reactions), the time-series of its concentration is not necessarily given by the the rate equation of a single reaction. As an example, consider the Leighton cycle (Leighton, 1961) that explains the formation of ozone from nitrogen oxides in the troposhere in unpolluted conditions: where hv represents energy from solar radiation (as calculated by the product of Planck's constant, h, and the frequency of the wave of solar radiation, v) and O( 3 P) denotes an oxygen atom in its fundamental state. In turn, k 3 is the rate constant of the third reaction and the reaction rate J t represents the actinic flux, which varies over time and depends largely on the incidence of photons, thus presenting different values according to other factors such as cloud cover or season. The reactions (1)-(3) present the following kinetics (Marsili-Libelli, 1996) : In unpolluted conditions, the ozone present in the troposphere is due to transport from the stratosphere and photochemical production. Its destruction is also due to photochemical reactions and from deposition on the earth's surface. These processes happen at a rate that maintains the level of ozone approximately constant in this layer of the atmosphere and, in these circumstances, the above represents a null cycle in which there is not any net production nor destruction of these chemical species. Therefore, their kinetics reach a pseudo-steady-state that can be expressed as This relationship explains why during daylight hours, when the actinic influx is large, there is an increment of the concentrations of NO and O 3 at the expense of a destruction of NO 2 (Council et al., 1992) . It also explains why this trend is reversed during the night hours. The kinetics in polluted conditions turn out to be much more complex. The only known source for the creation of O 3 is via the photolysis of NO 2 (see Equation (1)). Thus, the ozone build-up that appears in those conditions is due to an excess of NO 2 produced by man-made pollution. Indeed, pollution is responsible for the presence of free radicals that initiate a series of chain reactions that lead to the creation of more search. Following the insight provided by Marsili-Libelli (1996) that this complex system can be simplified with a system of low-order ODEs that include the elusive information from free radicals into the kinetic rates, we explore the possibility of modeling the dynamics of chemical species in polluted environments similarly to their kinetics in unpolluted cases (equations (4) and (5)). We do this with a data-driven method that incorporates implicitly the complexity introduced by the side reactions of the free radicals into the kinetic rates. Hence, we find our research question in how we can develop data-driven methods that are able to capture the dynamics of a complex, atmospheric open system in a closed mathematical form by identifying the values of the coefficients of the rate equations corresponding to some of the reactions that occur. In our empirical work, we focus on the dynamics of two pollutants, namely nitrogen dioxide and ozone, whose time-varying concentrations are interdependent. As mentioned in Section 2, there have been attempts to predict the concentration of ozone in the atmosphere. Our approach differs from all these in that we propose a regression approach that can capture the dynamics of the atmosphere in a way that different chemical species and their concentrations over time are interrelated, thus offering a closed form of the rate equations that govern the chemical reactions occurring during the selected time frame. In addition, in Sections 4 and 5 we will use those governing equations to offer analytical insight of the dynamics of these species and to reconstruct the time series of ozone in those stations that do not measure them. Our goal is to find a series of ODEs that describe the chemical dynamics in the troposphere. That is, a system of the form,ẏ where y(t) = (y 1 (t), y 2 (t), . . . , y p (t)) T ∈ R p is a vector containing the time response of the concentrations of the p chemical species under study,ẏ(t) = (ẏ 1 (t),ẏ 2 (t), . . . ,ẏ p (t)) T is the vector of derivatives, and F(y(t)) is a vector field acting on y(t). In Brunton et al. (2016) the authors outlined a methodology to estimate the functional form F given samples from y andẏ by solving the a series of least squares problems, is a vector of observed derivatives and β i ∈ R n+1 is the vector of regression coefficients. The matrixF ∈ R m×(n+1) contains information about candidate nonlinear basis functions (plus the intercept term) over a time horizon m n. For example, if we have a system of two chemical species (p = 2) and we assume that the entries of F(y(t)) are composed of second-order polynomials over y 1 (t) and y 2 (t) (i.e., n = 5), theñ 1 y 1 (t 1 ) y 2 (t 1 ) y 2 1 (t 1 ) y 2 2 (t 1 ) y 1 (t 1 )y 2 (t 1 ) 1 y 1 (t 2 ) y 2 (t 2 ) y 2 1 (t 2 ) y 2 2 (t 2 ) y 1 (t 2 )y 2 (t 2 ) . . . Notice that the solution to the minimization problem (7) yields an approximate solution to F in (6). Also note that when data fromẏ i are not available, we can approximate these vectors via numerical differentiation using the data vectorỹ i = (y i (t 1 ), y i (t 2 ), . . . , y i (t m )), as suggested by Kaiser et al. (2018) . In particular, we With our data being collected on an hourly basis (see Subsection 3.4), this numerical differentiation reduces to calculatingẏ i (t j ) =ỹ i (t j ) −ỹ i (t j−1 ). As we will also mention in that subsection, these computations take place after a filtering and splining process aimed at reducing data noise. Other approaches to curb the effect of noisy data on numerical differentiation could also be possible (Chartrand, 2011) and would be an interesting area to explore in the future. In most occasions not all the terms in the chosen linear functional form are needed, as it is very likely that some are not relevant for explaining the dependent variable in question. Therefore, a frequent subproblem within regression is that of finding the subset of terms that provides the best representation of the independent variable. This search for a more parsimonious or sparse mathematical expression is the core idea explored in Brunton et al. (2016) . We analyzed the advantages and drawbacks of LASSO regression (Tibshirani, 1996) and best subset regression (Hill et al., 2006, Chapter 19) and concluded that for our application the latter was a better choice to obtain accurate representations of the dynamics of NO 2 and O 3 at different geographical points of Madrid. A description of both methods is detailed in Appendix A. The literature on systems identification is vast and its applications have been studied for many years (Ljung and Glad, 1994, Ljung, 1999) . However, the application of LASSO regression to sparse identification of nonlinear dynamics (SINDy) is more recent and, although some researchers proposed its use in this context shortly after LASSO was developed (Kukreja et al., 2006) , it became more prominent in the literature since Brunton et al. (2016) . Consequently, there have been multiple efforts to find sparse representations of physical and biological systems as well as population dynamics (Mangan et al., 2016 , Kaiser et al., 2018 . As far as chemical systems are concerned, Bhadriraju et al. (2019) modeled the dynamics of a continuous stirred tank reactor with an adaptive sparse identification method that involved sparse model identification, re-estimation of regression coefficients, and stepwise regression. These same authors also tackled the same problem with SINDy in conjunction with a neural networks controller that determined when the outputted equations needed to be re-evaluated (Bhadriraju et al., 2020) . Hoffmann et al. (2019) extended SINDy with ansatz functions to describe what they called "reactive SINDy" in order to eliminate the spurious reactions that are typically captured in reaction networks behind biological processes. The applicability of SINDy and symbolic regression (SymReg) for predicting the dynamics in a distillation column within the context of a manufacturing process was put to the test by Subramanian et al. (2021) . The authors found that SINDy performed better than SymReg and was able to identify terms related to Fick's law and Henry's law. They concluded that all the dynamics present in that system could not be captured by only one method and suggested the parallel use of different machine learning algorithms to capture the system's complexity entirely. A very recent and promising attempt to reduce the complexity inherent to solving the systems of ODEs that stem from large chemical networks was presented by Grassi et al. (2021) . In this case, however, the authors did not resort to a method like SINDy, but rather proposed a combination of encoders and decoders that produced a transparent and interpretable latent state with many less variables. The problem then was the correct definition of the topology of the compressed chemical network that eventually produced a result that had to be mapped back to the original set of variables. In Narasingam and Kwon (2018) , the authors successfully demonstrated how to construct reduced order models using SINDy to approximate the dynamics of a nonlinear hydraulic fracturing process. The tractability considerations discussed in Appendix A suggested that LASSO regression was a sensible option for identifying the dynamics of chemical species in the troposphere. However, the aforementioned works found sparse identifications of dynamic systems from data that were previously generated. During the course of our present research it was our experience that LASSO does not perform well in the context of SINDy when dealing with real-world data from multiple geographical locations, even after the stage of data preprocessing. We attribute this to the considerable amount of noise present in the data and the open nature of the system we are trying to model. While LASSO certainly could find sparse systems of ODEs, trying to solve numerically the systems of ODEs denoted by (7) was very problematic because of numerical issues caused by singularities. Assuming that, as mentioned in Subsection 3.1, the dynamics of the atmosphere can be explained with low-order polynomials, and based on equations (4) and (5) for unpolluted environments, we conjectured that second-order polynomials would suffice to identify the dynamics of this system under polluted conditions. When dealing with two chemical species, NO 2 and O 3 (i.e., p = 2), such polynomials have at most 5 variables and there are 2 5 = 32 possible regressions for each equation in (7). In these circumstances, bruteforce enumeration of all the possible regressions is computationally affordable and therefore we opted for a best subset regression approach in which for each chemical species i, the vector of optimal regression coefficientsβ i was selected according to the Akaike information criterion (AIC) (Akaike, 1998) . We thus solved the problem where the 0 -norm denotes the number of non-zero elements of the vector β i (excluding the intercept). Selecting the AIC was motivated by the need of an equilibrium in the bias-variance tradeoff. This criterion seeks that equilibrium by taking into account both the sum of the square of the errors (hence tackling underfitting) and the number of variables in the regression (hence tackling overfitting). This was useful to compare all the possible models and allowed us to implement an algorithm that could bypass the numerical issues encountered when using the ODE solvers (see Appendix B). Our algorithm was programmed with MATLAB R2020a and was built on the structure of the code developed in Brunton et al. (2016) . This code was adapted to solve the systems of ODEs with the output of the regressions obtained from the best subset method. MATLAB has different ODE solvers that can be used in various environments. In our particular case, we found that many of the systems of ODEs produced were stiff, and thus we used MATLAB's stiff solver ode15s. This improved the integrability of the systems of ODEs, although still presented numerical problems in some occasions. These problems stemmed from singularities or regions were the derivative changed very rapidly. To address this problem, we introduced in our algorithm an upper bound for the values of the derivative in such a way that those regressions that eventually led to those ill-posed solutions would be discarded. In the absence of numerical problems, our algorithm returns the optimal solution to Problem (8) for each i = 1, 2, . . . , p and provides representations of (7) that minimize the AIC. In the presence of numerical problems, it finds representations of (7) that minimize the AIC while also being numerically tractable. The data used in this study were collected from Madrid's City Council Open Data website (Ayuntamiento de Madrid, b). The authors selected this data set because of the vast and detailed amount of information that it provides, as it contains hourly readings between 2001 and 2018 of various pollutants in 24 different stations located across areas of downtown Madrid, as well as its outskirts (see Figure 1 and Section 5, we will discuss how we used our results in these stations to reconstruct the time series of O 3 readings in the other 10 stations. As mentioned in the previous section, the collected raw data came from sensors, which are naturally noisy (Ayuntamiento de Madrid, b). In the presence of such noise, it is very complicated to find systems of ODEs that that can be solved numerically without issues. Therefore some data preprocessing was needed. The following summarizes the operations performed on the set of raw data: • Data normalization: The order of magnitude of the concentrations of the different chemical species in the atmosphere may differ greatly. For this reason, the time series of all molecules were standardized (i.e., for each data point of the time series we subtracted the average concentration during the time frame considered and divided over the standard deviation). It was our experience that this lead to fewer numerical errors when we integrated the systems of ODEs represented by equation (7). We will denote our original M data samples of normalized concentrations of p chemical species as w i , i = 1, 2, . . . , p. • Data filtering: the excess of data noise made it difficult to extract trends in the concentrations of pollutants over time. In order to address this issue we perform a Gaussian-weighted moving average filter over our data. This filter, as implemented in MATLAB, uses a window size determined heuristically that is attenuated according to a smoothing parameter α ∈ [0, 1]. Values of α close to 0 reduce the window size (i.e., reduces the smoothing), whereas values of α close to 1 increase the window size (i.e., increases the smoothing). In very noisy and long time series, high values of α might be needed to extract meaningful patterns. However, this might come at the cost of excessive damping and the resulting time series might not be a good representation of the underlying data set. This parameter will be critical in our experiments, as we will discuss in Subsection 3.5. • Data splining: raw data were collected hourly, but the integration of a continuous-time system of ODEs requires a finer discretization of the time space. In consequence, and after attempting finer and coarser discretizations, we proceeded with the creation of 100 points between each original pair of data points, following a modified Akima interpolation (MAI) (Akima, 1970) . MAI helps with reducing excessive undulations that may occur with regular cubic splines. Figure After filtering and smoothing, the resulting vector of normalized concentrations will be ourỹ i ∈ R m , i = 1, 2, . . . , p. Note that, by means of the splining operation, these vectors have notably more time samples than the original normalized vectorsw i , i = 1, 2, . . . , p. The goal of our best subset regression approach is not only to find the best system of differential equations in the sense of the AIC, but also that the solutions of those systems represent a good fit with respect to the raw data. The regression coefficients in (7) are found by solving Problem (8) with the vectors y i , i = 1, 2, . . . , p. This means that the suitability or goodness of the fitted regressions is measured against data that have been previously manipulated. An excellent fit of an overly manipulated time series will probably not be very useful in practical terms. However, a good fit of noisy raw data seems difficult to obtain, especially if we conjecture that the dynamics of this system can be modeled with a second-order polynomial. In this context, the severity of the data filtering phase is paramount. It is sensible to develop a framework in which the hyperparameter α is tuned adequately. For a given time window [t 0 , t f ] that contains M readings of NO 2 and O 3 in an air quality station s, let us define the root mean square error In equation (9) the vector (w i ) s ∈ R M contains the original normalized M readings of chemical species i in station s. The vectorŷ i,s (α) ∈ R M contains the evaluations at those M points performed after numerically solving the system of ODEs (7) when solving Problem (8). This way, each vectorŷ i,s (α) is compared to the original normalized observations. In order to find the smoothing parameter that performs best, a suitable approach is to find the solution to the following optimization problem: where RMSE α i is the average of RMSE α i,s over all the air quality stations. Therefore, Problem (10) aims to calibrate the smoothing parameter α such that it minimizes the maximum forecasting error incurred, on average, by any chemical species. It is important to note that we should anticipate that the optimal value of α will be sensitive to the data selected for the analysis (i.e., it will be sensitive to location, number of time periods, etc.) The solution to this problem was tackled experimentally by considering v different values for α such that 0 < α 1 < α 2 < · · · < α v < 1. Figure 3 illustrates our procedure. For a given choice of α and for each station s, we solved problems BS (i), i = 1, 2, . . . , p and found a system of p (in our case p = 2) differential equations for each air quality station that read both NO 2 and O 3 . Once this system of ODEs was integrated numerically, we extracted the resulting vectorŷ i,s (α) of M points and a value of RMSE α i,s was obtained by (9). Then, we repeated these operations for all stations and averaged those RMSEs to obtain p different values of RMSE α i , whence the maximum was retrieved. After iterating over the v different values of α considered, we selected the minimum of those maximum average RMSEs as the solution to Problem (10). In our experiments we filtered our time series with v = 21 different values of α, from 0.05 to 0.95 in intervals of 0.05, plus 0.01 and 0.99. We did this for the 14 stations that could read both NO 2 and O 3 , which were selected as our chemical species (p = 2). Our results are consistent with the notion that larger time windows require a higher level of smoothing that can dampen the effect of noised data points. Consequently, it seems clear that the optimal value of α in Problem (10) is nondecreasing as we enlarge our time window. The optimal value of the objective function in Problem (10) is nondecreasing as well (i.e., the minimum of the maximum average RMSE does not decrease as we use larger data samples). This is shown in Figure 4 , where we compared our results for a time window centered at noon on April 1 st , 2018, with increasing window sizes between 3 hours and 19 hours. As already discussed, the usefulness of the regression results depend on how closely they end up producing close estimations of the original data. For this reason, we can have excellent fits in the sense of the AIC criterion that are not very useful for fitting actual data because these original data have been damped excessively. As an example, consider Figure 5 . Subfigures 5a and 5b show our regression results for two very disparate values of α (0.1 and 0.9). In Subfigure 5a the data filtered and splined is clearly wavier than (10) in Subfigure 5b, a consequence of a softer denoising effort. These wavier curves are very close to the original noisy data points and produce derivatives that are more difficult to fit using second-order polynomials, thus yielding results that do not adjust very good to the solid lines. In Subfigure 5b the original time series has been modified much more significantly and the derivatives behave in a way that is more suitable for a second-order polynomial to be fitted. After integrating of the resulting system of ODEs, this results in dashed lines that effectively overlap the filtered and splined lines. However, our true measure of accuracy is given by the errors with respect to the original data, that is, the distance between the "×" and the "•" Another important result stems from the degree of sparsity found in the fitted regressions. The assumption that the time derivatives of the concentrations of chemical species in the troposphere can be modeled with second-order polynomials is, per se, an assumption that specifically targets a type of functionals. With two chemical species like NO 2 and O 3 , this left us with at most 6 terms (5 variables and the intercept). From our results, it seemed that shorter time windows (and therefore most likely less noisy data sets) might produce sparser sets of ODEs under the AIC, ceteris paribus. This phenomenon is shown on Figure 6 . However, in most occasions the solutions to Problem (8) were full models or models without a large degree of sparsity. Therefore, we can say that while our intention was to use SINDy to further reduce the number of terms in these regressions, our results show that real-world data may not behave as well as generated data and hindered our ability to obtain sparser representations. In regards to the AIC, this means that the penalty imposed by the term 2 β i 0 is not sufficiently important to discard full models, even though in many instances we obtained very parsimonious models that performed almost as well as more complex regressions. In view of this, we also tried other criteria for selecting the best models such as the well-known R 2 ) + log (m)( β i 0 + 1). We did not appreciate significant changes in the way these different methods ranked all the regressions. In this section, we study some basic mathematical properties of the ODEs obtained from the data, which offer some further analytical insight about the dynamics of [NO 2 ] and [O 3 ]. The purpose of this analysis is to highlight some tools that allow us to better interpret the resulting ODE models we obtained in the previous section. The general form of the equations we are fitting is where the coefficients {β i j } are obtained by solving the optimization Problem (8). Such planar quadratic system can exhibit diverse behaviors that are well-studied in the mathematical literature. For example, [NO 2 ] can blow up in finite time ifβ 1 0 > 0,β 1 3 > 0 and all other coefficients are zero. Phase plane analysis, which characterizes topological properties of the two-dimensional trajectories of the system, reveals that this system, in general, can describe more than 700 different classes of phase portraits (Reyn, 2007) . From the coefficients obtained in the time-window considered,β 2 1 5 +β 2 2 5 andβ 2 1 3 +β 2 2 3 are both nonzero for all 14 stations. Furthermore, Theorems 2.1 and 2.2 of Reyn (2007) assert that the sum of the multiplicities (called finite multiplicity m f in Reyn (2007)) of the critical points of (11)-(12) is 4 for all 14 stations. Below we summarize some results about real critical points 1 for all 14 stations. The system (11)-(12) can be written in the compact forṁ y 1 = P(y 1 , y 2 ),ẏ 2 = Q(y 1 , y 2 ), where (y 1 , y 2 ) = ([NO 2 ], [O 3 ]) and P and Q are quadratic polynomials. If P(y * 1 , y * 2 ) = Q(y * 1 , y * 2 ) = 0, then (y * 1 , y * 2 ) is called a critical point. The local stability of a critical point (y * 1 , y * 2 ) is characterized by the eigenvalues of J(y * 1 , y * 2 ), where 1 0 + 2β 1 3 y 1 +β 1 4 y 2β1 2 + 2β 1 5 y 2 +β 1 4 y 1 β 2 0 + 2β 2 3 y 1 +β 2 4 y 2β2 2 + 2β 2 5 y 2 +β 2 4 y 1 is the Jacobian matrix (Reyn, 2007, Section 2.3 .1) (Murray, 2007, Appendix A) . For example, suppose the eigenvalues {λ 1 , λ 2 } are real and distinct (without loss of generality λ 1 > λ 2 ). Then the critical point is a stable node if 0 > λ 1 > λ 2 , an unstable node if λ 1 > λ 2 > 0, and a saddle point if λ > 0 > λ 2 . As mentioned in Section 3.4, the system (11)-(12) was standardized by subtracting the average concentration and dividing over the standard deviation. This normalization, being a linear transformation where µ's are the means and σ's are the standard deviations, will not change the stability of the critical point. Furthermore, (w * 1 , w * 2 ) is the critical point of the normalized system (11)-(12) if and only if (µ 1 + w * 1 σ 1 , µ 2 + w * 2 σ 2 ) is the critical point of the original (non-standardized) system. We can apply these analysis to the ODEs obtained for the 14 stations. In summary, 9 stations have 4 real critical points, 4 stations have a pair of complex critical points and two real critical points, and exactly one station has 4 complex critical points. Hence there are 44 real critical points and 12 complex critical points. Among the 44 real critical points, 36 have positive coordinates and deserve special attention since they have physical interpretation as chemical concentrations. One critical point has negative coordinates and the remaining 7 have one positive coordinate and one negative coordinate (Arturo Soria Station is an example). Stability analysis are performed for the 36 real critical points that have positive coordinates, which reveals that more than half (20 out of 36) are saddle points (see Table 2 for a summary). Eigenvalues {λ 1 , λ 2 } Count Stable node 0 > λ 1 > λ 2 5 Unstable node λ 1 > λ 2 > 0 4 Saddle point λ 1 > 0 > λ 2 20 Stable spiral {a + bi, a − bi} where a < 0 5 Unstable spiral {a + bi, a − bi} where a > 0 2 Our method, when applied to longer time-windows, would give a planar quadratic system with timevarying coefficients {β i j (t)}. Due to their simple form and rich structure, such closed form equations are promising tools for various purposes including the study of long-time stochastic behavior (Budhiraja and Fan, 2017, Nguyen et al., 2020) and the enhancement to spatial-temporal models such as network models (Huang et al., 2010) and partial differential equations (Ogura and Phillips, 1962 , Wilhelmson and Ogura, 1972 , Lanser and Verwer, 1999 . From our dataset, we were only able to fit 14 equations from 24 stations available. This is due to the fact that not all stations captured data for all the pollutants. As previously mentioned in Section 3.4, only 14 out of the 24 stations contained data for both NO 2 and O 3 , but all 24 stations capture measurements for NO 2 . Given our ability to fit the stations outlined in Section 3, we hypothesize that we can use Takens' delay embedding theorem (Takens, 1981) to recover the O 3 measurements from our data. In 1981, Floris Takens showed that global features of a trajectory in a dynamical system can be recovered using a single coordinate from the original data. In practice the following map, yields an embedding that recovers the global properties of the original trajectory, for some lag τ and embedding dimension d. We represent y 1 as the first coordinate of the original data vector y. If we assume that the data collected can be linked by an ODE using SINDy, as demonstrated in Section 3, we should be able to recover partial information from our missing O 3 measurements 2 . In practice, we need to estimate both τ and d, but in our case we can restrict ourselves to d = 2 since we are considering NO 2 and O 3 . To estimate τ, we use the method of minimizing the average mutual information between the data and the first lag coordinate (or second coordinate of equation (14)) (Fraser and Swinney, 1986) . Define the average mutual information of a time seriesỹ = (y(t 1 ), y(t 2 ), . . . , y(t m )) with lag τ by where q i is the probability that y(t l ) is in bin i of the histogram constructed using samples fromỹ, and q i j (τ) is the probability that y(t l ) is in bin i and y(t l + τ) is in bin j (Wallot and Mønster, 2018) . As mentioned previously, the embeddingŷ(t) will only recover some qualitative features of the trajectory, but for our problem we can exploit the fact that we only have two dimensions to recover some information about the geometry of the trajectory (e.g. the asymptotic behavior). Consider the following optimization problem where O(2) is the orthogonal group of all 2 × 2 matrices and P y 1 is the projection onto the first coordinate (our data coordinate). Exploring the local minima of this loss landscape yields matrices A (excluding the trivial solution A = I) that can be used to transform the output obtained from the reconstruction by applying Aŷ(t). Figure 8 shows the results we obtained when performing Takens' reconstruction and its correction for the Villaverde Station using NO 2 measurements to reconstruct O 3 samples (see Appendix C for more details). It is worth mentioning that the ODE that SINDy recovered from the data shown in Figure 8 is near an attractor, in this case a stable point, and this is a sufficient condition for the reconstruction procedure to apply (Sauer et al., 1991) . This technique can help engineers qualitatively recover pollutant measurements that were originally not captured without the need to upgrade equipment. Further exploration needs to be done in order to properly reconstruct the geometry of the trajectories. We have validated and outlined a series of data-driven tools to deal with real-world atmospheric time series data and showed how SINDy provides a framework for extracting ODE models for real data collected across multiple stations distributed throughout the city of Madrid. We can break down our conclusions at three different levels: 1. Descriptive analysis: We unveiled that using LASSO for extracting parsimonious ODEs in the context of SINDy can present numerical issues when solving these equations. We discussed how best subset regression, along with the Akaike information criterion, offered us a more stable way to consistently fit multiple stations and how they can be combined with an optimization framework that selects the optimal level of noise dampening as to produce the best possible fit with respect to our original data. We find that, as we aim at identifying the dynamics of our system for longer periods of time, this dampening has to be more intense and a good combination of sparsity and goodness of fit is more difficult to attain. We found that more than half among the physically relevant critical points are saddle points, suggesting that the system is unstable even under idealized environmental assumptions. However, there are few stable critical points in the model, pointing to a discrepancy with observed data in a longer (> 24 hours) time-scale. This discrepancy suggests that future refinements of the model can involve time-inhomogeneous coefficients that capture environmental fluctuations. 3. Reconstruction of trajectories: We discuss a reconstruction technique using Takens' embedding theorem that allows for the recovery of missing pollutant concentration data using other correlated concentration measurements. We show how we can reconstruct qualitatively O 3 measurements by only using NO 3 measurements. We suggest the use of matrix transformations as a way to correct for unfeasible solutions obtain from Takens' Reconstruction. This technique can be useful to use at stations that are not equipped to measure all atmospheric pollutants. In short, our methodology will provide researchers with the ability to construct interpretable, datadriven surrogate models from noisy chemical data sets. More importantly, it provides a more complete picture of the behavior of real world atmospheric chemical species (NO 2 and O 3 in our case, although our methodology can be extended to any others). We hope that the results obtained from the wide adoption of these tools allow pertinent authorities and policy makers make more informed decisions when designing future environmental policies. Appendix B. Algorithm for SINDy with AIC for one station Algorithm 1: SINDy with AIC for one station Result: Find the best regression in the sense of AIC that did not present singularities. Let p be the number of chemical species; n be the number of variables in the full regression model; be a threshold for the maximum value allowed forẏ i (t); l i = 1 be a counter denoting which ranked model for chemical species i should be selected; l = 0 be a counter for the total number of models discarded; Find the F = 2 n j=1 F j possible subsets of variables with size less or equal to n; for i ← 1 to p do for j ← 1 to 2 n do Solve the system (7) Figure 8b . In higher dimensions, the optimization loss landscape becomes difficult to visualize and the problem becomes ill-posed as we increase the dimensionality and decrease the access to data. Nevertheless, there are packages such as (Townsend et al., 2016) that allow users to solve this optimization problem over sets of n × n matrices and this approach could shed some light into potential reconstructions for higher dimensional problems. Figure 8 . In order to visualize the loss landscape, we use the fact that 2 × 2 rotations and reflections can be parametrized by one angle. Regional air quality forecasting using spatiotemporal deep learning Information theory and an extension of the maximum likelihood principle A new method of interpolation and smooth curve fitting based on local procedures Evaluating the first-order effect of intraannual temperature variability on urban air pollution Madrid Central -zona de bajas emisiones Portal de datos abiertos del Ayuntamiento de Madrid Workshop report on basic research needs for scientific machine learning: Core technologies for artificial intelligence An annual assessment of air quality with the CALIOPE modeling system over Spain Best subset selection via a modern optimization lens. The annals of statistics Operable adaptive sparse identification of systems: Application to chemical processes Machine learning-based adaptive model identification of systems: Application to a chemical process Tracking the chemical evolution of iodine species using recurrent neural networks Nitrogen oxides (NO x ): Why and how they are controlled Discovering governing equations from data by sparse identification of nonlinear dynamical systems Uniform in time interacting particle approximations for nonlinear equations of patlakkeller-segel type Numerical differentiation of noisy, nonsmooth data. International Scholarly Research Notices Neural ordinary differential equations Survey of mathematical programming models in air pollution management Rethinking the ozone problem in urban and regional air pollution The influence of nitrogen oxides on the atmospheric ozone content The role of NO and NO 2 in the chemistry of the troposphere and stratosphere. Annual review of earth and planetary sciences Air pollution modeling-an overview. Ambient air pollution The kinetics of open reaction systems Mathematical models of chemical reactions: theory and applications of deterministic and stochastic models Recurrent neural network and random forest for analysis and accurate forecast of atmospheric pollutants: a case study in hangzhou Atmospheric chemistry. fundamentals and experimental techniques Independent coordinates for strange attractors from mutual information Reducing the complexity of chemical networks via interpretable autoencoders Extended comparisons of best subset selection, forward stepwise selection, and the lasso Statistics: methods and applications: a comprehensive reference for science, industry, and data mining Reactive sindy: Discovering governing reactions from concentration data Bayesian source detection and parameter estimation of a plume model based on sensor network measurements Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change Sparse identification of nonlinear dynamics for model predictive control in the low-data limit A least absolute shrinkage and selection operator (lasso) for nonlinear system identification Analysis of operator splitting for advection-diffusion-reaction problems from air pollution modelling Assessing the environmental impact of car restrictions policies: Madrid Central case Photochemistry of air pollution Deep learning architecture for air quality predictions A study on extending the use of air quality monitor data via deep learning techniques System identification. Wiley encyclopedia of electrical and electronics engineering Modeling of dynamic systems. Number BOOK Inferring biological networks by sparse identification of nonlinear dynamics Simplified kinetics of tropospheric ozone Mathematical biology: I. An introduction Data-driven identification of interpretable reduced-order models using sparse regression Sparse approximate solutions to linear systems Stochastic variability of tropical cyclone intensity at the maximum potential intensity equilibrium Numerical optimization Scale analysis of deep and shallow convection in the atmosphere Air quality forecasting in Madrid using long short-term memory networks International innovation and diffusion of air pollution control technologies: the effects of NO x and SO 2 regulation in the US, Japan, and Germany Comparison of neural network models with arima and regression models for prediction of Houston's daily maximum ozone concentrations Universal differential equations for scientific machine learning Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations Phase portraits of planar quadratic systems Evaluation and comparison of statistical forecast models for daily maximum ozone concentrations Atmospheric chemistry and physics: from air pollution to climate change Noaa's hysplit atmospheric transport and dispersion modeling system White-box machine learning approaches to identify governing equations for overall dynamics of manufacturing systems: A case study on distillation column Detecting strange attractors in turbulence Regression shrinkage and selection via the lasso Pymanopt: A python toolbox for optimization on manifolds using automatic differentiation Calculation of average mutual information (ami) and false-nearest neighbors (fnn) for the estimation of embedding parameters of multidimensional time series in matlab The pressure perturbation and the numerical modeling of a cloud Multi-step ahead forecasting of regional air quality using spatial-temporal deep neural networks: A case study of huaihai economic zone Since the problems LR(i), i = 1, 2, . . . , p are quadratic and convex, they can be efficiently solved by some well-known optimization methods for finding the optimal solutions of a convex quadratic function over a polyhedron (see (Nocedal and Wright, 2006) for a reference of some usual optimization methods for this kind of problems). It is worth mentioning that many researchers have historically seen Problem (A.2) as a heuristic to solve Problem (A.1), which is widely regarded as the formulation that yields the most desired sparse solution with a subset of c variables. However, as noted in (Hastie et al., 2017), in noisy settings both problems offer different bias-variance tradeoffs and The authors want to thank Michael S. Hughes and Paul Bruillard for helpful discussion on Takens Appendix A. Methods considered 1. Best subset regression (Hill et al., 2006, Chapter 19) : The very notion of sparse regression suggests the selection of a subset of 1 ≤ c ≤ n terms. This subproblem, called the best subset problem, can be cast generally as:where the 0 -norm β i 0 = n j=1 1{β i j 0} is an indicator function that denotes the number of nonzero elements of the vector β i (except for the intercept). Therefore, Problem (A.1) aims at finding a sparse representation ofẏ i (t) that has at most c terms and that minimizes f (β i ) (very often the sum of errors squared, i.e., f (β i ) = 1 2 ẏ i −Fβ i 2 2 , also known as least-squares regression). However, its only constraint is combinatorial in nature and makes this optimization problem NP-hard (Natarajan, 1995) .Thus, despite of very promising and recent efforts with mixed-integer optimization reformulations (Bertsimas et al., 2016) , researches and practitioners alike usually resort to different alternatives to attain sparse regressions. (Tibshirani, 1996) : One such option lies on a convex quadratic alternative to Problem (A.1) known as LASSO (Least Absolute Shrinkage and Selection Operator) regression: where β i 1 = n j=1 |β i j | is the 1 -norm of the vector β i (except the intercept). If we denote byβ * i the values of the regression coefficients of the full (i.e., the unconstrained) regression, then any value of φ such that β * i 1 > φ will produce a shrinkage. Geometrical considerations in this model make this shrinkage such that some coefficients will be identical to zero as we decrease the upper bound on the 1 -norm (see (Tibshirani, 1996) for more details). This optimization model is frequently expressed as an equivalent unconstrained problem with a regularization parameter λ: