key: cord-0045947-bzi38q5k authors: Ménard, Richard; Cossette, Jean-François; Deshaies-Jacques, Martin title: On the Complementary Role of Data Assimilation and Machine Learning: An Example Derived from Air Quality Analysis date: 2020-05-25 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50433-5_17 sha: 9b39d1e7ea4f0c5bc8835b1eb46c990cf2359c19 doc_id: 45947 cord_uid: bzi38q5k We present a new formulation of the error covariances that derives from ensembles of model simulations, which captures terrain-dependent error correlations, without the prohibitive cost of ensemble Kalman filtering. Error variances are obtained from innovation variances empirically related to concentrations using a large data set. We use a k-fold cross-validation approach to estimate the remaining parameters. We note that by minimizing the cross-validation cost function, we obtain the optimal parameters for an optimal Kalman gain. Combined with the innovation variance consistent with the sum of observation and background error variances in observation space, yield a scheme that estimates the true error statistics, thus minimizing the true analysis error. Overall, this yield a new error statistics formulation and estimation out-performs the older optimum interpolation scheme using isotropic covariances with optimized covariance parameters. Yet, the analysis scheme is computationally comparable to optimum interpolation and can be used in real-time operational applications. These new error statistics comes as data-driven models, were we use validation techniques that are common to machine learning. We argue that the error statistics could benefit from a machine learning approach, while the air quality model and analysis scheme derives from physics and statistics. Data assimilation was originally developed from a need to specify the initial conditions of numerical weather prediction models [1] that otherwise would have little predictive skill due to unstable dynamics of the atmosphere. Imperfectly known and incorrect assumptions (e.g. no model error covariance) on the (input) error statistics used for data assimilation, is not as critical for numerical weather prediction as opposed to other applications, due to the presence of the unstable subspace in the forecast error. Indeed, it has been argued that by confining the forecast error corrections to the unstable and neutral subspace, four-dimensional variational data assimilation can better perform than without this confinement [2, 3] . However, not all application areas of data assimilation have an unstable subspace. Atmospheric chemistry models, for example, are known to be "slaved" by the meteorology [4, 5] and it is known that the precise knowledge of the chemical observation and model error covariances has a rather strong impact on the performance of the (chemical) data assimilation results [6, 7] . The estimation of correct error statistics is important for a truly optimal chemical data assimilation system. This has been recognized from the very beginning of Kalman filtering data assimilation using neutral or dissipative models (e.g. [8, 9] ). With dissipative models, the verification of the forecast as a measure of analysis accuracy has limited value. However, the true analysis error can be evaluated using cross-validation [10, 11] . In cross-validation we produce analyses with a subset of observations and verify the analysis with the remaining observations. There is no need to conduct a forecast. Since the optimality of analyses (measured against independent observations) depends on having the correct (input) error statistics, we may view the problem of estimating the true error statistics as an inverse problem of analyses (measured by cross-validation). But as with most inverse problems, the minimization of the verification error alone is not sufficient to determine the correct input error statistics [12] . Additional criteria are needed, such as the matching of the covariance of innovation with the sum of the background error covariance in observation space with the observation error covariance, called the innovation covariance consistency [13] . This is where (even elementary concepts) of machine learning can become useful. Atmospheric models are based on the laws of physics, and in our case on chemical laws as well; they induce our prior knowledge of the system. On the other hand, the innovations or observation-minus-model residuals, are quantities that contain information that is unexplained by the model, and this is where machine learning can have a complementary role to data assimilation. Complementary roles of machine learning and data assimilation were also developed in a form of Kalman filtering (known as the Parametric Kalman filter) that requires closure on the form and parameters or error covariances [14] . At Environment and Climate Change Canada (ECCC) we have developed an operational surface analysis of atmospheric pollutants since 2002 [15] that provide a complete and more accurate representation of the air quality atmospheric chemistry, which has been used in several health impact studies (e.g. [16, 17] ). Although the optimization of (isotropic) covariance model parameters improves the analysis [18] , little is known about the more realistic and accurate covariance models suitable for chemical data assimilation [19, 20] . Since air quality models are computationally demanding (compared to numerical weather prediction models) the use of (ensemble) Kalman filtering data assimilation to obtain non-homogeneous non-isotropic error covariances is more in the realm of research than operational purposes. Recently at ECCC we have been developing a simple approach to generate non-homogeneous non-isotropic error correlations near the surface that does not involve rerunning the air quality model, by simply using preexisting model simulations over a period of a few months. Note that chemical simulations are driven (slaved) by meteorological analyses. The error correlations that are obtained are not flow-dependent but are non-homogeneous non-isotropic and terrain-dependent, which account for a large fraction of the error correlation signal near the surface. We will presents some examples of these error correlations in Sect. 2.1. As with Kalman filtering localization of these ensembles is needed, and is obtained by minimizing the analysis error evaluated by cross-validation. This will be presented in Sect. 2.3. For error variances we use essentially an innovation-driven (data-driven) representation of the error variances as a function concentration, which is simple but somewhat appropriate for these fields. This will be discussed in Sect. 2.2. We show that this new analysis scheme is superior to the currently operational implementation using optimum interpolation with homogeneous isotropic correlation models [21] . We realize that the method we are using, is similar (yet much simpler) to some of the methods used in (simple) machine learning. We argue that a data-driven approach to model error covariances, and more sophisticated machine learning algorithms [22] could potentially improve those representations and be beneficial for truly optimizing an assimilation system. In our current operational version of the analysis of surface pollutants, we use homogeneous isotropic correlation models [21] with tuning of covariance parameters [15, 18] to optimize the system. In this new version, we obtain anisotropic error correlations from an ensemble of pre-existing air quality simulations (i.e. forecasts with no chemical data assimilation). The ensemble is in fact climatological, where the air quality simulations are collected over a period of two months. For each hour of the day, we thus have an ensemble of about 60 realizations over that time period that captures non-homogenous and non-isotropic correlations. As in the ensemble Kalman filter, there is a need for localization to avoid spurious correlations at large distances that infiltrate the analysis and thus significantly reduce its optimality. The idea behind using an ensemble of pre-existing model forecasts is that those error correlations will be able to capture stationary and terrain-dependent structures such as those induced by the proximity of water surfaces, mountain ranges, valleys, chemical sources, and so on, since these features are always present. This method does not capture the day-to-day variability (or flow dependence) of the error correlations like in an ensemble Kalman filter, but may capture the effect of prevailing winds for example. In the example presented below, we computed the spatial correlations for each observation sites using a time series of the Canadian operational air quality model (GEM-MACH) output of PM2.5 at 21 UTC over a two-month period (July-August 2016) and applied a compact support correlation function (the 5 th -order piecewise rational function of Gaspari and Cohn [23] ) as a method for localization. Examples of spatial correlation of PM2.5 are presented in Fig. 1 , with respect to an observation site in Toronto (Canada) and in Los Angeles (USA). We note that the correlation around Toronto (upper panel) is more or less oval, while strong anisotropic structures extending along the coast and influenced by the presence of water and terrain are depicted with the correlation about Los Angeles (lower panel). Correlations at Winnipeg in the Canadian prairies over a flat terrain are nearly circular or isotropic (result not shown). Likewise, the spatial correlation with respect to Surrey (in the neighborhood of Vancouver, Canada) shows a minimum over the Coast Mountain range and over the Rockies with a maximum in between over the central interior plateau (results not shown). In general, the spatial correlation structures shows the presence of mountain ranges, valleys, and extended water surfaces. These correlation structures were obtained after localization using a Schur product with a compact support correlation function. The length-scale of the compact correlation model is estimated by cross-validation (see Sect. 2.4). The innovation (i.e. the observation-minus-model residual) variance has been plotted against the mean concentration for each station at a given local time (21 UTC) using a 2-months' time series to generate the statistics for each station. The variance of the innovation gives the sum of observation and model (i.e. background) error variances, but its partition into observation and model errors is yet unknown. Furthermore, the observation error is not simply the instrument error but should also include; errors due to the mismatch in scales being sampled in a single observation vs that of the model grid box, subgrid scale variability that may be captured by the observation but not by the model, and missing modeling processes, etc., that collectively we call representativeness error. The observation error is thus not well known and needs to be estimated. This will be done in the Sect. 2.4 using cross-validation and assuming that observation error variance is simply a fraction of the variance of the innovation. The behavior as a function of concentration is different for PM 2.5 than for NO 2 . We note that for NO 2 the innovation variance goes to zero when the concentration goes to zero, which is not the case for PM 2.5 . The PM 2.5 is nearly linear, while NO 2 (especially at night -result not shown) has a shape of a quadratic for low concentrations and saturate at higher concentrations. The fitting of the innovation variance as a function of concentration (as in Fig. 2 ) is important for two reasons. First, as it will become relevant in the following subsection, we need to have the sum of observation and model (background) error variances match the innovation variance -a property known as the innovation variance consistency. Secondly, after obtaining the portion of the innovation variance due to the model (background) error variance, we can then apply the relationship between background error variance with concentrations, to determine the background error variance at each model surface grid point (not only at the observation locations). Under the assumptions of uncorrelated observation and background errors, it has been established that two necessary and sufficient conditions to estimate the true observation and background error covariances are [13] . One, is that the Kalman gain in observation space (i.e. HK where H is the observation operator and K the Kalman gain) is optimal in the sense that the true analysis error (in observation space) is minimum. The second condition is that the innovation covariance matches the sum of the background error covariance in observation space plus the observation error covariance, i.e. HBH T þ R where B is the background error covariance and R the observation error covariance. The fit of the innovation variance presented in Fig. 2 assures by construction that the sum of error variances r 2 o þ r 2 b is consistent with innovation variance (not the innovation covariance), thus at least partly fulfilling the second condition above. The first condition about the optimal Kalman gain is obtained by using crossvalidation [10, 11] . As a way to illustrate this, we use a geometric interpretation where random variables y 1 ; y 2 are represented in a Hilbert space whose inner product is defined as the covariance between the two random variables. where E is the mathematical expectation. In this framework, random variables form an Hilbert space. For example, uncorrelated random variables are represented as perpendicular "vectors", and the error variance as the norm squared of that "vector" (see [11] and references therein). Figure 3 illustrates geometrically the cross-validation. Let the active observation y o be illustrated as O in the Fig. 3, the prior or background y b (illustrated as B) , the analysis y a (illustrated as A), and the independent (or passive) observation y c (illustrated as O c ). The origin T corresponds to the truth. Note that although the illustration is made for a (scalar) random variable, the same principle holds for random vectors for each components. We assume that the background error is uncorrelated with observations errors, and that observations are spatially uncorrelated horizontally. Then, the background error, the active and the passive observation errors are uncorrelated with one another, so the three axes; e o for the active observation error, e b for the background error, and e o c for the passive observation error are orthogonal. The plane defined by axes e o and e b is the space where the analysis takes place, and is called the analysis plane. Since we define the analysis to be linear and unbiased, the analysis is a linear combination of the observation and background. Thus, in this illustration the analysis A lies on the line (B,O) . The thick lines in Fig. 4 represent the norm of the associated error. For example, the thick line along the e o axis depict the (active) observation standard deviation r o , and similarly for the other axes and other random variables. Since the active observation error is uncorrelated with the background error, the triangle D OTB is a right triangle, and by Pythagoras theorem we have, . This is the usual statement that the innovation variance is the sum of background and observation error variances. The analysis is optimal when the analysis error jj e a jj 2 ¼ r 2 a is minimum, that is when the point A is closest to T. In this case the line ðT;ÂÞ is perpendicular to line ðO,BÞ. Since passive observations are never collocated with active observations, they have errors uncorrelated to active observations and background errors, and thus e o c lies perpendicular to the analysis plane. So when the point A is closest to T it is also closest to O c . Since the distance ðO c ,AÞ squared corresponds to the variance of ðO c -AÞ CV evaluated in cross-validation, we conclude that when Var ðO c -AÞ CV is minimum, the analysis is optimal (i.e. minimal error), A ¼Â [11] . An evaluation by cross-validation is performed to evaluate the quality of the analysis. The k-fold cross-validation methodology consists of separating the data set of observations into k equal size spatially random-distributed subsets and using each subset to evaluate an analysis while using the remaining observations in the k-1 subsets to Fig. 3 . Geometric representation (i.e. Hilbert space) of a scalar analysis with cross-validation (reproduced from [24] Figure 37.2 b) ). compute the analysis. Here we use k = 3 [16, 17] . The separation into spatially randomdistributed subsets has been achieved by selecting one station over three in an ordered station ID number list. An illustration of the selection method for the PM2.5 surface monitoring stations is depicted in Fig. 4 , below. Another method which makes this selection rigorous is the application Hilbert curves [25] . Let O j ðn; tÞ be the concentration measured at a station j at the local hour t, on day n. First, we calculate the verification statistics for each station at a given local time using an ensemble of days fn ¼ 1; . . . ; N days g. The average at a station j at the local time t, is simply The verification statistics are often defined over a region. For example, the mean concentration variance over an ensemble i, i.e. fO i g, of stations, for a total number of N s stations, is defined as In general the ensemble fO i g (or simply denoted by O i ) can either be an: • ensemble over all stations in the whole domain • ensemble over a region (or subdomain) • ensemble over all passive stations (i.e. stations not used in the analysis) or any variants or combination thereof. In this document, there is only a single domain consisting of the continental USA and Canada. For cross-validation, the analyses are evaluated against passive observations, i.e. observations not used to construct the analyses. We recall that passive observation sites are never collocated with the active stations (stations used to construct the analyses). Specifically, the ensemble of observations (for each local time) is split into three disjoint subsets O 1 ; O 2 ; O 3 , and we denote the cross-validation analysis by A ½1 ¼ A ½1 ðO 2 ; O 3 Þ as an analysis that uses O 2 ; O 3 and excludes the subset O 1 . The interpolated analysis at the passive station j 2 O 1 will be denoted by The cross-validation variance statistics are then given by the average over the 3 where the statistics of each passive subset i are calculated by an average over all passive stations in the given subset. Specifically we have, Note that a cross-validation statistic is evaluated for each local time and we have omitted the variable t, to keep the notation simple. In our context where we enforce innovation variance consistency through the innovation variance fitting (see Sect. 2.2), we are left with only 2 parameters to estimate: 1) the ratio of observation error variance to background error variance r 2 o = r 2 b , and 2) the compact support correlation length-scale L s . These parameters are estimated by minimizing varðO À AÞ CV and thus result in the end in a nearly optimal analysis. In Fig. 5 , we plotted varðO À AÞ CV for different values of r 2 o = r 2 b and L s . We find a single minimum of the fit of the analysis to independent (or passive observations) for an error variance ratio of 1.5 and for a compact support correlation length of 300 km, used to localized the raw model correlations. Using these optimal parameter values and the modeling based in the innovation variance as function of concentration and with model output statistics to construct the spatial correlation structures, we evaluate the new analysis (i.e. Av2). We then compare it against the old analysis scheme (i.e. Av1) which is using homogenous isotropic correlation functions and v 2 -optimized error statistics [15] . The result evaluated by cross-validation is presented in the Fig. 6 below. We observe a sensitivity to the error statistics used to generate the analysis, with superior analyses when the modeling is based on our methodology using the data rather than using some specified isotropic models. It is by letting the data itself (model output and observations) provide the modeling elements of the observation and background error covariances that we arrive at an improved analysis. Thus, we thus argue that data-driven modeling of the observation and background error covariances plays a complementary role to data assimilation, resulting in a nearly optimal system. Atmospheric Data Analysis Four-dimensional variational assimilation in the unstable subspace and the optimal subspace dimension Asymptotic forecast uncertainty and the unstable subspace in presence of additive model error Constituent Assimilation Coupled stratospheric chemistry-meteorology data assimilation. Part I: Physical background and coupled modeling aspects Technical Note: Spectral representation of spatial correlations in variational assimilation with grid point models and application to the Belgian Assimilation System for Chemical Observations (BASCOE) Coupled stratospheric chemistry-meteorology data assimilation. Part II: Weak and strong coupling The lagged innovation covariance: a performance diagnostic for atmospheric data assimilation The effect of serially correlated observation and model error on atmospheric data assimilation Evaluation of analysis by cross-validation. Part I: Using verification metrics Evaluation of analysis by cross-validation. Part II: Diagnostic and optimization of analysis error covariance A posteriori verification of analysis and assimilation algorithms Error covariance estimation methods based on analysis residuals: theoretical foundation and convergence properties derived from simplified observation networks PDE-NetGen 1.0: from symbolic PDE representations of physical processes to trainable neural network representations Multi-year objective analysis of warm season ground-level ozone and PM2.5 over North-America using real-time observations and Canadian operational air quality models Ambient PM2.5, O3, and NO2 exposures and association with mortality over 16 years of follow-up in the Canadian Census Health and Environment Cohort (CanCHEC) Early life exposure to air pollution and incidence of childhood asthma, allergic rhinitis and eczema A comparison of correlation-length estimation methods for the objective analysis of surface pollutants at Environment and Climate Change Canada Autoregressive models of background errors for chemical data assimilation Construction of nondiagonal background error covariance matrices for global chemical data assimilation The chemistry-forecast system at the Meteorological Service of Canada Gaussian Processes for Machine Learning Construction of correlation functions in two and three dimensions Evaluation of air quality maps using cross-validation: Metrics, diagnostics and optimization Applications of Hilbert curves to the selection of subsets of spatially inhomogeneous observational data for cross-validation and to the construction of super-observations