key: cord-0136474-v1u9omkl
authors: Ruckdeschel, Peter
title: Optimally Robust Kalman Filtering at Work: AO-, IO-, and Simultaneously IO- and AO- Robust Filters
date: 2010-04-22
journal: nan
DOI: nan
sha: 7d57398740e0e45d139bc4b691a9588fc3762631
doc_id: 136474
cord_uid: v1u9omkl

We take up optimality results for robust Kalman filtering from Ruckdeschel[2001,2010] where robustness is understood in a distributional sense, i.e.; we enlarge the distribution assumptions made in the ideal model by suitable neighborhoods, allowing for outliers which in our context may be system-endogenous/propagating or -exogenous/non-propagating, inducing the somewhat conflicting goals of tracking and attenuation. Correspondingly, the cited references provide optimally-robust procedures to deal with each type of outliers separately, but in case of IO-robustness does not say much about the implementation. We discuss this in more detail in this paper. Most importantly, we define a hybrid filter combining AO- and IO-optimal ones, which is able to treat both types of outliers simultaneously, albeit with a certain delay. We check our filters at a reference state space model, and compare the results with those obtained by the ACM filter Martin and Masreliez[1977], Martin[1979] and non-parametric, repeated-median based filters Fried et al.[2006,2007].

Robustness is an "old" problem in Kalman filtering, with first (non-verified) hits on a quick search for "robust Kalman filter" on scholar.google.com as early as 1962, i.e.; even before the seminal Huber (1964) paper, often referred to as birthday of Robust Statistics.

The amount of literature on this topic is huge, and we do not attempt to give a comprehensive account here. Instead, we refer the reader to the excellent surveys given in Ershov and Lipster (1978) , Kassam and Poor (1985) , Stockinger and Dutter (1987) , Schick and Mitter (1994) , Künsch (2001) , and to some extent Ruckdeschel (2001, Sect. 1.5) .

The mere notion of robustness in the context of filtering is non-standard, most frequently it is used to describe stability of the procedure w.r.t. certain variations of the "input parameters". The choice of the "input parameters" to look at varies from notion to notion. In this paper we are concerned with (distributional) minimax robustness; i.e.; we allow for deviations from the ideal distributional model assumptions, defining suitable neighborhoods about this ideal model. For these neighborhoods, we consider procedures minimizing the maximal predictive inaccuracy on these neighborhoods, measured in terms of mean squared error (MSE).

These minimax filters derived in Ruckdeschel (2010) come as closed-form saddlepoints consisting of an optimally-robust procedure and a corresponding least favorable outlier situation. Their optimality holds in a surprisingly general setup of state space models, which is not limited to a Euclidean or time-discrete framework.

The important condition is that MSE makes sense for the range of the states, so these results cover general Hidden Markov Models for arbitrary observation space, dynamic (generalized) linear models as discussed in West et al. (1985) and West and Harrison (1989) , as well as continuous time settings, as those used in applications of Mathematical Finance, compare, e.g. Nielsen et al. (2000) and Singer (2002) for respective model formulations. For the application, we need to linearize the corresponding functions transition and observation functions suitably, e.g. to give the (continuous-discrete) Extended Kalman Filter (EKF). We even cover applications such as optimal portfolio selection, with corresponding controls U t (i.e.; buying and selling operations) and, in principle, indirectly observed random fields.

In this paper though, to clarify ideas we limit ourselves to the linear, Euclidean, time discrete state space model (SSM), and in the ideal model assume normality. More details on these models can be found in many textbooks, cf. e.g. Anderson and Moore (1979) , Harvey (1991) , Hamilton (1993) , and Durbin and Koopman (2001) .

Let us start with some definitions and assumptions. Our SSM consists in an unobservable p-dimensional state X t evolving according to a possibly time-inhomogeneous vector autoregressive model of order 1 (VAR(1)) with innovations v t and transition matrices F t , i.e.,

We only observe a q-dimensional linear transformation Y t of X t involving an additional observation error ε t ,

In the ideal model we work in a Gaussian context, that is we assume

{X 0 , v s , ε t , s, t ∈ N} stochastically independent (4)

For this paper, we assume the hyper-parameters F t , Z t , Q t , V t , a 0 to be known.

As announced, these ideal model assumptions for robustness considerations are extended by allowing (small) deviations, most prominently generated by outliers. In our notation, suffix "id" indicates the ideal setting, "di" the distorting (contaminating) situation, "re" the realistic, contaminated situation.

AO's and IO's. We follow the terminology of Fox (1972) , who distinguishes innovation outliers (or IO's) and additive outliers (or AO's). Historically, AO's and IO's denote gross errors affecting the observation errors and the innovations, respectively. For consistency with literature, we stick to this distinction, but rather use these terms in a wider sense: IO's stand for general endogenous outliers entering the state equation, hence with propagated distortion, also covering level shifts or linear trends which would not be included in the original definition. Similarly wide-sense AO's denote general exogenous outliers which do not propagate, like substitutive outliers or SO's as defined in equations (18) Different and competing goals induced by AO's and IO's. Due to their different nature, as a rule, a different reaction in the presence of IO's and AO's is required. As AO's are exogenous, we would like to ignore them as far as possible, damping their effect, while when there are IO's, something has happened in the system, so the usual goal will be to detect these structural changes as fast as possible.

A situation where both AO's and IO's may occur is more difficult: We are faced with an identification problem, as we cannot distinguish IO from AO type immediately after a suspicious observation; hence a simultaneous treatment is only possible with a certain delay-see section 4.

Another task, not pursued further in this paper, consists in recovering the situation without structural changes in an off-line situation. An example is spectral analysis of inter-individual heart frequency spectra, which requires a cleaning from both (wide-sense) IO's and AO's; after this cleaning the powerful instruments of spectral analysis will be available; cf. Spangl (2008) .

Filter Problem. The most important problem in SSM formulation is the reconstruction of the unobservable states X t by means of the observations Y t . For abbreviation let us denote

Using MSE risk, the optimal reconstruction is the solution to

We focus on filtering (s = t) in this paper, while s < t makes for a prediction, and s > t for a smoothing problem.

Kalman-Filter. The general solution to (6), the corresponding conditional expectation E[X t |Y 1:s ] in general is rather expensive to compute. Hence as in the Gauss-Markov setting, restriction to linear filters is a common way out. In this context, Kalman (1960) introduced a recursive scheme to compute this optimal linear filter reproduced here for later reference:

Initialization:

Prediction:

Correction:

where Σ t|t = Cov(X t − X t|t ), Σ t|t−1 = Cov(X t − X t|t−1 ), and M 0 t is the so-called Kalman gain.

The Kalman filter has a clear-cut structure with an initialization, a prediction, and a correction step. Evaluation and interpretation is easy, as all steps are linear. The strict recursivity / Markovian structure of the state equation allows one to concentrate all information from the past useful for the future in X t|t−1 . This linearity is also the reason for its non-robustness, as observations y enter unbounded into the correction step. A good robustification has to be bounded in the observations, otherwise preserving the advantages of the Kalman filer as far as possible.

The idea of the procedures we discuss in this paper are based on robustifying recursive Least Squares: rLS. Let us begin with (wide-sense) AO's.

With only AO's, there is no need for robustification in the initialization and prediction step, as no (new) observations enter. For the correction step, we use the following robustification, compare Ruckdeschel (2000) : Instead of M 0 ∆Y , we use a Huberization of this correction

for some suitably chosen clipping height b. While this is a bounded substitute for the correction step in the classical Kalman filter, it still remains reasonably simple, is non iterative and hence especially useful for online-purposes. However it should be noted that, departing from the Kalman filter and at the same time insisting on strict recursivity, we possibly exclude "better" non-recursive procedures. These procedures on the other hand would be much more expensive to compute.

Remark 3.1. · in expression M 0 ∆Y denotes the Euclidean norm of R q ; instead, however you could also use other norms like Mahalanobis-type ones. The choice of a quadratic-form-type norm for H b does not affect the optimality statements of Theorem 3.2 below, provided that the same norm be used in the MSE.

Choice of the clipping height b. For the choice of b, we have two proposals. Both are based on the simplifying assumption that E id [∆X|∆Y ] is linear, which in fact turns out to only be approximately correct. The first one chooses b = b(δ) according to an Anscombe criterion,

where δ may be interpreted as "insurance premium" to be paid in terms of efficiency. The second criterion uses the radius r ∈ [0, 1] of the neighborhood U SO (r) (defined in (20)) and determines b = b(r) such that

This will produce the minimax-MSE procedure for U SO (r) according to Theorem 3.2 below.

If r is unknown, which is almost always the case, we translate an idea worked out in Rieder et al. (2008) into our setting: Assume we know that r ∈ [r l , r u ], 0 ≤ r l < r u ≤ 1. Then we define a least favorable radius r 0 such that the procedure rLS(b(r 0 )) with clipping height b = b(r 0 ) minimizes the maximal inefficiency-w.r.t. the procedure knowing the respective radius-among all procedures rLS(b(r)), i.e.; each rLS for some clipping height b(r) = b(r 0 ) has a larger inefficiency for some r ∈ [r l , r u ]. Radius r 0 can be computed quite effectively by a bisection method: Let

Then

compare Ruckdeschel (2010, Lemma 3.1). From another point of view, for given b, you may interpret criteria (11) and (12) the other way round, giving the efficiency loss in the ideal model, or the size of the SO neighborhood, for which the corresponding procedure is MSE-minimax.

(One-Step)-Optimality of the rLS. The rLS filter is optimally-robust in some sense: To see this, in a first step we boil down our SSM to the following simplified model 1 . We have an unobservable but interesting state X ∼ P X (dx), where for technical reasons we assume that in the ideal model E |X| 2 < ∞. Instead of X we rather observe the sum

of X and a stochastically independent error ε. We assume that in the ideal model, the conditional distribution of Y given X must allow for densities w.r.t. some measure µ, i.e.;

As (wide-sense) AO model, we consider an SO outlier model already used by Birmiwal and Shen (1993) and Birmiwal and Papantoni-Kazakos (1994) 

1 Instead of this simplification, we could even use a more general "Bayesian" model as simplification, compare Ruckdeschel (2010, section 3.2).

for SO-contamination radius 0 ≤ r ≤ 1 specifying the size of the corresponding neighborhood, i.e.; the probability for an SO. U is assumed independent of (X, Y id ) and (X,

where L(Y di ) is arbitrary, unknown and uncontrollable (a.u.u.). The corresponding neighborhood is defined as (18) and (19) with radius s

With this setting we may formulate two typical robust optimization problems:

Minimax-SO problem. Minimize the maximal MSE on an SO-neighborhood, i.e.; find a measurable reconstruction f 0 for X s.t.

Lemma5-SO problem. In the spirit of Hampel (1968, Lemma 5) , minimize the MSE in the ideal model but subject to a bound on the bias to be fulfilled on the whole neighborhood, i.e.; find a measurable reconstruction f 0 for X s.t.

The solution to both problems can be summarized as (21) f 0 (y) := E X + D(y)w r (D(y)), w r (z) = min{1, ρ/|z|} (23)

where ρ > 0 ensures that P Y di 0 (dy) = 1 and D(y) = E id [X|Y = y] − E X.

(2) f 0 from (23) also is the solution to Problem (22) for b = ρ/r.

or in SSM formulation: M 0 is just the classical Kalman gain and f 0 the (one-step) rLS.

The proof to this theorem is given in Ruckdeschel (2010, Thm. 3.2) . Model (16) already covers our normal SSM model: we only have to identify X in model (17) with ∆X t and replace p ε (y − x) µ(dy) with N (Z t ∆X t , V t )(dy). If even ∆X t is normal, (3) applies and rLS is SO-optimal.

Remark 3.3. (a) The ACM filter by Masreliez and Martin (1977) , which is a competitor in this study, by analogy applies Huber (1964)'s minimax variance result to the "random location parameter X" setting of (16). They come up with redescenders as filter f . Hence the ACM filter is not so much vulnerable in the extreme tails but rather where the corresponding ψ function takes its maximum in absolute value. Care has to be taken, as such "inliers" producing the least favorable situation for the ACM are much harder to detect on naïve data inspection, in particular in higher dimensions. 6 (b) For exact SO-optimality of the rLS-filter, linearity of the ideal conditional expectation is crucial. However, one can show Ruckdeschel (2010, Prop. 3.6 ) that E id [∆X|∆Y ] is linear iff ∆X is normal, but, having used the rLS-filter in the ∆X-past, normality cannot hold, compare (Ruckdeschel, 2010, Prop 3.4) .

Back in the ∆X Model for t > 1: eSO-Neighborhoods. As noted in the last remark, rLS fails to be SO-optimal for t > 1. Nevertheless rLS performs quite well at both simulations and real data. One explanation for this is to consider yet an extension of the original SO-neighborhoods-the extended SO or eSO-model (compare Ruckdeschel (2010, Sect. 3.4 )), where we allow X to be suitably corrupted as well. In fact, the optimality of pair

remains a saddle-point in the corresponding Minimax-Problem on the eSO-neighborhood of same radius, cf. (Ruckdeschel, 2010, Thm. 3.10) . As a consequence, (compare (Ruckdeschel, 2010, Prop. 3 .11)), in the Gaussian setup, instead of focussing on the (SO-) saddle-point solution to an U(r)-neighborhood around L(∆X) stemming from an rLS-past, we use the following coupling-type idea:

We assume that for each time t, there is a (fictive) random variable ∆X N ∼ N p (0, Σ) such that ∆X rLS t stemming from an rLS-past can be considered a contaminating X di in the corresponding eSO-neighborhood around ∆X N with radius r. From this perspective, the rLS is exactly minimax for each time t.

As noted, in the presence of IO's, we want to follow an IO outlier as fast as possible. The Kalman filter in this situation does not behave as bad as in the AO situation, but still tends to be too inert. To improve upon this, let us go back to (16) which reveals a useful symmetry of X and ε: Apparently

Hence we follow Y more closely if we damp estimation of ε, for which we use the rLSfiler. We should note that doing so, we rely on "clean", i.e., ideally distributed errors ε.

With the obvious replacements, Theorem 3.2 translates word by word to a corresponding minimax Theorem for IO's, compare Ruckdeschel (2010, Thm. 4 .1).

In analogy to the definition of the rLS in equation (10), we set up an IOrobust version of the rLS as follows: We retain the initialization and prediction step of the classical Kalman filter and, assuming Z t invertible for the moment, replace the correction step by

where the same arguments for the choice of the norm and the clipping height apply as for the AO-robust version of the rLS.

To better distinguish IO-and AO-robust filters, let us call the IO-robust version rLS.IO and (for distinction) the AO-robust filter rLS.AO in the sequel.

Invertibility problem. Back in the (linear, discrete-time, Euclidean) SSM the approach just described faces the problem that in general matrix Z t will not be invertible, so we cannot reconstruct X injectively from Y and ε. Under a certain full-rank condition, this problem can be solved by passing to corresponding rLS-type smoothers.

The assumption we need is a version of complete constructibility, cf. Anderson and Moore (1979, Appendix) adopted to the time-inhomogeneous case: Denoting the product F t+p F t+p−1 · . . . · F t by F t+p:t we assume that for each t, F t+p−1:

. Details will be given in a subsequent paper.

Remark 3.4. (a) It is worth noting that also our IO-robust version is a filter, hence does not use information of observations made after the state to reconstruct; rLS.IO is strictly recursive and non iterative, hence well-suited for online applications. (b) An alias to rLS.IO could be "hysteric filter" as it hysterically follows any changes in the Y 's.

This section is less backed by theoretical results than the ones on pure AO or IO situation; rather it proposes a heuristic to achieve both types of outlier robustnesses. As already mentioned, simultaneous treatment of AO's and IO's is only possible with a certain delay. With this delay, we can base our decision of whether there was an AO or an IO on the size of subsequent |∆Y t |'s-if there was an AO this should result in only one "large" |∆Y t | in a row, whereas in case of an IO due to propagation, there should be a whole sequence of large |∆Y t |'s. So a hybrid filter (called rLS.IOAO for simplicity) could be designed as follows:

To a given delay window width w, we run in parallel rLS.AO and rLS.IO (but only store the last w values of rLS.IO). By default we return the rLS.AO values. Whenever there is a run of w "large" |∆Y rLS.AO t |'s we replace the last w filter values by the corresponding rLS.IO values and use these ones to continue with the rLS.AO.

Apparently rLS.IOAO gets into trouble in windows where we have both IO's and AO's; here some modeling of the type of structural change (local constant, or better: local linear) as in Fried et al. (2007) should be helpful; we do not treat this in this paper, though.

Operationalization. In the ideal (Gaussian) model, the ∆Y t 's should be independent, so a reasonable decision on whether a sequence of |∆Y rLS.AO t |'s is "large" could be based on corresponding quantiles of |∆Y rLS.AO t |, in the ideal model, or, somewhat easier, of ∆Y τ t ∆ −1 t ∆Y t which, assuming L(∆Y t ) to be approximately normal, is approximately χ 2 q (0)-distributed. Relaxing this condition a little, we already switch to rLS.IO when a high percentage h of the w |∆Y t |'s are larger than this given quantile.

This leaves us to determine several tuning parameters: window-width w (this should really be chosen according to the application; we have obtained good results in our examples with w = 5), the clipping heights for rLS.IO and rLS.AO (proposal: according to (versions of) (11) or (12)), the percentage h (default: 80% of the last w instances of |∆Y rLS.AO t |), and the corresponding quantile (default 99%) assuming that ∆Y t ∼ N q (0, ∆ t ).

Remark 4.1. (a) Note that although the decision whether we issue the rLS.IO or the rLS.AO values is made w observations after the state which is to be reconstructed, we still only use filters, hence the information of Y t+j , j = 1, . . . , w − 1 is not used to improve the reconstruction so far, as this would involve corresponding (yet-to-berobustified) smoothers. Once the corresponding work on robust smoothing will be done, we could surely use this additional information. (b) In a time-invariant linear SSM (i.e., with hyperparameters F , Z, Q, and V constant in time), in general ∆ t will converge in t exponentially fast-also if one uses bounded rLSsteps-so these tuning parameters will only have to be determined for a small number of time instances t, compare Ruckdeschel (2001, chap. 7) . In fact, setting them timeinvariant right from the beginning will often do a reasonable job already.

Our running example will be a one-dimensional steady state model with hyperparameters

We consider performance of classical Kalman filter, rLS.AO, rLS.IO, and rLS.IOAO in this model and under AO's and IO's. More specifically, we have generated deterministic AO's in observations 10,15,23, and IO's in observations 20-25 (a local linear trend) and 37-42 (level shift). As competitors, we include the ACM filter by Martin (1979) implemented by B. Spangl in R package robKalman, and a variant hybr PRMH of robfilter, cf. Fried and Schettlinger (2008) concerning its implementation and Fried et al. (2006) regarding its definition, which is a non-parametric filter fitting local levels and linear trends based on repeated medians. The results are plotted in Figures 1-4 , where in the plots, we confine ourselves to the rLS-variants, which already makes for five curves to be plotted in one panel.

In the ideal situation, all filters perform well, with slight advantages for the classical Kalman filter (which has smallest theoretical MSE), but closely followed (and in the prediction case slightly beaten) by the rLS.IO.

In the IO situation, the "hysteric" rLS.IO filter performs best, beating the classical Kalman filter; both rLS.IOAO and hybr PRMH perform reasonably well, while the AOrobust filters ACM and rLS.AO are not able to track the IO at all (as they can only perform bounded correction steps) and hence, like a hanging slope, only closely recover the changed situation.

In the AO situation, we have the complementary image; here ACM performs best (see also Remark 5.1(a)), but rLS.AO only performs slightly weaker. rLS.IOAO is a little worse, and with a certain gap, but still reasonably well follows hybr PRMH , while both classical Kalman filter and rLS.IO (the latter even worse) perform drastically bad.

Finally, in the mixed IO and AO situation, hybr PRMH is by far the best solution, then followed with a certain gap by the rLS.IOAO, while all other filters perform unacceptably bad. By construction, rLS.IOAO assumes that at every time instance there only can be either an AO or an IO. Otherwise the corresponding MSE gets unbounded on every neighborhood U(r) for r > 0. Hence the AO in observation 23 really confuses rLS.IOAO completely, compare hence faithfully follows the AO. Repeated-median-based hybr PRMH does not have this problem, as the median even stays stable under (almost) arbitrary substitutive outliers, hence it is able to keep the local linear trend. Omitting observation 23 results in a much better performance of rLS.IOAO, which then even beats hybr PRMH , cf. Table 2 .

Averaging over time in one realization of the SSM, we get the "ergodic" empirical MSEs as displayed in Tables 1, 2.

Remark 5.1. (a) We can explain why the ACM filter beats the rLS in the AOsituation by the fact that the contamination in this study clearly covers the worst-case behavior of the rLS but not the one of the ACM filter, compare Remark 3.3(e), and also fails to do so for hybr PRMH . (b) rLS.IOAO really has its advantages in higher dimensions where median-based filters are much harder to define and get computationally expensive. One might even think of combining rLS.IOAO and hybr PRMH in these settings: first let rLS.IOAO do a preliminary, fast, and dimension-independent cleaning, and then let hybr PRMH polish this result coordinate-wise. (c) It is still an open question whether we can improve on the rLS.IOAO behavior, using the SSM (28) with both IO's as in Figure 2 and AO's as in Figure 3 ; the panel below (note the different y-scale) is as in Figure 1 . background of hybr PRMH . In this setting Z t is not invertible, but the model is completely constructible, so passing to smoothers might help.

Remark 5.2. A careful reader might object that we have not included any real world data application. We have done so, because most data sets we have analyzed come with unkown hyper-parameters. The setting of this paper assumes knowledge of these hyper-parameters, so we would have to estimate them somehow-possibly by the EM algorithm of Shumway and Stoffer (1982) or again by more recent refinements. But, since the robustness properties of combining a non-robust M-step (i.e., maximum likelihood) and a robust E-step (achieved using the procedures of this paper) are not evident, and a corresponding robustification of the M-step would have been out of scope, we have rather confined us to simulations.

6. Implementation: R-package robKalman rLS.AO was originally implemented to XploRe, compare Ruckdeschel (2000) . In an ongoing project with Bernhard Spangl, BOKU, Vienna, and Irina Ursachi (ITWM), we are about to implement all the rLS filter to R, see R Development Core Team (2010), more specifically to an R-package robKalman, the development of which is done under r-forge project https://r-forge.r-project.org/projects/robkalman/, see also R-Forge Administration and Development Team (2008) . Under this address you will also find a preliminary version available for download.

In the extremely flexible class of dynamic models consisting in SSMs, we apply (distributional) robust optimality results for filtering. We could show that contrary to common prejudice a simultaneous treatment of (wide-sense) IO's and AO's is possible in SSM'salbeit with minor delay. The filters that we propose are model based (in contrast to the 14 non-parametric hybr PRMH ) which means that we need a higher degree of model specification in that we possibly have to estimate the hyper-parameters, but which also could help to get more precise in the ideal model (in particular for higher dimensions). Our filters are non-iterative, recursive, hence fast, and valid for higher dimensions. They are available in R in some devel versions and hopefully on CRAN soon.

Optimal filtering. Information and System Sciences Series

Optimal robust filtering

Outlier resistant prediction for stationary processes

Time Series Analysis by State Space Methods

Robust Kalman filter in discrete time

Outliers in time series

Repeated Median and Hybrid Filters

Weighted Repeated Median Smoothing and Filtering

R-package robfilter: Robust Time Series Filters

Contributions to the theory of robust estimation. Dissertation

State Space Models

Forecasting

Robust estimation of a location parameter

Minimax tests and the Neyman-Pearson lemma for capacities

A new approach to linear filtering and prediction problems

Robust techniques for signal processing: A survey

State space models and Hidden Markov Models

Approximate conditional-mean type smoothers and interpolators, in Smoothing techniques for curve estimation

Robust Bayesian estimation for the linear model and robustifying the Kalman filter

R: A language and environment for statistical computing. R Foundation for Statistical Computing

The cost of not knowing the radius

Robust recursive estimation in the presence of heavy-tailed observation noise

An approach to time series smoothing and forecasting using the EM algorithm

Parameter Estimation of Nonlinear Stochastic Differential Equations: Simulated Maximum Likelihood vs

On Robust Spectral Density Estimation

Robust time series analysis: A survey. Kybernetika, 23. Supplement

Bayesian forecasting and dynamic models

Dynamic generalized linear models and Bayesian forecasting

The author would like to acknowledge and thank for the stimulating discussion he had with Gerald Kroisandt at ITWM which led to the definition of rLS.IO. Many thanks go to Nataliya Horbenko for proof-reading the manuscript.