key: cord-0281972-rq4mb3ko authors: Wu, Zhengxiao title: A cluster identification framework illustrated by a filtering model for earthquake occurrences date: 2009-06-11 journal: nan DOI: 10.3150/08-bej159 sha: 35757dfd5f2da4400a1f2069e531b342ceb1eb92 doc_id: 281972 cord_uid: rq4mb3ko A general dynamical cluster identification framework including both modeling and computation is developed. The earthquake declustering problem is studied to demonstrate how this framework applies. A stochastic model is proposed for earthquake occurrences that considers the sequence of occurrences as composed of two parts: earthquake clusters and single earthquakes. We suggest that earthquake clusters contain a ``mother quake'' and her ``offspring.'' Applying the filtering techniques, we use the solution of filtering equations as criteria for declustering. A procedure for calculating maximum likelihood estimations (MLE's) and the most likely cluster sequence is also presented. Suppose one observes a series of events X 1 , X 2 , . . . , X n occurring at times τ 1 , τ 2 , . . . , τ n . Each event is either "normal" or "abnormal." The objective is to identify those "abnormal" events. One application of this problem is in epidemiology. For instance, the patients with Severe Acute Respiratory Syndrome (SARS) have symptoms similar to those of common flu patients. However since SARS is much more infectious than common flu, the SARS patients often appear in clusters. Such statistical evidence enables us to identify the SARS patients by mathematical tools. It provides a supplementary method to the costly medical test. Another application is to collusion set detection. In a stock market, a group of traders forms a collusion set if they heavily trade among themselves in order to manipulate the stock price. It is of interest to catch this kind of malpractice as early as possible. Considering each trade record as an event, it is intuitive that the malicious trading events tend to cluster. Assuming that a distance measuring the dissimilarity between any two records is available, Palshikar and Apte tackle the problem via graph clustering in [12] . They ignore the time stamp on the trade record so that a point process is reduced to a graph. But the temporal information is lost in their method. These examples motivated our filtering model. We model the observations as a mixture of two independent marked point processes representing the "normal" and the "abnormal" events, respectively. Each new "abnormal" event will change the intensity of the "abnormal" point process. Typically the "abnormal" event increases the intensity for additional "abnormal" events in its neighborhood. Our goal is to compute the conditional probability of each observed event being abnormal in real time. Employing filtering techniques, we derive versions of the Zakai and Kushner-Stratonovich equations. Under a Markov condition, a sequential algorithm is presented to calculate the exact conditional probability that we are interested in. Unfortunately, the data set for the two examples above is not available. We will present our methodology in the context of the "earthquake declustering problem." Even though there is no agreement on the underlying mechanism of earthquake occurrence in the current seismology literature, we want to emphasize that this example is mainly for the purpose of illustration. Our framework is for general modeling and computation. It could be adapted for different data sets in various areas. It is well known that earthquakes often occur in clusters. The largest quake in a cluster is called the main shock, those before it are called foreshocks, and those after it are called aftershocks. The aftershocks in an earthquake swarm are relatively easy to predict. However, there are also many earthquakes that strike without any foreshocks or aftershocks. As the authors stated in [8] : "To forecast the location of the large earthquakes, it is necessary to analyze the background seismicity, for which removal of temporal cluster members is considered to be of central importance." In this article, we propose a space-time point process model stemming from [14] . The observed earthquakes are considered as a mixture of earthquake swarms (a swarm contains at least two quakes) and single quakes. This could be considered as a special case of the "cluster processes" ([2], Section 6.3): a cluster process is composed of clusters that contain only a single point and clusters that have multiple points; in our model, we distinguish the single point events (single quakes) as the "noise" and the multiple point events (earthquake swarms) as the "signal." The conditional probability that a quake is in a cluster becomes a natural criterion for declustering. The filtering theory hence can be applied. We assume that, at most, one cluster is active at a time. This assumption can be relaxed with increased computational complexity. In the literature, inference for partially observed stochastic processes is often obtained by using Markov chain Monte Carlo (MCMC) methods (see, e.g., [6] ). A particle algorithm is also proposed in [14] . Such approximation methods are more flexible, but they are time-consuming and the approximation error is usually difficult to estimate. This paper deals with finding analytic solutions for some cases. The paper is organized as follows: Section 2 describes the generic model and the filtering equations; Section 3 presents the computational procedure for the conditional expectation of interest under the "mother quake" assumption (the first quake in a cluster triggers all the other quakes in that cluster); Section 4 illustrates the numerical results for earthquakes in central and western Japan; Section 5 summarizes the conclusions and describes future work; Appendix A gives the algorithm that calculates the maximum likelihood estimators of the parameters and the most likely cluster path; finally, the proofs are contained in Appendix B. 2. The generic model Suppose observed information about a quake is represented by a mark in a space E. For example, E could be R 3 , recording the earthquake's magnitude and the epicenter's location longitude and latitude. We model observations as a marked point process O with marks in E. O is the mixture of two independent point processes N and C, which stand for the single quakes and earthquake clusters, respectively. Hence letting O(A, t) denote the number of quakes characterized by values in A (A is a subset of E) observed up to time t, we can write We assume that N is a Poisson process with intensity γ relative to a reference measure ν, hence the single quake model is just a Poisson random measure on E × [0, ∞) with mean measure ν 0 (du × ds) = γ(u, s)ν(du) ds. We model clusters to be randomly initiated and assume they eventually die out; we also assume that there is at most one active cluster at a time as mentioned in Section 1. Let D be the process that indicates whether a cluster is active or not. The process C adds a mark u at time s with non-negative predictable intensity λ(u, s, D s− , η s− ), where η is the configuration of both the marks and occurrence times of all the previous cluster quakes. More precisely, if cluster quake c i occurs at t i , then η t = {i: ti≤t} δ (ci,ti) , where δ (ci,ti) is the Dirac measure concentrated on the point (c i , t i ). Therefore, η is a counting measure on E × [0, ∞). When D = 0, there is no active cluster and an intensity λ(u, s, 0, η s− ) gives the rate at which a new cluster is initiated by an event with mark u at time s. Once initiated, the cluster grows with intensity λ(u, s, 1, η s− ) until it dies out. Under very mild conditions on the intensities (see [5] ), the point processes can be written as solutions of stochastic differential equations. In particular, we can write where ξ 1 and ξ 2 are independent copies of a Poisson random measure on E × [0, ∞) × [0, ∞) with mean measure ν × ℓ × ℓ, denoting Lebesgue measure by ℓ. In this article, we define D as follows: D is equal to 1 once a cluster is initiated; D has a probability p to die out (i.e., D = 0) whenever a new observation is added to the cluster; D is independent of all previous history. Thus, for an arbitrary function f (D t , η t ), where the {I k , k = 1, 2, . . .} are independent Bernoulli random variables with parameter p that are also independent of N and C. This follows by writing the right-hand side as a finite sum where most terms cancel out. In practice, f (D t , η t ) contains information about D t and η t . Statistical inferences can be drawn if we are able to compute the conditional expectation of f based on the observations O. The rest of the paper mostly deals with how to realize such a computation for arbitrary f . It is worth noting that our whole problem is essentially discrete and finite, hence the measurability of functions is (and should be) of minor concern. As D t is either 0 or 1, and η t can only take finitely many values as well (2 n if there are n observations), thus the function domain of f is finite. Therefore, all the functions are measurable. We derive the filtering equation for the conditional distribution of η given observations of O using a reference measure approach. If (Ω, F , Q) is a probability space and P is a second probability measure on F given by dP = L dQ, then for any sub-σ-algebra D ⊂ F and L 1 -random variable Z, We are going to use a reference probability measure Q under which the observations have a relatively simple structure. In the following lemma, N and C are independent Poisson random measures under Q. We first introduce a definition that is used in the lemma. Lemma 2.2. On (Ω, F , Q), let N and C be independent Poisson random measures with mean measures ν 0 (du × ds) = γ(u, s)ν(du) ds and ν 1 (du × ds) = λ Q (u, s)ν(du) ds, respectively; let D be a cadlag process independent of N . Assume all processes are compatible and assuming that L is a {F t }-martingale. Let P satisfy dP |Ft = L(t) dQ |Ft . Then P is a probability measure and under P , for all A such that is a local martingale and N is independent of C and is a Poisson random measure with mean measure ν 0 . Thus under P both N and C have the intensity described in Section 2.1. Hence Lemma 2.2 gives the form of the Radon-Nikodym derivative (or the likelihood) of P with respect to Q. Our further computation then can be justified by the uniqueness of the martingale problem (see [9] or [4] , Chapter 4). The assumption that L is a {F t }-martingale is very mild. Remark 2.3. The following condition is sufficient to ensure that (2.2) is a well-posed equation and that L is a martingale. The process D has finitely many jumps in bounded time intervals. Thus we can record the history of the process D by a counting measure h t = {i: ti≤t} δ (Dt i ,ti) ; the sum is over those t i when D takes jumps. Hence it represents a path that has value D ti in time interval [t i , t i+1 ). As in (2.1), let f be an arbitrary function on the two counting measures (h s , η s ), and set Since h t contains all the information on D t , we can write D t = D t (h t ). Further, we abuse the notation a little and write λ(u, s, We need this expression to simplify the notation in the following theorem and in the application in Section 4. Let α denote the indicator that specifies whether or not a cluster is currently active, Theorem 2.4. For an arbitrary function f on (h s , η s ), let φ, π and f new be defined as in equations (2.3)-(2.5). Then φ and π satisfy the stochastic integral equations and In Section 4, we will take f as the indicator functions that indicate the status of η, so that π(f, t) gives us the conditional probability that an observation is in the cluster. Unlike the infinite-dimensional nonlinear filtering problem, the solution of which can only be approximated, the function space in our problem allows a natural finite decomposition since we have a finite function domain, that is, all the possible combinations of each observed event being in a quake swarm or not. Thus the exact solution could be computed theoretically, but generally the computational load increases exponentially as the number of observations increases. That is not feasible for online updating. In this section and also in Appendix A, we assume that when a cluster is active, the cluster is assumed to be triggered by the first quake (mother quake); when no cluster is active, a new cluster will be initiated randomly with an intensity ε. To be precise, suppose one observes u i at time τ i . Let y i = (u i , τ i ) and the set of observations by time where θ 0 (y i ) = 1 {yi is the mother quake in the currently active cluster} and θ 0 (y i ) is defined as 0 if there is no active cluster at that time. We suppose that the functional form of λ(u, t, y i ) is known. For the application in Section 4, λ(u, t, y i ) is a Gaussian kernel (4.1) that does not depend on t. Note that there is, at most, one θ 0 (y i ) (i = 1, 2, . . . , k) non-zero at any moment t. The simple fact that θ 0 (y i )θ 0 (y j ) = 0 if i = j makes finding an analytic solution possible (see the proof of the following theorems). Formally, we can think of the intensity λ as a vector with component λ(u, t, y i ) at each "orthogonal" direction θ 0 (y i ), i = 1, 2, . . . , k. With the help of this kind of "orthogonal decomposition" of the function space, the problem can be reduced to be of polynomial complexity. For simplicity, we also assume that there is no cluster active at time 0. Define a(y, t) = E λ(u, t, y)ν(du) for y ∈ E and ε(t) = E ε(u, t)ν(du). The following two theorems give the algorithm to compute π(f, t), f is an arbitrary function of D s and η s . Recall that In Theorem 3.3, we can solve for π(θ 0 (y i )α, t) as the first step in our algorithm. The task of computing π(f, t) for more general f is completed in the next theorem. Note that the solution π(θ 0 (y i )α, t) is needed in (3.4) . Combining Theorems 3.3 and 3.2, we can compute π(f, t) for an arbitrary f in real time. We use the same data set as in [8] : the earthquakes in the period of 1926-1995 in the rectangular area 34 • -39 • N and 131 • -140 • E with magnitudes greater than 4.0 and depths less than 100 km. We take ν to be the uniform measure on the rectangular region, γ(u) = γ and λ(u, D t , η t ) = 1 {Dt=1} k i=1 λ(u, y i )θ 0 (y i ) + 1 {Dt=0} ε, where λ(u, y i ) is proportional to a bivariate normal density: We compute π(θ(y i )(·, ·), T ) for all observed y i according to Theorem 3.2, where T is the last moment of the year 1995. The results are compared with [8] , where the authors declustered the observations by computing aftershock probabilities under an ETAS model. In Figure 1 , plot (a) is the histogram of the aftershock probabilities as presented in [8] . We denote their aftershock probabilities as p 1 . Plot (b) shows the distribution of the conditional probabilities p 2 in the mother quake model. We are pleased to see that our stochastic models give relatively deterministic answers. Around 95% of quakes have a probability of being in clusters that is either smaller than 0.1 or greater than 0.9, as can be seen in plot (b). Although both results have a bimodal shape, the one in [8] disagrees with our models for many individual quakes. This can be seen from Figure 2 . The histogram presents the difference of these two probabilities. It seems that the data set supports our model more. We plot the earthquake clusters in each setting by removing quakes with a low probability of being in a cluster. The time-space plots in Figure 3 have 1500 quakes. The vertical axis represents time (unit in days). It is quite clear that the plots from our models have a stronger cluster pattern. The three-dimensional plots are available from http://www.stat.nus.edu.sg/~stawz/ and can be rotated and viewed in different perspectives. We also can compute π(D τi (·, ·), T ) to see the status of the cluster at different times. Under the mother quake assumption, Figure 4 (a) gives us the conditional probability that the earthquake cluster is active. The answer is again quite distinct. Figure 4(b) shows that most conditional probabilities are either close to 0 or 1. Assumption (3.1) is just an example. Another earthquake model called the "domino" model is given in [14] , where we assume that the last quake triggers the next quake in the cluster. It turns out that the mother quake model is more likely in our data set by comparing their likelihood. Roughly speaking, as long as the conditional intensity λ modeling the cluster only depends on a "small" portion of the history (in (3.1), it only depends on the last mother quake), we can adopt an "orthogonal decomposition" and find an algorithm to find the analytic solution. Thus assuming that, at most, one cluster is active at a time is not essential for our method. This simplified assumption prevents the presentation from getting more messy. We can similarly work out a decomposition if we assume that, at most, say, three clusters are active at a time. Our filter separates the data set into the cluster quakes and the single quakes. Further data analysis in [14] shows geophysical differences. In particular, the magnitude of the cluster quakes is significantly different from the single quakes. The mother quakes are also significantly bigger than the offspring quakes. Note that we did not incorporate the magnitudes of the earthquakes into the model. This surprising finding further supports our model. The application to seismology is only a special case of the filtering approach to abnormal cluster identification proposed in [14] . Other possible applications include epidemiology, intrusion detection in network security, criminology and quality control. and assume that they are {F t }-martingales. Let L = L N L C . L will also be an {F t }martingale. Let P satisfy dP |Ft = L(t) dQ |Ft . Then P is a probability measure and under P , for all A such that λ(u, s, D s , η s )ν(du) ds is a local martingale and N is independent of C and is a Poisson random measure with mean measure γ(u, s)ν(du) ds. Remark A.2. The L derived from the theorem is the likelihood of our observation, which is the mixture of two processes. It is necessary to have it for the estimation of parameters. While in Lemma 2.2, the simplified version (2.2) is sufficient for proving Theorem 2.4, since it only concerns f (D t , η t ), which does not involve the process N . By applying L derived here, we can prove a more general form of Theorem 2.4 so that we can have filtering equations about f (N t , D t , η t ). We omit it because the notation gets worse and we do not use it in our application. Our goal is to compute E Q [L|F O ], the likelihood in our model. We can first solve (A.1) and (A.2): where ρ is the indicator of whether the observation is a cluster point. Under reference measure Q, ρ 1 , ρ 2 , . . . are i.i.d. Bernoulli(1/2), and (λ(u, s, D s− , η s− ) + γ(u, s))ν(du) ds . We use "∝" since we ignore a constant, which has no impact on likelihood inference. Recall γ j = γ(y j , τ j ), λ j,i = λ(y j , τ j , y i ), ε j = ε(y j , τ j ). Let τ 0 = 0 and suppose one observes u i at time τ i , y i = (u i , τ i ). We recall the special form of λ(u, t, D t , η t ): where θ 0i (y j ) = 1 {yj is the latest mother quake in the cluster at time τi−} and λ i,j is defined as in Theorem 3.3. Let E i be the index of the latest mother quake at time τ i −, then Assume no cluster is active at time 0, hence D 0 = 0. Therefore, [ε(u, s) + γ(u, s)]ν(du) ds . Sum over two terms corresponding to ρ 1 = 0 and ρ 1 = 1, respectively, and we have [ε(u, s) + γ(u, s)]ν(du) ds . It is not practical to sum over all the terms by brute force since the number of terms increases exponentially with respect to the number of observations. However, we can reduce the complexity to O(n 2 ) by computing exp{ recursively. We recall that when a single quake is observed at time τ i it has no impact on the cluster quakes E i+1 = E i and D τi = D τi− , while when a cluster quake is observed at time τ i , there are two possible scenarios: The first case is D τi− = 0, so the observation at time τ i is a new mother quake D τi = 1 and E i+1 = i. The second case is D τi− = 1, so the observation is an offspring quake; hence, E i+1 = E i . This observation kills the cluster with probability p. In other words, we look at a Bernoulli(p) random variable I i , which is independent of everything else. If I i = 1, D τi = 1; otherwise D τi = 0. For 0 < i < j, (λ(u, s, y i ) + γ(u, s))ν(du) ds (ε(u, s) + γ(u, s))ν(du) ds The second equality utilizes the fact that the switch I j is triggered only when a cluster point is observed. The last equality uses the fact that I j is an independent Bernoulli variable with parameter p. Since ρ j is an independent Bernoulli variable with parameter 1/2, we have: We ignore the constant factor 1/2 in the algorithm. This raises our forward algorithm. Let −1,τj] λ(u, s, y i )ν(du) ds l j−1 (1, i) , λ(u, s, y i )ν(du) ds l j−1 (1, i) From the discussion above, we can find the likelihood for any specific set of parameters. Looking for the MLE hence is a standard optimization problem. It turns out that a nonderivative method works better in the example of this paper. In particular, we use the Nelder-Mead simplex method to search for the MLE (see [11] ). The asymptotic confidence intervals for the parameters can be constructed. Observe that we have a hidden Markov model (HMM), essentially. Hence the theorems of asymptotic normality of the MLE for a general HMM should apply (see [1, 3, 10] ). Consequently, there are corresponding likelihood-ratio tests for the HMM as established in [7] . The asymptotic confidence intervals can then be constructed by inverting the test statistics (see [14] ). So the procedure to find the most likely cluster sequence starts from the calculation of l ⋆ j (d, i), using recursion in (A.7) and (A.8) while always keeping a record of the "winning sequence" in the maximum finding operation. Finally the last state (d, i) ⋆ is found where and, starting from this state, the sequence is recovered by backtracking. As before, normalization is necessary in each step of the recursion to prevent them from degenerating to 0 or infinity. is a local martingale, and hence is as well. Proof of Theorem 2.4. To simplify the notation, we use f (0, η s− + δ (u,s) ) to denote f (· + δ (0,s) , η s− + δ (u,s) ) and f (1, η s− + δ (u,s) ) to denote f (· + δ (1,s) , η s− + δ (u,s) ). Noting that Applying Itô's formula, π(f new λ(u, s, ·, ·), s−) − π(λ(u, s, ·, ·), s−)π(f, s−) π(λ(u, s, ·, ·) + γ(u, s), s−) π(f (·, ·)λ(u, s, ·, ·), s) − π(f, s)π(λ(u, s, ·, ·), s)ν(du) ds. Proof of Theorem 3.3. We apply Theorem 2.4. For τ k ≤ t < τ k+1 , π(θ 0 (y)αλ(u, s, h, η), s)ν(du) ds π(θ 0 (y)α, s)π(λ(u, s, h.η), s)ν(du) ds We use the fact that α = x∈O(s) θ 0 (x)α to get the second equality. Thus, for i = 1, 2, . . . , k, [a(y j , t) − ε(s)]π(θ 0 (y j ), t) π(θ 0 (y i ), t), (B.1) (3.2 ) is the unique solution of this system of ordinary differential equations. At time τ k+1 , π(θ 0 (y)α, τ k+1 ) = π(θ 0 (y)α, τ k+1 −) + π([1 − α(·, ·)][θ 0 (y)(1, · + δ (u,s) )]λ(y k+1 , τ k+1 , ·, ·), τ k+1 −) d k+1 + π(α(·, ·)q[(θ 0 (y)α)(1, · + δ (u,s) )]λ(y k+1 , τ k+1 , ·, ·), τ k+1 −) d k+1 − π(λ(y k+1 , τ k+1 , ·, ·), τ k+1 −)π(θ 0 (y)α, τ k+1 −) d k+1 = π(θ 0 (y)α, τ k+1 −)γ k+1 d k+1 + π([1 − α(·, ·)][θ 0 (y)(1, · + δ (u,s) )]λ(y k+1 , τ k+1 , ·, ·), τ k+1 −) d k+1 + π(α(·, ·)q[(θ 0 (y)α)(1, · + δ (u,s) )]λ(y k+1 , τ k+1 , ·, ·), τ k+1 −) d k+1 . For i < k + 1, π(θ 0 (y i )α, τ k+1 ) = π(θ 0 (y)α, τ k+1 −)(γ k+1 + qλ k+1,i ) d k+1 . For i = k + 1, π(θ 0 (y k+1 )α, τ k+1 ) = k j=1 (qλ k+1,j − ε k+1 )π(θ 0 (y j )α, τ k+1 −) + ε k+1 d k+1 . Proof of Theorem 3.2. For τ k ≤ t < τ k+1 , π(θ 0 (y)αf, t) = π(θ 0 (y)αf, τ k ) − E×[τ k ,t] π(θ 0 (y)αf λ(u, s, h, η), s)ν(du) ds π(θ 0 (y)αf, s)π(λ(u, s, h, η), s)ν(du) ds = π(θ 0 (y)αf, τ k ) − π(θ 0 (y)αf, s) = π(θ 0 (y)α, s)π(θ 0 (y)αf, τ k ) π(θ 0 (y)α, τ k ) and (3.3) follows easily. π(θ 0 (y)αf, τ k+1 ) = π(θ 0 (y)αf, τ k+1 −) − π(λ(y k+1 , τ k+1 , ·, ·), τ k+1 −)π(θ 0 (y)αf, τ k+1 −) d k+1 + π([1 − α(·, ·)][(θ 0 (y)f )(1, · + δ (u,s) )]λ(y k+1 , τ k+1 , ·, ·), τ k+1 −) d k+1 + π(α(·, ·)q[(θ 0 (y)f )(1, · + δ (u,s) )]λ(y k+1 , τ k+1 , ·, ·), τ k+1 −) d k+1 = [1 − δ y k+1 (y)]π(θ 0 (y)αf, τ k+1 −)γ k+1 d k+1 + [1 − δ y k+1 (y)]q k j=1 λ k+1,j π(f (1, · + δ y k+1 )θ 0 (y j )α, τ k+1 −) d k+1 + δ y k+1 (y)π(f (1, · + δ y k+1 )(1 − α), τ k+1 −)ε k+1 d k+1 , π(f, τ k+1 −) = π(f, τ k ) − + π(f, t) x∈Y (τ k ) π(θ 0 (x)α, t)[a(x, t) − ε(s)] , π(f, τ k+1 ) = π(f, τ k+1 −) − π(λ(y k+1 , τ k+1 , ·, ·), τ k+1 −)π(αf, τ k+1 −) d k + 1 + π(f new λ(y k+1 , τ k+1 , ·, ·), τ k+1 −) d k+1 = π(f, τ k+1 −)γ k+1 + π(f (1, · + δ y k+1 )(1 − α)ε k+1 , τ k+1 −) d k+1 + k j=1 λ k+1,j π([pf (0, · + δ y k+1 ) + qf (1, · + δ y k+1 )]θ 0 (y j )α, τ k+1 −) d k+1 . Proof of Theorem A.1. Apply Lemma 2.2 and note the independence of C and N . Asymptotic normality of the maximumlikelihood estimator for general hidden markov models An Introduction to the Theory of Point Processes Asymptotics of the maximum likelihood estimator for general hidden Markov models Markov Processes: Characterization and Convergence Spatial birth and death processes as solutions of stochastic equations Markov Chain Monte Carlo in Practice Likelihood-ratio tests for hidden Markov models Stochastic declustering of space-time earthquake occurrences Limit Theorems for Stochastic Processes Asymptotic normality of the maximum-likelihood estimator in state space models Convergence properties of the nelder-mead simplex method in low dimensions Collusion set detection using graph clustering Stochastic Integration and Differential Equations A filtering approach to abnormal cluster identification This material is based upon work supported by, or in part by, the U.S. Army Research Laboratory and the U.S. Army Research Office under contract, grant number DAAD19-01-1-0502, and by NSF Grant DMS 05-03983.I am deeply grateful to my adviser Tom Kurtz for very helpful discussions and comments. He taught me the filtering idea and directed my work on its applications. We thank Jiancang Zhuang who kindly sent us the data set and his numerical results, which were plotted in [8] . We also thank Feng-Chang Lin for pointing us to Zhuang's paper. Proof of Lemma 2.2.By Theorem III.20 of Protter [13] M (