key: cord-0132774-jwq0xbcp
authors: Ost, Guilherme; Takahashi, Daniel
title: Sparse Markov Models for High-dimensional Inference
date: 2022-02-16
journal: nan
DOI: nan
sha: 100a87c81fdb1f489129683bc9be31b7a392ef81
doc_id: 132774
cord_uid: jwq0xbcp

Finite order Markov models are theoretically well-studied models for dependent discrete data. Despite their generality, application in empirical work when the order is large is rare. Practitioners avoid using higher order Markov models because (1) the number of parameters grow exponentially with the order and (2) the interpretation is often difficult. Mixture of transition distribution models (MTD) were introduced to overcome both limitations. MTD represent higher order Markov models as a convex mixture of single step Markov chains, reducing the number of parameters and increasing the interpretability. Nevertheless, in practice, estimation of MTD models with large orders are still limited because of curse of dimensionality and high algorithm complexity. Here, we prove that if only few lags are relevant we can consistently and efficiently recover the lags and estimate the transition probabilities of high-dimensional MTD models. The key innovation is a recursive procedure for the selection of the relevant lags of the model. Our results are based on (1) a new structural result of the MTD and (2) an improved martingale concentration inequality. We illustrate our method using simulations.

1. Introduction. From daily number of COVID-19 cases to the activity of neurons in the brain, discrete time series are ubiquitous in our life. A natural way to model these time series is by describing how the present events depend on the past events, i.e., characterizing the transition probabilities. Therefore, finite-order Markov chains -models specified by transition probabilities that depend only on a limited portion of the past -are an obvious choice to model time series with discrete values. The length of the portion of the relevant past defines the order of the Markov chain. At first glance, estimating transition probabilities of a Markov chain from the data is straightforward. Given a sample X 1:n := (X 1 , X 2 , . . . , X n ) of a stationary d-th order Markov chain on a discrete alphabet A, the empirical transition probabilities are computed, for all past x , where N n (x −d:−1 , a) denotes the number of occurrences of the past x −d:−1 followed by the symbol a in the sample X 1:n . Nevertheless, some difficulties become apparent. First, for a Markov chain of order d, we have to estimate |A| d (|A| − 1) transition probabilities (parameters), making the number of parameters to estimate from the data to increase exponentially on d. One solution to avoid this exponential increase is to consider more parsimonious classes of models. One such popular class of models is the variable length Markov chains (VLMC), in which P(X t = a|X t−d:t−1 = x −d:−1 ) = P(X t = a|X t−ℓ:t−1 = x −ℓ:−1 ), where ℓ is a function of x −d:−1 (Rissanen, 1983; Bühlmann and Wyner, 1999; Galves et al., 2012) . The relevant portion x −ℓ:−1 of the past x −d:−1 is called a context. The key feature of VLMC is that all transition probabilities with the same context have the same values. Therefore, denoting τ the set of all contexts, the number of transition probabilities that needs to be estimated reduces to |τ |(|A| − 1). Another class of models that is even more parsimonious is the Minimal Markov Models -also known as Sparse Markov Chains (SMC) (Garcıa et al., 2011; Jääskinen et al., 2014) . In SMC, we say that the pasts x −d:−1 and y −d:−1 are related if for all a ∈ A, P(X t = a|X t−d:t−1 = x −d:−1 ) = P(X t = a|X t−d:t−1 = y −d:−1 ).

This relation generates the equivalent classes C 1 , . . . , C K that partition A {−d,...,−1} . Now, the number of transition probabilities that needs to be estimated is K(|A| − 1). Both VLMC and SMC have the advantage of better balancing the bias and variance tradeoff, nevertheless, in all of them we still need to estimate the transition probability usingp n (a|x −d:−1 ), which generates a second difficulty. For the estimatorp n (a|x −d:−1 ) to have any meaning, we have to observe the past x −d:−1 in the sample X 1:n at least once. By ergodicity, the number of times that we will observe the past x −d:−1 is roughly nP(X 1:d = x −d:−1 ). It is straightforward to show that, if the transition probabilities are bounded below from zero, there exist a constant c > 0 such that P(X 1:d = x −d:−1 ) < e −cd . Therefore, in general, it is hopeless to have a reasonable estimatorp n (a|x −d:−1 ) if d > (1 + ε) log n/c. This imposes a fundamental limit to the size of the past that can be included in the description of the time series.

Notwithstanding, Markov chains with small orders (i.e. d = O(log n)) are not always consistent with the known workings of natural phenomena where the transition probabilities might depend on remote pasts. For example, in predicting whether today will be a warm or cold day, we might need know remote past events like the corresponding weather a year ago (Király, Bartos and Jánosi, 2006; Yuan, Fu and Liu, 2013) . Physiological phenomena in humans with cycles of different lengths might result from dependence on events that happens at vastly different temporal scales (Gilden, Thornton and Mallon, 1995; Chen, Ding and Kelso, 1997; Buzsaki and Draguhn, 2004) . Importantly, not all portions of the past are necessarily relevant. These observations motivate us to explore sparser representations of the dependence on the past events. The mixture of transition distribution model (MTD) is a subclass of finite order Markov chains that can be used to obtain such sparse representation. Similar to VLMC and SMC, MTD was initially introduced to overcome the problem of exponential increase in the number of transition probabilities for Markov chains (Raftery, 1985; Berchtold and Raftery, 2002) . MTD represents higher order Markov models as a convex mixture of single-step Markov chains, where each single-step Markov chain depends on a single time point in the past. If a MTD model is a mixture of only few single step Markov chains, we naturally obtain a class of sparse Markov chains that depends only on a small portion of the past events. Nevertheless, available methods to consistently estimate the transition probabilities of MTD still need to consider all the past events up to the MTD order (Berchtold and Raftery, 2002) , which might include irrelevant portions of the past. In practice, this fact still restricts the MTD order to d = O(log n).

In this work, we show a simple method that consistently recovers only the relevant part of the past even when the order d of the MTD is proportional to the sample size n (i.e, d = O(n)) if the size of the relevant past is O(log n). As a consequence we prove that we can consistently estimate the transition probabilities for high dimensional MTD under sparsity constraint. Our estimator is computationally efficient, consisting of a forward stepwise procedure that finds the candidates for relevant past portions and a cutting procedure to eliminate the irrelevant portions. The theoretical guarantees of our estimator are based on a novel structural result for MTD and an improved martingale concentration inequality. Both results might have an interest in its own. Moreover, when the alphabet is binary, we show that the estimator can be further improved. We prove that in several cases, our estimator is minimax rate optimal.

Finally, let us mention that recently new Bayesian approaches for higher order VLMC and MTD selection were introduced in (Kontoyiannis et al., 2020; Heiner and Kottas, 2021) , where a posteriori most likely model estimation is considered. These works provide interesting alternative approaches for modeling higher order Markovian dependence. However, it is difficult to compare their results with ours, as consistency in high dimension is not the target of these methods.

We organized the paper as follows. In Section 2, we introduce the main notations, definitions, and assumptions that we will use throughout the paper. In Section 3, we introduce the algorithms to select the relevant part of the past. In Section 3.4, we provide an estimate of the rate of convergence of the estimator for the transition probabilities. In Section 3.5, we show our estimator achieves the optimal minimax rate. In Section 4, we use simulation to study the proposed estimators. All the proofs are postponed to the end of the article.

2. Notation, model definition and preliminary remarks.

2.1. General notation. We denote Z = {. . . − 1, 0, 1, . . .} and Z + = {1, 2 . . .} the set of integers and positive integers respectively. For s, t ∈ Z with s ≤ t, we write s, t to denote the discrete interval Z ∩ [s, . . . , t]. Throughout the article A denotes a finite subset of R, called alphabet. The elements of A will be denoted by the first letters of the alphabet a, b and c. Hereafter, we denote A ∞ = max a∈A |a| and Diam(A) = max a,b∈A |a − b|. For each S ⊂ Z, the set A S denotes the set of all A-valued strings x S = (x j ) j∈S indexed by the set S. To alleviate the notation, if S = s, t for s, t ∈ Z with s ≤ t, we write x s:t instead of x s,t . For any non-empty subsets U ⊂ S ⊆ Z and any string x S ∈ A S , we denote x (S\U ) ∈ A (S\U ) the string obtained from x S by removing the string x U ∈ A U . For all t ∈ Z and S ⊂ Z, we will write in some cases t + S do denote the set {t + s : s ∈ S}.

The set of all finite A-valued strings is denoted by

For all x ∈ A, we denote S x ⊂ Z the set indexing the string x, i.e., such that x ∈ A Sx . Given two probability measures µ and ν on A, we denote d T V (µ, ν) the total variation distance between µ and ν, defined as

The dimension L ∈ Z * + will be implicit in most cases. For two probability distributions P and Q on A 1,k where P is absolutely continuous with respect to Q, we denote KL(P ||Q) the Kullback-Leibler divergence between P and Q, given by KL(P ||Q) = x1:k∈A 1,k P (x 1:k ) log P (x 1:k ) Q(x 1:k ) .

2.2. Markov models. Let X = (X t ) t∈Z be a discrete time stochastic chain, defined in a suitable probability space (Ω, F, P), taking values in an alphabet A. For a d ∈ Z + , we say that X is a Markov chain of order d if for all k ∈ Z + with k > d, t ∈ Z and x t−k:

We say that a Markov chain is stationary if X s:t and X s+h:t+h have the same distribution for all t, s, h ∈ Z. Throughout the article, the distribution of a stationary Markov chain will be denoted by P. For a finite S ⊂ Z and x S ∈ A S , we write P(x S ) to denote P(X S = x S ). The support of a stationary Markov model is the set supp(P) = {x ∈ A : P(x Sx ) > 0}.

For stationary Markov chains, the conditional probabilities in (1) do not depend on the time index t. Therefore, for a stationary Markov chain of order d, for any a ∈ A, x S ∈ supp(P) with S ⊆ −d, −1 and t ∈ Z, we denote p(a|x S ) = P (X t = a|X t+S = x S ) .

Notice that p(·|x S ) is a probability measure on A, for each fixed past x S ∈ supp(P). The set {p(·|x −d:−1 ) : x −d:−1 ∈ supp(P)} is called the family of transition probabilities of the chain. In this article, we consider only stationary Markov chains.

For a Markov chain of order d, the oscillation δ j for j ∈ −d, −1 is defined as

The oscillation is useful to measure the influence of a j-th past value in the values of the transition probabilities.

2.3. Mixture transition distribution (MTD) models. A MTD model of order d ∈ Z + is a Markov chain of order d for which the associated family of transition probabilities {p(·|x −d:−1 ) : x −d:−1 ∈ supp(P)} admits the following representation:

with λ 0 , . . . , λ −d ∈ [0, 1] satisfying 0 j=−d λ j = 1 and p 0 (·) and p j (·|b),j ∈ −d, −1 and b ∈ A, being probability measures on A.

Following (Berchtold and Raftery, 2002) , we call the index j ∈ −d, 0 of the weight λ j in (2) the j-th lag of the model. The representation in (2) has the following probabilistic interpretation. To sample a symbol from p(·|x −d:−1 ), we first choose a lag in −d, 0 randomly, being λ j the probability of choosing the lag j. Once the lag has been chosen, say lag j, we then sample a symbol from the probability measure p j (·|x j ) which depends on the past x −d:−1 only through the symbol x j . Notice that a symbol is sampled independently from the past x −d:−1 , whenever the lag 0 is chosen.

For later use, let us define the conditional average at lag j as

for each j ∈ −d, −1 and b ∈ A. For MTD of order d, we have that the oscillation δ j of the lag j ∈ −d, −1 can be written as,

Notice that in this case δ j = 0 if and only if either λ j = 0 or d T V (p j (·|b), p j (·|c)) = 0 for all b, c ∈ A.

In the sequel, we say that the lag j is relevant if δ j > 0, and irrelevant otherwise. We will denote Λ the set of all relevant lags, i.e.,

The set Λ captures the dependence structure of the MTD model. The size |Λ| of the set Λ represents the degree of sparsity of the MTD model. The smaller the value of |Λ|, the sparser the MTD model.

The following quantities will appear in many of our results: 

2.4. Statistical lag selection. Suppose that we are given a sample X 1:n of a MTD model of known order d < n and whose set of relevant lags Λ is unknown. The goal of statistical lag selection is to estimate the set Λ from the sample X 1:n . Our particular interest is in the high-dimensional setting in which the parameters d = d n and |Λ| = |Λ n | scale as a function of the sample size n. Let us writeΛ n to indicate an estimator of the set of relevant lags Λ computed from the sample X 1:n . We say that the estimatorΛ n is consistent if

With respect to statistical lag selection, our goal is to exhibit sufficient conditions for each proposed estimator guaranteeing its consistency.

2.5. Empirical transition probabilities. Let n, m and d be positive integers such that n − m > d. We denote for each a ∈ A and x S ∈ A with S ⊆ −d, −1 non-empty,

The random variable N m,n (x S , a) indicates the number of occurrences of the string x S "followed" by the symbol a, in the last n − m symbols X m+1:n of the sample X 1:n . We also defineN m,n (x S ) = a∈A N m,n (x S , a). With this notation, the empirical transition probabilities computed from the last n − m symbols X m+1:n of the sample X 1:n are defined as,

When the countings are made over the whole sample X 1:n , we denote N n (x S , a) andN n (x S ) the corresponding counting random variables, andp n (a|x S ) the corresponding empirical transition probabilities.

In the next sections, the estimators for the set of relevant lags we propose in this paper rely on these empirical transition probabilities. IfΛ m denotes an estimator for the set of relevant lags Λ computed from X 1:m , we expect that under some assumptions (guaranteeing in particular the consistency ofΛ m ) the empirical transition probabilityp m,n (a|xΛ m ) converges (in probability) to p(a|x Λ ) as min{n, m} → ∞, for any x −d:−1 ∈ supp(P). To understand the convergence for the transition probabilities of high order Markov chains is crucial in our analysis.

2.6. Assumptions. We collect here the main assumptions used in the article. ASSUMPTION 1. The MTD model has full support, that is, supp(P) = A.

In other words, Assumption 1 means that P(X S = x S ) = P(x S ) > 0 for any string x S ∈ A S with S ⊂ Z finite. Notice that this assumption implies, in particular, that

where p(·|x Λ ) are the transition probabilities of MTD generating the data.

ASSUMPTION 2. The quantity ∆ := 1 − j∈Λ δ j > 0, where δ j is given by (4).

We have that λ 0 > 0 is a sufficient condition to Assumption 2 to hold. To check this, notice that

where we have used that d T V (p j (·|b), p j (·|c)) ≤ 1 for all b, c ∈ A and j ∈ Λ. Hence, it follows that ∆ > 0 whenever λ 0 > 0.

Assumptions 1 and 2 are used to obtain concentration inequalities for the counting random variables N m,n (x S , a) andN m,n (x S ) appearing in the definition of the empirical transition probabilitiesp m,n (a|x).

The next assumption is as follows.

In this case this is always true by the definition of the set Λ. As we will see in Section 3, the condition is crucial to prove a structural result about MTD models, presented in Proposition 3.1.

In what follows, P xS (X j ∈ ·|X k = b) denotes the conditional distribution of X j given X S = x S and X k = b. We use the convention that, for S = ∅, these conditional probabilities correspond to the unconditional ones. Moreover, for any function f :

The next two assumptions are conditions of weak dependence that will be only necessary to obtain a computationally very efficient algorithm.

ASSUMPTION 4 (Inward weak dependence condition). There exists Γ 1 ∈ (0, 1] such that the following condition holds: for all S ⊆ −d, −1 such that Λ ⊆ S, k ∈ Λ \ S and b, c ∈ A with b = c satisfying |m k (b) − m k (c)| > 0,

ASSUMPTION 5 (Outward weak dependence condition). The alphabet is binary, i.e. A = {0, 1}. Moreover, there exists Γ 2 ∈ (0, 1] such that the following condition holds: for all S ⊆ −d, −1 such that S ⊂ Λ and k / ∈ Λ,

3. Statistical lag selection. In this section, we address the problem of statistical lag selection for the MTD models. We will first introduce a statistical procedure called PCP estimator that is general and works well if there is a known small set S such that Λ ⊆ S. When such set S is not available, we will have to consider an alternative procedure called FSC estimator, which will be introduced later.

3.1. Estimator based on pairwise comparisons. Throughout this section we suppose that there is a known set S ⊆ −d, −1 such that Λ ⊆ S. Note that this is always satisfied in the worse case scenario in which the set S is the whole set −d, −1 . In some cases, however, we may have a prior knowledge on the set Λ and we can use this information to restrict our analysis to the lags in a known set S of size (possibly much) smaller than d.

The estimator discussed in this section is based on pairwise comparisons of empirical transition probabilities corresponding to compatible pasts. For this reason, we call it PCP estimator. The estimator is based on the following observation. For any j ∈ S, we say that the pasts x S , y S ∈ A S are (S \ {j})-compatible, if y S\{j} = x S\{j} . We have that if j ∈ Λ, then there exist a pair of (S \ {j})-compatible pasts x S , y S ∈ A S such that total variation distance between p(·|x S ) and p(·|y S ) is strictly positive. On the other hand, if j ∈ S \ Λ, then the total variation distance between p(·|x S ) and p(·|y Λ ) is 0 for all (S \ {j})-compatible pasts x S , y S ∈ A S . These remarks suggests to estimate Λ by the subset of all lags j ∈ S for which the total variation distance betweenp n (·|x S ) andp n (·|y S ) is larger than a suitable positive threshold, for some pair of (S \{j})-compatible pasts x S and y S . An uniform threshold over all possible realizations usually gives suboptimal results by either underestimating or overestimating for some configurations. The threshold we use here is adapted to each realization of the MTD, relying on improved martingale concentration inequalities that are of independent interest (see Appendix B).

Fix ε > 0, α > 0 and µ ∈ (0, 3) such that µ > ψ(µ) := e µ − µ − 1. For each x S , y S ∈ A S , consider the random threshold t n (x S , y S ) defined as,

where s n (x S ) is given by

With this notation, the PCP estimatorΛ 1,n is defined as follows. A lag j ∈ S belongs tô Λ 1,n if and only if there exists a (S \ {j})-compatible pair of pasts x S , y S ∈ A S such that (13) d T V (p n (·|x S ),p n (·|y S )) ≥ t n (x S , y S ).

In the sequel, the set S ⊆ −d, −1 such that Λ ⊆ S and the constants ε > 0, α > 0 and µ ∈ (0, 3) such that µ > ψ(µ) are called parameters of the PCP estimatorΛ 1,n .

Hereafter, for each j ∈ S and any b, c ∈ A, let

Finally, consider the following quantity

With these definitions, we have the following result.

THEOREM 3.1. Let X 1:n be a sample of MTD model with set of relevant lags Λ, where n > d. IfΛ 1,n is the PCP estimator defined in (13) with parameters µ ∈ (0, 3) such that µ > ψ(µ), α > 0, ε > 0 and Λ ⊆ S ⊆ −d, −1 , we have that

2. For each j ∈ Λ, we have that

where γ n,j and and δ j are defined in (14) and (4) respectively. 3. Furthermore, if assumptions 1 and 2 hold, then there exits a constant C = C(ε, µ) > 0 such that for n satisfying

where δ min and P S are defined in respectively (6) and (15), we have that

The proof of Theorem 3.1 is given in Appendix A.1.1.

The sum over j ∈ S \ Λ of the upper bound provided by Item 1 of Theorem 3.1 controls the probability that the PCP estimatorΛ 1,n overestimates the set of relevant lags Λ. The sum over j ∈ Λ of the upper bound given in Item 2 of Theorem 3.1 is as an upper bound for the probability that the PCP estimator underestimates the subset of relevant lags j ∈ Λ whose oscillation δ j is larger or equal than the "noise level" γ n,j . Note that the sum of these upper bounds corresponds to the first term appearing on the right hand side of (17). (b) The second term on the right hand side of (17) is an upper bound for the probability that there exists some relevant lag j ∈ Λ whose oscillation δ j is strictly smaller than the "noise level" γ n,j .

REMARK 2. By Assumption 1, we have that P S ≥ p min /|A| |S|−1 , where p min is defined in (8). As a consequence, it follows from (16) that if the sample size n is such that

then inequality (17) holds with the second exponential term replaced by

Combining Theorem 3.1 and Remark 2, one can deduce the following result.

COROLLARY 3.1. For each n, consider a MTD model with set of relevant lags Λ n and transition probabilities p n (a|x Λn ) such that p min,n ≥ p ⋆ min and ∆ n ≥ ∆ ⋆ min for some positive constants p ⋆ min and ∆ ⋆ min . Let d n = βn for some β ∈ (0, 1) and suppose that Λ n ⊆ S n ⊆ −d n , −1 with |S n | ≤ ((1 − γ)/2) log |A| (n) for some γ ∈ (0, 1). Let X 1:n be a sample from the MTD specified by Λ n and p n (a|x Λn ), and denoteΛ 1,n the PCP estimator defined in (13) computed from this sample with parameters µ n = µ ∈ (0, 3) such that µ > ψ(µ), ε n = ε > 0, α n = (1 + η) log(n) with η > 0 and S n . Under these assumptions there exists a constant C = C(µ, ε, β, p ⋆ min , ∆ ⋆ min , γ, η) > 0 such that if

The proof of Corollary is given in Appendix A.1.2.

REMARK 3. (a) Under the assumptions of Corollary 3.1, if additionally we have |Λ n | ≤ L for all values of n for some positive integer L, then one can choose a suitable sequence γ n → 1 as n → ∞ to obtain that P(Λ 1,n = Λ n ) vanishes as n → ∞, as long as

where the constant C here is larger than the one given in (20). (b) Observe that in Corollary 3.1, the set of relevant lags can be either finite or grow very slowly with respect to the sample size n. On the other hand, no assumption on the orders d n of the underlying sequence of MTD models is made. In particular, we could consider MTD models with very large orders, for example d n = βn with β ∈ (0, 1).

As Corollary 3.1 indicates, in the setting d n = βn, the major drawback of the PCP es-timatorΛ 1,n is that it requires a prior knowledge of Λ n in the form of a set S n growing slowly enough and such that Λ n ⊆ S n . The main goal of the next two sections is to propose alternative estimators of Λ n to deal with this issue.

Stepwise and Cut estimator. In this section we introduce a second estimator of the set of relevant lags Λ, called Forward Stepwise and Cut (FSC) estimator. This estimator is based on a structural result about MTD models presented in Proposition 3.1 below. Before presenting this structural result, we need to introduce some notation.

In what follows, for each lag k ∈ −d, −1 , subset S ⊆ −d, −1 \ {k}, configuration x S ∈ A S and symbols b, c ∈ A, let us denote

Recall that P xS (X 0 ∈ ·|X k = b) and P xS (X k = b) denote, respectively, the conditional distribution of X 0 given X S = x S and X k = b and the conditional probability of X k = b given X S = x S , with the convention that these conditional probabilities for S = ∅ correspond to the unconditional ones. Let us also denote for each lag k ∈ −d, −1 and subset S ⊆ −d, −1 \ {k},

The quantity ν k,S (x S ) measures the influence of X k on X 0 , conditionally on the variables X S = x S . The average conditional influence of X k on X 0 is measured through the quantitȳ ν k,S .

In the sequel, we write Cov xS (X 0 , X k ) to denote the conditional covariance between the random variables X 0 and X k given that X S = x S . Here, we also use the convention that the conditional covariance for S = ∅ corresponds to the unconditional one. With this notation, we can prove the following structural result about MTD models. PROPOSITION 3.1. For any lag k ∈ −d, −1 and subset S ⊆ −d, −1 \ {k},

Moreover, if Assumptions 1 and 3 hold, then there exists a constant κ > 0 such that the following property holds: for any

whereδ min , p min and Γ 1 are defined respectively in (6), (8) and (9).

The proof of Proposition 3.1 is given in A.2.

Putting together these facts, we deduce that the set of relevant lags can be written as Λ = arg min{|S| : f (S) = 0}. This observation motivates the FSC estimator defined below.

In what follows, we split the data X 1:n into two pieces. The first part is composed of the first m symbols X 1:m where 1 ≤ m < n, whereas the second part is composed of the n − m last symbols X m+1:n . In the sequel, we writeν m,k,S to denote the empirical estimate ofν k,S computed from X 1:m . The formal definition ofν m,k,S involves extra notation and is postponed to Appendix A.

The FSC estimator is built in two steps. The first step is called Forward Stepwise (FS) and the second one is called CUT. In the FS step, we start with S = ∅ and add iteratively to the set S a lag j ∈ arg max k∈S cν m,k,S , as long as |S| < ℓ, where 0 ≤ ℓ ≤ d is a parameter of the estimator. We denoteŜ m the set obtained at the end of FS step, with the convention thatŜ m = ∅ if the parameter ℓ = 0. As we will see, if ℓ is properly chosen the candidate set S m will contain the set of relevant lags Λ with high probability. It may, of course, include irrelevant lags j (those with δ j = 0). In the CUT step, for each j ∈Ŝ m , we remove j from

is given by (12) re-placingN n (·) andp n (·|·) byN m,n (·) andp m,n (·|·) respectively. The FSC estimator is defined as the setΛ 2,n of all lags not removed in the CUT step. The pseudo-code of the algorithm to compute the FSC estimator is given in Algorithm 1.

REMARK 5. (a) It is worth mentioning the following alternative algorithm (henceforth called Algorithm 2) to estimate the set of relevant lags Λ. As Algorithm 1, Algorithm 2 has two steps as well. In the first step, we start with S = ∅ and add iteratively a lag Algorithm 1: FSC(X 1 , . . . , X n ) FS Step; 1.Ŝm = ∅; 2. While |Ŝm| < ℓ; 3. Compute j * = arg max j∈Ŝ c mν m,j,Ŝm and include j * inŜm; CUT step; 6. For each j ∈Ŝm, remove j fromŜm unless

j ∈ arg max k∈S cν n,k,S as long asν n,j,S > τ , where τ is a parameter of the algorithm andν n,j,S is the empirical estimate ofν j,S computed from the entire data X 1:n . LetŜ n denote the set obtained at the end of this step. Next, in the second step, for each j ∈Ŝ n , we remove j fromŜ n unlessν n,j,Ŝn\{j} ≥ τ . The output of Algorithm 2 is the set of all lags inŜ n which were not removed in the second step. Algorithm 2 can be seen as a version adapted for our setting of the LearnNbhd algorithm, proposed in Bresler (2015) , to estimate the interaction graph underlying an Ising model from i.i.d samples of the model. (b) As opposed to Algorithm 2, notice that the data X 1:n is split into two parts in Algorithm 1. The first m symbols X 1:m of the sample are used in the FS step, whereas the last n − m symbols X m+1:n are only used in the CUT step. Despite requiring to split the data into two parts, one nice feature of Algorithm 1 is that even if a large ℓ is chosen the CUT step would remove the non-relevant lags, whereas in Algorithm 2, we have to calibrate τ carefully to recover the relevant lags.

In what follows, for any ξ > 0 and 0 ≤ ℓ ≤ d, let us define the following event,

In the next result we show that whenever the event G m (ℓ, ξ) holds with properly chosen parameters ξ and ℓ, the candidate setŜ m constructed in the FS step with parameter ℓ contains Λ.

THEOREM 3.2. Suppose Assumptions 1 and 3 hold and let κ be the lower bound provided by Proposition 3.1. Let

LetŜ m denote the candidate set constructed in the FS step of Algorithm 1 with parameter ℓ * . On the event G m (ℓ * , ξ * ), we have that Λ ⊆Ŝ m .

The proof of Theorem 3.2 is given in Appendix A.2.2. Theorem 3.2 ensures that the candidate setŜ m contains the set of relevant lags Λ whenever the event G m (ℓ * , ξ * ) holds. In this case, we can think of the CUT step as the PCP estimator discussed in the previous section applied to the n − m last observations X m+1:n of the data, where S =Ŝ m . The main difference is thatŜ m is a random set, depending on the first m observations X 1:m of the data.

In the sequel, let us denote

where P S is defined in (15).

In the next result we estimate the error probability of the FSC estimator.

THEOREM 3.3. Suppose Assumptions 1, 2, and 3 hold. Let ∆ > 0 be the quantity defined in Assumption 2. DenoteΛ 2,n the FSC estimator constructed by Algorithm 1 with parameter ℓ * , as defined in (30). Suppose also that m > d ≥ 2ℓ * . Then there exits a constant C = C(ε, µ) > 0 such that if

where δ min and P Sd,ℓ * are defined in (6) and (31), then we have that,

where ξ * is defined in (30).

The proof of Theorem 3.3 is given in Appendix A.2.3.

REMARK 6. (a) Let us give some intuition about the three terms appearing on the righthand side of (33). The first one is an upper bound for P(G c m (ℓ * , ξ * )). The other two are related to the terms appearing in (17). Indeed, by recalling that |Ŝ m | = ℓ * , one immediately sees that the third terms of (33) corresponds to the first term of (17) withŜ m and n − m in the place of S and n respectively. Besides, the second term of (33) is similar (modulo a factor which depends on d and ℓ * ) to the second term of (17). This extra factor reflects the fact that we do not know a priori a set S containing the set of relevant lags Λ. (b) Similar to Remark 2, one can also show that P Sd,ℓ ≥ p min /|A| ℓ−1 , where p min is defined in (8). Hence, we can deduce from (32) that when the sample size n satisfies

then inequality (33) holds with the second exponential term replaced by

The next result is a corollary of Theorem 3.3.

COROLLARY 3.2. For each n, consider a MTD model with set of relevant lags Λ n and transition probabilities p n (a|x Λn ) satisfying p min,n ≥ p ⋆ min ∆ n ≥ ∆ ⋆ min for some positive constants p ⋆ min and ∆ ⋆ min , and such that Assumption 3 holds. Let m n = n/2 and d n = m n β with β ∈ (0, 1). Let X 1:n be a sample from the MTD specified by Λ n and p n (a|x Λn ), and de-noteΛ 2,n the FSC estimator constructed by Algorithm 1 with parameters m n , µ n = µ ∈ (0, 3) such that µ > ψ(µ), ε n = ε > 0, α n = (1 + η) log(n) with η > 0 and ℓ * ,n as defined in (30). Assume that ℓ * ,n ≤ ((1 − γ)/2) log |A| (n) for some γ ∈ (0, 1). Then there exists a constant C = C(β, γ, η, p ⋆ min , ∆ ⋆ min , ε, µ) > 0 such that P(Λ 2,n = Λ * ) → 0 as n → ∞, whenever 3.3. Improving the efficiency for the binary case. In this section, we show that when the alphabet is binary, i.e., A = {0, 1}, we can further improve the FSC algorithm if we consider Assumptions 4 and 5. Observe that when the alphabet is binary, Assumption 3 holds automatically (see Section 2.6). Moreover, we have that 

where δ min and p min are defined in (6) and (8) respectively. In particular, if Γ 1 > Γ 2 andŜ m denotes the candidate set constructed at the end of the FS step of Algorithm 1 with parameter ℓ ≥ |Λ|, then Λ ⊆Ŝ m whenever the event G m (ℓ, ξ) holds where

The proof of Theorem 3.4 is given in Appendix A.3.1.

REMARK 8. Notice that if the size of Λ is known, then Theorem 3.4 impliesŜ m = Λ on the event G m (|Λ|, ξ) with ξ satisfying (37). In particular, in this case, we neither need to execute the CUT step nor to split the data into two pieces.

In the same spirit of the previous corollaries, we can show the following result. COROLLARY 3.3. For each n, consider a MTD model with set of relevant lags Λ n and transition probabilities p n (a|x Λn ) satisfying Assumptions 4 and 5 with Γ 1,n = Γ 1 > Γ 2 = Γ 2,n and such that |Λ n | ≤ L for some integer L, p min,n ≥ p ⋆ min and ∆ n ≥ ∆ ⋆ min for some positive constants p ⋆ min and ∆ ⋆ min . Let X 1:n be a sample from the MTD specified by Λ n and p n (a|x Λn ), and denoteΛ 2,n the FSC estimator with parameters with parameters m n = n/2, µ n = µ ∈ (0, 3) such that µ > ψ(µ), ε n = ε > 0, α n = (1 + η) log(n) with η > 0 and ℓ = L. Suppose that d n = βn with β ∈ (0, 1). Then there exists a constant C = C(β, L, ∆ * min , p * min , Γ 1 , Γ 2 , η, µ, ε) > 0 such that P(Λ 2,n = Λ * ) → 0 as n → ∞, as long as

The proof of Corollary 3.3 is given in Appendix A.3.2.

3.4. Post-selection transition probabilities estimation . Once the set of relevant lags have been estimated by applying the FSC estimator to the sample X 1:n , we reuse the entire sample to compute the estimatorp n (a|xΛ 2,n ) of the transition probability p(a|x Λ ). In the next result, we provide an estimate for rate of convergence ofp n (a|xΛ 2,n ) towards p(a|x Λ ), simultaneously for all pasts x −d:−1 ∈ A −d,−1 .

THEOREM 3.5. Under assumptions and notation of Theorem 3.3,

whereV n (a, xΛ 2,n ) is given bŷ

.

The proof of Theorem 3.5 is given in Appendix A.4.

A remark on the minimax rate for the lag selection. We take A = {0, 1} and consider the set of {p (j) (·|·), j ∈ −d, −1 } of transition probabilities of the following form:

where λ|p(1|1) − p(1|0)| := δ > 0. For each j ∈ −d, −1 , we denote P (j) the probability measure under which (X t ) t∈Z is a stationary MTD model having transition probability p (j) (·|·). For each t ≥ 1, we denote P a ∈ A, x Λ ∈ A Λ } of a MTD model of order d whose corresponding δ min ≥ δ. For a given p ∈ M T D d,δ , we denote P p the probability distribution under which (X t ) t∈Z is a stationary MTD model of order d with transition probabilities given by p. With this notation, we have the following result.

PROPOSITION 3.2. Let n > d. Then the following inequality holds: for j, k ∈ −d, −1 ,

In particular, if β ∈ (0, 1), d = nβ, and

where the infimum is over all lag estimatorsΛ n based on a sample of size n.

The proof of (43) follows immediately from Fano's inequality and the upper bound (41). Combining (38) and (42), we deduce that the condition on the minimal oscillation required for the consistency of the FSC estimator in Corollary 3.3 is sharp. The proof of Proposition 3.2 is given in Appendix A.5 4. Simulations. Here, we investigated the performance of the proposed methods by simulations. We used the following MTD model on alphabet A = {0, 1}. We considered different choices of order d and relevant lags i, j ∈ −d, −1 (see Table 1 ). Let p 0 (1) = p 0 (0) = 0.5 and λ 0 = 0.2. Also, let p i (0|0) = 1 − p i (0|1) = p j (0|1) = 1 − p j (0|1) = 0.7 and λ i = λ j = 0.4. For all x −d:−1 ∈ {0, 1} −d,−1 and a ∈ A, the transition probability of the model was given by p(a|x −d:−1 ) = λ 0 p 0 (a|x 0 ) + λ i p i (a|x i ) + λ j p j (a|x j ).

We simulated the above model using sample sizes n ∈ {2 8 , 2 9 , 2 10 , 2 11 , 2 12 , 2 13 }. For each choice of i, j, d, and n we simulated 100 realizations. For each realization, we estimated the transition probability p(0|0 d ). We used different estimators for the comparisons. For transition probability estimation after FSC, we used X 1:n/2 for FS step, X n/2+1:n for Cut step, obtaining the estimated relevant lag setΛ n . Then we used X 1:n to calculatep n (0|0Λ n ). For transition probability estimation after PCP, we used X 1:n to calculateΛ n for the PCP relevant lag estimator with initial set S = −d, −1 . Then we used X 1:n to calculatep n (0|0Λ n ). We also compared the performance of transition probability estimatorp n (0|0 −d:−1 ), where we did not select the relevant lags (Naive estimator). In our simulations, when d was larger than 5, for both PCP and Naive estimators we did not obtain meaningful results becausē N n (0 d ) = 0 with high probability. Therefore, we compared PCP and Naive estimators only for d = 5. In this case, FSC showed similar performance to PCP estimator and was in general better than Naive estimator. When d > 5, e.g. d = n/8, FSC still exhibited good performance (Table 1) . PROOF OF THEOREM 3.1. Since the set S ⊆ −d, −1 containing the set Λ is fixed, we will write x instead of x S to alleviate the notation. We start proving Item 1.

Proof of Item 1. For each x ∈ A S , let us define the event

.

By using first the union bound and then by applying Proposition B.3, we deduce that for each x ∈ A S ,

which, together with (44), implies that (45)

Note that if j / ∈ Λ, then by the definition of the set Λ it follows that p(a|x) = p(a|y) for all x, y ∈ A S which are (S \{j})-compatible. Hence, by applying first the triangle inequality and then using that t n (x, y) = s n (x) + t n (y), we deduce that the event {d T V (p n (·|x),p n (·|y)) ≥ t n (x, y)} is contained in the event {d T V (p n (·|x), p(·|x)) ≥ s n (x)} ∪ {d T V (p n (·|y), p(·|y)) ≥ s n (y)}, so that

where in the second inequality we have used (45).

, we obtain from the above inequality that,

concluding the the proof of Item 1.

Proof of Item 2. Let j ∈ Λ, recall that δ j = λ j max b,c∈A d T V (p j (·|b), p j (·|c)) and consider the event E = {δ j ≥ γ n,j }. Take b ⋆ , c ⋆ ∈ A such that d T V (p j (·|b ⋆ ), p j (·|c ⋆ )) = max b,c∈A d T V (p j (·|b), p j (·|c)), and observe that with this choice,

where we have used also that γ n,j = 2t n,j . Now, take a pair (x ⋆ , y ⋆ ) ∈ C j (a ⋆ , b ⋆ ) attaining the minimum in (14):

From the definition of t n,j , it follows then that t n,j ≥ t n,j (b ⋆ , c ⋆ ) = t n (x ⋆ , y ⋆ ).

Therefore, we conclude that E ⊆ {d T V (p(·|x ⋆ ), p(·|y ⋆ )) ≥ 2t n (x ⋆ , y ⋆ )}, so that by the triangle inequality, we obtain that on E,

From (45), it follows then that P j / ∈Λ 1,n , γ n,j ≤ δ j = P {j / ∈Λ 1,n } ∩ E ≤ 8|A| log(µ(n − d)/α + 2) log(1 + ε) e −α , concluding the proof of Item 2. Proof of Item 3. Observe that by combining Items 1 and 2 together with the union bound, we deduce that

Hence, to conclude the proof of Item 3, it suffices to show that

for all j ∈ Λ, whenever the sample size n satisfies (16). By the union bound, we have that

for any (x, y) ∈ C j (b, c). By using again the union bound, we can deduce that for each in x ∈ A S ,

and also that

(1 + ε) 2(µ − ψ(µ)) > δ j /16 .

By applying Proposition A.2 with u 1 = P(x) − (4|A|α)/(3δ j (n − d)) and u 2 = P(x) − (16|A|α (1 + ε)/2(µ − ψ(µ)))/(δ j (n − d)), one can show that

as long as (n − d) > 16|A|α δj P(x) (1+ε) 2(µ−ψ(µ)) .

, so that by Proposition A.2 with u 3 = P(x) − (128α(1 + ε)µ|A|)/(δ 2 j (µ − ψ(µ))), it follows that

). Therefore, we have shown that for any x ∈ A S ,

as long as

.

we can deduce from (48) and (49) that

.

Since P(x * ,j ) ∧ P(y * ,j ) ≥ P S for all j ∈ Λ * , we can take

to deduce that (46) is indeed satisfied whenever

concluding the proof.

A.1.2. Proof of Corollary 3.1.

PROOF OF COROLLARY 3.1. Notice that Assumptions 1 and 2 are satisfied for all values of n, since p min,n ≥ p ⋆ min and ∆ n ≥ ∆ ⋆ min for positive constants p ⋆ min and ∆ ⋆ min . Hence, the result follows immediately from Theorem 3.1-Item 3 and Remark 1-Item (c).

A.2.1. Proof of Proposition 3.1. In this section we prove Proposition 3.1. To that end, we need some auxiliary results. The first auxiliary result is the following. Recall that we write Cov xS (X 0 , m k (X k )) to denote the conditional covariance between the random variables X 0 and m k (X k ) given that X S = x S , where m k is defined (3).

LEMMA A.1. For each S ⊆ −d, −1 , k ∈ S c and x S ∈ A S , the following identity holds:

REMARK 9. In (50), we use the convention that j∈∅ λ j Cov xS (m j (X j ), m k (X k )) = 0.

PROOF OF LEMMA A.1. Observe that if Λ ⊆ S, then the both sides of (50) are 0, so that the result holds immediately in this case. Now suppose that Λ ⊆ S. In this case, to shorten the notation, let us write

We want to compute

We first compute E xS (X 0 ). To that end, write

and observe that for each a ∈ A,

where in the second equality we have used the definition of the transition probabilities (2). Hence, we have that

where m 0 = a∈A ap 0 (a). As a consequence of the above equality, it follows that

We now compute E xS (X 0 m k (X k )). We consider only the case k ∈ Λ, the other is treated similarly. In this case, we first write

and then we proceed similar as above to deduce that for each a, b ∈ A,

where in the second equality we have used the definition of the transition probabilities (2).

Combining (52) and (53), we deduce that

Putting together the identities (51) and (54), we then conclude that

and the result follows.

The next auxiliary result is the following.

LEMMA A.2. Suppose Assumptions 1 and 3 hold. Then there exists a constant κ ′ > 0 such that the following property holds: for any S ⊆ −d, −1 such that Λ ⊆ S, we have j∈Λ\S k∈Λ\S

PROOF OF LEMMA A.2. It suffices to show that for S ⊆ −d, −1 such that Λ ⊆ S, we have j∈Λ\S k∈Λ\S λ j λ k E(Cov xS (m j (X j ), m k (X k ))) > 0.

Suppose that this is not the case. Then,

This implies that P-almost surely,

for some function f : A S → R. Now take any configuration x S ∈ A S and consider the event A = {X S = x S }. From the above identity, it follows that P-a.s.,

Finally, take any configuration x Λ\S ∈ A Λ\S such that j∈Λ\S λ j m j (x j ) = f (x S ), and let B = {X Λ\S = x Λ\S }. Such a configuration must exist by Assumption 3. As a consequence, we have that P-a.s.,

By Assumption 1, we have P(A ∩ B) = P(x Λ ) > 0 so that the identify above would imply that

which is a contradiction. Therefore, we must have

and the result follows.

We also need the following result. LEMMA A.3. For each S ⊆ −d, −1 , k ∈ S c and x S ∈ A S , the following identity holds:

where m k and m k Lip are defined (3) and (6) respectively.

By taking c = min(A), we have that |b − c| = (b − c) for any b ∈ A and we deduce from the above inequality that

A similar argument shows that Cov xS (X 0 , m k (X k )) ≥ − m k Lip |Cov xS (X 0 , X k )|, concluding the proof.

Our last auxiliary result is the following. LEMMA A.4. For each S ⊆ −d, −1 such Λ ⊂ S, x S ∈ A S and j, k ∈ Λ \ S, the following identity holds:

where m j is defined (3).

Now, for any a, b ∈ A, one can check that

Hence, we deduce from the above equalities that

Interchanging the role of the symbols b and c in the equality above, we obtain that

The result follows by combing the last two equalities above.

We now prove Proposition 3.1.

PROOF OF PROPOSITION 3.1. We first prove inequality (26). Let us denote D k,S (a, b, c, x S ) = P xS (X 0 = a|X k = b) − P xS (X 0 = a|X k = c), for each a, b, c ∈ A, x S ∈ A S and k / ∈ S. With this notation, one can check that for any x S ∈ A S and k / ∈ S, we have that

Now, observe that the triangle inequality and the equality

proving inequality (26). We now prove (27). This is done as follows. In the sequel, we shall write λ1 Λ\S to denote the vector λ = (λ j ) j∈Λ restricted to the coordinates in Λ \ S: λ1 Λ\S = (λ j ) j∈Λ\S . With this notation, it follows from Lemma A.1 and Lemma A.2 that for any S ⊆ −d, −1 such that

By the triangle inequality, it then follows that k∈Λ\S λ k |E (Cov XS (X 0 , m k (X k ))|) ≥ κ ′ .

By observing that 1 − λ 0 = k∈Λ λ k ≥ λ1 Λ\S 1 , we conclude from the above inequality that

and the result follows from Lemma A.3. Therefore, it remains to prove (28). To that end, we first use Lemma A.1, Lemma (A.3) and Lemma A.4 to obtain that k∈Λ\S λ k m k Lip |Cov xS (X 0 , X k )| ≥ 1 2 k∈Λ\S j∈Λ\S b∈A c∈A

Next, we observe that Assumption 4 implies that

where we have also used that P xS (X k = b) ≥ p min . Finally, note that |m

Lip to obtain that

Then, by using Cauchy-Schwartz inequality, we deduce that

The result follows by combining the last two inequalities.

A.2.2. Proof of Theorem 3.2. Before starting the proof of Theorem 3.2, we recall some definitions from Information Theory. In what follows, for S ∈ −d, −1 and j ∈ −d, −1 , we write I(X 0 ; X j |X S ) to denote the conditional mutual information between X 0 and X j given X S , defined as

where I(X 0 ; X j |X S = x S ) := I j (x S ) denotes the conditional mutual information between X 0 and X j given X S = x S , defined as

.

We use the convention that when S = ∅, the conditional probability P xs is the unconditional probability P. Hence, in this case, the conditional mutual information between X 0 and X j is the mutual information between these random variables, denoted I(X 0 ; X j ) := I j .

The entropy H(X 0 ) of X 0 is defined as (61) H(X 0 ) = − a∈A P(X 0 = a) log(P(X 0 = a)).

To prove Theorem 3.2 we proceed similarly to Bresler (2015) . During the proof we will need the the following lemma.

LEMMA A.5. Suppose that the event G m (ξ, ℓ) defined in (29) holds and let S ⊆ −d, −1 with |S| ≤ ℓ. Ifν m,k,S ≥ τ with k ∈ S c , then I(X 0 ; X k |X S ) ≥ 2(τ − ξ) 2 .

PROOF. Definition (59) together with Jensen inequality implies that for any j ∈ S c ,

By Pinsker inequality, it then follows that

= ν j,S (x S ), where in the second equality we have used that for any a, b ∈ A,

and also that P xS (X 0 = a) = c∈A P xS (X j = c)P xS (X 0 = a|X j = c)).

As a consequence, we deduce that

Now, on the event G m (ξ, ℓ), we have thatν k,S ≥ν m,k,S − ξ so that

where in the rightmost inequality we have used thatν m,k,S ≥ τ . Hence, I(X 0 ; X j |X S ) ≥ 2(τ − ξ) 2 , and the result follows.

We now prove Theorem 3.2.

PROOF. Suppose the event G m (ξ * , ℓ * ) holds and letŜ m be the set obtained at the end of FS step of Algorithm 1 with parameter ℓ * , where the parameters ξ * and ℓ * are defined as in (30). In the sequel, let S 0 = ∅ and S k = S k−1 ∪ {j k }, where j k ∈ arg max j∈S c k−1ν m,j,Sk−1 for 1 ≤ k ≤ d, and observe that by constructionŜ m = S ℓ * . We want to show that Λ ⊆Ŝ m . We argue by contraction. Suppose that Λ is not contained inŜ m . In this case, it follows that Λ ⊆ S k for all 1 ≤ k ≤ ℓ * , and Proposition 3.1 implies that for all 1 ≤ k ≤ ℓ * ,

where the equality holds by the choice of ξ * . Since the event G m (ξ * , ℓ * ) holds and |S k | ≤ |S ℓ * | = ℓ * , it follows from the above inequality that

for all 1 ≤ k ≤ ℓ * + 1. By Lemma A.5, we then deduce that I(X 0 , X jk |X Sk−1 ) ≥ 8ξ 2 * for all 1 ≤ k ≤ ℓ * + 1. Now, notice that

where we have used Gibbs inequality in the first passage, the fact that the entropy is always larger than the mutual information in the second passage and the Chain Rule in the last passage. The proof of these facts can be found, for instance, in (Cover and Thomas, 2006) .

By the choice of ℓ * = log 2 (|A|)/8ξ 2 * , we have that ℓ * + 1 > log 2 (|A|)/8ξ 2 * so that it follows from (62) that log 2 (|A|) ≥ (ℓ * + 1)8ξ 2 * > log 2 (|A|), a contradiction. Thus, we must have Λ ⊆Ŝ m and the result follows.

A.2.3. Proof of Theorem 3.3. To prove Theorem 3.3 we shall need the following result. PROPOSITION A.1. Suppose Assumptions 1 and 2 hold, and let ∆ ⋆ > 0 the quantity defined in Assumption 2. Then, for any ξ > 0 and m > d ≥ 2ℓ ≥ 0,

During the proof of Proposition A.1 we will make use of the following proposition. For any function f : A 1,m → R, define for each 1 ≤ j ≤ m, PROPOSITION A.2 (Theorem 3.4. of (Chazottes, Gallo and Takahashi, 2020) ). Suppose Assumption 2 holds, that is, ∆ > 0.

1. For any u > 0 and f : A 1,m → R,

2. For m > d, any g : A S × A → R with S ⊆ −d, −1 and u > 0,

where E S [g] = xS∈A S a∈A P(X S = x S , X 0 = a)g(x S , a) and X t+S = (X t+j ) j∈S .

Before starting the proof Proposition A.1, we need to introduce some additional notation.

For each x ∈ A S with S ⊆ −d, −1 , we write

In what follows, we write xa V ∪{0} , with a ∈ A and V ⊆ −d, −1 , to denote the configuration ((xa) j ) j∈V ∪{0} , defined as

When V = S ∪ {k} and x k = b ∈ A, we shall write xba S∪{k,0} instead of xa V ∪{0} . With this notation, the empirical version ofν k,S is defined as follows:

where for x S ∈ A S , we define 

Hereafter, we omit the dependence on S and on m, whenever there is no risk of confusion. We now prove Proposition A.1. PROOF OF PROPOSITION A.1. Claim 1. Let S ⊆ −d, −1 with |S| ≤ ℓ < d/2 and take j ∈ S c . Then, |ν j,S −ν j,S | ≤ 3 x∈A S a∈A b∈A P (X S = x, X j = b, X 0 = a) − P(X S = x, X j = b, X 0 = a) .

Proof of the Claim 1. By applying the triangle inequality twice, one can check that

Now observe that for fixed x ∈ A S and a, b, c ∈ A,

and similarly,

By using these identities in (69) and then by applying the triangle inequality, one can deduce that

By adding and subtracting the term P x (X j = c)P(X S = x, X j = b, X 0 = a) in the righthand side of the above inequality and using again the triangle inequality, it follows that

By adding and subtracting the term P(X S = x)P(X S = x, X j = c), we can then check that

From (71) and (72), one deduces that

the proof of Claim 1 follows from (73). Claim 2. For any u > 0,

Proof of Claim 2. It follows from the union bound and Proposition A.2.

We now will conclude the proof. Let S k = {S ⊆ −d, −1 : |S| = k} and observe that by the union bound P(G c m (ξ, ℓ)) ≤ ℓ k=0 S∈Sk j∈S c P (|ν j,S −ν j,S | > ξ) .

Combining Claims 1 and 2, it follows that

and the result follows.

We now prove Theorem 3.3. PROOF OF THEOREM 3.3. First, observe that by Theorem 3.2, (74) P(Λ 2,n = Λ) ≤ P(G c m (ξ * , ℓ * )) + P(Λ ⊆Ŝ m ,Λ 2,n = Λ) Next, notice that the second term on the right hand side of (74) can be written as P(Λ ⊆Ŝ m ,Λ 2,n = Λ) = S⊆ −d,−1 :Λ⊆S,|S|≤ℓ * P(Ŝ m = S,Λ 2,n = Λ). Now for any S ∈ −d, −1 such that Λ ⊆ S, |S| ≤ ℓ * , it follows from the union bound that P(Ŝ m = S,Λ 2,n = Λ) ≤ j∈Λ P(Ŝ m = S, j / ∈Λ 2,n ) + j∈S\Λ P(Ŝ m = S, j ∈Λ 2,n ).

By proceeding similarly as in the proof of Item 1 of Theorem 3.1, one can deduce that for any j ∈ S \ Λ,

we then deduce that S⊆ −d,−1 :Λ⊆S,|S|≤ℓ * j∈S\Λ

Following the steps of the proof of Item 2 of of Theorem 3.1, we can also show that

where γ S m,n,j is defined as in (14) with t m,n,j = max b,c∈A:b =c min (xS,yS)∈Cj (b,c) t m,n (x S , y S ) in the place of t n,j .

Hence, it remains to estimate S⊆ −d,−1 :Λ⊆S,|S|≤ℓ * j∈Λ P(Ŝ m = S, j / ∈Λ 2,n , δ j < γ S m,n,j ).

By proceeding similarly to the proof of Item 3 of Theorem 3.1, one can show that for each S ⊆ −d, −1 such that |S| ≤ ℓ * ,

for all j ∈ Λ as long as n satisfies n > m + d + n min . By using this upper bound and by recalling that S⊆ −d,−1 :Λ⊆S,|S|≤ℓ * ≤ ℓ * k=0 d k ≤ (ℓ * + 1) d ℓ * (since 2ℓ * ≤ d), we deduce that S⊆ −d,−1 :Λ⊆S,|S|≤ℓ * j∈Λ P(Ŝ m = S, j / ∈Λ 2,n , δ j < γ S m,n,j ) ≤

for all j ∈ Λ whenever n > m + d + n min , implying the result.

PROOF OF COROLLARY 3.2. Assumptions 1 and 2 are satisfied for all values of n, since p ⋆ n ≥ p ⋆ min and ∆ ⋆ n ≥ ∆ ⋆ min for positive constants p ⋆ min and ∆ ⋆ min for all n. Since the sequence of MTD models also satisfy Assumption 3, the result follows from Theorem 3.3 and Remark 6-Item (b).

A.3.1. Proof of Theorem 3.4. PROOF OF THEOREM 3.4. Notice that we can write for each j ∈ −d, −1 , m j (X j ) = (p j (1|1) − p j (1|0)X −j + p j (1|0), so that equality (50) can be rewritten for any j ∈ −d, −1 satisfying (p j (1|1) − p j (1|0)) = 0, S ⊆ −d, −1 \ {j} and x S ∈ {0, 1} S , as

where ∆ ℓ = λ ℓ (p ℓ (1|1) − p ℓ (1|0)) for ℓ ∈ Λ. Recalling thatν j,S = 2E (|Cov XS (X 0 , X j )|) in the binary case, we can deduce that for any S ⊆ −d, −1 , x S ∈ {0, 1} S and j ∈ −d, −1 \ S,ν

As a consequence, it follows that for S ⊂ Λ and j ∈ Λ \ S,

where in the second inequality we have used that |Cov xS (X ℓ , X j )| = Var xS (X j )|P xS (X ℓ = 1|X j = 1) − P xS (X ℓ = 1|X j = 0)|.

By Assumption 4, we then deduce that (75)ν j,S ≥ 2Γ 1 |∆ j |E(Var XS (X j )).

Now, take S ⊂ Λ and let j S ∈ arg min ℓ∈Λ\S |∆ j |E(Var XS (X ℓ )). For any j ∈ (Λ) c , use the triangle inequality, equality (75) and Assumption 5 to deduce that

Using that δ j = |∆ j | and combing inequalities (75) and (76), it follows then that

where we have used also that max j∈Λ\Sνj,S ≥ν jS,S . Using that E (Var XS (X jS )) ≥ (p ⋆ ) 2 and |∆ jS | ≥ min j∈Λ |∆ j | we obtain that

concluding the first half of the proof.

To show the second assertion of the theorem, take S ⊂ Λ, let j * S ∈ arg max j∈Λ\Sνj,S and note that on G n (ℓ, ξ), max j∈Λ\Sν n,j,S ≥ν n,j * S ,S ≥ν j * S ,S − ξ = max j∈Λ\Sν j,S − ξ.

Similarly, one can show that on G n (ℓ, ξ), max j∈(Λ) cν n,j,S ≤ max j∈(Λ) cν j,S + ξ. As a consequence, it follows that max j∈Λ\Sν n,j,S − max j∈(Λ) cν n,j,S ≥ max j∈Λ\Sν j,S − max j∈(Λ) cν j,S − 2ξ, whenever G n (ℓ, ξ). By taking ξ as in (37), we have that max j∈Λ\Sν n,j,S − max j∈(Λ) cν n,j,S > 0, implying that arg max j∈S cν n,j,S ∈ Λ for all S ⊂ Λ, and the result follows.

A.3.2. Proof of Corollary 3.3. PROOF OF COROLLARY 3.3. By Theorem 3.4, we have that P(Λ 2,n = Λ) ≤ P(G c m (ξ, L)) + P(Λ ⊆Ŝ m ,Λ 2,n = Λ), for any ξ < (Γ 1 − Γ 2 )p ⋆ min min j∈Λ δ j . By Proposition A.1, we have that

By taking ξ = (Γ 1 − Γ 2 )p ⋆ min min j∈Λ δ j , one can check that if min j∈Λ δ j ≥ C 1 log(n) n ,

for some constant C 1 = C 1 (β, L, ∆ * min , p * min , Γ 1 , Γ 2 ), then P(G c m (ξ, L)) → 0 as n → ∞. By proceeding exactly as in the proof of Theorem 3.3 and using 6-item (b), we can show that

as long as

where C = C(µ, ε). Therefore, using that ∆ ≥ ∆ ⋆ min , p min ≥ p ⋆ min , d = nβ, α = (1 + η) log(n), we also see that if δ 2 min ≥ C 2 log(n) n for some C 2 = C 2 (µ, ε, η, ∆ ⋆ min , p ⋆ min , β, L) then P(Λ ⊆Ŝ m ,Λ 2,n = Λ) → 0 as n → ∞.

By taking C = C 1 ∨ C 2 , we deduce that P(Λ 2,n = Λ) → 0 as n → ∞ as long as δ 2 min ≥ C log(n) n , and the result follows.

A.4. Proofs of Section 3.4.

PROOF OF THEOREM 3.5. By the union bound, we have that

(1 + ǫ)V n (a, xΛ 2,n ) N n (xΛ 2,n )

Now, Proposition B.3 implies that for any a ∈ A and x Λ ∈ A Λ , P   |p n (a|x Λ ) − p(a|x Λ )| ≥ 2α(1 + ǫ)V n (a, x Λ ) N n (x Λ ) + α 3N n (x Λ )   ≤ 4 log(µ(n − d)/α + 2) log(1 + ε) e −α P N n (x Λ ) > 0 , so that

the result follows.

A.5. Proof of Section 3.5.

A 

where KL(p (j) (·|x −d:−1 )||p (k) (·|x −d:−1 )) denotes the Kullback-Leibler divergence between p (j) (·|x −d:−1 ) and p (k) (·|x −d:−1 ). Now note that for each fixed x −:d1 ∈ A −d,−1 , we can use the definition of the transition probabilities p (j) (·|·) together with Lemma 6 of Csizar and Talata (2005) to deduce that KL(p (j) (·|x −d:−1 )||p (k) (·|x −d:−1 )) ≤ λ 2 |p(1|1) − p(1|0)| 2 (p min ) −1 1 {xj =xk} .

Since p min ≥ (1 − λ)/2 and δ = λ|p(1|1) − p(1|0)|, it follows from the above inequality that E (j) (KL(p (j) (·|X −d:−1 )||p (k) (·|X −d:−1 ))) ≤ 2δ 2 1 − λ .

By using similar arguments, one can also show that KL(P Therefore, it follows that

and the result follows.

{ M k > 0 and M j = 0 for all j < k}, as in (Oliveira, 2015) , we can deduce that

which implies not only that for any λ > 0,

Now, we use that for λ ∈ (0, 3) it holds that ψ(λ) ≤ λ 2 (1 − λ/3) −1 /2. Hence, from the above inequalities we deduce that for any λ ∈ (0, 3),

P M t ≥ λ 2(1 − λ/3) M t + α λ ≤ exp(−α)P( M t > 0), and also that As a direct consequence of Proposition B.2, we derive the following result.

COROLLARY B.1. Let X 1:n be a sample from a MTD model of order d with set of relevant lags Λ. LetΛ m be an estimator of Λ computed from X 1:m , where n > m. For each x ∈ A −d,−1 , a ∈ A and S ⊆ −d, −1 , letp m,n (a|x S ) be the empirical transition probability defined in (7) computed from X m+1:n . Then for any S ⊆ −d, −1 such that Λ ⊆ S, ε > 0, α > 0 and n ≥ m + d + 1, we have (83) P Λ m = S, |p m,n (a|x S ) − p(a|x Λ )| ≥ 2α(1 + ε)p(a|x Λ )(1 − p(a|x Λ )) N m,n (x S ) + α 3N m,n (x S ) ≤ 2 log(n − m − d + 1) log(1 + ε) e −α P N m,n (x S ) > 0,Λ m = S .

In particular,

PROOF. Summing in both sides of (83) over S ⊆ −d, −1 such that Λ ⊆ S, we obtain inequality (84). Thus, it remains to prove (83). To that end, take ϕ(X 1:m , X (t−d):(t−1) ) = 1{Λ m = S}1{X t+j = x j , j ∈ S} and notice that in this case H ϕ • M a n = 1{Λ m = S}N m,n (x S )p(a|x Λ )(1 − p(a|x Λ )). So, if either p(a|x Λ ) = 0 or p(a|x Λ ) = 1, then we have necessarily H ϕ • M a n = 0 for all n ≥ m + d + 1, which implies that almost surely for all n ≥ m + d + 1, H ϕ • M a n = 1{Λ m = S}(N m,n (x S , a) −N m,n (x S )p(a|x Λ )) = 0. By noticing that H ϕ • M a n = 1{Λ m = S}N m,n (x S )(p m,n (a|x S ) − p(a|x Λ )), it follows that, on the event {Λ m = S,N m,n (x S ) > 0}, we must havep m,n (a|x S ) = p(a|x Λ ) almost surely and so the left-hand side of (83) is 0 and the result holds trivially.

Let us now suppose 0 < p(a|x Λ ) < 1. In this case, we apply Proposition B.2 with w = p(a|x Λ )(1 − p(a|x Λ )), v = (n − m − d)w and b = 1 to deduce that, (85) P Λ m = S, N * n,m (x S )(p m,n (a|x S ) − p(a|x Λ )) ≥ 2α(1 + ε)p(a|x Λ )(1 − p(a|x Λ ))N m,n (x S ) + α 3 ,N m,n (x S ) > 0 ≤ log(n − m − d + 1) log(1 + ε) e −α P(1{Λ m = S}N m,n (x S )p(a|x Λ )(1 − p(a|x Λ )) > 0).

To conclude the proof, observe that in this case REMARK 11. Let us briefly comment on the results of Corollary B.1. Suppose x ∈ A −d,−1 and a ∈ A are such that 0 < p(a|x Λ ) < 1 and also thatΛ m is a consistent estimator of Λ. By the CLT for aperiodic and irreducible Markov Chains it follows that N m,n (x Λ )(p m,n (a|x Λ ) − p(a|x Λ ))1{Λ m = Λ,N m,n (x Λ ) > 0} converges in distribution (as min{m, n} → ∞) to a centered Gaussian random variable with variance p(a|x Λ )(1 − p(a|x Λ )). This implies that for sufficiently large n, 1 N m,n (x S ) .

Then for any S ⊆ −d, −1 such that Λ ⊆ S and n ≥ m + d + 1, we have In particular, ≤ 4 log(µ(n − m − d)/α + 2) log(1 + ε) e −α P Λ ⊆Λ n ,N m,n (xΛ m ) > 0 .

PROOF. Summing in both sides of (87) over S ⊆ −d, −1 such that Λ ⊆ S, we obtain inequality (88). Hence, it remains to show (87). Arguing as in Corollary B.1 we need to consider only the case 0 < p(a|x) < 1.

By applying Theorem B.2 with H = H ±ϕ where ϕ is as in the proof of Corollary B.1, v = µ µ−ψ(µ) (n − m − d) + α µ−ψ(µ) and w = α µ−ψ(µ) , we obtain that P Λ m = S,N m,n (x S )|p m,n (a|x S ) − p(a|x Λ )| ≥ 2(1 + ǫ)αṼ m,n (a, x, S) + α 3 ≤ 4 log(µ(n − m − d)/α + 2) log(1 + ε) e −α P(1{Λ m = S}N m,n (x S )p(a|x Λ )(1−p(a|x Λ )) > 0), whereṼ m,n (a, x, S) =N m,n (x S )V m,n (a, x, S). By using that when 0 < p(a|x) < 1,

{1{Λ m = S}N m,n (x S )p(a|x Λ )(1 − p(a|x Λ )) > 0} = {Λ m = S,N m,n (x S ) > 0}, and the fact thatṼ m,n (a, x, S) =N m,n (x S )V m,n (a, x, S), we deduce (87) from the above inequality.

The Mixture Transition Distribution Model for High-Order Markov Chains and Non-Gaussian Time Series

Efficiently Learning Ising Models on Arbitrary Graphs

Variable length Markov chains

Neuronal oscillations in cortical networks

Optimal Gaussian concentration bounds for stochastic chains of unbounded memory

Long memory processes (1/f α type) in human coordination

Series in Telecommunications and Signal Processing

Context tree selection and linguistic rhythm retrieval from written texts

Minimal markov models

1/f noise in human cognition

Lasso and probabilistic inequalities for multivariate point processes

Estimation and selection for high-order Markov chains with Bayesian mixture transition distribution models

Sparse Markov chains for sequence data

Correlation properties of daily temperature anomalies over land

Bayesian Context Trees: Modelling and exact inference for discrete time series

Stochastic Processes With Random Contexts: A Characterization and Adaptive Estimators for the Transition Probabilities

A Model for High-Order Markov Chains

Concentration of Measure Inequalities in Information Theory, Communications, and Coding: Second Edition

A universal data compression system

Long-term memory in climate variability: A new look based on fractional integral techniques

In the sequel, N denotes the set of non-negative integers {0, 1, . . .} Let (Ω, F, P) be a probability space. We assume that this probability space is rich enough so that the following stochastic processes may be defined on it. In what follows, let (X t ) t∈Z be a Markov chain of order d ∈ Z + , taking values on a finite alphabet A, with family of transition probabilities {p(·|x −d:−1 ) : x −d:−1 ∈ supp(P)}. Denote F t = σ(X −d:t ) for t ∈ N. For each a ∈ A, consider the stochastic process M a = ((M a t )) t∈N defined as, M a t = 1{X t = a} − p(a|X (t−d):(t−1) ), t ∈ N. Let H = (H t ) t∈N be a stochastic process taking values on a finite alphabet B ⊂ R, satisfying H 0 = 0 and H t ∈ F t−1 for all t ∈ Z + , and consider H • M a = (H • M a t ) t∈N defined as,Furthermore, for any λ > 0 and b > 0 such that B ∞ ≤ b, the stochastic processis a supermartingale w.r.t. F starting from 1.PROOF. For each t ∈ Z + , we have that H t ∈ F t−1 and also that E [1{X t = a}|F t−1 ] = p(a|X (t−d):(t−1) ). These two facts imply that for any t ∈ Z + ,The predictable quadratic variation of H • M a is defined as(t−1) )). Using again that H t ∈ F t−1 and also that E [1{X t = a}|F t−1 ] = p(a|X (t−d):(t−1) ), one then deduces that for any t ∈ Z + , (Raginsky and Sason, 2014) .We will use Lemma B.1 to prove the following concentration inequality.REMARK 10. This is basically Lemma 5 of (Oliveira, 2015) (see the Economical Freedman's inequality provided in Inequality (41)) applied to the square integrable martingale H • M a . The only difference is the factor 2 in front of the linear term αb 3 which is not present here. Notice that for t ∈ Z + , the concentration inequality above can be rewritten in the following form:

PROOF. For t = 0 the result holds trivially. Now, suppose t ∈ Z + . By considering the setit suffices to prove the case b = 1. To shorten the notation, we denote M t = H • M a t in the sequel. By the Markov property, we have that for any λ > 0,where ψ(λ) = e λ − λ − 1 and we have used that if M t = 0 almost surely, then M t = M 0 = 0 almost surely. By using the fact that (exp (λM t − ψ(λ) M t )) t∈N is a supermartingale (Lemma B.1 with b = 1) together with the decompositionBy using a peeling argument as in (Hansen, Reynaud-Bouret and Rivoirard, 2015) , we deduce from the above result the following.PROOF. It suffices to prove the case b = 1. The general case follows from this one by first replacing B by B/b = {c/b : c ∈ B} and then rearranging the terms properly. Let us denote. Notice that v K ≥ v, by the definition of K. To shorten the notation, we denote H • M a t = M t in what follows. Starting from (81), one can deduce that for any 0 ≤ k < K and λ ∈ (0, 3), we havewhich, in turn, implies thatMinimizing w.r.t. to λ ∈ (0, 3) as in Proposition B.1, it then follows thatSumming over k the result follows (recall that v ≤ v K by the choice of K).Hereafter, let m ∈ N and consider a function ϕ :Here we use the convention that ϕ is a function defined only on A −d,−1 when m = 0. Given such a function ϕ, let us denote H ϕ = (H ϕ t ) t≥0 the stochastic process defined as H ϕ 0 = . . . = H ϕ m+d = 0 and H ϕ t = ϕ(X 1:m , X (t−d):(t−1) ) for t ≥ d + m + 1.Clearly, H ϕ t ∈ F t−1 for all t ∈ Z + . From (79), one can check that the predictable quadratic variation H ϕ • M a of the martingale H ϕ • M a is given by H ϕ • M a 0 = . . . = H ϕ • M a m+d = 0 and for t ≥ m + d + 1,Let us compare this heuristic argument with Corollary B.1 applied to S = Λ. In this case, inInequality (83), the variance term 2α(1+ε)p(a|xΛ)(1−p(a|xΛ))

can be made arbitrarily close to optimal value 2αp(a|xΛ)(1−p(a|xΛ))

, at the cost 1 log(1+ε) . Both the linear term ᾱ Nm,n(xΛ) and the log(n − m − d + 1) factor are the price to pay to achieve the result which holds every n ≥ m + d+ 1 and reflect the fact thatp m,n (a|x Λ )− p(a|x Λ ) is Gaussian only asymptotically.In particular, Corollary B.1 improves the Economical Freedman's Inequality (as stated in (Oliveira, 2015) -Lemma 5, Inequality (42) ) when restricted to the martingale H ϕ • M a .In the sequel, let us denote P a = (P a t ) t∈N , for each a ∈ A, the stochastic process defined as P a t = p(a|X (t−d:t−1) ) for each t ∈ N. With this notation, notice thatis such that H • M a t ≤ H 2 • P a t for all t ∈ N. In particular, Proposition B.1 holds if we replace H • M a t by H 2 • P a t . A closer inspection of the proof of Proposition B.2 reveals that this proposition also holds with H 2 • P a t in the place of H • M a t . In the next theorem, we show that we can replace H 2 • P a t by a linear transformation of its empirical version which is crucial for our analysis. THEOREM B.2. Let H • M a = (H • M a t ) t∈N be the stochastic process defined in (78). Suppose that B ∞ ≤ b for some b > 0. For any fixed µ ∈ (0, 3) satisfying µ > ψ(µ) = exp(µ) − µ − 1 and α > 0, define for t ∈ N,whereP a s = 1{X s = a} for all s ∈ N. Then, for any fixed ǫ > 0 and v > w > 0, we have for any t ∈ N,PROOF. We prove only the case b = 1. Note that −H 2 • M a t = H 2 • P a t − H 2 •P a t . Also recall that H • M a t ≤ H 2 • P a t . 42 We now proceed to the proof. We first use Inequality (80) for the martingale −H 2 • M a together with that fact that −H 2 • M a t ≤ H 4 • P a t ≤ H 2 • P a t (the last inequality holds because b = 1) to deduce that for any µ > 0,which implies that for any µ ∈ (0, 3) satisfying µ > ψ(µ), it holdsHence, combining this inequality with (81), we conclude that for any λ ∈ (0, 3) and any µ ∈ (0, 3) satisfying µ > ψ(µ),where we also used in the last inequality the fact that H • M a t ≤ H 2 • P a t . To conclude the proof, we need to follow the same steps as Proposition B.1 and then use the peeling argument as in the proof of Proposition B.2 with H 2 •P a t in the place of H • M a t .As consequence of Theorem B.2, we obtain the following result. PROPOSITION B.3. Let X 1:n be a sample from a MTD model of order d with set of relevant lags Λ. LetΛ m be an estimator of Λ computed from X 1:m where n > m. For any x ∈ A −d,−1 , a ∈ A and S ⊆ −d, −1 , letp m,n (a|x S ) be the empirical transition probability defined in (7) computed from X m+1:n , and consider for α > 0 and µ ∈ (0, 3) satisfying µ > ψ(µ) = exp(µ) − µ − 1, V m,n (a, x, S) = µ µ − ψ(µ)p m,n (a|x S ) + α µ − ψ(µ)