key: cord-0655517-olmjoghd authors: Lopes, Artur O.; Lopes, Silvia R. C.; Varandas, Paulo title: Bayes posterior convergence for loss functions via almost additive Thermodynamic Formalism date: 2020-12-10 journal: nan DOI: nan sha: 224ce9e53043d6bdc93d56e1c588aeaa25cc83f9 doc_id: 655517 cord_uid: olmjoghd Statistical inference can be seen as information processing involving input information and output information that updates belief about some unknown parameters. We consider the Bayesian framework for making inferences about dynamical systems from ergodic observations, where the Bayesian procedure is based on the Gibbs posterior inference, a decision process generalization of standard Bayesian inference where the likelihood is replaced by the exponential of a loss function. In the case of direct observation and almost-additive loss functions, we prove an exponential convergence of the a posteriori measures a limit measure. Our estimates on the Bayes posterior convergence for direct observation are related and extend those in a recent paper by K. McGoff, S. Mukherjee and A. Nobel. Our approach makes use of non-additive thermodynamic formalism and large deviation properties instead of joinings. 1.1. Bayesian inference. Statistical inference aims to update beliefs about uncertain parameters as more information becomes available. The Bayesian inference, one of the most successful methods used in decision theory, builds over Bayes' theorem: which expresses the conditional probability of the hypothesis H conditional to the event E with the probability that the event or evidence E occurs given the hypothesis H. In the previous expression, the posterior probability Prob(H | E) is inferred as an outcome of the prior probability Prob(H) on the hypothesis, the model evidence Prob(E) and the likelihood Prob(E | H). Bayes' theorem has been widely used as an inductive learning model to transform prior and sample information into posterior information and, consequently, in decision theory. One should not make confusion between Prob(E | H) and Prob(H | E). Let us provide a simple example. Suppose one is tested for covid-19, and the test turns out to be positive. If the test is 99% accurate, the latter means that Prob(Positive test | Covid-19) = 0.99. However, the most relevant information is Prob(Covid-19 | Positive test), namely the probability of having covid-19 once one is tested positive, which is related with the former conditional probability by (1) . If proportion Prob(Covid- 19) of infected persons in the total population is 0.001 it it possible to compute the normalizing term Prob(Positive test) and to conclude that Prob(Covid-19 | Positive test) = 0.5, which provides a different and rather relevant information (see e.g. [13] for all computations in a similar example). The conclusion is that both the prior and the data contain important information, and so neither should be neglected. The process of drawing conclusions from available information is called inference. However, in many physical phenomena the available information is often insufficient to reach certainty through a logical reasoning. In these cases, one may use different approaches for doing inductive inference, and the most common methods are those involving probability theory and entropic inference (cf. [13] ). The frequentist interpretation advocates that the probability of a random event is given by the relative number of occurrences of the event in a sufficiently large number of identical and independent trials. An alternative approach is given by the Bayesian interpretation which became more popular in the recent decades and sustains that a probability reflects the degree of belief of an agent in the truth of a proposition. Citing [13] , "the crucial aspect of Bayesian probability measures is that different agents may have different degrees of belief in the truth of the very same proposition, a fact that is described by referring to Bayesian probability measures as being subjective". In the framework of parametric Bayesian statistics, one is interested in updating beliefs, or the degree of confidence, on the space of parameters Θ, which play the role of the variable H in the expression (1) above. In rough terms, the formula (1) expresses that the belief on a certain set of parameters is updated from the original belief, after an event E, by how likely such event is for all parameterized models. This supports the idea that while frequentists say the data are random and the parameters are fixed, Bayesians say the data are fixed and the parameters are random. The basic idea in classical Bayesian inference is the updating of a prior belief distribution to a posterior belief distribution when the parameter of interest is connected to observations via the likelihood function. In [7] , Bissiri et al propose a general framework for the Bayesian inference arguing that a valid update of a prior belief distribution to the posterior one can be made for parameters which are connected to observations through a loss function which accumulates information as time passes rather than the likelihood function. In their framework, the classical inference process corresponds to the special case where the loss function is expressed as the negative log likelihood function. In this more general framework, the choice of loss function determines the way that the data are analyzed contribute to the mechanism of updating the belief distribution on the space of parameters, and such choice is often subjective and depends on the kind of feature one desires to highlight from the data. Moreover, the purpose is that the successive updated belief distributions, called posterior distributions, either converge or concentrate around the unknown targeted parameters. We refer the reader to [1, 21, 22, 49, 51, 54] for more information on classical Bayesian inference formalism. The Bayesian inference in the context of observations arising from dynamical systems faces some natural challenges. The first one is that the process of taking time series (via Birkhoff theorem) lacks independence: if T : (Y, ν) → (Y, ν) is a measure preserving map and φ : Y → R is an observable then the sequence of random variables (φ • T n ) n 1 is identically distributed but the random variables are not even pairwise independent. The second one concerns the choice of the loss function to make update of beliefs on the space of parameters. From the Physics and the Dynamical Systems viewpoints it is natural that loss functions should value some of the geometric or chaotic properties of the dynamical system, identified either in terms of Lyapunov exponents, joint spectral radius of matrix cocycles, entropy or estimates on the Charathéodory, box-counting or Hausdorff dimension of repellers and attractors, and with applications in wavelets and multifractal analysis, just to mention a few. These concepts, central in mathematical physics (see e.g. [4, 5, 6, 10, 8, 20, 29, 27, 32, 33] and references therein) appear naturally as limits of either Birkhoff averages of potentials, sub-additive or almost-additive potentials (or several other versions of non-additivity, to be defined in Subsection 3.2). As a first example, if T is a C 1 -smooth volume preserving and ergodic diffeomorphism on a surface then its largest Lyapunov exponent is by the random product of SL(2, R)-matrices as λ + (T, Leb) = lim n→∞ 1 n log A(T n−1 (y)) . . . A(T (y))A(y) for Lebesgue almost every y ∈ Y , where A = DT : Y → T Y is the derivative cocycle. In general, the sequence of observables Φ = (ϕ n ) n 1 defined by ϕ n (y) = log A(T n−1 (y)) . . . A(T (y))A(y) is sub-additive and, in the special case that the linear cocycle has an invariant cone-field, this sequence is actually almost-additive (cf. [28] where C n (x) ⊂ Ω denotes the n-cylinder set containing the sequence x = (x 1 , x 2 , x 3 , . . . ). The sequence of observables Φ = (ϕ n ) n 1 defined by ϕ n (y) = − log µ(C n (x)), which is non-additive in general, is additive and almost-additive in the relevant classes of Bernoulli and Gibbs measures, respectively (see Lemma 3.3) . Finally, it is worth to mention that sub-additive and almostadditive sequences appear naturally also in applications to several other areas of knowledge and appear for instance in the study of factorial languages by Thue, Morse and Hedlund in the beginning of the twentieth century (see [52] and references therein). In this article, inspired by the relevant physical quantities arising from non-additive sequences of potentials, we will establish a bridge between non-additive thermodynamic formalism of dynamical systems and Gibbs posterior inference in statistics (to be defined in Subsection 1.2 below), two areas of research in connection with statistical physics. We refer the interested reader to the introduction of [47] for a careful and wonderful exposition on the link between Bayesian inference and thermodynamic formalism, and a list of cornerstone contributions. We will mostly be interested in the parametric formulation of Bayesian inference, as described below. Let σ : Ω → Ω be a subshift of finite type. This will serve as the underlying dynamical system, with respect to which samplings are obtained along its finite orbits {y, σ(y), . . . , σ n−1 (y)}, y ∈ Ω. We take a family of Gibbs probability measures {µ θ } θ∈Θ as the models in the inference procedure for their relevance and ubiquity in the thermodynamic formalism of dynamical systems, and are of crucial importance in several other fields as in the study of the randomness of time-series, decision theory, quantum information and information gain, just to mention a few (cf. [2, 13, 30, 36, 46] ). In our context, Gibbs measures appear as fixed points of the dual of certain transfer operators. Let us be more precise. For any Lipschitz continuous potential A : Ω → R, the Ruelle-Perron-Frobenius transfer operator associated to A is defined by The potential A is called normalized if L A (1) = 1, and in this case, it is natural to write A = log J, and we call J the Lipschitz continuous Jacobian. A Gibbs measure µ is any σinvariant probability measure obtained as a fixed point of the dual operator L * log J acting on the space of probability measures on Ω, for some Lipschitz continuous and normalized Jacobian J. In this way, it is natural to parametrize Gibbs probabilities by the space of normalized Lipschitz continuous Jacobians J, hence this space can be observed as an infinite dimensional Riemannian analytic manifold [35, 45, 46] . Invariant Gibbs measures are equilibrium states, namely they satisfy a variational relations (cf. Subsection 1.3 for more details). Given a prior probability measure Π 0 on the space Θ of parameters and a the sampling according to a Gibbs measure µ θ 0 , the posterior probability (i.e. updated belief distribution) is determined using the loss functions ℓ n : Θ × Ω × Ω → R, where ℓ n (θ, x, y) encodes the information on the parameter θ accumulated along the sampling {y, σ(y), . . . , σ n−1 (y)} and influenced by the measurements along the orbit {x, σ(x), . . . , σ n−1 (x)}. The Shannon-McMillan-Breiman formula (2) suggests the use of loss functions to collect the information of the measure on cylinder sets in Ω (cf. expressions (4), (5) and (9) below). The relative entropy, also called Kullback-Leibler divergence and defined by (27) , makes the comparison between the measurements of cylinders according to two different Gibbs measures. This notion is of paramount importance in Physics and will be used to interconnect log likelihood inference with the direct observation analysis of Gibbs probability measures. Our main results guarantee that posterior consistency for certain classes of loss functions determined by almost-additive sequences of potentials: the posterior distributions asymptotically concentrate around the unknown targeted parameter θ 0 , often with exponential speed (we refer the reader to Theorems A, B and C for the precise statements). The main ingredient to obtain quantitative estimates on the convergence for the parameter θ 0 is the use of large deviations for non-additive sequences of potentials [57] . Our results are strongly inspired, and should be compared, with those by McGoff, Mukherjee and Nobel [47] , where the authors established posterior consistency of (hidden) Gibbs processes on mixing subshifts of finite type using properties of Gibbs measures. For that purpose, they consider a more general framework, where the dynamical system T : Y → Y on a Polish space does not necessarily coincide with the subshift of finite type σ : Ω → Ω. In particular, the sampling is determined by a T -invariant and ergodic probability measure ν, that could be unrelated to the Gibbs measures {µ θ } θ∈Θ for the shift. If the loss functions are additive (i.e. ℓ n = n−1 j=0 ℓ(θ, σ j (x), T j (y)) for some function ℓ : Θ × Ω × Y → R satisfying a mild regularity condition then the main results in [47] ensure that it is possible to formulate the problem as a limiting variational problem and to identity the parameters, obtained as minimizing parameters for a lower semicontinuous function V : Θ → R, for which the posterior consistency holds: if Θ min = argmin θ∈Θ V (θ) then the posterior distributions Π n (· | y), defined by (6), satisfy for each open neighborhood U of Θ min and for ν-almost every y ∈ Y (cf. [47, Theorem 2] ). The proof of this result requires the use of joinings (or couplings) of the model system and the observed system, and results on fibered entropy. Our framework corresponds to the special case of direct observation, that the dynamical system T coincides with the subshift of finite type σ and the target parameter is a single θ 0 ∈ Θ, with a subtler difference that our assumptions ensure that µ θ = µθ for every distinct θ,θ ∈ Θ. Our results complement the ones in [47] in the sense that the information can be collected by more general loss functions ℓ n . Furthermore, the more direct use of large deviation techniques allows to prove an exponential speed of convergence in the posterior consistency (cf. Theorem A), which were not known even in the context of direct observation (cf. [47, Theorem 2 and Remark 8] ). Summarizing, the three main novelties are the extension to non-additive loss functions, the exponential rate of convergence and the proof which is not based on joinings and fiber entropy. It is also worth noticing that, more recently, Su and Mukherjee [55] also used a large deviations approach for posterior consistency, using Varadhan's large deviation principle for stochastic processes. A different point of view of the Bayesian a priori and a posteriori formalism will appear in [26] where results on thermodynamic formalism for plans are used (see [42, 43] ). In [36] the author considered log-likelihood estimators in classical thermodynamic formalism and the inference concerns Hölder potentials and not probabilities. To finalize, one should mention that there is an increasing interest to explore the strong connection between Statistical Inference and Physics in general. There are several such connections in this regard, including a Bayesian approach to the dynamics of the classical ideal gas [58, Section 31.3] , prior sensitivity in the Bayesian model selection context to some galaxy data sets [11] . In the monograph [13] , the author clarifies the conceptual foundations of Physics by deriving the fundamental laws of statistical mechanics and of quantum mechanics as examples of inductive inference, while he also advocates that, in view of the fact that models may need to change as time evolves, it may be the case that all areas of Physics may be modeled using inductive inference. 1.2. Gibbs posterior inference. According to the Gibbs posterior paradigm [7, 37] , the beliefs should be updated according to the Gibbs posterior distribution. Let us recall the formulation of this posterior measure following [47] . Observed system. Assume that Y is a complete and separable metric space and that T : Y → Y is a Borel measurable map endowed with a T -invariant, ergodic probability measure ν. This dynamical system represents the observed system and will be used to update information for the model. This is the analogue of the data in the context of Statistics. The updated belief, given by the a posteriori measure, is obtained by feeding data obtained from the observed system on a model by means of a loss function. Model families. Consider a transitive subshift of finite type σ : Ω → Ω where σ denotes the right-shift map, acting on a compact invariant set Ω ⊂ {1, 2, ..., q} N determined by a transition matrix M Ω ∈ M q×q ({0, 1}). The map σ presents different statistical behaviors (e.g. measured in terms of different convergences for Cesàro averages of continuous observables) according to any of its equilibrium states associated to Lipschitz continuous observables, each of which satisfies a Gibbs property (see e.g. Remark 1 in [48, Section 2] or [41] ). Consider a compact metric space Θ and a family of σ-invariant probability measures G = µ θ : θ ∈ Θ so that: (i) for every θ ∈ Θ the probability measure µ θ is a Gibbs measure associated to a Lipschitz continuous potential f θ : Ω → R, that is, there exists K θ > 1 and P θ ∈ R so that where S n f θ = n−1 j=0 f θ • σ j and C n (x) ⊂ Ω denotes the n-cylinder set in the shift space Ω containing the sequence x = (x 1 , x 2 , x 3 , . . . ); and (ii) the family Θ ∋ θ → f θ is continuous (in the Lipschitz norm). We assume Gibbs measures to be normalized, hence probability measures. It is well known that the previous conditions ensure the continuity of the pressure function Θ ∋ θ → P θ and of the map Θ ∋ θ → µ θ (in the weak * topology) [48] . In particular, one can take a uniform constant K > 0 in (3) . The problem to be considered here involves a formulation and analysis of an iterative procedure (based on sampling and updated information) on the family G of models. Loss functions and Gibbs posterior distributions. Consider the product space Θ × Ω endowed with the metric d defined as d( (θ, x), (θ ′ , x ′ ) ) = max{d Θ (θ, θ ′ ), d Ω (x, x ′ ) }. A fully supported probability measure Π 0 on Θ describes the a priori uncertainty on the Gibbs measure. Given such an a priori probability measure Π 0 on the space of parameters Θ and a sample of size n (determined by the observed system T ) we will get the a posteriori probability measure Π n on the space of parameters Θ, taking into account the updated information from the data. More precisely, given Π 0 and a family (µ θ ) θ∈Θ , consider the probability measure P 0 on the product space Θ × Ω given by for all Borel sets E ⊂ Θ × Ω. In other words, P 0 has the a priori measure Π 0 as marginal on Θ and admits a disintegration on the partition by vertical fibers where the fibered measures are exactly the Gibbs measures (µ θ ) θ∈Θ . There is no action of the dynamics T on this product space. Indeed, the a posteriori measures are defined using loss functions. For each n ∈ N consider a continuous loss function ℓ n of the form consider the probability measure P n on Θ × Ω given by for all Borel sets E ⊂ Θ × Ω, and set where x = (x 1 , x 2 , ..., x n , . . . ) ∈ Ω and y ∈ Y . In the special case that Y = Ω, that −ℓ n : Θ × Ω × Ω → R coincides with a n-Birkhoff sum of a fixed observable ψ with respect to T and Π 0 is a Dirac measure, the expression (5) resembles the partition function in statistical mechanics whose exponential asymptotic growth coincides with the topological pressure of T with respect to ψ. Given y ∈ Y and n 1, the a posteriori Borel probability measure Π n (· | y) on the parameter space Θ (at time n and determined by the sample of y) is defined by for every measurable B ⊂ Θ and appears as marginals of the probability measures P n (· | y) given above. The general question is to describe the set of probability measures Π n (. | y) on the parameter space Θ, namely if their marginals converge and to formulate the locus of convergence in terms of some variational principle or as points of maximization for a certain function (see e.g. [47, Theorem 2] for a context where the loss functions are chosen such that the support of such measures on the minimization locus of a certain rate function). The main problem we are interested in is to understand whenever a sampling process according to a fixed probability measure can help to identify it from a recursive process involving Bayesian inference. Assume that Y = Ω, that T = σ is the shift and that one is interested in a specific probability measure µ θ 0 ∈ G , where θ 0 ∈ Θ. If ν = µ θ 0 then the sampling {y, T (y), T 2 (y), . . . T n−1 (y)} is distributed according to this probability measure. From the Birkhoff time series is it possible to successively update the initial a priori probability measure Π 0 in order to get a sequence of probability measures Π n (· | y) on Θ (the a posteriori probability measure at time n) as described. We ask the following: • Does the limit lim n→∞ Π n exist? • If the previous question has an affirmative answer: • is it the Dirac measure δ θ 0 on θ 0 ∈ Θ? • is it possible to estimate the speed of convergence to the limiting measure? In this paper we answer the previous questions for loss functions that are not necessarily arising from Birkhoff averaging but that keep some almost additive property. For that reason our approach will make use of results from non-additive thermodynamic formalism, hence it differs from the one considered in [47] . We refer the reader to [16] for a related work which does not involve Bayesian statistics. This paper is organized as follows. In the rest of this first section we formulate the precise setting we are interested in and state the main results. In Section 2 we present several examples and applications of our results. Section 3 is devoted to some preliminaries on relative entropy, large deviations and non-additive thermodynamic formalism. Finally, the proofs of the main results are given in Section 4. Let σ : Ω → Ω be a subshift of finite type endowed with the metric d Ω (x, y) = 2 −n(x,y) , where n(x, y) = inf{n 1 : x n = y n }, and denote by M σ (Ω) the space of σ-invariant probability measures. The space M σ (Ω) is metrizable and we consider the usual topology on it (compatible with weak * convergence). Let D Ω be a metric on M σ (Ω) compatible with the weak * topology. The set G ⊂ M σ (Ω) of Gibbs measures for Lipschitz continuous potentials is dense in M σ (Ω) (see for instance [39] ). Given a Lipschitz continuous potential A : Ω → R we denote by µ A the associated Gibbs measure. We say that the Lipschitz continuous potential A : is the usual Ruelle-Perron-Frobenius transfer operator (cf. [48, Chapter 2] ). We will always assume that potentials are normalized and write J = e A > 0 (or alternatively A = log J) as the Jacobian of the associated probability measure . Moreover, it is a classical result in thermodynamic formalism (see e.g. [48] ) that the following variational principle holds for any Lipschitz and normalized potential log J. A particularly relevant context is given by the space of stationary Markov probability measures on shift spaces (cf. Example 2.1). One should emphasize that, replacing the metric on Ω, it is possible to deal instead with the space of Lipschitz continuous potentials (cf. [48, Chapter 1] ). In the direct observation context, the sampling on the Bayesian inference is determined by T = σ and a fixed T -invariant Gibbs measure ν on Ω associated to a normalized potential log J. The sampling will describe the interaction (expressed in terms of the loss functions) over certain families of potentials (and Gibbs measures) which are parameterized on a compact set, where the sampling will occur. More precisely, consider the set of parameters Θ ⊂ R k of the form endowed with the metric d Θ given by d Θ (θ 1 , θ 2 ) = θ 1 − θ 2 , ∀θ 1 , θ 2 ∈ Θ, and denote by f : Θ → G ⊂ M σ (Ω) a continuous function of potentials parameterized over Θ such that: (1) f is an homeomorphism over its image; (2) for each θ the potential f (θ) is normalized (we use the notation f (θ) = log J θ ). The assumptions guarantee that for each θ ∈ Θ there exists a unique invariant Gibbs measure µ θ with respect to the associated normalized potential f (θ), and that these vary continuously in the weak * topology. Moreover, as the parameter space Θ is compact and f : Θ → G is a continuous function (expressed in the form f (θ) = log J θ , where f is a continuous function on θ ∈ Θ and J θ > 0), we deduce that the quotient is uniformly bounded for every x ∈ Ω and all θ 1 , θ 2 ∈ Θ. Remark 1.1. At this moment we are not requiring the probability measure ν of the observed system Y = Ω to belong to the family of probability measures (µ θ ) θ∈Θ . We refer the reader to Example 2.5 for an application in the special case that ν = µ θ 0 , for some θ 0 ∈ Θ. The statistics is described by an a priori Bayes probability measure Π 0 on the space of parameters Θ satisfying Hypothesis A: Π 0 (dz 1 , dz 2 , ..., dz k ) = Π 0 (dθ) is a fixed continuous strictly positive density (A) fully supported on the compact set Θ. In many examples the a priori measure appears as the Lebesgue or an equidistributed measure on the parameter space. We refer the reader to Section 2 for examples. The previous full support assumption not only expresses the uncertainty on the choice of the parameters, as it ensures that all parameters in Θ will be taken into account in the inference independently of the initial belief (distribution of Π 0 ). In this case of direct observations of Gibbs measures, let θ 0 ∈ Θ be fixed. The probability measure µ θ 0 will play the role of the measure ν (on the observed system Y ) considered abstractly on the previous subsection. We will consider the loss functions ℓ n : Θ × Ω × Ω → R, n 1, given by If one denotes by 1 Cn(y) the indicator function of the n-cylinder set centered at y and defined by C n (y) = {(x j ) j 1 : x j = y j , ∀1 j n}, such choice of loss functions ensures that for each y ∈ Y . Therefore, using equalities (25) and (27) (see Subsection 3.1 below), Jensen inequality and the monotone convergence theorem, one obtains that lim sup for µ θ 0 -almost every y. On this context of direct observation we are interested in estimating the family of a posteriori measures on Borel sets E ⊂ Θ which do not contain θ 0 and y ∈ Ω is a point chosen according to µ θ 0 . An equivalent form of (11) which may be useful is Actually, given such kind of E ⊂ Θ, one can ask wether the limit exists for µ θ 0 -almost every y. The following result gives an affirmative answer to this question. Theorem A. In the previous context, Moreover the convergence is exponentially fast: for every δ > 0 there exists a constant c δ > 0 so that the ball B δ of radius δ around θ 0 satisfies |Π n (B δ | y) − 1| e −c δ n for every large n 1. The previous result guarantees that the parameter θ 0 , or equivalently the sampling measure µ θ 0 , is identified as the limit of the Bayesian inference process determined by the loss function (9) . This result arises as a consequence of the quantitative estimates in Theorem 4.1, given in the proofs section below. The direct observation of Gibbs measures was also considered in [47, Section 2.1] although with a different approach. For a parameterized family of loss functions of the form β · ℓ n (θ, x, y) it is also analyzed in section 3.7 of [47] the zero temperature limit (ground states). This is a topic which can be associated to ergodic optimization. Our results are related in some sense to the so called Maximum Likelihood Identification described [15, 14, 17, 18, 16] The previous context fits in the wider scope of non-additive thermodynamic formalism, using almost-additive sequences of continuous functions (see Subsection 3.2 for the definition). Indeed, the loss functions (ℓ n ) n 1 described in (9) form an almost-additive family (cf. Definition 3.2 and Lemma 3.3). Furthermore, we will consider loss functions ℓ n : Θ × Ω × Y → R which form an almost-additive sequence of continuous functions, and for which one can write where ϕ n : Θ × Ω × Y → R + are continuous observables satisfying: (A1) for ν-almost every y ∈ Y the following limit exists is upper semicontinuous. Given y ∈ Y and the loss functions ℓ n satisfying (A1)-(A2), the a posteriori measures are Remark 1.2. The expression appearing in assumption (A1), which resembles the logarithm of the moment generating function for i.i.d. random variables, is in special cases referred to as the free energy function. Consider the special case where T = σ is the shift, ν is an equilibrium state with respect to a Lipschitz continuous potential ψ and ϕ n (θ, x, y) = ϕ n,1 (θ, x) + ϕ n,2 (θ, y), where ϕ n,1 (θ, x) = n−1 j=0 φ θ • σ j (x), φ θ : Ω → R is Lipschitz continuous and (ϕ n,2 (θ, ·)) n 1 is sub-additive. Then using the fact that the pressure function defined over the space of Lipschitz continuous observables is Gateaux differentiable and the sub-additive ergodic theorem one obtains that for ν-almost every y ∈ Ω, hence it is independent of y. We refer the reader to Subsection 3.2 for the concept of topological pressure and further information. The following result guarantees that the previous Bayesian inference procedure accumulates on the set of probability measures on the parameter space Θ that maximize the free energy function Γ y . By assumption (A2) the set argmax Γ y := {θ 0 ∈ Θ : Γ y (θ) Γ y (θ 0 ), ∀θ ∈ Θ} is non-empty. Then we prove the following: Theorem B. Assume ℓ n is a loss function of the form (14) satisfying assumptions (A1)-(A2). There exists a full ν-measure subset Y ′ ⊂ Y so that, for any δ > 0 and y ∈ Y ′ , is the open δ-neighborhood of the maximality locus of Γ y . In particular, if y ∈ Y ′ is such that Θ ∋ θ → Γ y (θ) has a unique point of maximum θ y 0 ∈ Θ then lim n→∞ Π n (· | y) = δ θ y 0 . Finally, inspired by the log-likelihood estimators in the context of Bayesian statistics it is also natural to consider the loss functions ℓ n : associated to an almost additive sequence Φ = (ϕ n ) n 1 of continuous observables ϕ n : Θ × X × Y → R + satisfying (H1) for each θ ∈ Θ and x ∈ X there exists a constant K θ,x > 0 so that, for every y ∈ Y , In this context, the loss functions induce the a posteriori measures Therefore, even though the loss functions are not almost-additive, due to the logarithmic term, we have the following result for the latter non-additive loss functions: Theorem C. Assume that the loss function of the form (17) satisfies assumptions (H1)-(H2) above. There exists a non-negative function ψ * : Θ → R + (depending on Ψ θ = (ψ n (θ, ·)) n≥1 ) so that for ν-almost every y ∈ Y the a posteriori measures (Π n (· | y)) n 1 are convergent and as n → ∞. Moreover, if T = σ is a subshift of finite type, ν ∈ M σ (Ω) is a Gibbs measure with respect to a Lipschitz continuous potential and inf θ∈Θ ψ * (θ) > 0 then for each g ∈ C(Θ, R) there exists c > 0 so that where F(η, Ψ θ ) := lim n→∞ 1 n ψ n (θ, ·) dη. If, additionally, the map Θ ∋ θ → F(η, Ψ θ ) is continuous for each η ∈ M σ (Ω) then the right hand-side in (19) is strictly negative. The previous theorem ensures that, in the context of loss functions of the form (17) satisfying properties (H1) and (H2) above, the a posteriori measures do converge exponentially fast to a probability measure on the parameter space which is typically fully supported. We refer the reader to Example 2.2 for more details in the special case the loss function depends exclusively on one parameter. Remark 1.3. For completeness, let us mention that the results by Kifer [38] suggest that level-2 large deviations estimates (ie, the rate of convergence of Π n (· | y) to Π * on the space of probability measures on Θ) are likely to hold under the assumption that the limit lim n→∞ 1 n log e ϕn dν exists for all almost-additive sequences Φ = (ϕ n ) n 1 of continuous observables and defines a non-additive free energy function which is related to the non-additive topological pressure. This extrapolates the scope of our interest here. In what follows we give some examples which illustrate the intuition and utility of the Bayesian inference and also the meaning of the a priori measures. In this case the associated normalized Jacobian J (a,b) (w) has constant value on cylinders of size two. More precisely, for w on the cylinder [i, j] ⊂ Ω we get J = π i P i,j π j , where (π 1 , π 2 ) is the initial invariant probability vector. For each value (a, b) denote by µ (a,b) the stationary Markov probability measure associated to the stochastic matrix M (a,b) . In this case we get that h(µ (a,b) ) + log J (a,b) dµ (a,b) = 0 and L * log J (a,b) (µ (a,b) ) = µ (a,b) (see [41, 53] ). We refer the reader to [23, 24, 25, 56] for applications of the use of the maximum likelihood estimator in this context of Markov probability measures. One possibility would be to take the probability measure Π 0 on the Θ space as the Lebesgue probability measure on (0, 1) × (0, 1). Different choices of loss functions would lead to different solutions for the claim of Theorem B. The first of the following examples are very simple and illustrate some trivial contexts. Whenever the parameter space Θ (or Y ) is a singleton the Bayesian inference is trivial, hence it carries no information. The first example we shall consider is when the loss function depends exclusively on a single variable. Nevertheless, as loss functions are non-additive, these results could not be handled with the previous literature in the subject. Example 2.2. Assume that Θ ⊂ R d is a compact set, Y = Ω and T = σ : Ω → Ω is a subshift of finite type. In the case that the loss functions ℓ n : Θ × Ω × Y → R are generated by an almostadditive sequence of continuous observables Φ = (ϕ n ) n 1 by ℓ n (θ, x, y) = − log ϕ n (y) which are independent of θ and x, the loss function gives no information on the parameter space. For that reason it is natural that the a posteriori measures are for every sampling y, T (y), . . . , T n−1 (y) ∈ Y . Now, assuming alternatively that the loss function is given by ℓ n (θ, x, y) = − log ϕ n (θ), which is independent on both x and y, a simple computation shows that In this case the loss function neglects the observable dynamical system T , hence the a posteriori measures are independent of the sampling. Yet, as the family Φ is almost-additive it is easy to check that there exists C > 0 so that {ϕ n + C} n 1 is sub-additive. In particular, a simple application of Fekete's lemma (cf. Lemma 3.2) ensures that the limit lim n→∞ φn(θ) n does exists and coincides with φ * (θ) := inf n 1 φn(θ) n , for every θ ∈ Θ. In consequence, independently of the sampling y. In particular the limit measure Π is fully supported on Θ if and only if ϕ * (θ) > 0 for every θ ∈ Θ. Finally, for each n 1 and almost-additive sequence of continuous observables Φ = (ϕ n ) n 1 on X, consider the loss function In this case a simple computation shows that one obtains a posteriori measures where the sequence ψ n (θ) = Ω ϕ n (x) dµ θ (x) is almost additive. Indeed, the σ-invariance of µ θ and the almost-additivity condition ϕ n (x)+ϕ m (σ n (x))−C ϕ m+n (x) ϕ n (x)+ϕ m (σ n (x))+C ensures that ψ n (θ) + ψ m (θ) − C ψ m+n (θ) ψ n (θ) + ψ m (θ) + C for every m, n 1 and θ ∈ Ω. Hence, even though the feed of information is given through the x-variable, the a posteriori measures are of the form (20) , and their convergence is described by Lemma 3.2. In particular, this example shows that the situation is much simpler to describe when the loss functions depend exclusively on a single variable. In the following two simple examples, we will make explicit computations on the limit of posterior distributions which shows that the assumption (A) on the space of parameters and a priori distribution cannot be removed. In particular, these will show that the posterior distributions Π n (· | y) may converge but not for a Dirac measure on the parameter θ 0 corresponding to the measure with respect to which the sampling occurs. = c < 0 then it is not hard to deduce (see e.g. [9] ) that φ | [1] = log(1 − e c ) and the unique equilibrium state for σ with respect to φ is the probability measure B(e c , 1 − e c ). Assume that µ −1 = B( 1 3 , 2 3 ) and µ 1 = B( 2 3 , 1 3 ) which are the unique equilibrium states for the potentials respectively. Take Π 0 = 1 2 δ −1 + 1 2 δ 1 and ν = B( 1 2 , 1 2 ) and notice that ν does not belong to the family (µ θ ) θ . On the context of direct observation we are interested in describing the a posteriori measures Π n (E | y) = E µ θ (C n (y)) dΠ 0 (θ) Θ µ θ (C n (y)) dΠ 0 (θ) for ν-almost every y. The Bernoulli property of µ ±1 then implies that, for ν-a.e. y, µ 1 (C n (y)) µ −1 (C n (y)) → 1 as n → ∞ and, consequently, the sequence of probability measures Π n (· | y) on {−1, 1} is convergent as lim n→∞ Π n ({±1} | y) = lim n→∞ µ ±1 (C n (y)) µ −1 (C n (y)) + µ 1 (C n (y)) = 1 2 . In other words, lim n→∞ Π n (· | y) = 1 2 δ −1 + 1 2 δ 1 = Π 0 . This convergence reflects the fact that φ −1 dν = φ 1 dν. Finally, it is not hard to check that for any a priori measure Π 0 = αδ −1 + (1 − α)δ 1 for some 0 < α < 1 it still holds that lim n→∞ Π n (· | y) = Π 0 . Example 2.4. In the context of Example 2.3, assume that the sampling is done with respect to a non-symmetric Bernoulli measureν = B(α, 1 − α) for some 0 < α < 1 2 . The ergodic theorem guarantees that, forν-a.e. y, µ 1 (C n (y)) 2 αn 3 n → 1 and µ −1 (C n (y)) 2 (1−α)n 3 n → 1 as n → ∞ and, consequently, µ 1 (C n (y))/µ −1 (C n (y)) → 0 as n → ∞. Altogether we get lim n→∞ Π n ({1} | y) = lim n→∞ µ 1 (C n (y)) µ −1 (C n (y)) + µ 1 (C n (y)) = lim n→∞ µ 1 (Cn(y)) µ −1 (Cn(y)) 1 + µ 1 (Cn(y)) µ −1 (Cn(y)) = 0 and lim n→∞ Π n ({−1} | y) = 1. In other words, lim n→∞ Π n (· | y) = δ −1 forν-almost every y, which reflects the fact that φ 1 dν < φ −1 dν. For each θ ∈ [0, 1] let µ θ be the unique Gibbs measure associated to the Lipschitz continuous potential f θ (see also section 6 in [30] for a related work). Assume further that the observed probability measure associated to the sampling is ν = µ θ 0 for some θ 0 ∈ [0, 1]. The probability measure Π 0 describes our ignorance of the exact value θ 0 among all possible choices θ ∈ [0, 1]. For each n ∈ N consider a continuous loss function ℓ n : Θ × Ω × Y → R expressed as Similar expressions are often referred as cross-entropy loss functions. By compactness of the parameter space Θ we conclude that the third and fourth expressions above are uniformly bounded, hence (ℓ n ) n 1 forms an almost-additive family on the y-variable, hence it fits in the context of Theorem B. In particular we conclude that the a posteriori measures Π n (· | y) converge to the probability measure µ θ 0 as n tends to infinity, for µ θ 0 -almost every y. Alternatively, consider the continuous loss function ℓ n : Θ × Ω × Y → R given by ℓ n ((a, b) The minimization of −ℓ n corresponds, in rough terms, to what is known in statistics as the minimization of the mean squared error on the set of parameters. As the previous loss function is also almost-additive on the y-variable, Theorem B ensures that the corresponding a posteriori measures Π n (· | y) converge exponentially fast to the sampling probability measure µ θ 0 as n tends to infinity, µ θ 0 -almost everywhere (we refer the reader to [47] where the methods which were developed there can provide an alternative argument leading to the same conclusion). Given n 1 and (x 1 , . . . , x n ) ∈ {1, 2} n , take the matrix product The limit λ θ,i := lim n→∞ θ (x) v i is the largest Lyapunov exponent along the orbit of x, it is well defined for ν-almost every x and independs on the vector v i ∈ E i θ,x \ {0}, (i = ±) (cf. Subsection 3.2.3 for more details). Somewhat dual to the context of joint spectral radius [8] , the problem here is the selection of a certain Gibbs measure from the information on the norm of the products of matrices, along orbits of typical points. More precisely, take the loss function ℓ n (θ, x, y) = − log A (n) θ (x) and notice that, for ν-almost every y ∈ Y and every θ ∈ Θ, θ (σ n (z)) µ θ (C n (z)) µ θ (C m (σ n (z))) Ω e ϕn(θ,x,y) dµ θ (x) · Ω e ϕm(θ,x,y) dµ θ (x) for every m, n 1, where we used that µ θ is a σ-invariant Bernoulli measure. In particular, Fekete's lemma implies that the following limit exists lim n→∞ 1 n log Ω e ϕn(θ,x,y) dµ θ (x) = inf exists and independs on y. As the right hand-side above is the infimum of continuous functions on the parameter θ, the limit function Θ ∋ θ → Γ y (θ) is upper semicontinuous. We remark that θ 0 = (0, 0) is the unique parameter for which the Lyapunov exponent is the largest possible (see Lemma 3.5) . Hence, as assumptions (A1) and (A2) are satisfied, Theorem B implies that for every measurable subset E ⊂ Θ, In other words, the a posteriori measures converge to the Dirac measure δ (0,0) . In particular, one has posterior consistency in the problem of determining the measure with largest Lyapunov exponent. Alternatively, taking the loss function ℓ n (θ, x, y) = −ϕ n (θ, x, y) = − log A (n) θ (y) , note that the a posteriori measures are given by θ (y) dΠ 0 (θ) and that, by the Oseledets theorem and the sub-additive ergodic theorem, the limit θ (·) dν is upper semicontinuous because it is the infimum of continuous maps. In particular, Theorem B implies once more that for ν-almost every y ∈ Y Example 2.7. In the context of Example 2.6, noticing that all matrices are in SL(2, R) it makes sense to consider alternatively the loss function ℓ n (θ, x, y) = − log ϕ n (θ, x, y) = − log log A (n) θ (y) , and to observe that ϕ n (θ, x, y) is almost-additive, meaning it satisfies (H1)-(H2) with a constant K uniform on θ. The loss functions induce the a posteriori measures A simple computation involving Fekete's lemma guarantees that, for each θ ∈ Θ, the annealed Lyapunov exponent does exist. Theorem C implies that the a posteriori measures (24) converge and lim n→∞ Π n (E | y) = E λ(θ) dΠ 0 (θ) Ω λ(θ) dΠ 0 (θ) for every measurable subset E ⊂ Θ. In particular the limit measure is absolutely continuous with respect to the a priori measure Π 0 and with density given by the normalized Lyapunov exponent function. Moreover, the continuous dependence of the Lyapunov exponents with respect to the parameter θ implies the exponential large deviations estimates in Theorem C. 3.1. Relative entropy. Let us recall some relevant concepts of entropy in the context of shifts. Given x = (x 1 , x 2 , ..., x k , ...) ∈ Ω and n 1, recall C n (x) = {y ∈ Ω | y j = x j , j = 1, 2, .., n} the n-cylinder in Ω that contains the point x. The concept of relative entropy will play a key role in the analysis. Let φ : Ω → R be a Lipschitz continuous potential and let µ φ be its unique Gibbs measure, thus ergodic. Following [16, Section 3] , given an ergodic probability measure µ ∈ M σ (Ω) the limit exists and is non-negative for µ-almost every x = (x 1 , x 2 , ..., x n , ..) ∈ Ω, and it is called the relative entropy of µ with respect to µ φ . Notice that any two distinct ergodic probability measures are mutually singular, hence no Radon-Nykodym derivative is well defined. In (25) , a sequence of nested cylinder sets is used as an alternative to compute relative entropy when Radon-Nykodym derivatives are not well defined (see [16] for more details). Moreover, The relative entropy is also known as the Kullback-Leibler divergence. For proofs of general results on the topic in the context of shifts we refer the reader to [16] and [44] which deal with finite and compact alphabets, respectively. We refer the reader to [34] for an application of Kullback-Leibler divergence in statistics. Remark 3.1. In the special case that (µ θ ) θ∈Θ is a parameterized family of Gibbs measures associated to normalized potentials then for µ θ -almost every x = (x 1 , x 2 , ..., x n , ..) ∈ Ω we have as n → ∞, whenever f θ and f θ 0 are not cohomologous. Furthermore, as the pressure function is zero in this context the relative entropy h(µ θ 0 | µ θ ) can be written as Expression (26) allows to obtain uniform estimates on the relative entropy of nearby invariant measures. More precisely: Lemma 3.1. Let φ : Ω → R be a Lipschitz continuous potential and let µ φ be its unique Gibbs measure. Then, for any small ε > 0 there exists δ > 0 such that Proof. Fix ε > 0. By the continuity of the map µ → φ dµ, upper-semicontinuity of the entropy map µ → h(µ) and uniqueness of the equilibrium state, there exists δ > 0 so that any invariant probability measure µ so that D Ω (µ, µ φ ) > δ satisfies h(µ) + φ dµ < P top (σ, φ) − ε. This, together with (26) proves the lemma. Non-additive thermodynamic formalism. As mentioned before, we are mostly interested in non-additive loss functions which keep some almost additivity condition. Let us recall some of the basic notions associated to the non-additive thermodynamic formalism. There are several notions of non-additive sequences which appear naturally in the description of thermodynamic objects. Let us recall some of these notions. Definition 3.2. A sequence Ψ := {ψ n } n 1 of continuous functions ψ n : Ω → R is called: (1) almost additive if there exists C > 0 such that ψ n + ψ m • σ n − C ψ m+n ψ n + ψ m • σ n + C, ∀m, n 1; (2) asymptotically additive if for any ξ > 0 there is a continuous function ψ ξ so that lim sup n→∞ 1 n ψ n − S n ψ ξ < ξ; (3) sub-additive if ψ m+n ψ m + ψ n • σ m , ∀m, n 1. The convergence in the case of constant functions, ie sub-additive sequences is given by the following well known lemma. Lemma 3.2 (Fekete's lemma). Let (a n ) n 1 be a sequence of real numbers so that a n+m a n +a m for every n, m 1. Then the sequence (a n ) n 1 is convergent to inf n 1 an n . In order to recall the variational principle and equilibrium states for sequences of dynamical systems we need to obtain an almost sure convergence. Given a probability measure ρ ∈ M(Ω), Kingman's sub-additive ergodic theorem ensures that any almost additive or sub-additive sequence Ψ := {ψ n } n 1 of continuous functions is such that Definition 3.3. We denote by P top (σ, Φ) the pressure of the almost additive family Φ, associated to the family ϕ n , where A probability measure µ = µ Φ ∈ M σ (Ω) is called a Gibbs measure for the almost additive family Φ, if it attains the supremum. The previous topological pressure for non-additive sequences can also be defined, in the spirit of information theory, as the maximal topological complexity of the dynamics with respect to such sequences of observables (cf. [3] ). The unique Gibbs measure associated to the family Φ = (ϕ n ) n 1 , ϕ n = n−1 j=0 log J θ 0 • σ j , n ∈ N, is µ θ 0 . Moreover, in this case P top (σ, Φ) = 0. For the family Φ := {ϕ n } the claim is under the domain of the classical Thermodynamic Formalism as described before by expression (7) . In this case Remark 3.4. In [19] , the author proved that any sequence Ψ of almost additive or asymptotically additive potentials is equivalent to standard additive potentials: there exists a continuous potential ϕ with the same topological pressure, equilibrium states, variational principle, weak Gibbs measures, level sets (and irregular set) for the Lyapunov exponent and large deviations properties. Yet, it is still unknown wether any sequence of Lipschitz continuous potentials has a Lipschitz continuous additive representative. Almost-additive potentials related to entropy. The next proposition says that Gibbs measures determine in a natural way some sequences of almost additive potentials. Lemma 3.3. Given θ ∈ Θ, the family ψ θ n,1 (y) := log(µ θ ( C n (y) ), n ∈ N, is almost additive. Proof. Recall that all potentials f θ are normalized, thus each µ θ satisfies the Gibbs property (3) with P θ = 0. Thus, for θ ∈ Θ there exists K θ > 0 such that for all n 1 and x ∈ Ω µ θ (C m+n (x)) K 3 θ µ θ (C n (x)) µ θ (σ n (C m+n (x)) ) = K 3 θ µ θ (C n (x)) µ θ (C m (σ n (x)) ). Similarly, µ θ (C m+n (x)) K −3 θ µ θ (C n (x)) µ θ (C m (σ n (x)) ) for all n 1. Therefore, the family ψ θ n,1 (y) = log(µ θ ( C n (y) ) satisfies ψ θ n,1 + ψ θ m,1 • σ n − 3 log K θ ψ θ (n+m),1 ψ θ n,1 + ψ θ m,1 • σ n + 3 log K θ for all m, n 1, hence it is almost-additive. Note that the natural family n ∈ N, which seems at first useful, may not be almost additive as one first evaluate fluctuations on the different ways the measures see cylinders and only afterwards take its logarithm. We consider alternatively the sequence of potentials given below. Lemma 3.4. For any fixed y ∈ Y and any Borel set E ⊂ Θ, the family ψ n (y) = ψ E n (y) = − 1 E (θ) log(µ θ ( C n (y) ) dΠ 0 (θ), n ∈ N, is almost additive. In particular, for each θ 0 ∈ Θ and E ⊂ Θ, the family Ψ E := {Ψ E n } n , is almost additive. Proof. The first assertion is a direct consequence of the previous lemma and linearity of the integral. For the second one just notice that is the sum of two almost-additive sequences, hence almost additive. Almost-additive potentials related to Lyapunov exponents. Let σ : Ω → Ω be a subshift of finite type and for each θ = (θ 1 , θ 2 , . . . , θ p ) ∈ [−ε, ε] 2 consider the locally constant linear cocycle A θ : Ω → SL(2, R) given by To each n 1 and (x 1 , . . . , x n ) ∈ {1, 2} n one associates the product matrix If ε > 0 is chosen small the previous family of matrices preserve a constant cone field in R 2 , hence have a dominated splitting. Furthermore, if µ ∈ M σ (Ω) is ergodic the Oseledets theorem ensures that for µ-almost every x ∈ Ω there exists a cocycle invariant splitting . Actually, Oseledets theorem also ensures that the largest Lyapunov exponent can be obtained by means of sub-additive sequences, as for µ-almost every x. Since all matrices preserve a cone field then for each θ ∈ [−ε, ε] 2 the sequence (log A (n) θ (x) ) n 1 is known to be almost-additive on the x-variable (cf. [28] ). Most surprisingly, in this simple context the largest annealed Lyapunov exponent varies analytically with the parameter θ (cf. [50] ). We will need the following localization result. Proof. First observe that, as all matrices are obtained by a rotation of the original hyperbolic matrix, we have that log A θ = log( 3+ √ 5 2 ) for all θ ∈ [−ε, ε] 2 . Second, it is clear from the definition that λ + (A (0,0) , ν) is the logarithm of the largest eigenvalue of the unperturbed hyperbolic matrix, hence it is log( 3+ √ 5 2 ). Finally, Furstenberg [31] proved that where S 1 stands for the projective space of R 2 and P is a ν-stationary measure, meaning that ν × P is invariant by the projectivization of the map Altogether this guarantees that where v + is the leading eigenvector of A (0,0) , which cannot occur because ν × δ v + is not invariant by the projectivized cocycle. This proves the lemma. Large deviations: speed of convergence. Large deviations estimates are commonly used in decision theory (see e.g. [12, 30, 56] ). In the context of dynamical systems, the exponential rate of convergence in large deviations are defined in terms of rate functions, often described by thermodynamic quantities as pressure and entropy. In the case of level-1 large deviation estimates these can be defined as follows. Given a family Ψ E := {ψ E n }, where ψ E n : Ω → R, E is a Borel set of parameters, n ∈ N and −∞ c < d ∞, we define and R ν (Ψ E , (c, d)) = lim inf n→∞ 1 n log ν({ y ∈ Ω : 1 n ψ E n (y) ∈ (c, d)}). Since the subshift dynamics satisfies the transitive specification property (also referred as gluing orbit property, the [57, Theorem B] ensures the following large deviations principle for the subshift and either asymptotically additive or certain sequences of sub-additive potentials. ii. inf n 1 ψn(x) n > −∞ for all x ∈ Ω; and iii. the sequence {ψ n /n} is equicontinuous. Given c ∈ R, it holds that: ( While in the previous theorem both invariant measures and sequences of observables may be generated by non-additive sequences of potentials (we refer the reader e.g. to [3] for the construction of equilibrium states associated to almost-additive sequences of potentials) we will be mostly concerned with Gibbs measures generated by a single Lipschitz continuous potential. In the special case of the almost-additive sequences considered in Subsection 3.2.2 the previous theorem can read as follows: Corollary 3.5. Let Φ = {ϕ n } be defined by ϕ n = n−1 j=0 log J θ 0 , n ∈ N and let µ θ 0 denote the corresponding Gibbs measure. For a given Borel set E ⊂ Θ, take Ψ E := {ψ E n }, where ψ E n , n ∈ N was defined in Lemma 3.4. Then, given ∞ d > c −∞ we have: As the entropy function of the subshift is upper-semicontinuous, any sequence of invariant measures whose free energies associated to a continuous potential tend to the topological pressure accumulate on the space of equilibrium states. Thus, in the special case that there exists a unique equilibrium state, any such sequence is convergent to the equilibrium state. Altogether the previous argument gives the following: Lemma 3.7. Consider the sequence of functions Φ = {ϕ n } n 1 where ϕ n (y) = n−1 j=0 log J θ 0 (σ j (y)) and log J θ 0 is Lipschitz continuous, and let µ Φ denote the corresponding Gibbs measure. If U is an open neighborhood of the Gibbs measure µ Φ then there exists α 1 > 0 such that We are particularly interested in the δ-neighborhood of the parameter θ 0 ∈ Θ defined by The next result establishes large deviations estimates for relative entropy associated to Gibbs measures close to µ θ 0 . More precisely: Proposition 3.8. Let Ψ E be defined by (30) . For any δ > 0 there exists d δ > 0 satisfying Moreover, for every small δ > 0 there exists α 1 > 0 so that Proof. Remember that, given η ∈ M σ (Ω) and E ⊂ Θ, Taking η = µ θ 0 and E = Θ we get from (25), (27) and Lemma 3.1 that Similarly, one obtains F(µ θ 0 , Ψ E ) = −h(µ θ 0 ) Π 0 (E) − E log J θ dµ θ dΠ 0 for any E ⊂ Θ. Using that h(µ θ 0 | µ θ ) > 0 for all θ = θ 0 and that Π 0 is fully supported on Θ, Lemma 3.1 ensures that for every small δ. In consequence, for every small δ, hence there exists d δ > 0 so that Now, on the one hand, by continuity of η → F(η, Therefore, from Theorem 3.5 lim sup Remark 3.9. From Hypothesis A the value d δ > 0 can be taken small, if δ > 0 is small, because Corollary 3.10. Given δ > 0 small let B δ ⊂ Θ be the δ-open neighborhood of θ 0 defined in (31) and let d δ > 0 be given by Proposition 3.8. The following holds: for µ θ 0 -almost every point y. Moreover, for µ θ 0 -almost every point y Proof. For each n 1 consider the set By Proposition 3.8, we get that n µ θ 0 (A n ) < ∞. It follows from Borel-Cantelli Lemma that for µ θ 0 -almost every point y ∈ Ω there exists an N , such that y / ∈ A n for all n > N . Equivalently, − 1 n B δ log( µ θ ( Cn(y) ) µ θ 0 ( Cn(y) ) ) dΠ 0 (θ) < d δ for all n > N , which proves (34) . Therefore, from Jensen inequality, we get for µ θ 0 -almost every y ∈ Ω and every large n 1 Moreover, as lim n→∞ − 1 n log(µ θ 0 ( C n (y) ) = h(µ θ 0 ) for µ θ 0 -almost every y, it follows from the previous inequalities that lim inf for µ θ 0 almost every y, which proves (35) , as desired. Remark 3.11. The previous corollary ensures that for any ζ > 0 and µ θ 0 -a.e. y ∈ Ω B δ µ θ ( C n (y) ) dΠ 0 (θ) e −[d δ +Π 0 (B δ ) h(µ θ 0 )+ζ] n for every large n 1. Moreover, Remark 3.9 guarantees that d δ > 0 can be chosen small provided that δ is small. In particular, the absolute continuity assumption on the a priori measure Π 0 (hypothesis A) implies that Π 0 (B δ ) h(µ θ 0 ) + d δ can be taken arbitrarily small, provided that δ is small. Lemma 3.12. For small δ > 0 and µ θ 0 -almost every y ∈ Ω lim sup Moreover, sup θ∈Θ\B δ log J θ dµ θ 0 → −h(µ θ 0 ) as δ → 0. Proof. Recalling the Gibbs property (3) for µ θ the continuous dependence of the constants K θ and compactness of Θ we conclude that there exist uniform constants c 1 , c 2 > 0 so that Furthermore, as the potentials are assumed to be normalized then P θ = 0 for every θ ∈ Θ. Therefore, there exists C 1 > 0 and C 2 > 0 , such that, for all y ∈ Ω, θ ∈ Θ and n 1 Then, lim sup n→∞ 1 n log C 1 < lim sup n→∞ 1 n log µ θ (C n (y)) µ θ 0 (C n (y)) In consequence, using the ergodic theorem and that h(µ θ 0 ) = − log J θ 0 dµ θ 0 , one gets lim sup Fix ζ > 0 arbitrary and small. The previous expression ensures that, for µ θ 0 -a.e. y ∈ Ω, µ θ (C n (y)) µ θ 0 (C n (y)) e n (h(µ θ 0 )+ log J θ dµ θ 0 +ζ) for every large n 1 Given a small δ > 0, by uniqueness of the equilibrium state for log J θ , we have that and that ρ δ tends to zero as δ → 0. Then, for µ θ 0 -almost every point y Θ\B δ µ θ (C n (y)) µ θ 0 (C n (y)) dΠ 0 (θ) As ζ > 0 was chosen arbitrary we conclude that, for µ θ 0 -almost every point y, The statement of the second inequality of this Proposition is nothing more than the expression (10). We proceed to show that the a posteriori measures in Theorem A do converge, for µ θ 0 -typical points y. In order to prove that Π n (·, y) → δ θ 0 (in the weak * topology) it is sufficient to prove that, for every δ > 0, one has that Π n (Θ \ B δ , y) → 0 as n → ∞. This is the content of the following theorem. Theorem 4.1. Let Π n (· | y) be the a posteriori measures defined by (11) and let B δ be the δ-neighborhood of θ 0 defined by (31) . Then, for every small δ > 0 and µ θ 0 -a.e. y, exponentially fast as n → ∞. Proof. Fix δ > 0 small. We claim that Π n (Θ \ B δ | y) tends to zero exponentially fast as n → ∞. We have to estimate lim sup and − lim sup n→∞ 1 n log Θ\B θ µ θ (C n (y))dΠ 0 (θ). From (35) , for µ θ 0 almost every point y lim sup where d δ can be taken small if δ > 0 is small. Fix 0 < ζ < h(µ θ 0 ) 2 . Therefore, from Remark 3.11 we get that for µ θ 0 almost every point y B δ µ θ ( C n (y) ) dΠ 0 (θ) e −[d δ +Π 0 (B δ ) h(µ θ 0 )−ζ] n for every large n 1. (41) Observe that the map δ → sup θ∈Θ\B δ log J θ dµ θ 0 is monotone increasing and recall that sup θ∈Θ\B δ log J θ dµ θ 0 → −h(µ θ 0 ) as δ → 0. On the other hand, −h(µ θ 0 ) Π 0 (B δ ) − d δ tends to zero as δ → 0 (cf. Remark 3.11). Thus, for every small δ > 0. As we just have to show that when n → ∞. Now, equations (37) and (41) and the choice of δ in (42) ensure that, for µ θ 0 -almost every y ∈ Ω, e n sup θ∈Θ\B δ log J θ dµ θ 0 which tends to infinity as n → ∞. Finally the previous expression also ensures that |Π n (B δ | y) − 1| = Θ\B δ µ θ ( C n (y)) dΠ 0 (θ) Θ µ θ (C n (y) )dΠ 0 (θ) e n [sup θ∈Θ\B δ log J θ dµ θ 0 +h(µ θ 0 ) Π 0 (B δ )+d δ +ζ] decreases exponentially fast with exponential rate that can be taken uniform for all small δ > 0. This finishes the proof of the theorem. Proof of Theorem B. By assumption, there exists a full ν-measure subset Y ′ ⊂ Y so that the limit Γ y (θ) := lim exists for every y ∈ Y ′ . Given an arbitrary y ∈ Y ′ we proceed to estimate the asymptotic behavior of the a posteriori measures Π n (· | y) given by (23) . Given δ > 0, by upper semicontinuity of Γ y (·) the function Γ y has a maximum value and there exists d δ > 0 (which may be chosen to converge to zero as δ → 0) so that ) is non-empty and open subset, where α y := max θ∈Θ Γ y (θ). There are two cases to consider. On the one hand, if Γ y (·) ≡ α y is constant then B y δ = Θ and we conclude that Π n (B y δ | y) = 1 for all n 1 and the convergence in (16) is trivially satisfied. On the other hand, as Π 0 is fully supported and absolutely continuous then Θ Γ y (θ) dΠ 0 (θ) < α y . Actually, this allows to estimate the double integral Θ\B y δ Ω e ϕn(θ,x,y) dµ θ (x) dΠ 0 (θ) without making use of the features of the set B y δ . More precisely, using Jensen inequality and taking the limsup under the sign of the integral, lim sup As ϕ n are assumed non-negative we conclude that Γ y (·) is a non-negative function and lim sup In consequence, if 0 < ζ < 1 2 α y − Θ Γ y (θ) dΠ 0 (θ) then Θ Ω e ϕn(θ,x,y) dµ θ (x) dΠ 0 (θ) e (α y −ζ)n for every large n 1. Now, in order to estimate the measures Π n (· | y) on the nested family (B y δ ) δ>0 we observe that Ω e ϕn(θ,x,y) dµ θ (x) e (α y −d δ )n , ∀θ ∈ B y δ , thus B y δ Ω e ϕn(θ,x,y) dµ θ (x) dΠ 0 (θ) e (α y −d δ )n Π 0 (B y δ ) for every large n 1. In particular, if δ > 0 is small so that 0 < d δ < ζ, putting together the last expression, inequality (43) and the fact that 0 < Π 0 (B y δ ) < 1, one concludes that Π n (Θ \ B y δ | y) = Θ\B y δ Ω e ϕn(θ,x,y) dµ θ (x) dΠ 0 (θ) Θ Ω e ϕn(θ,x,y) dµ θ (x) dΠ 0 (θ) Θ\B y δ Ω e ϕn(θ,x,y) dµ θ (x) dΠ 0 (θ) B y δ Ω e ϕn(θ,x,y) dµ θ (x) dΠ 0 (θ) 1 Π 0 (B y δ ) e −(ζ−d δ )n tends exponentially fast to zero, as claimed. Hence, any accumulation point of (Π n (· | y)) n 1 (in the weak * topology) is supported on the compact set argmax Γ y , which proves the first statement in the theorem. As the second assertion is immediate from the first one, this concludes the proof of the theorem. Proof of Theorem C. Consider the family of loss functions ℓ n : Θ × X × Y → R defined by (17) associated to an almost additive sequence Φ = (ϕ n ) n 1 of continuous and non-negative observables ϕ n : Θ × X × Y → R + satisfying assumptions (H1)-(H2). (H1) for each θ ∈ Θ and x ∈ X there exists a constant K θ,x > 0 so that, for every y ∈ Y , ϕ n (θ, x, y) + ϕ m (θ, x, T n (y)) − K θ,x ϕ m+n (θ, x, y) ϕ n (θ, x, y) + ϕ m (θ, x, T n (y)) + K θ,x (H2) K θ,x dµ θ (x) < ∞ for every θ ∈ Θ. The a posteriori measures are Π n (E | y) = E ψ n (θ, y) dΠ 0 (θ) Θ ψ n (θ, y) dΠ 0 (θ) , where the sequence ψ n (θ, y) = Ω ϕ n (θ, x, y) dµ θ (x) is almost additive in the y-variable. Indeed, this family satisfies ψ n (θ, y) + ψ m (θ, T n (y)) − K θ,x dµ θ (x) ψ m+n (θ, y) ψ n (θ, y) + ψ m (θ, T n (y)) + K θ,x dµ θ (x) for every m, n 1, every θ ∈ Ω and y ∈ Y . Now, for each fixed θ ∈ Θ, we note that the sequence of observables ψ n (θ, ·) + K θ,x dµ θ (x) n 1 is subadditive. Hence, Kingman's subadditive ergodic theorem ensures that the limit lim n→∞ ψn(θ,y) n does exist and is ν-almost everywhere constant to the non-negative function ψ * (θ) := inf n 1 1 n ψ n (θ, y) dν(y). The function ψ * is measurable and integrable, because it satisfies 0 ψ * ψ 1 . Thus, taking the limit under the sign of the integral and noticing that the denominator is a normalizing term we conclude that for every measurable subset E ⊂ Θ. This proves the first statement of the theorem. We proceed to prove the level-1 large deviations estimates on the convergence of the a posteriori measures Π n (· | y) to Π * , whenever T is a subshift of finite type and ν is a Gibbs measure associated to a Lipschitz continuous potential ϕ. We will make use of the following instrumental lemma, whose proof is left as a simple exercise to the reader. Lemma 4.2. Given arbitrary functions A, B : Ω → R + and constants a, b, δ > 0 and 0 < ξ < b, the following holds: Let us return to the proof of the large deviation estimates. Given g ∈ C(Θ, R) it is not hard to check using (44) and (45) that g dΠ n (· | y) = g(θ) ψn(θ,y) n dΠ 0 (θ) Θ ψn(θ,y) n dΠ 0 (θ) and g dΠ * = g(θ)ψ * (θ) dΠ 0 (θ) Θ ψ * (θ) dΠ 0 (θ) . Fix δ > 0. In order to provide an upper bound for lim sup n→∞ 1 n log ν({ y ∈ Ω : g dΠ n (· | y) − g dΠ * > δ}) we will estimate the set g dΠ n (· | y) − g dΠ * > δ as in Lemma 4.2. For that purpose, fix 0 < ξ < min θ∈Θ ψ * (θ). For each fixed θ ∈ Θ the family Ψ θ := (ψ n (θ, ·)) n is almost-additive. Hence Theorem 3.6 implies that lim sup n→∞ 1 n log ν({ y ∈ Ω : ψ n (θ, y) n − ψ * (θ) ξ}) sup Analogously, lim sup n→∞ 1 n log ν y ∈ Ω : 1 Θ ψ * (θ) dΠ 0 (θ) − ξ g(θ) ψ n (θ, y) n dΠ 0 (θ) − Θ g(θ)ψ * (θ) dΠ 0 (θ) δ 2 lim sup n→∞ 1 n log ν y ∈ Ω : ψ n (θ, y) n dΠ 0 (θ) − Θ ψ * (θ) dΠ 0 (θ) Θ ψ * (θ) dΠ 0 (θ) − ξ 2 g ∞ δ sup θ∈Θ sup P 2 θ,ξ,δ − P (σ, ϕ) + h η (σ) + ϕ dη , where η ∈ P 2 θ,ξ,δ ⊂ M σ (Ω) if and only if |F(η, Ψ θ ) − ψ * (θ)| Θ ψ * (θ) dΠ 0 (θ)−ξ 2 g ∞ δ. The third term in the decomposition of Lemma 4.2 is identical to the estimate of (47) and we have lim sup n→∞ 1 n log ν y ∈ Ω : where η ∈ P 3 θ,ξ,δ ⊂ M σ (Ω) if and only if |F(η, Ψ θ ) − ψ * (θ)| ( Θ ψ * (θ) dΠ 0 (θ)−ξ) 2 2 Θ ψ * (θ) dΠ 0 (θ) δ. Altogether, if 0 < δ < 1 and ξ = δ · min{inf θ∈Θ ψ * (θ), Θ ψ * (θ) dΠ 0 (θ)} > 0, estimates (47)- (49) imply that there exists c > 0 so that lim sup n→∞ 1 n log ν y ∈ Ω : g dΠ n (· | y) − g dΠ * δ Finally, it remains to guarantee that the right hand-side above is strictly negative. Notice that as F(ν, Ψ θ ) = ψ * (θ), the uniqueness of the equilibrium state (which is an invariant Gibbs measure) for the potential ϕ and the continuity of the map η → F(η, Ψ θ ) imply that the set B θ (δ) := {η ∈ M σ (Ω) : |F(η, Ψ θ ) − ψ * (θ)| cδ} is compact and disjoint from {ν}, hence d Mσ(Ω) ν, B θ (δ) > 0, for each θ ∈ Θ. Hence, under the additional assumption that both maps θ → F(η, Ψ θ ) = inf n≥1 Statistical Theory, A Concise Introduction Nonequilibrium thermodynamics and information theory: basic concepts and relaxing dynamics Nonadditive thermodynamic formalism: equilibrium and Gibbs measures Thermodynamic formalism and applications to dimension theory, Birkhäuser Multifractal analysis of asymptotically additive sequences Invariant measure for quantum trajectories A general framework for updating belief distributions A formula with some applications to the theory of Lyapunov exponents Equilibrium states and the ergodic theory of Anosov diffeomorphisms Lyapunov exponents for Quantum Channels: an entropy formula and generic properties Recursive pathways to marginal likelihood estimation with prior-sensitivity analysis Large Deviation Techniques in Decision, Simulation and Estimation Entropic Inference and the Foundations of Physics How Gibbs distributions may naturally arise from synaptic adaptation mechanisms. A model-based argumentation Maximum Likelihood and Minimum Entropy Estimation of Grammars Relative entropy and identification of Gibbs measures in dynamical systems Relative entropy, dimensions and large deviations for g -measures Large deviations for empirical entropies of g-measures Additive, almost additive and asymptotically additive potential sequences are equivalent Two-scale difference equations, local regularity, infinite products of matrices and fractals Probability and Statistics Introductory Statistics and Random Phenomena: Uncertainty, Complexity and Chaotic Behavior in Engineering and Science Posterior consistency for partially observed Markov models Consistency of the maximum likelihood estimator for general hidden Markov models The maximizing set of the asymptotic normalized log-likelihood for partially observed Markov chains How to get the Bayesian a posteriori probability from an a priori probability via thermodynamic formalism for plans; the connection to Disordered Systems A subadditive thermodynamic formalism for mixing repellers Lyapunov exponent for products of matrices and Multifractal analysis. Part I: Positive matrices Lyapunov spectrum of asymptotically sub-additive potentials Decision Theory and Large Deviations for Dynamical Hypothesis Test: Neyman-Pearson, min max and Bayesian Noncommuting Random Products Nonequilibrium and fluctuation relation Chaotic hypothesis: Onsager reciprocity and fluctuation-dissipation theorem Escort distributions minimizing the Kullback-Leibler divergence for a large deviations principle and tests of entropy level The calculus of thermodynamical formalism Estimating functionals of one-dimensional Gibbs states Gibbs posterior for variable selection in high-dimensional classification and data mining Large Deviations in Dynamical Systems and Stochastic processes Entropy and Large Deviation Pressure and Large Deviation, Cellular Automata, Dynamical Systems and Neural Networks Thermodynamic Formalism, Maximizing probability measures and Large Deviations Duality Theorems in Ergodic Transport Pressure and Duality for Gibbs plans in Ergodic Transport On information gain, Kullback-Leibler divergence, entropy production and the involution kernel The sectional curvature of the infinite dimensional manifold of Hölder equilibrium probabilities Nonequilibrium in Thermodynamic Formalism: the Second Law, gases and Information Geometry Gibbs posterior convergence and Thermodynamic formalism Zeta functions and the periodic orbit structure of hyperbolic dynamics An Introduction to Probability Theory and Mathematical Statistics Analyticity Properties of the Characteristic Exponents of Random Matrix Products Theory of Statistics Growth properties of power-free languages A Variational characterization of finite Markov chains. The Annals of Mathematical Statistics Probability and statistics by example A large deviation approach to posterior consistency in dynamical systems A large-deviation analysis of the maximum-likelihood learning of Markov tree structures Weak Gibbs measures: speed of convergence to entropy, topological and geometrical aspects, Ergodic Theory Dynam Bayesian Probability Theory Brasil Email address: silviarc.lopes@gmail.com Instituto de Matemática e Estatística Acknowledgments. The authors are indebted to the anonymous referees for the careful reading of the manuscript and many suggestions that helped to improve the presentation of the paper. AOL and SRCL was partially supported by CNPq grant. PV was partially supported by CMUP (UID/MAT/00144/2019), which is funded by FCT with national (MCTES) and European structural funds through the programs FEDER, under the partnership agreement PT2020, and by Fundação para a Ciência e Tecnologia (FCT) -Portugal through the grant CEECIND/03721/2017 of the Stimulus of Scientific Employment, Individual Support 2017 Call.