key: cord-0507768-fe2ubdpk
authors: Reisinger, Christoph; Tam, Jonathan
title: Markov decision processes with observation costs
date: 2022-01-19
journal: nan
DOI: nan
sha: c88a95c759e550ae1fda0b4c4ed96b66b3a18909
doc_id: 507768
cord_uid: fe2ubdpk

We present a framework for a controlled Markov chain where the state of the chain is only given at chosen observation times and of a cost. Optimal strategies therefore involve the choice of observation times as well as the subsequent control values. We show that the corresponding value function satisfies a dynamic programming principle, which leads to a system of quasi-variational inequalities (QVIs). Next, we give an extension where the model parameters are not known a priori but are inferred from the costly observations by Bayesian updates. We then prove a comparison principle for a larger class of QVIs, which implies uniqueness of solutions to our proposed problem. We utilise penalty methods to obtain arbitrarily accurate solutions. Finally, we perform numerical experiments on three applications which illustrate our framework.

In this article, we consider a controlled Markov chain problem, where a cost is incurred to reveal the state of the chain at a given moment in time. We assume that any changes to the control can only occur at these observation times. Hence, in addition to searching for the optimal action, the user also seeks for the optimal intervals between successive observations of the state.

A continuous stream of information is often an assumption taken for granted in both fully observable or partially observable control problems. However, in many practical applications, due to technical and labour difficulties, it can be expensive or impractical to obtain such measurements. Examples include the virological state of patients [11, 20, 21] , environmental measurements on river sediments [25] and biological growth dynamics of organisms [27] . In consumer spending, it is desirable to purchase a good product with a low searching cost [14] . Such searches are sequential by nature, so that the assumption of continuous observations does not apply.

To the best of our knowledge, the earliest works on observation controls are [3, 4] . A Brownian motion is to be stopped upon the exit of a given set, but each inspection of the Brownian motion comes with a cost. It was shown that the problem reduces to a free boundary problem and the corresponding value function satisfies a quasi-variational inequality (QVI), with further analysis demonstrating the well-posedness of the problem. More recently, Dyrssen and Ekström incorporated observation costs in hypothesis testing for the drift of a diffusion, and characterises the value function as the unique fixed point of an associated operator [12] . Other works concerning observation controls are motivated by specific applications: Winkelmann et al. explored the optimal diagnosis and treatment scheduling for HIV-1 patients, based on the trade off between treatment cost against productivity loss across different countries [11, 20, 21] ; Yoshioka et al. focused on a variety of environmental management problems, including the modelling of replenishing sediment storage in rivers, monitoring algae population dynamics and biological growth of fishery resources [24, 25, 26, 27] . The phenomenon of sporadically observing the state process is also modelled in mathematical finance under the term rational inattention, where portfolio adjustments are assumed to occur infrequently with a utility cost proportional to the users' assets [1, 2, 10] .

Our goal in this article is to systematically develop a framework for the observation control problems described above, utilising first principles to establish dynamic programming for the value function, and exploring suitable numerical approximations to the solutions of the resulting systems of QVIs. To avoid intuition being obscured by technical complications, we shall focus on discrete-time Markov chains, which often serve as valid approximations for diffusions in continuous time [16] .

In our model, we consider an underlying controlled Markov chain X = {Xn}n on a probability space (Ω, F, P), with a corresponding controlled observation filtration the estimation and control problem can be separated. This is a common procedure in solving partially observable control models [8, 17] . In general, the conditional distribution is characterised by the Zakai or Kushner-Stratonovich equations using standard filtering techniques [5, 13] . We will show in Section 2 that in our case, each realisation of µn only depends on the last observation time k, state of the chain x and control of the chain i. This gives rise to a higher dimensional value function than in the standard case due to the presence of k and i. More specifically, in the finite horizon case, the reward functional that we aim to optimise depends on the tuple (m, (k, x, i), α), where m represents the current time, and α is a double sequence representing the observation times and control of the chain. Some analysis of the value function as set up above was seen in [3, 4] , but their applications were limited to optimal stopping problems concerning Brownian motion. Other existing works on the observation control model assume stationarity in the problem. This leads to a reward functional that only depends on the triplet (x, i, α) [12, 21, 24, 25, 26, 27] . Whilst this formulation gives an overall lower dimensional problem, the setup assumes that the user is in possession of accurate and updated information of the state process at initialisation. This excludes solving for scenarios with unintended large gaps between observations, which can lead to qualitatively different optimal actions. For example, lockdowns imposed due to the COVID-19 pandemic have led to delayed consultations for patients, who are more likely to appear in a worsened state upon the time of diagnosis. We adopt the former, more general formulation to include such incidences into our model. The qualitative behaviour of the value function under these scenarios is demonstrated in our numerical experiments in Section 5.

Having set up the initial framework and established dynamic programming in Section 2, we obtain a system of discrete quasi-variational inequalities (QVIs). In Section 4, we investigate the well-posedness of such systems. We can rewrite the QVIs in the more abstract form

where M : R N×L×d → R N×L : (Mu) n l = ((Anū q ) l − c), q ∈ {1, . . . , N } is some fixed index, c is a constant, each An ∈ R L×L is a strictly substochastic matrix, and Fi : R N×L → R N×L satisfies a monotonicity property specified in Section 4.

The system of QVIs (1.3) share a similar structure to that of a monotone interconnected obstacle system, which typically feature in optimal switching problems [18] . The well-posedness of interconnected obstacle systems is treated in [19] , which establishes existence via penalisation [22, 23] . The use of penalty methods over traditional policy iteration is due to the fact that invertibility of the matrices is not guaranteed under the presence of the obstacle operator. The extra matrix An appearing in the obstacle operator M represents the (discounted) transition matrix over the unobserved period of time. We show a modified proof of the comparison principle appearing in [19] , taking into account the effect of the matrix An on the coupling in (1.3). The same modifications apply for the penalty approximations, which gives us the well-posedness of the observation control problem. We also note that although the system of Bellman-type equations arising from the lower dimensional formulation can be solved via a fixed-point method by bounding the total number of observations [12] , it is not applicable to our model due to the presence of the extra time lag variable k in the value function.

Lastly, as a further novel extension, we incorporate parameter uncertainty into the model dynamics and consider sequential updates to the control in Section 3. In the literature, the adoption of Bayesian principles with observation control has been explored in the context of reinforcement learning [6] : an algorithm for solving the optimal policy and estimator in parallel was proposed, but an equivalent Bellman-type equation was not established. Along similar ideas on the tradeoff between cost and information, Cohen et al. consider a cost constraint that arises from the tracking error relative to the optimal Bayesian estimator, and relates the optimal band of inaction to a two-sided hypothesis test [9] .

Here, we consider the case where the transition probability of the Markov chain X depends on an unknown parameter θ. In general, the standard approach in Bayesian control problems is to reformulate the problem into a partially observable optimal control problem [15] , and then proceed to the separation of estimation and control as before. We apply the same principles to the observation control problem and derive a dynamic programming equation involving the 'prior' and 'posterior' distributions of θ. In particular, the reward functional is now augmented to be dependent on the tuple (m, (k, x, i), π, α), where π is a measure over the parameter space. Whilst the resulting equations are often infinite dimensional in nature, this can be reduced back to the finite dimensional case if one considers conjugate distributions for π. We demonstrate the solution of a model problem on a random walk involving beta conjugate priors in Section 5 and investigate its corresponding numerical properties.

The main contributions of our article are as follows:

-We expand on the probabilistic approach to the observation control problem, and extend the results first seen in [3, 4] . We subsequently derive a system of discrete QVIs that the value function satisfies with dynamic programming.

-Following the approach of [19] , we prove a comparison principle for the system of discrete QVIs and establish an approximation to the solutions via penalisation and semismooth Newton methods [7] . This also provides well-posedness to the observation control problem.

-We provide a novel Bayesian parametric setup of the observation control problem. This incorporates a distribution over the unknown parameters in the value function, which is dynamically updated after each observation.

-We demonstrate the numerical performance of our model with three numerical experiments, one of which is a time-discretised version of the HIV-treatment problem appearing in [21] . We show that our model extends the original one by solving for scenarios with large observation gaps, and that the optimal control coincides in scenarios that are covered by both models.

The rest of this paper is organised as follows. Section 2 sets out the framework for the Markov chain observation control model and establishes the corresponding set of discrete QVIs. A model problem with an explicit solution is also provided to illustrate the setup. The Bayesian extension is outlined in Section 3. In Section 4 we prove a comparison principle for a class of discrete QVIs which subsumes the QVIs obtained in Section 2. The penalty method is also introduced as an approximation for the QVI. Finally we provide numerical experiments for our observation control model in Section 5.

In this section, we establish the framework for the observation control problem for Markov chains. We will consider in the following both the finite and infinite horizon problem. We introduce the notion of observation filtrations in the spirit of [12] , as well as the corresponding class of observation controls. We then formulate the control problem and derive the relevant equations for the value function with the use of the dynamic programming principle. We conclude the section with a model problem that yields a closed form solution.

Let X = X(α) be a controlled Markov chain on a probability space (Ω, F, P), taking values in a state space S. We consider the discrete time case and write X = (Xn) n∈N . The control α is in general an adapted process with values in a finite control set I := {1, . . . , d}, and determines the distribution of the Markov chain via its transition kernel: P = P (i) = Pi = (pxy(i))x,y∈S, where i ∈ I. We will use the notation p (n) xy (i) to denote the entries of P n , i.e. the n-step transition probabilities. We also assume that the transition kernel is time-homogeneous and is known a priori.

In the observation control model, access to the realisation of each Xn will come with a cost. Thus one would expect that observations would generally not occur at every time point, so that the information available to us is restricted. It is then natural to consider a filtration that is no finer than the natural filtration generated by X. Hence, we define the following notion of an observation filtration, which is based on that appearing in [12] .

Proposition 2.1. Let X be a Markov chain on (Ω, F, P). Let τ = (τ k ) ∞ k=1 be a strictly increasing sequence of random times (in the sense that τ k (ω) < τ k+1 (ω) for every ω ∈ Ω). Denote also τ0 for the initial time of the chain. Define for all n ≥ 0,

Then the sequence (F X,τ n )n forms a filtration.

Proof. See appendix.

By construction, each τ k is a stopping time with respect to F X,τ . In the context of the observation control model, we would like to determine observation times based only upon information amassed from prior observations up to the current time. This motivates the following definition.

Definition 2.2. We say that τ is an observation sequence for X if each τ k is F X,τ τ k−1measurable. In this case F X,τ is called an X-observation filtration.

For the rest of this article, we will only consider τ which are observation sequences. Such sequences are also predictable.

is an observation sequence for X, then each τ k is a predictable stopping time with respect to F X,τ , i.e. for each k, {τ k = n} ∈ F X,τ n−1 for all n.

Proof. See appendix.

Since the influx of information is decided exogenously, the observation sequence should form part of the control. We define the set of admissible controls as follows.

Definition 2.4. Let τ be an observation sequence for X. For each τ k , let ι k be an I-valued, F X,τ τ k -measurable random variable. An admissible control is a double sequence of the form α = (τn, ιn)n. The set of admissible controls is denoted by A. Define also the corresponding piecewise constant process I = (In)n of the form

Note that double sequence controls are often seen in optimal switching problems (see, for example, [18, Chapter 5.3] ), but here the process is in addition adapted to the observation filtration. The following proposition justifies the intuitive idea that the observation filtration indeed contains no more information than the natural filtration.

Proposition 2.5. Let F X be the natural filtration generated by X. If the sequence τ is predictable (with respect to F X,τ ), then F X,τ is coarser than F X , i.e., F X,τ n ⊆ F X n for every n.

Proof. See appendix.

At any given time n ≥ 1, we will often consider the most recent observation that occurred before time n. We employ the following notation when considering quantities related to such observations. Definition 2.6. Let τ be an observation sequence for X. For any n ≥ 0, define the random variablesτ

Similarly, define alsoιn = ιτ n = In andXn = Xτ n .

Thus one can view the tilde notation (τn,ιn,Xn) as representing 'the most recent data available'. We can see that the random variableτn is F X,τ n -measurable, since for m ≤ n,

Althoughτn is not a F X,τ -stopping time, so that F X,τ τn is not well-defined, we can still utilise the Markov property of the underlying chain X to obtain the relation in the following lemma. Proof. This follows from the decomposition of the conditional expectation across the countable events that generate F X,τ m and the Markov property of X.

To formulate the observation control problem, we will have to define the conditional distribution of Xn given its past observations. Definition 2.8. Let P(S) be the set of probability measures over S. Define the P(S)-valued (controlled) process

Then µn is the conditional distribution of Xn given its past observation history up to and including time n.

We can characterise the random measures (µn)n as follows. As S is countable, it suffices to consider the singleton sets {y} for any y ∈ S. By Lemma 2.7, µn({y}) = P(Xn = y |τn,ιn,Xn) = k<n x∈S i∈I ½ {τn=k,Xn=x,ιn=i} P(Xn = y |τn = k,Xn = x,ιn = i) = k<n x∈S i∈I ½ {τn=k,Xn=x,ιn=i} p (n−k) xy (i).

(2.7)

The events {τn = k,Xn = x,ιn = i} partition the sample space, and µn restricted to each event is constant. Therefore, each realisation µn(ω) can be identified by some k ∈ {0, . . . , n}, x ∈ S, and i ∈ I. In view of this, we will use the notation µ k,x,i n to represent a realisation µn(ω) in the sequel.

By (2.4) and the predictability of τ , we also have ½ {τ n+1 =n+1} ∈ F X,τ n , so that, given any admissible control α ∈ A, at time n, we know whether an observation should be made at time n + 1. Hence, given the initial condition µn = µ k,x,i n , any admissible control α will either havẽ τn+1 = n + 1 (a new observation occurs) orτn+1 = k (no new observation is made). In the case thatτn+1 = k, then we simply have

On the other hand, ifτn+1 = n + 1, then for each y ∈ S, there exists a unique jy ∈ I such that

As a final note for this subsection, recall that the set of measures P(S) can be identified as the dual of the set of bounded functions on S (denoted B(S)). As such, for any f ∈ B(S), we can also write

This identification will be used in formulating the observation control problem in the following sections.

Let us define the reward functional we aim to maximise. Let f : N × S × I → R be a function that represents the running reward to the user. We assume for simplicity that the observation cost c obs is constant and independent of the control. The general objective in the observation control problem is to maximise a reward functional of the form Remark 2.9. One can incorporate switching costs {gij }i,j∈S into the problem, to formulate an optimal switching problem but with observation costs. This can be done by considering a variable observation cost cij = gij + c obs .

Through the law of total expectation, we can rewrite this functional as

Now, by treating µ as our new state process, define the reward functional

This is now a fully observable control problem, so a standard application of the Dynamic Programming Principle establishes the following Bellman-type equation.

The proof of Proposition 2.10 is standard and can be found, for example, in [18, Chapter 3] . We can expand upon (2.15) to obtain the following set of finite-dimensional equations.

).

Proof. Given the initial condition µm = µ k,x,i m we always have Im = i, so expanding the first term in the right side of (2. 15 ) gives

(2.17)

Next, choosing α ∈ A such thatτm+1 = k leads to

On the other hand, choosing α ∈ A such thatτm+1 = m + 1 leads to

Combining the above together, we obtain

) .

Equality is achieved by noting that any optimal control α * must either haveτm+1 = k or τm+1 = m + 1. The QVI (2.21) follows from rearranging.

We can interpret the value function v(m, µ k,x,i m ) as the optimal value obtainable from time m, given the latest available data (k, x, i). In the QVI (2.21), the first part of the minimum represents the region of no observation, and the latter part represents an observation, and subsequently adjusting to the optimal regime j * .

Remark 2.12. Equation (2.16) can also be expressed as a discrete quasi-variational inequality (QVI): for all 0 ≤ k ≤ m ≤ N , x ∈ S, and i ∈ I,

We will show later in Section 4 that the value function uniquely solves the above QVI by proving a comparison principle.

In the infinite horizon case, we remove the time dependence in the reward function f so that the problem is stationary. The reward functional now reads

where γ ∈ [0, 1) represents the discount factor. By the stationarity of f , we have

Thus without loss of generality, we can set µ x,i m = µ 0,x,i m and consider instead the quantities

Under this formulation, m represents the time elapsed since the previous observation, with (x, i) being the observed state and regime at the time origin. In order for (2.25) to be well-defined, we assume additionally that

for any γ ∈ [0, 1), x ∈ S, and i ∈ I. Following the same arguments as in Proposition 2.11, we obtain another Bellman-type equation for the infinite horizon case:

Proposition 2.13. The infinite horizon value function satisfies for all m ≥ 0, x ∈ S, and

which can be explicitly written as the finite

where j * ∈ I is the regime such that v(0, µ z,j * m+1 ) = maxj∈I v(0, µ z,j m+1 ). Remark 2.14. Just as in the finite horizon problem, (2.29) can also be written in QVI form , extra boundary conditions have to be imposed to make a closed system and to ensure uniqueness of solutions. This will be detailed in Sections 4 and 5. We first work out a case with a explicit solution in the subsection below.

To illustrate the framework, we present a model problem involving a two-state Markov chain and give an explicit solution. We assume the following setup:

-the transition matrix

where p ∈ (0, 1).

This model problem can be interpreted, e.g., as finding an optimal interval for repairing an appliance that is prone to breaking down over time. The profit function f gives a reward of 1 when the state and control values are the same, and zero otherwise. By construction, if the control remains constant, the chain eventually arrives at the undesirable absorbing state. It is clear that given the knowledge of Xm+1, the optimal regime is simply Im+1 = Xm+1. Hence it is sufficient to only consider the cases where x = i for x ∈ S and i ∈ I. At each time m, the possible controls are:

-do not make an observation (and make no regime changes); -make an observation, and set Im+1 = Xm+1. The optimal control is then the control that maximises the value function. Let us now solve for the infinite horizon problem. For ease of notation, we write (x, i) in place of µ x,i m . First consider the case (x, i) = (0, 0). Choose some arbitrary time m ≥ 0. If it is optimal to not make an observation, then by Proposition 2.13, v m; (0, 0) = p m + γv m + 1; (0, 0) .

(2.33)

If, instead, it is optimal to make an observation, then Im+1 = Xm+1 and applying Proposi- Furthermore, note that (x, i) = (1, 1) will also lead to the same set of equations. Therefore, to further simplify the notation, we can drop the x and i arguments in the value function to obtain the one-dimensional equation

Now if T is the first optimal observation time (where by convention T = ∞ if the optimal control is to never make an observation), the value function is further determined by the recurrence relations

otherwise.

(2.37)

Solving the above for v(0), we obtain the explicit solution

from which v(n) for n ≥ 1 can be calculated from (2.37). For this model problem, due to the symmetry of the chain, the optimal interval between observations is constant. We can see that (2.38) is actually a geometric series, where the first term T k=0 γ k p k − γ T c obs is the expected returns across the optimal observation interval, and the common ratio γ T +1 is the discount factor over the whole interval.

In this section, we expand upon the formulation of the observation control problem in Section 2 by incorporating Bayesian parameter uncertainty. In a variation of the setup from before, suppose now that the transition matrices {Pi}i∈I depend on an unknown parameter θ, taking values in the parameter space Θ. For simplicity, we shall focus on the case where Θ ⊆ R.

More specifically, we take the following setup. Take the sample space

with F the canonical product σ-algebra. For ω = (θ, x0, i0, x1, i1, . . .) ∈ Ω, define the process X = (Xn)n by Xn(ω) = xn. Let {P i,θ } i∈,θ∈Θ be a family of transition kernels, with p (n) xy|θ (i) denoting the corresponding n-step transition probabilities. Then, for any x0 ∈ S, ρ0 ∈ P(Ω) and fixed control sequence i = (in) ∞ n=0 , there exists a probability measure P x 0 ,ρ 0 ,i such that for any x1, . . . , xn+1 ∈ S, the following relation holds:

where (ρn)n is defined recursively by

For ease of notation, we will write P in place of P x 0 ,ρ 0 ,i . In the context of the observation control model, we are concerned with the conditional distribution

(we write νn here so as to distinguish from µn in the previous section). Note that due to the presence of θ, Lemma 2.7 does not apply to the process ν = (νn)n here. However, one can partition the sample space across the observation history as follows. Let H n denote the observation history up to time n, i.e.

H n (ω) = (τ k (ω), Xτ k (ω), Iτ k (ω)) k∈H n (ω) ,

where H n (ω) = {j ≥ 1 : τj (ω) ≤ n}, so that H n is of random length at most n. Next, for each n, define the collection of sets Cn by Cn = {(tj , xt j , it j ) 0≤j≤k : k ≤ n, tj ≥ 0 increasing, xt j ∈ S, it j ∈ I}.

(3.5)

Then, by conditioning on θ we obtain for any y ∈ S,

Now, conditional on the value of θ, Lemma 2.7 does apply, so that P Xn = y   θ, H n = P(Xn = y | θ,τn,Xn,ιn).

Hence, when dynamically updating the problem, the values of νn can be characterised by the values ofτn,Xn,ιn, and the conditional distribution P(dθ | F X,τ n ). That is, for every ω ∈ Ω, there exists some k ∈ N, x ∈ S, i ∈ I and π ∈ P(Θ) such that for any y ∈ S

(3.8)

In line with the previous section, we shall use the notation ν k,x,i,π n for a realisation νn(ω). Let us now establish the transition kernel for the process ν. First denote the parameter process (πn)n such that πn(dθ) = P(dθ | F X,τ n ). Then, by Bayes' Theorem,

Hence, in analogy with the transition kernel for µ, when conditioned on νn = ν k,x,i,πn n , either τn+1 = n + 1 orτn+1 = k. Ifτn+1 = k, then

(3.10)

Otherwise,τn+1 = n + 1 and for each y ∈ S there exists a unique jy ∈ I such that

xy|θ (i) πn(dθ), j = jy and (3.9) holds; 0, otherwise. Thus, when dynamically solving for the observation control problem at each time m, πm acts as the prior distribution for θ, from which the optimal action is computed. Any new observation occurring has an associated likelihood, from which the posterior πm+1 is calculated. Subsequently πm+1 acts as the prior for time m + 1. As (3.12) and (3.13) now form a fully observable control problem with state process ν, we can once again state the Dynamic Programming Principle below, which can be expanded using the transition probabilities (3.11). 

and j * ∈ I is the regime such that v(m + 1, ν m+1,z,j * ,π ′ m+1 ) = maxj∈I v(m + 1, ν m+1,z,j,π ′ m+1 ).

Proof. The proof is similar to that of Proposition 2.11. First, (3.14) is obtained through standard dynamic programming. Next,

Then, any optimal control α ∈ A conditioned on νm = ν k,x,i,π m must have eitherτm+1 = k or τm+1 = m. Ifτm+1 = k, then

Alternatively, ifτm+1 = m + 1, then

). (3.20)

Combining both equations then gives us the result.

For the infinite horizon case, we can once again invoke stationarity to obtain v(m, ν k,x,i,π m ) = v(m − k, ν 0,x,i,π m−k ), (3.21) so that by setting ν x,i,π m = ν 0,x,i,π m , we arrive at the corresponding DPP for the infinite horizon problem below. π ′ (dθ) = π(dθ) · p (m+1) xy|θ (i)/p π,(m+1) xy (i), (3.25) and j * ∈ I is the regime such that v(0, ν z,j * ,π ′ m+1 ) = maxj∈I v(0, ν z,j,π ′ m+1 ). Whilst the DPP here, even in reduced form, is typically an infinite dimensional equation, it can be reduced to finite dimensions if one uses conjugate distributions. One such example involving random walks is given in Section 5.

In this section, we consider a class of discrete QVIs and prove a comparison principle, as well as introducing a penalisation scheme which approximates the QVI. We will show that the QVI obtained via dynamic programming in the observation control model falls under this class of QVIs. We follow an approach similar to [19] , which considers the case A = Id below.

For this section we employ the following notation for indexing: n ∈ {1, . . . , N } for the time domain, l ∈ {1, . . . , L} for the spatial domain, and i ∈ {1, . . . , d} for the control. Let u ∈ R N×L×d , and write u n i,l for its components. Defineū ∈ R N×L byū n l = maxj∈I u n j,l . We also writeū n to represent the vector (ū n l ) l . The class of discrete QVIs of interest can be stated as follows: where q ∈ {1, . . . , N } is some fixed index, c is a constant and each An ∈ R L×L is a strictly substochastic matrix, i.e., all elements are non-negative and all row sums are less than 1.

-Fi : R N×L → R N×L satisfies the following property: there exists a constant β > 0 such that for any

Note that in general the c in (4.2) can also depend on i, n and l (this is indeed the case for (2.30), see Remark 4.7), but we shall only consider a constant c for the proofs in this section for ease of notation. The fixed index q can be arbitrary; the idea is that the operator M only couples the solution u at a cross-section of values along the time domain. In the following, we adapt the argument in [19] to prove a comparison principle of the QVI (4.1). The main ingredient in the proof is to bound the coupling terms An(ū q −v q ) by their maximum. If we extend the problem to an infinite domain, it is not obvious that this maximum is achieved. However, recall from Remark 2.15 that additional spatial boundary conditions have to be imposed for a closed system. This reduces the system of QVIs back into the form of (4.1). The comparison principle is shown in the proposition below. Let γ < 1 be the maximum of the row sums of Am. Then by combining both inequalities above we obtain

This implies that there exist some ju, jv ∈ I and l * ∈ {1, . . . , L} such that

which is a contradiction by the maximality of M . Hence, we must have Fj (uj) m k ≤ 0, but then since v is a supersolution we have by the monotonicity property

so that M ≤ 0.

Now we present a penalty approximation to the QVI (4.1). Consider the following penalised problem. 

where the penalisation function π : R → R is continuous, non-decreasing with π| (−∞,0] = 0 and π| (0,∞) > 0, and is applied elementwise.

We show that for each fixed ρ, (4.10) satisfies a comparison principle. This implies uniqueness for Problem 4.3. The argument follows similarly to the approach in [19] and Proposition 4.2.

Proposition 4.4. For any penalty parameter ρ ≥ 0 and any c ≥ 0, if u ρ = (u ρ i )i∈I (resp., v ρ = (v ρ i )i∈I) satisfies

Fi(u ρ i ) − ρ π (Mu ρ − u ρ ) ≤ 0 (resp., ≥ 0), (4.11) then u ρ ≤ v ρ .

Proof. As in the previous proposition, let M := max i,l,n (u ρ,n i,l − v ρ,n i,l ) =: u ρ,m j,k − v ρ,m j,k . Note that

Hence by rearranging we get

It then follows from the sub/super-solution properties of u ρ and v ρ that Theorem 4.6. For any fixed c ≥ 0, the solution to(4.10) converges monotonically from below to a function u ∈ R d×N×L as ρ → ∞. Moreover u solves the discrete QVI (4.1) if c > 0.

Remark 4.7. The QVI (2.30), derived from the infinite horizon problem of the observation control model, is an instance of the QVI (4.1). Specifically, we have L = |S| as well as the following:

Moreover, it is straightforward to see that Fi satisfies the monotonicity condition (4.3). For uniqueness of solutions to hold, additional boundary conditions have to be imposed to close the system. This will be demonstrated in the numerical experiments section below.

In this section, we apply our observation cost framework to three numerical experiments. Sections 5.1 and 5.2 analyse two infinite horizon problems. The roadmap for both is as follows: we first set up the discrete QVIs arising from the problem, which is then approximated by the penalised problem. As in [19] , we will employ the penalty function π(x) = x + . The solution of the penalised problem is in turn approximated iteratively with semismooth Newton methods [23] . Formally speaking, starting with an initialisation v (0) to the penalised problem

we obtain the next iterate by solving for

where L ρ denotes the generalised derivative of the function G ρ . For each example, we examine the numerical performance of the penalty method and Newton iterations, as well as the effects of the observation cost on the qualitative behavior of the solutions. Section 5.3 considers the Bayesian formulation of the observation control problem over a finite horizon. The solutions are obtained through backwards recursion from the terminal conditions. We examine the impact that the extra parameter uncertainty has on the optimal trajectories.

Consider an integer-valued random walk (Xn) n∈N with two regimes, representing a positive and negative drift respectively, so that the control space is I = {+1, −1}. The probability of each step is parametrised by θ. Specifically, for any x ∈ S = N,

We also adopt the following reward function:

The mass of this reward function f is concentrated around the origin, so naturally, the optimal regime is one that reverts the process back towards the origin. For this example, we consider the infinite horizon problem. Recall that the discrete QVI (2.30) reads: for all m ≥ 0, x ∈ S, and i ∈ I,

Note that there exists a path from x to y over m units of time if and only if m ≥ |y − x| and m ≡ y (mod 2). If S x m denotes the set of states that can be reached from x after m units of time, then the transition probabilties are given by

where r = (m + y − x)/2. Hence, in full, the QVI reads:

In general, r depends on x, y and m, but we suppress the subscript for ease of notation. To close the system to ensure a unique solution, we enforce the following time and spatial boundary conditions. We impose a reflecting boundary at x = ±L, where L is suitably large. In particular, so that the QVI (5.5) for the states x = ±L will use the transition probabilities (5.9) instead.

For the time boundary, we enforce an observation at some large N > 0. The terminal condition then reads (for −L < x < L):

where here r ′ = (N + y − x)/2. The analogous equations hold for the spatial boundary x = ±L, but with the transition probabilities (5.9). These terminal conditions can be interpreted as the largest possible interval between two observations. We now proceed to solve the penalised problem for the system (5.8), with boundary conditions (5.9) and (5.10), through the use of semismooth Newton methods. To initialise the iteration, we solve for the uncoupled system

with the spatial boundary transition probabilities (5.9) and time boundary

(5.12) The system (5.11) corresponds to the penalised equation with ρ = 0. The uncoupled time boundary condition is equivalent to enforcing an observation but with no switching (i.e., assuming that v = vi in each equation for vi). The iteration terminates once a relative tolerance threshold of 10 −8 is reached. 

We investigate the numerical performance of our described methods for the case θ = 0.75, γ = 0.99, L = 50 and N = 500, across different cost parameters c obs . Computations are performed using MATLAB R2019b. The numerical solutions are shown in Table 5 .1. Row (a) shows that the number of Newton iterations required to reach the tolerance threshold is independent from the size of the penalty parameter ρ. Fewer iterations are required for more extreme values of c obs , but the overall number of iterations remains low across different observation costs. Row (b) shows the increments v ρ − v 2ρ ∞. The values clearly demonstrate a first-order convergence of the penalisation error with respect to the penalty parameter ρ, which is in line with the theoretical results presented in [19, Theorem 3.9, 4.2] . We now discuss the qualitative behaviour of the solution. It is clear that if the chain is observed to be at a positive state, then the control should be switched to i = −1 for a negative drift and vice versa. Table 5 .2 lists the optimal observation time gap for selected states across different observation costs c obs . As the problem is symmetric by construction, it is sufficient to only examine the behavior for the positive states. In general, the optimal observation time increases as c obs increases. A longer unobserved period of time then leads to a lower average reward. This is illustrated in Figure 5 .3, where the function n → v n −1,30 is plotted for various values of c obs . In the absence of an observation cost, i.e., for c obs = 0, the optimal observation time equals the magnitude of the last observed state, as there is no need to observe until it is possible for the walk to cross the origin again. 

In this subsection, we implement our observation control framework in an extension to an HIVtreatment scheduling problem that appeared in [21] . As alluded to in the introduction, the stationarity of the reward function in the original model (see (5.13 ) below) implicitly assumes that the observer is given the state of the underlying process at initialisation. However, in practice many scenarios of interest do not satisfy this assumption. Our model formulation extends the above by allowing in addition that the user can approach the problem at initialisation with outdated or sub-optimal information. We demonstrate that such initial conditions can lead to different qualitative behaviours in the value function through time. We also examine the numerical performance of the penalty method when applied to the system of QVIs for this larger system, compared to that in Section 5.1.

Let us briefly describe the original problem in [21] here. A continuous-time Markov chain is used to model virus levels of HIV-positive patients over time. With two types of treatment available, the control space is I = {0, 1, 2} (where 0 represents no treatment given). Four virus strains are considered: WT denotes the wild type (susceptible to both treatments), R1 and R2 denotes strains that are each resistant to Treatment 1 and Treatment 2 respectively, and HR denotes the strain that is highly resistant to both. The level of each strain is represented by the states 'none' (0), 'low' (l), 'medium' (m), and 'high' (h). Therefore, the state space for the Markov chain is S = {0, l, m, h} 4 ∪ { * }, where the asterisk represents patient death. Note in particular that * is an absorbing state. The goal in the original model is to then minimise a cost functional J : S × I → R of the form:

τ j e −γs c(Xs, ι(Xτ j )) ds + e −γτ j+1 c obs , (5.13) where the cost function c : S × I → R is a linear combination of the productivity loss resulting from each patient's condition and their received treatment. To adapt the model above for our framework, we first discretise the Markov chain, choosing each step to represent one day. We then take the model parameters from the original paper [21, Section 3] , which provides the transition rate matrices {Qi}i∈I and the cost function c(x, i). The transition matrices {Pi}i∈I are then given by Pi = e Q i (as the time unit in [21] is one day). For illustration purposes, a sparse plot of the transition matrix P0 is shown in Figure 5 .4. As our framework takes the form of a maximisation problem, we choose f = −c for the reward function. We can now formulate our problem in terms of the following QVI:

x + c obs = 0. (5.14)

We now follow the same procedure in Section 5.1 to obtain a numerical solution. Note that for this problem, the spatial domain is finite and we also have a natural spatial boundary arising from the absorbing death state * , that is, for all m ≥ 0 and i ∈ I,

where a is a constant representing the average GDP loss due to patient death [20, 21] . A time boundary is once again enforced at some large time N > 0, which can be interpreted as a mandatory observation at time N . Explicitly, this reads

We now solve the associated penalised problem with semismooth Newton methods. As in Section 5.1, we choose the initial guess to be the solution to the penalised problem with ρ = 0, with uncoupled time boundary conditions. The iterations terminate once a relative tolerance threshold of 10 −8 is reached. The numerical experiments are performed on MATLAB R2019b. Table 5 .6 shows the numerical solution for different values of N and c obs across different penalty parameters ρ. Row (a) shows that much like the random walk experiment in Section 5.1, the number of iterations remains constant with respect to ρ. However, the number of Newton iterations required to reach the target threshold is much higher, approximately 20 iterations. Figure 5 .5 illustrates the gap between the initial guess and the final solution. We see that the disagreement occurs only on one side of the free boundary. This is due to the nature of our initial guess, where solving the penalised problem for ρ = 0 is equal to restricting the system to only one region (in this case, the region of no observations). Despite the large gap between the two curves, we see that the iterate obtained after one step is significantly closer to the true solution, which indicates that much of the convergence in the solutions is achieved in the first few Newton iterations.

Row (b) in Table 5 .6 shows the increments v ρ − v 2ρ ∞. Reassuringly, for this more complicated system, we still see a clear first-order convergence of the penalisation error with respect to the penalty parameter ρ. Even for small values of ρ, the successive increments were within O(1) (in comparison to the magnitude of the solution which is of O(10 6 )). This shows that the penalty approximation is very effective for small penalty parameters, and that it works well when extended to the class of QVIs that we introduced in Section 4.

We now analyse the behaviour of the value function when plotted as a function against time. The top-left graph of Figure 5 .7 depicts an instance where the patient is under a stable condition. Here the observation region is [15, N ] . There are limited benefits of frequently paying a high observation cost when it is unlikely that the patient's condition will deteriorate over a short period of time. On the other hand, if the patient has a high chance of mortality, i.e. a high probability of reaching the absorbing state * , then it is optimal to observe as soon as possible. The top-right graph illustrates the case when the last observed state contains a low amount of the R2 strain but with no treatment given. The observation region here is [0, 53]. The disparity between the optimal actions can be attributed to the negative reward associated with the absorbing state. For the latter case, with the parameters for the transition matrix giving a mortality rate of approximately 3% after 53 days, making an observation unlikely to improve subsequent rewards, if any. The optimal control allows for a more effective resource allocation by focusing on higher quality samples during data collection. The solutions of the value function under these scenarios were not available in the original model.

To examine the behaviour around the decision boundaries, we plot the central finite difference terms (v n+1

i,x − v n−1 i,x )/2∆n in the bottom row of Figure 5 .7, underneath their respective graphs of the value function. If we consider the plots as a discretisation of a continuous value function, we see that there is much bigger variation within the observation region. Critically, there is non-smoothness across the boundary in the bottom-left graph. This suggests that the solution in continuous-time is C 2 in time within each decision region, but only C 1 across the boundary. This is in line with theoretical results on the regularity of viscosity solutions in optimal stopping and switching problems [18, Chapter 5].

In this subsection, we consider a random walk with drift, as set up in Section 5.1, but with the additional assumption that the true value of the drift parameter θ is unknown to the user. To avoid complications with boundary conditions and infinite parameter domains, we shall only consider the finite horizon problem. Using the notation in Section 3, for a fixed value of θ, the n-step transition probabilities are given by

where r = 1 2 (n + y − x). As remarked at the end of Section 3, we shall choose the prior from a family of beta distributions to obtain conjugacy in the parameter process. This reduces the ρ 10 3 2 × 10 3 4 × 10 3 8 × 10 3 16 × 10 3 32 × 10 3 N = 150, c obs = 200 (a) 18 is the probability mass function of the Beta-binomial distribution and B(a, b) is the Beta function. The posterior distribution πn = π ′ 0 is then

Beta(a + r, b + n − r), ι0 = +1, Beta(a + n − r, b + r), ι0 = −1.

(5.20)

Since the parameter process can now be characterised by the parameters of the Beta distribution, if π ∼ Beta(a, b) then we write v(m; (k, x, i); (a, b)) := v(m, ν k,x,i,π m ). As both parameters can generally be any non-negative value, it is not feasible to obtain a solution of the value function for all values of (a, b). However, for a given prior, the required values of the Beta distribution parameters for calculation are limited via the relation in (5.20) . We can then recursively calculate the value function using (3.15) from the time horizon N . Recalling that the reward function f (y) = 1/(|y| + 1) is independent of the control α, we have the set of terminal

where S x k is the (finite) set of states that can be reached from x after k units of time, as defined in Section 5.1.

For our experiment, we choose three different initial parameter pairs for the prior π0 as well as varying the observation cost. Here we assume that the true value of θ = 0.3 and set a time horizon of N = 50. We evaluate their performances under the optimal strategy for the same path realisations. The trajectories are sampled as follows. A realisation of the path is generated via a sequence of uniform random variables Un ∼ U [0, 1] to represent the walk at each time step. If ιn = +1, we take an upwards step if Un ≤ θ, and a downwards step otherwise; the opposite applies to ιn = −1, with switching only allowed at the optimal observation times. An illustration of a particular optimal sequence of actions is depicted in Figure 5 .8.

We now examine the effects of parameter uncertainty on the value function, recorded in Table 5 .9. The inaccuracy of the prior π0 increases further down the table. When comparing the values of row (a) column-wise, i.e. for a fixed observation cost, there is a small increase in the number of observations as the prior moves away from the true value. The effects of the choice in prior is more clearly seen in row (b), where the average profit sees a more substantial decrease for π0, ∼ Beta(5, 2), whose mass is concentrated towards [0.5, 1]. The misalignment of the prior and the ground truth is also reflected in row (c), where the credible intervals are generally widest in the bottom row of the table, reflecting a bigger uncertainty over the parameter value. In particular, {τ k = n} ∈ F X,τ τ k−1 for any n. As the sequence τ is strictly increasing, we have the lower bound τn ≥ n, so without loss of generality assume k ≤ n. Now {τ k = n} ⊆ {τj ≤ n − 1} for all 0 ≤ j ≤ k − 1. Hence,

so that τ k is predictable as required.

Proof of Proposition 2.5. We prove the claim by induction. Assume without loss of generality that τ0 = 0. Since τ0 is deterministic, trivially that F X,τ 0 = F X 0 . Now for any n ≥ 1, recall that the strictly increasing nature of τ means we have where the second inclusion follows from the predictability of τ , and the third is the induction assumption.

Optimal inattention to the stock market

Optimal inattention to the stock market with information costs and transactions costs

Optimal inspections in a stochastic control problem with costly observations

Optimal inspections in a stochastic control problem with costly observations

Fundamentals of Stochastic Filtering. Stochastic Modelling and Applied Probability

Active measure reinforcement learning for observation cost minimization

Some convergence results for howard's algorithm

Controlled diffusion processes

Switching cost models as hypothesis tests

Transactions costs and portfolio choice in a discrete-continuoustime setting

Optimal treatment strategies in the context of 'treatment for prevention' against HIV-1 in resource-poor settings

Sequential testing of a Wiener process with costly observations

Optimal control for partially observed diffusions

Search for information on multiple products

Stochastic Systems: Estimation, Identification, and Adaptive Control

Numerical Methods for Stochastic Control Problems in Continuous Time

On some recent aspects of stochastic control and their applications

Continuous-Time Stochastic Control and Optimization with Financial Applications

A penalty scheme for monotone systems with interconnected obstacles: Convergence and error estimates

Markov Decision Processes with Information Costs

Markov control processes with rare state observation: Theory and application to treatment scheduling in HIV-1

A penalty method for the numerical solution of Hamilton-Jacobi-Bellman (HJB) equations in finance

Penalty methods for the solution of discrete HJB equations-continuous control and obstacle problems

Analysis and computation of an optimality equation arising in an impulse control problem with discrete and costly observations

A hybrid stochastic river environmental restoration modeling with discrete and costly observations

Jonathan Tam has been supported by the EPSRC Centre for Doctoral Training in Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1).