key: cord-0507768-fe2ubdpk authors: Reisinger, Christoph; Tam, Jonathan title: Markov decision processes with observation costs date: 2022-01-19 journal: nan DOI: nan sha: c88a95c759e550ae1fda0b4c4ed96b66b3a18909 doc_id: 507768 cord_uid: fe2ubdpk We present a framework for a controlled Markov chain where the state of the chain is only given at chosen observation times and of a cost. Optimal strategies therefore involve the choice of observation times as well as the subsequent control values. We show that the corresponding value function satisfies a dynamic programming principle, which leads to a system of quasi-variational inequalities (QVIs). Next, we give an extension where the model parameters are not known a priori but are inferred from the costly observations by Bayesian updates. We then prove a comparison principle for a larger class of QVIs, which implies uniqueness of solutions to our proposed problem. We utilise penalty methods to obtain arbitrarily accurate solutions. Finally, we perform numerical experiments on three applications which illustrate our framework. In this article, we consider a controlled Markov chain problem, where a cost is incurred to reveal the state of the chain at a given moment in time. We assume that any changes to the control can only occur at these observation times. Hence, in addition to searching for the optimal action, the user also seeks for the optimal intervals between successive observations of the state. A continuous stream of information is often an assumption taken for granted in both fully observable or partially observable control problems. However, in many practical applications, due to technical and labour difficulties, it can be expensive or impractical to obtain such measurements. Examples include the virological state of patients [11, 20, 21] , environmental measurements on river sediments [25] and biological growth dynamics of organisms [27] . In consumer spending, it is desirable to purchase a good product with a low searching cost [14] . Such searches are sequential by nature, so that the assumption of continuous observations does not apply. To the best of our knowledge, the earliest works on observation controls are [3, 4] . A Brownian motion is to be stopped upon the exit of a given set, but each inspection of the Brownian motion comes with a cost. It was shown that the problem reduces to a free boundary problem and the corresponding value function satisfies a quasi-variational inequality (QVI), with further analysis demonstrating the well-posedness of the problem. More recently, Dyrssen and Ekström incorporated observation costs in hypothesis testing for the drift of a diffusion, and characterises the value function as the unique fixed point of an associated operator [12] . Other works concerning observation controls are motivated by specific applications: Winkelmann et al. explored the optimal diagnosis and treatment scheduling for HIV-1 patients, based on the trade off between treatment cost against productivity loss across different countries [11, 20, 21] ; Yoshioka et al. focused on a variety of environmental management problems, including the modelling of replenishing sediment storage in rivers, monitoring algae population dynamics and biological growth of fishery resources [24, 25, 26, 27] . The phenomenon of sporadically observing the state process is also modelled in mathematical finance under the term rational inattention, where portfolio adjustments are assumed to occur infrequently with a utility cost proportional to the users' assets [1, 2, 10] . Our goal in this article is to systematically develop a framework for the observation control problems described above, utilising first principles to establish dynamic programming for the value function, and exploring suitable numerical approximations to the solutions of the resulting systems of QVIs. To avoid intuition being obscured by technical complications, we shall focus on discrete-time Markov chains, which often serve as valid approximations for diffusions in continuous time [16] . In our model, we consider an underlying controlled Markov chain X = {Xn}n on a probability space (Ω, F, P), with a corresponding controlled observation filtration the estimation and control problem can be separated. This is a common procedure in solving partially observable control models [8, 17] . In general, the conditional distribution is characterised by the Zakai or Kushner-Stratonovich equations using standard filtering techniques [5, 13] . We will show in Section 2 that in our case, each realisation of µn only depends on the last observation time k, state of the chain x and control of the chain i. This gives rise to a higher dimensional value function than in the standard case due to the presence of k and i. More specifically, in the finite horizon case, the reward functional that we aim to optimise depends on the tuple (m, (k, x, i), α), where m represents the current time, and α is a double sequence representing the observation times and control of the chain. Some analysis of the value function as set up above was seen in [3, 4] , but their applications were limited to optimal stopping problems concerning Brownian motion. Other existing works on the observation control model assume stationarity in the problem. This leads to a reward functional that only depends on the triplet (x, i, α) [12, 21, 24, 25, 26, 27] . Whilst this formulation gives an overall lower dimensional problem, the setup assumes that the user is in possession of accurate and updated information of the state process at initialisation. This excludes solving for scenarios with unintended large gaps between observations, which can lead to qualitatively different optimal actions. For example, lockdowns imposed due to the COVID-19 pandemic have led to delayed consultations for patients, who are more likely to appear in a worsened state upon the time of diagnosis. We adopt the former, more general formulation to include such incidences into our model. The qualitative behaviour of the value function under these scenarios is demonstrated in our numerical experiments in Section 5. Having set up the initial framework and established dynamic programming in Section 2, we obtain a system of discrete quasi-variational inequalities (QVIs). In Section 4, we investigate the well-posedness of such systems. We can rewrite the QVIs in the more abstract form where M : R N×L×d → R N×L : (Mu) n l = ((Anū q ) l − c), q ∈ {1, . . . , N } is some fixed index, c is a constant, each An ∈ R L×L is a strictly substochastic matrix, and Fi : R N×L → R N×L satisfies a monotonicity property specified in Section 4. The system of QVIs (1.3) share a similar structure to that of a monotone interconnected obstacle system, which typically feature in optimal switching problems [18] . The well-posedness of interconnected obstacle systems is treated in [19] , which establishes existence via penalisation [22, 23] . The use of penalty methods over traditional policy iteration is due to the fact that invertibility of the matrices is not guaranteed under the presence of the obstacle operator. The extra matrix An appearing in the obstacle operator M represents the (discounted) transition matrix over the unobserved period of time. We show a modified proof of the comparison principle appearing in [19] , taking into account the effect of the matrix An on the coupling in (1.3). The same modifications apply for the penalty approximations, which gives us the well-posedness of the observation control problem. We also note that although the system of Bellman-type equations arising from the lower dimensional formulation can be solved via a fixed-point method by bounding the total number of observations [12] , it is not applicable to our model due to the presence of the extra time lag variable k in the value function. Lastly, as a further novel extension, we incorporate parameter uncertainty into the model dynamics and consider sequential updates to the control in Section 3. In the literature, the adoption of Bayesian principles with observation control has been explored in the context of reinforcement learning [6] : an algorithm for solving the optimal policy and estimator in parallel was proposed, but an equivalent Bellman-type equation was not established. Along similar ideas on the tradeoff between cost and information, Cohen et al. consider a cost constraint that arises from the tracking error relative to the optimal Bayesian estimator, and relates the optimal band of inaction to a two-sided hypothesis test [9] . Here, we consider the case where the transition probability of the Markov chain X depends on an unknown parameter θ. In general, the standard approach in Bayesian control problems is to reformulate the problem into a partially observable optimal control problem [15] , and then proceed to the separation of estimation and control as before. We apply the same principles to the observation control problem and derive a dynamic programming equation involving the 'prior' and 'posterior' distributions of θ. In particular, the reward functional is now augmented to be dependent on the tuple (m, (k, x, i), π, α), where π is a measure over the parameter space. Whilst the resulting equations are often infinite dimensional in nature, this can be reduced back to the finite dimensional case if one considers conjugate distributions for π. We demonstrate the solution of a model problem on a random walk involving beta conjugate priors in Section 5 and investigate its corresponding numerical properties. The main contributions of our article are as follows: -We expand on the probabilistic approach to the observation control problem, and extend the results first seen in [3, 4] . We subsequently derive a system of discrete QVIs that the value function satisfies with dynamic programming. -Following the approach of [19] , we prove a comparison principle for the system of discrete QVIs and establish an approximation to the solutions via penalisation and semismooth Newton methods [7] . This also provides well-posedness to the observation control problem. -We provide a novel Bayesian parametric setup of the observation control problem. This incorporates a distribution over the unknown parameters in the value function, which is dynamically updated after each observation. -We demonstrate the numerical performance of our model with three numerical experiments, one of which is a time-discretised version of the HIV-treatment problem appearing in [21] . We show that our model extends the original one by solving for scenarios with large observation gaps, and that the optimal control coincides in scenarios that are covered by both models. The rest of this paper is organised as follows. Section 2 sets out the framework for the Markov chain observation control model and establishes the corresponding set of discrete QVIs. A model problem with an explicit solution is also provided to illustrate the setup. The Bayesian extension is outlined in Section 3. In Section 4 we prove a comparison principle for a class of discrete QVIs which subsumes the QVIs obtained in Section 2. The penalty method is also introduced as an approximation for the QVI. Finally we provide numerical experiments for our observation control model in Section 5. In this section, we establish the framework for the observation control problem for Markov chains. We will consider in the following both the finite and infinite horizon problem. We introduce the notion of observation filtrations in the spirit of [12] , as well as the corresponding class of observation controls. We then formulate the control problem and derive the relevant equations for the value function with the use of the dynamic programming principle. We conclude the section with a model problem that yields a closed form solution. Let X = X(α) be a controlled Markov chain on a probability space (Ω, F, P), taking values in a state space S. We consider the discrete time case and write X = (Xn) n∈N . The control α is in general an adapted process with values in a finite control set I := {1, . . . , d}, and determines the distribution of the Markov chain via its transition kernel: P = P (i) = Pi = (pxy(i))x,y∈S, where i ∈ I. We will use the notation p (n) xy (i) to denote the entries of P n , i.e. the n-step transition probabilities. We also assume that the transition kernel is time-homogeneous and is known a priori. In the observation control model, access to the realisation of each Xn will come with a cost. Thus one would expect that observations would generally not occur at every time point, so that the information available to us is restricted. It is then natural to consider a filtration that is no finer than the natural filtration generated by X. Hence, we define the following notion of an observation filtration, which is based on that appearing in [12] . Proposition 2.1. Let X be a Markov chain on (Ω, F, P). Let τ = (τ k ) ∞ k=1 be a strictly increasing sequence of random times (in the sense that τ k (ω) < τ k+1 (ω) for every ω ∈ Ω). Denote also τ0 for the initial time of the chain. Define for all n ≥ 0, Then the sequence (F X,τ n )n forms a filtration. Proof. See appendix. By construction, each τ k is a stopping time with respect to F X,τ . In the context of the observation control model, we would like to determine observation times based only upon information amassed from prior observations up to the current time. This motivates the following definition. Definition 2.2. We say that τ is an observation sequence for X if each τ k is F X,τ τ k−1measurable. In this case F X,τ is called an X-observation filtration. For the rest of this article, we will only consider τ which are observation sequences. Such sequences are also predictable. is an observation sequence for X, then each τ k is a predictable stopping time with respect to F X,τ , i.e. for each k, {τ k = n} ∈ F X,τ n−1 for all n. Proof. See appendix. Since the influx of information is decided exogenously, the observation sequence should form part of the control. We define the set of admissible controls as follows. Definition 2.4. Let τ be an observation sequence for X. For each τ k , let ι k be an I-valued, F X,τ τ k -measurable random variable. An admissible control is a double sequence of the form α = (τn, ιn)n. The set of admissible controls is denoted by A. Define also the corresponding piecewise constant process I = (In)n of the form Note that double sequence controls are often seen in optimal switching problems (see, for example, [18, Chapter 5.3] ), but here the process is in addition adapted to the observation filtration. The following proposition justifies the intuitive idea that the observation filtration indeed contains no more information than the natural filtration. Proposition 2.5. Let F X be the natural filtration generated by X. If the sequence τ is predictable (with respect to F X,τ ), then F X,τ is coarser than F X , i.e., F X,τ n ⊆ F X n for every n. Proof. See appendix. At any given time n ≥ 1, we will often consider the most recent observation that occurred before time n. We employ the following notation when considering quantities related to such observations. Definition 2.6. Let τ be an observation sequence for X. For any n ≥ 0, define the random variablesτ Similarly, define alsoιn = ιτ n = In andXn = Xτ n . Thus one can view the tilde notation (τn,ιn,Xn) as representing 'the most recent data available'. We can see that the random variableτn is F X,τ n -measurable, since for m ≤ n, Althoughτn is not a F X,τ -stopping time, so that F X,τ τn is not well-defined, we can still utilise the Markov property of the underlying chain X to obtain the relation in the following lemma. Proof. This follows from the decomposition of the conditional expectation across the countable events that generate F X,τ m and the Markov property of X. To formulate the observation control problem, we will have to define the conditional distribution of Xn given its past observations. Definition 2.8. Let P(S) be the set of probability measures over S. Define the P(S)-valued (controlled) process Then µn is the conditional distribution of Xn given its past observation history up to and including time n. We can characterise the random measures (µn)n as follows. As S is countable, it suffices to consider the singleton sets {y} for any y ∈ S. By Lemma 2.7, µn({y}) = P(Xn = y |τn,ιn,Xn) = k 0 such that for any Note that in general the c in (4.2) can also depend on i, n and l (this is indeed the case for (2.30), see Remark 4.7), but we shall only consider a constant c for the proofs in this section for ease of notation. The fixed index q can be arbitrary; the idea is that the operator M only couples the solution u at a cross-section of values along the time domain. In the following, we adapt the argument in [19] to prove a comparison principle of the QVI (4.1). The main ingredient in the proof is to bound the coupling terms An(ū q −v q ) by their maximum. If we extend the problem to an infinite domain, it is not obvious that this maximum is achieved. However, recall from Remark 2.15 that additional spatial boundary conditions have to be imposed for a closed system. This reduces the system of QVIs back into the form of (4.1). The comparison principle is shown in the proposition below. Let γ < 1 be the maximum of the row sums of Am. Then by combining both inequalities above we obtain This implies that there exist some ju, jv ∈ I and l * ∈ {1, . . . , L} such that which is a contradiction by the maximality of M . Hence, we must have Fj (uj) m k ≤ 0, but then since v is a supersolution we have by the monotonicity property so that M ≤ 0. Now we present a penalty approximation to the QVI (4.1). Consider the following penalised problem. where the penalisation function π : R → R is continuous, non-decreasing with π| (−∞,0] = 0 and π| (0,∞) > 0, and is applied elementwise. We show that for each fixed ρ, (4.10) satisfies a comparison principle. This implies uniqueness for Problem 4.3. The argument follows similarly to the approach in [19] and Proposition 4.2. Proposition 4.4. For any penalty parameter ρ ≥ 0 and any c ≥ 0, if u ρ = (u ρ i )i∈I (resp., v ρ = (v ρ i )i∈I) satisfies Fi(u ρ i ) − ρ π (Mu ρ − u ρ ) ≤ 0 (resp., ≥ 0), (4.11) then u ρ ≤ v ρ . Proof. As in the previous proposition, let M := max i,l,n (u ρ,n i,l − v ρ,n i,l ) =: u ρ,m j,k − v ρ,m j,k . Note that Hence by rearranging we get It then follows from the sub/super-solution properties of u ρ and v ρ that Theorem 4.6. For any fixed c ≥ 0, the solution to(4.10) converges monotonically from below to a function u ∈ R d×N×L as ρ → ∞. Moreover u solves the discrete QVI (4.1) if c > 0. Remark 4.7. The QVI (2.30), derived from the infinite horizon problem of the observation control model, is an instance of the QVI (4.1). Specifically, we have L = |S| as well as the following: Moreover, it is straightforward to see that Fi satisfies the monotonicity condition (4.3). For uniqueness of solutions to hold, additional boundary conditions have to be imposed to close the system. This will be demonstrated in the numerical experiments section below. In this section, we apply our observation cost framework to three numerical experiments. Sections 5.1 and 5.2 analyse two infinite horizon problems. The roadmap for both is as follows: we first set up the discrete QVIs arising from the problem, which is then approximated by the penalised problem. As in [19] , we will employ the penalty function π(x) = x + . The solution of the penalised problem is in turn approximated iteratively with semismooth Newton methods [23] . Formally speaking, starting with an initialisation v (0) to the penalised problem we obtain the next iterate by solving for where L ρ denotes the generalised derivative of the function G ρ . For each example, we examine the numerical performance of the penalty method and Newton iterations, as well as the effects of the observation cost on the qualitative behavior of the solutions. Section 5.3 considers the Bayesian formulation of the observation control problem over a finite horizon. The solutions are obtained through backwards recursion from the terminal conditions. We examine the impact that the extra parameter uncertainty has on the optimal trajectories. Consider an integer-valued random walk (Xn) n∈N with two regimes, representing a positive and negative drift respectively, so that the control space is I = {+1, −1}. The probability of each step is parametrised by θ. Specifically, for any x ∈ S = N, We also adopt the following reward function: The mass of this reward function f is concentrated around the origin, so naturally, the optimal regime is one that reverts the process back towards the origin. For this example, we consider the infinite horizon problem. Recall that the discrete QVI (2.30) reads: for all m ≥ 0, x ∈ S, and i ∈ I, Note that there exists a path from x to y over m units of time if and only if m ≥ |y − x| and m ≡ y (mod 2). If S x m denotes the set of states that can be reached from x after m units of time, then the transition probabilties are given by where r = (m + y − x)/2. Hence, in full, the QVI reads: In general, r depends on x, y and m, but we suppress the subscript for ease of notation. To close the system to ensure a unique solution, we enforce the following time and spatial boundary conditions. We impose a reflecting boundary at x = ±L, where L is suitably large. In particular, so that the QVI (5.5) for the states x = ±L will use the transition probabilities (5.9) instead. For the time boundary, we enforce an observation at some large N > 0. The terminal condition then reads (for −L < x < L): where here r ′ = (N + y − x)/2. The analogous equations hold for the spatial boundary x = ±L, but with the transition probabilities (5.9). These terminal conditions can be interpreted as the largest possible interval between two observations. We now proceed to solve the penalised problem for the system (5.8), with boundary conditions (5.9) and (5.10), through the use of semismooth Newton methods. To initialise the iteration, we solve for the uncoupled system with the spatial boundary transition probabilities (5.9) and time boundary (5.12) The system (5.11) corresponds to the penalised equation with ρ = 0. The uncoupled time boundary condition is equivalent to enforcing an observation but with no switching (i.e., assuming that v = vi in each equation for vi). The iteration terminates once a relative tolerance threshold of 10 −8 is reached. We investigate the numerical performance of our described methods for the case θ = 0.75, γ = 0.99, L = 50 and N = 500, across different cost parameters c obs . Computations are performed using MATLAB R2019b. The numerical solutions are shown in Table 5 .1. Row (a) shows that the number of Newton iterations required to reach the tolerance threshold is independent from the size of the penalty parameter ρ. Fewer iterations are required for more extreme values of c obs , but the overall number of iterations remains low across different observation costs. Row (b) shows the increments v ρ − v 2ρ ∞. The values clearly demonstrate a first-order convergence of the penalisation error with respect to the penalty parameter ρ, which is in line with the theoretical results presented in [19, Theorem 3.9, 4.2] . We now discuss the qualitative behaviour of the solution. It is clear that if the chain is observed to be at a positive state, then the control should be switched to i = −1 for a negative drift and vice versa. Table 5 .2 lists the optimal observation time gap for selected states across different observation costs c obs . As the problem is symmetric by construction, it is sufficient to only examine the behavior for the positive states. In general, the optimal observation time increases as c obs increases. A longer unobserved period of time then leads to a lower average reward. This is illustrated in Figure 5 .3, where the function n → v n −1,30 is plotted for various values of c obs . In the absence of an observation cost, i.e., for c obs = 0, the optimal observation time equals the magnitude of the last observed state, as there is no need to observe until it is possible for the walk to cross the origin again. In this subsection, we implement our observation control framework in an extension to an HIVtreatment scheduling problem that appeared in [21] . As alluded to in the introduction, the stationarity of the reward function in the original model (see (5.13 ) below) implicitly assumes that the observer is given the state of the underlying process at initialisation. However, in practice many scenarios of interest do not satisfy this assumption. Our model formulation extends the above by allowing in addition that the user can approach the problem at initialisation with outdated or sub-optimal information. We demonstrate that such initial conditions can lead to different qualitative behaviours in the value function through time. We also examine the numerical performance of the penalty method when applied to the system of QVIs for this larger system, compared to that in Section 5.1. Let us briefly describe the original problem in [21] here. A continuous-time Markov chain is used to model virus levels of HIV-positive patients over time. With two types of treatment available, the control space is I = {0, 1, 2} (where 0 represents no treatment given). Four virus strains are considered: WT denotes the wild type (susceptible to both treatments), R1 and R2 denotes strains that are each resistant to Treatment 1 and Treatment 2 respectively, and HR denotes the strain that is highly resistant to both. The level of each strain is represented by the states 'none' (0), 'low' (l), 'medium' (m), and 'high' (h). Therefore, the state space for the Markov chain is S = {0, l, m, h} 4 ∪ { * }, where the asterisk represents patient death. Note in particular that * is an absorbing state. The goal in the original model is to then minimise a cost functional J : S × I → R of the form: τ j e −γs c(Xs, ι(Xτ j )) ds + e −γτ j+1 c obs , (5.13) where the cost function c : S × I → R is a linear combination of the productivity loss resulting from each patient's condition and their received treatment. To adapt the model above for our framework, we first discretise the Markov chain, choosing each step to represent one day. We then take the model parameters from the original paper [21, Section 3] , which provides the transition rate matrices {Qi}i∈I and the cost function c(x, i). The transition matrices {Pi}i∈I are then given by Pi = e Q i (as the time unit in [21] is one day). For illustration purposes, a sparse plot of the transition matrix P0 is shown in Figure 5 .4. As our framework takes the form of a maximisation problem, we choose f = −c for the reward function. We can now formulate our problem in terms of the following QVI: x + c obs = 0. (5.14) We now follow the same procedure in Section 5.1 to obtain a numerical solution. Note that for this problem, the spatial domain is finite and we also have a natural spatial boundary arising from the absorbing death state * , that is, for all m ≥ 0 and i ∈ I, where a is a constant representing the average GDP loss due to patient death [20, 21] . A time boundary is once again enforced at some large time N > 0, which can be interpreted as a mandatory observation at time N . Explicitly, this reads We now solve the associated penalised problem with semismooth Newton methods. As in Section 5.1, we choose the initial guess to be the solution to the penalised problem with ρ = 0, with uncoupled time boundary conditions. The iterations terminate once a relative tolerance threshold of 10 −8 is reached. The numerical experiments are performed on MATLAB R2019b. Table 5 .6 shows the numerical solution for different values of N and c obs across different penalty parameters ρ. Row (a) shows that much like the random walk experiment in Section 5.1, the number of iterations remains constant with respect to ρ. However, the number of Newton iterations required to reach the target threshold is much higher, approximately 20 iterations. Figure 5 .5 illustrates the gap between the initial guess and the final solution. We see that the disagreement occurs only on one side of the free boundary. This is due to the nature of our initial guess, where solving the penalised problem for ρ = 0 is equal to restricting the system to only one region (in this case, the region of no observations). Despite the large gap between the two curves, we see that the iterate obtained after one step is significantly closer to the true solution, which indicates that much of the convergence in the solutions is achieved in the first few Newton iterations. Row (b) in Table 5 .6 shows the increments v ρ − v 2ρ ∞. Reassuringly, for this more complicated system, we still see a clear first-order convergence of the penalisation error with respect to the penalty parameter ρ. Even for small values of ρ, the successive increments were within O(1) (in comparison to the magnitude of the solution which is of O(10 6 )). This shows that the penalty approximation is very effective for small penalty parameters, and that it works well when extended to the class of QVIs that we introduced in Section 4. We now analyse the behaviour of the value function when plotted as a function against time. The top-left graph of Figure 5 .7 depicts an instance where the patient is under a stable condition. Here the observation region is [15, N ] . There are limited benefits of frequently paying a high observation cost when it is unlikely that the patient's condition will deteriorate over a short period of time. On the other hand, if the patient has a high chance of mortality, i.e. a high probability of reaching the absorbing state * , then it is optimal to observe as soon as possible. The top-right graph illustrates the case when the last observed state contains a low amount of the R2 strain but with no treatment given. The observation region here is [0, 53]. The disparity between the optimal actions can be attributed to the negative reward associated with the absorbing state. For the latter case, with the parameters for the transition matrix giving a mortality rate of approximately 3% after 53 days, making an observation unlikely to improve subsequent rewards, if any. The optimal control allows for a more effective resource allocation by focusing on higher quality samples during data collection. The solutions of the value function under these scenarios were not available in the original model. To examine the behaviour around the decision boundaries, we plot the central finite difference terms (v n+1 i,x − v n−1 i,x )/2∆n in the bottom row of Figure 5 .7, underneath their respective graphs of the value function. If we consider the plots as a discretisation of a continuous value function, we see that there is much bigger variation within the observation region. Critically, there is non-smoothness across the boundary in the bottom-left graph. This suggests that the solution in continuous-time is C 2 in time within each decision region, but only C 1 across the boundary. This is in line with theoretical results on the regularity of viscosity solutions in optimal stopping and switching problems [18, Chapter 5]. In this subsection, we consider a random walk with drift, as set up in Section 5.1, but with the additional assumption that the true value of the drift parameter θ is unknown to the user. To avoid complications with boundary conditions and infinite parameter domains, we shall only consider the finite horizon problem. Using the notation in Section 3, for a fixed value of θ, the n-step transition probabilities are given by where r = 1 2 (n + y − x). As remarked at the end of Section 3, we shall choose the prior from a family of beta distributions to obtain conjugacy in the parameter process. This reduces the ρ 10 3 2 × 10 3 4 × 10 3 8 × 10 3 16 × 10 3 32 × 10 3 N = 150, c obs = 200 (a) 18 is the probability mass function of the Beta-binomial distribution and B(a, b) is the Beta function. The posterior distribution πn = π ′ 0 is then Beta(a + r, b + n − r), ι0 = +1, Beta(a + n − r, b + r), ι0 = −1. (5.20) Since the parameter process can now be characterised by the parameters of the Beta distribution, if π ∼ Beta(a, b) then we write v(m; (k, x, i); (a, b)) := v(m, ν k,x,i,π m ). As both parameters can generally be any non-negative value, it is not feasible to obtain a solution of the value function for all values of (a, b). However, for a given prior, the required values of the Beta distribution parameters for calculation are limited via the relation in (5.20) . We can then recursively calculate the value function using (3.15) from the time horizon N . Recalling that the reward function f (y) = 1/(|y| + 1) is independent of the control α, we have the set of terminal where S x k is the (finite) set of states that can be reached from x after k units of time, as defined in Section 5.1. For our experiment, we choose three different initial parameter pairs for the prior π0 as well as varying the observation cost. Here we assume that the true value of θ = 0.3 and set a time horizon of N = 50. We evaluate their performances under the optimal strategy for the same path realisations. The trajectories are sampled as follows. A realisation of the path is generated via a sequence of uniform random variables Un ∼ U [0, 1] to represent the walk at each time step. If ιn = +1, we take an upwards step if Un ≤ θ, and a downwards step otherwise; the opposite applies to ιn = −1, with switching only allowed at the optimal observation times. An illustration of a particular optimal sequence of actions is depicted in Figure 5 .8. We now examine the effects of parameter uncertainty on the value function, recorded in Table 5 .9. The inaccuracy of the prior π0 increases further down the table. When comparing the values of row (a) column-wise, i.e. for a fixed observation cost, there is a small increase in the number of observations as the prior moves away from the true value. The effects of the choice in prior is more clearly seen in row (b), where the average profit sees a more substantial decrease for π0, ∼ Beta(5, 2), whose mass is concentrated towards [0.5, 1]. The misalignment of the prior and the ground truth is also reflected in row (c), where the credible intervals are generally widest in the bottom row of the table, reflecting a bigger uncertainty over the parameter value. In particular, {τ k = n} ∈ F X,τ τ k−1 for any n. As the sequence τ is strictly increasing, we have the lower bound τn ≥ n, so without loss of generality assume k ≤ n. Now {τ k = n} ⊆ {τj ≤ n − 1} for all 0 ≤ j ≤ k − 1. Hence, so that τ k is predictable as required. Proof of Proposition 2.5. We prove the claim by induction. Assume without loss of generality that τ0 = 0. Since τ0 is deterministic, trivially that F X,τ 0 = F X 0 . Now for any n ≥ 1, recall that the strictly increasing nature of τ means we have where the second inclusion follows from the predictability of τ , and the third is the induction assumption. Optimal inattention to the stock market Optimal inattention to the stock market with information costs and transactions costs Optimal inspections in a stochastic control problem with costly observations Optimal inspections in a stochastic control problem with costly observations Fundamentals of Stochastic Filtering. Stochastic Modelling and Applied Probability Active measure reinforcement learning for observation cost minimization Some convergence results for howard's algorithm Controlled diffusion processes Switching cost models as hypothesis tests Transactions costs and portfolio choice in a discrete-continuoustime setting Optimal treatment strategies in the context of 'treatment for prevention' against HIV-1 in resource-poor settings Sequential testing of a Wiener process with costly observations Optimal control for partially observed diffusions Search for information on multiple products Stochastic Systems: Estimation, Identification, and Adaptive Control Numerical Methods for Stochastic Control Problems in Continuous Time On some recent aspects of stochastic control and their applications Continuous-Time Stochastic Control and Optimization with Financial Applications A penalty scheme for monotone systems with interconnected obstacles: Convergence and error estimates Markov Decision Processes with Information Costs Markov control processes with rare state observation: Theory and application to treatment scheduling in HIV-1 A penalty method for the numerical solution of Hamilton-Jacobi-Bellman (HJB) equations in finance Penalty methods for the solution of discrete HJB equations-continuous control and obstacle problems Analysis and computation of an optimality equation arising in an impulse control problem with discrete and costly observations A hybrid stochastic river environmental restoration modeling with discrete and costly observations Jonathan Tam has been supported by the EPSRC Centre for Doctoral Training in Mathematics of Random Systems: Analysis, Modelling and Simulation (EP/S023925/1).