key: cord-0484472-3g828frk authors: Wang, Bo; Xie, Wei; Martagan, Tugce; Akcay, Alp; Ravenstein, Bram van title: Optimizing Biomanufacturing Harvesting Decisions under Limited Historical Data date: 2021-01-11 journal: nan DOI: nan sha: bed30f6d781079dcf874e231ac63844774149f61 doc_id: 484472 cord_uid: 3g828frk In biopharmaceutical manufacturing, fermentation processes play a critical role on productivity and profit. A fermentation process uses living cells with complex biological mechanisms, and this leads to high variability in the process outputs. By building on the biological mechanisms of protein and impurity growth, we introduce a stochastic model to characterize the accumulation of the protein and impurity levels in the fermentation process. However, a common challenge in industry is the availability of only very limited amount of data especially in the development and early stage of production. This adds an additional layer of uncertainty, referred to as model risk, due to the difficulty of estimating the model parameters with limited data. In this paper, we study the harvesting decision for a fermentation process under model risk. In particular, we adopt a Bayesian approach to update the unknown parameters of the growth-rate distributions, and use the resulting posterior distributions to characterize the impact of model risk on fermentation output variability. The harvesting problem is formulated as a Markov decision process model with knowledge states that summarize the posterior distributions and hence incorporate the model risk in decision-making. The resulting model is solved by using a reinforcement learning algorithm based on Bayesian sparse sampling. We provide analytical results on the structure of the optimal policy and its objective function, and explicitly study the impact of model risk on harvesting decisions. Our case studies at MSD Animal Health demonstrate that the proposed model and solution approach improve the harvesting decisions in real life by achieving substantially higher average output from a fermentation batch along with lower batch-to-batch variability. The biomanufacturing industry has developed several innovative treatments for cancer, adult blindness, and COVID-19 among many other diseases. Despite its increasing success, biomanufacturing is a challenging production environment. Different from classical pharmaceutical manufacturing, biomanufacturing methods use living organisms (e.g., bacteria, viruses or mammalian cells) during the production processes. These living organisms are custom-engineered to produce highly complex active ingredients for biopharmaceutical drugs. However, the use of living organisms also introduces several operational challenges related to high uncertainty and batch-to-batch variability in the production outcomes. The drug substance manufacturing can be broadly categorized into two main steps: fermentation and purification operations. During the fermentation process, the living organisms grow and produce the desired active ingredients. Specific characteristics of active ingredients (e.g., monoclonal antibodies, biomass, proteins, antigens, etc.) could vary across different drugs. In the remainder of this paper, we refer to the resulting target active ingredient as protein. After fermentation, the batch continues with a series of purification operations to comply with stringent regulatory requirements on safety and quality. Our main focus in this study is the fermentation process. The fermentation process is typically carried out inside a stainless steel vessel called bioreactor. Bioreactors are equipped with advanced sensors to achieve a highly controlled environment via monitoring of critical process parameters (e.g., cell growth rate, protein accumulation, impurity accumulation, etc.). Figure 1 uses industry data to illustrate the main dynamics of a batch fermentation process. As the fermentation continues, we observe from Figure 1 (a) that the amount of protein produced during fermentation increases exponentially over time. Hence, this specific phase of fermentation is known as the exponential growth phase. However, the exponential growth phase continues only for a finite period of time (e.g., several hours or days depending on the application) because of the inherent limitations of biological processes (e.g., limitations in media, cell viability and growth). After the exponential growth phase, the fermentation enters a stationary phase in which the protein production stops and the batch needs to be harvested. In addition, we observe from Figure 1 (b) that unwanted impurities accumulate inside the bioreactor along with the desired proteins. The specific nature of impurities varies across applications but impurities often represent unwanted byproducts, such as ammonia, dead cells, etc. These impurities are subsequently filtered and eliminated through a series of purification operations. The simultaneous growth of desired proteins and unwanted impurities, as shown in Figure 1 , is often known as the purity-yield trade-off in fermentation processes. From a practical perspective, the purity-yield trade-off presents a critical challenge in fermentation harvesting (stopping) decisions. Waiting too long to harvest the fermentation, in anticipation of a higher amount of protein production, may result in a higher amount of impurities. In turn, this could increase the cost (difficulty) of subsequent purification operations. This purity-yield trade-off motivates our main research question: (1) What is an optimal harvesting policy (i.e., when should we stop the fermentation) to maximize the expected profit obtained from a batch? In addition to the purity-yield trade-off, process "uncertainty" imposes another critical challenge on harvesting decisions. In particular, two types of process uncertainty are commonly encountered in biomanufacturing practice: (i) inherent stochasticity and (ii) model risk. In our problem setting, inherent stochasticity represents the uncertainty in the amounts of protein and impurity produced throughout fermentation, and is often caused by the inherent complexity of biological systems. Because living organisms are used during fermentation, the rate at which proteins and impurities accumulate is random (although fermentation is carried out under identical conditions). Therefore, the inherent stochasticity of biological processes motives our second research question: (2) How can we develop an analytical model to learn the inherent stochasticity of fermentation processes and incorporate it into optimal harvesting decisions? Most often, the inherent stochasticity can not be controlled but can be predicted through historical process data. However, building a reliable prediction model is challenging in practice because most biomanufacturing applications involve only a limited amount of historical process data. We refer to the resulting uncertainty in the prediction model itself as the model risk. The problem of decision-making under limited data (i.e., under model risk) is a critical concern for both research and development (R&D) projects and industry-scale applications. In biopharmaceutical R&D projects, each protein is unique such that the scientists re-engineer and manufacture it for the first time. This implies that harvesting decisions are typically made under limited R&D data. In industry-scale applications, the problem of limited data becomes relevant every time a change occurs in equipment or raw materials. For example, the supplier of raw materials (i.e., medium or seed cells) might change their formulations, the management might purchase a new bioreactor, etc. Such changes have a substantial impact on the output of fermentation which often makes the historical process data obsolete or unreliable. Thus, it is of practical importance to have a harvesting strategy that accounts for the model risk. It is possible that ignoring the model risk might lead to sub-optimal decisions. From a practical perspective, decision-making under limited data leads to a natural "learning-by-doing" mechanism where the biomanufacturer sequentially collects real-world data (by implementing a policy) under model risk. This motivates our third and fourth research questions: (3) Given a limited amount of historical data, how can we develop a learning mechanism to simultaneously account for inherent stochasticity and model risk while making harvesting decisions? (4) How does the structural characteristics of optimal harvesting policies change based on the system's physical state (i.e., protein and impurity amounts present in the batch) as well as the knowledge states (i.e., current belief of the biomanufacturer on model risk)? To address the aforementioned research questions, we build a solution approach based on a reinforcement learning model using the theory of Bayesian statistics and Markov decision processes. A key aspect of our work is that we build analytical models that combine the knowledge from life sciences and operations research to support biomanufacturing decisions under limited historical data and inherent stochasticity of biological systems. In life sciences research, there are well-known mechanistic models to predict the evolution of fermentation [1] . However, existing models do not mathematically capture both aspects of limited process data and inherent stochasticity of fermentation. Our study is a first attempt to apply operations research concepts to fermentation decisions under limited data, and combines the knowledge from life sciences and stochastic modeling to derive guidelines that improve industry practices. In addition, we characterize the structural properties of optimal harvesting decisions and show that the state space can be partitioned into two distinct decision-zones: harvest zone and continue zone. These decision-zones are derived based on both the physical and knowledge states of the fermentation process, and they provide policy guidelines that are easy to implement in practice. By exploiting the unique features of our problem setting, we also generate analytical expressions to characterize the decision zones under limited data. Such analytical characterization of the decision zones enables a rigorous assessment of the impact of limited data on operating decisions, and provides managerial insights on the value of collecting additional data. At a given knowledge state, we also analytically characterize the contributions of the inherent stochasticity and the model risk to the overall uncertainty faced by the decision maker. To the best of our knowledge, these two sources of uncertainty have been a critical challenge for practitioners but their impact on optimal costs and harvesting decisions have not been fully understood. In summary, our model formalizes our understanding on fermentation decisions under limited historical data, and our results inform practitioners on the impact of adopting a learning-by-doing framework on current practice. This research is an outcome of a multi-year collaboration with MSD Animal Health in Boxmeer, the Netherlands. 1 The facility in Boxmeer is a leading biomanufacturing hub of Europe that conducts both biopharmaceutical R&D and large-scale production. Since September 2019, the developed framework has been used in daily operations to support harvesting decisions. The implementation has resulted in an average of 3X additional revenue per year. The outcome of this research has also been shared with a wider audience through industry working group sessions and social media [2, 3] . Due to its scalability and potential impact, the research outcomes have been internationally recognized in several platforms [4, 5, 6, 7] . The remainder of the paper is organized as follows. We review the most related literature in Section 2. We develop a stochastic optimization model in Section 3, and analyze its structural characteristics in Section 4. We present a solution procedure in Section 5 and a case study in Section 6. We discuss the implementation in Section 7 and the concluding remarks in Section 8. guide the search for optimal control strategies. In this context, most studies develop deterministic or stochastic models to predict and control fermentation. Deterministic models typically build kinetic process models (i.e., differential equations) on cell growth and product formation [1, 9, 10] . These kinetic models are also integrated with optimization models. For example, [11] constructed a dynamic flux balance model for a fermentation process. Then, they developed a closed-loop control for feed rate and dissolved oxygen concentration profiles to maximize the yield production. Data-driven stochastic optimization is relatively understudied to predict and control fermentation processes. Existing studies typically focus on the inherent stochasticity of fermentation. For example, [12] used approximate dynamic programming to maximize yield and minimize process time in fed-batch fermentation. [13] adopted a Markov chain Monte Carlo approach to optimize the kinetics of a fermentation process. More recently, [14] and [15] developed a Markov decision processes (MDP) model to optimize fermentation operating decisions. However, their optimization models are built based on a sufficiently large historical data, and hence are not equipped to capture the impact of model risk (limited historical data) on biomanufacturing decisions. When decisions are made under limited data, ignoring the model risk would lead to a bias in decision-making. To the best of our knowledge, this paper is the first to simultaneously capture the inherent stochasticity of biological systems and model risk to optimize harvesting decisions in biomanufacturing systems. Reinforcement learning approaches have been recently developed for bioprocess control. For example, [16] and [17] developed model-free deep-Q-network based reinforcement learning approaches to maintain the cells at target populations by controlling the feeding profiles and to maximize the yield by controlling the flow rate. [18] constructed a model-based reinforcement learning for biomanufacturing control, with predictive distribution of system response. After that, they proposed a simulation assisted policy gradient algorithm that can efficiently reuse the previous process outputs to facilitate the learning and the search for the optimal policy. Our paper is different from [18] in the following ways: (1) We study the harvesting problem in fermentation, while [18] focus on a chromatography problem in purification operations, (2) we explicitly incorporate the posterior distributions of unknown fermentation-process parameters as knowledge states of the MDP model; and (3) their focus is developing a solution algorithm, while we focus on analytical properties of the optimal harvesting policy and practical implementation in real life. In reinforcement learning literature, the model-based Bayesian Reinforcement learning, or the Bayes-Adaptive Markov decision process (BAMDP), explicitly maintain a posterior over the model parameters as knowledge states to select actions, which can automatically balance both exploration and exploitation. A comprehensive review of Bayesian reinforcement learning methodologies can be found in [19] . Notice an equivalent representation is considering the parameters as unobservable states of the system, which can be formulated as a partially observable Markov decision process (POMDP) [20] . Since solving the original model-based Bayesian reinforcement learning is notoriously complex due to potentially huge state space, various approximation algorithms have been developed, including offline value approximation and computing the policy a priori through POMDP planning [21] ; online near-myopic value and tree search approximation that focus on realized knowledge states in planning [22, 23, 24] ; and exploration bonus based methods where an agent acts according to an optimistic model of MDP under uncertainty [25, 26, 27] . Motivated by those studies, in this paper, we develop a model-based Bayesian reinforcement learning approach, which can account for model risk in guiding the fermentation harvesting decisions. In addition, we provide structural and sensitivity analyses to study how the model risk impacts the value function and optimal policies. Section 3.1 introduces a stochastic model to represent the protein and impurity accumulation in the fermentation process. Section 3.2 presents a Bayesian approach to capture the uncertainty in the unknown parameters of this model, and describes how this uncertainty can be updated with new protein and impurity observations during the fermentation process. Finally, Section 3.3 presents a Markov Decision Process (MDP) model formulation to optimize the harvesting decision in the fermentation process. The accumulation of protein and impurity amount in the exponential-growth phase of a fermentation process is commonly modeled with the so-called cell-growth kinetics mechanism [1] . The cell-growth kinetics mechanism is often represented as an ordinary differential equation. To be specific, the protein amount at time t, denoted by p t , is given by the functional form dp t dt = µp t , where µ is referred to as the specific growth rate of protein. Therefore, it is common to assume that the protein amount at time t follows the functional form p t = p 0 e µt , where p 0 is the starting amount of protein (seed). Similarly, the impurity amount at time t follows the functional form i t = i 0 e ψt , where ψ is the specific growth rate of impurity and i 0 is the starting amount of impurity. In our model, we consider that measurements are performed at discrete time points T = {t : 0, 1, . . . , T } with T denoting the time point at which the fermentation process is terminated. The time between two measurements is fixed and it can be a day or longer, depending on the process characteristics. This leads to a recursive representation p t+1 = p t e µ for the protein amount and i t+1 = i t e ψ for the impurity amount for t ∈ {0, 1, . . . , T − 1}. Because living biological systems are used during fermentation, the specific growth rates of protein and impurity are uncertain. To address this, we model the specific growth rates as random variables. Specifically, we let Φ t and Ψ t denote the independent random variables that, respectively, represent the specific growth rate of protein and the specific growth rate of impurity in the time interval from time t to t + 1. We assume that the random variables Φ t , t ∈ {0, 1, . . . , T − 1}, are independent and normally distributed, which is often used in the existing biopharmaceutical manufacturing studies; see, for example, [28] 2 . We let µ c . The modeling of growth rates as normally distributed random variables leads to the recursive representations of the protein and impurity amount given by for t ∈ {0, 1, . . . , T − 1}, accounting for the inherent stochasticity in the accumulation of protein and impurity in a fermentation process. In (1), we use the notation ∼ to mean "distributed as" and N (a, b) to represent a normal distribution with mean a and variance b. The exponential growth phase eventually reaches the stationary phase at time pointT and the fermentation needs to be terminated at this point, if it has not been terminated beforehand. Consistent with the literature (e.g., [1] ), we assume that the value ofT is fixed and known at the beginning of the process. The value of T at which the fermentation process is terminated (i.e., T ≤T because the fermentation can be terminated before hitting the stationary phase) can be considered as a random variable that depends on the policy being used to terminate the process (see Section 3.3 for details). The true parameters of the underlying stochastic model for the protein and impurity growth rates, denoted by c }, are unknown and often need to be estimated from very limited amount of realworld data (especially for new products that are not yet in production). In this work, we adopt a Bayesian approach and model the mean and variance of the protein and impurity growth rates as random variables, denoted by θ θ θ = {µ (p) , σ (p)2 , µ (i) , σ (i)2 }. In the remainder of this section, we describe how we specify the prior distribution for θ θ θ, obtain its posterior distribution by updating the prior with new data on protein and impurity accumulation, and characterize the predictive distributions for the protein and impurity growth rates. Specification of prior distribution. For the protein growth rate, we build the joint prior distribution of (µ (p) , σ (p)2 ) in the following way. First, the marginal distribution of σ (p)2 is chosen as an inverse-gamma distribution with prior parameters λ 0 are also prior parameters. It then follows that the joint prior distribution of (µ (p) , σ (p)2 ) has a normal-inverse-gamma distribution [29] , i.e., For the impurity growth rate, we obtain the joint prior distribution of (µ (i) , σ (i)2 ) in a similar way by using the prior parameters α It is well known that normal-inverse-gamma distribution is conjugate when combined with normally distributed observations [29, 30] . This enables us to efficiently update the prior distributions with the arrival of new data to obtain the posterior distributions. Characterization of the posterior distribution. Suppose that (α (p) distributed protein growth rate Φ t . Then, the posterior distribution of (µ (p) , σ (p)2 ) follows a normal-inverse-gamma distribution, i.e., with updated parameters We emphasize that the distribution of (µ (p) , σ (p)2 ) is updated by using protein data, while the distribution of (µ (i) , σ (i)2 ) is updated by using impurity data. Therefore, the posterior distribution of (µ (i) , σ (i)2 ) is independent of the posterior distribution of (µ (p) , σ (p)2 ). Given our belief (α t ) about the distribution of impurity growth rate at time point t, since the observation ψ t = ln(i t+1 /i t ) is a random realization of the normally distributed impurity growth rate Ψ t , the posterior distribution of (µ (i) , σ (i)2 ) is also a normal-inverse-gamma; i.e., with the updated parameters For further details on the Bayesian update procedure, we refer the reader to [29] . Posterior predictive distribution of the growth rates. Given the Bayesian model described above, the density of the protein growth rate at time t, conditional on the historical protein data (i.e., summarized by the belief parameters (α t ) is the joint posterior density of the normal-inverse-gamma distributed (µ (p) , σ (p)2 ). In (8) , the integral marginalizes out the variables µ (p) and σ (p)2 , leading to the predictive density of the future observation of the protein growth rate given the current belief parameters on the underlying protein growth model. We let Φ t denote the predictive protein growth-rate random variable at time point t, and it has the density given in (8) , accounting for both the inherent stochasticity in the protein accumulation and model risk. It can be shown that the random variable Φ t follows a generalized t-distribution [31] , i.e., b follows a standard t-distribution with v degrees of freedom. We refer to the term β (p) (9) as the predictive variance of the protein growth rate, and denote it with σ The same result holds for the predictive impurity growth-rate Ψ t at time point t: where we refer to the term β (i) (10) as the predictive variance of the impurity growth rate and denote it with σ (i)2 t . In the remainder of the paper, we use f t (·) to denote the posterior predictive density functions of the random variables Φ t and Ψ t , respectively. It is of practical importance to optimize when to harvest the fermentation process under limited historical data. We will formulate this problem as a Markov Decision Processes (MDP) model with Bayesian updates on the parameters of the protein and impurity growth-rate distributions. Decision Epochs. We consider a finite-horizon discrete-time model with decision epochs T = {t : 0, 1, . . . , T }, representing the time points at which the protein and impurity amounts are measured (see Section 3.1). The value of T (i.e., the time point at which the fermentation process is terminated) can be at mostT (i.e., the time point at which the fermentation enters the stationary phase), because the fermentation process must certainly be terminated in the stationary phase. Physical States. At decision epoch t, the physical state S t is specified by the current protein amount p t ∈ R + and the current impurity amount i t ∈ R + in the fermentation process, i.e., S t = (p t , i t ), where R + is the set of positive real numbers. Action Space. At a decision epoch before reaching the stationary phase, we can either continue the fermentation process one more time period (denoted by action C) or terminate the fermentation process by harvesting it (denoted by action H). The harvest action is the only possible action if: (1) the current protein amount is greater than the harvesting limit P ; i.e., the protein growth beyondP always triggers harvesting; (2) there is a batch failure, caused by the impurity level exceeding the threshold levelĪ, i.e., the maximum impurity value predefined in accordance with regularity standards on batch quality; or (3) the fermentation reaches the stationary phase, meaning that the current decision epoch is the time pointT . That is, the action space in a decision epoch depends on the physical state and whether the stationary phase is reached. The action space can be formalized as {H} if p t >P or i t >Ī or t =T . On the other hand, the action space is given by {C, H} if p t ≤P , i t ≤Ī, and t Ī), then the failure penalty r f is charged as the cost of losing the batch due to the failure. On the other hand, if there is no failure at the harvesting moment (i.e., if i t ≤Ī), the harvest reward is collected as an immediate reward. In (12) , c 0 > 0 represents the lump-sum reward collected per fermentation batch, while c 1 > 0 and c 2 > 0 represent the marginal reward collected per unit of protein and the marginal cost encountered per unit of impurity, respectively. Notice that the protein amount up to the harvesting limitP is valuable, and no reward is collected from the excess protein beyond the harvesting limit. To summarize, given the physical states (p t , i t ) and the action a t , the reward R(p t , i t ; a t ) at decision epoch t can be written as, Policy: Let π denote a nonstationary policy {π t (·); t = 0, 1, . . . ,T }, which is a mapping from any hyper state H t to an action a t , i.e., a t = π t (H t ). Given the policy π, the expected total discounted reward is where γ ∈ [0, 1] is the discount factor. Notice that (14) represents the expected total discounted reward under the policy π from time point 0 until the termination of the fermentation process. The stopping time T in (14) is the decision epoch at which the harvest action is taken; i.e., if π t (H t ) = H, then T = t. Our objective is to find the optimal policy π * that maximizes the expected total discounted reward, i.e., π = arg max π ρ(π). Value Function: The value function V t (H t ) is defined as the expected total discounted reward starting from the decision epoch t with hyper state H t under the optimal policy π * , i.e., , represents the maximum expected total discounted reward starting from decision epoch t with physical state (p t , i t ) and knowledge state I t , and it can be recursively written as for t = 0, 1 . . . ,T − 1. At the decision epochT (i.e., if the stationary phase is reached in the fermentation process), the value function is equal to because the only feasible action is to harvest if the time pointT is reached, and either the harvesting reward or the failure cost is charged depending on the impurity amount at decision epochT . Recall that the hyper state transits to the absorbing stopping state ∆ after the harvest action, and the value function is equal to zero for a process already at the stopping state, i.e., V t (H t ) = 0 if H t = ∆ at any t. So, the transition to the stopping state ∆ is omitted in (15) and (16). Section 4.1 presents a characterization of the variability in the posterior predictive distribution of the growth rates. Section 4.2 provides some analytical properties of the optimal policy under model risk. Section 4.3 focuses on the perfect-information case with no model risk, and establishes the optimal policy. Section 4.4 uses this result to construct a harvesting policy that can be considered as an efficient approximation to the optimal policy under model risk. All the proofs are provided in the Appendix. Recall that the uncertainty in the protein and impurity growth rates comes from two sources: the intrinsic stochasticity of the fermentation and the model risk. Conditional on the historical data collected until the decision epoch t, the predictive protein growth rate Φ t and the predictive impurity growth rate Ψ t are the random variables that the decision maker uses to model the growth rates, and these random variables account for both sources of uncertainty (see Section 3.2). The objective of this section is to quantify the contribution of each source of uncertainty to the variance of the random variables Φ t and Ψ t , denoted with σ (p)2 t and σ (i)2 t , respectively. ψ 1 ) , . . . , (φ Jt , ψ Jt )} denote the historical data on past realizations of the growth rates available at the t-th decision epoch of the fermentation process. It is possible that the data size J t can be greater than t as the historical data D t may also include the growth-rate realizations from the previous fermentation processes. Recall from Section 3.2 that the knowledge states can be recursively written as function of the historical data D t ; see equations (5) and (7). By applying the commonly used improper prior that assumes the initial belief states α 0 are all equal to 0, the knowledge states can be obtained as The posterior predictive variances are then given by Next, by considering the randomness in the historical data D t , Proposition 1(i) establishes the expectation and variance of σ (p)2 t (i.e., similar to the characterization of the expectation and variance for the sample variance of a set of realizations from a specific population). For a particular realization of the historical data set D t , Proposition 1(ii) characterizes the variance σ (p)2 t as the sum of two closed-form terms that represent the variability in the growth rate due to the intrinsic stochasticity of the fermentation process and the model risk, respectively. and Var σ (ii) Conditional on the historical data D t , the predictive variance σ (p)2 t for the protein growth rate can be decomposed into two componentsσ , representing the variability of the protein growth rate due to intrinsic stochasticity and the model risk, respectively; i.e., σ We notice from Proposition 1(i) that the bias E σ will be greater than the true variance σ . Furthermore, as the amount of historical data J t increases, Var σ (p)2 t converges to zero, and the predictive variance σ (p)2 t will converge to σ (p)2 c , which represents the protein growth-rate variance under perfect information. This property will be used later in Theorem 5 to show the convergence of a proposed harvesting policy to the optimal policy under perfect information. For a particular realization of the historical data set D t , Proposition 1(ii) is useful in practice as it allows making a judgement on how the overall uncertainty in the protein growth rate is affected from the inherent stochasticity of the fermentation process and from the model risk. Notice that the ratioσ increases by collecting more historical data, i.e., see (5), we note that the ratioσ decreases by collecting more historical data. This intuitively shows that the model risk becomes smaller (relative to the intrinsic stochasticity of the process) as the size of the historical data increases. Notice that the results in Proposition 1 also apply for the impurity growth model. To be specific, the same results and discussion hold for the predictive variance σ (i)2 t of the impurity growth rate, as the functional form and the underlying modeling assumptions are the same as in the protein growth model. For brevity, we do not repeat those results in the paper. Properties with respect to physical state. We first analyze the structural properties related to physical state (p t , i t ) at a given knowledge state I t . We start our analysis by first showing the monotonicity of the value function, and then present sufficient conditions for the existence of a control limit policy with respect to the impurity level. Theorem 1. Given the knowledge state I t , the value function V t (p t , i t , I t ) is a non-increasing function of the impurity level i t and a non-decreasing function of the protein level p t . Based on the monotonicity properties presented in Theorem 1, we can derive sufficient conditions for the existence of a control limit policy as follows. Theorem 2. At any decision epoch t with a given protein level p t and knowledge state, there exists a threshold i t such that the optimal decision is to harvest if and only if the impurity level i t ≥ i t given that the following condition holds t (ψ t )dψ t is the probability that the process failure doesn't occur in the time period that starts with the impurity level i t at decision epoch t. Theorem 2 presents the existence of the optimal threshold i * t with respect to the impurity level. It indicates that given the same protein and knowledge states, if we harvest at a certain impurity level, we will also harvest at any higher level of impurity. That is, the physical-state space can be split into a harvest zone and a continue zone, indicating the optimal action at a particular physical state. We note that the condition (18) in Theorem 2 is more likely to be satisfied as the reward parameters c 1 and c 2 decrease and the failure penalty r f increases. Properties with respect to knowledge state. We next analyze the monotonicity properties of the value function V t (S t , I t ) with respect to the variables in the knowledge state I t at a given physical state S t . Furthermore, we know from (9) and (10) that the predictive growth rates Φ t and Ψ t follow a t-distribution, which is symmetric on both side of its mean. However, the exponential terms e Φt and e Ψt in the growth model (1) will be impacted more by the right-tail values of the growth rates than the left tail due to the exponential transformation. Therefore, a higher variance for the predictive growth rate implies a higher protein and impurity level in the next time period. Intuitively, since the value function is non-increasing in impurity level and non-decreasing in protein level, the value function is also non-increasing in the predictive variance of the impurity growth rate (i.e., Var[ Ψ t ], denoted by σ (i)2 t ) and non-decreasing in the predictive variance of the protein growth rate (i.e., Var[ Φ t ], denoted by σ (p)2 t ). This leads to the monotonicity results in Theorem 3(ii)-(iii) because we already know Proposition 1(ii) that σ (p)2 t and σ (i)2 t have monotonicity in terms of the knowledge-state variables; i.e., σ Finally, we notice that the monotonicity results with respect to λ (p) t and λ (i) t were analytically intractable when the predictive growth rates follow the t-distribution. However, we were able to establish it in Theorem 3(iv) based on normal approximation of the t-distribution. It is well known that with reasonably large degrees of freedom (i.e., λ (p) t and λ (i) t taking values close to 30), the predictive distribution can be accurately approximated by a normal distribution. In this section, we consider that the true parameters of the underlying stochastic model are known, referred to as the perfect-information case. That is, the parameters µ for the impurity growth rate are known by the decision maker. The objective of this section to provide an analytical characterization of the optimal action under the perfect-information case at decision epoch t, starting with protein amount p t and impurity amount i t . The intuition is first to characterize the probability of certain events in the evolution of the protein and impurity amounts and then to compare the total reward associated with each possible harvesting moment. We start our analysis in Lemma 1 by characterizing certain properties of the impurity amount at a specific time step in the future (under the policy that always takes the continue action). Our analysis in this subsection assumes that the probability of a negative growth rate is negligible (i.e., the realizations of the growth rate random variables Φ t and Ψ t are always non-negative), which is often the case in practice with standard deviation of the normally distributed growth rates expected to be much smaller than their mean values. Lemma 1. Suppose that a fermentation process which has impurity amount i t at decision epoch t does not fail by the decision epoch t + k − 1 for k ≥ 1. (i) The probability that a failure is observed between decision epochs t + k − 1 and t + k is given by where i [t:t+k−1] = (i t , i t+1 , . . . , i t+k−1 ) and Φ(·) is the c.d.f. of a standard normal random variable. (ii) Suppose that the process does not fail by decision epoch t + k. The conditional expectation of the impurity amount at decision epoch t + k is given by Remark 1. The expression for Pr i t+k >Ī i [t:t+k−1] ≤Ī characterized in (19) can be computed numerically by using sample average approximation, c ) truncated at zero with n = 1, 2, . . . , N . It is important to note that a counterpart of Lemma 1, which focuses on impurity accumulation process, can also be formulated for the protein accumulation process. To be specific, if the terms µ (19) and (20) are replaced with µ (p) c and σ (p) c , we obtain the expressions for Pr p t+k ≥P | p [t:t+k−1]

P or i t >Ī, only the harvest action is possible. That is why Theorem 4 only focuses on the physical states with p t ≤P and i t ≤Ī. We note that Theorem 4 leads to an analytical characterization of the state space, referred to as the harvest zone (p t , i t ) : min where the optimal action is to harvest. In this section, the true parameters of the underlying stochastic model are not known. That is, in contrast to the case with perfect information, there is model risk. The objective of this section is to investigate how the model risk affects the harvesting decisions. For this purpose, we introduce a policy that serves as an approximation to the optimal policy under model risk, and we compare it with the optimal policy under perfect information. Recall that Theorem 4 compares the reward of harvesting immediately with the expected reward of harvesting at all possible future decision epochs (to decide whether it is optimal to harvest at the current decision epoch) under perfect information. In the absence of perfect information, the predictive distributions of the growth rate random variables (i.e., the distributions characterized in (9) and (10)) can be used by approximating them with a normal distribution 3 . Then, the harvesting decision can be made at decision epoch t by following the same decision rule established in Theorem 4 after replacing the unknown parameters µ for the protein growth rate). Since this decision rule does not update the model risk in future decision epochs (i.e., the predictive distributions at decision epoch t, obtained from the data collected until decision epoch t, are used in the remainder of the fermentation process), we refer to this approximation policy as the harvesting policy under model risk without forward-looking learning. Figure 2 shows that the optimal actions under perfect information (and hence the actions of the harvesting policy under model risk without forward-looking learning) can be represented with a decision boundary; i.e., it is always better to take the harvest action at the physical states above the decision boundary and to take the continue action at the physical states below the decision boundary. To be specific, Figure 2 considers a scaling factor k that links each parameter of the predictive growth-rate distribution to its true counterpart, and plots the decision boundaries for some relevant values of k. For example, k = 1 represents the case where each predictive-distribution parameter reduces to its true counterpart (e.g., α (p) t , the mean of the predictive distribution, becomes equal to µ (p) c , the true mean of the protein growth rate), and thus, the resulting decision boundary becomes equivalent to the decision boundary of the optimal policy under perfect information. An immediate observation from Figure 2 is that the decision boundary is first an increasing and then a decreasing function of the protein amount. Notice that this result is consistent with Theorem 2 which states the existence of a critical threshold with respect to the impurity amount. Figure 2 (top, left) shows that as the predictive mean of protein α (p) t increases, the harvest boundary moves up, indicating that it becomes less beneficial to harvest immediately when the potential future gain in protein amount is high. Also, the difference on the left part (with small amount of protein p t ) is larger than on the right (with large amount of protein p t ), since as protein approaches the harvesting limit, an increase in α (p) t becomes less influential on the harvesting decision. Figure 2 (top, right) shows that as the predictive mean of impurity α (i) t increases, the harvest boundary moves down. This is intuitive because a higher α (i) t implies a stronger belief on the impurity growth rate (and hence a higher probability of a fermentation failure), so a smaller amount of current impurity level is sufficient to trigger an harvest action. According to Figure 2 (bottom, left), as the parameter σ (p) t increases, the decision boundary shifts to the left. That is, with higher uncertainty of protein growth, we tend to harvest with smaller amount of protein especially when it is close to the maximum protein level. On the other hand, as the parameter σ (i) t increases, Figure 2 (bottom, right) shows that the decision boundary moves down. A larger σ (i) t implies higher uncertainty in the impurity growth rate, and we tend to harvest with smaller amount of impurity to avoid a batch failure. Theorem 5. As the size of the historical data D t approaches infinity (i.e., as J t → ∞), the harvesting policy under model risk without forward-looking learning becomes equivalent to the optimal policy under perfect information. The proof of Theorem 5 follows from the convergence of the parameters of the predictive growth-rate distributions to the corresponding parameters of the true growth-rate distributions. Intuitively, this represents the situation with sufficient amount of historical data such that the underlying fermentation process is already learned accurately and the use of future data is no longer necessary. In this section, we provide a Bayesian reinforcement-learning algorithm to solve for the optimal policy that minimizes the objective function in (14) . Different from the approximate policy in Section 4.4, the optimal policy takes the model risk into account by incorporating the effect of learning from future data on harvesting decisions (in a forward-looking manner). We emphasize that the harvest action at physical states (p t , i t ) ends the fermentation with a deterministic (known) reward. Therefore, it is only needed to estimate the total reward associated with the continue action and following the optimal policy thereafter, denoted by the corresponding Q-function for t = 0, 1, . . . ,T − 1. On the other hand, the Q-function associated with the harvest action is denoted with for t = 0, 1, . . . ,T . Recall that harvesting is the only feasible action for i t >Ī, p t > P , or t =T , and the harvest action takes the state of the system to a cost-free absorbing state. For a fermentation process that has not yet been harvested at decision epoch t, the value function is given by V t (p t , i t , I t ) = max at Q t (p t , i t , I t ; a t ). Theoretically, the optimal policy that maps any possible hyper state to an action can be obtained through backward dynamic programming (this is also referred to as offline planning). However, solving this dynamic program is notoriously difficult and also not necessary in practice, given that the optimal policy is only needed starting from a specific physical state (which evolves by visiting certain states more likely than others) in real-life execution of the fermentation process. Therefore, we adopt a solution approach that executes the policy in an online manner, which means that we focus on estimating the Q-function in (23) at a particular current state (p t , i t , H t ), and decide to continue or harvest the fermentation process by comparing it with the harvesting reward in (24) . After the selected action is executed in real life, the next decision epoch starts with a new hyper state at which the entire procedure is repeated. We refer to this procedure as Reinforcement Learning under Model Risk (RL with MR) and provide an outline in Algorithm 1. In Step (2) of Algorithm 1, the Q-function needs to be estimated. For this purpose, we apply the sparse sampling idea of [32] , which generates future trajectories (i.e., state-action pairs over time) for modeling the uncertainty in the hyper state evolution and chooses the optimal action by constructing a look-ahead tree with these trajectories. Similar to [33] and [22] , we generate the future states by using the Bayesian updated posterior predictive distributions. To be specific, given the current hyper state H t = (S t , I t ) with S t = (p t , i t ), the samples of next hyper state can be generated through the transition probability Pr(S t+1 , I t+1 |S t , I t , a t ) = Pr(S t+1 |S t , I t , a t )Pr(I t+1 |S t+1 , S t , I t , a t ) in two steps. First, we generate the K scenarios of physical states (p t+1 , i t+1 ) by sampling the protein growth rates φ t , the knowledge state can be updated by using (5) and (7) . In Figure 3 , we illustrate the main idea of Bayesian sparse sampling. Bayesian sparse sampling considers a look-ahead tree starting with the root at the current hyper state H t , followed by two potential actions: harvest or continue fermentation. The Q-function value of harvest decision is known and given Figure 3 : Illustration of the lookahead tree for Bayesian sparse sampling. by (24) . For the continue action, the expected value E max at+1 Q t+1 (p t+1 , i t+1 , I t+1 ; a t+1 ) in (23) is estimated by using sample average approximation, which grows the tree into K following nodes that each representing a sample of the next hyper state H t+1 ) for k = 1, 2, . . . , K. The same procedure is repeated for each node until all the leaf nodes reach the end of the fermentation, i.e. the harvest decision is made. The Q-function (and the value function) estimates are computed from leaf nodes rolling-backing up to the root. The procedure is summarized in Algorithm 2. It is known that the online sparse sampling algorithm (which is implemented by Algorithms 1 and 2 for our problem) is guaranteed to compute a near-optimal action at any state of the MDP model [32] . The theoretical results on the quality of the near-optimal action depends on the number of samples K at each node and the depth of the look-ahead tree. In our implementation of the sparse-sampling algorithm, we do not set a specific value for the depth of the look-ahead tree, but instead continue branching until all the leaves represent a harvest action (i.e., step (A3) in Algorithm 2 ends up with step (B1) for each k). Notice that the tree grows until at most decision epochT at which the fermentation process must be terminated (if not done yet). In our implementation, the number of samples K is the only parameter that controls the computation time and quality of the solution. In literature, it is typical to set the value of K not too large considering the branches grow exponentially; e.g., [32] and [34] set K equal to 5 in their implementation of Bayesian sparse sampling. In our numerical study, we set K = 10, as it is observed that the results do not change by increasing the value of K further. We present a numerical study motivated by MSD Animal Health, Boxmeer, the Netherlands. We use this case study to (1) analyze the potential impact of model risk on optimal policies and the value function, and (2) evaluate the potential room for improvement in practice. First, we present an overview of the problem setting in Section 6.1, and evaluate the impact of model risk in Section 6.2. We consider practically-relevant problem settings to evaluate the potential room for improvement in Section 6.3. Problem setting. Our case study is constructed based on process data of 21 historical batches. We establish the main problem parameters as follows. The starting protein and impurity for each batch are p 0 = 1.5, i 0 = 2.0 grams. In this case study, the harvesting limit isP = 30 grams, the batch failure impurity threshold isĪ = 50 grams, and the maximum time for a batch to reach the stationary phase isT = 8 hours. To represent common industry practice, we considered the following cost and reward structures: harvest reward r h (p, i) = 10p − i, the one-step operation cost c u = 2, and the batch failure penalty r f = 880 with no discounting. These cost structures are identified based on input received from our industry partners. To protect confidentiality, we disguised MSD's original data and used representative values in this case study. In practice, different types of seed cells may exhibit different levels of inherent stochasticity (e.g., bacterial proteins may be more stable than those of RNA viruses). Therefore, the numerical analysis considers two levels of inherent stochasticity to capture a wider range of practically-relevant settings: (1) low inherent stochasticity with the protein and impurity growth parameters µ Analysis overview. We use the case study to generate insights for practitioners and to provide an understanding of the problem configurations that have high potential for improvement. For this purpose, we consider four practically-relevant strategies: 4 • Perfect information MDP (PI-MDP). To establish a benchmark, we consider the setting where the decisionmaker has perfect information on fermentation dynamics (i.e., no model risk). This setting represents the best possible performance that can be achieved. In this setting, the MDP model is solved as described in Section 4.3 by considering that the true parameters µ • Reinforcement learning ignoring model risk (RL ignoring MR). This strategy represents the case when the decision-maker considers the inherent stochasticity of fermentation processes but ignores the model risk. To estimate state transitions, the decision-maker derives the point estimates from limited historical data (by using maximum likelihood estimation), and uses these point estimates as the true parameters µ c of the MDP model. Then, the MDP model is solved to guide harvesting decisions. • Reinforcement learning with model risk (RL with MR). The decision-maker considers both the inherent stochasticity of fermentation processes and the model risk caused by limited historical data. Optimal harvesting policy is obtained as described in Section 5. • Current practice (CP). In common practice, fermentation is typically harvested based on a "fix-threshold" approach, i.e., harvest when the impurity amount i t exceeds 60% of the maximum amount permittedĪ (i.e., harvest when i t ≥ 0.6Ī) or when the protein amount p t reaches the limitP or when the fermentation process transitions to the stationary phase. These fixed-thresholds represent simple rule-of-thumbs and are typically prespecified based on experience. Performance metrics. We adopt the following approach to evaluate the performance of each strategy. For any policy π obtained by these four strategies, we evaluate the policy π through the expected total reward achieved for the underlying true process, as defined in Section 3.1, where the expectation is taken with respect to the stochastic trajectory τ c ≡ (S 0 , a 0 , S 1 , a 1 , . . . , S T −1 , a T −1 , S T ) with stopping time T , following the underlying physical state transition Pr(S t+1 |S t , a t ; θ θ θ c ) and decision policy a t = π t (S t ). Based on the given policy π, we can generate trajectory realizations τ c n ≡ (S Tn ) with n = 1, 2, . . . , N from the underlying model, and the expected total reward can be estimated by, where T n denotes the stopping time in the trajectory realization n. Further, to assess the process stability, we measure the batch-to-batch variations under a given policy through the standard deviation (SD) of the total reward, which is estimated by sample SD from N replications of simulation experiments, i.e., For each strategy, we run N = 100 replications to compare the mean and the standard deviation of the expected total reward. We now investigate the impact of model risk on policies and the value function. For this purpose, our analysis considers three different sizes of historical data, J 0 = 3, 10, 20. We use the strategy PI-MDP as a benchmark to represent the case with perfect information. First, we report the optimal policies and the value function under PI-MDP. Figure 4 shows the optimal harvesting decisions (left panel) and the Q-function (right panel) over the physical states of the state space, i.e., p t ∈ [1.5, 30] and i t ∈ [2.0, 50]. When the inherent stochasticity is high, we observe from Figure 4 that optimal policies tend to harvest earlier (i.e., harvest at smaller protein and impurity levels) in order to proactively hedge against the higher prediction risk. Next, we consider different sizes of historical data (J 0 = 3, 10, 20) to study the impact of model risk on optimal policies and the value function 5 . Figures 5 and 6 illustrate the harvesting decision boundaries (left panels) and the 95% confidence bands of the estimated Q-function for continue decision (right panels). We consider the cases when the inherent stochasticity is low ( Figure 5 ) and high ( Figure 6 ) under different size of historical data. Figures 5 and 6 show a similar trend on how the optimal harvesting threshold moves as the number of historical data J 0 increases from 3 to 20. More specifically, as the number of historical data increases, we see that the 95% confidence band of the harvest boundary shrinks and becomes closer to the one under the optimal harvest policy with perfect information. By comparing Figures 5 and 6 , we also see that we need more data to achieve similar accuracy for the harvest decision boundary when the inherent stochasticity is high. For practitioners, these observations imply that the impact of model risk on optimal policy becomes more pronounced when the inherent stochasticity of the fermentation is high. We now consider various problem settings to evaluate the performance of the four strategies (CP, PI-MDP, RL ignoring MR, and RL with MR). In particular, we focus on the performance metrics, such as, the expected total reward ρ c (π), and the standard deviation SD c (π) under different sizes of historical data (i.e., J 0 = 3, 10, 20) and harvest policy π, as described in Section 6.1. To provide a benchmark, Table 1 reports the performance of strategies CP and PI-MDP. Recall that harvesting policies π under both CP and PI-MDP are independent of data size J 0 . In this specific setting, we observe from Table 1 that CP does not perform well (compared to PI-MDP) when the inherent stochasticity is high. The performance of reinforcement learning strategies (RL ignoring MR, and RL with MR) are reported in Table 2 . We observe from Tables 1 and 2 that the strategy RL-with-MR provides substantial benefits (in terms of the performance metrics ρ c and SD c ) compared to all other strategies in all scenarios considered. In addition, we observe from Table 2 that the average reward of both strategies RL-ignoring-MR and RL-with-MR increases while variability decreases as the number of historical data increases. For practitioners, these results emphasize the business value (and the potential impact) of accounting for the model risk in harvesting decisions. In addition, we observe that the strategy RL-with-MR results in a lower standard deviation SD c compared to current practice CP, even when the amount of historical data is small (J 0 = 3). This observation is interesting because the objective of the optimization model is to maximize the expected total reward (not to minimize variability). Nevertheless, we obtained a similar result from the real-world implementation at MSD (i.e., the implementation has led to a 20% reduction in the standard deviation of the annual production yield compared to current practice, as discussed in Section 7.2). Lastly, we provide a numerical analysis on reward and cost parameters. In particular, we assess the sensitivity with respect to (i) the unit reward obtained per protein amount, i.e., r h (p, i) = 5p−i, r h (p, i) = 10p−i and r h (p, i) = 15p−i; Table 3 . For each different revenue and cost configuration, Table 3 presents the mean ρ c (π) and standard deviation SD c (π) of total reward realized by 100 batches. For brevity, we report the results for the case when inherent stochasticity is low. 6 Each column corresponds to one of the four strategies defined earlier. We considered the size of historical data J 0 = 10 for both strategies RL-ignoring-MR and RL-with-MR. For each strategy, we observe from Table 3 that the average and standard deviation of the total reward increase as the unit reward of protein increases; and the average reward decreases but the standard deviation increases as the failure cost increases. This trend matches with our intuition. For the cases with lower protein rewards or/and lower failure costs, we observe from Table 3 that the performance of the proposed RL-with-MR strategy (under J 0 = 10) is closer to that of the perfect information setting J 0 = 3 J 0 = 10 J 0 = 20 ρ c (π) SD c (π) ρ c (π) SD c (π) ρ c (π) SD c (π) RLJ 0 = 3 J 0 = 10 J 0 = 20 ρ c (π) SD c (π) ρ c (π) SD c (π) ρ c (π) SD c (π) RL(π) SD c (π) ρ c (π) SD c (π) ρ c (π) SD c (π) ρ c (π) SD c (π) 5 We elaborate on the implementation process in Section 7.1, and discuss the impact on MSD's daily operations in Section 7.2. The team. The research project has been conducted in close collaboration with a university team (Eindhoven University of Technology in the Netherlands, and Northeastern University in the United States) and a team of practitioners from MSD in the Netherlands. The university team brought in expertise on Operations Research (OR), whereas the MSD team provided expertise on aspects related to life sciences. A multi-disciplinary team of researchers from MSD contributed to the project (i.e., bioreactor operators, experts in life sciences, chemical and biological engineers, process improvement engineers, middle/upper management, etc.). The university team consisted of three faculties, one Ph.D. student and several Master's students from Industrial Engineering. Project management. The project is an outcome of four years of close collaboration between the university and MSD team. We planned and monitored the project's progress through regular (online) meetings and brainstorming sessions with the university and the MSD team. In general, these meetings were conducted on a weekly or bi-weekly basis (i.e., the frequency was adjusted based on the project's specific needs). In addition, every three to four months, we conducted an update presentation and a brainstorming session with a larger team of end-users from MSD (i.e., the extended team consisted of 15-20 people from various functions in the company). This way, we encouraged an active engagement with potential end-users, leading to a successful and smooth implementation of the developed optimization model. The decision-support tool. To facilitate the real-world implementation process, the team developed a user-friendly decision-support tool for daily use. This decision-support tool reports a look-up table for optimal decisions (harvest or continue) as a function of the system's state. The user-interface of the tool is developed in MS Excel. This user-interface is linked to an optimization module which was developed in Python. The implementation process was facilitated by Master's students who were actively engaged in the development of the decision-support tool. Several training and test runs were performed prior to the real-world implementation. In particular, the user interface was modified and enhanced based on the feedback of the end-users. The student assistants also prepared a short documentation and a video to support the daily use of the tool (for internal use only). The end-users at MSD found the decision-support tool user-friendly and were very satisfied with the project outcomes. Data collection. We analyzed multiple different products during the project. Our data consisted of two main data sets for each batch: (i) fermentation information (e.g., initial amount of protein and impurity, information on growth rates, etc.), and (ii) physicochemical parameters monitored during fermentation (e.g., temperature, pH, oxygen levels, etc.). Using standard bioreactor models, the physicochemical parameters were used to determineT , the duration of the exponential growth phase for a batch [1] . To summarize, we had the following data for each batch: starting amount of protein and impurity, the final amount of protein and impurity (by the time when the process is harvested), the duration of the exponential growth phase, and the realized protein and impurity growth rates over time. Because the main scope of our project is decision-making under limited historical data, we focused on products that were produced only few times (over a time frame of one to two years). These products exhibit inherent stochasticity and model risk. Some of the products in our scope had as little as five data sets (i.e., fermentation data on five batches) to build prior distributions; whereas data for most products involved around 10 batches at the beginning of the decision-making process. One product had no data to build the prior distribution because of a change in raw materials. In that particular case, we relied on the domain knowledge and experience of bioreactor operators to build a prior distribution for the decision-support tool. The developed optimization framework has been used in daily operations at MSD since 2019. The project has been implemented at multiple production lines in Boxmeer, and directly affected the business metrics: In total, the project resulted in roughly 3X higher production yield on average, without expanding the existing production capacity. In this setting, we define the "production yield" as the total amount of end-products produced (i.e., total amount of proteins produced after eliminating impurities). We quantified the savings in terms of the production yield (and not cost reduction) because we found that this approach was more intuitive and easier to quantify with existing data. More specifically, we had data on (monthly) production yields associated with the prior/post implementation process. This data was stored in a software for production planning and performance evaluation purposes. Therefore, to quantify the impact, we compared the actual production yield for a time frame of one year prior to the implementation, and a time frame of one year after the implementation. We believe that both the implementation of a learning-by-doing framework and the optimization of harvesting decisions have contributed to the 3X improvement in the production yield. In particular, the learning-by-doing framework encouraged to make the best use of the limited historical production data, and helped the bioreactor operators adjust harvesting decisions in real time. Therefore, the systematic use of a data-driven approach encouraged a proactive decision-making process. Prior to the optimization framework, operating (harvesting) decisions were determined based on experience and domain knowledge. As an additional benefit, we observed a 20% reduction in the standard deviation of the annual production yield. We believe that this reduction was achieved as a consequence of a better standardization of operating decisions by using an OR-based decision-making framework. In addition, the project provided several non-quantified benefits: • Flexibility. The optimization tool helped to increase the production yield without additional resources, and thereby freed up production capacity and enabled higher flexibility. • Creation of new jobs. MSD created four new positions in the Boxmeer facility for industrial engineers and OR scientists. The company recognizes OR as an important skill to improve biomanufacturing efficiency. • Improved understanding of the production challenges and trade-offs. The project enabled an improved understanding on the fermentation dynamics under limited historical data. The optimization tool helped to connect the complex dynamics of fermentation processes with business metrics, such as, production yields and uncertainties. • Environmental impact. As we produced higher yields without expanding the production capacity, the project helped to reduce CO 2 levels. • The use of OR in daily operations. Prior to the optimization tool, harvesting decisions were made based on prior knowledge and the bioreactor operator's experience. The concept of OR-based decision-making under uncertainty and model risk was new but the scientists embraced it. The project encouraged the best use of the limited historical process data via a learning-by-doing framework. • Improved availability of drugs. The animal health industry is experiencing deficiencies in meeting the worldwide demand. The facility in Boxmeer is a leading biomanufacturing site serving all over the world. Therefore, increasing the production yield at this facility will have a societal impact in terms of improved availability of these drugs. Limitations in historical process data (model risk) is often perceived as a common industry challenge in biomanufacturing. Yet the implications of model risk on optimal costs and harvesting decisions have not been fully understood. Our work provides one of the first attempts in modeling and optimization of fermentation systems under model risk (caused by limited historical data) and inherent stochasticity (caused by the uncertain nature of biological systems). We developed an MDP model to guide fermentation harvesting decisions under a learning-by-doing framework. In particular, we used a Bayesian approach where the decision-maker sequentially collects real-world data on fermentation dynamics and updates the beliefs on state transitions. As a salient feature, the MDP model combines the knowledge from life sciences and operations research, and is equipped to capture the complex dynamics of fermentation processes under limited data. We studied the structural properties of the value function and optimal policies, and showed that optimal policies have a zone-based structure. In addition, we analytically characterized the impact of model risk on biomanufacturing harvesting decisions. To illustrate the use of the optimization model, we present a case study from MSD Animal Health in the Netherlands. The implementation at MSD had substantial impact with 3X improvement in annual production yield and a 20% reduction in the standard deviation of the annual production yield. The developed optimization framework is generic, and addresses common industry challenges. Therefore, it can be easily implemented at other production lines and facilities. For example, MSD is currently in the process of implementing this decision-making framework on other production lines in the Boxmeer facility. The long-term vision is to encourage the worldwide use of such optimization models at other facilities. As another future work, we aim to explore the potential applications in other industries. For example, the food industry encounters similar challenges related to fermentation and may benefit from the developed optimization model. In addition, it can be interesting to expand the scope of this paper to incorporate production planning and scheduling aspects. 1. The company is known as "Merck" in the United States and Canada, and "MSD" elsewhere. 2. The independence and normality assumptions are validated based on industry data from the products that are already in production. 3 . As mentioned at the end of Section 4.2, with reasonably enough data size, the posterior predictive distribution can be approximated by a normal distribution, i.e., Φ ∼ N α 4. We defined these four strategies based on the feedback we received from our industry partners. Our main objective is to understand the benefits of adopting the proposed "RL with MR" approach instead of other practically-relevant strategies. 5. Numerical results for the strategy "RL with MR" are generated by the Bayesian sparse sampling described in Section 5. The analysis starts with the non-informative prior that α For each case (i.e., J 0 = 3, 10, 20), we performed 100 macro-replications to assess the performance of the proposed reinforcement learning framework. In each macro-replication, we generated J 0 number of process data D 0 from the underlying true distribution of the protein and impurity growth rates, and then updated the knowledge states. After that, with the updated knowledge state, we estimated the Q-function for continue decision over the physical state space (p t , i t ) ∈ [1.5, 30] × [2.0, 50] at time t, based on Bayesian sparse sampling in Algorithm 2. We obtained the mean decision boundary and the corresponding 95% confidence band based on the results from 100 macro-replications. 6 . We obtained similar managerial insights for the case when the inherent stochasticity is high. which is non-decreasing in α (p) T . By using the Bellman's equation, we can do the proof through dynamic programming or backward induction. Assume that V t+1 (p t+1 , i t+1 , I t+1 ) is non-decreasing in α t+1 through the knowledge states update, i.e., see (5) . Then, ≤ − r f =E[V t+1 (p t e φt , i t e ψt , I t+1 )|α where φ t = φ t + δ, α t+1 . The inequality (32) holds because the value function is non-decreasing in p t (since p t e φt+δ ≥ p t e φt ), and induction assumption (since α (p)+ t+1 ≥ α (p) t+1 ). Furthermore, (33) holds by performing variable replacement φ t = φ t + δ. Thus, we have V t (p t , i t , I t (α (p) t )) ≤ V t (p t , i t , I t (α (p)+ t )). On the other hand, consider α with ψ t = ψ t + δ, α . Similarly, the inequalities hold because the value function is non-increasing in i t and non-increasing in iα (i) t+1 by induction assumption, as well as performing variable replacement ψ t = ψ t + δ. Thus, we have V t (p t , i t , I t (α (i) t )) ≥ V t (p t , i t , I t (α (i)+ t )). Here we consider the "decreasing in β (p) For any (p, i), it can be shown that where the first equation holds by variable replacement φ t = φ t − α (p) t , so that p = pe α (p) t , and In addition, β , the inequality (36) holds by induction assumption. Let φ t = ηφ t and α it is easy to see that κ + > κ > 0. Therefore, where both inequalities hold because the value function V t+1 (·) is non-decreasing in protein level p t+1 and also the posterior mean of protein α (p) t+1 , and the final steps hold by performing variable replacement φ t = η · φ t . Thus, combining the two parts we have, E[V t+1 (p t e φt , i t e ψt , I t+1 )|β Then, V t (p t , i t , I t (β (p) t )) ≤ V t (p t , i t , I t (β (p)+ t )). On the other hand, E[V t+1 (p t e φt , i t e ψt , I t+1 (α On the other hand, we could wait to harvest until m periods ahead (1 ≤ m ≤T − t), and then the expected additional total reward of this batch is calculated as follows, Computers and bioprocess development Operations research improves biomanufacturing efficiency Innovative applications of operations research: Reducing costs and lead times in biomanufacturing Operational research improves biomanufacturing efficiency The research project was one of the three finalists of 2020 MSCA2020.HR Awards. The competition was among all disciplines to acknowledge the most influential projects under Horizon 2020 program of European Commission The research project was awarded with second prize in INFORMS Innovative Applications in Analytics Award Analytical models for biopharmaceutical operations and supply chain management: a survey of research literature Practical Fermentation Technology A more generalized kinetic model for binary substrates fermentations Nonlinear model predictive control of fed-batch fermentations using dynamic flux balance models Optimal control of fed-batch bioreactor using simulation based approximate dynamic programming Modeling kinetics of a large-scale fed-batch cho cell culture by markov chain monte carlo method Optimal condition-based harvesting policies for biomanufacturing operations with failure risks Managing trade-offs in protein manufacturing: How much to waste? Manufacturing & Service Operations Management Deep reinforcement learning for the control of microbial co-cultures in bioreactors Reinforcement learning based optimization of process chromatography for continuous processing of biopharmaceuticals Green simulation assisted reinforcement learning with model risk for biomanufacturing learning and control Bayesian reinforcement learning: A survey Optimal learning: Computational procedures for Bayes-adaptive Markov decision processes An analytic solution to discrete bayesian reinforcement learning Bayesian reinforcement learning in continuous pomdps with application to robot navigation (more) efficient reinforcement learning via posterior sampling Optimistic planning for belief-augmented markov decision processes Near-bayesian exploration in polynomial time Approaching bayes-optimalilty using monte-carlo tree search A bayesian sampling approach to exploration in reinforcement learning Retrospective optimization of time-dependent fermentation control strategies using time-independent historical data Bayesian Data Analysis Optimal learning Conjugate bayesian analysis of the gaussian distribution A sparse sampling algorithm for near-optimal planning in large markov decision processes Bayesian sparse sampling for on-line reward optimization Model-based bayesian reinforcement learning in large structured domains This research was funded by the Dutch Science Foundation (NWO-VENI Scheme). We would like to thank the Master's students Thijs Diessen and Len Hermsen for their assistance during the project. We also thank Oscar Repping from MSD Animal Health for his continuous support during the collaboration. A Proof of Proposition 1 (i) Notice that φ j , j = 1, . . . , J t , are i.i.d. samples from underlying true protein growth rate distribution N (µ (p) , σ (p)2 ).According to the properties of Gaussian sample mean and sample variance, we haveφ ∼ N (µ (p) , σ (p)2 J t ),Jt j=1 (φ j −φ) 2 σ (p)2 ∼ χ 2 (J t − 1), and they are independent of each other. Thus, σ. Based on the mean and variance of Chi-square distribution, we have(ii) Recall that the predictive growth rate Φ t is a compound random variable, i.e., Φ t ∼ N (µ (p) , σ (p)2 ) where µ (p) , σ (p)2 ∼ N (αt ) at decision epoch t. For brevity, let θ (p) denote the random variables µ (p) , σ (p)2 . The variance of the compound random variable Φ t can be decomposed as follows:where the equations in (25) and (26) hold by law of total variance. We note that Var θ (p) αt is a constant. The first term in the right-hand side of (27) (i.e.,σ) corresponds to E θ (p) σ (p)2 , measuring the expected variability of the protein growth rate due to intrinsic stochasticity of the fermentation. The second term (i.e.,σ) corresponds to E θ (p) Var µ (p) σ (p)2 and it represents the expected variability of the protein growth rate due to model risk. Notice at timeT , we have thatwhich is non-increasing in iT and non-decreasing in pT . Based on the value function given in Section 3.3, we can prove Theorem 1 through backward induction.As the induction hypothesis, assume that V t+1 (p t+1 , i t+1 , I t+1 ) is non-increasing in i t+1 and non-decreasing in p t+1 for a given t ∈ {0, 1, . . . ,T − 1}. Recall from (15) that the value function at decision epoch t is given bywhere the inequality hold because that V t+1 (p t e φt , i t e ψt , I t+1 ) ≥ −r f , i t e ψt ≤ i + t e ψt and induction assumption. Therefore, we have V t (p t , i t , I t ) is also non-increasing in i t . On the other hand, for p + t ≥ p t ,where the inequality hold because that p t e φt ≤ p + t e φt and induction assumption. Therefore, we have V t (p t , i t , I t ) is also non-decreasing in p. According to Theorem 2, if π (p t , i − t , I t ) = H, then π (p t , i + t , I t ) = H. However, π (p t , i + t , I t ) = H always holds for i + t >Ī. So, we focus on the case with i + t ≤Ī. Assume that π (p t , i − t , I t ) = H but π (p t , i + t , I t ) = C as the contradiction hypothesis. Then, it holds thatwhich contradicts the condition, so that we have π (p t , i + t ) = H. Notice that, inequality (30) holds because:by the definition of the value function and the nondecreasing behavior of the value function in p t , and (ii)which is non-decreasing in β (p)T . By using Bellman's equations, we can do the proof through backward induction.The difference in β (p) t affects the expectation through the predictive distribution ft ); and it also affects the knowledge hyper parameters α (p) t+1 and β (p) t+1 through the knowledge states update, i.e., see (5) . Then,Following similar procedure, we have for any (p, i), Here we consider the "non-increasing in ν (p) t " part. The proof of the "non-decreasing in ν (i) t " can be shown similarly. Given the hyper-parameters (αwhich is non-increasing in ν (p)T . By using Bellman's equations, we can prove the statement through backward induction.The difference in ν (p) t affects the expectation through the predictive distribution ft ); and also the knowledge hyper parameters αt+1 through the knowledge states update, i.e., see (5) . Then,Also, for any (p, i), it can be shown that, and the inequality (37) holds by induction assumption.t + κφ t , and it is easy to see that κ > κ − > 0. Therefore,where both inequalities hold because the value function V t+1 (·) is non-decreasing in protein level p t+1 and also the posterior mean of protein α (p) t+1 , and the final steps hold by performing variable replacement φ t = η · φ t . Thus, combining the two parts we have,. On the other hand, Here we prove the "non-increasing in λ (p) t " part. The proof of the "non-decreasing in λ (i) t " part can be shown similarly. Given the hyper-parameters (αwhich is non-decreasing in λ (p)2T . By using the Bellman's equations, we can prove the statement through backward induction. Assume that V t+1 (p t+1 , i t+1 , I t+1 ) is non-decreasing in λThe difference in λ (p) t affects the expectation through the predictive distribution g t+1 through the knowledge states update. Notice that the approximate Gaussian predictive density is given byThen, based on the predictive distribution,Also, for any (p, i), it can be shown thatt + κφ t , and it is easy to see that κ > κ − > 0. Therefore,where both inequalities hold because the value function V t+1 (·) is non-decreasing in protein level p t+1 as well as the posterior mean of protein α (p) t+1 and scale parameter β (p) t+1 , and the final steps hold by performing variable replacement φ t = η · φ t . Thus, combining the two parts we have,Then, V t (p t , i t , I t (λ. On the other hand,Following the similar procedure, it holds for any (p, i) that≤ 1, and ψ t = ηψ t . Therefore,and that V t (p t , i t , I t (λ). (i) It can be shown thatwhere the approximation in (40) holds because it was assumed that Pr(Φ < 0) and Pr(Ψ < 0) are negligible, and noting that. Then, the result follows because(ii) The conditional expectation can be written aswhere the equation in (42) holds because Pr(Ψ < 0) was assumed negligible, and the functional form in (43) follows from the conditional expectation of a random variable with log-normal distribution.F Proof of Lemma 2 At any decision epoch t with physical state (p t , i t ), if we harvest, the additional total reward is, where i [t:t+m] = (i t , i t+1 , . . . , i t+m ) and p [t:t+m] = (p t , p t+1 , . . . , p t+m ). The first term on the right side of (44) considers the total cost for scenarios having failure occurring during time t to t + m, the second term considers the reward for scenarios ending before t + m because the protein p t+k exceeds the boundaryP and we harvest at t + k for k ∈ {1, . . . , m}, and the last term accounts for the reward collected if the scenario reaches the decision epoch t + m and we harvest. Plugging in the results we have in Lemma 1 and Lemma 2, we can calculate r m as a function of physical states:Therefore, given the current physical state (p t , i t ), the optimal decision is to harvest if and only if r 0 (p t , i t ) ≥ r m (p t , i t ) for all m ∈ {1, 2, . . . ,T − t}. Otherwise, the optimal decision is to continue the fermentation for this batch. Let h(p t , i t ; m) denote the counterpart of the function h(p t , i t ; m) (which was characterized under perfect information) for the harvesting policy under model risk without forward looking (recall that h(p t , i t ; m) is obtained by replacing the unknown parameters µ for the protein growth rate). Our objective is to show that the function h(p t , i t ; m) converges to its true counterpart h(p t , i t ; m) as the length of the historical data increases.It is known that φ j , j = 1, 2, . . . , J t , are i.i.d. samples from the underlying true protein growth rate distribution N (µ According to the weak law of large numbers (WLLN), we have sample mean converge to the true mean, i.e., α (p) t p → µ (p) c as J t → ∞. In addition, it is known from Proposition 1(i) thatAs J t → ∞, we will have E σ . Similar results also apply for impurity growth rate so that we also have α c . Thus, through the continuous mapping theorem, we have for any (p t , i t ) ∈ [p 0 ,P ] × [i 0 ,Ī] and m = 1, 2, . . . ,T − t, h(p t , i t ; m) p → h(p t , i t ; m). In other words, the decision boundary under model risk will converge to the decision boundary under perfect information.