key: cord-0622160-ht8j61ou authors: Zheng, Hua; Xie, Wei; Feng, M. Ben title: Green Simulation Assisted Policy Gradient to Accelerate Stochastic Process Control date: 2021-10-17 journal: nan DOI: nan sha: b8b0105b55bf77243f83a5ab8fbf5a5ba684f0e1 doc_id: 622160 cord_uid: ht8j61ou This study is motivated by the critical challenges in the biopharmaceutical manufacturing, including high complexity, high uncertainty, and very limited process data. Each experiment run is often very expensive. To support the optimal and robust process control, we propose a general green simulation assisted policy gradient (GS-PG) framework for both online and offline learning settings. Basically, to address the key limitations of state-of-art reinforcement learning (RL), such as sample inefficiency and low reliability, we create a mixture likelihood ratio based policy gradient estimation that can leverage on the information from historical experiments conducted under different inputs, including process model coefficients and decision policy parameters. Then, to accelerate the learning of optimal and robust policy, we further propose a variance reduction based sample selection method that allows GS-PG to intelligently select and reuse most relevant historical trajectories. The selection rule automatically updates the samples to be reused during the learning of process mechanisms and the search for optimal policy. Our theoretical and empirical studies demonstrate that the proposed framework can perform better than the state-of-art policy gradient approach and accelerate the optimal robust process control for complex stochastic systems under high uncertainty. The biopharmaceutical industry has developed various innovative biodrugs and vaccines for many diseases, e.g., cancer and COVID-19, that has offered great benefits to our society. But the current manufacturing systems are limited by key challenges, including high complexity, high variability, and limited process data. Human error alone accounts for 80% of out-of-specification variations (Cintron 2015) . To facilitate the end-to-end manufacturing process automation and improve productivity and stability, we consider stochastic decision process (SDP) model, e.g., a hybrid model representing the risk-and science-based understanding of underlying processing mechanisms and inherent stochasticity, and introduce a green simulation assisted policy gradient (GS-PG) framework which can selectively reuse most relevant historical trajectories to accelerate the learning of optimal and robust control policy. Though this framework is motivated by challenges and needs of biopharmaceutical manufacturing, it is applicable to reinforcement learning (RL) problems and complex stochastic system control in general. Biotherapeutics are manufactured with living organisms, e.g., cell, bacteria, and yeast, whose biological processes are complex. The output trajectories are affected by interactions of many factors such as raw materials, feeding strategy, and critical process parameters. As new biotherapeutics (e.g., cell/gene therapies) become more and more "personalized," biomanufacturing requires more advanced manufacturing protocols, which causes very limited historical data. Thus, it becomes an emerging trend in the biopharmaceutical industry to utilize digital twins, i.e., simulation models representing the understanding on underlying bioprocessing mechanisms, to guide process control (Hong et al. 2018 ). The simulation model parameters, such as cell growth and antigen protein production rates, are estimated by very limited real-world data. In this study, we use a Baysian approach, i.e., a posterior distribution of model parameters, to quantify the model uncertainty. With the support of simulation model, we consider RL to guide the optimal and robust control for complex SDP with statistical dynamics properties depending on the selection of model parameters and decision policy. For so complex stochastic processes, each real or simulation experiment is very expensive. It is important to fully utilize the information from historical experiments when solving RL problems to facilitate real-time decision making. Stochastic policy gradient ascent is a common algorithm for solving RL problems. In its classical implementation, only new simulations generated in current iteration are used to estimate the policy gradient, which can induce high estimation uncertainty especially under the situations with high stochastic and model uncertainties. Motivated by Dong et al. (2018) , we first develop mixture likelihood ratio (LR) metamodeling of the SDP with inputs, including both process model coefficients and decision policy parameters. Each historical experiment trajectory can be used to provide a unbiased prediction of the mean response across the input space. However, the reuse of historical trajectories with associated SDP distribution far from the target distribution can lead to the inflated variance on policy gradient estimation. To avoid this issue, we propose a mixture likelihood ratio (MLR) based green simulation approach that can selectively reuse historical trajectories to reduce the estimation variance of policy gradient and speed up the search for optimal policy. As the posterior of model parameters is dynamically updated and the policy optimization proceeds, this approach can automatically select and reuse the most "relevant" historical trajectories generated under various process model parameters and policies. Thus, the proposed GS-PG is sample efficient and robust to model uncertainty. Since the selection cost can increase as the number of historical trajectories increases, we further develop a computationally efficient optimization solution algorithm. When we reuse the historical trajectories, we only need to compute the likelihood ratios (LR) for new generated trajectories and new SDP input parameters, which can dramatically reduce the computational cost and support real-time decision making. Here, we summarize the key contributions and benefits of our approach as follows. • We develop a general GS-PG framework for both online and offline RL settings. For the offline case, a simulation model as an approximation of real system is used to guide the decision making. Given very limited real-world data, Bayesian RL accounting for model uncertainty is used to guide the optimal and robust decision making. • Since each experiment for complex stochastic systems can be very expensive, the proposed GS-PG and MLR based approach can automatically select and reuse the most relevant historical trajectories and improve the estimation of policy gradient. • Comprehensive theoretical and empirical studies demonstrate that the proposed GS-PG framework can efficiently utilize historical trajectories and accelerate the optimization convergence. The organization of this paper is as follows. We review the most related literature studies in Section 2, and present the general policy gradient based RL in Section 3. We propose the MRL based policy gradient estimation for both online and offline reinforcement learning settings in Section 4. Then, we develop the green simulation assisted policy gradient optimization in Section 5. Given a tight budget, since each experiment can be very expensive, our approach can automatically select the most relevant historical trajectories to improve the policy gradient estimation accuracy. We provide a computationally efficient algorithm to speed up the search for the optimal policy, and further prove the asymptotic performance in terms of optimal convergence and historical trajectory reuse. We conclude this paper with the comprehensive empirical study on the proposed framework in Section 6. The goal of RL is to learn an optimal policy strategy for interacting with complex system or environment to achieve best reward (Sutton and Barto 2018) . Policy gradient is a popular approach to solve RL problems via modeling and optimizing the parametric policy using stochastic gradient descent/ascent method (Sutton et al. 1999) . Many new methods have been proposed to improve its performance in terms of better sample efficiency, scalability, and rate of convergence in the recent years. For example, Soft Actor-Critic (SAC) incorporates the entropy measure of policy into the reward objective to encourage exploration (Haarnoja et al. 2018) . To improve training stability, the trust region policy optimization (TRPO) enforces a KL divergence constraint on the size of policy update at each iteration to avoid too much policy parameter change in one update (Schulman et al. 2015a ). Important sampling (IS) has been widely applied to RL. Traditionally, it was often used for off-policy value evaluation and policy correction through IS weight between the target policy and the behaviour/proposal policy (Jiang and Li 2016 , Precup 2000 , Degris et al. 2012 . However, IS estimates can suffer from large variance, which motivates the development of new variance reduction techniques (Mahmood and Sutton 2015 , Wu et al. 2018 , Lyu et al. 2020 . Espeholt et al. (2018) presents an IS weighted actor-critic architecture that adopts "retrace" ), a method truncating IS weight with a upper bound to resolve the high variance issue. However, these methodologies often focus on variance reduction and do not provide additional theoretical guarantee on improving the convergence rate. In the simulation literature, green simulation is a new paradigm of IS to support the experiment design, which reuses outputs from past experiments to create an metamodel predicting system mean response Staum 2017, Dong et al. 2018) . Experience replay (ER) (Lin 1992 , Mnih et al. 2015 , Wang et al. 2016 ) and its extension, such as prioritized experience replay (Schaul et al. 2016) , are often used in the policy gradient and RL algorithms to reduce data correlation and improve sample efficiency. Wang et al. (2017) proposed the actor-critic with experience replay (ACER) that applies experience replay method to an off-policy actor-critic model. This approach uses the IS weight to account for the discrepancy between the proposal and target policy distributions and truncate the importance weight to avoid inflated variance. Motivated by existing literature studies in experience reply, green simulation, and policy gradient methods, in this paper, we provide a new sample reuse scheme capturing the strengths of experience reply and multiple importance sampling based variance reduction. The applicability of green simulation to RL was investigated in our previous study (Zheng et al. 2020) . It empirically shows that the MLR based policy gradient estimator could provide promising performance in convergence. Built on Zheng et al. (2020) , the key contributions of this paper include: (1) developing a GS-PG framework for both online and offline RL settings that works for general policy gradient method; (2) creating a novel experience reply selection approach and introducing a computationally efficient optimization solution algorithm; (3) theoretically and empirically showing that the proposed GS-PS can improve the policy gradient estimation and speed up the optimal convergence for RL problems; and (4) studying the asymptotic and finite sample behaviors of historical trajectory reusing. We consider both online and offline RL settings for controlling complex stochastic systems in Section 3.1. In the online setting, the RL agent learns the optimal policy by directly and repeatedly interacting with the real system. However, in many situations, such repeated direct interactions are unbearably expensive. For example, each batch of bio-drugs can be valued at more than one million dollars. In the offline setting (see Levine et al. 2020 , for example), the RL agent instead interacts with a digital twin simulation model of the real system to answer "what-if" questions and guide the search for the optimal policy. At the same time, batch-based data can be dynamically and periodically collected from the real system. The simulation model, as a representation of the real system, has the approximation error (called model uncertainty) quantified by a posterior distribution. It will be updated as new real-world data collected. Thus, we consider Bayesian RL to guide simulation experiments to find the optimal and robust policy accounting for model uncertainty. Though running experiments on the digital twin is less costly than interacting with the real system, it is often a complex simulation model thus computational burdensome to run. For both online and offline RL settings, each real or simulation system experiment is considered to be expensive. Given a tight budget and under high uncertainty, we need to fully utilize the information from historical experiments, which can be collected in previous periods and optimization iterations, to accelerate the optimal search. In this paper, the policy gradient is used to solve for the optimal policy. The proposed green simulation approach can be applicable to general policy gradient approach as reviewed in Section 3.2. We formulate the reinforcement learning problem of interest as a Markov decision process (MDP) specified by (S, A, H, r, P, s s s 1 ), where S, A, and H are the state space, the action space, and the planning horizon, respectively. The planning horizon H can be scenario-dependent. For example, the optimal harvesting or stopping decision for biomanufacturing fermentation process depends on the production process state s s s t , such as the amount of working cells' metabolic wastes and drug substance. At each time t = 1, 2, . . . , H within the planning horizon, the agent observes the state s s s t ∈ S, takes an action a a a t ∈ A, and receives a reward r t (s s s t , a a a t ) ∈ R. In this study we consider the stochastic policy π θ θ θ : S → A, defined as a mapping from state space to the action and parameterized by θ θ θ ∈ R d , i.e., a a a t ∼ π θ θ θ (s s s t ). It can be degenerated to a deterministic policy. The probability model P of stochastic process is parameterized by ω ω ω specifying the state transition probabilities p(s s s t+1 |s s s t , a a a t ; ω ω ω) for all t, as well as the probability p(s s s 1 ; ω ω ω) for the initial state s s s 1 , such as raw material variation. In the classical stochastic policy gradient ascent algorithm, in each iteration k, the policy gradient ∇µ(θ θ θ k ) is estimated by ∇µ(θ θ θ k ). For simplicity, suppose n i.i.d. trajectories τ τ τ are generated for every policy and model parameter pair, i.e., (θ θ θ, ω ω ω), during the algorithm. In the online setting, at each k-th iteration, we generate n new trajectories from the real system, i.e., τ τ τ (k,j) ∼ D (τ τ τ ; θ θ θ k , ω ω ω c ) for j = 1, 2, . . . , n, and store in the set D k ← D k−1 ∪ τ τ τ (k,j) : j = 1, 2, . . . , n . In the offline setting, from the latest SDP model update θ θ θ k , ω ω ω k ) for j = 1, 2, . . . , n and b = 1, 2, . . . , B, and store in the set X k ← X k−1 ∪ τ τ τ (k b ,j) : b = 1, 2, . . . , B and j = 1, 2, . . . , n . Note that the first index in the superscript of τ τ τ , i.e., k for the online setting and k b for the offline setting, uniquely specifies the policy and transition model parameter pair. The classical policy gradient (PG) estimator in the k-th iteration is given by for online setting, where the scenario-based policy gradient estimate at the policy parameters θ θ θ k is Ψ t ∇ ln π θ θ θ k (a a a t |s s s t ) . It can be challenging to precisely estimate the policy gradient (4), especially when the model uncertainty (e.g., in ω ω ω) and the process stochasticity (e.g., in τ τ τ ) are high. We observe that the classical PG estimator (5) only uses trajectories generated in the k-th iteration; all trajectories generated in previous iterations are discarded. This is a wasteful use of experiment outputs, as the trajectories simulated in previous iterations likely have relevant information for estimating the policy gradient in the current iteration. Inspired by the green simulation design paradigm (Feng and Staum 2017) , we propose reusing the historical trajectories via the likelihood ratio method to improve the estimation accuracy for policy gradient and to accelerate the convergence of stochastic policy gradient ascent algorithm. In this section we derive likelihood ratio based policy gradient estimators. At the current k-th iteration, let p k (τ τ τ ) denote the target distribution. Let F k denote the set of all historical SDP models that have been visited until the beginning of the k-th iteration. Let U k be a reuse set with U k ⊆ F k including model candidates whose trajectories are selected and reused for estimating ∇µ(θ θ θ k ). Denote its cardinality as |U k |. For discussions in this section we assume U k is given. We will present how to determine U k in Section 5. We propose to reuse the historical trajectories associated with U k to estimate the gradient ∇µ(θ θ θ k ). Let D i (τ τ τ ) denote the i-th process model or proposal distribution in U k , which was used to generate state-action trajectories τ τ τ (i,j) i.i.d ∼ D i (τ τ τ ) with j = 1, 2, . . . , n. For the online RL setting, since we directly interact with the real system by applying the policy specified by parameters θ θ θ i to get the output trajectories, we can replace D i with θ θ θ i as model input specification. For the offline RL setting, since we conduct experiments on the simulation model, we can replace D i with the required inputs (θ θ θ i , ω ω ω i ). Provided that p k (τ τ τ )g(τ τ τ ) = 0 whenever D i (τ τ τ ) = 0, then the well-known importance sampling (IS) identity Di(τ τ τ ) is likelihood ratio (LR). Therefore, the policy gradient ∇µ(θ θ θ k ) can be estimated by the following individual likelihood estimator (ILR) using the n trajectories generated by D i (τ τ τ ), The likelihood ratio f i,k (τ τ τ ) weights the trajectories appropriately so that all the ILR estimators (7) for D i ∈ U k are unbiased. One way to reuse all the trajectories associated with the generative models included in U k is to average the ILR estimators for all D i ∈ U k , which we call the average ILR policy gradient estimator, Though the ILR estimators (7) and (8) are unbiased, their variances could be large or even infinite, as the likelihood ratio f i,k (τ τ τ ) can be large or unbounded. This drawback is studied by, for example, Veach and Guibas (1995), Feng and Staum (2017) , and Dong et al. (2018) . These studies propose another LR based estimator that alleviates this drawback and has much more stable estimation in different applications. Instead of viewing the trajectories for each component D i (τ τ τ ) in isolation and form ILR estimators, it is argued in the aforementioned studies that we can view all the trajectories associated with U k collectively as stratified samples from the mixture proposal distribution k (τ τ τ ) = 1 . This view leads to the mixture likelihood ratio (MLR) policy gradient estimator, The mixture likelihood ratio f k (τ τ τ ) reaches its max value when D i (τ τ τ ) = 0 for SDP models visited in previous periods and iterations. Since we always include the trajectories generated in the k-th iteration, this mixture likelihood ratio is bounded, i.e., f k (τ τ τ ) ≤ |U k |, which can control the policy gradient estimation variance inflation issue. In this way, the mixture likelihood ratio puts higher weight on the samples that are more likely to be generated by target distribution p k (τ τ τ ) without assigning extremely large weights on the others. Lemma 1. Conditional on F k , the proposed MLR policy gradient estimators for online and offline RL settings are unbiased. For the offline setting, it means, Based on the studies, e.g., Veach and Guibas (1995) and Feng and Staum (2017) , we show that the MLR estimator is unbiased in Lemma 1; see proof in Appendix Section C. The trace of the covariance matrix Tr(Var[ ∇µ(θ θ θ)]) is the total variance of the estimator ∇µ(θ θ θ), which is often considered in many gradient estimation variance reduction studies (Greensmith et al. 2004) . This motivates us to use it as the metric to select historical trajectories and associated SDP models that could reduce the variance of MLR policy gradient estimators in Section 5. For any (d × d) matrix A, the trace used in the paper are defined as Tr(A) = d i=1 A i,i and we consider L 2 norm for any d-dimensional vector Then, motivated by Martino et al. (2015) Theorem A.2 and Feng and Staum (2017) Proposition 2.5, we can show that conditional on F k , the trace of the covariance of the MLR estimator is less than that of the average ILR estimator in Proposition 1; see the proof in the Appendix (Section D). Proposition 1. Conditional on F k , the total variance of the MLR policy gradient estimator (9) is smaller and equal to that of the average ILR policy gradient estimator (8), Consider solving the RL optimization (2) in the online setting using policy gradient ascent. In the k-th iteration, the target distribution is p k (τ τ τ ) = D (τ τ τ ; θ θ θ k , ω ω ω c ) = p(s s s 1 ; ω ω ω c ) H t=1 π θ θ θ k (a a a t |s s s t )p(s s s t+1 |s s s t , a a a t ; ω ω ω c ). To improve the policy gradient estimation, we can reuse the selected historical trajectories generated by following the policies π π π θ θ θi with D i (τ τ τ ) = D(τ τ τ ; θ θ θ i , ω ω ω c ) ∈ U k . Applying (8) and cancelling p(s s s 1 ; ω ω ω c ) H−1 t=1 p(s s s t+1 |s s s t , a a a t ; ω ω ω c ) in the likelihood ratio f i,k (τ τ τ ) = D(τ τ τ ;θ θ θ k ,ω ω ω c ) D(τ τ τ ;θ θ θi,ω ω ω c ) , the (average) ILR policy gradient estimators in this case are given by where the historical trajectories τ τ τ (i,j) i.i.d ∼ D (τ τ τ ; θ θ θ i , ω ω ω c ) specify the states s s s . Similarly, applying (9), the MLR policy gradient estimator in this case is given by . (11) Consider solving the RL optimization (2) in the offline setting using policy gradient ascent. In the k-th iteration, the target SDP distribution p k (τ τ τ ) is specified by θ θ θ k and the posterior samples { ω ω ω b , b = 1, 2, . . . , B} accounting for model uncertainty, i.e., π θ θ θ k (a a a t |s s s t )p(s s s t+1 |s s s t , a a a t ; ω ω ω b ). Applying (8), the (average) ILR policy gradient estimators of ∇µ(θ θ θ k ) are given by where τ τ τ (i,j) ∼D (τ τ τ ; θ θ θ i , ω ω ω i ) with j = 1, 2, . . . , n are the historical trajectories generated by the SDP models in the reuse set, i.e., D i (τ τ τ ) = D(τ τ τ ; θ θ θ i , ω ω ω i ) ∈ U k . Applying (9), the MLR policy gradient estimator for offline RL setting is given by Before one can use the LR based estimators developed in Section 4, there is one key remaining question: How to select the SDP models, among all historical candidate models visited up to the k-th iteration, to be included in the reuse set U k ? Instead of reusing all historical models, selecting only the most relevant SDP models to reuse can reduce computational cost, stabilize likelihood ratio calculations, and reduce the gradient estimation variance. In short, a well-designed selection rule improves the efficiency of policy gradient estimation and accelerates the resulting policy gradient ascent algorithm. Since the MLR policy gradient estimator is unbiased, we propose a selection criteria to construct U k in Sections 5.1 that can reduce the variance in policy gradient estimation. To support the selection, we provide the variance estimation for PG, ILR, and MLR policy gradient estimators in Section 5.2. Our selection criteria assess all historical trajectories and associated SDP models to determine U k , so the likelihood ratio calculation can be computationally expensive as the reuse set increases. We present an efficient computational scheme in Section 5.4 that avoids repetitive likelihood calculations by storing and reusing them. Based on the selection criteria, we propose the GS-PG algorithms for online and offline RL settings in Section 5.3. The convergence analysis in Section 5.5 shows that these algorithms can accelerate the optimization process compared to existing algorithms in the literature. In Section 5.6, we show that the size of the reuse set |U k | increases to infinity almost surely (a.s.) as the number of iterations increases k → ∞. The empirical study in Section 6 shows consistent performances with these analytical conclusions. In this section, we propose selection criteria for determining the reuse set U k to reduce the policy gradient estimation variance when reusing the historical trajectories therein. Here, we use the offline RL setting to illustrate the insight on selecting the most relevant historical trajectories. At each k-th iteration, we generate B posterior samples, ω ω ω b ∼ p( ω ω ω|D) with b = 1, 2, . . . , B, quantifying model uncertainty and use them to estimate the objective in (2). Thus, the target mixture distribution becomes To improve the policy gradient estimation using the MLR estimator, we want to select U k such that the corresponding mixture proposal distribution, k = 1 |U k | (θ θ θi,ω ω ωi)∈U k D(τ τ τ ; θ θ θ i , ω ω ω i ), is close to p k ; see Figure 1 (a) for an one dimensional visualization. Even though the target SDP mixture distribution is high dimensional for complex systems under high uncertainty, the proposed approach, with selected relevant historical trajectories, tends to put more weight on trajectories that are most likely sampled from the target distribution. Such selection produces stable likelihood ratios and thus reduces the estimation variance of policy gradient. Motivated by this insight, we propose selection criteria based on the comparison of variances of classical PG estimator (5) and the ILR estimator (7) to determine the reuse set U k for both online and offline RL settings. Recall that, in the k-th iteration, the target (mixture) distribution is given by Online RL Setting: For the online RL setting, if we reuse n historical trajectories from θ θ θ i ∈ F k to estimate the policy gradient ∇µ(θ θ θ k ), we have the ILR estimator, Compared with the classical PG estimator, ∇µ , the different variances between ∇µ ILR i,k and ∇µ P G k is induced by some distance between the proposal and target distributions, D (τ τ τ ; θ θ θ i , ω ω ω c ) and D (τ τ τ ; θ θ θ k , ω ω ω c ), respectively, which is specified by θ θ θ i and θ θ θ k . If θ θ θ i = θ θ θ k , the likelihood ratio in (15) equals to 1 so ∇µ k , therefore they have the same variance. The proposed selection criterion (16) in Theorem 1 puts a threshold constant c on the variance inflation of ∇µ ILR i,k compared to ∇µ P G k . This criterion tends to select those proposal distributions D (τ τ τ ; θ θ θ i , ω ω ω c ) with θ θ θ i ∈ F k that are close to the target distribution D (τ τ τ ; θ θ θ k , ω ω ω c ). To ensure that new trajectories generated in the current k-th iteration are always included in the set U k , we consider the threshold constant c > 1 in this paper. The proofs of Theorems 1 and 2 can be found in the Appendix (Section E). Theorem 1 (Online Selection Rule). At the k-th iteration for the online RL setting where the target distribution is D (τ τ τ ; θ θ θ k , ω ω ω c ), the reuse set U k includes the SDP models, i.e., D (τ τ τ ; θ θ θ i , ω ω ω c ) with θ θ θ i ∈ F k , whose ILR policy gradient estimator's total variance is no greater than c times the total variance of the classical PG estimator for some constant c > 1. Mathematically, Then, based on such reuse set U k , the total variance of the MLR policy gradient estimator (11) is no greater than c |U k | times the total variance of the classical PG estimator, Moreover, Offline RL Setting: For the offline RL setting, the proposed selection criterion in Theorem 2 can select most relevant SDP candidate models visited before, i.e., Theorem 2 (Offline Selection Rule). At the k-th iteration for the offline RL setting where the target distribution is 1 B B b=1 D (τ τ τ ; θ θ θ k , ω ω ω b ) with ω ω ω b ∼ p( ω ω ω|D) for b = 1, 2, . . . , B, the reuse set U k includes the SDP models, i.e., D (τ τ τ ; θ θ θ i , ω ω ω i ) with (θ θ θ i , ω ω ω i ) ∈ F k , whose ILR policy gradient estimator's total variance is no greater than c times the total variance of the classical PG estimator for some constant c > 1. Mathematically, Then, based on such reuse set U k , the total variance of the MLR policy gradient estimator (13) is no greater than c |U k | times the total variance of the classical PG estimator, Moreover, Theorems 1 and 2 indicate that the proposed selection criteria in (16) and (19) guide an appropriate reuse of historical trajectories and lead to variance reduction of the MLR policy gradient estimation, compared to the classical PG estimator, by a factor of c/|U k |; see (17) and (20). To use the selection criteria in (16) and (19) to determine the reuse set U k , we need to estimate the conditional variances of the PG estimator (5), the ILR estimator (7), and the MLR estimator (9). In the following, we show covariance and total variance estimators in offline RL setting. We can obtain all corresponding estimators in online setting by simply replacing ω ω ω b and ω ω ω i with ω ω ω c and set B = 1 in equations (22), (23) and (24). The conditional variance of the classical PG estimator (5) is estimated by The trace is Tr Var ∇µ For offline RL setting, the conditional variance estimate of the ILR gradient estimator (7) and its trace are For the MLR gradient estimator (9), its conditional variance estimate and trace are Remark: Given F k , the MLR policy gradient estimator ∇µ M LR k is a stratified sampling from the proposal distribution, At the end of this section, we show that the variance estimators of PG, ILR, and MLR policy gradient estimators listed in (22)-(24) are conditionally unbiased in Proposition 2; see the proof in the Appendix (Section F). Proposition 2. Given F k , the policy gradient variance estimators Var ∇µ and Var ∇µ M LR k F k in (24) are unbiased. In this section, we present the proposed GS-PG optimization algorithms for online and offline settings. For the online setting in Algorithm 1, at each k-th iteration, we generate n trajectories by conducting experiments on the real system with the policy specified by parameters θ θ θ k and update the data set D k in Step 1. Then we select the historical trajectories that satisfy the selection rule (16) and use associated SDP distributions to create the reuse set U k in Step 2. The likelihood storage in Step 2(a) will help us to reduce the computation cost; see more details in Section 5.4. Then we get the MLR policy gradient estimate by applying (11), and update and store the policy parameters in Steps 3-4. After that, we repeat the procedure until reaching to the budget limit specified by K iterations. For the offline setting in Algorithm 2, in Step 1 of each k-th iteration, we simulate n new trajectories at (θ θ θ k , ω ω ω (b) k ) with b = 1, 2, . . . , B, and then update the historical simulation trajectory set X k . After it, we generate B posterior samples, ω ω ω b ∼ p( ω ω ω|D) with b = 1, 2, . . . , B, to quantify model uncertainty and create the target distribution p k (τ τ τ ) = 1 B B b=1 D (τ τ τ ; θ θ θ k , ω ω ω b ). Then we create the reuse set U k based on the selection rule (19) in Step 2. The likelihood storage in Step 2(a) will help us to reduce a large proportion of computation cost in reuse set selection and gradient Input: the selection threshold constant c; the maximum number of iterations K; the number of replications per iteration n; the set of historical trajectories from the real system D0; the set of policy parameters visited so far F0; the set of stored likelihoods L0. Initialize policy parameter θ θ θ1 store it in F1 = F0 ∪ {θ θ θ1}; for k = 1, 2, . . . , K do 1. Collect n trajectories from real system using policy π π π θ θ θ k (a a at|s s st); Update the sets D k ; Initialize U k = ∅, screen all historical trajectories and associated SDP models in F k , and construct the reuse set U k ; for θ θ θi ∈ F k do (a) Compute and store the new likelihoods: where Hm with m = 1, 2, . . . , k represents the set of process models visited in the m-th iteration; see the definition in (26). F k by using the estimator (23) and Tr Var ∇µ P G k θ θ θ k by using the estimator (22). Then construct U k by applying (16); if Tr Var ∇µ 3. Reuse the historical trajectories associated with U k and stored likelihoods L k to calculate the policy gradient ∇µ M LR k by applying (11); 4. Update the policy parameters: θ θ θ k+1 ← θ θ θ k + η k ∇µ M LR k and store it to the set F k = F k ∪ {θ θ θ k }. end estimation; see Section 5.4. Then we calculate the MLR policy gradient estimate ∇µ M LR k in Step 3. At the end of iteration k, we update the SDP model parameters by setting θ θ θ k+1 ← θ θ θ k + η k ∇µ M LR k and ω ω ω Step 4. Repeat the procedure until reaching to the budget limit specified by K iterations. As the iteration k increases, the number of historical trajectories increases. It can be computationally expensive to repeatedly calculate all likelihood ratios required for historical trajectories selection and policy gradient estimation. Thus, we develop a computational efficient likelihoods reuse approach that can save and reuse the previous calculated likelihoods. The analytical study in Proposition 3 and the empirical study in Section 6.2.3 illustrates that this approach can dramatically reduce the computational complexity. Since the online RL setting can be considered as a special case of the offline setting, we use the offline RL setting to illustrate the proposed approach and conclusion. At each k-th iteration of policy update θ θ θ k+1 ← θ θ θ k + η k ∇µ M LR k , there are two main steps to calculate the MLR policy gradient estimate ∇µ M LR k , including: a) selecting the reuse set U k by applying the criteria in (19); and b) calculating the MLR policy gradient estimate by applying (13). Both steps involve the calculation of likelihoods and scenario-based policy gradient estimates. Since we need to calculate the ratios of new generated likelihoods with each individual likelihoods visited before, i.e., (θ θ θ i , ω ω ω i ) ∈ F k , the main computational load comes from determining U k . The calculation of each ILR policy gradient estimator includes: (I) the computation of likelihoods D(τ τ τ (i,j) ; θ θ θ i , ω ω ω i ) and D(τ τ τ (i,j) ; θ θ θ k , ω ω ω b ); and (II) the scenario-based policy gradient estimate g k τ τ τ (i,j) for each historical trajectory τ τ τ (i,j) ∼ D (τ τ τ ; θ θ θ i , ω ω ω i ) at new candidate policy parameters θ θ θ k . After selecting the reuse set U k , the calculation of the MLR policy gradient estimator in (13) includes an increasing number of repetitive calculation of likelihoods as the iteration k and the size of reuse set |U k | increase. Input: the selection threshold constant c; the maximum number of iterations K; real-world data D; the set of SDP model parameters visited so far F0; the set of stored likelihoods L0; the set of historical simulation trajectories X0. Initialize model parameters θ θ θ1, generate B posterior samples ω ω ω for k = 1, 2, . . . , K do 1. Conduct simulation runs and collect n trajectories from each D τ τ τ ; θ θ θ k , ω ω ω Update the set X k and then generate new posterior transition models; (14). 2. Set U k = ∅, screen all historical simulation trajectories and associated SDP model parameters in F k , and construct the reuse set U k ; for (θ θ θi, ω ω ωi) ∈ F k do (a) Compute and store the new likelihoods: where Hm with m = 1, 2, . . . , k represents the set of process models visited in the m-th iteration; see the definition in (25). F k by using the estimator (23) and Tr Var ∇µ P G k θ θ θ k by using the estimator (22). Then construct U k by using (19); if Tr Var ∇µ Reuse the simulation trajectories associated with the SDP models included in U k and stored likelihoods L k to calculate the policy gradient ∇µ M LR k by using (13). Notice that the trajectory distribution can be decomposed to the product of policy functions and the product of state transition functions, i.e., p(s s s t+1 |s s s t , a a a t ; ω ω ω) where q(τ τ τ ; θ θ θ) ≡ H t=1 π θ θ θ (a a a t |s s s t ) and h(τ τ τ ; ω ω ω) ≡ p(s s s 1 ; ω ω ω) H t=1 p(s s s t+1 |s s s t , a a a t ; ω ω ω). We represent the set of trajectories generated in the m-th iteration by At the current iteration k, we denote the likelihood set of historical trajectories T m at the new target distribution p k - where ( * ) comes from Step 4 of Algorithm 2 through ω ω ω Then, the stored likelihoods : ξ ξ ξ ∈ T m , b = 1, 2, . . . , B will be further reused for computing proposal likelihoods in next iteration k + 1. In online setting, the set of likelihoods is simplified to Moreover, we define gradient calculations using the trajectories included in T m by G k (T m ) = {g k (ξ ξ ξ) : ξ ξ ξ ∈ T m }. To reduce the computational cost, we can store and reuse the previously calculated likelihoods; see the grey cells illustrated in Figure 1 (b). In this way, we only need to compute those likelihoods associated with new trajectories generated at the current k iteration; see the white cells. Since the proposed MLR policy gradient approach with reusing likelihoods only calculates the white cells and reuses the grey ones, we can show that its computational complexity is O(k + |U k | 2 ) linearly growing with the iteration k; see Proposition 3. For the brute force approach without likelihoods reuse, the complexity is O(k 2 + |U k | 2 ) quadratically growing with the iteration k or |F k | in general. The proof of Proposition 3 can be found in the Appendix (Section G). Proposition 3. Suppose the calculations of likelihood D(τ τ τ ; θ θ θ, ω ω ω) and scenario-based gradient g k (τ τ τ ) take a constant complexity O(1). Then, at each k-th iteration, the computational complexity of implementing the selection rule in (19) and calculating the MLR gradient estimator in (13) is: (1) O(k + |U k | 2 ) with proposed storing and reusing past likelihoods; and (2) O(k 2 + |U k | 2 ) without reusing. In this section, we study the optimization convergence of proposed GS-PG approach and show that it has the superior convergence rate compared to the classical PG approach in (5) without reusing historical trajectories. Here, we assume the following conditions hold. A.1 (Bounded variance) There exists a constant σ > 0 that bounds the mean square error of classical PG estimator: The system mean response surface µ(·) is L-smooth. It means that µ(·) is differentiable and its gradient is L-Lipschitz, i.e., ∇µ(θ θ θ 1 ) − ∇µ(θ θ θ 2 ) ≤ L θ θ θ 1 − θ θ θ 2 for any policy parameters θ θ θ 1 , θ θ θ 2 ∈ R d . A.3 (ILR variance) There exists a constant ζ > 0 that bounds the conditional variance of ILR policy gradient estimator Tr Var ∇µ A.1 and A.2 are standard assumptions that are often used in stochastic approximation analysis. Assumption A.1 was originally introduced by Robbins and Monro (1951) and it is defined based on the bound of the L 2 norm inducing a metric d(X, Y ) = X − Y 2 . It states that the classical PG estimator ∇µ P G (θ θ θ) is unbiased and has bounded variance. Assumption A.2 is another essential condition in ensuring the convergence of many gradient decent based algorithms (Niu et al. 2011 , Reddi et al. 2016 , Nemirovski et al. 2009 , Li and Orabona 2019 . It implies that the gradient cannot change abruptly: |µ(θ θ θ 2 ) − µ(θ θ θ 1 ) − ∇µ(θ θ θ 1 ), θ θ θ 2 − θ θ θ 1 | ≤ L θ θ θ 2 − θ θ θ 1 2 (Nesterov (2003), Lemma 1.2.3). Assumption A.3 is a commonly used assumption for importance sampling in statistics (Koopman et al. 2009 , Geweke 1989 ) and green simulation literature (Feng and Staum 2017, Assumption (A6) ). It guarantees that the total variance of ILR estimator in the selection rule is finite. We first prove that the proposed GS-PG algorithm can provide a faster local convergence for a general non-convex objective in Theorem 3. Following the literature of nonconvex optimization (Zhang et al. 2019, Zhou and Cong 2018), we choose to state the convergence in terms of the average over all K iterations. This theorem presents the effect of learning rate η k on the average gradient norm, as well as the balancing effect from the size of the reuse set |U k | and the constant threshold c used for historical trajectory selection. As |U k | increases, the convergence rate tends to increase; see (27). Increasing c can tend to induce a larger reuse set. The proof of Theorem 3 can be found in the Appendix (Section H). Theorem 3. Suppose Assumptions A.1, A.2 and A.3 hold. Let η k denote the learning rate used in the k-th iteration and µ µ(θ θ θ ) denote the global optimum. We can have where σ and L are the bounds defined in Assumptions A.1 and A.2, and c is the selection constant defined in Theorems 1 and 2 for online and offline RL settings. Theorem 3 guarantees the local convergence of the proposed GS-PG algorithm for general non-convex optimization problems. By choosing the learning rate η k , the convergence rate is between O( 1 Theorem 4 sheds light on how the size of the reuse set |U k | and the learning rate η k impact on the convergence rate. If |U k | grows in O(k 1/p ), we can always choose a diminishing learning rate satisfying max{ 1 k 1−1/q , 1 k 1/p+1/q } ≤ 1 k 1/2 , or equivalently set q ≥ 2 and 2 q = 1 − 1 p , such that the GS-PG algorithm converges faster than O( 1 √ k ). Ideally if all historical trajectories satisfy the selection criteria, that is |U k | = k, its optimal rate of convergence becomes O ln k k with a constant learning rate. The proof of Theorem 4 can be found in the Appendix (Section I). Theorem 4. Suppose Assumptions A.1, A.2 and A.3 hold. If the size of reused sample set |U k | ∼ O(k 1/p ) and the learning rate η k = η1 k 1/q with p ≥ 1, q > 1, the proposed GS-PG algorithm has the bounded convergence rate O(max{ 1 k 1−1/q , 1 k 1/p+1/q }). The proposed GS-PG can guarantee the convergence even with a fixed learning rate. By fixing η k = η at the right hand side of (27), we have since lim k→∞ |U k | = ∞ almost surely which will be theoretically proved in Theorem 6. As the size of the reuse set grows to infinity almost surely with iteration k, i.e., P (lim k→∞ |U k | = ∞) = 1, the variance term 4cLσ 2 K K k=1 η k |U k | in the upper bound (27) will decrease faster than that of classic gradient descent methods. To show this asymptotic property of the reuse set U k , we first introduce three additional assumptions. A.4 (Strong concavity) The objective µ(θ θ θ) is locally λ-strongly concave or equivalently −µ(θ θ θ) is locally λ-strongly convex with a constant λ > 0, i.e., −µ(y y y) x ≤ r} denotes a neighborhood around the local maximizer x x x with radius r > 0. A.5 (Continuity) The likelihood D(τ τ τ ; θ θ θ, ω ω ω) and scenario-based policy gradient g(τ τ τ ) is continuous almost everywhere (a.e.) in the space of (τ τ τ , θ θ θ, ω ω ω). A.6 (Regularity condition) There exists a constant M > 0 such that the L 2 norm of the scenario-based policy gradient is bounded, i.e., g(τ τ τ ) ≤ M . Assumption A.4 is a weaker condition than the locally strong convexity assumption that is often used in many nonconvex stochastic optimization studies; for example Goebel and Rockafellar (2008) and Jain and Kar (2017). It rules out edge case, such as plateaus surrounding saddle points, and guarantees the convergence to a local optimum. Assumptions A.5 and A.6 are regularity conditions that are widely used in stochastic gradient descent methods enforcing the interchangeability of expectation and differentiation. With these additional assumptions, we can establish the almost sure convergence for the proposed GS-PG algorithm and use it for the asymptotic analysis of the reuse set U k . Here, we firstly present a classical convergence result (Robbins and Siegmund 1971). Lemma 2 (Robbins-Siegmund). Consider a filtration C k . Let (V k ) k≥1 , (U k ) k≥1 and (Z k ) n≥1 be three nonnegative (C k ) k≥1 -adapted processes and a sequence of nonnegative numbers (α k ) k≥1 such that k Z k < ∞ almost surely and Lemma 2 has been used by Sebbouh et al. (2021) to study the almost sure convergence of stochastic gradient descent. Following the similar idea, we can show that the policy parameters converge almost surely to some local optimum in Theorem 5. Then, given any feasible τ τ τ and ω ω ω, by applying continuous mapping theorem, we can have almost sure convergence of likelihood in Proposition 4. The proofs of Theorem 5, Proposition 4, and Theorem 6 can be found in the Appendix (Sections J -L). Theorem 5. Assume A.1, A.2, A.3, A.4 and A.6 hold. Let η k denote the learning rate in the k-th iteration and θ θ θ denote a local maximizer. We have θ θ θ k a.s. Finally, we present the main asymptotic property of the reuse set U k in Theorem 6 showing that its size continuously grows and converges almost surely to infinity under Assumptions A.1-A.6. The empirical study in Section 6.1 justifies this conclusion by showing the growing size of the reuse set with the increase of iteration k. Theorem 6. Suppose Assumptions A.1-A.6 hold. The size of the reuse set |U k | increases to infinity with probability 1 as the iteration k → ∞, i.e., P (lim k→∞ |U k | = ∞) = 1. In this section, we use two test examples to empirically assess the finite sample performance of the proposed GS-PG and compare it with the classical PG estimator in (5) for both online and offline RL settings in terms of: a) convergence rate; b) convergence behaviors with a fixed learning rate η k ; and c) policy gradient estimation variance reduction. We also study the historical trajectory reusing pattern of proposed GS-PG, as well as the sensitivity of its performance to the selection of c value and the number of replications n. In Section 6.1, we evaluate the performance under the online setting by using a classical RL control problem named cartpole. Then, in Section 6.2, we consider both online and offline settings using a real-world biomanufacturing problem called fed-batch fermentation for cell culture and production; see the detailed description in Zheng et al. (2021) . In the experiments, we set the number of replications n = 4 and the number of macro-replications to be 30. For the offline setting, we set the number of posterior samples B = 4. For the convergence analysis using a fixed learning rate η, we update policy parameters with θ θ θ k+1 ← θ θ θ k − η ∇µ M LR k . In all other cases, we set the initial stepsize η 0 = 0.005 and use "Adam" optimizer (Kingma and Ba 2015a) to update it; see Algorithm 3 in Appendix M. To assess the algorithm performance, in each -th macro-replication, we record the total reward obtained in the j-th replication at k-th iteration, denoted by R (k,j) , which is H t=1 γ t r for the offline setting. We set the discount factor γ = 0.99 for Cartpole example and γ = 1 for fermentation example. At each k-th iteration, based on the results obtained from 30 macro-replications, we represent the estimation uncertainty of outputs, denoted by x k (i.e., the expected total reward, the size of reuse set |U k |, and the trace of policy gradient estimation variance matrix Tr(Var[ ∇µ(θ θ θ k )])), by using the 95% confidence bands, (µ m k − z · SE k , µ m k + z · SE k ) , with z = 1.96, sample mean µ m k = 1 30 30 =1 x k, , and sample standard error (SE), i.e., SE k = Specifically, in the convergence plots, we record the averaged total reward x k, = 1 n n j=1 R (k,j) ; in the figure of reusing patterns (i.e., Figure a) , we set x k, = |U k, |, the size of reuse set obtained in the -th macro-replication; and lastly we set x k, to be the natural logarithm of the total variance or the trace of variance matrix of MLR gradient estimator at the k-th iteration and -th macro-replication for Figure b. Cartpole is a pendulum with a center of gravity above its pivot point; see the problem description in Appendix A and Barto et al. (1983) . The goal is to keep the cartpole balanced by applying appropriate force − → F t , such as left (0) and right (1), to a pivot point and prevent the pendulum from falling over. Each episode or simulation run terminates if (1) the pole angle is more than 12 degree from the vertical axis, or (2) the cart position is more than 2.4 unit from the centre, or (3) the episode length is greater than 200. The agent receives a reward of 1 for every step taken including the termination step. The problem is considered "solved" if the average reward is greater than 195 for over 100 consecutive iterations. To control the binary action − → F t , we choose a softmax policy (28). The simulation model is provided in the standard Open AI library (Brockman et al. 2016) , and we refer interested readers to Barto et al. (1983) for details of Cartpole simulation. We first study the number of iterations required by GS-PG and PG based policy gradient approaches to solve Cartpole problem (i.e., the average reward is greater than 195 over 100 consecutive iterations); see the results in Figure a. In specific, GS-PG takes 178.83 (SE: 15.61) iterations to solve this online RL optimization problem, whereas the classical PG method takes 489.34 (SE: 75.95) iterations, almost three times as many as GS-PG. Overall, the proposed GS-PG requires less and more consistent number of iterations to converge. In addition,we study the sensitivity of convergence performance to the choice of constant c used in determining the reuse set U k . Figure b shows the convergence of GS-PG algorithm and Table 1 F k ≤ c |U k | Tr Var ∇µ P G k θ θ θ k . At the beginning, the reuse set U k tends to be small and noisy so that both of the factor c |U k | and the sample variance of MLR estimator are relatively large. As the iteration k increases, the size |U k | increases (see Figure a) . Thus, the value of Tr Var ∇µ M LR k F k gradually decreases, while Tr Var ∇µ P G k θ θ θ k , estimated based on a fixed number of new replications generated in the k-th iteration, keeps relatively constant. Theorem 3 implies that the GS-PG can have asymptotic optimization convergence even with a fixed learning rate η k if the size of the reuse set |U k | increases to infinity as iteration k grows. Here, we empirically investigate this property and the convergence behaviors of the classical PG and proposed GS-PG based policy gradient optimization with the fixed learning rate η = 0.005, 0.01. We fix the total number of iterations K = 250 and set the maximum episode length to 100 (it means that the maximum reward is 100). The mean and 95% confidence interval (CI) of average reward are shown in Figure 4 . The results indicate that the classical PG (red line) diverges quickly after few iterations while the average reward of GS-PG method continues to increase with iterations. Thus, the convergence behavior of proposed GS-PG using a fixed stepsize is consistent with the conclusion in Theorem 3. As the fixed learning rate or stepsize does not satisfy the Robbins-Monro conditions, the divergence of the PG is within expectations; see Robbins and Monro (1951) . In this section, we use a biomanufacturing fermentation process to study the finite-sample performance of proposed GS-PG and PG. The fermentation process plays a critical role impacting productivity and downstream purification cost. Specifically, we consider a fed-batch fermentation of Yarrowia lipolytica yeast for citrate production; see Zheng et al. (2021) . Our objective is to find an optimal feed policy to maximize the expected manufacturing profit. We consider a 140-hour fermentation process and the RL agent needs to decide the feed rate of substrate every 4 hours to maximize the expected profit. This process contains H = 36 time steps (proceeding in increments of four hours). The MDP is formulated with a five-dimensional state describing the bioprocess status, an one-dimensional continuous action, and the reward function (29); see details in Appendix B. The fermentation process is simulated by a stochastic differential equations based kinetic model. We take it as the underlying SDP for the "real" system. We set the total budget in terms of the number of experiment runs as N = 120, i.e., each experiment has n = 4 replications and there are K = 30 iterations. In online setting, RL agent directly interacts with this real system; in offline setting, we first "collect" real-world historical trajectory data and then construct a dynamic Bayesian network (DBN) based process hybrid model as simulation model or digital twin. We quantify the estimation uncertainty for the simulation model parameters using the posterior distribution p( ω ω ω|D). We generate B posterior samples using Gibbs sampling. We refer the interested readers to Zheng et al. (2021) and Xie et al. (2021) for detailed simulator, DBN models, and Bayesian network model inference. We evaluate the performance of GS-PG (with c = 4) and classical PG approaches in terms of expected total reward and productivity (i.e., final titer) received after K = 30 iterations. We record the means and SEs of these outcomes in Table 2 , and plot the convergence curves of means and CIs obtained by the PG and GS-PG approaches in Figure 5 . The results indicate that GS-PG significantly outperforms PG. In specific, compared to the classical PG approach, GS-PG increases the expected reward from 58.45 to 87.05 and the final titer from 81.92 g/L to 93.53 g/L for online setting. Similar improvements are also observed in offline setting: GS-PG outperforms PG by 28.5 unit of expected reward and improves final product concentration from 84.92 g/L to 97.09 g/L. The convergence results in Figure 5 illustrate that GS-PG converges consistently faster than PG in both online and offline RL settings. The relative larger variance or wider CI in Figure 5 is mainly caused by the large range of improvement in different macro-replications as shown in Figure a , where GS-PG has converged to optimal reward (median reward >110) in the majority of macro-replications. Given the optimal solutions obtained from 30 macro-replications, Figure 6 depicts the median with the interquartile ranges (IQR) of results from both PG and GS-PG under different numbers of replications, n = 4, 10, 16. The upper and lower bars represent the minimum and maximum expected reward obtained from 30 macro-replications. GS-PG provides higher expected reward than PG. As the number of replications n per iteration increases, both GS-PG and PG have more accurate estimation of policy gradient. Thus, their optimization performance becomes better. In addition, as n increases, the improvement in the accuracy of policy gradient variance estimation, i.e., Var ∇µ ILR i,k |F k and Var ∇µ P G k |θ θ θ k , involved in the selection criteria (16) and (19) can lead to more reliable selection of most relevant historical trajectories. Thus, the proposed GS-PG approach demonstrates better and more stable performance if more simulation runs are allocated in each iteration. Here, we present the empirical results studying the sensitivity of the performance of proposed GS-PG to the reuse set selection criterion parameter c and the number of replications n allocated per iteration. First, we study the performance of GS-PG with different values of constant c. Table 3 shows that the efficacy of the GS-PG algorithm is robust to the selection of c in both online and offline settings. Next, we study the sensitivity of the performance of GS-PG to the number of replications n allocated per iteration given a fixed budget N = 120. Here, we set c = 4. The results obtained from 30 macro-replications are summarized in Table 4 . In both online and offline settings, we observe that the expected reward is highest when n = 8 and the choice of larger or smaller n values tends to decrease the expected reward. This implies a trade-off in the budget allocation for the optimal policy search. Given a fixed and tight experiment budget, when n is large, we tend to have more reliable selection of historical trajectories but less number of iterations to optimize the RL policy; while when n is small, we tend to have more learning iterations but less reliable reusing selection and policy gradient estimation. In Section 5.4, we present a likelihood reusing approach that can avoid repeatedly calculating all likelihoods needed for determining the reuse set U k and estimating the policy gradient. We provide the asymptotic analysis in Proposition 3 to explain how GS-PG computation cost scales with iterations. To further investigate how much computational cost the proposed likelihood reusing approach can save, we compare the computation cost with/without reusing the likelihoods calculations. The results as shown in Table 5 show that the computation cost of the brute force approach can not scale well as the number of iterations increases. When k > 30, it takes more than 5 minutes overload per each experiment run. Thus, the proposed likelihood reusing can greatly reduce the computational cost. Lastly, we compare the average overhead of computational time of GS-PG and PG per simulation run. The computational time (in minute) of GS-PG is 0.28 ± 0.01 (online) and 0.97 ± 0.03 (offline); that of PG is 0.06 ± 0.003 (online) and 0.15 ± 0.003 (offline). As we discussed in Section 5.4, the computational complexity of the proposed GS-PG approach increases with the number of iterations. However, for real-world complex stochastic systems, such as integrated biopharmaceutical manufacturing processes, each experiment run can be very expensive and the total budget is often very tight. Thus, the overhead cost of the proposed GS-PG framework is ignorable. It becomes an emerging trend to use digital twins, representing the risk-and science-based understanding of underlying mechanisms, to guide the process control for complex stochastic systems under high uncertainty, such as integrated biopharmaceutical manufacturing systems. In this paper, we introduce a new green simulation assisted policy gradient (GS-PG) framework for both online and offline reinforcement learning settings. It can reduce the policy gradient estimation variance through selectively reusing historical trajectories and automatically allocating more weight to those trajectories that are more likely generated by the target stochastic decision process model. Both theoretical and empirical studies show that such variance reduction in gradient estimation can substantially increase the convergence rate of policy gradient methods and the proposed GS-PG can outperform the state-of-art approaches especially under the situations with a tight budget. Cartpole (as shown in Figure 7) is a pendulum with a center of gravity above its pivot point; see more detailed description in Barto et al. (1983) . The goal is to keep the cartpole balanced by applying appropriate forces to a pivot point under the center of mass and prevent the pendulum from falling over. The state vector s s s t for this system is a four dimensional vector having components x t ,ẋ t , α t ,α t representing position of cart, velocity of cart, angle of pole, rotation rate of pole respectively at any time t. The action a t consists of a force − → F t : left (0) and right (1). Each episode or simulation run terminates if (1) the pole angle is more than 12 degree from the vertical axis, or (2) the cart position is more than 2.4 unit from the centre, or (3) the episode length is greater than 200. The agent receives a reward of 1 for each step taken including the termination step. The cartpole system is simulated by digital computer using a system of differential equations that includes all of the nonlinearities and reactive forces of the physical system. In this paper, we use the simulation model provided in the standard Open AI library (Brockman et al. 2016) and refer interested readers to Barto et al. (1983) for details of Cartpole simulation. For the binary action, we use a softmax policy converting a score function, denoted by µ θ θ θ (s s s, a), to a distribution of probabilities π θ θ θ (a|s s s) = exp (µ θ θ θ (s s s, a)) 1 a =0 exp (µ θ θ θ (s s s, a )) . where µ θ θ θ : S × A → R is an approximation function. When it's linear, the policy π θ θ θ (a|s s s) becomes logistic regression. To improve the approximation, in this paper, we paramerterize µ θ θ θ by a multi-layer neural network. It contains a 4-dimensional input layer, two 32-dimensional hidden layers and a 2-dimensional output layer, followed by the softmax function (28); see Section 11 in Hastie et al. (2001) for more discussion of multilayer neural network. In sum, the policy function has d = 1282 dimensions. All parameters are initialized by a uniform distribution within [−1, 1]. In the empirical study, we use a biomanufacturing example, a fed-batch fermentation, to study the finite-sample performance of proposed GS-PG and PG. The fermentation process plays a critical role in the biomanufacturing and it directly impacts the productivity and influences the downstream purification cost. Specifically, we consider a fed-batch fermentation of Yarrowia lipolytica yeast for citrate (or citric acid) production; see details in Zheng et al. (2021) . We need to find an optimal feed policy for bioreactor to maximize the productivity. We consider a 140-hour fermentation process and the RL agent needs to decide the feed rate every 4 hours to optimize the expected profit. It means this process contains H = 36 time steps (proceeding in increments of four hours, up to the harvest time of 140 hours) and 35 decisions needed to be made. We have a five-dimensional continuous state variable s s s = {X f , C, S, N, V }, where X f represents lipid-free cell mass; C measures citrate, the actual product to be harvested at the end of the bioprocess, generated by the cells' metabolism; S and N are amounts of substrate (i.e., oil) and nitrogen, both used for cell growth and maintenance; and V is the working volume in the bioreactor. We also have one scalar continuous action F S , which represents the feed rate, or the amount of new substrate given to the cell in a unit of time. The acceptance range of feed rate is between 0 g/L and 0.02 g/L. The fermentation trajectory τ τ τ is evaluated in terms of revenue (based on how much bioproduct was harvested at the end of the production process) as well as cost. The latter includes the fixed cost of operating the bioreactor, running the sensors, and maintaining the facilities, as well as variable manufacturing costs related to fed oil. As such, the reward function is defined by r(s s s t , a t ) = −15 + 1.29C t , if t = H (fixed cost and revenue), −534.52 × a t , if 0 ≤ t < H (variable manufacturing cost). As the feed rate is a continuous action within a regulation-defined acceptance range, we choose to use the truncated Gaussian policy with state-independent variance. For continuous control, Gaussian policy is a common method (Papini et al. 2020 ). This class of policies subsumes deterministic policies with an independent Gaussian noise for exploration (Tangkaratt et al. 2018 ). In addition, our feed rate continuous control tasks have bounded action sets, which inspires us to truncate Gaussian policies. Similar approaches can be found in other applications, such as Fujita and Maeda (2018) and Duan et al. (2016) . Given state s s s, the probability of action is given by where φ(·) and Φ(·) are probability density function and cumulative distribution function of the standard normal distribution, u = 0 and v = 0.02 represent lowest and highest feed rate respectively. The mean µ = µ θ θ θ (s s s) is given by a function approximator parameterized by θ θ θ and the standard deviation is set σ = 0.1. Here we choose a multi-layer neural network to parameterize the mean function µ θ θ θ (s s s) for both PG and GS-PG. It contains a 6-dimensional input layer and two 32-dimensional hidden layers and 1-dimensional output layer followed by the truncated normal function (30). Therefore, taking intercept terms into consideration, the total policy has 1313 dimensions. All parameters are uniformly initialized between −0.1 and 0.1. Lemma 1 Conditional on F k , the proposed MLR policy gradient estimators (11) and (13) for online and offline RL settings are unbiased. For the offline setting, it means, Proof. We show that the MLR policy gradient estimator listed in (13) is unbiased, Following the similar derivation (with underlying model ω ω ω c ), we can also show the MLR policy gradient estimator for the online RL setting listed in (11) is unbiased. Denote θ θ θ = (θ (1) , . . . , θ (d) ) ∈ R d . Remember that g k (τ τ τ ) = H t=1 Ψ t ∇ θ θ θ logπ θ θ θ k (a a a t |s s s t ) denote the scenario-based policy gradient at θ θ θ k , and it is a (d × 1)-dimensional vector g k (τ τ τ ) = (g 1 , . . . , g d ) = H t=1 Ψ t ∂ ∂θ (1) logπ θ θ θ k (a a a t |s s s t ), . . . , a a a t |s s s t ) . For the offline RL setting, both ILR and MLR policy gradient estimators, i.e., , are both unbiased (Lemma 1): Proposition 1 Conditional on F k , the total variance of the MLR policy gradient estimator is smaller and equal to that of the average ILR policy gradient estimator, Proof. Let τ τ τ (i,j) : (θ θ θ i , ω ω ω i ) ∈ U k and j = 1, 2, . . . , n be i.i.d. samples from the SDP model mixture distribution 1 |U k | (θ θ θi,ω ω ωi)∈U k D (τ τ τ ; θ θ θ i , ω ω ω i ) and obtain the policy gradient estimator, . = E ω ω ω∼p( ω ω ω|D) E τ τ τ ∼D(τ τ τ ;θ θ θ k , ω ω ω) [g k (τ τ τ )] , which equals ∇µ(θ θ θ) by Lemma 1. By plugging in (12), we have Step ( * ) holds because of the conditional independence of samples ω ω ω b ∼ p( ω ω ω|D) and τ τ τ (i,j) ∼ D (τ τ τ ; θ θ θ i , ω ω ω i ) with i ∈ U k , j = 1, 2, ..., n, and b = 1, 2, ..., B. which holds due to the well-known inequality between arithmetic and harmonic mean, i.e., Next, we observe that ∇µ M LR k is a stratified-sampling version of ∇µ M ix k , with n equal number of samples allocated to each SDP model included in U k . This means equally weighted strata: the sampling likelihood of the i-th stratum is D (·; θ θ θ i , ω ω ω i ). Theorem 1: At the k-th iteration for the online RL setting where the target distribution is D (τ τ τ ; θ θ θ k , ω ω ω c ), the reuse set U k includes the SDP models, i.e., D (τ τ τ ; θ θ θ i , ω ω ω c ) with θ θ θ i ∈ F k , whose ILR policy gradient estimator's total variance is no greater than c times the total variance of the classical PG estimator for some constant c > 1. Mathematically, Then, based on such reuse set U k , the total variance of the MLR policy gradient estimator (11) is no greater than c |U k | times the total variance of the classical PG estimator, Moreover, Proof. Consider the target distribution D (τ τ τ ; θ θ θ k , ω ω ω c ) instead of mixture 1 B B b=1 D (τ τ τ ; θ θ θ k , ω ω ω b ) with ω ω ω b ∼ p( ω ω ω|D) in the policy gradient estimator. The proof follows same steps as Theorem 2. Theorem 2 [Offline Selection Rule] At the k-th iteration for the offline RL setting where the target distribution is 1 B B b=1 D (τ τ τ ; θ θ θ k , ω ω ω b ) with ω ω ω b ∼ p( ω ω ω|D), the reuse set U k includes the SDP models, i.e., D (τ τ τ ; θ θ θ i , ω ω ω i ) with (θ θ θ i , ω ω ω i ) ∈ F k , whose ILR policy gradient estimator's total variance is no greater than c times the total variance of the classical PG estimator for some constant c > 1. Mathematically, Then, based on such reuse set U k , the total variance of the MLR policy gradient estimator (11) is no greater than c |U k | times the total variance of the classical PG estimator, Moreover, Proof. Conditional on F k , the trajectories τ τ τ (i,j) with (θ θ θ i , ω ω ω i ) ∈ U k and posterior samples ω ω ω b for b = 1, 2, . . . , B, and j = 1, 2, . . . , n are all independent. Thus, we have where inequality (33) where (36) where the equality holds due to the fact that Then with (35)-(37), we can show It completes the proof. is an unbiased estimator of the covariance Cov(X, X) = Var(X) whereX = 1 n n i=1 X i and independent samples {X i : i = 1, . . . , n} have same mean E[X i ] = E[X] for all i = 1, 2, . . . , n. Proposition 2. Given F k , the policy gradient variance estimators Var ∇µ P G k θ θ θ k in (22) Proposition 3 Suppose the calculations of likelihood D(τ τ τ ; θ θ θ, ω ω ω) and scenario-based gradient g k (τ τ τ ) take a constant complexity O(1). Then, at each k-th iteration, the computational complexity of implementing the selection rule in (19) and calculating the MLR gradient estimator in (13) is: (1) O(k + |U k | 2 ) with proposed storing and reusing past likelihoods; and (2) O(k 2 + |U k | 2 ) without reusing. Proof. In this proposition, we perform the complexity analysis for each iteration k. To simplify the analysis and notations, we consider the calculation of each likelihood of probabilistic policy q(·; θ θ θ) and transition model h(·; ω ω ω), and scenario-based gradient g(·|θ θ θ) has constant computational complexity since it only depends on constant values such as the planning horizon, the dimension of states, actions and policy parameters, which do not scale up with iterations. First of all, whether or not reusing likelihoods, we always take O(nB) operations to calculate the gradients G k (T m ) for trajectories visited in the m-th iteration, i.e., G k (T m ) = {g k (ξ ξ ξ) ; ξ ξ ξ ∈ T m } with the trajectory set T m ≡ τ τ τ (m b ,j) ∼ D (τ τ τ ; θ θ θ m , ω ω ω b ) with ω ω ω b ∼ p( ω ω ω|D) for b = 1, 2, . . . , B and j = 1, 2, . . . , n . Thus, it requires O(nBk) operations for calculating all gradients k m=1 G k (T m ) since we need to compute the scenariobased gradients of all trajectories generated so far at new candidate policy parameters θ θ θ k to support the selection of the reuse set U k ; see Part (B) in Figure 1(b) . Case 1 with Likelihood Save and Reuse: Since the selection rule requires enumerating all past SDP models in F k , the computational complexity for assessing the selection criterion in (19) includes: (1) O(nBk) (gradient calculations); (2) O(nB 2 k) (new likelihoods); (3) O(nB 2 k) (historical trajectory selection); and (4) O(nB|U k | 2 ) (MLR gradient). Therefore, combining both selection and MLR gradient estimation steps, we have the total computational complexity for Case 1 as O(k + |U k | 2 ) as n and B are both constant. (1) New Likelihoods: In this case, we have stored all the likelihoods associated with the SDP proposal distributions visited and trajectories collected in the previous iterations; see grey cells in Part (A) of Figure 1(b) . It takes O(nB 2 ) to calculate likelihoods for new generated trajectories T k at policy models at θ θ θ m and transition models at ω ω ω . . , B, b = 1, . . . , B . It also takes O(nB 2 ) to calculate likelihoods for historical trajectories T m sampled from the m-th iteration at current SDP models Since the selection rule requires enumerating all iterations with m = 1, 2, . . . , k, the total computational complexity of generating all likelihoods becomes O(nB 2 k). (2) Historical Trajectory Selection: As we have stored all likelihoods calculated so far (see the grey cells as shown in Part (A) of Figure 1 (b)) and also have computed all gradients and likelihoods for new process model and trajectories (see Part (1) above), we only take O(nB) additions/subtraction and O(nB) multiplications to estimate Var ∇µ ILR i,k F k for each (θ θ θ i , ω ω ω i ) ∈ F k by using (23) and Var ∇µ P G k θ θ θ k by using (22). Since the selection rule requires enumerating all past SDP models in F k , the total complexity for historical trajectory selection becomes O(nB|F k |) = O(nB 2 k). (3) MLR policy gradient calculations: After the reuse set U k is selected, we compute the MLR gradient estimator (13), For each (θ θ θ i , ω ω ω i ) ∈ U k , j = 1, 2, . . . , n, and b = 1, 2, . . . , B, it takes O(|U k |) to estimate the mixture 1 |U k | (θ θ θi,ω ω ωi)∈U k D (τ τ τ ; θ θ θ i , ω ω ω i ) by reusing likelihoods D (τ τ τ ; θ θ θ i , ω ω ω i ). Then, by summing up all cases in (40), the complexity of computing ∇µ M LR k is O(nB|U k | 2 ). Case 2 without Likelihood Save and Reuse: Suppose that we don't store historical likelihoods calculated in the previous iterations or periods. Then at each iteration, we need to compute all likelihoods and gradients associated with trajectories generated in the past and current iterations, i.e., T = ∪ k m=1 T m = {τ τ τ (i,j) ; i = 1, 2, . . . , kB, and j = 1, 2, . . . , n}, as well as all proposal distributions {D (ξ ξ ξ; θ θ θ i , ω ω ω i ) ; (θ θ θ i , ω ω ω i ) ∈ F k , ξ ξ ξ ∈ T } and target distribution {D(ξ ξ ξ; θ θ θ k , ω ω ω b ); ξ ξ ξ ∈ T }. The computational complexity is O(k 2 ); see all cells in Figure 1(b) . Then, based on these likelihoods and gradients, the remaining cost for assessing the selection criterion (19) and computing the MLR gradient estimator (13) is the same as that in Case 1 with the complexity as O(k + |U k | 2 ). Therefore, the total computational cost for Case 2 is O(k 2 + k + |U k | 2 ) = O(k 2 + |U k | 2 ). Lemma 4. Suppose that Assumption A.1 holds. For any feasible policy parameters θ θ θ ∈ R d , we have E[ ∇µ Proof. We proceed by using Assumption A.1 where (41) follows Cauchy-Schwarz inequality and (42) is obtained by applying Assumption A.1. This completes the proof. Theorem 3 Suppose Assumptions A.1, A.2 and A.3 hold. Let η k denote the learning rate used in the k-th iteration and µ µ(θ θ θ ) denote the global optimum. We have where σ and L are the bounds defined in Assumptions A.1 and A.2, and c is the selection constant defined in Theorems 1 and 2 for online and offline RL settings. Proof. Let µ = µ(θ θ θ ) represent the global optimal mean response. Assumption A.2 implies the L-smoothness property, |µ(θ θ θ k ) − µ(θ θ θ k+1 ) − ∇µ(θ θ θ k ), θ θ θ k − θ θ θ k+1 | ≤ L θ θ θ k+1 − θ θ θ k 2 (Nesterov (2003), Lemma 1.2.3). It gives µ(θ θ θ k ) − µ(θ θ θ k+1 ) ≤ ∇µ(θ θ θ k ), θ θ θ k − θ θ θ k+1 + L θ θ θ k+1 − θ θ θ k 2 . By applying the policy parameter update implemented in the proposed GS-PG algorithm θ θ θ k+1 = θ θ θ k + η k ∇µ Taking the expectation of ∇µ(θ θ θ k ), η k ∇µ M LR k (θ θ θ k ) conditioning on U k , we have that where Step ( * ) follows because the MLR estimator ∇µ M LR k (θ θ θ k ) is conditionally unbiased (Lemma 1). Taking the expectation of θ θ θ k+1 − θ θ θ k 2 conditioning on F k and then applying Theorem 1 for online setting and Theorem 2 for offline setting, we have Then by taking the unconditional expectation of both sides of (43) and plugging in (44) and (45), we have Rearranging both sides of (46) gives By applying Lemma 4, i.e., E µ P G k 2 ≤ 2E ∇µ(θ θ θ k ) 2 + 2σ 2 , we can rewrite (47) as By moving 2cLη 2 k |U k | E ∇µ(θ θ θ k ) 2 to the left hand side of equation (48), we have Consider η k small enough that 1 − Lη k 2c |U k | + 1 ≥ 1 2 or equivalently η k ≤ |U k | 4cL+2L|U k | . If learning rate is nonincreasing, this condition can be simplified as initial learning rate η 1 ≤ 1 4cL+2L . Then it proceeds with Thus, by dividing η k from both side of (49), we have Then, by applying (50) and µ(θ θ θ k ) ≤ µ and summing over k = 1, 2, . . . , K, we have Then, we have I Proof of Theorem 4 Lemma 5. For a non-negative integer n, we have the following inequality n−1 i=1 i p < n p+1 p + 1 . Proof. For any number r ≥ 1 and x ≥ −1, Bernoulli's Inequality suggests that 1 + rx ≤ (1 + x) r . Let x = 1 i−1 and r = p + 1. Then we have Summing it from 1 to n and dividing by p + 1 yields Theorem 4. Suppose Assumptions A.1, A.2 and A.3 hold. If the size of the reuse set |U k | ∼ O(k 1/p ) and the learning rate η k = η1 k 1/q with p ≥ 1, q > 1, the proposed GS-PG algorithm has the bounded convergence rate O(max{ 1 k 1−1/q , 1 k 1/p+1/q }). Proof. With loss of generality, we can write |U k | = c u k 1/p with a constant c u > 0. By applying Lemma 5, we get k i=1 ηi |Ui| = η1 cu k i=1 i −1/p−1/q < η1 cu(1−1/p−1/q) (k+1) 1−1/p−1/q ∼ O(k 1−1/p−1/q ). Then, by applying Theorem 3, we have 1 K K k=1 E ∇µ(θ θ θ k ) 2 ≤ 4 η K |µ | − 2 η1 µ (θ θ θ 1 ) + 4cLσ 2 K k=1 η k |U k | K < 4 η1 K 1/q |µ | − 2 η1 µ (θ θ θ 1 ) + 4cLσ 2 η1 cu(1−1/p−1/q) (K + 1) 1−1/p−1/q K ∼ O max 1 K 1−1/q , J Proof of Theorem 5 Theorem 5. Assume A.1, A.2, A.3, A.4 and A.6 hold. Let η k denote the learning rate in the k-th iteration and θ θ θ denote a local maximizer. We have θ θ θ k a.s. Proof. Step 1: Convergence in mean. The locally strong convexity of −µ(θ θ θ) implies ∇µ(y y y) − ∇µ(x x x) ≥ λ x x x −y y y with a constant λ > 0 for any x x x, y y y ∈ B r (θ θ θ ). By Assumption A.4, when k is large enough, we have θ θ θ k in a neighborhood of a local maximizer θ θ θ , i.e., θ θ θ k ∈ B r (θ θ θ ) such that ∇µ(θ θ θ k ) − ∇µ(θ θ θ ) ≥ λ θ θ θ k − θ θ θ . By using Minkowski's inequality, we have λ θ θ θ k − θ θ θ ≤ ∇µ(θ θ θ k ) − ∇µ(θ θ θ ) ( ) ≤ ∇µ(θ θ θ k ) + ∇µ(θ θ θ ) ( ) ≤ ∇µ(θ θ θ k ) . where ( ) holds by applying Minkowski's inequality and ( ) follows because θ θ θ is a local maximizer and ∇µ(θ θ θ ) = 0. Then, by taking the expectation of both sides, we have λE θ θ θ k − θ θ θ 2 ≤ E ∇µ(θ θ θ k ) 2 . Summing up to infinity gives Therefore, by applying (52), we have E[ θ θ θ k − θ θ θ ] → 0 as k → ∞. Step 2: Convergence almost surely. Following (45), we have Let V k = θ θ θ k − θ θ θ 2 , U k = 2η k (µ(θ θ θ ) − µ(θ θ θ k )) and Z k = η 2 k 1 + 2c |U k | σ 2 + c |U k | M 2 . Notice k Z k ≤ ∞ if ∞ k=1 η 2 k < ∞ (Robbins-Monro condition). Using Lemma 2, we have that θ θ θ k − θ θ θ 2 k≥1 converges almost surely to some random variable V ∞ . For the converged sequence θ θ θ k − θ θ θ 2 k≥1 , θ θ θ k − θ θ θ 2 is bounded almost surely. Thus from boundary convergence theorem follows E[V ∞ ] = E[lim k→∞ θ θ θ k − θ θ θ ] = lim k→∞ E[ θ θ θ k − θ θ θ ] = 0, which implies V ∞ = 0 almost surely. It further implies θ θ θ k a.s. − − → θ θ θ . Proposition 4. Assume A.1-A.6 hold. Let θ θ θ denote a local maximizer. Given any ω ω ω ∼ p( ω ω ω|D), we have D(τ τ τ ; θ θ θ k , ω ω ω) a.s. − − → D(τ τ τ ; θ θ θ , ω ω ω) and 1 D(τ τ τ ;θ θ θ k , ω ω ω) a.s. − − → 1 D(τ τ τ ;θ θ θ , ω ω ω) as k → ∞. Proof. By applying Assumption A.5, Theorem 5, and Continuous Mapping Theorem, the results immediately follow, i.e. D (τ τ τ ; θ θ θ k , ω ω ω) a.s. − − → D (τ τ τ ; θ θ θ , ω ω ω) and L Proof of Theorem 6 Theorem 6. Suppose Assumptions A.1-A.6 hold. The size of the reuse set |U k | increases to infinity with probability 1 as iteration k → ∞, i.e., P(lim k→∞ |U k | = ∞) = 1. Proof. Online RL Setting: We first prove |U k | a.s. → ∞ as k → ∞ for the online setting. Let f (x x x) = x x x 2 and define the Jensen's gap as Define the scenario-based policy gradient at true optimum θ θ θ by g (τ τ τ |θ θ θ ) ≡ H t=1 Ψ t ∇ θ θ θ log (π θ θ θ (a a a t |s s s t )) . By applying θ θ θ k a.s. − − → θ θ θ (Theorem 5), continuity assumption A.5, and continuous mapping theorem, at any fixed argument τ τ τ , we have g k (τ τ τ ) a.s. − − → g(τ τ τ |θ θ θ ) as k → ∞. Under the some regularity conditions, i.e., the likelihood function is dominated almost everywhere by some integrable function ψ(·) in the sense that D(τ τ τ ; θ θ θ k , ω ω ω c ) ≤ ψ(τ τ τ ), applying Assumptions A.5 and A.6, we have e ∞ = lim k→∞ e k ( ) = lim k→∞ g k (τ τ τ ) 2 D(τ τ τ ; θ θ θ k , ω ω ω c ) dτ τ τ − lim k→∞ (g k (τ τ τ )D(τ τ τ ; θ k , ω ω ω c )) dτ τ τ 2 = E g(τ τ τ |θ θ θ ) 2 − E [g(τ τ τ |θ θ θ )] 2 > 0 where ( ) follows by applying dominated convergence theorem. Jensen's inequality becomes equality only when f (·) is affine and/or g(τ τ τ |θ θ θ ) is constant. However, neither of these conditions holds, which implies that the Jensen's gap strictly greater than zero, i.e., e ∞ > 0. It further implies lim inf k→∞ e k > 0. Otherwise there is a subsequence {e ki } ∞ i=1 that converges to 0 (which contradicts to the fact that e ∞ = lim k→∞ e k > 0). Let e = lim inf k→∞ e k . Define the difference i,k ≡ Tr Var ∇µ ILR i,k F k − cTr Var ∇µ P G k θ θ θ k . In the following proof, and write E k [·] = E[·|θ θ θ k ] to represent the conditional expectation over the k-th SDP model. where (63) holds because of Assumption A.6 and (66) follows because e b k is greater than e b infinitely often (i.o.) as e b = lim inf e b k . (65) holds because for each (θ θ θ i , ω ω ω i ) ∈ F k we can find a constant m i > 0, depending on (θ θ θ i , ω ω ω i ), small enough such that the integral over the region {τ τ τ : D(τ τ τ ; θ θ θ i , ω ω ω i ) < m i } is less and equal to the constant c−1 2 e b k (which only depends on iteration k and posterior sample ω ω ω b ), i.e., 1 {D(τ τ τ ;θ θ θi,ω ω ωi) K ω ω ω , we have D(τ τ τ ;θ θ θ k , ω ω ω) D(τ τ τ ;θ θ θi, ω ω ω) ≤ √ 1 + (a.e.) for all i ≥ k 2 . In addition, for any parameters (θ θ θ i , ω ω ω i ) ∈ F k , there is a nonzero measurable neighborhood B(ω ω ω i ) around ω ω ω i with probability P ( ω ω ω ∈ B(ω ω ω i )|D) = δ i > 0 such that for ω ω ω ∈ B(ω ω ω i ) and τ τ τ ∈ {τ τ τ : D(τ τ τ ; θ θ θ i , ω ω ω i ) ≥ m i }, we have D(τ τ τ ; θ θ θ i , ω ω ω) 2 − D(τ τ τ ; θ θ θ i , ω ω ω i ) 2 ≤ m 2 i . By dividing D(τ τ τ ; θ θ θ i , ω ω ω i ) from both sides, it follows that D(τ τ τ ; θ θ θ i , ω ω ω) 2 D(τ τ τ ; θ θ θ i , ω ω ω i ) 2 − 1 ≤ m 2 i D(τ τ τ ; θ θ θ i , ω ω ω i ) 2 ≤ =⇒ D(τ τ τ ; θ θ θ i , ω ω ω) D(τ τ τ ; θ θ θ i , ω ω ω i ) ≤ √ 1 + (a.e.). As such, for any k > K ω ω ω , ω ω ω ∈ B r (ω ω ω i ), and τ τ τ ∈ {τ τ τ : D(τ τ τ ; θ θ θ i , ω ω ω i ) ≥ m i }, we have D(τ τ τ ; θ θ θ k , ω ω ω) D(τ τ τ ; θ θ θ i , ω ω ω) D(τ τ τ ; θ θ θ i , ω ω ω) D(τ τ τ ; θ θ θ i , ω ω ω i ) ≤ 1 + ⇐⇒ D(τ τ τ ; θ θ θ k , ω ω ω) D(τ τ τ ; θ θ θ i , ω ω ω i ) ≤ 1 + (a.e.) with probability P ( ω ω ω ∈ B(ω ω ω i )|D) = δ i > 0. Let E i,k = { i,k ≤ 0} denote the event of selecting the historical stochastic decision process (θ θ θ i , ω ω ω i ) ∈ F k in iteration k. Then when k is large enough, we have i,k ≤ 0 with probability of at least P ∩ B b=1 { ω ω ω b ∈ B(ω ω ω i )}|D = (δ i ) B > 0. As P( i,∞ ≤ 0) ≥ ∞ i (δ i ) B = ∞, by Second Borel-Cantelli Lemma II (Durrett (2019), Theorem 4.3.4.), we have P(lim sup i E i ) = 1. Equivalently there is infinitely often stochastic decision process (θ θ θ i , ω ω ω i ) that i,k ≤ 0 when k → ∞. It suggests lim k→∞ |U k | = ∞ almost surely. Human Factors Analysis and Classification System Interrater Reliability for Biopharmaceutical Manufacturing Investigations Challenges and opportunities in biopharmaceutical manufacturing control Unbiased metamodeling via likelihood ratios Reinforcement Learning: An Introduction Policy gradient methods for reinforcement learning with function approximation Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor Trust region policy optimization Doubly robust off-policy value evaluation for reinforcement learning Eligibility traces for off-policy policy evaluation Ashique Rupam Mahmood and Richard S Sutton. Off-policy learning based on weighted importance sampling with linear computational complexity Variance reduction for policy gradient with action-dependent factorized baselines Variance-reduced off-policy memory-efficient policy search Scalable distributed deep-rl with importance weighted actor-learner architectures Safe and efficient off-policy reinforcement learning Green simulation: Reusing the output of repeated experiments Self-improving reactive agents based on reinforcement learning, planning and teaching Human-level control through deep reinforcement learning Sample efficient actor-critic with experience replay Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay Green simulation assisted reinforcement learning with model risk for biomanufacturing learning and control Interpretable biomanufacturing process risk and sensitivity analyses for quality-by-design and stability control The Elements of Statistical Learning Balancing learning speed and stability in policy gradient via adaptive exploration Guide actor-critic for continuous control Clipped action policy gradient Benchmarking deep reinforcement learning for continuous control Probability and statistics. Pearson, 2012. Rick Durrett. Probability: theory and examples Adam: A method for stochastic optimization We gratefully acknowledge Prof. Dongming Xie from University of Massachusetts Lowell for his support in providing the biomanufacturing fermentation process experiment data and PDE-based mechanistic model. D(τ τ τ (i,j) ; θ θ θ k , ω ω ω c ) D(τ τ τ (i,j) ; θ θ θ i , ω ω ω c ) g k τ τ τ (i,j) F k     − cTr   Var   1 n n j=1 g k τ τ τ (k,j) θ θ θ k     = 1 n Tr D(τ τ τ ; θ θ θ k , ω ω ω c ) 2 D(τ τ τ ; θ θ θ i , ω ω ω c ) 2 g k (τ τ τ )g k (τ τ τ ) D(τ τ τ ; θ θ θ i , ω ω ω c )dτ τ τ − c n Tr g k (τ τ τ )g k (τ τ τ ) D(τ τ τ ; θ θ θ k , ω ω ω c )dτ τ τ − 1 n Tr E i D(τ τ τ ; θ θ θ k , ω ω ω c ) D(τ τ τ ; θ θ θ i , ω ω ω c )where (58) holds by following that ∇µ(θ θ θ k ) = E [g k (τ τ τ )]. Then, by letting = e(c−1) M 2 > 0, (59) becomeswhere (60) holds because of Assumption A.6 and (61) follows since e k is greater than e infinitely often (i.o.) as e = lim inf e k . By Proposition 4, we have D(τ τ τ ;θ θ θ k ,ω ω ω c ). . , k}. It implies that given > 0 at least half of past iterations will be reused almost surely when k is large enough becauseOffline Case: Similar as online case, we can define the Jensen's gap for each posterior sample ω ω ω b by e b k = E τ τ τ ∼D(τ τ τ ;θ θ θ k , ω ω ω b ) g k (τ τ τ ) 2 |F k − E τ τ τ ∼D(τ τ τ ;θ θ θ k , ω ω ω b ) [g k (τ τ τ )|F k ] 2 and infimum limit e b = lim inf k→∞ e b k . In the following proof, we write E k [·] = E[·|θ θ θ k , ω ω ω k ] to represent the conditional expectation over the k-th SDP model. Again, A PREPRINT we start with the difference for reuse set selectionThen, by letting = e b (c−1)≤ 1 n 1 {D(τ τ τ ;θ θ θi,ω ω ωi)≥mi} D(τ τ τ ; θ θ θ k , ω ω ω b ) D(τ τ τ ; θ θ θ i , ω ω ω i ) Algorithm 3 reviews the Adam optimizer (Kingma and Ba 2015b) for policy gradient learning rate selection. Input: initial learning rate η 1 and initial parameter vector θ θ θ 1 . Set β 1 = 0.9, β 2 = 0.999 and = 10 −8 . Let denote the elementwise product of matrices. All operations on vectors are element-wise. With β k 1 and β k 2 , we denote β 1 and β 2 to the power k. Initialize: m 1 ← 0 (1st moment vector) and v 1 ← 0 (2nd moment vector). for k = 1, 2, . . . , K do