key: cord-027386-23exaaik authors: Rao, Vishwas; Maulik, Romit; Constantinescu, Emil; Anitescu, Mihai title: A Machine-Learning-Based Importance Sampling Method to Compute Rare Event Probabilities date: 2020-05-25 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50433-5_14 sha: doc_id: 27386 cord_uid: 23exaaik We develop a novel computational method for evaluating the extreme excursion probabilities arising from random initialization of nonlinear dynamical systems. The method uses excursion probability theory to formulate a sequence of Bayesian inverse problems that, when solved, yields the biasing distribution. Solving multiple Bayesian inverse problems can be expensive; more so in higher dimensions. To alleviate the computational cost, we build machine-learning-based surrogates to solve the Bayesian inverse problems that give rise to the biasing distribution. This biasing distribution can then be used in an importance sampling procedure to estimate the extreme excursion probabilities. Characterizing high-impact rare and extreme events such as hurricanes, tornadoes, and cascading power failures are of great social and economic importance. Many of these natural phenomena and engineering systems can be modeled by using dynamical systems. The models representing these complex phenomena are approximate and have many sources of uncertainties. For example, the exact initial and boundary conditions or the external forcings that are necessary to fully define the underlying model might be unknown. Other parameters that are set based on experimental data may also be uncertain or only partially known. A probabilistic framework is generally used to formulate the problem of quantifying various uncertainties in these complex systems. By definition, the outcomes of interest that correspond to high-impact rare and extreme events reside in the tails of the probability distribution of the associated event space. Fully characterizing the tails requires resolving high-dimensional integrals over irregular domains. The most commonly used method to determine the probability of rare and extreme events is Monte Carlo simulation (MCS). Computing rareevent probabilities via MCS involves generating several samples of the random variable and calculating the fraction of the samples that produce the outcome of interest. For small probabilities, however, this process is expensive. For example, consider an event whose probability is around 10 −3 and the underlying numerical model for the calculation requires ten minutes per simulation. With MCS, estimating the probability of such an event to an accuracy of 10% will require two years of serial computation. Hence, alternative methods are needed that are computationally efficient. Important examples of extreme events are rogue waves in the ocean [14] , hurricanes, tornadoes [29] , and power outages [3] . The motivation for this work comes from the rising concern surrounding transient security in the presence of uncertain initial conditions identified by North American Electric Reliability Corporation in connection with its long-term reliability assessment [32] . The problem can be mathematically formulated as a dynamical system with uncertain initial conditions. In this paper, the aim is to compute the extreme excursion probability: the probability that the transient due to a sudden malfunction exceeds preset safety limits. Typically, the target safe limit exceedance probabilities are in the range 10 −4 -10 −5 . We note that the same formulation is applicable in other applications such as data assimilation, which is used extensively for medium-to long-term weather forecasting. For example, one can potentially use the formulation in this paper to determine the likelihood of temperature levels at a location exceeding certain thresholds or the likelihood of precipitation levels exceeding safe levels in a certain area. In [25] , we presented an algorithm that uses ideas from excursion probability theory to evaluate the probability of extreme events [1] . In particular we used Rice's formula [27] , which was developed to estimate the average number of upcrossings for a generic stochastic process. Rice's formula is given by where the left-hand side denotes the expected number of upcrossings of level u, y is the derivative of the stochastic process in a mean squared sense, and ϕ t (u, y) represents the joint probability distribution of the process and its derivative. In this paper, we build on our recent algorithm [25] , which we used to construct an importance biasing distribution (IBD) to accelerate the computation of extreme event probabilities. A key step in the algorithm presented in [25] involves solving multiple Bayesian inverse problems, which can be expensive in high dimensions. Here, we propose to use machine-learning-based surrogates to obtain the inverse maps and hence alleviate the computational costs. The mathematical setup used in this paper consists of a nonlinear dynamical system that is excited by a Gaussian initial state and that results in a non-Gaussian stochastic process. We are interested in estimating the probability of the stochastic process exceeding a preset threshold. Moreover, we wish to estimate the probabilities when the underlying event of the process exceeding the threshold is a rare event. The rare events typically lie in the tails of the underlying event distribution. To characterize the tail of the resulting stochastic process, we use ideas from theory excursion probabilities [1] . Specifically, we use Rice's formula (1) to estimate the expected number of upcrossings of a stochastic process. For a description of the mathematically rigorous settings used for the rare event problem, we refer interested readers to [25, §2] and references therein. Evaluating ϕ t (u, y), the joint probability distribution of the stochastic process and its derivative, is central to evaluating the integral in Rice's formula. However, ϕ t (u, y) is analytically computable only for Gaussian processes. Since our setup results in a non-Gaussian stochastic process, we linearize the nonlinear dynamical system variation around the trajectories starting at the mean of the initial state. We thus obtain a Gaussian approximation to the system trajectory distribution. In [25] , we solve a sequence of Bayesian inverse problems to determine a biasing distribution to accelerate the convergence of the probability estimates. For high-dimensional problems, however, solving multiple Bayesian inverse problems can be expensive. In this work, we propose to replace multiple solutions to Bayesian inverse problems with machine-learning-based surrogates to alleviate the computational burden. The rest of the paper is organized as follows. In Sect. 2 we review the existing literature for estimating rare event probabilities. In Sect. 3 we reformulate the problem of determining the IBD as a Bayesian inference problem, and in Sect. 4 we develop a machine-learning-based surrogate to approximate the solution to the Bayesian inference problem. In Sect. 5 we demonstrate this methodology on a simple nonlinear dynamical system excited by a Gaussian distribution. In Sect. 6 we present our conclusions and potential future research directions. Most of the existing methods to compute the probabilities of rare events use MCS directly or indirectly. The MCS approach was developed by Metropolis and his collaborators to solve problems in mathematical physics [22] . Since then, it has been used in a variety of applications [21, 28] . When evaluating rare event probabilities, the MCS method basically counts the fraction of the random samples that cause the rare event. For a small probablilty P of the underlying event, the number of samples required to obtain an accuracy of 1 is O( −2 P −1 ). Hence MCS becomes impractical for estimating rare event probabilities. A popular sampling technique that is employed to compute rare event probablities is importance sampling (IS). IS is a variance reduction technique developed in the 1950s [17] to estimate the quantity of interest by constructing estimators that have smaller variance than MCS. In MCS, simulations from most of the samples do not result in the rare event and hence do not play a part in probability calculations. IS, instead, uses problem-specific information to construct an IBD; computing the rare event probability using the IBD requires fewer samples. Based on this idea, several techniques for constructing IBDs have been developed [8] . For a more detailed treatment of IS, we direct interested readers to [2, 13] . One of the major challenges involved with importance sampling is the construction of an IBD that results in a low-variance estimator. We note that the approach may sometimes be inefficient for high-dimensional problems [19] . A more detailed description of MCS and IS in the context of rare events can be found in [25, §2] and references therein. Other methods use the notion of conditional probability over a sequence of nested subsets of the probability space of interest. For example, one can start with the entire probability space and progressively shrink to the region that corresponds to the rare event. Furthermore, one can use the notion of conditional probability to factorize the event of interest as a product of conditional events. Subset simulation (SS) [4] and splitting methods [16] are ideas that use this idea. Several modifications and improvements have been proposed to both SS [6, 9, 10, 18, 33] and splitting methods [5, 7] . Evaluating the conditional probabilities forms a major portion of the computational load. Compute the conditional probabilities for different nested subsets concurrently is nontrivial. Large deviation theory (LDT) is an efficient approach for estimating rare events in cases when the event space of interest is dominated by few elements such as rogue waves of a certain height. LDT also has been used to estimate the probabilities of extreme events in dynamical systems with random components [11, 12] . A sequential sampling strategy has been used to compute extreme event statistics [23, 24] . Most of the work in this section is a review of our approach described in [25, §3] . Here we reformulate the problem of constructing an IBD as a sequence of Bayesian inverse problems. Consider the following dynamical system, where c is a canonical basis vector and x 0 , the initial state of the system, is uncertain and has a probability distribution p. The problem of interest is to estimate the probability that c x(t) exceeds the level u for t ∈ [0, T ]. That is, we seek to estimate the following excursion probability, where x(t, x 0 ) represents the solution of the dynamical system (2) for a given initial condition x 0 . We note that where μ is the respective measure transformation subject to (2) and Ω(u) ⊂ Ω represents the excursion set Hence, estimating Ω(u) will help us in estimating the excursion probability P T (u). In general, however, estimating the excursion set Ω(u) analytically is difficult. Rice's formula, (1) gives us insights about the excursion set and can be used to construct an approximation to the excursion set. Recall that in Rice's formula (1), ϕ t (u, y) represents the joint probability density of c x and its derivative c x for an excursion level u. The right-hand side of (1) can be interpreted as the summation of all times and slopes at which an excursion occurs. One can sample from yϕ t (u, y) to obtain a slope-time pair (y i , t i ) at which the sample paths of the stochastic process cause an excursion. Now consider the map G : R d×1 → R 2 that evaluates the vector c x(t) c x (t) based on the dynamical system (2), given an initial state x 0 and a time t. By definition of the excursion set Ω(u), there exists an element x i ∈ Ω(u) that satisfies the following relationship, where ε > 0. We can use this insight to construct an approximation of Ω(u) by constructing the preimages of multiple slope-time pairs. Observe that the problem of finding the preimage of a sample (y i , t i ) is ill-posed since there could be multiple x i 's that map to u + ε i y i at t i via operator G. We define the set and an approximation Ω(u) to Ω(u) can be written as Note that the approximation (8) improves as we increase N . For a discussion on the choice of ε i , we refer interested readers to [25, §3.3] . The underlying computational framework to approximate Ω(u) consists of the following stages: -Draw samples from unnormalized yϕ t (u, y) -Find the preimages of these samples to approximate Ω(u). We use MCMC to draw samples from unnormalized yϕ t (u, y) . We note that irrespective of the size of the dynamical system, yϕ t (u, y) represents an unnormalized density in two dimensions; hence, using MCMC is an effective means , draw samples from it. Drawing samples from yϕ t (u, y) requires evaluating it repeatedly, and in the following section we discuss the means to do so. We note that yϕ t (u, y) can be evaluated analytically only for special cases. Specifically, when ϕ t (u, y) is a Gaussian process, then the joint density function yϕ t (u, y) is analytically computable. Consider the dynamical system described by (2) . When p is Gaussian and f is linear, we have Assuming A is invertible, x(t) can be written as where I represents an identity matrix of the appropriate size. Given that x 0 is normally distributed, it follows that x(t) is a Gaussian process: where and Φ := exp(A(t − t 0 ))Σ (exp(A(t − t 0 ))) . We now can evaluate yϕ t (u, y) for arbitrary values of u i , y i , and t i as where Υ := c Φc c ΦA c c AΦ c c AΦA c and | Υ | denotes the determinant of Υ . Note that the right-hand side in (13) is dependent on t i via Υ . When f is nonlinear, yϕ t (u, y) cannot be computed analytically-a key ingredient for our computational procedure. We approximate the nonlinear dynamics by linearizing f around the mean of the initial distribution. Assuming that the initial state of the system is normally distributed as described by Eq. (9), linearizing around the mean of the initial state gives where F represents the Jacobian of f at t = 0 and x = x 0 ; this reduces the nonlinear dynamical system to a form that is similar to Eq. (9). Thus, we can now use Eqs. (11) , (12) , and (13) to approximate yϕ t (u, y) for nonlinear f . In [25] we formulated the problem of determining preimages (7) as a Bayesian inverse problem. However, solving multiple Bayesian inverse problems can be expensive. Hence we approximated our IBD by using the solutions of a small number of Bayesian inverse problems. In this section we build a simple datadriven surrogate for approximating the preimages X i described in Eq. (7) . Using the surrogate, we can approximate the preimages of several y i 's obtained by sampling from yϕ t (u, y). The surrogate developed here approximates the inverse of the map defined in Eq. (6) . To that end, we wish to approximate the map where the input space corresponds to (u + ε i , y) | ti and the output lives in the domain of the state space (Ω here). This is equivalent to augmenting t i as an additional input variable and building a surrogate that maps from R 3 → R d . We utilize a fully connected deep neural network to approximate this map. A one-layered neural network can be expressed as where F is a differentiable activation function that imparts nonlinearity to this transformation; L is the input dimension of an incoming signal; M is the number of hidden-layer neurons (in machine learning terminology); c m ∈ R M ×L are the weights of this map; m ∈ R M are the biases; and ξ j ∈ R J is the nonlinear output of this map, which may be matched to targets available from data or "fed-forward" into future maps. Note that ξ j is the postactivation value of each neuron in a hidden layer of J neurons. In practice, multiple compositions of this map may be used to obtain nonlinear function approximators, called deep neural networks, that are very expressive. For nonlinear activation, we utilize for all its activation functions. In addition, we concatenate three such maps as shown in Eq. 16 to ultimately obtain an approximation for G −1 . Two such submaps have J fixed at 256, and a final transformation utilizes J = 3. We note that the function F for the final transformation is the identity, as is common in machine learning algorithms. A schematic of this network architecture is shown in Fig. 1 . The trainable parameters (c m and m for each transformation) are optimized with the use of backpropagation [30] , an adjoint calculation technique that obtains gradients of the loss function with respect to these parameters. A stochastic gradient optimization technique, ADAM, is used to update these parameters [20] with a learning rate of 0.001. Our loss function is given by the L 2 -distance between the prediction of the network and the targets (i.e., the mean-squared error). Our network also incorporates a regularization strategy, called dropout [31] , that randomly switches off certain units ξ j (here we utilize a dropout probability of 0.1) in the forward propagation of the map (i.e., from d → 2). Through this approach, memorization of data is avoided, while allowing for effective exploration of a complex nonconvex loss surface. Our map is trained for 500 epochs with a batch size of 256; in other words, a weight update is performed after a loss is computed for 256 samples. Each epoch is completed when the losses from the entire data set are used for gradient update. During the network training, we set aside a random subset of the data for validation. Losses calculated from this data set are used only to monitor the learning of the framework for unseen data. These are plotted in Fig. 2 , where one can see that both training and validation losses are reduced to an equal magnitude. Figure 3 also shows scatter plots for this validation data set where a good agreement between the true and predicted quantities can be seen. We may now use this map for approximating the IBD. The following procedure is used to construct the IBD. 1. Obtain different realizations of the initial conditions of the dynamical system by sampling from the initial PDF p. 2. Use G to obtain the forward maps of these realizations. 3. Use the forward maps and the corresponding random realizations of the initial conditions to train the inverse map G −1 . 4. We now apply this trained inverse map on samples generated from yϕ t (u, y) to obtain the approximate preimages of samples y i . 5. Use a Gaussian approximation of these inverse maps is used as an IBD. Assume that this Gaussian approximation has PDF p IBD . 6. Sample from the IBD, and use importance sampling to estimate the probabilities. We can now estimate P T (u) using the IBD as follows: where x 1 0 , . . . , x M 0 are sampled from the biasing distribution p IBD and I( x i 0 ) represents the indicator function given by Also, ψ( x i 0 ) represents the importance weights. The importance weight for an arbitrary x i 0 is given by We demonstrate the application of the procedure described in Sect. 3 and Sect. 4 for nonlinear dynamical systems excited by a Gaussian distribution. We use the Lotka-Volterra equations as a test problem. These equations, also known as the predator-prey equations, are a pair of first-order nonlinear differential equations and are used to describe the dynamics of biological systems in which two species interact, one as a predator and the other as a prey. The populations change through time according to the following pair of equations, where x 1 is the number of prey, x 2 is the number of predators, and dx 1 dt and dx 2 dt represent the instantaneous growth rates of the two populations. We assume that the initial state of the system at time t = 0 is a random variable that is normally distributed: and we are interested in estimating the probability of the event P (c x ≥ u), 10] , and u = 17. The first step of our solution procedure involves sampling from yϕ t (u, y) to generate observations y i . We linearize the dynamical system about the mean of the distribution of x 0 (Eq. (14)) and express ϕ t (u, y) as a function of t and y as described by Eq. (12) . We compute yϕ t (u, y) as shown in Eq. (13) . We use the delayed rejection adaptive Metropolis (DRAM) Markov chain Monte Carlo (MCMC) method to generate samples from yϕ t (u, y). (For more details about DRAM, see [15] .) To minimize the effect of the initial guess on the posterior inference, we use a burn-in of 1,000 samples. Figure 4 shows the contours of yϕ t (u, y) and samples drawn from it by using DRAM MCMC. In [25] we then solved the Bayesian inverse problem by using both MCMC and Laplace approximation at MAP to construct a distribution that approximately maps to likelihood constructed around y i . Here, we replace the solution to the Bayesian inverse problem with a machine-learned inverse map described in Sect. 4. Multiple samples generated from yϕ t (u, y) can be used to construct the IBD, as described in Sect. 4.1, and the IBD can be used to estimate P T (u), as explained in Sect. 4.2. Figure 5 compares the results between the conventional MCS and Machine Learning based Importance Sampling (ML-based IS) methods. Note that we use an MCS estimate with 10 Million samples as a proxy for the true probabilities. The "true" probability is 3.28 × 10 −5 . ML-based IS gives fairly good estimate even with small number of model evaluations. When the training dataset size is large enough, the improvements are dramatic. Notice that for a true probability of the order of 10 −5 we obtain an estimate that has a relative error of less than 1%. Notice that our method gives the same (or better) accuracy as the MCS with hundred times lesser computational cost. The convergence with just 5000 training samples is acceptable and these results improve dramatically for 10000 and 20000 training samples. We believe the results could be even better when we use a Gaussian mixture to represent the IBD instead of a simple Gaussian approximation. 5 . Comparison between conventional MCS and ML-based IS. We observe even with a small amount of training data, we obtain fairly accurate estimates; and as we increase the training data, the accuracy improves dramatically. In Fig. 5 , we havent included the costs associated with generating training data, training costs, and cost for approximating the inverse map as these costs are almost negligible when compared to the overall costs. Note that generating 20000 samples is approximately equivalent to 400 model evaluations (this is because a single model evaluation can be used to generate the slope and state at 50 different times and each of them can be used as a training sample). The training of the ML framework, for this problem, required very little compute time. Each training was executed on an 8th-generation Intel Core-I7 machine with Python 3.6.8 and Tensorflow 1.14 and took less than 180 s for training 20000 samples (this is less than 50 model evaluations). Inference (for 20000 prediction points) costs were less than 2 s, on average. In this work we developed a ML-based IS to estimate rare event probabilities and we demostrated the algorithm on the prey-predator system. The method developed here builds on the approach in [25] and replaces the expensive Bayesian inference with a Machine learning based surrogate. This approach yields fairly accurate estimate of the probabilities and for a given accuracy requires atleast three orders of magnitude lesser computational effort than the traditional MCS. In future, we aim to test this algorithm for larger problems and also use an active learning based approach to pick the training samples. Scaling this algorithm to high dimensions (say O(1000)) could be challenging and to address it, we will use state-of-the-art techniques developed by machine learning and deep learning community in the future. The geometry of random fields. SIAM Stochastic Simulation: Algorithms and Analysis Power system blackouts-literature review Estimation of small failure probabilities in high dimensions by subset simulation Rare-event simulation Bayesian subset simulation Efficient monte carlo simulation via the generalized splitting method Introduction to Rare Event Simulation Hybrid subset simulation method for reliability estimation of dynamical systems subject to stochastic excitation Reliability estimation for dynamical systems subject to stochastic excitation using subset simulation with splitting Rogue waves and large deviations in deep sea Extreme event quantification in dynamical systems with random components Exploring Monte Carlo methods Oceanic rogue waves Dram: efficient adaptive MCMC Estimation of particle transmission by random sampling Methods of reducing sample size in monte carlo computations A two-stage subset simulation-based approach for calculating the reliability of inelastic structural systems subjected to gaussian random excitations Geometric insight into the challenges of solving high-dimensional reliability problems Adam: a method for stochastic optimization Monte Carlo Strategies in Scientific Computing The monte carlo method A sequential sampling strategy for extreme event statistics in nonlinear dynamical systems Sequential sampling strategy for extreme event statistics in nonlinear dynamical systems Efficient computation of extreme excursion probabilities for dynamical systems Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning Mathematical analysis of random noise Monte Carlo Statistical Methods. Springer Texts in Statistics US Department of Commerece, National Ocanic and Atmospheric Administration, National Environmental Satellite Data and Information Service Learning representations by backpropagating errors Dropout: a simple way to prevent neural networks from overfitting The North American Electricity Reliability Corporation Bayesian post-processor and other enhancements of subset simulation for estimating failure probabilities in high dimensions