key: cord-0048963-vwb72122
authors: Budde, Carlos E.; Biagi, Marco; Monti, Raúl E.; D’Argenio, Pedro R.; Stoelinga, Mariëlle
title: Rare Event Simulation for Non-Markovian Repairable Fault Trees
date: 2020-03-13
journal: Tools and Algorithms for the Construction and Analysis of Systems
DOI: 10.1007/978-3-030-45190-5_26
sha: 9d806ddde00c5500b93c480ce8d1a2dd9371210a
doc_id: 48963
cord_uid: vwb72122

Dynamic fault trees (DFT) are widely adopted in industry to assess the dependability of safety-critical equipment. Since many systems are too large to be studied numerically, DFTs dependability is often analysed using Monte Carlo simulation. A bottleneck here is that many simulation samples are required in the case of rare events, e.g. in highly reliable systems where components fail seldomly. Rare event simulation (RES) provides techniques to reduce the number of samples in the case of rare events. We present a RES technique based on importance splitting, to study failures in highly reliable DFTs. Whereas RES usually requires meta-information from an expert, our method is fully automatic: By cleverly exploiting the fault tree structure we extract the so-called importance function. We handle DFTs with Markovian and non-Markovian failure and repair distributions—for which no numerical methods exist—and show the efficiency of our approach on several case studies.

Reliability engineering is an important field that provides methods and tools to assess and mitigate the risks related to complex systems. Fault tree analysis (fta) is a prominent technique here. Its application encompasses a large number of industrial domains that range from automotive and aerospace system engineering, to energy and telecommunication systems and protocols. certain performance indicators. Two common metrics are system reliability-the probability that there are no system failures during a given mission time-and system availability-the average percentage of time that a system is operational.

In this paper we consider repairable dynamic fault trees. Dynamic fault trees (dfts [17, 43] ) are a common and widely applied variant of fts, catering for common dependability patterns such as spare management and causal dependencies. Repairs [6] are not only crucial in fault-tolerant and resilient systems, they are also an important cost driver. Hence, repairable fault trees allow one to compare different repair strategies with respect to various dependability metrics.

Fault tree analysis. The reliability/availability of a fault tree can be computed via numerical methods, such as probabilistic model checking. This involves exhaustive explorations of state-based models such as interactive Markov chains [40] . Since the number of states (i.e. system configurations) is exponential in the number of tree elements, analysing large trees remains a challenge today [26, 1] . Moreover, numerical methods are usually restricted to exponential failure rates and combinations thereof, like Erlang and acyclic phase type distributions [40] . Alternatively, fault trees can be analysed using (standard) Monte Carlo simulation (smc [22, 40, 38] , aka statistical model checking). Here, a large number of simulated system runs (samples) is produced. Reliability and availability are then statistically estimated from the resulting sample set. Such sampling does not involve storing the full state space so, although the result provided can only be correct with a certain probability, smc is much more memory efficient than numerical techniques. Furthermore, smc is not restricted to exponential probability distributions. However, a known bottleneck of smc are rare events: when the event of interest has a low probability (which is typically the case in highly reliable systems), millions of samples may be required to observe it. Producing these samples can take a unacceptably long simulation time.

To alleviate this problem, the field of rare event simulation (res) provides techniques that reduce the number of samples [35] . The two leading techniques are importance sampling and importance splitting.

Importance sampling tweaks the probabilities in a model, then computes the metric of interest for the changed system, and finally adjusts the analysis results to the original model [23, 33] . Unfortunately it has specific requirements on the stochastic model: in particular, it is generally limited to Markov models.

Importance splitting, deployed in this paper, does not have this limitation. Importance splitting relies on rare events that arise as a sequence of less rare intermediate events [28, 2] . We exploit this fact by generating more (partial) samples on paths where such intermediate events are observed. As a simple example, consider a biased coin whose probability of heads is p = 1 /80. Suppose we flip it eight times in a row, and say we are interested in observing at least three heads. If heads comes up at the first flip (H) then we are on a promising path. We can then clone (split) the current path H, generating e.g. 7 copies of it, each clone evolving independently from the second flip onwards. Say one clone observes three heads-the copied H plus two more. Then, this observation of the rare event (three heads) is counted as 1 /7 rather than as 1 observation, to account for the splitting where the clone was spawned. Now, if a clone observes a new head (HH), this is even more promising than H, so the splitting can be repeated. If we make 5 copies of the HH clone, then observing three heads in any of these copies counts as 1 35 = 1 7 · 1 5 . Alternatively, observing tails as second flip (HT ) is less promising than heads. One could then decide not to split such path.

This example highlights a key ingredient of importance splitting: the importance function, that indicates for each state how promising it is w.r.t. the event of interest. This function, together with other parameters such as thresholds [19] , are used to choose e.g. the number of clones spawned when visiting a state. An importance function for our example could be the number of heads seen thus far. Another one could be such number, multiplied by the number of coin flips yet to come. The goal is to give higher importance to states from which observing the rare event is more likely. The efficiency of an importance splitting implementation increases as the importance function better reflects such property.

Rare event simulation has been successfully applied in several domains [34, 45, 49, 4, 5, 46] . However, a key bottleneck is that it critically relies on expert knowledge. In particular for importance splitting, finding a good importance function is a well-known highly non-trivial task [35, 25] .

Our contribution: rare event simulation for fault trees. This paper presents an importance splitting method to analyse rfts. In particular, we automatically derive an importance function by exploiting the description of a system as a fault tree. This is crucial, since the importance function is normally given manually in an ad hoc fashion by a domain or res expert. We use a variety of res algorithms based in our importance function, to estimate system unreliability and unavailability. Our approach can converge to precise estimations in increasingly reliable systems. This method has four advantages over earlier analysis methods for rfts-which we overview in the related work section 6namely: (1) we are able to estimate both the system reliability and availability;

(2) we can handle arbitrary failure and repair distributions; (3) we can handle rare events; and (4) we can do it in a fully automatic fashion.

Technically, we build local importance functions for the (automata-semantics of the) nodes of the tree. We then aggregate these local functions into an importance function for the full tree. Aggregation uses structural induction in the layered description of the tree. Using our importance function, we implement importance splitting methods to run res analyses. We implemented our theory in a full-stack tool chain. With it, we computed confidence intervals for the unreliability and unavailability of several case studies. Our case studies are rfts whose failure and repair times are governed by arbitrary continuous probability density functions (pdfs). Each case study was analysed for a fixed runtime budget and in increasingly resilient configurations. In all cases our approach could estimate the narrowest intervals for the most resilient configurations.

Paper outline. Background on fault trees and res is provided in Secs. 2 and 3. We detail our theory to implement res for rfts in Sec. 4. Using a tool chain, we performed an extensive experimental evaluation that we present in Sec. 5. We overview related work in Sec. 6 and conclude our contributions in Sec. 7.

A fault tree ' ' is a directed acyclic graph that models how component failures propagate and eventually cause the full system to fail. We consider repairable fault trees (RFTs), where failures and repairs are governed by arbitrary probability distributions. The latter is called the voting or k out of m gate. Note that VOT 1 is equivalent to an OR gate, and VOT m is equivalent to an AND. The priority-and gate (PAND) is an AND gate that only fails if its children fail from left to right (or simultaneously). PANDs express failures that can only happen in a particular order, e.g. a short circuit in a pump can only occur after a leakage. SPARE gates have one primary child and one or more spare children: spares replace the primary when it fails. The FDEP gate has an input trigger and several dependent events: all dependent events become unavailable when the trigger fails. FDEPs can model for instance networks elements that become unavailable if their connecting bus fails.

An RBOX determines which basic element is repaired next according to a given policy. Thus all its inputs are BEs or SBEs. Unlike gates, an RBOX has no output since it does not propagate failures. Example. The tree in Fig. 2 models a railway-signal system, which fails if its high voltage and relay cabinets fail [21, 39] . Thus, the top event is an AND gate with children HVcab (a BE) and Rcab. The latter is a SPARE gate with primary P and spare S. All BEs are managed by one RBOX with repair priority HVcab > P > S. 

Following [32] we give semantics to rft as Input/Output Stochastic Automata (iosa), so that we can handle arbitrary probability distributions.

Each state in the iosa represents a system configuration, indicating which components are operational and which have failed. Transitions among states describe how the configuration changes when failures or repairs occur.

More precisely, a state in the iosa is a tuple x = (x 0 , . . . ,

where S is the state space and x v denotes the state of node v in . The possible values for x v depend on the type of v. The output z v ∈ {0, 1} of node v indicates whether it is operational (z v =0) or failed (z v =1) and is calculated as follows:

-BEs (white circles in Fig. 1 ) have a binary state: Fig. 1e ) have two additional states:

-ANDs have a binary state. Since the AND gate v fails iff all children fail:

x v = min w∈chil(v) z w . An AND gate outputs its internal state: 

if its primary is failed and no spare is available. Else z v = 0. -An FDEP gate has no output. All inputs are BEs and the leftmost is the trigger. We consider non-destructive FDEPs [7] : if the trigger fails, the output of all other BE is set to 1, without affecting the internal state. Since this can be modelled by a suitable combination of OR gates [32] , we omit the details.

For example, the rft from Fig. 2 starts with all operational elements, so the initial state is x 0 = (0, 0, 2, 0, 0). If then P fails, x P and z P are set to 1 (failed) and S becomes x S = 0 (active and operational spare), so the state changes to x 1 = (0, 1, 0, 0, 0). The traces of the iosa are given by x 0 x 1 · · · x n ∈ S * , where a change from x j to x j+1 corresponds to transitions triggered in the iosa.

Nondeterminism. Dynamic fault trees may exhibit nondeterministic behaviour as a consequence of underspecified failure behaviour [15, 27] . This can happen e.g. when two SPAREs have a single shared SBE: if all elements are failed, and the SBE is repaired first, the failure behaviour depends on which SPARE gets the SBE. Monte Carlo simulation, however, requires fully stochastic models and cannot cope with nondeterminism. To overcome this problem we deploy the theory from [16, 32] . If a fault tree adheres to some mild syntactic conditions, then its iosa semantics is weakly deterministic, meaning that all resolutions of the nondeterministic choices lead to the same probability value. In particular, we require that (1) each BE is connected to at most one SPARE gate, and (2) BEs and SBEs connected to SPAREs are not connected to FDEPs. In addition to this, some semantic decisions have been fixed, e.g. the semantics of PAND is fully specified, and policies should be provided for RBOX and spare assignments.

Dependability metrics. An important use of fault trees is to compute relevant dependability metrics. Let X t denote the random variable that represents the state of the top event at time t [14] . Two popular metrics are:

system reliability: the probability of observing no top event failure before some mission time T > 0, viz. REL T = Prob ∀ t∈[0,T ] . X t = 0 ; -system availability: the proportion of time that the system remains operational in the long-run, viz. AVA = lim t→∞ Prob (X t = 0).

System unreliability and unavailability are the reverse of these metrics. That is:

Standard Monte Carlo simulation (SMC). Monte Carlo simulation takes random samples from stochastic models to estimate a (dependability) metric of interest. For instance, to estimate the unreliability of a tree we sample N independent traces from its iosa semantics. An unbiased statistical estimator for p = UNREL T is the proportion of traces observing a top level event, that is,

if the j-th trace exhibits a top level failure before time T and X j = 0 otherwise. The statistical error ofp is typically quantified with two numbers δ and ε s.

Such procedures scale linearly with the number of tree nodes and cater for a wide range of pdfs, even non-Markovian distributions. However, they encounter a bottleneck to estimate rare events: if p ≈ 0, very few traces observe X j = 1.

Therefore, the variance of estimators likep becomes huge, and cis become very broad, easily degenerating to the trivial interval [0, 1]. Increasing the number of traces alleviates this problem, but even standard ci settings-where ε is relative to p-require sampling an unacceptable number of traces [35] . Rare event simulation techniques solve this specific problem.

Rare Event Simulation (RES). res techniques [35] increase the amount of traces that observe the rare event, e.g. a top level event in an rft. Two prominent classes of res techniques are importance sampling, which adjusts the pdf of failures and repairs, and importance splitting (isplit [30] ), which samples more (partial) traces from states that are closer to the rare event. We focus on isplit due to its flexibility with respect to the probability distributions. isplit can be efficiently deployed as long as the rare event γ can be described as a nested sequence of less-rare events

This decomposition allows isplit to study the conditional probabilities

Moreover, isplit requires all conditional probabilities p k to be much greater than p, so that estimating each p k can be done efficiently with smc.

The key idea behind isplit is to define the events γ k via a so called importance function I : S → N that assigns an importance to each state s ∈ S. The higher the importance of a state, the closer it is to the rare event γ M . Event γ k collects all states with importance at least k , for certain sequence of threshold levels

To exploit the importance function I in the simulation procedure, isplit samples more (partial) traces from states with higher importance. Two wellknown methods are deployed and compared in this paper: Fixed Effort and restart. Fixed Effort (fe [19] ) samples a predefined amount of traces in

Thus, starting at γ 0 it first estimates the proportion of traces that reach γ 1 , i.e. p 0 = Prob(γ 1 | γ 0 ) = Prob(S 0 ). Next, from the states that reached γ 1 new traces are generated to estimate p 1 = Prob(S 1 ), and so on until p M . Fixed Effort thus requires that (i) each trace has a clearly defined "end," so that estimations of each p k finish with probability 1, and (ii) all rare events reside in the uppermost region. Example. Fig. 3a shows Fixed Effort estimating the probability to visit states labelled before others labelled . States have importance >13, and thresholds 1 , 2 = 4, 10 partition the state space in regions

The effort is 5 simulations per region, for all regions: we call this algorithm fe 5 .

In region S 0 , 2 simulations made it from the initial state to threshold 1 , i.e. they reached some state with importance 4 before visiting a state . In S 1 , starting from these two states, 3 simulations reached 2 . Finally, 2 out of 5 simulations visited states in S 2 . Thus, the estimated rare event probability of this run of

RESTART (rst [48, 47] ) is another res algorithm, which starts one trace in γ 0 and monitors the importance of the states visited. If the trace up-crosses threshold 1 , the first state visited in S 1 is saved and the trace is cloned, aka split-see Fig. 3b . This mechanism rewards traces that get closer to the rare event. Each clone then evolves independently, and if one up-crosses threshold 2 the splitting mechanism is repeated. Instead, if a state with importance below 1 is visited, the trace is truncated ( in Fig. 3b ). This penalises traces that move away from the rare event. To avoid truncating all traces, the one that spawned the clones in region S k can go below importance k . To deploy an unbiased estimator for p, restart measures how much split was required to visit a rare state [47] . In particular, restart does not need the rare event to be defined as γ M [44] , and it was devised for steady-state analysis [48] (e.g. to estimate UNAVA) although it can also been used for transient studies as depicted in Fig. 3b [45] .

The effectiveness of isplit crucially relies on the choice of the importance function I as well as the threshold levels k [30] . Traditionally, these are given by domain and/or res experts, requiring a lot of domain knowledge. This section presents a technique to obtain I and the k automatically for an rft.

By the core idea behind importance splitting, states that are more likely to lead to the rare event should have a higher importance. To achieve this, the key lies in defining an importance function I and thresholds k that are sensitive to both the state space S and the transition probabilities of the system. For us, S ⊆ N n are all possible states of a repairable fault tree (rft). Its top event fails when certain nodes fail in certain order, and remain failed before certain repairs occur.

To exploit this for isplit, the structure of the tree must be embedded into I.

The strong dependence of the importance function I on the structure of the tree is easy to see in the following example. Take the rft from Fig. 2 and let its current state x be s.t. P is failed and HVcab and S are operational. If the next event is a repair of P, then the new state x (where all basic elements are operational) is farther from a failure of the top event. Hence, a good importance function should satisfy I (x) > I (x ). Oppositely, if the next event had been a failure of S leading to state x , then one would want that I (x) < I (x ). The key observation is that these inequalities depend on the structure of as well as on the failures/repairs of basic elements.

In view of the above, any attempt to define an importance function for an arbitrary fault tree must put its gate structure in the forefront. In Table 1 we introduce a compositional heuristic for this, which defines local importance functions distinguished per node type. The importance function associated to node v is I v : N n → N. We define the global importance function of the tree (I or simply I) as the local importance function of the top event node of . 

Thus, I v is defined in Table 1 via structural induction in the fault tree. It is defined so that it assigns to a failed node v its highest importance value.

Functions with this property deploy the most efficient isplit implementations [30] , and some res algorithms (e.g. Fixed Effort) require this property [19] .

In the following we explain our definition of I v . If v is a failed BE or SBE, then its importance is 1; else it is 0. This matches the output of the node, thus I v (x) = z v . Intuitively, this reflects how failures of basic elements are positively correlated to top event failures. The importance of AND, OR, and VOT k gates depends exclusively on their input. The importance of an AND is the sum of the importance of their children scaled by a normalisation factor. This reflects that AND gates fail when all their children fail, and each failure of a child brings an AND closer to its own failure, hence increasing its importance. Instead, since OR gates fail as soon as a single child fails, their importance is the maximum importance among its children. The importance of a VOT k gate is the sum of the k (out of m) children with highest importance value.

Omiting normalisation may yield an undesirable importance function. To understand why, suppose a binary AND gate v with children l and r, and define I naive v (x) = I l (x) + I r (x). Suppose that I l takes it highest value in max I l = 2 while I r in max I r = 6 and assume that states x and x are s.t. I l (x) = 1, I r (x) = 0, I l (x ) = 0, I r (x ) = 3. This means that in both states one child of v is "good-as-new" and the other is "half-failed" and hence the system is equally close to fail in both cases. Hence we expect I naive , which can be interpreted as the "percentage of failure" of the children of v. To make these numbers integers we scale them by lcm v , the least common multiple of their max importance values. In our case lcm v = 6 and hence I v (x) = I v (x ) = 3. Similar problems arise whit all gates, hence normalization is applied in general.

SPARE gates with m children (including its primary) behave similarly to AND gates: every failed child brings the gate closer to failure, as reflected in the left operand of the max in Table 1 . However, SPAREs fail when their primaries fail and no SBEs are available, e.g. possibly being used by another SPARE. This means that the gate could fail in spite of some children being operational. To account for this we exploit the gate output: multiplying z v by m we give the gate its maximum value when it fails, even when this happens due to unavailable but operational SBEs. For a PAND gate v we have to carefully look at the states. If the left child l has failed, then the right child r contributes positively to the failure of the PAND and hence the importance function of the node v. If instead the right child has failed first, then the PAND gate will not fail and hence we let it contribute negatively to the importance function of v. Thus, we multiply Ir(x) max I r (the normalized importance function of the right child) by −1 in the later case (i.e. when state x v / ∈ {1, 4}). Instead, the left child always contribute positively. Finally, the max operation is two-fold: on the one hand, z v · 2 ensures that the importance value remains at its maximun while failing (PANDs remain failed even after the left child is repaired); on the other, it ensures that the smallest value posible is 0 while operational (since importance values can not be negative.)

Our compositional importance function is based on the distribution of operational/failed basic elements in the fault tree, and their failure order. This follows the core idea of importance splitting: the more failed BEs/SBEs (in the right order), the closer a tree is to its top event failure.

However, isplit is about running more simulations from state with higher probability to lead to rare states. This is only partially reflected by whether basic element b is failed. Probabilities lie also in the distributions F b , R b , D b . These distributions govern the transitions among states x ∈ S, and can be exploited for importance splitting. We do so using the two-phased approach of [11, 12] , which in a first (static) phase computes an importance function, and in a second (dynamic) phase selects the thresholds from the resulting importance values.

In our current work, the first phase runs breadth-first search in the iosa module of each tree node. This computes node-local importance functions, that are aggregated into a tree-global I using our compositional function in Table 1 .

The second phase involves running "pilot simulations" on the importancelabelled states of the tree. Running simulations exercises the fail/repair distri-butions of BEs/SBEs, imprinting this information in the thresholds k . Several algorithms can do such selection of thresholds. They operate sequentially, starting from the initial state-a fully operational tree-which has importance i 0 = 0. For instance, Expected Success [10] runs N finite-life simulations. If K < N 2 simulations reach the next smallest importance i 1 > i 0 , then the first threshold will be 1 = i 1 . Next, N simulations start from states with importance i 1 , to determine whether the next importance i 2 should be chosen as threshold 2 , and so on.

Expected Success also computes the effort per splitting region S k = {x ∈ S | k+1 > I(x) k }. For Fixed Effort, "effort" is the base number of simulations to run in region S k . For restart, it is the number of clones spawned when threshold k+1 is up-crossed. In general, if K out of N pilot simulations make it from k−1 to k , then the k-th effort is N K . This is chosen so that, during res estimations, one simulation makes it from threshold k−1 to k on average. Thus, using the method from [11, 12] based on our importance function I , we compute (automatically) the thresholds and their effort for tree . This is all the meta-information required to apply importance splitting res [19, 18, 11] . Fig. 4 : Tool chain Implementation. Fig. 4 outlines a tool chain implemented to deploy the theory described above. The input model is an rft, described in the Galileo textual format [42, 41] extended with repairs and arbitrary pdfs. This rft file is given as input to a Java converter that produces three outputs: the iosa semantics of the tree, the property queries for its reliability or availability, and our compo- In this way, we implemented automatic importance splitting for fta. In [9] we provide more details about our tool chain and its capabilities.

Using our tool chain, we computed the unreliability and unavailability of 26

highly-resilient repairable non-Markovian dfts. These trees come from seven literature case studies, enriched with RBOX elements and non-Markovian pdfs.

We estimated their UNREL 10 3 or UNAVA in increasingly resilient configurations.

To estimate these values we used various simulation algorithms: Standard Monte Carlo (smc); Fixed Effort [19] for different number of runs performed in each S k region (fe n for n = 8, 12, 16, 24, 32) ; restart [47] with thresholds selected via a Sequential Monte Carlo algorithm [12] for different global splitting values (rst n for n = 2, 3, 5, 8, 11) ; and restart with thresholds selected via Expected Success [10] , which computes splitting values independently for each threshold (rst es ). fe n , rst n , and rst es , used the automatic isplit framework based in our importance function, as described in Sec. 4.2.

An instance y is a combination of an algorithm algo, an rft, and a dependability metric. An rft is identified by a case study (CS) and a parameter (p), where larger parameters of the rft CS p indicate smaller dependability values p CSp . Running algo for a fixed simulation time, instance y estimates the value p y = p CSp . The resulting ci (p y ) has a certain width p y ∈ [0, 1] (we fix the confidence coefficient δ = 0.95). The performance of algo can be measured by that width: the smaller p y , the more efficient the algorithm that achieved it.

The simulation time fixed for an rft may not suffice to observe rare events, e.g. for smc. In such cases the FIG tool reports a "null estimate"p y = [0, 0]. Moreover, the simulation of random events depends on the rng-and its seedused by FIG, so different runs may yield different resultsp y . Therefore, for each y we repeated n = 10 times the computation ofp y , to assess the performance of algo in y by: (i ) how many times it yielded not-null estimates, indicated with a bold number at the base of the bar corresponding to y (e.g. 8 10 in Fig. 5b) ;

(ii ) what was the average width p y , using not-null estimates only, indicated by the height of the bar; and (iii ) what was the standard deviation of those widths, indicated by whiskers on top of the bar. We performed n = 10 repetitions to ensure statistical significance: a 95% ci for a plotted bar is narrower than the whiskers and, in the hardest configuration of every CS, the whiskers of smc bars never overlap with those of the best res algorithm.

Case studies. Our seven parametric case studies are: the synthetic models DSPARE n and VOT m , with n ∈ {3, 4, 5} SBEs the first, m ∈ {2, 3, 4} shared BEs the second, and one RBOX each; FTPP s [17] , where we study one triad with s ∈ {4, 5, 6} shared SBEs, using one RBOX for the processors and another for the network elements; HECS o [43] , with 2 memory interfaces, 4 RBOX (one per subsystem), o ∈ {1, . . . , 5} shared spare processors, and 2o parallel buses; and RWC u∈{4,...,7} [22, 21, 39] , which combines subsystems RC v with one RBOX and v ∈ {3, . . . , 6} SPAREs, and HVC w with another RBOX and w ∈ {2, . . . , 4}

shared SBEs. In total these are 26 rfts with pdfs that include exponential, Erlang, uniform, Rayleigh, Weibull, normal, and log-normal distributions. In an extended version of this work [9] we provide all details of our case studies.

Hardware. Experiments ran in two types of nodes in a SLURM cluster running Linux x64 (Ubuntu, kernel 3.13.0-168): korenvliet nodes have CPUs Intel ® Xeon ® E5-2630 v3 @ 2.40 GHz, and 64 GB of DDR4 RAM @ 1600 MHz; caserta has CPUs Intel ® Xeon ® E7-8890 v4 @ 2.20 GHz, and 2 TB of RAM DDR4 @ 1866 MHz.

Using smc and restart we computed UNAVA for VOT 2,3,4 , HECS 1,...,5 , RC 3,...,6 , and RWC 1,...,4 . fe was not used since it requires regeneration theory for steady-state analysis [19] , which is not always feasible with non-Markovian models. The mean widths of the cis achieved per instance are shown in Fig. 5 .

For example for VOT 2 (Fig. 5a) , 10 independent computations with smc ran in caserta for 5 min, and all converged to not-null cis ( 10 ). The mean width of these cis was 1.40×10 -4 and their standard deviation 7.96×10 -6 . For VOT 3 , all smc computations yielded not-null cis (after 30 min) with an average precision of 9.62×10 -6 and standard deviation 1.52×10 -6 . For VOT 4 all smc simulations yielded null cis after 3 hours of simulation (0). Instead, rst 2 converged to 10, 10, and 5 not-null cis resp. for VOT 2,3,4 , with mean widths (and standard deviation): 1.24×10 -4 (1.19×10 -5 ), 5.09×10 -6 (1.48×10 -6 ), and 1.79×10 -7 (3.19×10 -8 ).

Thus for the VOT case study, rst 2 was consistently more efficient than smc, and the efficiency gap increased as UNAVA became rarer.

This trend repeats in all experiments: as expected, the rarer the metric, the wider the cis computed in the time limit, until at some point it becomes very hard to converge to not-null cis at all (specially for smc). For the least resilient configuration of each case study, smc can be competitive or even more efficient than some isplit variants. For instance for VOT 1 and HECS 1 in Figs. 5a and 5b, all computations converged to not-null cis for all algorithms, but smc exhibits less variable ci widths, viz. smaller whiskers. This is reasonable: truncating and splitting traces in restart adds (i ) simulation overhead that may not pay off to estimate not-so-rare events, and on top of it (ii ) correlations of cloned traces that share a common history, increasing the variability among independent runs. On the other hand and as expected, smc looses this competitiveness for all case studies as failures become rarer, here when UNAVA 1.0×10 -5 . This Fig. 6 shows the results. The overall trend shown for unreliability estimations is similar to the previous unavailability cases. Here however it was possible to use Fixed Effort, since every simulation has a clearly defined end at time T = 10 3 . It is interesting thus to compare the efficiency of restart vs. fe: we note for example that some variants of fe performed considerably better than any other approach in the most resilient configurations of FTPP and HECS. It is nevertheless difficult to draw general conclusions from Figs. 6a to 6e, since some variants that performed best in a case study-e.g. fe 16 in HECS-did worse in others-e.g. FTPP, where the best algorithms were fe 8, 12 . Furthermore, fe 8 , which is always better than † rst8 for HECS5 escapes this trend: analysing the execution logs it was found that smc when UNREL 1000 < 10 −3 , did not perform very well in HVC, where the algorithms that achieved the narrowest and most not-null cis were rst 5, 11 . Such cases notwithstanding, fe is a solid competitor of restart in our benchmark. Another relevant point of study is the optimal effort e for rst e or fe e , which shows no clear trend in our experiments. Here, e is a "global effort" used by these algorithms, equal for all S k regions. e also alters the way in which the thresholds selection algorithm Sequential Monte Carlo (seq [12] ) selects the k . The lack of guidelines to select a value for e that works well across different systems was raised in [8] . This motivated the development of Expected Success (es [10] ), which selects efforts individually per S k (or k ). Thus, in rst es , a trace upcrossing threshold k is split according to the individual effort e k selected by es.

In the benchmark of [10] , which consists mostly of queueing systems, es was shown superior to seq. However, experimental outcomes on dfts in this work are different: for UNAVA, rst es yielded mildly good results for HECS and RC; for the other case studies and for all UNREL 1000 experiments, rst es always yielded null cis. It was found that the effort selected for most thresholds k was either too small-so splitting in e k was not enough for the rst es trace to reach k+1 -or too large-so there was a splitting/truncation overhead. This point is further addressed in the conclusions.

Beyond comparisons among the specific algorithms, be these for res or for selecting thresholds, it seems clear that our approach to fta via isplit deploys the expected results. For each parameterised case study CS p , we could find a value of the parameter p where the level of resilience is such, that smc is less efficient than our automatically-constructed isplit framework. This is particularly significant for big dfts like HECS and RWC, whose complex structure could be exploited by our importance function.

Most work on dft analysis assumes discrete [43, 3] or exponentially distributed [15, 29] components failure. Furthermore, components repair is seldom studied in conjunction with dynamic gates [6, 3, 40, 29, 31] . In this work we address repairable dfts, whose failure and repair times can follow arbitrary pdfs. More in detail, rfts were first formally introduced as stochastic Petri nets in [6, 13] . Our work stands on [32] , which reviews [13] in the context of stochastic automata with arbitrary pdfs. In particular we also address non-Markovian continuous distributions: in Sec. 5 we experimented with exponential, Erlang, uniform, Rayleigh, Weibull, normal, and log-normal pdfs. Furthermore and for the first time, we consider the application of [13, 32] to study rare events.

Much effort in res has been dedicated to study highly reliable systems, deploying either importance splitting or sampling. Typically, importance sampling can be used when the system takes a particular shape. For instance, a common assumption is that all failure (and repair) times are exponentially distributed with parameters λ i , for some λ ∈ R and i ∈ N >0 . In these cases, a favourable change of measure can be computed analytically [20, 23, 33, 34, 49, 39] .

In contrast, when the fail/repair times follow less-structured distributions, importance splitting is more easily applicable. As long as a full system failure can be broken down into several smaller components failures, an importance splitting method can be devised. Of course, its efficiency relies heavily on the choice of importance function. This choice is typically done ad hoc for the model under study [44, 30, 46] . In that sense [24, 25, 11, 12] are among the first to attempt a heuristic derivation of all parameters required to implement splitting. This is based on formal specifications of the model and property query (the dependability metric). Here we extended [11, 12, 8] , using the structure of the fault tree to define composition operands. With these operands we aggregate the automatically-computed local importance functions of the tree nodes. This aggregation results in an importance function for the whole model.

We have presented a theory to deploy automatic importance splitting (isplit) for fault tree analysis of repairable dynamic fault trees (rfts). This Rare Event Simulation approach supports arbitrary probability distributions of components failure and repair. The core of our theory is an importance function I defined structurally on the tree. From such function we implemented isplit algorithms, and used them to estimate the unreliability and unavailability of highly-resilient rfts. Departing from classical approaches, that define importance functions ad hoc using expert knowledge, our theory computes all metadata required for res from the model and metric specifications. Nonetheless, we have shown that for a fixed simulation time budget and in the most resilient rfts, diverse isplit algorithms can be automatically implemented from I , and always converge to narrower confidence intervals than standard Monte Carlo simulation.

There are several paths open for future development. First and foremost, we are looking into new ways to define the importance function, e.g. to cover more general categories of fts such as fault maintenance trees [37] . It would also be interesting to look into possible correlations among specific res algorithms and tree structures, that yield the most efficient estimations for a particular metric. Moreover, we have defined I based on the tree structure alone. It would be interesting to further include stochastic information in this phase, and not only afterwards during the thresholds-selection phase.

Regarding thresholds, the relatively bad performance of the Expected Success algorithm shows a spot for improvement. In general, we believe that enhancing its statistical properties should alleviate the behaviour mentioned in Sec. 5.2. Moreover, techniques to increase trace independence during splitting (e.g. resampling) could further improve the performance of the isplit algorithms. Finally, we are investigating enhancements in iosa and our tool chain, to exploit the ratio between fail and dormancy pdfs of SBEs in warm SPARE gates.

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Assessment of maintenance policies for smart buildings: Application of formal methods to fault maintenance trees

Statistical techniques for simulation models

Non deterministic repairable fault trees for computing optimal repair strategy

Rare event simulation for queues

Rare event estimation for a large-scale stochastic hybrid system with air traffic application

Parametric fault trees with dynamic gates and repair boxes

Architectural dependability evaluation with arcade

Automation of Importance Splitting Techniques for Rare Event Simulation

Rare event simulation for non-Markovian repairable fault trees

Better automated importance splitting for transient rare events

Rare event simulation with fully automated importance splitting

Compositional construction of importance functions in fully automated importance splitting

Repairable fault tree for the automatic evaluation of repair policies

Formal semantics of models for computational engineering: a case study on dynamic fault trees

Dynamic fault tree analysis using input/output interactive Markov chains

Input/Output Stochastic Automata with Urgency: Confluence and weak determinism

Fault trees and sequence dependencies

On the importance function in splitting simulation

The splitting method in rare event simulation

A unified framework for simulating Markovian models of highly dependable systems

DFTCalc: Reliability centered maintenance via fault tree analysis (tool paper)

Smart railroad maintenance engineering with stochastic model checking

Fast simulation of rare events in queueing and reliability models

Importance splitting for statistical model checking rare properties

Distributed verification of rare properties using importance splitting observers

Fault trees on a diet

Uncovering dynamic fault trees

Estimation of particle transmission by random sampling

Boosting Fault Tree Analysis by Formal Methods

Splitting techniques

Smart maintenance via dynamic fault tree analysis: A case study on Singapore MRT system

Stochastic Automata for Fault Tolerant Concurrent Systems

Techniques for fast simulation of models of highly dependable systems

Importance sampling simulations of Markovian reliability systems using cross-entropy

Introduction to rare event simulation. In: Rare Event Simulation Using Monte Carlo Methods [36

Rare Event Simulation Using Monte Carlo Methods

Maintenance analysis and optimization via statistical model checking

Reliability-centered maintenance of the electrically insulated railway joint via fault tree analysis: A practical experience report

Rare event simulation for dynamic fault trees

Fault tree analysis: A survey of the state-of-the-art in modeling, analysis and tools

Galileo user's manual & design overview

The Galileo fault tree analysis tool

Fault tree handbook with aerospace applications. NASA Office of Safety and Mission

RESTART method for the case where rare events can occur in retrials from any threshold

Importance functions for RESTART simulation of highly-dependable systems

RESTART vs splitting: A comparative study

Enhancement of the accelerated simulation method RESTART by considering multiple thresholds

RESTART: a method for accelerating rare event simulations

Dependability estimation for non-Markov consecutive-kout-of-n: F repairable systems by fast simulation

The authors thank José and Manuel Villén-Altamirano, for fruitful discussions that helped to better understand the application scope of our approach.