key: cord-0048640-ms65cexw
authors: Cordy, Maxime; Papadakis, Mike; Legay, Axel
title: Statistical Model Checking for Variability-Intensive Systems
date: 2020-03-13
journal: Fundamental Approaches to Software Engineering
DOI: 10.1007/978-3-030-45234-6_15
sha: 43a6fbf7cdd12fd71f61867065cce62e23c3ccf4
doc_id: 48640
cord_uid: ms65cexw

We propose a new Statistical Model Checking (SMC) method to discover bugs in variability-intensive systems (VIS). The state-space of such systems is exponential in the number of variants, which makes the verification problem harder than for classical systems. To reduce verification time, we sample executions from a featured transition system – a model that represents jointly the state spaces of all variants. The combination of this compact representation and the inherent efficiency of SMC allows us to find bugs much faster (up to 16 times according to our experiments) than other methods. As any simulation-based approach, however, the risk of Type-1 error exists. We provide a lower bound and an upper bound for the number of simulations to perform to achieve the desired level of confidence. Our empirical study involving 59 properties over three case studies reveals that our method manages to discover all variants violating 41 of the properties. This indicates that SMC can act as a low-cost-high-reward method for verifying VIS.

We consider the problem of bug detection in Variability Intensive Systems (VIS). This category of systems encompasses any system that can be derived into multiple variants (differing, e.g., in provided functionalities), including software product lines [12] and configurable systems [32] . Compared to traditional ("single") systems, the complexity of bug detection in VIS is increased: bugs can appear only in some variants, which requires analysing the peculiarities of each variant.

Among the number of techniques developed for bug detection, one finds testing and model checking. Testing [6] executes particular test inputs on the system and checks whether it triggers a bug. Albeit testing remains widely used in industry, the rise of concurrency and inherent system complexity has made system-level test case generation a hard problem. Also, testing is often limited to bounded reachability properties and cannot assess liveness properties.

Model checking [2] is a formal verification technique which checks that all behaviours of the system satisfy specified requirements. These behaviours are typically modelled as an automaton, whose each node represents a state of the system (e.g. a valuation of the variables of a program and a location in this program's execution flow) and where each transition between two states expresses that the program can move from one state to the other by executing a single action (e.g. executing the next program statement). Requirements are often expressed in temporal logics, e.g. the Linear Temporal Logic (LTL) [31] .

Such logics capture both safety and liveness properties of system behaviours. As an example, consider the LTL formula (command sleep ⇒ ♦system sleep). command sleep and system sleep are logic atoms and represent, respectively, a state where the sleep command is input and another state where the system enters sleep mode. The symbols and ♦ means always and eventually, respectively. Thus, the whole formula expresses that "it is always the case that when the sleep command is input, the system eventually enters sleep mode".

Contrary to testing, model checking is exhaustive: if a bug exists then the checking algorithm outputs a counterexample, i.e. an execution trace of the system that violates the verified property. Exhaustiveness makes model checking an appealing solution to obtain strong guarantees that the system works as intended. It can also nicely complement testing (whose main advantage remains to be applied directly on the running system), e.g. by reasoning over liveness properties or by serving as oracle in test generation processes [1] . Those benefits, however, come at the cost of scalability issues, the most prominent being the state explosion problem. This term refers to the phenomenon where the state space to visit is so huge that an exhaustive search is intractable. As an illustration of this, let us remark that the theoretical complexity of the LTL model-checking problem is PSPACE-complete [37] .

Model checking complexity is further exacerbated when it comes to VIS. Indeed, in this case, the model-checking problem requires verifying whether all the variants satisfy the requirements [11] . This means that, if the VIS comprises n variation points (n features in a software product line or n Boolean options in a configurable system), the number of different variants to represent and to check can reach 2 n . This exponential factor adds to the inherent complexity of model checking. Thus, checking each variant (or models thereof) separatelyan approach known as enumerative or product-based [34] -is often intractable. To alleviate this, variability-aware models and companion algorithms were proposed to represent and check efficiently the behaviour of all variants at once. For instance, Featured Transition Systems (FTS) [11] are transition systems where transitions are labelled with (a symbolic encoding of) the set of variants able to exercise this transition. The structure of FTS, if well constructed, allows one to capture in a compact manner commonalities between states and transitions of several variants. Exploiting that information, family-based algorithms can check only once the executions that several variants can execute and explore the state space of an individual variant only when it differs from all the others. In spite of positive improvements over the enumerative approach, state-space explosion remains a major challenge.

In this work, we propose an alternative technique for state-space exploration and bug detection in VIS. We use Statistical Model Checking (SMC) [26] as a trade-off between testing and model checking to verify properties (expressed in full LTL) on FTS. The core idea of SMC is to conduct some simulations (i.e. sample executions) of the system (or its model) and verify if these executions satisfy the property to check. The results are then used together with statistical tests to decide whether the system satisfies the property with some degree of confidence. Of course, in contrast with an exhaustive approach, a simulationbased solution does not guarantee a result with 100% confidence. Still, it is possible to bound the probability of making an error. Simulation-based methods are known to be far less memory-and time-consuming than exhaustive ones, and are sometimes the only viable option. Over the past years, SMC has been used to, e.g. assess the absence of errors in various areas from aeronautic to systems biology; measure cost average and energy consumption for complex applications such as nanosatellites; detect rare bugs in concurrent systems [10, 21, 25] .

Given an LTL formula and an FTS, our family-based SMC method samples executions from all variants at the same time. Doing so, it avoids sampling twice (or more) executions that exist in multiple variants. Merging the individual state spaces biases the results, though, as it changes the probability distribution of the executions. This makes the problem different from previous methods intended for single systems (e.g. [20] ) and obliges us to revisit the fundamentals of SMC in the light of VIS. In particular, we want to characterize the number of execution samples required to bound the probability of Type-1 error by a desired degree of confidence. We provide a lower bound and an upper bound for this number by reducing its computation to particular instances of the coupon problem [4] . We implemented our method within ProVeLines [17] , a model checker for VIS. We provide empirical evidence, based on 3 case studies totalling 59 properties to check, that family-based SMC is a viable approach to verify VIS. Our study shows that our method manages to find all buggy variants in 41 properties and does so up to 16 times faster than state-of-the-art model-checking algorithms for VIS [11] . Moreover, our approach can achieve a median bug detection rate 3 times higher than classical SMC applied to each variant individually. The hardest cases arise when the state space of some variant is substantially smaller than the other. This leads to a reduced probability to find a bug in those variants.

In model checking, the behaviour of the system is often represented as a transition system (S, ∆, AP, L) where S is a set of states, ∆ ⊆ S × S is the transition relation, AP is a set of atomic propositions 3 and L : S → 2 AP labels any state with the atomic propositions that the system satisfies when in such a state.

LTL is a temporal logic that allows specifying desired properties over all future executions of some given system. Given a set AP of atomic propositions, an LTL formula φ is formed according to the following grammar: φ ::= | a | φ 1 ∧ φ 2 | ¬φ 1 | φ 1 | φ 1 Uφ 2 where φ 1 and φ 2 are LTL formulae, a ∈ AP , is the next operator and U is the until operator. We also define ♦φ ("eventually" φ) and φ ("always" φ) as a shortcut for Uφ and ¬♦¬φ, respectively. Vardi and Wolper have presented an automata-based approach for checking that a system -modelled as a transition system ts -satisfies an LTL formula φ [37] . Their approach consists of, first, transforming φ into a Büchi automaton B ¬φ whose language is exactly the set of executions that violate φ, that is, those that visit infinitely often a so-called accepting state. Such execution σ takes the form of a lasso, i.e. σ = q 0 . . . q n with q j = q n for some j and where q i is accepting for some i : j ≤ i ≤ n. We name accepting any such lasso whose cycle contains an accepting state.

The second step is to compute the synchronous product of ts and B ¬φ , which results in another Büchi automaton B ts⊗¬φ . Any accepting lasso in B ts⊗¬φ represents an execution of the system that violates φ. Thus, Vardi and Wolper's algorithm comes down to checking the absence of such accepting lasso in the whole state space of B ts⊗¬φ . The size of this state space is O(|ts| × |2 |φ| |) and the complexity of this algorithm is PSPACE-complete.

Originally, SMC was used to compute the probability to satisfy a bounded LTL property for stochastic system [39] . The idea was to monitor the properties on bounded executions represented by Bernoulli variables and then use Monte Carlo to estimate the resulting property. SMC also applies to non-stochastic systems by assuming an implicit uniform probability distribution on each state successor. Grosu and Smolka [20] lean on this and propose an SMC method to address the full LTL model-checking problem. Their sampling algorithm walks randomly through the state space of B ts⊗¬φ until it finds a lasso. They repeat the process M times and conclude that the system satisfies the property if and only if none of the M lassos is accepting. They also show that, given a confidence ratio δ and assuming that the probability p for an execution of the system exceeds an error margin , setting M = δ 1− bounds the probability of a Type-1 error (rejecting the hypothesis that the system violates the property while it actually violates it) by δ. Thus, M can serve as a minimal number of samples to perform. Our work extends theirs in order to support VIS instead of single systems. Other work on applying SMC to the full LTL logic can be found in [18, 38] .

Applying classical model checking to VIS requires iterating over all variants, construct their corresponding automata B ts⊗¬φ and search for accepting lasso in each of these. This enumerative method (also named product-based [34] ) fails to exploit the fact that variants have behaviour in common.

As an alternative, researchers came up with models able to capture the behaviour of multiple variants and distinguish between the unique and common behaviour of those variants [3, 8, 11] . Among such models, we focus on featured transition systems [11] as those can link an execution to the variants able to execute it more directly than the alternative formalisms. In a nutshell, FTS extend the standard transition system by labelling each transition with a symbolic encoding of the set of variants able to exercise this transition. Then, the set of variants that can produce an execution π is the intersection of all sets of variants associated with the transitions in π.

To check which variants violate a given LTL formula φ, one can adapt the procedure of Vardi and Wolper and build the synchronous product of the featured transition system with B ¬φ [11] . This product is similar to the Büchi automaton obtained in the single system case, except that its transitions are also labelled with a set of variants. 4 Then, the buggy variants are those that are able to execute the accepting lassos of this automaton. This generalized automaton is the fundamental formalism we work on in this paper.

is the transition relation, Q 0 ⊆ Q is a set of initial states, A ⊆ Q is the set of accepting states, Θ is the whole set of variants, and γ : ∆ → 2 Θ associates each transition with the set of variants that can execute it. The Büchi automaton corresponding to one particular variant v is derived by removing the transitions not executable by v. That is, we remove all transitions (q, q ) ∈ ∆ such that v ∈ γ(q, q ). The resulting automaton is named the projection of the FBA onto v. For example, one obtains the projection of the FBA in Figure 1 onto v 2 by removing the transition from State 3 to State 7 and those between State 7 to State 8.

Recent work has applied SMC in the context of VIS. In [36] , the authors proposed an algebraic language to describe (quantitative) behavioural variability in a dynamic manner. While their work shares some similarities with ours, there are fundamental differences. First, we seek for guaranteeing the absence of bugs in all variants of the family (applying family-based concepts), while they focus on dynamic feature interactions (on a product-based basis). The second difference is that they consider quantitative bounded properties, while we support the entire LTL verification problem by extending the multi-lasso concept of [20, 28] .

Another related, yet different area is the sampling of VIS variants (e.g. [27, 30] ). Such work considers the problem of sampling uniformly variants in order to study their characteristics (e.g. performance [22] and other quality requirements [15] ) and infers those of the other variants. Recently, Thüm et al. [35] survey different strategies for the performance analysis of VIS, including the sampling of variants and family-based test generation, which is based on the same idea of executing test cases common to multiple variants. Contrary to us, such works do not consider temporal/behavioural properties and most of them perform the sampling based on a static representation of the variant space (i.e. a feature model [23] ). An interesting direction for future work is to combine our familybased SMC with sampling techniques to check only representative variants of the family.

The purpose of SMC is to reduce the verification effort (when visiting the state space of the system model) by sampling a given number of executions (i.e. lassos). This gain in efficiency, however, comes at the risk of Type-1 errors. Indeed, while the discovery of a counterexample leads with certainty to the conclusion that the variants able to execute it violate the property φ, the fact that the sampling did not find a counterexample for some variant v does not entail a 100% guarantee that v satisfies φ. The more lassos we sample, the more confident we can get that the variants without counterexamples satisfy φ. Thus, designing a family-based SMC method involves answering three questions: (1) how to sample executions;

(2) how to choose a suitable number of executions; (3) what is the associated probability of Type-1 error.

One can sample a lasso in an FBA by randomly walking through its state space, starting from a randomly-chosen initial state and ending as soon as a cycle is found. A particular restriction is that this lasso should be executable by at least Input: f ba = (Q, ∆, Q0, A, Θ, γ) Output: (σ, Θσ, accept) where σ is a lasso of f ba and Θσ is the set of the variants able to execute σ and accept is true iff σ is accepting.

Algorithm 1: Random Lasso Sampling one variant; otherwise, we would sample a behaviour that does not actually exist. The set of variants able to execute a given lasso are those that can execute all its transitions, i.e. the intersection of all γ(q, q ) met along the transitions of this lasso. More generally, we define the lasso sample space of an FBA as follows.

Definition 2. Let f ba = (Q, ∆, Q 0 , A, Θ, γ) be a featured Büchi automaton. The lasso sample space L of f ba is the set of executions σ = q 0 . . . q n such that

Moreover, σ is said to be an accepting lasso if ∃q a ∈ A for some j ≤ a ≤ n.

Algorithm 1 formalizes the sampling of lassos in a deadlock-free FBA. 5 After randomly picking an initial state (Line 1), we walk through the state space by randomly choosing, at each iteration, a successor state among those available (Line 7-18). Throughout the search, we maintain the set of variants Θ σ that can execute σ so far (Line 16). Then, we use this set as a filter when selecting successor states, so as to make sure that σ remains executable by at least one variant. At Line 13, Succ σ is the set of successors q of q (last state of σ) that can be reached. We stop the search as soon as we reach a state that was previously visited (Line 7). If this state was visited before the last accepting state, it means that the sampled lasso is accepting (Line 19).

A motivated criticism [28] of the use of random walk to sample lasso is that shorter lassos receive a higher probability to be sampled. To counterbalance this, we implemented a heuristic named multi-lasso [20] . It consists of ignoring backward transitions that do not lead to an accepting lasso if there are still forward transitions to explore. This is achieved by modifying Line 13 such that backward transitions leading to a non-accepting lasso are not considered in the successor set.

Assuming a uniform selection of outgoing transitions from each state, one can compute the probability that a random walk samples any given lasso from the sample space.

Definition 3. The probability P (σ) of a lasso σ = q 0 . . . q n is inductively defined as follows:

In the absence of deadlock, (L, P(L), P ) defines a probability space. Probability spaces on infinite executions are by no means a trivial construction (see e.g. [9] ). Nevertheless, the proof of this proposition is similar to its counterpart in Büchi automata [20] and is therefore omitted. It derives from the observation that the lasso sample space is composed of non-subsuming finite prefixes of all infinite paths of the automaton.

Let us consider an example. In the FBA from Figure 1 , there are two nonaccepting lassos (l 1 = (1, 2, 1) and l 2 = (1, 3, 7, 8, 7) ) and two accepting lassos (l 3 = (1, 3, 4, 5, 3) and l 4 = (1, 3, 6, 5, 3)). Both variants can execute lassos l 3 , while only v 1 can execute l 2 and only v 2 can execute l 1 and l 4 . The probability of sampling l 1 is 1 2 , whereas P [l 2 ] = P [l 3 ] = P [l 4 ] = 1 6 . Thus, the probability of sampling a counterexample executable by v 2 is 1 3 , whereas it is only 1 6 for v 1 . Next, we characterize the relationship between this probability space and any individual variant v. Let L v be the set of lassos executable by v. Since L v ⊆ L, the probability p v to sample such a lasso is σv∈Lv P (σ). Note that p v can be different from the probabilityp v of sampling an accepting lasso from the automaton modelling the behaviour of v only (i.e. the projection of the FBA onto v). This is because, in the FBA, the probability of selecting an outgoing transition from a given state is assigned uniformly regardless of the number of variants able to execute that transition. This balance-breaking effect increases more as the variants have different numbers of unique executions.

Let σ = q 0 . . . q n be a lasso in L v . Then P v (σ) is inductively defined as follows:

6 . This implies that it is more likely to sample an accepting lasso executable by v 1 from its projection in one trial than it is from the whole FBA in two trials. This illustrates the case where merging the state spaces of the variants can have a negative impact on the capability to find bugs specific to one variant.

Thus, sampling lassos from the FBA allows finding one counterexample executable by multiple products but it introduces a bias. Overall, it tends to decrease the probability of sampling lassos from variants that have a smaller state space. This can impact the results and parameter choices of SMC, like the number of samples required to get confident results and the associated Type-1 error.

Remember that addressing the model checking problem for VIS requires to find a counterexample for every buggy variant v. Thus, one must sample a number M of lassos such that one gets an accepting lasso for each such buggy variant with a confidence ratio δ. Let f ba be a featured Büchi automaton, v be a variant and p v = σ ∈ L ω v P (σ) where L ω v is the set of accepting lasso executable by v. Let Z v denote a Bernoulli random variable such that Z v = 1 with probability p v and Z v = 0 with probability q v = 1 − p v . Now, let X v denote the geometric random variable with parameter p v that encodes the number of independent samples required until Z v = 1. For a set of variants V = {v 1 . . . v |V | }, we have that X v1 . . . X v |V | are not independent since one may sample a lasso executable by more than one variant.

We define X = max i=1..|V | X vi . We aim to find a number of sample M such that P [X ≤ M ] ≥ 1 − δ for a confidence ratio δ. This is analogous to the coupon collector's problem [4] , which asks how many boxes are needed to collect one instance of every coupon placed randomly in the boxes. It differs from the standard formulation in that the probability of occurrence of coupons are neither independent nor uniform, and a single box can contain 0 to |V | coupons. Even for simpler instances of the coupon problem, computing P [X ≤ M ] analytically is known to be hard [33] . Thus, existing solutions rather characterise a lower bound and an upper bound. We follow this approach as well.

To compute a lower bound for the number of samples to draw, we transform the family-based SMC problem to a simpler form (in terms of verification effort). We divide our developments into two parts. First, we show that assigning equal probabilities p vi to every variant v i (obtained by averaging the original probability values) reduces the number M of required samples. As a second step, we show that assuming that all variants share all their executions also reduces M . Doing so, we reduce the family-based SMC problem to its single-system counterpart, which allows us to obtain the desired lower bound. Intuitively, the value of X depends mainly on the variants whose accepting lassos are rarer. By averaging the probability of sampling accepting lassos, we raise the likelihood to get those rarer lassos and, thus, the number of samples required to get an accepting lasso for all variants. Shioda [33] proves a similar result for the coupon collector problem. He does so by showing that the vector p even majorizes p = {p v1 . . . p v1 } and that the ccdf 6 of X is a Schur-concave function of the sampling probabilities. Even though our case is more general than the nonuniform coupon collector's problem, the result of Lemma 4 still holds. Indeed, we observe that the theoretical proof of [33] (a) does not assume the independence of the random variables Z vi ; (b) still applies to the dependent case; and (c) supports the case where the sum of the probability values p vi is less than one.

Maximized commonalities. Next, let X all be the particular case of X even where all accepting lassos are executable by all variants and are sampled with probability p avg . Thus, the number of samples to find an accepting lasso for every variant is reduced to the number of samples required to find any accepting lasso.

Moreover, let us note that X all is a geometric random variable with parameter p avg . This reduces our problem to sampling an accepting lasso in a classical Büchi automaton and allows us to reuse the results of Grosu and Smolka [20] .

Lemma 6. For a confidence ratio δ and an error margin , it holds that

ln(1− ) and N = ln(δ) ln(1−pavg ) .

Theorem 7. Assuming that p avg ≥ avg for a given error margin avg , a lower bound for the number of samples required to find an accepting lasso for each buggy variant is M = ln(δ) ln(1− avg ) with a Type-1 error bounded by δ.

We follow a similar two-step process to characterise an upper bound for M . In the first step, we replace the probabilities p vi of every variant by their minimum.

In the second step, we alter the model so that the variants have no common behaviour. Then we show that, given a desired degree of confidence, the obtained model requires a higher number of samples than the original one.

Minimum probability. Let p min = min v=1..|V | p v and X min be the counterpart of X where all probabilities p vi have been replaced by p min . The ccdf of X being a decreasing function of the sampling probabilities, we have that

No common counterexamples. Let {(X indep ) vi } be a set of independent geometric random variables with parameters p min and let X indep = max(X indep ) vi . X indep actually encodes the number of samples required to get a counterexample for all buggy variants when those have no common counterexamples. We have that P [X indep ≤ N ] ≤ P [X min ≤ N ], since the number of samples to perform cannot be reduced by sampling a counterexample executable by multiple variants. Now, let us note that X indep is an instance of the uniform coupon problem with |V | coupons to collect. A lower bound for P [X indep ≤ M ] is known to be 1 − |V | × (1 − p min ) M [33] . Assuming p min greater than some error margin 

One can regard SMC as a means of speeding up verification while risking missing counterexamples. Our first question studies this trade-off and analyses the empirical Type-1 error rate. More precisely, we compute the detection rate of our family-based SMC method, expressed as the number of buggy variants that it detects over the total number of buggy variants.

We compute the detection rate for different numbers M of samples lying between the lower and upper bounds as characterised in Section 3. To get the ground truth (i.e. the true set of all buggy variants), we execute the exhaustive LTL model checking algorithms for FTS designed by Classen et al. [11] . For the lower bound, we assume that the average probability to sample an accepting lasso for any variant is higher than avg = 0.01. Setting a confidence ratio δ = 0.05 yields ln(0.05) ln(0.99) = 298. We round up and set M = 300 as our lower bound. For the higher bound, we assume that the minimum probability to sample a counterexample in a buggy variant is higher than min = 3. = 18478. For convenience, we round it up to 19, 200 = 300 · 2 6 . In the end, we successively set M to 300, 600, . . . , 19200 and observe the detection rates.

Next, we investigate a complementary scenario where the engineer has a limited budget of samples to check. We study the smallest budget required by 7 256 is the maximum number of variants in our case studies SMC to detect all buggy variants (in the cases where it can indeed detect all of them) and what is the incurred computation resources compared to an exhaustive search of the state space. Thus, our second question is:

RQ2 How much efficient is SMC with a minimal sample budget compared to an exhaustive search?

Finally, we compare family-based SMC with the alternative of sampling in each variant's state space separately. We name this alternative method enumerative SMC. Hence, our last research question is:

RQ3 How does family-based SMC compares with enumerative SMC?

As before, we compare the two techniques w.r.t. detection rate. We set M to the same values as in RQ1. In enumerative SMC, this means that each variant receives a budget of samples of M |V | where M is the number of samples used in family-based SMC and V is the set of variants.

Implementation. We implemented our SMC algorithms (family-based and enumerative-based) in a prototype tool. The tool takes as input an FTS, an LTL formula and a sample budget. Then it performs SMC until all samples are checked or until all variants are found to violate the formula. To compare with the exhaustive search we use ProVeLines [17] , a state-of-the-art model checker for VIS.

Dataset. We consider three systems that were used in past research to evaluate VIS model checking algorithms [11, 14, 16] . Table 1 summarizes the characteristics of our case studies and their related properties. The first system is a minepump system [11, 24] with 128 variants. The underlying FTS comprises 250,561 states, while the state space of all variants taken individually reaches 889,124 states. The second model is an elevator model inspired by Plath and Ryan [29] . It is composed of eight configuration options, which can be combined into 256 different variants, and its FTS has 58,945,690 states to explore. The third and last is a case study inspired by the CCSDS File Delivery Protocol (CFDP) [13] , a realworld configurable spacecraft communication protocol [5] . The FTS modelling the protocol consists of 1,732,536 states to explore and 56 variants (individually totalling 2,890,399 states). We discarded the properties that are satisfied by all variants. Those are: Minepump #17, #33, #40; Elevator #13, CFDP #5. Indeed, these properties are not relevant for RQ1 and RQ3 since SMC is trivially correct in such cases. As for RQ2, any small sample budget would return correct results while being more efficient than the exhaustive search. This leaves us with 59 properties.

Infrastructure and repetitions. We run our experiments on a MacBook Pro 2018 with a 2.9 GHz Core-i7 processor and macOS 10.14.5. To account for random variations in the sampling, we execute 100 runs of each experiment and compute the average detection rates for each property. 

#32 ¬♦ (¬pumpOn ∧ ¬methane ∧ highW ater) #33 (( ♦readCommand) ∧ ( ♦readAlarm) ∧ ( ♦readLevel)) ⇒ (¬♦ (¬pumpOn ∧ ¬methane ∧ highW ater)) #34 ((pumpOn ∧ highW ater ∧ ♦lowW ater) ⇒ (pumpOnU lowW ater)) #35 ¬♦(pumpOn ∧ highW ater ∧ ♦lowW ater) #36 (lowW ater ⇒ (♦¬pumpOn)) #37 (( ♦readCommand) ∧ ( ♦readAlarm) ∧ ( ♦readLevel)) ⇒ ( (lowW ater ⇒ (♦¬pumpOn))) #38 ¬♦ (pumpOn ∧ lowW ater) #39 (( ♦readCommand) ∧ ( ♦readAlarm) ∧ ( ♦readLevel)) ⇒ (¬ ♦(pumpOn ∧ lowW ater)) #40 ((¬pumpOn ∧ lowW ater ∧ ♦highW ater) ⇒ ((¬pumpOn)U highW ater)) #41 ¬♦(¬pumpOn ∧ lowW ater ∧ ♦highW ater) Elevator (58,945,690 FTS states, 256 valid variants) 19, 200 . This illustrates that our assumption regarding p min was inappropriate for those properties: counterexamples are rarer than we imagined. The elevator study yields even better results: at M = 600, SMC detects all buggy variants for 10/18 properties; this number becomes 14/18 at M = 2, 400 and 17/18 at M = 9, 600. As for the remaining property, SMC with M = 19, 200 detects 50% of the variants on average and we observe that this percentage consistently increases as we increase M .

The results for CFDP are mixed: while the median percentage goes beyond 80% as soon as M = 1, 200, it tends to saturate when increasing the number of samples. The 0.25 percentile still increases but also seems to reach an asymptotic behaviour in the trials with the highest M . A detailed look at the results reveals that for M ≥ 1, 200, SMC cannot identify all buggy variants for only two properties: #3 (9 buggy variants) and #4 (4 buggy variants). At M = 19, 200, SMC detects 5.43 and 3.14 buggy variants for those two properties, respectively. Further doubling M raises these numbers to 6.36 and 3.26. This indicates that the non-detected variants have few counterexamples, which are rare due to the tinier state space of those variants. The computation resources required by SMC to find such rare counterexamples with high confidence are higher than modelchecking the undetected variants thoroughly. An alternative would be to direct SMC towards rare executions, leaning on techniques such as [10, 21] .

SMC can detect all buggy variants for 41 properties out of 59. For the remaining properties, however, SMC was unable to find the rare counterexamples of some buggy variants. This calls for new dedicated heuristics to sample those rare executions.

Next, we check how much execution time SMC can spare compared to the exhaustive search. Results are shown in Table 2 . Overall, we see that SMC holds the potential to greatly accelerate the discovery of all buggy variants, achieving a total speedup of 526%, 1891% and 356% for Minepump, Elevator and CFDP, respectively. For more than half of the properties, the smallest number of samples we tried (i.e. 300) was sufficient for a thorough detection. Those properties are actually satisfied by all variants. The fact that SMC requires such a small number of samples means that the same bug lies in all the variants (as opposed to each variant violating the property in its own way). On the contrary, Minepump property #31 is also violated by all variants but requires a much higher sample number, which illustrates the presence of variant-specific bugs.

Interestingly, the benefits of SMC are higher in the Elevator case (the largest of the three models), achieving speedups of up to 16,575%. A likely explanation is that the execution paths of the Elevator model share many similarities, which means that a single bug can lead to multiple failed executions. By sampling randomly, SMC avoids exploring thoroughly a part of the state space that contains no bug and, instead, increases the likelihood to move to interesting (likely-buggy) parts. A striking example is property #16 (satisfied by half of the variants), where SMC reduces the verification time from 3 minutes to 4 seconds.

Where SMC can detect all buggy variants, it can do so with more efficiency compared to exhaustive search, for 33/41 properties, achieving speedups of multiple orders of magnitude. Figure 3 shows the detection rate achieved the enumerative SMC for the three case studies and different numbers of samples, while the results of the familybased SMC were shown in Figure 2 . In the Minepump and Elevator cases, enumerative SMC achieves a lower detection rate than family-based SMC. In both cases, a Student t-test with α = 0.05 rejects, with statistical significance, the hypothesis that the two SMC methods yield no difference in error rate. One can observe, for instance, that, with 600 samples, enumerative SMC achieves a median detection rate of 31.13%, while family-based SMC achieved 99.86%. This tends to validate our hypothesis that family-based SMC is more effective as the variants share more executions. Indeed, on average, one state of the Minepump is shared by 3.55 variants.

In the case of CFDP, however, enumerative SMC performs systematically better (up to 13.95% more). Still, the difference in median detection rate tends to disappear as more executions are sampled. Nevertheless, CFDP illustrates the main drawback of family-based SMC: it can overlook counterexamples in variants with fewer behaviours. In such cases, enumerative SMC might complement family-based SMC by sampling from the state space of specific variants.

Family-based SMC can detect significantly more buggy variants than enumerative SMC, especially when few lassos are sampled. Yet, enumerative SMC remains useful for variants that have a tiny state space compared to the others and can, thus, complement the family-based method.

We proposed a new simulation-based approach for finding bugs in VIS. It applies statistical model checking to FTS, an extension of transition systems designed to model concisely multiple VIS variants. Given an LTL formula, our method results in either collecting counterexamples for multiple variants at once or proving the absence of bugs. The algorithm always converges, up to some confidence error which we quantify on the FTS structure by relying on results for the coupon collector problem. After implementing the approach within a state-of-the-art tool, we study empirically its benefits and drawbacks. It turns out that a small number of samples is often sufficient to detect all variants, outperforming an exhaustive search by an order of magnitude. On the downside, we were unable to find counterexamples for some faulty variants and properties. This calls for future research, exploiting techniques to guide the simulation towards rare bugs/events [7, 10, 21] or towards uncovered variants relying, e.g., on distance-based sampling [22] or light-weight scheduling sampling [19] . Nevertheless, the positive outcome of our study is to show that SMC can act as a low-cost-high-reward alternative to exhaustive verification, which can provide thorough results in a majority of cases.

Using model checking to generate tests from specifications

Principles of model checking

Modelling and analysing variability in product families: Model checking of modal transition systems with variability constraints

The coupon-collector problem revisited -a survey of engineering problems and computational methods

Tag and prune: A pragmatic approach to software product line implementation

Model-Based Testing of Reactive Systems, Advanced Lectures [The volume is the outcome of a research seminar that was held in Schloss Dagstuhl

Rare event simulation with fully automated importance splitting

Multi-valued symbolic model-checking

A testing scenario for probabilistic processes

Using crossentropy for satisfiability

Featured transition systems: Foundations for verifying variability-intensive systems and their application to LTL model checking

Software Product Lines: Practices and Patterns. SEI Series in Software Engineering

Consultative Committee for Space Data Systems (CCSDS): CCSDS File Delivery Protocol (CFDP): Blue Book

Counterexample guided abstraction refinement of product-line behavioural models

Towards sampling and simulation-based analysis of featured weighted automata

Beyond Boolean product-line model checking: Dealing with feature attributes and multi-features

Provelines: A product-line of verifiers for software product lines

Faster statistical model checking for unbounded temporal properties

Lightweight statistical model checking in nondeterministic continuous time

Tools and Algorithms for the Construction and Analysis of Systems

Importance splitting for statistical model checking rare properties

Distance-based sampling of software configuration spaces

Feature-oriented domain analysis (FODA) feasibility study

Conic: an integrated approach to distributed computer control systems. Computers and Digital Techniques

Leveraging Applications of Formal Methods, Verification and Validation

Statistical model checking: An overview

t-wise coverage by uniform sampling

Uniform Monte-Carlo model checking

Feature integration using a feature construct

Uniform sampling of SAT solutions for configurable systems: Are we there yet?

The temporal logic of programs

Product configuration frameworks-a survey

Some upper and lower bounds on the coupon collector problem

A classification and survey of analysis strategies for software product lines

Performance analysis strategies for software variants and versions

Qflan: A tool for the quantitative analysis of highly reconfigurable systems

An automata-theoretic approach to automatic program verification

Statistical verification of probabilistic properties with unbounded until

Probabilistic verification of discrete event systems using acceptance sampling

), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use