Aalborg Universitet When To Test? Troubleshooting with Postponed System Test Ottosen, Thorsten Jørgen; Jensen, Finn V. Publication date: 2010 Document Version Early version, also known as pre-print Link to publication from Aalborg University Citation for published version (APA): Ottosen, T. J., & Jensen, F. V. (2010). When To Test? Troubleshooting with Postponed System Test. Department of Computer Science, Aalborg University. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. ? Users may download and print one copy of any publication from the public portal for the purpose of private study or research. ? You may not further distribute the material or use it for any profit-making activity or commercial gain ? You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us at vbn@aub.aau.dk providing details, and we will remove access to the work immediately and investigate your claim. Downloaded from vbn.aau.dk on: April 06, 2021 https://vbn.aau.dk/en/publications/f8ecc587-23fd-4361-a0d4-60934b9f477f When To Test? Troubleshooting with Postponed System Test. Thorsten J. Ottosen and Finn V. Jensen Technical Report 10-001 Department of Computer Science Aalborg University, Denmark October 19, 2010 Abstract. Troubleshooting is about computing a cost-efficient way to repair a faulty device. In this paper we investigate the most simple action-based trou- bleshooting scenario with a single extension: the overall system test is not required to be performed after each action, but it may be postponed until several actions have been performed. As shown by Kadane and Simon, the simple troubleshooting scenario is easily solvable in polynomial time. However, we conjecture that the new troubleshooting scenario is NP-hard and therefore describe an Θ(n3) time heuristic. The new heuristic guar- antees optimality for a class of models, and for other classes of models we benchmark it against the optimal solution and other simpler greedy heuristics. The benchmark is performed on artificial models as well as a real-world troubleshooting model, and they suggest that the heuristics are close to optimal. Introduction Imagine that you have a device which has been running well up to now, but suddenly it is malfunctioning. A set of faults F describes the possible causes to the problem. To fix the problem you have a set A of actions, which may fix the problem by repairing one or more faults. An action, α, has a positive cost Cα and it has two possible outcomes, namely "α = yes" (the problem was fixed) and "α = no" (the action failed to fix the problem). Informally, our task is to determine the best sequence of actions until the problem is fixed or until no more actions remains to be carried out. For an overview of state of the art of decision theoretic troubleshoot- ing we refer the reader to (Jensen et al., 2001), (Langseth and Jensen, 2001), (Skaanning and Vomlel, 2001), (Langseth and Jensen, 2003), (Koca and Bil- giç, 2004a), (Koca and Bilgiç, 2004b), and (Warnquist et al., 2008). Basically all non-trivial troubleshooting domains are NP-hard (Vomlelová, 2003). The scenario investigated in this article is similar to troubleshooting with con- ditional costs, but the proof for NP-hardness does not carry through in this simple scenario. Nevertheless we conjucture that our scenario is NP-hard. In traditional troubleshooting it has been assumed that each action is followed by a test that determines whether the entire system D is working. Furthermore, it has generally been assumed that the cost of this test, CD, is so low that it should not even be modelled. We show that this assumption is incorrect in general and develop terminology and algorithms for this new scenario. In this paper we shall examine troubleshooting problems where the fol- lowing model parameters are given: 1. An overall problem D that we need to solve (e.g. "my printer does not work"). 2. A set of faults F = {f1, . . . , fm} describing the possible causes to the problem (e.g. "toner cartridge empty"). 3. For each fault f ∈F, a probability P(f) describing how likely it is that f is present when troubleshooting begins. 4. A set of actions A = {α1, . . . ,αn} that can potentially solve D (e.g. "change the toner cartridge"). 5. For each action α ∈A, a cost Cα describing the resources required to perform the action. 6. For each action α ∈ A, a success probability P(α = yes|f) describ- ing the likelihood of solving D by performing the action when f is present. 2 7. A cost of testing if the problem D remains, CD > 0. Furthermore, we have the following simplifying assumptions about the model: Assumption 1. The single fault assumption: Initially the problem D is known to exist and it is due to the presence of a single fault from F. Assumption 2. The action result assumption: repeating a failed action will not fix the problem. Assumption 3. The carefulness assumption: by performing an action or testing the system, we never introduce new faults. Assumption 4. The independent actions assumption: Each action repairs a dis- tinct set of faults that may be causing the problem D. We uphold these assumptions throughout this paper. Remark. Assumption 1 implies∑ f∈F P(f) = 1 and by Assumption 4 the repair probability of an action can be computed as P(α = yes) = ∑ f∈F P(α = yes|f) · P(f) . However, since actions may be imperfect (P(α = yes|f) < 1), Assumption 4 does not imply ∑ α∈A P(α = yes) = 1 . Furthermore, the above equation can also be violated by the fact that not all faults may be addressed by an action. Before we can define our optimization problem formally, we need some notation and terminology. We write εi to denote that the first i actions have failed (ε0 ⊆ εi), and we have by assumption P ( ε0 ) = 1 because the device is faulty. We call P(α = yes) for the repair probability of α, and we shall abbreviate it with Pα. To make it clear that the cost of an action includes the cost of testing the system CD, we write CA+D for CA + CD and so CA never includes the cost of testing the system. For each V ⊆A we can form a compound action A with the following properties PA = ∑ α∈V Pα, CA = ∑ α∈V Cα, |A| = |V | . 3 To make the distinction clear, we call actions α ∈ A for atomic actions. We can see that a partition of A into k subsets {V1, . . . , Vk} induces a new model with independent actions {A1, . . . , Ak}. We shall also say that {A1, . . . , Ak} is a partition of A whenever the corresponding sets of atomic actions form a partition of A. From now on we shall not distinguish be- tween the set of actions and the compound action; that is, we use A to mean a set of atomic actions and write A = {α1, . . . ,α|A|}. A troubleshooting se- quence is a sequence of compound actions s = 〈A1, . . . , A`〉 prescribing the process of repeatedly performing the next action until the problem is fixed or the last action has been performed and such that {A1,. . . ,A`} is a partition of A. When each compound action contains only one action, we say that the sequence is atomic. A subsequence of s, s[k,m] = 〈Ak+1, . . . , Am〉 with k ≤ m, is called a partial troubleshooting sequence if k ≥ 1 or m < `. If k = 0, we may simply write s[m]. Definition 1. Let s = 〈A1, . . . , A`〉 be a troubleshooting sequence. Then the expected cost of repair (ECR) of s is the mean of the cost until an action succeeds or all actions have been performed: ECR (s) = ∑̀ i=1 CAi+D · P ( εi−1 ) . Formally, the problem is then to find a troubleshooting sequence with min- imal ECR (·) over all possible sequences. We call such a sequence for opti- mal. An important definition is that of efficiency: Definition 2. The efficiency of an action A is ef(A) = PA CA+D . Its importance stems from the following results. Theorem 1 (Kadane and Simon (1977)). Let s = 〈α1, . . . ,αn〉 be an atomic troubleshooting sequences, and let CD = 0. Then s is optimal if and only if ef(αi) ≥ ef(αi+1) for i ∈{1, . . . ,n− 1} . Theorem 2 (Jensen et al. (2001)). Let s = 〈α1, . . . ,αn〉 be an atomic trou- bleshooting sequence. Then a necessary condition for s to be optimal, even when CD > 0, is that ef(αi) ≥ ef(αi+1) for i ∈{1, . . . ,n− 1} . Corollary 1. Theorem 2 holds for arbitrary troubleshooting sequences. 4 Proof. Because a troubleshooting sequence partitions A, the compound ac- tions can be seen as virtual actions which are also independent, thus satis- fying Assumption 3. All other assumptions cannot be invalidated. We may also compute the ECR of any atomic troubleshooting sequence: Theorem 3 (Jensen et al. (2001)). Let s = 〈α1, . . . ,αn〉 be an atomic trou- bleshooting sequence. Then the ECR of s may be computed as ECR (s) = n∑ i=1 Cαi+D ·  1 − i−1∑ j=1 Pαj   , where 1 − ∑i−1 j=1 Pαj = P ( εi−1 ) . Corollary 2. Theorem 3 holds for arbitrary troubleshooting sequences. Proof. Same argument as for Corollary 1. Thus, due to Assumption 1-4, we may completely ignore the F, P(f), and P(α = yes|f) once the repair probabilities have been computed. Therefore we only use Pα in the rest of this paper. The following example shows that Theorem 1 does not extend to arbi- trary troubleshooting sequences when CD > 0. Example 1. Consider a model with four atomic actions α1, α2, α3 and α4. The other properties of the model are as follows: Pα Cα Cα+D ef(α) α1 0.24 1 2 0.12 α2 0.42 3 4 0.105 α3 0.2 1 2 0.1 α4 0.14 19 20 0.007 So CD = 1. We have ECR (〈α1,α2,α3,α4〉) = 2 + 0.76 · 4 + 0.34 · 2 + 0.14 · 20 = 8.52 ECR (〈{α1,α2},α3,α4〉) = 5 + 0.34 · 2 + 0.14 · 20 = 8.48 thus proving that it is more efficient to form the compound action A = {α1,α2}. To be able to test heuristic algorithms it is necessary with an algorithm that is guaranteed to find an optimal value. Exhaustive searches are al- ways able to do this, albeit the algorithm might take exponential or super- exponential time. Algorithm 1 shows an exhaustive search with pruning (line 11, applying Theorem 2) and memoization (line 5-8). 5 Algorithm 1 Exhaustive Search With Compound Actions 1: function GENERATEALLSEQUENCES(A, &S) 2: Let all = ∅ 3: for i = 1 to |A| do 4: for all (|A| i ) ways to construct the first compound action A do 5: if A 6∈ S then 6: Compute A from scratch 7: Set S = S∪{A} 8: end if 9: Let after = GENERATEALLSEQUENCES(A\ A,S) 10: for s = 〈A1, . . . , A`〉 ∈ after do 11: if ef(A1) ≤ ef(A) then 12: Let s′ = 〈A, A1, . . . , A`〉 13: Set all = all∪{s′} 14: end if 15: end for 16: end for 17: end for 18: return all 19: end function 20: function OPTIMALCOMPOUNDACTIONSEQUENCE(A) 21: Let S = ∅ 22: Let all = GENERATEALLSEQUENCES(A,S) 23: return arg min s∈all ECR (s) 24: end function 6 Let us briefly discuss the complexity of this algorithm. The algorithm re- cursively generates all possible sequences of compound actions. The num- ber of such sequences is n! ·2n−1 because for each of the n! permutations of the n atomic actions we can form 2n−1 different partitions. Thus the run- ning time and memory requirement of the algorithm is O(n! · 2n). Note, however, that this is not a tight upper bound since a troubleshooting se- quence s = 〈A1, . . . , A`〉 can be constructed by ∏` i=1 |Ai|! different order- ings. Later we show that the complexity of an exhaustive search can be reduced to Θ(n3 · n!). Two Simple Greedy Heuristics In this section we shall develop two simple greedy algorithms. Further- more, we also give examples showing that none of the algorithms are guar- anteed to find an optimal troubleshooting sequence. For the first algorithm we consider how the ECR of a troubleshooting sequence can be improved by merging two neighbouring actions into one compound action. We have the following result. Theorem 4. Let the two troubleshooting sequences s and s′ be given as s = 〈. . . , Ax, Ax+1, . . .〉 s′ = 〈. . . , A, . . .〉 with A = {Ax, Ax+1} and everything else being the same. Then s′ has a strictly lower ECR than s if and only if CD > CAx+1 · Pαx 1 − ∑x j=1 Pαj . Proof. When the sequence with the new compound action is better, we must have ECR ( s′ ) − ECR (s) ≤ 0 . We observe that most of the terms cancel each other out and we are left 7 with CA+D · P ( εx−1 ) − [ CAx+D · P ( εx−1 ) + CAx+1+D · P(ε x) ] < 0 m CA+D < CAx + CD + (CAx+1 + CD) · P(εx) P(εx−1) m CAx+1 · ( 1 − P(εx) P(εx−1) ) < CD · P(εx) P(εx−1) m CD > CAx+1 · ( P ( εx−1 ) P(εx) − 1 ) m CD > CAx+1 · ( Pαx 1 − ∑x j=1 Pαj ) Theorem 4 immediately suggests a procedure where we greedily merge ad- jacent actions until the ECR is no longer improved. We start with an atomic troubleshooting sequence and form compound actions by repeated appli- cation of Theorem 4. Justified by Theorem 1, there are two obvious choices of the initial sorting: with respect to descending efficiency or descending P C -value. We have summarized this procedure in Algorithm 2. (The special function 1st arg(·) returns the first index for which the boolean argument returns true.) As specified, the algorithm runs in Θ(n2) time. However, if the sum in line 5 in CompoundAction(·) are precomputed , the total cost of all calls to CompoundAction(·) can be made linear. The running time of the algorithm is then dominated by the initial sorting of the actions leading to an Θ(n · lg n) worst case running time. Is Algorithm 2 guaranteed to find an optimal troubleshooting sequence? The example below show it is not. Example 2. We consider the following model: Pα Cα Cα+D ef(α) Pα Cα α1 0.61 1 11 0.055 0.61 α2 0.21 5 15 0.014 0.042 α3 0.18 3 13 0.013 0.06 So CD = 10. Using the ef(·) order Algorithm 2 first computes 10 < 5 · 0.61 1 − 0.61 ≈ 7.8 . 8 Algorithm 2 Computing sequence with compound actions greedily 1: function COMPOUNDACTION(〈α1, . . . ,αn〉 ,x) 2: if x = n then 3: return {αn} 4: end if 5: Let m = n 1st arg i=x ( CD ≤ Cαi+1 · Pαi 1 − ∑i j=1 Pαj ) 6: return {αx, . . . ,αm} 7: end function 8: function GREEDYCOMPOUNDACTIONSEQUENCE(A) 9: Let s′ = 〈α1, . . . ,αn〉 such that ef(αi) ≥ ef(αi+1) or Pαi Cαi ≥ Pαi+1 Cαi+1 . 10: Let s = 〈〉 11: Let i = 1 12: while i ≤ n do 13: Let A = COMPOUNDACTION(s′, i) 14: Set s = s + A 15: Set i = i + |A| 16: end while 17: return s 18: end function Since this is false, it continues with the second test 10 < 3 · 0.21 1 − 0.61 − 0.21 = 3.5 which means that the algorithm suggest the sequence consisting of a single compound action {α1,α2,α3}. If we repeat the calculations using the PC - order, we get the same result. The ECR is easily seen to be 19, but ECR (〈α1,{α2,α3}〉) = 11 + 0.39 · 18 = 18.02 so the algorithm is not optimal. Let us now consider another greedy heuristic. The algorithm is based on the idea that we should repeatedly form the most efficient compound action until all atomic actions have been assigned a compound action. To do this, we need a simple way of computing the most efficient compound action from a set of atomic actions. We first need a lemma. Lemma 1. Let a,b,c, and d be strictly positive numbers. Then a + b c + d ≥ a c if and only if b d ≥ a c 9 Proof. a + b c + d ≥ a c ⇔ a · c + b · c ≥ a · c + a ·d ⇔ b d ≥ a c Theorem 5. Let V ⊆ A be given. Let A be a compound action consisting of ac- tions in V with ef(A) maximal and with |A| minimal over those actions with max- imal efficiency. Then A can be found by the following procedure: define ef(A) = 0 when A = ∅; then repeatedly add an action α ∈ V to A with the highest Pα Cα value until ef(A) does not increase. Proof. Let A consist of actions in V such that ef(A) is maximized and |A| is minimal. Then ef(A) equals∑ α∈A Pα CD + ∑ α∈A Cα = ∑ α∈A\{β} Pα + Pβ CD + ∑ α∈A\{β} Cα + Cβ = SP + Pβ SC + Cβ = P C where β is chosen arbitrarily. Let furthermore α ∈ V \ A. We shall prove Pβ Cβ ≥ P C ≥ Pα Cα which implies the theorem. We first prove Pβ Cβ ≥ P C (i). Because ef(A) is maximal and A has minimal size, Lemma 1 gives SP + Pβ SC + Cβ > SP SC ⇔ Pβ Cβ > SP SC (ii). Then assume Equation (i) is false; then by Lemma 1 SP + Pβ SC + Cβ ≥ Pβ Cβ ⇔ SP SC ≥ Pβ Cβ which is a contradiction with Equation (ii). We then consider the second inequality. Since ef(A) is maximized, Lemma 1 implies P C ≥ P + Pα C + Cα ⇔ P C ≥ Pα Cα which completes the proof. Remark. In (Langseth and Jensen, 2001) a similar heuristic procedure is reached albeit for a slightly different problem. 10 Algorithm 3 Greedy algorithm with max-efficient actions 1: function GREEDYCOMPOUNDACTIONSEQUENCE2(A) 2: Let s′ = 〈α1, . . . ,αn〉 such that Pαi Cαi ≥ Pαi+1 Cαi+1 3: Let s = 〈〉 4: Repeatedly add the most efficient compound action to s 5: return s 6: end function Algorithm 3 shows a greedy algorithm based on Theorem 5. Like Al- gorithm 2 we can precompute the sums used in the efficiency and get a running time of Θ(n · lg n). The following example shows that Algorithm 3 is not guaranteed to find an optimal troubleshooting sequence. Example 3. Consider the following model Pα Cα Cα+D Pα Cα α1 0.15 1 2 0.150 α2 0.35 2 3 0.175 α3 0.50 3 4 0.167 where the ordering considered is 〈α2,α3,α1〉. Algorithm 3 computes the following values to determine the first compound action: ef({α2}) = 0.35 3 = 0.117 ef({α2,α3}) = 0.35 + 0.5 6 = 0.1416 ef({α2,α3,α1}) = 1 7 = 0.1429 Thus the suggested sequence is one single compound action with ECR 7. However, ECR (〈{α2,α3},α1〉) = 6 + 0.15 · 2 = 6.3 which shows the algorithm is not optimal. Having established the heuristic nature of both greedy algorithms, we would like to know if they at least find a local optimum. We use the following def- inition to define a local optimum. Definition 3. Let s = 〈α1, . . . ,αn〉 be a sequence of atomic actions, and let sc = 〈B1, . . . , Bm〉 be a compound action sequence. Then sc is said to be consistent with s if ∀ k < l αi ∈ Bk and αj ∈ Bl =⇒ i < j. 11 In other words, sc can be seen as an ordered partition of s. If an atomic action sequence s is given, we call the set of all compound action sequences consistent with s for C(s). A local optimum is then a com- pound action sequence sc ∈ C(s) with minimal ECR. We then say the (or- dered) partition induced by sc is optimal. From Example 2 and 3 it follows that neither Algorithm 2 nor Algorithm 3 guarantees an optimal partition. An Algorithm For Optimal Partitioning We have now seen how the greedy algorithms fail to provide optimal se- quences even under the restriction that the solution must be consistent with a single seed-sequence s of atomic actions. However, since there are 2n−1 ways to partition the actions of s, a brute force approach is quite impracti- cal even for models of moderate size. By exploiting Theorem 6 below we can construct an O(n3) dynamic programming algorithm. Lemma 2. Let k ≥ 0 and let sc = 〈A1, . . . , A`〉 be a partial troubleshooting sequence that is performed after the first k actions have been performed. Then the contribution of sc to the total ECR is given by ECRk(s c) = ∑̀ i=1 P ( εk+i−1 ) · CAi+D where P ( εk+i−1 ) = 1 − k∑ j=1 Pαj − i−1∑ j=1 PAj . Proof. Follows directly from Corollary 2. The Lemma ensures that the following definition is sound. Definition 4. Let s = 〈α1, . . . ,αn〉 be an atomic troubleshooting sequence. For any partial sequence s[k,m] with k ≤ m the ECR of an optimal com- pound action sequence consistent with s[k,m] is defined by ECR∗(s[k,m]) = { min sc∈C(s[k,m]) ECRk(s c) if k < m 0 if k = m . Theorem 6. Let s = 〈α1, . . . ,αn〉 be an atomic troubleshooting sequence. Then for k < m, ECR∗(s[k,m]) can be calculate as ECR∗(s[k,m]) = m min i=k+1 ( ECRk(〈{αk+1, . . . ,αi}〉) + ECR∗(s[i,m]) ) 12 Proof. We use induction in m−k. Basis step: m−k = 1. Since there is only one compound sequence consistent with 〈αm〉, we get ECR∗(s[k,m]) = min(ECRk(〈{αm}〉) + 0) = ECRk(〈{αm}〉) so the theorem holds in this case. Induction step: assume the theorem holds for m−k ≤ x and let m−k = x + 1. Then let s∗ = arg min sc∈C(s[k,m]) ECRk(s c) for which we must prove ECR∗(s∗) = m min i=k+1 ( ECRk(〈{αk+1, . . . ,αi}〉) + ECR∗(s[i,m]) ) . By Lemma 2 it is correct to split the ECR into two terms corresponding to an ordered partition of s∗. Furthermore, by the induction hypothesis we can compute ECR∗(s[i,m]) for any i since m−i ≤ x. Then consider that s∗ must start with a compound action consisting of y ∈ {1, 2, . . . ,x + 1} actions. These choices of the first action correspond exactly to the expressions that we minimize over, and the result follows. Theorem 6 shows that the problem of finding an optimal partition of a seed sequence s requires optimal partitions of smaller partial sequences (the optimal-substructure property) and that these smaller partial sequences are needed by the computation of several bigger sequences (the overlap- ping subproblems property). These two properties enables a dynamic pro- gramming algorithm, that is, an algorithm that only computes solutions to subproblems once by storing the results in a table. Algorithm 4 uses this approach to find the optimal partition of a seed-sequence s. The algorithm runs in Θ(n3) time and uses Θ(n2) space. The correctness of the algorithm follows from the above theorem. Algorithm 4 now also means that we can make a more efficient exhaustive search for the best troubleshooting se- quence. The algorithm runs in Θ(n3 · n!) time and is shown in Algorithm 6. Unfortunately, Algorithm 5 is not guaranteed to find an optimal trou- bleshooting sequence, as the following example shows. Example 4. Consider the model from Example 3, but take CD = 2: Pα Cα Cα+D ef(α) Pα Cα α1 0.15 1 3 0.05 0.150 α2 0.35 2 4 0.0875 0.175 α3 0.50 3 5 0.1 0.167 13 Algorithm 4 Optimal Partition into Compound Actions 1: procedure FILLCELL(row,col, s, CD, &ecrMatrix, &prevMatrix) Require: row > col 2: Let N = row − col 3: Let A = 〈αcol, . . . ,αcol+N〉 from s 4: Let ecr = P ( εcol−1 ) · CA 5: Set ecrMatrix[row,col] = ecr 6: Set prevMatrix[row,col] = col 7: for prev = col + 1 to row − 1 do 8: Let ecr′ = ecrMatrix[prev,col] + ecrMatrix[row,prev] 9: if ecr′ < ecr then 10: Set ecrMatrix[row,col] = ecr′ 11: Set prevMatrix[row,col] = prevMatrix[row,prev] 12: end if 13: end for 14: end procedure 15: function OPTIMALPARTITION(s = 〈α1, . . . ,αn〉, CD) 16: Let N = |s| 17: Let ecrMatrix = {0} 18: Let prevMatrix = {0} 19: for row = 1 to N do 20: for col = row − 1 down to 1 do 21: FILLCELL(row,col, s, CD,ecrMatrix,prevMatrix) 22: end for 23: end for 24: Let s∗ = 〈〉 25: Let prev = N 26: repeat 27: Let i = prevMatrix[prev, 1] 28: Let A = 〈αi+1, . . . ,αprev〉 29: Set s∗ = 〈A, s∗〉 30: until prev < 1 31: ASSERT(ECR (s∗) = ecrMatrix[N, 1] ) 32: return s∗ 33: end function Algorithm 5 Compound Action Sequence with Optimal Partition 1: function COMPOUNDACTIONSEQUENCE(A) 2: Let s = 〈α1, . . . ,αn〉 such that ef(αi) ≥ ef(αi+1) or Pαi Cαi ≥ Pαi+1 Cαi+1 3: Let s = OPTIMALPARTITION(s) 4: return s 5: end function 14 Algorithm 6 Optimized Exhaustive Search With Compound Actions 1: function OPTIMALCOMPOUNDACTIONSEQUENCE2(A) 2: Let bestEcr = ∞ 3: Let bestSeq = 〈〉 4: for all n! permutations of A s do 5: Let s′ = OPTIMALPARTITIONING(s) 6: if ECR (s′) < bestEcr then 7: Set bestEcr = ECR (s′) 8: Set bestSeq = s′ 9: end if 10: end for 11: return s′ 12: end function For the ef(·) ordering, the algorithm constructs the following tables—here visualized as one table with a pair of values (ECR∗(s[k,m]), i) where i− 1 is the split that minimized the (partial) ECR:   (0, 0) (0, 0) (0, 0) (0, 0) (5, 0) (0, 0) (0, 0) (0, 0) (7, 0) (2, 1) (0, 0) (0, 0) (7.45, 2) (2.45, 2) (0.45, 2) (0, 0)   Looking at the bottom left entry, we get that the last action is {α1} and then looking at the entry above it, that the first action is {α3,α2} and 〈{α3,α2},α1〉 has ECR 7.45. Even though the initial P C ordering is different, the algorithm returns the same troubleshooting sequence. However, ECR (〈{α1,α3},α2〉) = 6 + 0.35 · 4 = 7.4 which shows that the algorithm does not lead to an optimal sequence. A Heuristic for The General Problem We have seen that by changing the ordering of the atomic actions, we could improve the ECR compared to keeping the ordering fixed. We shall there- fore consider the orthogonal task of optimizing the ordering given that the partitioning is fixed. Definition 5. Let sc = 〈A1, . . . , A`〉 and sd = 〈B1, . . . , B`〉 be two compound action sequences of the same size. Then sc and sd are partition equivalent if |Ai| = |Bi| ∀i ∈{1, 2, . . . ,`}. 15 Lemma 3. Let sc = 〈. . . , Ax, . . . , Ay, . . .〉 and sd = 〈. . . , Bx, . . . , By, . . .〉 be two partition equivalent compound action sequences. Furthermore, let the only difference between sc and sd be that two actions α ∈ Ax and β ∈ Ay have been swapped s.t. β ∈ Bx and α ∈ By. Then ECR (sc) − ECR(sd) = (Cα − Cβ) · [ P(β) − P(α) + y−1∑ i=x P(Ai) ] + [P(β) − P(α)] · y∑ i=x+1 CAi+D . Proof. Let sc and sd be given as above. We then have ECR(sc) = ∑̀ i=1 CAi+D · (1 − i−1∑ j=1 P(Aj)) = ∑̀ i=1 CAi ·ai ECR(sd) = ∑̀ i=1 CBi+D · (1 − i−1∑ j=1 P(Bj)) = ∑̀ i=1 CBi · bi . The difference between these two numbers can be expressed as ECR(sc) − ECR(sd) = ∆X + ∆I + ∆Y (i) where • ∆X is the difference caused by changes to CAx, • ∆I is the difference caused by ai changing into bi for all compound actions between Ax and Ay, and • ∆Y is the difference causes by changes to CAy and ay. For i ∈{1, . . . ,x− 1,y + 1, . . . ,`} the terms cancel out. We have ∆X = ax · (CAx+D − CBx+D) = ax · (Cα − Cβ) . Furthermore, since bi = ai + P(α) − P(β)∀ i ∈{x + 1, . . . ,y}: ∆I = y−1∑ i=x+1 (ai − bi) · CAi+D = [P(β) − P(α)] · y−1∑ i=x+1 CAi+D Finally we have ∆Y = ay · CAy+D − by · CBy+D = ay · CAy+D − by · (CAy+D − Cβ + Cα) = (ay − by) · CAy+D − by · (Cα − Cβ) = [P(β) − P(α)] · CAy+D − by · (Cα − Cβ) . 16 Now observe that by = ay − [P(β) − P(α)] = ( ax − y−1∑ i=x P(Ai) ) − [P(β) − P(α)] which implies that ∆Y = [P(β) − P(α)] · CAy+D − [ ax − P(β) + P(α) − y−1∑ i=x P(Ai) ] · (Cα − Cβ) . Inserting these three results into Equation (i) yields ECR(sc) − ECR(sd) = (Cα − Cβ) · [ P(β) − P(α) + y−1∑ i=x P(Ai) ] + [P(β) − P(α)] · y∑ i=x+1 CAi+D as required. Theorem 7. Let sc, α and β be given as in Lemma 3. Then the ECR of sc can be improved by swapping α and β if and only if (Cα − Cβ) · [ P(β) − P(α) + y−1∑ i=x P(Ai) ] > [P(α) − P(β)] · y∑ i=x+1 CAi+D Proof. We should swap the two actions if ECR(sc) − ECR(sd) > 0 m (Cα − Cβ) · [ P(β) − P(α) + y−1∑ i=x P(Ai) ] + [P(β) − P(α)] · y∑ i=x+1 CAi+D > 0 m (Cα − Cβ) · [ P(β) − P(α) + y−1∑ i=x P(Ai) ] > [P(α) − P(β)] · y∑ i=x+1 CAi+D . It turns out that in certain cases we can avoid the full analysis of Theorem 7. 17 Proposition 1. Let sc, α and β be given as in Lemma 3. If P(β) > P(α) and Cα ≥ Cβ, or P(β) = P(α) and Cα > Cβ then ECR(sc) can be improved by swapping α and β. Proof. If P(β) > P(α) and Cα ≥ Cβ, then P(β)−P(α) < 0, but Cα−Cβ ≥ 0. In Theorem 7 the two other factors are strictly positive, so the left hand side becomes non-negative while the right hand side becomes negative making the inequality true. If P(β) = P(α) and Cα > Cβ, then the left hand side is strictly positive while the right hand side is zero which also makes the inequality true. Based on the above results, we can easily search for a better partition equivalent troubleshooting sequence. The procedure is given in Algorithm 7. Arguably, after the algorithm has terminated, it could be that more ac- tions could be swapped to improve the ECR. However, we have been un- successful in proving the number of possible swaps. Therefore we prefer an algorithm that terminates deterministically. Furthermore, we tried to run the procedure until no more swaps could be made on our test models, and we found that it did not improve the ECR compared to Algorithm 7. Algorithm 7 1: procedure OPTIMIZEPARTITION(& sc=〈A1, . . . , An〉) 2: for x = 1 to n do 3: for α ∈ Ax do 4: for y = x + 1 to n do 5: for β ∈ Ay do 6: if (Cα − Cβ) · [ P(β) − P(α) + ∑y−1 i=x P(Ai) ] > 7: [P(α) − P(β)] · ∑y i=x+1 CAi+D 8: then 9: SWAP(α,β) 10: end if 11: end for 12: end for 13: end for 14: end for 15: end procedure It should be clear that the algorithm runs in Θ(n3) where n is the number of atomic actions. The reason it cannot run in Θ(n2) time is that checking Theorem 7 takes linear time on average, albeit when Proposition 1 applies it only takes constant time. Because any swapping invalidates the involved 18 sums, it is not possible to precompute these as with the greedy approaches we saw earlier. We can also formulate the improved heuristic described in Algorithm 8. Algorithm 8 Compound Action Sequence with Optimized Partition 1: function COMPOUNDACTIONSEQUENCE(A) 2: Let s = 〈α1, . . . ,αn〉 such that ef(αi) ≥ ef(αi+1) or Pαi Cαi ≥ Pαi+1 Cαi+1 3: Let s = OPTIMALPARTITION(s) 4: OPTIMIZEPARTITION(s) 5: return s 6: end function We have by means of a computer constructed examples that show that the algorithm is not guaranteed to find optimal troubleshooting sequences. However, for certain models we can prove that optimality is ensured. Con- sider a model where the actions can be ordered into the sequence s = 〈α1, . . . ,αn〉 such that Pαi ≥ Pαi+1 and Cαi ≤ Cαi+1 for i ∈{1, 2, . . . , n − 1}; we call such models for non-reordable models. Theorem 8. Let s = 〈α1, . . . ,αn〉 be an atomic troubleshooting sequence in a non-reordable model such that ef(αi) ≥ ef(αi+1). Then there exists an optimal troubleshooting sequence sc consistent with s. Proof. Let sd = 〈B1, . . . , B`〉 and sc = 〈A1, . . . , A`〉 be partition equivalent compound troubleshooting sequences where sd is optimal and sc is consis- tent with s. Now assume sd is not consistent with s. Since sc 6= sd, we can find the index of the first pair of compound actions Ax and Bx where the two sequences differ. Then at least one action β ∈ Ax, but β 6∈ Bx and at least one action α ∈ Bx, but α 6∈ Ax. Furthermore, α must be found in Ay with y > x. Since sc is consistent with s and since the model is non- reordable Pβ ≥ Pα and Cβ ≤ Cα (i). We now consider two cases. Case 1: at least one of the inequalities in (i) is strict. Then if we apply Theorem 7 on sd we find that we can improve the ECR of sd by swapping α and β. However, this is a contradiction to sd being optimal. Hence in this case any optimal sequence must be consistent with s. Case 2: we have equality in Equation (i). In that case we can safely swap α and β in sd without altering the ECR. If we now have sd = sc, then we are done. If not, we can repeat the procedure until sd = sc. Since each swap puts at least one atomic action into its rightful compound action, the procedure terminates in a finite number of steps, thus proving that there exists an optimal sequence consistent with s. 19 Corollary 3. Algorithm 5 and 8 find an optimal troubleshooting sequence in non- reorderable models. Proof. Both algorithms calls OptimalPartition(·) after having sorted the actions with respect to descending efficiency. Results In this section we shall investigate the performance of the heuristic algo- rithms. In particular, we investigate how the following two model param- eters affect the precision: 1. The closeness of the initial efficiency of actions. 2. The closeness of the cost of the actions. Due to the difficulty of finding an optimal solution, we could not inves- tigate models with more than 8 actions. In the models below the repair probabilities have not been normalized to make the specification accurate (but they are of course normalized at runtime) and the efficiency stated is the efficiency rounded to three decimals when CD = 0. Model 1: close efficiencies + close costs α1 α2 α3 α4 α5 α6 α7 α8 Pα 0.21 0.17 0.19 0.155 0.185 0.139 0.17 0.135 Cα 1.8 1.4 1.7 1.3 1.6 1.2 1.5 1.1 ef(α) 0.086 0.090 0.083 0.088 0.085 0.085 0.084 0.091 Model 2: close efficiencies + non-close costs α1 α2 α3 α4 α5 α6 α7 α8 Pα 0.36 0.205 0.32 0.155 0.28 0.11 0.25 0.05 Cα 8 4 7 3 6 2 5 1 ef(α) 0.026 0.030 0.026 0.030 0.027 0.031 0.029 0.029 Model 3: non-close efficiencies + close costs α1 α2 α3 α4 α5 α6 α7 α8 Pα 0.4 0.105 0.52 0.055 0.78 0.13 0.6 0.02 Cα 1.8 1.4 1.7 1.3 1.6 1.2 1.5 1.1 ef(α) 0.085 0.029 0.117 0.016 0.187 0.042 0.153 0.007 Model 4: non-close efficiencies + non-close costs α1 α2 α3 α4 α5 α6 α7 α8 Pα 0.16 0.15 0.05 0.165 0.165 0.159 0.07 0.165 Cα 8 4 7 3 6 2 5 1 ef(α) 0.018 0.035 0.007 0.051 0.025 0.073 0.013 0.152 20 For each of the four initial models we run all the algorithms for increas- ing values of CD. The following observation shows when it does not make sense to increase CD anymore. Proposition 2. Once CD is made so high that the optimal sequence consists of one compound action, then increasing CD can never change the optimal sequence. The increment in CD was determined as one permille of the largest cost of any action in the model. Starting from CD = 0 we kept sampling until the optimal sequence consists of only one compound action. The tables below summarize the result for the four models where each column describes the results of a given algorithm. The first four rows summarize the percent- wise deviation from the optimal ECR, and the last row shows the percent of cases where the algorithm was optimal. The number of increments to CD is given in the parenthesis after the model name. The first algorithm column is the simple ef(αi) ≥ ef(αi+1) sorting. For Algorithm 2, 5, and 8 we give two numbers. The first corresponds to an initial ef(·) ordering and the second corresponds to an initial P C ordering. Model 1: close efficiencies + close costs (5828) ef(αi) ord. Alg. 3 Alg. 2 Alg. 5 Alg. 8 min 0 0 0/0 0/0 0/0 max 128.26 45.56 4.28/2.83 1.47/0.68 1.08/0.63 mean 73.14 10.05 1.66/0.66 0.77/0.07 0.15/0.02 median 79.37 5.97 1.52/0.37 0.79/0 0.11/0 % opt. 1.48 0.05 1.49/26.56 1.49/62.06 39.88/84.80 Model 2: close efficiencies + non-close costs (4201) ef(αi) ord. Alg. 3 Alg. 2 Alg. 5 Alg. 8 min 0 0 0/0 0 /0 0/0 max 97.16 38.90 6.17/2.34 5.05/0.48 4.17/0.48 mean 54.71 9.01 2.75/0.33 1.87/0.06 1.09/0.03 median 58.87 5.71 3.15/0.07 1.77/0.01 1.17/0 % opt. 0.33 0.05 8.40/32.85 9.50/46.04 18.64/63.91 Model 3: non-close efficiencies + close costs (79145) ef(αi) ord. Alg. 3 Alg. 2 Alg. 5 Alg. 8 min 0 0 0/0 0/0 0/0 max 149.50 3.93 2.80/2.80 0/0 0/0 mean 124.27 0.08 0.16/0.16 0/0 0/0 median 137.14 0 0/0 0/0 0/0 % opt. 0.45 76.27 73.89/73.89 100/100 100/100 21 Model 4: non-close efficiencies + non-close costs (18085) ef(αi) ord. Alg. 3 Alg. 2 Alg. 5 Alg. 8 min 0 0 0/0 0/0 0/0 max 219.13 4.87 2.62/2.62 0/0 0/0 mean 162.80 0.35 0.24/0.24 0/0 0/0 median 180.95 0 0/0 0/0 0/0 % opt. 0.25 53.08 76.08/76.08 100/100 100/100 We can summarize the results as follows: 1. Algorithm 8 is by far the best heuristic. Algorithm 3 is the worst of the new heuristics. Algorithm 2 performs surprisingly well, almost as good as Algorithm 5 and would thus be considered the preferred choice for a system under real-time constraints. 2. Models with non-close efficiencies are much easier to solve than mod- els with close efficiencies. Models with non-close costs seem to induce larger deviations from the optimal ECR than models with close costs. We consider none of these findings surprising. We have also tested the algorithms on a real-world model with 32 actions. The results are summarized in the table below where the statistics are com- puted as the percent-wise deviation from the overall best algorithm (Algo- rithm 8 with P C ordering). The −0 indicates that a small percentage of the cases where better than Algorithm 8. Also notice that the apparent conflict between "median" and "% best" in column two is due to rounding of the median. Model 5: 32 actions (33145) ef(αi) ord. Alg. 3 Alg. 2 Alg. 5 Alg. 8 min 0 0 0/0 −0/0 −0 max 667.30 3.79 25.30/19.05 10.68/0.05 6.07 mean 555.34 0.14 5.94/3.14 3.25/0 1.51 median 598.70 0 3.73/1.32 3.59/0 1.30 % best 0.01 48.15 0.01/0 0.01/99.43 0.01 The conclusion here is the same with one noticeable difference: Algorithm 3 is now the third best algorithm, only beaten by Algorithm 5 and 8 with P C ordering and far better than any other heuristic. The fact that Algorithm 5 and 8 are very close to each other also suggests that we are quite close to the optimal value. 22 Conclusion And Further Research We have extended a classical polynomial troubleshooting scenario with postponed system test. Albeit this is a somewhat simple extension, we conjecture that the new scenario is NP-hard. We have identified a class of non-reordable models where the best of the described heuristics are guar- anteed optimal. For reorable models, we found that the suggested heuris- tics provide quite close approximations to an optimal solution—even for heuristics with an O(n·lg n) complexity. Apart from proving or disproving NP-hardness, future research includes finding guarantees on the perfor- mance of the heuristics as well as guarantees about local optimality when optimizing a given partition. Acknowledgements We would like to thank Sven Skyum for the discussions that lead to The- orem 5. We are also grateful to Dezide Aps for providing the real-world model used in the experiments. Finally, we would like to thank our col- legues of the Machine Intelligence group at Aalborg University for provid- ing such a fruitful research environment. References F. V. Jensen, U. Kjærulff, B. Kristiansen, C. Skaanning, J. Vomlel, and M. Vomlelová. The sacso methodology for troubleshooting complex sys- tems. Artificial Intelligence for Engineering Design, Analysis and Manufac- turing, 15:321–333, 2001. J. Kadane and H. Simon. Optimal strategies for a class of constrained se- quential problems. The Annals of Statistics, 5:237–255, 1977. E. Koca and T. Bilgiç. Troubleshooting with questions and conditional costs. Proceedings of the 13th Turkish Symposium on Artificial Intelligence and Arti- ficial Neural Networks, pages 271–280, 2004a. E. Koca and T. Bilgiç. A troubleshooting approach with dependent actions. In R. L. de Mántaras and L. Saitta, editors, ECAI 2004: 16th European Con- ference on Artificial Intelligence, pages 1043–1044. IOS Press, 2004b. ISBN 1-58603-452-9. H. Langseth and F. V. Jensen. Decision theoretic troubleshooting of coherent systems . Reliability Engineering & System Safety, pages 49–62, 2003. 23 H. Langseth and F. V. Jensen. Heuristics for two extensions of basic trou- bleshooting. In Proceedings of the Seventh Scandinavian Conference on Arti- ficial Intelligence, pages 80–89. IOS Press, 2001. ISBN 1-58603-161-9. C. Skaanning and J. Vomlel. Troubleshooting with simultaneous models. Symbolic and Quantitative Approaches to Reasoning with Uncertainty, 6th, 2001. M. Vomlelová. Complexity of decision-theoretic troubleshooting. Int. J. Intell. Syst., 18(2):267–277, 2003. H. Warnquist, M. Nyberg, and P. Säby. Troubleshooting when action costs are dependent with application to a truck engine. In Proceeding of the 2008 conference on Tenth Scandinavian Conference on Artificial Intelligence, pages 68–75, Amsterdam, The Netherlands, The Netherlands, 2008. IOS Press. ISBN 978-1-58603-867-0. 24