Risk-based Test Case Prioritization Using a Fuzzy Expert System Charitha Hettiarachchi North Dakota State U. charitha.hettiarachc@ndsu.edu Hyunsook Do U. of North Texas hyunsook.do@unt.edu Byoungju Choi Ewha Womans U. bjchoi@ewha.ac.kr August 25, 2015 Context: The use of system requirements and their risks enables software testers to identify more important test cases that can reveal the faults associated with system components. Objective: The goal of this research is to make the requirements risk estimation process more systematic and precise by reducing subjectivity using a fuzzy expert system. Further, we provide empirical results that show that our proposed approach can improve the effectiveness of test case prioritization. Method: In this research, we used requirements modification status, complexity, security, and size of the software requirements as risk indicators and employed a fuzzy expert system to estimate the requirements risks. Further, we employed a semi-automated process to gather the required data for our approach and to make the risk estimation process less subjective. Results: The results of our study indicated that the prioritized tests based on our new approach can detect faults early, and also the approach can be effective at finding more faults earlier in the high-risk system components compared to the control techniques. Conclusion: We proposed an enhanced risk-based test case prioritization approach that estimates requirements risks systematically with a fuzzy expert system. With the proposed approach, testers can detect more faults earlier than with other control techniques. Further, the proposed semi-automated, systematic approach can easily be applied to industrial applications and can help improve regression testing effectiveness. 1 Introduction Software products change over time due to feature updates or user demand changes. When software functionalities change, software engineers need to retest the software to ensure that the changes did not affect software quality. Regression testing is one of the important maintenance activities, but it requires a great deal of time and effort. Often, software companies have pressures with time and budget, so expensive and time-consuming regression testing could be a major burden for them. To overcome these schedule and cost-related concerns with regression testing, many researchers have proposed various cost-effective regression testing techniques [20, 33, 35]; in particular, test case prioritization techniques have been actively studied because they provide appealing benefits, such as flexibility for testers who need to adjust their testing efforts for the limited time and budget [10, 13, 42, 44]. While the majority of test case prioritization ap- proaches utilize source code information, some researchers have investigated using other software artifacts, such as © 2015. This manuscript version is made available under the Elsevier user license http://www.elsevier.com/open-access/userlicense/1.0/ system requirements and design documents, produced during early development phases [7, 25, 41]. For instance, Kr- ishnamoorthi and Mary [25] proposed a system-level test case prioritization approach using the information obtained from the requirements specification, such as requirements completeness and implementation complexity. Srikanth et al. [41] also introduced a system-level test case prioritization technique that analyzes and evaluates the requirements in terms of requirement volatility, complexity, customer priority, and fault proneness. In addition to utilizing requirements information for test case prioritization, some researchers used risk information that can help identify more important test cases that are likely to detect defects associated with the system’s risks (e.g., safety or security risks) [43, 48]. The results of previous research work empirically showed that the effectiveness of test case prioritization could be improved by using requirements risks. However, these risk-based test case prioritization techniques did not consider the direct relationship between requirements risks and test cases [48] or only used one type of risk, such as fault information obtained from preceding versions [43]. Further, these studies evaluated the approaches by measuring how fast the reordered test cases detected faults; the approaches utilized risk information to prioritize tests, so they should be evaluated by measuring whether the detected faults are, indeed, from the locations where risks reside in the product. To address these limitations, our previous work [21] proposed a new requirements risk-based test case prioritization approach by considering the direct relationship between requirements risks and the test case. We also introduced a new evaluation method to measure how effective test case prioritization approaches were for detecting defects in risky components of software systems. While our previous research was shown to be promising, the approach required human experts’ involvement dur- ing the risk estimation process. Human involvement with the risk estimation process is important, but it makes the estimation process subjective and imprecise. To avoid the possible imprecision introduced by human judgment, a more systematic approach should be considered. Often, fuzzy expert systems have been utilized to address such problems because they can provide a mechanism to simulate the judgment and reasoning of experts in a particular field. To date, many researchers have used fuzzy expert systems in different application areas to help with complex decision-making problems, such as the diagnosis of disease [4] and risk estimation in aviation [18]. These studies have shown that fuzzy expert systems can be used to systematically represent human expertise in a particular domain and to deal with imprecision and subjectivity-related issues of the decision-making process while making the decision-making process more effective. In this research, we propose a systematic risk estimation approach using a fuzzy expert system to address the limitations of our previous approach. We also reduced the number of risk items used for the risk estimation and simplified the prioritization approach so that we can perform test case prioritization with less effort. Further, from the results of our previous requirements risk-based approach, we learned that incorporating code information with the requirements could improve the rate of fault detection. Therefore, in this study, we used code information to extract requirements risks with respect to a few risk indicators in addition to the information obtained from requirements specifications written in natural language. Because we use code information, the proposed approach is applied during 2 the testing phase after coding is done. To evaluate our approach, we used one open source application and one industrial application developed in Java. The results of our study indicate that the systematic, risk-based test case prioritization approach has the capability to find faults earlier compared with other test case prioritization techniques, including our previous requirements risk- based approach. Moreover, the new approach is also better at finding more faults earlier in high-risk components than other techniques. The rest of the paper is organized as follows. Section 2 describes the fuzzy expert system used in this research and the related work. Section 3 describes our new prioritization technique in detail. Section 4 describes our experiment including the research questions. Section 5 presents the results and analysis. Section 6 discusses our results and their implications. Section 7 presents the conclusions and discusses future work. 2 Fuzzy Expert Systems and Related Work In this section, we provide background information on the fuzzy expert system and the existing work related to test case prioritization techniques, mainly focusing on techniques that use requirements, risks, and fuzzy expert systems which are most closely related to our work. 2.1 Fuzzy Expert Systems In this research, we use a fuzzy expert system to derive requirements modification status (RMS) and potential security threats (PST) values. The fuzzy expert system used in this work simulates human expert’s reasoning to derive the RMS and PST values for each requirement in a similar way that a human expert would estimate these values using the same input values. Existing empirical studies [19, 46] indicate that fuzzy expert systems can improve the effectiveness of decision making process in many different application areas including regression testing. Moreover, fuzzy expert systems can handle ambiguity, which in turn produce output values much closer to realistic values. To provide a better understanding about the process of acquiring the RMS and PST values, we summarize the mechanism for a fuzzy expert system used with our approach. A fuzzy expert system is comprised of fuzzy mem- bership functions and rules. It contains four main parts: fuzzification, inference, composition, and defuzzification. Figure 1 shows the typical architecture of a fuzzy expert system. The fuzzification process transforms the crisp input into a fuzzy input set. The inference process uses the fuzzy input set to determine the fuzzy output set using rules for- mulated in the knowledge base and the membership functions. The composition process aggregates all output fuzzy sets into a single fuzzy set. Finally, the defuzzification process calculates a crisp output using the fuzzy set produced by the composition process. The knowledge base shown in Figure 1 contains the selected fuzzy rule set. In a fuzzy expert system, fuzzy rules play a vital role because they are formulated based upon the experts’ knowledge about the domain of interest. The fuzzy rule’s antecedent defines the fuzzy region of the input space, and the consequent defines the fuzzy region of 3 the output space. Fuzzy rules can not only support multiple input variables, but also multiple output variables. The following equation shows an example of a fuzzy rule. if x is A and y is B then z is C where x and y are input variables, and z is the output variable. A is a membership function defined on variable x; B is a membership function defined on variable y, whereas C is a membership function defined on output variable z. Fuzzification Knowledge Base Defuzzification Fuzzy Inference Engine Inference Composition Crisp Inputs Crisp Outputs Rules Input Fuzzy Set Output Fuzzy Set Fuzzy Inputs Fuzzy Outputs Figure 1: Architecture of a fuzzy expert system Fuzzy Set Theory. To understand the fuzzification process, some knowledge of fuzzy set theory is necessary. Fuzzy set theory was first introduced by Zadeh [49] in 1965, and it defines fuzzy sets as an extension of conventional sets. In a conventional set, an element either belongs or does not belong to the set. Equation 1 defines a conventional set where µA(x) shows the membership of an element, x, of a conventional set, A. µA(x) = { 1, if x ∈ A 0, if x /∈ A (1) Unlike conventional sets, in fuzzy sets, the membership of elements in the sets can be partially represented. Be- cause fuzzy sets permit partial membership, the degree of membership is determined by using membership functions for unit interval [0, 1]. Hence, the imprecision of input data is handled by obtaining degree of membership values for each membership function defined in the fuzzy expert system. A fuzzy set is defined as shown in Equation 2. A = (x,µA(x))|x ∈ X,µ(x) : X → [0,1] (2) where A is the fuzzy set, µA is the membership function, and X is the universe of discourse. Fuzzification. In the fuzzification step, the input variables’ values are used to determine the degree to which these values fit into each membership function used by the fuzzy rules. There are several types of membership functions 4 available, such as triangular, trapezoidal, and gaussian. In this research, we utilize the triangular membership function, which is widely used for fuzzy expert system based research and applications. Various empirical studies [39, 46] indicate that the triangular membership function is easy and simple to apply compared to other membership functions. As an initial investigation of the use of a fuzzy expert system, we chose the triangular membership function due its application simplicity. The triangular membership function is specified in Equation 3. µA(X) =   0, x < 0 (x−a)/(b−a), a ≤ x ≤ b (c−x)/(c− b), b ≤ x ≤ c 0, x > 0 (3) where A is the fuzzy set, µA is the membership function, X is the universe of discourse, a is the lower limit, b is the modal value, and c is the upper limit. Table 1 shows the membership functions used for the input variables in our experiment. Table 1: Input variable membership functions Linguistic Value Triangular Fuzzy Numbers (a,b,c) Low (0,0,5) Medium (0,5,10) High (5,10,10) Inference. In the inference step, the fuzzified inputs (i.e., the degree of appropriate membership functions for input variable values) are applied on each rule antecedent to determine the degree of truth for each rule. The degree of truth is applied to the consequent of each rule, thus each output variable obtains an appropriate membership function. The output variable membership functions defined in our experiment are shown in Table 2. Our fuzzy expert system is comprised of nine rules that represent experts’ knowledge and experiences regarding the risks associated with software requirements. A sample set of rules used by the fuzzy inference process to estimate RMS is shown in Table 3. With the first rule (R1), the membership functions of two input variables, RML and PRV are low, and the membership function for the output variable, RMS is also low. Table 2: Output variable membership functions Linguistic Value Triangular Fuzzy Numbers (a,b,c) Low (0,0,5) Medium (0,5,10) High (5,10,10) Composition. The composition step is to combine all membership functions obtained from each rule for each output variable and to form a single membership function for each output variable. 5 Table 3: Fuzzy rules for RMS R1. If RML is Low and PRV is Low then RMS is Low R2. If RML is Medium and PRV is Low then RMS is Low R4. If RML is Medium and PRV is Medium then RMS is Medium R9. If RML is Medium and PRV is High then RMS is High Defuzzification. The last step of fuzzy expert system is defuzzification which is an optional process. The defuzzi- fication process produces crisp output for each output variable. In our experiment, we need exact, crisp output to quantitatively measure the RMS and PST for each requirement. Therefore, in this step, the resulting membership function of the previous step is defuzzified into a single number. There are several methods to perform the defuzzifi- cation. In our fuzzy expert system, we follow the Mamdani [30] type fuzzy inference process. Therefore, we use the center of gravity (COG) method which is considered as more accurate and widely used with the Mamdani-type fuzzy expert system. Equation 4 shows how to compute the COG that represents the crisp output for a particular output variable. y∗ = ∫ b a µA(y)ydy∫ b a µA(y)dy (4) where y∗ is the crisp output; µA(y) is the aggregated membership function, A, on interval ab; and y is the output variable. To illustrate the aforementioned processes, we provide a simple example. Suppose we estimate a requirements modification status for a requirement using two inputs: requirements modification level value of 10 and potential requirement volatility value of 8. The fuzzification process fuzzifies these values to produce the degree of membership for each input membership function. The fuzzy rules defined in the expert system use the fuzzified input and the inference process produces the degree of memberships for the output variable (requirements modification status). All membership functions obtained from each rule are combined and the resulting membership function is defuzzified at the defuzzification process to obtain the final crisp value of 8.4 as the requirements modification status value for this requirement. 2.2 Related Work Test case prioritization provides a way to schedule test cases so that testers can run more important or critical test cases early. Various prioritization techniques have been proposed [47], and some of them have been used by several software organizations [28, 42]. The majority of prioritization techniques have used the information obtained from software source code [10, 36, 40]. For instance, one technique, total statement coverage prioritization reorders the test cases in the order of the number of statements they cover. One variation of this technique, additional statement coverage prioritization reorders the test cases in the order of number of new statements they cover. Other types of code information for aiding prioritization include code change history, code modification, and fault proneness of 6 code [31, 40]. Beyond code-based information, other software artifact types, such as software requirements and design information, have also been utilized. For example, Srikanth et al. [41] proposed a test case prioritization approach using several requirements-related factors, such as requirements complexity and requirements volatility, for the early detection of severe faults. Krishnamoorthi and Mary [25] also proposed a model to prioritize test cases using the requirements specification to improve the rate of severe fault detection. Arafeen and Do [7] proposed an approach that clusters requirements based on similarities obtained through a text-mining technique and that prioritizes test cases using the requirements-tests relationship. These studies reported that using requirements information improved the effectiveness of prioritization. In addition to requirements and design information, other researchers have used software risk information to pri- oritize test cases in order to run test cases to exercise code areas with potential risks as early as possible [43, 48]. Many risk-based testing techniques have adopted Amland’s [6] risk model that estimates risk exposure as a product of probability of faults in software components and the impact (e.g., cost or damage) of the corresponding fault if it occurs in the operational environment. In our approach, we also used the risk exposure of both requirements and risk items to prioritize tests. Stallbaum et al. [43] proposed a technique, RiteDAP (risk-based test case derivation and prioritization), that can automatically generate test case scenarios from activity diagrams and can prioritize test cases using the risks associated with fault information. In this RiteDAP approach, to quantify the risk, probability of failure for each action is estimated by the usage frequency of each action, whereas the damage (impact) caused by that par- ticular failure is estimated through its financial losses. Yoon et al. [48] used the relationship among requirements risk exposure, risk items, and test cases to determine the order of test cases. Another paper [27] proposed a value-based software engineering framework to improve the software testing process. The proposed multi-objective feature prior- itization strategy prioritizes the new features by considering the business importance, quality risks, testing costs, and the market pressure. Further, Felderer and Schieferdecker [17] presented a framework that organizes and categorizes the risk-based testing to aid the adoption of appropriate risk-based approaches according to the circumstances. Erdo- gan et al. [14] conducted a systematic literature review on the combined use of risk analysis and testing. This survey identified, classified, and discussed the existing approaches in terms of several factors such as main goals and the ma- turity levels of the approaches. For example, the survey discusses a model-based security testing approach proposed by Zech [51] using risk analysis for cloud computing environments. In Zech’s proposed approach, misuse cases are used on a model-driven approach for test code generation. These existing papers on risk-based testing demonstrate that the use of risks in the software systems can help find critical functional defects that may cause severe security or safety related issues. Another research area that is relevant to our work is fuzzy expert systems. Fuzzy expert systems have been used in areas that require expert knowledge to make decisions while minimizing several issues, such as uncertainties and subjectivity, in the decision-making process. In general, fuzzy expert systems are applied to various domains, such as diagnosing diseases in the medical field [4, 23], risk assessment in aviation [18], risk assessment in construction 7 projects [9], and selecting superior stocks on the stock exchange [15]. For instance, Adeli and Neshat [4] proposed a fuzzy expert system to diagnose heart disease. Recently, fuzzy expert systems have been used in software engineering areas such as software development effort prediction [5], software cost estimations [24], and risk analysis for e- commerce development [32]. For instance, Ahmed et al. [5] developed a fuzzy expert system to obtain accurate software cost and schedule estimation by managing the uncertainties and imprecision that exist in the early stages of software development. More recently, some researchers applied fuzzy expert systems to regression testing. Schwartz and Do [39] used a fuzzy expert system to determine the most cost-effective regression testing technique for different testing environments by addressing the limitations of existing, adaptive regression testing strategies. Xu et al. [46] applied a fuzzy expert system to deal with the inaccurate and subjective issues present during the test case selection process of regression testing. In this work, we used a fuzzy expert system with requirements and their risk information to improve risk estimation processes, thus improving the effectiveness of test case prioritization. 3 Proposed Approach In this section, we describe the proposed approach. Our new method consists of four main steps. 1. Estimate risks by correlating with requirements 2. Calculate the risk exposure for the requirements 3. Calculate the risk exposure for risk items 4. Prioritize requirements and test cases Figure 2 gives an overview of the proposed approach. The main steps are shown in light blue boxes, and the inputs and outputs for each step are shown in the ovals. The first three steps are used to calculate the requirements priorities. In the last step, test cases are prioritized using the results produced by the first three steps. The following subsections describe each step in detail by using an example that we excerpted from our experimental data which are fully described in Sections 4 and 5. 3.1 Estimate Risks by Correlating with the Requirements In order to perform risk assessment for the requirements, we identify four risk indicators that have been used by previous requirements and risk-based regression testing research [6, 21, 25, 41]: requirements complexity (RC), re- quirements size (RS), requirements modification status (RMS), and potential security threats (PST). These previous studies indicate that these risk indicators can be effective in finding defects in software systems, thus we focus on these four risk indicators in this study, but we consider to use other risk indicators, such as usage rate [16], as we evaluate our approach in the future. While obtaining the first two risk indicators is straightforward, the last two risk indicators can be subjective, so we utilize a fuzzy expert system to reduce the subjectivity and possible errors made by 8 human judgment. To calculate the first two risk indicators (RC and RS), we used both source code and requirements information, and for others (RMS and PST), we used requirements information. We explain each of these indicators in detail. Requirements Complexity (RC) Requirements that need complex functionalities during implementation tend to introduce more faults. A case study conducted by Amland [6] showed that requirements that need complex function- alities at the coding phase tend to introduce a higher number of defects. In addition, the study indicated that functions with a higher number of faults have a higher McCabe complexity. Therefore, in this research, we used McCabe complexity to measure the requirements complexity (RC). We measured the McCabe complexity values of software functionalities (source code) using Eclipse IDE. The complexity of requirements was determined by examining the relationships between requirements and functionalities. For the specific applications that we used with our experiment, the complexity values ranged from 1 to 12, and we normalized these values into a range from 0 to 10. A value of 0 indicated the lowest complexity, whereas 10 indicated the highest complexity for the requirements. The eighth column of Table 4 shows the example for the complexity values used for our experiment. Requirements Size (RS) To measure the requirements size (RS), the size of functions associated with the require- ments is used, and it is measured through lines of code (LOCs). Functions with higher LOC numbers tend to contain more defects. Amland’s case study [6] shows that the size of the functions could affect the number of faults in a system. In the experiment we performed, requirements size values range from 16 to 508, and these values are normalized into a range from 0 to 10, where a value of 0 indicates the lowest size and 10 indicates the highest size for the requirements. Requirements Calculate Req-Risks Calculate Risk Exposure for Risk Items Prioritize Req & Test Cases Prioritized Test Cases Weighted- Risk Exposure Risk Indicator Values Req-Risk Assessment RM L PVL CIA PAA Fu zzy Expert System RMS PST Complexity Size Req-Test Tracebility Req-1 Req-2 Req-n Test 1 Test 2 Test n Req-Risk Items Correlation Req-Risks Risk Items Figure 2: Overview of the risk-based approach 9 The last column of Table 4 shows the example for the RS values. Requirements Modification Status (RMS) To estimate the RMS, two aspects of requirements modifications are considered: requirements modification level (RML) and potential requirements volatility (PRV). These two variables are the inputs for the fuzzy expert system that we develop for this research, and the fuzzy expert system produces RMS using these two input variables. RMS reflects the overall modification status of each requirement. RML represents the degree of a requirement’s modification by comparing the same requirement in the previous version. However, manual comparison of requirements can be a subjective and time-consuming process. Therefore, to reduce the amount of time and subjectivity, we developed a program that uses cosine similarity [26] to measure similarities between requirements. The program compares two given requirements and produces the requirements modification level values (RML). This semi-automated approach helps eliminate some mistakes that may occur in the manual requirements- comparison process. Here, requirements modification includes existing requirements change and new requirements addition. The RML values are normalized into a range from 0 to 10, where 10 indicates the highest requirements modification level and 0 indicates no modification. New requirements for a subsequent version are automatically assigned the value of 10 because functionalities associated with the new requirements have a high possibility of introducing new faults to the system. After finishing this process, we examine the final RML values to ensure that the RML values obtained from the automated process reflect the actual modification levels for the requirements. The requirement’s PRV values are used to quantify the possibility of having requirements changes in later versions of the system. PRV values are measured through experts’ knowledge and experiences with requirements engineer- ing. Several requirements characteristics, such as functional instability, possible interface changes, and other factors that may influence the requirements modifications, are taken into account to estimate the PRV values. For example, consider a healthcare application. For such an application, requirements that calculate an insurance premium would change whenever the insurance policy changes. Thus, these requirements are assigned a higher PRV value compared to the requirements that handle patients’ information. The second and the third columns in Table 4 show the example for the RML and PRV values used in our experiment, and the fourth column shows the RMS values provided by the fuzzy expert system which is explained in Section 2.1. Potential Security Threats (PST). Today, software security is a major concern for software applications (in par- ticular, for web applications) due to the rapid growth of malicious activities, such as SQL injection, eavesdropping, etc., against software applications. Security flaws for a software application lead to severe consequences unless the security-related issues are identified and properly handled as early as possible. Therefore, in this approach, potential security threats (PST) is used as an indicator of security-related risks that reside in the requirements. Again, the fuzzy expert system is employed with a different rule set to estimate the PST values. The two input variables contain sets of software security objectives that are identified in the software security field [1, 34]. The first input variable contains the major security objectives, such as confidentiality, integrity and availability (CIA), and the second variable consists 10 of secondary security objectives, such as privacy, authentication and accountability (PAA). To estimate these input variable values for each requirement, we developed a term-extraction tool to find the number of security key words in a particular requirement that were related to the input variables. For example, suppose we identified 50 security key words associated with the first input variable (i.e., CIA) and a particular requirement contains 5 of the 50 keywords, then the CIA variable value of the requirement is 0.1 (5/50). Then, the input variable values obtained from the tool are normalized into a range from 0 to 10. A value of 0 indicates no association with CIA, whereas 10 indicates the highest association. The same technique is used to obtain the PAA values for each requirement. The fifth and sixth columns of Table 4 show the CIA and PAA values, respectively, and the seventh column shows the PST values provided by the fuzzy expert system. Table 4: Risk indicators and fuzzy input-output values Fuzzy input Fuzzy output Fuzzy input Fuzzy output Requirement RML PRV RMS CIA PAA PST Complexity Size UC1S1 0 2 4.6 7 6 8.1 1.9 5.8 UC2S1 0 3 4.6 7 9 8.3 1.9 4.3 UC8S1 5 6 5.0 5 7 6.0 1.9 2.8 UC10S1 6 5 5.0 6 8 8.3 1.0 1.6 UC11S1 2 5 5.0 5 7 4.5 2.8 9.4 UC23S3 3 5 5.0 6 6 5.7 1.9 10.0 UC26S3 0 3 4.6 4 5 3.2 2.8 4.9 ... ... ... ... ... ... ... ... ... UC34S6 10 8 8.4 5 6 5.2 1.0 1.7 3.2 Calculate the Risk Exposure for the Requirements Risk assessment for software requirements is performed by considering the probability of the risk occurrence (risk likelihood) for the risk indicators and the degree of possible damage (risk impact) with each indicator. We consider the four risk indicators explained in the previous step, and the weight for each indicator is determined using the analytic hierarchy process (AHP) that supports pairwise comparisons [37]. The comparison’s result values are normalized into a range from 1 to 5, and we obtain the final weight values for each indicator. In Table 5, columns two to five show the comparison values of risk indicators; sixth column shows the total; and the last column shows the priority vector (PV) values calculated for each indicator. Then, the PV values obtained from each expert are averaged. (In this example, two experts perform pairwise comparison.) The first column of Table 6 shows the risk indicators; the second and third columns show the PV values obtained from each expert; and the fourth column shows the averaged PV values. The last column shows the normalized weight values for the risk indicators. Equation 5 is used to calculate the risk exposure of a requirement (RE(Req)) by following a common risk- assessment practice used by several researchers (i.e., multiplication of risk likelihood and risk impact). Each risk 11 Table 5: Risk indicator comparison RMS Complexity PST Size Total Priority Vector First Expert’s Comparison RMS 0.37 0.37 0.38 0.35 0.37 0.37 Complexity 0.37 0.37 0.38 0.30 0.37 0.36 PST 0.18 0.18 0.19 0.30 0.18 0.21 Size 0.05 0.06 0.03 0.05 0.05 0.04 Second Expert’s Comparison RMS 0.38 0.32 0.48 0.38 1.56 0.39 Complexity 0.38 0.32 0.24 0.29 1.22 0.31 PST 0.19 0.32 0.24 0.29 1.03 0.26 Size 0.05 0.05 0.04 0.05 0.19 0.05 Table 6: Risk indicators and weights Indicator PV1 PV2 Average Weight Requirements Modification Status 0.37 0.39 0.38 5 Requirements Complexity 0.36 0.31 0.34 4 Potential Security Threats 0.21 0.26 0.24 3 Size 0.04 0.05 0.05 1 indicator value corresponds to the risk likelihood of a requirement, and the weight of the indicator corresponds to the risk impact. RE(Reqj) = n∑ i=1 (Wi ∗Rji) (5) where n is the number of indicators; Wi is the weight of the indicator; and RIji is the risk value of the requirement, Reqj , in terms of indicator i. The following example shows the risk calculation for the UC1S1 requirement used in our experiment (“The health- care personnel create a patient as a new user of the system.”). The last column of Table 7 shows the risks for a sample set of requirements. RE(UC1S1) = (5*4.6) + (4*1.9) + (3*8.1)+ (1*5.8)= 60.8 The risk values of the requirements RE(Req) range from 0 to 130, where 0 indicates the lowest risk and 130 indicates the highest risk. In this example, the risk of UC1S1 is 60.8, implying an average risk level. 3.3 Calculate the Risk Exposure of Risk Items In the previous subsection, the process for requirements risk exposure estimation was explained. These requirements risk exposure (RE(Req)) values indicate how risky each requirement is from the system requirements’ point of view. Although, the (RE(Req)) value of a requirement reflects the risk residing in the requirement, it is not sufficient to obtain information about the association between the requirement and potential defect types of a software system. When a particular requirement is associated with multiple defect types, that requirement has a high tendency to expose 12 Table 7: Risk indicator values and the risk exposure for requirements Requirement RMS Complexity PST Size RE(Req) 5 4 3 1 UC1S1 4.6 1.9 8.1 5.8 60.8 UC2S1 4.6 1.9 8.3 4.3 59.8 UC8S1 5.0 1.9 6.0 2.8 53.5 UC10S1 5.0 1.0 8.3 1.6 55.6 UC11S1 5.0 2.8 4.5 9.4 59.2 UC23S3 5.0 1.9 5.7 10.0 59.8 UC26S3 4.6 2.8 3.2 4.9 48.9 ... ... ... ... ... ... UC34S6 8.4 1.0 5.2 1.7 63.1 more defects. Thus, the association information between requirements and potential defect types help identify the requirements with a higher defect density which eventually cause common software failures. Therefore, high priorities can be assigned to the test cases that are correlated to such requirements with a higher defect density. To prioritize the test case using the association information between requirements and potential defect types, we adopted a process defined by Yoon et al. [48]. We improved this process by introducing weights for the risk items (RiIM). The weight values for the risk items were obtained from our previous requirements risk-based research [21], and for this research, we further calibrated the weight values by using the risk exposure values for risk items in Yoon et al.’s approach [48]. The risk items denoted potential defect types for a software system. The process of calculating risk exposure for risk items involved the following activities: (1) Identify Risk Items: The risk items employed in this research were derived from studies that used different applications and standards [8, 22, 45]. The identified risk items are shown in Table 8. The input problem was an example for risk items because input data can cause several problems for an operating software system. For instance, input data that were not validated before the execution or input data that were beyond the valid boundary resulted in operational failures for the system or a system crash. (2) Calculate risk exposure values for the risk items (RE(RiIM)): The risk level for a particular risk item was quantified by using the risk exposure values. To find the risk exposure values for risk items, first, we need to find the probability of fault occurrence by con- sidering the association between risk items and requirements. When we consider a set of requirements, some of them may be associated with a particular risk item. The probability of fault occurrence depends on the number of associated requirements. When a particular risk item is associated with multiple requirements, its probability for fault occurrence will increase. Second, we need to estimate the impact of the risk items. For this purpose, we consider the risk exposure of requirements (RE(Req)). When a particular risk item is associated with requirements that have higher risk exposure, the risk impact of that particular item increases. In this research, we ignored the Startup/ShutDown risk item which was used for our previous risk-based ap- proach [21] because Startup/ShutDown risk item only associated with a small number of requirements and also 13 did not make a significant impact on the test case prioritization. Additionally, in Yoon et al.’s [48] approach, the Startup/ShutDown risk item indicated a relatively low exposure value. Eliminating one risk item caused an approxi- mate 17% work reduction for this risk items’ risk exposure calculation subprocess. Table 8: Software product-risk items Risk Item Abbreviation i Input Problem IP ii Output Problem OP iii Calculation Calc iv Interactions Inac v Error Handling ErHa Table 9: Risk exposures and weighted risk exposure matrix Risk Items Requirement RE(Req) RiIM1 . . . RiIMx RiIMn W-RE (SV1) . . . (SVx) (SVn) Req1 RE(Req1) C11 C1x C1n W-RE(Req1) ... ... ... . . . ... ... ... Reqy RE(Reqy) Cy1 . . . Cyx Cyn W-RE(Reqy) Reqm RE(Reqm) Cm1 . . . Cmx Cmn W-RE(Reqm) RE(RiIM1) . . . RE(RiIMx) RE(RiIMn) Table 9 shows a matrix to calculate the risk exposure (RE(RiIM)) values for risk items and the weighted risk exposure (W-RE) values for requirements. The matrix lists requirements, risk exposure for requirements, and a set of risk items. Each risk item (RiIMx) has a severity value (SVx) that indicates how risky the item is. The severity values are defined in Table 10. If a risk item is associated with a certain requirement, the Cmx value is 1; otherwise, the value is 0. Again, multiple associations between risk times and requirements can exist. The last row in the matrix shows the RE values for risk items, and the RE values are calculated using Equation 6; the last column shows the W-RE values that are calculated by incorporating RE(RiIM) values with the severity values of risk items for each requirement. The final outcome of this step, the W-RE values, is used to prioritize requirements. Equations 6 and 7 are used to calculate the risk exposure values for risk items and the weighted risk exposure values for requirements. RE(RiIMx) = m∑ y=1 (RE(Reqy) ∗Cyx) (6) where m is the number of requirements, RE(Reqy) is the risk exposure of Reqy, and Cyx is 1 when requirement Reqy is associated with risk item RiIMx or 0 otherwise. 14 W −RE(Reqy) = n∑ x=1 (RE(RiIMx) ∗Cyx ∗SVx) (7) where n is the number of risk items for the system; RE(RiIMx) is the risk exposure value of the risk item, RiIMx; Cyx indicates the correlation between the requirement, Reqy, and the risk item, RiIMx; and SVx is the severity value of the risk item, RiIMx. Table 10: Severity of risk items Severity Value Description 1 Least critical risk item 2 Slightly critical risk item 3 Moderately critical risk item 4 Very critical risk item 5 Most critical risk item Table 11: Example of risk exposures and weighted risk exposures of iTrust Requirement TC ID RE(Req) OP ErHa Inac IP Calc W-RE Risk Items Severity Values 4 5 5 2 3 UC1S1 TC2 60.8 1 1 0 1 0 61384.6 UC2S1 TC5 59.8 1 1 1 1 0 88440.5 UC2S1 TC6 57.2 1 1 1 1 0 88440.5 UC8S1 TC13 53.5 1 1 0 1 1 68855.7 UC10S1 TC19 55.6 1 1 1 1 1 95911.6 UC11S1 TC21 59.2 1 1 1 0 0 79497.2 UC23S3 TC31 59.8 1 1 1 1 1 95911.6 UC26S3 TC33 48.9 1 1 1 0 0 79497.2 ... ... ... ... ... ... ... ... ... UC34S6 TC141 63.1 1 1 1 1 0 88440.5 Risk Exposure (RE(RiIM)) 6146.0 5571.5 5411.2 4472.0 2490.0 Table 11 shows a sample data set collected from our experiment. In this example, we can see that the output problem risk item is associated with all requirements. Thus, the risk exposure value of the output problem risk item can be calculated using Equation 6 as follows: RE(OP ) = (60.8*1) + (59.8*1) + (53.5*1) + (55.6*1) + . . . + (63.1*1) = 6146.0 Because the output problem has a high RE value compared to other risk items, it implies that the output problem is a high risk area for this product. After calculating the RE values for risk items, we calculate the W-RE values for each requirement by utilizing Equation 7. For instance, we obtain the W-RE value for UC1S1 as follows. W-RE(UC1S1) = (6146.0*1*4) + (5571.5*1*5) + (5411.2*0*5) + (4472.0*1*2) + (2490.0*0*3) = 61384.6 15 After calculating the W-RE values for each requirement, we prioritize requirements by their W-RE values (in descending order). Table 12 shows a portion of our data. From the table, we can see that each requirement has one or more corresponding test cases (The last subsection explains how to map these two.) and that the requirements appear by their W-RE values (in descending order). When multiple requirements have the same W-RE value (e.g., UC33S1 and UC26S2 have the same value, 7250.3, in the table.), we used the RE(Req) values as the second factor to prioritize the requirements. The next section describes, in detail, the prioritization of the test cases. Table 12: Example for a prioritized test suite: iTrust Requirement TC-ID W-RE Re(Req) UC33S1 TC117 7250.3 76.5 UC26S2 TC94 7250.3 70.3 UC21 TC30 7090.6 80.2 UC21 TC80 7090.6 70.1 ... ... ... ... UC12 TC26 4735.4 80.0 UC30S3 TC100 4236.6 67.2 UC30S1 TC99 4236.6 56.3 UC3S3 TC43 2947.6 33.3 UC24 TC91 2860.2 42.5 UC25 TC92 2860.2 20.8 ... ... ... ... UC2S2 TC77 1787.7 66.5 3.4 Prioritize the Requirements and Test Cases In this final stage, we prioritize requirements using the W-RE values and then by the the risk exposure of requirements (RE(Req)). To prioritize test cases, we need to know the relationship between the test cases and the requirements. In this research, we use the traceability matrices created by the object programs’ (i.e., iTrust and Capstone) developers. Unavailability of traceability matrices could require more effort to adopt our proposed approach. However, often, the documents that are created in the early software development stages, such as business requirements document and functional specification document, are used when creating test cases, which can help construct traceability informa- tion between requirements and test cases. Further, some industrial tools such as Qmetry [2] and Microsoft Testing Manager [3] can aid establishing traceability. For most software applications, including our experimental programs, a single requirement is tested with multiple test cases. As a result, test cases associated with the same requirement have the same W-RE value. However, the different functions used to manipulate a particular requirement have different complexity levels and LOCs, hence producing different RE(Req) values. Therefore, after mapping test cases to their corresponding requirements and functions, the test cases with similar W-RE values can be re-prioritized using the RE(Req) values. If test cases still 16 have the same W-RE and RE(Req) values, then we randomly order those test cases. 4 Empirical Study To investigate the effectiveness of our new risk-based test case prioritization approach, we designed and performed a controlled experiment. The following subsections describe research questions, the objects of analysis, independent variables, dependent variables and measures, experimental setup and procedure, and threats to validity. 4.1 Research Questions In this study, we investigate the following research questions: RQ1: Can systematic risk-based test case prioritization improve the rate of fault detection for test suites? RQ2: Can systematic risk-based test case prioritization find more faults in the risky components early? 4.2 Objects of Analysis In order to evaluate the new approach, we utilized two applications: one open source and one industrial application developed for graduate-student project. iTrust is the open source application used for this experiment. The iTrust program is a patient-centric electronic health record system which was developed by the Realsearch Research Group at North Carolina State University. In this experiment, four versions of the iTrust application (versions 0, 1, 2, and 3) were used. We considered the functional requirements of each version, and the test cases were used to test the system functionalities associated with the system requirements. For iTrust, all the test cases used in the experiment were developed by the iTrust system developers. The industrial application, Capstone, was developed by computer science graduate students at North Dakota State University in collaboration with a local software company. Capstone is an online application that is used to automate the company’s examination procedure. The information about the system components for all versions of the two applications used in this experiment is shown in Table 13. For each system, version 0 (base version) is not listed in the table because regression testing starts with the second version. However, the information obtained from version 0 is utilized to obtain the mutants for version 1. Table 13: Experiment objects and associated data Object Version Requirements Test Cases Size (KLOCs) Mutation Faults Mutation Groups iTrust V1 91 122 24.42 54 13 V2 105 142 25.93 71 12 V3 108 157 26.70 75 12 Capstone V1 21 42 6.82 118 23 17 4.3 Variables and Measures Independent Variable The test case prioritization technique is the independent variable manipulated in this study. We consider seven control techniques, including a requirements risk-based technique and one heuristic prioritization technique, as follows: • Control Techniques – Original (Torig): The object program provides the testing scripts. Torig executes test cases in the order in which they are available in the original testing script. – Total statement coverage (Tsc): This technique prioritizes test cases based on the total number of state- ments exercised by test cases. – Code metric (Tcm): This technique uses a code metric that we defined in our previous study [7]. The code metric is calculated using three types of information obtained from the source code, Line of Code (LOC), Nested Block Depth (NBD), and McCabe Cyclomatic Complexity (MCC), which are considered good predictors for finding error-prone modules [38, 50].1 – Requirements-based clustering: We consider another set of techniques proposed by Arafeen and Do [7] as control techniques that follow the requirements-based clustering approach. Because all requirements are not equally important to clients, requirements clustering is important to prioritize the requirements based on their importance to the client so that the tester can pay more attention to the test cases associated with high-priority requirements. With the previous approach, test cases are clustered based on requirements and test case association. Within clusters, test cases are prioritized using code metric information or use the original test case order. Thus, all techniques are categorized into two broad categories based on the original test case order and the code metric test case order. In this research, we only considered the code metric based category which produced relatively better results compared to original order category. Then, the clusters are ordered in three ways: original cluster order, random cluster order, and prioritized cluster order. Clusters are prioritized using both requirements commitment (i.e., the priority of requirements to be implemented by the developers) and code modification information. After ordering the test cases and the clusters, test cases are selected from the clusters in a round-robin fashion for the prioritization. The three requirements-based clustering techniques are as follows: ∗ Tcl-cm-orig (Tcco): It uses the code metric based test case order within clusters, and the clusters are ordered according to the original order of the clusters. ∗ Tcl-cm-rand (Tccr): This technique uses the code metric based test case order within clusters, and the clusters are ordered randomly. ∗ Tcl-cm-prior (Tccp): This technique uses the code metric based test case order within clusters, and 1Tcm = NBD Max(NBD) + MCC Max(MCC) + LOC Max(LOC) 18 the clusters are ordered according to requirements commitment and code modification information. This requirements cluster-based approach [7] used five different cluster sizes for the iTrust application. In this study, we only considered cluster sizes 10 and 20 which produced moderate and the best results, respectively. In the case of Capstone, the only cluster size used in requirements cluster-based approach [7] was cluster size 6. – Requirements risk-based (Trrb): This technique prioritizes test cases based on the risks residing in the requirements and the association between a system’s requirements and potential defect types. • Heuristic (Tfrrb): The proposed technique uses a fuzzy expert system to estimate requirements risks and priori- tizes the test cases as described in Section 3. Dependent Variable and Measures We considered two dependent variables: • Average Percentage of Fault Detection (APFD): The APFD [12, 29] value represents the average for the percentage of fault detection during the execution of a particular test suite. APFD values range from 0 to 100. The prioritization techniques are being considered as better techniques when their APFD values are closer to 100, and the technique that obtains the highest APFD value is considered to be the best prioritization technique. • Percentage of Total Risk Severity Weight (PTRSW): PTRSW is used to measure the effectiveness of test suites in terms of finding more faults for risky components in a particular system as early as possible. The PTRSW values range from 0% to 100%. Equation 8 shows how to calculate the PTRSW value. When a test suite can detect all faults for risky components in a system at a particular test case execution rate, then the PTRSW becomes 100% at that particular execution rate. A test case prioritization technique can be considered as effective in finding more faults for risky components if the test suite produced by that technique is capable of achieving a higher PTRSW value for a lower test case execution rate. For example, if the PTRSW value of a test suite produced by prioritization technique A is 100% when half of the test cases are executed (i.e., 50% test execution rate), whereas the prioritized test suite for technique B produces a 70% PTRSW value for the same 50% test execution rate, then technique A is considered the most effective technique to find more faults in the risky components. PTRSW = (STRSW/GTRSW)∗100% (8) The sub-total of the total risk severity weight (STRSW) and the grand total of total risk severity weight (GTRSW) are explained, in detail, in the next subsection. 19 4.4 Experimental Setup and Procedure In this study, for each requirement, we estimated the risk exposure using a fuzzy expert system and then estimated the weighted risk exposure (W-RE) values for the requirements as described in Section 3. We prioritized the requirements based on the W-RE values and then by the risk exposure of requirements (RE(Req)) values of the requirements. To obtain the prioritized test cases, we used the requirements-test mapping information which was provided by the object programs’ developers. To empirically evaluate the proposed approach, we also need fault data. Due to the absence of the faults with the applications, we used a set of mutation faults created for each object program during previous research [7]. The second-last column in Table 13 lists the total number of mutation faults for each program version. In actual testing scenarios, however, programs do not typically contain this many faults. Thus, to reflect more realistic situations, our previous study [11] introduced mutant groups which were formed by randomly selecting mutants from the pools of mutants created for each version, ranging from 1 to 10 mutants per group. For the iTrust application, 13, 12, and 12 mutation groups were created for version 1, 2, and 3, respectively, whereas 23 mutation groups were created for Capstone version 1. To perform the risk estimation process, we followed the steps described in Section 3. As a human expert, one graduate student who has several years of software industry experience performed the risk estimation process. To obtain the requirements modification level (RML); confidentiality, integrity, and availability (CIA) values; and privacy, authentication, and accountability (PAA) values, the student followed the process explained in Section 3. When the final RML, CIA, and PAA values were obtained, those values were reviewed by the student to validate the inaccuracy. On average, only 3% of the requirements needed slight adjustments for their RML values, and less than 2% of the requirements required minor adjustments for the CIA and PAA values. The potential requirements volatility (PRV) values for the requirements were estimated based on the expert’s knowledge and experience. The weight values used for the risk indicators were determined through the priority vector (PV) values of analytic hierarchy process (AHP) that was performed by two experts (graduate students). We averaged and normalized the priority vector values obtained from each expert for every indicator to obtain the final weight values for risk indicators. Test cases were prioritized using W-RE and RE(Req), and with test cases that had the same W-RE and RE(Req) values, those test cases were randomly prioritized. For all the applications and versions used for this experiment, on average, only 3% of the test cases required random orders. To calculate the PTRSW values for each test case prioritization technique, we need know the classes and methods modified by the mutation faults. Therefore, we used a mutation analysis tool, ByteME (Bytecode Mutation Engine, a tool from the software-artifact infrastructure repository (SIR) [12]), to locate these altered classes/methods and devel- oped the faults-classes/methods trace files for each version of the object programs. Further, we used the requirements- classes/methods trace files to identify the association relationship between requirements and classes/methods. We estimated the risks of classes/methods caused by mutation faults, risk severity weight (RSW), using the information 20 provided by ByteME and the requirements risks. The RSW values estimated for classes/methods were assigned to the associated mutation faults. The RSW values ranged from 0 to 10. The RSW value was 0 if there was no risk caused by a mutation fault for a class/method, whereas 10 indicated the highest risk caused by a mutation fault on a critical system component. The faults-tests traceability matrix is used to identify the relationship between mutation faults and test cases. One mutation fault can be detected by one or more test cases. In that case, the same RSW value is assigned to all test cases associated with the same mutation fault. On the other hand, one test case can detect several mutation faults, and all RSW values for the detected mutation faults are added together to obtain the total risk severity weight (TRSW) for that test case. The sum of the TRSW values of all test cases produces the grand total risk severity weight (GTRSW) for the test suite. When developing software applications, software companies often face time and budget constraints, and typically, the companies cut back on testing activities to ensure a timely release for their product. When testing activities are cut short, faults will slip through testing. If testing techniques can detect riskier defects earlier, then the companies could reduce potential severe consequences. To examine this situation, we consider four different test execution rates as shown in Table 14. For example, the STRSW values for the original test order of iTrust version 2 are 4 and 19 for test execution rates of 12.5% and 25%, respectively. Table 14 shows sample RSW, TRSW, STRSW, GTRSW, and PTRSW values for the original test case order of iTrust version 2. The first column shows the test cases; columns 2, 3, and 4 show the sample set of RSW values for the test cases; column 5 shows TRSW and STRSW; and the last column shows the PTRSW values for different test execution rates. For example, the STRSW for the 50% test execution rate is 225, and the PTRSW is 46.68%. When we compare these data with the results of other techniques shown in Table 18, these values are relatively low, meaning that the original order of test cases is unable to detect faults in the risky components effectively. After gathering all required data, we obtained the prioritized test suite by following the proposed approach. We calculated the APFD and PTRSW values for the prioritized test suite. 4.5 Threats to Validity This section describes the internal, external, and construct threats to the validity of our study, and the approaches we used to limit their effects. Internal Validity: In this study, we estimated the risks residing in the functional requirements in terms of product risks and did not consider other types of software risks, such as project and process risks. This threat can be minimized by adding more appropriate risk indicators in the requirements risk estimation process. We estimated the complexity and the size of a particular requirement using the McCabe Cyclomatic Complexity (MCC) and Lines of Code (LOC) for the class/method utilized to implement that particular requirement. There are other alternatives available to measure 21 Table 14: Example test cases, PTRSW, and associated data for iTrust original test order-version 2 Test cases (Orig v2) RSW of Mutations TRSW Percentage of Total Risk Severity Weight M1 ... M71 (PTRSW) TC7 3 0 1 4 ... ... ... ... ... TC18 0 0 0 0 ... ... ... ... ... Sub-total of TRSW (STRSW) after 12.5% test case execution 4 0.83% ... ... ... ... ... TC20 0 1 8 10 Sub-total of TRSW (STRSW) after 25% test case execution 19 3.94% ... ... ... ... ... TC53 0 2 2 5 Sub-total of TRSW (STRSW) after 50% test case execution 225 46.68% ... ... ... ... ... TC106 0 9 8 42 Sub-total of TRSW (STRSW) after 75% test case execution 276 57.26% ... ... ... ... ... TC136 0 8 8 24 ... ... ... ... ... Grand total after execution the entire test suite (GTRSW) 482 the complexity and the size, and the results could be affected by the choice of different alternatives. However, much of the previous research has shown that MCC and LOC are good indicators to measure the complexity and size with ease. Moreover, the crisp output obtained from the fuzzy expert system may vary due to several factors, such as the number of input/output membership functions, the types of waveforms, and the defuzzification methods. The fuzzy expert system used for this research is based on three triangular membership functions for both input and output variables because many previous fuzzy expert systems used a similar configuration successfully. We utilized the center of gravity (COG) method for the defuzzification because it is a widely used defuzzification method and is also considered as accurate. However, further studies can be performed to investigate the effectiveness of the proposed approach with different configurations. External Validity: The object programs we used in this research are a small, industrial application (Capstone) and a mid-size, open source application (iTrust). Therefore, our findings cannot be interpreted in the context of large applications. This limitation can be addressed by applying our approach to large, open source and industrial applications. Construct Validity: Estimating the potential requirements volatility (PRV) for requirements and determining the 22 correlation between requirements and risks items needed human involvement. Because human judgment is based on several factors, such as human experts’ knowledge and experiences, the results could vary from person to person. We formulated and validated the rules of our fuzzy expert system based on our experiences and knowledge, but a fuzzy expert system with fewer or more rules could be developed and potentially change the results. 5 Data and Analysis In this section, we present the results of our study and the data analyses for each research question. We discuss further implications of the data and results in Section 6. 5.1 The effectiveness of risk-based prioritization with a fuzzy expert system for improving the fault detection rate of test cases (RQ1) In our first research question (RQ1), we consider whether the risk-based approach that incorporates a fuzzy expert system can help improve the effectiveness of test case prioritization. To answer this question, we compared techniques based on the results shown in Tables 15, 16, and 17. Table 15 shows the APFD values for our heuristic technique (Tfrrb). Columns 2 to 4 show the APFD values for iTrust versions 1, 2, and 3, respectively, and the last column shows the APFD value for the Capstone application. The first column of Table 16 shows the control techniques. CS10 and CS20 indicate the two cluster sizes. All APFD values shown in Tables 16 and 17 are averaged values for the 13, 12, and 12 data points of iTrust versions 1, 2, and 3, respectively, and the 23 data points for Capstone version 1. Table 16 and 17 also show the improvement rates for our proposed approach over the control techniques. The results in Table 16 indicate that our approach (Tfrrb) outperformed the original (Torig), statement coverage (Tsc), code metric (Tcm), and our previous requirements risk-based approach (Trrb) across all versions. The improve- ment rates ranged from 5% to 175.83%. However, when we compared our approach to the cluster-based approaches, the trend varied across versions. For version 1, the heuristic outperformed the control techniques with a cluster size of 10 (The improvement rates ranged from 59.45% to 63.81%.), but it was not better than the techniques with a cluster size of 20. In the case of version 2, the results were reversed. The heuristic performed slightly better than the control techniques with a cluster size of 20, and it performed slightly worse than the control with a cluster size of 10. In the case of version 3, the heuristic outperformed all cluster-based control techniques. The results for Capstone shown in Table 17 indicate that the proposed approach outperformed all control tech- niques except for the Tccp technique. The heuristic and Tccp produce the same results. For this application, the improvement rates range from 0.00% to 55.26%. To show our results visually, we present them in boxplots. Figure 3 presents the boxplots that show APFD values for the control techniques and heuristic for all iTrust and Capstone versions. For iTrust version 1, each boxplot has 13 data points, and for versions 2 and 3, each has 12 data points. In the case of Capstone, each boxplot has 23 data points. 23 Table 15: Heuristic APFD: iTrust and Capstone itrust Capstone Version V1 V2 V3 V1 APFD (Tfrrb) 63 65.78 79.44 67.86 Table 16: APFD comparison and improvement over controls: iTrust Control iTrust - V1 iTrust - V2 iTrust - V3 Technique APFD Improvement Over APFD Improvement Over APFD Improvement Over Control (%) Control (%) Control (%) Torig 43.70 44.16 47.80 37.62 28.80 175.83 Tsc 53.02 18.82 56.75 15.91 69.26 14.70 Tcm 45.80 37.55 48.80 34.81 44.84 77.18 Trrb 60.00 5.00 60.60 8.55 56.00 41.86 Tcco -CS10 38.46 63.81 68.26 -3.63 68.48 16.00 Tccr -CS10 38.88 62.04 68.30 -3.69 64.54 23.09 Tccp -CS10 39.51 59.45 57.78 13.85 75.31 5.48 Tcco -CS20 65.58 -3.93 62.67 4.96 66.69 19.12 Tccr-CS20 66.22 -4.86 63.65 3.35 67.68 17.38 Tccp-CS20 75.45 -16.50 65.33 0.69 72.41 9.71 Each subfigure for iTrust contains boxplots for ten prioritization techniques; the first nine boxplots present data for the control techniques, and the last one presents the heuristic technique. The last subfigure, Capstone, contains boxplots for seven prioritization techniques; the first seven boxplots present data for the control techniques, and the last one presents the heuristic technique. When we examine the boxplots for iTrust, in version 1, the results for requirements risk-based approaches (Trrb and Tfrrb) show a wider distribution of data points than the other two versions. The results with other techniques for all versions show similar data-distribution patterns except for a couple cases. For version 3, the control techniques that used clustering with a cluster size of 10 show a wider distribution than other techniques. Overall, the heuristic shows better results compared to the controls across all versions of iTrust except for a few cases (clustering-based approaches for version 2). In the case of Capstone, all techniques show a similar data-distribution pattern, and the heuristic produces the best median value (indicated with a line in the box) compared to the control techniques and the best average value (indicated with a diamond) except for one case (Tcm). 5.2 The effectiveness of risk-based prioritization with a fuzzy expert system to find faults in risky components (RQ2) For the first research question, we consider whether the risk-based approach that incorporates a fuzzy expert system can find faults earlier. The results are encouraging, but we do not know whether the early detected faults are, indeed, the faults that reside in the risky components. Thus, our second research question (RQ2) considers whether the risk- based approach with a fuzzy expert system can be more effective at finding faults associated with risky components early compared to the control techniques. To evaluate RQ2, we consider five control techniques: Torig, Tsc, Tcm, Trrb, 24 Table 17: APFD comparison and improvement over controls: Capstone Control Technique Capstone - V1 APFD Improvement Over Control (%) Torig 43.70 55.26 Tsc 55.19 22.96 Tcm 67.80 0.07 Trrb 62.88 7.90 Tcco 61.19 10.89 Tccr 61.47 10.37 Tccp 67.86 0.00 iTrust version 1 iTrust version 2 iTrust version 3 Capstone version 1 Figure 3: APFD boxplots for all controls and the heuristic: iTrust and Capstone and a clustering technique that produced the best results for RQ1 (i.e., Tccp-CS20 for iTrust and Tccp for Capstone). To address this research question, we measured the percentage of the total risk severity weight (PTRSW) values that we described in Section 4.3, showing how fast the technique can detect faults in the risky components. Tables 18 and 19 show the PTRSW values for all versions of iTrust and Capstone, respectively. The first column of Tables 18 and 19 shows the version number, and the second column categorizes the control techniques and heuristic. The third column shows the prioritization techniques. The subsequent columns show the PTRSW values when we foreshorten 25 the test execution process by the specified percentage for each column. For example, the sixth column shows the PTRSW values when 50% of the test cases are executed, meaning that, for a 50% cutting ratio, we simulate the effects of having the testing process halted halfway through. The results showed that our approach can detect more faults in the risky components earlier than the controls except for a few cases (Trrb in version 1 for iTrust). For version 1, the heuristic (Tfrrb) and the requirements risk-based approach (Trrb) showed very similar results, but for versions 2 and 3, the trend changed. For version 2, the heuristic produced relatively high fault detection rates when test execution rates were low, and for version 3, the differences between these two techniques were more outstanding. For instance, at a 25% execution rate, the heuristic produced 70.64%, but Trrb produced only 18.94%. In the case of Capstone (Table 19), our approach outperformed all control techniques. These results indicated that using requirements risks with a fuzzy expert system during prioritization was effective in locating faults that reveal risks early. Further, even when companies need to cut their testing process short due to their product release schedule, they can still identify and fix more important faults with the limited time and budget than otherwise. Table 18: Percentage of total risk severity weight (PTRSW) for different test execution levels: iTrust Version Technique Percentage of Total Risk Severity Weight (PTRSW) for Different Test Execution Levels: iTrust Execution Rate 12.5% Execution Rate 25% Execution Rate 50% Execution Rate 75.0% V1 Controls Torig 0.00 0.00 26.49 70.90 Tsc 12.31 26.07 70.51 76.49 Tcm 1.87 5.22 29.85 52.24 Tccp-CS20 7.46 7.46 30.97 71.27 Trrb 21.64 30.22 67.91 82.09 Heuristic Tfrrb 21.64 26.12 70.52 79.48 V2 Controls Torig 0.83 3.94 46.68 57.26 Tsc 42.53 43.36 58.51 95.13 Tcm 11.00 26.97 67.63 74.90 Tccp-CS20 0.83 17.22 67.43 74.48 Trrb 10.37 26.35 68.46 95.44 Heuristic Tfrrb 42.74 43.78 73.24 95.44 V3 Controls Torig 0.00 0.85 25.32 29.36 Tsc 46.98 46.98 47.44 99.07 Tcm 0.43 10.43 19.36 65.32 Tccp-CS20 45.53 45.53 70.85 90.00 Trrb 0.00 18.94 66.38 86.38 Heuristic Tfrrb 49.79 70.64 97.66 99.57 Figure 4 presents the results graphically. The first three subgraphs show the results for iTrust, and the last graph shows the PTRSW comparison for Capstone. As we observed from Tables 18 and 19, our approach outperforms all control techniques for version 1 except for one technique (Trrb), and for versions 2 and 3, our approach outperforms the controls at all test execution rates. The trend for version 3’s results is different from other versions. The results of the techniques for version 3 vary widely across test execution rates. In particular, the clustering-based approach performs 26 Table 19: Percentage of total risk severity weight (PTRSW) for different test execution levels: Capstone Version Technique Percentage of Total Risk Severity Weight (PTRSW) for Different Test Execution Levels: Capstone Execution Rate 12.5% Execution Rate 25% Execution Rate 50% Execution Rate 75% V1 Controls Torig 11.51 16.43 54.26 75.30 Tsc 24.15 37.36 54.68 77.45 Tcm 24.33 39.35 66.04 83.67 Tccp 11.20 23.55 52.59 79.75 Trrb 22.87 37.10 58.03 84.67 Heuristic Tfrrb 24.49 40.08 66.77 85.45 Figure 4: PTRSW comparison graphs for all versions of iTrust and Capstone better than all other control techniques. In the case of Capstone, the heuristic outperforms all control techniques across all test execution rates. Similar to version 1 of iTrust, the heuristic and the requirements risk-based approach (Trrb) produce very similar results. 6 Discussion and Implications In this section, we present more insight about the findings of our research and possible implications. Through the results we obtained with our study, we draw the following observations. First, the proposed systematic, risk-based approach is capable of outdoing the original test order, code-metric-based approach and our previous requirements 27 risk-based approach for all versions of the iTrust application. Cluster-based approaches with a cluster size of 20 out- perform our proposed approach for version 1 while two cluster-based approaches with a cluster size of 10 outperform our approach for version 2. However, our proposed approach outperforms all cluster-based approaches for version 3. The source code of version 3 is significantly affected by the requirements modifications for version 3. In our previous requirements risk-based approach [21], we did not get better results for version 3 compared to cluster-based approaches because we only used one code-related risk factor, line of code (LOC), to estimate requirements risk. However, in this research, we use both LOC and McCabe Complexity, and we speculate that using more source code data to extract risk information has affected the better results obtained for version 3 that underwent major code modifications. In the case of Capstone, our approach outperforms all control techniques while producing the same result as the Tccp approach. Overall, the proposed approach produces very effective results across all versions of the iTrust and Capstone programs in spite of imprecise, inconsistent, and complex conditions that may occur when extracting the risk information from software requirements. Using a fuzzy expert system contributes to handling such circumstances successfully and fa- cilitates making more realistic risk estimations. Further, we emulate expert thinking in the risk estimation procedure by using our fuzzy expert system and minimize subjectivity in the risk estimation procedure. Additionally, we employ a semi-automated process to assess the risk factors for our risk estimation procedure. Hence, having a systematic risk estimation procedure is the most possible reason for getting better results with all versions of every object program. Second, the results indicated that our proposed approach has the ability to detect more faults early in risky com- ponents of software applications. In particular, for the third version of iTrust, our proposed approach detected a significantly higher number of faults in risky components at low test execution rates compared to the first and the second versions. Again, the requirements modifications in version 3 and their effect on the source code are a possi- ble reason for this outcome. In the case of Capstone, our new approach produced the best results over other control techniques. In the proposed approach, requirements were primarily prioritized by considering their direct relationship with critical risk items, and then, the requirements were further prioritized using requirements risks which were esti- mated through a less subjective, fuzzy expert system based approach. Because the test cases are prioritized using the association between requirements and test cases, the top test cases in the prioritized test suite were able to detect faults in the high-risk components early. Therefore, in this research, considering the direct relationship between critical risk items and test cases was the major reason for the early detection of more faults in high-risk components. The results of this research indicate very important implications for the software industry. Modern software sys- tems which are developed in today’s software industry are very complex and are vulnerable to malicious attacks; frequent changes are inevitable to maintain the systems’ quality. Software systems which are intended to be available online (web-based) and the systems which are considered as mission critical are even more complex, undergo frequent changes, and have much bigger concerns for their security-related issues. The proposed approach pays much attention to these factors in terms of fault detection. Hence, by adopting our proposed approach, software development compa- nies can detect faults in their modern applications within a short time frame. Furthermore, if a company happens to 28 shorten their regression testing process due to any constrains (time, budget, etc.), they can still detect more faults in their applications, including faults in high-risk components, with the limited resources. In particular, early detection of more defects in critical systems is very important because such defects can eventually lead to severe failures, such as life-threatening conditions or huge financial loss. Therefore, using our proposed approach, companies can develop their software systems with more confidence and cost-effectiveness while meeting their tight production deadlines. 7 Conclusions and Future Work In this paper, we presented a systematic risk estimation approach using a fuzzy expert system which can minimize the subjectivity, imprecision, and inconsistency issues confronted by the requirements risks estimation process. We empirically evaluated the new approach using two Java applications with multiple versions. The results of this study demonstrated that our new systematic, risk-based approach can detect faults earlier and is even better at finding faults in the risky components earlier than the control techniques. With the proposed approach, software companies can manage their testing and release schedules better by providing early feedback to testers and developers so that the development team can fix the problems as soon as possible. While we addressed the limitations of our previous approach, as discussed in Section 4.5, there are still some limitations that need to be overcome. For example, determining the relationships between requirements and risk items is done by human experts, but this process can be improved by reducing human involvement through semi-automated approaches (e.g., semantic analysis of natural language). Another limitation involves the choices of membership functions and the defuzzification method. We considered three triangular membership functions for both the input and output variables and used the centroid defuzzification method to defuzzify output variables, but the choice of these functions and methods could affect our results. Therefore, we plan to conduct more studies by considering different wave types (i.e., trapezoidal, gaussian, etc.), different numbers of membership functions, and different defuzzification methods to see how these variations affect the requirements risk estimation and final results. Further, in this work, we only used four risk indicators, but we plan to use other risk indicators, such as usage rate. We also plan to investigate the use of other fuzzy technologies, such as the fuzzy clustering approach, to improve risk-based regression testing approaches, and plan to evaluate these approaches considering their efficiency and effectiveness. Acknowledgments This work was supported, in part, by NSF CAREER Award CCF-1149389 to University of North Texas. This work was also funded by the MSIP (Ministry of Science, ICT, and Future Planning), Korea, under the ITRC (Informa- tion Technology Research Center) support program (NIPA-2014-H0301-14-1023) supervised by NIPA (National IT Industry Promotion Agency) 29 References [1] Minimum Security Requirements for Federal Information and Information Systems . http://csrc.nist.gov/publications/fips/fips200/FIPS-200-final-march.pdf, 2006. [2] QMetry Test Management. http://www.qmetry.com, 2014. [3] Quick Start Guide for Manual Testing using Microsoft Test Manager. https://msdn.microsoft.com/en- us/library/vstudio/dd3807632014. [4] A. Adeli and M. Neshat. A fuzzy expert system for heart disease diagnosis. International MultiConference of Engineers and Computer Scientists (IMECS), 1:1–7, 2010. [5] M.A. Ahmed, M.O. Saliu, and J. AlGhamdi. Adaptive fuzzy logic-based framework for software development effort prediction. Information and Software Technology, 47(1):31–48, 2005. [6] S. Amland. Risk-based testing: Risk analysis fundamentals and metrics for software testing including a financial application case study. Journal of Systems and Software, 53(3):287–295, 2000. [7] M.J. Arafeen and H. Do. Test case prioritization using requirements-based clustering. International Conference of Software Testing, Verification and Validation (ICST), March 2013. [8] J. Bach. Risk and requirements-based testing. IEEE Computer, 32(6):113–114, 1999. [9] V. Carr and J.H.M. Tah. A fuzzy approach to constuction project risk assessment and analysis: construction project risk management system. Advances in Engineering Software, 32(10-11):847–857, 2001. [10] H. Do, S. Mirarab, L. Tahvildari, and G. Rothermel. The effects of time constraints on test case prioritization: A series of controlled experiments. IEEE TSE, 26(5), September 2010. [11] H. Do and G. Rothermel. An empirical study of regression testing techniques incorporating context and lifecycle factors and improved cost-benefit models. In Proceedings of the ACM SIGSOFT Symposium on Foundations of Software Engineering, November 2006. [12] H. Do and G. Rothermel. On the use of mutation faults in empirical assessments of test case prioritization techniques. IEEE Transactions on Software Engineering, 32(9):733–752, September 2006. [13] S. Elbaum, A. G. Malishevsky, and G. Rothermel. Test case prioritization: A family of empirical studies. IEEE TSE, 28(2):159–182, February 2002. [14] G. Erdogan, Y. Li, R. Runde, F. Seehusen, and K. Stlen. Approaches for the combined use of risk analysis and testing: a systematic literature review. International Journal on Software Tools for Technology Transfer, 16(5):627–642, 2014. 30 [15] M. Fasanghari and G.A. Montazer. Design and implementation of fuzzy expert system for tehran stock exchange portfolio recommendation. Expert Systems with Applications, 37(9):6138–6147, 2010. [16] M. Felderer and R. Ramler. Integrating risk-based testing in industrial test processes. Software Quality Journal, 22(3):543–575, 2014. [17] M. Felderer and I. Schieferdecker. A taxonomy of risk-based testing. International Journal on Software Tools for Technology Transfer, 16(5):559–568, 2014. [18] M. Hadjimichale. A fuzzy expert system for aviation risk assessment. Expert Systems with Applications, 3:6512– 6519, 2009. [19] V. Hajipour, A. Kazemi, and S. M. Mousavi. A fuzzy expert system to increase accuracy and precision in measurement system analysis. Measurement, 46(8):2770–2780, 2013. [20] M. J. Harrold, D. Rosenblum, G. Rothermel, and E. Weyuker. Empirical studies of a prediction model for regression test selection. IEEE TSE, 27(3):248–263, March 2001. [21] C.S. Hettiarachchi, H. Do, and B Choi. Effective regression testing using requirements and risks. In Eighth International Conference on Software Security and Reliability, pages 157–166, June 2014. [22] IEEE. IEEE Guide to Classification for Software Anomalies. Std 1044.1-1995. Institute of Electrical and Elec- tronics Engineers, Inc, 1996. [23] M.A. Kadhim, M.A. Alam, and H. Kaur. Design and implementation of fuzzy expert system for back pain diagnosis. International Journal of Innovative Technology and Creative Engineering, 1:16–22, 2011. [24] M. Kazemifard, A. Zaeri, N. Ghasem-Aghaee, M.A. Nematbakhsh, and F. Mardukhi. Fuzzy emotional cocomo ii software cost estimation (fecsce) using multi-agent systems. Applied Soft Computing, 11(2):22602270, 2011. [25] R. Krishnamoorthi and S.A. Sahaaya Arul Mary. Factor oriented requirement coverage based system test case prioritization of new and regression test cases. Information and Software Technology, 51(4):799–808, 2009. [26] G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 297–304, July 2004. [27] Q. Li, Y. Yang, M. Li, Q. Wang, B. W. Boehm, and C. Hu. Improving software testing process: feature prioritiza- tion to make winners of successcritical stakeholders. Journal of Software: Evolution and Process, 24(7):783–801, 2012. 31 [28] M. J. Harrold and A. Orso. Retesting software during development and maintenance. In ICSM: Frontiers of Software Maintenance, pages 88–108, September 2008. [29] A. Malishevsky, G. Rothermel, and S. Elbaum. Modeling the cost-benefits tradeoffs for regression testing tech- niques. In ICSM, pages 204–213, October 2002. [30] E.H. Mamdani and S. Assilian. An experiment in linguistic synthesis with a fuzzy logic controller. International Journal of Man-Machine Studies, 7(1):1–13, 1975. [31] S. Mirarab and L. Tahvildari. A prioritization approach for software test cases on Baysian Networks. In FASE, pages 276–290, March 2007. [32] E.W.T. Ngai and F.K.T. Wat. Fuzzy decision support system for risk analysis in e-commerce development. Decision Support Systems, 40(2):235–255, 2005. [33] J. Offutt, J. Pan, and J. M. Voas. Procedures for reducing the size of coverage-based test sets. In Proc. Int’l. Conf. Testing Comp. Softw., pages 111–123, June 1995. [34] M. Riaz, J. King, J. Slankas, and L. Williams. Hidden in plain sight: Automatically identifying security require- ments from natural language artifacts. In 22nd IEEE International Conference on Requirements Engineering Conference (RE), pages 183–192, August 2014. [35] G. Rothermel and M. J. Harrold. Analyzing regression test selection techniques. IEEE TSE, 22(8):529–551, August 1996. [36] G. Rothermel, R. Untch, C. Chu, and M. J. Harrold. Prioritizing test cases for regression testing. IEEE TSE, 27(10):929–948, October 2001. [37] T. L. Saaty. The Analytic Hierarchy Process. McGraw-Hill, 1980. [38] N.F. Schneidewind and H.M. Hoffman. An experiment in software error data collection and analysis. IEEE TSE, 5(3):276–286, May 1979. [39] A. Schwartz and H. Do. A fuzzy expert system for cost-effective regression testing strategies. In 29th IEEE International Conference on Software Maintenance (ICSM), pages 1–10, September 2013. [40] Mark Sherriff, Mike Lake, and Laurie Williams. Prioritization of regression tests using singular value decompo- sition with empirical change records. In ISSRE, pages 81–90, November 2007. [41] H. Srikanth, L. Williams, and J. Osborne. System test case prioritization of new and regression test cases. In ESE, pages 64–73, August 2005. 32 [42] A. Srivastava and J. Thiagarajan. Effectively prioritizing tests in development environment. In Proceedings of the International Symposium on Software Testing and Analysis, pages 97–106, July 2002. [43] H. Stallbaum, A. Metzger, and K. Pohl. An Automated Technique for Risk-based Test Case Generation and Prioritization. In Proceedings of the 3rd International Workshop on Automation of Software Test, pages 67–70, May 2008. [44] A. Walcott, M. L. Soffa, G. M. Kapfhammer, and R. S. Roos. Time-aware test suite prioritization. In Proceedings of the International Conference on Software Testing and Analysis, pages 1–12, July 2006. [45] D.R. Wallace and D.R. Kuhn. Failure Modes in Medical Device Software: An Analysis of 15 Years of Recall Data. Reliability, Quality and Safety Engineering, 8(4):301–311, 2001. [46] Z. Xu, K Gao, and T.M. Khoshgoftaar. Application of fuzzy expert system in test case selection for system regression test. In IEEE International Conference on Information Reuse and Integration, pages 120–125, August 2005. [47] S. Yoo and M. Harman. Regression testing minimization, selection and prioritisation: A survey. JSTVR, pages 67–120, March 2010. [48] M. Yoon, E. Lee, M. Song, and B. Choi. A test case prioritization through correlation of requirement and risk. Journal of Software Engineering and Applications, 5(10):823–835, 2012. [49] L.A. Zadeh. Fuzzy sets. Information and Control, 8(3):338–353, 1965. [50] W.M. Zage and D.M. Zage. Evaluating design metrics on large-scale software. IEEE TSE, 10:75–81, 1993. [51] P. Zech. Riskbased security testing in cloud computing environments. In Fourth IEEE International Conference on Software Testing, Verification and Validation, pages 411–414, March 2011. 33