key: cord-0058980-gbu9sjpv authors: Ronchieri, Elisabetta; Canaparo, Marco; Belgiovine, Mauro title: Software Defect Prediction on Unlabelled Datasets: A Comparative Study date: 2020-08-19 journal: Computational Science and Its Applications - ICCSA 2020 DOI: 10.1007/978-3-030-58802-1_25 sha: 93eb3e98ca86a2991e51a6858e84c3a384095cfa doc_id: 58980 cord_uid: gbu9sjpv Background: Defect prediction on unlabelled datasets is a challenging and widespread problem in software engineering. Machine learning is of great value in this context because it provides techniques - called unsupervised - that are applicable to unlabelled datasets. Objective: This study aims at comparing various approaches employed over the years on unlabelled datasets to predict the defective modules, i.e. the ones which need more attention in the testing phase. Our comparison is based on the measurement of performance metrics and on the real defective information derived from software archives. Our work leverages a new dataset that has been obtained by extracting and preprocessing its metrics from a C++ software. Method: Our empirical study has taken advantage of CLAMI with its improvement CLAMI+ that we have applied on high energy physics software datasets. Furthermore, we have used clustering techniques such as the K-means algorithm to find potentially critical modules. Results: Our experimental analysis have been carried out on 1 open source project with 34 software releases. We have applied 17 ML techniques to the labelled datasets obtained by following the CLAMI and CLAMI+ approaches. The two approaches have been evaluated by using different performance metrics, our results show that CLAMI+ performs better than CLAMI. The predictive average accuracy metric is around 95% for 4 ML techniques (4 out of 17) that show a Kappa statistic greater than 0.80. We applied K-means on the same dataset and obtained 2 clusters labelled according to the output of CLAMI and CLAMI+. Conclusion: Based on the results of the different statistical tests, we conclude that no significant performance differences have been found in the selected classification techniques. A large number of software defect prediction models have been proposed over time to enable developers to determine the most critical software modules to which they should pay more attention and track carefully [1, 2] . Consequently, a prompt recognition of these modules can allow developers to effectively allocate their limited reviewing and testing resources and may suggest architectural improvements. In the last decade, researchers have striven to employ machine learning (ML) techniques in many software engineering (SE) problems, such as defective prediction [3] . In many approaches described in existing literature prediction models learn from historical labelled data within a software project in a supervised way [4] : the model is trained on labelled data (e.g. where defectiveness information is used) and performs prediction on test data. Although supervised approaches are the vast majority, in practice, software projects usually lack classification labels to train a supervised defect prediction model, because it is considered a time and effort consuming activity [5] . Due to the difficulty in collecting label information before building prediction models, unsupervised methods have begun to draw the attention of researchers [6] . Finding the best way to classify unlabelled data is a challenging task. The current approaches use various experiment configurations in terms of software metrics (e.g., belonging to size and complexity category) and performance metrics (e.g., f-measure and accuracy). Li et al. [7] cathegorize them into clustering and non-clustering techniques, some of them are summarized in Table 1 . Clustering k-Partition It is based on the distance data point [8] [9] [10] [11] It exploits core objects with dense neighborhoods [12] It determines in which a data point can belong to two or more clusters [13, 14] Spectral clustering It is based on eigenvalues of the similarity matrix [15, 16] Nonclustering Expert In this case expert determine the labels directly [9] Threshold It uses some thresholds to classify instances directly [8, 14] Once clustered the instances of the datasets, the labelling phase is a necessary step in software defect prediction techniques: the expert-based approach labels a cluster by exploiting the decision of an expert [9] ; the threshold-based approach decides the cluster label by making use of particular metric threshold values [8, 14] ; the top half approach labels defective half of the top clusters [17] ; the distribution approach is based on the number of nodes of a cluster, the smaller are considered defect prone, the larger non defect-prone [18] ; the majority vote uses the most three similar clustering centers [14] ; the cross project exploits metrics from user projects [15] . The labelling techniques are characterized by having some limitations, such as relying on experts' information and using suitable metrics' thresholds, and some assumptions like dealing with datasets which include the same features in cross project defect prediction models. Furthermore, it has been shown that their results are uneasy to compare [19] . Previous studies on unlabelled datasets have employed datasets, like the NASA dataset and the clean version of the NASA dataset [20] and the PROMISE dataset [21] . All these datasets are labelled and include different metrics. Therefore, to allow researches to be able to steer their choices when dealing with unlabelled datasets, we have decided to provide a comparative analysis of approaches that are able to overcome the previous limitations by using a new unlabelled dataset built for C++ software. Our comparative analysis exploits CLAMI [22] that is made up of 4 phases: Clustering, LAbeling, Metrics selection and Instance selection. Its key idea is to label unlabelled datasets by exploiting metrics' magnitude (e.g. median, mean, quantiles), that is metrics' cutoff threshold. In addition to CLAMI, we have explored CLAMI+ [23] an improvement of CLAMI characterized by a different procedure in the metrics' selection phase (see Sect. 3). All the approaches have been applied on a new dataset that has been obtained by extracting and preprocessing its metrics from its software archives. Differently from the majority of previous analysed datasets that are related to C or Java code, ours is the result of a metrics' extraction of a software written in C++ [24] . Finally, we have used the K-means algorithm as another technique to find potentially critical modules. In this study, we consider a high energy physics (HEP) software project, called Geant4 [25] , characterized by a long development history. It represents a valid candidate to systematic comparison approaches applied on unlabelled datasets. For this software we have measured its software characteristics [26] (by using the Imagix 4D framework [27] ) to determine its quality evolution over releases and to build a complete set of data on top of which assessing the applicability of software metrics. Considering the amount of data available -i.e., 34 different datasets each for different software modules (such as class, namespace, function, file and variable) -this analysis may help to use these approaches and to concentrate effort only to the most critical software modules of a software project. Our comparison relies on performance metrics, such as accuracy, f-measure and Kappa statistics, and difference tests. Furthermore, we have manually extracted information about defectives of software modules from their software archives to be able to conduct a comparison with real data. In the remaining parts of this work, we first introduce the terminology Sect. 2. Section 3 introduces the methodology of this comparison, including the research questions, the data used, the frameworks adopted for the analysis, the unsupervised approaches and the ML techniques used. Section 4 further presents the experimental results. Section 5 discusses the results. At last, the work is concluded with future work in Sect. 6. Within this section, we introduce a set of definitions we use in the paper, the workflow of unsupervised approaches, the software metrics we have collected and considered in the software datasets, the performance metrics we have used for the discussion and a set of statistical tests we have used to detect differences amongst the various ML techniques. Definitions -A software project contains different releases of the software. Each software release is coded by a unique identifier that is composed of one or more numbers. Semantic versioning [28] uses three digits that respectively identify major, minor and patch. Developers increment the major number when they introduce breaking changes; a minor number when they add functionality in a backward compatible way; a patch number when they make all other nonbreaking changes. A module identifies an element of the software, such as a class, a file, and a function. An unlabelled dataset is a set of software metrics for various modules of a given software release. The modules are not labelled with respect to release defects and they are marked with the key term?. A labelled dataset is a set of software metrics for various modules of a given software release. The modules are labelled with defective and non defective key terms, such as buggy (B) and clean (C) respectively, according to the unsupervised approaches. General Workflow - Figure 1 shows the general workflow of software defect prediction on unlabelled datasets. The unlabelled dataset contains for the various modules of a given software releases a set of software metrics. Each software release-specific dataset is split in the training and test datasets respectively to train the model and predict defects: the training dataset starts unlabelled and ends labelled after a set of preprocessing operations related to the adopted unsupervised approaches; the test dataset remains unlabelled. The training dataset contains modules labelled as defective (i.e. buggy) or non defective (i.e. clean) according to the unsupervised approaches adopted. Furthermore, each module contains information of some software metrics. Modules with label and software metrics are used to train ML classifiers and to predict new modules as buggy or clean. Software Metrics -Software metrics provide information on various aspects of software through quantitative measurements [29] . Aspects of software metrics may include: its structure, coupling, maintainability. Metrics can refer to a file, a class, a function or any key software elements that characterize a software. There are different families of software metrics. In this work we consider size, complexity and object orientation categories. The size metrics help to quantify the software size and complexity and suggest the quality of the implementation. Their aim is to identify files, classes or methods that can be hard to understand, reuse, test and modify. For instance, class size metrics can reflect the effort required to build, understand and maintain a class. These metrics go from the comment ratio and size lines of source code up to the size number of statements. The complexity metrics help to detect complex code that is difficult to test and maintain. Metrics used to measure this software characteristic are: size metrics (already described), McCabe metrics [30] and Halstead metrics [31] . There are various object orientation metrics suites. The most studied one is the Chidamber and Kemerer (CK) metrics suite [32] that aims at measuring design complexity in relation to their impact on external quality attributes, such as maintainability and reusability. The CK metrics are: number of children and depth of inheritance tree related to inheritance and abstraction; weighted methods per class and lack of cohesion in methods pertaining to the internal class structure; coupling between objects and response for a class that are related to relationships among classes. Performance Metrics -We detail the performance metrics [33] used to assess the considered ML techniques. The metrics are based on the confusion matrix that describes the complete performance of the model by considering the following concepts: a defective instance is defective (i.e. tp as true positive); a defective instance is non defective (i.e. fn as false negative); a non defective instance is defective (i.e. tn as true negative); a non defective instance is non defective (i.e. fp as false positive). In this work, we consider the recall metric, the precision metric, the accuracy metric, the f-measure metric, the AUC metric and the Kappa statistic. The recall metric tp tp+fn , also called Probability of Detection (PD), measures how many of the existing defects are found. The precision metric tp tp+fp measures how many of the found results are actually defects. The accuracy metric tp+tn tp+fn+tn+fp measures the percentage of the correct predictions. The f-measure metric 2· recall·precision recall+precision is the harmonic mean between recall and precision. The Probability of False Alarm (PF) fp fp+tn is defined as the ratio of false positive to all non-defective modules. Ideally, when used as a performance metric, it is desired to be equal to zero. The Kappa statistic accuracy−randaccuracy 1−randaccuracy [34] , with the randaccuracy equals to (tn+fp)·(tn+fn)+(fn+tp)·(fp+tp) accuracy·accuracy , compares the observed accuracy with the expected accuracy and its value ∈ [0, 1]. If Kappa statistic ∈ [0.81, 0.99], then the value indicates an almost perfect agreement. The Receiver Operating Characteristics (ROC) curve, in the field of software prediction, is defined as a tradeoff between the prediction ability to correctly detect defective modules (PD or recall) and non-defective modules (PF) [35] . From a graphical point of view, PD is represented on the Y-axis and PF on the X-axis, while predictor's threshold is varying. The performance of a predictor is better as the curve is closer to coordinates PD = 1 and PF = 0. The Area Under the Curve (AUC) metric is the area under the ROC curve and it is commonly used to compare performance of predictors. It measures a model's discriminatory power which represents the area under receiver operating characteristic (ROC) curve [23, 36] . Compared to other metrics, AUC has two advantages: it is independent on the definition of a metric threshold and it is robust towards imbalanced datasets. With other performance metrics, a classifier must rely on the determination of a threshold whose value is usually set as the average of all metric' values, this choice may affect the final results of the classifier. Precision and recall are examples of metrics which are highly influenced by imbalanced datasets, making it difficult to conduct a fairly models' comparison. Performance Differences Tests -We statistically compare the performances achieved by the experimented prediction models. We are going to detect groups of differences between the ML techniques we have considered in this study. These groups are called blocks in statistics and are usually linked to the problems met in the experimental study [37] . For example, in a comparative analysis of multiple datasets, each block is related to the results computed on a specific dataset. When dealing with multiple comparison tests, each block is composed of various results each linked to the couple algorithm, datasets. The Friedman test [38] is the most common procedure for testing the differences among two o more related samples which represent the performance metrics of the techniques measured across the same datasets. The objective of this test is to determine if we may conclude from a sample of results that there is difference among treatment effects [37] . It is composed of two steps: the first step converts the original results in ranks by calculating a test statistic; the second step ranks the techniques for each problem separately where the best performing techniques is given rank 1. Iman and Davenport [39] showed that Friedman test have a conservative behaviour and proposed a different statistic. Post hoc procedures are employed to obtain what classifier performances better than the one proposed. They usually follows the application of the Friedman test since they take into account the rankings generated by the Friedman procedure. According to [40] , Friedman test with Iman and Davemport extension is the most popular omnibus test and is a good choice for comparing more than five techniques. About post hoc tests, comparing all the techniques with one proposed requires the simple but least powerful Bonferroni test, on the contrary, when it is necessary to conduct pairwise comparison, Hommel [41] and Rom are the two most powerful procedures. The Nemenyi post-hoc test [42] is used when every classifier is compared to one another. Two classifiers achieve significant different performance if the corresponding average ranks differ by at least the critical difference (CD) q α where k is in the number of ML techniques, N is the number of the datasets and q α is derived from the studentized range statistic divided by √ 2. The Scott-Knott Effect Size Difference (ESD) test [43] is a mean comparison approach based on a hierarchical clustering analysis to partition the set of treatment means into statistically distinct groups. It corrects the non-normal distribution of an input dataset, and merges any two statistically distinct groups that have a negligible effect size into one group. We describe the methodology that we followed to perform our comparison. Research Questions -We formulate the research questions (RQ) that we answer through this comparison. RQ1 : which unsupervised approach performs best in terms of our performance metrics? RQ2 : which supervised technique performs best in terms of our performance metrics in the two unsupervised approaches? RQ3 : what is the impact of using different cutoff thresholds (i.e. quantiles) on the comparison results? The reason for RQ1 is that we intend to test the known unsupervised approaches employed both in previous literature and in our work to find which one is performing better than the others. To do so we will compute their results in term of performance metrics. The analysis of RQ3 provides insights into how the dataset distribution affects the overall performance of our defect prediction experiments. On the one hand, larger datasets entail more generalization and better performance, however, less data for training may have less variance which can cause an improvement in the performance. Data -Our comparison assesses the CLAMI and CLAMI+ approaches on a multi-release dataset. It is related to the scientific widespread Geant4 software due to its peculiarity: the first 0.0 release was delivered the 15th of December 1998 and it is still under development by an international collaboration that have produced 10 major releases and at least 34 minor releases with various patches. The development time frames sampled in the study ranges 21 years, covering almost a quarter of century of development history. Figure 2 shows the Geant4 software releases considered for this study over years: for each release number, the + shows the year when the release was delivered, the x shows the year when the related patch was delivered. At the moment, the dataset consists of 34 software releases delivered in almost 20 years of development. There are 66 software metrics for 482 C++ classes, measured for the various releases by using the Imagix 4D tool [27] . The resulting dataset is composed of 34 x 482 classes, 482 are repeated 33 releases. The defect labels are not included in the dataset: we have extracted them by using the release notes that accompany the corresponding software releases. Framework -In order to conduct this study, we exploit various available frameworks running in Java, Python and R. We have assessed these frameworks leveraging how much effort they required us to be used: Weka [44] is the easiest to use, while Theano [45] requires more expertise. Unsupervised Approaches -We apply three unsupervised approaches to the examined dataset. The CLAMI approach conducts the prediction on unlabelled dataset by learning from itself. It creates a labelled training sample by using the magnitude of metric values (i.e. the metrics that are greater than of a specific cutoff threshold) and build the model by supervised learning to predict unlabelled modules. This approach consists of clustering, labelling, metrics selection and instance selection phases: clustering groups the modules; labelling estimates the label of groups; metric selection and instance selection select more informative features (i.e., software metrics) and produce training set. In the labeling, CLAMI measures the count of violation (i.e., a metric value is greater than a certain threshold) of a module. The CLAMI+ approach extends the CLAMI one. The main difference concerns the clustering and labelling phases. CLAMI+ uses the violation degree (i.e., transforming the difference between the metric value and the threshold to a probabilistic value) to replace the boolean representation in CLAMI. In this case, how much the instance violated on a metric is considered. The resulting training set is more informative than the one of CLAMI. K-means is a clustering algorithm [46] that depends upon iterative location that partitions dataset into K number of clusters by standard Euclidean distance. There are various variants of the algorithm, we have used the ones implemented in scikit-learn that is based on either Llody's or Elkan's algorithm [47, 48] . Since it is simple and flexible these algorithms are very popular in statistical analysis [49] . The defect prediction is performed within-project where the training and test sets are from the same project. ML Techniques -Aiming at comparing several classification algorithms, we have considered a total of seventeen classifiers, each of which can be categorized according to its statistical approach. We have identified the following categories: statistical classifiers that construct a Bayes optimal classifier; nearest-neighbor methods that classify a module by considering the k most similar samples, where the similarity definition differs among algorithms; neural networks methods that depicts a network structure to define a concatenation of weighting, aggregation and thresholding functions in order to determine a software module probability of being fp; support vector machines classifiers that aim at optimizing a linear decision function to discriminate between fp and non fp modules; tree-based methods that recursively partition the training data; ensembles methods that embody several base classifiers, built independently and put together to obtain a final class prediction. Table 2 provides, for each category, the considered classifiers. A detailed description is available in general textbooks [50] . Evaluation Strategy and Remarks -The focus of our comparison is not an advancement or improvement of existing approaches. We apply them to an unlabelled dataset related to Geant4 software, a stone in the Monte Carlo simulation in order to provide an alternative way to control and monitor its development process. The resulting datasets have been checked against the existing documentation in the release note of each new software release. We have been able to trace modules that were changed for bug fixes, minor fixes, warning, and major fixes, finding a correspondence between the documentation and the labelling activity performed by the considered unsupervised approaches. In this section we provide an analysis of the obtained results. To conduct a reasonable comparison among the various releases we have considered the modules that were present in all the releases. This filtering operation has led to a considerable reduction of our training set. More in detail, while the common modules of all the releases were 482, the releases 4, 11 and 1004 for example contain respectively 891, 915 and 2544 modules. In the following, we present the results relative to the various steps of the considered approaches. Selected Metrics - Table 3 shows the level, the category (i.e. Complexity, Size, Object Orientation), the complete name and the description of all metrics selected either in the CLAMI or the CLAMI+ approach. The file level metrics have been computed both on header (i.e. the .h extension) and source files (i.e. the .cpp extension). Halstead's metrics have been calculated both at file Table 4 indicates the metrics selected both in the CLAMI and CLAMI+ approaches for the different quantiles (cutoff thresholds) among the various releases. Quantiles' numbers from 1 to 9 correspond to 10%, ..., 90% respectively. Quantiles 1 and 2 have focused more on metrics on file level, while quantiles from 3 to 9 have selected object orientation metrics. Metrics CLhpd, CLhpv, CLhme, ckrfc, ckwmc are the ones that share more quantiles from 5 to 9, and they belong to the object orientation or complexity category. Table 5 displays the total number of metrics selected in the CLAMI and CLAMI+ approaches for the different quantiles among the various releases. They go from a minimum of 18 in the 4, 5 and 6 quantiles to a maximum of 52 in the 9 quantile. Labelled Modules -The percentage of buggy modules on the total modules over the releases for each quantile has been measured. The quantiles are related to the different cut-off values chosen for the metrics. For example, for the release 701 and quantile 5 (which corresponds to the median value of the metrics), CLAMI computes about 60% of buggy modules. The difference between CLAMI and CLAMI+ of the percentage of buggy modules over the various releases for each quantile has been measured. As an example, in release 511 and quantile 5, the values of the difference is about 10. Furthermore, CLAMI predicts 50% of buggy modules, consequently, CLAMI+ predicts about 40% of buggy modules. It is worth noticing that our manual work has covered just 380 modules, this has reduced the size of our training set. K-means has provided us with two clusters made up of different classes. We checked the different classes of these clusters against the documentation of defectiveness. Afterwards, we collected how CLAMI and CLAMI+ have labelled each class. Both of them have labelled 179 modules as defective in cluster 1, while in cluster 2 CLAMI has labelled 9617 modules as defective and CLAMI+ 9747. ML Techniques -In our study, we have carried out a comparative analysis of all the ML techniques using the average rank calculated on the various performance metrics, such as accuracy, precision, recall, f-measure and AUC. In order to avoid overfitting we have exploited 10 cross-validation [51] [52] [53] , furthermore, we have excluded all the prediction models whose Kappa statistic scored less than 0.80 in order to have a good agreement. Table 6 presents the different values of average accuracy computed for 4 ML techniques, such as J48, LMT, AdaBoost and Bagging obtained by using ten cross validation in order to avoid overfitting. The best accuracy values are achieved by all the techniques for the first quantile; CLAMI+ approach reaches high scores also in the ninth quantile. The technique that performs the best regardless of the two approaches and quantiles is Adaboost that scores 95.3508 in terms of average accuracy. Regardless of the specific techniques, CLAMI+ performs better than CLAMI in every quantile except the second, third and fourth. Table 7 displays the different values of average buggy precision computed for 4 ML techniques i.e. J48, LMT, AdaBoost and Bagging. The technique that performs the best in terms of precision and regardless of the quantiles and approaches is LMT that scores 0,9385 in terms of average precision. Except for quantile one, two and nine, CLAMI performs better than CLAMI+ regardless of the specific technique. Table 8 illustrates the different values of average buggy recall computed for 4 ML techniques i.e. J48, LMT, AdaBoost and Bagging. The technique that performs the best in terms of recall and regardless of the quantiles and approaches is Adaboost that scores 0,9328 in terms of average recall. Except for quantile three and five, CLAMI performs better than CLAMI+ regardless of the specific technique. Table 9 shows the different values of average buggy f-measure computed for 4 ML techniques i.e. J48, LMT, AdaBoost and Bagging. The technique that performs the best in terms of f-measures and regardless of the quantiles and approaches is LMT that scores 0,9343. Except for quantile one, CLAMI performs better than CLAMI+ in terms of f-measure regardless of the specific technique. Table 10 displays the different values of average buggy AUC computed for 4 ML techniques i.e. J48, LMT, AdaBoost and Bagging. The technique that performs the best in terms of AUC and regardless of the quantiles and approaches is Adaboost that scores 0.9830. CLAMI+ performs better than CLAMI in quantiles 1, 2, 5 and 7 in terms of AUC regardless of the specific technique. Considering the CLAMI approach, Fig. 4 shows the results obtained with Nemenyi's test for the AUC metric, according to which there is no significant differences between methods that are grouped together by a horizontal line: the top line represents the axis on which the average ranks of methods (see Table 10 ) is plotted; the highest (best) ranks are to the right; the critical difference is also shown above the graph. Still for CLAMI, Fig. 3 shows the results obtained with Friedman's post-hoc test with Bergmann and Hommel's correction for the AUC metric, according to which there is no significance differences between methods that have connected nodes: each node reporting the method name with its average rank (see Table 10 ) is plotted; the highest (best) ranks are not connected. Compared with the critical difference plot, the graph shows more differences between methods. We have performed the same plots for the other performance metrics. According to the ScottKnott ESD test, we have obtained a non negligible effect size for the AUC metric either in the CLAMI and CLAMI+ approaches as shown in Fig. 4 In this section we present the answers of our research questions. In regards to RQ1 and RQ2 we employed different performance metrics and statistical tests. In detail, we have used accuracy, precision, recall, f-measure and the area under the receiver operating characteristics (AUC). Accuracy measures the percentage of the correct predictions among all the predictions; precision measures rate of the correctly predicted defective instances among all the instances predicted as defective; recall measures the rate of correctly predicted defective instances among all the actual defective instances. F-measure represents the harmonic mean of precision and recall, AUC measures the area under the ROC curve that is plotted using recall and false positive rate by changing software metrics' cutoff thresholds. All the above metrics have widely used in previous literature in the field of software defect prediction [6, 22, 23, 54, 55] . For statistical tests, we have conducted Iman and Davenport (ID) omnibus test, Nemenyi tests, the Friedman post hoc test with Bergman and Hommel's correction and Scott-KnottESD test. We have followed the guidelines of Demšar [42] for the visual representation of the Nemenyi and Friedman tests. : which unsupervised approach performs best in terms of our performance metrics? Two main unsupervised approaches have been involved for the comparison: CLAMI and CLAMI+. They entail varying their metrics' cutoff thresholds and, consequently, the creation of 9 quantiles to conduct the labelling operation, and afterwards, the application of supervised ML techniques -more in detail, we have considered all the ML techniques whose Kappa statistic reached a value superior to 0.80. To find the best approach we have computed the average of all the values of one performance metric over the techniques and then quantiles, giving the same weight to each of them. In regards to accuracy CLAMI+ scores 95.6928 and does better than CLAMI that scores 94.7845. CLAMI achieves a higher result than CLAMI+ in precision (0.9389 vs 0.9253), recall (0.9335 vs 0.9276), f-measure (0.9360 vs 0.9258) and AUC (0.9647 vs 0.9636). RQ2 : which supervised technique performs best in terms of our performance metrics in the two unsupervised approaches? Table 11 introduces the average performance metrics values per techniques over approaches and quantiles. Adaboost achieves the best score in accuracy (95.3508), recall (0.9329), and AUC (0.9830), whilst LMT performs best in precision (0.9385) and f-measure (0.9343). Concerning the CLAMI and CLAMI+ approaches and the AUC performance metric, J48 has achieved the best average rank. : what is the impact of using different cutoff thresholds (i.e. quantiles) on the comparison results? Quantiles values affects the resulting performance metrics, as a consequence, metrics' thresholds have to be taken into consideration since the data distribution is not Gaussian. In this paper, we have conducted a comparative analysis of different unsupervised approaches: CLAMI, CLAMI+ and K-means. Once obtained a labelled dataset by exploiting these approaches, we have carried out a second comparison amongst various ML supervised techniques. Starting from an unlabelled software dataset, CLAMI and CLAMI+ allow users to label each instance by exploiting their metrics magnitude. We have applied different supervised ML techniques on the resulting datasets and detected the best ones by using statistical comparisons, such as Iman and Davenport's test, Nemenyi's test, Friedman's post-hoc test with Bergmann and Hommel's correction and Scott-KnottESD test. Among the used 17 ML techniques, we have selected the ones which scored more than 0.80 in the Kappa statistic performance metric: J48, Adaboost, LMT and Bagging. The computed predictive average accuracy metric is around 95% for all the ML techniques. Furthermore, our study has shown that, for CLAMI and CLAMI+, the J48 method performs best according to the average AUC performance metric. In regards to CLAMI+, J48 is followed by: the Bagging method for the average AUC, average accuracy, average precision and average f-measure metrics; and by the LMT method for the average recall metric. Based on the results of our empirical analysis, we conclude that no significant performance differences have been found in the selected classification techniques. Future works based on this study could involve all the datasets for each release of Geant4 (now only 482 modules for each release are considered) and the employment of other clustering unsupervised techniques such as Fuzzy CMeans and Fuzzy SOMs. Software defect prediction using cost-sensitive neural network Data mining static code attributes to learn defect predictors Metrics for software reliability: a systematic mapping study Cross project change prediction using open source projects Unsupervised learning for expert-based software quality estimation Defect prediction on unlabeled datasets by using unsupervised clustering A systematic review of unsupervised learning techniques for software defect prediction Clustering and metrics thresholds based software fault prediction of unlabeled program modules Analyzing software measurement data with clustering techniques Software fault prediction using quad tree-based k-means clustering algorithm Benchmarking machine learning techniques for software defect detection Software defect prediction using heterogeneous ensemble classification based on segmented patterns Software metrics data clustering for quality prediction Increasing the accuracy of software fault prediction using majority ranking fuzzy clustering Cross-project defect prediction using a connectivity-based unsupervised classifier A novel method for software defect prediction in the context of big data Self-learning change-prone class prediction Software fault prediction model using clustering algorithms determining the number of clusters automatically A comparative study to benchmark crossproject defect prediction approaches Data quality: some comments on the NASA software defect datasets Balancing privacy and utility in crosscompany defect prediction CLAMI: defect prediction on unlabeled datasets (T) Automated change-prone class prediction on unlabeled dataset using unsupervised method Revisiting the impact of classification techniques on the performance of defect prediction models GEANT4 -a simulation toolkit Assessing software quality in high energy and nuclear physics: the geant4 and root case studies and beyond Semantic Versioning 2.0 A critique of software defect prediction models A complexity measure Elements of Software Science Metrics suite for object oriented design Comments on data mining static code attributes to learn defect prediction The measurement of observer agreement for categorical data Multiple-classifiers in software quality engineering: combining predictors to improve software fault prediction ability Automating change-level self-admitted technical debt determination Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power A comparison of alternative tests of significance for the problem of M rankings Approximations of the critical region of the friedman statistic scmamp: statistical comparison of multiple algorithms in multiple problems Improvements of general multiple test procedures for redundant systems of hypotheses Statistical comparisons of classifiers over multiple data sets An empirical comparison of model validation techniques for defect prediction models Analysis of data mining based software defect prediction techniques Deep learning for smart manufacturing: methods and applications Web-scale K-means clustering Least squares quantization in PCM Using the triangle inequality to accelerate k-means A clustering algorithm for software fault prediction The Elements of Statistical Learning: Data Mining, Inference, and Prediction Unsupervised machine learning for networking: techniques. applications and research challenges A few useful things to know about machine learning Dropout: a simple way to prevent neural networks from overfitting An empirical study of just-in-time defect prediction using cross-project models Dictionary learning based software defect prediction The authors thank the Imagix Corporation for providing an extended free license of Imagix 4D to perform this work.