key: cord-0184265-av6imb1o
authors: Majumder, Suvodeep; Mody, Pranav; Menzies, Tim
title: Revisiting Process versus Product Metrics: a Large Scale Analysis
date: 2020-08-21
journal: nan
DOI: nan
sha: 04bd5485c5a45dcb1283af4ee8092133a48911b9
doc_id: 184265
cord_uid: av6imb1o

Numerous methods can build predictive models from software data. However, what methods and conclusions should we endorse as we move from analytics in-the-small (dealing with a handful of projects) to analytics in-the-large (dealing with hundreds of projects)? To answer this question, we recheck prior small-scale results (about process versus product metrics for defect prediction and the granularity of metrics) using 722,471 commits from 700 Github projects. We find that some analytics in-the-small conclusions still hold when scaling up to analytics in-the-large. For example, like prior work, we see that process metrics are better predictors for defects than product metrics (best process/product-based learners respectively achieve recalls of 98%/44% and AUCs of 95%/54%, median values). That said, we warn that it is unwise to trust metric importance results from analytics in-the-small studies since those change dramatically when moving to analytics in-the-large. Also, when reasoning in-the-large about hundreds of projects, it is better to use predictions from multiple models (since single model predictions can become confused and exhibit a high variance).

There exist many automated software engineering techniques for building predictive models from software project data [23] . Such models are cost-effective methods for guiding developers on where to quickly find bugs [48, 62] .

Given that there are so many techniques, the question naturally arises: which one should we use? Software analytics is growing more complex and more ambitious with time. A decade ago, a standard study in this field dealt with just 20 projects or less 1 . Now we can access data on hundreds to thousands of projects. How does this change software analytics? What methods and conclusions should we endorse as we move from analytics in-the-small (which analyzes a small number of projects individually to report their findings) to analytics in-the-large (which analyzes hundreds of projects individually to report findings that are important across all or majority of the projects analyzed) 2 ? So reproducing results and findings that were true for analytics in-the-small is of utmost importance with hundreds to thousands of projects. Such analytics in-the-large results will help the software engineering community to understand and adopt appropriate methods, beliefs, and conclusions.

As part of this study, we revisited the Rahman et al. ICSE 2013 study "How, and why, process metrics are better" [69] and Kamei et al. ICSM 2010 study "Revisiting common bug prediction findings using effort-aware models" [34] . Both papers were analytics in-the-small study that used 12 and 3 projects, respectively to see if defect predictors worked best if they used: -Product metrics, showing what was built; e.g., see Table 1 .

-Or process metrics, showing how code is changed; e.g., see Table 2 ;

These papers are worth revisiting since it is widely cited 3 and it addresses an important issue. Herbsleb argues convincingly that how groups organize themselves can be highly beneficial/detrimental to the process of writing code [28] . Hence, process factors can be highly informative about what parts of a codebase are buggy. In support of the Herbsleb hypothesis, prior studies have shown that, for defect prediction, process metrics significantly outperform product metrics [10, 41, 69] . Also, if we wish to learn general principles for software engineering that hold across multiple projects, it is better to use process metrics since: -Process metrics are much simpler to collect and can be applied uniformly to software written in different languages. -Product metrics, on the other hand, can be much harder to collect. For example, some static code analysis requires expensive licenses, which need updating every time a new version of a language is released [71] . Also, the collected value for these metrics may not translate between projects since those ranges can be highly specific.Lastly, product metrics tend to be far more verbose and hence time-consuming to collect. For example, for 722,471 commits studied in this paper, data collected required 500 days of CPU (using five machines, 16 cores, 7days). Our process metrics, on the other hand, were an order of magnitude faster to collect. 4 Since product versus process metrics is such an important issue, we revisited the Rahman et al. and Kamei et al. study . To check their conclusions, we ran an analytics in-the-large study that looked at 722,471 commits from 700 Github projects.

All in all, this paper explores eight hypotheses using two widely used validation criteria. One is release-based (where given R releases of the software, we trained on data from release 1 to R − 3, then tested on release R − 2, R − 1, and R) and another is cross-validation based (where the data is randomly divided into N stratified bins. Each bin, in turn, becomes the test set and a model is trained on the remaining bins.) After comparing conclusions seen in the prior analytics-in-the-small to the analytics-in-the-large, we find two cases where we disagree and six where we agree. So what is the value of a paper with 75% agreement with prior work? We assert that this paper makes several important contributions: -Firstly, in the two cases where we disagree, we very strongly disagree:

-We find that the use of any learner is not appropriate for analytics-inlarge. Our results suggest that any learner that generates a single model may get confused by all the intricacies of data from multiple projects. On the other hand, ensemble learners (that make the conclusions by polling across many models) know how to generate good predictions from an extensive sample. -Also, in terms of what recommendations we would make to improve software quality, we find that the conclusions achieved via analytics-in-thelarge are very different from those achieved via analytic-in-the-small. Later in this paper, we compare those two sets of conclusions. We will show that changes to software projects that make sense from analyticsin-the-small (after looking at any five projects) can be wildly misleading since, once we get to analytics-in-the-large, a very different set of attributes is most effective -Secondly, in the case where our conclusions are the same as prior work, we have successfully completed a valuable step in the scientific process: i.e., reproduction of prior results. Current ACM guidelines 5 distinguish replication and reproduction as follows: the former uses artifacts from the prior study while the latter does not. Our work is a reproduction 6 since we use ideas from the Rahman et al. and Kamei et al. study , but none of their code or data. We would encourage more researchers to conduct and report more reproduction studies. Specifically, this paper asks eight research questions RQ 1: For predicting defects, do methods that work in-the-small, also work in-the-large?

In a result that agrees with Rahman et al., we find that how we build code is more indicative of what bugs are introduced than what we build (i.e., process metrics make best defect predictions ).

RQ 2: Measured in terms of predication variability, do methods that works well in-the-small, also work at at-scale?

Rahman et al. said that it does not matter what learner is used to build prediction models. We make the exact opposite conclusion. For analyticsin-the-large, the more data we process, the more variance in that data. Hence, conclusions that rely on a single model get confused and exhibit significant variance in their predictions. To mitigate this problem, it is important to use learners that make conclusions by averaging over multiple models (i.e., ensemble Random Forests are far better for analytics than the Naive Bayes, Logistic Regression, or Support Vector Machines used in prior work).

RQ 3: Measured in terms of granularity, do same granularity that works well in-the-small, also work at at-scale? Kamei et al. said in their study that although the file-level prediction is better than package-level prediction when measured using Popt20, the difference is very little and we agree with this result. However, when measured via other evaluation measures, the difference is significantly different. Thus for analytics-in-the-large, when measured using other criteria, it is evident the granularity of the metrics matter and file-level prediction shows significantly better results than package-level prediction.

RQ 4: Measured in terms of stability, are process metrics more/less stable than code metrics, when measured at at-scale?

When measured in terms of stability of performance across the last 3 releases by using all other previous releases for training the model, our results agree with Rahman et al. in all traditional evaluation criteria (i.e., recall, pf, precision). We find that the performance across the last 3 releases does not significantly differ in all evaluation criteria except for effort-aware evaluation criteria Popt20.

RQ 5: Measured in terms of stasis, Are process metrics more/less static than code metrics, when measured at at-scale?

In this result, we agree with Rahman et al.. We can see product metrics are significantly more correlated than process metrics. We measure this correlation in both release-based and JIT-based settings. Although we can see process metrics have a significantly lower correlation than product metrics in both release-based and JIT-based settings, the difference is lower in case of JITbased settings. Also, when lifting process metrics from file-level to packagelevel, as explored by Kamei et al., we can see a significant increase in correlation in case of process metrics. This can explain the drop in performance in packagelevel prediction.

RQ 6: Measured in terms of stagnation, Do models built from different sets of metrics stagnate across releases, when measured at at-scale?

Rahman et al. warn that, when reasoning over multiple releases, models can stagnant, i.e., fixate on old conclusions and miss new ones. For example, if a defect occurs in the same file in release one and release two, and another defect appears in a new file in the second release, the model will catch the file as defective, which was defective in first release, but will miss the defect in the new file.

Here we measure the stagnation property of the models built using the metrics. Our results agree with Rahman et al.: we see a significantly higher correlation between the predicted probability and learned probability in the case of product metrics than process metrics. This signifies models built using product metrics tend to be stagnant. RQ 7: Do stagnant models (based on stagnant metrics) tend to predict recurringly defective entities?

In these results, we try to evaluate if models built with product and process metrics tend to predict recurrent defects. Our results concur with Rahman et al. and we see models built with product metrics tend to predict recurrent defects, while models built with process data do not suffer from this effect.

RQ 8: Measured in terms of metric importance, are metrics that seem important in-the-small, also important when reasoning in-the-large?

Numerous prior analytics in-the-small publications offer conclusions on the relative importance of different metrics. For example, [33] , [22] , [52] , [38] , [18] offer such conclusions after an analysis of 1,1,3, 6,and 26 software project, respectively. Their conclusions are far more specific than process-vs-product; rather, these prior studies call our particular metrics are being most important for prediction.

Based on our analysis, we must now call into question any prior analytics in-the-small conclusions that assert that specific metrics are more important than any other (for defect prediction). We find that the relative importance of different metrics found via analytics in-the-small is not stable. Specifically, when we move to analytics in-the-large, we find very different rankings for metric importance.

The rest of this paper is structured as follows. Some background and related work are discussed in section 2. Our experimental methods are described in section 3. Data collection in section 3.1 and learners used in this study in section 3.2. Followed by the experimental setup in section 3.4 and evaluation criteria in section 3.5. The results and answers to the research questions are presented in section 4. Which is followed by threats to validity in section 5. Finally, the conclusion is provided in section 6.

Note that all the scripts and data used in this analysis are available online at https://github.com/Suvodeep90/Revisit_process_product 7 .

This section shows that software defect prediction is a (very) widely explored area with many application areas. Specifically, in 2020, software defect prediction is now a "subroutine" that enables much other research.

A defect in software is a failure or an error represented by incorrect, unexpected, or unintended behavior of a system caused by an action taken by a developer. As today's software proliferates both in size and number, software testing for capturing those defects plays more and more crucial roles. During software development, the testing process often has some resource limitations. For example, the effort associated with coordinated human effort across a large codebase can grow exponentially with the scale of the project [21] .

It is common to match the quality assurance (QA) effort to the perceived criticality and bugginess of the code for managing resources efficiently. Since every decision is associated with a human and resource cost to the developer team, it is impractical and inefficient to distribute equal effort to every component in a software system [12] . Creating defect prediction models from either product metrics (like those from Table 1 ) or process metrics (like those from Table 2 ) is an efficient way to take a look at the incoming changes and focus on specific modules or files based on a suggestion from defect predictor.

Recent results show that software defect predictors are also competitive widely-used automatic methods. Rahman et al. [72] compared (a) static code analysis tools FindBugs, Jlint, and PMD with (b) defect predictors (which they called "statistical defect prediction") built using logistic regression. No significant differences in cost-effectiveness were observed. Given this equivalence, it is significant to note that defect prediction can be quickly adapted to new languages by building lightweight parsers to extract product metrics or use common change information by mining git history to build process metrics. The same is not true for static code analyzers -these need extensive modification before they can be used in new languages. Because of this ease of use and its applicability to many programming languages, defect prediction has been extended in many ways, including: 1. Application of defect prediction methods to locate code with security vulnerabilities [80] . 2. Understanding the factors that lead to a greater likelihood of defects such as defect-prone software components using code metrics (e.g.,, ratio comment to code, cyclomatic complexity) [46, 47] or process metrics (e.g.,, recent activity). 3. Predicting the location of defects so that appropriate resources may be allocated (e.g.,, [9] ) 4. Using predictors to proactively fix defects [5] 5. Studying defect prediction not only just release-level [15] but also changelevel or just-in-time [75] .

6. Exploring "transfer learning" where predictors from one project are applied to another [39, 58] . 7. Assessing different learning methods for building predictors [23] . This has led to the development of hyper-parameter optimization and better data harvesting tools [1, 2] .

Defect prediction models are built using various machine learning classification methods such as Random Forest, Support Vector Machine, Naive Bayes, Logistic Regression [24, 27, 30, 31, 39, 50, 59, 63, 78, 79, 83, 86, 92, 103, 104, 106] etc. All these methods input project metrics and output a model that can make predictions. Fenton et al. [20] say that a "metric" is an attempt to measure some internal or external characteristic and can broadly be classified into product (specification, design, code-related) or process (constructing specification, detailed design related). The metrics are computed either through parsing the codes (such as modules, files, classes or methods) to extract product (code) metrics or by inspecting the change history by parsing the revision history of files to extract process (change) metrics.

In September 2020, we conducted the following literature review to understand the current thinking on the process and product metrics. Starting with Rahman et al. [70] and Kamei et al. [34] , we used Google Scholar to trace citations forward and backward-looking for papers that offered experiments on the process or product metrics for defect prediction or that suggested why certain process or product metrics are better for defect prediction. This gave us a list of 76 papers. Following the advice of Mathew et al. [44] , we examined: -Highly cited papers, i.e., those with at least ten cites per year.

-Papers from senior SE venues, i.e., those listed at "Google Scholar Metrics Software Systems". Next, using our domain expertise, we augmented that list of papers we considered important or highly influential papers that focus on the benefits of using process or/and product metrics that were not included in the above two criteria). This leads to the 45 papers that are listed in Table 3 .

Within this set of papers, we observe that studies on product metrics are more common than on process metrics (and very few papers experimentally compare both product and process metrics: see Figure 1 ). The product metrics community [35, 48, 50, 78, 79, 83, 91, 92, 97, 109, 110] argues that many kinds of metrics indicate which code modules are buggy: -For example, for lines of code, it is usually argued that large files can be hard to comprehend and change (and thus are more likely to have bugs); -For another example, for design complexity, it is often argued that the more complex a design of code, the harder it is to change and improve that code (and thus are more likely to have bugs). On the other hand, the process metrics community [11, 17, 29, 43, 55, 60, 65, 68, 73, 74, 85, 93, 102] Table 3 : Number of data sets explored in recent papers at prominent venues that experiment with process and/or product metrics.

experience; and (b) how many developers worked on certain file (and, it is argued, many developers working on a single file is much more susceptible to defects); and (c) how long it has been since the last change (and, it is argued, a file which is changed frequently may be an indicator for bugs). The rest of this section lists prominent results from the Figure 1 survey. From the product metrics community, Zimmermann et al. [110] , in their study on Eclipse project using file and package-level data, showed complexity-based product metrics are much better in predicting defective files. Zhang et al. [105] , in their experiments, showed that lines of code-related metrics are good predictors of software defects using NASA datasets. In another study using product metrics, Zhou et al. [108] analyzed a combination of ten object-oriented software metrics related to complexity to conclude that size metrics were a much better indicator of defects. A similar study by Zhou and Leung et al. [107] evaluated the importance of individual metrics and indicated that while CBO, WMC, RFC, and LCOM metrics are useful metrics for fault prediction, but DIT is not useful using NASA datasets. Menzies et al. [48] , in their study regarding static code metrics for defect prediction, found product metrics are very effective in finding defects. Basili et al. [8] , in their work, showed objectoriented ck metrics appeared to be useful in predicting class fault-proneness, which was later confirmed by Subramanyam and Krishnan et al. [82] . Nagappan et al. [56] , in their study, reached a similar conclusion as Menzies et al. [48] , but concluded, "However, there is no single set of complexity metrics that could act as a universally best defect predictor".

In other studies related to process metrics, Nagappan et al. [57] emphasized the importance of change bursts as a predictor for software defects on Windows Vista dataset. They achieved a precision and recall value at 90% in this study and achieved a precision of 74.4% and recall at 88.0% in another study on 36 15 6 Process Metrics Product Metrics Fig. 1 : Number of papers exploring the benefits of the process and product metrics for defect prediction. The papers in the intersection are [7, 25, 26, 34, 53, 70] explore and compare both process and product metrics. Note that prior to this EMSE paper, prior work that looked at the process and product metrics explored analyticsin-the-small.

Windows Server 2003 datasets. In another study by Matsumoto et al. [45] investigated the effect of developer-related metrics on defect prediction. They showed improved performance using these metrics and proved module that is revised by more developers tends to contain more faults. Similarly, Schröte et al. [42] , in their study, showed a high correlation between the number of developers for a file and the number of defects in the respective file.

As to the six papers that compare process versus product methods: -Four of these papers argue that process metrics are best. Rahman et al. [70] found process metrics perform much better than product metrics in both within-project and cross-project defect prediction settings. Their study also showed product metrics do not evolve much over time and that they are much more static. Hence, they say, product metrics are not good predictors for defects. Similar conclusions (about the superiority of process metrics) are offered by Moser et al. [53] , Giger et al. [25] , and Graves et al. [26] . -Only one paper argues that both process and product metrics perform similarly. Arisholm et al. [7] found one project where both process and product metrics perform similarly. -Only one paper argues that the combination of process and product metrics is better at predicting deefects. Kamei et al. [34] found 5 out of 9 versions of 3 projects where combination of process and product metrics perform better than just using process metrics and 9 out of 9 cases they are better than just using product metrics. Of these papers, Moser et al. [53] , Arisholm et al. [7] , Kamei et al. [34] , Rahman et al. [70] , Graves et al. [26] and Giger et al. [25] based their conclusions on 1,1,3,12,15,21 projects (respectively). That is to say, these are all analytics in-the-small studies. The rest of this paper checks their conclusions using analytics in-the-large.

This section describes our methods for comparatively evaluating process versus product metrics using analytics in-the-large.

To collect data, we search Github for Java projects from different software development domains. Although Github stores millions of projects, many of these are trivially very small, not maintained, or are not about -software development projects. To filter projects, we used the standard Github "sanity checks" recommended in the literature [4, 32, 54] : -Collaboration: refers to the number of pull requests. This is indicative of how many other peripheral developers work on this project. We required all projects to have at least one pull request. This will prove the repository is a part of distributed development model where others have forked/created a branch on this repository to make independent changes and submitted those changes to the main repository to be merged with the main branch. We also validated and remove any project where all pull requests are submitted by same developers by checking unique ids of pull request submitter. -Commits: The project must contain more than 20 commits as recommended in the literature. Commits in a Github repository represent the amount of activity in the project. More than 75% of the projects found in Github have less than 20; thus 20 is a good number for this filtering criteria. with defects on Java files. This is because the SMOTE algorithm that we are using for balancing the datasets requires at least 10 examples of the minority class. -Forked Project: The project must not be a forked project from the original repository.This is to remove any potential duplicity and remove any project from the study that is not the project's main branch. We used the Github API to check for the "Forked" flag, and we removed any project which is flagged as yes.

We started with 8023 Github projects from various domains collected using Github search API. After applying the sanity checks mentioned above, we selected 700 projects. The Data Statistics section of Table 4 shows the median and IQR of each of the filtering criteria for the selected projects. For this research, we collected file-level process metrics and file-level product metrics to answer our research questions (RQ1, RQ3-RQ8) as suggested by Rahman et al. [70] . We also followed the suggested aggregation process used by Kamei et al. [34] in their paper to calculate the package-level metrics by lifting the file-level metrics to the package-level to investigate and answer RQ2.

This data was extracted once and stored as pickle files in the following four steps: 1. We collected 21 process metrics (following the definition either from commit guru or from the definitions shared by Rahman et al.) for each file in each commit by extracting the commit history of the project, then analyzing each commit for our metrics. We used a modified version of Commit Guru [76] code for this purpose, where instead of aggregating filespecific metric values for a commit, we store metric values for each file. We create objects for each new file we encounter and keep track of details (i.e., developer who worked on the file, LOCs added, modified, deleted by each developer, etc.) that we need to calculate. We also keep track of files modified together to calculate co-commit-based metrics. After collecting the 21 metrics as mentioned in Table 4 for each project, it is stored as a pickle file to be used for prediction.

2. Secondly, we use Commit Guru [76] code to identify buginducing and bugfixing commits. This process involves identifying bugfixing commits using a keyword 8 based search. Using these commits, the process uses the commit guru's SZZ algorithm [76, 94] to find commits that were responsible for introducing those changes and marking them as buginducing 9 . This process is performed on all commits throughout the life cycle of the project. Note here for a buginducing, each file that is labeled as a buggy file (buginducing ) will have another instance of the same file, which is non-buggy (bugfixing). If a file has been fixed multiple times throughout the project history, it will have multiple instances in the dataset. 3. Thirdly, we used Github tag API to collect the release information for each of the projects. We use the release number, release date information supplied from the API to group commits into releases and thus dividing each project into multiple releases for each of the metrics. Note here we refer to a release number as the tags provided by the contributors of the repository, not by Github. Thus we apply regular expressions to match the release number to either "X.X.X.X" or "X.X.X" format. Here for a tag to be considered as a release, it needs to be different in the section before the third dot. 4. Finally, we used the Understand from Scitools 10 to extract the 45 product metrics used in this study. Understand has a command-line interface to analyze project codes and generate metrics from that. We use the data collected from the first 2 steps to generate a list of commits and their corresponding files, along with class labels for defective and non-defective files. Next, we download the project codes from Github, then used the git commit information to move the git head to the corresponding commit to match the code for that commit. Understand uses this snapshot of the code to analyze the metrics for each file and store the data in temporary storage. We do this for all commits throughout the project history. To ensure for every analyzed commit, we only consider the files which were changed, and we only keep files which was changed as part of that commit. Here we also added the class labels to the metrics. To only mark files that were defective, we use commit Ids along with file names to add labels. After the last step is done, the 45 product metrics collected for each project are stored in a separate file to answer the research questions for this study.

Note that steps one and two required 2 days (on a single 16 cores machine), while step four required 7 days (on 5 machines with 16 cores) of computation, respectively. The data collected in this way are summarized in Table 4 .

In this section, we briefly explain the four classification methods we have used for this study. We selected the following based on a prominent paper by Ghotra et al.'s [24] . Also, all these learners are widely used in the software engineering community. For all the following models, we use the implementation from Scikit-Learn 11 . We applied Differential Evolution (DE) as a hyperparameter optimization [84] to tune the models discussed here. However, as shown below, the performance of the Random Forest model with default parameters was so promising that we applied hyperparameter optimization on 3 of the models except for Random Forest.

This is a discriminative classifier, which tries to create a hyper-plane between classes by projecting the data to a higher dimension using kernel tricks [13, 49, 77, 88] . The model learns the separating hyper-plane from the training data and classifies test data based on which side the example resides.

This is a probabilistic model, widely used in software engineering community [50, 78, 79, 83, 92] , that finds patterns in the training dataset and builds predictive models. This learner assumes all the variables used for prediction are not correlated, identically distributed. This classifier uses Bayes rules to build the classifier. When predicting for test data, the model uses the distribution learned from training data to calculate the probability of the test example belonging to each class and report the class with maximum probability.

This is a statistical predictive analysis method similar to linear regression but uses a logistic function to make predictions. Given 2 classes Y=(0 or 1) and a metric vector X = x 1 , x 2 , ...., x n , the learner first learns coefficients of each metrics vector to best match the training data. When predicting for test examples, it uses the metrics vectors of the test example and the coefficients learned from training data to make the prediction using a logistic function. Logistic regression is widely used in defect prediction [24, 27, 59, 63, 103] .

This is a type of ensemble learning method, which consists of multiple classification decision trees built on random metrics and bootstrapped samples selected from the training data. Test examples are classified by each decision tree in the Random Forest and then the final classification decision is decided using a majority voting. Random forest is widely used in software engineering domain [30, 31, 39, 86, 92, 104, 106] and has proven to be effective in defect prediction. Later in this paper, the following distinction will become very significant. Of the four learners we apply, Random Forests make their conclusion via a majority vote across multiple models while all the other learners build and apply a single model.

In this section, we explain the hyper-parameter optimizer used in this study to fine-tune an ML model's parameters. There are several parameters for each ML model, which decide how an ML model learns to discriminate between desirable and undesirable outcomes. These parameters of the models can greatly affect the performance of the models. In this study, we used Differential Evolution (DE) as the hyper-parameter optimized as has been widely used in software engineering and machine learning community [1, 61, 87, 95] . DE is a stochastic population-based optimization algorithm [81] . DE starts with a frontier of randomly generated candidate solutions. For example, when exploring tuning, each member of the frontier would be a different possible set of control settings for (say) an Support Vector Machine.

After initializing this frontier, a new candidate solution is generated by extrapolating by some factor f between other items on the frontier. Such extrapolations are performed for all attributes at probability cf. If the candidate is better than one item of the frontier, then the candidate replaces the frontier item. The search then repeats for the remaining frontier items. For the definition of "better", this study uses F1-score; i.e., "better" means maximizing the objective score of the model-based F1 Score. This process is repeated for lives number of repeated traversals of the frontier. For full details of DE, see Figure 2 . As per Storn's advice [81] , we use Kernel coefficient, (c) kernel: Specifies the kernel type to be used in the algorithm, and (d) coef0: Independent term in kernel function. Figure 3 illustrates our experimental rig. For each of our 700 selected Java projects, we first use the project's revision history to collect file-level change metrics, along with class labels (defective and non-defective commits). Then, using information from the process metrics, we use Understand's commandline interface to collect and filter the product metrics. Next, we join the two metrics to create a combined metrics set for each project. Using the evaluation strategy mentioned above, the data is divided into train, validation and test sets. The data is then filtered depending on metrics we are interested in (i.e., process, product, or combined) and pre-processed (i.e., data normalization, filtering/imputing missing values, etc.). After preprocessing and metric filtering is completed, the data is processed using SMOTE algorithm to handle data imbalance. As described by Chawla et al. [14] , SMOTE is useful for re-sampling training data such that a learner can find rare target classes. For more details in SMOTE, see [3, 14] . Note one technical detail: when applying SMOTE, it is important that it is not applied to the validation or test data since data mining models need to be tested on the kinds of data they might actually see in practice.

Finally, we select one learner from four and it is applied to the training set to build a model. If hyperparameter optimization is to be performed, then the model is tuned using the validation data. Finally, the model is tested using the Fig. 3 : Framework for this analysis.

test data. As to how we generate our train/test sets, we report results from two methods: 1. release-based 2. cross-validation Both these methods are defined below. We use both methods since (a) other software analytics papers use cross-validation while (b) release-based is the evaluation procedure of Rahman et al. As we shall see, these two methods offer very similar results so debates about the merits of one approach to the other are something of a moot point. But by reporting on results from both methods, it is more likely that other researchers will be able to compare their results against ours.

In a cross-validation study, we select all the files collected using the process described in Section 3.1. This includes the files that were labeled as buggy and non-buggy (this can include multiple copies of the same file if it was committed multiple times) throughout the project history. This data for each project is sorted randomly M times. Then for each time, the data is divided into N stratified bins. Each bin, in turn, becomes the test set and the remaining data is further divided into training and validation sets. For this study, we used M = N = 5.

An alternative to cross-validation is a release-based approach such as the one used by Rahman et al. Here; given R releases of the software, we divide all the data into R parts. Then we trained on data from release 1 to R − 3, then tested on release R − 2, R − 1, and R. This temporal approach has the advantage that the future data never appears in the training data.

In this section, we introduce the following 6 evaluation measures used in this study to evaluate the performance of machine learning models. Based on the results of the defect predictor, humans read the code in order of what the learner says is most defective. During that process, they find true negative, false negative, false positive, and true positive (labeled TN, FN, FP, TP, respectively) reports from the learner.

Recall: This is the proportion of inspected defective changes among all the actual defective changes; i.e., TP/(TP+FN). Recall is used in many previous studies [36, 89, [98] [99] [100] [101] . When recall is maximal, we are finding all the target class items. Hence we say that larger recalls are better.

Precision: This is the proportion of inspected defective changes among all the inspected changes; i.e., TP/(TP+FP). When precision is maximal, all the reports of defect modules are actually buggy (so the users waste no time looking at results that do not matter to them). Hence we say that larger precision is better.

Pf: This is the proportion of all suggested defective changes that are not actual defective changes divided by everything that is not actually defective; i.e., FP/(FP+TN). A high pf suggests developers will be inspecting code that is not buggy. Hence we say that smaller false alarms are better.

Popt20: A good defect predictor lets programmers find the most bugs after reading the least amount of code [6] . Popt20 models that criteria. First, we divide the test data into (a) those that are predicted to be defective and (b) those that are not. Second, we sorted the sets (a,b) on LOC. Third, we returned the test in the order sorted (a) followed by sorted (b). Within that sort, we then report the percent of actual bugs found by inspecting the first 20% of the code (measured in terms of LOC). We say that larger Popt20 values are better.

IFA: Parnin and Orso [64] warn that developers will ignore the suggestions of static code analysis tools if those tools offer too many false alarms before reporting something of interest. Other researchers echo that concern [37, 64, 96] . IFA counts the number of initial false alarms encountered before we find the first defect. We say that smaller IFA values are better.

AUC ROC: This is the area under the curve for receiver operating characteristic. This is designated by a curve between true positive and false positive rates and created by varying the thresholds for defects between 0 and 1. This creates a curve between (0,0) and (1,1), where a model with random guess will yield a value of 0.5 by connecting (0,0) and (1,1) with a straight line. A model with better performance will yield a higher value with a more convex curve in the upper left part. Hence we say that larger AUC values are better.

When comparing the results of different models in this study, we used a statistical significance test and an effect size test: -Significance test is useful for detecting if two populations differ merely by random noise. -Effect sizes are useful for checking that two populations differ by more than just a trivial amount.

For the significance test, we use the Scott-Knott procedure recommended at TSE'13 [51] and ICSE'15 [24] . This technique recursively bi-clusters a sorted set of numbers. If any two clusters are statistically indistinguishable, Scott-Knott reports them both as belonging to the same "rank". To generate these ranks, Scott-Knott first looks for a break in the sequence that maximizes the expected values in the difference in the means before and after the break. More specifically, it splits l values into sub-lists m and n to maximize the expected value of differences in the observed performances before and after divisions. e.g.,, list l, m and n of size ls, ms and, ns where l = m ∪ n, Scott-Knott divides the sequence at the break that maximizes:

Scott-Knott then applies some statistical hypothesis test H to check if m and n are significantly different. If so, Scott-Knott then recurses on each division. For this study, our hypothesis test H was a conjunction of the A12 effect size test (endorsed by [5] ) and non-parametric bootstrap sampling [19] , i.e., our Scott-Knott divided the data if both bootstrapping and an effect size test agreed that the division was statistically significant (90% confidence) and not a "small" effect (A12 ≥ 0.6).

RQ 1: For predicting defects, do methods that work in-the-small, also work in-the-large?

To answer this question, we use Figure 4 , Figure 5 , Figure 6 , and Figure 7 to compares Recall, Pf, AUC, Popt20, Precision, and IFA across four different learners using process, product, and combined metrics. In those figures, the metrics are marked as P (process metrics), C (product metrics), and combined (P+C). Figure 4 , Figure 5 , and Figure 6 represents the cross-validation results, while Figure 7 represent the release-based results.

For this research question, the key thing to watch in these figures is the vertical colored box plots. The box plots were generated using results from all 700 Github projects, where each data point for a project is the (a) median result from 5-fold cross-validation repeated 5 times for Figure 4 , Figure 5 , Figure 6 , and (b) median result from 3 release for Figure 7 . These horizontal lines running across their middle show the median performance of a learner across 700 Github projects. As we said above in section 3.5, the best learners are those that maximize recall, precision, AUC, Popt20 while minimizing IFA and false alarms.

Reading the median line in the box plots, we say that compared to the Rahman et al. analytics in-the-small study, this analytics in-the-large study says some things are the same and some things are different. Like Rahman et al., these results show clear evidence of the superiority of process metrics since, except for Popt20 (no significant difference across process, product, and process+product metrics) across all learners, the median process results from process metrics are clearly always better. That is to say, returning to our introduction, this study strongly endorses the Hersleb hypothesis that how we build software is a major determiner of how many bugs we inject into that software.

As to where we differ from the prior analytics in-the-small study, Random Forest with process metrics is statistically significantly better (achieving different statistical rank in Scott-Knott test) than any learner in all performance measure, other than Popt20 and IFA. In the case of Popt20 and IFA, all learners achieve the same statistical ranking from the Scott-Knott test. With these results we need to keep in mind, the Logistic Regression, Naive Bayes, and Support Vector Machine were tuned using hyper-parameter optimization, while the result for Random Forest was using default parameters. Thus the hyper-parameter tuned Logistic Regression and Support Vector Machine models were much costlier to build (256 hours for hyper-parameter tuned Support Vector Machine for vs 10 hours for default Random Forest). So, unlike the Rahman et al. analytics in-the-small study, we would argue that it is very important which learner is used to for analytics in-the-large. Certain learning in widespread use such as Naive Bayes, Logistic Regression, and Support Vector Machines may not be the best choice for reasoning from hundreds of software projects. Rather, we would recommend the use of Random Forests. We also performed a small experiment to see if certain metrics only capture certain defects as part of this study. We analyzed the defects that are only captured by process metrics vs the defects that are only captured by product metrics. Looking into our results, we see that: -Process metrics capture nearly all the defects; evidence: see the very high recall scores for Random Forest process metrics in Figure 4 . -As to product metrics, they tended to miss many defects; observe how, for all learners in Figure 4 , the product metrics recall are much lower than than the process metrics. For example. in the case of Random Forests, we found that the product metrics missed 48% of the defects found by process metrics, On the other hand, there are indeed a small number of defects captured by product metrics and not process metrics. But this case is definitely in the

Logistic Regression Naive Bayes Support Vector Machine minority (less than 1% in all our studies). Hence we say that process metrics are superior at finding nearly all types of defects in a software system, while product metrics are not able to do that. Before going on, we comment on certain other aspects of these results: -We see no evidence of any added value of combining process and product metrics. If we compare the (P+C) results to the (P) results, there is no case in Figure 4 , Figure 5 , and Figure 6 where process + product (P+C) metrics do better than just using process (P) metrics. -Similar to Kamei et al. in the case of effort-aware evaluation criteria process metrics are superior to product metrics, as can be seen in Figure 6 . Note in that figure, many of our learners using process metrics have near-zero IFA scores. This is to say that, using process metrics, programmers will not be bombarded with numerous false alarms. But unlike Kamei et al., we do not see any significant benefit when accessing the performance in regards to the Popt20, which is another effort-aware evaluation criteria used by Kamei et al. and this study. - Figure 7 shows the Random Forest results using release-based test sets.

As stated in section 3.5 above, there is very little difference in the results between release-based test generation and the cross-validation method o Figure 5 , and Figure 6 . Specifically, in both our cross-val and release-based results, (a) process metrics do best; (b) there is no observed benefit in adding in product metrics and, when using process metrics then random forests have (c) very high precision and recall and AUC, (d) low false alarms; and (e) very low IFA.

RQ 2: Measured in terms of predication variability, do methods that works well in-the-small, also work at at-scale?

To answer this research question, we assess our learners not by their median performance but by their variability.

Rahman et al. commented that many different learners might be used for defect prediction since, for the most part, they often give the same results. While that certainly holds for their analytics in-the-small case study, the situation is very different when reasoning at-scale about 700 projects. Looking at the process metrics results for Figure 4 and Figure 5 and Figure 6 , we see that - Here the training data was till t-3 th release and the rest was test release.

1. The performance for Random Forests is statistically significantly better in case all performance measures, other than Popt20 and IFA. 2. The box plots for Random Forests are much smaller than for other learners in the case of precision, recall, and AUC. That is, the variance in the predictive performance is much smaller for Random Forest than for anything else in this study. 3. These results for Random Forests are without hyper-parameter optimization, while other learners are optimized with hyper-parameter optimization. This makes the model building for Random Forest orders of magnitude faster. The size of both these effects is quite marked. Random Forest is usually better (median) than Logistic Regression. As to the variance, the Random Forest variance is smaller than the other learners.

Why is Random Forest doing so well? We conjecture that when reasoning about 700 projects that there are many spurious effects. Since Random Forests make their conclusions by reasoning across multiple models, this kind of learner can avoid being confused. Hence, we recommend ensemble methods like Random Forest for analytics in-the-large.

RQ 3: Measured in terms of granularity, do same granularity that works well in-the-small, also work at at-scale?

In this research question, we try to evaluate if the granularity of the metrics matters when predicting for defects when measuring at scale. This is one of the research questions asked in study by Kamei et al.. Here we try to measure if package-level prediction better identifies defective packages than file-level prediction. There are multiple strategies for creating package-level metrics such as lifting file-level metrics to package-level, collecting metrics designed for package-level, and lifting file-level prediction results for package-level as explored by Kamei et al. in their study. We explore the first strategy that is to lift the file-level metrics to package-level. We select this strategy as Kamei et al. in their paper has shown the metrics designed for package-level does not produce good results and both file and result lift ups have similar performance and have been explored by many other researchers. To build a defect predictor using package-level data, we use the process metrics collected for our tasks. For each commit/release, if there are multiple files from the same package, we aggregate them to their package-level by taking the median values. Figure 8 shows the difference in performance between file-level prediction results and package-level prediction results. It is evident from the results, that file-level prediction shows statistically significant improvement than packagelevel prediction, with an exception in the case of Popt20. This result agrees with Kamei et al., and we conclude that the granularity of the metrics set does matter and file-level level prediction has superior performance than packagelevel prediction. To answer this research question, we first tag each commit into a release by using the release information from Github. Using this release information, we divide the data into train and test data using the last 3 releases as test releases one by one and other older releases as training data. If a model build using either process and product data significantly differ across last 3 releases, that would imply the model built using that set of metrics will need to be rebuilt for each subsequent release, this in-tern will create instability. To verify the stability of the models built using metrics, we build the models using the training data and then check each of the 3 subsequent releases in term of the evaluation criteria used in this study. We compare both process and product metrics across all 6 criteria mentioned in Section 3.5. Figure 9 shows the performance of the models. The first row of the figure represents the process metrics, while the second row represents the product metrics. Each column represents the evaluation criteria that we are measuring and inside each plot, each box plot represents one of the last 3 releases. We applied Scott-Knott statistical test on the results to check for each evaluation criteria if any of the releases are statistically significantly different than the others. The results show no significant difference between 3 releases in all Fig. 9 : Stability of the models across the last 3 releases built using process (P) and product (C) metrics. Each plot shows one of the six performance criteria used in this study for the last 3 releases. The first row shows the results for the process metrics denoted as Metric Type = P and the second row shows the results for product metrics denoted as Metric Type = C.

evaluation criteria (all releases for each evaluation criteria in each metric type) except Popt20. Popt20 is an effort-aware criterion as explained in Section 3.5, and we see in both process-based and product-based models the Popt20 does significantly better in the third release. Which may be because third release have more smaller predicted defective files than two releases. If that is the case, based on how Popt20 is calculated it can explain the increase in Popt20 score. That being said, the result shows none of the models build using process and product metrics degrades over time, thus reducing the instability of the models. We can also say, as over time, the performance does not degrade and we have already seen in terms of performance process metrics performs much better than product metrics, it is wiser to use process metrics in predicting defects.

RQ 5: Measured in terms of stasis, Are process metrics more/less static than code metrics, when measured at at-scale?

In this research question, we try to find the reason behind the difference in performance in models built using process and product data. Most models try to learn how to differentiate between two classes by learning the pattern in the training data and tries to identify similar patterns in the test data to predict for defects. Throughout the life cycle of a project, different parts of the project are updated and changed as part of regular enhancements. This results in introduction of bugs and thus bug fixes for those defective changes. The metrics that we use to create the defect prediction models should be able to reflect those changes, so the model is able to identify the difference between defective and non-defective changes. This means if either process or product metrics can capture such differences, then the metric values for a file between release R and R + 1 would not be highly correlated, and models built with such metrics will be able to better differentiating defective and non-defective change.

To measure the stasis of the metrics, we used Spearman correlation for every file between two consecutive releases (to check releases-based prediction) and two consecutive commits where the file was changed (to check for JITbased predictions). Here the metrics for each file for a release are calculated from the last time the file was changed before the release. Thus for comparing between release R and R+1 for a file, we select the commit the file was changed last both for release R and R + 1 and compute the Spearman correlation between them. Figure 10 shows the Spearman correlation values for every file between two consecutive releases/commits for all the projects explored as a violin plot for each type of metric. A wide and short violin plot represents the majority of the value concentrated near a certain value. In contrast, a thin and long violin plot represents values being in a different range. Figure 10 shows the correlation scores for process and product metrics in both release-based and JIT-based settings. The process and product metrics in release-based settings are denoted by P R and C R respectively, while in JIT-based setting they are denoted by P J and C J respectively. In the figure, the P P J represents the package-level process metrics in JIT-based setting. We can see from figure The plot represents the Spearman correlation of every file between two consecutive checkpoints. Here x-axis label P R and C R represents the process and product metrics when the correlation was calculated in release level. While the P J, C J, and P P J represent the process, product, and package-level process metrics when calculated in JIT-based setting.

the product metrics form a wide and short violin plot and are very highly correlated. While the process metrics form a thin and long violin plot ranging between 0.2 to 1 for release-based setting and 0.5 to 1 for JIT-based setting. If we compare the correlations between release-based and JIT-based metric sets, we see the correlation value for process metrics increases in JIT-based metric sets. The reason behind this increase in correlation value can be explained as in JIT-based metrics, we compare between commits. Here the amount of the change in file is less than the change when measured between two releases (here each release contains multiple commits). Similarly, when the process metrics has been lifted from file-level to package-level, the correlation increases.

So why process metrics outperform product metrics? We think the stasis property of the metric set is one of the main reasons as product metrics seems to be more static, thus changing very little with time and between defective files and non-defective files. When models are created with such static metric sets, it is hard for the model to learn a pattern and differentiate between defective and non-defective changes. While process metrics change over time and much less correlated between changes, thus making them a potentially better metric for creating defect prediction models.

RQ 6: Measured in terms of stagnation, Do models built from different sets of metrics stagnate across releases, when measured at at-scale?

In this research question, we try to measure the stagnation property of the models built using the process and product metrics. As suggested by Rahman et al., we use Spearman rank correlation between the learned probability from the training set and predicted probability from the test set to calculate the correlation between these two. To learn the learned probability and predicted probability, we use the defect-proneness from the learner (Random Forest in this research question) across all pairs of training-test releases. For each pair of training-test releases, if a file has been committed multiple times during a release, we consider the file instance that was changed last. Here a high correlation between the learned and predicted probability, which will indicate the models are probably learning to predict the same set of files defective. It is finding the same probabilities in the test set as training set and thus, it is not able to properly differentiate between defective and non-defective files. Figure 11 shows a box plot of Spearman rank correlation between the learned and predicted probability for models built using process and product metrics on 700 projects used as part of this study. We can see that, a model built using product data has significantly higher correlation than a model built using process data. Although this value is slightly lower, both in the case of process and product metrics than what Rahman et al. reported in their project, the results signify the models built with product metrics are significantly more stagnant than the models built using process metrics.

RQ 7: Do stagnant models (based on stagnant metrics) tend to predict recurringly defective entities?

Here we try to verify the stagnation property of the metrics as seen in the previous research question. If a model is stagnant, it will predict the same file as defective regardless of whether the file actually contains defects or not. To evaluate whether or not model built on process and product data is predicting the same files as defective, we follow the same approach suggested by Rahman et Figure 12 shows the recall and pf of the models build using process and product metrics on all 3 types of test sets. We can see from the figure that models built using either process or product metrics can identify recurrently defective files in case of recurrent set. However, we can see a significant difference between process and product metrics, where process metrics is doing much better in recognizing recurrently defective files. In case of train only test set, we can see very high pf (median value ≈ 0.8) for model build using product data, while the model built using process data has a low pf (median value ≈ 0.0). This is a clear indication that model built using product metrics is stagnant and identifies the same set of files as defective regardless of whether they are actually defective or not. While the test only set shows a very low recall for model built using product data, while high recall for model built using process data. This indicates model built using product data is unable to identify new defects. Thus this result bolsters the claim that process metrics are better at identifying defects than product metrics.

RQ 8: Measured in terms of metric importance, are metrics that seem important in-the-small, also important when reasoning in-the-large?

To answer this question, we test if what is learned from studying some projects is the same as what might be learned from studying all 700 projects. That is, we compare the rankings given to process metrics using all the projects (analytics in-the-large) to the rankings that might have been learned from analytics in-the-small projects looking at 5 projects (where those projects were selected at random). Figure 13 shows the metric importance of metrics in the combined (process + product) data set. This metric importance is generated according to what metrics are important while building and making predictions in Random Forest. The metric importance returned by Random Forest is calculated using a method implemented in Scikit-Learn. Specifically: how much each metric decreases the weighted impurity in a tree. This impurity reduction is then averaged across the forest and the metrics are ranked. In Figure 13 the metric importance increases from left to right. That is, in terms of defect prediction, the most important metric is the average number of developers in co-committed files (avg nadev) and the least important metric is the number of directories (nd).

In that figure, the process metrics are marked with two blue asterisks**. Note that nearly all of them appear on the top. That is, in a result consistent with Rahman et al., process metrics are far more important than process metrics. Figure 14 compares the process metrics rankings learned from analytics in-the-large (i.e., from 700 projects) versus a simulation of an in-the-small CountLine **lt **la **ld **sctr **sexp **exp **avg_nddev **ddev **ndev **own **age **nuc **adev **minor **ncomm **nadev **rexp **avg_ncomm **avg_nadev study that looks at five projects selected at random. In the figure, the X-axis ranks metrics via analytics in-the-large (using Random Forests applied to 700 projects), and Y-axis ranks process metrics using analytics in-the-small (using Random Forests applied to randomly select 5 projects). For both x and Y-axis rankings, the metrics were sorted by the metric importance returned by the Random Forest Classifier.

In an ideal scenario, when the ranks are the same, this would appear in Figure 14 as a straight line at a 45-degree angle, running through the origin. To say the least, this not what is observed here. We would summarize Figure 14 Fig. 14: X-axis ranks metrics via analytics in-the-large (using Random Forests applied to 700 projects). Y-axis ranks process metrics using analytics in-the-small (using Random Forests selected from random sample of 5 projects).

follows: the importance given to metrics by a few analytics in-the-small studies is very different from the importance learned via analytics in-the-large.

As with any large scale empirical study, biases can affect the final results. Therefore, any conclusions made from this work must be considered with the following issues in mind: (a) Evaluation Bias: In all research questions in this study, we have shown the performance of models built with process, product and, process+product metrics and compared them using statistical tests on their performance to conclude which is better and more generalizable predictor for defects. While those results are true, that conclusion is scoped by the evaluation metrics we used to write this paper. It is possible that using other measurements, there may be a difference in these different kinds of projects (e.g., G-score, harmonic mean of recall, and false-alarm reported in [90] ). This is a matter that needs to be explored in future research.

(b) Construct Validity: At various places in this report, we made engineering decisions about (e.g.,) choice of machine learning models, selecting metric vectors for each project. While those decisions were made using advice from the literature, we acknowledge that other constructs might lead to different conclusions.

(c) External Validity: For this study, we have collected data from 700 Github Java projects. The product metrics collected for each project were done using a commercialized tool called "Understand" and the process metrics were collected using our own code on top of Commit Guru repository. There is a possibility that calculation of metrics or labeling of defective vs non-defective using other tools or methods may result in different outcomes. That said, the "Understand" is a commercialized tool with detailed documentation about the metrics calculations. We have shared our scripts and processes to convert the metrics to a usable format and has described the approach to label defects.

(d) Sampling Bias: Our conclusions are based on the 700 projects collected from Github. It is possible that different initial projects would have lead to different conclusions. That said, this sample is very large, so we have some confidence that this sample represents an interesting range of projects.

(e) Selection Bias: Our comparison between process, product and, pro-cess+product metrics are based on metrics used in prior work (Rahman et al. [70] Kamei et al. [34] ). It is certainly true that other metrics might be more important than those explored here. For future work, we strongly recommend exploring a wider range of metrics; e.g., such as those suggested by other researchers [40, 66, 67] .

Much prior work in software analytics has focused on in-the-small studies that used a few dozen projects or less. Here we checked what happens when we take specific conclusions generated from analytics in-the-small, then review those conclusions using analytics in-the-large. While some conclusions remain the same (e.g., process metrics generate better predictors than product metrics for defects), other conclusions change (e.g., learning methods like logistic regression that work well in-the-small perform comparatively much worse when applied in-the-large).

We find here that issues that may not seem critical in-the-small become significant problems in-the-large. For example: -Recalling Figure 14 , we can say that what seems to be an important metric, in-the-small, can prove to be very unimportant when we start reasoning in-the-large. -Further, when reasoning in-the-large, variability in predictions becomes a concern. Thus when researchers or industry practitioners attempt to: -Generate guidelines or best practices to either train new researchers or developers;

-Create tools for quality measurements, guide developers to follow best practices or helping developers or researchers in other ways; -Study data to find general defect-related trends/properties of open-source projects; then it is better to use findings from in-the-large analysis. The reason being, if the lessons learned change from project to project, it will be very hard to generate guidelines or create tools that are stable enough for an organization. This is an issue since: -If the guidelines or tools are not stable, then developers or researchers will lose trust in those tools. -Also, when trying to find general trends in software projects, trends found from an in-the-small study might change when the selected projects are changed and thus, those will not be general trends but project specific trends. -We found that certain systems issues seem unimportant in-the-small. However, when scaling up to in-the-large, it becomes a critical issue that product metrics are an order of magnitude to harder to manage. We listed one case study above where the systems requirements needed for product metrics meant that, very nearly, we almost did not deliver scientific research in a timely manner. Based on this experience, we say: -Industrial practitioners should make use of in-the-large findings or re-validate in-the-small findings with in-the-large analysis before applying them to organizational level either to create guidelines or to make tools. -Analysts performing analytics in-the-large should use process metrics and ensemble methods like random forests since they can better handle the kind of large scale spurious singles seen while reasoning effectively over hundreds of projects. -SE researchers must now:

-Revisit many of the conclusions previously obtained via analytics in-thesmall to find if those findings still hold true for in-the-large analysis. -Perform in-the-large analysis when trying to find general trends in software projects in their research. More generally, what is this work saying about the notions/need/benefits of quantitative versus qualitative in defect-related research in-the-large? Quantitative studies can scale to a very large number of projects (as shown by this study), while qualitative studies can find specific, nuanced features that are specific to that small set of projects (evidence, see Figure 14 ). However, it would be wrong to use this study to say (e.g.,) "stop qualitative studies" since, in our experience, more can be achieved by combining the two approaches (than just mono-focusing on just qualitative or quantitative).

For example, previously, with Chen and Stolee et al. [16] , we have argued for a marriage of qualitative and quantitative methods to effectively reduce the effort associated with the partial replication and enhancement of qualitative studies. In the case study of that paper [16] , a qualitative study explored factors influencing the fate of GitHub pull requests using extensive qualitative analysis of 20 pull requests. Guided by their findings, we mapped some of their qualitative insights onto quantitative questions. To determine how well their findings generalize, we collected much more data (ten times as many additional pull requests from hundreds of GitHub projects). This combined approach resulted in a new predictor for whether code would be merged. That predictor was far more accurate than one built from the study's qualitative factors (F1=90 vs 68%), illustrating the value of a mixed-methods approach and replication to improve prior results. We conjecture that that case study is representative of an underlying methodology for scaling and extending primary qualitative studies that require expert opinions.

Hence, we argue that one future direction for this research could be to encourage more studies that replicate parts of primary qualitative studies using quantitative methods (since these scale to a large number of projects). Further, we should not stop there. The insights gained from this combined qualitative/quantitative approach could be used to design insightful subsequent studies.

This work was partially funded by NSF Grant #1908762.

Is better data better than better data miners?: on the benefits of tuning smote for defect prediction

What is wrong with topic modeling? and how to fix it using search-based software engineering

better data" is better than "better data miners" (benefits of tuning SMOTE for defect prediction

We don't need another hero? the impact of" heroes" on software development

A practical guide for using statistical tests to assess randomized algorithms in software engineering

Predicting fault-prone components in a java legacy system

A systematic and comprehensive investigation of methods to build and evaluate fault prediction models

A validation of object-oriented design metrics as quality indicators. Software Engineering

Putting it all together: Using socio-technical networks to predict failures

Does distributed development affect software quality? an empirical case study of windows vista

Don't touch my code! examining the effects of ownership on software quality

Developing interpretable models with optimized set reduction for identifying high-risk software components

An improved twin support vector machine based on multi-objective cuckoo search for software defect prediction

Smote: synthetic minority over-sampling technique

Applications of psychological science for actionable analytics

Replication can improve prior results: A github study of pull request acceptance

Alok Mishra, and Cagatay Catal. Empirical analysis of change metrics for software fault prediction

An extensive comparison of bug prediction approaches

An introduction to the bootstrap

Software metrics: roadmap

Tuning for software analytics: Is it really necessary? Information and Software Technology

Choosing software metrics for defect prediction: an investigation on feature selection techniques. Software: Practice and Experience

Revisiting the impact of classification techniques on the performance of defect prediction models

Revisiting the impact of classification techniques on the performance of defect prediction models

Method-level bug prediction

Predicting fault incidence using software change history

An investigation on the feasibility of cross-project defect prediction

Socio-technical coordination (keynote)

Supervised vs unsupervised models: A holistic look at effort-aware just-in-time defect prediction

Software defect prediction using feature selection and random forest algorithm

Improved random forest algorithm for software defect prediction through data mining techniques

The promises and perils of mining github

Revisiting common bug prediction findings using effort-aware models

Revisiting common bug prediction findings using effortaware models

The effects of over and under sampling on fault-prone module detection

A large-scale empirical study of just-in-time quality assurance

Practitioners' expectations on automated fault localization

The impact of context metrics on just-in-time defect prediction

Bellwethers: A baseline method for transfer learning

Progress on approaches to software defect prediction

Learning better inspection optimization policies

Is external code quality correlated with programming experience or feelgood factor

Which process metrics can significantly improve defect prediction models? an empirical study

Trends in topics at se conferences

An analysis of developer metrics for fault prediction

Data mining static code attributes to learn defect predictors

Defect prediction from static code features: Current results, limitations, new approaches

Data mining static code attributes to learn defect predictors

500+ times faster than deep learning:(a case study exploring faster methods for text mining stackoverflow)

Implications of ceiling effects in defect predictors

Ranking and clustering software cost estimation models through a multiple comparisons algorithm

A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction

A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction

Curating github for engineered software projects

Using software dependencies and churn metrics to predict field failures: An empirical case study

Mining metrics to predict component failures

Change bursts as defect predictors

Heterogeneous defect prediction

Transfer defect learning

Clever: combining code metrics with clone detection for just-in-time fault prevention and resolution in large industrial projects

A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification

Where the bugs are

Domain adaptation via transfer component analysis

Are automated debugging techniques actually helping programmers?

Fine-grained just-in-time defect prediction

On the performance of method-level bug prediction: A negative result

Software fault prediction metrics: A systematic literature review. Information and software technology

Ownership, experience and defects: a finegrained study of authorship

How, and why, process metrics are better

How, and why, process metrics are better

Comparing static bug finders and statistical prediction

Comparing static bug finders and statistical prediction

Sample size vs. bias in defect prediction

Bugcache for inspections: hit or miss?

Commit guru: Analytics and risk prediction of software commits

Commit guru: analytics and risk prediction of software commits

Value-cognitive boosting with a support vector machine for cross-project defect prediction

An empirical study of the classification performance of learners on imbalanced and noisy software quality data

Predicting faults in high assurance software

Can traditional fault prediction models be used for vulnerability prediction?

Differential evolution-a simple and efficient heuristic for global optimization over continuous spaces

Empirical analysis of ck metrics for object-oriented design complexity: Implications for software defects

Using coding-based ensemble learning to improve software defect prediction

The impact of automated parameter optimization on defect prediction models

The impact of mislabelling on the performance and interpretation of defect prediction models

Automated parameter optimization of classification techniques for defect prediction models

The impact of automated parameter optimization on defect prediction models

A comparison on multi-class classification methods based on least squares twin support vector machine. Knowledge-Based Systems

While tuning is good, no tuner is best

Better data labelling with emblem (and how that impacts defect prediction)

On the relative value of cross-company and within-company data for defect prediction

Using class imbalance learning for software defect prediction

Do too many cooks spoil the broth? using the number of developers to enhance defect prediction models

Szz revisited: verifying when changes induce fixes

Hyperparameter optimization for effort estimation

automated debugging considered harmful" considered harmful: A user study revisiting the usefulness of spectra-based fault localization techniques with professionals using real bugs from large systems

Massively compositional model for cross-project defect prediction

Collective personalized change classification with multiobjective search

Tlel: A two-layer ensemble learning approach for just-in-time defect prediction

Deep learning for just-intime defect prediction

Effort-aware just-in-time defect prediction: simple unsupervised models could be better than supervised models

Learning to rank relevant files for bug reports using domain knowledge

Data transformation in cross-project defect prediction

Cross-project defect prediction using a connectivity-based unsupervised classifier

An investigation of the relationships between lines of code and defects

Predicting defective software components from code complexity measures

Empirical analysis of object-oriented design metrics for predicting high and low severity faults

On the ability of complexity metrics to predict fault-prone classes in object-oriented systems

Cross-project defect prediction: a large scale experiment on data vs. domain vs. process

Predicting defects for eclipse