Submitted 15 August 2019 Accepted 10 November 2019 Published 16 December 2019 Corresponding author Philipp Leitner, philipp.leitner@chalmers.se Academic editor Arie van Deursen Additional Information and Declarations can be found on page 21 DOI 10.7717/peerj-cs.245 Copyright 2019 Guo and Leitner Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Studying the impact of CI on pull request delivery time in open source projects—a conceptual replication Yunfang Guo and Philipp Leitner Software Engineering Division, Chalmers | University of Gothenburg, Gothenburg, Sweden ABSTRACT Nowadays, continuous integration (CI) is indispensable in the software development process. A central promise of adopting CI is that new features or bug fixes can be delivered more quickly. A recent repository mining study by Bernardo, da Costa & Kulesza (2018) found that only about half of the investigated open source projects actually deliver pull requests (PR) faster after adopting CI, with small effect sizes. However, there are some concerns regarding the methodology used by Bernardo et al., which may potentially limit the trustworthiness of this finding. Particularly, they do not explicitly control for normal changes in the pull request delivery time during a project’s lifetime (independently of CI introduction). Hence, in our work, we conduct a conceptual replication of this study. In a first step, we replicate their study results using the same subjects and methodology. In a second step, we address the same core research question using an adapted methodology. We use a different statistical method (regression discontinuity design, RDD) that is more robust towards the confounding factor of projects potentially getting faster in delivering PRs over time naturally, and we introduce a control group of comparable projects that never applied CI. Finally, we also evaluate the generalizability of the original findings on a set of new open source projects sampled using the same methodology. We find that the results of the study by Bernardo et al. largely hold in our replication. Using RDD, we do not find robust evidence of projects getting faster at delivering PRs without CI, and we similarly do not see a speed-up in our control group that never introduced CI. Further, results obtained from a newly mined set of projects are comparable to the original findings. In conclusion, we consider the replication successful. Subjects Software Engineering Keywords Continuous integration, Mining software repositories, Replication, Pull-request based development INTRODUCTION Continuous Integration (CI) is by now a popular practice in the software commu- nity (Duvall, Matyas & Glover, 2007). CI helps developers integrate changes frequently in a collaborative manner. As a distributed and cooperative practice, CI is commonly used in both, commercial and open source software (OSS) development. Considerable previous research has investigated the impact of CI on OSS projects. Vasilescu et al. (2015) found that core developers are able to discover more bugs using CI. Ståhl & Bosch (2014) claim that integrators tend to release more frequently after adopting CI. Finally, a recent study How to cite this article Guo Y, Leitner P. 2019. Studying the impact of CI on pull request delivery time in open source projects—a con- ceptual replication. PeerJ Comput. Sci. 5:e245 http://doi.org/10.7717/peerj-cs.245 https://peerj.com/computer-science mailto:philipp.leitner@chalmers.se https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.245 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.245 by Bernardo, da Costa & Kulesza (2018) empirically analyzed whether CI improves the time-to-delivery of merged Pull Requests (PRs) that are submitted to GitHub projects. Interestingly, this study revealed that only 51.3% of the analyzed OSS projects actually deliver merged PRs more quickly after adopting CI. The authors present an increase in PR submission numbers after adopting CI as a possible reason for this relatively counter- intuitive result. Further, the authors used regression analysis to identify two factors (merge workload and queue rank) that are the main predictors of PR delivery time. However, we observe that the study by Bernardo et al. exhibits some important limitations. Firstly, their methodology consists of comparing various PR related metrics before and after CI adoption without controlling for confounding factors, most importantly that PR delivery time may increase or decrease naturally over the lifetime of a project. For example, it is conceivable that projects may just naturally get better at merging PRs over time, independently of whether they adopt CI or not. Secondly, they do not make use of a control group of projects that never adopted CI in the first place. In our opinion, this limits the trustworthiness of the results of Bernardo et al. Hence, in our work, we present a conceptual replication (Shull et al., 2008) of this study. We replicate their work and investigate the same research questions with slightly different methodology, and by incorporating additional study objects. Concretely, we investigate the following research questions. RQ1: Exact Replication. Can the original study results be reproduced? As a baseline, we reproduce the original results of the study, using the same methodology and the data provided by the authors. We are able to achieve very similar results, with minor differences (between 1.1 and 5.5 percentage points difference to the originally published results). RQ2: Conceptual Replication. To extend the original study methodology, and address the concerns we have with the experimental methodology as initially proposed, we investigate two different aspects: RQ2.1: Can similar results to the original study be found when controlling for changes in PR delivery time over the lifetime of a project? To answer this question, we apply Regression Discontinuity Design –RDD (Thistlethwaite & Campbell, 1960), a statistical method that allowed us to evaluate whether there is a trend of PR delivery times over time, and whether this trend changes significantly when CI is introduced. We find no clear evidence of such trends in the data, alleviating our concerns in this regard. However, we observe that PR delivery times depend strongly on when in the release cycle a PR is merged. PRs that are merged close to the next release are released much quicker than PRs that come in shortly after a release. This indicates that, ultimately, CI introduction may have less impact on PR delivery times than how often a project releases. RQ2.2: Are there other factors besides merge workload and queue rank that strongly impact the PR delivery time? Based on the results of RQ2.1, we hypothesize that one important factor impacting PR delivery time that is not directly captured in the original study is when in the release cycle a new PR is submitted. We incorporate this additional variable into the regression model, and evaluate whether it is a better predictor than the variables in the Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 2/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 original study. We find that this ‘‘come-in time’’ indeed is the best predictor of PR delivery time for a majority of projects. RQ3: Generalizability. Finally, to evaluate the generalizability of the results, we apply our adapted methodology to two new data sets, a new data set of study subjects collected using the same methodology as in the original study, and a control group of projects that have similar characteristics but have, to the best of our knowledge, never applied CI. RQ3.1: Can similar results be found when applying the same methodology to different projects that have also adopted CI? We find that results found for a new set of study subjects vary up to 14 percentage points. However, the high-level conclusions drawn by the original study still hold for our replication using new data. Hence, we consider the original findings to be largely confirmed using additional data, with the caveat that the individual differences between projects may be very high. RQ3.2: Can similar results be found when applying the same methodology to different projects that have never adopted CI? Finally, we collect a control group of comparable projects that have never adopted CI. We observe results that vary between 10 and 16 percentage points from what has been observed based in the original data, i.e., the results of applying the same methods on a control group are only mildly more different than applying the same method on a new test group (RQ3.1). However, we observe that projects in the control group do not increase the number of PRs they are able to handle per release over time. This is different to both test groups, where we observe a statistically significant increase in submitted, merged, and released PRs per release after CI adoption. In summary, we consider the replication successful. Our concern regarding trends in the data has largely been alleviated, and an analysis of a control group has led to, at least subtly, different results. However, our results also indicate that PR delivery times seem to more strongly depend on when in the release cycle a PR comes in than on whether or not a CI system is present. This is consistent with the original study, which also reported that the presence of CI only impacts delivery time metrics with small effect sizes. Our study sheds some more light on why this is the case. Finally, we conclude that the delivery time of PRs is not strongly impacted by whether a project adopts CI, but projects that do are able to handle more PRs per release than projects that do not. The present article is based on work conducted by the first author over five months in early 2018 as part of her master’s thesis project at Chalmers University of Technology, under the supervision of the second author (Guo, 2019). The results presented here are a summary of this work, and more details can be found in the thesis report. BACKGROUND We now present important background on CI and the pull request based development model. Further, we summarize the main results of Bernardo, da Costa & Kulesza (2018), which we attempt to replicate in our study. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 3/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 CI and the pull request based development model CI is a practice which has originated from Agile software development and Extreme Programming. Its core tenet is the merging of all developer working copies to shared mainline several times a day. Each integration is then verified by an automated build, which allows errors to be detected and located as early as possible (see also online https://www.thoughtworks.com/continuous-integration). CI promises manifold benefits, such as quickening the delivery of new functionalities (Laukkanen, Paasivaara & Arvonen, 2015), reducing problems of code integration in a collaborative environment (Vasilescu et al., 2014), hence guaranteeing the stability of the code in the mainline. Consequently, CI has found widespread practitioner adoption (Hilton et al., 2016), making it a relevant subject of academic study. Tightly linked to CI (and to the GitHub open source development platform— https://github.com) is the idea of pull request based development (see also Fig. 1 for a schematic overview). In this model, the main repository is not shared with external developers. Instead, prospective contributors fork the central repository and clone it to a local repository. The contributor makes changes to the local repository, and commits their changes there. These local changes are then submitted to the main repository by opening a PR in the central repository. A CI system, such as Travis-CI (https://travis-ci.com), then automatically merges the PR into a test branch and runs the tests to check if the PR breaks the build. Finally, one or more rounds of code review (Bacchelli & Bird, 2013; McIntosh et al., 2014) are conducted and the integrator decides whether to approve the PR, after which it is merged and closed. Does using ci lead to faster pull request delivery? Note that a CI system is not strictly required for the pull request based development model to be followed. Bernardo, da Costa & Kulesza (2018) have studied whether using a CI system, which, as described, automates much of the testing that integrators otherwise would have to do manually, leads to shorter PR delivery times. They collected 162,653 PRs and 7,440 releases of 87 OSS projects using the GitHub API, and addressed the following three research questions: RQ1: Are merged pull requests released more quickly using CI? RQ2: Does the increased development activity after adopting CI increase the delivery time of PRs? RQ3: What factors impact the delivery time after adopting CI? By applying non-parametric tests to the merge and delivery time of PRs, the authors drew the conclusion for RQ1 that only half of the projects deliver PRs faster after adopting CI, but 71.3% of the studied projects merge PRs faster before using CI. In RQ2, they found that there is a considerable increase in the PR submission, merge and delivery rate, concluding that this may be the reason why projects do not deliver merged PRs faster after adopting CI. They also found that the number of releases per year does not change significantly after CI adoption. In RQ3, they built linear regression models for each project and used the Wald X2 maximum likelihood test to evaluate the explanatory power of a number of different factors. They found that the two variables with the highest explanatory Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 4/24 https://peerj.com https://www.thoughtworks.com/continuous-integration https://github.com https://travis-ci.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 1 An overview of the pull request based development model. Full-size DOI: 10.7717/peerjcs.245/fig-1 power were both related to the volume of PRs that have to be merged, namely the merge workload (how many PRs are waiting to be merged?) and queue rank (is the PR at the beginning or the end of the merge queue?). RELATED WORK We now discuss previous work in related fields and how the research questions in the study fill in gaps presented in the field. CI adoption and pull requests Previous researchers have investigated the impact of adopting CI in projects in multiple aspects. Most papers agreed that the introduction of CI is beneficial to projects. Manglaviti et al. (2017) examined the human resources that are associated with developing and maintaining CI systems. They analyzed 1,279 GitHub repositories that adopt Travis-CI using quantitative methods. The authors found that for projects with growing contributor bases, adopting CI becomes increasingly beneficial and sustainable as the projects age. Further, there is a strong expectation that CI should improve the productivity of projects. Miller (2008) analyzed the impact of CI by summarizing their experience with CI in a distributed team environment at Microsoft in 2007. They collected various CI related data in their daily work. Teams moving to a CI driven process can expect to achieve at least a 40% reduction in check-in overhead when compared to a check-in process that maintains the same level of the code base and product quality. Ståhl & Bosch (2014) argued based on survey results that build and test automation saves programmer’s time for more creative work, and should thus increase productivity. Stolberg (2009) argued that CI practices speed up the delivery of software by decreasing integration times. However, not all previous study agree that adopting CI improves productivity. For instance, Parsons, Ryu & Lal (2007) found no clear benefits of CI on either productivity or quality. Related research has shown that the PR based development model is popular in OSS projects. For instance, Vasilescu et al. (2014) collected 223 GitHub projects and found Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 5/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-1 http://dx.doi.org/10.7717/peerj-cs.245 that for 39 of 45 project (87%), builds corresponding to PRs are much more likely to succeed than builds corresponding to direct pushes. Gousios, Pinzger & van Deursen (2014) found that 14% of repositories are using PRs on GitHub. They selected 291 projects from the GHTorrent corpus, and concluded that the PR model offers fast turnaround, increased opportunities for community engagement, and decreased time to incorporate contributions. CI impact on pull request success and release frequency Our study focuses on whether CI has an impact on PR delivery time. Bernardo, da Costa & Kulesza (2018) have conducted an extensive mining study on this subject (as discussed in more detail in ‘Does Using CI Lead to Faster Pull Request Delivery?’). Our present work is a conceptual replication of their paper. Hilton et al. (2016) have previously analyzed 34,544 open source projects from GitHub and surveyed 442 developers. The authors found that CI helps projects release twice as often and that when using CI, PRs are accepted 1.6 hours sooner in median. Vasilescu et al. (2014) studied the usage of Travis-CI in a sample of 223 GitHub projects written in Ruby, Python, and Java. They found that the majority of projects (92.3%) are configured to use Travis-CI, but less than half actually use it. In follow-up research, they investigated the productivity and quality of 246 GitHub projects that use CI (Vasilescu et al., 2015). They found that projects that use CI successfully process, accept, and merge more PRs. This increased productivity does not appear to be gained at the expense of quality. Finally, Yu et al. (2015) collected 103,284 PRs from 40 different GitHub projects. They investigated which factors affect PR evaluation latency in GitHub by applying a linear regression model and quantitative analysis. They found that the size of PR and the availability of the CI pipeline are strong predictors or PR delivery time. In later work, the same authors used a linear regression model to analyze which factors affect the process of the pull request based development model in the context of CI (Yu et al., 2016). They found that the likelihood of rejection of a PR increases by 89.6% when the PR breaks the build. The results also show that the more succinct a PR is, the greater the probability that such a PR is reviewed and merged quickly. Replication studies The need for conducting more replications of published research is by now rather widely accepted in the software engineering community, as documented through efforts such as the ROSE Festival (held, for instance, at ICSE (https://2019.icse-conferences.org/track/icse- 2019-ROSE-Festival) and FSE (https://github.com/researchart/rose3-fse19) in 2019). In general, replication is necessary to increase the trust in any individual piece of research –the results of any one study alone cannot be extrapolated to all environments, as there are typically many uncontrollable sources of variation between different environments (Shull et al., 2002). Successful replication increases the validity and reliability of the outcomes observed in an experiment (Juristo & Gmez, 2012). Shull et al. (2008) distinguish two types of replication studies. In exact replications, the original experimental design is followed as exactly as possible, while a conceptual replication attempts to answer the same research questions using an adapted methodology. We argue that conceptual replications are even Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 6/24 https://peerj.com https://2019.icse-conferences.org/track/icse-2019-ROSE-Festival https://2019.icse-conferences.org/track/icse-2019-ROSE-Festival https://github.com/researchart/rose3-fse19 http://dx.doi.org/10.7717/peerj-cs.245 more important than exact ones, as they allow us to control for deficiencies in research design, whereas exact replications mostly validate experiment execution. However, not all researchers share this excitement about replication studies. Shepperd (2018) argued that, due to wide prediction intervals, most replications end up successful anyway. Further, according to Basili, Shull & Lanubile (1999), replication studies in software engineering are particularly difficult to conduct, as experiments in this field usually involve a large number of context variables. Consequently, a systematic mapping study of replications in the software engineering field (Shull et al., 2008) concluded that the absolute number of replications is still small, in particular considering the breadth of topics in software engineering. Their study retrieved more than 16,000 articles, from which they selected 96 articles reporting only 133 replications. Our work is a contribution towards increasing the trustworthiness of research on the impact of CI on PR delivery times. Our replication design combines exact with conceptual replication—we decide to not deviate far from the original design of Bernardo, da Costa & Kulesza (2018), and also largely follow their style of presentation, while at the same time addressing the methodological concerns we had with their original work. METHOD The goal of the present study is to replicate and extend the results from earlier work presented in ‘Does Using CI Lead to Faster Pull Request Delivery?’. We now discuss our scientific methodology and the data that has been used. Fig. 2 provides a schematic overview. For RQ1, RQ2.1, and RQ2.2, the data set from the original study is re-used. For RQ3.1 and RQ3.2, two new data sets are collected from GitHub. For RQ1, the original statistical methods are re-used. For RQ2.1, an alternative analysis approach (RDD) is employed. For RQ2.2, the same method is extended with an additional analysis variable (the point in time in the release cycle when a PR is submitted, ‘‘come-in time’’). For RQ3.1, all analyses are applied to the new data sets. For RQ3.2, we only apply non-parametric tests, as our findings do not warrant applying the rest of the analyses to this data set. All data as well as the necessary analysis scripts are publicly available on GitHub https://github.com/radialine/Do- Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI. Study subjects and data collection As depicted in Fig. 2, our study relies on three different sets of study objects –the original data provided by the authors, a set of new projects collected using the same methodology (new data), and a control group consisting of projects collected using the same methodology, but which, crucially, have to the best of our knowledge never adopted CI (control data). Basic information about the three data sets is contained in Table 1. The collection procedure is further described below. Original data We re-use the data that Bernardo, da Costa & Kulesza (2018) have made available online https://prdeliverydelay.github.io/#datasets. However, for a subset of our analysis, we need Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 7/24 https://peerj.com https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI https://prdeliverydelay.github.io/#datasets http://dx.doi.org/10.7717/peerj-cs.245 Original New Control RQ1 RQ2.1RQ2.2 RQ3.1 RQ3.2 RQ1 Results Non-Parametric Test RQ2.1 Results RDD RQ2.2 Results Linear Regression RQ3.1 Results RQ3.2 Results All three methods Original Paper Results compare Non-Parametric Test Figure 2 Overview of study methodology and used data. Shaded elements are re-used from Bernardo, da Costa & Kulesza (2018). Full-size DOI: 10.7717/peerjcs.245/fig-2 additional information not contained in the original data (e.g., the exact point in time when a PR was merged). This information was collected directly through the GitHub API. New data For collecting a new data set, we largely follow the process originally used by Bernardo, da Costa & Kulesza (2018), which in turn was inspired by Vasilescu et al. (2015). We identify the 800 most highly-starred projects on GitHub written in Java, Python, PHP, Ruby, and JavaScript. This leads to a total of 2763 unique projects (projects that use multiple of these languages are counted only once). We discard all projects that are not using Travis-CI, as well as all projects that were already contained in the original data set. We further exclude all projects that have less than 100 merged PRs before or after CI adoption. That is, we only consider projects that have had reasonable development activity before and after adopting CI. Finally, we also discard toy projects, tutorials, and other similar projects that are not intended to be deployed to production. This leaves us with 54 projects, for which we then collect PR and release data using git and the GitHub API. Control data We use the same process as for new data to collect a control group, with the key difference that we discard all projects for which we can tell that they are, or have been, using any CI system, leading to 28 projects. Note that this data set is smaller, as, given the prevalence of CI, it is difficult to find high-profile projects with similar characteristics to the projects in the other two data sets which never adopted CI in their lifetime. Analysis methods As shown in Fig. 2, we use three different statistical methods in our study. We replicate two of the methods used in the original study, and introduce a third, new, method (regression discontinuity design, RDD). Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 8/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-2 http://dx.doi.org/10.7717/peerj-cs.245 Table 1 Basic data set statistics. Data set # of projects Total # of PRs Original data 87 162.653 New data 54 84.487 Control data 28 47.519 Methods re-used from the original study In line with the original work, we use non-parametric hypothesis testing (Mann-Whitney- Wilcoxon, MWW) for testing whether there is a statistically significant difference in pull request delivery time before and after CI introduction. MWW is used as data normality could not be assumed. MWW is used in conjunction with Cliffs delta to measure effect sizes, using the standard threshold values as defined by Romano et al. (2006). Additionally, we use a multiple regression model fitted with ordinary least squares to identify which factors best explain a dependent variable (delivery delay, in our case). We use the Wald X2 maximum likelihood test to evaluate the explanatory power of each independent variable. RDD Due to our concern that the original study did not properly control for changes in PR submissions and PR delivery time that are independent of CI (due to, for instance, project growth or other project lifetime related factors), we extend the original work with an additional statistical method, RDD, as inspired by the work of Zhao et al. (2017). RDD is a fairly old idea firstly proposed by Thistlethwaite & Campbell (1960), which is seeing a renaissance in recent years (Imbens & Lemieux, 2008). It is a quasi-experimental pretest- posttest design that elicits the causal effects of interventions by assigning a cutoff above or below when an intervention is applied (CI introduction, in our case). The assumption of RDD is that the trend continues without changes if the intervention does not exist. We would conclude that CI had a significant impact if there is an obvious discontinuity around the cutoff point (the point in time when the intervention has been applied). A core question when applying RDD is which model(s) to use for fitting the data before and after the intervention. In this study, four models of RDD are used, as sketched in Fig. 3. The linear model with common slope assumes that the data before and after the intervention can be fit using the same linear regression model (shifted by a constant), while the linear model with different slopes only assumes that both sides can be fit by any linear regression. The non-linear model assumes that at least one side requires fitting using a non-linear regression. Finally, local linear regression performs exactly that using the Imbens-Kalyanaraman optimal bandwidth calculation. RESULTS We now discuss the results for each research question. Given that this is a replication study, a particular emphasis will be put on comparing our results to Bernardo, da Costa & Kulesza (2018) and evaluating to what extent the results therein still hold. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 9/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 3 RDD estimation models. Full-size DOI: 10.7717/peerjcs.245/fig-3 PR Submitted PR Merged PR Released MD (Merge Delay) DD (Delivery Delay) PL (Pull Request Lifetime) Figure 4 Graphical overview of the evaluated metrics DD, MD, and PL. Adapted from Bernardo, da Costa & Kulesza (2018). Full-size DOI: 10.7717/peerjcs.245/fig-4 RQ1 –exact replication As a first step in the study, we conducted an exact replication of Bernardo, da Costa & Kulesza (2018), based on the data that the authors provide. This was deemed necessary as a first step of validation, but also to acquire the necessary in-depth knowledge about the original study’s design choices. RQ1 of the original study investigated the impact of adopting CI on the delivery time of PRs. They analyzed three metrics, which are delivery delay (DD, days between when a PR got merged and when it was released), merge delay (MD, days between when a PR was submitted and when it was merged), and pull request lifetime (PL). A visual overview of these metrics and what they mean in the PR lifecycle is presented in Fig. 4. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 10/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-3 https://doi.org/10.7717/peerjcs.245/fig-4 http://dx.doi.org/10.7717/peerj-cs.245 Table 2 Results of the exact replication of RQ1 in Bernardo, da Costa & Kulesza (2018). Faster with CI [% of Projects] Stat. Different [% of Projects] DD Original 51.4% 82.8% Replication 47.9% 83.9% Difference 3.5 1.1 MD Original 27% 72.4% Replication 29.4% 78.2% Difference 2.4 5.8 PL Original 48.4% 71.3% Replication 52.4% 72.4% Difference 4 1.1 After carefully studying the original paper and limited follow-up discussion with the authors through private communications, we are able to reproduce their results. Table 2 contrasts the original results with the results of our exact replication. We report on the percentage of projects for which each of these metrics improved after introducing CI (i.e., handling PRs became faster) and the percentage of projects for which there is a statistically significant difference (in any direction). Cliff’s delta effect sizes for the latter metric vary between 0.2 and 0.3 (i.e., a small effect size), except for the changes in pull request liftetime, where we observe medium or even large effect sizes for a majority of projects. It is interesting to note that even though we used the same methods on the same data, we were not able to achieve entirely identical results (differences between 1.1 and 5.8 percentage points). We speculate that the observed differences may be due to undocumented data cleaning procedures or updates to the publicly available data set. However, given that the main findings of the study remain unchanged, we nonetheless consider the replication successful. RQ2 in the original study then tried to find the reason for this phenomenon. The authors compare the number of submitted, merged, and released PRs before and after CI adoption. We again replicate this analysis, leading to the results depicted in Fig. 5. For this analysis step, our results are virtually identical to what has been presented in Bernardo, da Costa & Kulesza (2018). We observe that after CI was adopted, thenumber of submitted, merged and released PRs per release increases statistically significantly with medium effect sizes. Interestingly, the release frequency does not change statistically significantly after adopting CI. Box 1. Summary and Lessons Learned. We were able to conduct an exact replication of the original paper, with minor dif- ferences in the results (between 1.1 and 5.5 percentage points). All main results of the original study are confirmed. This analysis indeed supports that only about half the projects deliver PRs faster (with a small effect size) after introducing CI, but less than a third of projects improves how fast they merge PRs (again with a small effect size). While projects do not seem to release more frequently, they can handle more PRs per release after CI adoption. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 11/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 5 Comparison of merged and released PRs per release before (‘‘NO-CI’’) and after (‘‘CI’’) in- troduction of CI. Full-size DOI: 10.7717/peerjcs.245/fig-5 RQ2 –conceptual replication We now discuss the two additional analysis steps we have introduced in our study in comparison to the original work. Application of RDD (RQ2.1) In the first step of our conceptual replication, we use RDD to analyze whether there are gradual changes in PR delivery time over the lifetime of projects, independently of CI introduction. However, an initial visual inspection of both, DD and PL, reveals that these metrics follow a clear pattern that is independent of CI introduction. Figures 6 and 7 depict this for two example projects (mantl/mantl and mozilla-b2g/gaia). Virtually all 87 projects in the original data set follow a similar pattern, indicating that these metrics are to a large degree dominated by when in the release cycle a PR comes in –PRs merged shortly after a release need to wait for the next release to roll around, while PRs merged shortly before a release get released much quicker. It is unlikely that the introduction of CI has much direct impact on this. It should be noted that this is true even for PL, which represents the entire delivery time of a PR (i.e., the time it takes maintainers to merge a PR plus the time the PR then waits to get released). Hence, it seems unlikely that the introduction of CI can impact this end-to-end delivery time of a PR by much. This also explains why we, similar to the original study, observe primarily differences with small effect sizes in RQ1. Ultimately, the end-to-end delivery time is presumably much Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 12/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-5 http://dx.doi.org/10.7717/peerj-cs.245 Figure 6 Visual inspection of metric delivery delay (DD) for two example projects. (A) mantl/mantl, (B) mozilla-b2g/gaia. The x-axis represents project lifetime in weeks, with point 0 being the RDD cutoff point (i.e., the time when CI has been adopted in the project). Full-size DOI: 10.7717/peerjcs.245/fig-6 Figure 7 Visual inspection of metric pull request lifetime (PL) for two example projects. (A) mantl/- mantl, (B) mozilla-b2g/gaia. The x-axis represents project lifetime in weeks, with point 0 being the RDD cutoff point (i.e., the time when CI has been adopted in the project). Full-size DOI: 10.7717/peerjcs.245/fig-7 more dependent on how frequently a project releases than on whether a CI system is used, which we have established in RQ1 to not be impacted by CI adoption. However, no such pattern exists for the third metric, MD. Hence, we attempt to apply all four RDD models described in ‘Analysis Methods’. The data of each project is divided into two buckets separated by the cutoff point (when CI was adopted), and one model for each bucket is fit. Fig. 8 shows the fitted models of project boto/boto. In the first three models, the red and blue lines fit data after and before the intervention respectively. It is evident that neither the two linear models (Figs. 8A and 8B) provide sufficient fit to accurately represent the data for boto/boto. Indeed, the linear or non-linear models never achieve an R2 value higher than 0.35 for any of the 87 projects. The local linear regression model depicted in Fig. 8D provides a better, albeit still very noisy, fit to the data. Hence, we conclude that there is no, or at least no particularly relevant, ‘‘natural trend’’ of MD getting faster or slower over time in any of the projects. Hence, we consider our original concern with the work of Bernardo, da Costa & Kulesza (2018) (that projects may just naturally get faster or slower over time) to be unsupported. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 13/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-6 https://doi.org/10.7717/peerjcs.245/fig-7 http://dx.doi.org/10.7717/peerj-cs.245 Figure 8 Four RDD models fit for project boto/boto. (A) Linear model (common slope). (B) Linear model (diff slopes). (C) Non-linear model. (D) Local linear regression model. The x-axis represents project lifetime in weeks, with point 0 being the RDD cutoff point (i.e., the time when CI has been adopted in the project). Full-size DOI: 10.7717/peerjcs.245/fig-8 Evaluation of “Come-In Time” as Predictor of PR Delivery Time (RQ2.2) In an attempt to explain what exactly impacts the end-to-end lifetime of a PR (PL), the original study built a multiple regression model based on 13 different variables (related to characteristics of the project, the PR submitter, and the PR itself). They found that three metrics (merge workload, queue rank, and, to a lesser degree, the contributor) had significant explanatory power with regards to PL. Before CI adoption, the merge workload has the highest explanatory power, which changes to the queue rank after adoption. Based on our previous findings, we speculate that in fact the most important predictor of end-to-end PR delivery time may be when in the release cycle a PR has been merged. We refer to this new factor as ‘‘come-in time’’, and provide a schematic overview of its definition in Fig. 9. We re-use the original methodological setup (regression analysis using ordinary least squares), but use the variables sketched in Table 3. We remove all variables which had an explanatory power close to 0 in the original study, leaving us with 6 potential factors Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 14/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-8 http://dx.doi.org/10.7717/peerj-cs.245 Figure 9 Definition of the factor ‘‘come-in time’’. Full-size DOI: 10.7717/peerjcs.245/fig-9 Table 3 Description of all variables used in the regression model. The first 6 variables are re-used from Bernardo, da Costa & Kulesza (2018), the last variable has been newly introduced in our study. Variable Definition Variables From Original Study Number of Activities An activity is an action to a PR conducted by a GitHub user, e.g., labeled, assigned, etc. It is assumed that a large number of activities may lead to longer delivery times. Merge Time The time between when a PR was created and when it was merged by an integrator (MD). Contributor Experience The number of released PRs that were created by the same author. We speculate that contributions by an experienced contributor may be evaluated less critically, and hence may be delivered faster. Contributor Integration The mean delivery time in days of the past PRs submitted by this contributor. If past PRs were released quickly, then the next PR submitted by the same person may also be released rapidly. Merge Workload The number of PRs waiting to be merged at the time when the PR was submitted. We speculate that, as the time and energy of integrators is limited, the workload of an integrator may have an impact on delivery times. Queue Rank This variable represents the order of the merged PRs in a release cycle. A merged PR might be released faster or slower depending of its position in the merge queue. New Variable Come-in Time The time in days between the time when a PR got merged and the time of the last release (see also Figure ??). This new variable is motivated by our previous findings. of influence (‘‘merge workload’’, ‘‘queue rank’’, ‘‘contributor experience’’, ‘‘contributor integration’’, ‘‘number of activities’’, and ‘‘merge time’’). We add the new variable ‘‘come-in time’’ to this set. From these variables, we build two regression models for each project (before and after CI adoption), and evaluate the R2 metric for each model. R2 represents how much of the variability in the data can be explained using the model. Following Bernardo, da Costa Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 15/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-9 http://dx.doi.org/10.7717/peerj-cs.245 Figure 10 Bar chart plotting the variables with the highest explanatory power for each project, before (‘‘NO-CI’’) and after (‘‘CI’’) CI adoption. Only projects for which a regression model with R2 > 0.5 could be trained are considered. Full-size DOI: 10.7717/peerjcs.245/fig-10 & Kulesza (2018), we only accept models with R2 > 0.5 as sufficiently accurate. Prior to CI adoption, the models for 17 of 87 projects (19.8%) have R2 values higher than 0.5 (median of these is 0.58). After CI adoption, we achieve only 9 valid models (10.5%), with a median R2 value of 0.57. This is in line with our previous findings, and indicates that PR delivery time is in general rather unpredictable, and unlikely to depend on any single factor. Figure 10 depicts for how many projects (among those for which a model with R2 > 0.5 could be found) each variable is the one with the highest explanatory power, as measured through the Wald X2 maximum likelihood test. Our newly proposed variable ‘‘come-in time’’ indeed outperforms all variables from the original study. This further supports that the factor most important to the end-to-end delivery time of a PR is whether it has been merged close in time to the next release. It is also noticable that all variables related to the nature of the PR or the contributor are less relevant than process- and project-oriented metrics, such as when a PR comes in, which position in the merge queue it has, or how large the merge workload currently is. It needs to be noted that there is a high correlation between the new metric ‘‘come-in time’’ and ‘‘queue rank’’, one of the metrics in the original study, in a subset of the projects. Namely, in 19 of 87 projects (22%) the correlation between these metrics is larger than 0.7 prior to introducing CI, and in 21 of 87 projects (24%) after CI introduction. For the remaining projects, there is a correlation between these metrics, but it is less pronounced. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 16/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-10 http://dx.doi.org/10.7717/peerj-cs.245 Box 2. Summary and Lessons Learned. Applying RDD to the original data set primarily revealed that two of the three ana- lyzed metrics (DD and PL) follow very clear patterns, namely that they depend to a large degree on the time until the next release. Consequently, when in the release cycle a PR is merged is the best predictor of delivery time (PL). The merge delay MD does not follow such a pattern. We did not observe in any project that MD would trend up- or downwards independently of CI adoption, alleviating our original concern with the original study. However, our experiments also confirm the result from the original study that the delivery time is generally difficult to predict, as indicated by the low R2 metric of the regression models of most projects. RQ3 –Generalizability So far, we have applied all analyses to the data set also used in the original study. Now we turn towards evaluating whether the previous findings are specific to the used data. Analysing new data (RQ3.1) In a first step, we evaluate the generalizability of our findings by collecting 54 new projects (which have also adopted CI), and conducting the same analyses as presented in ‘RQ1—Exact Replication’ and ‘RQ2—Conceptual Replication’. We firstly again evaluate how many projects improved DD, MD, and PL, and used a MWW test to evaluate statistical significance. The results of this analysis are provided in Table 4, which also provides our own results from RQ1 as a point of comparison. We observe that the results are not fundamentally different, although we observe 10 to 14 percent points difference in selected results (particularly related to the delivery delay DD). Effect sizes are small, as also observed for the original data. A replication of our analysis of submitted, accepted, and released PRs confirms our findings that projects statistically significantly increase their development activities after adopting CI (with medium effect size), but we can again not find a statistically significant change in the number of releases. Finally, the re-execution of RDD (RQ2.1) on the new data yields similarly comparable results. A deeper discussion of this aspect is omitted here for reasons of brevity, but can be found in Guo (2019). An interesting result is found when fitting regression models, as discussed for RQ2.2, to the new data. For 54 projects, only 2 models (3.7%) trained on data after CI adoption and 5 models (9.3%) for data before CI adoption achieve an R2 metric higher than 0.5. It remains unclear why the regression approach works even less well on the new than on the original data. However, given that R2 values were generally low even for the original data, this result may ultimately just stress that predicting delivery times is difficult at the best of times. Analysing a control group (RQ3.2) So far, we have experimented only with projects that actually adopted CI at some point in the project’s lifetime. We now turn towards analysing our control group of comparable Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 17/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Table 4 Results of a re-analysis using a new data set. Faster with CI [% of Projects] Stat. Different [% of Projects] DD Original data 47.9% 83.9% New data 58% 98% Difference 10.1 14.1 MD Original data 29.4% 78.2% New data 19.5% 80.4% Difference 9.9 2.2 PL Original data 52.4% 72.4% New data 46.5% 79.6% Difference 5.9 7.2 Table 5 Results of a re-analysis using a control group of projects which never introduced CI. Faster with CI [% of Projects] Stat. Different [% of Projects] DD Original data 47.9% 83.9% Control data 59.2% 96.4% Difference 11.3 12.5 MD Original data 29.4% 78.2% Control data 28% 89.3% Difference 1.4 11.1 PL Original data 52.4% 72.4% New data 40% 89.3% Difference 12.4 16.9 projects which, as far as we can observe, have never adopted CI. One challenge in this context is what point in the project’s history to use as cutoff for analysis. From analysing the 87 projects in the original data set, we learn that these projects, on average, introduce CI after 38.2% of the lifetime of the project in days (median 38%, variance 8.5). Hence, we decide to introduce a ‘‘mock-ci-timepoint’’ for the projects in the control group that corresponds to 38% of their lifetime. Intuitively, this is the point in time when these projects would have, on average, adopted CI (if they ever did). A comparison of the results achieved for this control group with the results achieved for the original data set is provided in Table 5. Note that in this case ‘‘Faster with CI’’ for the control group should be interpreted as ‘‘faster after the mock-ci-timepoint’’ of 38%. The results of this analysis indicate that we do observe (slightly) larger differences between the original test group and the control group than what we have observed for the two different test groups in RQ3.1 (cp. Table 4). This supports the conclusion that the introduction of CI has some modest impact on these numbers. However, when analyzing the number of submitted, merged, and released PRs, we observe that there is no difference between before and after the (mocked) CI introduction. This is visualized in Fig. 11. Statistical testing does not reveal any differences before and after the mocked CI introduction for any metric. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 18/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Figure 11 Comparison of merged and released PRs per release before and after CI inroduction for a control group of projects that never introduced CI. (A) #Submitted PRs/release; (B) #Merged PRs/re- lease; (C) #Delivered PRs/release; (D) #release/year. Full-size DOI: 10.7717/peerjcs.245/fig-11 Hence, we support the argument by Bernardo, da Costa & Kulesza (2018) that the introduction of CI seems to have a (minor) impact on PR delivery time of projects. However, projects in both test groups manage to handle considerably more PRs per release after CI adoption, while we have not observed any statistically significant increase in the control group. Hence, we conclude that projects do not so much speed up handling individual PRs, but rather manage to handle considerably more PRs per release after adopting CI. Box 3. Summary and Lessons Learned. Applying our analyses to new data sets allowed us to evaluate to what extent the ef- fects observed so far are due to specifics of the data collected by Bernardo, da Costa & Kulesza (2018). When analyzing a new data set collected using the same methodology, we have observed results that are in the broad strokes similar to the original findings, although we have observed differences up to 14 percentage points for individual met- rics. When analyzing a control group of projects that never adopted CI, we found re- sults not unlike to the results of the new test group, indicating that the small size ef- fects we observed in RQ1 may be independent of CI introduction. However, we have observed that both test groups handle more PRs after CI adoption with medium ef- fect size, while we have not observed a statistically significant increase for the control group. This leads us to believe that projects may not actually handle individual PRs Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 19/24 https://peerj.com https://doi.org/10.7717/peerjcs.245/fig-11 http://dx.doi.org/10.7717/peerj-cs.245 (much) faster after CI adoption, but they are able to handle considerably more PRs per release. THREATS TO VALIDITY This section addresses potential threats to the validity of our replication and overall results. Construct validity Construct validity describes threats related to the design of the study. To a large degree, we have chosen the same data collection procedures, statistical methods, and analysis techniques that were already present in the original study. This was done by design, so as to keep our replication easily comparable to the original study. However, this also means that any limitations inherent in the original study design are still present in our replication (with the exception of those that we explicitly chose to address as part of our conceptual replication). For the construction of our control group, there are two related threats. (1) Even though we carefully attempted to determine whether a candidate project for the control group does indeed not use any CI system, it is not always feasible to determine this from an outsider’s point of view (e.g., a company-backed OSS project may use a CI system within the company, which is not mentioned on the GitHub page). (2) Even though we attempted to keep the control group as similar in characteristics to the original study objects as possible, the mere fact that these projects have chosen to not adopt CI may already hint at deeper differences in mindsets, processes, and project goals than what is visible from GitHub metrics alone. These differences may also account for some of the different results we have observed. Further, our control group is considerably smaller than the original data set (28 versus 87 projects). External validity External validity concerns to what extent the findings of the study still hold under in more generalized circumstances. Part of our replication was specifically to investigate a data set of 52 new projects which adopted CI, and 28 projects which are not using CI. However, we used the same data collection procedure and sampling methods to select these projects. Hence, our replication does not aim to, and cannot, answer the question if the observed results are specific to OSS software, to high-profile projects, or to projects written in the Java, Python, PHP, Ruby, or JavaScript programming languages. Further, it should be noted that we only consider projects that make use of Travis-CI. Hence, it remains an open question to what extent our results also generalize to projects using other CI systems, such as GitLab https://about.gitlab.com or Jenkins https://jenkins.io. Internal validity Internal validity questions to what extent the study is able to draw correct conclusions, and does not fall prey to, for instance, confounding factors. One of the key motivations of our replication was to evaluate whether normal changes in projects over the lifetime of the project may be responsible for the effects observed in the original study. This concern was alleviated in our replication. However, other confounding factors may still remain relevant. Particularly concerning in this regard is that our evaluation of a control group of projects that never applied CI has shown results that, ultimately, were not fundamentally different than what we observed for a new data set of CI-using projects. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 20/24 https://peerj.com https://about.gitlab.com https://jenkins.io http://dx.doi.org/10.7717/peerj-cs.245 Hence, we see the need for more work to fully establish the effects of adopting CI in OSS projects. CONCLUSIONS In this work, we replicated an original study by Bernardo, da Costa & Kulesza (2018) that attempted to answer the question whether OSS projects deliver PRs faster after adopting CI. Our replication was motivated by limitations in the original study design, which did not account for changes in PR delivery time independent of CI introduction. We conducted an exact replication of the original work, analyzed the original data using a different statistical procedure (RDD), and extended the original multiple regression model using a new variable (‘‘come-in time’’). Further, we analyze two new data sets, a new set of study subjects that adopted CI and a control group of projects that did not. We were able to replicate the original findings. Our analysis using RDD has not shown any evidence of growth of PR delivery times independent of CI introduction, and our analysis of control group data has revealed that projects which never adopted CI do not see the same increase in submitted, merged, and released PRs as seen for CI-using projects. However, our study also confirms that the impact of CI on the delivery time for an individual PR is only minor. This is in line with the original study, which has also reported primarily small statistical effect sizes. We further find that, before as well as after CI adoption, the best predictor of PR delivery times is when in the release cycle a PR is merged. This indicates that, ultimately, projects need to increase the number of releases to speed up PR delivery times rather than adopt CI. However, the number of releases appears to be largely independent of whether or not a project adopts CI. ACKNOWLEDGEMENTS This work has been conducted as a master project while the first author was a student at Chalmers University of Technology. ADDITIONAL INFORMATION AND DECLARATIONS Funding This work has received financial support by the Swedish Research Council VR under grant number 2018-04127 (Developer-Targeted Performance Engineering for Immersed Release and Software Engineers). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Grant Disclosures The following grant information was disclosed by the authors: Swedish Research Council VR under grant number 2018-04127 (Developer-Targeted Performance Engineering for Immersed Release and Software Engineers). Competing Interests Philipp Leitner is an Academic Editor for PeerJ Computer Science. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 21/24 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.245 Author Contributions • Yunfang Guo conceived and designed the experiments, performed the experiments, analyzed the data, prepared figures and/or tables, performed the computation work, authored or reviewed drafts of the paper, approved the final draft. • Philipp Leitner conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, approved the final draft. Data Availability The following information was supplied regarding data availability: All data is available on GitHub: https://github.com/radialine/Do-Open-Source- Projects-Deliver-Pull-Requests-Faster-Using-CI. REFERENCES Bacchelli A, Bird C. 2013. Expectations, outcomes, and challenges of modern code review, ICSE ’13. In: Proceedings of the 2013 international conference on software engineering. Piscataway: IEEE Press. Basili VR, Shull F, Lanubile F. 1999. Building knowledge through families of experi- ments. IEEE Transactions on Software Engineering 25(4):456–473 DOI 10.1109/32.799939. Bernardo JAH, Da Costa DA, Kulesza U. 2018. Studying the impact of adopting continuous integration on the delivery time of pull requests. New York: ACM, 131–141 DOI 10.1145/3196398.3196421. Duvall PM, Matyas S, Glover A. 2007. Continuous integration: improving software quality and reducing risk. London: Pearson Education. Gousios G, Pinzger M, van Deursen A. 2014. An exploratory study of the pull-based software development model. In: ICSE 2014 proceedings of the 36th international conference on software engineering. 345–355. Guo Y. 2019. The impact of adopting continuous integration on the delivery time of pull requests—a partial replication and extension. Gothenburg, Sweden: Department of Computer Science and Engineering, Chalmers University of Gothenburg. Hilton M, Tunnell T, Huang K, Marinov D, Dig D. 2016. Usage, costs, and benefits of continuous integration in open-source projects. In: Proceedings of the 31st IEEE/ACM international conference on automated software engineering. Piscataway: IEEE, 426–437. Imbens GW, Lemieux T. 2008. Regression discontinuity designs: a guide to practice. Journal of Econometrics 142(2):615–635 DOI 10.1016/j.jeconom.2007.05.001. Juristo N, Gómez OS. 2012. Replication of software engineering experiments. In: Empirical software engineering and verification. Springer-Verlag Berlin/Heidelberg: International Summer Schools, LASER 2008–2010, 60–88. Laukkanen E, Paasivaara M, Arvonen T. 2015. Stakeholder perceptions of the adoption of continuous integration–a case study. In: 2015 agile conference. Piscataway: IEEE, 11–20. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 22/24 https://peerj.com https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI https://github.com/radialine/Do-Open-Source-Projects-Deliver-Pull-Requests-Faster-Using-CI http://dx.doi.org/10.1109/32.799939 http://dx.doi.org/10.1145/3196398.3196421 http://dx.doi.org/10.1016/j.jeconom.2007.05.001 http://dx.doi.org/10.7717/peerj-cs.245 Manglaviti M, Coronado-Montoya E, Gallaba K, McIntosh S. 2017. An empirical study of the personnel overhead of continuous integration. In: MSR ’17 proceedings of the 14th international conference on mining software repositories. 471–474. McIntosh S, Kamei Y, Adams B, Hassan AE. 2014. The impact of code review coverage and code review participation on software quality: a case study of the Qt, VTK, and ITK projects. In: Proceedings of the 11th working conference on mining software repositories, MSR 2014. New York: ACM, 192–201 DOI 10.1145/2597073.2597076. Miller A. 2008. A hundred days of continuous integration. In: Agile 2008 conference. Parsons D, Ryu H, Lal R. 2007. The impact of methods and techniques on outcomes from agile software development projects. In: Organizational dynamics of technology- based innovation: diversifying the research agenda. 235–249. Romano J, Kromrey J, Coraggio J, Skowronek J. 2006. Appropriate statistics for ordinal level data: should we really be using t-test and Cohensd for evaluating group differ- ences on the NSSE and other surveys? In: Annual meeting of the florida association of institutional research. Shepperd M. 2018. Replication studies considered harmful. In: Proceedings of the 40th international conference on software engineering: new ideas and emerging results, ICSE- NIER ’18. New York: ACM, 73–76 DOI 10.1145/3183399.3183423. Shull F, Basili V, Carver J, Maldonado JC, Travassos GH, Mendona M, Fabbri S. 2002. Replicating software engineering experiments-addressing the tacit knowledge problem. In: Proceedings international symposium on empirical software engineering. 716. Shull FJ, Carver JC, Vegas S, Juristo N. 2008. The role of replications in empirical software engineering. Empirical Software Engineering 13(2):211–218 DOI 10.1007/s10664-008-9060-1. Stolberg S. 2009. Enabling agile testing through continuous integration. In: 2009 agile conference. Ståhl D, Bosch J. 2014. Modeling continuous integration practice differences in industry software development. Journal of Systems and Software 87:48–59. Thistlethwaite DL, Campbell DT. 1960. Regression-discontinuity analysis: an alternative to the ex post facto experiment. Journal of Educational Psychology 51(6):309–317 DOI 10.1037/h0044319. Vasilescu B, Schuylenburg SV, Wulms J, Serebrenik A, van den Brand MG. 2014. Continuous integration in a social-coding world: empirical evidence from GitHub. In: Software Maintenance and Evolution (ICSME), 2014 IEEE international conference on IEEE. Piscataway: IEEE. Vasilescu B, Yu Y, Wang H, Devanbu P, Filkov V. 2015. Quality and productivity outcomes relating to continuous integration in GitHub. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering - ESEC/FSE 2015. Yu Y, Wang H, Filkov V, Devanbu P, Vasilescu B. 2015. Wait for it: determinants of pull request evaluation latency on GitHub. In: 2015 IEEE/ACM 12th working conference on mining software repositories. Piscataway: IEEE. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 23/24 https://peerj.com http://dx.doi.org/10.1145/2597073.2597076 http://dx.doi.org/10.1145/3183399.3183423 http://dx.doi.org/10.1007/s10664-008-9060-1 http://dx.doi.org/10.1037/h0044319 http://dx.doi.org/10.7717/peerj-cs.245 Yu Y, Yin G, Wang T, Yang C, Wang H. 2016. Determinants of pull-based development in the context of continuous integration. Science China Information Sciences 59:080104 DOI 10.1007/s11432-016-5595-8. Zhao Y, Serebrenik A, Zhou Y, Filkov V, Vasilescu B. 2017. The impact of continuous integration on other software development practices: a large-scale empirical study. In: Proceedings of the 32nd IEEE/ACM international conference on automated software engineering. Piscataway: IEEE, 6071. Guo and Leitner (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.245 24/24 https://peerj.com http://dx.doi.org/10.1007/s11432-016-5595-8 http://dx.doi.org/10.7717/peerj-cs.245