key: cord-0544303-9hkhc3av authors: Noller, Yannic; Shariffdeen, Ridwan; Gao, Xiang; Roychoudhury, Abhik title: Trust Enhancement Issues in Program Repair date: 2021-08-30 journal: nan DOI: nan sha: 52c142b5a9d88c5b4853f9d325d73d850f8fd0db doc_id: 544303 cord_uid: 9hkhc3av Automated program repair is an emerging technology that seeks to automatically rectify bugs and vulnerabilities using learning, search, and semantic analysis. Trust in automatically generated patches is necessary for achieving greater adoption of program repair. Towards this goal, we survey more than 100 software practitioners to understand the artifacts and setups needed to enhance trust in automatically generated patches. Based on the feedback from the survey on developer preferences, we quantitatively evaluate existing test-suite based program repair tools. We find that they cannot produce high-quality patches within a top-10 ranking and an acceptable time period of 1 hour. The developer feedback from our qualitative study and the observations from our quantitative examination of existing repair tools point to actionable insights to drive program repair research. Specifically, we note that producing repairs within an acceptable time-bound is very much dependent on leveraging an abstract search space representation of a rich enough search space. Moreover, while additional developer inputs are valuable for generating or ranking patches, developers do not seem to be interested in a significant human-in-the-loop interaction. Automated program repair technologies [14] are getting increased attention. In recent times, program repair has found its way into the automated fixing of mobile apps in the SapFix project in Facebook [28] , automated repair bots as evidenced by the Repairnator project [44] , and has found certain acceptability in companies such as Bloomberg [17] . While all of these are promising, large-scale adoption of program repair where it is well integrated into our programming environments is considerably out of reach as of now. In this article, we reflect on the impediments towards the usage of program repair by developers. There can be many challenges towards the adoption of program repair like scalability, applicability, and developer acceptability. A lot of the research on program repair has focused on scalability to large programs and also to large search spaces [12, 26, 28, 31] . Similarly, there have been various works on generating multi-line fixes [13, 31] , or on transplanting patches from one version to another [41] -to cover various use cases or scenarios of program repair. Surprisingly, there is very little literature or systematic studies from either academia or industry on the developer trust in program * Joint first authors † Alternate email: gaoxiang9430@gmail.com repair. In particular, what changes do we need to bring into the program repair process so that it becomes viable to have conversations on its wide-scale adoption? Part of the gulf in terms of lack of trust comes from a lack of specifications -since the intended behavior of the program is not formally documented, it is hard to trust that the automatically generated patches meet this intended behavior. Overall, we seek to examine whether the developer's reluctance to use program repair may partially stem from not relying on automatically generated code. This can have profound implications because of recent developments on AI-based pair programming 1 , which holds out promise for significant parts of coding in the future to be accomplished via automated code generation. In this article, we specifically study the issues involved in enhancing developer trust on automatically generated patches. Towards this goal, we first settle on the research questions related to developer trust in automatically generated patches. These questions are divided into two categories (a) expectations of developers from automatic repair technologies, and (b) understanding the possible shortfall of existing program repair technologies with respect to developer expectations. To understand the developer expectations from program repair, we outline the following research questions. To what extent are the developers interested to apply automated program repair (henceforth called APR), and how do they envision using it? RQ2 Can software developers provide additional inputs that would cause higher trust in generated patches? If yes, what kind of inputs can they provide? RQ3 What evidence from APR will increase developer trust in the patches produced? For a comprehensive assessment of the research questions, we engage in both qualitative and quantitative studies. Our assessment of the questions primarily comes in three parts. To understand the developer expectations from program repair, we conduct a detailed survey (with 35 questions) among more than 100 professional software practitioners. Most of our survey respondents are developers, with a few coming from more senior roles such as architects. The survey results amount to both quantitative and qualitative inputs on the developer expectations since we curate and analyze respondents' comments on topics such as the expected evidence for patch correctness provided by automated repair techniques. Based on the survey findings, we note that developers are largely open-minded in terms of trying out a small number of patches (no more than 10) from automated repair techniques, as long as these patches are produced within a reasonable time, say less than 1 hour. Furthermore, the developers are open to receiving specifications from the program repair method (amounting to evidence of patch correctness). They are also open-minded in terms of providing additional specifications to drive program repair. The most common specifications the developers are ready to give and receive are tests. Based on the comments received from survey participants, we then conduct a quantitative comparison of certain well-known program repair tools on the widely used ManyBugs benchmarks [20] . To understand the possible deficiency of existing program repair techniques with respect to outlined developer expectations as found from the survey, we formulate the following research questions. RQ4 Can existing APR techniques pinpoint high-quality patches in the top-ranking (e.g., among top-10) patches within a tolerable time limit (e.g., 0.5/1/2 hours)? RQ5 What is the impact of additional inputs (say, fix locations and additional passing test cases) on the efficacy of APR? We note that many of the existing papers on program repair use liberal timeout periods to generate repairs, while in our experiments the timeout is strictly maintained at no more than one hour. We are also restricted to observing the first few patches, and we examine the impact of the fix localization by either providing and not providing the developer location. Based on a quantitative comparison of well-known repair tools Angelix [31] , CPR [40] , GenProg [21] , Prophet [26] and Fix2Fit [12] -we conclude that the search space representation has a significant role in deriving plausible/correct patches within an acceptable time period. In other words, an abstract representation of the search space (aided by constraints that are managed efficiently or aided by program equivalence relations) is at least as critical as a smart search algorithm to navigate the patch space. We discuss how the tools can be improved to meet developer expectations, either by achieving compilation-free repair or by navigating/suggesting abstract patches with the help of simple constraints (such as interval constraints). Last but not the least, we note that program repair can be seen as automated code generation at a micro-scale. By studying the trust issues in automated repair, we can also obtain an initial understanding of trust enhancement in automatically generated code. The goal of APR is to correct buggy programs to satisfy given specifications. In this section, we review these specifications and discuss how they can impact patch quality. Test Suites as Specification. APR techniques such as GenProg [21] and Prophet [26] treat test suites as correctness specifications. The test suite usually includes a set of passing tests and at least one failing test. The repair goal is to correct the buggy program to pass all the given test suites. Although test suites are widely available, they are usually incomplete specifications that specify part of the intended program behaviors. Hence, the automatically generated patch may overfit the tests, meaning that the patched program may still fail on program inputs outside the given tests. For instance, the following is a buggy implementation that copies n characters from source array src to destination array dest, and returns the number of copied characters. A buffer overflow happens at line 6 when the size of src or dest is less than n. By taking the following three tests (one of them can trigger this bug) as specification, a produced patch (++index