key: cord-0544303-9hkhc3av
authors: Noller, Yannic; Shariffdeen, Ridwan; Gao, Xiang; Roychoudhury, Abhik
title: Trust Enhancement Issues in Program Repair
date: 2021-08-30
journal: nan
DOI: nan
sha: 52c142b5a9d88c5b4853f9d325d73d850f8fd0db
doc_id: 544303
cord_uid: 9hkhc3av

Automated program repair is an emerging technology that seeks to automatically rectify bugs and vulnerabilities using learning, search, and semantic analysis. Trust in automatically generated patches is necessary for achieving greater adoption of program repair. Towards this goal, we survey more than 100 software practitioners to understand the artifacts and setups needed to enhance trust in automatically generated patches. Based on the feedback from the survey on developer preferences, we quantitatively evaluate existing test-suite based program repair tools. We find that they cannot produce high-quality patches within a top-10 ranking and an acceptable time period of 1 hour. The developer feedback from our qualitative study and the observations from our quantitative examination of existing repair tools point to actionable insights to drive program repair research. Specifically, we note that producing repairs within an acceptable time-bound is very much dependent on leveraging an abstract search space representation of a rich enough search space. Moreover, while additional developer inputs are valuable for generating or ranking patches, developers do not seem to be interested in a significant human-in-the-loop interaction.

Automated program repair technologies [14] are getting increased attention. In recent times, program repair has found its way into the automated fixing of mobile apps in the SapFix project in Facebook [28] , automated repair bots as evidenced by the Repairnator project [44] , and has found certain acceptability in companies such as Bloomberg [17] . While all of these are promising, large-scale adoption of program repair where it is well integrated into our programming environments is considerably out of reach as of now. In this article, we reflect on the impediments towards the usage of program repair by developers. There can be many challenges towards the adoption of program repair like scalability, applicability, and developer acceptability. A lot of the research on program repair has focused on scalability to large programs and also to large search spaces [12, 26, 28, 31] . Similarly, there have been various works on generating multi-line fixes [13, 31] , or on transplanting patches from one version to another [41] -to cover various use cases or scenarios of program repair.

Surprisingly, there is very little literature or systematic studies from either academia or industry on the developer trust in program * Joint first authors † Alternate email: gaoxiang9430@gmail.com repair.

In particular, what changes do we need to bring into the program repair process so that it becomes viable to have conversations on its wide-scale adoption? Part of the gulf in terms of lack of trust comes from a lack of specifications -since the intended behavior of the program is not formally documented, it is hard to trust that the automatically generated patches meet this intended behavior. Overall, we seek to examine whether the developer's reluctance to use program repair may partially stem from not relying on automatically generated code. This can have profound implications because of recent developments on AI-based pair programming 1 , which holds out promise for significant parts of coding in the future to be accomplished via automated code generation.

In this article, we specifically study the issues involved in enhancing developer trust on automatically generated patches. Towards this goal, we first settle on the research questions related to developer trust in automatically generated patches. These questions are divided into two categories (a) expectations of developers from automatic repair technologies, and (b) understanding the possible shortfall of existing program repair technologies with respect to developer expectations. To understand the developer expectations from program repair, we outline the following research questions.

To what extent are the developers interested to apply automated program repair (henceforth called APR), and how do they envision using it? RQ2 Can software developers provide additional inputs that would cause higher trust in generated patches? If yes, what kind of inputs can they provide? RQ3 What evidence from APR will increase developer trust in the patches produced? For a comprehensive assessment of the research questions, we engage in both qualitative and quantitative studies. Our assessment of the questions primarily comes in three parts. To understand the developer expectations from program repair, we conduct a detailed survey (with 35 questions) among more than 100 professional software practitioners. Most of our survey respondents are developers, with a few coming from more senior roles such as architects. The survey results amount to both quantitative and qualitative inputs on the developer expectations since we curate and analyze respondents' comments on topics such as the expected evidence for patch correctness provided by automated repair techniques. Based on the survey findings, we note that developers are largely open-minded in terms of trying out a small number of patches (no more than 10) from automated repair techniques, as long as these patches are produced within a reasonable time, say less than 1 hour. Furthermore, the developers are open to receiving specifications from the program repair method (amounting to evidence of patch correctness).

They are also open-minded in terms of providing additional specifications to drive program repair. The most common specifications the developers are ready to give and receive are tests.

Based on the comments received from survey participants, we then conduct a quantitative comparison of certain well-known program repair tools on the widely used ManyBugs benchmarks [20] . To understand the possible deficiency of existing program repair techniques with respect to outlined developer expectations as found from the survey, we formulate the following research questions.

RQ4 Can existing APR techniques pinpoint high-quality patches in the top-ranking (e.g., among top-10) patches within a tolerable time limit (e.g., 0.5/1/2 hours)? RQ5 What is the impact of additional inputs (say, fix locations and additional passing test cases) on the efficacy of APR? We note that many of the existing papers on program repair use liberal timeout periods to generate repairs, while in our experiments the timeout is strictly maintained at no more than one hour. We are also restricted to observing the first few patches, and we examine the impact of the fix localization by either providing and not providing the developer location. Based on a quantitative comparison of well-known repair tools Angelix [31] , CPR [40] , GenProg [21] , Prophet [26] and Fix2Fit [12] -we conclude that the search space representation has a significant role in deriving plausible/correct patches within an acceptable time period. In other words, an abstract representation of the search space (aided by constraints that are managed efficiently or aided by program equivalence relations) is at least as critical as a smart search algorithm to navigate the patch space. We discuss how the tools can be improved to meet developer expectations, either by achieving compilation-free repair or by navigating/suggesting abstract patches with the help of simple constraints (such as interval constraints).

Last but not the least, we note that program repair can be seen as automated code generation at a micro-scale. By studying the trust issues in automated repair, we can also obtain an initial understanding of trust enhancement in automatically generated code.

The goal of APR is to correct buggy programs to satisfy given specifications. In this section, we review these specifications and discuss how they can impact patch quality.

Test Suites as Specification. APR techniques such as GenProg [21] and Prophet [26] treat test suites as correctness specifications. The test suite usually includes a set of passing tests and at least one failing test. The repair goal is to correct the buggy program to pass all the given test suites. Although test suites are widely available, they are usually incomplete specifications that specify part of the intended program behaviors. Hence, the automatically generated patch may overfit the tests, meaning that the patched program may still fail on program inputs outside the given tests. For instance, the following is a buggy implementation that copies n characters from source array src to destination array dest, and returns the number of copied characters. A buffer overflow happens at line 6 when the size of src or dest is less than n. By taking the following three tests (one of them can trigger this bug) as specification, a produced patch (++index<n ↦ → ++index<n && index<3) can make the program pass the given tests. Obviously, the patched program is still buggy on test inputs outside the given tests. Constraints as Specification. Instead of relying on tests, another line of APR research, e.g., ExtractFix [13] and CPR [40] , take constraints as correctness specifications. Constraints have the potential to represent a range of inputs or even the whole input space. Driven by constraints, the goal of APR is to patch the program to satisfy the constraints. However, unlike the test suite, the constraints are not always available in practice; for this reason, techniques like Angelix [31] and SemFix [34] take tests as specifications but extract constraints from tests. Certain existing APR techniques take as input coarse-grained constraints, such as assertions or crash-free constraints. For instance, ExtractFix relies on predefined templates to infer constraints that can completely fix vulnerabilities. For the above example, according to the template for buffer overflow, the inferred constraint is index<sizeof(src)&&index<sizeof(dest).

Once the patched program satisfies this constraint, it is guaranteed that the buffer overflow is completely fixed. Guarantees from such fixing of overflows/crashes do not amount to a guarantee of the full functional correctness of the fixed program.

Code Patterns as Specification. Besides test suites and constraints, code patterns can also serve as specifications for repair systems. Specifically, given a buggy program that violates a code pattern, the repair goal is to correct the program to satisfy the rules defined by the code pattern. The code patterns can be manually defined [42] , from static analyzers [45] , automatically mined from large code repositories [3, 4] , etc. Similar to the inferred constraints, code patterns cannot ensure functionality correctness.

Since constructing formal program specifications is notoriously difficult, the specifications used by APR tools cannot ensure patch correctness. Unreliable overfitting patches cause developers to lose trust in APR tools. This motivates us to enquire/survey developers on how APR can be enhanced to gain their trust. We designed and conducted a survey with software practitioners, specifically to answer the first three research questions (RQ1-3). In June 2021, we distributed a questionnaire to understand how developers envision the usage of automated program repair and what can be provided to increase trust in automatically generated patches. Note that we followed our institutional guidelines and received approval from the Institutional Review Board (IRB) of our organization prior to administering the survey.

Survey Instrument. We asked in total 35 questions about how trustworthy APR can be deployed in practice. Our questions are structured into six categories: are useful for the developers, e.g., for manual bug-fixing. C6 Background: the role and experience of the participants in the software development process. C1 will provide insights for RQ1, C2 and C3 for RQ2, and C4 and C5 for RQ3. The questions are a combination of open-ended questions like "How would you like to engage with an APR tool?" and closeended questions like "Would it increase your trust in auto-generated patches if additional artifacts such as tests/assertions are used during patching?" with Multiple Choice or a 5-point Likert scale. The questionnaire itself was created and deployed with Microsoft Forms. A complete list of our questions can be found in Table 1 and in our replication package [35] .

Participants. We distributed the survey via two channels: (1) Amazon MTurk, and (2) personalized email invitations to contacts from global-wide companies. As incentives, we offered each participant on MTurk 10 USD as compensation, while for each other participant, we donated 2 USD to a COVID-19 charity fund. We received 134 responses from MTurk. To filter low-quality and non-genuine responses, we followed the known principles [10] and used qualitycontrol questions. In particular, we manually inspected all responses and filtered out answers that are irrelevant to the actual question: (1) we checked for suspicious answers, which overload keywords, e.g., many responses included a message on Annual Percentage Rate (APR) instead of automated program repair, and then (2) we checked the consistency of the responses with quality-control questions, e.g., "Please describe briefly your role in software development" and "Name your primary activity in software development" at the beginning of the survey. After this manual post-processing, we ended up with 34 valid responses from MTurk. From our company contacts, we received 81 responses, from which all have been genuine answers. From the total of 115 valid responses, we selected 103 relevant responses, which excluded responses from participants who classified themselves as Project Manager, Product Owner, Data Scientist, or Researcher. Our goal was to include answers from software practitioners that have daily, hands-on experience in software development. Figure 1 For all open-ended questions, we performed a qualitative content analysis coding [39] to summarize the themes and opinions. The first iteration of the analysis and coding was done by one author, followed by the review of the other authors. In the following sections, we will discuss the most mentioned responses, and indicate in the brackets behind the responses how often the topics are mentioned among the 103 participants. We use the chisquare goodness of fit test [37] ( = 0.01) to check that our results are significant and not a random observation. We also show the significance of the obtained trends/majorities with the Binomial Test [1] ( = 0.05). We present the corresponding values. All data, statistics, and codes are included in our replication package [35] .

In this section, we discuss the responses for the questions in category C1 and question Q2.8, which was explicitly exploring how the participants want to engage with an APR tool. First of all, a strong majority (72% of the responses, < .001) indicate that the participants are willing to review auto-generated patches (see Q1.1 in Figure 3 ). This finding generally confirms the efforts in the APR community to develop such techniques. Only 7% of the participants are reluctant to apply APR techniques in their work. As shown in Figure 4 , we note that 72% ( < .001) of the participants want to review only up to 5 patches, while only 22% would review up to 10 patches. Furthermore, 6% mention that it would depend on the specific scenario. At the same time, the participants expect relatively quick results: 63% ( = .003) would not wait longer than one hour, of which the majority (72% of them, < .001) prefer to not even wait longer than 30 minutes. The expected time certainly depends on the concrete deployment, e.g., repair can also be deployed along a nightly Continuous Integration (CI) pipeline, but our results indicate that direct support of manual bug fixing requires quick fix suggestion or hints. In fact, 82% ( < .001) of the participants state Selection + Other. . . Q1.5 Do you trust a patch that has been adopted from another location/application, where a similar patch was already accepted by other developers? 5-Point Likert Scale Q1.6 Would it increase your confidence in automatically generated patches if some kind of additional input (e.g., user-provided test cases) were considered? 5-Point Likert Scale Q1.7 Besides some additional input that is taken into account, what other mechanism do you see to increase the trust in auto-generated patches?

Open-Ended Q2.1 Can you provide additional test cases (i.e., inputs and expected outputs) relevant for the reported bug? 5-Point Likert Scale C2 Availability Q2.2 Can you provide additional assertions as program instrumentation about the correct behavior? 5-Point Likert Scale of Inputs Q2.3 Can you provide a specification for the correct behavior as logical constraint? 5-Point Likert Scale Q2.4 Would you be fine with classifying auto-generated input/output pairs as incorrect or correct behavior? 5-Point Likert Scale Q2.5 How many of such queries would you answer?

Selection + Other. . . Q2.6 For how long would you be willing to answer such queries?

Selection + Other. . . Q2.7 What other type of input (e.g., specification or artifact) can you provide that might help to generate patches?

Open-Ended Q2.8 Please describe how you would like to engage with an APR tool. For example shortly describe the dialogue between you (as user of the APR tool) and the APR tool. Which input would you pass to the APR tool? What do you expect from the APR tool?

Open-Ended Q3.1 Would it increase your trust in auto-generated patches if additional artifacts such as tests/assertions are used during patching? 5-Point Likert Scale C3 Impact on trust Q3.2 Which of the following additional artifacts will increase your trust?

Multiple Choice Q3.3 What are other additional artifacts that will increase your trust?

Open-Ended Q4.1 Would it increase your trust when the APR technique shows you the code coverage achieved by the executed test cases that are used to construct the repair?

C4 Explanations for generated Q4.2 Would it increase your trust when the APR technique presents the ratio of input space that has been successfully tested by the inputs used to drive the repair?

Q4.3 What other type of evidence or explanation would you like to come with the patches, so that you can select an automatically generated patch candidate with confidence?

Open-Ended Q5.1 Which of the following information (i.e., potential side-products of APR) would be helpful to validate the patch? Multiple Choice C5 Usage of APR Q5.2 What other information (i.e., potential side-products of APR) would be helpful to validate the patch? Open-Ended side-products Q5.3 Which of the following information (i.e., potential side-products of APR) would help you to fix the problem yourself (without using generated patches)?

Q5.4 What other information (i.e., potential side-products of APR) would help you to fix the problem yourself (without using generated patches)? that they usually spend not more than 2 hours on average to fix a bug, and hence, the APR techniques need to be fast to provide a benefit for the developer. To increase trust in the generated patches, 80% ( < .001) agree that additional artifacts (e.g., test cases), which are provided as input for APR, are useful (see Q1.6 in Figure 3 ). As a consistency check, we asked a similar question at a later point (see Q3.1 in Figure 3 ), and obtained that even 84% ( < .001) agree that additional artifacts can increase trust. The most mentioned other mechanisms to increase trust are the extensive validation of the patches with a test suite and static analysis tools (17/103), the actual manual investigation of the patches (10/103), the reputation Figure 4 : Cumulative illustration of the responses for Q1.2 How many auto-generated patches would you be willing to review before losing trust/interest in the technique?

of the APR tool itself (9/103), the explanation of patches (8/103), and the provisioning of additionally generated tests (7/103).

RQ1 -Acceptability of APR: Additional user-provided artifacts like test cases are helpful to increase trust in automatically generated patches. However, our results indicate that full developer trust requires a manual patch review. At the same time, test reports of automated dynamic and static analysis, as well as explanations of the patch, can facilitate the reviewing effort.

The responses for the explicit question about developers' envisioned engagement with APR tools (Q2.8) can be categorized into four areas: the extent of interaction, the type of input, the expected output, and the expected integration into the development workflow.

Interaction. Most participants (71/103, < .001) mention that they prefer a rather low amount of interaction, i.e., after providing the initial input to the APR technique, there will be no further interaction. Only a few responses (6/103) mention the one-time option to provide more test cases or some sort of specification to narrow down the search space when APR runs into a timeout, or the generated fixes are not correct. Only 3 participants envision a high level of interaction, e.g., repeated querying of relevant test cases.

Input. Many participants appear ready to provide failing test cases (22/103) or relevant test cases (20/103). Others mentioned that APR should take a bug report as input (15/103), which can include the stack trace, details of the environment, and execution logs. Some also mentioned that they envision only the provision of the bare minimum, i.e., the program itself or the repository with the source code (11/103).

Output. Besides the generated patches, the most mentioned helpful output from an APR tool is an explanation of the fixed issue including its root cause (9/103 ). This answer is followed by the requirement to present not only one patch but a list of potential patches (8/103). Additionally, some participants mentioned that it would be helpful to produce a comprehensive test report (6/103).

Integration. The most mentioned integration mechanism is to involve APR smoothly in the DevOps pipeline (17/103), e.g., whenever a failing test is detected by the CI pipeline, the APR would be triggered to generate appropriate fix suggestions. A developer would manually review the failed test(s) and the suggested patches. Along with the integration the participants mentioned that the primary goal of APR should be to save time for the developers (8/103). Figure 5 : Responses for Q3.2 Which of the following additional artifacts will increase your trust?

RQ1 -Interaction with APR: Developers envision a low amount of interaction with APR, e.g., by only providing initial artifacts like test cases. APR should quickly (within 30 min -60 min) generate a small number (between 5 and 10) of patches. Moreover, APR needs to be integrated into the existing DevOps pipelines to support the development workflow.

In this section, we look more closely in the categories C2 and C3 to investigate which additional artifacts can be provided by developers, and how these artifacts influence the trust in APR. We first explore the availability of additional test cases (69% positive, < .001), program assertions (71% positive, < .001), and logical constraints (59% positive, = .024) (see the results for Q2.1, Q2.2, and Q2.3 in Figure  3 ). Furthermore, 58% ( = .038) of the participants are positive about answering queries to classify generated tests as failing or passing. This can be understood as participants want to have low interaction (i.e., asking questions to the tool), but if the tool is able to issue queries, they are ready to answer some of them (typically respondents prefer to answer no more than 10 queries, = .001). Based on the results for open-ended question Q2.7, the majority of the participants (70/103, < .001) do not see any other additional artifacts (beyond tests/assertions/logical-constraints/user-queries) that they could provide to APR. The most mentioned responses by other participants are different forms of requirements specification (7/103), e.g., written in a domain-specific language, execution logs (6/103), documentation of interfaces with data types and expected value ranges (5/103), error stack traces (4/103), relevant source code locations (3/103), and reference solutions (3/103), e.g., existing solutions for similar problems.

RQ2 -Artifact Availability: Software developers can provide additional artifacts like test cases, program assertions, logical constraints, execution logs, and relevant source code locations.

Regarding an increase in trust in patches through the incorporation of additional artifacts driving repair, 93% ( < .001) of the participants agree that additional test cases are helpful (see Figure 5 ). This is also interesting from the perspective of recent automated repair tools [40, 52] which perform automated test generation to achieve less overfitting patches. Logical constraints (70%, < .001) and program assertions (68%, < .001) perform worse in this respect. Although user queries allow more interaction with the APR technique, they would not necessarily increase trust more than the other artifacts. Only 59% ( = .024) agreed on their benefit. Most of the participants (88/103, < .001) did not mention a trust Figure 6 : Responses for Q5.1 Which of the following information (i.e., potential side-products of APR) would be helpful to validate the patch?

gain by other artifacts. However, some participants (3/103) mentioned non-functional requirements like performance or security aspects, which is related to a concern that auto-generated patches may harm existing performance characteristics or introduce new security vulnerabilities.

RQ2 -Impact on Trust: Additional test cases would have a great impact on the trustworthiness of APR. There exists the possibility of automatically generating tests to increase trust in the auto-generated patches.

In this section, we explore which patch evidence and APR sideproducts can support trust in APR (see categories C4 and C5). We first proposed two possible pieces of evidence that could be presented along with the patches: the code coverage achieved by the executed test cases that are used to construct the repair, and the ratio of input space that has been successfully tested by the automated patch validation. 76% ( < .001) of the participants agree that code coverage would increase trust, and 71% ( < .001) agree with the input ratio (see Q4.1 and Q4.2 in Figure 3 ). The majority of the participants (78/103, < .001) do not mention other types of evidence that would help to select a patch with confidence. Nevertheless, the most mentioned response is a fix summary (10/103), i.e., an explanation of what has been fixed including the root cause of the issue, how it has been fixed, and how it can prevent future issues. Other participants mention the success rate in case of patch transplants (5/103), and a test report summarizing the patch validation results (3/103). These responses match the observations for RQ1, where we asked how developers want to interact with trustworthy APR and what output they expect.

RQ3 -Patch Evidence: Software developers want to see evidence for the patch's correctness to efficiently select patch candidates. Developers want to see information such as code coverage as well as the ratio of the covered input space.

A straightforward way to provide explanations and evidence is to provide outputs that are already created by APR as side-products. We listed some of them and asked the participants to select which of them would be helpful to validate the patches (see results in Figure  6 ). 85% ( < .001) agree that the identified fault and fix locations are helpful to validate the patch followed by the generated test cases with 79% ( < .001) agreement. In addition, a few participants emphasize the importance of a test report (4/103), an explanation of the root cause and the fix attempt (4/103). Figure 7 : Responses for Q5.3 Which of the following information (i.e., potential side-products of APR) would help you to fix the problem yourself (without using generated patches)?

Finally, we explore which side-products are most useful for developers, even when APR cannot identify the correct patch. Figure 7 shows that the identified fault and fix locations are of most interest (82%, < .001), followed by the generated test cases (75%, < .001). Very few participants add that an issue summary (2/103) and the potential results of a data flow analysis (2/103) could be helpful too.

Our results indicate that sideproducts of APR like the fault and fix locations and the generated test cases can assist manual patch validation, and hence, enhance trust in APR.

We now investigate to which extent existing APR techniques support the expectations and requirements collected with our survey. Not all aspects of our developer survey can be easily evaluated. For example, the evaluation of the amount of interaction, the integration into existing workflows, the output format for the efficient patch selection, and the patch explanations, require additional case studies and further user experiments. In this evaluation, we focus on the quantitative evaluation of the relatively short patching time (30-60 min), the limited number of patches to manually investigate (5 to 10), handling of additional test cases and logical constraints, and the ability to generate a repair at a provided fix location. We explore whether state-of-the-art repair techniques can produce correct patches under configurations that match these expectations and requirements. Specifically, we aim to provide answers to the research questions RQ4 and RQ5.

APR Representatives. In our evaluation, we selected tools to represent a wide spectrum of state-of-the-art APR techniques: searchbased (GenProg [21] ), semantic-based (Angelix [31] ), the combination of search-based and learning-based (Prophet [26] ), and the integration of testing inside repair to tackle overfitting (Fix2Fit [12] , CPR [40] ). We further selected tools that apply on C due to our evaluation subjects. GenProg [21] is a search-based program repair tool that evolves the buggy program by mutating program statements. It is a well-known representative of the generate-andvalidate repair techniques. Angelix [31] is a semantic program repair technique that applies symbolic execution to extract constraints, which serve as a specification for subsequent program synthesis. Prophet [26] combines search-based program repair with machine learning. It learns a code correctness model from open-source software repositories to prioritize and rank the generated patches. Fix2Fit [12] combines search-based program repair with fuzzing. It uses grey-box fuzzing to generate additional test inputs to filter overfitting patches that crash the program. The test generation prioritizes tests that refine an equivalence class based patch space representation. CPR [40] uses semantic program repair and concolic test generation for refining abstract patches and for discarding overfitting patches. It takes a logical constraint as additional user input to reason about the generated tests inputs.

Subject Programs. We use the ManyBugs [20] benchmark, which is a well-established benchmark in APR, and all of the considered techniques/tools also use (some of) these subjects in their evaluation. Therefore, it is a benchmark for which it is known that the examined tools can identify patches. Our goal is to evaluate whether they can still identify patches with changed/limited environmental conditions (e.g., timeout, set of available test cases etc). The benchmark set consists of 185 defects in 9 open-source projects. For each subject, ManyBugs includes a test suite created by the original developers. Note that all of the studied repair techniques require and/or can incorporate a test suite in their repair process. For our evaluation, we filter the 185 defects that have been fixed by the developer at a single fix location. We remove defects from "Valgrind" and "FBC" subjects due to the inability to reproduce the defects. Finally, we obtain 60 defects in 6 different open-source projects (see Table 2 ). Experimental Configurations and Setup. All tools are configured to run in full-exploration mode; which will continue to generate patches even after finding one plausible patch until the timeout or the completion of exploring the search space. To study the impact of fix locations and test case variations (see RQ5), we evaluate each tool using different configurations (see Table 3 ). In each configuration we provide the relevant source file to all techniques, however, with "developer fix location" we provide the exact source line number as well. Note that each setup uses a 1-hour timeout, which is chosen based on our survey responses: 63% ( = .003) of all participants would expect results within 1 hour. Evaluation Metrics. In order to assess the techniques and support the answering of our research questions, we consider the following eight metrics, which are inspired by existing studies in APR [24, 25] : M1 the search space size of the repair tool, M2 the number of enumerated/explored patches, M3 the explored ratio with respect to the search space, M4 the number of non-compilable patches, M5 the number of non-plausible patches, i.e., patches that have been explored but ruled out because existing or generated test cases are violated, M6 the number of plausible patches, M7 the number of correct patches, and M8 the highest rank of a correct patch. M1-M6 help to analyze the overall search space creation and navigation of each technique. The definition of the search space size (M1) for the defect, as well as the definition of an enumerated/explored patch (M2), vary for each tool. We include all experiment protocols in our replication artifact, which describes how to collect these metrics for each tool. M7-M9 assess the repair outcome, i.e., the identification of the correct patch. We define a patch as correct whenever it is semantically equivalent to the developer patch that is provided in our benchmark. To check for the correct patch, we manually investigated only the top-10 ranked patches because our survey concluded that developers would not explore beyond that. Note that not all techniques provide a patch ranking (e.g., Angelix, GenProg, and Fix2Fit). In these cases, we use the order of generation.

Hardware. All our experiments were conducted using Docker containers on top of AWS (Amazon Web Services) EC2 instances. We used the c5a.8xlarge instance type, which provides 32 vCPU processing power and 64GiB memory capacity.

Replication. Our replication package contains all experiment logs and subjects, as well as protocols that define the methodology used to analyze the output of each repair tool [35] . In particular, we describe how to retrieve each evaluation metric for the specific repair techniques.

Experimental Setup: Our experiments are meant to investigate specific aspects concerning the increase of program repair adoption based on the results of our developer survey. We assume that the developer/user is not an APR expert, and hence, would use the default parameter settings instead of fine-tuning or extending the tools. Furthermore, our experiments use strict timeouts and computation power restrictions. Other setups can lead to different and better results. Table 3 ). Each cell shows | |/| |, where | | is the number of defects for which the tool was able to generate at least one plausible patch (i.e., M6), and similarly | | is the number of defects for which the tool was able to generate a correct patch among the top-10 plausible patches. For example, the LibTIFF project has 7 defects, for which Angelix was able to generate 3 plausible and 1 correct patch for the setup EC1 (i.e., 1-hour timeout, tool fault localization, and all available test cases). Due to limitations in its symbolic execution engine KLEE [6] , Angelix and CPR do not support lighttpd and python, and the corresponding cells are marked with "-". For CPR, we are not able to produce results for EC1 because it does not have its own fault localization, and hence, requires the fix location as an input. Additionally, Table 5 presents the average patch exploration/enumeration ratio | | of the techniques with respect to the patch space size, computed as a percentage of M2/M1 for each defect considered in each subject.

The numbers in Table 4 show that the overall repair success is limited. For example, Fix2Fit can generate plausible patches for 14 defects with EC1, while CPR can generate correct patches for 8 defects given the correct fix location. Compared to previous studies, the number of plausible patches is lower in our experiments, mainly due to the 1-hour timeout. Prior research on program repair have experimented with 10-hour [30] , 12-hour [26, 31] and 24-hour [12] timeouts, and determined whether a correct patch can be identified among all generated plausible patches. The focus of these prior experiments was to evaluate the capability to generate a patch, whereas, in our work, we focus on the performance within a tolerable time limit set by developers. Not only the timeout but also a scenario-specific parameter fine-tuning can affect the results greatly. For example, when we modify the synthesis-level parameter of Angelix (a parameter that modulates the back-end synthesis of the tool, and hence, can affect the search space), we can see additional patches being generated, such as for a defect in Libtiff (3edb9cd), in the EC3 configuration. Our reported experiments only use the default parameters. In future, for a full investigation of the repair tools' capabilities, it will therefore be necessary to conduct an exploration of the parameter choices in each repair tool, which has not been done in this paper.

RQ4 -Repair Success: Under our tight constraints (i.e., the 1hour timeout and the top-10 ranking) and their default parameter setups, current state-of-the-art repair techniques cannot identify many plausible patches for the ManyBugs benchmark.

Automated program repair tools are only beginning to gain adoption, and are still an emerging technology. We want to identify what it would take to increase the adoption of program repair. In general, the repair success of an APR technique is determined by (1) its search space, (2) the exploration of this search space, and (3) the ranking of the identified patches. In a nutshell, this means, if the correct patch is not in the search space, the technique cannot identify it. If the correct patch is in the search space, but APR does not identify it within a given timeout or other resource limitations, it cannot report it as a plausible patch. If it identifies the patch within the available resources but cannot pinpoint it in the (potentially huge) space of generated patches, the user/developer will not recognize it. By means of these impediments for repair success in real-world scenarios, we examine the considered repair techniques. Our goal is to identify the concepts in APR that are necessary to achieve the developers' expectations, and hence, to improve the state-of-the-art approaches.

Search Space. Table 5 shows that Angelix explores almost its complete search space within the 1-hour timeout, while Table 4 shows that it can identify plausible patches for only one defect (with EC1). As described in [30] , the program transformations (to build/explore the search space) by Angelix only include the modification of existing side-effect-free integer expressions/conditions and the addition of if-guards. Therefore, we conclude that Angelix's search space is too limited to contain the correct patches. The other techniques, on the other hand, consider larger search spaces. Prophet also considers the insertion of statements and the replacement of function calls. GenProg can insert/remove any available program statement. Fix2Fit uses the search space by f1x [30] , which combines the search spaces of Angelix and Prophet to generate a larger search space. CPR uses the same program transformations as Angelix but is designed to easily incorporate additional user inputs like custom synthesis components to enrich its search space.

RQ4 -Search Space: Successful repair techniques need to consider a wide range of program transformations and should be able to take user input into account to enrich the search space.

Search Space Exploration. Prophet and GenProg show a relatively low exploration ratio with 21% and 4% respectively (see EC1 in Table 5 ), which leads to a low number of plausible patches in our experiments. Instead, Fix2Fit fully explores the patch search space for most of the considered defects (except for PHP), which leads to a high possibility of finding a plausible patch. CPR (not shown in the table) fully explores its search space in our experiments. In contrast to Prophet and GenProg, Fix2Fit and CPR perform grouping and abstracting of patches, to explore them efficiently. Fix2Fit groups the patches by their behavior on test inputs and uses this equivalence relation to guide the generation of additional inputs. CPR represents the search space in terms of abstract patches, which are patch templates, accompanied by constraints. CPR enumerates abstract patches instead of concrete patches, and hence, can reason about multiple patches at once to remove or refine patches. Prophet and GenProg, however, explore and evaluate all concrete patches, which causes a significant slowdown. Reduction of the patch validation time is possible if we can validate patches without the need to re-compile the program for each concrete patch [8, 9, 49] .

A large/rich search space requires an efficient exploration strategy, which can be achieved by, e.g., using search space abstractions.

Patch Ranking. Although Fix2Fit builds a rich search space and can efficiently explore it, it still cannot produce many correct patches in our experiments. One reason is that Fix2Fit can identify a correct patch but fails to pinpoint it in the top-10 patches because it only applies a rudimentary patch ranking, which uses the edit-distance between the original and patched program. For instance, Fix2Fit generates the correct patch for the defect 865f7b2 in the LibTiff subject but ranks it below position 10, and hence, it is not considered in our evaluation. Furthermore, Fix2Fit's patch refinement and ranking is based on crash-avoidance, which is not suitable for a testsuite repair benchmark such as ManyBugs that does not include many crashing defects. CPR improves on that by leveraging the userprovided logical constraint to reason about additionally generated inputs, while the patch behaviors on these inputs are collected and used to rank the patches. But still, overall, it cannot produce many correct patches within the top-10. We also investigated how many of the correct patches are within the top-5 because 72% ( < .001) of our survey participants favored reviewing only up to 5 patches (see Figure 4 ). We observed that most identified correct patches within the top-10 are ranked very high so that there is not much difference if a top-5 threshold is applied. Recent works [49, 50] propose the use of the test behavior similarity between original/patched programs to rank plausible patches, which is a promising future direction.

RQ4 -Patch Ranking: After exploring the correct patch, an effective patch ranking is the last impediment for the developer.

Providing Fix Location as User input. In Table 4 , the column EC1 shows the results with the tool's fault localization technique, and column EC2 shows the results by repairing only at the developerprovided (correct) fix location. Intuitively, one expects that equipped with the developer fix location, the results of each repair technique should improve. However, the results by Angelix and GenProg do not change (except for one more plausible patch with Angelix). From the previous discussion about the search space, we conclude that the program transformations by Angelix are the main limiting factor to the extent that even the provision of the correct fix location has no impact. For GenProg we know from the EC3 configuration that there is at least one correct patch in the search space (see Table 4 ). Therefore, we conclude that GenProg suffers from its inefficient space exploration so that even the space reduction by setting the fix location has no impact. Prophet instead can generate two additional correct patches in EC2, and hence, benefits from the precise fix location. The exploration ratio in Table 5 shows that Prophet almost fully explores its search space in EC2, indicating a smaller search space. Fix2Fit can generate one more correct patch as compared to EC1. Similar to Prophet, Fix2Fit benefits from the precise fix location and can explore more of its search space. Note that CPR is not included in the comparison between EC1 and EC2 because it does not apply for EC1. However, for EC2, it can generate the highest number of correct patches. Besides its efficient patch space abstraction, we attribute this to its ability to incorporate additional user inputs like the fix location and the user-provided logical constraint.

Our results show that the provision of the precise and correct fix location does not necessarily improve the outcome of the state-of-the-art APR techniques due to their limitations in search space construction and exploration. However, being amenable to such additional inputs can significantly improve the repair success, as shown by results from CPR.

Varying Passing Test Cases. To examine the impact of the passing test cases, we consider the differences between the columns EC2, EC3, and EC4 in Table 4 . In general, more passing test cases can lead to high-quality patches because they represent information about the correct behavior. In line with this, we observe that more passing test cases lead to fewer plausible patches because the patch validation can remove more overfitting patches. For Angelix however, we observe that there is no difference due to its limited search space. CPR is also not affected by the varying number of passing test cases. It uses the failing test cases to synthesize the search space and the passing test cases as seed inputs for its input generation. But since CPR always fully explores the search space in our experiments, the variation of the initial seed inputs has no effect within the 1 hour. Overall, we observe three different effects: (a) For techniques with a limited search space (e.g., Angelix), passing test cases have very low or no effect. (b) For techniques that suffer from inefficient space exploration strategies (e.g., GenProg and Prophet), having fewer passing test cases can speed up the repair process and lead to more plausible (possibly overfitting) patches. (c) Otherwise (e.g., Fix2Fit), variations in the passing test cases can still influence the ranking.

Whether more tests are better depends on the APR strategy and its characteristics, as discussed in Section 6.1. Therefore, we suggest that APR techniques incorporate an intelligent test selection or filtering mechanism, which is not yet studied extensively in the context of APR. Recently, [27] suggested applying traditional regression test selection and prioritization to achieve better repair efficiency. Further developing and using such a mechanism represents a promising research direction. Note that in the discussed experiments, the fix location was defined beforehand. However, if APR techniques use a test-based fault localization technique (like in EC1), the test cases have an additional effect on the search space and repair success.

Variation of passing test cases causes different effects depending on the characteristics of the APR techniques. Overall, one needs an intelligent test selection method.

External Validity of Survey. Although we reached out to different organizations in different countries, we cannot guarantee that our survey results can be generalized to all software developers. To mitigate this threat, we made all research artifacts publicly available [35] so that other researchers and practitioners can replicate our study. To reduce the risk of developers not participating or the volunteer bias, we designed the survey for a short completion time (15-20 min) and provided incentives like charity donations and (in the case of MTurk) monetary compensation. Another potential threat to validity is that only 15% of all participants responded that they are familiar with APR (see Q6.8/9/10). This is to be expected as APR is not (yet) heavily applied in the industry (with exceptions like Facebook and Bloomberg). To ensure that the participants have an idea of APR, we added a description and a link to an illustrative video at the beginning of our survey form. We note that we are exploring what it would take for developers to try out program repair, since developers may have general preconceived notions. By finding out what would make the developers comfortable to use APR, we can hope to increase adoption.

Construct Validity of Survey. In our survey, to encourage candid responses from participants, we did not collect any personally identifying information. Additionally, we applied control questions to filter non-genuine answers. To mitigate the risk of wrong interpretation of the collected responses, we performed qualitative analysis coding, for which all codes have been checked and agreed by at least two authors. Although we found general agreement across participants for many questions, we consider our results only as a first step towards exploring trustworthy APR.

Internal Validity of Survey. Our participants could have misunderstood our survey questions, as we could not clarify any particulars due to the nature of online surveys. To mitigate this threat, we performed a small pilot survey with five developers, in which we asked for feedback about the questions, the survey structure, and the completion time. Additionally, there is a general threat that participants could submit multiple responses because our survey was completely anonymous.

Threats to Validity of Experimental Results. In our empirical analysis, we do not cover all available APR tools, but instead, we cover the main APR concepts: search-based, semantics-based, and machinelearning-based techniques. With ManyBugs [20] we have chosen a benchmark that is a well-known collection of defects in opensource projects. Additionally, it includes many test cases, which are necessary to evaluate the aspects of test case provision. The metrics in our quantitative evaluation measure the patch generation progress, measuring repair efficiency/effectiveness via variations in configurations (EC1-EC4). To mitigate the threat of errors in our setup of experiments, we performed preliminary runs with a subset of the benchmark and manually investigated the results.

Our experimental results in Section 5 explore the capability of the repair tools to produce patches within a 1-hour timeout. Different results may be observed if a different timeout is chosen. More importantly, it is possible to get significantly better results from the repair tools by fine-tuning the parameters of the repair tools. For example, when we modify the synthesis-level parameter of Angelix (a parameter that modulates the back-end synthesis of the tool, and hence, can affect the search space), we can see additional patches being generated, such as for a defect in Libtiff. In our experiments, we did not fine-tune such parameters but instead used the default parameter settings, to simulate the experience of novice APR users. It is entirely possible that more expert APR users will be able to use the tools more effectively to get better results. The impact of parameter choices can also be rather nuanced e.g. Angelix is built on top of KLEE symbolic execution engine and KLEE has parameter settings of its own. Furthermore, we only share the results for the 1-hour timeout as it is closer to the time tolerance mentioned by our study participants.

Our related work includes considerations of trust issues [2, 5, 38] and studies about the human aspects in automated program repair [7, 11, 16, 23, 43] , user studies about debugging [36] , and empirical studies about repair techniques [18, 24, 25, 29, 33, 46, 48, 51] . With regard to human aspects in automated program repair, our survey study contributes novel insights about the developers' expectations on their interaction with APR and which mechanisms help to increase trust. With regard to empirical studies, our evaluation contributes a fresh perspective on existing APR techniques.

Trust Aspects in APR. Trust issues in automated program repair emerge from the general trust issues in automation. Lee and See [22] discuss that users tend to reject automation techniques whenever they do not trust them. Therefore, for the successful deployment of automated program repair in practice, it will be essential to focus on its human aspects. With respect to this, our presented survey contributes to the knowledge base of how developers want to interact with repair techniques, and what makes them trustworthy.

Existing research on trust issues in APR focuses mainly on the effect of patch provenance, i.e., the source of the patch. Ryan and Alarcon et al. [2, 38] performed user studies, in which they asked developers to rate the trustworthiness of patches, while the researchers varied the source of the patches. Their observations indicate that human-written patches receive a higher degree of trust than machine-generated patches. Bertram et al. [5] conducted an eye-tracking study to investigate the effect of patch provenance. They confirm a difference between human-written and machinegenerated patches and observe that the participants prefer humanwritten patches in terms of readability and coding style. Our study, on the other hand, explores the expectations and requirements of developers for trustworthy APR. The work of Weimer et al. [47] proposed strategies to assess repaired programs to increase human trust. Our study results confirm that an efficient patch assessment is crucial and desired by the developers. We note that [47] focuses on how to assess APR, while we focus on how to enhance/improve APR in general, specifically in terms of its trust.

Human Aspects in APR. Other human studies in the APR context focus on how developers interact with APR's output, i.e., the patches. Cambronero et al. [7] observed developers while fixing software issues. They infer that developers would benefit from patch explanation and summaries to efficiently select suitable patches. They propose to explain the roles of variables and their relation to the original code, to list the characteristics of patches, and to summarize the effect of the patches on the program. Tao et al. [43] explored how machine-generated patches can support the debugging process. They conclude that, compared to debugging knowing only the buggy location, high-quality patches can support the debugging effort, while low-quality patches can actually compromise it. Liang et al. [23] concluded that even incorrect patches are helpful if they provide additional knowledge like fault locations. Fry et al. [11] explored the understandability and maintainability of machine-generated patches. While their participants label machine-generated patches as "slightly" less maintainable than human-written patches, they also observe that some augmentation of patches with synthesized documentation can reverse this trend. Kim et al. [16] proposed their template-based repair technique PAR and evaluated the patch acceptability compared to GenProg. All of these preliminary works explore the reactions on the output of APR. While our findings confirm previous hypotheses, e.g., that fault locations are helpful side-products of APR [23] or that an efficient patch selection is important [23, 47] , our work also considers the input to APR, the interaction with APR during patch generation, and how trust can be accomplished.

Debugging. Parnin and Orso [36] investigate the usefulness of debugging techniques in practice. They observe that many assumptions made by automated debugging techniques often do not hold in practice. Johnson et al. [15] explore barriers for the wide adoption of static analysis tools and how well such tools fit into actual development workflows. They conduct interviews with developers and discuss their feedback to identify how those techniques can be improved. Although we focus on automated program repair, our research theme is related to [36] and [15] . We strive to understand how developers want to use automated program repair and whether current techniques support these aspects.

Empirical Evaluation of APR. The living review article on automated program repair by Martin Monperrus [32] lists (at the point of time we wrote this paper) 43 empirical studies. Most of them are concerned about patch correctness to compare the success of APR techniques. Other frequently explored aspects are repair efficiency [18, 24, 25, 29] , the impact of fault locations [24, 25, 48, 51] , and the diversity of bugs [18, 24, 24] . Less frequently studied aspects are the impact of the test suite [18, 33] and its provenance [19, 33] , specifically the problem of test-suite overfitting [19, 24] , and how close the generated patches come to human-written patches [46] . Our empirical evaluation is not just another empirical assessment of APR technologies. It is specifically linked to the collected developer expectations from our survey. It limits the timeout to 1 hour, only explores the top-10 patches, and explores various configurations of passing tests as well as the impact of fix locations. Together with our survey results, our empirical/quantitative evaluation provides the building blocks to create trustworthy APR techniques, which will need to be validated via future user studies with practitioners.

In this paper, we have investigated the issues involved in enhancing developer trust in automatically generated patches. Through a detailed study with more than 100 practitioners, we explore the expectations and tolerance levels of developers with respect to automated program repair tools. We then conduct a quantitative evaluation of existing repair tools to simulate the experience of novice APR users. Our qualitative and quantitative studies indicate directions that need to be explored to gain developer trust in patches -low interaction with repair tools, exchange of artifacts such as generated tests as inputs as well as output of repair tools, and paying attention to abstract search space representations over and above search algorithmic frameworks. Each repair tool has many parameters and we only used the default parameter settings as would be expected from novice users -we did not explore the various parameter settings. To understand the full capability of the repair tools, in future it would be worthwhile to systematically explore a large number of parameter settings and try out the tools with various different timeouts.

We note that increasingly there is a move towards automated code generation such as the recently proposed Github Copilot, but this raises the question of whether such automatically generated code can be trusted. Developing technologies to support mixed usage of manually written and auto-generated code, where program repair can improve the automatically generated code -could be an enticing research challenge for the community.

Our replication package with the survey and experiment artifacts is available on Zenodo [35] .

Binomial Test

Would You Fix This Code for Me? Effects of Repair Source and Commenting on Trust in Code Repair

Getafix: Learning to fix bugs automatically

Phoenix: Automated data-driven synthesis of repairs for static analysis violations

Trustworthiness Perceptions in Code Review: An Eye-Tracking Study

KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs

Characterizing Developer Use of Automatically Generated Patches

Fast and Precise Onthe-Fly Patch Validation for All

Dynamic patch generation for null pointer exceptions using metaprogramming

Mechanical Turk: Potential Concerns and Their Solutions

A Human Study of Patch Maintainability

Crash-Avoiding Program Repair

Beyond Tests: Program Vulnerability Repair via Crash Constraint Extraction

Automated Program Repair

Why don't software developers use static analysis tools to find bugs

Automatic Patch Generation Learned from Human-Written Patches

On The Introduction of Automatic Program Repair in Bloomberg

The impacts of techniques, programs and tests on automated program repair: An empirical study

Overfitting in semantics-based automated program repair

The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs

GenProg: A Generic Method for Automatic Software Repair

Trust in Automation: Designing for Appropriate Reliance

Yingfei Xiong, and Gang Huang. 2020. Interactive Patch Filtering as Debugging Aid

A critical review on the evaluation of automated program repair systems

On the Efficiency of Test Suite Based Program Repair: A Systematic Assessment of 16 Automated Repair Systems for Java Programs

Automatic Patch Generation by Learning Correct Code

How Does Regression Test Selection Affect Program Repair? An Extensive Study on 2 Million Patches

SapFix: Automated End-to-End Repair at Scale

Automatic repair of real bugs in java: a large-scale experiment on the defects4j dataset

Test-Equivalence Analysis for Automatic Patch Generation

Angelix: Scalable Multiline Program Patch Synthesis via Symbolic Analysis

The Living Review on Automated Program Repair

Quality of Automated Program Repair on Real-World Defects

SemFix: Program Repair via Semantic Analysis

Xiang Gao, and Abhik Roychoudhury. 2022. Replication Package for "Trust Enhancement Issues in Program Repair

Are Automated Debugging Techniques Actually Helping Programmers

On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. The London, Edinburgh, and Dublin Philosophical Magazine and

Trust in Automated Software Repair

Qualitative content analysis in practice

Concolic Program Repair

Anti-patterns in search-based program repair

Automatically Generated Patches as Debugging Aids: A Human Study

How to Design a Program Repair Bot? Insights from the Repairnator Project

IEEE/ACM 40th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP

Static automated program repair for heap properties

How Different Is It Between Machine-Generated and Developer-Provided Patches? : An Empirical Study on the Correct Patches Generated by Automated Program Repair Techniques

Trusted Software Repair for System Resiliency

An empirical analysis of the influence of fault space on search-based automated program repair

VarFix: Balancing Edit Expressiveness and Search Effectiveness in Automated Program Repair

Identifying Patch Correctness in Test-Based Program Repair

Evaluating the usage of fault localization in automated program repair: an empirical study

Better test cases for better automated program repair

ACKNOWLEDGMENT This research is supported by the National Research Foundation, Prime Minister's Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme.