key: cord-0512933-zr3nchyk
authors: Lu, Yunlong; Meng, Na; Li, Wenxin
title: FAPR: Fast and Accurate Program Repair for Introductory Programming Courses
date: 2021-07-14
journal: nan
DOI: nan
sha: ebfcbe0a8b238d5a52286fdfaab7be0170dbc91b
doc_id: 512933
cord_uid: zr3nchyk

In introductory programming courses, it is challenging for instructors to provide debugging feedback on students' incorrect programs. Some recent tools automatically offer program repair feedback by identifying any differences between incorrect and correct programs, but suffer from issues related to scalability, accuracy, and cross-language portability. This paper presents FAPR -- our novel approach that suggests repairs based on program differences in a fast and accurate manner. FAPR is different from current tools in three aspects. First, it encodes syntactic information into token sequences to enable high-speed comparison between incorrect and correct programs. Second, to accurately extract program differences, FAPR adopts a novel matching algorithm that maximizes token-level matches and minimizes statement-level differences. Third, FAPR relies on testing instead of static/dynamic analysis to validate and refine candidate repairs, so it eliminates the language dependency or high runtime overhead incurred by complex program analysis. We implemented FAPR to suggest repairs for both C and C++ programs; our experience shows the great cross-language portability of FAPR. More importantly, we empirically compared FAPR with a state-of-the-art tool Clara. FAPR suggested repairs for over 95.5% of incorrect solutions. We sampled 250 repairs among FAPR's suggestions, and found 89.6% of the samples to be minimal and correct. FAPR outperformed Clara by suggesting repairs for more cases, creating smaller repairs, producing higher-quality fixes, and causing lower runtime overheads. Our results imply that FAPR can potentially help instructors or TAs to effectively locate bugs in incorrect code, and to provide debugging hints/guidelines based on those generated repairs.

In introductory programming courses, students may commit various mistakes when they learn to program. It is very challenging or even impossible for instructors and teaching assistants (TAs) to provide one-on-one programming instructions, and to help all students debug their incorrect code. This challenge has become even more severe and intractable during the COVID-19 pandemic, as Massive Online Open Courses (MOOC) see exponential growth in enrollments and thousands of students submit coding solutions online [1] . While lots of students desperately need teachers' suggestions on code debugging, instructors and TAs usually have limited time availability and programming capabilities. Both teachers and students need tools to automatically provide debugging feedback or program repair suggestions. Some recent tools were built to suggest program repairs by comparing correct and incorrect solutions to the same coding problem [2] , [3] , [4] , [5] , [6] . For instance, Clara [5] suggests edits by (1) clustering correct solutions based on the equivalence of program execution status, and (2) comparing any given incorrect code with the representative programs of each cluster to find a minimum repair. However, the high runtime overhead of Clara's dynamic analysis can make the tool pretty slow, when lots of students submit incorrect code simultaneously to seek for debugging feedback. The heavy reliance on sophisticated dynamic analysis also compromises Clara's applicability to code written in different programming languages. Sk p [3] trains a sequence-to-sequence neural network using token sequences of correct programs, and leverages the trained model to suggest repairs for given incorrect code. However, the suggested repairs are usually incorrect, achieving 13%-49% accuracy.

To better help teachers provide debugging feedback on students' solutions to introductory programming problems, we need to tackle two challenges:

1. Lightweight program comparison methods (e.g., tokenlevel code comparison) are usually fast but may suggest inaccurate repairs, while heavyweight alternatives (e.g., comparing execution traces) are slow but may suggest more accurate repairs. We need to better balance speed with accuracy. 2. As students may submit programs written in distinct programming languages, we need an approach that is generally applicable to different languaged programs. Thus, the less language-specific static or dynamic analysis is in use, the more portable a new approach can become.

We designed and implemented FAPR (short for "Fast and Accurate Program Repair") to overcome both challenges. We envision FAPR to be used by teachers or automatic systems. When teachers look at FAPR's repair for any incorrect program, they can quickly offer debugging hints/guidelines to students. When FAPR is integrated into feedback generation systems, those systems can rely on the generated repairs to produce various kinds of feedback (e.g., bug location, highlevel natural-language description on fixes, or partial edits for fixes). As shown in Fig. 1 , FAPR has four phases. Given an incorrect program p i , FAPR first extracts the longest common subsequence (LCS) between the tokens of p i and all correct Fig. 1 : The overview of FAPR solutions, in order to quickly locate up to 100 programs P that are most similar to p i as candidate references. The code differences captured by above-mentioned naïve token sequence comparison cannot be directly used to repair code for two reasons. First, some token differences (e.g., distinct variable usage) are irrelevant to software bugs and should not be suggested as repairs. Second, when multiple alternative LCSs exist between two programs, only the LCSs that minimize the number of edited statements are more desirable. Thus, in Phase II, for each selected candidate reference p j ∈ P , FAPR revises its token sequence based on the depth information of each token (i.e., the number of AST nodes on the path from root to leaf). Based on such program encoding, in Phase III, FAPR uses a matching algorithm to quickly select the best LCS that minimizes statement-level differences. FAPR also uses the selected LCS to normalizes p j to p j by replacing its distinct variable usage with that of p i . In Phase IV, FAPR explores subsets of the code differences between p j and p i to tentatively repair p i , and uses testing to validate each repair.

We did two experiments to evaluate FAPR. In one experiment, we applied FAPR to the 20,897 C programs and 25,000 C++ programs collected from an online judge system-OpenJudge [7] . The collected data corresponds to five programming problems used in the course "Introduction to Computing". Among the 6,676 given incorrect C programs, FAPR suggested repairs for 99% of programs and the repaired programs pass all tests. In total, FAPR spent only 306 seconds to preprocess the 14,221 correct C solutions; additionally, it spent 2.8 seconds on average to suggest each repair. We observed similar phenomena when applying FAPR to C++ programs. In the other experiment, we randomly sampled 250 C repairs suggested by FAPR, and recruited 5 students (assuming them to be TAs) to separately examine 50 samples. This manual evaluation shows that FAPR's repairs have high quality: 98.4% of them (246/250) are correct fixes, while 89.6% of them (224/250) are minimum and correct.

To empirically compare FAPR with prior work, we also applied Clara [5] to the same C program dataset and repeated the above-mentioned two experiments. Our evaluation shows that FAPR outperformed Clara by proposing more fixes with higher quality. More importantly, FAPR has much lower runtime overheads. Clara spent in total 62 hours clustering correct C solutions before repairing any code; it spent 31.2 seconds to suggest each repair.

In summary, our research contributions are as below:

• We developed a novel approach FAPR to suggest program repairs in a fast and accurate manner. Different from prior work, FAPR does not use any complex program analysis framework and is highly portable. Our experience with C/C++ programs also evidences FAPR's portability.

• FAPR encodes tree-level information into token sequences, and applies a dynamic programming (DP) algorithm to compare alternative LCSs. Our novel program encoding and DP algorithm enable FAPR to balance speed with accuracy.

• We conducted a comprehensive evaluation to (1) assess FAPR's applicability to C and C++ programs, (2) evaluate FAPR's efficiency and the quality of its repairs, and (3) compare FAPR with Clara. Our evaluation shows that FAPR's repairs are usually correct, minimal, and useful.

This section overviews our approach with a running example drawn from the introductory CS course "Introduction of Computing". There is a coding exercise problem in the course:

Array Reversal: Write a C/C++ program to reverse any given array. For example, given the array (1 4 5 6 8), the program outputs (8 6 5 4 1).

Suppose that a student creates an incorrect program solution p i by enumerating array indexes in a wrong value range, as highlighted by yellow in Fig. 2a . Namely, in the second forloop, "int i=a; i>0; i−−" should be "int i=a-1; i>=0; i−−". When this student sends p i to an instructor Alex for debugging help, Alex can apply FAPR to p i to quickly learn about the program repair before guiding that student to debug code. In particular, FAPR consists of four phases.

Phase I. Code Comparison: FAPR compares p i with existing correct solutions to retrieve programs P most similar to p i . Specifically for any two programs under comparison, FAPR tokenizes the source code to extract an LCS l. Although l is not directly used to compute repairs, FAPR adopts it to quickly filter out solutions dissimilar to p i and to roughly rank the remaining ones based on similarities. We regard P as candidate references usable to fix p i . Fig. 2b shows a candidate reference selected for the incorrect solution.

Phase II. Program Encoding: To facilitate more accurate program comparison between p i and P , FAPR encodes tokens' relative location information on ASTs into token sequences. We refer to the resulting encoding enhanced token sequences.

Phase III. Code Normalization: Between the enhanced token sequences of two given programs p i and p j ∈ P , FAPR uses an efficient algorithm to explore all alternative LCSs and select the best one with minimum statement-level differences. FAPR also infers identifier mappings between those programs to normalize the code representation of p j . For our example, p j is converted to p j (see Fig. 2c Fig. 2 : Demonstration of an incorrect program p i , a correct program p j , and its normalized version p j between p j and p i are regarded as candidate repairing edits R j . FAPR then uses testing to iteratively explore and validate subsets of candidate edits. This phase returns the minimized repair-a sequence of customized code edits-to Alex.

This section explains individual phases with more details.

FAPR first tokenizes code by taking two steps. In Step 1, it adopts an off-the-shelf tool srcML [8] to parse code, create Abstract Syntax Trees (AST), and store each AST into an XML file. In Step 2, FAPR performs a depth-first traversal of the XML tree (i.e., AST) and collects all leaf nodes (except -nodes) to build the token sequence. Notice that we intentionally derive token sequences from XML files instead of using a scanner because in later phases, FAPR needs to revisit ASTs for code normalization and repair generation. To eliminate the extra runtime overhead of (1) invoking a scanner and (2) repetitively parsing code, FAPR parses each program only once and saves ASTs to XML files. When comparing two token sequences (e.g., s i and s j ), FAPR extracts LCS by (1) matching variable and function names based on token type (i.e., identifier), and (2) matching other tokens based on content (i.e., lexemes [9] ). The extracted LCS serves two purposes. First, it is used to compute similarity score as below:

Similarity(si, sj) = length(LCS) Average(length(si), length(sj)) .

Second, it will facilitate later comparison between program encodings (see Section III-B). For our running example, the first for-loop headers of p i and p j have common tokens extracted as below (see underlined text):

for ( int i = 0 ; i < a ; i ++ )

for ( int i = 0 ; i < n ; i ++ )

Literally, FAPR considers all token pairs at corresponding locations to match. This is because the non-identifier tokens have matching content (e.g., (for, for)), and the identifiers have the same token type although a and n are different. FAPR The relevant depth changes related to each alternative match: alt1: # ê include é <cstdio> é ê # ê include é <iostream> cost = 2 alt2: # ê include é <cstdio> é ê # ê include é <iostream> cost = 3 alt3: # ê include é <cstdio> é ê # ê include é <iostream> cost = 3 Fig. 4 : The cost comparison between alternative LCSs matches type names (i.e., int) by content instead of token type, because divergent type usage indicates bugs and/or fixes. In particular, srcML is a multi-language parser that converts source code of C, C++, C#, and Java to XML files. As FAPR analyzes and compares code based on XML representations, it obtains the multi-language processing capability almost for free. To expand FAPR's multi-language support, users only need to use different frontends offered by srcML to create XML files or to extend srcML with new language parsers.

In an AST, the depth of root node is 0, all children of the root have depth=1, and grandchildren have depth=2. To create a program encoding (i.e., enhanced token sequence) for each candidate reference p j ∈ P or p i , FAPR first orders AST nodes based on depth-first traversal, counts the depth difference between any two adjacent leaf nodes (i.e., tokens), and injects related numbers of arrows (i.e., ↑ or ↓) into the original token sequence to reflect depth changes. As shown in Fig. 4 , circled numbers 1 -11 indicate traversal ordering. Between Nodes 3 and 5 , because the depth is incremented from 2 to 3, a down arrow ↓ is injected between tokens # and include. Notice that 6 and 8 have the same depth 2. As they belong to distinct #include directives (i.e., program statements), FAPR injects an extra pair of arrows "↑↓" between the tokens <cstdio> and #, to indicate statement-level boundaries and penalize any potential token matches spanning across statements.

Given programs p i and p j ∈ P , there can be multiple LCSs with the same length. Selecting the best one among all alternatives is crucially important, because it decides how well FAPR matches code and derives differences. For our example in Section II, there are three distinct ways to match #include-directives between p i and p j . As shown in Fig. 3 , the three tokens of p i can either match the second directive of p j or match parts of p j 's both directives. Within these three alternatives, the first one is most reasonable because the matching part does not span across statements. In this section, we first introduce our novel matching algorithm that identifies the optimal LCS to minimize statement-level differences. Next, we describe the program transformation that leverages optimal LCS to normalize each candidate reference p j to p j .

1) The Design of Matching Algorithm: Our algorithm starts with the LCS mentioned in Section III-A, enumerates all samelength LCSs, relies on the program encodings mentioned in Section III-B to compute cost for each alternative, and picks the one with minimal cost. Here, the cost of an LCS alternative counts the number of arrows (i.e., depth changes) included by all consecutive fragments of unmatched tokens. For instance, as shown in Fig. 4 , suppose that we want to find the best LCS in s to match a given three-token list tl: "# include <iostream>". Sequence s corresponds to two AST subtrees, with each token being a leaf node; s has three alternatives to match tl: alt1, alt2, and alt3. In alt1, the first three tokens compose an unmatched consecutive fragment, and there are two arrows inside this fragment (as highlighted by yellow). Thus, the cost is 2. Similarly, in alt2 (or alt3), because the unmatched fragment covers three arrows, the cost is 3. As alt1 has the lowest cost, it is chosen as the optima.

Intuitively, the cost function is applied to program encodings. It measures distances between unmatched tokens on ASTs and reflects potential repair effort. Namely, the farther away unmatched tokens are from each other, the more likely that they scatter on different subtrees, the more statements may need to be revised for program repair, and the higher cost there can be. Because finding the optimal LCS with minimum cost is a combinatorial optimization problem, we invented a DP algorithm to realize the algorithm design.

2) The Implementation of Matching Algorithm: Our algorithm has three steps (see Algorithm 1). Given a token sequence under comparison (i.e., s i or s j ) and the naïvely extracted LCS a lcs by Phase I (see Section III-A), FAPR first initializes two tables: cost and prev. The cost table records the minimum matching cost between any two subsequences; the prev table traces the selected tokens in s that lead to optimal solutions to subproblems. Next, FAPR enumerates each token pair between s and a lcs to iteratively compute the minimum cost of each subproblem, and to update prev accordingly.

When comparing s i and s j for the optimal LCS, FAPR applies Algorithm 1 twice: once to (a lcs, s i ) and once to (a lcs, s j ). It records the best match as a set of token pairs M = {m 1 , m 2 , . . . , m n }, where each pair m i (i ∈ [1, n]) has a token from s i and a token from s j . The time complexity of our algorithm is at most O(len(s i ) * len(s j )). This new algorithm is much faster than the fastest tree matching algorithm (by Zhang and Shasha [10] ), whose complexity is O(len(s i ) 2 * len(s j ) 2 ). Such complexity contrast motivated us to compare programs based on token sequences instead of AST structures.

Input: s, a lcs / * s is si or sj , and a_lcs is the naïvely extracted LCS by Phase I (see Section III-A) * / Output: opt lcs / * The optimal LCS from s * / / * 1. Create two tables to separately save the matching cost and trace matched tokens

/ * 2. Enumerate each token pair between s and a_lcs to compute the cost iteratively to the minimum cost * / / * 3. Traverse prev to reveal the optimal LCS * / opt lcs = getT okenLoc(prev);

3) Program Transformation: FAPR transforms each selected correct program p j ∈ P in order to (1) normalize the usage of variable/function names and (2) prepare candidate references for repair generation. Specifically, among the abovementioned matching pairs M , FAPR extracts all identifier mappings IM (IM ⊂ M ) and counts the occurrences of each unique match. When an identifier id from one program is mapped to multiple identifiers {mid 1 , mid 2 , . . . , mid r } in the other program, FAPR computes the confidence value for each alternative pair as below:

If an alternative mid u has the confidence value no smaller than 0.6 and occurs at least 3 times, FAPR keeps mid u and removes mappings conflicting with it. Otherwise, FAPR removes all alternatives from IM and does not consider id to be matched. Finally, based on the refined mappings IM , FAPR converts p j to p j by consistently replacing mapped identifiers. We chose the thresholds (0.6 and 3) based on two intuitions. First, if p i and p j are syntactically similar, most mappings in IM should be consistent with each other; we can rely on the majority of mappings to align code and transform p j . Thus, 0.6 is used to quantify the majority. Second, if p i and p j are syntactically different, their identifier mappings can conflict with each other and each mapping may occur only once or twice. To avoid any mismatches caused by such infrequent and unreliable token mappings, we used 3 to limit the minimum occurrence count for each selected token mapping. Our thresholds can impact FAPR's applicability and repair quality. Although we did not explore other values, our evaluation (see Section V) shows that the used values worked well and enabled FAPR to often suggest best repairs.

This section first explains the extraction of candidate repairing edits, and then discusses repair minimization.

1) Extracting Candidate Repairing Edits: FAPR executes each normalized program with tests. If any normalized program p j passes all tests, FAPR uses p j to generate candidate repairing edits R j . Otherwise, if p j fails any test while p j does not, it is possible that the identifier mappings FAPR used to transform code are not fully accurate; FAPR discards such normalized programs. When there are multiple sets of candidate edits available, FAPR picks the one with fewest edits and we denote it as R s .

Specifically for any code p j that passes tests, FAPR extracts p j 's code differences with p i based on the optimal LCS mentioned in Phase III. Intuitively, the extracted differences reflect all edits required to transform p i into p j . Each edit may insert, delete, or update a sequence of tokens at a code location. One technical challenge here is: when FAPR extracts LCS between program token sequences, it sometimes produces trivial mappings between the code snippets that should not have been matched. Look at the following two snippets:

[ y ] > 0 Although the snippets are dissimilar to each other, FAPR still manages to extract the LCS as "x y 0". When naïvely relying on these trivial token mappings to generate edits, FAPR can create three edits that are less intuitive to users: (i) insert "a [" at location 0; (ii) update "== 0 ||" to "] [" at location 2; (iii) update "==" to "] >" at location 6. To improve the usefulness and readability of suggested repairing edits, FAPR replaces the small edits co-applied to the same AST subtree with one big edit. With more details, FAPR first maps all edit locations to the corresponding leaf nodes on p i 's AST. It then traverses the AST in a post-order way. If any subtree st has more than 50% of leaf nodes edited, then FAPR replaces all related edits with one edit that updates the whole token sequence of st. For the example mentioned earlier in this section, the new edit used to replace all trivial edits (i.e., small and unintuitive edits) is

Update "x == 0 || y == 0" to "a [ x ] [ y ] > 0" With such AST-based refinement, FAPR can reduce the number of edits in some candidate repairs and remove meaningless edits. For our running example, FAPR infers five edits:

1. Insert "# include <cstdio>" at location 0. 2. Delete "long" at location 13. 3. Delete "long" at location 21. 4. Update "a" to "a -1" at location 56. 5. Update ">" to ">=" at location 59.

2) Repair Minimization: FAPR computes the minimal repair by removing unneeded edits from R s . The major challenge here is how to efficiently search for the minimal repair based on testing. Although a naïve enumeration of all subsets of edits guarantees a minimum repair, it is almost inapplicable in practice. This is because given a repair with n edits, the total number of subsets to explore is (2 n − 2). This exponential search space is intolerable, especially when n is large (i.e., n > 10) and every round of test-based validation is costly. To overcome the challenge, we invented a heuristic-based algorithm. Given a repair R s containing n edits, our algorithm explores at most (3n − 1) subsets in the following manner:

(1) Single-edit repairs: FAPR applies each edit independently to p i and gets n distinct versions. If any version passes all tests, the corresponding edit is the minimum repair.

(2) Leave-one-out (LOO) repairs: In each trial, FAPR removes one edit from R s and applies the remaining (n − 1) edits to p i . If any version passes all tests, the removed edit is an unneeded change and should be skipped. After these n trials, FAPR removes all unnecessary changes from R s .

(3) Leave-two-out (LTO) repairs: Similar to LOO, FAPR removes two consecutive edits in each trial and applies the remaining edits in R s to p i . In this way, FAPR filters out certain changes whose co-application does not affect the validation outcome. 

Given the five edits mentioned in Section III-D1 (see Table I ), FAPR first applies each edit individually to p i . Because no modified version passes all tests, there is no single-edit fix to p i . Next, FAPR applies four edits at a time. As edits {2, 3, 4, 5} produce a correct program, the 1 st edit is unneeded. Similarly, we can remove the 2 nd and 3 rd edits. After LOO, R s = {4, 5}.

There is no need to try LTO because only two edits are left. This algorithm was designed based on our manual analysis of randomly sampled solutions. We observed that (1) many incorrect solutions are similar to correct ones but only different in one or two places, and (2) each suggested edit is usually independent of others or at most related to another one edit. By limiting subset exploration to a small number of trials, our algorithm ensures FAPR to always suggest repairs in a timely manner. In the worst case, even if an incorrect program does not match our observation, FAPR can still suggest correct repairs although the repair sizes may be not optimal. Our later evaluation (see Section V) shows that FAPR repaired many programs and had low runtime overhead; the suggested repairs were usually considered minimal by users.

IV. IMPLEMENTATION SPECIFICS FAPR uses srcML [8]-a lightweight, highly scalable, robust, multi-language parsing tool to parse code and represent ASTs with XML files. FAPR compares code based on the XML format, modifies code based on token sequences, and validates code modification via testing; thus, it has very weak dependence on language-specific features or program analysis frameworks. As srcML supports multiple programming languages, we implemented FAPR to process both C and C++ program by using different srcML frontends. Based on our experience, FAPR's implementation does not change with the language it processes, which indicates the tool's good portability across languages. Additionally, we used multithreading to speed up code comparison for Phase I.

V. EVALUATION

We conducted two experiments with FAPR and Clara to explore four research questions:

• RQ1: How effectively does FAPR suggest repairs for incorrect solutions?

• RQ2: How efficiently does FAPR generate repairs? • RQ3: What is the quality of repairs generated by FAPR? • RQ4: How does FAPR compare with Clara in terms of the factors mentioned in RQ1-RQ3? Our first experiment quantitatively measures the effectiveness and efficiency of FAPR and Clara to answer RQ1, RQ2, and part of RQ4. The second experiment qualitatively measures the relevance, usefulness, size, and readability of repairs output by both tools to answer RQ3 and part of RQ4. All experiments were performed on a server with 32 Xeon E5-2667 3.2GHz processors. We set the process pool size to 8. This section introduces our datasets (Section V-A) and evaluation metrics (Section V-B), presents the results by FAPR (Section V-C) and by Clara (Section V-D), and explains the comparison between FAPR and Clara (Section V-E).

We have two datasets: one dataset (D1) used to evaluate the effectiveness and efficiency of repair generation, and the other one (D2) used to evaluate the quality of generated repairs.

1) Dataset D1: FAPR handles C and C++ programs, so we collected the C/C++ program data from an online judge system-OpenJudge [7] . OpenJudge is a platform where administrators post programming problems and users submit their solutions for evaluation. OpenJudge compiles each code submission, and runs the compiled version with predefined test cases. OpenJudge outputs "accepted (AC)" for any program 1678  10  C  1 -65  18  3,000  847  C++  3 -416  18  3,000  2,000   1689  10  C  1 -119  15  3,000  1,236  C++  2 -448  15  3,000  2,000   1703  10  C  1 -609  17  2,221  1431  C++  2 -217  18  3,000  2,000   1716  10  C  1 -177  14  3,000  1,908  C++  3 -164  15  3,000  2,000   1720  5  C  1 -57  11  3,000  1,254  C++  2 -414  11 3,000 2,000 passing all tests, or "wrong answer (WA)" for incorrect code that fails any test. Every user can submit multiple solutions for a problem. For each code submission, OpenJudge records the program, its judgment (WA or AC), the related problem Id, and the author Id. We chose five problems (see Appendix A in the supplementary document) from OpenJudge, because they were used in the course "Introduction to Computing". We downloaded all code submissions from OpenJudge. Since the program corpus is too large to use, we took two strategies when building D1 based on the corpus. Strategy 1: As a user may submit incorrect solutions for a problem before submitting a correct one, those incorrect solutions can be very similar to the correct one written by the same person. We filtered out a user's correct solution(s) if the user also submitted incorrect code. In this way, we ensure that (1) there is no data bias towards the correct/incorrect solutions written by the same person and (2) our experiment simulates FAPR's real application scenario.

Strategy 2: For every problem and each language, we randomly selected 3,000 correct programs and 2,000 incorrect programs to include in D1. When OpenJudge has insufficient program data for a particular problem or language, we simply included all available data after applying Strategy 1.

As shown in Table II , the adopted languages and problem Ids divide the whole dataset into 10 distinct program sets. The code size varies from 1 to 609 lines of code (LOC). Each program set includes 2,221-3,000 correct solutions and 847-2,000 incorrect ones. There are 5-10 test cases predefined for each problem.

2) Dataset D2: We created D2 based on the outputs by FAPR and Clara. Specifically, among the incorrect solutions repaired by both tools, we randomly sampled 50 solutions in each C program set because Clara repairs C and Python code. Then we included the 250 sampled code (i.e., 50 * 5) together with tool-generated repairs into D2.

For RQ1, we defined two metrics: coverage and accuracy. Coverage (C) measures among all given incorrect programs, for how many of them a tool can suggest repairs: For RQ2, we measured tools' runtime overheads. For each tool, there are two kinds of cost. The first cost is preprocessing time, which measures how much time a tool spends on initialization and preparation. It corresponds to the time spent by FAPR to create XML files based on AST parsing, and that spent by Clara to cluster correct solutions based on dynamic program analysis. The second cost is average repair time, which measures the average amount of time a tool spends on each repair suggestion. It is the time spent by FAPR to derive and test repairs, and that by Clara to generate any repair. For RQ3, we defined two metrics to measure repair quality based on humans' manual inspection: relevance and usefulness.

Repair Relevance (RR) measures how relevant a repair is to the expected repair in instructors' mind. An instructor or TA grades a repair on a 1-4 scale as below:

4: The repair is correct and fixes the bug(s), without any extra change. 3: The repair is correct and fixes the bug(s), but includes certain unnecessary change(s). 2: The repair is correct but loosely related to the original algorithm design. Namely, it applies so many edits that the original code is radically changed. 1: The repair is incorrect.

RR reflects the correctness of a suggested repair. The higher RR is, the better. RR is closely related to the accuracy metric A mentioned above; however, it measures the repair quality based on teachers' perception instead of testing results.

Repair Usefulness (RU) measures the usefulness of a repair suggestion perceived by teachers when they are given the repair together with the related incorrect solution. An instructor or TA grades a repair with a five-level Likert scale: 5: The repair suggestion is very useful.

The repair suggestion is somewhat useful.

The repair suggestion is neither useful nor useless.

The repair suggestion is less useful. 1: The repair suggestion is least useful.

RU varies within [1, 5] . The higher score, the better. For RQ4, in addition to the above-mentioned metrics, we designed and used two survey questions to query people's opinions on the repair comparison between FAPR and Clara.

Size Comparison (SC). Given an incorrect solution and two repairs separately generated by FAPR and Clara, participants are asked to choose among the following three options:

A. The correct repairs suggested by FAPR are smaller. B. The correct repairs suggested by Clara are smaller. C. The two types of repairs have almost equal sizes. Readability Comparison (RC). Given an incorrect solution and repairs suggested by tools, participants are asked to choose among the following three options:

A. FAPR's repair is easier to accept and understand. B. Clara's repair is easier to accept and understand. C. The two repairs are almost equally easy/difficult to accept and understand. In the user study, participants did not know which tool is ours.

In the first experiment, we applied FAPR to D1. In the second experiment, we recruited five CS students to separately inspect the sampled programs and related FAPR's repairs in D2, assuming them to play the TA role. These students had 2-8 years of C experience; each of them independently assessed the repair quality for samples from one C program set. Table III , FAPR achieved very high coverage. It suggested repairs for 95.9%-99.9% of incorrect C programs; it obtained 95.5%-99.8% coverage when being applied to 5 C++ program sets. We were curious why FAPR was unable to suggest repairs for some programs, so we manually inspected code and found two reasons. First, some programs contain special characters (e.g., Chinese characters) that prevents srcML from parsing code. Second, some incorrect solutions are dissimilar to all correct ones and the normalized programs all failed test(s), being unusable to repair code. In the future, we will overcome the first limitation by preprocessing code to remove special characters, and mitigate the second one by enlarging the corpus of correct code.

2) Accuracy: In all program sets, FAPR achieved 100% accuracy. It means that the suggested repairs by FAPR always pass all tests. This is expected, because we designed FAPR to validate each generated repair via testing. If FAPR cannot produce any repair to fix a given program, the program is not considered in the accuracy evaluation.

3) Runtime Overhead: As shown in Table III , in each program set, the preprocessing time of converting all correct code to XML files is 48.7-66.8 seconds while the average repair time is 1.4-8.1 seconds. Our evaluation implies that when teachers rely on FAPR to locate bugs and suggest repairs, FAPR can always respond quickly.

Finding 1: Our quantitative experiment with FAPR shows that it generated repairs with very high coverage (≥ 95.5%), 100% accuracy, and low runtime overheads (≤ 8.1 seconds per repair). Id  4  3  2  1  4  3  2  1  1678  43  3  1  3  24  11  0  15  1689  45  4  1  0  25  16  3  6  1703  43  4  3  0  9  17  0  24  1716  50  0  0  0  11  8  0  31  1720  43  4  2  1  27  12  7  4  Total  224  15  7  4  96  64 10 80 Table IV , among the 250 samples, participants considered 224 repairs to be correct and include no unnecessary change. They identified another 15 repairs to be correct but include unnecessary edits. Another seven repairs were graded to be "2"; they were considered to be correct but loosely related to the original program. Four repairs were graded to be "1" and considered incorrect, which is unexpected to us because FAPR only suggested repairs after programs passed all tests. We inspected the reported incorrect repairs, and found the predefined test cases to have limited branch coverage. Consequently, some adopted "correct" programs pass all test cases but actually contain bugs, which make generated repairs incorrect. 5) Repair Usefulness: According to the left part of Table V , participants believed 247 out of 250 repair samples to be very useful or somewhat useful. Only two repairs were graded to be "neither useful nor useless" and one repair was considered "useless". We further inspected the three less helpful repairs to understand why participants disliked them. The one considered useless is an incorrect repair FAPR suggested due to limited test coverage; the other two significantly changed the algorithm design. FAPR suggested such algorithm-modifying repairs, as the adopted correct code implements different algorithms from the original incorrect code. Participants prefer repairs that fix bugs instead of replacing whole algorithms.

Finding 2: Our qualitative experiment with FAPR's repairs shows that participants considered (1) 89.6% of repairs to be correct and smallest and (2) 98.8% of repairs to be helpful. The average RR score is 3.8 out of 4, and the average RU score is 4.9 out of 5.

In the first experiment, we applied Clara to 5 C program sets in D1 because Clara does not repair C++ code. In the second experiment, similar to what we did in Section V-C, we also asked the five participants to manually inspect Clara's repairs for the sampled incorrect solutions in D2. Namely, given an incorrect solution, a participant was expected to inspect the alternative repairs suggested by both tools without knowing which tool is ours. With such experiment setting, we ensured that the alternative repairs for the same incorrect solution were always examined by the same person and got assessed based on the same subjective criteria. Section V-C reports participants' assessment for FAPR repairs, while this section reports participants' assessment for Clara repairs.

1) The Clara Approach: To automate repair generation, Clara first clusters correct solutions based on their semantic equivalence (e.g., equivalent variable states during program execution), and then compares each incorrect solution with the representative 1-2 solutions from each cluster to produce repairs. Specifically, to capture program semantics, Clara parses source code to create an internal representation (IR), and adopts an interpreter to execute each IR statement and record intermediate execution status (i.e., variable values). The recorded program states were later used to decide whether two programs implement the same semantics (i.e., by having equivalent variable values). Clara extracts IR differences for repairs. For example, given for(int i=a;i>0;i−−) updated to for(int i=a-1;i>=0;i−−), Clara expresses the edit as:

1. Change i := a to i := -(a, 1) 2. Change $cond := >(a, 0) to $cond := >=(a, 0) 2) Coverage: Table VI shows the evaluation results for Clara's coverage, accuracy, and runtime overhead. According to the table, Clara achieved 57.1%-92.0% coverage when being applied to the 5 C program sets. To understand why Clara was unable to repair all programs, we manually analyzed its implementation. We found that Clara has a self-defined C parser for C-to-IR conversion, and a self-defined interpreter to execute IR instructions and produce traces for dynamic analysis. Both parts only handle general cases, but cannot process all syntactic structures or library function calls.

3) Accuracy: As mentioned in Clara's approach overview, the tool suggests repairs based on IR instead of C code; it does not transform IR back to C code to reflect the code modification, neither does it test code to automatically validate repairs. Thus, we could not measure the tool's accuracy via automated testing. 4) Runtime Overhead: According to Id  A  B  C  A  B  C  1678  25  6  19  18  6  26  1689  18  5  27  16  2  32  1703  19  0  31  50  0  0  1716  38  0  12  41  0  9  1720  5  1  44  45  2  3  Total  105  12  89  170  10  70 Finding 3: Our quantitative experiment with Clara shows that it generated repair with good coverage (57.1%-92.0%), but incurred relatively high runtime overheads for clustering and repair suggestion.

Relevance: According to the right part of Table IV, participants identified 96 repairs to be correct and minimized, and recognized another 64 repairs to be correct but have redundant changes. They also considered 10 repairs to be correct but weakly related to the original algorithm design, and 80 repairs to be incorrect. 6) Repair Usefulness: As shown by the right part of Table V , participants believed 166 out of 250 repairs to be useful and 60 repairs to be useless. They were neutral to the remaining 24 repairs. The participants did not like 84 (i.e., 60+24) of the repairs because these repairs are either incorrect or totally replace the algorithm design. One possible reason to explain Clara's many unsatisfactory repairs is its semanticsbased program comparison. By comparing and clustering programs based on equivalent execution status, Clara managed to align semantically similar but syntactically dissimilar code with each other. Although such flexible alignment can locate more candidate references to fix p i , the generated repairs sometimes implement irrelevant algorithms and become less useful to participants.

Finding 4: Our qualitative experiment with Clara's repair samples shows that participants found (1) 38.4% of repairs to be correct and smallest and (2) 66.4% of repairs to be helpful. The average RR value is 2.7 out of 4, and the average RU value is 3.7 out of 5.

Observing the results shown in Sections V-C and V-D, we learnt that FAPR outperformed Clara by achieving higher coverage (i.e., fixing more code) with lower runtime overheads, and producing repairs with higher RR and RU scores. Table VII shows participants' manual comparison between the repairs of both tools. According to the table, when participants compared repair sizes, they considered 105 repairs by FAPR to be smaller than Clara's repairs; however, only 12 of Clara's repairs were considered smaller. For 133 incorrect solutions, the repairs generated by both tools seem to have equal sizes. In terms of readability, participants considered 170 of FAPR's repairs to be more readable; only 10 of Clara's repairs were considered to have better readability.

Three reasons can explain FAPR's higher efficiency and better repair quality. First, FAPR compares code based on static token-level matches instead of semantic reasoning via dynamic analysis; its comparison method is simpler and thus faster. Second, participants preferred bug fixes over algorithm replacements; semantics-based code comparison does not ensure syntactic similarity and thus Clara's repairs are less accurate. Third, FAPR suggests repairs in C, which enables automated validation via testing and facilitates participants to comprehend repairs. Additionally, FAPR has better portability than Clara because it does not use complex program analysis. To apply FAPR to code written in a new language, developers only need to feed FAPR the XML files of program ASTs.

The comparison between tools shows that FAPR worked better than Clara. FAPR suggested correct repairs for more programs. In many scenarios, users considered FAPR's repairs to have higher quality (i.e., smaller and more readable).

To reveal the similarity or differences between programs, people can compare either code strings, token sequences, ASTs, control/data flows, program dependencies, or execution traces. Different comparison methods achieve distinct tradeoffs among speed, flexibility, and cross-language portability. For instance, string comparison is usually fast and applicable to different languaged code, but does not tolerate any difference (e.g., formatting) when matching code. On the other extreme, execution comparison is usually slow and only applicable to languages with good dynamic analysis support; however, it can flexibly match programs that are syntactically different but semantically equivalent. In MOOC, lots of correct code is available and students' incorrect solutions in different languages wait for debugging feedback; we believe (1) speed and cross-language portability to be more important than flexibility and (2) syntax-driven approaches to outperform semantics-driven ones.

Instead of trying all edit subsets, FAPR minimizes repairs by enumerating single-edit repairs, leave-one-out repairs, and leave-two-out repairs; it also relies on testing outcomes to skip the exploration of some unpromising subsets. We designed this algorithm based on our manual analysis of sample programs (see Section III-D2). In the future, we will explore more heuristics that help FAPR reduce repairs more effectively without introducing too much overhead.

The related work of our research includes automatic patch generation and automated feedback generation.

Automatic Patch Generation: Given a buggy program that fails some test(s) in a test suite, automatic program repair (APR) generates candidate patches and checks patch correctness via compilation and testing [11] , [12] , [13] , [14] , [15] , [16] , [17] , [2] , [18] . For example, GenProg [17] and RSRepair [14] create patches by randomly replicating, mutating, or deleting code from the existing program. As random search often has a huge search space, these tools have high runtime overheads but unsatisfactory capabilities. They can Data-driven program repair tools extract edit patterns from frequently applied bug fixes, and use those patterns to fix similar buggy code [19] , [20] , [21] , [22] . For instance, Getafix [21] mines software repositories for code changes that fix a specific kind of bugs (e.g., NullPointerException) and generalizes edit patterns from those changes. Given a previously unseen buggy program, Getafix finds suitable fix patterns based on the edit context, ranks all candidate fixes, and suggests the top fixes. Tufano et al. [22] applies Neural Machine Translation (NMT) to bug-fixing commits, to train a model that predicts repaired code given buggy programs. These tools are similar to FAPR, but they do not guarantee that the suggested repairs pass all tests or always have correct syntax and semantics.

Several tools use symbolic execution to reason about the semantics of both incorrect and correct programs, in order to repair code [23] , [24] . For instance, Angelix [23] first uses statistical fault localization to identify n most suspicious expressions in a buggy program. It then relies on concolic execution to extract value constraints on expressions that should be satisfied by a repaired program, and adopts program synthesis to generate repairs accordingly. However, due to the scalability and applicability issues of symbolic execution, these approaches cannot create complex patches that edit large chunks of code. In comparison, FAPR can create complex patches, as demonstrated by Fig. 5 . Please refer to Appendix B in our supplementary document for more examples.

Knowledge-Based Feedback Generation: Tools were proposed to generate feedback based on domain knowledge or predefined rules [25] , [7] , [26] , [27] , [28] , [29] , [30] , [31] , [2] . Specifically, given program submissions, various online judge systems [25] , [7] , [26] and automatic grading systems [27] , [28] , [30] check for code correctness via compilation and testing. If a program compiles successfully and passes all predefined test cases, these systems consider the code to be correct; otherwise, they provide high-level succinct error description like "Time Limit Exceeded (TLE)" or "Memory Limit Exceeded (MLE)". Such vague and abstract program feedback may be not helpful to students.

When using AutoGrader [2] to repair students' incorrect code, instructors have to learn a domain-specific language EML (short for "Error Model Language"), and use EML to describe possible coding errors and related fixes. Given a buggy program, AutoGrader matches the program with all error models, and enumerates related fixing patterns to correct the code. However, it is tedious and error-prone for instructors to learn EML and to prescribe error models, while the specified models can only fix a limited number of bugs. FAPR is different from prior work in two aspects. First, it does not require instructors to prescribe error models and related fixing patterns. Second, it generates repair based on existing correct solutions and predefined test cases.

Data-Based Feedback Generation: Some tools were proposed to generate feedback based on the comparison between correct and incorrect solutions [32] , [33] , [34] , [35] , [36] , [4] , [3] , [37] , [38] , [39] , [40] . For instance, Pex4Fun [33] and CodeHunt [34] use dynamic symbolic execution to generate test cases that reveal behavioral discrepancies between student solutions and the teacher's hidden program specification. Similar to Clara, Laura [32] and Apex [36] also model and compare program semantics to repair code. However, Laura statically analyzes code to represent semantics with graphs, which capture the value constraints and evaluation ordering between variables. Apex collects symbolic and concrete execution traces to reason about program semantics, and uses Z3 to reveal semantic equivalence. Due to the usage of symbolic execution, dynamic analysis, and/or graph-based semantic reasoning, these tools cannot achieve high efficiency. FAPR complements these tools.

Pu et al. [3] built sk p-that trains generative neural networks with correct programs, and uses the trained model to suggest fixes for incorrect programs. Evaluation shows that sk p could not suggest fixes with high accuracy. DrRepair [41] and DeepFix [42] also adopt machine learning methods to repair programs; however, they address compilation errors and cannot fix programs that fail tests. SARFGEN [37] is most relevant to our work. Given an incorrect program, it searches for syntactically similar correct code based on program embeddings, aligns code based on control-flow structures and tree edit distances, and generates repairs based on code differences. However, SARFGEN uses dynamic analysis to minimize repairs, which may jeopardize its cross-language portability and repair quality. As the tool is unavailable, we did not empirically compare FAPR with SARFGEN.

Albeit similar to some prior work that compares incorrect programs against correct ones to create repairs, FAPR is novel in three aspects. First, FAPR retrieves correct code P most similar to the given incorrect code p i based on tokenlevel comparison. Such efficient comparison ensures the (1) syntactic similarity between P and p i , and (2) usability of P to generate minimal repairs that do not considerably modify algorithm design. Second, FAPR encodes syntactic structures into token sequences; such encodings help FAPR overcome any inaccuracy caused by pure token-level comparison and improve the quality of generated repairs. Third, FAPR minimizes repairs by selectively testing subsets of generated edits; this testing-based approach eliminates FAPR's dependency on complex program analysis and ensures its cross-language portability. Thanks to the novelties mentioned above, FAPR complements existing work and can suggest repairs in a fast and accurate way.

Our comprehensive evaluation with 45,897 C/C++ programs evidences FAPR's great efficiency, accuracy, and crosslanguage portability. FAPR requires almost zero developer effort to get ported from C to C++ code. In the future, we will deploy FAPR to online judge systems. Repair for Introductory Programming Exercises

A. Problem 1678 a) Description: Given k integers (1 < k < 100), each of which is no less than 1 and no more than 10, you are asked to write a program to count the occurrences of 1, 5 and 10 respectively. b) Input: There are two lines. The first line contains an integer k. The second line contains k integers split by a single space.

c) Output: Three lines containing the number of occurrences of 1, 5 and 10 respectively. d) Sample Input: Given the age of each student in integers, find the average age of all the students in the class, rounded to two decimal places.

b) Input: The first line contains an integer n (1 < n < 100), which is the number of students. Each of the next n lines contains an integer ranging from 15 to 25, which is the age of a student. c) Output: One line with a floating-point number rounded to two decimal places, which is the average age required. Here we demonstrate five programs and the generated repairs from FAPR, one for each problem. The repair relevance (RR) and repair usefulness (RU) of each repair from user study are given as well as the explanation of these repairs. These programs are selected to show how repairs of different RR look like. Figure 1 uses the modulo operator to check the divisibility of each integer for 1, 5 and 10, instead of equality, probably because of a misunderstanding of the original problem. b) Explanation for repair: The most natural way to fix this incorrect program is to change the conditions of three ifstatement from divisibility check to equality check. However, since the problem ensures that each number from input is no less than 1 and no more than 10, the last divisibility check can remain unchanged because 10 is the only number to be divisible by 10 in the given range. So the correct and smallest repair should only change the first two conditions. The repair generated by FAPR for this solution gets a RR score of 3, because it is correct and fixes the bug, but includes unnecessary changes that move the definition of variable h and use a different variable n for the number of input. It get a RU score of 5 i.e. is considered very useful by the volunteer, probably because the volunteer thinks that the error in original solution is fixed exactly, though some minor changes are also included.

B. Example from Problem 1689 a) Explanation for error: In Problem 1689, the program should reverse a given array. However, the incorrect solution in Figure 2 implements the algorithm of insertion sorting and sorts the given array in ascending order, probably because the student misunderstood the problem when he/she sees the (1 4 5 6 8) to be in ascending order, which is a coincidence. b) Explanation for repair: To fix this incorrect program, FAPR replaces a large chunk of code from the original solution. Namely, it replaces the part of insertion sorting with a for-loop that correctly reverses the array. It is worth mention that the generated for-loop uses the same variables in the original solution. Instead of i and j as common sense, it uses i and min as array index to avoid unnecessary code change because the original solution uses them, as an evidence of the strong adaptation of FAPR.

The RR score of this repair is only 2, because the generated repair significantly changes the original algorithm design. However, such low repair relevance is unavoidable for any correct repair because the algorithm design of original solution i.e. insertion sorting is wrong. The RU score of this repair is also marked as 5 i.e. very useful by our volunteer, because the generated repair correctly fixes the algorithm used. of each number by 7, but a number is related to 7 can also have 7 in one of its digits according to the problem, which is not considered in this program. Third, it forgets to print the result to standard output. b) Explanation for repair: The repair generated by FAPR consists of three edits, each of which exactly fixes one mistake of the original solution. Note that n is smaller than 100 in the problem description, so checking each digit only needs to check the ones place and tens place of a number, instead of a more complex solution to use a loop. The RR score of this repair is 4 because it correctly fixes the bug without introducing unnecessary code changes. The RU score of this repair is marked as 5 i.e. very useful by our volunteer.

Repaired Code #include<stdio.h> int main(void) { int n,i; double age,total; printf("Enter n:"); scanf("%d",&n); total=0; for(i=1;i<=n;i++){ printf("Enter age#%d:",i); scanf("%lf",&age); total=total+age; } printf("Age average=%.2f

",total/n); return 0; } #include<stdio.h> int main(void) { int n,i; double age,total; scanf("%d",&n); total=0; for(i=1;i<=n;i++){ scanf("%lf",&age); total=total+age; } printf("%.2lf",total/n); return 0; } Fig. 4 : An incorrect program from problem 1716 and the generated repair, with RR = 4 and RU = 5 a) Explanation for error: In Problem 1716, the program should read a list of numbers and calculate their mean. The incorrect program in Figure 4 makes a common mistake by beginners to output extra prompts in the program as the textbook does, which is not needed and fails tests in an online judge.

b) Explanation for repair: The repair generated by FAPR fixes the mistake exactly, by removing the extra prompts before input and output of the original program. The RR score of this repair is 4 because it correctly fixes the bug without introducing unnecessary code changes. The RU score of this repair is marked as 5 i.e. very useful by our volunteer.

Code with An Incorrect Repair #include<stdio.h> int main() { int a; scanf("%d",&a); if (a==2&&a==4&&a==6&&a==7) printf("YES"); else printf("NO"); return 0; } #include<stdio.h> int main() { int a; scanf("%d",&a); if(a==2&&a==4||a==6||a==7) printf("YES"); else printf("NO"); return 0; } Fig. 5 : An incorrect program from problem 1720 and the generated repair, with RR = 1 and RU = 4 a) Explanation for error: In Problem 1720, the program should make a simple judgement, and output "YES" if the input is 1, 3 or 5, and "NO" if it is 2, 4, 6 or 7. The incorrect solution in Figure 5 mistakenly concatenates multiple conditions by operator and(&&) instead of operator or(--), causing the program to output "NO" whatever the input is. b) Explanation for repair: The repair generated by FAPR only modifies two of the three operator and(&&), and the RR score of it is marked as 1 i.e. the repair is incorrect by our volunteer. We further analyze the reason of it and find something interesting. The test cases of this problem only cover five situations i.e. the input is 1, 3, 5, 6 and 7. FAPR first generates repair consisting of three edits, but drop the first one when minimizing repair set because dropping this edit also passes all test cases. This example is one that shows the common limitation of testing-based APR techniques. However, from another point of view it also shows the effectiveness of our tool to generate a minimal repair that passes test. The RU score of this repair is marked as 4 i.e. this repair is somewhat useful by our volunteer, because it points out the exact mistakes of the original program and is inspiring, though does not fully fix it.

Remember the MOOCs? After Near-Death, They're Booming

Automated feedback generation for introductory programming assignments

sk p: a neural program corrector for moocs

Semisupervised verified feedback generation

Automated clustering and program repair for introductory programming assignments

Semcluster: Clustering of imperative programming assignments based on quantitative semantic features

OpenJudge

srcML

Compilers: Principles, Techniques, and Tools

Simple fast algorithms for the editing distance between trees and related problems

Automatically patching errors in deployed software

Automatically finding patches using genetic programming

Automatic patch generation by learning correct code

The strength of random search on automated program repair

Automatic repair of buggy if conditions and missing preconditions with smt

Automatic patch generation learned from human-written patches

Genprog: A generic method for automatic software repair

Semfix: Program repair via semantic analysis

Systematic editing: Generating program transformations from an example

Lase: Locating and applying systematic edits by learning from examples

Getafix: Learning to fix bugs automatically

An empirical study on learning bug-fixing patches in the wild via neural machine translation

Angelix: Scalable multiline program patch synthesis via symbolic analysis

Semantic program repair using a reference implementation

LeetCode

Developing an automated program checkers

A software infrastructure to support introductory computer science courses

Grading student programs using assyst

A secure on-line submission system

Improving student performance by evaluating how well students test their own programs

Laura, a system to debug student programs

Teaching and learning programming and software engineering via interactive gaming

Code hunt: Searching for secret code for fun

An automated framework for recommending program elements to novices (n)

Apex: Automatic programming assignment error explanation

Search, align, and repair: Data-driven feedback generation for introductory programming exercises

Automatic and scalable detection of logical errors in functional programming assignments

Refactoring based program repair applied to programming assignments

Fast test suite-driven modelbased fault localization with application to pinpointing defects in student programs

Graph-based, self-supervised program repair from diagnostic feedback

Deepfix: Fixing common c language errors by deep learning