Building a State-of-the-Art Grammatical Error Correction System Alla Rozovskaya Center for Computational Learning Systems Columbia University New York, NY 10115 alla@ccls.columbia.edu Dan Roth Department of Computer Science University of Illinois Urbana, IL 61801 danr@illinois.edu Abstract This paper identifies and examines the key principles underlying building a state-of-the- art grammatical error correction system. We do this by analyzing the Illinois system that placed first among seventeen teams in the re- cent CoNLL-2013 shared task on grammatical error correction. The system focuses on five different types of errors common among non-native English writers. We describe four design principles that are relevant for correcting all of these er- rors, analyze the system along these dimen- sions, and show how each of these dimensions contributes to the performance. 1 Introduction The field of text correction has seen an increased interest in the past several years, with a focus on correcting grammatical errors made by English as a Second Language (ESL) learners. Three competi- tions devoted to error correction for non-native writ- ers took place recently: HOO-2011 (Dale and Kil- garriff, 2011), HOO-2012 (Dale et al., 2012), and the CoNLL-2013 shared task (Ng et al., 2013). The most recent and most prominent among these, the CoNLL-2013 shared task, covers several common ESL errors, including article and preposition usage mistakes, mistakes in noun number, and various verb errors, as illustrated in Fig. 1.1 Seventeen teams that 1The CoNLL-2014 shared task that completed at the time of writing this paper was an extension of the CoNLL-2013 com- petition (Ng et al., 2014) but addressed all types of errors. The Illinois-Columbia submission, a slightly extended version of the Nowadays *phone/phones *has/have many functionalities, *included/including *∅/a camera and *∅/a Wi-Fi receiver. Figure 1: Examples of representative ESL errors. participated in the task developed a wide array of ap- proaches that include discriminative classifiers, lan- guage models, statistical machine-translation sys- tems, and rule-based modules. Many of the systems also made use of linguistic resources such as addi- tional annotated learner corpora, and defined high- level features that take into account syntactic and se- mantic knowledge. Even though the systems incorporated similar re- sources, the scores varied widely. The top system, from the University of Illinois, obtained an F1 score of 31.202, while the second team scored 25.01 and the median result was 8.48 points.3 These results suggest that there is not enough understanding of what works best and what elements are essential for building a state-of-the-art error correction system. In this paper, we identify key principles for build- ing a robust grammatical error correction system and show their importance in the context of the shared task. We do this by analyzing the Illinois system and evaluating it along several dimensions: choice Illinois CoNLL-2013 system, ranked at the top. For a descrip- tion of the Illinois-Columbia submission, we refer the reader to Rozovskaya et al. (2014a). 2The state-of-the-art performance of the Illinois system dis- cussed here is with respect to individual components for differ- ent errors. Improvements in Rozovskaya and Roth (2013) over the Illinois system that are due to joint learning and inference are orthogonal, and the analysis in this paper still applies there. 3F1 might not be the ideal metric for this task but this was the one chosen in the evaluation. See more in Sec. 6. 419 Transactions of the Association for Computational Linguistics, 2 (2014) 419–434. Action Editor: Alexander Koller. Submitted 10/2013; Revised 6/2014; Published 10/2014. c©2014 Association for Computational Linguistics. of learning algorithm; choice of training data (native or annotated learner data); model adaptation to the mistakes made by the writers; and the use of linguis- tic knowledge. For each dimension, several imple- mentations are compared, including, when possible, approaches chosen by other teams. We also vali- date the obtained results on another learner corpus. Overall, this paper makes two contributions: (1) we explain the success of the Illinois system, and (2) we provide an understanding and qualitative analysis of different dimensions that are essential for success in this task, with the goal of aiding future research on it. Given that the Illinois system has been the top system in four competitive evaluations over the last few years (HOO and CoNLL), we believe that the analysis we propose will be useful for researchers in this area. In the next section, we present the CoNLL-2013 competition. Sec. 3 gives an overview of the ap- proaches adopted by the top five teams. Sec. 4 de- scribes the Illinois system. In Sec. 5, the analysis of the Illinois system is presented. Sec. 6 offers a brief discussion, and Sec. 7 concludes the paper. 2 Task Description The CoNLL-2013 shared task focuses on five common mistakes made by ESL writers: arti- cle/determiner, preposition, noun number, verb agreement, verb form. The training data of the shared task is the NUCLE corpus (Dahlmeier et al., 2013), which contains essays written by learners of English (we also refer to it as learner data or shared task training data). The test data consists of 50 essays by students from the same linguistic back- ground. The training and the test data contain 1.2M and 29K words, respectively. Table 1 shows the number of errors by type and the error rates. Determiner errors are the most com- mon and account for 42.1% of all errors in training. Note that the test data contains a much larger pro- portion of annotated mistakes; e.g. determiner errors occur four times more often in the test data than in the training data (only 2.4% of noun phrases in the training data have determiner errors, versus 10% in the test data). The differences might be attributed to differences in annotation standards, annotators, or writers, as the test data was annotated at a later time. The shared task provided two sets of test an- Error Number of errors and error rates Train Test Art. 6658 (2.4%) 690 (10.0%) Prep. 2404 (2.0%) 311 (10.7%) Noun 3779 (1.6%) 396 (6.0%) Verb agr. 1527 (2.0%) 124 (5.2%) Verb form 1453 (0.8%) 122 (2.5%) Table 1: Statistics on annotated errors in the CoNLL-2013 shared task data. Percentage denotes the error rates, i.e. the number of erroneous instances with respect to the total number of relevant instances in the data. notations: the original annotated data and a set with additional revisions that also includes alternative an- notations proposed by participants. Clearly, having alternative answers is the right approach as there are typically multiple ways to correct an error. How- ever, because the alternatives are based on the error analysis of the participating systems, the revised set may be biased (Ng et al., 2013). Consequently, we report results on the original set. 3 Model Dimensions Table 2 summarizes approaches and methodologies of the top five systems. The prevailing approach consists in building a statistical model either on learner data or on a much larger corpus of native En- glish data. For native data, several teams make use of the Web 1T 5-gram corpus (henceforth Web1T, (Brants and Franz, 2006)). NARA employs a statis- tical machine translation model for two error types; two systems have rule-based components for se- lected errors. Based on the analysis of the Illinois system, we identify the following, inter-dependent, dimensions that will be examined in this work: 1. Learning algorithm: Most of the teams, includ- ing Illinois, built statistical models. We show that the choice of the learning algorithm is very impor- tant and affects the performance of the system. 2. Adaptation to learner errors: Previous stud- ies, e.g. (Rozovskaya and Roth, 2011) showed that adaptation, i.e. developing models that utilize knowledge about error patterns of the non-native writers, is extremely important. We summarize adaptation techniques proposed earlier and examine their impact on the performance of the system. 3. Linguistic knowledge: It is essential to use some linguistic knowledge when developing error correc- tion modules, e.g., to identify which type of verb 420 System Error Approach Illinois (Rozovskaya et al., 2013) Art. AP model on NUCLE with word, POS, shallow parse features Prep. NB model trained on Web1T and adapted to learner errors Noun/Agr./Form NB model trained on Web1T NTHU (Kao et al., 2013) All Count model with backoff trained on Web1T HIT (Xiang et al., 2013) Art./Prep./Noun ME on NUCLE with word, POS, dependency features Agr./Form Rule-based NARA (Yoshimoto et al., 2013) Art./Prep. SMT model trained on learner data from Lang-8 corpus Noun ME model on NUCLE with word, POS and dependency features Agr./Form Treelet LM on Gigaword and Penn TreeBank corpora UMC (Xing et al., 2013) Art./Prep. Two LMs – on NUCLE and Web1T corpus – with voting Noun Rules and ME model on NUCLE + LM trained on Web1T Agr./Form ME model on NUCLE (agr.) and rules (form) Table 2: Top systems in the CoNLL-2013 shared task. The second column indicates the error type; the third column describes the approach adopted by the system. ME stands for Maximum Entropy; LM stands for language model; SMT stands for Statistical Machine Translation; AP stands for Averaged Perceptron; NB stands for Naı̈ve Bayes. Classifier Art. Prep. Noun Agr. Form Train 254K 103K 240K 75K 175K Test 6K 2.5K 2.6K 2.4K 4.8K Table 3: Number of candidate words by classifier type. error occurs in a given context, before the appropri- ate correction module is employed. We describe and evaluate the contribution of these elements. 4. Training data: We discuss the advantages of training on learner data or native English data in the context of the shared task and in broader context. 4 The Illinois System The Illinois system consists of five machine-learning models, each specializing in correcting one of the er- rors described above. The words that are selected as input to a classifier are called candidates (Table 3). In the preposition system, for example, candidates are determined by surface forms. In other systems, determining the candidates might be more involved. All modules take as input the corpus documents pre-processed with a part-of-speech tagger4 (Even- Zohar and Roth, 2001) and shallow parser5 (Pun- yakanok and Roth, 2001). In the Illinois submis- sion, some modules are trained on native data, oth- ers on learner data. The modules trained on learner data make use of a discriminative algorithm, while 4http://cogcomp.cs.illinois.edu/page/ software view/POS 5http://cogcomp.cs.illinois.edu/page/ software view/Chunker native-trained modules make use of the Naı̈ve Bayes (NB) algorithm. The Illinois system has an option for a post-processing step where corrections that al- ways result in a false positive in training are ignored but this option is not used here. 4.1 Determiner Errors The majority of determiner errors involve articles, although some errors also involve pronouns. The Illinois system addresses only article errors. Can- didates include articles (“a”,“an”,“the”)6 and omis- sions, by considering noun-phrase-initial contexts where an article is likely to be omitted. The con- fusion set for articles is thus {a, the, ∅}. The ar- ticle classifier is the same as the one in the HOO shared tasks (Rozovskaya et al., 2012; Rozovskaya et al., 2011), where it demonstrated superior per- formance. It is a discriminative model that makes use of the Averaged Perceptron algorithm (AP, (Fre- und and Schapire, 1996)) implemented with LBJava (Rizzolo and Roth, 2010) and is trained on learner data with rich features and adaptation to learner er- rors. See Sec. 5.2 and Sec. 5.3. 4.2 Preposition Errors Similar to determiners, we distinguish three types of preposition mistakes: choosing an incorrect prepo- sition, using a superfluous preposition, and omitting a preposition. In contrast to determiners, for learn- ers of many first language backgrounds, most of the preposition errors are replacements, i.e., where the 6The variants “a” and “an” are collapsed to one class. 421 “Hence, the environmental factors also *contributes/ contribute to various difficulties, *giving/given prob- lems in nuclear technology.” Error Confusion set Agr. {INF=contribute, S=contributes} Form {INF=give, ED=given, ING=giving, S=gives } Table 4: Confusion sets for agreement and form. For irreg- ular verbs, the second candidate in the confusion set for Verb form is the past participle. author correctly recognized the need for a prepo- sition, but chose the wrong one (Leacock et al., 2010). However, learner errors depend on the first language; in NUCLE, spurious prepositions occur more frequently: 29% versus 18% of all preposition mistakes in other learner corpora (Rozovskaya and Roth, 2010a; Yannakoudakis et al., 2011). The Illinois preposition classifier is a NB model trained on Web1T that uses word n-gram features in the 4-word window around the preposition. The 4-word window refers to the four words before and the four words after the preposition, e.g. “problem as the search of alternative resources to the” for the preposition “of”. Features consist of word n-grams of various lengths spanning the target preposition. For example, “the search of” is a 3-gram feature. The model is adapted to likely preposition confu- sions using the priors method (see Sec. 5.2). The Illinois model targets replacement errors of the 12 most common English prepositions. Here we aug- ment it to identify spurious prepositions. The con- fusion set for prepositions is as follows: {in, of, on, for, to, at, about, with, from, by, into, during, ∅}. 4.3 Agreement and Form Errors The Illinois system implements two verb modules – agreement and form – that consist of the following components: (1) candidate identification; (2) deter- mining the relevant module for each candidate based on verb finiteness; (3) correction modules for each error type. The confusion set for verbs depends on the target word and includes its morphological vari- ants (Table 4). For irregular verbs, the past partici- ple form is included, while the past tense form is not (i.e. “given” is included but “gave” is not), since tense errors are not part of the task. To generate morphological variants, the system makes use of a morphological analyzer verbMorph; it assumes (1) a list of valid verb lemmas (compiled using a POS- Dimension Systems used in the comparison Learn. alg. (Sec. 5.1) NTHU, UMC Adaptation (Sec. 5.2) Error inflation: HIT Ling. knowledge Cand. identification: NTHU, HIT (Sec. 5.3) Verb finiteness: NTHU Train. data (Sec. 5.4) HIT, NARA Table 5: System comparisons. Column 1 indicates the di- mension, and column 2 lists systems whose approaches provide a relevant point of comparison. tagged version of the NYT section of the Gigaword corpus) and (2) a list of irregular English verbs.7 Candidate Identification stage selects the set of words that are presented as input to the classifier. This is a crucial step: errors missed at this stage will not be detected by the later stages. See Sec. 5.3. Verb Finiteness is used in the Illinois system to sep- arately process verbs that fulfill different grammati- cal functions and thus are marked for different gram- matical properties. See Sec. 5.3. Correction Modules The agreement module is a bi- nary classifier. The form module is a 4-class system. Both classifiers are trained on the Web1T corpus. 4.4 Noun Errors Noun number errors involve confusing singular and plural noun forms (e.g. “phone” instead of “phones” in Fig. 1) and are the second most common error type in the NUCLE corpus after determiner mistakes (Table 1). The Illinois noun module is trained on the Web1T corpus using NB. Similar to verbs, candi- date identification is an important step in the noun classifier. See Sec. 5.3. 5 System Analysis In this section, we evaluate the Illinois system along the four dimensions identified in Sec. 3, compare its components to alternative configurations imple- mented by other teams, and present additional exper- iments that further analyze each dimension. While a direct comparison with other systems is not always possible due to other differences between the sys- tems, we believe that these results are still useful. Table 5 lists systems used for comparion. It is im- portant to note that the dimensions are not indepen- dent. For instance, there is a correlation between algorithm choice and training data. 7The tool and more detail about it can be found at http://cogcomp.cs.illinois.edu/page/publication view/743 422 Results are reported on the test data using F1 com- puted with the CoNLL scorer (Dahlmeier and Ng, 2012). Error-specific results are generated based on the output of individual modules. Note that these are not directly comparable to error-specific results in the CoNLL overview paper: the latter are approx- imate as the organizers did not have the error type information for corrections in the output. The com- plete system includes the union of corrections made by each of these modules, where the corrections are applied in order. Ordering overlapping candidates8 might potentially affect the final output, when mod- ules correctly identify an error but propose differ- ent corrections, but this does not happen in practice. Modules that are part of the Illinois submission are marked with an asterisk in all tables. To demonstrate that our findings are not spe- cific to CoNLL, we also show results on the FCE dataset. It is produced by learners from seventeen first language backgrounds and contains 500,000 words from the Cambridge Learner Corpus (CLC) (Yannakoudakis et al., 2011). We split the corpus into two equal parts – training and test. The statis- tics are shown in Appendix Tables A.16 and A.17. 5.1 Dim. 1: Learning Algorithm Rozovskaya and Roth (2011, Sec. 3) discuss the re- lations between the amount of training data, learn- ing algorithms, and the resulting performance. They show that on training sets of similar sizes, discrimi- native classifiers outperform other machine learning methods on this task. Following these results, the Illinois article module that is trained on the NUCLE corpus uses the discriminative approach AP. Most of the other teams that train on the NUCLE corpus also use a discriminative method. However, when a very large native training set such as the Web1T corpus is available, it is often ad- vantageous to use it. The Web1T corpus is a collec- tion of n-gram counts of length one to five over a cor- pus of 1012 words. Since the corpus does not come with complete sentences, it is not straightforward to make use of a discriminative classifier because of the limited window provided around each example: training a discriminative model would limit the sur- 8Overlapping candidates are included in more than one module: if “work” is tagged as NN, it is included in the noun module, but also in the form module (as a valid verb lemma). rounding context features to a 2-word window. Be- cause we wish to make use of the context features that extend beyond the 2-word window, it is only possible to use count-based methods, such as NB or LM. Several teams make use of the Web1T corpus: UMC uses a count-based LM for article, preposition, and noun number errors; NTHU addresses all errors with a count-based model with backoff, which is es- sentially a variation of a language model with back- off. The Illinois system employs the Web1T corpus for all errors, except articles, using NB. Training Naı̈ve Bayes for Deletions and Inser- tions The reason for not using the Web1T corpus for article errors is that training NB on Web1T for deletions and insertions presents a problem, and the majority of article errors are of this type. Recall that Web1T contains only n-gram counts, which makes it difficult to estimate the prior count for the ∅ candi- date. (With access to complete sentences, the prior of ∅ is estimated by counting the total number of ∅ candidates; e.g., in case of articles, the number of NPs with ∅ article is computed.) We solve this prob- lem by treating the article and the word following it as one target. For instance, to estimate prior counts for the article candidates in front of the word “cam- era” in “including camera”, we obtain counts for “camera”, “a camera”, “the camera”. In the case of the ∅ candidate, the word “camera” acts as the tar- get. Thus, the confusion set for the article classifier is modified as follows: instead of the three articles (as shown in Sec. 4.1), each member of the confu- sion set is a concatenation of the article and the word that follows it, e.g. {a camera, the camera, cam- era}. The counts for contextual features are obtained similarly, e.g. a feature that includes a preceding word would correspond to the count of “including x”, where x can take any value from the confusion set. The above solution allows us to train NB for ar- ticle errors and to extend the preposition classifier to handle extraneous preposition errors (Table 6). Rozovskaya and Roth (2011) study several algo- rithms trained on the Web1T corpus and observe that, when evaluated with the same context win- dow size, NB performs better than other count-based methods. In order to show the impact of the algo- rithm choice, in Table 6, we compare LM and NB models. Both models use word n-grams spanning the target word in the 4-word window. We train LMs 423 Error Model F1 CoNLL FCE Art. LM 21.11 24.15 NB 32.45 30.78 Prep. LM 12.09 30.01 NB 14.04 29.40 Noun LM 40.72 32.41 NB* 42.60 34.40 Agr. LM 20.65 33.53 NB* 26.46 36.42 Form LM 13.40 08.46 NB* 14.50 12.16 Table 6: Comparison of learning models. Web1T corpus. Modules that are part of the Illinois submission are marked with an asterisk. Source Candidates ED INF ING S ED 0.99675 0.00192 0.00103 0.00030 INF 0.00177 0.99630 0.00168 0.00025 ING 0.00124 0.00447 0.99407 0.00022 S 0.00054 0.00544 0.00132 0.99269 Table 7: Priors confusion matrix used for adapting NB. Each entry shows Prob(candidate|source), where source corre- sponds to the verb form chosen by the author. with SRILM (Stolcke, 2002) using Jelinek-Mercer linear interpolation as a smoothing method (Chen and Goodman, 1996). On the CoNLL test data, NB outperforms LM on all errors; on the FCE corpus, NB is superior on all errors, except preposition er- rors, where LM outperforms NB only very slightly. We attribute this to the fact that the preposition prob- lem has more labels; when there is a big confusion set, more features have default smooth weights, so there is no advantage to running NB. We found that with fewer classes (6 rather than 12 prepositions), NB outperforms LM. It is also possible that when we have a lot of labels, the theoretical difference be- tween the algorithms disappears. Note that NB can be improved via adaptation (next section) and then it outperforms the LM also for preposition errors. 5.2 Dim. 2: Adaptation to Learner Errors In the previous section, the models were trained on native data. These models have no notion of the er- ror patterns of the learners. Here we discuss model adaptation to learner errors, i.e. developing models that utilize the knowledge about the types of mis- takes learners make. Adaptation is based on the fact that learners make mistakes in a systematic manner, e.g. errors are influenced by the writer’s first lan- guage (Gass and Selinker, 1992; Ionin et al., 2008). There are different ways to adapt a model that de- pend on the type of training data (learner or native) and the algorithm choice. The key application of adaptation is for models trained on native English data, because the learned models do not know any- thing about the errors learners make. With adapta- tion, models trained on native data can use the au- thor’s word (the source word) as a feature and thus propose a correction based on what the author orig- inally wrote. This is crucial, as the source word is an important piece of information (Rozovskaya and Roth, 2010b). Below, several adaptation techniques are summarized and evaluated. The Illinois system makes use of adaptation in the article model via the inflation method and adapts its NB preposition clas- sifier trained on Web1T with the priors method. Adapting NB The priors method (Rozovskaya and Roth, 2011, Sec. 4) is an adaptation technique for a NB model trained on native English data; it is based on changing the distribution of priors over the cor- rection candidates. Candidate prior is a special pa- rameter in NB; when NB is trained on native data, candidate priors correspond to the relative frequen- cies of the candidates in the native corpus and do not provide any information on the real distribution of mistakes and the dependence of the correction on the word used by the author. In the priors method, candidate priors are changed using an error confusion matrix based on learner data that specifies how likely each confusion pair is. Table 7 shows the confusion matrix for verb form errors, computed on the NUCLE data. Adapted pri- ors are dependent on the author’s original verb form used: let s be a form of the verb appearing in the source text, and c a correction candidate. Then the adapted prior of c given s is: prior(c|s) = C(s, c) C(s) where C(s) denotes the number of times s appeared in the learner data, and C(s, c) denotes the number of times c was the correct form when s was used by a writer. The adapted priors differ by the source: the probability of candidate INF when the source form is S, is more than twice than when the source form is 424 Error Model F1 CoNLL FCE Train Test Art. NB 18.28 32.45 30.78 NB-adapted 19.18 34.49 31.76 Prep. NB 09.03 14.04 29.40 NB-adapted* 10.94 12.14 32.22 Noun NB* 23.06 42.60 34.40 NB-adapted 22.89 42.31 32.38 Agr. NB* 16.72 26.46 36.42 NB-adapted 17.62 23.46 38.57 Form NB* 11.93 14.50 12.16 NB-adapted 14.63 18.35 16.67 Table 8: Adapting NB with the priors method. All models are trained on the Web1T corpus. Modules that are part of the Illinois submission are marked with an asterisk. ED; the probability that S is the correct form is very high, which reflects the low error rates. Table 8 compares NB and NB-adapted models. Because of the dichotomy in the error rates in CoNLL training and test data, we also show exper- iments using 5-fold cross-validation on the training data. Adaptation always helps on the CoNLL train- ing data and the FCE data (except noun errors), but on the test data it only helps on article and verb form errors. This is due to discrepancies in the error rates, as adaptation exploits the property that learner errors are systematic. Indeed, when priors are estimated on the test data (in 5-fold cross-validation), the perfor- mance improves, e.g. the preposition module attains an F1 of 18.05 instead of 12.14. Concerning lack of improvement on noun num- ber errors, we hypothesize that these errors differ from the other mistakes in that the appropriate form strongly depends on the surface form of the noun, which would, in turn, suggest that the dependency of the label on the grammatical form of the source that the adaptation is trying to discover is weak. In- deed, the prior distribution of {singular, plural} la- bel space does not change much when the source feature is taken into account. The unadapted priors for “singular” and “plural” are 0.75 and 0.25, respec- tively. Similarly, the adapted priors (singular|plural) and (plural|singular) are 0.034 and 0.016, respec- tively. In other words, the unadapted prior probabil- ity for “plural” is three times lower than for “singu- lar”, which does not change much with adaptation. This is different for other errors. For instance, in case of verb agreement, the unadapted prior for “plu- ral” is 0.617, more than three times than the “sin- gular” prior of 0.20. With adaptation, these priors become almost the same (0.016 and 0.012). Adapting AP The AP is a discriminative learning algorithm and does not use priors on the set of can- didates. In order to reflect our estimate of the error distribution, the AP algorithm is adapted differently, by introducing into the native data artificial errors, in a rate that reflects the errors made by the ESL writers (Rozovskaya and Roth, 2010b). The idea is to simulate learner errors in training, through arti- ficial mistakes (also produced using an error confu- sion matrix).9 The original method was proposed for models trained on native data. This technique can be further enhanced using the error inflation method (Rozovskaya et al., 2012, Sec. 6) applied to models trained on native or learner data. The Illinois system uses error inflation in its ar- ticle classifier. Because this classifier is trained on learner data, the source article can be used as a fea- ture. However, since learner errors are sparse, the source feature encourages the model to abstain from flagging a mistake, which results in low recall. The error inflation technique addresses this problem by boosting the proportion of errors in the training data. It does this by generating additional artificial errors using the error distribution from the training set. Table 9 shows the results of adapting the AP clas- sifier using error inflation. (We omit noun results, since the noun AP model performs better without the source feature, which is similar to the noun NB model, as discussed above.) The inflation method improves recall and, consequently, F1. It should be noted that although inflation also decreases preci- sion it is still helpful. In fact, because of the low error rates, performance on the CoNLL dataset with natural errors is very poor, often resulting in F1 be- ing equal to 0 due to no errors being detected. Inflation vs. Sampling To demonstrate the impact of error inflation, we compare it against sampling, an approach used by other teams – e.g. HIT – that improves recall by removing correct examples in training. The HIT article model is similar to the 9The idea of using artificial errors goes back to Izumi et al. (2003) and was also used in Foster and Andersen (2009). The approach discussed here refers to the adaptation method in Ro- zovskaya and Roth (2010b) that generates artificial errors using the distribution of naturally-occurring errors. 425 Error Model F1 CoNLL FCE Art. AP (natural errors) 07.06 27.65 AP (infl. const. 0.9)* 24.61 30.96 Prep. AP (natural errors) 0.0 14.69 AP (infl. const. 0.7) 07.37 34.77 Agr. AP (natural errors) 0.0 08.05 AP (infl. const. 0.8) 17.06 31.03 Form AP (natural errors) 0.0 01.56 AP (infl. const. 0.9) 10.53 09.43 Table 9: Adapting AP using error inflation. Models are trained on learner data with word n-gram features and the source feature. Inflation constant shows how many correct instances remain (e.g. 0.9 indicates that 90% of correct examples are un- changed, while 10% are converted to mistakes.) Modules that are part of the Illinois submission are marked with an asterisk. Infl. constant F1 Sampling Inflation 0.90 23.22 24.61 0.85 27.75 29.29 0.80 30.04 33.47 0.70 33.02 35.52 0.60 32.78 35.03 Table 10: Comparison of the inflation and sampling meth- ods on article errors (CoNLL). The proportion of errors in training in each row is identical. Illinois model but scored three points below. Ta- ble 10 shows that sampling falls behind the inflation method, since it considerably reduces the training size to achieve similar error rates. The proportion of errors in training in each row is identical: sampling achieves the error rates by removing correct exam- ples, whereas the inflation method converts some positive examples to artificial mistakes. Inflation constant shows how many correct instances remain; smaller inflation values correspond to more erro- neous instances in training; the sampling approach, correspondingly, removes more positive examples. To summarize, we have demonstrated the impact of error inflation by comparing it to a similar method used by another team; we have also shown that fur- ther improvements can be obtained by adapting NB to learner errors using the priors method, when train- ing and test data exhibit similar error patterns. 5.3 Dim. 3: Linguistic Knowledge The use of linguistic knowledge is important in sev- eral components of the error correction system: fea- ture engineering, candidate identification, and spe- Error Features F1 CoNLL FCE Art. n-gram 24.61 30.96 n-gram+POS+chunk* 33.50 35.66 Agr. n-gram 17.06 31.03 n-gram+POS 24.14 35.29 n-gram+POS+syntax 27.93 41.23 Table 11: Feature evaluation. Models are trained on learner data, use the source word and error inflation. Modules that are part of the Illinois submission are marked with an asterisk. cial techniques for correcting verb errors. Features It is known from many NLP tasks that feature engineering is important, and this is the case here. Note that this is relevant only when training on learner data, as models trained on Web1T can make use of n-gram features only but for the NUCLE cor- pus we have several layers of linguistic annotation.10 We found that for article and agreement errors, using deeper linguistic knowledge is especially beneficial. The article features in the Illinois module, in addi- tion to the surface form of the context, encode POS and shallow parse properties. These features are pre- sented in Rozovskaya et al. (2013, Table 3) and Ap- pendix Table A.19. The Illinois agreement module is trained on Web1T but further analysis reveals that it is better to train on learner data with rich features. The word n-gram and POS agreement features are the same as those in the article module. Syntactic features encode properties of the subject of the verb and are presented in Rozovskaya et al. (2014b, Table 7) and Appendix Table A.18; these are based on the syntactic parser (Klein and Manning, 2003) and the dependency converter (Marneffe et al., 2006). Table 11 shows that adding rich features is help- ful. Notably, adding deeper syntactic knowledge to the agreement module is useful, although parse features are likely to contain more noise.11 Foster (2007) and Lee and Seneff (2008) observe a degrade in performance on syntactic parsers due to grammat- ical noise that also includes agreement errors. For articles, we chose to add syntactic knowledge from shallow parse as it is likely to be sufficient for arti- cles and more accurate than full-parse features. Candidate Identification for errors on open-class 10Feature engineering will also be relevant when training on a native corpus that has linguistic annotation. 11Parse features have also been found useful in preposition error correction (Tetreault et al., 2010). 426 words is rarely discussed but is a crucial step: it is not possible to identify the relevant candidates us- ing a closed list of words, and the procedure needs to rely on pre-processing tools, whose performance on learner data is suboptimal.12 Rozovskaya et al. (2014b, Sec. 5.1) describe and evaluate several can- didate selection methods for verbs. The Illinois sys- tem implements their best method that addresses pre-processing errors, by selecting words tagged as verbs as well as words tagged as NN, whose lemma is on the list of valid verb lemmas (Sec. 4.3). Following descriptions provided by several teams, we evaluate several candidate selection methods for nouns. The first method includes words tagged as NN or NNS that head an NP. NTHU and HIT use this method; NTHU obtained the second best noun score, after the Illinois system; its model is also trained on Web1T. The second method includes all words tagged as NN and NNS and is used in several other systems, e.g. SZEG, (Berend et al., 2013). The above procedures suffer from pre-processing errors. The Illinois method addresses this problem by adding words that end in common noun suffixes, e.g. “ment”, “ments”, and “ist”. The percentage of noun errors selected as candidates by each method and the impact of each method on the performance are shown in Table 12. The Illinois method has the best result on both datasets; on CoNLL, it improves F1 score by 2 points and recovers 43% of the candi- dates that are missed by the first approach. On FCE, the second method is able to recover more erroneous candidates, but it does not perform as well as the last method, possibly, due to the number of noisy candidates it generates. To conclude, pre-processing mistakes should be taken into consideration, when correcting errors, especially on open-class words. Using Verb Finiteness to Correct Verb Errors As shown in Table 4, the surface realizations that cor- respond to the agreement candidates are a subset of the possible surface realizations of the form classi- fier. One natural approach, thus, is to train one clas- sifier to predict the correct surface form of the verb. However, the same surface realization may corre- spond to multiple grammatical properties. This ob- 12Candidate selection is also difficult for closed-class errors in the case of omissions, e.g. articles, but article errors have been studied rather extensively, e.g. (Han et al., 2006), and we have no room to elaborate on it here. Candidate Error recall (%) F1 ident. method CoNLL FCE CoNLL FCE NP heads 87.72 92.32 40.47 34.16 All nouns 89.50 95.29 41.08 33.16 Nouns+heuristics* 92.84 94.86 42.60 34.40 Table 12: Nouns: effect of candidate identification methods on the correction performance. Models are trained using NB. Error recall denotes the percentage of nouns containing number errors that are selected as candidates. Modules that are part of the Illinois submission are marked with an asterisk. Training method F1 CoNLL FCE One classifier 16.43 21.14 Finiteness-based training (I) 18.59 27.72 Finiteness-based training (II) 21.08 29.98 Table 13: Improvement due to separate training for verb errors. Models are trained using the AP algorithm. servation motivates the approach that corrects agree- ment and form errors separately (Rozovskaya et al., 2014b). It uses the linguistic notion of verb finite- ness (Radford, 1988) that distinguishes between fi- nite and non-finite verbs, each of which fulfill differ- ent grammatical functions and thus are marked for different grammatical properties. Verb finiteness is used to direct each verb to the appropriate classifier. The candidates for the agree- ment module are verbs that take agreement markers: the finite surface forms of the be-verbs (“is”, “are”, “was”, and “were”), auxiliaries “have” and “has”, and finite verbs tagged as VB and VBZ that have ex- plicit subjects (identified with the parser). The form candidates are non-finite verbs and some of the verbs whose finiteness is ambiguous. Table 13 compares the two approaches: when all verbs are handled together; and when verbs are pro- cessed separately. All of the classifiers use surface form and POS features of the words in the 4-word window around the verb. Several subsets of these features were tried; the single classifier uses the best combination, which is the same word and POS fea- tures shown in Appendix Table A.19. Finiteness- based classifier (I) uses the same features for agree- ment and form as the single classifier. When training separately, we can also explore whether different errors benefit from different fea- tures; finiteness-based classifier (II) optimizes fea- tures for each classifier. The differences in the fea- ture sets are minor and consist of removing several 427 unigram word and POS features of tokens that do not appear immediately next to the verb. Recall from the discussion on features that the agreement module can be further improved by adding syntactic knowl- edge. In the next section, it is shown that an even better approach is to train on learner data for agree- ment mistakes and on native data for form errors. The results in Table 13 are for AP models but sim- ilar improvements due to separate training are ob- served for NB models trained on Web1T. Note that the NTHU system also corrects all verb errors us- ing a model trained on Web1T but handles all these errors together; its verb module scored 8 F1 points below the Illinois one. While there are other differ- ences between the two systems, the results suggest that part of the improvement within the Illinois sys- tem is indeed due to handling the two errors sepa- rately. 5.4 Dim. 4: Training Data NUCLE is a large corpus produced by learners of the same language background as the test data. Because of its large size, training on this corpus is a natural choice. Indeed, many teams follow this approach. On the other hand, an important issue in the CoNLL task is the difference between the training and test sets, which has impact on the selection of the train- ing set – the large Web1T has more coverage and allows for better generalization. We show that for some errors it is especially advantageous to train on a larger corpus of native data. It should be noted that while we refer to the Web1T corpus as “native”, it certainly contains data from language learners; we assume that the noise can be neglected. Table 14 compares models trained on native and learner data in their best configurations based on the training data. Overall, we find that Web1T is clearly preferable for noun errors. We attribute this to the observation that noun number usage strongly depends on the surface form of the noun, and not just the contextual cues and syntactic structure. For example, certain nouns in English tend to be used exclusively in singular or plural form. Thus, con- siderably more data compared to other error types is required to learn model parameters. On article and preposition errors, native-trained models perform slightly better on CoNLL, while learner-trained models are better on FCE. We con- Error Train. Learning Features F1 data algorithm CoNLL FCE Art. Native NB-adapt. n-gram 34.49 31.76 Learner AP-infl.* +POS+chunk 33.50 35.66 Prep. Native LM; NB-adapt. n-gram 12.09 32.22 Learner AP-infl. n-gram 10.26 33.93 Noun Native NB* n-gram 42.60 32.38 Learner AP-infl. +POS 19.22 17.28 Agr. Native NB-adapt. n-gram 23.46 38.57 Learner AP-infl. +POS+syntax 27.93 41.23 Form Native NB-adapt. n-gram 18.35 16.67 Learner AP-infl. +POS 12.32 12.02 Table 14: Choice of training data: learner vs. native (Web1T). For prepositions, LM is chosen for CoNLL, and NB- adapted for FCE. Modules that are part of the Illinois submis- sion are marked with an asterisk. jecture that the FCE training set is more similar to the respective test data and thus provides an advan- tage over training on native data. On verb agreement errors, native-trained models perform better than those trained on learner data, when the same n-gram features are used. However, when we add POS and syntactic knowledge, train- ing on learner data is advantageous. Finally, for verb form errors, there is an advantage when training on a lot of native data, although the difference is not as substantial as for noun errors. This suggests that unlike agreement mistakes that are better addressed using syntax, form errors, similarly to nouns, benefit from training on a lot of data with n-gram features. To summarize, choice of the training data is an important consideration for building a robust sys- tem. Researchers compared native- and learner- trained models for prepositions (Han et al., 2010; Cahill et al., 2013), while the analysis in this work addresses five error types – showing that errors be- have differently – and evaluates on two corpora.13 6 Discussion In Table 15, we show the results of the system, where the best modules are selected based on the performance on the training data. We also show the Illinois modules (without post-processing). The fol- lowing changes are made with respect to the Illinois submission: the preposition system is based on an LM and enhanced to handle spurious preposition er- rors (thus the Illinois result of 7.10 shown here is 13For studies that directly combine native and learner data in training, see Gamon (2010) and Dahlmeier and Ng (2011). 428 Error Illinois submission This work Model F1 Model F1 Art. AP-infl. 33.50 AP-infl. 33.50 Prep. NB-adapt. 07.10 LM 12.09 Noun NB 42.60 NB 42.60 Agr. NB 26.14 AP-infl. 27.93 Form NB 14.50 NB-adapt. 18.35 All 31.43 31.75 Table 15: Results on CoNLL of the Illinois system (with- out post-processing) and this work. NB and LM models are trained on Web1T; AP models are trained on NUCLE. Modules different from the Illinois submission are in bold. different from the 12.14 in Table 8); the agreement classifier is trained on the learner data using AP with rich features and error inflation; the form classifier is adapted to learner mistakes, whereas the Illinois submission trains NB without adaptation. The key improvements are observed with respect to least fre- quent errors, so the overall improvement is small. Importantly, the Illinois system already takes into account the four dimensions analyzed in this paper. In CoNLL-2013, systems were compared using F1. Practical systems, however, should be tuned for good precision to guarantee that the overall qual- ity of the text does not go down. Clearly, optimiz- ing for F1 does not ensure that the system improves the quality of the text (see Appendix B). A differ- ent evaluation metric based on the accuracy of the data is proposed in Rozovskaya and Roth (2010b). For further discussion of evaluation metrics, see also Wagner (2012) and Chodorow et al. (2012). It is also worth noting that the obtained results underestimate the performance because the agree- ment on what constitutes a mistake can be quite low (Madnani et al., 2011), so providing alternative cor- rections is important. The revised annotations ad- dress this problem. The Illinois system improves its F1 from 31.20 to 42.14 on revised annotations. However, these numbers are still an underestimation because the analysis typically eliminates precision errors but not recall errors. This is not specific to CoNLL: an error analysis of the false positives in CLC that includes the FCE showed an increase in precision from 33% to 85% and 33% to 75% for preposition and article errors (Gamon, 2010). An error analysis of the training data also al- lows us to determine prominent groups of system errors and identify areas for potential improvement, which we outline below. Cascading NLP errors: In the example below, the Illinois system incorrectly changes “need” to “needs” as it considers “victim” to be the subject of that verb: “Also, not only the kid- nappers and the victim needs to be tracked down, but also jailbreakers.” Errors in interacting linguis- tic structures: The Illinois system considers every word independently and thus cannot handle interact- ing phenomena. In the example below, the article and the noun number classifiers propose corrections that result in an ungrammatical structure “such a sit- uations”: “In such situation, individuals will lose their basic privacy.” This problem is addressed via global models (Rozovskaya and Roth, 2013) and re- sults in an improvement over the Illinois system. Er- rors due to limited context: The Illinois system does not consider context beyond sentence level. In the example below, the system incorrectly proposes to delete “the” but the wider context indicates that the definite article is more appropriate here: “We have to admit that how to prevent the abuse and how to use it reasonably depend on a sound legal system, and it means surveillance has its own restriction.” 7 Conclusion We identified key design principles in developing a state-of-the-art error correction system. We did this through analysis of the top system in the CoNLL- 2013 shared task along several dimensions. The key dimensions that we identified and analyzed con- cern the choice of a learning algorithm, adaptation to learner mistakes, linguistic knowledge, and the choice of the training data. We showed that the de- cisions in each case depend both on the type of a mistake and the specific setting, e.g. how much an- notated learner data is available. Furthermore, we provided points of comparison with other systems along these four dimensions. Acknowledgments We thank Peter Chew and the anonymous reviewers for the feedback. Most of this work was done while the first author was at the University of Illinois. This material is based on re- search sponsored by DARPA under agreement number FA8750- 13-2-0008 and by the Army Research Laboratory (ARL) under agreement W911NF-09-2-0053. Any opinions, findings, con- clusions or recommendations are those of the authors and do not necessarily reflect the view of the agencies. 429 References G. Berend, V. Vincze, S. Zarrieß, and R. Farkas. 2013. Lfg-based features for noun number and article gram- matical errors. In Proceedings of CoNLL: Shared Task. T. Brants and A. Franz. 2006. Web 1T 5-gram Version 1. Linguistic Data Consortium. A. Cahill, N. Madnani, J. Tetreault, and D. Napolitano. 2013. Robust systems for preposition error correction using wikipedia revisions. In Proceedings of NAACL. S. Chen and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Pro- ceedings of ACL. M. Chodorow, M. Dickinson, R. Israel, and J. Tetreault. 2012. Problems in evaluating grammatical error de- tection systems. In Proceedings of COLING. D. Dahlmeier and H. T. Ng. 2011. Grammatical error correction with alternating structure optimization. In Proceedings of ACL. D. Dahlmeier and H.T Ng. 2012. A beam-search decoder for grammatical error correction. In Proceedings of EMNLP-CoNLL. D. Dahlmeier, H.T. Ng, and S.M. Wu. 2013. Build- ing a large annotated corpus of learner English: The NUS corpus of learner English. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Build- ing Educational Applications. R. Dale and A. Kilgarriff. 2011. Helping Our Own: The HOO 2011 pilot shared task. In Proceedings of the 13th European Workshop on Natural Language Gen- eration. R. Dale, I. Anisimoff, and G. Narroway. 2012. A re- port on the preposition and determiner error correction shared task. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. Y. Even-Zohar and D. Roth. 2001. A sequential model for multi class classification. In Proceedings of EMNLP. J. Foster and Ø. Andersen. 2009. Generrate: Generating errors for use in grammatical error detection. In Pro- ceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Applications. J. Foster. 2007. Treebanks gone bad: Generating a tree- bank of ungrammatical english. In Proceedings of the IJCAI Workshop on Analytics for Noisy Unstructures Data. Y. Freund and R. E. Schapire. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning. M. Gamon. 2010. Using mostly native data to correct errors in learners’ writing. In Proceedings of NAACL. S. Gass and L. Selinker. 1992. Language transfer in language learning. John Benjamins. N. Han, M. Chodorow, and C. Leacock. 2006. Detecting errors in English article usage by non-native speakers. Journal of Natural Language Engineering, 12(2):115– 129. N. Han, J. Tetreault, S. Lee, and J. Ha. 2010. Us- ing an error-annotated learner corpus to develop and ESL/EFL error correction system. In Proceedings of LREC. T. Ionin, M.L. Zubizarreta, and S. Bautista. 2008. Sources of linguistic knowledge in the second lan- guage acquisition of English articles. Lingua, 118:554–576. E. Izumi, K. Uchimoto, T. Saiga, T. Supnithi, and H. Isa- hara. 2003. Automatic error detection in the Japanese learners’ English spoken data. In Proceedings of ACL. T.-H. Kao, Y.-W. Chang, H.-W. Chiu, T-.H. Yen, J. Bois- son, J.-C. Wu, and J.S. Chang. 2013. CoNLL-2013 shared task: Grammatical error correction NTHU sys- tem description. In Proceedings of CoNLL: Shared Task. D. Klein and C. D. Manning. 2003. Fast exact inference with a factored model for natural language parsing. In Proceedings of NIPS. C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault. 2010. Automated Grammatical Error Detection for Language Learners. Morgan and Claypool Publish- ers. J. Lee and S. Seneff. 2008. Correcting misuse of verb forms. In Proceedings of ACL. N. Madnani, M. Chodorow, J. Tetreault, and A. Ro- zovskaya. 2011. They can help: Using crowdsourcing to improve the evaluation of grammatical error detec- tion systems. In Proceedings of ACL. M. Marneffe, B. MacCartney, and Ch. Manning. 2006. Generating typed dependency parses from phrase structure parses. In Proceedings of LREC. H. T. Ng, S. M. Wu, Y. Wu, Ch. Hadiwinoto, and J. Tetreault. 2013. The CoNLL-2013 shared task on grammatical error correction. In Proceedings of CoNLL: Shared Task. H. T. Ng, S. M. Wu, T. Briscoe, C. Hadiwinoto, R. H. Su- santo, and C. Bryant. 2014. The CoNLL-2014 shared task on grammatical error correction. In Proceedings of CoNLL: Shared Task. V. Punyakanok and D. Roth. 2001. The use of classifiers in sequential inference. In Proceedings of NIPS. A. Radford. 1988. Transformational Grammar. Cam- bridge University Press. N. Rizzolo and D. Roth. 2010. Learning Based Java for Rapid Development of NLP Systems. In Proceedings of LREC. 430 A. Rozovskaya and D. Roth. 2010a. Annotating ESL errors: Challenges and rewards. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Build- ing Educational Applications. A. Rozovskaya and D. Roth. 2010b. Training paradigms for correcting errors in grammar and usage. In Pro- ceedings of NAACL. A. Rozovskaya and D. Roth. 2011. Algorithm selec- tion and model adaptation for ESL correction tasks. In Proceedings of ACL. A. Rozovskaya and D. Roth. 2013. Joint learning and in- ference for grammatical error correction. In Proceed- ings of EMNLP. A. Rozovskaya, M. Sammons, J. Gioja, and D. Roth. 2011. University of Illinois system in HOO text cor- rection shared task. In Proceedings of the European Workshop on Natural Language Generation (ENLG). A. Rozovskaya, M. Sammons, and D. Roth. 2012. The UI system in the HOO 2012 shared task on error cor- rection. In Proceedings of the NAACL Workshop on Innovative Use of NLP for Building Educational Ap- plications. A. Rozovskaya, K.-W. Chang, M. Sammons, and D. Roth. 2013. The University of Illinois system in the CoNLL-2013 shared task. In Proceedings of CoNLL Shared Task. A. Rozovskaya, K.-W. Chang, M. Sammons, D. Roth, and N. Habash. 2014a. The University of Illinois and Columbia system in the CoNLL-2014 shared task. In Proceedings of CoNLL Shared Task. A. Rozovskaya, D. Roth, and V. Srikumar. 2014b. Cor- recting grammatical verb errors. In Proceedings of EACL. A. Stolcke. 2002. Srilm-an extensible language model- ing toolkit. In Proceedings of International Confer- ence on Spoken Language Processing. J. Tetreault, J. Foster, and M. Chodorow. 2010. Using parse features for preposition selection and error de- tection. In Proceedings of ACL. J. Wagner. 2012. Detecting Grammatical Errors with Treebank-Induced, Probabilistic Parsers. Ph.D. the- sis. Y. Xiang, B. Yuan, Y. Zhang, X. Wang, W. Zheng, and C. Wei. 2013. A hybrid model for grammatical error correction. In Proceedings of CoNLL: Shared Task. J. Xing, L. Wang, D.F. Wong, L.S. Chao, and X. Zeng. 2013. UM-Checker: A hybrid system for English grammatical error correction. In Proceedings of CoNLL: Shared Task. H. Yannakoudakis, T. Briscoe, and B. Medlock. 2011. A new dataset and method for automatically grading ESOL texts. In Proceedings of ACL. I. Yoshimoto, T. Kose, K. Mitsuzawa, K. Sakaguchi, T. Mizumoto, Y. Hayashibe, M. Komachi, and Y. Mat- sumoto. 2013. NAIST at 2013 CoNLL grammat- ical error correction shared task. In Proceedings of CoNLL: Shared Task. 431 Appendix A Features and Additional Information about the Data Classifier Art. Prep. Noun Agr. Form Train 43K 20K 39K 22K 37K Test 43K 20K 39K 22K 37K Table A.16: Number of candidate words by classifier type in training and test data (FCE). Error Number of errors and error rate Train Test Art. 2336 (5.4%) 2290 (5.3%) Prep. 1263 (6.4%) 1205 (6.1%) Noun 858 (2.2%) 805 (2.0%) Verb agr. 319 (1.5%) 330 (1.4%) Verb form 104 (0.3%) 127 (0.3%) Table A.17: Statistics on annotated errors in the FCE cor- pus. Percentage denotes the error rates, i.e. the number of er- roneous instances with respect to the total number of relevant instances in the data. Features Description (1) subjHead, subjPOS the surface form and the POS tag of the subject head (2) subjDet determiner of the subject NP (3) subjDistance distance between the verb and the subject head (4) subjNumber Sing – singular pro- nouns and nouns; Pl – plural pronouns and nouns (5) subjPerson 3rdSing – “she”, “he”, “it”, singular nouns; Not3rdSing – “we”, “you”, “they”, plural nouns; 1stSing – “I” (6) conjunctions (1)&(3); (4)&(5) Table A.18: Verb agreement features that use syntactic knowledge. Appendix B Evaluation Metrics Here, we discuss the CoNLL-2013 shared task eval- uation metric and provide a little bit more detail on 0 10 20 30 40 50 60 70 80 0 10 20 30 40 50 P R E C IS IO N RECALL Article Prep Noun Agreement Form Figure 2: Precision/Recall curves by error type. the performance of the Illinois modules in this con- text. As shown in Table 1 in Sec. 2, over 90% of words (about 98% in training) are used correctly. The low error rates are the key reason the error cor- rection task is so difficult: it is quite challenging for a system to improve over a writer that already per- forms at the level of over 90%. Indeed, very few NLP tasks already have systems that perform at that level. The error sparsity makes it very challenging to identify mistakes accurately. In fact, the highest precision of 46.45%, as calculated by the shared task evaluation metric, is achieved by the Illinois system. However, once the precision drops below 50%, the system introduces more mistakes than it identifies. We can look at individual modules and see whether for any type of mistake the system improves the quality of the text. Fig. 2 shows Preci- sion/Recall curves for the system in Table 15. It is interesting to note that performance varies widely by error type. The easiest are noun and article usage errors: for nouns, we can do pretty well at the recall point 20% (with the corresponding precision of over 60%); for articles, the precision is around 50% at the recall value of 20%. For agreement errors, we can get a precision of 55% with a very high threshold (identifying only 5% of mistakes). Fi- nally, on two mistakes – preposition and verb form – the system never achieves a precision over 50%. 432 Feature type Feature group Features Word n- gram wB, w2B, w3B, wA, w2A, w3A, wBwA, w2BwB, wAw2A, w3Bw2BwB, w2BwBwA, wBwAw2A, wAw2Aw3A, w4Bw3Bw2BwB, w3Bw2BwBwA, w2BwBwAw2A, wBwAw2Aw3A, wAw2Aw3w4A POS pB, p2B, p3B, pA, p2A, p3A, pBpA, p2BpB, pAp2A, pBwB, pAwA, p2Bw2B, p2Aw2A, p2BpBpA, pBpAp2A, pAp2Ap3A Chunk NP1 headWord, npWords, NC, adj&headWord, adjTag&headWord, adj&NC, adjTag&NC, npTags&headWord, npTags&NC NP2 headWord&headPOS, headNumber wordsAfterNP headWord&wordAfterNP, npWords&wordAfterNP, headWord&2wordsAfterNP, npWords&2wordsAfterNP, headWord&3wordsAfterNP, npWords&3wordsAfterNP wordBeforeNP wB&fi ∀i ∈ NP1 Verb verb, verb&fi ∀i ∈ NP1 Preposition prep&fi ∀i ∈ NP1 Table A.19: Features used in the article error correction system. wB and wA denote the word immediately before and after the target, respectively; and pB and pA denote the POS tag before and after the target. headWord denotes the head of the NP complement. NC stands for noun compound and is active if second to last word in the NP is tagged as a noun. Verb features are active if the NP is the direct object of a verb. Preposition features are active if the NP is immediately preceded by a preposition. Adj feature is active if the first word (or the second word preceded by an adverb) in the NP is an adjective. NpWords and npTags denote all words (POS tags) in the NP. 433 434