key: cord-0181375-hq70jn3y authors: Kabra, Anubha; Bhatia, Mehar; Kumar, Yaman; Li, Junyi Jessy; Shah, Rajiv Ratn title: Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems date: 2020-07-14 journal: nan DOI: nan sha: 8801c29de6c299289f47488a30350f325379774f doc_id: 181375 cord_uid: hq70jn3y Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Despite such wide usage, the AI-based testing literature of these"intelligent"models is highly lacking. Most of the papers proposing new models rely only on quadratic weighted kappa (QWK) based agreement with human raters for showing model efficacy. However, this effectively ignores the highly multi-feature nature of essay scoring. Essay scoring depends on features like coherence, grammar, relevance, sufficiency and, vocabulary. To date, there has been no study testing Automated Essay Scoring: AES systems holistically on all these features. With this motivation, we propose a model agnostic adversarial evaluation scheme and associated metrics for AES systems to test their natural language understanding capabilities and overall robustness. We evaluate the current state-of-the-art AES models using the proposed scheme and report the results on five recent models. These models range from feature-engineering-based approaches to the latest deep learning algorithms. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models. On the other hand, irrelevant content, on average, increases the scores, thus showing that the model evaluation strategy and rubrics should be reconsidered. We also ask 200 human raters to score both an original and adversarial response to seeing if humans can detect differences between the two and whether they agree with the scores assigned by auto scores. We know that writing is a social practice. Testing of written prose is a long-established practice to teach students on how to engage with readers meaningfully. It involves choosing a stance on a continuum, responding, interacting, and sharing meaning with others. Automated Essay Scoring (AES), by proposing to automate the above process, poses as an important socio-technical system in the education paradigm [52] . AES uses computer programs to automatically characterize the performance of examinees on standardized tests involving writing prose. ETS, the largest company working in language testing domain, says that the AES systems are dependent on a balance between "current societal expectations and the cutting edge of technological advances" [14] . The business motivation of using such systems is quite clear. They help in realizing cost-saving at scale. A human teacher is able to save hundreds of man hours per year on account of savings in testing and evaluation [6] . Additionally, for low-resource countries and rural areas with abysmal teacher-student ratios, this becomes a necessity [11, 47] . In the last decade, owing to the advancements in artificial intelligence, the usage of such systems has increased by several folds. These are now increasingly used in taking high-stake decisions such as college admissions, visa approvals, and job screening and pre-screening tests. In the last five years, they have further made their way to the middle and high school classrooms of states like Utah [36] , and Ohio [34] . While earlier, each score generated by the AI systems was verified by an expert human rater, now they are scoring a majority of essays independently without any intervention by human experts [34] . At the same time, there have been a multitude of papers in premiere machine learning conferences reporting novel models and state-of-the-art on automatic essay scoring datasets [23] . The pearson-correlation based agreement scores reported by these studies have risen from 0.23 to 0.8 over time [23] . Most of these papers report Pearon-correlation or kappa based agreement scores to measure the performance of their models. However, as shown by multiple previous studies, despite achieving human level agreement scores [24] or even 'surpassing' them [45] , the models are easily fooled [35, [37] [38] [39] . This reduces the trustworthiness of AI-based automated scoring systems in the eyes of both language-testing researchers [35, 37, 43, 57] and general public [15, 18, 34, 46] . Due to its wide applicability, several research studies in the linguistic community have tried to characterize the performance of essay scoring models and attribute it to features like number of words [37] , style [43] , vocabulary [39] , coherence [12] , etc. However, the results from these research studies are often conflicting in nature. While one indicates that essay scoring models have substantial correlation with number of words [37] , the other attributes it to style [43] . Moreover, there is no standard way of testing automatic essay scoring systems apart from measuring agreement scores on a subset of the dataset (typically chosen to be 10% of the dataset size) [57] . This leads to non-thorough testing and hence model development. It is noteworthy that in the last five years, very few publications have performed any evaluations beyond agreement scores. Most of those who have reported any other feature do it mostly on coherence evaluation [21, 53, 58] . This is inadequate evaluation technique since essay scoring is a highly feature rich task which depends on a variety of features like vocabulary, factuality, coherence, grammar, relevance, sufficiency, argument quality, persuasion, etc. [59] . Despite the importance and the magnitude of the problem, there have been a few efforts from language testing community to develop a unified testing framework. Ding et al. [12] collaborated with ETS 1 to show that AES models are adversarially perturable. However, the inputs are limited to just random incoherent response generation. This does not mimic a test-taker's capability to fool an AES system nor does it test all the features important for scoring. There have also been some manual studies where experts and non-experts were invited to test out some models [40] . These studies, despite the good motivation and human grounding efforts, cannot be scaled or even made consistent across all the models. To the best of our knowledge, there has been no work which systematically analyzes AES models on all the different aspects important for scoring or propose an evaluation suite. Such a validity suite is important from the following perspectives: 1) it provides a uniform benchmark to compare different models beyond metrics such as accuracy or QWK. These metrics neither provide any insights into the construct validity of AES models nor do they indicate measures number of students who received a score by the human grader and by the model. Weight matrix is defined as ( = ( − ) 2 /( − 1) 2 ) and assigns penalty to each pair of predicted, actual scores. QWK denotes machine-human agreement. It is then compared with human-human agreement score to compare different models. The other metric commonly used in the literature is Pearson Correlation (PC). Given as the number of pairs of scores, Σ as the product of paired scores, Σ and Σ being the sum of x and y scores respectively and Σ 2 ,Σ 2 referring to the sum of the squares of x and y scores. It is defined as: We argue that for deep learning based systems, tracking merely QWK (or PC) as evaluation metrics is suboptimal for several reasons: 1) while subsequent research papers show an iterative improvement in QWK but most of them fail in evaluating how their works generalize across all the different dimensions of scoring including coherence, cohesion, vocabulary, and even surface metrics like average length of sentences, word difficulty, etc. 2) QWK as a metric captures only the overall and broad agreement with humans scores, however, scoring as a science includes knowledge from many domains of NLP like: fact-checking, discourse and coherence, coreference resolution, grammar, content coverage, etc [59] . QWK, instead of making the scoring comprehensive, is abstracting out all the details associated with scoring as a task. 3) it does not indicate the direction of a machine learning model: oversensitivity or overstability. We quantitatively illustrate the gravity of all these aspects by performing statistical and manual evaluations, mentioned in Section 2.3. We demonstrate in the later parts of our paper that heavily modifying responses (as much as 25%), does not break the scoring systems and the models still maintain their high confidence and scores while evaluating the adversarial responses. Our results show that no published model is robust to these examples. They largely maintain the scores of the unmodified original response even after all the adversarial modifications. This indicates that the models are largely overstable and unable to distinguish ill-formed examples from the well-formed ones. While on an average, humans reduce their score by approx 3-4 points (on a normalized 1-10 scale), the models are highly overstable and either increase the score by 1 point for some tests or reduce them for others by only 0-2 points ( § 3.3). We propose that instead of tracking just QWK for evaluating a model, the field should track a combination of QWK and adversarial evaluation of the models for performance. Cognitive studies have characterized AES models as information-integration models trying to learn category-learning tasks [59] . The descriptor of such a category can be, "Score the essay at level 3 if it consists of a clear aim reasoned by structured claims and supported by appropriate evidence with rebuttals of all the major counter arguments." [59] . Following this, many research studies have established features which must be present in AES models [7, 24, 49, 59] . A few examples of such features are: factuality, grammar-correctness, organization, coherence, lexical sophistication, etc. In this work, we propose a black-box adversarial evaluation of AES systems based on these features. We show the evaluation of five recent models on the popular dataset, Automated Student Assessment Prize (ASAP) dataset for Essay-Scoring [1] . Our evaluation scheme consists of evaluating AES systems on essays derived from the original responses but modified heavily to change its original meaning. These tests are mostly designed to check for the overstability of the different models. An overview of the adversarial scheme is given in Table 3 . We perform the following operations for generating test responses: Addition (Adding lines to the original text), Deletion(Deleting lines from the original text), Modification(Modifying parts of the original text) and Generation(Generating a completely new text). These cover all the fundamental methods that can be used to change a given piece of sequence to another [27] . Under these four operations, we include many other operation subtypes such as adding related and unrelated content, modifying the grammar of the response, taking only first part of the response, etc. These operations and sub-operations quantify a model's performance on each feature important for scoring. Therefore, the main contributions of our work are summarized as follows: • We propose a model agnostic evaluation suite to alter examples given in a dataset to test out a given AES model. This evaluation suite can be used to test various systems including automatic scoring [13, 53] , attribute scoring [28] , coherence evaluation [21] , argument mining [32] , topic detection [60] , and measuring argument persuasiveness [22] . Essay scoring datasets like the ASAP-AES dataset were used in all these settings and hence our evaluation suite can also be used in all these settings. • We evaluate five recent state-of-the-art AES models on all the eight prompts belonging to the widely-cited ASAP-AES [1] dataset and report their test performance on various metrics for a thorough understanding of their weaknesses. • We propose a comprehensive 3-way automatic evaluation for aiding model-makers involving parameters of length, position and type of adversarial tests. We also validate the adversarial examples with a human study to show that scores awarded by AES models are indeed disconnected with rubrics. • Finally, we open-source the code, test samples and model weights for easy reproducibility, and future benchmarking. We would also like to say that we present our argument not as a criticism of anyone, but as an effort to refocus the research directions of the field. Since the automated systems that we develop as a community have such high stakes like deciding jobs and admissions of the takers, the research should reflect the same rigor. We sincerely hope to inspire higher quality reportage of the results in automated scoring community that does not track just the performance but also the validity of their models. In this section, we define the problem statement and the dataset used for experimentation. We provide details about the state-of-the-art AES models we experimented with and the adversarial evaluation metrics. We also elaborate on all the adversarial test cases used for testing these models. Similar to various research studies [13, 51, 53, 63] , we have used the widely cited ASAP-AES [1] dataset to evaluate Automatic Essay Scoring systems. The relevant statistics for this dataset are listed in Table 1 . The questions covered by the dataset are from many different areas such as Sciences and English literature. The responses were written by high school students and were subsequently double-scored. The evaluation framework built for assessing AES systems is broadly based on the linguistic features considered essential for scoring like grammar, coherence, etc [5, 59] . We evaluate the recent state-of-the-art deep learning Liu et al. [25] , Taghipour and Ng [51] , Tay et al. [53] , Zhao et al. [63] and feature-based models EASE [13] and show the adversarial-evaluation results. Brief descriptions of each of them are given as follows: • EASE (EASE [13] ): It is an open-source feature-based model maintained by EdX. This model includes features such as tags, prompt-word overlap, n-gram based features, etc. Originally, it ranked third among the 154 participating teams in the ASAP-AES competition. They consider two types of adversarial evaluation: well-written permuted paragraphs and prompt-irrelevant essays. For these, they develop a two-stage learning framework where they calculate semantic, coherence and prompt-relevance scores and concatenate them with engineered features. The paper uses BERT [10] to extract sentence embeddings. 2.3.1 General Framework. From Figure 1 , we can see that given a prompt , response , bounded size criterion 1 , position criterion 2 and optionally a model , an adversarial testing model converts response to response ′ based on a specific set of rules and the criteria 1 and 2 . For benchmarking a model , we use the scores ( ) and ( ′ ) to calculate the statistics listed in Table 2 . Since the score ranges and the number of samples vary across all the prompts, we report the corresponding values in percentages (percentage of total samples and percentage of range of score). From our human evaluation survey (Section 3.3) and corresponding Table 6 , we see a significant difference in human scores and scores generated by various AES systems. We ask our human annotators to score our adversarial response r', given the score for the original response r. We also ask the annotators to give supporting reasons for their responses. From our survey, for each adversary, , we summarize the following, (1) According to all human annotators, the score of an adversarial response ( ′ ) was always less than the score of the original response ( ). In other words, from humans' point of view, no adversary increased the quality of the response. (2) Second, all human annotators were able to detect and differentiate from ′ . We conducted t-test between scores given by AES engines and human annotators on the adversarially perturbed responses to confirm this notion. 94% of all the t-tests rejected the null hypothesis (p<0.05), hence highlighting the statistical significance. Notably, these findings are different from what is "commonly" given in the adversarial literature where the adversarial response is formed such that a human is not able to detect any difference between the original and modified responses, but a model (due to its adversarial weakness) is able to detect differences and thus changes its output [62] . For example, in computer vision, a few pixels are modified to make the model mispredict a bus as an ostrich [50] , and in NLP, paraphrasing by changing a few words is done to churn out racial and hateful slurs from a generative deep learning model [55] . Here, our survey observations show that humans can detect the difference between the original and final response. We call the inability (or under-performance) of models on differentiating between adversarial and natural samples as their overstability. Test Name Description 1 Add AddWikiRelated Addition of Wikipedia lines related to the essay question in a response. AddWikiUnrelated Addition of Wikipedia lines unrelated to the essay question in a response. RepeatSent Repetition of some lines of the response within a response. AddSong Addition of song lyrics into the response. AddSpeech Addition of excerpts of speeches of popular leaders into a response. Next, we discuss the various strategies of adversarial perturbations. An overview of all the perturbations is given in Table 3 . We categorize all the adversarial tests by the major-operation they do on a sample. Therefore, we divide the tests into four categories: Add (those operations which change a sample majorly by adding to it), Delete (those operations which change a sample majorly by deleting from it), Modify (those operations which change a sample majorly by modifying its structure) and Generate (those operations which tests the robustness of a model by giving it completely machine-generated non-meaningful samples). Adversaries. Add adversaries change the original response by adding new content to it. Adding unrelated or repetitive content negatively impacts the content-specific and topic development features of an essay, which are considered necessary for essay evaluation [59] . To test the content knowledge of scoring models, we designed various types of Add tests that are explained hereafter.All the testcases follow the position and amount of addition given by the parameters, 1 and 2 , respectively, as explained in Section 2.3.1. A few examples are shown in Figure 2 . • AddWikiRelated: With this testcase, we add prompt-related information to each sample response. We used a key-phrase extraction technique 2 over each prompt/question in the dataset for choosing prompt-related articles from Wikipedia 3 . After selecting articles, we randomly selected sentences from each extracted article and appended them to the responses. • AddWikiUnrelated: We form this testcase to disturb the topic relevance of the responses. This test tries to mimic students' behavior when they make their response lengthy by adding irrelevant information. For this, we add prompt-irrelevant information to each sample response by selecting Wikipedia articles that do not match the response's prompt. The score by an AES model should be negatively affected with this kind of perturbation. The first example in Figure 2 depicts this testcase. Red, Green, Blue shows that adversarial responses were scored higher, lower and equally than original response, respectively. • RepeatSent: Students intentionally tend to repeat sentences or specific keywords in their responses in order to make it longer yet not out of context and to fashion cohesive paragraphs [20, 26, 61] . This highlights the test taker's limited knowledge about the subject and also clutters the writing. To design responses for this test, we divided each response into three equally sized chunks and randomly selected sentences from each of them to form a repetition block, added back to the response. An AES model should negatively score such responses. The second example in Figure 2 depicts this testcase. • AddSong: Poetic license gives freedom to ignore or modify normal English rules. However, creative content like songs have a very different language structure than written prose in tests. Therefore, this can be used for negative testing of a system. Additionally, it has been observed that students in an attempt to fool the system use this strategy in their exams [29] . With this motivation, we form this test by perturbing samples to include songs. We used FiveThirtyEight [16] , Neisse [31] , PromptCloud [41] , RakanNimer [42] and Bansal [3] to extract 58,000 English songs lyrics over a long time period and range of genres like Rock, Jazz, Classical, etc. An AES system should negatively score such responses with addition of song lyrics since they do not relate to the prompt and are a misfit to the context of the answer. • AddSpeech: Formal style of writing or speech is conventionally characterized by long and complex sentences, a scholarly vocabulary, correct grammatical rules and a consistently serious tone [33] . In the speeches of leaders, popular terms might be used to refer to certain contextual social phenomenon. It may also include references to literary works or allusions to classical and historical figures. Generally, this style of writing is seen as sophisticated and hence better. However, when sentences of such a type are added without context or relevance, they serve the purpose of confusing the readers without giving any new meaning. We collected eight public speeches of popular leaders such as Barack Obama, Hillary Clinton, Queen Elizabeth II, etc. These speeches were sourced from public archives and government websites. • AddRC: It is commonly observed that students tend to repeat parts of a question in their answer to make their answers lengthier and related to the question asked [20, 26, 61] . Therefore, to test over-reliance of AES models on the keywords present in a question asked or reading comprehension given, we randomly pick up sentences from the corresponding reading comprehension passages and add them to the responses. • AddTruth: Facts and quotations provide conclusive evidence and a voice of authority for the arguments addressed in an essay [54] which makes it common for test-takers to use. The motive behind this testcase is to measure relevance of responses [59] and a check for factuality knowledge in current AES systems. This attack focuses on inculcating factual, yet unrelated text, often done by students to increase the word count of the responses. For this testcase, we acquired a list of well-known facts from [56] and injected it into the original text. • AddLies: Test takers may use false facts or quotations to embellish their essays and provide strong argumentative evidence to their reasoning written in their response. This underscores the importance of fact-checking while scoring these essays. This forms the motive behind this testcase and check whether these systems are able to highlight this disinformation. We collected various false statements 4 , manually verified them to be false statements and did not include those which we felt were subjective in nature. We also note that AddLies being false statements should preferably impact the scoring more negatively than AddTruth. • DelStart: Beginnings generally serve the purpose of introducing the flow of an essay. They state the main point of the overall argument and give context to what will come in the next paragraphs. It helps in outlining a response. Hence, it is crucial to maintain the discourse of an essay and its central features like organization and development [59] . Although organization may not be severely impacted on deleting introductory lines, the essay's development will crumple. In this testcase, we remove the introductory lines from each response which renders the development senseless, hence negatively impacting the scores. The first example in Figure 3 depicts this testcase. • DelEnd: Similar to the above test, we deleted the last conclusive sentences from an essay. The conclusion of any response is also an integral part of an essay. It allows you to have the final say on the arguments you have raised, synthesize your thoughts, demonstrate the importance of your ideas, and propel your reader to a new view of the subject. The conclusion is the point where the final argument is stated based on the evidence provided in the body of the essay. Deleting the conclusion, therefore, must decrease the score of the overall essay. summary, to disrupt the organization of an essay, we removed sentences randomly from the response. AES systems should lower the scores for these essays. Modify adversaries majorly retain the originality of a response while changing its syntax heavily. In this, we majorly change the grammar, fluency, organization and lexical sophistication of a sample. The various types of Modify tests are explained hereafter. Some examples of these tests are shown in Figure 4 . • ModGrammar: Several studies underline the importance of grammar in scoring [2, 7] . TOEFL iBT mentions grammar usage in the category 'language use' for their TOEFL test [9] . We formed two test cases to simulate common grammatical errors committed by students. The first one focused on evaluating the basic grammar knowledge of AES models and the second one assessed the effect of colloquial and informal language commonly found in essays as is demonstrated in the Table 4 . For changing the subject-verb-object (SVO) order, we parse the responses and using spacy 5 library to extract grammatical dependencies. An abbreviation dictionary 6 is used for randomly replacing words with their corresponding informal colloquial forms. The first example in Figure 4 depicts this testcase. • ModLexicon: Diversity and sophistication of vocabulary is an essential feature for scoring essays [8, 24] . It is commonly observed that test-takers using sophisticated vocabulary often are scored higher than their counterparts using simpler, more straightforward vocabulary [38] . However, the change or inclusion of even a single word in a sentence changes its meaning. Therefore, in this test case, we evaluate AES systems' vocabulary-dependence by improper replacement of a random word (excluding stopwords) in each sentence, to a synonym using Wordnet synsets [30] . Later, in Section 3.3, we observe that a human would Anita is going to the park for a walk. Anita to the park is going for a walk. Anita is going to an park for the walk. Step 2: Subject Verb Agreement Errors Anita go to an park for the walk. Step 3: Conventional Errors anita go 2 an park 4 the walk Table 4 . Examples of the type ModGrammar view such an example as a change in vocabulary but with improper usage of the words changed. An example of this type of perturbation is, "Tom was a happy man. He lived a simple life.". It gets changed to "Tom was a grinning man. He lived a bare life." • ShuffleSent: Important aspects of essay scoring are coherence and organization that measure the extent to which a response demonstrates a unified structure and direction of the narrative. [4, 8, 17, 44, 53] . To evaluate the dependence of AES scoring on coherence, we randomly shuffle the sentences of a response. This ensures the response's readability and coherence are affected negatively [58] . It affects the transition between the lines so that the different ideas appear disconnected to a reader and changes the meaning substantially. Red, Green, Blue shows that adversarial responses were scored higher, lower and equally than original response, respectively. • BabelGen: We generate entirely false and gibberish adversarial samples using Les Perelman's B.S. Essay Language Generator (BABEL) [38] . BABEL requires a user to enter three keywords based on which it generates an incoherent, meaningless sample containing a concoction of obscure words and keywords pasted together. In 2014, Perelman showed that ETS' e-rater, which is used to grade Graduate Record Exam (GRE) 7 essays consistently 5-6 on a 1-6 point scale [39, 48] . This motivated us to try out the same approach on current state-of-the-art deep learning recent approaches. We came up with a list of keywords based on the AES questions. For generating a response, we chose three keywords as input to BABEL, 7 GRE is a widely popular exam accepted as the standard admission requirement for a majority of graduate schools. It is also used for pre-job screening by a number of companies. Educational Testing Services (ETS) owns and operates the GRE exam. which then generated a generative adversarial example. Figure 5 depicts an example of this testcase. In this section, we demonstrate our results by performing adversarial perturbations on 2600 original responses and provide a detailed analysis based on our general framework for adversarial evaluation (refer Section 2.3.1). W We divide this section into two categories. First, we present the effects on different hyper-parameters such as effect on position, length and amount of change. Secondly, we present results of different test categories (as defined in Table 3 ) such as Modify, Add, Delete, and Generate based Adversaries. In this section, we evaluate the effect of various parameters as defined in Section 2. ( 1) to observe how the model scores such responses. Figure 6 shows the average difference of scores (averaged over all the different tests). For all models, going from 5% to 25% perturbation leads to an increase in of and of 34% and 47%, respectively. We observe that the scoring trend changes considerably while going from 15% to 20% perturbation, otherwise it remains consistent. Exceeding 1 to more than 25% does not add more value to the results. It is clear, irrespective of increase in 1, all models except EASE and SKIPFLOW have hardly any increase in their value (5% change). Hence these models score similar number of responses higher than original but with greater intensity ( ) as the amount of perturbation increases. We infer that these models are overstable with respect to number of adversarial responses they are scoring higher or lower, due to the consistent value of with an increase in 1. It is unexpected to see that EASE has scored an average of 82% adversarial responses higher than the original ( ). This value is the lowest for LSTM-MoT , averaging to only 7%. ( 2 ). We perform an analysis to show whether the addition of content at specific positions namely (Start, Mid, End) under the following conditions. • bounded: retaining the length of the response after the addition of content. • unbounded: no restrictions on the length of the response despite the addition of content. Across all models as demonstrated in Figure 7 , we see an equivalent variation of and values for both bounded and unbounded cases, irrespective of the position and length of the perturbation. However, we see that increases for unbounded cases when compared to bounded cases for all test cases across all models. For the START position, we see an increase of 12% on an average in from bounded to the unbounded situation. This means that models are sensitive towards an increase in the number of words by scoring more number of responses higher but with similar intensities. We observe a similar trend for END position but with a lower increase of 7% in average. We can say that scores are proportional to the length of the response for the START and END positions. However, addition in MID position does not influence the scores differently based on the length of the essay. This resonates with [37] as he states that word count is the most important predictor of an essay's score. From Figure 7 , we also notice the intensity of change in scores is small (less than 5%) for models EASE and SKIPFLOW . Both are overstable concerning position criterion 2 , as they are not able to distinguish ill-formed responses from the well-formed ones. Even when all prompts are taken into consideration, the deviation for these models are 8% and 9% respectively, reinforcing the previous insight. However, the model EASE has the highest number of positively affected adversaries (80.3% of responses). This is unexpected as perturbation should decrease the scores of the responses. Amongst the other models, we observe that model Memory-Nets is the worst performing model as it detects 49.2% (approximately one half) with a rise in score by 23.5% and the other half of responses with a fall with the same intensity. This shows the model does not know which direction to move when scores are changed. On the contrary, the model LSTM-MoT is the best performing model considering all adversarial evaluation metrics. 3.2 Results of the different types of Adversaries 3.2.1 Add Adversaries . In this section, we explain the results over all the tests for Add adversaries (refer Section 2.3.2). We show all adversarial evaluation metrics (refer Table 2 and Figure 8 ) for all the models (refer Section 2.2). We observe that that two out of five models, EASE and SKIPFLOW , show the least (12% and 8.7% respectively) and (6.2% and 6.9% respectively) after adversarial modifications. We infer that these models are overstable as the intensity of change in scores is small for both the models, and they score the adversarial changed responses similar to that of unmodified original responses. Model Memory-Nets unexpectedly scores about 50% of the adversarial responses higher than original with high of about 30%. In contrast, LSTM-MoT has scored the adversarial responses in a highly negative fashion. The value of is only 8.6%. This implies that 91.4% of modified responses are scored lower than their respective original response. Moreover, the value of is 27% while is only 6.1% over all the prompts. This symbolizes that this model can observe perturbations in the responses and score them relatively lower. Hence, amongst all the models, these two show relatively better performance. It is interesting to observe that out of all testcases, the test AddLies has around 50% for the models EASE and Memory-Nets . These models are not able to penalize the deliberately included false facts into the response. However, we observe that AddTruth (as shown in Figure 8 ) is scored comparatively higher by all models. On a relative note, false statements have impacted scores negatively, even if they do so marginally. We believe this is because most models used contextual word embeddings as inputs to their models. Mostly, we notice the tendency of lengthier responses to be scored higher, despite being factually wrong and having unrelated content. Ideally, we felt that added irrelevant lines from songs, speeches, and Wikipedia articles would likely make the models score the responses lower than the addition of relevant content. However, these test cases were scored no differently than the rest, suggesting that addition of relevant or irrelevant lines were both scored in similar manner. This means that the models do not check for the relevance and sufficiency content features of an essay, which should play an important role in scoring [19] . Adversaries. This section describes the results over all tests for Delete Adversaries (refer Section 2.3.3). We demonstrate all adversarial evaluation metrics (refer Table 2 and Figure 9 ) over all the models (refer Section 2.2). We find that model EASE and SKIPFLOW show the least and values after adversarial modifications. This means that both models are hardly fluctuating from the original scores of unmodified responses. This indicates characteristics of overstability of both models. In model Memory-Nets , we see high values of and , aggregating with a high value of 54%, in average. In other words, the model is scoring half of the responses higher, with an average of 30% soar ( ) and scoring the other half lower, with a dip of 22% ( ). This means that the model is responsive to Delete adversaries but in no particular direction. Model LSTM-MoT has scored adversarial responses majorly in a negative fashion by observing the highest average value of 26.4% when compared to a low of 5.9%. Additionally, We calculate the to be 8.5%, which means that 91.5% of samples have been scored negatively. We draw the inference that this model can observe the presence of adversarial perturbations in the responses. Moreover, we mark a similar trend for model BERT , however, with higher intensities of average deviation ( and ) in the scores. We summarize that model LSTM-MoT is the best performing model. Looking into testcase based results, we see that DelRand has high and (increase of 2% and 3% )for as compared to DelStart and DelEnd tests. This implies that adversarial responses where the lines were randomly deleted were positively scored with a higher than those responses in which the introduction and conclusion was removed. This is surprising as deletion of random lines from the response leads to a loss in organization and response structure. Deletion at the end has a (more that 50% of the responses scored higher than original) on average for three out of five models. The responses in this test case were missing any concluding remarks. Hence, the capability of models to check for a proper conclusion is poor. This section, explains the results over all tests for Modify based Adversaries (refer Section 2.3.4 ). We depict the adversarial evaluation metrics (mentioned in Table 2 and Figure 10 ) for all models (refer Section 2.2). We observe that Models Memory-Nets and EASE and SKIPFLOW has greater than 50% of the total number of responses. That means more than half of the responses have been scored higher than original responses. Modify test cases such as ModGrammar and ShuffleSent significantly affects the discourse of the response in a negative manner and also makes it unorganized and unstructured. Hence, ideally these responses should not have been scored positively. This shows that these models are not able to capture the discourse and organization based relevance of the responses.On the other hand, we observe that the model LSTM-MoT has scored the adversarial responses in a highly negative fashion. These responses generally scored lower (89% modified responses are scored lower than their respective original response) and with high intensity, as shown by of 23%. Over all eight prompts, we see that greater values of compared to only 6% value of . This symbolizes that this model has the ability to act robustly in the presence of adversaries. Among all Modify Adversaries (refer Section 2.3.4), we observe that test ModGrammar had a consistently low score amongst all the models. This can be verified as the measure is significantly lower in all the models except EASE . Overall, constitutes of only 36% of all the adversarial responses. This shows that most models can identify grammatically incorrect sentences and score them lower. The intensity of scoring grammatically incorrect adversarial responses negatively is also higher than that ofModLexicon and ModShuffle. However, for model EASE the trend is opposite with respect to . An average of 83% of incorrect grammar adversarial responses are scored positively, in this case. This shows that model EASE has problems recognizing grammatical errors in the responses. Moreover, it is scoring these adversarial responses higher than the original. Again, LSTM-MoT has correctly scored most of the ModGrammar and ModShuffle lower than ModLexicon (Figure 10 ), which is how we expect all models to infer these testcases. Another category of test case BabelGen where we generate incoherent and meaningless responses. Ideally, this should have been scored a zero but as demonstrated in Table 5 , we notice that almost all the models score these generated essays at least 60% of the prompt scoring range. This strongly suggests that models were looking for obscure keywords with complex sentence formation. We can also infer that the relevance of the responses with respect to the question is missing. Since the responses are generated using key words, they contain sentences with respect to those key words, but fail to answer the question targeted. We conducted a social survey with 200 participants to understand and compare how humans score our tests compared to the automatic essay scoring systems. Figure 11 show a few screenshots from our survey website. To create our survey forms, we chose test cases based on the following three conditions: 1) where < , 2) where > 10%, 3) where a T-test rejects the hypothesis that the adversarial and original scores are the same distribution and 4) . The motivation behind setting these three conditions was that we wanted to choose those test-cases where the model should be most confident in scoring adversarial response as negative and unfavorable. Once annotated by humans, we show that these systems, we compare the differences in these scores. We observe the AES systems lack the ability to adequately penalize scores by either marking marking the perturbations as better than the original ( > ) or not detecting any significant difference. Both are wrong presumptions by the model. Table 6 depicts the results of our human annotations. We divide the annotators into two groups. We show them the original response and its corresponding score for the first group and then ask the annotators to score the adversarial response accordingly. For the second group, we ask them to score both the original and adversarial responses. If any of the annotators felt that both the responses' scores should not be the same, we ask them to list supporting reasons. For uniformity in responses, we derive a set of scoring rubrics, mentioned in our dataset and ask them to choose the most suitable keywords. As observed from Table 6 , the percentage of people who scored adversarial responses lower than original responses is significantly higher for all selected test-cases. The main reasons for scoring adversarial responses lower by annotators are Relevance, Organization, Readability, etc. It can be observed that the percentage lowering in score was on an average of 30%. Score ↓ % % People ↓ % People ↑ Common Reasons of ↓ Finally, we performed an experiment by training on the adversarial samples generated by our framework to see if the models can pick up some inherent "pattern" of the adversarial samples. Since there is a multitude of adversarial test cases category, we narrowed a subcategory of five test cases from those shown for the human annotations. They were selected such that on an average, these test cases had maximum deviation between human annotated scores and machine scores. The train data consisted of an equal number of original samples and adversarial samples. The target scores of adversarial samples were set as the original score minus the mean difference of scores between original and human-annotated scores. For example, according to the human annotation study, for the ModGrammar case, the mean difference was 2 points below the original score, so all the samples were scored as original scores minus 2 points in the simulated training data. The simulated training data was then appended with original and shuffled. The testing was conducted with the respective adversarial test-case as well as the others. The results for the same is shown in Figure 13 . It is evident that the adversarial training improves the scores marginally for all four metrics, as shown by the solid lines being higher than the dotted lines. However, a slightly visible improvement in scores is inapparent. The increases for adversarial training, highest for the respective test-case. Similar trend is observed for metric. For , the adversarial training reduces this score for respective testcase, as compared to non-adversarial testing. Through our experiments. we conclude that current AES systems built mainly with feature extraction techniques and deep neural networks based algorithms fail to recognize the presence of common-sense adversaries in student essays and responses. As these common adversaries are popular among students for 'bluffing' during examinations, it is vital for Automated Scoring system developers to think beyond the accuracies of their systems and pay attention to complete robustness so that these systems are not vulnerable to any form of adversarial attack. The future scope of this work includes designing more efficient AES systems using the metrics proposed, combining the metrics to provide a more holistic criteria for analysis and improving the evaluation suite with emphasis to the type of exam and level of education of the students. The Hewlett Foundation: Automated Essay Scoring Develop an automated scoring algorithm for student-written essays Automated essay scoring with e-rater® v. 2.0 Modeling local coherence: An entity-based approach Automated scoring with validity in mind How artificial intelligence will impact K-12 teachers Automated essay evaluation: The Criterion online writing service. Ai magazine End-to-end neural network based automated speech scoring Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability Bert: Pre-training of deep bidirectional transformers for language understanding The Open University Of China Awarded UNESCO Prize For Its Use Of AI To Empower Rural Learners Aoife Cahill, and Torsten Zesch. 2020. Don't take "nswvtnvakgxpm" for an answer-The surprising vulnerability of automatic content scoring systems to adversarial input EASE (Enhanced AI Scoring Engine) is a library that allows for machine learning based classification of textual content. This is useful for tasks such as scoring student essays Automated Scoring. What it is and why it's a big deal Flawed Algorithms Are Grading Millions of Students' Essays FiveThirtyEight Hip Hop Candidate Lyrics Dataset Implementation and applications of the Intelligent Essay Assessor. Handbook of automated essay evaluation Automated essay scoring remains an empty dream Algorithms on Strings, Trees and Sequences Managing what we can measure: Quantifying the susceptibility of automated scoring systems to gaming behavior Centering-based Neural Coherence Modeling with Hierarchical Discourse Segments Learning to Give Feedback: Modeling Attributes Affecting Argument Persuasiveness in Student Essays Automated Essay Scoring: A Survey of the State of the Art Get IT Scored Using AutoSAS-An Automated System for Scoring Short Answers Automated essay scoring based on two-stage learning Detection of gaming in automated scoring of essays with the IEA Managing the data base environment ASAP++: Enriching the ASAP Automated Essay Grading Dataset with Essay Attribute Scores What?! Students Write Song Lyrics And Abuses In Exam Answer Sheet WordNet: a lexical database for English Song lyrics from 6 musical genres Comparing Automatic and Human Evaluation of Local Explanations for Text Classification Minimum essentials of English. Barron's Educational Series Computers are now grading essays on Ohio's state tests The Engine Driving Automated Essay Scoring When "the state of the art Basic Automatic B.S. Essay Language Generator (BABEL) Basic Automatic B.S. Essay Language Generator (BABEL) by Les Perelman Stumping E-Rater: Challenging the validity of automated essay scoring Taylor Swift Song Lyrics from all the albums Billboard 1964-2015 Songs + Lyrics Why can't it mark this one?: A qualitative analysis of student writing rejected by an automated essay scoring system The intellimetric automated essay scoring engine-a review and an application to chinese essay scoring Contrasting state-of-the-art automated scoring of essays: Analysis. In Annual national council on measurement in education meeting More states opting to'robo-grade'student essays by computer Average number of students per teacher in India from Is MIT researcher being censored by Educational Testing Service C-rater: Automatic content scoring for short constructed responses Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks A neural approach to automated essay scoring Automated writing evaluation in an EFL setting: Lessons from China SkipFlow: Incorporating neural coherence features for end-to-end automatic text scoring How to Embed Quotes in your Essay Like a Boss Universal adversarial triggers for attacking and analyzing NLP 2020. 1000 Random & Interesting Facts About Literally Everything Trustworthy Automated Essay Scoring without Explicit Construct Validity A cross-domain transferable neural coherence model Handbook of automated scoring: Theory into practice A topic detection method based on KeyGraph and community partition Atypical Inputs in educational applications Adversarial attacks on deep-learning models in natural language processing: A survey A memory-augmented neural model for automated grading