key: cord-0046856-olv7h0mg authors: Uto, Masaki; Okano, Masashi title: Robust Neural Automated Essay Scoring Using Item Response Theory date: 2020-06-09 journal: Artificial Intelligence in Education DOI: 10.1007/978-3-030-52237-7_44 sha: e2bf5097dbf9cf771dc62734718ed72f771787c3 doc_id: 46856 cord_uid: olv7h0mg Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to human grading. Conventional AES methods typically rely on manually tuned features, which are laborious to effectively develop. To obviate the need for feature engineering, many deep neural network (DNN)-based AES models have been proposed and have achieved state-of-the-art accuracy. DNN-AES models require training on a large dataset of graded essays. However, assigned grades in such datasets are known to be strongly biased due to effects of rater bias when grading is conducted by assigning a few raters in a rater set to each essay. Performance of DNN models rapidly drops when such biased data are used for model training. In the fields of educational and psychological measurement, item response theory (IRT) models that can estimate essay scores while considering effects of rater characteristics have recently been proposed. This study therefore proposes a new DNN-AES framework that integrates IRT models to deal with rater bias within training data. To our knowledge, this is a first attempt at addressing rating bias effects in training data, which is a crucial but overlooked problem. In various assessment fields, essay-writing tests have attracted much attention as a way to measure practical and higher-order abilities such as logical thinking, critical reasoning, and creative thinking [1, 4, 13, 18, 33, 35] . In essay-writing tests, examinees write essays about a given topic, and human raters grade those essays based on a scoring rubric. However, grading can be an expensive and timeconsuming process when there are many examinees [13, 16] . In addition, human grading is not always sufficiently accurate even when a rubric is used because assigned scores depend strongly on rater characteristics such as strictness and inconsistency [9, 11, 15, 26, 31, 43] . Automated essay scoring (AES), which utilizes natural language processing (NLP) and machine learning techniques to automatically grade essays, is one approach toward resolving this problem. Many AES methods have been developed over the past decades, and can generally be classified as feature-engineering or automatic feature extraction approaches [13, 16] . The feature-engineering approach predicts scores using manually tuned features such as essay length and number of spelling errors (e.g., [3, 5, 22, 28] ). Advantages of this approach include interpretability and explainability. However, these approaches generally require extensive feature redesigns to achieve high prediction accuracy. To obviate the need for feature engineering, automatic feature extraction based on deep neural networks (DNNs) has recently attracted attention. Many DNN-AES models have been proposed in the last few years (e.g., [2, 6, 10, 14, 23, 24, 27, 37, 47] ) and have achieved state-of-the-art accuracy. This approach requires a large dataset of essays graded by human raters as training data. Essay grading tasks are generally shared among many raters, assigning a few raters to each essay to lower assessment burdens. However, assigned scores are known to be strongly biased due to the effects of rater characteristics [8, 15, 26, 31, 34, 39, 40] . Performance of DNN models rapidly drops when biased data are used for model training, because the resulting model reflects bias effects [3, 12, 17] . This problem has been generally overlooked or ignored, but it is a significant issue affecting all AES methods using supervised machine learning models, including DNN, and because cost concerns make it generally difficult to remove rater bias in practical testing situations. In the fields of educational and psychological measurement, statistical models for estimating essay scores while considering rater characteristic effects have recently been proposed. Specifically, they are formulated as item response theory (IRT) models that incorporate parameters representing rater characteristics [9, 29, 30, 38, [42] [43] [44] [45] . Such models have been applied to various performance tests, including essay writing. Previous studies have reported that they can provide reliable scores by removing adverse effects of rater bias (e.g., [38, 39, 41, 42, 44] ). This study therefore proposes a new DNN-AES framework that integrates IRT models to deal with rater bias in training data. Specifically, we propose a two-stage architecture that stacks an IRT model over a conventional DNN-AES model. In our framework, the IRT model is first applied to raw rating data to estimate reliable scores that remove effects of rater bias. Then, the DNN-AES model is trained using the IRT-based scores. Since the IRT-based scores are theoretically free from rater bias, the DNN-AES model will not reflect bias effects. Our framework is simple and easily applied to various conventional AES models. Moreover, this framework is highly suited to educational contexts and to low-and medium-stakes tests, because preparing high-quality training data in such situations is generally difficult. To our knowledge, this study is a first attempt at mitigating rater bias effects in DNN-AES models. We assume the training dataset consists of essays written by J examinees and essay scores assigned by R raters. Let e j be an essay by examinee j ∈ J = {1, · · · , J} and let U jr represent a categorical score k ∈ K = {1, · · · , K} assigned by rater r ∈ R = {1, · · · , R} to e j . The score data can then be defined as U = {U jr ∈ K ∪ {−1} | j ∈ J , r ∈ R}, with U jr = −1 denoting missing data. Missing data occur because only a few graders in R can practically grade each essay e j to reduce assessment workload. Furthermore, letting V = {1, · · · , V } be a vocabulary list for essay collection E = {e j | j ∈ J }, essay e j ∈ E is definable as a list of vocabulary words e j = {w jt ∈ V | t = {1, · · · , N j }}, where w jt is a one-hot representation of the t-th word in e j , and N j is the number of words in e j . This study aimed at training DNN-AES models using this training data. This section briefly introduces the DNN-AES models used in this study. Although many models have been proposed in the last few years, we apply the most popular model that uses convolution neural networks (CNN) with long short-term memory (LSTM) [2] , and an advanced model based on bidirectional encoder representations from transformers (BERT) [7] . A CNN-LSTM-based model [2] proposed in 2016 was the first DNN-AES model. Figure 1 (a) shows the model architecture. This model calculates a score for a given essay, which is defined as a sequence of one-hot word vectors, through the following multi-layered neural networks. This layer transforms each word in a given essay into a Ddimensional word-embedding representation, in which words with the same meaning have similar representations. Specifically, letting A be a D × Vdimensional embeddings matrix, the embedding representation corresponding to w jt ∈ e j is calculable as the dot-product A · w jt . Convolution layer: This layer extracts n-gram level features using CNN from the sequence of word embedding vectors. These features capture local textual dependencies among n-gram words. Zero padding is applied to outputs from this layer to preserve the word length. This is an optional layer, often omitted in current studies. Recurrent layer: This layer is a LSTM network that outputs a vector at each timestep to capture long-distance dependencies of the words. A singlelayer unidirectional LSTM is generally used, but bidirectional or multilayered LSTMs are also often used. which calculates an average vector M j = 1 Nj Nj t=1 h jt , is generally used because it tends to provide stable accuracy. Other frequently used pooling methods include the last pool, which uses the last output of the recurrent layer h jNj , and a pooling-with-attention mechanism. Linear layer with sigmoid activation: This layer projects pooling-layer output to a scalar value in the range [0, 1] by utilizing the sigmoid function as where W is a weight matrix and b is a bias. Model training is conducted by normalizing gold-standard scores to [0, 1], but the predicted scores are rescaled to the original score range in the prediction phase. BERT, a pretrained language model released by the Google AI Language team, has achieved state-of-the-art results in various NLP tasks [7] . BERT has been applied to AES [32] and automated short-answer grading (SAG) [19, 21, 36] since 2019, and provides good accuracy. BERT is defined as a multilayer bidirectional transformer network [46] . Transformers are a neural network architecture designed to handle ordered sequences of data using an attention mechanism. Specifically, transformers consist of multiple layers (called transformer blocks), each containing a multi-head self-attention and a position-wise fully connected feed-forward network. See Ref. [46] for details of this architecture. BERT is trained in pretraining and fine-tuning steps. Pretraining is conducted on huge amounts of unlabeled text data over two tasks, masked language modeling and next-sentence prediction, the former predicting the identities of words that have been masked out of the input text and the latter predicting whether two given sentences are adjacent. Using BERT for a target NLP task, including AES, requires fine-tuning (retraining), which is conducted from a task-specific supervised dataset after initializing model parameters to pretrained values. When using BERT for AES, input essays require preprocessing, namely adding a special token ("CLS") to the beginning of each input. BERT output corresponding to this token is used as the aggregate sequence representation [7] . We can thus score an essay by inputting its representation to a linear layer with sigmoid activation, as illustrated in Fig. 1(b) . Training of CNN-LSTM-based AES models and fine-tuning of BERT-based AES models are conducted using large datasets of essays by graded human raters. For model training, the mean-squared error (MSE) between predicted and goldstandard scores is used as the loss function. Specifically, letting y j be the goldstandard score for essay e j and lettingŷ j be the predicted score, the MSE loss function is defined as 1 The gold-standard score y j is a score for essay e j assigned by a human rater in a set of raters R. When multiple raters grade each essay, the gold-standard score should be determined by selecting one score or by calculating an average or total score. In any case, such scores depend strongly on rater characteristics, as discussed in Sect. 1. The accuracy of a DNN model drops when such biased data are used for model training, because the trained model inherits bias effects [3, 12, 17] . In educational and psychological measurement research, item response theory (IRT) models that can estimate essay scores while considering effects of rater characteristics have recently been proposed [9, 29, 30, 38, [42] [43] [44] . The main goal of this study is to train AES models using IRT-based unbiased scores. The next section introduces the IRT models. IRT [20] is a test theory based on mathematical models. IRT represents the probability of an examinee response to a test item as a function of latent examinee ability and item characteristics such as difficulty and discrimination. IRT is widely used for educational testing because it offers many benefits. For example, IRT can estimate examinee ability considering effects of item characteristics. Also, the abilities of examinees responding to different test items can be measured on the same scale, and missing response data can be easily handled. Traditional IRT models are applicable to two-way data (examinees × test items), consisting of examinee test item scores. For example, the generalized partial credit model (GPCM) [25] , a representative polytomous IRT model, defines the probability that examinee j receives score k for test item i as where θ j is the latent ability of examinee j, α i is a discrimination parameter for item i, β i is a difficulty parameter for item i, and d ik is a step difficulty parameter denoting difficulty of transition between scores k − 1 and k in the item. Here, d i1 = 0, and K k=2 d ik = 0 is given for model identification. However, conventional GPCM ignores rater factors, so it is not applicable to rating data given by multiple raters as assumed in this study. Extension models that incorporate parameters representing rater characteristics have been proposed to resolve this difficulty [29, 30, 38, [42] [43] [44] [45] . This study introduces a stateof-the-art model [44, 45] that is most robust for a large variety of raters. This model defines the probability that rater r assigns score k to examinee j's essay for a test item (e.g., an essay task) i as where α r is the consistency of rater r, β r is the strictness of rater r, and d rk is the severity of rater r within category k. For model identification, we assume This study applies this IRT model to rating data U in training data. Note that DNN-AES models are trained for each essay task. Therefore, rating data U are defined as two-way data (examinees × raters). When the number of tasks is fixed to one in the model, the above model identification constraints make α i and β i ignorable, so Eq. (2) becomes This equation is consistent with conventional GPCM, regarding use of item parameters as the rater parameters. Note that θ j in Eq. (3) represents not only the ability of examinee j but also the latent unbiased scores for essay e j , because only one essay is associated with each examinee. This model thus provides essay scores with rater bias effects removed. We propose a DNN-AES framework that uses IRT-based unbiased scores θ = {θ j | j ∈ J } to deal with rater bias in training data. Figure 2 shows the architectures of the proposed method. As that figure shows, the proposed method is defined by stacking an IRT model over a conventional DNN-AES model. Training of our models occurs in two steps: 1. Estimate the IRT scores θ from the rating data U . 2. Train AES models using the IRT scores θ as the gold-standard scores. Specifically, the MSE loss function for training is defined as 1 J J j=1 (θ j −θ j ) 2 , wherê θ j represents the AES's predicted score for essay e j . Since scores θ are estimated while considering rater bias effects, a trained model will not reflect bias effects. Note that the gold-standard scores must be rescaled to the range [0, 1] for training because sigmoid activation is used in the output layer. In IRT, 99.7% of θ j fall within the range [−3, 3] because a standard normal distribution is generally assumed. We therefore apply a linear transformation from the range [−3, 3] to [0, 1] after rounding the scores lower than −3 to −3, and those higher than 3 to 3. Note that the increase in training time for the proposed method compared with a conventional method is the time for IRT parameter estimation. In the testing phase, the score for new essay e j is predicted in two steps: 1. Predict the IRT score θ j from a trained AES model, and rescale it to the range [−3,3]. 2. Calculate the expected scoreÛ j , which corresponds to an unbiased originalscaled score of e j [39] , asÛ This section describes evaluation of the effectiveness of the proposed method through actual data experiments. These experiments used the Automated Student Assessment Prize (ASAP) dataset, which is widely used as benchmark data in AES studies. This dataset consists of essays on eight topics, originally written by students from grades 7 to 10. There are 12,978 essays, averaging 1,622 essays per topic. However, this dataset cannot be directly used to evaluate the proposed method, because despite its essays having been graded by multiple raters, it contains no rater identifiers. We therefore employed other raters and asked them to grade essays in the ASAP dataset. We used essay data for the fifth ASAP topic, because the number of essays in that topic is relatively large (n = 1805). We recruited 38 native English speakers as raters through Amazon Mechanical Turk and assigned four raters to each essay. Each rater graded around 195 essays. The assessment rubric used the same five rating categories as ASAP. Average Pearson's correlation between the collected rating scores and the original ASAP scores was 0.675. To confirm any differences in rater characteristics, we plotted averaged score values and standard deviations (SD) for each rater, as shown in Fig. 3 . In that figure, each plot represents a rater, and horizontal and vertical axes respectively show the average and SD values. In addition, Table 1 shows appearance rates in the five rating categories for 10 representative raters. The figure and table show extreme differences in grading characteristics among the raters, suggesting that consideration of rater bias is required. This subsection shows that the proposed method can provide more robust scores than can conventional AES models, even when the rater grading each essay in the training data changes. The experimental procedures, which are similar to those used in previous studies examining IRT scoring robustness [39] [40] [41] [42] , were as follows: 1. We estimated IRT parameters by the Markov chain Monte Carlo (MCMC) algorithm [30, 42] using all rating data. 2. We created a dataset consisting of (essay, score) pairs by randomly selecting one score for each essay from among the scores assigned by multiple raters. We repeated this data generation 10 times. Hereafter, the m-th generated dataset is represented as U m . 3. From each dataset U m , we estimated IRT scores θ (referred to as θ m ) given the rater parameters obtained in Step 1, and then created a dataset U m comprising essays and θ m values. 4. Using each dataset U m , we conducted five-fold cross validation to train AES models and to obtain predicted scoresθ m for all essays. 5. We calculated metrics for agreement between the expected scores calculated by Eq. (4) givenθ m and those calculated givenθ m for all unique m, m ∈ {1, · · · , 10} pairs ( 10 C 2 = 45 pairs in total). As agreement metrics, we used Cohen's kappa, weighted kappa, root mean squared error (RMSE), and Pearson correlation coefficient. 6. We calculated average metric values obtained from the 45 pairs. High kappa and correlation and low RMSE values obtained from the experiment indicate that score predictions are more robust for different raters. We conducted a similar experiment using conventional DNN-AES models without the IRT model. Specifically, using each dataset U m , we predicted essay scores from a DNN-AES model through five-fold cross validation procedures as in Step 4. We then calculated the four agreement metrics among the predicted scores obtained from different datasets U m , and averaged them. These experiments were conducted with several DNN-AES models. Specifically, we examined CNN-LSTM models using MoT pooling or last pooling, those models without a CNN layer, and the BERT model. These models were implemented in Python with the Keras library. For the BERT model, we used the base-sized pretrained model. The hyperparameters and dropout settings were determined following Refs. [2, 7, 46 ]. Table 2 shows the results, which indicate that the proposed method sufficiently improves agreement metrics as compared to the conventional models in all cases. The results indicate that the proposed method provides stable scores when the rater allocation for each essay in training data is changed, thus demonstrating that it is highly robust against rater bias. Note that the values in Table 2 are not comparable with the results of previous AES studies because our experiment and previous experiments evaluated different aspects of AES performance. In addition, as in previous AES studies, we evaluated score (θ) prediction accuracy of the proposed method through five-fold cross-validation. We mea- sured accuracy using mean absolute error (MAE), RMSE, the correlation coefficient, and the coefficient of determination (R 2 ), because θ is a continuous variable. Table 3 shows the results, which indicate that the CNN-LSTM and LSTM models with MoT pooling achieved higher performance than did those with last pooling. The table also shows that the CNN did not effectively improve accuracy. These tendencies are consistent with a previous study [2] . In addition, the BERT provided the highest accuracy, which is also consistent with current NLP studies. Tables 2 and 3 show that the score prediction robustness in Table 2 tends to increase with score prediction accuracy. This might be because scores in lowperformance DNN-AES models are strongly biased not only by rater characteristics, but also by prediction errors arising from the model itself. With increasing accuracy of DNN-AES models, rater bias effects as a percentage of overall error increases, suggesting that the impact of the proposed method increases. We showed that DNN-AES model performance strongly depends on the characteristics of raters grading essays in training data. To resolve this problem, we proposed a new DNN-AES framework that integrates IRT models. Specifically, we formulated our method as a two-stage architecture that stacks the IRT model over a conventional DNN-AES model. Through experiments using an actual dataset, we demonstrated that the proposed method can provide more robust essay scores than can conventional DNN-AES models. The proposed method is simple but powerful, and is easily applicable to any AES model. As described in the Introduction, our method is also highly suited to situations where highquality training data are hard to prepare, including educational contexts. In future studies, we expect to evaluate effectiveness of the proposed method using various datasets. Although this study mainly focused on robustness against rater bias, the proposed method might also improve prediction accuracy for each rater's score. In future studies, the accuracy should be evaluated. Our method is defined as a two-stage procedure for separately training IRT models and DNN-AES models. However, conducting end-to-end optimization would further improve the performance. This extension is another topic for future study. Beyond translation: adapting a performance-task-based assessment of critical thinking ability for use in Rwanda Automatic text scoring using neural networks Automated essay scoring in the presence of biased ratings Rater rating-level bias and accuracy in performance appraisals: the impact of rater personality, performance management competence, and rater accountability ReaderBench learns Dutch: building a comprehensive automated essay scoring system for Dutch language Augmenting textual qualitative features in deep convolution recurrent neural network for automatic essay scoring BERT: pre-training of deep bidirectional transformers for language understanding Examining rater effects in TestDaF writing and speaking performance assessments: a many-facet Rasch analysis Introduction to Many-Facet Rasch Measurement: Analyzing and Evaluating Rater-Mediated Assessments Neural automated essay scoring and coherence modeling for adversarially crafted input Exploring the psychometric properties of the mind-map scoring rubric O2U-Net: a simple noisy label detection approach for deep neural networks Automated language essay scoring systems: a literature review TDNN: a two-stage deep neural network for prompt-independent automated essay scoring Judging behaviour and rater errors: an application of the manyfacet Rasch model Automated essay scoring: a survey of the state of the art Coupled-view deep classifier learning from multiple noisy annotators Assessing critical thinking in higher education: current state and directions for next-generation assessment Automatic short answer grading via multiway attention networks Applications of Item Response Theory to Practical Testing Problems Multiple data augmentation strategies for improving performance on automatic short answer scoring Automated Essay Scoring: A Cross-disciplinary Perspective A neural local coherence model for text quality assessment Unsupervised learning of discourse-aware text representation for essay scoring A generalized partial credit model Detecting and measuring rater effects using many-facet Rasch measurement: part I Automated essay scoring with discourse-aware neural models Argument mining for improving the automated scoring of persuasive essays The hierarchical rater model for rated test items and its application to large-scale educational assessment data Applications and extensions of MCMC in IRT: multiple item types, missing data, and rated responses Investigating central tendency in competency assessment of design electronic circuit: analysis using many facet Rasch measurement (MFRM) Language models and automated essay scoring. arXiv, cs.CL Making student thinking visible through a concept map in computer-based assessment of critical thinking Rating the ratings: assessing the psychometric quality of rating data Assessment techniques and students' higher-order thinking skills Improving short answer grading using transformer-based pre-training A neural approach to automated essay scoring Item response theory for peer assessment Rater-effect IRT model integrating supervised LDA for accurate measurement of essay writing ability Group optimization to maximize peer assessment accuracy using item response theory Group optimization to maximize peer assessment accuracy using item response theory and integer programming Item response theory for peer assessment Empirical comparison of item response theory models with rater's parameters Item response theory without restriction of equal interval scale for rater's score A generalized many-facet Rasch model and its Bayesian estimation using Hamiltonian Monte Carlo Attention is all you need Automatic essay scoring incorporating rating schema via reinforcement learning Acknowledgment. This work was supported by JSPS KAKENHI 17H04726 and 17K20024.