key: cord-0046976-va82gnps
authors: Condor, Aubrey
title: Exploring Automatic Short Answer Grading as a Tool to Assist in Human Rating
date: 2020-06-10
journal: Artificial Intelligence in Education
DOI: 10.1007/978-3-030-52240-7_14
sha: b9fcec933d83c12ed091294bce2b5025c9eaf63d
doc_id: 46976
cord_uid: va82gnps

This project proposes using BERT (Bidirectional Encoder Representations from Transformers) as a tool to assist educators with automated short answer grading (ASAG) as opposed to replacing human judgement in high-stakes scenarios. Many educators are hesitant to give authority to an automated system, especially in assessment tasks such as grading constructed response items. However, evaluating free-response text can be time and labor costly for one rater, let alone multiple raters. In addition, some degree of inconsistency exists within and between raters for assessing a given task. Recent advances in Natural Language Processing have resulted in subsequent improvements for technologies that rely on artificial intelligence and human language. New, state-of-the-art models such as BERT, an open source, pre-trained language model, have decreased the amount of training data needed for specific tasks and in turn, have reduced the amount of human annotation necessary for producing a high-quality classification model. After training BERT on expert ratings of constructed responses, we use subsequent automated grading to calculate Cohen’s Kappa as a measure of inter-rater reliability between the automated system and the human rater. For practical application, when the inter-rater reliability metric is unsatisfactory, we suggest that the human rater(s) use the automated model to call attention to ratings where a second opinion might be needed to confirm the rater’s correctness and consistency of judgement.

Although it has been shown that incorporating constructed response items in educational assessments is beneficial for student learning [2] , the burden of time spent grading constructed response activities, as opposed to that of multiple choice questions, can deter educators from their use. In addition, the quality of human ratings of student responses can vary in consistency and reliability [15] . Using an automated system for grading free-text could help to alleviate this time burden as well as produce more consistent ratings. However, from the educator's perspective, completely removing human judgement from assessment tasks is neither responsible nor realistic. Natural Language Understanding (NLU) models are not yet able to discern all the nuances of language as well as a human. In high-stakes grading situations, incorrect ratings can have dire consequences for students.

Recent Automated Short Answer Grading (ASAG) Research using the most state-of-the-art language models, trained on large quantities of data, is only able to predict human ratings correctly less than 85% of the time. Notable recent work includes Crossley et al. who used Latent Semantic Analysis (LSA) to assess student summarizations [3] . Mieskes et al. combined several different automated graders to create a superior ensemble grader [8] . Qi et al. created a hierarchical word-sentence model using a CNN and BLSTM model [9] . Sung et al. examined the effectiveness of pre-training BERT as a function of the size of training data, number of epochs and generalizability across domains [12] . In a separate study, Sung et al. pre-trained BERT on relevant domain texts to enhance the existing model for ASAG [11] . Dhamecha et al. introduced an iterative data collection and grading approach for analyzing student answers [5] . Finally, Hu et al. incorporated a technique called Recognizing Textual Entailment to investigate whether a given passage and question support the predicted answer [14] .

We propose using a compressed version of the BERT model called bert-base to simplify the training process and show that with a relatively small amount of training data (less than 70 student answers per question), we can achieve high enough inter-rater reliability to assist a human grader in constructed response rating tasks.

A data set called DT-Grade was used, consisting of short constructed answers from tutorial dialogues between students and an Intelligent Tutoring System called Deep Tutor, created at the University of Memphis Institute for Intelligent Systems [10] . About 1100 student responses, in 100 words or less, to conceptual questions relating to Newtonian Physics were randomly selected from 40, juniorlevel college students. Included in the data are 34 distinct questions with relative question context information. Initial ratings were completed by experts and each answer was annotated for correctness by categorizing it as one of four categories: correct, correct-but-incomplete, contradictory, and incorrect [1] (Fig. 1) . 

We removed all records from the dataset where the number of answers per question was less than 20. Remaining were 28 distinct conceptual physics questions, and the number of student responses per question ranged from 20 to 69. The filtered dataset then contained 994 records. We collapsed the four rating categories into two so that we will have a binary response variable in order to start with the simplest version of the model. Correct responses were considered correct, and all others (correct-but-incomplete, contradictory and incorrect) were considered incorrect. The question context text was concatenated with the question text as well as the student's answer text before creating the input vector embeddings. The concatenated input texts were tokenized using the bert-base-uncased tokenizer [16] (Fig. 2) . Training, validating, and testing data sets were created such that 70% of responses were randomly allocated to the training set, 15% to the validation set and 15% to the test set.

The language model we used, BERT, which stands for Bidirectional Encoder Representations from Transformers, was introduced in [4] as a revolutionary language representation model. It was the first to successfully learn by pretraining unlabeled text bidirectionally. Consequently, the model can be finetuned for many different tasks, such as ASAG, by adding only one additional layer to the existing deep neural network. We use a compressed version of the original BERT model, called bert-base, through a python package called fast-bert [13] . Fast-bert enables quick and simplified fine-tuning of the bert-base model for the assessment task at hand.

A simple grid search was used to tune the parameters and hyper-parameters of the model such that we achieve a high validation accuracy. The best results were observed with using a batch size of 8 (on a single GPU), a maximum sequence length of 512, 8 training epochs and a learning rate of 6e−5. In addition, the LAMB optimizer was used for training. Our particular ASAG task is essentially one of binary text classification -each response is classified as either Correct or Incorrect. Our model returns the predicted classification of response rating per input vector.

We used Cohen's Kappa as our metric for inter-rater reliability. It calculates the extent to which raters agree on rating assignments beyond what is expected by chance [7] . Cohen's Kappa is calculated as follows:

where p 0 = the relative observed agreement among raters and p e = the hypothetical probability of chance agreement. A k value of 0 represents agreement equivalent to random chance, and a value of 1 represents perfect agreement between raters.

The best model achieved a testing accuracy of 0.760 and a Cohen's Kappa statistic of 0.684. This represents the probability that the BERT model agrees with the human rater beyond random chance. With such a small amount of training data per question, we believe that these results provide evidence that transfer learning models such as BERT can remove a significant amount of human rating work, as well help achieve more consistent human ratings. We must consider the question of whether an instructor would find the described system practical, and correspondingly whether the resulting Kappa statistic is good enough for real world use. One perspective is that, the human rater can incorporate context specific judgement about the extent to which they would like to examine the highlighted cases of disagreement. For example, if the assessment is used for low-stakes, formative purposes, it might not be practical for an educator to investigate rating mismatches in depth. However, if the questions will be used repeatedly in future assessments or the scoring is involved in a pass-fail discernment for a student, a detailed look into discrepancies may be appropriate.

In order for the field of education to adopt a willingness to embrace applicable research in Artificial Intelligence, researchers must consider the practicality and usefulness of new technologies from the educator's perspective. Such technologies should act as a support for teachers; not as independent, decision-making entities. This project represents a work-in-progress to continually investigate how we can leverage artificial intelligence to be in service of human decision making.

Evaluation dataset (DT-Grade) and word weighting approach towards constructed short answers assessment in tutorial dialogue context

Eliciting self-explanations improves understanding

Automated Summarization Evaluation (ASE) using natural language processing tools

BERT: pre-training of deep bidirectional transformers for language understanding

Balancing Human efforts and performance of student response analyzer in dialog-based tutors

Artificial intelligence in education: promises and implications for teaching and learning

Interrater agreement measures: comments on Kappan, Cohen's Kappa, Scott's π, and Aickin's α

Work smart-reducing effort in short-answer grading

Attention-based hybrid model for automatic short answer scoring

DeepTutor: an effective, online intelligent tutoring system that promotes deep learning

Pre-training BERT on domain resources for short answer grading

Improving short answer grading using transformer-based pre-training

Fast-BERT

Read+ verify: machine reading comprehension with unanswerable questions

A systematic review of methods for evaluating rating quality in language assessment

HuggingFace's transformers: state-of-the-art natural language processing