key: cord-0506525-ic8wr0fz authors: Marinho, Jeziel C.; Anchieta, Rafael T.; Moura, Raimundo S. title: Essay-BR: a Brazilian Corpus of Essays date: 2021-05-19 journal: nan DOI: nan sha: e8fd6ab4d0976ae8b13cc58b2c1d438fea93b0b6 doc_id: 506525 cord_uid: ic8wr0fz Automatic Essay Scoring (AES) is defined as the computer technology that evaluates and scores the written essays, aiming to provide computational models to grade essays either automatically or with minimal human involvement. While there are several AES studies in a variety of languages, few of them are focused on the Portuguese language. The main reason is the lack of a corpus with manually graded essays. In order to bridge this gap, we create a large corpus with several essays written by Brazilian high school students on an online platform. All of the essays are argumentative and were scored across five competencies by experts. Moreover, we conducted an experiment on the created corpus and showed challenges posed by the Portuguese language. Our corpus is publicly available at https://github.com/rafaelanchieta/essay. The Automated Essay Scoring (AES) area began with Page [15] with the Project Essay Grader. Shermis and Barrera [17] define AES as the computer technology that evaluates and scores written essays, i.e., it aims to provide computational models to grade essays either automatically or with minimal involvement of humans [15] . AES is one of the most important educational applications of Natural Language Processing (NLP) [12, 4] . It encompasses other fields, such as Cognitive Psychology, Education Measurement, Linguistics, and Written Research [18] , which aim to study methods to assist teachers in automatic assessments, providing a cheaper, faster, and deterministic approach than humans to grade an essay. Due to all the benefits, AES has been widely studied in various languages, for example, English, Chinese, Danish, Japanese, Norwegian, and Swedish, among others [4] . To grade an essay, these studies supported the development of regressionbased methods, such as [3, 22] , classification-based methods, as [10, 14] , and neural networks-based methods as [21] . Moreover, AES systems have also been successfully used in schools and large scale exams [23] . According to Dikli [8] , examples of such systems are: Intelligent Essay TM , Criterion SM , IntelliMetric TM , E-rater ® , and MY Access! ® . Despite the importance of AES area, most of the resources and methods are only available for the English language [12] . There are very few AES-based studies for the Brazilian Portuguese language. The main reason for that is the lack of a public corpus with manually graded essays. Hence, it is important to put some effort into creating resources that will be useful for the development of alternative methods for this field. In this paper, aiming to fulfill this gap, we create a large corpus, namely Essay-BR, with essays written by Brazilian high school students through an online platform. These essays are of the argumentative type and were graded by experts across five different competencies to achieve the total score of an essay. The competencies follow the evaluation criteria of the ENEM -Exame Nacional do Ensino Médio -(National High School Exam), which is the main Brazilian high school exam that serves as an admission test for most universities in Brazil. In addition to the corpus, we carry out an experiment, implementing two approaches to automatically score essays, demonstrating the challenges posed by the corpus, and providing baseline results. As this is the first publicly available corpus of essays for the Portuguese language, we believe that it will foster AES studies for that language, resulting in developing alternative methods to grade an essay. The remaining of this paper is organized as follows. Sect. 2 describes the main related work for the Portuguese language. In Sect. 3, we present the ENEM exam. Sect. 4 details our corpus, its construction, and an analysis of the training, development, and testing datasets. In Sect. 5, we describe our conducted experiments. Finally, Sect. 6 concludes the paper, indicating future work. As before mentioned, there is no publicly available corpus of essays for the Brazilian Portuguese. However, three efforts investigated AES for that language. Here, we briefly present them. Bazelato and Amorim [2] crawled 429 graded essays from the Educação UOL Website to create the first corpus of essays for the Portuguese Language. However, the crawled essays are very old, they do not meet the criteria of the ENEM exam. Moreover, the collected essays are not available. Amorim and Veloso [1] developed an automatic essay scoring method for the Brazilian Portuguese language. For that, they collected 1, 840 graded essays from the Educação UOL Website. Next, they developed 19 features to feed a linear regression to grade the essays. Then, to evaluate the approach, the authors compared the automatic scores with the scores of the essays, using the Quadratic Weighted Kappa (QWK) metric [6] , achieving 42.45%. Fonseca et al. [11] addressed the task of automatic essay scoring in two ways. In the first one, they adopted a deep neural network architecture similar to the Dong et al. [9] with two Bidirectional Long Short-Term Memory (BiLSTM) layers. The first layer reads word vectors and generates sentence vectors, which are read by the second layer to produce a single essay vector. This essay vector goes through an output layer with five units and a sigmoid activation function to get an essay score. In the second approach, the authors hand-crafted 681 features to feed a regressor to grade an essay. The authors evaluated the approaches using a corpus with 56,644 graded essays and reached the best result with the second method, achieving 75.20% in the QWK metric. Although these works have used essays written in Brazilian Portuguese to evaluate their methods, the authors did not make corpora publicly available, making the development of alternative methods difficult. Moreover, each work utilized a different corpus, also making it difficult to provide a fair comparison between them. In English, according to [12] , there are five popular available corpora: ICLE [20] , CLC-FCE [24] , Automated Student Assessment Prize 3 (ASAP), TOEFL 11 [5] , and AAE [19] . The ASAP corpus, one of the most famous and established corpus, was released as part of a Kaggle competition in 2012, becoming widely used for holistic scoring.Furthermore, the corpus is composed by 17,450 argumentative essays and 8 prompts written by United States students from grades 7 to 10 . In what follows, we present the ENEM exam. The ENEM -Exame Nacional do Ensino Médio -(National High School Exam) is the main Brazilian high school exam that serves as an admission test for most universities in Brazil. More than that, it is the second-largest admission test in the world after the National Higher Education Entrance Examination, the entrance examination of higher education in China. In the ENEM exam, the reviewers take into account five competencies to evaluate an essay, which are: 1. Adherence to the formal written norm of Portuguese. 2. Conform to the argumentative text genre and the proposed topic (prompt), to develop a text, using knowledge from different areas. 3. Select, relate, organize, and interpret data and arguments in defense of a point of view. 4. Usage of argumentative linguistic structures. 5. Elaborate a proposal to solve the problem in question. where each competence is graded with scores ranging from 0 to 200 in intervals of 40. These scores are organized by proficiency levels, as shown in Table 1 . In this table, the value 200 indicates an excellent proficiency in the field of competence, whereas the value 0 shows ignorance in the field of competence. In this way, the total score of an essay is the sum of the competence scores and may range from 0 to 1,000. At least two reviewers grade an essay in the ENEM exam, and the final score for each competence is the arithmetic mean between the two reviewers. If the disagreement between the reviewers' score is greater than 80, a new reviewer is invited to grade the essay. Thus, the final score for each competence will be the arithmetic mean between the three reviewers. The Essay-BR corpus contains 4, 570 argumentative documents and 319 topics (prompts). They were collected from December 2015 to April 2020. The topics include: human rights, political issues, healthcare, cultural activities, fake news, popular movements, covid-19, and others. Also, they are annotated with scores in the five competencies of the ENEM exam. Table 2 summarizes the Essay-BR corpus. To create the Essay-BR corpus, we developed a Web Scraper to extract essays from two public Websites: Vestibular UOL 4 and Educação UOL 5 . The essays from these Websites are public, may be used for research purposes, were written by high school students, and are graded by experts following the ENEM exam criteria. We collected 798 essays and 43 prompts from Educação UOL, and 3, 772 essays and 276 prompts from Vestibular UOL. The difference in the number of essays is because of the latter Website receives up to forty essays per month, while the former receive up to twenty essays per month. After collecting the essays, we applied some preprocessing to remove HTML tags and comments from the reviews. So, the essays contain only the content written by the students. Then, we normalized the scores of the essays. Although these Websites adopted the same ENEM exam competencies to evaluate the essays, they have a slightly different scoring strategy. Thus, we mapped the scores from Websites to ENEM scores, as shown in Figure 1 . Although the corpus has a holistic scoring, it also has proficiency scores. Holistic scoring technologies are commercially valuable, since they allow automatically scoring million of essays deterministically, summarizing the quality of an essay with a single score. However, it is not adequate in classroom settings, where providing students with feedback on how to improve their essays is of utmost importance [12] . To mitigate this weakness, the Essay-BR corpus contains five competencies. Thus, the score of competence shows how a student should improve its essay. For example, a student who got a score equals 40 in the first competence, i.e., adherence to the formal written norm, got as feedback that it is necessary to improve its grammar. We also present an example of the structure of our corpus, as shown in Table 3 . From this table, the score is the sum of the competencies (C1 to C5), and the essay content is composed as a list of paragraphs. It is important to say that some essays have no title, since, in the ENEM exam, the title is not mandatory. Besides the structure, we computed some statistics and linguistics features about the essays of the corpus, as depicted in Tables 4 and 5, respectively. In Table 4 , we can see that, on average, an essay has 4 paragraphs, and each paragraph has 2 sentences. Furthermore, the sentences are somewhat long, with an average of 30 tokens. In Table 5 , most of the essays are passive voice, since in the Portuguese Language the essays should be impersonal. Also, we calculated the Flesch score that measures the readability of an essay. From the score value, the essays are compatible with the college school level. Finally, we computed some richness vocabulary metrics, such as: hapax legomenon, which is a word that occurs only once, lexical diversity, also known as the type-token ratio, and lexical density, which is the number of lexical tokens divided by the number of all tokens. In order to create sets of training, development, and testing, we divided the corpus in proportions of 70%, 15%, and 15%, which corresponds to 3, 198, 686, and 686 essays for training, development, and testing, respectively. Aiming to choose essays with a fair distribution of scores for each split, we computed the distribution of the total score of the essays, as depicted in Figure 2 . The top 3 scores are 600, 680, and 720 corresponding to 13.00%, 10.68%, and 8.67% of the corpus, respectively, indicating that essays with these scores should appear more times in the training, development, and testing sets. Moreover, the scores in the corpus have a slightly rightward skewed normal distribution. We also computed the distribution score for each competence and presented it in Table 6 . From this table, all of the essays received 120 as the higher score, showing that, in general, the students have medium dominance in the field of competence. In the following subsection, we analyzed the training, development, and testing sets of the Essay-BR corpus. To create the three splits with score distributions similar to that of the complete corpus, we first shuffled all the data; then, we filled each split with essays based on the score distribution. Figure 3 presents the score distribution for the training, development, and testing sets, respectively. From this figure, one can see that the score distributions are similar to the score distribution of the complete corpus. Likewise, in the score distribution of Figure 2 , the top 3 scores of the training set are 600, 680, and 720. Moreover, the development and testing sets have a similar distribution. This means that if a machine-learning algorithm performs well in the development set, it probably will also perform well in the testing set. More than the scores, we also calculated some statistics on the splits, intending to verify whether the proportion of paragraphs, sentences, and tokens for each division remained related to the complete corpus proportion. Comparing the obtained results in Table 6 with the got results of each split in Table 7 , we can see that the results maintained similar proportions. For example, the average of paragraphs per essay, sentences per essay, and sentences per paragraph had related results: 4, 10, and 2, respectively. In what follows, we present the experiment and obtained results. We carried out an experiment on the Essay-BR corpus to understand the challenges introduced by the corpus. For that, we implemented the feature-based methods of Amorim and Veloso [1] and Fonseca et al. [11] . We are aware that, in recent years, the NLP area has been dominated by the transformer architectures, as BERT [7] . However, for the AES field the obtained results by these architectures are similar to traditional models, such as N -grams at high computation cost [13] . Thus, as a baseline, we preferred to implement feature-based methods since they require less computational resources and effort. Amorim and Veloso [1] developed 19 features: number of grammatical errors, number of verbs, number of pronouns, and others. These features fed a linear regression to score an essay. Fonseca et al. [11] created a pool of 681 features, as the number of discursive markers, number of oralities, number of correct words, among others, and these features fed the gradient boosting regressor to score an essay. To extract the Essay-BR corpus features, we used the same tools reported by the authors, and to implement the regressors, we used the scikitlearn library [16] . We evaluated those methods using the Quadratic Weighted Kappa (QWK), which is a metric commonly used to assess AES models [25] , and the Root Mean Squared Error (RMSE), which is a metric employed to regression problems. Table 8 shows the QWK metric results, while Table 9 presents the results for the RMSE metric. In the QWK metric, the greater the value, the better the result, whereas in the RMSE metric, the smaller the value, the better the result. Although the approach of Fonseca et al. [11] achieved better results in both metrics for each competence (C1 to C5), these results are not fit for summative student assessment, as normally for the AES field, threshold values between 0.6 and 0.8 QWK are used as a floor for testing purposes [13] . Furthermore, the method of Fonseca et al. [11] , which achieved 75.20% in the QWK metric in their corpus, reached only 51% in the Essay-BR. This difference may be due to two factors. The first is the size of the corpus. Fonseca et al. [11] used a corpus with more than 50, 000 essays, whereas our corpus has 4.570 essays. The second is implementation details. Fonseca et al. [11] used several lexical resources, but they did not make them available. Thus, we do not know if the lexical resources we used are the same as Fonseca et al. [11] . As we can see, it is necessary to develop more robust methods to grade essays for the Portuguese language in order to improve the results. In this paper, we presented a large corpus of essays written by Brazilian high school students that were graded by experts following the evaluation criteria of the ENEM exam. This is the first publicly available corpus for that language. At this time, it has 4, 570 essays and 319 prompts, but we already scraped 13,306 essays from the Vestibular UOL Website. These essays are being pre-processed and will be available as soon as possible. We hope that this resource will foster the research area for Portuguese by developing of alternative methods to grade essays. More than that, according to Ke and Ng [12] , the quality of an essay may be graded adopting different dimensions, as presented in Table 10 . From this table, one can see that a corpus of essays may be graded regarding several dimensions. Assessing and scoring these dimensions helps the students get better feedback on their essays, supporting them to identify which aspects of the essay need improvements. Some of these dimensions do not seem challenging, such as the grammaticality, usage and mechanism dimensions, since they already have been extensively explored. Several other dimensions, such as cohesion, coherence, thesis clarity, and persuasiveness, bring problems that involve computational modeling in different levels of the text. Modeling these challenging dimensions may require understanding the essay content and exploring the semantic and discourse levels of knowledge. Thus, there exist several possible applications that the Essay-BR may be useful. As future work, besides increasing the corpus, which is already in process, we intend to provide the essay corrections, aiming to develop machine learning models to learn from the corrections. A multi-aspect analysis of automatic essay scoring for Brazilian Portuguese A bayesian classifier to automatic correction of portuguese essays Topicality-based indices for essay scoring Automated evaluation of writing -50 years and counting Toefl11: A corpus of non-native english Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit BERT: Pre-training of deep bidirectional transformers for language understanding An overview of automated scoring of essays Attention-based recurrent convolutional neural network for automatic essay scoring Scoring persuasive essays using opinions and their targets Automatically grading brazilian student essays Automated essay scoring: a survey of the state of the art Should you fine-tune BERT for automated essay scoring? Argument mining for improving the automated scoring of persuasive essays The imminence of... grading essays by computer Scikit-learn: Machine learning in Python Exit assessments: Evaluating writing ability through automated essay scoring Handbook of automated essay evaluation: Current applications and new directions Annotating argument components and relations in persuasive essays International Corpus of Learner English (Version 2) A neural approach to automated essay scoring Automated assessment of non-native learner essays: Investigating the role of linguistic features A framework for implementing automated scoring A new dataset and method for automatically grading ESOL texts Evaluating the performance of automated text scoring systems