key: cord-0653305-vd462vmc authors: Reyes, Lloyd Lois Antonie; Ibanez, Michael Antonio; Sapinit, Ranz; Hussien, Mohammed; Imperial, Joseph Marvin title: A Baseline Readability Model for Cebuano date: 2022-03-31 journal: nan DOI: nan sha: d357714566bb524cb47aa4b7b5cc54f0eb3b8d13 doc_id: 653305 cord_uid: vd462vmc In this study, we developed the first baseline readability model for the Cebuano language. Cebuano is the second most-used native language in the Philippines with about 27.5 million speakers. As the baseline, we extracted traditional or surface-based features, syllable patterns based from Cebuano's documented orthography, and neural embeddings from the multilingual BERT model. Results show that the use of the first two handcrafted linguistic features obtained the best performance trained on an optimized Random Forest model with approximately 87% across all metrics. The feature sets and algorithm used also is similar to previous results in readability assessment for the Filipino language showing potential of crosslingual application. To encourage more work for readability assessment in Philippine languages such as Cebuano, we open-sourced both code and data. The proper identification of the difficulty levels of reading materials is a vital aspect of the language learning process. It enables teachers and educators alike to assign appropriate materials to young learners in which they can fully comprehend, preventing boredom and disinterest (Guevarra, 2011) . However, assessing readability presents challenges, particularly when you have a large corpus of text to sift through. Manually extracting and calculating a wide range of linguistic features can be time-consuming and expensive and can lead to subjectivity of labels due to human errors (Deutsch et al., 2020) . To tackle this problem, more and more research in the field have focused on experimenting with automated methods for extracting possible linguistic predictors to train models for readability assessment. While automating readability assessment is a challenge itself, one of the original problem in the field starts with data. In the Philippines, the Mother-Tongue Based Multilingual Education (MTB-MLE) scheme was introduced by the Department of Education (DepEd) in 2013. With this initiative, there were little to no available tool for automatically assessing readability of reading resources, instructional materials, and grammatical materials in mother tongue languages aside from Filipino such as Cebuano, Hiligaynon, and Bikol (Medilo Jr, 2016 ). To answer this challenge, in this paper, we investigate various linguistic features ranging from traditional or surface-based predictors, orthography-based features from syllable patterns, and neural representations to develop a baseline readability assessment model for the Cebuano language. We use an array of traditional machine learning algorithms to train the assessment models with hyperparameter optimization. Our results show that using non-neural features are enough to produce a competitive model for identifying the readability levels of children's books in Cebuano. Readability assessment has been the subject of research of linguistic experts and book publishers as a method of measuring comprehensibility of a given text or document. Villamin and de Guzman (1979) pioneered a readability assessment for the Filipino language in 1979. Hand-crafted indices and surface information from texts, such as hand counts of words, phrases, and sentences, are used in these formula-based techniques. An equivalent technique of traditional formula was applied on to Waray language (Oyzon et al., 2015) to complement the DepEd's MTB-MLE program in certain regions of the Philippines such as in Samar and Leyte. While traditional featured formulas relied on linear models, recent studies on readability research assessment have shifted their focus on ex-panding the traditional method to more fine-grained features. Guevarra (2011) and Macahilig (2014) introduced the use of a logistic regression model trained with unique word counts, total word and sentence counts, and mean log of word frequency. A few years later, lexical, syllable patterns, morphology, and syntactic features were eventually explored for readability of Filipino text by works of Imperial and Ong (Imperial and Ong, 2021a , 2021b . Cebuano (CEB) is an Austronesian language mostly spoken in the southern parts of the Philippines such as in major regions of Visayas and Mindanao. It is the language with the second highest speaker count 2 in the country with 27.5 million, just after Tagalog, where the national language is derived from, with 82 million speakers. Both Cebuano and Tagalog languages observe linguistic similarities such as in derivation, prefixing, disyllabic roots, and reduplication (Blake, 1904) . On the other hand, differences are seen in syntax such as use of particles (ay, y), phonetic changes, and morphological changes on verbs. Figure 1 illustrates a portion of the Philippine language family tree emphasizing on where Cebuano originated. Cebuano is part of the Central Philippine subtree along with Tagalog and Bikol which can be attributed to their similarities and differences as mentioned. The full image can be viewed at Oco et al. (2013) . We compiled the first Cebuano text corpus composed of 277 expert-annotated literary pieces 2 https://www.ethnologue.com/language/ ceb uniform to the first three grade levels (L1, L2, and L3) of the Philippine primary education. For comparison to international grading systems, the standard age range for each level is 6-7, 7-8, and 8-9 respectively. We collected the materials from three online, open-sourced book repositories online: Let's Read, Bloom Library, and DepEd Commons. All materials are licensed under Creative Commons BY 4.0 allows redistribution in any medium or format provided proper attribution. Table 1 shows the distribution of the collected corpus. In this study, we extracted three linguistic feature groups from our Cebuano text corpus: traditional or surface-based features, orthography-based features, and neural embeddings. To the best of our knowledge, no study has ever been conducted to assess and explore the readability assessment of Cebuano text using these features. Traditional or surface-based features are predictors that were used by experts for their old readability formulas for Filipino such as sentence and word counts in Guevarra (2011). Despite the claims that these features insufficiently measures deeper text properties for readability assessment (Redish, 2000) , since this is the pioneering study for Cebuano, we still considered these features for our baseline model development. In this study, we adapted the seven features of traditional features from existing works in Filipino (Imperial and Ong, 2020, 2021a,b) such as number of unique words, number of words, average word length, average number of syllables, total number of sentences, average sentence length and number of polysyllable words. Orthography-based features measure characterlevel complexity of texts through combinations of various syllable patterns (Imperial and Ong, 2021b). Same as in Filipino, we adapted syllable patterns as features for the baseline model development but used only seven recognizable consonantvowel combinations linguistically documented in the Cebuano language (Blake, 1904). We used consonant clusters and syllable pattern combinations of v, cv, cc, vc, cvc, ccv, ccvc normalized by the number of words. The use of Transformer-based language model embeddings have shown to be an effective substitute for handcrafted features in low-resource languages (Imperial, 2021). Probing tasks have shown that these representations contain information such as semantic and syntactic knowledge (Rogers et al., 2020) which can be useful in readability assessment. For this study, we extracted embedding representations with dimension size of 768 from the multilingual BERT model (Devlin et al., 2019) as features for each instance from the Cebuano corpus. According to the training recipe of multilingual BERT, Cebuano data in the form of Wikipedia dumps was included in its development which makes the model a viable option for this study. The task at hand is a multiclass classification problem with three classes being the aforementioned grade levels. We specifically chose traditional learning algorithms such as Logistic Regression, Support Vector Machines, and Random Forest for building the baseline models for post-training interpretation techniques described in the succeeding sections. To reduce bias, a k-fold cross validation where k = 5 was implemented. For the intrinsic evaluation, we used standard metrics such as accuracy, precision, recall and macro F1-score. In addition, we also used grid search to optimize the following model-specific hyperparamters: solver and regularization penalties for Logistic Regression, kernel type, maximum iterations, and regularization penalties for Support Vector Machines, and number of estimators, maximum features, and maximum depth for Random Forest. To assess the effectiveness of the proposed framework in the experimentation, we examined model performances on three different ablation studies: (a) linguistic features only, (b) neural embeddings only, and (c) combination of the two via concatenation. The results of each fine-tuned model utilizing the given evaluation metric are showed in Tables 2, 3 , and 4. Across the board, the best performing model and feature combination for Cebuano achieved approximately 87.3% for all metrics using the combination of TRAD and SYLL features with Random Forest. This top performing model makes used of 100 tree estimators, automatically adjusted maximum features, and a max depth of 20. Interestingly, the feature combination and the algorithm of choice is also the same for Filipino readability assessment as seen in the work ofImperial and Ong (2021b). This may suggest that, despite language differences and similarities, the use of surface-based features such as counts and syllable patterns are accepted for both Filipino and Cebuano languages in the readability assessment task. Referring again to Figure 1 for emphasis, both languages are part of the Central Philippine subtree which opens the possibility of a cross-lingual application of linguistic features for future research. This effectiveness of surface-based features is also seen for the optimized Logistic Regression model where using TRAD features obtained the best performance. In the case of the optimized Support Vector Machine model, the use of neural embeddings alone obtained better scores than the combination of traditional and syllable pattern features. This result affirms the observation in Imperial (2021) where the extracted neural embeddings can serve as substitute features and can relatively be at par with handcrafted features. To understand more about which specific linguistic feature is contributive during model training, we used two versions of model interpretation algorithms specifically used for Random Forest models: permutation on full model and mean decrease in impurity (MDI) as shown in Figures 3 and 2 respectively. Feature permutation recursively adds a predictor to a null model and evaluates the growth in accuracy while mean decrease impurity adds up all weighted impurity score reductions or homogeneity averaged for all tree estimators (Breiman, 2001) . From both the feature importance results, the most important feature is the v_density or singular vowel density. This may indicate that the denser the vowels in a word, the more complex the text becomes. Likewise, both cv_density and consonant clusters emerged as second top predictors for both analysis which may suggest that in Cebuano, words with combined consonants with no intervening vowels are more apparent in complex sentences than from easier ones. We also looked at model-independent feature analysis techniques through Spearman correlation with respect to readability levels. Table 5 shows the top ten highly correlated features. In support to the findings described in Sections 6 and 7.1, all correlated linguistic features belong to the TRAD and SYLL feature sets with number of unique words at the top. This may suggest that the density of unique words may increase relative to the readability level in a positive direction. In addition, cv, cvc, and ccv densities are the only syllable pattern features that placed top in both model-dependent and independent feature interpretation techniques. This may hint further potential as readability predictors for other text domains. To note, the cv-pattern in Cebuano is one of the most common consonant-vowel combinations (Zorc et al., 1976; Yap and Bunye, 2019) . We developed the first ever baseline machine learning model for readability assessment in Cebuano. Among the three linguistic feature groups extracted to build the model, the combination of traditional or surface-based features (TRAD) with syllable pattern based features (SYLL) produced the highest performance using an optimized Random Forest model. One of the main challenges in the field is the limited amount of resource for tools and data especially for low-resource languages (Vajjala, 2021 ). To answer this call and encourage growth of research in this direction, we open-sourced the compiled dataset of annotated Cebuano reading materials and the code for model development. Differences between tagalog and bisayan Random forests. Machine learning Linguistic features for readability assessment BERT: Pre-training of deep bidirectional transformers for language understanding Development of a filipino text readability index BERT embeddings for automatic readability assessment Exploring hybrid linguistic feature sets to measure filipino text readability Application of lexical features towards improvement of filipino readability identification of children's literature Diverse linguistic features for assessing reading difficulty of educational filipino texts A content-based readability formula for filipino texts. The Normal Lights The experience of mother tongue-based multilingual education teachers in southern leyte, philippines Dice's coefficient on trigram profiles as metric for language similarity Validation study of waray text readability instrument Readability formulas have even more limitations than klare discusses A primer in BERTology: What we know about how BERT works Sowmya Vajjala. 2021. Trends, limitations and open challenges in automatic readability assessment research Pilipino readability formula: The derivation of a readability formula and a pilipino word list Cebuano grammar notes The Bisayan dialects of the Philippines: Subgrouping and reconstruction. Pacific Linguistics The authors would like to thank the anonymous reviewers for their valuable feedback. This project is supported by the Google AI Tensorflow Faculty Grant awarded to Joseph Marvin Imperial.