key: cord-0440911-17b1yihd authors: Premasiri, Damith; Ranasinghe, Tharindu; Zaghouani, Wajdi; Mitkov, Ruslan title: DTW at Qur'an QA 2022: Utilising Transfer Learning with Transformers for Question Answering in a Low-resource Domain date: 2022-05-12 journal: nan DOI: nan sha: a56c8e6b2db32abe2c38bbc4a78a4a895137d15d doc_id: 440911 cord_uid: 17b1yihd The task of machine reading comprehension (MRC) is a useful benchmark to evaluate the natural language understanding of machines. It has gained popularity in the natural language processing (NLP) field mainly due to the large number of datasets released for many languages. However, the research in MRC has been understudied in several domains, including religious texts. The goal of the Qur'an QA 2022 shared task is to fill this gap by producing state-of-the-art question answering and reading comprehension research on Qur'an. This paper describes the DTW entry to the Quran QA 2022 shared task. Our methodology uses transfer learning to take advantage of available Arabic MRC data. We further improve the results using various ensemble learning strategies. Our approach provided a partial Reciprocal Rank (pRR) score of 0.49 on the test set, proving its strong performance on the task. Machine Reading Comprehension (MRC) is a challenging Natural Language Processing (NLP) application (Baradaran et al., 2022) . The concept of MRC is similar to how humans are evaluated in examinations where a person should understand the text and answer questions based on the text. Similarly, the goal of a typical MRC task requires a machine to read a set of text passages and then answer questions about the passages. MRC systems could be widely applied in many NLP systems such as search engines and dialogue systems. Therefore, the NLP community has shown a great interest in MRC tasks over recent years. The most common way of dealing with MRC tasks is to train a machine learning model on an annotated dataset. Over the years, researchers have experimented with different machine learning approaches ranging from traditional algorithms such as support vector machines (Suzuki et al., 2002; Yen et al., 2013) to embedding based neural approaches such as transformers, with the latter providing state-of-the-art results in many datasets. We discuss them thoroughly in Section 2. However, an annotated dataset is an essential requirement for these machine learning models. Identifying this, the NLP community has developed several datasets in recent years. The most popular MRC dataset is the Stanford Question Answering Dataset (SQuAD), which contains more than 100,000 annotated examples (Rajpurkar et al., 2016) . SQuAD dataset has been extended to several languages including Arabic (Mozannar et al., 2019) , Dutch (Rouws et al., 2022) , Persian (Abadani et al., 2021) and Sinhala (Jayakody et al., 2016) . However, MRC datasets have been limited to common domains such as Wikipedia and MRC in low-resource domains, including religious books, have not been explored widely by the community (Baradaran et al., 2022) . Moreover, most researchers focus on a few popular MRC datasets, while most other MRC datasets are not widely known and studied by the community (Zeng et al., 2020) . Qur'an QA 2022 shared task (Malhas et al., 2022) has been organised to address these gaps in MRC research. The goal of the shared task is to trigger state-of-theart question answering and reading comprehension research on a book that is sacredly held by more than 1.8 billion people across the world. The shared task relies on a recently released dataset of 1,337 questionpassage-answer triplets extracted from the holy Qur'an (Malhas and Elsayed, 2020) . Despite the novelty, the dataset poses several challenges. Firstly, since the dataset contains texts from Qur'an, modern embedding models would have problems encoding them. Therefore, we experimented with different pre-processing techniques to handle the texts from Qur'an. Secondly, the dataset is small compared to other MRC datasets such as SQuAD (Rajpurkar et al., 2016) , and it would be difficult to fine-tune the state-of-the-art neural models. We experiment with different techniques such as transfer learning and ensemble learning to overcome this. We show that state-of-the-art neural models can be applied in smaller MRC datasets utilising the above methods. We address two research questions in this paper: The code of the experiments has been released as an open-source Github project 1 . The project has been released as a Python package 2 and the pre-trained machine learning models are freely available to download in HuggingFace model hub 3 . Furthermore, we have created a docker image of the experiments adhering to the ACL reproducibility criteria 4 . The rest of the paper is structured as follows. Section 2 presents an overview of MRC datasets and machine learning models. Section 3 describes the data we used in the experiments. In Section 4 we explain the experiments carried out. Section 5 discusses the results answering the research questions. Finally, the paper outlines future works and provides conclusions. Machine reading comprehension is not newly proposed. The earliest known MRC system dates back to 1977 when (Lehnert, 1977) developed a question answering program called the QUALM. In 1999 (Hirschman et al., 1999) constructed a reading comprehension system exploiting a corpus of 60 development and 60 test stories of 3rd to 6th-grade material. Due to the lack of high-quality MRC datasets and the poor performance of MRC models, this research field was understudied until the early 2010s. However, with the creation of large MRC datasets and with the success of word embedding based neural models in the NLP field, research in MRC has been popular in recent years. We present the related work in MRC in two broad categories; datasets and models. Datasets In 2013, (Richardson et al., 2013) created the MCTest dataset which contained 500 stories and 2000 questions. This dataset can be considered the first big MRC dataset. A breakthrough in MRC was achieved in 2015 when (Hermann et al., 2015) defined a new dataset generation method that provides large-scale supervised reading comprehension datasets. This was followed by the creation of large scale MRC datasets such as SQuAD (Rajpurkar et al., 2016) . Later the SQuAD dataset has been expanded to many languages including Arabic (Mozannar et al., 2019), Dutch (Rouws et al., 2022) , French (d 'Hoffschmidt et al., 2020) and Russian (Efimov et al., 2020) . Furthermore, SQuAD has been extended to low-resource languages such as Persian (Abadani et al., 1 The Github project is available on https://github. com/DamithDR/QuestionAnswering 2 The Python package is available on https://pypi. org/project/quesans/ 3 The pre-trained models are available on https://huggingface.co/Damith/ AraELECTRA-discriminator-SOQAL and https://huggingface.co/Damith/ AraELECTRA-discriminator-QuranQA 4 The docker image is available on https: //hub.docker.com/r/damithpremasiri/ question-answering-quran 2021) and Sinhala (Jayakody et al., 2016) proving that SQuAD has been an important benchmark in MRC research. MRC datasets have been compiled on different domains such as news (Trischler et al., 2017) , publications (Dasigi et al., 2021) and natural sciences (Welbl et al., 2017) . As far as we know, Qur'an Reading Comprehension Dataset used in this shared task is the first dataset created on religious texts (Malhas and Elsayed, 2020) . Methods Most MRC systems in the early 2000s were rule-based or statistical models (Riloff and Thelen, 2000; Charniak et al., 2000) . These models do not provide good results compared to the neural methods introduced in recent years (Baradaran et al., 2022) . (Hermann et al., 2015) developed a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure. Since 2015, with the emergence of various large scale, supervised datasets, neural network models have shown state-ofthe-art results in MRC tasks. The recently introduced transformer models such as BERT (Devlin et al., 2019) have already exceeded human performance over the related MRC benchmark datasets (Zeng et al., 2020) . A critical contribution of the SQuAD benchmark is that it provides a system to submit the MRC models and a leaderboard to display the top results 5 . This has enabled the NLP community to keep track of the state-ofthe-art MRC systems. Other languages have also followed this approach 6 . However, the NLP community has focused mainly on improving system performance on popular benchmarks such as SQuAD and has not focused on improving results on benchmarks with limited coverage, which we address in this research paper. MRC tasks are usually divided into four categories: cloze style, multiple-choice, span prediction, and free form (Liu et al., 2019) . The Qur'an QA 2022 shared task 7 belongs to the span prediction category where the MRC system needs to select the correct beginning and end of the answer text from the context. The event organisers provided the QRCD (Quran Reading Comprehension Dataset), which contained 1,093 tuples of question-passage pairs that are coupled with their extracted answers to constitute 1,337 question-passageanswer triplets. QRCD is a JSON Lines (JSONL) file; each line is a JSON object that comprises a questionpassage pair and its answers extracted from the accompanying passage. Figure 1 shows a sample training tu- SQuAD is widely used as the standard dataset in English MRC tasks, therefore using the machine translation of the same dataset will be helpful for the learning process. Compared to QRCD, SOQAL is a large dataset and both of these datasets belong to the span prediction MRC category. Therefore, they can be used to perform transfer learning which we describe in Section 4. With the introduction of BERT (Devlin et al., 2019) , transformer models have achieved state-of-the-art results in different NLP applications such as text classification , information extraction (Plum et al., 2022) and event detection (Giorgi et al., 2021) . Furthermore, the transformer architectures have shown promising results in SQuAD dataset (Zhang et al., 2021; Zhang et al., 2020; Yamada et al., 2020; Lan et al., 2020) . In view of this, we use transformers as the basis of our methodology. Transformer architectures have been trained on general tasks like language modelling and then can be fine-tuned for MRC tasks. (Devlin et al., 2019) . For the MRC task, transformer models take an input of a single sequence that contains the question and paragraph separated by a [SEP] token. Then the model introduces a start vector and an end vector. The probability of each word being the start-word is calculated by taking a dot product between the final embedding of the word and the start vector, followed by a softmax over all the words. The word with the highest probability value is considered. The architecture of transformer-based MRC model is shown in Figure 2 . (Antoun et al., 2020) . These models are available in HuggingFace model hub (Wolf et al., 2020) . For all the experiments we used a batchsize of eight, Adam optimiser with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. During the training process, the parameters of the transformer model, as well as the parameters of the subsequent layers, were updated. The models were trained using only training data. All the models were trained for five epochs. For some of the experiments, we used an Nvidia GeForce RTX 2070 GPU, whilst for others we used a GeForce RTX 3090 GPU. This was purely based on the availability of the hardware and it was not a methodological decision. We further used following fine-tuning strategies to improve the performance. Ensemble learning is a popular technique in machine learning, where different machine learning models contribute to a single solution. As different machine learning algorithms tend to learn differently, the final predictions each one of them provides can be slightly different. However, they have the potential to contribute to the final output with ensemble learning. Usually, ensemble learning provides better results compared to single models (Sagi and Rokach, 2018) . Transformer models that we used as the base model are prone to the random seed . The same architecture can provide different results for different random seeds (Uyangodage et al., 2021) . To avoid the impact of this, we performed self ensemble. We trained the same architecture using five different random seeds and ensembled the output files using Algorithm 1. Algorithm 1 Ensemble Learning Algorithm for MRC R ← all results files r i ← i(th) result file Q ← all questions q j ← j(th) question A ← all answers a i,j ← answer for question j in files file i a j ← all unique answers for question j in all files a j,k ← answer k from unique answers for question j a i,j,m ← answer m from file i to question j answer j,k ← temporary score repeat for each a j,k ∈ a j do repeat for each a i,j,m ∈ r i do if a i,j,m = a j,k then score j,k ← average(score ai,j,m , score a j,k ) answer j,k ← a j,k , score j,k end if end for until all items iterated in R final answers ← answer j,k end for until all unique answers iterated in for question j answers ← sort(answers) repeat for each q j ∈ Q do repeat for each answer j,k ∈ f inal a nswers do rank j,k ← assign rank end for until iterate all answers for question j end for until iterate all questions iterated in Q One limitation of the QRCD dataset is that training set only contains 710 annotated QnA pairs and as a result transformer models would find it difficult to properly fine-tune their weights in the training process. A common practice to overcome this is to utilise transfer learning. The main idea of transfer learning is that we train a machine learning model on a resource rich setting, save the weights of the model and when we initialise the training process for a lower resource setting, start with the saved weights from the resource rich setting . Transfer learning has improved results for many NLP tasks such as offensive language identification (Ranasinghe and Zampieri, 2020) , machine translation (Nguyen and Chiang, 2017) and named entity recognition (Lee et al., 2018) . For this task, we first trained a transformer-based MRC model on SOQAL dataset which contained more training data compared to the QRCD dataset as mentioned in Section 3. Then when we started the training for QRCD dataset we started from the saved weights from the SOQAL dataset. In this section, we report the experiments we conducted and their results. As advised by the task organisers, we used partial Reciprocal Rank (pRR) score to measure the model performance. It is a variant of the traditional Reciprocal Rank evaluation metric that considers partial matching. We also report Exact Match (EM), and F1@1 in the results tables, which are evaluation metrics applied only to the top predicted answer. The EM metric is a binary measure that rewards a system only if the top predicted answer matches exactly one of the gold answers. In comparison, the F1@1 metric measures the token overlap between the top predicted answer and the best matching gold answer. The reported results are for the dev set. As can be seen in Table 2 , camelbert-mix model produced the best results with 0.549 pRR value. This was closely followed by camelbert-ca and AraELECTRAdiscriminator. Transformer models built specifically for Arabic generally outperformed multilingual models. To answer our RQ1, we performed self ensemble learning. Table 3 shows the results of different models with results ensemble. Even though there was a slight improvement in AraELECTRA-discriminator, the overall impact for the results from the ensemble was very low. And we noticed that some of the models had performed less when using ensemble. However, the results were stable compared to the single models. Therefore, we used self ensemble learning even though it did not contribute to improving the results. With these findings, we answer our RQ1, ensemble models do not provide better results compared to single models; however, they provide more consistent results. To answer our RQ2, we performed transfer learning from SOQAL (Mozannar et al., 2019) to QRCD dataset as mentioned in Section 4. We only conducted the experiments for the best model from the self ensemble setting. As can be seen in the results in Table 4 , transfer learning improved the results for AraELECTRA-discriminator. Without transfer learning, AraELECTRA-discriminator scored only 0.528 pRR, while with transfer learning, it provided 0.616 pRR. We did not observe improvements in other transformer models. However, the 0.616 pRR we got with performing transfer learning with AraELECTRAdiscriminator was the best result for the dev set. With this, we answer our RQ2, other Arabic MRC resources such as SOQAL (Mozannar et al., 2019) can be used to improve the results for Qur'an MRC. We believe that this finding will be important to the researchers working on low-resource MRC datasets. Based on the results of the dev set, we selected three models for the final submission; camelbert-mix with ensemble learning but without transfer learning, camelbert-mix with transfer learning and ensemble learning and AraELECTRA-discriminator with transfer learning and ensemble learning. Table 5 shows the results that the organisers provided on the test set for our submitted models. AraELECTRA-discriminator performed best in the test set too. The camelbert-mix mode without transfer learning has decreased its performance from 0.549 to 0.290, which is a 47% decrease. However, the models with transfer learning have performed comparatively high, confirming our answer to the RQ2. In this paper, we have presented the system submitted by the DTW team to the Qur'an QA 2022 shared task in the 5th Workshop on Open-Source Arabic Corpora and Processing Tools. We have shown that AraELECTRA-discriminator with transfer learning from an Arabic MRC dataset is the most successful transformer model from several transformer models we experimented with. Our best system scored 0.495 pRR in the test set. With our RQ1, we showed that transformer models based on self ensemble provided stable results than single models in Qur'an QA task. Revisiting our RQ2, we showed that transfer learning could be used to improve the MRC results of the Qur'an. We believe that this finding would pave the way to enhance MRC in many low-resource domains. Our code, software and the pre-trained models have been made available freely to the researchers working on similar problems. In future work, we would like to explore more to transfer learning. We will be exploring cross-lingual transfer learning with larger English MRC datasets such as SQuAD, as cross-lingual transfer learning has shown splendid results in many NLP tasks . Furthermore we will be exploring zero-shot and few-shot learning, which could benefit a multitude of low-resource languages. This project was partially funded by the University of Wolverhampton's RIF4 Research Investment Funding provided for the Responsible Digital Humanities lab (RIGHT). We would like to thank the Qur'an QA 2022 shared task organisers for running this interesting shared task and for replying promptly to all our inquiries. Furthermore, we thank the anonymous OSACT 2022 reviewers for their insightful feedback. Parsquad: Machine translated squad dataset for persian question answering AraBERT: Transformer-based model for Arabic language understanding Ara-ELECTRA: Pre-training text discriminators for Arabic language understanding A survey on machine reading comprehension systems Reading comprehension programs in a statistical-language-processing class A dataset of informationseeking questions and answers anchored in research papers BERT: Pre-training of deep bidirectional transformers for language understanding FQuAD: French question answering dataset Sberquad -russian reading comprehension dataset: Description and analysis Discovering black lives matter events in the United States: Shared task 3, CASE 2021 Teaching machines to read and comprehend In-foMiner at WNUT-2020 task 2: Transformer-based covid-19 informative tweet extraction Deep read: A reading comprehension system. ACL '99 The interplay of variant, size, and task type in Arabic pre-trained language models mahoshadha", the sinhala tagged corpus based question answering system Albert: A lite bert for self-supervised learning of language representations Transfer learning for named-entity recognition with neural networks European Language Resources Association (ELRA) The process of question answering Neural machine reading comprehension: Methods and trends Ayatec: Building a reusable verse-based test collection for arabic question answering on the holy qur'an Overview of the first shared task on question answering over the holy qur'an Neural Arabic question answering Transfer learning across low-resource, related languages for neural machine translation Biographical: A semisupervised relation extraction dataset SQuAD: 100,000+ questions for machine comprehension of text BRUMS at SemEval-2020 task 12: Transformer based multilingual offensive language identification in social media Multilingual offensive language identification with cross-lingual embeddings An exploratory analysis of multilingual word-level quality estimation with cross-lingual transformers MCTest: A challenge dataset for the open-domain machine comprehension of text A rule-based question answering system for reading comprehension tests Dutch squad and ensemble learning for question answering from labour agreements Ensemble learning: A survey. WIREs Data Mining and Knowledge Discovery SVM answer selection for open-domain question answering NewsQA: A machine comprehension dataset Transformers to fight the COVID-19 infodemic Crowdsourcing multiple choice science questions Transformers: State-of-the-art natural language processing LUKE: Deep contextualized entity representations with entity-aware self-attention A support vector machinebased context-ranking model for question answering A survey on machine reading comprehension-tasks, evaluation metrics and benchmark datasets Sg-net: Syntax-guided machine reading comprehension Retrospective reader for machine reading comprehension