key: cord-0806797-mc86laj4 authors: Ghosh, Krishnendu title: Remediating textbook deficiencies by leveraging community question answers date: 2022-04-11 journal: Educ Inf Technol (Dordr) DOI: 10.1007/s10639-022-10937-5 sha: 2448b06e9a8d43ea9eff9c09a247e9e2d2564966 doc_id: 806797 cord_uid: mc86laj4 The paper presents a method for recommending augmentations against conceptual gaps in textbooks. Question Answer (QA) pairs from community question-answering (cQA) forums are noted to offer precise and comprehensive illustrations of concepts. Our proposed method retrieves QA pairs for a target concept to suggest two types of augmentations: basic and supplementary. Basic augmentations are suggested for the concepts on which a textbook lacks fundamental references. We identified such deficiencies by employing a supervised machine learning-based approach trained on 12 features concerning the textbook’s discourse. Supplementary augmentations aiming for additional references are suggested for all the concepts. Retrieved QA pairs were filtered to ensure their comprehensiveness for the target students. The proposed augmentation system was deployed using a web-based interface. We collected 28 Indian textbooks and manually curated them to create gold standards for assessing our proposed system. Analyzing expert opinions and adopting an equivalent pretest-posttest setup for the students, the quality of these augmentations was quantified. We evaluated the usability of the interface from students’ responses. Both system and human-based evaluations indicated that the suggested augmentations addressed the concept-specific deficiency and provided additional materials to stimulate learning interest. The learning interface was easy-to-use and showcased these augmentations effectively. The institutional education model, as we know, has been carried out using a classroom-based environment for centuries. However, the last decade discerned a massive paradigm shift with the introduction of "e-learning," offering quality education for everyone anywhere and anytime (Al-Qatawneh et al. 2019; Gelderblom et al. 2019) . Although the e-learning systems kept changing their content structures (from traditional to multi-modal and customized) with time (Kempe and Grönlund 2019) , they have always been favored due to their low-cost educational resources (Kopciewicz and Bougsiaa 2021; Nipa and Kermanshachi 2020) . Even more, there has been no significant difference between the conventional classroom-based environment and self-study sessions supported by the e-learning paradigm as more and more users are getting along with this change (Leonard et al. 2021; Schuyten et al. 1999) . During the recent Covid-19 pandemic, the schools remain closed across the globe (Kingsbury 2021) . Consequently, the e-learning platforms continue enhancing the way they have been presenting content knowledge through various modalities (Ghosh et al. 2021; Mershad et al. 2019; Mitra and Gupta 2020; Singh et al. 2021 ). However, textbooks are still considered one of the primary conduits for catering basic knowledge forms to the students. The quality of these textbooks remains significant for catering primary knowledge forms to the students. The quality of these textbooks often suffers due to (i) the lack of detailed illustration of the concepts (Mohammad and Kumari 2007) , (ii) improper sequencing of the concepts (Tyson-Bernstein 1988) , and (iii) absence of adequate references (Mohammad and Kumari 2007) . A substantial amount of studies to augment the textbooks have been undertaken to overcome these issues. Augmentations from these studies can be categorized into three classes: (i) bibliographic, (ii) component-based, and (iii) cQA-based. Bibliographic augmentations (e.g., Wikipedia articles or search-snippets) are suggested to offer additional references (Garcia-Gonzalez et al. 2016; Augustijn et al. 2018) . Being elaborate in nature, these text-based augmentations address several related concepts instead of focusing on a particular one. Component-based augmentations are recommended in the form of a relevant image or video that illustrates a specific topical discourse (Agrawal et al. 2012b (Agrawal et al. , 2014 . However, such augmentations do not furnish any additional references for motivated students. cQA-based augmentations are provided in the form of QA pairs combining texts and relevant components (e.g., solved problems, figures, or real-life examples). They illustrate the target concepts using precise descriptions. Even so, they are reared with necessary components to understand them better and additional references that stimulate the students' interests. QA pairs are also considered more comprehensible to the students than a formal textbook addressing their conceptual gaps (Singh et al. 2015) . With this motivation, recent studies started augmenting textbooks using the cQA-based augmentations (Kumar and Chauhan 2019; Srba et al. 2019) . cQA platforms (like Yahoo! Answers 1 or WikiAnswers 2 ) allow users from different backgrounds to satisfy their information need by browsing the historical archive of question-answer (QA) pairs and posing/answering direct questions. As the cQA forums gained popularity, their respective archives stacked a huge amount of question-answer pairs. These archives are now used to recommend relevant questions and, consequently, avoid the time-lag between asking a question and obtaining a personal response from the experts (Zhou et al. 2013) . The popularity of the cQA platforms gradually led to their use in the education domains. There have been numerous efforts for developing educational cQAs in different scales (large-scale cQAs like Stack Exchange 3 , MedHelp 4 , Chegg 5 , Piazza 6 , or Brainly 7 and small-scale cQAs (Aritajati and Narayanan 2013; Srba and Bielikova 2015) ). Srba et al. (2019) analyzed the significance of such cQAs in the educational environments. These cQAs were developed as intervention-based forums offering little scope for the inactive students who hate participating. Consequently, there have been studies to improve the student experience and augment textbooks using the rich educational content offered by these cQAs (Uluyol and Agca 2012) . Combining two related subjects (e.g., Geometry and Visual Arts) over similar cQA items, Schoevers et al. (2019) enriched the overall learning process. Kumar et al. (2019) proposed an approach for augmenting each sentence from the textbooks with cQA contents. However, these augmentations were recommended without identifying the conceptual deficiencies. Such augmentations often target an adequately addressed concept in the textbook but seldom enrich the deficient one. In the present work, we have proposed a system to suggest textbook augmentations in the form of question-answer (QA) pairs after examining the conceptual deficiencies. The challenges in developing the proposed system have been: 1. Extracting concepts: With the availability of various annotation services, one could easily extract concepts from a textbook. However, the difficulty lay in extracting the key phrases which render domain-specific significance. 2. Identifying deficiencies: Concept-specific deficiencies had not been studied for Indian school-level textbooks. One needed to codify the possible deficiencies and model them over the textbook quality factors (Agrawal et al. 2012a ). 3. Retrieving augmentations: Existing studies performed textbook augmentations without investigating concept deficiencies (Singh et al. 2015) . QA pairs addressing the concept deficiencies could be retrieved by (i) creating useful queries, (ii) developing a potent question retrieval model, and (iii) filtering the augmentations to align them for the target students. The current work has proposed a textbook augmentation system that addresses concept-specific deficiencies present in the textbooks and furnishes additional references to stimulate learning interests. To diagnose the conceptual deficiencies, the current work investigated the factors which led to the deficiencies and modeled the diagnosis of deficiency as a supervised classification problem. Once the concepts had been diagnosed, queries were generated, and using a three-stage question retrieval approach; relevant QA pairs were retrieved from the popular community question-answering forum Stack Exchange 8 . Stack Exchange is a chain of sites covering all possible educational fields. So, it is possible to locate relevant augmentation items from a single archive for different subjects and domains. The QA pairs from the Stack Exchange sites also come with detailed and organized metadata information. Such knowledge can be helpful to identify the validity of questions, acceptability of the answers, or expertise of the users and, in turn, retrieve quality augmentations items. Due to its archive size and organized metadata, we have considered Stack Exchange over any other cQAs. Finally, we deployed the augmentation system through a web-based user interface to present the augmented textbook contents to the students. Considering the challenges mentioned earlier, the main contributions of this work have been as follows: 1. The current work recommended two types of augmentations: (i) basic and (ii) supplementary. Basic augmentations address the concepts that have been deficient in the textbook. On the other hand, supplementary augmentations offer additional references to stimulate learning interests. All the augmentations have been filtered to ensure that they are comprehensible to the target students. 2. Possible deficiencies in the school-level textbooks were surveyed, and the associated factors have been spotted. The proposed deficiency diagnosis module, modeled over these factors, automatically identifies these deficiencies. 3. The augmentation retrieval module has been designed by combining question retrieval models with suitable reranking and filtering techniques. While retrieval and reranking techniques ensured relevance of the augmentations, the filtering process guaranteed that they were understandable to the target students. 4. An easy-to-use web-based platform has been developed where students can easily access the augmentations while reading a learning material. 5. The proposed augmentation system has been assessed using a thorough evaluation plan. System-based evaluation tasks were performed for major modules of the proposed augmentation system. Quality of the augmentations and usability of the interface were assessed using human-based evaluation plans. The paper has been organized as follows. Section 2 discussed the existent works related to the modules of the proposed system and has summarized the research gaps. Section 3 illustrated the architecture of our augmentation system while its constituent modules have been elaborated in subsequent Sections 4-7. Implementation details of the interface were presented in Section 8. Section 9 presented the evaluation strategies and the system performance. Section 10 has concluded the present work with the directions for future work. Apart from the rule-based domain-dependent approaches (Wang et al. 2021) , existing studies for extracting the set of phrases indicative of topical discourse of the study-materials are classified as supervised and unsupervised approaches. One of the major supervised techniques is implemented as project KEA that employed Naive Bayes classifier to group the concepts based on their TF-IDF scores, and first occurrence (Witten et al. 2005 ). Nguyen improved the extraction veracity using positional information and a set of morphological features (Nguyen and Kan 2007) . Recent approaches explored relevant features to model the concepts and their keyphraseness over the training data to extract the concepts. However, supervised techniques have never been a quality option for developing domain-independent systems. There have been several techniques for extracting concepts in an unsupervised approach. One approach was based on different statistics-based measures, e.g., TF-IDF scores, context, and span of the candidate concepts (Campos et al. 2018; El-Beltagy and Rafea 2009) . With the introduction of Wikipedia, several Wikipedia-based properties (e.g., Wikipedia-based keyphraseness (Medelyan et al. 2009 ), Wikipedia links, Wikipedia-based IDF, phrase embedding, keyphraseness, and title overlap (Papagiannopoulou and Tsoumakas 2020)) were employed to improve these systems. The second approach was graph-clustering methods, where a graph is generated using the candidate concepts as nodes and their relations as edges. This graph is clustered, and the representative concepts from each cluster are drawn out to forge the set of key concepts. One of the earlier graphclustering approaches named TextRank modeled the relations between the concepts based on their binary co-occurrence measure in a M-word frame (Mihalcea and Tarau 2004) . This method has been boosted where the edges were weighted based on the amount of co-occurrence in SingleRank (Wan and Xiao 2008) , and RAKE (Rose et al. 2010) . The third approach employed embedding-based techniques for extracting concepts. Bennani-Smires et al. (2018) designed Embe-dRank using two different embeddings (Doc2Vec and Sent2Vec) for representing a document and the underlying candidate concepts. The candidature of a concept was finalized based on the distance between these embeddings. While the first two approaches suffer due to their expensive training data collection and feature extraction steps, embedding-based methods are considered easier to implement. However, being one of the fundamental and well-studied natural language processing tasks, there lies little scope for improving the performance of a concept extraction task. Such tasks are often outsourced to several API-based services (e.g., DBpedia or Tagme) that combine the available concept extraction features. The linked Wikipedia articles also offer supplementary details of the concepts. Consequently, recent works started banking upon DBpedia, Tagme, or similar tools for concept extraction. Substantial amount of studies have been undertaken for (i) determining the factors affecting the overall quality of the textbooks (Pawlowski et al. 2007; Woodward et al. 2013 ), (ii) proper sequencing of the curriculum (Chambliss and Calfee 1998) , and (iii) creating subject-wise guidelines for textbook creation (Crossley and Murby 1994) . Summarizing these studies, textbook deficiencies can be categorized as style-centric and concept-specific. Style-centric deficiencies depend on two aspects. The first aspect concerns the individual contrasts (e.g., intellectual capacity, reading competence, prior experiences, and personal intrigue). The second set relates to the readability of the materials (e.g., format, organization, or writing style) (Kieras and Dechert 1985) . As the first category does not directly associate with textbooks, we analyze the latter aspect using four linguistic variables, namely, vocabulary load (Lockheed and Hanushek 1988) , sentence structure, idea density (Asheim 1958) , and human interest (Agrawal et al. 2011a ). Such factors have been integrated into several formulas (Flesch 1948; Dale and Chall 1948; Mc Laughlin 1969; Coleman and Liau 1975) . The current work focuses on identifying the concept-specific deficiencies and, therefore, employs simple linguistic features (average sentence length, average word length, and average word familiarity). Good-quality textbooks are typically organized into chapters and subdivided into sections. They should be designed such that each section (i) precisely illustrates a few related concepts, (ii) explains the topics using necessary figures, examples, or solved problems (Chambliss and Calfee 1998) , (iii) discusses adequate and relevant references, (iv) follows a sequence so that, a new concept is not encountered before explaining all of its prerequisite concepts (Anderson et al. 1984) , and (v) is arranged in a way where related concepts should be discussed in consecutive sections without creating memory-burden in students (Agrawal et al. 2012a) . Investigating the issues above, Agrawal et al. (2011a; 2012b) recommended enrichments for textbook sections if found deficient, employing linguistic factors like readability, different forms of syntactic complexity, and dispersion between concepts. However, proceeding from the existing literature, concept-specific features need to be determined to diagnose concept-specific deficiencies. The task of suggesting augmentations comprises three different alleys. The first approach surveyed how different factors are influenced using different learning resources in a learning process. The second one retrieves the augmentations from the freely available Web resources and offers them to the students without any intervention. The second approach designs its own educational cQA, where the students require to participate in active knowledge-sharing tasks. Adesope et al. (2017) compared concept maps, refutations, and expository text in diminishing concept-gaps among the students. Purgina et al. (2020) established the effectiveness of gamification by developing a game "WordBricks" for secondlanguage acquisition. Klippel et al. (2019) showed how a virtual field trip would cut costs yet could achieve a better learning experience and actual lab scores. Wang and Xing (2019) designed a structural model to identify how elementary students' learning gains are affected based on external factors like self-efficacy, learning motivation, learning strategies, and parent education level. Research works for textbook augmentation have been initiated with the introduction of Web resources and cQAs. Such augmentations are primarily composed of texts, relevant components (such as images, videos, solved problems), or QA pairs. Koć-Januchta et al. (2020) introduced questions relevant to the topics discussed in an AI-enriched e-book to help in retaining and analyzing the concerned topics. In a similar study, Alsalem (2018) demonstrated how meta-cognitive strategies could be used in an e-book to develop better reading comprehension skills among the DHH (deaf and hard of hearing) students. A few studies also incorporated AR (Augmented Reality) techniques for specific learning tasks -learning atoms' and molecules' reactions Ewais and Troyer (2019) and learning biology concepts (Weng et al. 2020 ). Most of these works suggested augmentations without investigating the deficiencies of the textbooks. Such augmentations often enrich the concepts even though they are precisely explained, but seldom address the deficient concepts (Singh et al. 2015) . Although they claim to provide better learning, it is prudent not to provide definitional augmentations for the concepts which have been discussed sufficiently in the textbook. Growing popularity of cQAs motivated the recent works on designing and developing educational cQAs. Apart from offering a platform for knowledge-sharing among the students, such cQAs work towards routing questions to the best experts ( [Macina et al. 2017) , balancing load between the experts (Babinec and Srba 2017) and refining the answers based on content-quality (Choi et al. 2015; Le et al. 2016) . Social networking sites triggered Brainly 9 to incorporate real-time chats and discussion forums. In subsequent systems like Piazza 10 or Askalot 11 , teachers also take part in answering queries and examining contents. While such systems are beneficial for active students, passive students remain deprived of the automatically recommended augmentations (Srba et al. 2019) . Limitations about the tasks of diagnosing deficiencies and recommending augmentations can be enumerated from the surveyed literature as follows. -Concept-specific deficiencies have not been surveyed for Indian school-level textbooks. Accordingly, the factors responsible for such deficiencies have also not been investigated systematically. -None of the existing textbook augmentation systems have diagnosed conceptspecific deficiencies. Subsequently, the recommended augmentations have failed to address the deficient concepts precisely but have often enriched the adequately illustrated concepts. The proposed augmentation system has been realized using four major modules: (i) concept extraction, (ii) deficiency diagnosis, (iii) query generation, and (iv) textbook augmentation, as outlined in Fig. 1 . Let us consider, a textbook T for grade level GL is organized over a set of sections S = {s 1 , s 2 , .., s m } . Each of these sections s i is illustrated using a set of concepts C(s i ) = {c 1 , c 2 , .., c n } . From each section s i , first, the concepts C(s i ) were extracted. Each concept c j ∈ C(s i ) from section S i were diagnosed to classify its deficiency Def(c j ) ∶ C → D , where D is the set of deficiency types or classes. For concept c j , the subsequent module generated basic query BQ(c j ) , addressing deficiency Def(c j ) and supplementary query SQ(c j ) , that provides additional references on c j . These queries were formed using concept c j , its deficiency type Def(c j ) and neighboring context Con(c j ) . Querying the input question archive QA with query BQ(c j ) (or SQ(c j ) ), a set of final set augmentations BA(c j ) (or SA(c j ) ) was retrieved. In the subsequent sections, the input-output specifications and the implementation details for each of the constituent modules are presented. Glossary, provided in a textbook, is the most reliable resource to extract the domainspecific concept phrases (Mulvany 2009 ). In the absence of these expert-administered indexes for most of the textbooks, the present work proposed an alternate approach to extract the set of key concepts present C(s i ) = {c 1 , c 2 , .., c n } for each of the sections s i ∈ S of a grade level GL textbook T organized over a set of sections S = {s 1 , s 2 , .., s m }. To extract meaningful relevant concepts (or, entities) the current work extracted concepts using: (i) linguistic patterns (Agrawal et al. 2011a, b) , (ii) DBpedia spotlight tool (Daiber et al. 2013 ) and (iii) glossary terms. Phrases with the linguistic pattern of A * N + (where A represents an adjective and N a noun), commonly considered as concepts (Toutanova et al. 2003) , was first extracted from the textbook sections. The list of concepts was further populated with the concepts that are annotated by the DBpedia tool and mentioned in expert-labeled glossary terms. Concepts exactly matching the title of a Wikipedia article were only considered to generate the final list of concepts. The decision to use Wikipedia for validating the concepts was taken by considering its coverage over the topics in basic subjects. However, that did not rule out the possibility of discarding a valid concept not present in Wikipedia. This module identifies the deficiency Def(c j ) ∶ C → D for a target concept c j ∈ C(s i ) from textbook section s i . To introduce the deficiency types, let us first consider the properties of the quality textbooks. The properties of good quality textbooks can be listed as: A good-quality textbook should avoid the use of long sentences, complex sentence structure, and clauses. With reduced syntactic and word-level complexity, readability gets improved (Gray and Leary 1935; DuBay 2004 ). 2. Focus: Each textbook section should explain a set of concepts that are related to each other (Clark et al. 2011 ). 3. Unity: Instead of being described across various sections, a concept should be discussed in a particular section (considered as the key section for the concept) (Chambliss and Calfee 1998). 4. Sequentiality: Concepts should be explained sequentially where (i) before introducing a new concept, all of its prerequisite concepts are explained, and (ii) all its related concepts are discussed in consecutive sections (Paas et al. 2003) . Based on the properties mentioned above, we associate a concept with one of the following deficiencies: 1. Focus-Deficiency (D1): Considering the property "Focus", concepts of a textbook section should be related to each other. However, if a concept is sparsely related to the other concepts that, on the other hand, share substantial accordance, it is considered as focus-deficient. Relatedness between concepts is commonly presented using concept graphs where nodes are made of concepts and edges represent their An image to illustrate the movement of sound waves (Hewitt 2002) . With absence of such relevant learning supports, textbooks suffer from deficiency D3 relatedness. Figure 2 offers a concept graph formed with concepts from a textbook section. In this section, the concept "momentum" is a D1-deficient concept as it shares relatively less connectivity with the other concepts in terms of relatedness. However, this relatedness can be increased in the presence of concepts "impulse," "conservation of momentum," "elastic collision," or "inelastic collision." Therefore, focus deficiency associated with concept "momentum" can be remediated by augmenting this section with mentioned missing links. 2. Sequence-Deficiency (D2): Introducing a concept without mentioning its prerequisite concepts may create unnecessary memory-burden among the students (Agrawal et al. 2012a) . A concept discussed before or after the section where it should ideally be discussed is considered sequence-deficient. The sequence of the concepts from a textbook section is depicted in Fig. 3 . 3. Component-Deficiency (D3): Complex concepts are meant to be discussed in detail with essential figures, examples, and relevant problems. In the absence of such components, students often find it difficult to assimilate the associated concepts. Visualizing the concepts "compression" and "rarefaction" is relatively easier when accompanied by an image, as shown in Fig. 4 . 4. No deficiency (D4): A concept not having any of the deficiencies as mentioned earlier is termed as not deficient. Deficient concepts can be identified by pointing out several aspects (e.g., related concepts, prerequisite concepts, dependent concepts, or relevant components). Manual annotation for these aspects being unfeasible; the current work relied on the conceptual structure of Wikipedia. Considering A(c) as a Wikipedia article associated with concept c, we defined the relevant aspects as follows. The set of concepts related to a concept c is represented as R(c) = {c k } where the Wikipedia articles A(c) and A(c k ) are linked (by in-links or out-links). The set of concepts prerequisite to a concept c is represented as P(c) = {c k } if the article A(c) has in-links from each Wikipedia article A(c k ). The set of concepts dependent on a concept c is represented as D(c) = {c k } if the article A(c) has out-links to each Wikipedia article A(c k ). A component is considered relevant for a concept c, if the concept or any of its related concepts are present in the title/caption of the component. Accordingly, the relevant components RC(c) is defined as where Title(com) is the set of concepts present in the title/caption for a component com. where N and E stand for the set of nodes and edges, respectively. N models the set of concepts C. An edge e ij ∈ E between concepts c i and c j exists, if Wikipedia article A(c i ) and A(c j ) are connected (by in-links or out-links). Presence of single or multiple links between the articles is treated equally as the significance of the links cannot be determined without an external knowledge-base. Definition 6 Section-based distance dis(c i , c j ) : Let s(i) and s(j) are the key sections of the concepts c i and c j respectively. For a pair of concepts c i and c j , the sectionbased distance dis(c i , c j ) is defined as the number of sections between s(i) and s(j) and is formulated as The heuristics-based annotation (based on definition 1-4) of different concepts calls for evaluating the quality of annotation. Consequently, the concepts that had been annotated using the proposed heuristics were manually inspected. The result of this manual inspection is reported using standard accuracy measure in Table 1 . As the quality of these heuristics for predicting concept aspects was satisfactory, the conceptual structure of Wikipedia has been used to annotate the concepts with different aspects. Banking upon the performant Wikipedia-hierarchy and various aspects of concepts, we proposed a supervised classification-based approach for diagnosing deficiency. This approach learned a mapping function: Def(c) ∶ C → {D1, D2, D3, D4} from a dataset consisting of ⟨Concept, D1/D2/D3/D4⟩ items collected from textbooks. The Support Vector Machine (SVM) with an RBF kernel, implemented using libSVM python package (Chang and Lin 2011) was used as a classifier. The kernel-type and the optimal parameters were determined using 10-fold crossvalidation with validation and test set. The features representing different aspects of a concept/section and its deficiency are of two major classes: section-specific and concept-specific. Section-specific features, having the same values for all concepts from a section, are considered to meet better "Readability" and "Focus" across sections. Other properties of good-quality textbooks are reflected using concept-specific features. For deficiency diagnosis, we employed four sectionspecific and eight concept-specific features, as mentioned in Fig. 5 and defined below. Four section specific features considered in this work are: 1. Average sentence length: It is determined by averaging the number of words present in the sentences from the concerned textbook section. 2. Average word length: This is measured by averaging the number of letters present in each word from the concerned textbook section. 3. Average word familiarity: This measure is the normalized proportion of the words that are generally considered uncommon. This feature is often termed also as lexical diversity. (1) Word(s) is the set of words (allowing repetitions) in textbook section s and DC is the set of familiar words from Dale long list (Dale and Chall 1948) . 4. Dispersion between concepts: According to property "Focus" Agrawal et al. (2012a) , understanding a textbook section discussing related concepts is easier than one comprising many unrelated concepts. This premonition has been formulated as a measure of dispersion between the concepts of a textbook section following (Agrawal et al. 2011a) . Dispersion measures the fraction of the unrelated concept pairs from the set of concepts C(s) present in textbook section s. Agrawal et al. (2011a) generated a concept-graph for the concepts C(s) with E edges and determined the fraction as follows: Dispersion between concepts of a section was formally determined using Algorithm 1 following (Agrawal et al. 2011a) . Dispersion value ranges between 0 and 1, where 0 relates to a section where each concept-pairs are related, and 1 corresponds to a section with mutually independent concepts. The concept specific features used in this work are: 6. Average section distance for related concepts: Considering that the related concepts are to be discussed in the same or consecutive sections, section distance between two related concepts should be lower. Based on the premise, the average section distance ( ASD R (c))for a concept c as: Let us consider, a concept c is located in Section 1. Concepts, related to c are c 1 and c 2 from Section 2, and c 3 from Section 4. So, R(c) = c 1 , c 2 , c 3 . Figure 6 illustrates how to calculate the average section distance between these related concepts. As c 1 and c 2 are located in the section next to the section where c is located, their distance is dis(c, c 1 ) = dis(c, c 2 ) = 1. Similarly, dis(c, c 3 ) = 3 and finally, ASD R (c) becomes 5/3. 7. Average section distance for prerequisites: This feature estimates whether or not the prerequisites are discussed before a concept. For concept c, average section distance between prerequisite concepts ( ASD P (c) ) is defined as: 8. Relevant component count: Understanding a concept is relatively easier with a higher number of components illustrating the target concept. The number of components, relevant to concept c is defined as |RC(c)|. ( To retrieve the basic and supplementary augmentations from cQA archive QA for a concept c j , two keyword-based queries were generated. The basic query BQ(c j ) was created by combining a concept c j , their context Con(c j ) and deficiency type Def(c j ) . Supplementary queries SQ(c j ) were formed using the concept c j and its context Con(c j ). The generation of the basic query was done by choosing context Con(c j ) as follows: -For Focus-Deficiency (D1), the context was formed using all the concepts related to the concerned concept. -For Sequence-Deficiency (D2), the context was formed using the prerequisite and dependent concepts. -For Component-Deficiency (D3), the context was created by using all the concepts that (i) have been related to the concerned concept but (ii) have no relevant components in the present textbook section. Consequently, the basic query BQ(c j ) for concept c j was formulated as ⟨c j , Con(c j )⟩ where Con(c j ) is defined as: Considering the associations for a concept to be reflected from its related concepts, context Con(c j ) for concept c j was formed using all the concepts present in R(c j ). This textbook augmentation module consists of three sub-modules: (i) Augmentation Item Retrieval, (ii) Structural Re-ranking, and (iii) Filtering Augmentations. In order to retrieve a ranked list of QA pairs from a QA archive, the proposed system employed a question retrieval model following (Ghosh et al. 2017) to determine the similarities between the ⟨query-QA⟩ pairs. Instead of overloading with numerous augmentations, the current application aims to return highly relevant augmentations, easily comprehensible to the target students. Augmentation items retrieved initially were reranked based on their relevance with the query. The final list of augmentations was returned by filtering these reranked augmentations with the associated grade level. Following (Ghosh et al. 2017) , the current application modeled a similarity function between a query q and QA pairs Q i ∈ QA . This similarity function, devised over three major categories of features, returns a ranked list of augmentations Q init . Each of these retrieved augmentations Q i ∈ Q init obtains a score IS Q i (q) . The features used for learning the similarity function are categorized as follows: Features from this class compute the word level similarity between a query and QA pair. Following lexical features were used in the current work. 1. Word n-gram overlap (n = 1, 2 and 3): It is the ratio of the number of overlapping n-grams between the ⟨query-QA⟩ pair and the number of n-grams present in QA. 2. BM25 score: This feature approximates similarity between a ⟨query-QA⟩ pair considering: term-frequency, inverse document frequency and length of the documents (Manning et al. 2010 ). 3. Cosine similarity: This feature is quantified as the cosine similarity to measure the lexical similarity between a query-QA pair based on word n-grams (n = 1, 2, and 3) scores of words. These features are computed from the syntactic structure of the sentences: 1. Noun overlap: It is the count of nouns present in both the query and candidate QA pair, normalized by the number of nouns present in the QA pair. 2. Verb overlap: It is computed by normalizing the overlap counts of verbs. 3. Dependence-pair overlap: Dependence-pairs present a set of grammatical dependency relationships between the words within a sentence (Chen and Manning 2014). For example, "Alexander" is considered as a passive nominal subject for the verb "impressed" in the sentence "Alexander was impressed by Porus." Such a relationship is represented as a dependence-pair "nsubjpass(impressed, Alexander)." The count of common dependence pairs between a query-QA pair normalized by the number of dependence pairs present in the QA pair is proposed as a syntactic feature. The spaCy python package obtained these dependence pairs using Stanford parser (Chen et al. 2014. 4 . Named-entity overlap: Named-entities refer to real-world objects such as persons, locations, organizations, or products. A normalized count of the common named entities between a query-QA pair is used as another feature. We used Stanford coreNLP python APIs to determine named entities. Aiming to match the latent conceptual meaning, the following semantic features were used in the current application. A query that is similar to a QA pair often tends to show significant word-level alignments. The similarity between ⟨q;QA⟩ pair can be quantified by aggregating alignment probabilities of constituent word/phrase pairs, one taken from q and the other from QA. Using a parallel corpus of 100,000 similar QA pairs collected from Stack Exchange on a word-alignment model using Giza++ 12 , phrase pair alignment probabilities had been determined. Finally, the similarities between the query and QA pairs were determined with statistical machine translation using IBM model-1. 2. Common frames: Frames present a schema to state particular relationships between different constituents in a sentence. For example, in the sentence "Alexander was impressed by Porus," the relationship between "Alexander" and "Porus" can be affirmed using a frame "Impression." While a sentence is represented differently, these different representations may often contain similar frames. The normalized count of the common frames between query and question has been considered as the feature "common frames". These frames are obtained using Sling python API using FrameNet frames (Baker et al. 1998) . The similarity function, parameterized over the features mentioned earlier, was modeled using a deep neural network (DNN), comprised of one input layer, two hidden layers, and an output layer following (Ghosh et al. 2017) . The features were used in the input layer, and the output layer finally denoted the similarity. The weights were initialized in a pre-training step using greedy layer-wise Contrastive Divergence technique (Carreira-Perpinan and Hinton 2005) and were fine-tuned by standard backpropagation technique discriminatively. To promote the QA pairs from the set of initially retrieved augmentations Q init , a structural reranking technique was proposed in the present work based on the following assumption from Kurland and Lee (2005) . to most of the other QA pairs retrieved for the same query. To illustrate, consider a case where due to the absence of specific terms in q, a QA pair Q 1 gets a lower similarity score where QA pairs Q 2 and Q 3 are scored higher. Considering the terms from the top-ranked QA pairs significant, Q 1 's score can be modified based on the extent Q 1 matches with top-ranked QA pairs Q 2 and Q 3 . Similarity with the other QA pairs, therefore, provides support to re-assess the rank for the associated QA pair using three steps: (i) Support Graph Generation (ii) Cumulative Support Determination, and (iii) Final Scoring. The support relationship between two QA pairs was determined based on their similarity value. Stating formally, a QA pair Q j is assumed to receive support from another pair Q i ∈ Q init if Q i is retrieved against a query Q j . The support relationships between all QA pairs from Q init were encoded using a support graph. Definition 8 Edge weight function wt(.): Consider that, Top (Q j ) represents the set of top QA pairs {Q i |Q i ∈ Q init − {Q j } for query Q j with respect to a scoring function IS Q i (Q j ) . The support Q j receives from Q i is computed as follows: The construction of supports graph started by creating a set of nodes, which associated Q init . For each QA pair Q i , the construction strategy identified top similar questions in Q init − {Q j } and established edges between each respective pair. This process was conducted iteratively for all the QA pairs in Q init . There might be cases where an insignificant amount of support has been propagated to a target pair from the others. Consequently, QA pairs that had achieved an average score in the first phase might suffer a hefty penalty. The current application proposed the Pagerank smoothing technique (Brin and Page 1998) to compensate for such discrepancy. Accordingly, a graph G was smoothed to G [ ] with a smoothing parameter Amount of support QA pair Q i propagates to QA pair Q j depends on the support Q i acquires from others. Hence, cumulative support ( CS R ) is determined as: For a query-QA pair ( q, Q j ), the final score of Q j was modeled as a function of initial retrieval score IS Q j (q) and cumulative support score CS R where the associated parameters and had been learned to achieve better retrieval performance. Reranking the QA pairs Q j according to their score SR R (Q j , q) against query q, a list of QA pairs Q rerank was returned. The set of reranked augmentation items Q rerank was filtered based on their suitability to the target population or population of a target grade level. The present work depended on the following assumptions to determine whether an augmentation relates to a target grade level GL or not. Let C GL represents the set of concepts that appear in the index of the books for grade-level GL. The grade level for a concept is, therefore, determined as the grade level where it is first introduced: Accordingly, augmentation Q k 's grade level is considered the maximum of the grade levels associated with the set of concepts C(Q k ) present in the augmentation. Grade(c j ) Q final , the final set of augmentations, was formed by filtering out the QA pairs Q k from Q rerank if its grade level was higher than GL. The present work proposed a web-based interface to achieve the annotation tasks and cater the video lectures and their augmentations to the students. Based on the type of users, this platform offers two different interfaces: annotator's interface and student's interface. Through login details of the homepage, users are redirected to an interface that suits their purpose. The homepage is shown in Fig. 7. Each annotator is assigned a set of textbooks of specific subject and grade levels according to their expertise. Upon logging in, the annotators are redirected to either annotation tasks: identifying concept deficiencies or determining relevance of the recommended augmentations. For the former task, annotators are shown the textbook contents with hyperlinked concepts. A pop-down option will appear on selecting a concept, and the annotator can set a deficiency tag for the selected concept. After selecting deficiency tags for all the concepts of a textbook chapter, the annotator can submit the response. Figure 8 shows a snapshot of the annotator's interface associated with the task of identifying concept deficiencies. To determine relevance of the recommended augmentations, annotators are shown the textbook contents with concepts linked to their augmentations. The recommended augmentations will appear in the right panel on selecting a concept, and the annotator can set a relevance tag (relevant and irrelevant) for each augmentation. After selecting relevance for augmentations recommended for all the concepts, the annotator's response can be submitted and stored. A snapshot of the annotator's interface associated with this task is shown in Fig. 9 . Students are redirected to the students' interface that offers a list of textbooks across subjects and grade levels from the homepage. Figure 10 shows this interface. On obtaining a specific textbook as an input, the system loads the textbook sections and extracts the concepts from the section. The input textbook contents appear on the interface chapter-wise, with concepts present in the chapter linked with the corresponding augmentations in the form of question-answer pairs. The screenshots of the students' interface are shown in Fig. 11 , which shows textbook contents of the 2 nd chapter from the "Biology" textbook of grade "11." Fig. 8 A snapshot from the annotators interface where annotator is being asked to tag a concept 'Alternia' with its deficiency class. This instance is taken from chapter 'Biological Classification' of Biology textbook of grade level 11. A snapshot from the annotators interface where annotator is being asked to tag augmentations of concept 'Alternia' with their relevance tags When a student clicks on a certain concept in the augmented textbook content, the linked augmentation in the form of relevant QA pair appears in the same window. A sample augmentation is shown in Fig. 12 for a concept "cyanobacteria", selected from the 2 nd chapter from "Biology" textbook of grade "11". The current work furnished the same set of augmentations for all students. However, students can choose between the concepts that need further elaborations. The evaluation focused on two broad aspects of the proposed system: 1) automation to detect conceptual deficiencies and relevant augmentations and 2) effect of proposed cQA-based augmentations in the learning process. While the automation aspect is evaluated with respect to a set of carefully developed gold standard data, effect of cQAs in learning has been studied using interactions with students and teachers. With this focus, we have categorized the evaluation tasks as follows. To evaluate the proposed deficiency diagnosis model, the current work collected a corpus of textbooks as well as created a gold-standard data tagging the concepts present in it. Concerned QA archives were also collected from the concerned Stack Exchange sites to retrieve cQA-based augmentations. The present work collected a corpus of Indian school-level textbooks issued by the National Council of Educational Research and Training (NCERT). The corpus, collected from 28 books of grade labels 6-12, covers seven subjects: Science, Mathematics, Chemistry, Physics, Biology, Geography, and Economics. This corpus consists of 68 264 concepts distributed across 8 062 sections, and 383 chapters. A QA archive was collected from six Stack Exchange sites related to concerned textbook subjects: Mathematics, Chemistry, Physics, Biology, Economics, and Earth Science. Detailed statistics of the QA archive are mentioned in Table 2 . The gold standards were created by assigning 12 teachers who have been teaching the respective subjects in different schools for at least five years. We assigned the annotation task for six subjects (Physics+Science, Chemistry, Mathematics, Biology, geography, and Economics) such that each textbook was assigned to 2 teachers. They were asked to annotate the concepts with their deficiency classes (D1 / D2 / D3 / D4). The inter-annotator agreement was measured using Cohen's kappa (McHugh 2012) from the annotation matrix, presented in Table 3 . For classes D1, D2, D3, and D4, kappa values of 0.75, 0.72, 0.62, and 0.81 were achieved. While annotators agreed on D1, D2, and D4 classes to a satisfactory level, agreement for D3 is relatively low. The annotators were consulted on a subset of cases of disagreement in D3. However, the agreement did not improve significantly. This can be attributed to the fact that D3 is somewhat more subjective than the other classes as per the views of the annotators. The gold standards were created by assigning the same 12 teachers to annotate augmentations as relevant or irrelevant. We assigned the annotation task for six subjects (Physics+Science, Chemistry, Mathematics, Biology, geography, and Economics) such that each textbook is assigned to 2 teachers. The annotation matrix is presented in Table 4 while the inter-annotator agreement was measured using Cohen's kappa (McHugh 2012). For this annotation task, we achieved a satisfactory agreement with a kappa value of 0.71. The task of assessing performance of the constituent modules of our proposed textbook augmentation system raises following research questions. 1. SYS-RQ1: How effective is the deficiency diagnosis module? 2. SYS-RQ2: How effective is the augmentation retrieval module? In the absence of a study on diagnosing concept-specific deficiencies, the current work used the section-specific variant presented by Agrawal et al. as baseline (Agrawal et al. 2012a) . Considering the section-specific features mentioned in Section 5, this baseline system classified the deficiencies. The baseline was implemented using an SVM classifier similar to the proposed system. To evaluate the significance of the features, the dependence between the features were computed using the F-statistic (Guyon et al. 2002) , chi square-statistic (Zhai et al. 2018) , Bagged decision tree (Tran et al. 2017) , and PCA (Song et al. 2010 ). If considered insignificant according to the above-mentioned statistics, a feature was not used while predicting the deficiency of a concept. This feature selection criterion was used for both baseline and proposed approaches. Choice of kernels and values for the parameters were determined similarly. While training and testing, we handled the issue of data imbalance by randomly selecting the same number of instances from each deficiency group (D1, D2, D3 and, D4). Performance of the baseline and the proposed classification-based approaches were evaluated using standard metrics: Accuracy, Hamming loss, Precision score, Recall score, F1 score, Jaccard score, and Zero-one loss. Predictions for both the approaches were tested against the gold standard data, as shown in Tables 5 and 6 . For some measures, a bigger value is better (such as accuracy, precision score, recall score, F1 score, and Jaccard score). On the other hand, a lesser value is better for two measures: hamming loss and zero-one loss. These two categories of measures are denoted in these Tables with ↑ and ↓ , respectively. From both these tables, it is observed that the proposed approach (i) modeled all the deficiency classes better than the baseline, and (ii) the proposed approach performed almost equally for all the subjects. Error cases in diagnosing deficiencies are primarily attributed to the extraction of irrelevant concepts. The concept extraction module may often return irrelevant concepts due to the unavailability of glossary terms. Consequently, its related, prerequisite, or dependent concepts are often noted missing in the concerned sections. Such concepts are often falsely predicted as "not deficient" due to a lack of significance in their context. A few cases are also reported due to poor recall in concept extraction. The number of extracted concepts gets lowered if one restricts the concept extraction module. Although the well-established entities are present in the textual discourse, they are not extracted as a concept. Consequently, one fails to locate the related, prerequisite, or dependent concepts for any concept. Due to a lack of precision in the concept extraction, the extracted concepts are often predicted deficient, although all its related concepts are present in the concerned section. In the present study, we analyzed the performance of the augmentation retrieval module by determining the relevance of the recommended augmentations. We showed each augmentation to a teacher associated with teaching the concerned subject in different schools and asked to tag its binary relevance (relevant or irrelevant). Performance of the augmentation retrieval module is determined using six standard metrics: MAP (mean average precision), MRR (Mean reciprocal rank), RP (R-precision), P@1 (Precision at 1), P@5 (Precision at 5), and P@10 (Precision at 10) in three stages (Retrieval; Retrieval + Reranking; and Retrieval + Reranking + Filtering), as shown in Table 7 . The overall retrieval performance is satisfactory. We achieved consistently better performance by applying reranking and filtering steps over the initial retrieval approach. A few cases are reported where no augmentations have been retrieved as (i) relevant QA pairs were missing in the associated archive or (ii) relevant QA pairs were aligned to a higher grade level. On the other hand, the suggested augmentations suffered due to poor mapping between queries and QA pairs. Due to difficulty in representing deficiencies of a concept in natural languages, the present application forms a key-word based query using the concept and its context. Consisting of numerous concepts, often, intent of the query cannot be accurately mapped with relevant QA pairs. Consequently, irrelevant augmentations are suggested. We replied upon the opinions of students and expert teachers to judge quality of the augmentations and usability of the user interface. These assessment tasks have formed following research questions: 1. HUM-RQ1: Do the augmented textbooks contribute to the learning gain? 2. HUM-RQ2: How satisfactory are the augmentations in the view of the teachers? 3. HUM-RQ3: How usable is the interface for presenting the augmentations? To conduct a quantitative evaluation for the proposed augmentations, The current work experimented with 200 students enrolled in grade levels 9-12. The experiment comprised five major tasks. The experimental-flow is depicted in Fig. 13 to compare two types of study-materials: TB and ATB. TB is a normal textbook chapter, and ATB is the enhanced TB with the augmentations inserted at appropriate places. The details of the tasks are mentioned below. -Grouping students (Task 1): The students were randomly divided into two homogeneous groups: Control group ( G C ) and Experimental group ( G E ) using simple randomization technique (Kim and Shin 2014) . However, this division ensured equal number of students from the same grade level in both the groups. -Forming questionnaire (Task 2): Two sets of question papers, Q1 and Q2, each comprising ten questions from textbook chapter TB, were developed to check the basic conceptual and supplementary references among the students. Questions referring to the basic conceptual gap and supplementary references were popu- Experimental-flow to quantify the quality of the augmentations lated in the equal distribution in both the questionnaires. These questionnaires are formed using teachers teaching the respective subjects in the concerned grade level in schools for at least 5 years. All these questions are within the scope of the curriculum and associated textbook. Content validity for these questionnaires were determined following (Lawshe 1975) . According to Lawshe, the questions should be judged based on how useful they are in determining a given task and classified within 'essential,' 'useful, but not essential,' or 'not necessary' groups. If at least half the experts say that a question is 'essential,' the concerned question should have some content validity. We performed this test by asking 6 subject matter experts (teachers) for each question and calculated its content validity ratio CVR as: where N e is the number of experts who tagged the concerned question as 'essential' and N is the total number of experts. Averaging the content validity ratio CVR for all the questions, content validity ratio CVR of a questionnaire is finally determined. The content validity ratio CVR value ranges from +1 to -1, while positive values indicate that at least half the experts rated the question as essential. For our questionnaires Q1 and Q2, the average content validity ratios are 0.49 and 0.56, respectively. -Pretest (Task 3): Pretest with Q1 was conducted to judge the learning levels of the students from both groups before they study the associated materials. For each correct answer in the tests, 1 mark was awarded to the students. -Study Session (Task 4): A study session was arranged on the same day pretest session had been arranged. In this session, students from G E were offered with augmented textbook ATB, whereas students from G C were provided standard (Spano 2006) , (i) no students were informed about their engagements, and (ii) the textbook chapters were selected such that the participating students had been able to understand the contents but had not studied the chapter earlier. -Post-test (Task 5): Three days after the study session, a test was conducted using questionnaire Q2. Like the pretest, 1 mark was awarded to the students for each correct answer. The descriptive statistics regarding the performance of the students are mentioned in Table 8 for four subjects: Science, Geography, Physics, and Biology. To measure the quality of the suggested augmentations, the present work calculated the Kullback-Leibler divergence between the performance observed for the following pair of distributions. The divergence values are mentioned in Table 9 . Homogeneity of the control and experimental groups is established as no significant performance differences have been observed between the distributions G C pretest and G E pretest (divergence value of 0.02). Significance for learning materials (both TB and TB) is proven as the compared distributions are significantly different (divergence value of 0.29 and 0.37, respectively). However, the augmented textbook chapter ATB is found slightly more significant in developing better learning labels among the students (divergence value of 0.1). The overall effect of the suggested augmentations is illustrated in Fig. 14 using boxplot representations. Figures 15 and 16 presented the effect of augmentations in achieving better learning gains for different grade and subject domains. From these diagrams, it is observed that: (i) for all the posttest scenarios, performance of G E surpassed the same of G C and (ii) performance of G E is significantly better for higher grade levels or associated subjects. The overall improvement in learning levels among the students from G E establishes the effectiveness of the proposed augmentations across the grade levels and subjects. To evaluate the augmentations qualitatively, the current work assigned 12 teachers, teaching the respective subjects for at least five years in different schools. The survey is performed using a 5-point (score 1 refers to very poor and 5 to very good) Likert scale-based satisfaction questionnaire, consisting of 5 questions, as mentioned in Table 10 . The opinions from the teachers are mentioned in Fig. 17 for each of the survey questions. Analyzing the reliability of the questionnaire using Cronbach alpha (Cronbach and Shavelson 2004) , the internal concordances were found to be within allowed ranges (0.78). According to the qualitative evaluation, 45 out of 60 responses achieved a score of 3 or more for all the questions. This tells us that the augmentations achieved an overall satisfactory level. Although the augmentations offered relevant QA pairs to address the textbook deficiencies and additional references on related concepts, there are scopes to improve in terms of creating learning interests. Learning interests are commonly attributed to the student's metacognition and situational factors (hsun Tsai et al 2018). We can model these factors to retrieve augmentations that trigger learning interests among the students. To assess the usability of our interface, 15 students were asked to submit their responses against the questions/statements from two standard questionnaires: Computer System Usability Questionnaire (Brooke 1996) , and System Scale (Lewis 1995) , respectively named as "SUS" and "IBM." -SUS: This questionnaire contains 10 questions. A response for each question can be given in a 5-point (0 to 4; where score 0 denotes to "completely disagree" and 4 to "completely agree") Likert scale. With this range of values, the highest possible score is 4 × 10 = 40 . This score is further converted to a 0 to 100 scale by multiplying with 2.5. The responses for each question are stated in Table 11 from 15 students (P1-P15 denote to the 15 students). Averaging the score for the responses obtained from 15 students for each question, we determined the usability score of our interface. Our designed interface achieved a usability score of 58.2 against the standard acceptable score of 54.4 Bangor et al. (2009) . This score is considered valid considering the minimum sample size for the SUS test as at least 12 participants (Lewis and Sauro 2009) . We analyzed the student responses to validate this questionnaire using Cronbach alpha (Cronbach and Shavelson 2004) and achieved a favorable internal concordance of 0.72. For two Table 11 Opinion of students on usability of the learning platform using questionnaire SUS questions (e.g., questions numbered 5 and 6), the attained scores were noted to be comparatively lower. However, these questions address functional aspects of the interface rather than usability. -IBM: Questionnaire IBM is formed with 19 questions, each responded using a 7-point (1 to 7; score 1 refering to "completely agree" and 7 to "completely disagree") Likert scale. We presented the responses from 14 students for each question in Table 12 , as one student did not respond to all the questions. The average score for 19 statements from 14 students was 1.86, with a standard deviation of around 0.7. This denotes usability of an overall satisfactory level. We have also analyzed the student responses to validate this questionnaire using Cronbach alpha (Cronbach and Shavelson 2004) and achieved a favorable internal concordance of 0.94. However, we noted a few statements for which the average score comes to 2.2+ with a standard deviation of 0.7+. This result tells us to improve the interface's functionality, but a satisfactory level for usability. The present application offered a novel perspective for textbook augmentation by automatically aligning concepts in the learning materials with relevant QA pairs. To summarize, our proposed augmentations are considered helpful to an average student as well as to a motivated student. We surveyed the existing studies to list the factors that affect the overall readability and conceptual deficiencies in the textbooks. A supervised machine-learning-based approach was adopted to diagnose these concept deficiencies. Our deficiency diagnosis model can be extended in developing related tasks (e.g., determining textbook quality, listing issues for designing course contents, or creating guidelines for writing textbooks). Keyword-based queries were formed by combining the concepts with their context and deficiency patterns. The queries were used to retrieve the QA pairs from the cQA archives. The relevant QA pairs were then recommended to the students. We presented system and human-based evaluation strategies to analyze the performance of the proposed augmentation system, the overall quality of the augmentations, and the usability of the interface. Despite our proposed system performed to a satisfactory level, there are certain limitations. The concept extraction and deficiency diagnosis models depend on the link structure of the Wikipedia knowledge base. The absence of a Wikipedia title corresponding to a candidate concept leads to inappropriate pruning of the concept. The deficiency diagnosis model made use of aspects like related concepts, prerequisite concepts, and others that had been extracted by applying some heuristics over Wikipedia link structure. These structures are induced by the concepts (and aligned Wikipedia title) that are being probed. Incorrect concept to Wikipedia title mapping that may arise due to ambiguity may lead to erroneous link structure and consequently extraction of incorrect concept aspects. The retrieval performance, on the other hand, highly depends on the size of associated QA archives and their comprehension level. Consequently, a small number of augmentations have been recommended for lower grade levels and subjects having smaller QA archives. It is to be noted that the augmentations are student agnostic. We believe that the mentioned limitations may lead to research studies that aim to automate the process of improving learning materials. Comparative effects of computer-based concept maps, refutational texts, and expository texts on science learning Empowering authors to diagnose comprehension burden in textbooks Mining videos from the web for electronic textbooks Identifying enrichment candidates in textbooks Enriching textbooks with images Data mining for improving textbooks To e-textbook or not to e-textbook? a quantitative analysis of the extent of the use of e-textbooks at ajman university from students' perspectives. Education and Information Technologies Exploring metacognitive strategies utilizing digital books: Enhancing reading comprehension among deaf and hard of hearing students in saudi arabian higher education settings Content area textbooks. Learning to read in American Schools: Basal Readers and Content Texts Facilitating students' collaboration and learning in a question and answer system Readability: An appraisal of research and application (book review). College & Research Libraries The living textbook: Towards a new way of teaching geo-science Education-specific tag recommendation in cqa systems The berkeley framenet project Determining what individual sus scores mean: Adding an adjective rating scale Simple unsupervised keyphrase extraction using sentence embeddings The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems SUS-A quick and dirty usability scale. Usability evaluation in industry A text feature based automatic keyword extraction method for single documents On contrastive divergence learning Textbooks for learning: Nurturing children's minds LIBSVM: A library for support vector machines A fast and accurate dependency parser using neural networks Improving context and category matching for entity search Utilizing content moderators to investigate critical factors for assessing the quality of answers on brainly, social learning q&a platform for students: a pilot study Efficiency in learning: Evidence-based guidelines to manage cognitive load A computer readability formula designed for machine scoring My current thoughts on coefficient alpha and successor procedures Textbook provision and the quality of the school curriculum in developing countries: Issues and policy options Improving efficiency and accuracy in multilingual entity extraction A formula for predicting readability: Instructions Kp-miner: A keyphrase extraction system for english and arabic documents A usability and acceptance evaluation of the use of augmented reality for learning atoms and molecules reaction by primary school female students in palestine High school learners' continuance intention to use electronic textbooks: a usability study Using re-ranking to boost deep learning based community question retrieval Augmenting video lectures: Identifying off-topic concepts and linking to relevant video lecture segments What makes a book readable Gene selection for cancer classification using support vector machines The effects of metacognition on online learning interest and continuance to learn with moocs Collaborative digital textbooks-a comparison of five different designs shaping teaching and learning. Education and Information Technologies Rules for comprehensible technical prose: A survey of the psycholinguistic literature How to do random allocation (randomization) Online learning: How do brick and mortar schools stack up to virtual schools? Education and Information Technologies Transforming earth science education through immersive experiences: Delivering on a long held promise Engaging with biology by asking questions: Investigating students' interaction and learning with an artificial intelligence-enriched textbook Understanding emergent teaching and learning practices: ipad integration in polish school. Education and Information Technologies Enriching textbooks by question-answers using cqa Pagerank without hyperlinks: Structural re-ranking using links induced by language models A quantitative approach to content validity Evaluating the quality of educational answers in community questionanswering Highlighting and taking notes are equally ineffective when reading paper or etext. Education and Information Technologies The factor structure of the system usability scale Ibm computer usability satisfaction questionnaires: Psychometric evaluation and instructions for use Improving educational efficiency in developing countries: what do we know? Educational question routing in online student communities Introduction to information retrieval Smog grading-a new readability formula Interrater reliability: the kappa statistic Human-competitive tagging using automatic keyphrase extraction Learnsmart: A framework for integrating internet of things functionalities in learning management systems Textrank: Bringing order into text Mobile learning under personal cloud with a virtualization framework for outcome based education. Education and Information Technologies Effective use of textbooks: A neglected aspect of education in pakistan Indexing books Keyphrase extraction in scientific publications Assessment of open educational resources (oer) developed in interactive learning environments. Education and Information Technologies Cognitive load theory and instructional design: Recent developments A review of keyphrase extraction Quality research for learning, education, and training. Reading & Writing Quarterly Wordbricks: Mobile technology and visual grammar formalism for gamification of natural language grammar acquisition Automatic keyword extraction from individual documents Enriching mathematics education with visual arts: Effects on elementary school students' ability in geometry and visual arts Towards an electronic independent learning environment for statistics in higher education Indian government e-learning initiatives in response to covid-19 crisis: A case study on online learning in indian higher education system. Education and Information Technologies Automatically augmenting learning material with practical questions to increase its relevance Feature selection using principal component analysis Observer behavior as a potential source of reactivity: Describing and quantifying observer effects in a large-scale observational study of police Askalot: community question answering as a means for knowledge sharing in an educational organization Employing community question answering for online discussions in university courses: Students' perspective Feature-rich part-of-speech tagging with a cyclic dependency network Bagging and feature selection for classification with incomplete data A Conspiracy of Good Intentions. ERIC: America's Textbook Fiasco Integrating mobile multimedia into textbooks: 2d barcodes Single document keyphrase extraction using neighborhood knowledge Knowledge annotation for intelligent textbooks Understanding elementary students' use of digital textbooks on mobile devices: A structural equation modeling approach Enhancing students' biology learning by using augmented reality as a learning supplement Kea: Practical automated keyphrase extraction Textbooks in school and society: An annotated bibliography & guide to research A chi-square statistics based feature selection method in text classification Improving question retrieval in community question answering using world knowledge The author is grateful to the students and teachers from Mount Litera Zee School, Nalanda and Jion Marshal English Medium School, Kharagpur, for their participation in the meticulous annotation and assessment tasks. Data Availibility Statement Data used in the current work is shared using Google drive. The detail is provided at a Github repository present at: https:// github. com/ Krish nendu Ghosh/ TB Code Availibility Codes for implementing the current work is provided at as a Github repository. The link is: https:// github. com/ Krish nendu Ghosh/ TB Education and Information Technologies Participants P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 Q1 3 3 4 2 3 4 3 4 2 4 3 4 4 3 3 Q2 4 4 3 4 3 3 4 3 3 2 4 3 3 3 2 Q3 2 2 3 2 4 3 3 3 3 3 3 4 4 3 4 Q4 4 3 3 4 2 4 3 2 4 2 3 3 2 2 3 Q5 2 1 3 2 3 3 2 3 3 2 2 2