key: cord-0046910-4b5lpl2g authors: Dascalu, Maria-Dorinela; Ruseti, Stefan; Dascalu, Mihai; McNamara, Danielle S.; Trausan-Matu, Stefan title: Multi-document Cohesion Network Analysis: Visualizing Intratextual and Intertextual Links date: 2020-06-10 journal: Artificial Intelligence in Education DOI: 10.1007/978-3-030-52240-7_15 sha: 48f7e783a2232a1fd91a34dad0948826578e22d3 doc_id: 46910 cord_uid: 4b5lpl2g Reading comprehension requires readers to connect ideas within and across texts to produce a coherent mental representation. One important factor in that complex process regards the cohesion of the document(s). Here, we tackle the challenge of providing researchers and practitioners with a tool to visualize text cohesion both within (intra) and between (inter) texts. This tool, Multi-document Cohesion Network Analysis (MD-CNA), expands the structure of a CNA graph with lexical overlap links of multiple types, together with coreference links to highlight dependencies between text fragments of different granularities. We introduce two visualizations of the CNA graph that support the visual exploration of intratextual and intertextual links. First, a hierarchical view displays a tree-structure of discourse as a visual illustration of CNA links within a document. Second, a grid view available at paragraph or sentence levels displays links both within and between documents, thus ensuring ease of visualization for links spanning across multiple documents. Two use cases are provided to evaluate key functionalities and insights for each type of visualization. Comprehension is a difficult and challenging process, for which learners need to understand words and sentences, connect ideas and link them to prior knowledge, while creating a coherent mental representation of the read text. One important factor in the comprehension process regards the cohesion of text [1] , which considers the degree to which there are semantic links between ideas within a text. Cohesion is higher when there are multiple ideas and words that overlap and when the connections between ideas are explicit. Low cohesion text is more challenging to understand, particularly for low knowledge and less skilled readers [2] . The process of overcoming cohesion gaps is even more challenging when learners are faced with multiple documents that require establishing connections both within and between disparate text fragments. Making connections across multiple texts is considerably more difficult than doing so within a single text. Some text fragments may be semantically linked, while other may be isolated, distal, and thus more difficult to recognize or infer. While text cohesion is recognized as an important factor in comprehension and learning from text, there is currently no technique or tool available to visualize cohesive links between documents. We address this gap here, introducing Multi-document Cohesion Network Analysis (MD-CNA). CNA [3] relies on advanced natural language processing techniques, together with Social Network Analysis [4] measurements applied on the cohesion graph, to model discourse structure in terms of semantic links. The MD-CNA graph is a multi-layered graph that establishes semantic links between text elements of different granularities (i.e., the entire text, paragraphs, or sentences), including hierarchical inclusion links and links among elements of the same level. MD-CNA can be used to model both local and global cohesion, as it reflects the underlying semantic content of discourse within a document or between multiple texts [5] . In this study, we extend the CNA graph with lexical overlap links of two types (i.e., topic and content), together with coreference links, to better highlight dependencies between text fragments at different levels. We also introduce visualizations that highlight filtered links from the extended CNA graph, both within and between documents. The CNA graph [3] is centered mainly on semantic links computed using various models (e.g., Latent Semantic Analysis [6] , Latent Dirichlet Allocation [7] , word2vec [8] , FastText [9] , or Glove [10] ), that can be established either between text elements of the same level (e.g., among sentences), or between different layers of the hierarchy (e.g., sentences relating to the constituent paragraphs). The CNA graph was extended for this study with two new types of links. First, lexical overlap is computed as a Jaccard distance over a bag of word representation of the text elements. Two types of measurements are performed after preprocessing the text using in spaCy 1 . Content overlap considers the usage of content words which include useful information from the text (i.e., lemmatized forms of words having as part-of-speech one of the following: nouns, verbs, adjectives, or adverbs). Topic overlap considers a more constrained view which takes into account only lemmas responsible for text contextualization and inducing actions (i.e., only nouns and verbs). Second, coreference links are identified using NeuralCoref 2 , which includes a mentions-detection module based on rules built on top of spaCy, together with a feedforward neural-network to identify relevant pairs of mentions. The resulting clusters of co-referring mentions are used to enrich the CNA graph structure. All follow-up visualizations rely on this extended CNA graph that is rendered both using only one reference text, as well as sequences of documents. Two types of visualizations are introduced here, together with preliminary use cases to illustrate their extensive applicability. First, a hierarchical view groups nodes by granularity level (i.e., document, paragraph, and sentence), followed by the rendering of different types of links from the MD-CNA graph using a tree-structure of discourse (see Fig. 1 ). The corresponding use case explores differences between high-low cohesion documents from the study performed by McNamara, Louwerse, McCarthy and Graesser [11] . All types of links are filtered within the rendered visualizations by minimum similarity thresholds available for each type of link. These values can be easily adjusted within the user interface. For this use case, topic overlap was set at 0.4, content overlap was established at 0.3, and high level of semantic similarity (0.7) was imposed. Sentences from the same paragraph share its color. The size of each node is proportional to its semantic degreei.e., the sum of all in-bound and out-bound semantic links above a statically imposed threshold, which ensures a sufficiently high semantic relatedness based on the context and readers (for Fig. 1 , we considered the average plus standard deviation of all links, at each analysis level). On mouse-over, the link is colored in red, and a tooltip is displayed containing relevant details, including: link type, inter-connected text elements, similarity value (for content and semantic links) or pairs within the coreference cluster identified between the two nodes. The text from Fig. 1 .a has low cohesiononly 2 semantic links are above the imposed threshold (i.e., links between sentences 1.1-1.4, and 1.6-1.9 respectively). The text was modified to increase its cohesion and, as expected, there are considerably more links (2 versus 4 topic overlap links, 6 versus 6 content overlap links, 6 versus 14 semantic links, and 3 versus 8 coreference links; 17 versus 32 total links), covering more text elements which are distributed throughout the entire document. Moreover, the semantic degree of most nodes is higher, mainly in the first 2 paragraphs. The hierarchical view depicts only within document links, as the input consists of one text. The views are useful for analyzing text structure and cohesion, both locally at sentence level, as well as globally, between paragraphs. We can also observe cohesive sections of text and potential cohesion gaps, further providing improvement recommendations in terms of structure. Importantly, MD-CNA affords a visual illustration of cohesive links within a document, affording greater ease for researchers and educators in recognizing text cohesion and potentially increasing it for students. Second, the grid view ensures ease of visualization for links spanning across multiple documents (see Fig. 2 ). The corresponding use case explores the task of multidocument comprehension on the collection of four documents used in the experiments performed by Nicula, Perret, Dascalu and McNamara [5] . Topic overlap was set at 0.1 due to a more diverse vocabulary, content overlap was kept at 0.2, while semantic similarity was increased to 0.75 to reduce the clutter generated by a dense semantic network. This visualization shows connections both within (curved lines) and between documents (straight lines). The view can be rendered at two granularity levels (i.e., paragraph and sentence); for the second option, sentences have the same color as their corresponding paragraph. Documents are rendered as different columns in the grid, with constituent text elements displayed sequentially. As it can be observed, all documents are tightly related, with the 1 st and 4 th document containing many intratextual and intertextual semantic links. This second view enables researchers and educators to easily identify and trace semantically similar text segments between multiple documents, as well as to provide support to better target representative information (e.g., encourage bridging across multiple texts). This view can also be used to guide tutors to adequately order texts for presentation to learners, as well as to formulate comprehension questions that address a cohesive context spanning across multiple texts. In summary, the views provided in this study represent visual aids for researchers and educators to adequately evaluate and select texts to maximize cohesion flow and ease presentation of reading material. Our visualizations are designed to also scaffold readers to establish connections between texts and integrated concepts across documents, facilitating a more coherent understanding from separate sources of information. SERT: self-explanation reading training Reversing the reverse cohesion effect: good texts can be better for strategic, high-knowledge readers Cohesion network analysis of CSCL participation Social Network Analysis Predicting multi-document comprehension: cohesion network analysis A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction and representation of knowledge CNA graph for multi-document analysis at paragraph level Latent Dirichlet allocation Efficient estimation of word representation in vector space Enriching word vectors with subword information Glove: global vectors for word representation Coh-Metrix: capturing linguistic features of cohesion Acknowledgments. This research was been funded by the "Semantic Media Analytics -