key: cord-0118476-bfkwy4uz
authors: Levi, Effi; Mor, Guy; Shenhav, Shaul; Sheafer, Tamir
title: CompRes: A Dataset for Narrative Structure in News
date: 2020-07-09
journal: nan
DOI: nan
sha: 0c45e0ffd68c7f552b1a97d15a94996c39aa02ec
doc_id: 118476
cord_uid: bfkwy4uz

This paper addresses the task of automatically detecting narrative structures in raw texts. Previous works have utilized the oral narrative theory by Labov and Waletzky to identify various narrative elements in personal stories texts. Instead, we direct our focus to news articles, motivated by their growing social impact as well as their role in creating and shaping public opinion. We introduce CompRes -- the first dataset for narrative structure in news media. We describe the process in which the dataset was constructed: first, we designed a new narrative annotation scheme, better suited for news media, by adapting elements from the narrative theory of Labov and Waletzky (Complication and Resolution) and adding a new narrative element of our own (Success); then, we used that scheme to annotate a set of 29 English news articles (containing 1,099 sentences) collected from news and partisan websites. We use the annotated dataset to train several supervised models to identify the different narrative elements, achieving an $F_1$ score of up to 0.7. We conclude by suggesting several promising directions for future work.

Automatic extraction of narrative structures from texts is a multidisciplinary field of research, combining discourse and computational theories, which has been receiving increasing attention over the last few years. Examples include modeling narrative structures for story generation (Gervás et al., 2006) , using unsupervised methods to detect narrative event chains (Chambers and Jurafsky, 2008) and detecting content zones (Baiamonte et al., 2016) in news articles, using semantic features to detect narreme boundaries in fictitious prose (Delmonte and Marchesini, 2017) , identifying turning points in movie plots (Papalampidi et al., 2019) and using temporal word embeddings to analyze the evolution of characters in the context of a narrative plot (Volpetti et al., 2020) .

A recent and more specific line of work focuses on using the theory laid out in Labov and Waletzky (1967) and later refined by Labov (2013) to characterize narrative elements in personal experience texts. Swanson et al. (2014) relied on Labov and Waletzky (1967) to annotate a corpus of 50 personal stories from weblogs posts, and tested several models over hand-crafted features to classify clauses into three narrative clause types: orientation, evaluation and action. Ouyang and McKeown (2014) constructed a corpus from 20 oral narratives of personal experience collected by Labov (2013) , and utilized logistic regression over hand-crafted features to detect instances of complicating actions.

While these works concentrated their effort on detecting narrative elements in personal experience texts, we direct our focus to detecting narrative structure in news stories; the social impact of news stories distributed by the media and their role in creating and shaping of public opinion incentivized our efforts to adapt narrative structure analysis to this domain. To the best of our knowledge, ours is the first attempt to automatically detect the narrative elements from (Labov, 2013) in news articles.

In this work, we introduce CompRes -a new dataset of news articles annotated with narrative structure. For this purpose, we adapted two elements from the narrative theory presented in Labov and Waletzky (1967) ; Labov (1972 Labov ( , 2013 , namely Complication and Resolution, while adding a new narrative element, Success, to create a new narrative annotation scheme which is better suited for informational text rather than personal experience. We used this scheme to an-notate a newly-constructed corpus of 29 English news articles, containing a total of 1099 sentences; each sentence was tagged with a subset of the three narrative elements (or, in some cases, none of them), thus defining a novel multi-label classification task.

We employed two supervised models in order to solve this task; a baseline model which used a linear SVM classifier over a bag-of-words feature representation, and a complex deep-learning model -a fine-tuned pre-trained state-of-the-art language model (RoBERTa-based transformer). The latter significantly outperformed the baseline model, achieving an average F 1 score of 0.7.

The remainder of this paper is organized as follows: Section 2 gives a theoretical background and describes the adjustments we have made to the scheme in (Labov, 2013) in order to adapt it to informational text. Section 3 provides a complete description of the new dataset and of the processes and methodologies which were used to construct and annotate it, along with a short analysis and some examples for annotated sentences. Section 4 describes the experiments conducted on the dataset, reports and discusses our preliminary results. Finally, Section 5 contains a summary of our contributions as well as several suggested directions for future work.

2 Narrative Analysis

The study of narratives has always been associated, in one way or another, with an interest in the structure of texts. Ever since the emergence of formalism and structuralistic literary criticism (Propp, 1968) and throughout the development of narratology (Genette, 1980; Fludernik, 2009; Chatman, 1978; Rimmon-Kenan, 2003) , narrative structure has been the focus of extensive theoretical and empirical research. While most of these studies were conducted in the context of literary analysis, the interest in narrative structures has made inroads into social sciences. The classical work by Labov and Waletzky (1967) on oral narratives, as well as later works (Labov, 1972 (Labov, , 2013 , signify this stream of research by providing a schema for an overall structure of narratives, according to which a narrative construction encompasses the following building blocks (Labov, 1972 (Labov, , 2013 : These building blocks provide useful and influential guidelines for a structural analysis of oral narratives.

Despite the substantial influence of (Labov and Waletzky, 1967; Labov, 2013) , scholars in the field of communication have noticed that this overall structure does not necessarily comply with the form of news stories (Thornborrow and Fitzgerald, 2004; Bell, 1991; Van Dijk, 1988 ) and consequently proposed simpler narrative structures (Thornborrow and Fitzgerald, 2004) .

In line with this stream of research, our coding scheme was highly attentive to the unique features of news articles. A special consideration was given to the variety of contents, forms and writing styles typical for media texts. For example, we required a coding scheme that would fit laconic or problem-driven short reports (too short for full-fledged Labovian narrative style), as well as complicated texts with multiple story-lines moving from one story to another. We addressed this challenge by focusing on two out of Labovs six elementscomplicating action and resolution. Providing answers to the potential question And then what happened? (Labov, 2013) , we consider these two elements to be the most fundamental and relevant for news analysis. There are several reasons for our focus on these particular elements: first, it goes in line with the understanding that worth-telling stories usually consist of protagonists facing and resolving problematic experiences (Eggins and Slade, 2005) ; from a macro-level perspective, this can be useful to capture or characterize the plot type of stories (Shenhav, 2015) . Moreover, these elements resonate with what is considered by Entman (2004) to be the most important Framing Functions -problem definition and remedy. Our focus can also open up opportunities for further exploration of other important narrative elements in media stories, such as identifying villainous protagonists who are expected to be strongly associated with the complication of the story, and who are expected to be instrumental to a successful resolution (Shenhav, 2015) . In order to adapt the original complicating action and resolution categories to news media content, we designed our annotation scheme as follows. Complicating action -hence, Complication -was defined in our narrative scheme as an event, or series of events, that point at problems or tensions. Resolution refers to the way the story is resolved or to the release of the tension. An improvement from -or a manner of -coping with an existing or hypothetical situation was also counted as a resolution. We did that to follow the lack of a closure which is typical for many social stories (Shenhav, 2015) and the often tentative or speculative notion of future resolutions in news stories (Thornborrow and Fitzgerald, 2004) . We have therefore included in this category any temporary or partial resolutions. The transitional characteristic of the resolution brought us to subdivide this category into yet another derivative category defined as Success. Unlike the transitional aspect of the resolution, which refers, implicitly or explicitly, to a prior situation, this category was designed to capture any description or indication of an achievement or a good and positive state.

Here we describe the process of constructing CompRes, our dataset of news articles annotated with narrative structures. The dataset contains 29 news articles, comprising 1,099 sentences. An overview of the dataset is given in Table 1 .

We started by conducting a pilot study, for the purpose of formalizing an annotation scheme and training our annotators. For this study, samples were gathered from print news articles in the broad domain of economics, published between 1995 and 2017 and collected via LexisNexis. We used these articles to refine elements from the theory presented in (Labov and Waletzky, 1967; Labov, 2013 ) into a narrative annotation scheme which is better suited for news media (as detailed in Section 2.2), as well as perform extensive training for our annotators. The result was a multi-label annotation scheme containing three narrative elements: Complication, Resolution and Success.

Following the conclusion of the pilot study, we used the samples which were collected and manually annotated during the pilot to train a multi-label classifier for this task by fine-tuning a RoBERTa-base transformer (Liu et al., 2019) . This classifier was later used to provide labeled candidates for the annotators during the annotation stage of the CompRes dataset, in order to optimize annotation rate and accuracy. The pilot samples were then discarded.

The news articles for the CompRes dataset were sampled from 120 leading news and partisan websites in the English language, all published between 2017 and 2020. The result is a corpus of 29 news articles comprising a total of 1,099 sentences, with an average of 39.3 sentences per article (and a standard deviation of 21.8), and an average of 22.2 tokens per sentence (with a standard deviation of 13.0). The articles are semantically diverse, as they were sampled from a wide array of topics such as politics, economy, sports, culture, health. For each article in the corpus, additional meta-data is included in the form of the article title and the URL from which the article was taken (for future reference).

The news articles' content was extracted using diffbot. The texts were scraped and split into sentences using the Punkt unsupervised sentence segmenter (Kiss and Strunk, 2006) . Some remaining segmentation errors were manually corrected.

Following the pilot study (Section 3.1), a code book containing annotation guidelines was produced.

For each of the three categories in the annotation scheme -Complication, Resolution and Success -the guidelines provide:

• A general explanation of the category • Select examples of sentences labeled exclusively with the category

We employed a three-annotator setup for annotating the collected news articles. First, the model which was trained during the pilot stage (Section 3.1) was used to produce annotation suggestions for each of the sentences in the corpus. Each sentence was then separately annotated by two trained annotators according to the guidelines described in Section 3.4.1. Each annotator had the choice to either accept the suggested annotation or to change it by adding or removing any of the suggested labels. Disagreements were later decided by a third expert annotator (the project lead). Table 2 reports inter-coder reliability scores for each of the three categories, averaged across pairs of annotators: the raw agreement (in percentage) between annotators, and Cohen's Kappa coefficient, accounting for chance agreement (Artstein and Poesio, 2008) .

Categories vary significantly in their prevalence in the corpus; their respective proportions in the dataset are given in Table 1 . The categories are unevenly distributed: Complication is significantly more frequent than Resolution and Success. This was to be expected, considering the known biases of "newsworthiness" towards problems, crises and scandals, and due to the fact that in news media, resolutions often follow reported complications. Table 3 reports pairwise Pearson correlations (φ coefficient) between the categories. A minor negative correlation was found between Complication and Success (φ = −0.26), and a minor positive correlation was found between Resolution and Success (φ = 0.22); these were not surprising, as success is often associated with resolving some complication. However, Complication and Resolution were found to be completely uncorrelated (φ = 0.01), which -in our opinion -indicates that the Success category does indeed bring added value to our narrative scheme.

In Table 5 we display examples of annotated sentences from the CompRes dataset. Note that all the possible combinations of categories exist in the dataset; Table 4 summarizes the occurrences of each of the possible category combinations in the dataset.

The fact that the dataset is composed of full coherent news articles allows the analysis of a range of micro, meso and macro stories in narrative texts. For example, an article in the dataset concerning the recent coronavirus outbreak in South Korea 1 opens with a one-sentence summary, tagged with both Complication and Resolution:

"South Korea's top public health official hopes that the country has already gone through the worst of the novel coronavirus outbreak that has infected thousands inside the country." (Complication,

This problem-solution (or in this case, hopeful solution) plot structure reappears in the same article, but this time it is detailed over a series of sentences: More than 7,300 coronavirus infections have been confirmed throughout South Korea, killing more than 50." (Complication) The South Korean government has been among the most ambitious when it comes to providing the public with free and easy testing options." (Success)

The sequence starts with two sentences tagged with Complication, followed by two additional ones tagged with both Complication and Resolution, and concludes with a sentence tagged as Success. This example demonstrates a more gradual transition from problem through solution to success.

We randomly divided the news articles in the dataset into training, validation and test sets, while keeping the category distribution in the three sets as constant as possible; the statistics are given in Table 7 . The training set was used to train the supervised model for the task; the validation set was used to select the best model during the training phase (further details are given in Sections 4.2), and the test set was used to evaluate the chosen model and produce the results reported in Section 4.5.

For our baseline model, we used unigram counts (bag-of-words) as the feature representation. We first applied basic pre-processing to the texts: sentences were tokenized and lowercased, numbers were removed and contractions expanded. All the remaining terms were used as the features. We utilized a linear SVM classifier with the documentterm matrix as input, and employed the one-vs-rest strategy for multilabel classification.

The validation set was used to tune the C hyperparameter for the SVM algorithm, via a random search on the interval (0, 1000], in order to choose the best model.

In addition to the baseline model, we experimented with a deep-learning model, fine-tuning a pre-trained language model for our multi-label classification task. We used the RoBERTa-base transformer (Liu et al., 2019) as our base language model, utilizing the transformers python package (Wolf et al., 2019) . We appended a fully connected layer over the output of the language model, with three separate sigmoid outputs (one for each of the narrative categories), in order to fine-tune it to our task. The entire deep model was fine-tuned for 5 epochs, and evaluated against the validation set after every epoch, as well as every 80 training steps. The checkpoint with the best performance (smallest loss) on the validation set was used to choose the best model.

Finally, we tested the effect of data augmentation in our setup; both models were re-trained with augmented training data, via back-translation. Back-translation involves translating training samples to another language and back to the primary language, thus increasing the size of the training set and potentially improving the generalization capacity of the model (Shleifer, 2019) . For this purpose, we used Google Translate as the translation engine. Translation was performed to German and back to English, discarding translations that exactly match the original sentence. Following the augmentation, the training set size almost # Sentence

Comp. Res. Suc. 1

It is no surprise, then, that the sensational and unverified accusations published online this week stirred a media frenzy.

2 America would lose access to military bases throughout Europe as well as NATO facilities, ports, airfields, etc.

3 How did some of the biggest brands in care delivery lose this much money?

4 Bleeding from the eyes and ears is also possible after use, IDPH said.

The gentrification project, which concluded this year, included closing more than 100 brothels and dozens of coffee shops (where cannabis can be bought), and trying to bring different kinds of businesses to the area.

His proposal to separate himself from his business would have him continue to own his company, with his sons in charge.

Instead, hospitals are pursuing strategies of market concentration.

The South Korean government has been among the most ambitious when it comes to providing the public with free and easy testing options.

9

The husband and wife team were revolutionary in this fast-changing industry called retail.

10 With its centuries-old canals, vibrant historic center and flourishing art scene, Amsterdam takes pride in its cultural riches.

11 Mr. Trump chose to run for president, he won and is about to assume office as the most powerful man in the world.

12 Soon after, her administration announced a set of measures intended to curb misconduct.

13 Voter suppression is an all-American problem we can fight -and win.

14 Though many of his rivals and some of his Jamaican compatriots have been suspended for violations, Bolt has never been sanctioned or been declared to have tested positive for a banned substance.

15 The Utah man's mother, Laurie Holt, thanked Mr. Trump and the lawmakers for her son's safe return, adding: "I also want to say thank you to President Maduro for releasing Josh and letting him to come home."

16 They were fortunate to escape to America and to make good lives here, but we lost family in Kristallnacht.

17 Historically, such consolidation (and price escalation) has enabled hospitals to offset higher expenses. 

We report our test results in Table 6 . First, we observe that the deep models significantly outperformed the baseline models: an average F 1 score of 0.7 compared to 0.39/0.4, which represents an increase of 75% in performance. The improvement is evident for every one of the narrative categories, but is particularly substantial for the Success category -an F 1 score of 0.56 compared to 0.15, constituting an increase of 373%. One plausible explanation we can offer has to do with the nature of our Success category; while the Complication and Resolution categories seem to be constrained by sets of generic terminologies, the definition of Success is more content-oriented, and thus highly sensitive to specific contexts. For example, linguistically speaking, the definition of the success of an athlete in never being tested positive for a banned substance (see sentence #14 in Table 5 ) is very different from the definition of success in the cultural context of the art scene of a city (sentence #10 in Table 5 ).

Generally, the performance for each category appears to reflect the proportion of instances belonging to each category (see Table 1 ). This is most evident in the baseline models -F 1 scores of 0.61, 0.4 and 0.15 in the SVM model, and F 1 scores of 0.61, 0.43 and 0.17 in the augmented SVM model for Complication, Resolution and Success (respectively). However, in the deep models this behavior seems to be less extreme; in the augmented RoBERTa model, the F 1 score for the Success category is higher by 0.05 compared to the Resolution category, despite being less frequent in the dataset. We also observe that the Success category consistently exhibit notably higher precision than recall, across all models, possibly due to the smaller number of samples encountered by the classifier during training. This is generally true for the Resolution category as well (except in the case of the RoBERTa model), though to a lesser extent.

Interestingly, the data augmentation procedure does not seem to have any effect on model performance, both in the case of the baseline model (an increase of 0.01 in the average F 1 score) as well as the case of the deep model case (no change in the average F 1 score).

We introduced CompRes -the first dataset for narrative structure in news media. Motivated by the enormous social impact of news media and their role in creating and shaping of public opinion, we designed a new narrative structure annotation scheme which is better suited to informational text, specifically news articles. We accomplished that by adapting two elements from the theory introduced in (Labov and Waletzky, 1967; Labov, 2013 ) -Complication and Resolutionand adding a new element, Success. This scheme was used to annotate a set of 29 articles, containing 1,099 sentences, which were collected from news and partisan websites.

We tested two supervised models on the newly created dataset, a linear SVM over bag-of-words baseline classifier and a fine-tuned pre-trained RoBERTa-base transformer, and performed an analysis of their performances with respect to the different narrative elements in our annotation scheme. Our preliminary results -an average F 1 score of up to 0.7 -demonstrate the potential of supervised learning-methods in inferring the narrative information encoded into our scheme from raw news text.

We are currently engaged in an ongoing effort for improving the annotation quality of the dataset and increasing its size. In addition, we have several exciting directions for future work. First, we would like to explore incorporating additional elements from the narrative theory in (Labov, 2013) to our annotation scheme; for example, we believe that the evaluation element may be beneficiary in encoding additional information over existing elements in the context of news media, such as the severity of a Complication or the 'finality' of a Resolution. A related interesting option is to add completely new narrative elements specifically designed for informational texts and news articles, such as actor-based elements identifying entities which are related to one or more of the currently defined narrative categories; for instance, as mentioned in 2.2, we may add indications for villainous protagonists, strongly associated with complications in the story and are expected to be instrumental to a successful resolution.

Another direction which we would like to explore includes enriching the scheme with clauselevel annotation of the different narrative elements, effectively converting the task from multilabel classification to a sequence prediction one -detecting the boundaries of the different narrative elements in the sentence. Alternatively, we could introduce additional layers of information which will encode more global narrative structures in the text, such as inter-sentence references between narratively-related elements (e.g., a Resolution referencing its inducing Complication), or even between narrativelyrelated articles (e.g., different accounts of the same story).

Survey article: Inter-coder agreement for computational linguistics

Annotating content zones in news articles. CLiC it

The language of news media

Unsupervised learning of narrative event chains

Existents. Story and Discourse: Narrative Structure in Fiction and Film

A semantically-based computational approach to narrative structure

Analysing casual conversation

Projections of power: Framing news, public opinion, and US foreign policy

An introduction to narratology

Ithaca: Cornell [= 1972, Figures III

Narrative models: Narratology meets artificial intelligence

Unsupervised multilingual sentence boundary detection

Language in the inner city: Studies in the Black English vernacular

The language of life and death: The transformation of experience in oral narrative

Narrative analysis. oral versions of personal experience

Towards automatic detection of narrative structure

Movie plot analysis via turning point identification

Morphology of the Folktale

Narrative fiction: Contemporary poetics

Analyzing social narratives

Low resource text classification with ulmfit and backtranslation

Identifying narrative clause types in personal stories

Storying the news through category, action, and reason

News as discourse. hillsdale, nj: L

Temporal word embeddings for narrative understanding

Huggingface's transformers: State-of-the-art natural language processing

This research was partially supported by the Israel Science Foundation (grant No. 1400/14). We wish to thank Vered Porzycki and Avishai Green for their helpful comments on the CompRes code-book and for their careful tagging.