key: cord-0039637-8g5i7r05
authors: Wang, Jiexin; Jatowt, Adam; Färber, Michael; Yoshikawa, Masatoshi
title: Answering Event-Related Questions over Long-Term News Article Archives
date: 2020-03-17
journal: Advances in Information Retrieval
DOI: 10.1007/978-3-030-45439-5_51
sha: 8ebee51e26e1b27ee4b7faadd6aec0549c253c52
doc_id: 39637
cord_uid: 8g5i7r05

Long-term news article archives are valuable resources about our past, allowing people to know detailed information of events that occurred at specific time points. To make better use of such heritage collections, this work considers the task of large scale question answering on long-term news article archives. Questions on such archives are often event-related. In addition, they usually exhibit strong temporal aspects and can be roughly categorized into two types: (1) ones containing explicit temporal expressions, and (2) ones only implicitly associated with particular time periods. We focus on the latter type as such questions are more difficult to be answered, and we propose a retriever-reader model with an additional module for reranking articles by exploiting temporal information from different angles. Experimental results on carefully constructed test set show that our model outperforms the existing question answering systems, thanks to an additional module that finds more relevant documents.

With the application of digital preservation techniques, more and more old news articles are being digitized and made accessible online. News article archives help users to learn detailed information on events that occurred at specific time points in the past and constitute part of our heritage [1] . Some professionals, like historians, sociologists, or journalists need to deal with these time-aligned document collections for a variety of purposes [2] . Moreover, average users can verify information about the past using original, primary resources. However, it is difficult for users to efficiently make use of news archives due to their large sizes and complexities. Large scale question answering systems (QA systems) can solve the problem, with the aim to identify the most correct answer from relevant documents for a particular information need, expressed as a natural language question. User questions on these archives are often event-related and include temporal aspects. They can be divided into two types: (1) those with explicit temporal expressions (e.g., "NATO signed a peace treaty with which country in 1999?"), and (2) those only implicitly associated to time periods, hence not containing any temporal expression (e.g., "How many members of International Olympic Committee were expelled or resigned because of the bribery scandal?"). We focus on the latter type, which is more challenging, as the temporal information cannot be obtained directly. Table 1 shows some examples of the questions that we use. This paper presents a large scale question answering system called QANA (Question Answering in News Archives) designed specifically for answering eventrelated questions on news article archives. It exploits the temporal information of a question, of a document content and of its timestamp for reranking candidate documents. In the experiments, we use New York Times (NYT) archive as the underlying knowledge source and a carefully constructed test set of questions which are associated with past events. The questions are selected from existing datasets and history quiz websites, and they lack any temporal expressions which makes them particularly difficult to be answered. Experimental results show that our proposed system improves retrieval effectiveness and outperforms the existing QA systems commonly used for large scale question answering.

We make the following contributions: (a) we propose a new subtask of QA, which uses long-term news archives as the data source, (b) we build effective models for solving this task by exploiting temporal characteristics of both questions and documents, (c) we perform experiments to prove their effectiveness and construct a novel dedicated test set for evaluating QA on news archives.

The remainder of this paper is structured as follows. The next section overviews the related work. In Sect. 3, we introduce our model. Section 4 describes experimental settings and results. Finally, we conclude the paper in Sect. 5.

Question Answering System. Current large scale question answering systems usually consist of two modules: (1) IR module (called also a document retriever module) responsible for selecting relevant articles from an underlying corpus and (2) Machine Reading Comprehension (MRC) module (called also a document reader module) used to extract answer spans from relevant articles, typically, by using neural network models.

Latest MRC models, especially those that use Bert [3] can even surpass human-level performance (based on EM (Exact Match) and F1 scores) on both SQuAD 1.1 [4] and SQuAD 2.0 [5] , the two most widely-used MRC datasets, where each question is connected with a given reading passage. However, recent studies [6] [7] [8] indicate that IR module is a bottleneck having a significant impact on the performance of the whole system (degraded performance of MRC component due to noisy input). Hence, few works tried to improve the IR task. Chen et al. [9] propose one of the most well-known large scale question answering system, DrQA whose IR component is based on a TF-IDF retriever that uses bigrams with TF-IDF matching. Wang et al. [7] introduce R 3 model, where IR component and MRC component are trained jointly by reinforcement learning. Ni et al. [10] propose ET-RR model, which improves IR part by identifying essential terms of a question and reformulating the query.

Nonetheless, as the existing question answering systems are essentially designed for synchronic document collections (e.g., Wikipedia), they are incapable of utilizing temporal information like document timestamp when answering questions on long-term news article archives, despite temporal information constituting an important feature of events reported by news articles. The questions and documents are then processed in the same way as on synchronic collections. Even though some temporal question answering systems that can exploit temporal information of question and document content have been proposed in the past [11, 12] , they are still designed for synchronic document collections (e.g., Wikipedia or Web) and they do not use document timestamps. Besides, they are based on traditional rule-based methods and their performance is rather poor.

In addition, there are very few resources available for temporal question answering. Jia et al. [13] propose a dataset with 1,271 temporal question-answer pairs where 209 pairs do is without any explicit temporal expression. However, only few pairs can be used in our case, as most are about events which happened long time ago (e.g., Viking Invasion of England) or are not event-related.

Our approach contains an additional module that is used for reranking documents which improves the retrieval of correct documents by exploiting temporal information from different angles. We not only utilize the inferred time scope information from the questions themselves, but also combine it with the document timestamp information and with temporal information embedded inside document content. To the best of our knowledge, no studies, as well as no available datasets that can help to design a question answering system on news article archives have been proposed so far. Building a system that makes full use of the past news articles and satisfies different user information needs is however of great importance due to the continuously growing document archives. Temporal Information Retrieval. In Information Retrieval (IR) domain, several research studies have already been proposed for temporal ranking of documents [14] [15] [16] . Li and Croft [17] introduce a time-based language model that takes into account timestamp information of documents to favor recent documents. Metzler et al. [18] propose a method that analyzes query frequencies over time to infer the implicit temporal information of queries and exploit this information for ranking results. Arikan et al. [19] design a temporal retrieval model that integrates temporal expressions of document content into query-likelihood language modeling. Berberich et al. [20] propose a similar model but also consider uncertainty in temporal expressions. However, in [19] and [20] , the temporal scopes of queries are explicitly given in their setting and the proposed methods do not utilize timestamp information. Kanhabua and Nørvåg [21] propose three different methods to determine the implicit temporal scope of queries and exploit this temporal information to improve the retrieval effectiveness by reranking documents. [21] is probably the most related work to ours as it also linearly combines both textual and temporal similarity to rerank documents, however, that work does not use any temporal information embedded in document content and the linear combination is done in a static way. In our experiments, for comparison with [21] we will replace our ranking method in the reranking module with the best one proposed in [21] .

All the above-mentioned temporal information retrieval methods are designed for short queries instead of questions, and none of them exploits both timestamps and content temporal information. We are the first to adapt and improve concepts from temporal information retrieval to the QA research domain, showing significant improvement in answering questions on long-term news archives.

In this section, we present the proposed system that is designed specifically for answering questions over news archives. We focus on questions for which the time periods are not given explicitly, and so further knowledge is required for obtaining or inferring their time periods (e.g. "Who replaced Goss as the director of the Central Intelligence Agency?"). Figure 1 shows the architecture of QANA system which is composed of three modules: Document Retriever Module, Time-Aware Reranking Module and Document Reader Module. Compared with the architectures of other common large scale question answering systems, we add an additional component: Time-Aware Reranking Module which exploits temporal information from different angles for selecting the best documents.

This module firstly performs keyword extraction and expansion, then retrieves candidate documents from the underlying document archive. First, single-token nouns, compound nouns, and verbs from each question are extracted based on analyzing part of speech (POS) and dependency information using spaCy 1 . After removing common stop words, the module expands keywords with their synonyms taken from WordNet [22] . The synonyms are further filtered by keeping those whose POS types match the original term in the question, and whose word embeddings 2 similarity to question terms is over 0.5. Finally, a query is issued to Solr [24] search engine which returns the top 300 documents ranked by BM25. 

In this module, temporal information is exploited from different angles to rerank retrieved candidate documents. Since the time scope information of questions is not provided explicitly, the module firstly determines candidate periods of the time scope T (Q) of a question Q. These are supposed to represent when an event mentioned in the question could occur. Each inferred candidate period is assigned a weight to indicate its importance. Then, the module contrasts the query time scope against the information derived from the document timestamp t pub (d) and the temporal information embedded inside document content T text (d), in order to compute two temporal scores S temp pub (d) and S temp text (d) for each candidate document d. Finally, both the textual relevance score S rel (d) and the final temporal score S temp (d) are used for document reranking.

Query Time Scope Estimation. Although the time scope information of the questions is not given explicitly, the distributions of relevant documents over time should provide information regarding temporal characteristics of the questions. Examining the timeline of a query's result set should allow us to characterize how temporally dependent the topic is. For example, in Fig. 2 , the dashed lines of the data show the distribution of relevant documents obtained from the NYT archive per month for two example questions: "Lewinsky told whom about her relationship with the President Clinton?", and "Which Hollywood star became governor of California?". We use a cross mark to indicate the time of each corresponding event, which is also the true time scope of the question.

We can see that the actual time scope (January, 1988) of the first question is reflected relatively well by its distribution of relevant documents as generally these documents are located between 1998 and 1999. However, still most of the relevant documents are published in October rather than January, because another event -the impeachment of Bill Clinton -occurred at that time. On the other hand, the distribution of relevant documents corresponding to the second question is more complex as it contains many peaks, and documents are not located in a specific short time period, and the number of relevant documents published around the actual event time is relatively small when compared to the total number of relevant documents. However, the distribution line near the actual time of the event (November, 2003) still reveals useful features, i.e, the highest peak (maxima) of the dashed line of the data is near the event time. Therefore, the characteristics of the distribution of relevant documents over time can be used for inferring hidden time scopes of questions.

We perform burst detection on the retrieved relevant time-aligned documents, as the time and the duration of bursts are likely to signify the start point and the end point of events underlying the questions. More specifically, we apply burst detection method used by Vlachos et al. [25] , which is a simple yet effective approach 3 . Bursts are detected as points with values higher than β standard deviations above the mean value of the moving average (MA). The procedure of calculating the candidate periods of time scope T (Q) of question Q is as follows: Timestamp-Based Temporal Score Calculation. After obtaining candidate periods of time scope T (Q), the module computes the timestamp-based temporal score S temp pub (d) of each candidate document d as shown in Eq. 1. We calculate S temp pub (d) based on the intuition that articles published within or soon after time period of the question have high probability of containing detailed information of the event mentioned in the question. The calculation way is as follows:

S temp pub (d) is estimated as P (T (Q)|t pub (d)), which is the average probability of generating m candidate periods of time scope T (Q). Then, the probability of generating a period (t s i , t e i ) given document timestamp t pub (d) is defined as:

) elsewhere (2) T imeSpan(D) is the length of time span of news archive D. In the experiments, we use NYT archive with monthly granularity, so T imeSpan(D) equals to 246 units, corresponding to the number of all months in the archive. w(t s i , t e i ) is the weight indicating the importance of (t s i , t e i ) over candidate periods of time scope T (Q) (as explained before). P ((t s i , t e i )|t pub (d)) equals to 0.0 when document d is published before t s i , as such document usually cannot provide much information on the events that occurred after its publication. Otherwise, P ((t s i , t e i )|t pub (d)) can be larger when the timestamp is closer to the time period (t s i , t e i ), and when the importance weight w(t s i , t e i ) of this period is large.

Content-Based Temporal Score Calculation. Next, we compute another temporal score, S temp text (d), of a candidate document d based on the relation between temporal information embedded in d's content and the candidate periods of time scope T (Q). We compute S temp text (d) because some news articles, even the ones published long time ago after the events mentioned in questions, may retrospectively refer to these events, providing salient information on them, and can thus help to distinguish between similar events. For example, articles published near a certain US presidential election may also discuss previous elections for comparison or for other purposes. Such references are often in the form of temporal expressions that refer to particular points in the past.

Temporal expressions are detected and normalized by the combination of temporal tagger (we use SUTime [29] ) and temporal signals 4 (words that help to identify temporal relations, e.g. "before","after","during"). The normalized result of each temporal expression is mapped to the time interval with the "start" and "end" information. For example, temporal expression "between 1999 and 2002" is normalized to [('1999-01', '2002-12') ]. Special cases like "until January 1992" are normalized as [('', '1992-01')], since the "start" temporal information can not be determined. Finally, we can obtain a list of time scopes of temporal expressions contained in a document d, denoted as T text (d) = {τ 1 , τ 2 , ..., τ m(d) } where m(d) is the total number of temporal expressions found in d.

As each time scope τ i has its "start" information, denoted as τ s i , and "end" information, τ e i , we create two lists T s text (d), T e text (d) containing all τ s i and all τ e i , respectively. Next, we construct two probability density functions by using kernel density estimation (KDE) based on these two lists. KDE is a technique closely related to histograms, which has characteristics that allow it to asymptotically converge to any density function. The probabilities of t s (Q) and t e (Q) denoted as S temp b text (d), S temp e text (d), respectively, can be then estimated using the probability density functions.

where h is a bandwidth (equal to 4) and K is a Guassian Kernel defined by:

S temp e text (d) is calculated in the same way but using τ e i and t e (Q), and S temp text (d) is:

Final Temporal Score Calculation & Document Ranking. After computing the two temporal scores, the final temporal score of d is given by:

where S temp pub (d) and S temp text (d) are the normalized values computed by dividing by the corresponding maximum scores among all candidate documents.

Additionally, document relevance score S rel (d) is used after normalization:

Finally, we rerank documents by a linear combination of their relevance scores and temporal scores:

is an important parameter, which determines the proportion between document temporal score and its relevance score. For example, when α(Q) equals to 0.0, the relevance of the temporal information is completely ignored. As different questions have different shapes of the distributions of their relevant documents, we propose to dynamically determine α(Q) per each question. The idea is that when a question has many bursts, meaning that the event of the question is frequently mentioned at different times or many similar or related events occurred over time, then time should play lesser role. In this case we want to decrease α(Q) value to pay more attention to document relevance. In contrast, when only few bursts are found, which means that the question has obvious temporal character, time should be considered more. α(Q) is computed as follows:

c is a constant set to 0.25. α(Q) assumes small values when the number of bursts is high, while it is the highest for the case of a single burst. When the relevant document distribution of the question does not exhibit any bursts, which also means that the list of candidate periods of the question time scope (T (Q)) is empty, α(Q) is set to 0 and the reranking is based on document relevance.

For this module, we utilize a commonly used MRC model called BiDAF [30] which achieves Exact Match score 68.0 and F1 score 77.5 on the SQuAD 1.1 dev set. We use BiDAF model to extract answers of the top N reranked documents and we select the most common answer as the final answer. Note that BiDAF could be replaced by other MRC models, for example, the models that combine with Bert [3] . We use BiDAF for easy comparison with DrQA, whose reader component's performance is similar although a little better than the one of BiDAF.

Document Archive and Test Set. As we mentioned before, NYT archive [31] is used as the underlying document collection, and is indexed using Solr. The archive contains over 1.8 million articles published from January 1987 to June 2007 and is often used for Temporal Information Retrieval researches [15, 16] .

To evaluate the performance of our approach, we first need a set of answerable questions. To the best of our knowledge, there was no previous proposal for answering questions on news archives or available question answering test sets designed for news archives. Thus we have to manually construct the test set making sure that the questions can be answered in NYT archive. We finally construct a test set of 200 questions 5 for NYT archive, that are carefully selected from other existing datasets and history quiz websites, and that (a) fall into the time frame of NYT archive, (b) their answers could be found in NYT archive and (c) they do not contain any temporal expressions 6 . The second condition was verified by manually selecting correct keywords from the questions and checking whether at least one retrieved document can infer the correct answer. Table 2 shows the distribution of resources used for creating the test set while Table 1 gives few examples.

We test the following models: 

We measure the performance of the models under comparison using exact match (EM) and F1 score -the two standard measures commonly used in QA research. As shown in Table 3 , QANA with full components outperforms other systems for all different N , which represent the numbers of reranked documents used in the Document Reader Module. The performance improvement is due to the use of temporal information for locating more correct documents which is derived from the question itself, document timestamp and document content. We then compare our model with others by considering the top 1 and top 5 documents. When comparing with the DrQA system, which is often used as QA baseline, the improvement is in the range of 17.77% to 32.14%, and from 24.25% to 33.49% on EM and F1 metrics, respectively. We have also examined the performance of DrQA when using Wikipedia articles as its knowledge source. In this case, the results are worse than the ones of any other compared method that uses NYT (including DrQA), which implies that Wikipedia cannot successfully answer questions on distant past events, and they need to be answered using primary sources, i.e., news articles from the past.

When comparing with QA-NLM-U [21] , the improvement ranges from 12.76% to 12.12% on EM score, and 12.21% to 10.19% on F1 score. In addition, when comparing with QA-Not-Rerank [30] that does not include reranking module, we can also observe an obvious improvement, when considering the top 5 and top 15 documents, ranging from 23.33% to 8.33%, and from 15.64% to 7.11% on EM and F1 metrics, respectively. Moreover, QANA-TempPub performs better than QANA-TempCont when using the top 1 and top 5 documents, but worse when using top 10 and top 15. In addition, we can observe that just using only timestamp information still allows achieving relatively good performance. Nevertheless, QANA with all the proposed components, which make use of the inferred time scope of the questions and the temporal information from both document timestamps and document content, achieves the best results.

We next evaluate the performance of QANA based on the number of relevant documents, and compare it with QA-Not-Rerank. We first rank questions by the number of documents they return, and then group them into two equal parts. As shown in Table 4 , we can see that both the tested models achieve better results on questions with few relevant documents, as it is likely easier to locate more relevant documents from small number of documents. We also observe an improvement when comparing our model with QA-Not-Rerank, especially, for the top 5 and top 15 documents, which proves the effectiveness of the reranking method by utilizing temporal information. Moreover, we also analyze the impact of the number of bursts on performance. About half of the questions (96 questions) have few bursts (less than or equal to 4). Table 5 shows that both QANA and QA-Not-Rerank perform much better when answering such questions. The events in questions with many bursts are likely to be similar to other events that occurred at different times, which causes the difficulty to distinguish between the events. As our system considers the importance of bursts by assigning weights to them, it significantly outperforms QA-Not-Rerank. Although α(Q) is smaller in this case (according to Eq. 9), it still plays an important part in selecting relevant documents. For example, if the number of bursts of a question is 10, α(Q) approximately equals to 0.1, which means that the final reranking is driven by about 10% of the temporal score.

Finally, we examine the effect of α(Q), which determines the proportion between temporal relevance score and query relevance score. As shown in Fig. 3 , the model using dynamic alpha (depicted by dashed lines) performs always better than the model with static alpha, since the dynamic value is dependent on the distributions of relevant documents over time for each question. The dynamic approach flexibly captures the changes in importance of temporal information and relevance information, resulting in better overall performance. 

In this work we propose a new research task of answering event-related questions on long-term news archives and we show effective solution for it. Unlike other common QA systems designed for synchronic document collections, questions on long-term news archives are usually influenced by temporal aspects, resulting from the interplay between the document timestamps, temporal information embedded in document content and query time scope. Therefore, exploiting temporal information is crucial for this type of QA, as also demonstrated in our experiments. We are also the first to incorporate and adapt temporal information retrieval approaches to QA systems.

Finally, our work makes few general observations. First, to answer eventrelated questions on long-span news archives one needs to (a) infer the time scope embedded within a question, and then (b) rerank documents based on their closeness and order relation to this time scope. Moreover, (c) using temporal expressions in documents further helps to select best candidates. Lastly, (d) applying dynamic way to determine the importance between query relevance and temporal relevance is quite helpful.

Interacting with digital documents: a real life study of historians' task processes, actions and goals

Searching for old news: user interests and behavior within a national collection

Bert: pre-training of deep bidirectional transformers for language understanding

Squad: 100,000+ questions for machine comprehension of text

Know what you don't know: Unanswerable questions for squad

Ask the right questions: Active question reformulation with reinforcement learning

R3: reinforced ranker-reader for open-domain question answering

End-to-end open-domain question answering with bertserini

Reading wikipedia to answer opendomain questions

Learning to attend on essential terms: An enhanced retriever-reader model for scientific question answering

Towards temporal web search

Question answering based on temporal inference

TempQuestions: a benchmark for temporal question answering

On the value of temporal information in information retrieval

Survey of temporal information retrieval and related applications

Temporal information retrieval

Time-based language models

Improving search relevance for implicitly temporal queries

Time will tell: leveraging temporal expressions in IR

A language modeling approach for temporal information needs

Determining time of queries for re-ranking search results

WordNET: a lexical database for english

GloVe: global vectors for word representation

Identifying similarities, periodicities and bursts for online search queries

Parameter free bursty events detection in text streams

Finding surprising patterns in textual data streams

Bursty and hierarchical structure in streams

SUTime: a library for recognizing and normalizing time expressions

Bidirectional attention flow for machine comprehension

The new york times annotated corpus