key: cord-0039540-5p7pshyj authors: Saravanou, Antonia; Stefanoni, Giorgio; Meij, Edgar title: Identifying Notable News Stories date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_44 sha: 886442daa986fd766c25d37d43ee857db3420238 doc_id: 39540 cord_uid: 5p7pshyj The volume of news content has increased significantly in recent years and systems to process and deliver this information in an automated fashion at scale are becoming increasingly prevalent. One critical component that is required in such systems is a method to automatically determine how notable a certain news story is, in order to prioritize these stories during delivery. One way to do so is to compare each story in a stream of news stories to a notable event. In other words, the problem of detecting notable news can be defined as a ranking task; given a trusted source of notable events and a stream of candidate news stories, we aim to answer the question: “Which of the candidate news stories is most similar to the notable one?”. We employ different combinations of features and learning to rank (LTR) models and gather relevance labels using crowdsourcing. In our approach, we use structured representations of candidate news stories (triples) and we link them to corresponding entities. Our evaluation shows that the features in our proposed method outperform standard ranking methods, and that the trained model generalizes well to unseen news stories. With the rise in popularity of social media and the increase in citizen journalism, news is increasing in volume and coverage all around the world. As a result, news consumers run the risk of either being overwhelmed due to the sheer amount of news being produced, or missing out on news stories due to heavy filtering. To deal with the information overload, it is crucial to develop systems that can filter the noise in an intelligent fashion. Due to the highly condensed language used in news, automated systems have been developed to process them and generate well-defined structured representations from their content [9] . Each structure is a so-called triple that represents an event in the form of who did what to whom, with additional metadata information about when and where this happened. Such representations (triples) form a knowledge graph (KG). There are multiple computational benefits when searching, labeling, and processing KGs due to their clean and simple structure [11, 14] . A common approach to measure notability of a news event is to track it through a proxy metric. For example, Naseri et al. [7] decide whether an article describes a notable event by counting the user interactions, while Setty et al. [10] cluster together similar news articles and then use the cluster size to decide if the common theme is notable. Wang et al. [12] propose a recommendation framework that takes as input a stream of news and predicts the user's clickthrough rate. In this paper, we approach the problem of identifying notable news stories as a ranking task, i.e., we rank structured news stories represented as triples against notable events. We use Wikipedia's Current Events Portal (WCEP) [2] as curated notable events and, using a combination of textual and semantic features, we build a learning to rank (LTR) model to solve the ranking problem. Let Q = [q 0 , . . . , q k ] denote a stream of events, where each query event q i ∈ Q is a notable event composed of a textual description and of a publication date. Let C = [c 0 , . . . , c l ] denote a stream of candidate events. Each c j ∈ C is a structured representation of a news story that consists of a triple of the form (s, p, o), where s is the subject, p is the predicate, and o is the object, together with information about the location (city, country) and the date of the news story. Given a query q i ∈ Q and a stream of candidates C, we aim to rank each candidate c j ∈ C by its relevance to the query q i . A pair (q i , c j ) is considered as very relevant when the information from q i and c j matches completely; it is considered as relevant when some of the information matches; otherwise, it is considered as not relevant. Table 1 shows a query q 0 and two candidates c 0 and c 1 . The pair (q 0 , c 0 ) is very relevant because c 0 matches q 0 completely; in contrast, the pair (q 0 , c 1 ) is relevant because q 0 and c 1 disagree only on the location of the event. In this section we present our method to identify notable news stories which consists of three steps: (1) creating (query, candidate) pairs, (2) extracting textual and semantic features, and (3) training a learning to rank (LTR) model. (1) Creating Pairs. We create the set of all possible (query, candidate) pairs where (i) the query and the candidate have the same publication date, and (ii) the query and the candidate have at least one word in common as a pair is unlikely to be relevant if they share no words. (2) Extracting Features. We extract a set of features for each constructed pair. Our features can be classified into three groups as follows. (i) Features related to a component. We compute the size of the query or the candidate (i.e., the number of terms in the query/candidate). (ii) Features related to the pair. We calculate the Okapi BM25 score, the term frequency (TF ) and the term frequency-inverse document frequency (TF-IDF ) for the query/candidate in the pair. We calculate these scores using the stemmed versions of the query/candidate (using the Porter Stemmer [8] ). We further define a similarity score, element match, EM (q i , ele cj ) = |q i ∩ ele cj |/|ele cj | where an element ele cj is one of the: subject, predicate, object, description of the predicate, location, and the date in the candidate c j . For each of those, we calculate the fraction of the number of common terms between the element ele cj and the query q i to the total number of terms in ele cj . In addition, we compute all EM scores using the stemmed versions of the pair components. We also extract similarity scores for combinations of elements, as for example EM (q i , subject ∩ predicate ∩ object) and EM (q i , city ∩ country). (iii) Semantic features. An entity is a well-defined, meaningful and unique way to characterize a word/phrase. We therefore apply entity linking using the TagMe API [5] to identify entities (an example is shown in Table 1 ). Given the tagged query and the tagged candidate, we calculate the number of common entities using the Jaccard similarity. (3) Ranking Pairs. We then use our features to train a learning to rank model in order to obtain a ranking of pairs. More details on the training and the tuning can be found in Sect. 4. For the candidate news stories, we use the Integrated Crisis Early Warning System (ICEWS) [1] dataset which contains events that are automatically extracted from news articles using TABARI [3, 9] . This system uses grammatical parsing to identify events (who did what to whom, when and where) using human-generated rules. The events are triples consisting of coded actions between socio-political actors. The actors refer to individuals, groups, sectors and nation states. The actions are coded into 312 categories. Geographical and temporal metadata are also associated with each triple (examples are shown in Table 1 ). In our experiments, we use the same two weeks of data from ICEWS and WCEP. We remove triples with the generic action type "Make statement" as they do not convey any meaningful information. We then create pairs as described in Sect. 3. We build a crowdsourcing task (see below) to get golden truth labels. From the resulting annotated dataset, we only keep queries with at least one relevant ICEWS triple as there are, e.g., sports events in the WCEP dataset but not in the ICEWS dataset. In total, the resulting dataset contains 9.1K pairs; 74 queries and 123 candidates per query on average. To evaluate our method in a real-world setting we split the dataset by date and use the first ten days for training, the next two days for validation, and the last two for testing. Golden Truth. We employ crowdsourcing on the Figure-eight platform and ask annotators to judge the relevance of each pair on a 3-point scale (very relevant, relevant, not relevant). 1 Each pair (q i , c j ) is annotated by at least three annotators and we use majority voting to obtain the gold labels. Our task obtains a inter-annotator agreement of 96.57%. Table 2 shows the distribution of relevance labels among pairs. The resulting dataset is highly skewed; with 3% annotated as very relevant, 1% as relevant, and 96% as not relevant. We explore various LTR algorithms and include results from Rank-Boost (RB) [6] , lambdaMART (LM) [13] , and Random Forest (RF) [4] . We experiment using different sets of features: all features (All), all except entityrelated features (All − ), selected features (Sel) and baseline features (B). Sel features include BM25 and TF-IDF scores calculated from the original/stemmed versions, EM scores for subject, predicate, object and location, and the number of entities in common and Jaccard similarity between the query and the candidate. For B features, we only consider BM25 and TF-IDF scores calculated from the original/stemmed versions of the query and the candidate. We evaluate using MAP, Precision@k, NDCG@k and MRR. In this section we discuss our experimental results and answer the following research questions. How does our method compare against the baselines? Does the performance vary with different parameter settings? Does the number of relevant pairs affect performance? Do we benefit from tagging entities? We compare the three LTR models on the All and B feature sets and show the results in Table 3 . Our method (using All features) achieves better results than using just the baseline B features. For each model and feature set, we only show the best tuned model on the validation set. Our method consistently outperforms all baselines, achieving 5-12% improvements on NDCG@10. These improvements are statistically significant with p ≤ 0.01 using paired t-test. We tune the parameters for each model on the validation set using NDCG@10. Figure 1 (left) shows the performance quartile plot using different parameter settings. RB and RF models show less sensitivity in the parameters tuning compared to LM. We evaluate the models when ranking pairs using all annotations (VR, R, and NR). We perform the same experiment using only the VR and NR labeled pairs. Figure 1 (right) shows that the model achieves better results when excluding the R labeled pairs. This is expected as the relevant label is very rare (only 1%, see Table 2 ) and the models tend to consider it as noise. Our next step is to evaluate different combinations of features (All, All − , Sel, B). We show our findings in Table 4 . First, we compare our method using All − and B feature sets. We show that using the proposed features (Sect. 3) we achieve better performance for all LTR models. Second, we evaluate the performance of the models when we add the entity features by comparing All and All − . In Table 4 , we show that there is a statistically significant improvement (p ≤ 0.01) on MRR (+7%) when we add the entity-related features. In this section, we show examples of the output from our best performing setting, i.e., RF with All features using the VR and NR labeled pairs. We show our best and worst per-query NDCG@10 performance. The best one achieves a score of 1, which indicates that our method was able to rank all pairs in the right order. The top-1 ranked pair is the query "At least 15 children are killed and 45 more are injured after a school bus collides with a truck in Etah, India. In summary, we provide quantitative and qualitative performance analyses of our proposed method and we conclude that learning to rank is a viable method to determine notability of news stories. Among the key steps of our method are: (i) the extraction of textual and semantic features, and (ii) the exclusion of the pairs that do not convey strong signal, i.e., the ones labeled as 'relevant'. The RF model outperforms all baselines and it is also more robust with respect to hyperparameter settings. These findings show that our approach to detect notable news through ranking is a promising one. Although our method obtains high performance (MRR = 81%), we believe we can attain further improvements by leveraging relations of the identified entities to discover implicitly relevant ones, such as . In this paper, we present a method to rank notable news representations which leverages textual and semantic features. Our evaluation on labeled pairs from WCEP and the ICEWS shows that our method is effective. In the future, we intend to include features based on the relations of the tagged entities from external KGs, such as DBPedia and Freebase. Wikipedia's Current Events Portal (WCEP) ICEWS automated daily event data Random forests TAGME: on-the-fly annotation of short text fragments (by Wikipedia entities) An efficient boosting algorithm for combining preferences Analyzing and predicting news popularity in an instant messaging service An algorithm for suffix stripping Automated coding of international event data using sparse parsing techniques Modeling event importance for ranking daily news events Learning to explain entity relationships in knowledge graphs DKN: deep knowledge-aware network for news recommendation Adapting boosting for information retrieval measures Fast top-k search in knowledge graphs