key: cord-0183436-9tvl1kxm
authors: Jones, Shawn M.; Nwala, Alexander C.; Klein, Martin; Weigle, Michele C.; Nelson, Michael L.
title: SHARI -- An Integration of Tools to Visualize the Story of the Day
date: 2020-08-01
journal: nan
DOI: nan
sha: 93520599f12c21c08b8b9134a98bd21815f5b9d9
doc_id: 183436
cord_uid: 9tvl1kxm

Tools such as Google News and Flipboard exist to convey daily news, but what about the past? In this paper, we describe how to combine several existing tools with web archive holdings to perform news analysis and visualization of the"biggest story"for a given date. StoryGraph clusters news articles together to identify a common news story. Hypercane leverages ArchiveNow to store URLs produced by StoryGraph in web archives. Hypercane analyzes these URLs to identify the most common terms, entities, and highest quality images for social media storytelling. Raintale then uses the output of these tools to produce a visualization of the news story for a given day. We name this process SHARI (StoryGraph Hypercane ArchiveNow Raintale Integration).

Tools such as Google News and Flipboard exist to convey daily news, but what about the news of the past? We have combined StoryGraph 1 with tools from the Dark and Stormy Archives Toolkit 2 to produce the StoryGraph Hypercane ArchiveNow Raintale Integration (SHARI) process. These tools represent disparate research efforts in news analysis, corpus summarization, web archiving, and visualization. The integration produces a summary of the "biggest story" for a given date. SHARI combines the following components from Old Dominion University's Web Science and Digital Libraries Research Group 3 :

-StoryGraph: a platform that downloads RSS feeds and analyzes the linked articles to cluster news stories [12] http://storygraph.cs.odu.edu/ -Hypercane: a framework for intelligently sampling and analyzing documents from web archive collections [5] https://oduwsdl.github.io/hypercane -ArchiveNow: a library developed by Aturban et al. [2] that submits live web URI-Rs to web archives to create URI-Mshttps://github.com/oduwsdl/archivenow -Raintale: a MementoEmbed [3] client that creates stories from a sample of mementoshttps://oduwsdl. { "config": "/files/config/polar−media−consensus−graph/f6e84be9969ecef7adb20689002608d0/", "connected−comps": [ { "avg−degree": 4.318181818181818, "density": 0.10042283298097252, "node−details": { "annotation": "polarity", "color": "green", "connected−comp−type": "event" }, "nodes": [ 0, 1, ... additional node ids omitted for brevity ... ], "unique−source−count": 14 }, { "avg−degree": 1, "density": 1, "node−details": { "annotation": "polarity", "color": "red", "connected−comp−type": "cluster" }, "nodes": [ 9, 67 ] , "unique−source−count": 2 } ], "links": [ { "rank": 1, "sim": 0.57, "source": 2, "target": 21, "label": "1 (0.57)", "label−description": "rank (sim)" }, ... additional link definitions omitted for brevity ...

{ "rank": 96, "sim": 0.3, "source": 53, "target": 73, "label": "96 (0.3)", "label−description": "rank (sim)" } ], "ner−version": "3.8.0", "nodes": [ ... other nodes omitted for brevity ...

{ "entities": [ { "class": "LOCATION", "entity": "Coney Island" }, { "class": "LOCATION", "entity": "Brooklyn" }, { "class": "PERSON", "entity": "Victor J. Blue" }, ... ], "extraction−time": "2020−03−23T00:09:10.325362", "favicon": "https://www.nytimes.com/vi−assets/static−assets/favicon−4bf96cb6a1093748bf5b3c429accb9b4.ico", "id": "nytimes.com−1", "link": "https://www.nytimes.com/2020/03/22/health/coronavirus−restrictions−us.html", "node−details": { "annotation": "polarity", "color": "blue", "connected−comp−type": "event", "type": "left" }, "published": "Sun, 22 Mar 2020 22:00:52 +0000", "rss−uri−m": "https://web.archive.org/web/20200323000609id /https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml", "text": "Health |Harsh Steps Are Needed to Stop ], "self": "http://storygraph.cs.odu.edu/graphs/polar−media−consensus−graph/#cursor=0&hist=1440&t=2020−03−23T00:09:10", "timestamp": "2020−03−23T00:09:10.707796Z", "graph−pointer": { "cursor": 0, "hist": 1440, "cur−path": "2020/03/23" } } Nwala et al. [10, 11] have focused on finding seeds within search engine result pages (SERPs), social media stories, and news feeds. As part of this research, Nwala et al. also developed StoryGraph [12] , a service that saves RSS feeds from 17 news sources ( Table 1 in Appendix A) every ten minutes. With these RSS feeds, StoryGraph analyzes the lexical connections between articles across feeds to generate JSON output, which drives a graph visualization. Figure 1 displays some of this JSON output for March 23, 2020. StoryGraph then visualizes this output, as shown in Figure 2 .

Collections on specific topics exist at various web archives [7] . AlNoamany et al. [1] introduced how to use social media storytelling to summarize web archive collections. Klein et al. [8] have built collections from web archives by conducting focused crawls. Jones developed Hypercane [5] to intelligently sample mementos from larger collections. Jones also developed Raintale [4] for generating social media stories to summarize groups of mementos, providing visualizations that employ familiar techniques, like cards, that require no training for most users to understand.

The JSON data structure from Figure 1 provides all information gathered but is difficult for humans to understand at a glance. The graph shown in Figure 2 provides an overview of the JSON through favicons and edges, but a user requires some training to fully comprehend what it represents. Figure 3 displays the largest connected component from this graph visualized via the SHARI process. Through images, text snippets, titles, cards, domain names, favicons, and other content, the SHARI output allows the viewer to intuitively understand that the biggest news story for this date consists of different reactions to the growing COVID-19 pandemic. The StoryGraph Hypercane ArchiveNow Raintale Integration (SHARI) [6] process automatically creates stories summarizing news for a day. Figure 4 details what each tool contributes to the story. Figure 5 shows the steps of the SHARI process.

1. With the StoryGraph Toolkit, we query the StoryGraph service for the URI-Rs belonging to the biggest story of the day. 2. Hypercane converts these URI-Rs to URI-Ms by first attempting to find a corresponding URI-M by querying the LANL Memento Aggregator 4 via the Memento Protocol [13] . For each URI-M that does not have a memento, Hypercane creates a memento by calling ArchiveNow [2] (Figure 6 ). 3. Hypercane runs the mementos through spaCy 5 to generate a list of named entities, sorted by frequency (Figure 7 ). 4. Hypercane runs the mementos through sumgram [9] and generates a list of sumgrams, sorted by frequency (Figure 8 ). 5. Hypercane scores all of the mementos' embedded images. Images that article authors reference in HTML META tags are favored first, followed by MementoEmbed [3] score, then pixel size, color count, the ratio of width to height, and finally position on the page (Figure 9 ). 6. Hypercane runs the mementos through newspaper3k 6 to extract each article's publication date and orders the URI-Ms by that date (Figure 10 ) . 7. Hypercane consolidates the entities, terms, image scores, and ordered URI-Ms into a JSON file containing the structured data for the summary. During this step, Hypercane uses the highest scoring image as the striking image for the summary (Figure 11 ). In Figure 4 , the highest-ranking image is the UK Prime Minister addressing his country about the COVID-19 pandemic. 8. Raintale renders the output as Jekyll HTML based on the contents of this JSON file, a template file, and information on each memento provided by MementoEmbed ( Figure 11 ). 9. The SHARI script publishes the summary story to GitHub Pages for distribution. Figure 13 shows the output of our dsa tweeter bot which announces the story after publication through the @StormyArchives Twitter account.

StoryGraph is a valuable resource with additional unrealized potential. We are not only able to create stories for today or yesterday but any date back to August 8, 2017, when Nwala launched StoryGraph. As seen in Figures 14, 15, and 16 we can see how the world has evolved each year on StoryGraph's launch date. In Figure  14 , the biggest news story was that of North Korea threatening other nations with nuclear weapons. One year later, in Figure 15 , we see that the biggest news story is the results of several United States Congressional and gubernatorial primaries. Two years after StoryGraph's launch, Figure 16 shows that the biggest news story is the aftermath of the 2019 shootings in El Paso and Dayton. Fig. 6 : SHARI steps 1-2 illustrated with a single URI-R from the story shown in Figure 3 . Here SHARI extracts the URI-R from StoryGraph and then creates a corresponding URI-M with ArchiveNow. services like Wakelet 7 because SHARI is entirely automated. The stories produced by SHARI are different from services like Google News 8 or Flipboard 9 because those tools focus on current events and personalized topics. Because StoryGraph samples content from multiple sides of the political spectrum, the SHARI process can provide a visualization of articles not tied to one interest area or even a single side's terminology. This process works because each component is loosely coupled, has high cohesion, has explicit interfaces, and engages in information hiding. Each command passes data in the expected format to the next. We are also exploring how to improve striking image selection for stories. One could use this to consider how the same story is told in different venues. For instance, one could ask StoryGraph only to include left-leaning sources and produce a SHARI story. One could then do the same for only the right-leaning sources. With both stories, one could compare the striking images and sumgrams that SHARI produces. We are investigating how to produce and render other news stories for a given day and any given period of time. Finally, we are examining how to best visualize significant events that span substantial periods of time, like the entire COVID-19 news story. Though StoryGraph is an existing service that gathers current news, we also want to apply its algorithm directly to mementos and tell the news stories of past events like the Hurricane Katrina disaster. https://oduwsdl.github.io/dsa-puddles/stories/shari/2019/08/08/storygraph_biggest_story_ 2019-08-08/

Generating Stories From Archived Collections

ArchiveNow: Simplified, Extensible, Multi-Archive Preservation

A Preview of MementoEmbed: Embeddable Surrogates for Archived Web Pages

Raintale -A Storytelling Tool For Web Archives

Hypercane Part 1: Intelligent Sampling of Web Archive Collections

StoryGraph Hypercane ArchiveNow Raintale Integration -Combining WS-DL Tools For Current Events Storytelling

The Many Shapes of Archive-It

Focused crawl of web archives to build event collections

Introducing sumgram, a tool for generating the most frequent conjoined ngrams

Bootstrapping Web Archive Collections from Social Media

Scraping SERPs for Archival Seeds: It Matters When You Start

365 Dots in 2019: Quantifying Attention of News Sources

RFC 7089 -HTTP Framework for Time-Based Access to Resource States -Memento