key: cord-0711543-2mwj0u88
authors: Jones, Shawn M.; Weigle, Michele C.; Klein, Martin; Nelson, Michael L.
title: Automatically Selecting Striking Images for Social Cards
date: 2021-03-08
journal: 13th ACM Web Science Conference, WebSci 2021
DOI: 10.1145/3447535.3462505
sha: a3e86b275666200e41800f159649dee47e9db5d7
doc_id: 711543
cord_uid: 2mwj0u88

To allow previewing a web page, social media platforms have developed social cards: visualizations consisting of vital information about the underlying resource. At a minimum, social cards often include features such as the web resource's title, text summary, striking image, and domain name. News and scholarly articles on the web are frequently subject to social card creation when being shared on social media. However, we noticed that not all web resources offer sufficient metadata elements to enable appealing social cards. For example, the COVID-19 emergency has made it clear that scholarly articles, in particular, are at an aesthetic disadvantage in social media platforms when compared to their often more flashy disinformation rivals. Also, social cards are often not generated correctly for archived web resources, including pages that lack or predate standards for specifying striking images. With these observations, we are motivated to quantify the levels of inclusion of required metadata in web resources, its evolution over time for archived resources, and create and evaluate an algorithm to automatically select a striking image for social cards. We find that more than 40% of archived news articles sampled from the NEWSROOM dataset and 22% of scholarly articles sampled from the PubMed Central dataset fail to supply striking images. We demonstrate that we can automatically predict the striking image with a Precision@1 of 0.83 for news articles from NEWSROOM and 0.78 for scholarly articles from the open access journal PLOS ONE.

Understanding the content behind a URL is important for web discourse. Though users might infer meaning from URLs, summaries are more effective. To provide these summaries, social media platforms provide social cards, as shown in Figure 1 . Social cards provide different pieces of information to summarize the underlying content. These social card units typically consist of a title, a small text summary, a striking image, and a domain name that, in aggregate, summarize the page behind a URL. Social cards are similar to snippets in search engine result pages (SERPs) but have a slightly different purpose. Where search engine result snippets typically are dynamically contextualized based on the query and answer the question of "Will this meet my information need?", social cards are static and address the question "What does the underlying page contain?" Both allow for previewing a page and allow pages to "compete for clicks" relative to rivals that might be present elsewhere on the page. Textual information is important, but a good striking image can also help readers better understand the underlying document [15] . As shown in the Twitter card from Figure 1 , the title and description give the reader some idea of the information found in the underlying page, but the striking image provides the additional insight of a current map of COVID-19 infection locations, helping the user infer that not only is a visualization available at this resource, but that it is also updated to contain current information.

In 2010, Facebook and Twitter established HTML metadata standards so page authors could supply their own values for social card units. However, we have observed that not all web pages offer sufficient information to generate helpful cards. Compare the card in Figure 2 showing no image or descriptive metadata with the one in Figure 3 that contains content for all social card units. The latter is more alluring and provides disinformation, whereas the former is a Nature article providing peer-reviewed research, illustrating the phenomena detailed in "The Truth Is Paywalled But The Lies Are Free" [30] . To facilitate the spread of accurate information, social cards can help by making such content more accessible. In the future, scholars will review the COVID-19 pandemic via web archives and their captures of specific observations of web pages. In prior work [14] , we analyzed more than 60 platforms and determined that none reliably generated social cards for archived web pages. The roughly 150 billion web pages captured by the Internet Archive before 2010 [3] likely have none of the 2010 standardized metadata. In this work, we will show how 40% of our sample of archived news articles and 22% of our sample of scholarly articles fail to specify striking images. Even though it appears that 78% of scholarly articles contain striking images, we note that 74% of them reuse the same image among multiple articles and 52% of journals reuse the same image for all of their articles -often a publisher or journal logo that does not summarize the article content. The lack of metadata and use of unsuitable images requires social card creation tools to automatically generate descriptions and select striking images for these documents. This fact along with the observed lack of platform support for creating cards for archived pages, the large number of pages that predate these metadata standards, and personal experience with poor support for some scholarly articles (e.g., Figure 2 ) inspired the research in this paper.

Some of the endeavors benefiting from this research are social media storytelling, carousels for content management systems, and news aggregation platforms. To quantify this problem effectively, we analyzed news articles and scholarly publications -resources that have undergone editorial review and, presumably, received some care in their publication. News articles and scholarly publications are also frequent sources for social cards on social media. Thus, in the following research questions, we contrast results between news articles (stored in the Internet Archive) with open access journal articles:

Research Question #1 (RQ1) -What are the distributions of HTML metadata elements (general and social card elements) in news articles (over time) and scholarly publications published on the web?

Research Question #2 (RQ2) -What approaches and image features are best suited to automatically select striking images from news articles and scholarly publications, and do the approaches differ for both resource types? Archived web pages, or mementos, contain a document's original HTML and images downloaded at some time in the past, recorded as the memento's memento-datetime [37] . This memento-datetime represents when the archive captured the memento and when the archive observed its properties, allowing us to examine web authors' behavior in the past. It is not necessarily the same as the publication date since archiving can occur well after publication. Sometimes the capture process fails to fully render a page, causing missing stylesheets, images, or JavaScript when an end user revisits the memento, a phenomenon called memento-damage [4] . We refer to the URLs identifying these unchanging mementos as URI-Ms. Each URI-M identifies a capture of a specific version of a changing, live web resource known as the memento's corresponding original resource, identified by a URI-R [37] . Because of the problems with reliably generating cards for mementos [14] , we created the social card creation tool MementoEmbed. We will update MementoEmbed with the results from this paper.

For RQ1, we use the NEWSROOM dataset [10] developed by Grusky et al. for evaluating automatic text summarization algorithms against news articles. NEWSROOM contains 1.3 million URI-Ms of news articles for which there are textual summaries present in the article's HTML META elements. These textual summaries may come from the *:description fields shown in Table  1 or they may be specified in the standard HTML META element description field. The NEWSROOM dataset represents captures from 1998 through 2016 of news articles from 29 news outlets.

Also for RQ1, we examine scholarly publications as a contrast with our results for news articles. A digital object identifier (DOI) is a persistent identifier for locating a scholarly article regardless of website redesigns, corporate publisher acquisitions, and other phenomena that lead to broken links. Our work focuses on the HTML landing pages and HTML articles of open access scholarly publications. We acquired the PubMed Central (PMC) open access commercial use dataset [24] consisting of 1.7 million open access articles formatted as XML or plain text files. From these files, we extracted the DOIs, dereferenced them to download their HTML counterparts, and then analyzed the results. These articles are not mementos but are current versions of these scholarly publications. While DOI resolution on the web is not always reliable [17] , we used the same request methods and HTTP clients to obtain consistent results.

For RQ2, we reuse a subset of the NEWSROOM dataset for news articles. As we mentioned in the Introduction and will show in Section 4.2, most of the articles in the PMC dataset do not provide good ground truth for striking images. Instead, we use all 227,265 articles from the open access journal PLOS ONE found in the PMC dataset for evaluating striking image prediction. We chose to analyze the articles from PLOS ONE because their submission guidelines [25] encourage each article's author(s) to choose a striking image from their article to represent it. PLOS ONE also has other benefits, such as standardized URL patterns for detecting figures, tables, and equations within the document. This capability allowed us to produce a more intelligent image prediction approach that could discard images such as the PLOS logo, ORCID logo, and advertisements.

Automatic image selection has been applied to the reduction of a large set of images to a small set for building photo albums [26] , selecting representative pages from historical manuscripts [7] , choosing the best key frame to represent a video [8, 28] , generating collages [31] and general image collection summarization [34] . Individual image selection has been applied to detecting specific categories of images, such as spam [22] , advertisements [5] , or landscapes [12] , and coarsely identifying specific principal image subjects, such as vehicle or pet [16] . None of these solutions attempt to find the striking image that summarizes a single web page.

In 2004, Hu and Bragga [13] analyzed the front page of 25 randomly selected news sites and classified the images into seven categories. A story image provides a striking image for a set of news articles covering a specific story. Preview images provide striking images for specific articles. Commercial images are advertisements. Host images provide a photograph of an author. Heading images are navigational elements consisting of stylized text. Icon logos provide branding for the whole news source or a specific feature of the publication. Formatting images perform the function of shaping or arranging a page, and examples include transparent spacing images or graphical horizontal rules, largely an artifact of the limited formatting abilities of the HTML of that era. The authors manually annotated 899 images across these 25 front pages. Their SVM classifier achieved an accuracy of 92.5% when combining the discrete cosine transforms of each image with the values of its color bands and the surrounding text's properties.

In 2006, Maekawa, Hara, and Nishio [23] analyzed forty websites and categorized 3,901 images into eleven categories. They then applied a custom classifier to the problem and achieved an accuracy rate of 83.1% across categories. Maekawa's goal was to classify images so that mobile browsers could avoid the unnecessary download of images that would not display well on the smaller displays of mobile devices. Many of their categories are similar to Hu's. Their solution relies on easy-to-calculate features like width, height, byte size, number of colors, content type, and aspect ratio, but they also include the number of images on the page with similar features and textual information. While their overall accuracy is 83.1%, they do poorly for certain categories, such as an 1 score of 0.458 for identifying buttons and 0.694 for advertisements.

Li, Shi, and Zhang [21] ran an experiment in 2008 that is more similar to our work. They were interested in identifying striking images for search results. Their dominant image is similar to our concept of a striking image. For ground truth, they randomly sampled 3,000 documents from a dataset of pages from msn.com, mit.edu, and cnn.com. They then asked three participants to label each image as dominant or non-dominant. With this training data, they applied a custom classifier to predict the class of each image on a page. They used the features of pixel size, aspect ratio, sharpness, contrast, number of colors, categorization of photo or graphic with or without a human face, content-type, position on the page, size of image compared to the size of the page rendered in a browser, number of images larger than this image on a page, if the image came from an external site, and if the image repeats across the same web site. Their classifier calculates a relevance score for each image on the page based on the user's query with an accuracy of 0.85.

Koh and Kerne [18] took a different approach. It starts by analyzing the HTML document's DOM. It finds the deepest nodes and works its way back up to discover the nodes most likely to contain content. From here, they choose the largest image with the smallest aspect ratio based on empirically determined thresholds. Based on human labeling, their algorithm achieves an accuracy of 0.898 and an 1 of 0.921 across datasets of web pages consisting of 239 news pages and 254 research pages.

Our work differs from that of Hu, Maekawa, Li, and Koh in several ways. Where Hu analyzed the front pages of news sources, we work with individual articles (i.e., deep links within the site). Unlike Hu and Maekawa, we are trying to find a single striking image to summarize the page rather than classifying images into multiple categories. We do not manually label images to develop our ground truth dataset; instead, we rely upon the actual image selected as part of the document's editorial review process as found in the og:image or twitter:image fields. Unlike Li's work, our method does not require a search query for selecting a striking image. We are inspired by many of the features chosen by all of this prior work. Unfortunately, some of these features are impossible to calculate reliably for our chosen documents. Mementos offer many challenges. For example, analyzing the images available on other pages from the same website may be impossible because the web archive did not capture other pages of the same website. Additionally, features that require rendering in a browser, like comparing page size to image size, may fail due to memento damage. Koh and Kerne are the only ones in this list to consider anything like scholarly publications. They did not process journal articles or conference proceedings but instead analyzed web pages from Scientific American, IBM Research, and Los Alamos National Laboratory. Their solution will only work for HTML documents that contain images of varying sizes and aspect ratios. Our PLOS One dataset contains many images whose size and aspect ratio are very similar, making it challenging to apply their solution.

To address RQ1 we sampled news articles from the NEWSROOM dataset and scholarly articles from the PMC dataset.

We discovered that the 1.3 million article NEWSROOM dataset was unbalanced with respect to domain name and memento-datetime. For example, NEWSROOM has 186,095 mementos from nytimes.com and 1,429 mementos from economist.com. In terms of mementodatetime year, NEWSROOM contained 19 mementos from 1998 [11] and Twitter will use that standard. Because the number of mementos in the NEWSROOM dataset are more heavily biased toward years closer to 2016 and we wanted to contrast metadata usage before and after card standards were published, we added all 90,570 NEWSROOM articles with memento-datetimes from 2009 and before to our sample. For those after 2010, we created a bucket for each domain and memento-datetime year. We randomly chose URI-Ms until we filled each domain/year bucket to a size of 1,307 -the median size of all domain/year divisions after 2010. Our sample size after this process was 310,163 mementos.

We downloaded this sample in June 2020. To lessen the chance of being rate limited by the Internet Archive, we divided the URI-Ms in the sample into seven subsets and spread them across different servers in Amsterdam, Frankfurt, London, New York, Northern Virginia, San Francisco, and Toronto. We felt confident in this approach because the Internet Archive presents the content it recorded and does not alter it for different geographic locations. To address failed downloads caused by rate limiting, we repeated the downloads once in July and again in August 2020. We discovered that the downloads with HTTP status codes of 400, 403, 404, and 405 were indeed captures of pages with those status codes. Some mementos redirected to mementos captured after 2016, similar to behavior that Ainsorth et al. [2] reported in a different set of mementos. We removed all mementos captured after 2016. After resolving these issues, as shown in Table 2 , our downloaded NEWSROOM sample consisted of 277,724 mementos. Table 3 lists the number of mementos in the NEWSROOM sample capable of creating different combinations of social card units. Because Facebook is more forgiving with missing fields, 59.56% of articles can create a full card on Facebook, while only 43.86% can do so with Twitter. We assigned each metadata field encountered to a category based on its corresponding standard or usage. We removed all instances where metadata fields were specific to a domain (e.g., only nytimes.com used the metadata field byl). Figure 4 demonstrates how these categories changed over time. There is a focus on HTML standard metadata over time throughout our sample because all mementos in the NEWSROOM dataset contain at least a textual summary in some form, and before the OGP or Twitter standards, these articles used the HTML standard description field. We observed that news articles rapidly adopted social card metadata fields, starting with 13.13% adoption of OGP in 2010 and reaching 93.05% by 2016. After 2010, there is a rise in all types of metadata usage, focusing on search engines, mobile apps, browser customization, and social media, showing that news articles leveraged promotion of their content once standard metadata fields became available.

The PMC dataset of 1.7 million articles was organized by journal. To equalize the numbers per journal, we randomly sampled 100 DOIs from each journal's articles in the dataset. To avoid rate limiting, we downloaded these articles' landing pages from servers located in the English-speaking locations of London, San Francisco, New York, and Toronto. We ensured that we equally represented each journal at each location. Table 4 shows how we were able to build a dataset of 110,900 articles in August 2020 for metadata analysis. This sample represents 1,109 journals from 209 publishers. These were not mementos but current web resources. Many of these articles are not stored in web archives, so we could not analyze metadata usage over time. Even though it appears that striking images for cards are well established for Facebook, 68,761/110,900 (73.98%) articles reuse the same og:image value as another article. Most of these striking images were journal or publisher logos. In fact, 572/1109 (51.6%) journals used the same URL for og:image with every article. Thus, few publishers use an image from the document's content and instead favor publisher or journal promotion. This contrasts with 

The numbers of mementos and scholarly articles lacking any meaningful striking image led to RQ2. Our overall goal was to find the approach that, given a set of images, will select the image closest to what a human selected, as found in the metadata, for the same document, regardless of whether or not that image also exists outside of the metadata. The striking images found in the metadata are our ground truth because the document's editorial process produced them.

Thus, any dataset applied to this endeavor requires documents where all images are available, and all documents must contain at least one striking image. This disqualifies documents that we cannot download or cannot parse with BeautifulSoup [29] and documents with images that we cannot process with Pillow [6] or ImageMagick [33] . We also disqualify documents with only one image because we have no prediction to make, and these easy wins may skew results. For each document, we consider image URLs found in the META elements from the HEAD as well as those provided by the src and srcset attributes of each IMG element found in the BODY.

For analyzing news articles, we started with our NEWSROOM sample of 310,163 mementos. As shown in Table 6 , the issues we encountered left us with 37,522 articles to evaluate. The fact that we lose much of our dataset to download issues further emphasizes that many mementos can benefit from automatic striking image prediction because some of their images may be missing or corrupted.

As mentioned in Section 2 and demonstrated in Section 4.2, the PMC dataset was unsuitable for striking image prediction because the ground truth was poor or non-existent. We instead used a subset of the PMC dataset consisting of 227,265 PLOS ONE articles because of their quality striking images. PLOS ONE's standardized URL patterns allowed us to produce a smarter image prediction approach that could discard images such as the PLOS logo or advertisements. We produced an HTML scraper that processes PLOS ONE articles with these patterns in mind. We did not consider images found in supplemental sections or appendices. Table 7 lists the issues we encountered when attempting to download and process these articles in July 2020, leaving us with 198,523 articles. Once we had our datasets, we applied different prediction approaches. Each prediction approach required one or more image features (e.g., byte size) to be successful. Some approaches choose • image byte size • image width in pixels • image height in pixels • the number of columns in the image's histogram with a value of 0 (negative space) • image size in pixels • the image's aspect ratio • the number of colors in the image In an attempt to achieve better results, we also supplied multiple features as input to various scikit-learn [27] classifiers. When training classifiers, we placed images found in the HTML metadata into the class of present-in-metadata and other images into the class of other. When testing classifiers, we considered each document to be its own test case. We supplied the features of each image in a document to the classifier and asked it to provide the probability that the image comes from the class present-in-metadata. From that set, we choose the image that has the highest probability of being in the class present-in-metadata as the predicted striking image for that document. We chose this probability method so that each document contains at least one striking image prediction. If we had merely evaluated how well the classifier predicted an image's class (present-in-metadata or other), there would be documents for which the classifier found no striking image. With our method, even a low probability of belonging to present-in-metadata still predicts a striking image because all of the document's images' probabilities are compared.

Social card creation tools have to contend with many different types of images. The values for some image features, like byte size, have no upper bound, making proper scaling challenging to estimate for the long term. Scaling can also remove precision from some measures, leading to poor results. Thus, we only considered the following classifiers because their scikit-learn implementations provide class probability scores and they do not require scaling:

We evaluate classifier operation with 10-fold cross-validation by document. Instead of evaluating how well each prediction approach predicts an image's class (present-in-metadata or other), we instead evaluate how well, given a set of images found in a document, the approach selects the ground truth striking image for that document. This leads us away from considering metrics like 1 because they invite discussions about what recall means in this application. Instead, we consider other metrics commonly associated with information retrieval because we consider each document a query and the output of a prediction approach as a set of ranked results where the striking images are the relevant results. Thus, we apply @1 ( @1) to determine the given approach's level of success for predicting the striking image from the metadata. If an approach produces a relevant result as its first result for a document, the @1 score for that document and approach is 1, and 0 otherwise.

We take the mean of the @1 scores for all documents, making @1 a proxy for accuracy. We also supply Mean Reciprocal Rank ( ) to evaluate how well an approach performed even if it failed to achieve @1 = 1.0 for a document. For a given document, we determine the rank of the first relevant image ( ) as provided by the approach under test and compute the rank's reciprocal ( 1 ). We then compute the mean of all of these reciprocal ranks across all documents for the same approach. For example, if 10 images exist in a document and the approach ranks a relevant image in fourth place, then the reciprocal rank is 1 4 = 0.25. A score of 1.0 for @1 and is ideal. We recognize that the same image may exist at different URLs or the striking image may be a cropped or resized form of the same image elsewhere in the document. For these issues, we apply a perceptive hash (pHash) distance to our evaluation as a proxy for human judgment of image similarity. If an approach selects image , but the ground truth image is , and if ℎ ( , ) = 0, then we consider to be just as relevant when computing @1 and . Different pHash implementations exist. We evaluated Image-Hash's pHash [20] , ImageMagick's pHash [32, 38] , and Zauner's pHash [39] . We manually reviewed the intra-and inter-document similarity distances provided by each pHash implementation for the images found in 10 news and 10 scholarly articles. We concluded that ImageMagick's pHash provided the most intuitive similarity distances because it placed photographs at the same distance to each other, separate from logos and text. ImageMagick's pHash was more consistent with considering cropped or resized images to be similar to their original form. ImageMagick's pHash scores are not scaled and reached values as high as 6000 during this evaluation. We noted that, while low scores intuitively indicated similar images, as scores reached higher values there is a greater discrepancy between the scores and human perception, thus we needed an upper bound for scaling that kept this discrepancy in mind. Our median score from this 20 article evaluation was 280.904. We scaled all distance scores such that scores above twice the median (561.808) became 1.0 and other scores are the result of dividing their value by 561.808. ImageMagick has an issue processing certain JPEGs [19] , so we converted all images to PNGs before computing the pHash distance.

News Articles Figure 5a demonstrates the performance of different prediction approaches with the NEWSROOM sample. Each line demonstrates the increasing or @1 score, identified by its points' y-axis values, as produced by an evaluation that we performed at the corresponding pHash distance on the x-axis. When we evaluate at a pHash distance of 1.0, any image selected by the approach is equivalent to the ground truth; thus, and @1 are 1. At a pHash distance of 0, only images that are perceptively equal to the ground truth image are relevant. The best approach achieves the highest and @1 at the lowest pHash distance, resulting in lines that start in the upper left quadrant of each graph.

Randomly choosing images gives poor performance at a pHash distance of 0 with = 0.4016, @1 = 0.1551. Training Random Forest with all base features results in the best performance at a (a) 37,522 news articles from NEWSROOM (b) 198,523 scholarly articles from PLOS ONE Figure 5 : These visualizations demonstrate the and @1 results for different striking image prediction approaches as run against each dataset. The best approach achieves the highest and @1 at the lowest pHash distance, making the ideal situation one where an approach's lines start higher into a graph's upper left corner. Items in parentheses are feature categories applied to the classifier, if applicable. distance of 0 with = 0.8825 and @1 = 0.8314. To see if we could improve performance with fewer features, we applied Spearman's to our features, as shown in Table 8 . We see that aspect ratio and negative space have the lowest correlation. We retrained Random Forest with these features removed and achieved = 0.8782, @1 = 0.8267 at a pHash distance of 0. Removing additional features produced results with @1 < 0.8.

Scholarly publications have additional features not available in webbased news publications. To improve our chances of discovering a meaningful striking image, we only included images that had captions and were directly cited in the paper. This decision excludes equations, which were never a striking image in the PLOS ONE dataset. Equations as images is an artifact of HTML being incapable of reliably rendering mathematical notation. This left graphics and Section features. The first reference to a figure may indicate its importance to the paper. Our section index feature is the number of the section where the image is first referenced. For example, in this paper, the first image is referenced in the Introduction, which has a section index of 1. The scaled section index is a scaled version of this. Character position in section and word position in section are Performing text analysis on captions may indicate which figure is most important and thus a good candidate for the striking image. For each paper, we used NLTK to generate a list of all words, with stopwords removed, and then calculated their term frequencies within that paper. We computed a score for each caption by summing the term frequencies of that caption's words. We then ranked all captions in the paper by these scores, giving us the feature caption TF rank; thus, a higher rank is a proxy for the caption's importance to the document. We scaled this across all captions in the paper for caption TF rank (scaled). Titles may contain insight into the meaning of a paper, so we computed the Jaccard distance between the words in the title and each caption. Figure 5b demonstrates the results of some of our striking image prediction approaches with the PLOS ONE sample. As with Figure  5a , the x-axis shows the pHash distance and the y-axis is the corresponding or @1 score. Randomly selecting the image from a document performs at = 0.5185, @1 = 0.2883 at a pHash distance of 0. Merely choosing the last figure's image achieves = 0.7753, @1 = 0.6975 at a pHash distance of 0. Table 9 provides detailed numbers for results with using different classifierfeature combinations. Our best scoring classifier-feature combination at a pHash distance of 0 was Random Forest trained with base features combined with figure position features at = 0.8643, @1 = 0.7786. Applying the correlation information from Table 10 to remove features produced results with @1 ≤ 0.7.

For those items in the NEWSROOM dataset that still exist on the web, it would be interesting to run our striking image prediction analysis on the current versions of those articles. Even though the articles are not current, their publishing platform has likely changed and may now produce card metadata.

Our results for PLOS ONE may be specific to that journal, or they may be specific to the biology and medical fields serviced by PMC. Further striking image prediction analysis with different journals is needed to expand and generalize these results. The open access publisher Frontiers supplies striking images for its HTML articles and may offer suitable comparison journals. Scholarly articles are more commonly published as PDFs, so we will re-evaluate this process with PDF extraction tools.

Additionally, another application of this research may be to suggest which images should be included in the main sections of a paper instead of being relegated to its supplemental sections or appendices.

Social cards describe the content behind a URL, helping a user answer the question of "What does the underlying page contain?" A social card summarizes each web page through the units of page title, striking image, domain name, and description. While the title and domain name are often easily extracted, automatically generating the description and a striking image are more challenging. Social media platforms provide standards so that authors can insert their own values for these social card units into their HTML pages as metadata.

Per RQ1, our evaluation of the archived web pages (mementos) of 277,724 news articles from the NEWSROOM dataset revealed that card metadata fields are nonexistent for mementos captured before 2010. This translates into roughly 150 billion web pages at the Internet Archive for which social card creation tools will need to automatically generate descriptions and striking images. We found that news articles rapidly adopted social card metadata fields, with 13.13% adoption in 2010 and reaching 93.05% by 2016. We also evaluated 110,900 scholarly articles in the PMC open access dataset and discovered that while 77.86% of scholarly articles specified a striking image, 73.98% reuse the same image among multiple articles. In fact, 572/1109 (51.6%) of journals used the same image in every article. Instead of selecting an image directly from the article to summarize it, the publications chose a colorful placeholder, or a journal or publisher's logo. This practice is not in line with non-scholarly publishers, and thus scholarly articles are at a disadvantage on social media.

This dearth of quality striking images inspired RQ2. By analyzing 37,522 articles from the NEWSROOM dataset and 198,523 articles from the journal PLOS ONE, we have determined that the same automatic image selection approach cannot be applied to both types of documents. With NEWSROOM, we achieve @1 = 0.8314, = 0.8825 at a pHash distance of 0 with Random Forest and the base features of image width, height, byte size, pixel size, negative space, aspect ratio, and color count. In our PLOS ONE dataset experiment, applying these base features to Random Forest performed worse than randomly selecting an image. We achieved the best image selection performance of @1 = 0.7786 and = 0.8643 at a pHash distance of 0 for PLOS ONE with Random Forest by using the base features combined with the position of the figure on the page. We did not achieve better results with features like caption text or section references.

These results have implications for social media, news aggregation platforms, content management systems, and social media storytelling. We believe that better social cards will help platforms better summarize topics, equalize the probabilities that readers will select more informative content, and, most important of all, help readers understand the documents that others share with them.

HighWire at 25: Anurag Acharya (Google Scholar) looks back

Only One Out of Five Archived Web Pages Existed as Presented

Internet Archive

Not all mementos are created equal: measuring the impact of missing resources

Sensitivity Analysis of Neural Network Parameters for Advertising Images Detection

Pillow (PIL Fork) Documentation. readthedocs

Automatic Image Cropping and Selection Using Saliency: An Application to Historical Manuscripts

Key frame selection to represent a video

The Open Graph protocol

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Abstract: The Open Graph Protocol Design Decisions

Landscape Image Retrieval with Query by Sketch and Icon

Categorizing Images in Web Documents

MementoEmbed and Raintale for Web Archive Storytelling

Social Cards Probably Provide For Better Understanding Of Web Archive Collections

WEB Image Classification Based on the Fusion of Image and Text Classifiers

On the Persistence of Persistent Identifiers of the Scholarly Web

Deriving image-text document surrogates to optimize cognition

CorruptImageProfile 'xmp' @ warning/profile.c/SetImageProfileInternal/1701 · Issue #110 · ImageMagick/ImageMagick6

Looks Like It -The Hacker Factor Blog

Improving Relevance Judgment of Web Search Results with Image Excerpts

A selfadaptable image spam filtering system

Image classification for mobile web browsing

Open Access Subset

Diverse Neural Photo Album Summarization

Scikit-learn: Machine Learning in Python

Best Frame Selection in a Short Video

Beautiful Soup Documentation -Beautiful

The Truth Is Paywalled But The Lies Are Free

ImageHive: Interactive Content-Aware Image Summarization

Perceptual Hashing for Color Images Using Invariant Moments

The ImageMagick Development Team

Learning mixtures of submodular functions for image collection summarization

Getting started with Cards

RFC 7089 -HTTP Framework for Time-Based Access to Resource States

Tests Of Perceptual Hash (PHASH) Compare Metric

Implementation and Benchmarking of Perceptual Image Hash Functions

The authors thank Jian Wu and Alexander Nwala for conversations that inspired parts of this research. We also thank Max Grusky for granting us access to the NEWSROOM dataset and PMC for making their Open Access Journal Dataset openly available.