key: cord-0544701-0cfpq8d2 authors: Mujib, Munif Ishad; Heidenreich, Hunter Scott; Murphy, Colin J.; Santia, Giovanni C.; Zelenkauskaite, Asta; Williams, Jake Ryland title: NewsTweet: A Dataset of Social Media Embedding in Online Journalism date: 2020-08-06 journal: nan DOI: nan sha: 7738239f8ad214ddab509fbe63da436594bc22cb doc_id: 544701 cord_uid: 0cfpq8d2 The inclusion of social media posts---tweets, in particular---in digital news stories, both as commentary and increasingly as news sources, has become commonplace in recent years. In order to study this phenomenon with sufficient depth, robust large-scale data collection from both news publishers and social media platforms is necessary. This work describes the construction of such a data pipeline. In the data collected from Google News, 13% of all stories were found to include embedded tweets, with sports and entertainment news containing the largest volumes of them. Public figures and celebrities are found to dominate these stories; however, relatively unknown users have also been found to achieve newsworthiness. The collected data set, NewsTweet, and the associated pipeline for acquisition stand to engender a wave of new inquiries into social content embedding from multiple research communities. The appearance of user-generated content from social media in web-based news articles is rapidly becoming a familiar phenomenon to readers. Studies have been conducted on the impact of including such content on reader perception. As a novel and engaging method of surfacing the voice of the general population as well as a more authentic channel for reporting statements from sources of news, be they persons or organizations, this phenomenon has engendered interest in multiple research communities. Due to this interest in social media content embedding, as referred to in this work, in terms of sourcing routines in the news [1] and understanding social media sourcing in particular [2] , a need for pertinent data collection and sharing is apparent. The presented data set, "NewsTweet", is an ongoing automatic collection from a broad range of news content categories. It collects user-level activity in addition to selected embedded tweets that may be used for predictive modeling. Further, it includes the proportions of tweets and associated users, thus providing a multilateral ap-proach to social media sourcing. The construction of this data set is the first necessary step in the study of the social media embedding phenomenon: the design of a custom data stream spanning news outlets and genres that sit at the critical intersection of edited news and noisy social content. Social media content from a variety of platforms appears in web articles in embedded form. In terms of reported embedding volume share, the largest platforms are Twitter (59%), YouTube (26%), Instagram (14%), and Facebook (1%) [3] . An exhaustive data collection pipeline should capture, follow, and analyze the embedded content from all of these sources. However, differing platform standards limit deep access to their content. As such, NewsTweet, in its first version presented here, collects social data only from Twitter, which has become the de-facto standard for social media analysis. Twitter also remains the largest source of embedded social content. The data collection described in this work exhibits expected proportions of embedded content from different platforms (see Table 1 ). NewsTweet attempts to acquire data on the interaction of embedded content with news production while maintaining representation of the various existing content categories in news. With this stated goal, a reasonable starting point for data collection was deemed to be a generalized news aggregator, specifically, the Google News RSS feeds [4] . These feeds offer a common breakdown of news into 8 sections: Business (B), Entertainment (E), Health (H), Nation (N), Sports (S), Technology (T), World (W), and Headlines (X) 1 The result of each request is a collection of hyperlinks to news articles along with limited other metadata. All newlyseen hyperlinks are then accessed for their HTML content, which includes embedded social media content. The resulting HTML content is processed to find and extract embedded tweets when present. RSS feed data collection has several caveats. The Google News RSS feeds lack official documentation. However, help forums discussing usage can be found. Google offered a software application, Google Reader, from 2005 to 2013 that allowed users to follow and read RSS feeds. Google Reader was shut down due to declining user numbers, signaling a shift in policy and product direction within the company regarding RSS feeds. While the feeds continue to be available-presumably, as part of the Google News product-Google may choose to sunset their availability at any time. However, since RSS is an open standard and other RSS news aggregators are and will foreseeably continue to be available, collection can be readily switched to a different aggregator. Moreover, the surfeit of news sources already uncovered by this collection effort could be directly monitored, creating new custom RSS feeds. However, Google News commands a substantial volume of users due to its close integration with Google Search [5] . As such, the selection of articles aggregated on this platform is uniquely significant. Thus, while the loss of the Google News RSS feeds would certainly have a substantial impact to analyzing online news-related phenomena, alternatives for data collection will remain available. A particular artifact of the Google News feeds is the inclusion of YouTube video pages as articles. During the data collection process, it was discovered that ∼6.8% of articles were YouTube pages. Although the HTML content of these pages is downloaded, they are ignored in subsequent analyses, since no social media embeddings can appear on a YouTube page. Embedded tweets, which primarily display user handles, timestamps, tweet text, and attached media to readers, appear in articles with limited metadata. From this metadata, individual tweet IDs are used to construct calls to the Twitter API for full tweet objects. These objects include user profile data and often geographic information as well. Once full tweet objects have been downloaded, the user IDs are leveraged to direct mass data collection from the Twitter platform and access large, continuous portions of the users' timelines (i.e., their tweets in chronological order). These timelines are accessed in reverse chronological order and restricted to the users' most recent 3,200 tweets according to Twitter API policies. Once the initial timeline depth is accessed for a given Table 3 : Descriptive statistics exhibiting the top five most-embedded users (by total embeds, top section), most re-embedded users (by fraction of embeddings unique, ascending), and most effectively-embedded accounts (by number of tweets embedded out of their total number produced over their embedding period). user, their account is then added to a running list of users whose timelines are tracked and "topped off", i.e., new tweets are gradually added to the original collection. Thus, once a tweet from a user is found embedded in an article, that user is subsequently continuously tracked for new tweets. Since the number of users to follow rapidly grows beyond the limits of the Twitter API's standard free tier, users are randomly sampled at regular intervals for new tweet downloads or top-offs to ensure that all users are eventually reached. This random sampling ensures that user timeline data collection does not lag behind the realtime stream. Efficient maintenance of the stream will ultimately require an automated queueing system that prioritizes topping off more active users more often. These enhancements are essential and remain a priority for implementation, as one of the goals of this project is to release this acquisition software to the community to advance research activity in this area. A full schematic of the data collection pipeline is presented in Figure 1 . Data collection initiated on May 15th, 2019. As of September 11th, 2019, the stream resulted in the acquisition of 273,899 articles (2,302 articles per day, on average). 35,218 (13%) of these had embedded tweets (296 articles per day, on average) 2 . The articles containing embedded social content are not uniformly distributed over the eight Google News sections. This can be seen in Table 2 , which presents these and other descriptive statistics by section. This table showcases that overall, 13% of articles included embeddings. Yet, content categories showed variation in proportions of embeddings. The Sports category featured the highest proportion of the embedded content (24%) of the articles, followed by Entertainment (14%) with the smallest one being 2 However, the number of articles with embeds from all platforms is nearly double this, and is quantified in Table 1 . Health amounting to only 2%. A reversed pattern is seen in terms of unique tweets embedded (93% in Health vs. 78% in Sports). Many users are repeatedly embedded. Among these, @realdonaldtrump 3 is a clear outlier, having been embedded an order of magnitude more in the Nation and World sections. Top embedded users by total embeds by section are shown in Table 3 . These are consistently newsworthy accounts. When users are sorted by the lowest fraction of unique tweets, we see a different group of users in the middle section of Table 3 . For these users, only a few of their unique tweets were found newsworthy, but were embedded repeatedly by journalists. Thus, we can see from this view that a completely different set of users -often celebrities, and well-known organizations -receive highly focused attention, perhaps around very specific, time-limited events. Taking a different view towards the effectiveness of users at getting embedded, we again see a completely different picture. In the bottom section of Tab. 3, the number of unique tweets each user had embedded is divided by the number of tweets they produced in their period of being embedded. Here, perhaps an unexpected picture emerges, as the users who received the most embeds for the least tweets, actually received more unique embeds than they produced unique tweets. This is possible only because there is a lag between the times when a tweet is produced and embedded. For these users, journalists had to 'catch readers up' on the stories that emerged from possibly unknown individuals. Thus, we consider if some users had a back story that necessarily had to be told in order for their newsworthiness to be qualified. Under this hypothesis, we ask if these effectiveness top-ranking users represent those who truly gained celebrity status from their embeddings, and perhaps come from a less known status before. The news side of the data collection also presents interesting characteristics. At the coarsest level, Google News RSS feeds covered 5,961 different domains in all. Table 4 presents the top-ranked domains sorted both by total articles and average embedded tweets per article. 4 The top hosts that appear in Table 4 are what might be expected for both metrics. Top hosts by total articles are well known mass media organizations, whereas top hosts by average embeddings seem to largely belong to categories that have higher embedding rates, i.e., Sports and Entertainment (further observable in the results in Table 2 ). This data set stands to inform and enable numerous angles of inquiry into the phenomenon of social media content embedding. The patterns immediately visible in the collected data pose intriguing questions. In particular, this data connects well with existing research into user commenting typology based on the contribution practices in news to gauge patterns of influence [6] . A particularly salient question is that of newsworthiness. What makes social media content newsworthy? Which users are deemed more newsworthy? What is the likelihood for a user's content to be selected for embedding? What role does the user's status on the social platform (verification, affiliation, and reach) play on their newsworthiness? When social media content is embedded in the news, how often is the content itself the source of news, and how often is it commentary on news? What defines "staying power" in the news? How are "influencers" and internet celebrities born? What effects do sudden newsworthiness bring to a user's social media following and behavior? Many scholars of the early days of social media, such 4 For the average embeddings per article ranking, an ad-hoc threshold is applied to exclude domains that have published fewer than 10 articles. This prevents domains that have few articles but embed an abnormal number of tweets from dominating the ranking. as [7] , emphasized the potential of social media to "break through from a one-way, asymmetric model of communication to a more participatory and collective system, where citizens have the ability to participate in the news production process." How much of this potential has been achieved? To what degree and when are 'regular' people embedded into news stories? Answers to many of these questions can potentially be achieved through exploration of and experimentation using NewsTweet data. By collecting timeline data from historically newsworthy users, an opportunity to study content production and selection also arises. Since the nature of news volatile and stories evolve over time, NewsTweet data can be utilized to potentially construct meta-stories. These can both be long-term phenomena such as the emergence of the COVID-19 pandemic and associated social frenzy over a number of months as well as shorter-lasting, focused events such as the compromise and abuse of highprofile accounts in a coordinated cryptocurrency scam in July 2020. Some authors [8] have suggested that users have very limited pre-defined spaces and journalistic gatekeeping remains the key element in the user-generated content process. The collected data and illustrated data collection pipeline present an opportunity to implement novel theoretical and methodological approaches to analyze social media integration in the news. While social media in journalism has solicited academic interest, most approaches have so far focused on content analysis of the articles themselves, and have only had access to relativelysmall scale and manually collected data sets. The New-sTweet data set's description and its framework for a data collection pipeline is shared with the broader interdisciplinary research community in the hope of advancing new avenues of scholarship in social media, digital media, and journalism. Social media references in newspapers Re-evaluating journalistic routines in a digital age: A review of research on the use of online sources The state of social embeds. https: //cdn.samdesk.io/static-content/ The-State-of-Social-Embeds.pdf. Accessed How to get an RSS feed from Google News com/googlenews/forum/AAAAKvAM41IWV2z_ hgA6p0/?hl=en&gpf=%23!topic%2Fnews% 2FWV2z_hgA6p0 Search as news curator: The role of Google in shaping attention to news information Information Warfare" and online news commenting: Analyzing forces of social influence through locationbased commenting user typology From tv to twitter: How ambient news became ambient journalism Value of user-generated content: Perceptions and practices regarding social and mobile media in two italian radio stations This document is based upon work supported by the National Science Foundation under grant no. #1850014.