key: cord-0153499-n1yufmky authors: Nwala, Alexander C.; Weigle, Michele C.; Nelson, Michael L. title: Garbage, Glitter, or Gold: Assigning Multi-dimensional Quality Scores to Social Media Seeds for Web Archive Collections date: 2021-07-06 journal: nan DOI: nan sha: b2b1937f5adfbb732d89081ae5b83a580ffea327 doc_id: 153499 cord_uid: n1yufmky From popular uprisings to pandemics, the Web is an essential source consulted by scientists and historians for reconstructing and studying past events. Unfortunately, the Web is plagued by reference rot which causes important Web resources to disappear. Web archive collections help reduce the costly effects of reference rot by saving Web resources that chronicle important stories/events before they disappear. These collections often begin with URLs called seeds, hand-selected by experts or scraped from social media. The quality of social media content varies widely, therefore, we propose a framework for assigning multi-dimensional quality scores to social media seeds for Web archive collections about stories and events. We leveraged contributions from social media research for attributing quality to social media content and users based on credibility, reputation, and influence. We combined these with additional contributions from the Web archive research that emphasizes the importance of considering geographical and temporal constraints when selecting seeds. Next, we developed the Quality Proxies (QP) framework which assigns seeds extracted from social media a quality score across 10 major dimensions: popularity, geographical, temporal, subject expert, retrievability, relevance, reputation, and scarcity. We instantiated the framework and showed that seeds can be scored across multiple QP classes that map to different policies for ranking seeds such as prioritizing seeds from local news, reputable and/or popular sources, etc. The QP framework is extensible and robust. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by ~0.13) when novelty is and is not prioritized. These contributions provide an explainable score applicable to rank and select quality seeds for Web archive collections and other domains. On March 11, 2020, the World Health Organization declared the Relevance Proximity 10 Reputation-broad Uncategorized 11 Reputation-narrow Uncategorized 12 Scarcity Uncategorized sent out a tweet 2 requesting social media users to contribute URLs about Coronavirus for preservation. It is important to preserve webpages chronicling important events such as the 2020 Coronavirus pandemic because according to SalahEldeen and Nelson, 11% of Web resources shared on social media are lost after the first year of publication [37] , so we run the risk of losing a portion of our collective digital heritage if they are not preserved. The Internet Archive (IA) was founded in 1996, and since then, it has been archiving the Web by collecting and saving public webpages. This is based on a simple idea: an archived copy of a webpage may be viewed in place of a lost original copy, but this is only possible if the original webpage was saved. Archive-It is a service created by the Internet Archive where individuals and institutions create Web archive collections (e.g., Fig. 1 , right) that preserve webpages and their URLs, about a particular topic (e.g., 2020 Coronavirus). These collections begin with the selection of an initial list of URLs called seed URLs or seeds which are crawled as part of the preservation process called Web archiving. Following the occurrence of major news events, various organizations collect and save webpages about the events before they are lost due to reference rot [6, 21, 40] . For example, archivists at the National Library of Medicine (NLM) saved seed URLs [24] during the 2014 Western African Ebola Virus Outbreak. The NLM Ebola virus Web archive collection includes websites of organizations, journalists, healthcare workers, and scientists, related to the 2014 Ebola virus discourse. Similarly, archivists at Michigan State University saved webpages [22] chronicling the Flint Water Crisis story. Unfortunately, we do not have enough curators to collect seeds amidst an abundance of local and global events, primarily because it is time-consuming to collect seeds manually, and collecting quality seeds requires domain expertise about the topic, which imposes an additional burden for curators. Consequently, organizations and researchers scrape social media posts for seeds (e.g., Fig. 1 , left) to cope with the shortage of curators with domain expertise [32, 39] . While social media offers a cheap method of crowd-sourcing domain expertise, the quality of social media content varies widely. Selecting quality seed URLs from social media is challenging and has not been extensively studied in the Web archiving community, who acknowledges the importance of selecting good seeds but often pay more attention to the mechanisms of building collections. The challenge of selecting quality seeds is embodied in the idea that it is difficult to define "quality," which could be subjective and is approximated with various metrics that are sensitive to relevance, popularity, or reputation. We developed the Quality Proxies (QPs) framework which generates a multi-dimensional quality score for seeds. A single QP assigns some quality trait to a seed, while different combinations of QPs combine to express different notions of quality (post popularity -{ , , ℎ}, reputation -{ }, local authority -{ , }, etc.) that are used to score seeds and select those that exceed a userdefined threshold. The Quality Proxies framework was inspired by social media research for attributing quality to social media content and users based on credibility, reputation, and influence [10, 11, 31] . It was additionally informed by Web archive research, which emphasizes the importance of geographical and temporal constraints when selecting seeds [30, 35] , and consists of three main classes ( , , and ) that sub-divide into additional classes enumerated in Table 1 . Given that the Quality Proxies is a framework, we additionally instantiated the framework with metrics ( Table 1) that approximate each class. The transparency and flexibility of the framework means the user can extend it and/or replace a QP instantiation. The quality score of a seed can be assigned by extracting Quality Proxy metrics across all classes, or a subset of classes as input to a quality score function (Eqn. 2), and selecting seeds that exceed a threshold. Our contributions are as follows. First, Quality Proxies provide a flexible means of encoding multiple definitions of quality instead of a single definition of quality based on relevance or popularity. By providing a multi-dimensional framework, a curator can use the Quality Proxies to score and select not just popular (post popularity -{ , , ℎ}) or relevant ( ) seeds but seeds from reputable sources ( ), local news organizations ( ), popular residents of a local community ({ , }), hard-to-find seeds ({ , }), etc. This is because Quality Proxies independently behave like alphabets that can be combined in different ways to provide various policies for scoring seeds. Second, the QP framework is robust to enable the assignment of quality scores even when a subset of the metrics is absent. Third, we instantiated each QP with metrics that approximate them. Fourth, the QP framework and quality score do not function as black boxes and thus produce explainable scores. Consequently, the QP instantiations can be critiqued, extended, or replaced if the requirements of the user demands such. Fifth, we compared seeds from Twitter Micro-collections that were scored/selected with QPs against seeds collected by human experts and scraped from Google (also scored/selected with QPs). Micro-collections (e.g., threaded conversations from single/multiple users) are social media posts that contain URLs that are gathered by humans (as opposed to search engines) as a demonstration of domain expertise and editorial activity [29] . Our evaluation results showed that QPs resulted in the selection of quality seeds with increased precision (by ∼0.13) when novelty is and is not prioritized. Our code and evaluation dataset (generated between 2014 and 2020), are publicly available [27] . The dataset is comprised of 1,552 seeds from reference collections (Google and manual selection by experts) and 2,027 seeds from 4,209 tweets. The goal of determining the quality of URLs found in social media posts is one shared by Web archivists (Section 2.1) and social media researchers (Section 2.2 and 2.3). Risse et al. [35] addressed the problem of determining attributes indicative of quality for Web archive seeds in the digital humanities domain by surveying scientists in social sciences, historical sciences, and law. Among others, they proposed that seeds should cover the evolution (topical dimension) of an event over time (temporal dimension) as opposed to a time or topic slice which gives an incomplete picture. Nwala et al. [30] proposed extracting seeds from local news sources for local events by showing that collections from local news articles produced older and lesser exposed stories than their non-local counterparts. These contributions from the Web archiving community inform the Qualities Proxies' ℎ and proximity classes. There are many studies that propose methods for assessing the credibility of information on social media platforms such as Twitter. These mostly focus on the content (e.g., text) of the social media posts and not the URLs (seeds) found in the posts, which is the focus of the QP framework. However, we posit that the quality of a seed URL can be approximated by the quality of the social media post that embeds it, and thus the following research is relevant to the QPs. Castillo et al. [11] adopted Merriam Webster's definition of credibility ("offering reasonable grounds for being believed") and automatically assessed the credibility of a given set of tweets and classified them as or not credible based on features extracted from the tweet content, ℎ , and . Similarly, to help extract credible tweets from a flood of tweets triggered by a major news event such as a natural disaster, Gupta and Kumaraguru [19] identified content (frequency of unique characters, swear words etc.) and user-based features (e.g., number of followers) to train a supervised machine learning and relevance feedback system. Their analysis of tweets posted about 14 high impact events in 2011 showed that on average 30% tweets contained situational information about the event, 14% were spam, and 17% contained situational awareness information that was credible. Bozarth and Budak [9] demonstrated the importance of an evaluation framework of fake news detection to complement traditional evaluation metrics like F1 and precision. They used error analysis to show that classifiers' performance varied depending on multiple factors including the choice of dataset and how the training data is split (e.g., 5-fold, 80/20). Shu et al. [38] approached fake news detection by studying the pattern of how news spreads on social media from news publishers to social media posts that (re)tweet content or conversation threads between accounts. Their experiments showed that the multi-level propagation network approach for fake news detection out-performed state-of-the-art fake news detection methods by at least 1.7% with an average F1>0.84. Similarly, Bal et al. [5] proposed an attention-based deep learning model that first identified tweets about the cause or cures of cancer and subsequently labeled those that spread misinformation. The Quality Proxies framework includes the class that approximates the credibility of the domain of seed URLs in social media posts. In this research, we did not develop a method for directly assigning credibility scores to seed domains but instead approximate them (Section 4.5) by counting how often the domains were cited by Wikipedia editors. However, the user of the framework could replace our instantiation for approximating with a different method such as those discussed in this section. Popularity is widely used as a proxy for quality and credibility [14] . However, algorithms that use popularity for ranking can be exploited [16, 34] . Consequently, Abbasi and Liu [1] introduced the CredRank algorithm for ranking social media users by a credibility score determined by their online behavior. CredRank first attempts to detect coordinated accounts that artificially inflate the popularity of some content. Next, it suppresses the votes of the culprits in order to give preference to independently popular (credible) content. Similarly, Pal and Counts [31] addressed the task of identifying social media users who are authorities for a given topic by proposing a set of features for characterizing social media authors, such as original tweets, conversational tweets, repeated tweets, mentions, etc. Next, using probabilistic clustering over these features, they ranked users in order to identify authorities. Agichtein et al. [2] explored using community feedback to identify high-quality content on question/answering social media platforms. They proposed a graph-based model of contributor relationships and combined it with content and usage statistics to identify quality questions and answers applied on Yahoo! Answers. Becker et al. [7] explored centrality-based approaches (Centroid, LexRank, and Degree) for identifying high quality text context from tweets for events. They defined high quality tweet text as that which contains relevant information (event time, location, participants, opinions) that is most representative of an event. Their results showed that the Centroid approach selected tweets most related to an event. Bian et al. [8] addressed quantifying quality of content by combining content quality estimation with user reputation estimation in order to identify quality content and improved the accuracy (above state-of-the-art methods) of search over community question answering archives. Similarly, Canini et al.'s [10] investigation into the factors that affect the credibility of users on social media led to a conclusion that both the topical content of information sources and social network structure affect credibility. This conclusion led them to design a method that automatically identifies and ranks Twitter users according to their relevance and expertise for a given topic. Similar to the class, the QP framework includes the subject-expert class that approximates the subject-expertise of the domain (e.g., cdc.gov) of a seed URL found in a social media posts. We did not develop a method for directly assigning subject-expertise scores to seed domains, but instead approximated them (Section 4.3) with search engines such as Google. However, the user of the framework could replace our instantiation for approximating subject-expert with a different method such as those discussed in this section. The seed URL quality problem is not unique to social media. A Search Engine (SE) must return a small list of URLs (from possibly millions of candidates) to fulfill an informational request encoded in a search query. It starts by identifying relevant pagesis a proxy for quality -but goes beyond relevance to rank webpages with a preference for popular webpages. In summary, SEs use as one method to approximate quality. This is reasonable since one can argue that popularity is the reward for quality. Ciampaglia et al. [14] argue that measures such as the citation rates of scientific papers, number of downloads of a song, or the number of social media followers are often used in the absence of measurable notions of quality. Additionally, the goal of algorithms that favor popular items "is to identify high-quality items such as reliable news, credible information sources, and important discoveries -in short, high-quality content should rank at the top. " However, popularity does not always mean quality since popularity could be exploited by fake reviews, social bots, and astroturf campaigns [16, 34] . In SEs, the use of popularity in ranking algorithms was alleged to reduce the novelty, a problem that could however be mitigated by diverse user queries [17] . Consequently, we argue that popularity is not sufficient as a QP and explore additional non-popularity based QPs (e.g., proximity and reputation, Section 4). Nonetheless, popularity remains one effective proxy for quality, and as such is included in our QP framework. In this section, we introduce the classes (e.g., ) that make up the QP framework and the metrics (e.g., ) used to instantiate them. For each seed URL, the values (all normalized between 0 and 1) for each metric are used to populate the seed QP vector which holds the multi-dimensional quality score of the seed. There are generally two approaches toward quantifying the popularity of URLs. The first computationally-expensive, link-based (e.g., PageRank) approach [13] utilizes the link structure of the Web to assign weights to webpages. We adopted the second lesser computationally-expensive approach that leverages social media post statistics to assign popularity scores to URLs found in social media posts. Social media posts often keep statistics that track the number of times a post is shared (a "retweet" on Twitter), liked, or replied to. Transitively, the popularity of URLs from social media posts could be derived from the social media post statistics [15, 19, 23] and also used to rank posts. The post popularity classes assign popularity to a seed URL by quantifying the popularity of the post(s) containing the URL. We instantiated them with metrics that count how many people replied to ( ), shared ( ℎ ℎ), and liked ( ) a social media post. All of these are normalized ( = − − ) in the QP vectors for seeds. The author popularity QP expresses the popularity of the author(s) who created the social media post(s) containing the seed URL. For example, Twitter and Instagram count (in-degree), and or (out-degree). Unlike Twitter, which separately counts in-degree and out-degree, Facebook only counts (in-degree and out-degree). For social media platforms like Facebook with bi-directional links, we instantiated with the normalized count of . For social media platforms like Twitter, = in-degree − out-degree (e.g., − for Twitter) normalized. If the in-degree < out-degree, then < 0. To fix this, the offset (the absolute value of smallest difference between in-degree and out-degree) is added to each difference before normalization. Given a set of social media posts , let and represent the in-degree and out-degree of social media post , respectively, Eqn. 1 instantiates . The domain popularity QP quantifies the popularity of a seed's domain. We instantiated it with Eqn. 1 by approximating the popularity of the social media account (e.g., @CDCgov) associated with the seed domain (e.g., cdc.gov). To calculate for a seed (e.g., https://www.cdc.gov/coronavirus/2019-nCoV/), utilizing Twitter as example, first, we must find the social media account (https: //twitter.com/CDCgov) associated with the domain (e.g., cdc.gov). This is done by finding a bi-directional link between the social media account and the seed's website. For example, cdc.gov domain links to @CDCgov Twitter account and vice versa. Second, we extract the − and −degree details from the account. Third, we apply Eqn. 1. We have already discussed some limitations of popularity as a proxy for quality such as the artificial manipulation of popularity by fake reviews, social bots, and astroturf campaigns. In addition to these, it is important to note that not all authoritative or credible sources are popular. For example, MLive, a local media organization located in Michigan, the epicenter of the Flint Water Crisis, is less popular than CNN, a national/international news organization, so one can argue that MLive is a local authority on topics about the Flint Water Crisis, more so than CNN. In fact, according to Denise Robbins, it took the national media one year after the E. coli outbreak to report the Flint story [36] . Consequently, it is pertinent to quantify quality (e.g. authority) across other classes in addition to popularity. This is the rationale for the following non-popularity based Quality Proxy classes ( Stories and events are often associated with some geographical location. For example, Hurricane Harvey made landfall in Texas in August 2017. The geographical QP gives credit to a local source (local authority) when geographical location information is present. The local source could be an individual ( -author geographical QP) or an organization ( -domain geographical QP). For example, if our reference epicenter is Texas, USA, given two seeds about Hurricane Harvey from CNN and TexasObserver (Texas local media), the QP would assign a higher value to TexasObserver. Similarly, given two individuals, a resident of Rockport, Texas, and a resident of San Francisco, California, the would give more credit to the Rockport resident. We instantiated the (or ) QP with the normalized ([0, 1]) distance (measured with the Haversine formula) between a reference epicenter and the geo-location associated with the post author (for ) or social media account associated via a bi-directional link (similar to ) with the seed domain (for ). We utilized the Google Maps Services Places API [18] to normalize names (e.g., "NYC" and "New York") into a single name and geo-coordinates. The stories and events often happen at a place (or places), but always happen at some time. After the occurrence of the event or before its occurrence, news organizations report the story or event. For example, some of the earliest reports of the Flint Water Crisis story are from Mlive. The temporal Quality Proxy rewards seeds published "early, " when a priori information about what constitutes early is present. We instantiate it with the normalized time difference between the publication date of the seed and the reference point considered early. The subject expert QP approximates the subject expertise of a seed's domain. For example, given two seeds about the Coronavirus, one from the CDC and another from the blog of a high school senior, would assign the CDC a higher subject expert score since the CDC is an authority on health topics. However, how does one measure the subject expertise of cdc.gov? We instantiated based on this simple assumption: A subject expert often has more to say about their subject of expertise. This means, if indeed the CDC is an expert on Coronavirus, we would expect to see many more reports from the CDC about Coronavirus than say ESPN. We acknowledge that this is a simplifying assumption that could be exploited. We used Document Frequency (DF) to instantiate the subject expertise of the domain of a seed. We extract DF scores by counting the number of result pages returned by Google for a given query normalized by the total number of pages indexed by the search engine for the site. This normalization is needed to avoid giving more advantage to larger websites. Seeds extracted from social media posts could also be scraped from Search Engine Result Pages (SERPs). The QP approximates how easy a seed is to find [4] . For example, Wikipedia pages for various entities (e.g., political figures) are often placed on the front page of SERPs, meaning they have high retrievability. For this reason, quantifies the level of difficulty of finding a seed. It is often a desirable quality to identify relevant seeds that are not easy to find to increase the novelty of a collection. We instantiated of a seed (e.g., https://www.cdc.gov/vhf/ebola/index.html) with its reciprocal rank 1 (e.g., 1/2) when searching the first (e.g., = 20) Google SERPs for the seed with the query (e.g., "ebola virus") used to extract seeds. Social media seed URLs originate from sources with varying reputations. Given two URLs about Coronavirus, one from InfoWars (promotes conspiracy theories [33] ) and another from the CDC, it would be problematic to consider the quality of information derived from both sources equal. Similar to the subject expert QP, the reputation approximates the reputation of the domain of seeds. We defined two kinds of reputation QPs. First, reputation-broad attributes reputation to the domain of a seed for having a record of publishing content about a topic (e.g., health topic), while reputation-narrow attributes reputation to the domain of a seed for having a record of publishing content focused specifically on a story (e.g., Coronavirus). But the question remains, how does one approximate reputation? We instantiated by leveraging the expertise of Wikipedia editors. We posit that Wikipedia editors presumably sample reputable sources [12] . Specifically, the reputation of the domain of a seed corresponds to the fraction of times it was cited as a reference from a gold-standard set of Wikipedia articles. For , the gold-standard is represented by a collection of Wikipedia articles that focus on the topic (e.g., Disease outbreaks) of the seed. For , the gold-standard is represented by the canonical Wikipedia page for the story. The canonical page can be found by searching for the top ranked Wikipedia page for the query (e.g., "ebola virus outbreak") representing the topic. To assign or to the domain of a seed, we extracted the URIs from the references of the reputation gold-standard Wikipedia articles and calculated the fraction of times each domain was referenced. For example, in our reputation gold-standard for the Disease outbreaks (https://en.wikipedia.org/wiki/List_of_epidemics) topic, cdc.gov appeared 42 out of 57 gold-standard articles. Therefore, the cdc.gov domain has a score of 0.74. The cdc.gov domain appears 14 times out of 720 references in the canonical 2014 Western African Ebola Virus Outbreak Wikipedia (https://en.wikipedia. org/wiki/Western_African_Ebola_virus_epidemic) page, and thus, = 0.02. In contrast, for sputniknews.com, = 0.02 (1/57) and = 0 (0/720). The QP measures the degree to which a seed is ontopic. A seed that receives high marks across all the other QP vector dimensions remains non-relevant if it is off-topic. We approximate relevance by simply measuring the cosine similarity between a seed's document vector and a gold-standard document vector that captures our definition of relevance. The gold-standard is created by concatenating the text of hand-selected documents (Section 8.2, Step 1) that are relevant to a topic, and creating a feature (vocabulary) vector consisting of the TF or TFIDF weights of the terms in the concatenated document. The QP rewards seeds from domains that are rare in a collection of seeds. It is not surprising to find multiple seeds from news organizations (e.g., cnn.com, foxnews.com, bbc.co.uk) for news topics. Sometimes far-reaching news events are covered by organizations for which news is not their primary domain (e.g., eonline.com and espn.com) and which may offer a novel reporting perspective. The QP was created to surface such seeds and is approximated by 1 − | | , where is the frequency of a seed's domain out of total domains. Thus far, the Quality Proxies have been presented with the assumption that the higher the QP value, the better the trait the QP captures. For example, a high author popularity score is a desirable trait, and a low author popularity score is not a desirable trait. However, desirability can be subjective. This means a curator might desire to surface seeds from authors that are not popular in an effort to amplify the voices of obscure users. Consequently, this requires flipping the direction of the reward system of the QP under consideration. For example, before flipping, the most popular author would have = 1, but if we flipped (represented with bar over the QP) the Quality Proxy, = 0 is assigned to the most popular author. Since all the quality proxies were designed to fall within [0, 1], a QP is simply flipped by 1 − ; = 1 − . The ability to flip QPs provides us with additional QPs ( , , , etc). But it must be noted that the unflipped ( ) state and the flipped ( ) state of QPs are mutually exclusive. The seed Quality Proxy vector is a 14-dimensional vector ( ∈ R 14 ) of all the values of metrics (e.g., , ℎ, , ) that instantiate the classes ( , , and ) of the Quality Proxies framework. The QP vector of a seed assigns quality scores to a seed across multiple dimensions. Each metric's value ∈ expresses some quality trait of a seed and is normalized ( ∈ [0, 1]) such that 0 represents lowest quality and 1 represents highest quality. The dimensions of the QP vector representing multiple quality traits can be combined into a single score that can be used to score and/or rank seeds. We instantiated the QP score function (Eqn. 2) of a seed simply with the 2-norm of the n-dimensional QP vector of the seed. A user can control the relative importance of the metrics of depending on prior information or specific needs. Therefore, one can multiply a weight vector ∈ R 14 ( =1 = 1) with ( ′ = ) to reflect the importance of each metric to obtain a new Quality Proxy scores ′ . The weight vector can also be used to switch off specific metrics. For example to switch off , we set = 0, such that = 0. In this section, we explore how different combinations of QPs map to different notions of quality and policies for selecting seeds for the 2020 Coronavirus Pandemic (Tables 2, 3 , & 4), the Flint Water Crisis (Table 5) , and Hurricane Harvey ( Table 6 ). The seed URL titles can be clicked. Table 2 illustrates that a combination of popularity-based Quality Proxies , ℎ, and unsurprisingly gives more credit to seeds from popular (well-known) domains (e.g., reuters.com, cnbc.com, washingtonpost.com) posted by popular authors (e.g., @HillaryClinton, @CNBC, and @SenSanders). Seeds from wellknown domains are more likely to be replied to ( ), shared ( ℎ), or liked ( ) as a result of the large audience they enjoy. Sampling Table 2 , Table 3 shifts the reward system by prioritizing authors ( ) and domains ( ) geographical distant from New York. This resulted in the surfacing of authors and domains outside the United States with an international perspective. The top five authors are residents of two different countries (e.g., @ick_forPH -Philippines and @OfficialKRU -Kenya) while the organization of the Table 6 : Hurricane Harvey: top five seeds extracted by combining relevance and the scarcity QP, used to increase the diversity of news sources (e.g., texasmonthly.com, eonline.com, and espn.com) by extracting seeds from domains with the smallest representation (Hits) in the collection. Given the concerns of the spread of (mis/dis)information surrounding the coronavirus pandemic, curators could potentially impose stringent rules that restrict the sources of seeds to reputable sources. This reputable sources only selection criteria aligns with the goal of the reputation-broad QP ( ). Table 4 outlines the top five seeds when seeds are scored by their respective reputation scores. For a single seed (e.g., https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC1592694/) in Table 4 , the score (e.g., 0.81) was approximated by counting the number of times the seed domain (e.g., nih.gov) was cited (e.g., 46 times) in a reputation gold standard of 57 representative Wikipedia documents (one vote per document) about Disease outbreaks. Accordingly, the most dominant seeds were from world-renowned health institutions such as the World Health Organization (who.int) which was referenced 47 times, National Institute of Health (nih.gov) referenced 46 times, and Centers for Disease Control and Prevention (cdc.gov) referenced 42 times out of 57 representative Wikipedia documents about public disease outbreaks. Table 5 illustrates how the QP helps surface local news organizations, such as mlive.com, which was critical to the coverage of the Flint Water Crisis, by giving credit to seed domains from organizations near a geographical reference (e.g., Flint, Michigan). Table 6 illustrates how the QP can help increase the diversity of sources by surfacing seeds from non-conventional news media outlets such as Taylor Swift Makes "Very Sizable Donation" to Houston Food Bank After Hurricane Harveyeonline.com and J.J. Watt's Hurricane Harvey charity fundraising closes with $37M-plus in donationsespn.com. The goal of this evaluation was two-fold. First, to assess the precision of the seeds selected by their Quality Proxy-assigned scores when novelty is not prioritized (Section 8.2). For brevity, we define QP seeds as the top ranked seeds selected when seed URLs extracted from social media posts (e.g., tweets) are ranked by their QP scores. It would be unreasonable to collect QP seeds if they are of poor quality compared to expert-generated seeds. We modeled good quality with prototypical seeds referred to as seeds scraped from Google and/or hand-selected by human-experts on Archive-It. Second, to assess the precision of seeds when novelty is prioritized (Section 8.3). It is a positive trait for QP seeds to be highly similar (low novelty) with respect to Google and/or expert-generated seeds, since this could be indicative of their high-quality. The goal of the first evaluation was to quantify the degree of similarity between QP seeds and reference seeds. However, we often need our seeds to be novel or, in other words, different from seeds produced by Google and/or experts but not at the expense of quality. Therefore, we assessed the precision of QP seeds when novelty is prioritized. Novelty of seeds was measured (Section 8.2, Step 4) by comparing them with reference (Google or Expert) seeds. To evaluate social media seed URLs selected with their QP scores (QP seeds), we generated a dataset (Table 7 , [27] ) consisting of seeds extracted from reference collections (Section 8.1.1) and Twitter Micro-collections (Section 8.1.2) for multiple topics. 8.1.1 Generating reference (Google/Expert) seeds. . The reference collections served as baselines for defining quality. Seeds from Google were scraped, while seeds from expert-generated collections were extracted from the Archive-It API [3] . 8.1.2 Extracting seeds from Micro-collections. . In addition to reference Google/Expert seeds, we extracted seeds from Twitter Micro-collections to be compared to the reference seeds. Micro-collections are social media posts that contain URLs that are gathered by humans as a demonstration of domain expertise and editorial activity [29] . On Twitter, they manifest as the threaded conversations created by single or multiple users. Seeds extracted from Twitter Micro-collections are different from those scraped exclusively from SERPs [28] . In total, the evaluation dataset (extracted from 2014 -2020) consisted of 1,552 seeds from reference collections, and 2,027 seeds from 4,209 tweets from Twitter Micro-collections. Even though we utilized Twitter for evaluation, our framework is applicable to other social media platforms such as Reddit and Facebook. The following five steps describe how we assessed the precision of QP seeds when novelty is not prioritized. Step 1: Extracting Quality Proxies for Seeds. . We instantiated the QP vectors for all seeds in the evaluation dataset by extracting all values for QP metrics (Table 1 ) except subject-expert and temporal ( ) resulting in the use of 12 QPs. The instantiation with the document frequency from Google was not determined to be a dependable approximation of since it fluctuated (for the same seed) with a high variance, hence we excluded it from our evaluation. Additionally, we did not impose a temporal bias to favor old or new documents, hence we excluded . We approximated the QP with the cosine similarity between document vectors for a seed and a gold standard document created from the text of the references of Wikipedia articles corresponding to each dataset topic. The author-popularity QP corresponds to the popularity of the social media author of the post. Since seeds from Google and Archive-It are not posted by social media authors, we approximated the QP with the reciprocal rank ( 1 ) of their seeds to ensure they are comparable to QP seeds. Step 2: Generating QPs Combinatorial States. . We utilized the 12 QPs from from the previous step to score (Eqn. 2) seeds, selected the top K seeds, and compared them with top reference seeds scored with the same QPs. We did not assign weights to the Quality Proxies. Additionally, we expanded the options for scoring seeds beyond 12 QPs as follows. First, we permitted flipping the QPs, resulting in 12 additional QPs (24 QPs total). Second, we permitted using a subset of the 24 QPs, leading to a combinatorial explosion of possible QP states for scoring seeds. However, we restricted our scoring to 1-, 2-, and 3-combinations which produced a total of 2,049 possible QP combinations (e.g., , { , }, { , , }) to score seeds. Table 8 shows 72 of these combinations. Step 3: Scoring Seeds with a Combination of QPs. . To score seeds from Twitter or reference Google or Expert collections, we first selected a single combination of Quality Proxies, for example, { , ℎ, }. Next, using only the QPs selected, we assigned a score to the seed with Eqn. 2. Step 4: Twitter vs Google/Expert: comparing top K QP seeds. Recall the QP seeds definition: the top ranked seeds selected when seed URLs extracted from social media posts (e.g., tweets) are ranked by their QP scores. The top K QP seeds with scores assigned by a given combination of QPs were compared to the top K reference (Google/Expert) seeds scored with the same QP combination. Comparison was done by measuring the domain (e.g., cdc.gov) overlap ( | ∩ | ( | |, | |) ) between the Twitter QP seeds and reference (Google and/or expert) seeds. We also measured the precision of the selected QP seeds and reference seeds. For precision evaluation, if the cosine similarity between a seed and the gold standard document vector is at least a predefined relevance threshold (set at 0.20 for all Table 9 : Combination of QPs that produced highest (QP Novel) and lowest (QP Non-Novel) novelty. Non-novelty mostly favors seeds from broadly-reputable (e.g., , ) sources that are easy-to-find (e.g., , ) while novelty mostly favors seeds that are non-popular (e.g., non-popular author: , non-popular posts: , ), and hard-to-find (e.g., ) Dataset , , , except Hurricane Harvey: 0.10), the seed is considered relevant. The threshold was estimated by finding the median similarity between each gold standard document and the rest of the gold standard documents. Median scores exceeding 0.20 -which was empirically determined to produce satisfactory baseline relevance -were set to 0.20. Step 5: Seed Precision when Novelty is not Prioritized. . The final process of assessing the precision of seeds when novelty is not prioritized involved reporting the average overlap and average precision for QP combinations used to score and select top K (10 ≤ ≤ 100) seeds. This was achieved by reporting the top 10 (out of 2,049 QP combinations) overlap scores between Twitter and reference seeds and reporting Precision at K (P@K) for the associated QP combination used to score the seeds. Selecting the top 10 overlap enables us learn the precision of seeds when overlap is at its best, albeit at the expense of novelty since the higher the overlap between Twitter and reference seeds, the lower the novelty. Section 9.1 presents and discusses the results. Since we consider reference seeds to be quality seeds, a high overlap between reference and Twitter QP seeds could result in a high precision of the Twitter QP seeds. However, since novelty (low overlap) is also a desirable quality of seeds, it is crucial to additionally assess the precision of Twitter QP seeds when novelty is prioritized. The steps for assessing the precision of seeds when novelty is prioritized are the same as the previous section (when novelty is not prioritized) except for Step 5. Instead of reporting the P@K for the associated QP combinations with the top 10 overlap scores, to prioritize novelty, we measured and reported the precision of QP combinations that produced no overlap (highest novelty) between Twitter and reference QP seeds. Section 9.2 discusses the results. Our overlap and precision results were proven to be statistically significant by a one-tailed Student's t-test with = 0.05 and K = 9.2 Novelty is prioritized: P@K of QP seeds In Fig. 4 , the heights represent the median of the average P@K for different overlap intervals and horizontal lines mark the relevance threshold for each dataset topic. In all cases except Hurricane Harvey (collected 2020), the median of the average P@K of Twitter QP seeds for the 0 overlap (maximum novelty) interval was always above the relevance threshold. This suggests that maximum novelty (0 overlap) did not adversely affect the P@K for Twitter QP seeds even though higher overlap resulted in a higher P@K. Our results ( Table 9 ) also suggest that novelty mostly favors seeds that are non-popular (e.g., non-popular author: , non-popular posts: , ), and hard-to-find (e.g., ). Our correlation analysis (Table 10) showed a strong (> .50) positive correlation between between popularity-based (e.g., { ℎ, , }) and reputation ({ , }) QP metrics. All positive correlations were statistically significant (p <.05) unlike negative correlations. These results are not surprising. For example, a post with many likes ( ) is highly likely to be shared ( ℎ) and/or replied to ( ). Similarly, many domains (e.g., cdc.gov) with high broad (topic) reputation ( ) also have high narrow (story) reputation ( ). The QP framework and metrics that instantiate the classes have some limitations for which a future work would address. First, Figure 3 : Average P@K (left) and Average overlap (with Expert seeds -right) showing that Twitter seed URLs scored and selected with Quality Proxies, improved (higher solid lines) the P@K and overlap above the baseline (lower red dotted lines) precision and overlap which did not use QPs. However, the improvement diminished as K (number of seeds) increased. the evaluation topics such as the 2020 Coronavirus Pandemic are well documented. We expect our framework to under-perform for esoteric or obscure stories due to sparse data. Second, the high correlation (Table 10) between QPs (e.g., popularity-based QPs { , ℎ, }) suggests popularity could be given more weight when combined with other QPs. Third, measuring relevance is limited by small text which could result in false negative errors. The Web is one of the greatest outcomes of human endeavor, but it has some major flaws, one of which is, the Web forgets, causing the disappearance of Web resources chronicling important stories and events. Web archive collections reduce this problem by preserving Web resources, and they begin with seed URLs hand-selected by experts or scraped from social media posts. While social media is a valuable source of seed URLs, the quality of social media content varies widely. In this paper, we presented the Quality Proxies framework (and instantiations) for assigning quality scores to seed URLs extracted from social media posts. A QP assigns a quality trait to a seed within a single dimension. Seeds can be assigned a quality score by selecting different combinations of Quality Proxies which map to different notions of quality across multiple dimensions such as , , geographical proximity, etc. The QP framework is flexible (enables multiple definitions of quality), robust (operates with subsets), explainable, and extensible. Our results showed that Quality Proxies resulted in the selection of quality seeds with increased precision (by ∼0.13) when novelty is and is not prioritized. To encourage reproducibility we have provided our research data and code [27] . Measuring user credibility in social media Finding high-quality content in social media Archive-It. 2020. Archive-It Retrievability: an evaluation measure for higher order information access tasks Analysing the Extent of Misinformation in Cancer Related Tweets Sic transit gloria telae: towards an understanding of the web's decay Selecting Quality Twitter Content for Events Learning to recognize reliable users and content in social media with coupled mutual reinforcement Toward a Better Performance Evaluation Framework for Fake News Classification Finding credible information sources in social networks based on content and social structure Information credibility on twitter An empirical examination of Wikipedia's credibility Efficient crawling through URL ordering. Computer Networks and ISDN Systems How algorithmic popularity bias hinders or promotes quality An empirical study on learning to rank of tweets The rise of social bots Topical interests and the mitigation of search engine bias Google. 2020. Place Search Credibility ranking of tweets during high impact events Internet Archive Global Events Robust Links in Scholarly Communication Flint Water Crisis Websites Archive Ranking approaches for microblog search Global Health Events Web Archive -Coronavirus Global Health Events Web Archive -Ebola Virus QP Framework dataset/code -Git Repo Scraping SERPs for archival seeds: it matters when you start Using Microcollections in Social Media to Generate Seeds for Web Archive Collections Local Memory Project: Providing Tools to Build Collections of Stories for Local Events from Local Sources Identifying topical authorities in microblogs Seed selection for domain-specific search Manufacturing Phobias: The political production of fear in theory and practice Detecting and tracking political abuse in social media What Do You Want to Collect from the Web ANALYSIS: How Michigan And National Reporters Covered The Flint Water Crisis Losing my revolution: How many resources shared on social media have been lost Hierarchical propagation networks for fake news detection: Investigation and exploitation A study of automation from seed URL generation to focused web archive development: the CTRnet context Perma: Scoping and Addressing the Problem of Link and Reference Rot in Legal Citations Figure 2 : Overlap vs P@20 for Google (orange dots) and Twitter (blue dots) 2020 Coronavirus Pandemic Twitter-Latest seeds scored with different QPs. A single dot represents the overlap (X-axis) and P@20 (Y-axis) for seeds scored by a single Quality Proxy. The scatterplot shows how different combinations of QPs result in high (e.g., , , and , , ) or low ( , , ) overlap/P@20. Unsurprisingly, , , resulted in a low P@20 because the QP was flipped, meaning relevance was penalized. Consider the results (P@K/overlap) when novelty is not prioritized.9.1.1 P@K of Twitter (and overlap with Google) QP seeds. Across all topics, for Google and Twitter seeds, the Minimum, Median, and Maximum (MMM) average overlap were 0.04, 0.32, and 1.0, respectively, when Quality Proxies were used to score seeds. Without the utilization of QP scores, the MMM average overlap were smaller, 0.04, 0.14, and 0.27, respectively. These results (e.g., Fig. 3 , right) suggest that the utilization of QP scores to rank and select seeds, helped surface seeds from a common set of domains between Twitter and Google unlike when QPs were not used. Additionally, they illustrate that different combinations of QP can result in high (e.g., ,, and , , ) or low ( , , ) overlap/precision as expressed by Fig. 2 . In Fig. 2 , unsurprisingly, the QP combination , , resulted in a low P@20 because the QP was flipped, meaning relevance was penalized. Our results ( Table 9 ) also suggest that non-novelty mostly favors seeds from broadly-reputable (e.g., , ) sources that are easy-to-find (e.g., , ).Across all topics, for Twitter seeds, with Google seeds as the reference, the MMM average Precision at K (P@K) were 0.0, 0.53, and 0.99, respectively, when QP scores were used. Without the utilization of QP scores, the MMM were smaller; 0.06, 0.45, and 0.65, respectively. These results (e.g., Fig. 3, left) showed that the utilization of Quality Proxies to score, rank, and select seeds, improved the precision of seeds by 0.08 (0.53 vs 0.45).9.1.2 P@K of Twitter (and overlap with Expert) QP seeds. . Across all topics, for Expert and Twitter seeds, the MMM average overlap were 0.09, 0.67, and 1.0, respectively, when Quality Proxies were used to score and select seeds. Without the utilization of QPs, they were smaller, 0.03, 0.13, and 0.19, respectively. Similar to the , -0.05 , 0.82 3., -0.04 , 0.82 4., -0.03 , ℎ 0.82 5., -0.03 , 0.66 6., -0.03 , 0.60 7., -0.02 , 0.45 overlap between Google and Twitter seeds, these results suggest that the utilization of QP scores to rank and select seeds facilitated the selection of seeds from a common set of domains for Twitter and Expert seeds. Across all topics, for Twitter seeds, with Expert seeds as the reference, the MMM average precision were 0.0, 0.72, and 0.95, respectively. Further investigation of the seeds that generated 0.0 precision showed that 5/10 were actually relevant based on human judgment. This means our relevance threshold of 0.20 was set too high, and thus resulted in the production of false positive labels. The MMM of the average precision of seeds not scored with QPs were smaller (0.06/0.55/0.71) by 0.17 (0.72 vs 0.55) suggesting again (as previously seen when Google was reference) that the utilization of QP scores improved the precision of seeds.