key: cord-0106371-3m8u53rp authors: Resnick, Paul; Alfayez, Aljohara; Im, Jane; Gilbert, Eric title: Informed Crowds Can Effectively Identify Misinformation date: 2021-08-17 journal: nan DOI: nan sha: 72eb44a9a5439a777ea6cfd6d7d2c381d0b51289 doc_id: 106371 cord_uid: 3m8u53rp Can crowd workers be trusted to judge whether news-like articles circulating on the Internet are wildly misleading, or does partisanship and inexperience get in the way? We assembled pools of both liberal and conservative crowd raters and tested three ways of asking them to make judgments about 374 articles. In a no research condition, they were just asked to view the article and then render a judgment. In an individual research condition, they were also asked to search for corroborating evidence and provide a link to the best evidence they found. In a collective research condition, they were not asked to search, but instead to look at links collected from workers in the individual research condition. The individual research condition reduced the partisanship of judgments. Moreover, the judgments of a panel of sixteen or more crowd workers were better than that of a panel of three expert journalists, as measured by alignment with a held out journalist's ratings. Without research, the crowd judgments were better than those of a single journalist, but not as good as the average of two journalists. Two recent studies elicited misinformation judgments from lay raters about specific articles Allen et al. [2021] , Godel et al. [2021] . Perhaps surprisingly, they came to quite different conclusions. One found that a simple average of the ratings of a group of lay raters could identify misinformation pretty well, even though the raters saw just the headlines and ledes of articles and not the entire articles Allen et al. [2021] . The other found that crowds performed worse than a journalist, even if fancy machine learning algorithms were used to aggregate the signals from the crowd members Godel et al. [2021] . Here we report on a larger study, using articles from both of the other studies. More importantly, we study the effects of varying the elicitation process. In a control condition, raters assessed whether articles were false or misleading after opening the articles but without doing any additional research. In one treatment condition, each rater also searched for corroborating evidence. This was intended to elicit "informed" judgments rather than gut reactions. In the other treatment condition, raters consumed the results of others' searches. This was intended to reduce ideological polarization-we deliberately included corroborating evidence links discovered by both liberal and conservative raters in an effort to broaden the search horizon for any given rater. Both of the treatment conditions thus enforced some form of lateral reading. We benchmarked the performance of simulated subsets of our raters against simulated subsets of four journalists who also rated the articles. To enable an apples-to-apples comparisons, all simulated subsets of raters, whether lay raters or journalists, were scored by correlating their mean ratings with those of a single held-out journalist. We find that a large ideologically balanced panel of MTurk raters in the no research condition performed better than a benchmark of a single journalist. In the individual research condition, a large panel of lay raters performed better than a benchmark of a three-journalist panel. We also find that ideological polarization between liberal and conservative lay raters was reduced in both the individual and collective research conditions. The second author conducted iterative usability studies on the survey software itself and the qualification tasks over the course of several months, with pilot participants thinking aloud while using the labeling software. Insights into question wording, presentation and order, and interface controls led to many modifications before data collection began. Figure 3 in the Appendix shows the labeling interface in the no research condition. Figure 4 in the Appendix shows the additional request made of raters in the second condition, to search for evidence and paste the search terms used and a link to the best evidence found. Figure 5 in the Appendix shows the interface in the third condition, where subjects were asked to click on links found by subjects in the second condition and select the one they thought provide the best evidence. Raters labeled two collections of English language news articles, taken from two other studies that were conducted independently in parallel with this one, by different teams. The first collection was selected from among 796 articles provided by Facebook that were flagged by their internal algorithms as potentially benefitting from fact-checking. Among those, 207 articles were manually selected based on the headline or lede including a factual claim, because the study's focus was how little information needs to be presented to raters in order for them to make good judgments, and subjects were presented with only the headline and lede rather than viewing the entire article Allen et al. [2021] . Facebook also provided a topical classification based on an internal aglorithm; of the 207 articles, 109 articles were classified by Facebook as "political". The second collection, containing 165 articles, comes from a study that focused on whether raters could judge items soon after they were published Godel et al. [2021] . It consisted of the most popular article each day from each of five categories: liberal mainstream news; conservative mainstream news; liberal low-quality news; conservative low-quality news; and low-quality news sites with no clear political orientation. Five articles per day were selected on 31 days between November 13, 2019 and February 6, 2020. 1 Our rating process occurred some months later. Four articles from the first collection and two articles from the second collection were removed because the URLs were no longer reachable when our journalists rated them, leaving a total of 368 articles between the two collections. MTurk raters first completed a qualification task. Each was randomly assigned to one of three experiment conditions. They used the interface corresponding to their assigned experiment condition to label a sample item, after which they were given two attempts to correctly answer a set of multiple choice questions quizzing them on their understanding of the instructions. They then completed an online consent form, a four-question multiple choice knowledge quiz, and a questionnaire about demographics and ideology. Subjects who did not pass the quiz about the instructions, or answer at least two questions correctly on the political knowledge quiz, were excluded. Table 1 shows the recruitment funnel. Because the individual research condition had slightly more complex instructions, there were more opportunities to fail the instructions quiz and thus there was a higher rate of failure. In order to get equal numbers of subjects who passed the qualification test, more subjects had to be randomized to the individual research condition. In the discussion section, we return to the question about whether the results may be due to selection of a more qualified or conscientious pool of raters. Of subjects who completed the qualification, a higher percentage of subjects in the second condition went on to complete rating tasks. At the completion of the qualification task, workers were assigned to one of nine groups based on their randomly assigned condition and their ideology. We asked about both party affiliation and ideology, each on a five-point scale; raters who both leaned liberal and leaned toward the Democratic Party were classified as liberal; those who both leaned conservative and leaned toward the Republican Party were classified as conservative; others were classified as moderate. In each of the three conditions, eighteen people of each ideology rated each news article. Table 2 shows that more raters were liberal than moderate or conservative. Thus, in order to get eighteen ratings from each rater group, the moderate and conservative raters rated more items on average than the liberal raters, as shown in Table 3 . Raters had the option to say they did not have enough information to make a judgment. This option was rarely used (< 3% in all conditions, see Table 5 in the appendix). Such ratings were excluded when computing averages. Figures 6 and 7 in the appendix show the frequency of judgments, by treatment condition and by ideology. Four journalists rated all items from both collections. They used the labeling interface for Condition 2, which required them to do individual research. All four had just completed a prestigious, selective fellowship for mid-career journalists. Three were U.S.-based and the fourth had covered U.S. politics for many years. One had been through the American Press Institute's fact-checking bootcamp. One journalist did not rate one item. Another journalist did not rate two items and reported that they did not have enough information to make a judgment on 54 others. Such missing ratings were excluded when computing averages and correlations. We report three evaluation metrics. First is inter-rater agreement, a measure of internal consistency. Second is ideological polarization, measured by correlation between liberal and conservative ratings. Third is correlation with journalist ratings, which we compare to benchmarks of journalist to journalist correlation. Figure 1 provides a heatmap of the frequency of the 1-7 ratings for each item in the three conditions. Each item is a row and rows are sorted based on the mean ratings for the item across all three conditions. Inter-rater agreement summary statistics are computed from these data. The color coding makes it obvious that in the two research conditions, there Figure 1 : Frequency of the 1-7 ratings for each item in the three conditions were many more items with a consensus of 1 (not misleading at all) or 7 (false or extremely misleading), and fewer intermediate ratings. As a summary measure of agreement, we report the intraclass correlation of ratings (ICC) Shrout and Fleiss [1979] . Row 1 of Table 4 shows the ICC estimates. We find that the ICC is highest in the collective research condition and that the differences are statistically significant. We also assess whether there is systematic disagreement between raters with different political ideologies. For each item, we compute the mean rating among the eighteen liberal raters and the mean rating among the conservative raters. We then compute the Pearson correlation coefficient, across items, of these mean ratings. The liberal-conservative correlation was .82 in the no-research condition, .88 in the individual research condition and .90 in the collective research condition. The effects, however, were not uniform for all kinds of items. On just the political items from the first collection, the liberal-conservative correlations in the three conditions were .69, .81, and .79. On just the non-political items from that collection, the correlations were .90, .93, and .98. On the items from the second collection, the liberal-conservative agreement hardly varied between conditions: .83, .83, and .85. See the Appendix for details about results on subsets of the items. Our external validity metric is to compare the performance of panels of lay raters to the performance of benchmark panels of journalists. Journalists make an appropriate benchmark because of their expertise and professional ethos. To enable performance comparison, both lay rater panels and journalist panels are scored across the whole set of items based on their correlation with the ratings of a single held-out journalist. In other words, we measure the ability of the panel to predict the ratings of a held-out journalist. The logic is that each single journalist is presumed to be reasonably correlated with the underlying, hidden ground truth. Thus, if one panel correlates with a single journalist better than another panel does, we infer that the first panel correlates better with the hidden ground truth. Liberal raters equivalent to three journalists >18 6.82 13.81 Conservative raters equivalent to one journalist >18 5.48 9.68 Conservative raters equivalent to two journalists >18 >18 >18 Conservative raters equivalent to three journalists >18 >18 >18 Benchmarking against the performance of simulated panels of journalists is especially interesting because platforms may face the practical question of whether and when to rely on crowd judgments to extend the reach of journalists' or fact checkers' judgments which are available for only a limited set of items. If a platform would be willing to rely on the judgment of a single journalist, then arguably it should be willing to rely on a crowd that performs better than a single journalist at predicting what another journalist would say. If a platform would be willing to rely on the majority rating of three journalists, then arguably it should be willing to rely on a crowd that performs better than a panel of three journalists at predicting what another journalist would say. Note that we always score simulated panels against a single held-out journalist. If, by contrast, we scored panels against the mean of several journalists rather than against a single journalist, the correlation scores for all simulated panels would be higher, but the relative ordering of scores should not change. By scoring all panels against just a single journalist, we leave three of the four journalists available for inclusion in simulated benchmark panels. Thus, we are able to compare the performance of large lay panels to the performance of groups of two and three journalists. Considering all four possible panels of three journalists, the average correlation with the remaining journalist was 0.73. For panels of two journalists, the average correlation with a held-out journalist was 0.70 For the twelve possible pairs of journalists, the average correlation with a single held-out journalist was 0.65. Following the procedure in Resnick et al. [2021], we construct simulated panels of lay raters of various sizes, from one to fifty-four, to produce a "power curve", as shown in Figure 2 . Each point on the power curve is the expected correlation of a randomly selected group of lay raters of that size with a randomly selected journalist, the power of a panel of that size to predict what a journalist will say. We use a bootstrap procedure to estimate confidence intervals over 500 resampled sets of items. The intersection points in the graphs in Figure 2 show the number of lay raters required to get the same predictive power as one or three journalists. For example, in the no-research condition, 7.69 lay raters 2 were sufficient to achieve the same power as one journalist. Comparing across the three conditions, we can see that lay raters in the two research conditions correlate with a journalist better than do raters in the no research condition and that the individual research condition has greater power than the collective research condition for large groups of lay raters. In the individual research condition, C2, 15.22 lay raters were equivalent to three journalists; even 54 raters were not sufficient to achieve the same power as three journalists in the other two conditions, and in the no research condition, C1, 54 raters were also insufficient to achieve the same power as two journalists. To assess reliability we considered the results separately for each bootstrap sample of items (see Table 6 in the Appendix). For all group sizes, the power of a group of lay raters in the individual research condition was higher than one in the no Figure 2 : Power curves for the three conditions. The x-axis is the number of turkers. The y-axis is the correlation of the mean of k turkers' ratings with a journalist's rating. The green horizontal lines show the correlation of a randomly selected journalist's rating for each item with a held-out journalist (0.65). The blue lines shows the correlation of the mean of three journalists with a held-out journalist (0.73). research condition in all of the 500 bootstrap samples. With just a single rater, the collective research condition had higher power than the individual research condition on 71.5% of the item samples. With groups of five raters, however, the individual research condition performed better than the collective research condition on 98.4% of item samples. In condition C2, with individual research, eighteen lay raters outperformed two journalists on all of the bootstrap item samples and outperformed three journalists on 65.7% of samples (see Table 8 in the Appendix). 54 lay raters outperformed three journalists on 97.4% of bootstrap item samples. As noted in the introduction, researchers found that with two 75-minute training sessions, students adopted practices of external search for supporting and challenging information, and lateral reading about the author and source Wineburg and McGrew [2019] . In our study, none of the raters received explicit training. Both of the treatment conditions, however, required raters to seek out or consider external evidence. This led to improved misinformation judgments, as measured in a variety of ways. Judgments in conditions 2 and 3 were more internally consistent between raters, showed less partisan divide, and were better correlated with expert journalist judgments. The results are mixed about what is the best way to ask raters to consider external evidence. Condition 2, where each person searched individually, had lower consistency among raters than Condition 3 and more partisanship. However, when averaging ratings from several raters, the correlation with a journalist was higher in Condition 2, with the difference becoming more and more reliable when averaging across larger rater sets. Each person doing their own search seems to yield a little more information from each person, but also more noise in individual assessments. Wisdom of crowds models posit that when averaging several judgments it is better for those judgments to be independent Surowiecki [2005] . Our results are consistent with that; in Condition 3, examining a common set of links to potential corroborating or challenging evidence may have yielded correlated errors in judgment. We suspect that the best procedure for eliciting judgments from raters will be some hybrid that encourages both individual search and considering links that have been discovered by others, especially links discovered by people with different ideologies. If one imagines that the primary use of crowd-based misinformation judgments will be to judge hundreds or thousands of articles every day, basically to "scale up" real-time fact-checking, then cost-effectiveness is an important consideration. An alternate vision, which we favor, conceives of using citizen juries as a governance mechanism, rather than as a way to speed up decision making Zittrain [2019], Fan and Zhang [2020] . In that vision, crowd-based misinformation judgments could be used as as part of appeals processes, as ground truth for transparency reports about platform performance, and as training data for human and automated processes. For that, they would need to operate on a medium rather than large scale. In either case, we think that it is most important at this stage in the development of crowd-sourced misinformation judgments to focus on processes that maximize quality rather than minimizing costs; cost optimization can come a little later. Even so, it is interesting to consider the cost effectiveness of different ways of eliciting judgments. We offered raters in the no research condition $.50 for each article rated and in the two research conditions $1 for each article rated, in order to compensate them for the extra time required to do research. If, in practice, one's goal was to get a rating of quality similar to what one would get from a single journalist, this could have been accomplished for $5.50 using eleven raters in the no research condition or $4 using four raters in either of the research conditions. One limitation of our study is that Turkers passed the qualification test at a lower rate in Condition 2. Thus, it is possible that the better performance in that condition was due, in part, to a pool of raters who were more diligent or skilled, rather than the requirement that they search for a corroborating source. It would be interesting in a future study to tease apart the selection effect from the task effect; if the selection effect is sufficient to yield the rater performance found in the second condition, without requiring raters to actually perform independent research, the costs of rater labeling could be further reduced. Another limitation is that articles were rated after they were first posted. It is possible that searching for corroborating evidence would not be as impactful soon after the articles were posted, and thus not as effective at driving raters to provide better misinformation judgements. The study that generated our second collection of articles was explicitly designed to compare judgments made by lay raters and journalists within a few days of an article's publication. It would be interesting to analyze how well those ratings correlate with our journalist and lay ratings that were collected several months later. Finally, more training for journalists, more time spent evaluating each article, or incentives for agreeing with each other could lead to higher inter-rater agreement among journalists. That would set a higher benchmark for lay panels to compete against, as noted in Godel et al. [2021] . It is worth revisiting the two previous studies to try to assess possible reasons for discrepant results. Because we reused most of the same items, we can rule out some possible explanations, while others will require further research to tease apart. First, we note that the three studies constructed the journalist benchmark slightly differently. All three studies compared a simulated panel of lay raters to a benchmark simulated panel of one or more journalists. We ensured apples-to-apples comparisons by scoring both lay panels and journalist panels ability to predict a single held-out journalist's ratings. Godel et al. [2021] scored both lay panels and a single journalist against the modal answer of a panel of journalists. However, for scoring the single journalist, they had to exclude that journalist, and thus they scored the journalist against the majority vote of a smaller panel. Most likely, the majority vote of a smaller panel will have higher variance, which would make their single journalist benchmark score be lower than it would be in a fair comparison. However, because of the tie-breaking methods they used (a 2-2 vote of four journalists was treated as "not misinformation"), it is possible that comparing to a smaller panel may have produced a score that was higher than it would be in a fair comparison. The main analysis in Allen et al. [2021] had an even larger difference in the metrics for comparing the performance of lay rater panels and the performance of a benchmark single journalist. Lay rater panels were scored against the mean of four journalists. The single journalist was scored against a single journalist. With that comparison, they found that the lay rater panels scored better than a single journalist. However, in an appendix ( Figure S9 in Allen et al. [2021] ) they provide data for lay panels scored against single journalists. As expected, the correlation is lower; lower, in fact, than the score of the benchmark single journalist. Thus, there may be less discrepancy between the results of the two prior studies than first meets the eye. Another possible explanation for discrepant results is uncertainty due to sampling error. Godel et al. [2021] reported on one binary outcome (whether the panel's prediction matched the majority vote of the journalists) for each of 135 articles. The accuracy reported for the majority vote of a random crowd of 25 people was 62%, and for the single journalist benchmark it was 69%. But this difference in performance could easily occur by chance even if there was no difference between the two. Suppose, for example, that panels of 25 people and the benchmark journalist both had 65% prediction accuracy. The 95% confidence interval for the sampling distribution of 135 draws from a Bernoulli distribution with p=.65 is ±.08. In our study, we have a larger pool of items. We quantify the uncertainty of our results through bootstrap sampling of 500 simulated article sets. In our individual research condition, panels of seven or more lay raters outperformed a benchmark of a single journalist on all 500 simulated article sets and the full panel of 54 lay raters outperformed a benchmark of three journalists on 98% of the simulated article sets. An even bigger source of uncertainty is the individual journalists. We employed just four journalists and the other studies employed three and six. Even one or two who were outliers from their peers could have reduced the average ability of benchmark journalist panels to predict what other benchmark journalists would say. Since we included the same items from Allen et al. [2021] in our study, both teams were able to perform robustness checks using the other team's journalist ratings. Tables 9 and 10 in the Appendix show qualitatively similar results in our study whether we use ratings from our four journalists or from their three journalists for the analysis. Because the journalist ratings from the other study are not available, we have not been able to perform a similar robustness check on the items from the second article collection. A third possible reason for discrepant results could be differences between the types of items in the two sets. The first article collection was selected to include only articles that included a factual claim in the headline. These might be easier for lay raters to judge. To assess this, we ran our analyses separately on the two article collections. Results were qualitatively similar for both. On the 98 articles from the second collection that were from fringe rather than mainstream sources, the subset which were analyzed in Godel et al. [2021] , lay raters correlated with a journalist less well but journalists also correlated with each other less well; nineteen or more lay raters outperformed a benchmark of three journalists (see Tables 11 and 12 in the Appendix). We did, however, find that lay rater performance may have been worse on the subset of 109 political items from the first collection, with panels of 15.26 raters matching the performance of two journalists but even the full panel of 54 lay raters failing to match the performance of three journalists, (see Table 13 and Figure 9 in the Appendix). Because of the smaller sample size, however, confidence intervals are wider, so there is uncertainty about whether performance on these kinds of items is truly different. Godel et al. [2021] suggests that the timing of when rating happens may be critical to the capabilities of lay crowds. In particular, they collected ratings within three days after articles first appeared. For some news items, follow-up articles or fact-checks might appear later that help to disseminate correct information. Prior to that, lay raters may be poor judges of information quality. Our study and Allen et al. [2021] had both lay raters and journalists assess articles weeks to months after they first appeared. If lay crowds are especially poor at assessing articles soon after they are posted, relative to a benchmark of journalists, that would limit their use for real-time decision-making. As mentioned previously, if lay panels are conceived of as citizen juries, as part of platform governance and transparency procedures, the ability to make real-time judgments may not be as important. Finally, we note significant differences in the rating procedures between the studies. Our finding that more informed raters make better judgments may explain some of the differences in results. The study from which we drew the first collection of articles asked raters to examine only the headline and lede, and in one condition the source, without clicking through to see the whole article Allen et al. [2021] . As noted, they found that the average of fifty lay raters had slightly lower performance than one fact-checker. Our first condition provided a little more information to raters, by asking them to examine the full article, and panels of eight or more lay raters had a higher performance than a single journalist. Our second and third conditions, involving individual research or reviewing the results of collective research, provided our raters with even more information, and led to still better judgments. It is also worth noting that refinement of the selection process and the user interface and instructions may make a big difference for the performance of lay crowds. Studies of crowdsourced labeling in other domains have found that quality control measures and small amounts of training are helpful Mitra et al. [2015] . For this study, we developed a custom web interface embedded as an iframe within MTurk pages, and went through multiple rounds of UX testing over several months. We also excluded raters who did not pass a simple test of whether they understood the interface and instructions and those who did not correctly answer two out of four knowledge questions. Finally, many MTurk workers are very conscientious, especially if they fear work rejections that will harm their ability to earn money in the future. In this study, in addition to the misinformation judgments, workers were asked to provide a subjective opinion about what enforcement action they thought platforms should take and to make a prediction about other raters' subjective opinions. They were told that their judgments and opinions would not be evaluated, but that if their predictions were too far off they would be disqualified, and this may have encouraged workers to be conscientious. All of these factors may have helped the lay crowds in our study to perform better than they might have otherwise. Consider a stylized hierarchical model of individual raters' judgments when those raters are drawn from some defined pool (e.g., journalists, or liberal workers on Mechanical Turk): Judgment ∼ T ruth + GroupOffset + RaterNoise Truth for an article is an unobserved hypothetical "correct label" for an item. The GroupOffset for an article is the difference between the Truth and the (also unobserved) mean rating in the rater population. Noise is the deviations of individual raters from that rater population's mean for an item, due to inter-individual differences and factors like fatigue and distractions. In this model, there is a Journalist Consensus for each item, consisting of the Truth plus possibly a GroupOffset. And then each journalist sees a draw from a distribution centered on the Journalist Consensus for that item. Similarly, in each condition, for each item there is a Liberal Lay Rater Consensus, a Conservative Lay Rater Consensus, and an overall Lay Rater Consensus. The GroupOffset can be further decomposed into two components. One is due to imperfect expertise of the rater pool and the other due to ideological bias. While none of the underlying components, Truth, GroupOffset, and RaterNoise, can be directly observed, this model provides a framework for interpreting many of the results of our study. Most importantly, the relatively high absolute correlations among judgments from random pairs of raters in all conditions suggest that the Truth component of judgments was fairly large; perhaps the notion of truth is not completely broken beyond repair. The noise in individual ratings was much higher for lay raters than for journalists. This can be inferred from the correlations between random pairs of raters from each group. The average correlation between a pair of random lay raters in the individual research condition was 0.47. For a random pair of journalists, it was 0.65. The intuition behind the so-called wisdom of crowds is that, following the central limit theorem, the mean of a large sample of independent draws from a distribution has lower variance than the mean of a single draw Surowiecki [2005] . In our model, taking the mean of many raters should reduce the RaterNoise. We find just that: the correlation between the means of random collections of eighteen raters in the individual research condition was 0.94, up from 0.47 for pairs of individual raters. In our study, the reduction in noise was sufficient to overcome the presumed smaller GroupOffset for journalists because of their expertise and professionalism. To see this, note that the correlation of a journalist with the mean of a group of eighteen lay raters in the individual research condition (C2) was .74, much better than the 0.65 correlation between a journalist and another journalist. Liberal and conservative lay raters had systematically different GroupOffsets. This follows from the fact that groups of liberal and conservative lay raters correlated with each other less than groups of random lay raters. Even in the independent research condition (C2), where this effect was reduced, the correlation between the means of eighteen liberal and eighteen conservative lay raters was 0.88, short of the 0.94 correlation between the means of random collections of eighteen lay raters. This could reflect ideological bias in judgments by one or both groups, or differences in expertise between the groups. We cannot distinguish between ideological bias and expertise differences, nor can we determine which group's Consensus tended to be closer to the hypothetical Truth. Consequently, we also cannot determine definitively whether the journalists had any bias in their judgments. Liberal lay raters correlated better with journalists than conservative lay raters did, meaning that the Journalist Consensus tended to be closer to the Liberal Lay Rater Consensus than it was to the Conservative Lay Rater Consensus. To the extent that liberal lay raters had ideological bias contributing to their GroupOffsets, the GroupOffsets for journalists must reflect similar ideological bias. On the other hand, if the GroupOffsets for liberals reflected only limited expertise, then a better correlation with journalists might not imply any ideological bias on the part of the journalists. Overall, the results suggest that juries composed of lay raters, perhaps the users of social media platforms themselves, could be a valuable resource in assessing misinformation. Together, their assessments can be comparable to or better than asking three expert journalists to make assessments. Importantly, these pools of raters should be readily available to most large social media platforms. Moreover, they could be selected to deliberately be representative of different ideological viewpoints, thus reducing concerns about potential bias of raters who are selected solely for their expertise. Much work, remains, however, to refine the processes by which rater pools are selected and the research tasks they are asked to perform. Would even more research, or more research done in a different way, further reduce ideological disagreement between liberals and conservatives and further increase alignment with journalist ratings? Moreover, the present work treats all disagreements in judgments as equally important. However, in reality, we know that certain disagreements (e.g., election fraud, the dangers of COVID-19, etc.) have outsized capability to harm society. More work should be done to systematically understand these factors and how to address them. We thank the four journalists who evaluated the articles. We thank the two other study teams whose article collections we used in this paper. In addition to sharing the article URLs, they provided useful feedback at a workshop where study designs were shared and at a second where preliminary results were shared. In particular, we thank Kevin Aslett, Adam Berinsky, William Godel, Amber Heffernan, Quentin Hsu, Jenna Koopman, Nathan Persily, David Rand, Zeve Sanderson, Luis Sarmenta, Henry Silverman, and Josh Tucker. We thank Kelly Garrett and Eshwar Chandrasekharan for early conversations that influenced the design of the study. Both workshops were sponsored by Facebook. Facebook also provided funding for this study. One of the authors of this study, (Paul Resnick), served as a paid consultant to Facebook from May 2018-May 2020. 6 Appendix The ICC is computed with a one-way random effects model, using the mean of raters and consistency as internal metrics Koo and Li [2016] ; we compute the ICC using the rpy2 bridge library 3 , which accesses an underlying CRAN implementation of the ICC 4 . This formulation of the ICC accounts for unstable rater pools: i.e., there is no guarantee that the same group of raters rates each URL. We compute the overall ICC and confidence intervals of each condition using a bootstrap procedure: we simulate 10,000 draws from our data via sampling with replacement. For each of our 500 bootstrap samples of items, we computed the power curve giving the average, over many samples of k turkers, of the correlation of the mean of those turker ratings with a journalist's ratings. This was done in each condition. As a measure of the reliability of comparisons between conditions, we report the fraction of those samples where the correlation was higher between one condition than another, as shown in Table 6 . For each of our 500 bootstrap samples of items, we computed the power curve giving the average, over many samples of k turkers, of the correlation of the mean of those turker ratings with a journalist's ratings. From this a survey-equivalence value was computed: how many turkers are equivalent to one, two, or three journalists. In Table 7 , we report a range that includes 95% of the values computed for the bootstrap samples. In Table 8 , we report the fraction of bootstrap samples where k lay raters outperformed m journalists in predicting the ratings of a held-out journalist. As described in the main text, there were 207 news items in the first sample, of which 109 were marked as political. There were 165 items in the second sample. We have ratings from four journalists for both samples. In addition, the team that assembled the first sample of items collected ratings from three other journalists, as described in Allen et al. [2021] . The journalists did not answer exactly the same question we posed, but were also on a 1-7 scale. We were able to treat them as comparable to our journalist ratings after reverse coding them. In Tables 9 and 11 we report key result metrics for the two article collections separately. Table 10 reports key result metrics for the three journalists from Allen et al. [2021] . Table 13 reports key result metrics for the only the 109 political items. Since the results are somewhat different for the political items, Figure 9 plots the power curves and survey equivalence values for just those items. Conservative raters equivalent to two journalists >18 >18 >18 Table 10 : Summary of results for article collection one only, using the journalist ratings from Allen et al. [2021] . Note that it is only only possible to compute equivalences to panels of one or two journalists, because only three journalist ratings were available and one journalist always has to be held out as the reference rater. Liberal Figure 8 : Power curves for the three conditions for only the 98 articles from collection two that came from nonmainstream sites. The x-axis is the number of turkers. The y-axis is the correlation of the mean of k turkers' ratings with a journalist's rating. The green horizontal lines show the correlation of a randomly selected journalist's rating for each item with a held-out journalist (0.47). The blue lines shows the correlation of the mean of three journalists with a held-out journalist (0.57). Conservative raters equivalent to one journalist >18 >18 >18 Conservative raters equivalent to two journalists >18 >18 >18 Table 13 : Summary of results for only political articles from collection one, using our four journalist ratings. The y-axis is the correlation of the mean of k turkers' ratings with a journalist's rating. The green horizontal lines show the correlation of a randomly selected journalist's rating for each item with a held-out journalist (0.60). The blue lines shows the correlation of the mean of three journalists with a held-out journalist (0.69). ?__source=facebook%7Cmain &fbclid=IwAR1HXryUvlbaPDWs-J39LnS2iksoD4qAO0mdai64d3znOCymsklN45IeUuY&fbclid=IwAR3qBgscvE4tvqjZrDAW_WdK-JWuBGDrjATWQgFYSQhepxMIp1x8PQK6p_c&fbclid=IwAR3Y3pIr8xjE7Ig7Wpz56Is_c4Td538NWF80KyZlchePTjKOAn9m6T6qZ6I