key: cord-0490958-m0bc82k5
authors: Bozhanova, Krasimira; Dinkov, Yoan; Koychev, Ivan; Castaldo, Maria; Venturini, Tommaso; Nakov, Preslav
title: Predicting the Factuality of Reporting of News Media Using Observations About User Attention in Their YouTube Channels
date: 2021-08-27
journal: nan
DOI: nan
sha: 7f458711115e5a925151cf02b64b834d35fb93a0
doc_id: 490958
cord_uid: m0bc82k5

We propose a novel framework for predicting the factuality of reporting of news media outlets by studying the user attention cycles in their YouTube channels. In particular, we design a rich set of features derived from the temporal evolution of the number of views, likes, dislikes, and comments for a video, which we then aggregate to the channel level. We develop and release a dataset for the task, containing observations of user attention on YouTube channels for 489 news media. Our experiments demonstrate both complementarity and sizable improvements over state-of-the-art textual representations.

Disinformation in the news and in social media is perceived as having a major impact on society, e.g., during the 2016 US Presidential election (Grinberg et al., 2019) and the Brexit referendum (Gorrell et al., 2018) . During the COVID-19 pandemic outbreak, the moral panic (McLuhan, 1964) around online disinformation grew to a whole new level as the first global infodemic. 1 Nowadays, fighting disinformation online has been recognized as one of the most important issues societies around the world are facing today.

In this paper, we highlight an aspect of disinformation that is often neglected. Rather than examining the truth-value of individual piece of information, we investigate the general quality of the attention regimes in different news outlets, by analyzing news media's YouTube channels. 1 MIT Technology Review: tinyurl.com/y8oschng YouTube is the largest and the most popular platform for video sharing with over two billion users and it is also the second most widely used news source in USA, after Facebook. 2 While the platform has been scrutinized for the way in which it may amplify marginal and sometimes radical contents (Munn, 2019; Ribeiro et al., 2020) , the connection between attention dynamics and disinformation levels is still largely unexplored.

Here, we do not focus on specific videos, but rather on entire YouTube channels of news media and their attention dynamics. In particular, we are interested in differentiating the "attention cycles" (Downs, 1972; Leskovec et al., 2009) of YouTube channels, that is to assess the rapidity and the steepness with which their videos rise and fall in the consideration of their audiences. While some outlets encourage extensive and diverse discussions, other tend to concentrate everyone's attention on the latest hot-button, thus distracting the public opinion instead of nourishing it (Venturini, 2019) .

Our contributions are the following:

• We propose to model the factuality of news media based on the user attention cycles in their respective YouTube channels.

• We release a specialized dataset for the task. 3

• We show experimentally that considering attention cycles yields considerable performance gains on top of text representations for predicting the factuality of news media.

The paper is organized as follows: Section 2 presents related work. Section 3 describes our dataset. Section 4 discusses our methodology. Section 5 presents the experiments and results. Section 6 offers analysis and discussion. Section 7 concludes and points to directions for future work.

Significant efforts have been dedicated in the last years to automating the detection of disinformation (which is commonly referred to as fake news), which we look at more closely in this section. At the end, we mention previous attempts to use data from YouTube for media classification tasks.

Analysis of the Content Many approaches have been proposed to analyze both the style and the content of fake news. Using natural language processing techniques, Horne and Adalı (2017) pointed out how fake news can be characterized by stylistic features such as the overuse of proper nouns, punctuation, capital letters, negation terms, and repetitions in the text (Rubin et al., 2016) . Fake news has also been associated with intensity of sentiment and emotions, compared to mainstream news (Giachanou et al., 2019) . Here, we also use textual representations, but (i) we focus mainly on analyzing the user attention cycles in YouTube channels, and (ii) we aim at categorizing entire news media outlets rather than individual pieces of news.

Analysis of the Response Many researchers used the user reactions in social media platforms to identify disinformation, e.g., the content and the number of replies to a piece of news, or the propagation of the content in the network. For instance, Zhao et al. (2015) classified disputed claims based on their comments and reactions, assuming that, if a claim is not true, at least some replies would question its factuality. Indeed, later studies (Ruchansky et al., 2017; Nguyen et al., 2020) have shown that user response features are quite important. Here, we also focus on inspecting the user response and its potential link to disinformation, but we use user attention cycles in YouTube.

Looking at a news media outlet as a source of low-quality content is another way to approach the problem. In these methods, features modelling the overall trustworthiness of the source are used, such as (i) Does the news media outlet use verified accounts on established platforms, such as Wikipedia and Twitter?

(ii) If it does, do these accounts have a proper description, location, website references, etc? (iii) How does the URL of the media's website look like? (iv) Does the medium express political bias or sentiment? Baly et al. (2018) and Baly et al. (2020) used features motivated by these questions to achieve better results in combination with content features and user profiles in social media. In our work, we also aim at classifying entire news media outlets, but we do so using user attention cycles in YouTube along with text.

Using Temporal Attention Data for Disinformation Detection We focus on the analysis of temporal patterns associated with news outlets of different type and quality. Previous studies (Ruchansky et al., 2017; Nguyen et al., 2020) have suggested that a combination of temporal, content-based, and user-based features is promising for disinformation detection. As the dynamics of the viral spread is often associated with successful junk news, we look at studies focusing on modelling virality, such as (Hoang et al., 2011) for tweets. Similar features are well-suited for our task, as described in Section 4.1.2, but (i) we model the user behavior differently, and (ii) we focus on data collected from the YouTube channels of the target news media.

Using YouTube Data for Classification The YouTube platform contains information that is still underexplored for the purposes of disinformation detection. Dinkov et al. (2019) looked into detecting the left-centre-right political bias of YouTube channels. Baly et al. (2020) included features from the news source's YouTube channel, derived from both sound and user profiles. The above work uses raw statistics about the number of views, likes, dislikes, and comments per video. We, instead, use much richer temporality features in combination with the textual representation of the videos.

We started from a corpus of news media outlets, whose reliability has been evaluated by Media Bias/Fact Check 4 (MBFC). Lead by a team of independent journalists and researchers, MBFC has analyzed close to 4,000 news outlets over the past six years. For each news outlet, they provide a detailed analysis summarized by a 'factuality' score chosen among: Very High, High, Mostly Factual, Mixed, Low, and Very Low. We searched the YouTube channels of news outlets in the MBFC corpus and we monitored all the videos they published from February'2020 to August'2020. Using the YouTube Data API, 5 we collected the number of views, likes, dislikes and comments collected during the first seven days after the publication of each video. We also stored its title and its description.

We observed that the percentage of media channels labelled with the Very High and the Very Low categories was 3.1% and 1%, respectively. We thus merged Very High with High; Mostly Factual with Mixed; and Very Low with Low, ending up with a 3-way labelling: High, Mixed, and Low.

Finally, for the sake of data balancing, we excluded the channels with fewer than 20 videos, and we capped the most prolific channels at the newest 100 videos. The final distribution of channels and their corresponding videos for each level of factuality is shown in Table 1 .

Our system is composed of two main components focusing on (i) data preparation, and on (ii) sequential classification, respectively. Below, we describe the representation we use for video-level and also for channel-level classification.

The data preparation component transforms the YouTube source data and produces representations (or features) for our model. We generate representations both for the textual content of the videos and for the user attention data, for which we introduce a number of novel features, presented in Section 4.1.2. 5 http://developers.google.com/youtube/v3/docs/videos

We gathered the title and the description of each video. We extracted Sentence BERT embeddings (768 features) for each title and description. These embeddings are derived from a modification of the pretrained BERT model, which yields semantically meaningful sentence embeddings of size 768, which are trained to be readily comparable using cosine similarity (Reimers and Gurevych, 2019) .

We hypothesize that the attention received over time by the videos in a YouTube channel can be used to predict the quality of its contribution to the online public debate, as captured (albeit imprecisely) by the factuality score assigned by MBFC. We generate a set of features that model the user attention cycles by looking at the temporal variation of the number of user actions (Views, Likes, Dislikes, and Comments) in the first week after a video has been published. We aim to model the following:

• How are user actions distributed hourly/daily?

• How much are the user actions concentrated in peak hours?

• At what moment in time does the peak hour for each user action type occur?

• How steep is the time series in terms of the distribution of hourly user actions?

We define U A d i as the total number of user actions that occurred by the end of day i. In our case, i ranges in {1, 2, . . . , 7}. Similarly, U A h j is the number of user actions occurring by the end of hour j after the publication of the video. We generate a set of attention features for each user action, which we group into the following categories:

1. User actions daily percentage (D d i ), or the fraction of user actions out of the total that occurred on day i, where 1 ≤ i ≤ 7, and U A d 0 = 0 (7 features per user action):

2. User actions daily cumulative percentage (DC d i ), or the fraction of user actions out of the total by the end of day i, where 1 ≤ i ≤ 7 (7 features per user action):

3. User actions daily increase (DI d i ), or the proportion of increase in the number of user actions on day i compared to day i − 1, where 2 ≤ i ≤ 7 (6 features per user action):

4. User actions hourly increase (HI h j ), or the proportion of increase in the number of user actions during hour j compared to those during hour j − 1, where 2 ≤ j ≤ 168 (167 features per user action): 

where T (one of {0.5, 0.7, 0.9}) is the majority share (3 features per user action).

7. User actions peak delay interval (P DI), or the number of hours leading to the hour with the highest concentration of user actions (1 feature per user action):

8. User actions alive interval length (AI), or the hour up to which user actions were recorded (1 feature per user action):

9. User actions peak share (P S), or the number of user actions during the peak hour divided by the total (1 feature per user action): Most of the attention received by the videos in our corpus is concentrated in the first day after a video has been published, and thus we monitor the attention during this period more closely. Besides daily and hourly, we look at six additional periods during the first day, as depicted on Figure 1 . We extract the following features: (i) Percentage of User Actions per Period, (ii) User Actions per Period Increase, and (iii) User Actions Average Hourly Increase for a Period. This yields 18 additional features and a total of 218 features per user action.

To model the opinion of the users regarding the videos, we also use features derived by ratios of different user action types:

• Positive Reactions: the ratio between the number of likes and the number of views;

• Negative Reactions: the ratio between the number of dislikes and the number of views;

• Engagement: the ratio between the number of comments and the number of views;

• Controversiality: the ratio between the number of likes and the sum of the number of likes and of dislikes.

For each of these ratios, we calculate a set of features that show how the numbers change daily, similarly to the User Actions Daily Percentages (7 features per ratio) and the User Actions Daily Cumulative Percentages (6 features per ratio) features. We further generate ratio features for the more granular first day periods (6 features per ratio), which yields a total of 19 ratio-driven features. Overall, we have 952 attention features per video.

Our architecture contains two consecutive classifications: (i) for YouTube videos, and (ii) for YouTube channels. As we want to make use of the features derived from the YouTube videos, we labelled each video with the factuality score of the channel that published it, using distant supervision.

Thus, the video classification learns to predict the factuality labels that are projected from the corresponding channels. Naturally, not all videos published by a low-factuality channel necessarily contain disinformation. Yet, this is not a problem since we do not aim at classifying correctly individual videos, but at detecting factuality-related patterns, which would then be used at the channel level: our channel classifier uses the predictions of the video-level classifier to predict the factuality of channels.

For each video, we have 768 features from the sentence-level BERT representation. We calculated these features once for the title and once for the description of the video, obtaining a total of 1,536 textual features.

We further have 952 attention-driven features per video. To validate the relevance of these features with respect to our classification task, we apply a set of feature selection methods over the training split of our dataset, namely ANOVA, Pearson correlation, and Spearman correlation. According to these methods, the ratio features turn out to be the most relevant ones. We selected the best 100 features from each method or 124 attention video-level features, which we used in our classification experiments. Combined with the 1,536 textual features, this yielded a total of 1,660 features per video.

For the second classifier, at the channel level, we generate the following groups of features:

1. YouTube statistics (total of 13 features):

• Popularity (7 features • maximum probability across the videos for each factuality label; • average probability across the videos for each factuality label; • factuality distributions percents: for each factuality, this is the percent of videos predicted to have that factuality.

We train two subsequent classifiers for factuality prediction: for videos and for channels. We conduct experiments with different models and we compare them to a majority-class baseline. We evaluate the models in terms of accuracy, balanced accuracy, and mean absolute error (MAE). MAE is a more relevant measure in our case as it takes into account the ordering of the labels: confusing high factuality with mixed factuality is a smaller error than confusing it with low factuality. While our ultimate goal is to classify channels, we start with video classification, then we aggregate the predictions, and we use them to make predictions at the channel level.

We divide the dataset into training, development, and test split at the channel level. Then, for the video-level experiments, we use for training/development/testing the videos for the respective channels. Note that this guarantees that all videos for a given channel go into the same split.

Below, we report results using Gradient Boosted Decision Trees (GBDT). We also experimented with logistic regression, ordinal logistic regression, and SVM with various kernels, but they performed worse. We trained separate models (a) using the textual representation, and (b) using the user attention cycles. Note that our datasets are not well-balanced and have very few examples of low-factuality videos and channels. To mitigate this, we apply oversampling using SMOTE (Chawla et al., 2002) , which generates additional synthetic examples. Moreover, it is important that the video classifier generates predictions for the low-factuality and the mixed-factuality classes; otherwise, the predictions for these classes could be lost when aggregating for the channel classification. Thus, we also report balanced accuracy, as it is important when choosing which video experiments to select for aggregation.

We experimented with several approaches for channel classification: • using aggregated video-level features to obtain channel-level representation;

• using the posterior probabilities of video-level classifiers;

• using the previous two together;

• ensemble of different channel-level classifiers.

For the ensemble aggregation, we experimented with three methods for choosing the most likely class for a given channel:

• after averaging the predictions from the various models (mean);

• after getting the maximum probability prediction from the various models (max);

• after getting the minimum probability prediction from the various models (min) -tells us which class is least likely to be wrong.

The results are shown in Table 3 . All experiments use GBDT, except for experiment 8, which uses ordinal logistic regression.

Below, we analyze the results and we perform an ablation study.

We can see in Table 3 that all models improve over the majority class baseline by a sizable margin. We further see that using average information on user attention cycles (line 4) performs better than using textual features (lines 1-3). Moreover, combining the two yields the best result (line 8). The other two sets of user attention features: channel statistics and aggregated video-classifier predictions, do not contribute to the combined user attentions model (compare line 4 to line 6).

The relative improvements over the majority class baseline in terms of accuracy are generally smaller than those for MAE, which can be explained by class imbalance. To improve accuracy, the models need to learn to assign mixed and low factuality labels properly (as the majority class is high factuality). Most of the trained models undervalue the unrepresented classes. If some of the balancing techniques are applied, the models recognize better the low and the mixed examples, but at the cost of false positives for these classes from the high-factuality examples, which decreases the overall accuracy. In contrast, MAE rewards models that can improve the small classes even given the risk of introducing some errors for the majority class. 

As our focus is on attention cycles, we performed an ablation study for these features against the combined user attention channel model. The results are shown in Table 4 . We can see that ratio features such as controversiality and positive reactions alone yield the best accuracy and MAE. Using the predictions of the video-level classifier as features yields the best balanced accuracy. This confirms the importance of having accurate lowand mixedfactuality predictions for the video classifier prior to the aggregation. Finally, the overall best results, when considering all measures, are achieved when combining all features.

We proposed a novel framework for predicting the factuality of reporting of news outlets by studying the user attention cycles in their respective YouTube channels. We further designed a rich set of features derived from the temporal evolution of the number of views, likes, dislikes, and comments for a video, which we then aggregated at the channel level. Our experiments demonstrated both complementarity and sizable improvements over state-of-the-art textual representations.

We further developed and released a dataset containing observations about user attention on YouTube channels for 489 news media. We hope that this will enable future research on using data from video sharing platforms.

In future work, we plan to improve the class imbalance of the dataset by extending it with more examples. We further want to integrate additional features based on the comments for the videos and other information sources such as Twitter and Wikipedia. Finally, we plan to study the utility of user attention cycles for other related tasks such as political ideology detection for news media.

Data Collection Our dataset was collected from YouTube, using their public API.

User Privacy Our dataset contains aggregated attention statistics without any user data.

Biases Any biases found in the dataset are unintentional, and we do not intend to do harm to any group or individual.

Intended Use and Misuse Potential Our dataset and the proposed model can enable the development of systems for automatic detection of reliable/unreliable YouTube channels, which could support media literacy, as well as analysis and decision making for the public good. However, they could also be misused by malicious actors.

Environmental Impact. Finally, we would also like to warn that the use of large-scale Transformers requires a lot of computations and the use of GPUs/TPUs for training, which contributes to global warming (Strubell et al., 2019) .

Predicting factuality of reporting and bias of news media sources

What was written vs. who read it: News media profiling using text analysis and social media context

SMOTE: synthetic minority over-sampling technique

Predicting the leading political ideology of YouTube channels using acoustic, textual, and metadata information

Up and down with ecology: The "issue-attention cycle

Leveraging emotional signals for credibility detection

Quantifying media influence and partisan attention on Twitter during the UK EU referendum

On modeling virality of Twitter content

This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news

Meme-tracking and the dynamics of the news cycle

Understanding media: The extensions of man

Alt-right pipeline: Individual journeys to extremism online

FANG: Leveraging social context for fake news detection using graph representation

Sentence-BERT: Sentence embeddings using Siamese BERTnetworks

Auditing radicalization pathways on YouTube

Fake news or truth? Using satirical cues to detect potentially misleading news

CSI: A hybrid deep model for fake news detection

Energy and policy considerations for deep learning in NLP

From fake to junk news, the data politics of online virality

Enquiring minds: Early detection of rumors in social media from enquiry posts

This research is part of the Tanbih mega-project, 6 developed at the Qatar Computing Research Institute, HBKU, which aims to limit the impact of "fake news", propaganda, and media bias by making users aware of what they are reading, thus promoting media literacy and critical thinking. This research is also partially supported by Project UNITe BG05M2OP001-1.001-0004 funded by the OP "Science and Education for Smart Growth" and co-funded by the EU through the ESI Funds.