key: cord-0202447-a0kvseec authors: Gothankar, Ruchira; Troia, Fabio Di; Stamp, Mark title: Clickbait Detection in YouTube Videos date: 2021-07-26 journal: nan DOI: nan sha: 8093fedbdea446fd7e33f16b2b4a2d64e5e4ff64 doc_id: 202447 cord_uid: a0kvseec YouTube videos often include captivating descriptions and intriguing thumbnails designed to increase the number of views, and thereby increase the revenue for the person who posted the video. This creates an incentive for people to post clickbait videos, in which the content might deviate significantly from the title, description, or thumbnail. In effect, users are tricked into clicking on clickbait videos. In this research, we consider the challenging problem of detecting clickbait YouTube videos. We experiment with multiple state-of-the-art machine learning techniques using a variety of textual features. Today, web content is increasingly popular and people rely on information obtained from the internet. Furthermore, with the diversity of available resources, the amount of time spent on the internet has increased. Many platforms provide a medium where virtually anyone can publish information that is accessible to a large number of people. However, the credibility of such information is not guaranteed. Online sources of information include blogs, video sharing platforms, and social media, among others. Many of these applications have been developed with the main intent to generate revenue. Hence, unscrupulous people can use false information to increase their viewership and increase their revenue. Clickbait is false and deceptive information that lures users to click a link, watch a video, or read an article. It aims to exploit the user's curiosity by providing misleading-though captivatinginformation. Clickbait has become a marketing tool in many sectors to entice users and thereby to generate revenue. Publishing eye-catching information to manipulate and trick users is a common practice to increase the viewership and spread brand awareness. A clickbait can be an image, a sensational headline, or a misleading video or audio content. While clickbait sources help in gaining attention, there are many disadvantages and negative ramifications. In fact, clickbait not only wastes the time of viewers, but also affects the trustworthiness of the underlying platform [25] . YouTube is a video publishing platforms where users upload videos and share them with others. When uploading a video, the user adds a title, a description, and a thumbnail. The other users then view the title and thumbnail before deciding whether to view the video. Hence, this data become crucial parameters on which the users can base their decision to watch a video or not. For this reason, many YouTube content creators (aka YouTubers) use clickbait title and thumbnails that might deviate from the actual content to increase viewership for a video, and thereby generate more revenue. A recent example includes the COVID-19 pandemic, where individuals have posted misleading health-related content, including some fake cures for COVID-19. Some other common examples of clickbait are video titles such as "You'll Never Believe What Happened Next. . .", "The 10 documentaries you should watch before you die", "You Can Now Travel Abroad Without Having to. . .","You Won't Believe. . ." and so on [13] . Figure 1 shows an example of clickbait video on YouTube. Figure 1 : Clickbait video example [30] The clickbait problem is somewhat similar to that of spam detection. Spam, which is unsolicited emails, often includes misleading messages that are sent to deceive users by redirecting them to websites for the purpose of advertising or attack. Therefore, considerable research has been focused on detecting spam. In this research, we are concerned with detecting clickbait YouTube videos. The YouTube platform relies on users to manually flag suspected malicious or clickbait content. However, a more automated approach would clearly be desirable. We consider machine learning and deep learning based solutions to the clickbait detection problem. The remainder of this paper is organized as follows. Section 2 considers relevant previous work and background topics related to natural language processing (NLP). In Section 3, we discuss our experimental setup, including the datasets used. Section 4 contains our experimental results and our analysis of these results. In Section 5, we give our conclusions and we discuss possible directions for future work. This Section discusses relevant work done in this field. We mainly focuses on clickbait detection, fake news detection, image forgery detection, and hoax detection. Apart from these topics, we also discuss advancements in natural language processing (NLP). Clickbait is a way to attract the attention of the users by luring them to access specific contents. However, misleading information is present on the internet in multiple forms and is often used interchangeably in different contexts. For example, a hoax is spreading false stories of, say, a celebrity death [30] , while an example of a forgery is an image that suggests false information. We now discuss and analyze the performance of previous works on clickbait, fake news, forgery, and hoax detection. In 2016 [4] , Chakraborty et al. implemented an ML classifier to detect clickbait. They also created a browser extension to help readers navigate around clickbait. They used the headlines from the Wiki-news corpus and used 18,513 articles as legitimate posts. For the clickbait posts, they used articles from popular domains containing illegitimate content. To train their classifier, they used a set of 14 features spanning linguistic analysis, word patterns, and -gram. They achieved an accuracy of about 89% using a support vector machine (SVM) classifier. Elyashar et al. [8] developed an approach focused on feature engineering. Their work focused on detecting clickbait posts in online social media. They performed linguistic analysis using a machine learning classifier which could differentiate between legitimate and illegitimate posts. The dataset used for analysis was provided by the 2017 Clickbait Challenge [20] . The results of their experiments suggest that malicious content tends to be longer than the benign content. They also concluded that the title of the post played an important role to identify a clickbait. Glenski et al. [11] developed a network model which is a linguistically infused network to detect fake tweets. This model, which is based on long short term memory (LSTM) and convolutional neural networks (CNN), used the text of tweets, images, and description for training. Furthermore, the pretrained embedding model GloVe was used as the embedding layer. They achieved an accuracy of 82%. Zhou [36] proposed a self-attentive neural network model using gated recurrent units (GRU) for predicting fake tweets. They performed multi classification using the annotation scheme. As proof of the success of their approach, they ranked first in the Clickbait Challenge 2017 with an F-score of 0.683. Fake news is a type of misinformation that has received considerable attention in recent years. The main idea is to analyze the text content of a news item to check if the statements are valid or not. Ahmad et al. [2] implemented an ensemble model based on the linguistic features of the text which involved a combination of multiple machine learning algorithms, namely, random forest, multilayer perceptron, and support vector machine (SVM), to detect fake news. They used XGBoost as an ensemble learner, achieving an accuracy of 92%. Thota et al. [28] presented a paper on detecting fake news using natural language processing. They used TF-IDF and Word2Vec with a dense neural network based on the news headline. In another paper on fake news detection, Jwa et al. [14] implemented a model using bidirectional encoder representations from transformers (BERT). The deep contextualizing nature of BERT has yielded strong results, including the ability to determine the relationship between the headline and the body of a news article. As the name suggests, image forgery detection consists of trying to detect malicious information that is conveyed through images. In 2018, Zhang et al. [35] developed a "fauxtography" detector which could detect images which are misleading on social media platforms. Palod et al. [19] passed pretrained Word2Vec comment embeddings through an LSTM network to generate a "fakeness" vector, and achieved an F-score of 0.82. Shang et al. [25] proposed a model that involved network feature extraction, metadata feature extraction, and linguistic feature extraction to detect clickbait in YouTube videos. The network feature extraction used comments in the videos and extracted semantic features. In the linguistic feature extraction, they relied on document embedding for comments using Doc2Vec, and they also employed a metadata module. In 2019, Reddy et al. [22] implemented a model using word embedding and trained on a support vector machine (SVM). In [7] , Dong et al. have proposed a "deep similarity-aware attentive model" that focuses on the relation between the titles that are misleading and the target content. This method was quite different from traditional feature engineering and seemed to work reasonably well. In [24] , Setlur considered a semi-supervised confidence network along with a gated attention based network. Based on a small labeled dataset, this method gave promising results. In many of the above approaches, only the textual information given by the title and the description, along with the metadata features, have been taken into consideration while training a model. An exception is the work in [25] , where the authors have also used comments to extract features. It is also worth noting that the embedding layers of Word2Vec, BERT, and Doc2Vec have been used in all of the implementations mentioned above. In this research, we experiment with multiple embedding layers, including BERT, DistilBERT, and Word2Vec. In previous research, BERT has proven to be effective because of its deep contextualizing nature [14] . A combination of multiple models, known as ensemble learning, has given interesting results in [28] , and we also consider ensemble models in the form of random forest classifiers. Articles in which facts are knowingly misrepresented can be viewed as hoaxes. These reports provide deceptive information to readers and present it as legitimate facts. One of such examples can be a fake story about a celebrity death. In [27] , the authors have proposed a technique that uses logistic regression for classifying hoaxes. In the model proposed, they have used features based on user interaction and have achieved an accuracy of 99%. Zaman et al. [33] employed a nïve Bayes algorithm which uses the feedback from users as an input to verify if a news is a hoax. Kumar et al. [16] have proposed a method which uses random forest classifier to classify the credibility of the articles on Wikipedia. They achieved an accuracy of 92%. Hoax detection is, though, a less explored area, as compared to the topics discussed above. Natural language processing (NLP) is the ability of a machine to process and understand the language of a human. It is used to solve many real-world problems, such as machine translation, question answering, and predicting words. Figure 2 shows a timeline of some recent advances in NLP. In the early 1990's, statistical and probabilistic approaches were employed to train NLP algorithms. However, with the arrival of the Web, the amount of data grew considerably, and such algorithms became inadequate. In 2001, Bengio et al. experimented with feedforward neural networks. Later, recurrent neural networks (RNN) and long short-term memory (LSTM) models were introduced [12] . As of 2012, techniques such as latent semantic indexing (LSI), latent semantic analysis (LSA), and support vector machines (SVM) became popular in the NLP domain. Part of speech (POS) tagging is a commonly used approach. [21] In 2013, Tomas et. al. introduced Word2Vec, which is used to generate vector representations of words. These embeddings are obtained from the weights of a relatively simple neural network, and the vectors can capture important semantic information, based on the cosine distance between Word2Vec embeddings [23] . Global vector for word representation (GloVe) was introduced in 2014 and is an attempt to combine the benefits of LSA, LSI, and Word2Vec. It is based on the occurrence of a word in the entire corpus. CNNs and LSTMs have become popular for NLP related tasks in recent years, as such models can capture effectively utilize sequential information [12] . LSTM is a highly specialized type of RNN that mitigates the gradient issues that occur with plain vanilla RNNs. Gated recurrent unit (GRU) is a variant of LSTM introduced in 2014 that is lighter, in the sense of having fewer parameters that need to be trained. Sutskever et al. [26] proposed a sequence-to-sequence learning approach which uses an encoder-decoder architecture. In fact, such encoder-decoder models appear to be the main language modeling frameworks for NLP tasks today. The concept of an attention mechanism was proposed by Bahdanau et al. [3] in 2015 to overcome the limitation of fixed vector length for input sentences in sequence-to-sequence models [31] . Attention provides information about the importance of a part of a sentence during the decision process. To better deal with the inherent complexity of attention mechanisms, transformers were introduced [18] . Transformer includes multiple stacks of encoder-decoder architecture, where at each step in the processing, the model takes the output of the previous step as an input. Figure 3 shows the architecture of a transformer where the decoder is on the right and the encoder is on the left. Initially, the input tokens are converted to embedding vectors. Since this model does not have any RNN units, position indices are stored in a -dimensional vector space in the form of embeddings. There are three fully connected layers in this particular attention mechanism, namely, the input key , the value , and the query , which is a matrix of queries. The algorithm defines weights for words based on all the words in , and it generates a vector representation for all words based on multi-head attention [18] . The other processes include context fragmentation, and multiple parallel attention layers. Some example of deep learning models that make use of transformers include BERT, RoBERTa, mBERT, and DistilBERT. Figure 3 : Architecture of a transformer [18] Bidirectional encoder representations from transformer (BERT) uses a transformer which is based on attention to learn the contextual relation between words. It involves an encoder which reads the input, and decoder which predicts the output. It is called bidirectional because instead of reading input sequentially from a specific direction, the transformer reads the sequence of words in both directions. This helps in learning the context of words based on previous and subsequent words. Figure 4 illustrated the input pattern used in a BERT model. BERT has four pretrained versions with different layers, hidden nodes, and parameters. Each of these BERT models can be fine-tuned for a specific task by adding additional layers. DistilBERT is a lighter and a faster variant of BERT. In this section, we discuss the various machine learning techniques that we have employed in this research. Specifically, we have performed experiments for YouTube Figure 4 : Input for the BERT model [15] clickbait detection based on logistic regression, random forest, and MLP, with various embedding mechanisms. Logistic regression is a supervised learning algorithm that is used for categorical data where some parameter-which depend upon the input features and the output-is a categorical prediction. In Logistic regression, a sigmoid function is fitted on the data. The formula for the sigmoid function is which produces a value in the range of 0 to 1, and hence it can be interpreted as a probability. The clickbait detection problem can be treated as a type of binomial logistic regression, where the output can be either zero or one [1]. A random forest is based on simple decision trees-a large group of decision trees operate together in an ensemble-like manner. Each tree is trained on a subset of the data and features, a process known as boostrap aggregation, or bagging. In bagging, the data for each tree is randomly selected with replacement [32] . The final prediction of the random forest can be obtained via a simple voting scheme. A random forest mitigates the tendency of individual decision trees to overfit the training data. The important hyperparameters in a random forest are -estimators, -jobs, max-features, and min-sample-leaf. The -estimators parameter represent the number of trees that are constructed. Typically, adding more trees increases performance at the cost of computation time. The max-features parameter is the number of features required to split at a specific node. The parameter -jobs is the number of processors that work in parallel. A multilayer perceptron (MLP) is a basic type of feedforward neural network that includes input and output layers, along with at least one hidden layer. An MLP with two hidden layers is illustrated in Figure 5 . The output layer of an MLP can be used for prediction or classification. Next, we briefly discuss regularization and activation functions; see [9] for additional details on these and related topics. Neural network models are prone to overfitting. An overfitted model is very effective in classifying the training data but it obtains poor accuracy in predicting the test data-in effect, the model has "memorized" the training data, rather than learning from the training data. One useful technique to prevent overfitting is the use of dropouts, where some number of nodes are ignored during various training steps [9] . This simple approach forces nodes that would otherwise atrophy to become active in the learning process. An activation function is used to determine the output of node in a neural network. There are multiple types of activation functions, including tanh, sigmoid, ReLU, and leaky ReLU [9] . In this research, we have experimented with ReLU and tanh. This section includes details on the implementation used in this research. We discuss the setup used to train and execute the various machine learning models, the experimental design, and so on. In this research, we used multiple Conda virtual environments for each implementation. Conda is an open source package and environment management system which runs on multiple operating systems [5] . The host machine was configured as given in Table 1 . All the training and the experiments were run on the host machine. Our clickbait detection experiments are based on a set of labeled videos. The problem is formulated as a binary classification problem where for each video a machine learning algorithm classifies it is clickbait or non-clickbait. The information from multiple sources (e.g., title, description, comments) are combined and fed to the classification model. The performance is evaluated and analyzed by multiple measures, specifically, precision, recall and the F-score. There are three types of features considered in this research. The first involves features from the profile of the user who posted the video (subscriptions count, views count, and videos count). The second type of feature is based on extracting textual information from the video (title and description). The third component involves statistical features related to the video (like count, dislike count, like-dislike ratio, views, and number of comments). A classification model performs binary classification (clickbait or non-clickbait) based on some combination of these features. An overview is provided in Figure 6 . Features that provide information regarding the reputation of the channel and the videos include the number of subscribers of the YouTube channel, the number of likes or upvotes, and the age of the channel. These statistical features represent the response of viewers to the channel. Previous related work claimes that videos that are clickbait tend to have a relatively small number of subscribers and likes [30] . Usually, the number of views for the clickbait and non-clickbait videos are quite similar [34] . Useful information in determining the credibility of a video is given by the dislike ratio, the favorites count, the video age, the views count, and the comments count. Sometime, in clickbait videos the uploader disables the comment section. This itself provides clues about the video [30] . Textual features include the headline of the video, the description of the video, and the comments by the viewers. YouTubers who upload clickbait usually employ techniques which are deceptive. They use catchy and exaggerated phrases for the title and description of the video. Some common phrases are "viral," "top." "won't believe", "epic", and similar. We tokenize the text and embed it in classification models using various embedding techniques, including Word2Vec, BERT, and DistilBERT Every month, billions of people visit YouTube and the videos are watched for over a billion hours. A large number of videos are also uploaded by the users. In fact, YouTube is a platform where people can generate revenue by uploading videos and gaining viewership for their videos. In this research, the evaluation is done on a dataset of 8219 labeled videos, where 4300 are non-clickbait and 3919 are clickbait. The dataset was crawled from the Google YouTube API for the list of video IDs fetched from a Github source [17] . These sources were randomly picked and manually verified. The statistics for various parameters are shown in Table 2 . In this research, we experimented multiple techniques including multiple language modeling techniques. We used Word2Vec, BERT, and DistilBERT for word embeddings. Architecture for the individual models is also shown. A grid search was used for training and building the models to obtain the best set of parameters. In this section, we briefly describe each of our models, and in the next section we give the results for each experiment. In this experiment, we used a Word2Vec model provided by Gensim [10] to generate the vector representations of words in the dataset. A logistic regression model is trained on these embeddings along with additional features, specifically, comments count, likes count, dislikes count, and subscriptions count for the channel. In this experiment, a random forest classifier is trained on the Word2Vec embeddings. We again used the Word2Vec model provided by Gensim. The values tested for -estimators is!10, 20, 30, 50, and 100. The set of input features are title, description, and metadata features such as comments count, likes count, dislikes count, and subscriptions count for the channel. In this case, we again use the Word2Vec model provided by Gensim. T The embedding for title and description is concatenated with the metadata features of the video and is fed to an MLP for classification. The batch size is 10 for 40 epochs. The activation functions used are ReLU and sigmoid. Figure 11 in the appendix provides the overall architecture of the model. Note that we use two input embedding layers for textual data (namely, title and description), which are then concatenated together. After this step, the output from the dense layer is flattened and concatenated with the input for the metadata features. Finally, a fully connected layer is used for classification. This experiment is an optimization of the previous experiment. In this model, additional dense layers, along with batch normalization and dropout rate of 0.5, are employed. We have used parametric rectified linear units (PReLU) as the activation function. The batch size is again 10 for 40 epochs. Figure 12 in the appendix illustrates the overall architecture of the model. In this model, the output from the embedding layers for the textual data is concatenated, followed by a fully connected dense layer, batch normalization, and activation. This output is finally flattened and concatenated with the metadata features. In this experiment, we have used BERT embedding for title and description of the video. The advantage of using BERT as an embedding model is that it provides context-based representation for each word in a sentence. In contrast, Word2Vec provides representations which are fixed irrespective of where the word is used in the sentence. The pretrained model of BERT that is used in this experiment has 12 layers, 110M parameters, and 768 hidden layers. The BERT tokenizer is used to split the words into tokens and attention masks are used for padding. The mask value of one is for tokens that are not masked, while the value zero means that the token is added by padding and should not be considered for attention. The model uses Adam optimizer, and the batch size is 10 for 5 epochs, and we have used sequence of length 180 for this experiment. Figure 13 in the appendix shows the model architecture. Note that the output of the BERT embedding layer is followed by a dense layer, which is then concatenated with the metadata features. After this, a dropout layer followed by a fully connected layer is used for classifying the data. DistilBERT is a faster, lighter model that is a variant of BERT-it runs 60% faster and has 45% fewer parameters than BERT [6] . For this experiment, we have used a pretrained DistilBERT model. The embeddings for tile and description are fed into a MLP and, later, concatenated with the metadata features of the video and the YouTube channel. The model uses Adam optimizer and the batch size is 10 for 5 epochs. Figure 14 in the appendix gives details on this model architecture. Note that in this model, the input from the metadata features is concatenated. Of course, the output layer is a dense layer that is used for classification. Recall that in experiment I, we have use logistic regression with Word2Vec embeddings for the features title and description, along with the metadata features. In this case, we achieve an accuracy of 52% with just title as input, and an accuracy of 70% with all of these features. This model is fast to train and much simple to implement. Figure 7 shows the ROC curve for this logistic regression model. Experiment II involves using a random forest classifier based on the title, description, likes count, dislikes count, comments count, and subscriptions count. We used Word2Vec embeddings for title and description. We trained this model in multiple sets of inputs. The first set of inputs includes just the title and metadata features. The last set of inputs included all the features. Not surprisingly, we find that the accuracy improves as more features are added. Table 3 shows precision and accuracy of 80.1% for the model with the first set of input features, that is, title and two metadata features for likes count and dislikes count. Table 4 shows the report for this experiment when we use the title and all the metadata as features. The accuracy for this experiment is 92.5%. The report shows the precision and recall of the model in classifying clickbait and non-clickbait videos. The model performs slightly better in classifying non-clickbait videos. Figure 8 shows the ROC curve for the random forest model where the input features included title, description, and all the metadata features, that is, count, In experiment III a simple MLP is used for classification, based on Word2Vec embeddings for title and description that are concatenated with metadata features. In this case, the test accuracy is observed to fluctuate during the training process, but the best average accuracy achieved is better than 91%. Figure 9 (a) shows the accuracy for this experiment over the 30 training epochs. In experiment IV, a modified MLP is used with batch normalization and PReLU as an activation function. In this case, the accuracy is slightly worse than in experiment III, although the training is more stable, as can be observed in Figure 9 (b). In experiment V, we have used a transfer learning model based on BERT for word embeddings. This experiment with BERT gives an accuracy of 94.5%. In this experiment the length of the input sequence is fixed at 180 characters. Figure 9 (c) shows the plot for accuracy over training epochs for both the train and validation sets. Note that the number of epochs is small due to the extended training time required, as compared to other models considered. In experiment VI, we have used a lighter variant of BERT model for the word embeddings, namely, DistilBERT. The accuracy achieved in this case is around 92%. This model is significantly faster to train than the BERT, although the accuracy obtained with BERT is slightly better than using DistilBERT. Table 5 shows the precision and recall for experiment VI, while Figure 9 (d) shows the training and test accuracy over epochs. In Figure 10 we summarize the results of our six experiments in terms of accuracy (to two decimal places). Note that in the bar graph in Figure 10 , "MLP plus" is used to denote our MLP model that includes dropout and batch normalization. Also, the bars from left-to-right represent experiments I through VI, respectively. The goal of this research was to utilize state-of-the-art techniques to classify YouTube videos as clickbait or non-clickbait. A YouTube video has multiple characteristics that can serve as useful features for such classification. We leverage three main types of such features, namely, user profile, video statistics, and textual data. In this research, multiple classification techniques were considered, including logistic regression, random forest, and MLP, and we employed Word2Vec, BERT, and DistilBERT as language models. The best accuracy was achieved using an MLP classifier based on BERT embeddings, but a the more lightweight DistilBERT performed almost as well. We also confirmed that the accuracy of the models could be increased by adding more features. For future work, more features can be included. For instance, the transcript of the video might contain useful information. For example, the "distance" between the transcripts and the title could provide important insight, as the content of clickbait videos often differs significantly from the title. The network structure of the comments and replies, which represents the semantic features and attributes, can also be considered [25] . In this research, we experimented with BERT, Word2Vec, and DistilBERT for word embeddings. For future work, DocToVec embeddings could also be considered. We used random forest classifier, and other ensemble techniques could be considered, including, such as XGBoost. Furthermore, we can also experiment with state-ofthe-art attentive language models, such as XLNet, which is supposed to be better than BERT for determining long-term dependencies [29] . Logistic regression simplified Fake news detection using machine learning ensemble methods Neural machine translation by jointly learning to align and translate Stop clickbait: Detecting and preventing clickbaits in online news media Similarity-aware deep attentive model for clickbait detection Detecting clickbait in online social media: You won't believe how we did it A simple overview of multilayer perceptron (MLP) deep learning Gensim -python framework for fast vector space modelling Dustin Arendt, and Svitlana Volkova. Fishing for clickbaits in social images and texts with linguistically-infused neural network models Generating sequences with recurrent neural networks 12 surprising examples of clickbait headlines that work Automatic fake news detection model based on bidirectional encoder representations from transformers (BERT). Applied Sciences BERT explained: A complete guide with theory and tutorial Disinformation on the Web: Impact, characteristics, and detection of Wikipedia hoaxes Misleading metadata detection on YouTube The clickbait challenge 2017: Towards a regression model for clickbait strength Clickbait detection using multimodel fusion and transfer learning An efficient word embedded click-bait classification of YouTube titles using SVM Semi-supervised confidence network aided gated attention based recurrent neural network for clickbait detection Towards reliable online clickbait video detection: A content-agnostic approach Sequence to sequence learning with neural networks Some like it hoax: Automated fake news detection in social networks Fake news detection: A deep learning approach A survey of the state-of-the-art language models up to early 2020 A unified approach for detection of clickbait videos on YouTube using cognitive evidences Attention is all you need Towards data science: Understanding random forest Kretawiweka Nuraga Sani, and Endah Purwanti. An Indonesian hoax news detection system using reader feedback and naïve Bayes algorithm. Cybernetics and Information Technologies Kostantinos Papadamou, and Michael Sirivianos. The good, the bad and the bait: Detecting and characterizing clickbait on YouTube Fauxbuster: A content-free fauxtography detector using social media comments Clickbait detection in tweets using self-attentive network