key: cord-0473995-s4vllhdt authors: Fakhruzzaman, Muhammad N.; Jannah, Saidah Z.; Ningrum, Ratih A.; Fahmiyah, Indah title: Clickbait Headline Detection in Indonesian News Sites using Multilingual Bidirectional Encoder Representations from Transformers (M-BERT) date: 2021-02-02 journal: nan DOI: nan sha: 61cefa924d62b4dee767c29786a74acde05b314d doc_id: 473995 cord_uid: s4vllhdt Click counts are related to the amount of money that online advertisers paid to news sites. Such business models forced some news sites to employ a dirty trick of click-baiting, i.e., using a hyperbolic and interesting words, sometimes unfinished sentence in a headline to purposefully tease the readers. Some Indonesian online news sites also joined the party of clickbait, which indirectly degrade other established news sites' credibility. A neural network with a pre-trained language model M-BERT that acted as a embedding layer is then combined with a 100 nodes hidden layer and topped with a sigmoid classifier was trained to detect clickbait headlines. With a total of 6632 headlines as a training dataset, the classifier performed remarkably well. Evaluated with 5-fold cross validation, it has an accuracy score of 0.914, f1-score of 0.914, precision score of 0.916, and ROC-AUC of 0.92. The usage of multilingual BERT in Indonesian text classification task was tested and is possible to be enhanced further. Future possibilities, societal impact, and limitations of the clickbait detection are discussed. Journalism has changed. Before the emergence of internet news, we bought newspapers because we were enticed to the headline on the front page which usually leads to the truth, but not anymore. The emergence of online news outlets created a whole new scheme of making money in the journalism world, the online ad. With internet advertising, a single click means money, even though it is not as much as a newspaper sale or advertising money from the sponsors, like the olden days. Now, a post headline has to rake in engagement, the metric that measures ratings in online news. The scheme of online advertising that bases on engagement have a negative influence on the original journalism idea. Sadly, online news organization now hunts for click money instead of the truth (Chen et al., 2015) . This phenomenon promotes a unique style of headline writing, infamously known as clickbait. The more people click the post, the more engagement that post has, the more advertising value the site will gain. A study found that most of the online news organization relies on clickbait's ad money to support their daily activities. With an increasing number of online news sites in recent years, they have to contest for reader's clicks (Chakraborty et al., 2016) . What makes it worse is that some news source that is once credible are also retreating to the means of clickbaiting. Further obscuring the integrity of the Indonesian online news organization, as a previous study found that the usage of clickbait worsens the news site reputation (Hurst, 2016) . Clickbait refers to a headline sentence that contains hyperbolic words to persuade its reader to click the following link but mostly did not reveal any major information. It may also contain a message that is controversial but did not disclose complete information about it in the sentence (Potthast et al., 2016) . Some clickbait headlines often use trending buzzwords, but most of its 2 following link leads to complete misunderstanding. An example of clickbait and non-clickbait headline is depicted in Figure 1 . The first headline, which is a non-clickbait headline, translates to "21.084 Vehicles Ticketed Due to Inability to Show SIKM Jakarta". While the second headline, which is a clickbait headline, translates to "Trump Calls Putin While George Floyd Protest Enrages in the USA, What Did They Talked About?". The example shows a clear distinction of non-clickbait headline versus a clickbait headline. A non-clickbait headline delivers straightforward key information while clickbait headline entices us to seek more. The usage of clickbait capitalized on the human's nature of curiosity. Such curiosity arises when human wants to know about something new, that they feel the gap between something they already know and something they want to know (Loewenstein, 1994) . That curiosity gap is exploited by providing teaser messages in clickbait headline which then signals the reader about new information, provoking the reader's curiosity, and leads them to click the headline (Anand et al., 2017) . Previous studies about automatic clickbait detection used a neural network that was trained on a specific labeled corpus of clickbait. Past researchers also tried to find a specific pattern in their corpus of clickbaits, but the pattern is constantly changing overtime (Anand et al., 2017; Agrawal, 2016) . The need for automatic clickbait detection were addressed on previous studies, but mostly trained on English clickbait corpus (Anand et al., 2017; Chakraborty et al., 2016; Potthast et al., 2016) . As Zuhroh & Rakhmawati (2020) stated in their literature review, there is a gap to be filled in Indonesian clickbait detection. Indonesia needs such a tool to increase the quality of the journalism itself, while also indirectly enhance the public digital literacy as the use of clickbaits in Indonesian online news enrages. A past study that focused on detecting Indonesian clickbait using neural network utilized TF-IDF as their feature extraction algorithm. The TF-IDF (Term Frequency -Inverse Document Frequency) algorithm represented the feature of a text by counting the term or word appearance frequency in a document to express its relevance in a corpus (Maulidi et al., 2018) . However, term frequency is simply not enough to capture the characteristics of clickbait headlines. In order to capture semantic and syntactic properties in the headline, the text in this study is represented with word embeddings. Word embeddings is a text feature extraction technique that maps the words into a vector space model, thus representing each word as vector and enables computers to measure distances between words, thus returning word similarity (Anand et al., 2017; Zuhroh & Rakhmawati, 2020) . Recently, a state-of-the-art language representation model was released, named BERT. It performed the best among the available language models in completing NLP tasks (Devlin et al., 2018) . With the availability of the trained language model in Indonesian and other languages, this study used a pre-trained multilingual BERT model as the language model. The use of transfer learning from the multilingual model enables the model to extract the features from some headlines that used both English and Indonesian in its sentence. The approach of using neural network to classify clickbait was deemed to be feasible due to the dynamic nature of clickbait writing (Zuhroh & Rakhmawati, 2020; Maulidi et al., 2018; Agrawal, 2016) . Moreover, Anand et al. (2017) compared their neural network performance in detecting clickbait to other baseline model -i.e., support vector machine, decision tree, and random forest. Their neural network performed with higher accuracy for clickbait detection. Therefore, this study attempted to use M-BERT as an embedding layer in a neural network to detect clickbait in Indonesian online news site. The training process of the neural network used in this study is depicted in the flowchart on Figure 2 . BERT Weights from the flowchart refers to the pre-trained model that is used for an embedding layer. While BERT tokenizer is a built-in tokenizing factory that split the strings of headline into tokens. The news headline corpus was retrieved from William & Sari (2020)'s dataset, consisting of 8613 annotated news headline from 12 online news sites. The news sites (partially redacted) are listed on Table 1 . The dataset contains 15,000 headlines. They were labeled by 3 undergraduate students per headline and then was deemed moderately reliable with Fleiss' Kappa Interrater agreement of 0.42 (William & Sari, 2020) . However, in this study, only the headlines which every rater agreed that it is a clickbait, are selected. With that, the dataset is now consisted of 8613 headlines, with Fleiss' Kappa of 1 which means full agreement between raters. (William & Sari, 2020) . To explain further and make a clear distinction between clickbait and nonclickbait. Various sources listed the criteria of clickbait headlines (Potthast et al., 2018; Biyani et al., 2016) . The criteria are listed on Table 2 2 The dataset loaded directly to the python code which is also available on the github link specified in the end of this article. Because the class is imbalanced, the data was normalized. 3316 non-clickbait headlines were random-picked to balance the dataset. The clickbait headlines still consist of 3316 data. The total counts of data for training was 6632 headlines. Then, using the BERT Tokenizer from huggingface, the text is stemmed, tokenized into words, padded, and indexed while also formatted accordingly for BERT layer input specification. The texts were then have its stopwords removed 6 Contains emotion-provoking word by using an opensource Indonesian stopword remover PySastrawi (Robbani, 2018; Wolf et al., 2019) . Finally, each headline in the dataset is transformed into list of sequences of token ids. The sequence of integers, which refers to respective words on the dictionary, is ready to be fed to the neural network (Wolf et al., 2019) . The neural network configuration is as follows. The input layers are two Keras input layers, each responsible for handling a list of sequences of token ids, and attention masks (the marker for pad and non-pad tokens), it is then passed forward to the BERT layer. The embedding layer is a BERT multilingual model that trained on a multilingual Wikipedia dump dataset, which included the Indonesian language (Devlin et al., 2018) . The usage of a pre-trained language model allows the researcher to capture semantics features in the headline, it also enables the researcher to extract features from the headline corpus in a short time, without wasting a lot of computing hours to train a language model. The hidden layer consisted of 100 densely connected neurons, activated with the ReLU function. Before the sequence is passed into the Dense layer, it passes through a GlobalAveragePooling layer and Flattened to fit the dense layer's dimension. Finally, the sequence is passed into a final dense layer, activated with a 7 sigmoid function to classify it as either clickbait class or non-clickbait class. The model is compiled with Adam Optimizer and a learning rate of 1e-05, then trained in three iterations. The network architecture is depicted on Figure 3 . The model was evaluated using 5-fold cross-validation method to identify its accuracy, confusion matrices, and its ROC-AUC plot. An additional evaluation was also employed. A total of 3237 labeled headlines collected on May 2020, which is different than the training dataset collection time, was used as a test data. It is then evaluated using the same metric. The additional evaluation was employed to test whether or not the model can detect clickbaits in another dataset with different topics and possibly different clickbait sentence structure. A total of 6632 headlines were used as a training data. Specifically, 3316 headlines were labeled as Clickbait, and 3316 headlines were random-picked from a total of 5297 Non-Clickbait headlines to balance the class. In the clickbait category, the word "ini" or "this" in English, appears highly frequently, as seen in Figure 4a . Usually, the word "ini" is used as a pointing word that leads the reader to the curiosity gap e.g. "This 5 Kinds Fruit is Really Good For Your Skin!" which is translated from Indonesian sentence "5 Macam Buah Ini Sangat Bagus Untuk Kulitmu!". Notice the use of "ini" in the headline. Although it may be just one signal word of clickbait, but it seems that clickbaiting uses this word very often. Hence, it appears as a top word in the clickbait category. Additionally, the word "bikin" is also appearing frequently in the clickbait category. The word "bikin" is considered a conversational slang in Indonesia, mostly used among urban citizen. It is well-suited for a clickbait because slangs are often used in clickbait headline to bring the headline to an "easier level", so that the readers can relate to the headline with relative ease While the word "jadi" can mean two different things. One of it means "become" if it is used in a less formal setting, it can also be used as a conjunction word, often coupled with other sentence, which then the word "jadi" translates to "so" in English. The word "jadi" appears oftenly both in the clickbait category and nonclickbait category, this may confuse the model, because the word has different meaning in Indonesian. However, using M-BERT can solve this confusion because M-BERT can capture semantic meaning by the context of the sentence it appeared in. Other words in the wordcloud depicted on Figure 4 also shows some hints on which topic often appeared in clickbait category. Celebrity names appeared in the clickbait wordcloud which indicated that clickbaits are often used in gossip and tabloid headlines, or "soft news". Still, some "hard news" words are also appearing often, such as "jokowi","bj habibie", and "indonesia". Which indicated that clickbaits are used among the "hard news" topics as well. Meanwhile, the non-clickbait wordcloud depicted in Figure 4b shows a lot of "hard news" related words e.g. "indonesia", "karhutla", "kpk","polisi" and no signs of neither celebrity names nor informal words. Top 10 words can also describe the corpus. Using a bag of words tokenization, bar charts depicting top 10 appearing words in the corpus on each category are depicted on Figure 5 . Looking at Figure 5a , the word "ini" appeared 881 times in the corpus, which is a lot more often compared to other words. Also, most of the top 10 words in clickbait did not refer directly to the topic of the article, although a good headline should refer the topic directly. Clickbait bar chart shows a good description of the corpus and match the clickbait criteria well. Whereas on Figure 5b , "kpk" appeared most often with 206 frequency. Looking at all the top 10 words of non-clickbait corpus, most of them were nouns, e.g. "Indonesia", "habibie", "dpr". It shows that non-clickbait headline often refer to a specific topic directly, without using pointing words or conjunctions, unlike clickbait. From the word clouds and bar charts, the distinction between two categories lies on the use of informal words, named entities, and different parts-of-speech usage. After the dataset is trained on the described neural network model for 3 iterations, it is evaluated using 5-fold cross-validation, with each fold yielded accuracy, precision, recall, and f1-score, which are then used for evaluating the model. Accuracy, precision, recall, and f1-score values indicate the performance of the model, the closer the value is to 1, the better the model in classifying headlines. Specifically, accuracy represents the model's ability to classify each headline into their supposed class accurately. Then, precision represents the proportion of true positives i.e., predicted as clickbait turned out true, among the total predicted clickbait class. Furthermore, recall represents the proportion of true positives related to the actual clickbait class. While f1 score is a combination of precision and recall that can represent both values in equal proportion. Additionally, a further evaluation is employed to see whether or not the model can adapt to a dataset from different collection time period with different media agenda. Specifically, the training dataset is collected pre-Covid19 while the additional test dataset is collected during Covid19 pandemic. Depicted in Figure 7a , clickbait headlines still has the word "ini" mentioned quite frequently, which is similar to the clickbait category in the training dataset. It was mentioned 325 times in the clickbait group. However, in Figure 7b , "Covid-19" is the most frequent word among the non-clickbait groups, mentioned 116 times. Other top 10 frequent words are still related to "Covid-19", i.e. "corona", "new", "normal", and "pandemi", which is consistent with the media agenda-setting and public issue during the data acquisition. These difference of agenda may confuse the model, and the additional evaluation seeks to identify those flaws. The additional evaluation shows average accuracy of 0.83, precision of 0.82, recall of 0.83, and f1-score of 0.83. The additional evaluation score indicated that the model can still detect newer headline, given different topic, and term frequency. Although, with decreased performance. Finally, considering all of the evaluation metrics, ROC-AUC scores, and the additional evaluation, the model was deemed to perform well. The finding indicates a possible future in using a pre-trained language model in classifying clickbait. With such a carefully curated dataset, the clickbait detection can be further expanded to different NLP tasks as well (William & Sari, 2020) . By using BERT, the whole model looks simplified, using only a BERT layer and a hidden standard dense layer, finally topped with a sigmoid activated neuron, the classifier worked remarkably well with an average accuracy of 92%. Furthermore, if the model is deployed into usable software, it can help to alleviate the bias given by the priming words that are widely used in writing clickbait headlines. Flagging the clickbait headline gives a chance for the reader to rethink their urge to click and fill their curiosity gap (Loewenstein, 1994) . This study has some limitations, one of which is the broad topics of news in the dataset. Such broad topic may provide noise, because clickbait headlines often dominated the celeb and gossip topic. It may affect future prediction due to the noise provided by the mentioned topic. Future research may fill the gap by focusing on specific hard news topic, such as politics, to learn further about clickbait usage in different topics. Additionally, the labelling of clickbait is relatively hard, even for human. Future research may expand the dataset and select only the data where every rater agreed, so that there is a clear distinction of clickbait and non-clickbait headline. Also, the initial briefing of the rater should be conducted sistematically, so that the dataset is reliable and unambiguous. The training dataset also needed expansions by adding more data from multiple time period of collection. This may increase the model versatility by enabling it to capture the dynamic pattern of clickbait structure in various time period. Furthermore, This study used pre-trained BERT using Indonesian Wikipedia dumps as well as other 100 languages, which may not contain sensational wording commonly used in clickbait headlines, due to Wikipedia writing rules. Therefore, future researchers may collect a bigger Indonesian corpus that includes offensive and rude words, slangs, and sensational words to fine-tune the BERT model, which might increase the performance of the model. Yet, a qualitative approach regarding clickbait assessment is also needed to define clickbait characteristics thoroughly. With a detailed specification of clickbait headline, the labeling process of headlines can be less biased. Qualitative study can also confirm the descriptive analysis of this study which stated that clickbait headline often used non-topic related words and teasing words. Using the neural network in classifying clickbait has been fairly common in the English language, but not in Indonesian. This study contributes to show that Multilingual BERT, a state-of-the-art model is able to classify Indonesian clickbait headlines. Furthermore, we would like to explore more about the methods in detecting clickbait of online news headlines. We also want to deploy the model into a usable component, like a browser extension program, so that the clickbait detector can be used by the public. Finally, future researchers can look into the effect of the clickbait detector among the general public. Whether or not it influences adult literacy and its ability to inhibit the spread of misinformation. The complete Python notebook and datasets are stored on https://github.com/ruzcmc/ClickbaitIndotextclassifier. Clickbait detection using deep learning We used neural networks to detect clickbaits 8 Amazing Secrets for Getting More Clicks": Detecting Clickbaits in News Streams Using Article Informality Stop clickbait: Detecting and preventing clickbaits in online news media IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) News in an online world: The need for an "automatic crap detector Bert: Pre-training of deep bidirectional transformers for language understanding To clickbait or not to clickbait? an examination of clickbait headline effects on source credibility The psychology of curiosity: A review and reinterpretation Penerapan Neural Network Backprogpagation Untuk Klasifikasi Artikel Clickbait Crowdsourcing a large corpus of clickbait on twitter Clickbait detection Github: Indonesian stemmer. Python port of PHP Sastrawi project CLICK-ID: A novel dataset for Indonesian clickbait headlines Transformers: State-of-the-art Natural Language Processing Clickbait detection: A literature review of the methods used Training dataset provided by A. William and Y. Sari as cited. Additional test dataset provided by Fakhruzzaman, et.al.