key: cord-0790742-mlsk8r9p authors: Olaleye, T.O.; Arogundade, O.T.; Abayomi-Alli, A.; Adesemowo, A.K. title: An ensemble predictive analytics of COVID-19 infodemic tweets using bag of words date: 2021-05-21 journal: Data Science for COVID-19 DOI: 10.1016/b978-0-12-824536-1.00004-6 sha: 5a60e45a6eb6e25440f50689ddcde4c3e30968ed doc_id: 790742 cord_uid: mlsk8r9p Fake COVID-19 tweets appear as legitimate and appealing to unsuspecting internet users because of lack of prior knowledge of the novel pandemic. Such news could be misleading, counterproductive, unethical, unprofessional, and sometimes, constitute a log in the wheel of global efforts toward flattening the virus spread curve. Therefore, aside the COVID-19 pandemic, dealing with fake news and myths about the virus constitute an infodemic issue which must be tackled to ensure that only valid information is consumed by the public. Following the research approach, this chapter aims at a predictive analytics of COVID-19 infodemic tweets that generates a classification rule and validates genuine information from verified accredited health institutions/sources. On deployment of classifier Vote ensembles formed by base classifiers SMO, Voted Perceptron, Liblinear, Reptree, and Decision Stump on dataset of tokenized 81,456 Bag of Words which encapsulate 2964 COVID-19 tweet instances and 3169 extracted numeric vector attributes, experimental result shows a novel 99.93% prediction accuracy on 10-fold cross validation while the information gain of each 3169 extracted attributes is ranked to ascertain the most significant COVID-19 tweet-words for the detection system. Other performance metrics including ROC area and Relief-F validates the reliability of the model and returns SMO as the most efficient base classifier. The thrust of the model centered more on the trustworthiness of COVID-19 tweet source than the truthfulness of the tweet which underscores the prominence of verified health institutions as well as it contributes to discourse on inhibition and impact of fake news especially on societal pandemics. The COVID-19 infodemic detection algorithm provides insight into new spin on fake news in the age of social media and era of pandemics. The emergence of the novel coronavirus in year 2019 , also refer to as 2019-nCov, came with divers challenges apart from the global health threats that nations and institutions are grappling with toward flattening the curve of the trend [1, 2] . Chief of such attendant challenges is the issue of fake news, myths, and misguiding information that are daily propagated by disgruntled elements in the society, (and even innocent unknowing citizens) thereby constituting a cog in the wheel of progress of health institutions. The World Health Organization (WHO) in its attempt to trigger global efforts toward taming the trend of COVID-19 fake news across the world, noted that apart from the COVID-19 pandemic, the next most challenging problem is what it refers to and term as the infodemic nature of news daily circulated, most especially on online media [3] , p. 3. Whereas the social networking sites are aids to ease communication and global interaction across nations of the world, they also pave ways for fake news which are seamlessly posted on web platforms including social media platforms to misguide unsuspecting citizens [4] . One of the most prominent social media platforms is Twitter. It enjoys patronage of both young and old who throng their twitter handles to post usergenerated information popularly referred to as Tweets or simply join an ongoing hash-tag discussion which are arranged in threads amounting to millions of tweet posts in some cases. As at January 2020, internet users in Nigeria total 85.49 million, an increment of 2.2 million (þ2.6%) from 2019 while internet penetration stands at 42% out of which a total of 27.00 million are social media users [5] . Aforementioned report shows an additional 3.4 million (þ14%) more users were added between April 2019 till January Major problem with modeling of text however is the messy nature of words most especially user-generated social media posts which are stylishly crafted and posted with characteristic phone elements including emoticons, hash-tags, haphazard punctuations, abbreviations, deliberate misspellings, etc., which contravenes machine learning accepted standards of using well-defined fixed-length inputs and outputs. Since Twitter posts are not acceptable text words in machine learning metadata, there is need to convert tweets into vectors of numbers for a consequent classification phase of fake news detection model. The rest of the paper is structured such that session II discusses existing-related works in the area of fake news detection and text classification data mining, while session III unveils the methodology deployed for the design of the proposed model. Session IV discusses the result of the predictive analytics and session V concluded this work with recommendations. The motivation to checkmate fake news especially on social media is commonplace across different professional callings and research works with various approaches and methodologies deployed for an eventual accurate model of the goal in sight. In the work of [7] , an in-depth analysis of the relationship between fake news and social media profile of perpetrators, and their cohorts were evaluated and established. Political bias, profile image, and location of users were features used to train random forest algorithm used for the classification analysis unlike [8] that developed a Check-it plug in that is aimed at taming misinformation on social media. The authors of Check-it conceptualized a web browser plug-in approach that put together variety of signals into a channel for identification of invalid news. In Ref. [6] , data management and mining were the perspective of the survey conducted by the authors toward understudying approaches in literatures aimed at nose-diving effects of the scourge. Database and machine learning approaches were discussed as efficient ways of predicting beforehand the status of news items, while epidemiological approach and influence intensification models were established as mitigating choices in stemming the tide. While identifying the towering usefulness of machine learning models in predictive analytics [9] , emphasized the towering role of data mining toward solving classification problems as the author deploys academic data to predict future performance by assigning risk levels to current academic standing of students thereby returning a highpredictive accuracy which reechoes the efficiency of machine learning algorithms. In the work of Ref. [10] , subjectivity of both legitimate and fake news was considered as of paramount importance in the predictive analytics of news categorization. A set of objectivity lexicons initiated by linguists was featured as input variables by calculating the Word Mover's Distance to build the feature vectors extracted from each of politics, sports, economy, and culture news domains for XGBoost and random forest classification, while Ref. [11] discusses the conventional bouts on junk news filters and attacks targets at text classification to detect the robustness of machine learning models using synthetic attacks often referred to as adversarial training. The essence is to generate synthetic data instances through the generative-adversarial network. In Ref. [12] , detection of fake news on Twitter was the target of their weekly acquisition of tweets instances for a supervised classification approach. Tweets were labeled as trustworthy or untrustworthy and are trained by XGBoost with an F1 score of up to 0.9 by the binary classification of the training set. The dual classification approach of cross validation and training set deployed were to predict source of fake news and the text of fake news itself respectively with experimental result showing that labeled training set, of a large-scale dataset but with inexact labels returns good accuracies as well. The work of [13] is an approach to detect spam accounts on twitter through the deployment of classification algorithms including K-Nearest Neighbor, Decision Tree, Naive Bayesian, Random Forest, Logistic Regression, support vector machine (SVM), and XGBoost with numbers of followings and followers as independent input features while random forest returned the best classification accuracy result in detecting spammers. The text mining of social media Facebook platform through Naïve Bayes was conducted by Ref. [14] to predict whether a post on Facebook is Real or Fake while convolutional neural networks and convolutional recurrent neural networks were used to extract features from Chinese texts in Ref. [15] before Chinese text predicative analytics. Automatic detection of racist news forms the essence of the work of [16] using SVM through the identification of indication words in texts of racists. Bag of words, bigram, and part of speech tags were concerted approaches deployed for the classification model in their work. Classification of demographic attributes to identify latent Twitter users was the thrust of the work of [17] through stacked-SVM model of single and ensemble approaches with gender, age, political affiliation, and ethnicity forming attributes of the predictive analytics. In an attempt to automatically detect hoax on the internet, Ref. [18] deploys content analysis in the classification of fake news including mining the post diffusion pattern via social content feature returning 81.7% prediction accuracy. In Ref. [19] , K-means clustering is deployed to count clusters in Twitter posts which is faster compared to methodologies in literatures despite the numbers of document vectors and their dimensions but there is a huge extension of time taken to perform task. A sentencecomment co-attention sub-network to exploit news content toward detection of fake news was the main aim of Ref. [20] in their model code-named defend. In the work, explainable detection of fake news was the approach which places emphasis on the rationale behind the decision to label a news item as fake while deploying explainable machine learning as well for the classification phase through intrinsic explainability and post-hoc explainability and is achieved through phases of content encodings. In Ref. [21] , fake news detection in Twitter using hybrid CNN and RNN with an 82% performance accuracy was the main thrust of the research word on the twitter sub-domain of the social network. The use of n-grams for lie detection was the aim of Ref. [22] that established datasets using crowd-sourcing which contain statements of people lying about their beliefs on issues that centered on abortion, friendship, and death penalty in an attempt to decipher dissimilarities in the texts while evaluating the efficiency of n-grams in classifying lies from the truth. Naive Bayes and SVM algorithms were trained deploying term frequency vectors of n-grams in the texts as inputs. A 70% classification accuracy was achieved in identifying lies as expressed in their beliefs and an accuracy of 75% in identifying lies about their feelings. A linguistic analysis-based method is likewise proposed in Refs. [23, 24] with same n-gram classification approach for pretended appraisals through crowd-sourced workers who were enjoined to create false positive assessments about hotels [23] and false negative reviews on hotels [24] . Their work revealed that fake assessments contained fewer spatial words including floor, small, and location expectedly because the worker is alien to the hotel hence little or no availability of spatial detail for the assessment together with the discovery that positive sentimental words were overstated in the positive false assessments when comparison with their true counterparts. Related exaggeration was observed in negative sentimental words in false negative assessments. A variant work proposed by Ref. [25] is specifically on the textual features extracted by adding an attention apparatus into an LSTM design to reflect resemblance of fake news words to depict portions of the text which are pointers to fact or sham such that words indicated with darker red seemed to be germane to fake news detection task. Whereas adding attention does not specifically improve detection performance but thus provide some interpretability from the extracted features by shedding more lights toward tagging significant words which can be deployed for qualitative fake news studies instead of detection which is a different approach to the work of Ref. [26] that features multitask knowledge approaches to mutually provide stand detection and truth classification to enhance accuracy by employing the interdependence in the two tasks. The proposed model is based on recurrent neural networks with shared and task-definite parameters by combining the task of opinion detection and truth classification where each task is tagged with a shared GRU and task-definite GRU layers. The intention of the shared GRU layer is to detect patterns similar to both tasks whereas task-definite GRU aids the detention of patterns that are paramount to one task than to the other. An ensemble classification of tweet sentiment was the thrust of the work of Ref. [27] which proposes an approach that uniquely classifies tweet sentiments by using ensembles classifier and lexicons with a binary classification of either positive or negative class output. The model is conceptualized for consumers who can search for products with sentiment analysis and for manufacturers who needs to monitor the public sentiments of their brands. A public tweet sentiment dataset is deployed on a formed ensemble with base classifiers of logistic regression, SVM, Multinomial Naïve Bayes, and Random Forest which significantly improved the predictive analytics of the classification. The work of Ref. [28] is a systematic literature review on combating fake news through by analyzing the nitty-gritty of modern-day fake news problem including highlighting the technical tests associated therewith by discussing existing methodologies and techniques deployed for both detection and inhibition. The inherent characteristic features of public datasets in particular were discovered to have a significant influence on the eventual accuracy of any detection system designed purposely for preventive measures. Categorization of existing detection methods in the survey reveals three major approaches including content-based identification which comprises of cue and feature methods, deep learning content-based and linguistic analysis methods in no particular order. Feedback-based identification comprises of hand-crafted attributes, broadcast pattern study, temporal shape analysis, and response text and response user analysis. The third approach is referred to as the intervention-based solutions which comprises of identification and mitigation strategies. The survey recommended three germane factors and among which include the adoption of a dynamic knowledge base dataset which could reflect the changes occurring in a fast-paced world which influences the status of a news item along timeline between being truthful or otherwise and the deployment of datasets for news intent detection as against the usual "true" or "false" binary output class discovered in all the 23 public datasets surveyed in the systematic literature review. The level of fake news associated with global outpour of impression and opinions about COVID-19 has resulted into a somewhat xenophobic attack toward those who have tested positive and especially on Chinese tourists as observed by Ref. [29] asserting that with the avalanche of misconceptions being spread through social media, discrimination against Chinese nationals has become extensive in some parts of the world resulting to more havoc than the virus itself. A trending #ChineseDon'tComeToJapan twitter hashtag is noted as referring to Chinese as insensitive, and even bioterrorists which is counterproductive to global concerted efforts toward finding an antidote to the menace and furthermore, an evolution tree study in a computational approach for probing the roots and spreading forms of fake news was carried out in the work of Ref. [30] . The work uses a recent progress in the field of evolution tree analysis by analyzing issues in the scope of the presidential election of the United States in 2016 by accessing 307,738 tweets about 30 fake and 30 real news items. Results show that root tweets about fake news are mostly authored by twitter handles of ordinary users with a link to noncredible news platforms while observing a significant variance between real and fake news items in terms of evolution patterns. In a deception detection research endeavor, Ref. [31] designed LIAR: a publicly dataset for false news detection by collecting a manually labeled 12,800 decadelong small statements in numerous situations from POLITIFACT.COM which provides exposition and links to source documents for each case. An automated fake news detection system was then investigated on LIAR based on surface-level linguistic patterns resulting into a novel, hybrid convolutional neural network which integrates metadata with text and consequently shows that the hybrid approach improves text-only deep learning model while a twitter bot detection through one-class classification approach was proposed by Ref. [32] to detect malicious posts and the proposal is reported to steadily identify different kinds of bots with a 0.89þ performance measured using AUC requiring little or no previous evidence about them. Ten classifiers were deployed for the work using the ROC curves to ascertain performance of classifiers on diverse thresholds. The AUC of the ROC curve was adopted as it reviews the accuracy of a classifier into a single metric. In this NLP predictive analytics research, numeric word vectors are derived from the corpus of 81,456 COVID-19 twitter word-posts to reflect various linguistic properties of posted COVID-19 texts acquired from verified twitter handles of national and global health institutions and handles of unverified and random Twitter users in an attempt to detect fake COVID-19 news which forms the new global infodemic menace. With the execution of this model, Twitter users who post unsubstantiated, unauthorized, unverifiable, misleading, counterproductive, and misguiding COVID-19 tweets can be identified and their tweet disregarded or reported accordingly as people who circulate fake news or are not authorized to issue classified statements to the public. Hence, vocabulary of known words through Bag of Words is deployed in this work as dataset inputs encapsulating 2964 COVID-19 tweet instances for preprocessing and tokenization and subsequently for the supervised dual classification phase of ensemble and singly predictive analytics with the aim of isolating fake COVID-19 tweets from valid ones which has affected government spirited efforts in no small way. The Waikato Environment for Knowledge Analysis (WEKA) tool was used for the preprocessing, tokenization, and machine learning phases of this work. The research work followed the research approach in Ref. [32] as outlined in Fig. 19 .1 as it deploys NLP of user-generated COVID-19 tweets to ascertain the validity of the information being passed across the Twitter medium. Semantic features of Tweet texts are tokenized through the Word-to-Vector filter in the WEKA application tool. The pipeline of this infodemic predictive analytics is in four phases of Tweet Acquisition phase, Tweet Tokenization phase, Ensemble Classification and Rules generation phase. The proposed infodemic COVID-19 tweet detection model framework is as represented in Fig. 19 .1. A total of 81,456 corpus of tweet-texts from 2964 COVID-19erelated tweets forms the training and testing dataset which are tweets from either verified authorized institutions or from individual twitter users hence a binary classification task ahead. A total of 1602 tweets from verified institutions amounting to 54.048% are labeled as valid or trusted COVID-19 tweets, while 1362 tweets from individuals amounting to 45.95% are labeled as invalid COVID-19 tweets. Table 19 .1 shows the labeling of tweets depending on the source. The tweets, in their dirty and raw form, were subjected to cleaning to remove stop words, invalid icons, and emoticons that are the characteristic nature of tweets across demography of the world. Upon cleaning, the 81,456 tweet texts are prepared as arff file before filtering in the second stage of this machine learning pipeline. To acquire useful classification elements embedded within texts of tweets for a syntactic representation, StringToWordVector filter is applied in this stage of the activity flow to make the text file classifiable by a way of text extraction. By a way of word count, a Bag of Words emerges from this stage which forms input, as an arff file, into the next stage of ensemble machine learning phase. The StringToWordVector filter converts the string of tweets to a set of numeric attributes representing frequency of word appearance from the tweet corpus. Features extracted from the tweets serve as attributes from the 2964 collected valid and invalid tweeter posts. The word-count frequency of these attributes serves as independent attributes that point to the pandemic status of each post which are tagged as either valid or invalid in the class output labeling. The array of the resulting attributes is referred to as the Bag of Words in arff file format which is the representation of the 8,1456 COVID-19 Twitter texts. With Bag of Words, the trustworthiness of the tweet source takes prominence over the truth of the tweet itself which underscores the prominence of constituted authorities in COVID-19 information dissemination efforts. The intuition is that some particular COVID-19 tweets are similar and from same source if they have similar content in terms of some specific known words, hence, the creation of a vector document of Boolean representation such that the presence of specific words are marked 1 and 0 for their absence. These words constitute the independent attributes of the dataset, and the following pseudocode generates the Bag of Words. 8. end if 9. end for 10. return bow feature From the generated arff Bag of Words, a preprocessing of the dataset is implemented by an Information Gain (I.G) evaluator filter to rank the entire 3169 word attributes to ascertain the relevance of each word in the determination of the valid or invalid status of each COVID-19 tweeter post. To achieve this, Ranker calculates the entropy for each attribute which varies from 0 (no information) to 1 (maximum information) and then assigns high I.G value to attributes that contributes more information toward the categorization task as shown in Eq. (19.1) thus: where C is the output class, and B i and E as the entropy. Three ensemble of classifiers including ADABOOST, BAGGING, and VOTE are separately deployed for the classification stage in 10-fold cross validation each and then followed by singly classification by Decision Stump and Reptree of the tree subcategory and SMO, Liblinear and Voted perceptron of the function subcategory which are all of supervised learning all in a bid to increase the reliability of the study. Dual training is done on the ADABOOST with SMO and Decision Stump base learners over 10 iterations and likewise for BAGGING deploying SMO and Reptree base learners, respectively, conducted in 10 iterations as well. The VOTE ensemble classification was achieved on SMO, Liblinear, voted perceptron, and Reptree base learners to derive the weighted average performance metrics. BAGGING is often referred to as BOOTSRAP combining bootstrapping and aggregation to form one ensemble model. Given our dataset of numeric bag of words, multiple bootstrapped sub-data are pulled which then forms a decision tree on each data sample. Each subsample decision tree is then aggregated through averaging process to form the most efficient predictor. ADABOOST considers homogenous weak base learners by learning sequentially in an adaptive manner such that a base model depends on the previous ones preceding it and then combines them following a deterministic strategy. The Reptree machine learning algorithm is of the supervised tree family and is deployed for the purpose of rule generation to aid fake COVID-19 tweets identification with the Bag of Words input. It simply identifies word attributes that are significant to the unique identification of invalid tweets through word frequency occurrence ratio in both valid and invalid tweet texts relating to COVID-19. In VOTING, multiple classification models are initially created using training dataset as each base models are created using same training set with different algorithms. Each base models then forwards its prediction (votes) for respective test instance using 10-fold cross validation hence the final output prediction which takes into cognizance the prediction of the better models multiple times. Filtering of the total number of 8,1456 tweet corpus extracted from 2964 COVID-19 news-item Tweets, through the use of String to Word Filter, returns a total of 3169 Bag of Words independent numeric attributes which serves as the dictionary of words for the subsequent I.G evaluation, ensemble and singly classification stages, and subsequent rule generation phase. 10 fold cross validation was used to validate the results and as observed from the performance accuracy graph of Fig. 19 contributing most information toward the determination of the endemic nature of the COVID-19 post determination. Their average rank and merit altogether returns the popular hash-tag #Covid19 as the attribute with the most I.G and the @ncdcgov as the first most significant twitter handle while the world health organization abbreviation, WHO, is the eighth most significant attribute and the coronavirus returns as the 18th most significant word. Table 19 .7 encapsulates our model-generated rules to detect COVID-19 infodemic tweets, which is the output from the Reptree supervised learning model. This uniquely identifies fake COVID-19 tweets which are not only misguiding but forms various myths that impede government efforts toward engaging the citizenry toward taming the COVID-19 trend in the country in an attempt to flatten its curve. Twitter social media have turned veritable platforms for disgruntled elements in the society who are resolute in their determination to constitute a clog in the wheel of governments' efforts toward stemming the tide of the global COVID-19 pandemic ravaging nations of the world. This also holds true for citizens who are unintentional and/or inept with social media ethics. Their penchant for authoring and circulating fake news has resulted into another infodemic challenge for government and health institutions who battles the trend to further safeguard unsuspecting members of the public who digests fake news, COVID-19 myths, misguided and misleading information hook, line, and sinker. This study consequently deploys machine learning intelligence toward the classification of valid and invalid COVID-19 news items emanating from Twitter. Through NLP of Bag of Words and ensemble machine learning approach, VOTE ensemble classification model returns an efficient model toward detecting misguiding and fake COVID-19 tweets with cross validation satisfactory accuracy of 99.93% which serves the aim of this research by generating rules which uniquely identifies fake COVID-19 tweets. Future deployment of API is recommended for the acquisition of tweets for a robust and wider acquisition scope. COVID-19) Situation Report, World Health Organisation An interactive web-based dashboard to track COVID-19 in real time COVID-19 infodemic: more retweets for science-based information on coronavirus than for false information Differentiating data-and text-mining terminology Digital 2020: Nigeria Combating fake news: a data management and mining perspective The role of user profiles for fake news detection Check-it: a plugin for detecting and reducing the spread of fake news and misinformation on the web A predictive model for students' performance and risk level indicators using machine learning Fake news classification based on subjective language Adversarial machine learning for text Weakly supervised learning for fake news detection on Twitter Detecting spam accounts on Twitter Fake news detection Chinese text feature extraction and classification based on deep learning Classifying racist texts using a support vector machine Classifying latent user attributes in twitter Automatic online fake news detection combining content and social signals Counting clusters in twitter posts Text mining: finding nuggets in mountains of textual data Defend: explainable fake news detection The lie detector: explorations in the automatic recognition of deceptive language Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Finding deceptive opinion spam by any stretch of the imagination Call attention to rumors: deep attention based recurrent neural networks for early rumor detection Detect rumor and stance jointly by neural multi-task learning Tweet sentiment analysis with classifier ensembles Combating fake news: a survey on identification and mitigation techniques 2019-nCoV, fake news, and racism A computational approach for examining the roots and spreading patterns of fake news: evolution tree analysis Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection A One-Class Classification Approach for Bot Detection on Twitter Authors appreciate the efforts of the reviewers toward the final and better outcome of this chapter. Table 19 .7 Classification rule of COVID-19 infodemic tweet by reptree classifier.