key: cord-0627310-ze0foltm authors: Varshney, Deepika; Vishwakarma, Dinesh Kumar title: An Automated Multi-Web Platform Voting Framework to Predict Misleading Information Proliferated during COVID-19 Outbreak using Ensemble Method date: 2021-09-19 journal: nan DOI: nan sha: b9262d872e37647a369d6c5a41500592176ac403 doc_id: 627310 cord_uid: ze0foltm Spreading of misleading information on social web platforms has fuelled huge panic and confusion among the public regarding the Corona disease, the detection of which is of paramount importance. To address this issue, in this paper, we have developed an automated system that can collect and validate the fact from multi web-platform to decide the credibility of the content. To identify the credibility of the posted claim, probable instances/clues(titles) of news information are first gathered from various web platforms. Later, the crucial set of features is retrieved that further feeds into the ensemble-based machine learning model to classify the news as misleading or real. The four sets of features based on the content, linguistics/semantic cues, similarity, and sentiments gathered from web-platforms and voting are applied to validate the news. Finally, the combined voting decides the support given to a specific claim. In addition to the validation part, a unique source platform is designed for collecting data/facts from three web platforms (Twitter, Facebook, Google) based on certain queries/words. This unique platform can also help researchers build datasets and gather useful/efficient clues from various web platforms. It has been observed that our proposed intelligent strategy gives promising results and quite effective in predicting misleading information. The proposed work provides practical implications for the policy makers and health practitioners that could be useful in protecting the world from misleading information proliferation during this pandemic. In the recent scenario, a new coronavirus disease spread around the world. The disease emerges as a respiratory infection with significant concern for global public health hazards. Initially, it is suspected that the disease is transmitted from animal to humans, later the paradigm is shifted that the infection is transmitted towards human to human via droplets, close contacts creating huge panic with approximate 6,359,182 confirmed cases and 380,663 deaths 1 have been encountered till now and growth rate is still high which has alarmed the global authorities including world health organization (WHO) [1] . The COVID-19 pandemic affects worldwide badly, however, there is no shortage of people who are taking this crisis as an opportunity for malicious activities/gaining profit [2] . A lot of health-related misleading information, some of the fake cures are suggested for COVID-19 has been posted by the malicious users and creates lots of confusion and misconceptions about the disease. During this pandemic, people have their eye on any new announcement from the government official or some news that can help to get rid of COVID-19. As the disease is deadly, the people are also desperate to know some cure and in rush to find a treatment for the new coronavirus disease. Some of the fake cures posted over social media are harmful too and give bad health advice. The recent examples of fake cures are shown in Fig.1 , where Fig.1(a) shows an image with a false claim is gone viral that drinking water a lot and gargling with warm water and salt or vinegar eliminates the coronavirus, however, there is no significant evidence has been found concerning to this claim. Another cure in Fig1(b) reported that the silver solution can kill coronavirus within 12 hours. The proliferation of these misleading information creates lots of misconceptions in the mind-set of people related to coronavirus disease and some of the users are spreading it without verification and fuelled panic among people regarding the COVID-19. According to [3] , misleading or fake can be defined as any post that shares content that does not faithfully represent the event that it refers to. We followed this definition in our work and define "misleading information as the content that does not faithfully represent the event that it refers to and having no significant evidence of proof to validate the claim". From the recent research, it has been observed that numerous misleading content is circulating about the coronavirus and it is becoming difficult to differentiate fake news from the real one [4] . The propagation of misleading content on the virus could also be deleterious to mankind. This has led to the dire need for a system that can differentiate fake from real. Earlier many of the previous research has been reported methods of detecting fake news in online social media considering a variety of applications [5] . Most of the previous research has counter fake news problems mainly in the following type: Image-based algorithm and Text-based algorithms [6] [7] . Many previous studies have worked on fake news detection by applying a text-based approach. Text-based approaches mainly use text patterns and match them with already existing patterns of fake news. They are sometimes referred to as the linguistic approach. Along with this lots of researchers have shifted their interest in the credibility detection of posts/tweets using text-based features [8] [9] . Like a text-based approach, research has also been done by employing an image-based approach. From the study, it has been seen that researchers have explored images based algorithms for the analysis of fake images or images attached with false claims in mainly following ways, Text additive images, and Manipulated images. The manipulated images termed as an image whose piece/part or certain region is manipulated with respect to visual context. Various image-based features have been explored for the classification of images. The authors of [10] , propose 5 visual features and 7 statistical features for the verification of news events. Along with the manipulated images, some of the researchers have also considered text-additive images for the analysis of misleading content. The text additive images termed as images embedded with false claims instead of having any manipulation from visual context. The authors of [11] , have incorporated text additive images, where they have applied a rule-based algorithm for the prediction of fake news. From recent research, it has been observed that none of the works have shown and reported fake news prediction analysis propagated during one of the major pandemic "CORONAVIRUS". Many people are sharing fake cures to get rid of coronavirus disease without any verification and create lots of misconceptions. Government and officials have also urged peoples to check the authenticity of the post before sharing [4] . This also motivates us to build an intelligent system for the prediction of fake news spreading during this pandemic. We, therefore, developed a generalized multi web platform framework of detecting misleading content on social media platforms, where we have considered COVID-19 as a special issue which is a huge pandemic and taken as one of the application case studies in this work. However, our model is generalized and works for other applications as well. COVID-19 is an emerging issue and very few research have been reported yet in this context that leads to motivates us to build an efficient framework to predict misleading content spreading during the COVID outbreak. The major key contributions of the work are highlighted in the following points.  To the best of our knowledge, we are first to build a unique platform (Facts collector) for the collection of crucial facts and knowledge from three different prominently used social media and web search platform (Twitter, YouTube and Google) as well as provide different mechanism to search the query (build query) to get efficient and relevant results.  The four set of novel features based on content, linguistics/semantic cues, similarity and sentiments has been gathered from web-platforms that further fed into ensemble based machine learning model to classify the news as Misleading or real. Finally, voting is applied to validate the news and to check the confidence/support given by different web platforms.  As the COVID-19 is one of the emerging issues and none of the work has been reported yet to predict the fake news propagating during this phase and gives a major contribution by providing the analysis, which greatly helps researchers for further study.  We investigate the model performances with different classifiers, and comparative analysis reveals that our proposed method outperforms other states of the art on the same dataset. The remainder of this paper is organized as follows. In Section 2, we are going to discuss the previous work that has been done related to this field, wherein Section 3, we discuss the problem statement and unique fact collector platform. In contrast, Section 4, elaborates the strategy/method that we have employed for the misleading information detection, which is followed by a discussion of experimental results in Section 5. Lastly, the paper is concluded with some suggested future work aspects. In the current era, spreading false information is one of the crucial problems nowadays, where it is quite difficult for the online user to discriminate fake news from the real one and that's why the development of an intelligent system is required. Most of the methods proposed in earlier states-of-the-art [12] to detect misleading information considered it a classification problem intending to associate label as true or false with a particular claim/post. From the survey analysis, it has been observed that the classification approaches are turn divided into approaches based on machine learning and deep learning. A detailed description is given below. From the study, it has been proven that machine learning algorithms are extremely useful in countering numerous problems in the information engineering field. In particular, many of the machine learning approaches implemented for misleading/ fake information detection applied as a supervised learning strategy [13] . In machine learning classification algorithms support vector machines(SVMs) are one of the widely used methods for classifications. The authors of [14] , have proposed a method where they employed a graph-kernel-based SVM classifier to detect rumors using propagation structure and content features with an accuracy of 0.91 on Sina-Weibo dataset. Whereas, in [15] the author reported a set of features to distinguish among fake news, real news, and satire. The SVMs are also employed for clickbait detection in [16] . Like SVM, the random forest has also been exploited in numerous works for fake news and rumor detection. Most of the studies have reported random forests as a strong performer among other machine learning algorithms [17] , [18] , [19] , [20] , [21] . In [18] , the author has proposed a set of temporal, structural, and linguistic features for the classification of rumors in a tweet graph by employing a random forest with an accuracy of 0.90. The random forest has also been used for stance detection in [17] , [22] . The comparative studies of a different approach in the context of rumor and fake news have shown competitive performance for logistic regression [19] , [23] , [24] , [25] ]. The authors of [26] , employed logistic regression for stance classification of news articles or headlines and claims. Another widely studied family of the algorithm proposed particularly for misleading content detection is a decision tree [27] . The effectiveness of decision tree algorithms like j48 with respect to other machine learning paradigms including SVMs has been reported in [[8] , [19] ]. The authors of [8] have used the content and contextbased features to perform credibility evaluation of tweets and the model is performing well with an accuracy of 0.86. In [28] , to evaluate the trustworthiness of users in social media via decision tree, the author has proposed a series of user trust metrics and reported an accuracy of 0.75. Deep learning is one of the prominent and widely explored research topics in machine learning. The main advantage of deep learning over traditional machine learning approaches is they are not based on manually crafted features and lead to reduce feature extraction time. Along with this, the deep learning framework can learn hidden representations from simpler inputs both in context and content variation [29] . The two prominent and widely used paradigms in Morden artificial neural network are RNN and CNN. In [29] , authors have proposed a novel RNN architectures, namely tanh-RNN, LSTM, and Gated Recurrent Unit(GRU) for the detection of rumors. From the results, it has been found that GRU has obtained the best results in both the datasets considered with 0.88 and 0.91 accuracies, respectively. Whereas, in [30] , the author has proposed a multi-task learning approach and designed a multi-task learning framework with an LSTM layer shared among all tasks to counter the problem of rumor classification. Like RNN, CNNs have also been explored and widely studied for image recognition and many other fields of computer vision. However, it now gaining popularity in the NLP field as well [31] . The authors of [22] , have explored a technique using CNN with single and multi-word embedding to counter problems concerning both stance and veracity classification of tweets. The author has reported accuracy of 0.70 for stance classification problem and 0.53 for the problem of veracity classification. Whereas, Paragraph embedding is explored to learn the representation of a small group of posts in a specific event and used them as input for their CNN model in [32] and achieved an accuracy of 0.93 for Sina Weibo and 0.77 for Twitter. From the study, it has been observed that most of the recent work has explored the combination of RNN and CNN in their model [ [33] , [34] , [35] ]. The authors of [35] proposed an architecture applied on the LIAR dataset that encodes text information via a CNN and metadata about the author of the text using an LSTM layer as well as it has been also found that the hybrid model has proved to outperform all other baselines along with a bi-LSTM architecture with an accuracy of 0.27 on the testing dataset. Whereas, in [ [33] ] author has proposed an approach based on repost sequence patterns for the detection of false rumours. All the different approaches discussed above have considered different machine and deep learning methods for the prediction of misleading content. In our study, we have used a multi web platform framework to gather effective clues for predicting false information. The clues can be detected from multiple web platforms to get the strong support as it may happen that one platform may not give effective clues to predict some information but other can. Moving to this concept, instead of relying only on a specific platform for getting information, our proposed model incorporates multi web platform for retrieving clues concerning to specific query. To the best of our knowledge, none of the previous studies has been explored this concept. Along with this, very few studies incorporate the concept of a unique platform that can collect information from various social media and web sources for building data. These all points we will discuss in later sections in detail. In this paper, we have considered a binary class classification problem. We assume that the In this section, we elaborate our automated Multi-Web Platform Voting Framework to Predict Misleading Information Proliferated during COVID-19 outbreak. The detailed flowdiagram of our proposed model is shown in Fig.3 . In the first phase of the process the input query is given by the user that he/she wants to validate. The input query is passed through the text processing phase where the cleaning of data has been performed to make it in a format so that it can be used for further processing, and it includes removal of stopwords, removing duplicates, handling missing data, stemming, punctuation removal, text translation (Google translation API) to English language, Removing URLs, symbol, emoji etc. After, text processing the cleaned data is pass to the next module called as "Fact Collection". In this phase, the input is passes through Multi-Web platform to reterive relevant facts concerning to query. The two prominent social media platform that has been utilized to retrieve the facts are we have adopted case 1 as other have some limilation as disccused in Table 1. The Table 1 describes 3 possible cases we have considered for building query. Stopword_removal(Input_query)+ " "+ "fake news" -2. N_grams(Input_query)+ " " + fake news 1. Sometimes the context of the query cannot be come out properly and missed out. 3. Pos(Input_query)+" "+ Fake news 1. Sometimes the context of the query cannot be come out proprely. 2. In some of the cases giving too many irrelevant facts, goes out of context. In the second phase, the query has been created by removing stopwords to make the query clear and short. As in many cases, the stopwords makes query too long that leads to give irrelevant responses. After removing stopwords, the query is attached with a space concatenated with "fake news" keyword. From the analysis, this case of query generation found to be good and considered in this study. Whereas, the other cases includes the N_grams concatenated with the keyword "fake news" and the POS part of speech tagging(proper nouns), we can find all the proper nouns from the input query. Each proper noun is concatenated with the fake news keyword to build a query. However, these case have certain limitations. Sometimes the context of the query cannot become out properly and missed out the relevance. After this step, the build query is passed through two prominently used platforms, YouTube and Google, to retrieve relevant facts. Finally, the top 10 title headings are scrapped automatically using selenium from both the platform that further be used for analysis. The algorithm for fact collection is shown below in Algorithm 1. Algorithm 1, shows the process of facts collected from the Multi-web platform. Here, in this study, we have incorporated two social media and web search platforms for reteriving efficient facts/title heading concerning a query that further be used in feature engineering and validate the claim as fake/real. However, other platforms like twitter has also been explored for the collection of facts, but the issue with twitter is it support keyword based searching and long query-based search is not applicable that leads to be a major issue in the collection of relevant facts. Whereas, in the case of google web search and YouTube, we can fetch efficient responses concerning a query. The third module is fact validation, this module take the facts collected from the previous step and utilizes them to get some efficient clues to predict the claim as fake/real. The four set of features are employed based on content, linguistics/semantic cues, similarity and sentiments. Each of these feature category has been discussed in detail in the below section The content based features has been widely explored in numerous data mining research field. In this paper we have incorporated content based features for the prediction of misleading information including question mark count and fake word count. The question mark count gives necessary clues regarding the confidence reflected from the sentence. If the sentence showing uncertainty, it means that the claim is not sure regarding that event. Question mark count plays a major role in finding the uncertainty in a given sentence. It returns true if any question mark has been encountered in a title/headings retrieved from the web platform, while searching a specific query. Whereas, fake_word_count is also one of the important feature that discriminate fake from real. There are set of false_phrase_corpus that incorporates list of keywords that prominently used to represent fake news. been encountered in the retrieved responses corresponding to a query, the fake count incremented by 1. The feature is helpful in identifying fake as the title having these phrases more likely representing news as fake. It is very diffcult to process raw text intelligently as the same words in a different order can mean something completely different, while using lingustic knowledge can be possible to solve some problems starting from only the raw characters. For a given claim it is very important to understand in what context it is used. The python libray nltk.pos_tag is designed to do the same. When a raw text is passed as an input, it returns a doc object that comes with a variety of annotation. The nltk parse and tag a given document, there are some statistical model which enable it to make prediction of which tag or label most likely applies in this context also called as POS part of speech tagging/ grammatical tagging which is used to mark up a word in a text as corresponding to a particular part of speech, based on both its definition and its context. POS tagging also describes the charactersitics of lexical terms within a sentence or text that further be used for making prediction / assumptions about semantics. To compute the semantic text similarity between two sentence we have used POS (Part of speech) text similarity. There are different POS tags that can be given to a each word in a sentence like .NLTK POS tagger is employed to assign grammatical information of each word of the sentence. This feature is useful in computing the semantic text similarity between the user query and the clues reterived from web platforms. The tags generated by nltk.pos_tag are converted to the tag used by wordnet.synsets. The nltk wordnet's synset is used to measure the similarity. This is another category of feature used in this work based on similarity. This feature is helpful in segregating relevant title/heading from all the given responses, as not all responses are useful for validation. To get efficient performance of the model we need to remove irrelevant titles from the analysis, only those who cross the threshold value are used for analysis. The one of the prominently used similarity measure "cosine similarity" has been used in this work to compute the similarity between two sentences irrespective of their size. The sentences are considered as two vectors and the cosine similarity between two vectors is measured in 'θ'. If the angle between two sentence is 0 it means they are similar, and if θ = 90° they are dissimilar. The formula of calculating similarity between two sentence x and y can be given as: 1) Query Sentiment: Query Sentiment is a sentiment of the input query given by the user. 2) Title/heading sentiment: This is a sentiment of the responses(title/heading) received as a search result concerning a specific query. 3) Sentiment match count: From all the 10 responses retrieved from the web platforms, how many times the sentiments of the query and the titles are matches. It also represents whether the sentiment pose by the input query is equivalent to the responses received. It also means that both query and heading are posing the same sentiments and presented in the same polarity. All these above discussed features are briefly shown in Table 2 In this section, the experimental analysis is performed on publicly available datasets, different performance measures are adopted (Precision, Recall, F1-score, Accuracy etc.) to measure the effectiveness of the proposed method and lastly presenting the results showing the performance of the proposed model as well as comparative analysis with other State-of-the-art methods. This section covers each of these points in the following subsections. In this paper, we have used the constraint-2021 shared task to detect COVID-19 fake news in English. The dataset is collected from various social media like Twitter, Facebook, Instagram, etc. The main objective of this task is to classify a given social media post into Fake/Real. The dataset collects 10,700 manually annotated social media posts and articles of fake and real news on COVID-19 [36] . The dataset is further split into training validation and test sets in the ratio of 3:1:1 as shown in Table 3 . The experimental analysis is performed by employing various machine learning algorithms like Logistic Regression(LR), Support Vector Machine(SVM), Random forest, Ensemble based classification model, etc. We employed this dataset to measure the performance of our model with respect to precision, recall, f1-score and accuracy. As we have incorporated multiweb platforms, so analysis has been performed on both the web platform separately as well hybrid (Google +YouTube) of both. The Results concerning to this experiment is shown Table 4 , Table 5 and Table 6 respectively. The first study incorporates the clues retrieved from google platform concerning to specific query. The performance has been analysed majorly on four having value 0.980 as shown in Table 6 . Some earlier studies have also worked on the given problem and reported results concerning the Constraint Task 2021 Covid fake news dataset. The Comparative Study with the other state-of-the-art method on the validation set is shown in In this paper, we developed an intelligent generalized strategy for identifying possible clues to predict misleading information, where fake news proliferated during the COVID-19 outbreak is considered as a special case study and detailed analysis has been discussed. We proposed an automated Multi-Web Platform Voting Framework considering YouTube and Google as major sources for the retrieval of clues. The four set of novel features based on content, linguistics/semantic cues, similarity and sentiments has been gathered from these platforms that further fed into ensemble based machine learning model to classify the news as Retrieving clues from multi-web platform improve the performance of the model and it outperforms other state-of the-art technique on the same dataset by employing ensemble based classification model. In the future we are planning to incorporates and explore other platforms (Instagram, WhatsApp etc.) to validate the news as well as also expand the work by including different modalities of data (images, videos, etc). Along with this, we are also planning to build a real time application for the users to predict misleading content. Comprehensive update on current outbreak of novel coronavirus infection (2019-nCoV) Analysing and Identifying Crucial Evidences for the prediction of False Information proliferated during COVID-19 Outbreak : A Case Study Detection and visualization of misleading content on Twitter Effects of media reporting on mitigating spread of COVID-19 in the early phase of the outbreak A unified approach for detection of Clickbait videos on YouTube using cognitive evidences Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities HAN, image captioning, and forensics ensemble multimodal fake news detection Information Credibility on Twitter Hoax news-inspector: a real-time prediction of fake news using content resemblance over web search results for authenticating the credibility of news articles Novel Visual and Statistical Image Features for Microblogs News Verification Detection and veracity analysis of fake news via scrapping and authenticating the web search A review on rumour prediction and veracity assessment in online social network A temporal ensembling based semi-supervised ConvNet for the detection of fake news articles False Rumors Detection on Sina Weibo by Propagation Structures This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News Stop clickbait: Detecting and preventing clickbaits in online news media Simple open stance classification for rumour analysis Cues to deception in social media communications Determining the Veracity of Rumours on Twitter Prominent features of rumor propagation in online social media # unconfirmed: Classifying rumor stance in crisisrelated social media messages SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours Some Like it Hoax : Automated Fake News Detection in Social Networks An exploratory study into deception detection in text-based computer-mediated communication Emergent: a novel data-set for stance classification Classification and regression trees Classification and regression trees Increasing the veracity of event detection on social media networks through user trust modeling Detecting Rumors from Microblogs with Recurrent Neural Networks All-in-one: Multi-task Learning for Rumour Verification Understanding convolutional neural networks for text classification Attention-based convolutional approach for misinformation identification from massive and noisy microblog posts CED: Credible early detection of social media rumors Fake News Identification on Twitter with Hybrid CNN and RNN Models Liar, Liar Pants on Fire': A New Benchmark Dataset for Fake News Detection Fighting an infodemic: Covid-19 fake news dataset Constraint 2021: Machine Learning Models for COVID-19 Fake News Detection Shared Task A Heuristic-driven Ensemble Framework for COVID-19 Fake News Detection