key: cord-0642161-2qmpe3zd authors: Gautam, Akansha; Jerripothula, Koteswar Rao title: SGG: Spinbot, Grammarly and GloVe based Fake News Detection date: 2020-08-16 journal: nan DOI: nan sha: bfbed226c13ca49d364635ee0f6fdc7cdb5832c3 doc_id: 642161 cord_uid: 2qmpe3zd Recently, news consumption using online news portals has increased exponentially due to several reasons, such as low cost and easy accessibility. However, such online platforms inadvertently also become the cause of spreading false information across the web. They are being misused quite frequently as a medium to disseminate misinformation and hoaxes. Such malpractices call for a robust automatic fake news detection system that can keep us at bay from such misinformation and hoaxes. We propose a robust yet simple fake news detection system, leveraging the tools for paraphrasing, grammar-checking, and word-embedding. In this paper, we try to the potential of these tools in jointly unearthing the authenticity of a news article. Notably, we leverage Spinbot (for paraphrasing), Grammarly (for grammar-checking), and GloVe (for word-embedding) tools for this purpose. Using these tools, we were able to extract novel features that could yield state-of-the-art results on the Fake News AMT dataset and comparable results on Celebrity datasets when combined with some of the essential features. More importantly, the proposed method is found to be more robust empirically than the existing ones, as revealed in our cross-domain analysis and multi-domain analysis. News consumption using online news portals has increased exponentially in modern society due to cheaper cost, easy accessibility, and rapid dissemination of information. Nevertheless, it also encourages the spread of lowquality information and hoaxes. For example, since the beginning of the COVID-19 pandemic, the distribution of false information has negatively affected individuals and society. Misinformation about the coronavirus ranged from false assertions to harmful health advice. Table I shows an example each of COVID-19 related fake and legitimate article. Many times news is new and such newness of news articles, be it legitimate or fake, calls for robustness in the fake news detection problem. That is, fake news detection models should not be domain-dependent. Considering several such exciting challenges, Fake-newsdetection [1] has kept attracting several researchers in the last decade. A relatively early study [2] on deceptive opinions spam focused on a combination approach that involved computational linguistic and psychological motivating features and n-gram features performed better in the detection of ambiguous opinion spam. Various methods have been proposed for automatic fake news detection, covering quite complex applications. Previous work [3] introduced two novel fake news datasets, one obtained through a crowdsourcing technique that includes six news domains and another one collected from the web. They also proposed classification models that rely on lexical, syntactic, and semantic information to detect misinformation. To detect satirical news, [4] proposed an SVM-based algorithm based on Absurdity, Humor, Grammar, Negative effect, and Punctuation that aids in minimizing the potential deceptive impact of satire. A recent research [5] report of Science Daily Blog showed that the spread of fake news on social networking sites is a pernicious trend to modern society. It has caused dire implications for the 2020 presidential election. It also manifests that public engagement with false news is higher than with legitimate news from mainstream sources that makes social media a powerful channel for propaganda. In this paper, we develop features using Grammarly, Spinbot, and the GloVe-based model. A previous study [6] has shown that fake news articles often carries a lot of grammatical mistakes. We use Grammarly to extract Table I EXAMPLE OF COVID-19 FAKE AND LEGITIMATE NEWS ARTICLE News Type Content True News Who can donate plasma for COVID-19?,"In order to donate plasma, a person must meet several criteria. They have to have tested positive for COVID-19, recovered, have no symptoms for 14 days, currently test negative for COVID-19 , and have high enough antibody levels in their plasma. A donor and patient must also have compatible blood types. Once plasma is donated, it is screened for other infectious diseases, such as HIV.Each donor produces enough plasma to treat one to three patients. Donating plasma should not weaken the donor's immune system nor make the donor more susceptible to getting reinfected with the virus." Fake News Due to the recent outbreak for the Coronavirus (COVID-19) the World Health Organization is giving away vaccine kits. Just pay $4.95 for shipping,"You just need to add water, and the drugs and vaccines are ready to be administered. There are two parts to the kit: one holds pellets containing the chemical machinery that synthesises the end product, and the other holds pellets containing instructions that tell the drug which compound to create. Mix two parts together in a chosen combination, add water, and the treatment is ready." such features to identify counterfeit items. Spinbot is a paraphrasing tool that is used to make the article content look scholarly. We use an innovative technique to extract sentiment features based on GloVe and K-Means model. GloVe model has shown successful results in representing the meaning of textual data into a feature vector. We use GloVe to transform articles into feature vectors and then apply K-Means to our feature vectors set, where each cluster encapsulates similar words together. After feature extraction, we rely on machine learning algorithms to separate the two categories of news articles. Our proposed methods help correctly identify the fake and legitimate COVID-19 news articles shown in table I. Also, it outperforms the previously established works in identifying the misinformation across the web. Figure 1 presents our novel approach to a shared task aimed at detecting whether a new article is fake or not. We develop innovative features based on Spinbot, Grammarly, and Glove-based models. We also utilize other essential features to build the classifier for the task of counterfeit news detection. We conduct several experiments to robustly identify linguistic properties that are predominately present in the false news content. Detection of misinformation has become an emerging research topic that is attracting the general public and researchers. In recent years, there is a requirement of a practical approach for the success of the fake news detection problem, which has caused a significant challenge to modern society. Each research paper embraces its ideas related to the different strategies to solve this problem. A substantial group of studies fabricates the concept of developing machine learning classifiers [7] to automatically detect the misinformation using a variety of news characteristics. Some prior studies show that the performance of fake news deception model improves when using features based on the linguistic characteristics derived from the text content of news stories. Previous work [8] has shown that newsworthy articles tend to contain URLs and to have deep propagation trees. They built a classification model based on a set of linguistic fea-tures such as special characters, sentiment positive/negative words, emojis, etc. to determine information credibility on twitter. Context-Free Grammar(CFG) parse trees [9] based on shallow lexical-syntactic features were applied to build the detection model. Lexicon patterns and part-of-speed tags [10] were also used. Named entities, retweet ratio, and clue keywords [11] were analyzed for the identification of rumor tweets on twitter. Another study used linguistic content such as swear words, emotion words, and pronouns [12] to assess the credibility of a tweet. Language style and source reliability of articles [13] were also used in reporting the claim to determine its likelihood. To differentiate fake news from legitimate news, lexical, syntactic, and semantical [3] information was used. A group of other approaches builds fake news detection model based on temporal-linguistic features. News entities [14] are categorized into content, social, and temporal dimensions reveals mutual relations and dependencies that were used to perform fake news dissemination. A classifier [15] was build based on inquiry phrases extracted from user comments. Prior work [16] has shown that leveraging non-expert, crowdsourced workers provides a useful way to detect fake news in popular Twitter threads. The comprehensive technique of identifying fake news on social media [17] included fake news characterizations on psychology and social theories. Bimodal variational autoencoder with a binary classifier [18] based on multimodal (textual+visual) [19] , [20] , [21] information helped classify posts as fake or not. The work of [22] depicted The significance of stylistic cues was described in determining the truthfulness of text. All scraps that contain propaganda techniques and their types [23] were detected to design a method that performs finegrained analysis of texts. Another group of researchers [24] exploited both the textual and visual features of an article. In [25] , the framework was proposed based on information captured from an image [26] , [27] , processed, and compared the information with the trusted party was used to detect tampered or photoshoot tweets. In this paper, we present the extraction of novel features based on Grammarly, Spinbot, and the GloVe-based model. To the best of our knowledge, no previous work has used this kind of approach in extract- Spinbot Paraphrasing Feature Extraction ing features for fake news detection [28] , [29] . This section precisely describes how we utilize the Spinbot, Grammarly, and GloVe tools to develop a robust yet simple fake news detection system. We give overview of our proposed approach in the Figure 2 . First of all, we paraphrase the original article (denoted as o i ) and obtain a paraphrased article (denoted as p i ). Then, we extract Grammarly features for both the original article and the paraphrased article. Let's denote such a feature extraction function as f gr (·). Similarly, we obtain GloVe based features (denoted as f gl (·)) for both original and paraphrased articles. In addition to these proposed features, we also extract some essential features (f es (·)) like TF-IDF, emotion, lexical, and so on, again, for both original and paraphrased articles. We concatenate [30] these features into f as shown below: (1) which is then fed to the random forest classifier (denoted by RF · function) to output the prediction y i , fake news or real news, as shown below: where we essentially predict the authenticity of a given original article and its paraphrased article using a RF model learned through training on labeled data. Let us now discuss each of these steps in greater detail one-by-one. We use paraphrasing to obtain a paraphrased content that looks scholarly. Assuming persons who spread fake news are not so scholarly, such paraphrasing serves two purposes in the false news detection. If they did not use such tools, the paraphrased article would be too distant from the original one, in terms of the features we develop. Also, if they did use it, they will copy-paste the paraphrased article, leading to too much closeness. Either way, fake articles will be distinguishable from real articles that may lie somewhere inbetween, neither too close nor too distant. They would not be too close because such automated tools may not completely capture the writer's real intention; so, the writer will change it a bit. We use Spinbot 1 tool for this paraphrasing task. Spinbot is a free automatic article spinning tool that rewrites the human-readable text into additional, readable text and helps in solving problems in optimizing content generation. Spinbot allows us to copy text and paste it into a text box. After selecting a couple of basic options and confirming that one is human, users receive an output of a paraphrased version of the original text entered in the text box below. This tool is vibrant and sophisticated; it makes the content look very scholarly. It rephrases the text by spinning textual data of length up to 10,000 characters in a single go. Spinbot replaces many words with their synonyms to come up with plausibly "new" content. Spinbot provides unique, quality textual content that can quickly acquire legitimate web visibility in terms of human readership and search engine exposure. Let's denote this tool as SP function that takes o i as input and outputs p i , as mentioned in Fig. 2 and below. There are four sentence structures present in the English literature: simple sentences, complex sentences, compound sentences, and complex-compound sentences. This feature checks the proper usage of sentence structure in articles as it detects and prompts the suggestions for correcting missing verbs, faulty parallelism, and incorrect adverb placement. Punctuation Misplaced punctuation changes the meaning of the sentences at a higher level. This feature spots misplaced punctuation and makes real-time suggestions to correct them. Style There are four major writing styles: expository, descriptive, persuasive, and narrative. The style feature analyzes the article to improve the communicating method with readers. Grammarly flags vague and redundant words that reduce the value of the content and suggests more engaging synonymous to add variety to articles. Plagiarism Grammarly's plagiarism checking tool checks 16 billion other articles across the web to correctly identify and cite text that is not 100% original. This feature provides general metrics that include the characters count, word count, and sentence count of the writings. It also renders the time required in reading and speaking the article. Readability These features indicate text understandability. It computes words and sentences length. It also enumerates the Readability score that measures the reader's likeliness to be understood. Vocabulary It measures vocabulary depth by identifying words that are not present in the 5,000 most common English words. It also measures the unique words by calculating the number of individual words. The first set of features we extract are mistakes-based, assuming the English writing of the persons who spread fake news is not that good. For capturing these mistakes, we use Grammarly. Grammarly is a proofreading application that uses artificial intelligence and natural language processing in developing tools to detect mistakes in grammar, spelling, punctuation, word choice, and style. It is a comprehensive writing tool that provides real-time suggestions to correct these errors. Grammarly checks writings on various aspects such as tone detection, confident language, politeness, formality level, and inclusive language before making a decision. Grammarly scans the document with an AI assistant, automatically detects mistakes, suggests corrections, and shows the rationale behind those suggestions. Interestingly, it can detect up to 250 types of errors. Table II shows Grammarly's significant features with their description [31] . Table III shows the Grammarly Features extracted from fake and legitimate news articles shown in Table IV . There are numerous features for which the values are very different for these examples, thanks to the assumption made. GloVe embedding is a popular word embedding algorithm. Given the word, it can output a numeric feature vector of n dimensions. Since we need to classify articles, not words, we use a codebook approach to represent an article, as explained below. We first build a corpus using different words present in a domain. Let us say there is a total of m words in the corpus. We use Glove to convert this corpus into an embeddings corpus, which can be seen as a matrix of dimensions m × n, stacking all the feature vectors vertically. We then apply k-means [32] , [33] , [34] , [35] , [36] on embeddings corpus to divide the matrix into k clusters of feature vectors. We compute the means of feature vectors in different clusters and assign them as codes in our embeddings codebook. The idea is to develop a frequency- Our essential features can be classified into three categories: basic, TF-IDF, and emotion. Let us discuss each of them one-by-one. 1) Basic: We extract basic features set consisting of 7 types of lexical features includes Unique word count, Stopword count, URL count, Mean word length, Hashtag count, Numeric count, and Uppercase count for each news article. 2) TF-IDF: We extract unigrams and bigrams derived from the TF-IDF representation of each news article. Term Frequency Inverse Document Frequency (TF-IDF) [37] gives us information on term frequency through the proportion of inverse document frequency. Words with a small-term rate in each document but have high possibility to appear in records with similar topics will have higher TF-IDF, while words like function words though frequently appear in every report, will have low TF-IDF because of lower inverse document frequency. 3) Emotion Lexicon: We perform sentiment analysis on the corpus of the news articles using Textblob [38] . Sentiment analysis [39] determines the attitude or the emotion of the writer, i.e., whether it is positive, neutral, or negative. The sentiment function of Textblob returns two properties, polarity, and subjectivity. Polarity returns float value, which lies in the range of [-1,1] where the "1" stands for a positive statement, and "-1" stands for a negative comment. Subjective sentences refer to personal opinion, emotion, or judgment, whereas objective refers to factual information. Subjectivity also returns a float number, which lies in the range of [0,1]. Different features obtained for both the original and the paraphrased articles are all concatenated into a feature vector. We apply 5-fold cross-validation to develop our random forest classifier for predicting whether an article is fake or real. Random forest is an ensemble tree-based learning algorithm that fits decision trees on the sub-samples of data and then aggregates the accuracy and controls over-fitting. In this section, we first describe different datasets and how we evaluate our approach. Then, we discuss different results we obtain by evaluating our method on these datasets while comparing them with the prior arts on these datasets. Our experiments are conducted on two publicly available benchmark datasets of fake news detection [3] , namely Celebrity and FakeNewsAMT. The FakeNewsAMT is collected by combining manual and crowdsourcing annotation efforts, including a corpus of 480 news articles, incorporating six news domains (i.e., Technology, Business, Education, Sports, Politics, and Entertainment). The news of the Fak-eNewsAMT dataset was obtained from various mainstream websites in the US, such as ABCNews, CNN, USAToday, NewYorkTimes, FoxNews, Bloomberg, and CNET, among others. Table IV shows examples of fake and legitimate news articles drawn from FakeNewsAMT dataset [3] . The Celebrity dataset contains 500 celebrity news articles, collected directly from the web. The news of the Celebrity dataset was obtained predominately from online magazines such as Entertainment Weekly, People Magazine, Radar Online, and other entertainment-oriented publications. The distribution of both the categories of news (fake and legitimate) are evenly distributed in both the datasets. We have extended these two datasets with our rich features. We analyze our method's performance using measurement metrics like Confusion Matrix and Accuracy. The confusion matrix corresponding to the Celebrity and FakeNewsAMT is given in Figure 4 and 5. Tables V and VI give the 5-fold cross-validation accuracy which we get on Type Content A Fake News Super Mario Run to leave app store The once popular Super Mario Run will be taken out of the Google play and apple app store on Friday. Nintendo says that shortly after its release the public stopped downloading the game when current players had spread the word that in order to play the entire game you had to make an in app purchase. Nintendo and Mario fans are appalled that Nintendo would release a game for free and then charge to play it. Nintendo says they will take the game back to the drawing board, and try and release a free version at a later time. Legitimate News How does nutrition affect children's school performance? As politicians debate spending and cuts in President Donald Trump's proposed budget, there have been questions about the effects of nutrition programs for kids. From before birth and through the school years, there are decades-old food programs designed to make sure children won't go hungry. Experts agree that the nutrition provided to millions of children through school meal programs is invaluable for their health. Test Accuracy (%) Linear SVM [3] 76 F-NAD [41] 82.61 Model 1 [40] 76.53 Model 2 [40] 79 Proposed Work 78 Test Accuracy (%) Linear SVM [3] 74 EANN [42] 75.6 F-NAD [41] 81 Model 1 [40] 77.08 Model 2 [40] 83.3 Proposed Work 95 Celebrity and FakeNewsAMT datasets, respectively, using individual and combined features. Notably, we achieve an accuracy of 78% and 95% when using all the features. It entails the fact that our proposed method predicts a high number of news articles correctly. Table VII and VIII compares our results with the recent works [40] , [41] , [3] , and [42] . Our results are quite competitive. We perform cross-domain analysis to test how our proposed method helps distinguish fake news across different domains using all the features. We train our best performing classifier on the FakeNewsAMT dataset and test on the Celebrity dataset and vice-versa. Table IX captures the comparison of results obtained in cross-domain experiment. These results suggest that our method robustly outperforms the previous works. We also explore how the amount of data affects the classifier accuracy in identifying fake news. We plot the learning curves for the proposed approach and [3] on both the datasets using different fractions of data while training, as shown in figure 6 and 7. These results suggest (1) our proposed method is outperforming the previous work [3] , and (2) our learning curve signifies a steady improvement in both the cases. It implies that a large number of training data has the potential to improve model performance. FakeNewsAMT dataset contains news articles of six domains (business, education, politics, technology, sports, and entertainment). In this experiment, we train our classifier on the five available domains out of six and test on the remaining domain news articles. In this paper, we addressed the task of identification of fake news using online tools. We introduced two new sets of features, one obtained through Grammarly and another obtained through Glove-based Feature. Additionally, we introduce the paraphrased article in this problem. Our study shows that the usage of features extracted using Grammarly, Spinbot, and GloVe-based model with other essential features such as Basic, Ngrams, and Emotion Lexicon improves the model performance significantly. The combination of these features achieved the best performance with the Random Forest classifier. Our proposed method obtains a testing accuracy of 78% on the Celebrity news dataset and 95% on the FakeNewsAMT dataset. The lie detector: Explorations in the automatic recognition of deceptive language Finding deceptive opinion spam by any stretch of the imagination Automatic detection of fake news Fake news or truth? using satirical cues to detect potentially misleading news Red-flagging misinformation could slow the spread of fake news on social media Fake news detection using naive bayes classifier Detection of gait abnormalities caused by neurological disorders Information credibility on twitter Syntactic stylometry for deception detection Rumor has it: identifying misinformation in microblogs Rumor detection on twitter Tweetcred: Real-time credibility assessment of content on twitter Credibility assessment of textual claims on the web Studying fake news via network analysis: Detection and mitigation Enquiring Minds: early detection of rumors in social media from enquiry posts Automatically identifying fake news in popular twitter thread Fake news detection on social media: A data mining perspective MVAE: multimodal variational autoencoder for fake news detection Cats: Co-saliency activated tracklet selection for video co-localization Multimodal analysis of usergenerated multimedia content Efficient video object co-localization with co-saliency activated tracklets Truth of varying shades: Analyzing language in fake news and political fact-checking Fine-grained analysis of propaganda in news articles Spotfake: A multi-modal framework for fake news detection A framework to detect fake tweet images on social media Qualityguided fusion-based co-saliency estimation for image cosegmentation and colocalization Object coskeletonization with co-segmentation Spotfake: A multi-modal framework for fake news detection Spotfake+: A multimodal framework for fake news detection via transfer learning (student abstract) Image cosegmentation via saliency co-fusion How can grammarly help you to write correct blog post? Some methods for classification and analysis of multivariate observations Automatic image co-segmentation using geometric mean saliency Qcce: Quality constrained co-saliency estimation for common object detection Co-saliency based visual object co-segmentation and co-localization Group saliency propagation for large scale and quick image co-segmentation Idf term weighting and ir research lessons Natural language processing for beginners: Using textblob Feature-level rating system using customer reviews and review votes A deep learning approach for automatic detection of fake news F-NAD: an application for fake news article detection using machine learning techniques Multimodal fake news detection on online socials