key: cord-0459452-pof3wsru authors: Mazzeo, V.; Rapisarda, A.; Giuffrida, G. title: Detection of fake news on CoViD-19 on Web Search Engines date: 2021-03-22 journal: nan DOI: nan sha: df69bef43956c7ac835f673b595c3e5bcd076183 doc_id: 459452 cord_uid: pof3wsru In early January 2020, after China reported the first cases of the new coronavirus (SARS-CoV-2) in the city of Wuhan, unreliable and not fully accurate information has started spreading faster than the virus itself. Alongside this pandemic, people have experienced a parallel infodemic, i.e., an overabundance of information, some of which misleading or even harmful, that has widely spread around the globe. Although Social Media are increasingly being used as information source, Web Search Engines, like Google or Yahoo!, still represent a powerful and trustworthy resource for finding information on the Web. This is due to their capability to capture the largest amount of information, helping users quickly identify the most relevant, useful, although not always the most reliable, results for their search queries. This study aims to detect potential misleading and fake contents by capturing and analysing textual information, which flow through Search Engines. By using a real-world dataset associated with recent CoViD-19 pandemic, we first apply re-sampling techniques for class imbalance, then we use existing Machine Learning algorithms for classification of not reliable news. By extracting lexical and host-based features of associated Uniform Resource Locators (URLs) for news articles, we show that the proposed methods, so common in phishing and malicious URLs detection, can improve the efficiency and performance of classifiers. Based on these findings, we suggest that the use of both textual and URLs features can improve the effectiveness of fake news detection methods. The reliability and credibility of both information source and information itself have emerged as a global issue in contemporary society [1; 2] . Undoubtedly, in the last decades Social Media have revolutionised the way in which • we use real-world data from WSE, analysing both textual data (meta titles and descriptions) and URLs information, by extracting features representations; • since most of the previous works on fake news detection were focused on classifier enhancements, not engaging in feature engineering, in this document we want to provide a new direction for the classification of fake news, proposing an integration of the most commonly used features in fake news detection and features that play an important role in the malicious URL detection. The purpose of feature engineering is indeed to feed the original data and provide new and meaningful feature representations to improve machine learning (ML) algorithms for classification. Currently, the problem of detecting fake news via URL has not been well and sufficiently addressed. Several studies focused on fake news detection via ML in Social Networks [12] have looked at the presence of URLs in the user's published content [13] , without generally performing further analysis on the source of information or extracting other potential relevant URL-based features (features that are, indeed, more common in malicious URLs/phishing detection classifiers). Although, in the past, the usage of URLs in a post/news could have represented a useful parameter for enhancing and improving ML classifiers' performance, nowadays this could result not enough for differentiating a good source from a bad one in terms of information credibility without a URL-based feature engineering approach. In fact, the more ML techniques have evolved over time, the more schemes for spreading fake news have changed; • we apply re-sampling techniques, such as under-sampling and over-sampling, due to the class imbalance of the real-world dataset [14; 15] . Disproportion between classes still represents an open issue and challenge for researchers focused on classification problems. In a typical news dataset, the number of fake news are likely to be very few compared to the number of real ones, and this fact makes the positive class (fake news) very small compared to the negative class (real news). This imbalance between the two classes would likely make classifiers biased towards the majority class leading to classify all the instances in the dataset as belonging to the majority class; • we compare different ML algorithms (Support Vector Machine, Stochastic Gradient Descent, Logistic Regression, Naïve Bayes, and Random Forest), based on their performance. Since we deal with imbalanced data, we evaluate the models looking at F 1 score and Recall metrics, and not at predictive accuracy, as the latter represents a misleading indicator which reflects the underlying class distributions [16; 17] . The paper is structured as follows: Section 2 introduces our material and methodology; Section 3 describes the results of our experimentation along with their evaluation. In Section 4 we summarise our key findings and give an interpretation of them, also by discussing the implications. Finally, Section 5 draws the conclusions, giving some prospective points for future work. We submitted several CoViD-19-related queries (Table 1 ) on a WSE. For each search result, we extracted metadata information, i.e., URL, title, meta description (or snippet), and date ( Figure 1 ). The final dataset consisted of a collection of approximately 3350 news results (fake/misleading and trusted/'real'), gathered from 2084 different urls. All the news were published within a 7-month time interval, between January 20 th and July 28 th , 2020. We chose this time interval as it covered the first CoViD-19 pandemic lockdowns proclaimed in Italy and in other countries [18] . Queries were selected based on topics (e.g., generic information on the new virus; pseudo-scientific therapies; conspiracy theories; travels; etc.) [19] that we were monitoring both on the Web (online newspapers) and Social Media, during the first lockdown period. We also looked at fact-checking websites, such as politifact.com or poynter.com, to check news and information credibility [20] . In order to reduce potential bias due to Search Engine optimisation, we had carefully planned our data collection as follows: • we used a VPN to be more consistent with the WSE domain inspected and its results; • in order to browse the Internet and query the WSE, we used a Private/Incognito window. This allowed us to prevent our browsing history from being stored and from biasing our results. By using Incognito/Private mode, we did not build a detailed picture of our online activity: in this way, all cookies were removed at the end of each session, i.e., we did not save any information about the pages we were visiting, avoiding to create customised results based on our search history. This process was repeated for each query and for each day of collection; • we collected all the results, and not some of them (e.g., the first 2 pages or the top 10 results). Even if we dramatically reduced bias during our data collection, the results from WSE might be automatically biased by the WSEs we were querying because of their ranking systems, which sort results by relevance. We did not have any control on that but we tried to address this potential bias comparing results from different WSEs, also across different days. Once we collected data, the labelling procedure was done manually and it consisted of assigning a binary class label indicating whether the news was real (0) or fake/misleading (1). In the binary fake news detection problem, fake news is usually associated with the positive class, since these are the news detected by the classifier. Data labelling process for training ML algorithms is not only critical but also time consuming. Because of the limited resources, we considered a limited sample size in our study, but big enough to be considered reliable and sufficiently large for binary detection [21] . The ML workflow proposed in this study was implemented in Python 3.8. Its schematic representation is illustrated in Figure 2 . In order to observe the most meaningful context words and to improve the performance of the classifiers, in the data pre-processing stage we removed all parts that were irrelevant, redundant and not related to the content: punctuation (with a few exceptions of symbols, like exclamation mark, question mark, and quotation marks) and extra delimiters; symbols; dashes from both titles and descriptions; stopwords [22] . By following guidance and advice given by fact-checking websites (e.g., factcheck.org) and reputable outlet sources (e.g., bbc.com) on how to spot fake news, we looked at the presence of words in capital letters and at the excessive use of punctuation marks in both titles and descriptions. Figure 3 shows the frequency of specific punctuation characters ('!', '?', ' " ', ':') and upper case words in titles and descriptions for news labelled as fake (1) and real (0). It is notable that fake news differs much more for real one by the excessive use of punctuation, quotes, interrogatives, words in all capital letters, and exclamation mark to alert and urge people to read the news. The frequency distributions in Figure 4 illustrate the top 20 uppercase words in the fake news and in the real news datasets. From the two histograms, we can derive an important information regarding the use of various uppercase words in the two news sets. It can be noticed, in fact, that in the real news dataset all uppercase words are more related to abbreviations (e.g., US, UK), acronyms (e.g., UNICEF), or organisations' name (NBC, NCBI), while in fake news dataset the use of uppercase letters highlights potential warnings (e.g., CONTROL, CREATE), capitalising on coronavirus fears and conspiracy theories. This highlights the different use of capitalising all characters in a word, an unusual habit for reporters working for trustworthy websites, who generally follow style-guidelines and journalistic convention. KNN imputation algorithm and MinMaxScalar were used to rescale variables. However, the percentage of missing values in the dataset was very low (< 1%). Since high correlation among features leads to redundancy of features and instability of the model, statistical tests, like Chi-Squared and Pearson's correlation coefficient, were used for feature selection. The corpus collected from WSEs was, therefore, pre-processed before being used as an input for training the models. Due to the class imbalance in the dataset, re-sampling techniques were applied to the training set only, during cross-validation (k-fold=5). Different classification models were then evaluated by scoring the classification outcomes from a testing set, in terms of the following performance metrics: F1-score, Recall, Accuracy and Precision. respectively, the word clouds show words sized according to their weights in the datasets. The use of uppercase words is different between the two datasets: in real news, the use of uppercase words is more frequent to indicate acronyms, brands, organisations, while in fake news uppercase words emphasise feelings, creating alerts and potential warnings. In the feature engineering stage, which typically includes feature creation, transformation, extraction and selection), we used pre-training algorithms, such as Bag-of-Words (BoW) [23] and Term Frequency-Inverse Document Frequency (TF-IDF) [24; 25] , for mapping cleaned texts (titles and descriptions) into numeric representations. Further features (length, counting, binary) were also extracted from URLs [26] . Information on the age of domain names was gathered from both Wayback Machine and WHOIS [27] , two tools that are crucial resources in the fight against fake news, as they allow users to see how a website has changed and evolved through time, gathering information on when the website was founded, on its country code top-level domain, and contact information. Although various ML methods assume that the target classes have same or similar distribution, in real conditions this does not happen as data is unbalanced [28] , with nearly most of the instances labelled with one class, while a few instances are labelled as the other one. Since we worked with real-world data [29; 30] , our dataset presented a high class imbalance with significantly less samples of fake news than real one. To address poor performance in case of unbalanced dataset, we used: • minority class random over-sampling technique, which consists in over-sizing the minority class by adding observations; • majority class random under-sampling technique, which consists in down-sizing the majority class by randomly removing observations from the training dataset. The re-sampling algorithms chosen depend on the nature of the data, specifically on the ratio between the two classes, fake/real. Although we had a class imbalance skewed (90:10), we could not treat our case as a problem of anomaly (or outlier) detection. In fact, in order to be considered such a case, we would have had a very skewed distribution (100:1) between the normal (real) and rare (fake) classes. Although the choice of the number of folds is still an open problem, generally researchers choose a number of folds equal to 3 (less common), 5 or 10. We used a 5-fold cross validation due to the small size of our dataset, but enough to contain sufficient variation [31] . Each fold was used once as a validation, while the k -1 remaining folds formed the training set. This process repeatedly ran until each fold of the 5 folds were used as the testing set. In this section we discuss features from URLs, the metrics used for evaluating models' performance and we report the classification results. We analysed lexical and host-based features from 2084 distinct URLs. To implement lexical features, we used a Bag-of-Words of tokens in URL, where '/' , '?', '.', '=', '_', and '-' are delimiters. We distinguished tokens that appear in the host name, path, top-level domain, using also the lengths of the host name and the URL as features [32; 33] . In Table 2 we show all the features extracted from URLs. Word-based features were introduced as well, as URLs were found to contain several suggestive word tokens. An example of URL structure is shown indeed in Figure 5 , where it is possible to distinguish the following parts: • scheme: it refers to the protocol, i.e., a set method for exchanging or transferring data, that the browser should use to retrieve any resource on the web. https is the most secured version; • third-level domain: it is the next highest level following the second-level domain in the domain name hierarchy. The most commonly used third-domain is www; • second-level domain: it is the level directly before the top-level domain. It is generally the part of a URL that identifies the website's domain name; • top-level domain: it is the domain's extension. The most used TLD is .com. The TLD can also give about geographic of a website, since each country has a unique domain suffix (e.g., .co.uk for UK websites). We used the Chi-Squared (χ 2 ) statistical test to assess the alternate hypothesis that the association we observed in the data between the independent variables (URL-feature) and the dependent variable (fake/not fake) was significant; specifically: • null hypothesis (H 0 ): there is no significant association between the variables and the dependent variable (fake/not fake); • alternate hypothesis (H 1 ): there is an association between the variables and the dependent variable (fake/not fake). We set a significance level of 0.05 [34] : • if the p-value was less than the significance level, then we rejected the null hypothesis and concluded that there was a statistically significant association between the variables; • if the p-value was greater than or equal to the significance level, we failed to reject the null hypothesis because there was not enough evidence to conclude that the variables were associated. The correlation-based feature selection (CFS) algorithm was used for evaluating the worth or merit of a subset of features, taking into account the usefulness of individual features for predicting the class label. In order to check high correlations among independent variables, we also performed a multicollinearity test. Multicollinearity is indeed a common problem when estimating models such as logistic regression. In general, to simulate predictor variables with different degree of collinearity, the Pearson pairwise correlation coefficients were varied: an absolute correlation of greater than or equal to 0.7 can be considered an appropriate indicator for strong correlation [35] . To measure the increase in the prediction error of the model, permutation importance feature was employed. The method is most suitable when the number of features is not huge as it is resource-intensive. This method can be also used for feature selection. In fact, it allows to select features based on their importance on the model. If there are features correlated, then the permutation importance will be low for all the correlated features. The choice of permutation importance as extra method for feature selection was justified also by the use of different models, tree and not-tree based, respectively [36] . These feature selection methods allowed us to select a small number of highly predictive features in order to avoid over-fitting. which show the presence of specific words in a URL; 2 host-based features; and the remaining 10 features are lexical-based and include special characters count or show the presence of digits in a URL. The purpose of feature engineering was to find and explore new features to improve model performance for fake news detection. We used Chi-square statistics and a correlation-based feature selection (CFS) approach. If reported, the error is meant the standard error of the mean. average for fake news URLs. The most surprising result is in the number of dots in URL: fake news URLs in our dataset do not contain more than 2 dots on average. In Table 3 we can observe that websites publishing fake news have generally newer domain name's age than websites publishing reliable news (Table 3 and Figure 6 ). For validating the results shown in Table 3 , we used Welch's t-test, which is usable independently of the data distribution thanks to the central limit theorem. The p-value (4.195e −17 ) we got is less than the chosen significance level (0.05), therefore we reject the null hypothesis in support of the alternative. In terms of model performance measurement, the decision made by the classifier can be represented as a 2 × 2 confusion matrix having the following four categories: • To evaluate the effectiveness of models, we used the metrics shown in Table 4 Both F 1 score and Recall are good metrics for the evaluation of imbalanced data. Since we are dealing with imbalanced data, the predictive, accuracy represents a misleading indicator, as it reflects the underlying class distributions, making it difficult for a classifier to perform well on the minority class [24] . For this reason, we used F 1 score [28; 41] and Recall metrics, as the higher the value assumed by these metrics, the better the class of interest is classified. Table 5 shows the evaluation metrics for all the classifiers we considered. It can be noticed that the classification metrics depend on the type of classifier and on the extracted features used for the classification. Logistic Regression with BoW model was the most effective classifier when we oversampled the data, reaching the highest F 1 -score (71%), followed by Naïve Bayes with BoW model (70%), and SVM with TF-IDF (69%). When we used the under-sampling technique and removed instances from the majority class, the score of the classifier models was very poor compared to over-sampling technique. SGD with TF-IDF and Naïve Bayes with TF-IDF and BoW came out the worst with F 1 scores of 34%, 35%, and 37%, respectively. From Table 5 , only Random Forest classifier got a F 1 -score greater than 50%, unlike the other classifiers when the under-sampling algorithm was applied, though the Precision metric results very poor. Figure 7 shows a comparison of the classifiers using different feature extraction techniques (BoW and TF-IDF) based on F 1 -score metric (Table 5) . Based on the analysis we performed on Section 3.1, we observed a positive influence on the F 1 -score and Recall metrics ( Figure 8 ) in some ML classifiers, after including the most relevant features extracted from URLs. As shown in Table 6 , the implementation of new features extracted from URLs successfully assisted the classifiers, by improving their performance. A visual inspection of metrics by model, before and after adding URL features in our ML classifiers, is illustrated in Figure 8 . The results verify the effectiveness of introducing URL features, with values approximately above 0.70 for the two types of pre-processing. Before URL features selection, the highest F 1 -score was 0.71. In binary classification problems, class imbalance represents an open challenge as real-word datasets are usually skewed. One issue involves the determination of the most suitable metrics for evaluating model performance. F 1 score, defined as the harmonic mean of Precision and Recall (Section 3.2), has been commonly used to measure the level of imbalance. Our data had a significantly high level of imbalance (majority class, i.e., real news, was approximately 90% of our dataset, and minority class, i.e., fake news, represented only 10% of the dataset). A way to address and mitigate class imbalance problem was data re-sampling, which consists of either over-sampling or under-sampling the dataset. Over-sampling the dataset is based on rebalancing distributions by supplementing artificially generated instances of the minor class (i.e., fake news). On the other hand, under-sampling method is based on rebalancing distributions by removing instances of the majority class (i.e., real news). By under-sampling the majority class, we had to reduce the sample size, which resulted too small for training models, causing poor performance. By over-sampling data, we instead noticed better results in terms of both Recall and F 1 score metrics, boosting up the model performance. We compared models based on popular feature representations, such as BoW and TF-IDF. After over-sampling data, the evaluation metrics returned results with F 1 -score over 70% for both Logistic Regression and Naïve Bayes classifiers with BoW. In order to further improve the results, we decided to focus on news sources as well, exploring and selecting URL features that have displayed high impact in various studies [42; 43; 27] . In fact, just like phishing attacks (e.g., suspicious e-mails or malicious links), fake news continues to be a top concern, as they still spread across the Web and will continue to spread until everyone understands how to spot them. A comparison between phishing websites and websites that deliberately have published fake news is shown in Table 7 . It is evident that websites that publish and share misleading content have generally URLs with identifiable features (Table 2 ), like malicious URLs. As shown in Table 7 , phishing carries out also by typosquatting domain, i.e., by registering a domain name that is extremely similar to that of an existing popular one. In the past few years, various online websites have been created to imitate trustworthy websites in order to publish misleading and fake content: for example, abcnews.com (registered on 1995) and abcnews.com.co (registered ahead of the 2016 US election); or ilfattoquotidiano.it (registered on 2009) and ilfattoquotidaino.it (registered on 2016). One of the most relevant URL features was certainly registration date. In our dataset, the average age of domain name of websites publishing fake news was 2008, while that one of websites publishing real news was 2004 ( Table 3) . Most of websites publishing fake news are, therefore, newer than websites which spread reliable news. This was in line with our expectation, i.e., that websites publishing reliable news are typically older, having more time to build reputation, while those ones that publish fake news and misleading content are likely unknown websites created more recently. The effects on the other features extracted from URLs had also a positive impact on the detection problem. By using correlation matrix heatmap and looking at findings from other research works, we selected features that most affected the target variable. Like in phishing, websites or blogs that publish and share fake news may contain special symbols (such as @ and &) to obfuscate links and trick readers into thinking that the URL leads to a legitimate website. For example, abcnews.com.co is a fake website, where the use of dots is for adding an extension (i.e., .co). On the other hand, the proportion of http and https did not provide relevant information, as https secured protocol now is commonly used. News by TLD showed that the most popular TLDs are .com (85% in fake news dataset; 73.3% in real news dataset) and .org. (8.4% in fake news dataset; 15% in real news dataset) ( Table 2) . Furthermore, large numbers of digits and hyphens (greater than 7 on average) were found within URLs in the fake news dataset, making it possible to compare with results obtained from the analysis of malicious URLs [44; 45] . By entering the selected URL features the model, Naïve Bayes classifier with BoW achieved F 1 score of 81%, while SVM with TF-IDF got 79%, significantly exceeded results based on features built from lexical representations of the text (titles and descriptions) only. Based on the achieved result, we concluded that the use of URL features increased the performance of models. In terms of challenges, the class imbalance of real-world data and the limited accessibility of high-quality labelled dataset are two of the major ones. The use of ML classification models in fake news detection still appears more challenging in realistic situations, especially on Web Search Engines, where metadata information from thousands websites are collected. Furthermore, as in phishing attacks, who writes fake news and misleading content constantly looks for new and creative ways to fool users into believing their stories involve a trustworthy source. This makes necessary to keep models continuously updated as fake news is becoming more and more sophisticated and difficult to spot. Also, misleading contents vary greatly and change over time: therefore, it is essential to investigate new features. In this study, we analysed meta data information extracted from Web Search Engines, after submitting specific search queries related to the CoViD-19 outbreak, simulating a normal user's activity. By using both textual and URLs properties of data, we trained different Machine Learning algorithms with pre-processing methods, such as Bag-of-Words and TF-IDF. In order to deal with class imbalance due to real-world data, we applied re-sampling techniques, i.e., oversampling of fake news and under-sampling of real news. While over-sampling technique allowed us to get satisfactory results, the under-sampling method was not able to increase model performance, showing very poor results due to the small sample size. Although news has some specific textual properties which can be used for its classification as fake or real, when we look at search results (titles, snippets, and links), some additional pre-processing can be used to obtain some specific extra features for fake news detection on WSEs. While text features are related to news content, gathered from both titles and snippets, URL features are based on the source websites returned as search results on WSEs. While most previous studies focused on fake news detection in Social Media, relying on data which can be directly gathered from the text (e.g., tweets) and from the usage of URLs for improving source credibility, our proposed approach goes further and analyse URL-features of the source of information itself. We believe indeed that URL pattern analysis via phishing detection techniques can enhance ML algorithms ability to detect and mitigate the spread of fake news across the World Wide Web. Checking the source is, indeed, one of the most common advice that fact-checking websites give to online readers [46] . The results from this study suggest that information on URLs, extracted by using phishing techniques (e.g., number of digits, dots and length of the URL), could provide indications to researchers regarding a number of potentially useful features that future fake news detection algorithms might have or develop in order to bring out further valuable information on websites containing mostly false content and improve the model performance. The analysis of fake news which spreads on the Web might have, however, a potential limitation, due to Search Engine optimisation. In this study we proposed a possible solution to address it. In fact, although Search Engine results might be customised based on online user location and user's search history, in order to reduce bias due to prior searching on the WSEs, it would be helpful to change settings preferences, delete cache, cookies, search history or use Incognito/Private windows. Furthermore, the use of proxies (or VPN) could allow to search queries on WSEs being location independent. In terms of future research on fake news detection, we believe that techniques commonly used for malicious URLs detection should also be considered for fake news detection: this would mean building classifiers based not only on traditional lexical and semantic features of texts, but also on lexical and host-based features of the URL. As future work, we therefore plan to construct more discriminative features to detect fake content, by profiling malicious sources of information based on domains, investigating in more detail, with additional performance metrics such as Net Reclassification Index (NRI), the improvement in prediction performance gained by adding a marker to the set of baseline predictors, in order to facilitate designing even better classification models for fake news detection. Evaluating the credibility of online information: A test of source and advertising influence Exploring the effect of social media information quality, source credibility and reputation on informational fit-to-task: Moderating role of focused immersion Social media use in the united states: implications for health communication Social media as a tool to increase the impact of public health research The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems Clickbait pattern detection and classification of news headlines using natural language processing From clickbait to fake news detection: An approach based on detecting the stance of headlines to articles Detecting fake news in social media networks Detecting misleading information on covid-19 Analysis of classifiers for fake news detection Using artificial intelligence techniques for detecting covid-19 epidemic fake news in moroccan tweets Collecting a large scale dataset for classifying fake news tweets using weak supervision An improved hybrid approach for handling class imbalance problem Handling class imbalance in direct marketing dataset using a hybrid data and algorithmic level solutions Beyond accuracy, f-score and roc: a family of discriminant measures for performance evaluation Federated learning on clinical benchmark data: Performance assessment An exploration of how fake news is taking over social media and putting public health at risk. Health information and libraries journal Sample size planning for classification models Stopwords in technical language processing. ArXiv, abs Understanding bag-of-words model: a statistical framework A tool for fake news detection Detecting opinion spams and fake news using text classification Ofs-nn: An effective phishing websites detection model based on optimal feature selection and neural network Improving malicious urls detection via feature engineering: Linear and nonlinear space transformation methods The impact of class imbalance in classification performance metrics based on the binary confusion matrix An improved oversampling algorithm based on the samples' selection strategy for classifying imbalanced data A comparison of class imbalance techniques for real-world landslide predictions Model selection and overfitting Malicious url detection based on machine learning Machine learning for malicious url detection Statistical significance: p value, 0.05 threshold, and applications to radiomics-reasons for a conservative approach Multicollinearity in regression analyses conducted in epidemiologic studies Selecting the most important self-assessed features for predicting conversion to mild cognitive impairment with random forest and permutation-based methods A framework for detection and measurement of phishing attacks Intelligent phishing url detection using association rule mining Email phishing: An enhanced classification model to detect malicious urls Evaluation measures for models assessment over imbalanced data sets Facing imbalanced data recommendations for the use of performance metrics Detecting malicious urls in e-mail -an implementation Phishing email detection based on binary search feature selection Breaking bad: Detecting malicious domains using word segmentation Malicious domain detection using machine learning on domain name features, host-based features and web-based features