key: cord-1033531-k6t01vl9 authors: Khanday, Akib Mohi Ud Din; Khan, Qamar Rayees; Rabani, Syed Tanzeel title: Identifying propaganda from online social networks during COVID-19 using machine learning techniques date: 2020-10-29 journal: Int J Inf Technol DOI: 10.1007/s41870-020-00550-5 sha: 51c66450cacd19c2fa43e335f329535baefbb8d9 doc_id: 1033531 cord_uid: k6t01vl9 COVID-19, affected the entire world because of its non-availability of vaccine. Due to social distancing online social networks are massively used in pandemic times. Information is being shared enormously without knowing the authenticity of the source. Propaganda is one of the type of information that is shared deliberately for gaining political and religious influence. It is the systematic and deliberate way of shaping opinion and influencing thoughts of a person for achieving the desired intention of a propagandist. Various propagandistic messages are being shared during COVID-19 about the deadly virus. We extracted data from twitter using its application program interface (API), Annotation is being performed manually. Hybrid feature engineering is performed for choosing the most relevant features.The binary classification of tweets is being performed with the help of machine learning algorithms. Decision tree gives better results among all other algorithms. For better results feature engineering may be improved and deep learning can be used for classification task. Social networks have bridged the gap of communication by providing a vast number of features for transferring the data from one client to other. With the advancement of online social networks, information sharing has become easy. People use online social networks for various purposes like for brand advertisements, marketing, education, business and for other purposes [1] . However, with these benefits it has some limitation/side effects, various Filthy users use this platform for various illegal activities that are very dangerous for the society. Various hate mongers have used this platform for spreading false content, rumors and fake information. The information that is being shared can be misinformation, disinformation and propaganda. Political and religious activists mostly use propaganda for gaining influence, propaganda can either be true or false [2] . The propaganda is spread in various forms that may be based on text, image and video. Since the twitter has a much influence on peoples behavior and is mostly used by politicians, religious activists, celebrities and influential actors [3] , the spike in the graph of propaganda increases exponentially. The study done by previous researchers indicated that propaganda text is mostly related to sectarian and political discussions. Twitter allows its users to write only 280 characters at a time in a single tweet, here is the challenge of how to detect propaganda posts. Various events that are trending in and around the world are gaining much attention for propagandist users to spread hate, fear, hoaxes etc. In late 2019, a virus occurred in Wuhan China Known as COVID-19 [4] . This virus affected almost 10 million people in the world. Due to the trade with other countries around the globe, the virus has spread in every corner of the world by effecting mostly the European countries like Italy, UK, Spain and USA. This virus has also spread to Iran, India, Pakistan etc. till now there is less mortality rate in the Asian sub continent. A lot of research is being done for developing a drug for this pandemic virus. Various misinformation's are being spread by fear mongers using social networks. Misinformation about curing this virus is spread enormously some of the misinformation that were claimed to cure this deadly virus are, drinking alcohol, drinking cow urine etc. which has not medically proven for curing this disease. The politicians have also considered COVID-19 a concern. Various politicians around the world appealed to the common people to take precautions revealed by the world health organization. Various propagandistic messages are being spread using online social networks. Various hashtags are being used on twitter for spreading the messages regarding COVID-19. In this paper we extracted data using twitter application program interface (API) by giving various hashtags. Our work consists of five sections, the background is being described in Sect. 2. Section 3 gives the detailed methodology for the proposed system, results are being shown and discussed in Sects. 4 and 5 concludes our work. The significant contribution of this paper is as follows: • Novel Data set of 5 K tweets is being generated. • Enhanced Feature Engineering has been done for achieving better accuracy. The ever-growing attractiveness and beauty of using social networks directly or indirectly effects our daily life. It is not surprising that social media has become a weapon for manipulating sentiments by spreading disinformation as per the trend. The adversal use of these platforms are mostly used for spreading unreliable or ambiguous information which is a communal, financial, and political threat [5] . Gupta et al. [6] Analysed fake content on twitter during Boston attack the results showed that fear mongers effectively use social media for triggering mass hysteria and panic. Arts et al. [7] discussed about three types of attacks that took place in cyber network operations-physical, syntactic and semantic attacks. Physical attacks are attacks that affect the hardware of the system. Syntactic attacks occur due to the technologies, and there is no human hand in this attack. Semantic attacks are the most dangerous attacks which change the information content or the meaning of information [7] . Semantic attacks diverse from the other two forms of cyber-attacks. Semantic attacks attack the human-computer interface, and its effect is not visible as physical or the syntactic attacks. Semantic attacks are divided into many categories viz overt attack (include phishing, spam, etc.) and covert attack. Cybenko et al. [8] focused on covert attacks, i.e. misinformation, disinformation and propaganda. Kumar et al., Sarwar et al. [9, 10] analysed textual data for predicting various diseseases. They showed that text classification showed better results in detecting the disease as well as any type of fraud from the text. Babcock et al. [1] suggested to use the social calculating characteristics of the consumers on online social media for determining the credibility of the information. The information on social networks can be shared deliberately or un-deliberately and are categorized in misinformation and disinformation. Mis-information is that information where the user does not know the truthfulness of information that is being spread. In contrast, Kumar et al. [11] described dis-information as the information in which the user deliberately gives false/accurate information for sharing [11] . Dis-Information usually occurs in politics, health, finance, technology etc. Howard et al. [12] studied Orchestrated Astroturf which is used for manipulating political conversations, even during election times. Esposito [13] proposed a semantic graph-based approach for radicalization detection in social media. They showed that pro-ISIS users tend to discuss about religion, historical events and ethnicity while anti-ISIS users focus more on politics, geographical locations and interventions against ISIS. Varol et al. [14] detected early promoted campaigns on social media. The results showed that compromised accounts are being used for spreading disinformation, and these accounts may also be used for spreading propaganda. According to O'Donnell et al. [15] propaganda comes under the type of disinformation which is defined as the systematic and deliberate process to shape opinions, influence thoughts, and direct behaviour of a person for achieving the desired intention of a propagandist. Paul et al. [16] showed that propaganda is mainly used for gaining the people's faith in some person or some community or party and plays a significant role in politics. Lightfoot [17] studied the effect of social bots on politics (political propaganda through social bots). The study found that social bots play a vital role in spreading fake news and accounts that continuously spread misinformation are significantly more likely to be Bots [18] showed that In USA presidential election 2016, political propaganda has a significant role in the winning of Donald Trump. Badawy et al. [19] analysed jihadist propaganda they showed that radical propaganda can be shared by posting four types of messages, religious and sacred topics, violence, sectarian discussion, and dominant celebrities and events. The proposed system for identifying propaganda during COVID-19 consists (i) data collection (ii) data preprocessing (iii) feature engineering and (iv) classification. The graphical representation of the proposed system is depicted in Fig. 1 . Data is being extracted using twitter API [20] , with the help of python tweepy by mentioning trending hashtags during COVID-19. About 5.1 million tweets are extracted using hashtags COVIDINDIA, CORONAVIRUS, COR-ONAJIHAD, CHINESEVIRUS, CORONAMUSLIM, etc. But after analyzing we got 3 hashtags that are related for spreading misinformation and propaganda, these tweets were #CoronaJihad, #CoronaMuslim and #Chinesevirus. We performed manual annotation to these tweets based on the content and semantics with the help of 18 different techniques of propaganda. We hire two journalists and one computer expert to perform labelling of the data. In the annotation about 5 K tweets were labelled into binary class as propaganda and non-propaganda. Based on various propaganda identification techniques. Figure 2 depicts the labelled dataset with their length in characters. The textual data in the corpus consists of many missing values, URL's, hyperlinks, digits, stop words. For refining the data, various preprocessing tasks were performed, some of the tasks are as follows: Tokenization splits the tweets into tokens. A sentence is being fragmented into the number of tokens, each word is considered as a separate token. Stop words like a, an, the etc. are being removed using English stop word dictionary. In this step lemma of the word is determined based on the intended meaning of the particular word. For performing classification various features are needed for performing this task. We consider hybrid feature engineering by combining three types of features extracted using three different techniques TF/IDF, bag of words and tweet length. Term frequency/inverse document frequency reflects the importance of a word in a tweet or in a whole corpus by giving its numerical statistics. It is calculated using the following equation. where t is the term as a feature, w denotes each tweet in the corpus and D is the total number of tweets in the dataset (document space). Consists of words and lemma uni, bi and trigrams. We included bigrams, trigram words such that more information can be extracted from the text. Since twitter allows only 280 characters in a single tweet, we considered the length of the tweet also. While performing computations it was revealed that the propagandistic tweets are having greater length than nonpropagandistic tweets. In our work, we used this feature with TF/IDF & bag of words for achieving better testing results. After performing feature engineering the most correlated bigrams were 'dangerous muslim', 'rise coronajihad', 'coronavirus report', 'rt billyperrigo', 'coronajihad nar', 'india come', 'come coronavirus', 'billyperrigo already', 'already dangerous', 'muslim india', 'rt rose_k01', 'hashtag coronajihad'. The main motive of work is to build a classifier which will classify a tweet into propaganda and non-propaganda class. Supervised machine learning algorithms are used as our corpus is labelled. Various traditional machine learning classifiers are trained and tested for this task. Based on class relationship with the label it predicts the numerical class value. Logistic regression is fine-tuned as: C = 1.0, classweight = None, dual = False, fit-intercept = True, intercept-scaling = 1, max-iter = 100, multiclass = 'warn', n_jobs = None, penalty = 'l2', ran-dom_state = 8, solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False. Multinomial Naïve Bayes (MNB) uses a classical Bayes algorithm for text classification. Multinomial Naïve Bayes is fine-tuned as: alpha = 1.0, class-prior = None, fit-prior = True. Supervised machine learning approach used for classification tasks as well as for regression problems. It takes 'n' number of features for the particular text with the given label. Support vector machine (SVM) is fine-tuned as: C = 0.1, cache-size = 200, class-weight = None, coef0 = 0.0, decision-function-shape = 'ovr', degree = 3, gamma = 'auto_deprecated', kernel = 'linear', max-iter = -1, probability = True, random-state = 8, shrinking = True, tol = 0.001, verbose = False. In this approach input space is broken down into regions. Every region is classified independently. Decision tree classifier is fine-tuned as: Class-weight = None, Criterion = 'gini', max-depth = None, max-features = None, max-leaf-nodes = None, minimpurity-decrease = 0.0, min-impurity-split = None, minsamples-leaf = 1, min-samples-split = 2, min-weight-fraction-leaf = 0.0, presort = False, random-state = 0, splitter = 'best'. In our experiment, we have used logistic regression, multinomial Naïve Bayesian, support vector machine and decision tree algorithms for performing the task of classifying propagandist text from non-propagandist text. The proposed hybrid feature engineering technique is used to extract the useful features that are supplied to the fine tuned machine learning models. About 100 features are chosen for performing the binary classification but due to the computational complexity information gain is used for selecting the most influential features. The dataset is being split into 70 by 30 ratio, 70% is used for training the machine learning models and 30% are used for testing the models. Machine learning algorithms are finetuned in such a way that they give better results. The algorithms are tested by giving them different parameters. In support vector machine we used three kernel RBF, poly and linear. The linear kernal showed the better results as compared to other two kernals. Similarly other machine learning algorithms showed better results by finetunning their particular parameters. Multinomial Naïve Bayes showed better results when alpha was set to 1.0, Logistic regression showed good results when C was assigned value of 1.0 and maximum iteration of 100 were taken. In decision tree gini coefficient was used for information gain and it showed promising results. The comparision of all machine learning was revealed that propagandistic tweets have greater length than non-propagandistic tweets. The majority of data used in our research was related to COVID-19. More data that range numerous fields should be gathered for better analysis of propaganda. We need more human exertion to play out the labelling of the tweets into different classes. As the data increases, manual annotation gets intense, therefore, requiring an automatic nnotation program that will learn from the semantics of provided text. More feature engineering is required for accomplishing better text classification results.The comparative analysis of machine learning algorithms used in our work is shown in Fig. 7 . Machine learning has grew a lot of attentiveness, due to its better and robust results in every field. During the COVID-19 various misinformation and propaganda is being shared. In this paper, data is extracted from the online social network platform ''twitter'' using its API. The extracted data is being manually labelled into two classes' propaganda and non-propaganda. Hybrid feature engineering is being performed by combining three different textual features (TF/IDF, bag of words and tweet length). The results revealed that propagandistic text gave greater length than non-propagandistic text. Machine learning algorithms are used for classifying tweets into propaganda and non-propaganda class. Decision tree classifier showed better results among all other machine learning algorithms by having 98.5% accuracy, 0.99 precision, 0.99 recall and 0.99 F1-Score. In future, more features may be used for getting better results also Deep learning can be used for performing this task. Different faces of false: the spread and curtailment of false information in the black Panther Twitter discussion Pro-ISIS fanboys network analysis and attack detection through Twitter data Social media? Get serious! Understanding the functional building blocks of social media Mohi ud Din M, (2020) Machine learning based approaches for detecting COVID-19 using clinical text data $1.00 per RT #Boston Marathon #Pray For Boston: Analyzing fake content on Twitter. APWG eCrime Researchers Summit Cyberdeterrence and cyberwar Cognitive hacking: a battle for the mind Text classification algorithms for mining unstructured data: a SWOT analysis Diagnosis of diabetes type-II using hybrid machine learning based ensemble model A psychometric analysis of information propagation in online social networks using latent trait theory computational propaganda during the UK-EU referendum. Available at SSRN 2798311 The Semantic Web Early detection of promoted campaigns on social media Propaganda & persuasion The Russian Political propaganda spread through social bots. Media, Culture, & Global Politics Social bots distort the 2016 US Presidential election online discussion The rise of jihadist propaganda on social networks Twitter sentiment analysis on Indian government project using R