key: cord-1007140-qybntdi5 authors: Pratama, Rivanda Putra; Tjahyanto, Aris title: The influence of fake accounts on sentiment analysis related to COVID-19 in Indonesia date: 2022-12-31 journal: Procedia Computer Science DOI: 10.1016/j.procs.2021.12.128 sha: d33c1929e6aa9cb40e8ca532abd436476fbb8c12 doc_id: 1007140 cord_uid: qybntdi5 Cases of the spread of COVID-19 that continue to increase in Indonesia have made the level of public satisfaction with the government in dealing with this virus fairly low. One way to measure the level of community satisfaction is by analyzing social media. Sentiment analysis can be used to analyze feedback from the public. Research related to sentiment analysis has been mostly carried out, but so far, it has focused more on opinions contained in sentences and comments and has not considered the subject of the account that posted it. On the other hand, the use of fake accounts or bots on social media is becoming more and more prevalent, so that the credibility of opinion makers is reduced. Based on these problems, this research conducted several experiments related to sentiment analysis using a machine learning approach and fake account categories to see the influence of fake accounts on sentiment analysis. The data used in this research were taken from social media Twitter. The results showed that there was an influence from fake accounts that can reduce the performance of sentiment classification. The experimental results of the two algorithms also prove that the Support Vector Machine algorithm has a better performance than the Naïve-Bayes algorithm for this case with the highest Accuracy value of 80.6%. In addition, the results of the sentiment visualization showed that there was an influence from fake accounts which actually leads to positive sentiment although it is not significant. Coronavirus Disease 2019 (COVID-19) is a heartbreaking disaster that affects all countries in the world, including Indonesia. Based on data from the official website of the COVID-19 Handling Task Force, there are around 67 million people exposed to COVID-19 and around 1.5 million people died. Meanwhile, in Indonesia, there are nearly 600,000 people affected and around 18,000 people died because of this virus [1] . At the beginning of the emergence of this pandemic, several countries implemented lockdown policies to prevent the spread of COVID-19. This policy has an impact on all segments of human life, starting from education, transportation, tourism, and the worst is the economy. Many schools, offices, and tourist attractions have been closed to massively reduce physical contact between people. People are encouraged to work from home if it is still possible. In the Indonesian economic sector, it also shows a continuous decline, until the end of 2020, Indonesia runs into a recession. To accelerate the handling of COVID-19 and to recover and transform the national economy, the Indonesian government has formed a Task Force. The government uses online media as a medium of communication and information to the public to support the tasks of the COVID-19 Handling Task Force. Information technology that is increasingly developing makes any information related to COVID-19 easily accessible, anytime and anywhere. The results of research conducted by HootSuite and We Are Social from January 2019 to January 2020, show that internet users in Indonesia are currently increasing by 17% from the previous year. It has been noted that active internet users currently reach 175 million people or 64% of the total population, and active users of social media reach 160 million people or 59% of the total population [2] . The increasing number of internet users in Indonesia shows the era of information openness which is increasingly widespread, making it easier for people to obtain information from the internet. This encourages the government to disseminate information related to COVID-19 through online media. In addition, online media can also be used to get feedback from the community through social media. Based on this feedback, the government can measure the level of satisfaction of the community with the government's performance in dealing with COVID-19. Feedback from the community can be interpreted as either satisfaction or dissatisfaction. Satisfaction from the community can be measured as a positive opinion. Meanwhile, dissatisfaction from the community can be measured as a negative opinion. Both forms of opinion from this feedback can be classified using sentiment analysis techniques. Sentiment analysis techniques in some studies are also known as opinion mining. Sentiment analysis is a technique for analyzing opinions, sentiments, appreciation, emotions about a product, service, organization, individual, and their attributes [3] . Sentiment analysis techniques can also be used in analyzing people's opinions of a piece of text to determine whether the sentiment is positive, negative, or neutral [4] . There has been a lot of research on sentiment analysis. However, research so far has focused more on the opinions contained in sentences and comments and does not consider the subject of the account that posted it. On the other hand, the use of fake accounts or bots on social media is becoming more and more prevalent, so that the credibility of opinion makers is reduced. To solve this problem, research was carried out related to the influence of fake accounts on sentiment analysis. Some tests will be conducted to get better sentiment classification performance. The results of sentiment visualization using fake accounts and without using fake accounts will show the influence of fake accounts on sentiment analysis. Most research on sentiment analysis uses a machine learning approach. The machine learning approach uses a number of features and algorithms to generate classification classes from the processed data [5] . Eliacik and Erdogan [6] conducted research related to microblogging obtained from Twitter. They use the Naïve-Bayes algorithm and Support Vector Machine as a classification algorithm. They also compared the initial classification results with the classification results combined with the PageRank, InterRank, and XiangRank methods to measure the influence of influential users. Kolog et al. [7] researched sentiment analysis and social phenomena about student life. They collected a dataset of student life through distributed questionnaires. They divide clusters of student life data using the K-Means algorithm. Furthermore, they compare the results of sentiment classification using several algorithms, including Sequential Minimum Optimization, Multinomial Naïve-Bayes, and J48 Decision Tree. Chouchani and Abed [8] researched social influences that can improve sentiment analysis on social networks. They analyzed famous figures through their social media accounts. They use the SampleRank algorithm to measure the influence of famous figures. In addition to the machine learning approach, several kinds of research use the lexicon-based approach to classify sentiments. The lexicon-based approach generally uses a dictionary as a basis for classifying sentiments. Bae and Lee [9] use a lexicon-based approach to conduct sentiment analysis on the Twitter dataset to see public sentiment towards microblogging from popular figures. They also conducted an analysis related to the influence of popular figures through user graphs of their social networks. Bravo-Marquez et al. [10] used a lexicon-based approach with the Annotate-Sample-Average algorithm to generate training data from the Twitter dataset. Research related to the detection of fake accounts on social media is also quite a lot. Some of the approaches that are often used include filtering, rules, and machine learning [11] . Machine learning is the approach that has the best results in detecting fake accounts. In a research conducted by Akyon [12] , a number of features of accounts deemed fake were used as features. Furthermore, a number of account datasets are classified and compared using the Naïve Bayes algorithm, Support Vector Machine, Logistic Regression, and Neural Network. Velayutham and Tiwari [13] in their research use a friend-to-follow ratio called reputation, as well as a number of features of fake accounts as features. The dataset of the accounts is classified using the Naïve Bayes algorithm. Ersahin et al. [14] in his research developed a discretization technique for a number of features of fake accounts. This technique is called Entropy Minimization Discretization and aims to improve the performance of the Naïve Bayes algorithm in the classification of fake accounts. Jia et al. [15] developed a new method called SybilWalk to overcome some of the limitations of the Random Walk algorithm in classifying fake accounts. In summary, this paper will use a machine learning approach for sentiment analysis. This paper will also compare the Naïve Bayes algorithm and the Support Vector Machine in sentiment classification. The difference between the related work and this paper is that we will research the influence that fake accounts have on the results of a sentiment analysis regarding the performance of the Indonesian government against COVID-19. There are five steps that we will do in this paper. Each step in this methodology is carried out to test whether there is an influence of fake accounts on sentiment analysis, both on the performance of sentiment classification and overall sentiment results. The first step is data collection. The tweets data were collected as a dataset in this paper. The tweets data were taken through comments on the Twitter account of the Ministry of Health of the Republic of Indonesia (@KemenkesRI) regarding COVID-19. This data source was chosen because most of the information, news, feedback, and public complaints related to COVID-19 in Indonesia are on the Twitter account of the Ministry of Health of the Republic of Indonesia, so it can be assumed that tweets on this account can represent all Indonesian people. Tweet data is taken using a crawling technique with several parameters. Detailed parameters and descriptions of tweet data can be seen in Table 1 . The tweets data were taken from January 2021 to June 2021. The total tweets data obtained were 4,170 tweets. The second step is to remove fake account tweets. Fake and real account data will be identified using the Tweetbotornot tool. Tweetbotornot is a tool that can detect whether a Twitter account is a social bot [16] . Tweetbotornot was developed with the R programming language. The probability scale of this tool starts from 0 to 1 with a statement that Twitter accounts that have a probability value above 0.5 will be identified as social bots. This tool will work based on the username contained in the tweet data. This tool uses three parameters that are used as a reference in identifying the authenticity of the account. Details of account authenticity reference parameters and features of Tweetbotornot can be seen in Table 2 . The results of identifying the authenticity of the account using Tweetbotornot obtained 1,952 tweets data originating from fake accounts. Furthermore, the tweets data originating from fake accounts were deleted so that a dataset without fake accounts was obtained with a total of 2,218 tweets. There are three parameters obtained from the identification of fake accounts which can be seen in Table 3 . Table 3 . Detailed parameters and descriptions of the identification of fake accounts. Username Username of Twitter account. The probability score of the account. A label indicating whether the account is identified as a real or fake account. The third step is data pre-processing. The tweets data that have been collected are then cleaned by removing the stop words, which is a collection of words that have no meaning. The tweets data were also converted to a lowercase form to make it easier for the machine to classify sentiments. The tweets data that have been cleaned are then labeled according to the type of tweet, namely positive, neutral, and negative. From the data of all tweets that have been labeled, the data obtained are tweets with a positive label of 1,062 tweets, a neutral label of 1,889 tweets, and a negative label of 1,219 tweets. The fourth step is data processing which is processing the data by training the dataset and run the algorithm. The first training begins by using an entire tweet dataset consisting of real accounts and fake accounts, and the dataset is taken randomly. The second training was carried out without using fake accounts. The total number of datasets in the first training was also made equal to the number of datasets without using fake accounts in the second training to avoid unbalanced data. The fifth step is the result which is analyzing and discussing the score based on the result on the algorithm produced from data processing. In machine learning, there are several metrics to evaluate how accurate the algorithm is. In our research, we have used Accuracy, F1-Score, Precision, and Recall matrix as the performance metric. Classification accuracy is the number of correct predictions as a ratio of all predictions made. F1-Score is the harmonic mean of precision and recall. Precision is the ratio of items that are predicted as positive to that are positive. Recall is the ratio of items that are positive to that are predicted as positive by the system [14] . Next, the results were visualized between sentiment classifications using all tweets data and without using fake accounts. The steps are shown in Fig. 1. At the experimental stage of sentiment analysis, the classification process is carried out using the Naïve-Bayes (NB) and Support Vector Machine (SVM) algorithms. These two methods are known to have good classification performance according to several studies related to machine learning. NB is the simplest and most widely used algorithm. The NB algorithm works by calculating the probability of a class appearing based on the distribution of words in a document using the Bag of Word (BOW) feature extraction [3] . Meanwhile, SVM is an algorithm that works by transforming training data into higher dimensions, where this algorithm will look for a hyperplane with the largest margin that separates data by class using training tuples [5] . However, basically, the SVM algorithm only supports binary classification. So, the use of this algorithm is slightly modified because this research uses three sentiment classes. The SVM algorithm in this research uses multiclass classification with a One-to-Rest approach, where one class will be separated from the other two classes by a hyperplane. The classification process begins with a random dataset of tweets with fake accounts and continues with a dataset of tweets without fake accounts. The dataset of tweets with fake accounts is equal to the dataset of tweets without fake accounts of 2,218 tweets. Each process is tested four times with a comparison ratio of training data and testing data for each experiment of 60:40, 70:30, 80:20, and 90:10. The experiments were carried out to obtain the best sentiment classification results. The first experiment was carried out using the NB algorithm. The results of each experiment for the NB algorithm using fake accounts can be seen in Table 4 . Also, the results of each experiment for the NB algorithm without using fake accounts can be seen in Table 5 . From these results, it can be seen that the highest Accuracy score for the NB algorithm experiment using all tweets data is obtained at a data ratio of 70:30 with a percentage of 53.8%. Also, the lowest Accuracy score is obtained at a data ratio of 90:10 with a percentage of 51%. Meanwhile, the NB algorithm experiment without using fake accounts obtained the highest Accuracy score at 90:10 data ratio of 59%. Then, the lowest Accuracy score is obtained at a data ratio of 70:30 with a percentage of 52.7%. These results indicate that the sentiment classification process without using fake accounts is better than using all tweets data in the NB algorithm experiment. Furthermore, the sentiment classification experiment is continued by using the SVM algorithm. The dataset used in this second experiment is the same as the dataset in the first experiment. The results of each experiment for the SVM algorithm using fake accounts can be seen in Table 6 . Also, the results of each experiment for the SVM algorithm without using fake accounts can be seen in Table 7 . From these results, it can be seen that the highest Accuracy score for the SVM algorithm experiment using all tweets data is obtained at a data ratio of 70:30 with a percentage of 79.5%. Also, the lowest Accuracy score is obtained at a data ratio of 80:20 with a percentage of 75.6%. Meanwhile, the SVM algorithm experiment without using fake accounts obtained the highest Accuracy score at 80:20 data ratio of 80.6%. Then, the lowest Accuracy score is obtained at a data ratio of 60:40 with a percentage of 74.4%. These results indicate that the sentiment classification process without using fake accounts is better than using all tweets data in the SVM algorithm experiment. The results of the sentiment classification experiment on both algorithms showed that the sentiment classification process without using fake accounts is better than using all tweets data. These results prove that there is an influence from fake accounts that can reduce the performance of sentiment classification. From the experimental results it is also known that the SVM algorithm has the highest Accuracy value of 80.6%, while the NB algorithm only has the highest Accuracy value of 59%. Precision, Recall, and F1-Score in the SVM algorithm also show higher numbers when compared to the NB algorithm. These results prove that sentiment classification using the SVM algorithm has better performance than the NB algorithm for this case. Apart from the results of the sentiment classification process, there are also the results of the sentiment visualization of tweets. The sentiment visualization of tweets is displayed in a pie chart. The results of the tweets sentiment visualization using all tweets and without fake accounts can be seen in Fig. 2 . From the results of the visualization, it can be seen that if we use all tweets data, the percentage of positive sentiment is 25.47%, neutral sentiment is 45.3%, and negative sentiment is 29.23%. However, after the tweet data from fake accounts were removed, there was a change in the percentage to a positive sentiment of 24.39%, a neutral sentiment of 46.35%, and a negative sentiment of 29.26%. These results indicate that there was an influence from fake accounts which actually leads to positive sentiment although it is not significant. This result is contrary to the initial hypothesis of the research which assumes that fake accounts will lead opinions towards negative sentiment. In this paper, several experiments were conducted to determine the influence of fake accounts on sentiment analysis. The tweets data obtained are divided into two datasets, including a dataset of all tweets using fake accounts and without using fake accounts. Each dataset was tested several times by classifying tweet sentiments. This research also compares sentiment classification using the Naïve-Bayes (NB) and Support Vector Machine (SVM) algorithms. The results of the sentiment classification experiment on both algorithms showed that the sentiment classification process without using fake accounts is better than using all tweets data. These results prove that there was an influence from fake accounts that can reduce the performance of sentiment classification. The experimental results of the two algorithms also prove that the SVM algorithm has a better performance than the NB algorithm for this case with the highest Accuracy value of 80.6%. In addition, the results of the sentiment visualization showed that there was an influence from fake accounts which actually leads to positive sentiment although it is not significant. This result is contrary to the initial hypothesis of the research which assumes that fake accounts will lead opinions towards negative sentiment. We will try several ways to get better results for future work, including using more extensive datasets and using other classification processes. Komite Penanganan COVID-19 dan Pemulihan Ekonomi Nasional [Title in English: Committee for Handling COVID-19 and National Economic Recovery Digital 2020: Indonesia 2. (a) Visualization of all tweets sentiment; (b) Visualization of tweets sentiment without fake accounts Rivanda Putra Pratama & Aris Tjahyanto / Procedia Computer Science 00 Random forest approach fo sentiment analysis in Indonesian language Data Mining: Concepts and Techniques Influential user weighted sentiment analysis on topic based microblogging community Using machine learning for sentiment and social influence analysis in text Enhance sentiment analysis on social networks with social influence analytics Sentiment analysis of twitter audiences: Measuring the positive or negative influence of popular twitterers Annotate-Sample-Average (ASA): A new distant supervision approach for Twitter sentiment analysis Using Machine Learning to Detect Fake Identities: Bots vs Humans Instagram Fake and Automated Account Detection Bot identification: Helping analysts for right data in twitter Twitter fake account detection Random Walk Based Fake Account Detection in Online Social Networks Tweetbotornot