International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8 Issue-3, September 2019 1649 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: C4432098319/19©BEIESP DOI:10.35940/ijrte.C4432.098319 Metoo Movement Analysis through the Lens of Social Media P.Asha, K. Sri Neeharika, T. Sindhura Abstract—Sentiment analysis is an errand which is used to analyse people’s opinions which has been derived out of textual data seems productive for palpating various NLP applications. The grievances associated with this task is that, there prevails variety of sentiments within these documents, accompanied with diverse expressions. Therefore, it seems hard to whip out all sentiments employing a dictionary which is commonly used. This work attempts at constructing the domain sentiment dictionary, by employing the external textual data. Besides, various classification models could be utilised to classify the documents congruent to their opinion. We have also implemented topic modelling, emoticon analysis and optimized gender classification in our proposed system. Many sectors have been identified where women are being abused. Clusters are formed for these sectors and the most affected sector is also identified. Keywords—Sentiment analysis, cluster, Classifier, Modelling. I. INTRODUCTION Based on vast augmentation in networks, internet has turned out to be a basic need for human survival. This development in internet increased the connectivity among people around the globe. People are getting more exposed to social media platforms in every possible way. So public opinion analysis has become a trend in the society before any further step in any industry. Hence the insistence for sentiment analysis along with opinion mining is burgeoning. In this era of machine learning sentimental analysis [1-3] plays a pivotal role in creating awareness through analyzing a big sample of social media users who share their thoughts, emotions and opinions. In this work, text mining helps is used to obtain results. As there are many social media platforms on the internet, one among them is twitter. The main use of this social network is that it contains hashtags which makes our task easier for data collection. II. LITERATURE REVIEW Probabilistic Latent Semantic Analysis (PLSA), an unsupervised learning technique was proposed, which was formulated on statistical Latent class model. The authors affirmed that their approach seems to be more of principal oriented than the Conventional Latent Semantic Analysis (LSA), as it possess a strong statistical foundation which adopts Annealed Likelihood function as its optimization criterion [4-7]. Revised Manuscript Received on September 15, 2019 Dr. P.Asha, Asst. Prof.,Dept. of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai. ashapandian225@gmail.com K. Sri Neeharika, Asst. Prof.,Dept. of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai. neeharikasri6@gmail.com T. Sindhura, Asst. Prof.,Dept. of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai. sindhura2397@gmail.com The privileges of this PLSA is considered as a promising and productive unsupervised learning method, which covers a wide spectrum of applications with respect to text learning [8].The authors stated the employment of semantic features in the twitter sentiment classification. They explored three other approaches for assimilating the collected tweets for effective analysis. These approaches include replacement, augmentation and interpolation [9]. Replacement includes replacing the words with meaningful words, deleting the unnecessary words. Augmentation simply means adding. Approaches in augmentation include adding noise and applying transformations on existing data. In sparse areas imputation and dimensional reduction are also used for augmentation in the data sets. Interpolation is a process of drawing new data points from the existing range of known data points. Mainly interpolation helped the model to achieve best results by interpolating the generative words into unigram language model of Naïve Bayes (NB) classifier [10]. A new approach to sentimental analysis was introduced, which uses support vector machines (SVM). Mainly, this SVM is used to bring together potentially pertinent information from different sources [11-13]. This also includes various favourability measures for phases and adjectives related to topic of the text in the tweet. Merits of this approach includes the incorporation of various words (with the help of SVM) where, previously it was limited to the specific words that are present in the tweets. Due to this incorporation of words from various sources, efficiency of the model was declined. III. PROPOSED SYSTEM Initially data has been collected in the first step. Later on it is pre-processed. The pre-processed data is used for getting valuable insights through different visualization techniques. Finally clusters are formed and the most affective cluster is identified (Fig. 1). Metoo Movement Analysis through the Lens of Social Media 1650 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: C4432098319/19©BEIESP DOI:10.35940/ijrte.C4432.098319 Fig 1. System Architecture The input data has been collected from twitter using various hashtags (#Metoo, #politics, #Education, #work) using twitter consumer key, API, secret key. All the extracted tweets are stored in a .csv format. With the help of hashtags like politics, education and work abuses in those fields are identified and data will be stored under different sectors which helps in the formation of clusters. Storage of tweets are done because extraction of tweets depends on the number of people tweeting using a particular hashtag. So if we take it dynamically sometimes the number tweets can be low. To avoid such constraints the required input data is stored in csv format. Two dictionaries are formed with a catalogue of positive and negative words in it. Data pre-processing is to be done to the collected input as it contains so much of noise. Here noise includes like punctuations marks, numbers, stop words, tags, URL’S, un- parliamentary language, missing end marks, splitting the sentence into words. Words like RT, CRT, amp, thi, CrT are also removed. These words are present at the beginning of each tweet. So all the above stated things are removed during pre-processing. A. Topic Modelling Topic modelling groups the similar words into one cluster which helps to identify the hidden patterns in it. B. Emoticon Analysis Emoticon analysis is used to calculate the reaction of the tweeting person. In this analysis we replaced emoticons to a suitable word, so that we can take the emoticon into consideration while categorizing the tweet. C. Gender Classification Twitter doesn’t disclose the gender of the tweeting purpose. But we can find out the gender of the user through their usernames. This can be achieved using traditional dictionary libraries. D. Visualization Generally visualizations are used to understand the reactions of the people in an easy manner. Various types of visualizations helps us to recognize any hidden semantic patterns in a precise manner. Some of the visualizations used in the project are 1. Bar plot 2. Histogram 3. Word cloud IV. RESULTS AND DISCUSSION Calculation of the score of the tweet is an important step for this analysis because this distinguishes the type of tweet and further classifies it. This classification is done with the help of the dictionaries which includes a wide range of positive and negative words in it. Hence all the extracted tweets including the cluster data undergoes this process. Calculation of the score of a tweet involves the number of positive and negative words in it. The ultimate score will be the difference of positive and negative words. Depending upon the final score, that particular tweet is further categorized into any one of the 5 categories. These categories contain headers like most positive, positive, neutral, negative, most negative which is shown in figure 1. This is achieved by using laply function. Hence the above stated process is done to the cleansed data obtained after pre-processing. Fig 2. Classification of tweets of #MeToo International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-8 Issue-3, September 2019 1651 Published By: Blue Eyes Intelligence Engineering & Sciences Publication Retrieval Number: C4432098319/19©BEIESP DOI:10.35940/ijrte.C4432.098319 Fig 3. Scores of the tweets A. Identification of the Most- Affected Cluster Sentiment scores calculated are grouped into negative and positive. Scores with a negative number falls under negative category and score with a positive number falls under positive category. So sentiment scores are calculated for the clusters and all the negative categories are compared and a bar plot is drawn to identify the most-affected cluster among the three of the clusters. Fig 4. Identification of the most-affected cluster Fig 5. Word cloud representing the most frequent words in tweets This is a word cloud that displays the most frequent words used in the tweets. The bigger the word is the bigger its size (occurrences) in the cloud. The specialty of this cloud is that if we hover on any of the word in the cloud it displays the frequency of that. V. CONCLUSION Data is collected from twitter and is not limited to a single platform. It can be collected from any social media platform, but the collected data should be accurate. Hence the collected data is pre-processed. Many pre-processing techniques are done to the data such that, the factual data is supplied as an input to the process. The pre-processed data is visualized to get valuable insights from it. Visualizations include bar plot, word clouds etc. Hence, the proposed strategy classifies the tweets based on the sentiment scores into 5 different categories. For further analysis they are classified as Positive and Negative tweets, which boosts up sentiment analysis which assists in identifying the most affected sector. REFERENCES 1. J. Yi, T. Nasukawa, R.B., Niblack, W.: Sentiment analyser: Extracting sentiments about a given topic using natural language processing techniques. In: 3rd IEEE Conf. on Data Mining (ICDM’03). (2003) 2. Lloyd, L., Kechagias, D., Skiena, S.: Lydia: A system for large-scale news analysis. In: String Processing and Information Retrieval (SPIRE 2005). Volume Lecture Notes in Computer Science, 3772. (2005) 161–166 3. Andreevskaia, A., Bergler, S.: Mining WordNet for a fuzzy sentiment: Sentiment tag extraction from WordNet glosses. In: EACL. (2006) 4. Mehler, A., Bao, Y., Li, X., Wang, Y., Skiena, S.: Spatial analysis of news sources. IEEE Trans. Visualization and Computer Graphics 12 (2006). 5. Xiaolong Wang, Furu Wei, Xiaohua Liu, Ming Zhou, Ming Zhang,” Topic Sentiment Analysis in Twitter: A Graph-based Hashtag Sentiment Classification Approach”, ACM, CIKM’11, October 24– 28, 2011, Glasgow, Scotland, UK, 2011. 6. Asha P., Albert Mayan J., Canessane A. (2018) Efficient Mining of Positive and Negative Itemsets Using K-Means Clustering to Access the Risk of Cancer Patients. Soft Computing Systems. ICSCS 2018. Communications in Computer and Information Science, vol 837. Springer, Singapore 7. Andrius Mudinas, Dell Zhang, Mark Levene,” Combining Lexicon and Learning based Approaches for Concept-Level Sentiment Analysis”, WISDOM’ 12, August 12, 2012, Beijing, China Copyright 2012, ACM. 8. Asha, P., Jebarajan, T.: SOTARM: size of transaction based association rule mining agorithm. Turk. J. Electr. Eng. Comput. Sci. 25(1), 278–291 (2017) 9. Asha, P., Srinivasan, S.: Analyzing the associations between infected genes using data mining techniques. Int. J. Data Min. Bioinform. 15(3), 250–271 (2016) 10. Chenghua Lin, Yulan He, Richard Everson, Member, IEEE, and Stefan Ru¨ger,” Weakly Supervised Joint Sentiment-Topic Detection from Text”, IEEE trans. on knowledge and data engineering, vol. 24, no. 6, June 2012. 11. Xiaohui Yu,Yang Liu, Jimmy Xiangji Huang, and Aijun An,,” Mining Online Reviews for Predicting Sales Performance: A Case Study in the Movie Domain”, IEEE Trans.On knowledge and data engineering,vol. 24, no. 4, April 2012. 12. Danushka Bollegala, David Weir, and John Carroll,” Cross-Domain Sentiment Classification Using a Sentiment Sensitive Thesaurus”, IEEE trans. on knowledge and data engineering, vol. 25, no. 8, August 2013. 13. Xin Chen, Mihaela Vorvoreanu, and Krishna Madhavan,”Mining Social Media Data for Understanding Students’ Learning Experiences” IEEE trans. on learning technologies, vol. 7, no. 3, July- September 2014.