key: cord-0459762-b54jinko authors: Khan, Mohammad Muzammil; Sohail, Shahab Saquib; Ishrat, Darakhshan title: How They Tweet? An Insightful Analysis of Twitter Handles of Saudi Arabia date: 2021-05-22 journal: nan DOI: nan sha: 6a3363ae2468e52d20fe2635be5382522b3ce6ba doc_id: 459762 cord_uid: b54jinko The emergence of social network site has attracted many users across the world to share their feeling, news, achievements and personal thoughts over several platforms. The recent crisis due to worldwide lockdown amid COVID 19 has shown how these online social platforms have grown stronger and turned up as the major source of connection among people when there is social distancing everywhere. Therefore, we have surveyed Twitter users and their mannerism with respect to languages, frequency of tweets, the region of belonging, etc. The above observations have been considered especially with respect to Saudi Arabia. An insightful analysis of the tweets and twitter handles of the kingdom has been presented. The results show some interesting facts that are envisaged to lay a platform for further research in the field of social, political and data sciences related to the Middle East. Twitter is a social networking and micro-blogging platform. The service gained rapid success and popularity as it provided a unique way to interact with the Internet termed as "tweets". Tweets can be a variety of services provided on the platform which include text, photos, videos, polls, etc. The service was launched in the year 2006 and after just 6 years, in 2012, the micro-blogging site had gained 140 million active users with over 300 million tweets a day [1] . The platform initially restricted a tweet with a limit of 140 characters. However, in 2017 Twitter doubled the limit for non-Asian languages to 280 characters [2] . Twitter provides many ways to categorize and tag tweets. For combining similar tweets based on content-type or combining similar tweets based on a ticker symbol for stocks and companies, "hashtags", can be used and usually known as trend/trending over Twitter. Twitter generates an enormous amount of data that can * Contact: mmkhan.sch@jamiahamdard.ac.in be accessed using the official Application Program Interface (API). The API requires an authentication key to access the data. However, there are limitations for each type of API. These limitations manage the number of times a request can be made, the amount of data a single user can access, and how much of the Twittersphere is accessible to a user. These limitations are here to protect data usage and prevent personalized and targeted approach for an individual or a group of individuals. We aim to collect the data as much as possible respecting the limits and restrictions in place and perform an analysis of the tweets to find some useful analytical data which can be used to analyze the behaviour of users tweeting in a certain region. We shall collect and analyze the tweets from the perspective of the Kingdom of Saudi Arabia (KSA). In general, researchers have been studying and exploring the social network for quite a while now. Many of the researchers have applied their studies and found influence patterns [3] . Researchers, regarding the influencers, have found that user accounts of news media are better spreaders of contents and that the celebrities use Twitter to make conversations [4] . Researchers in [5] focused a study for the United States of America of demographic data using the Twitter data from [3] and have been able to identify a location, gender, and race/ethnicity only from the publicly accessible data. Researchers have been able to use the data generated from online social networking sites for predicting many things that seem random and/or can't be humanly predicted. These predictions include the stock market [6, 7] , book sales [8] , movie-ticket sales [9, 10] , product sales [11] , infections, diseases, and consumer-spending [12] . Authors have also incorporated Twitter data for healthcare purpose and disease prediction [13, 14] . There have been few works with respect to Saudi Arabia and Arabic (language) tweets. In 2011, Al-Khalifa [15] performed a social network analysis of the Twittersphere to understand the social structure of citizens and their in-teractions as discussions of issues regarding politics. Since then Arabic dialects are receiving a lot of attention from researchers. With the use of natural language processing, many researchers are now able to investigate even more deeply in an unprecedented manner. In [16] , Al-Twairesh et al. have conducted a study of the Modern Standard Arabic (MSA) usage by the Arabic-speaking users of Twitter. Just like the predictions made in [6 -12] , researchers have also been able to predict the Saudi Stock Market Index [17] . Authors have also been able to detect sarcasm through analysis of the Arabic tweets [18] . In 2018, researchers targeted the Arabic-tweeting community for profiling to help identify and extract attributes of a Twitter user using machine learning-based classification of topics from Tweets which achieves a 90% accuracy [19] . Sentiment analysis is also being used in research to provide a better understanding of using text mining [20] . An emotion and mood visualizer called "Saudi Mood" was proposed in [21] to continuously monitor the Arabic Twitter to detect dominant emotion in the Kingdom of Saudi Arabia using real-time sentiment analysis. Another sentiment analysis [22] was used to get an insight into the different dialects used in the Arabic tweets using sentiment analysis of the Arabic tweets. As of January 2020 [23], there are 14.35 million Twitter users from Saudi Arabia, with a rank of 4, followed by 59.35, 45.75, and 16.7 million users from the United States, Japan, and the United Kingdom respectively. As the number of users is higher, we can infer that the number of tweets from the Kingdom is higher as well. We have investigated whether Saudi Arabia falls in the top 10 most tweeting countries or not? In addition, Arabic is the lingua franca of the Arab world. Being the 6th most-spoken language worldwide [24] , we have critically reconnoitred the ranking of Arabic among the top 10 languages used on Twitter. Regardless of the dialects spoken, Egypt is the most populous Arabic-speaking country [25] . However, a lot of factors come into play like social awareness of the platform, social networking norm in the country, etc. when we ponder over who tweets more. According to [26] , Facebook is the most popular social media platform used by the Egyptian public and Twitter is the 3rd. According to a report [27] , Facebook is also the most favoured in Saudi Arabia, however, Twitter comes 2nd. Naturally, we have inspected that which country will have more Arabic-tweets, Saudi Arabia, or any other country like Egypt? A study conducted in 2016 [28] shows that the best time to get most interactions for a tweet is around noon, from 12 PM to 1 PM. This study was conducted on a high volume of tweets. However, they did not include or mention Saudi Arabia in the research explicitly. We have surveyed the Twitter data to find out the period in which most frequent tweets are tweeted concerning Saudi Arabia. To maximize efficiency and productivity, we have divided the work into different yet integrated modules. The whole set consisted of 4 modules which performed tasks by taking the input of the previous module's data and generate output to be used as input for the next module. The modules are -Crawler, Processor, Analyzer, and Pruner. A customized crawler was created which accessed the tweets using the official API in real-time and stored the data synchronously in a well-defined format. We ran the crawler for around 23 continuous days with API rate-limiting of 450 requests per 15 minutes and got over 86.79 million tweets. The Crawler was programmed to save only the tweets in which -1. the user has entered their location, and 2. language was detected by Twitter. After trimming the results, we had a dataset consisting of 58.15 million tweets. We also created a custom processor to process in whichever way we wanted. In the Processor, we programmed two submodules, i.e. a sub-module to check and verify the integrity of dataset and a sub-module to detect and extract countries and cities from a plain text which in this case is the location field self-reported by the user. We ran the Processor twice. In the first time, we performed a check on duplication and the false negatives of location detection. After the first round of processing, we found that there were 8.2 million duplicate tweets and 2.36 million distinct entries for unknown/undetected locations from 21.69 million tweets. Then, we manually evaluated the false negatives and adjusted the algorithm with a modification of 0.37 million entries. While evaluating, we found that people were using a wide range of spellings and native languages for their locations. This process also confirms that the dataset format is error-free as we were able to process the whole dataset of crawled tweets. After deduplication and adjustments, we ran the Processor again, we had a data-set of 49.99 million tweets and were able to detect countries of 68.49% of the data-set. Then the analyzer came into the picture. The analyzer analyzed each tweet depending upon the inputs provided to categorize the data. We used a combination of regular expressions, mapping, and hardcoded commands to sort out the data and to output the analysis in comma-separated-value format. This format can be imported into any analysis tool for further study. We programmed the Analyzer to show some of the preliminary analysis after the completion of execution. The preliminary analysis showed that there were -12.85 million users from which 7.24 million posted tweets and 8.67 retweeted with 3.07 million users doing both, the dataset had 33.84 million words, 64 languages, and 237 countries. Finally, the Pruner analyzed the data which was more than 16 GB and was too big to load every time to look upon. So, the Pruner produced the output, i.e. the sorted and trimmed data, which was relatively small and much portable. Twitter provides with language detection of tweets within the API. At the time of crawling, we selected and saved only the tweets in which the API was able to detect the language. Our dataset had a total of 64 languages, among which English was the most popular language with 29.35 million tweets, followed by Spanish with 8.62 million and Portuguese at 6.31 million, and the least was, with mere 5 tweets, Laotian -a native language of the people of Laos. Here, we infer that Arabic which is the official language of the Kingdom, ranked at 8th position with over 0.21 million tweets or amounting to 0.42% of the whole dataset. Moreover, as shown in table 1, Arabic is found to be the most popular language in the KSA with 61,390 tweets, amounting to 67.15% of tweets originating from the country. Followed by English and Spanish with 27,161 and 464 tweets respectively and Lithuanian, Icelandic, Hebrew, Sindhi, Telugu, Polish, Kannada, and Bangla languages having only 1 tweet each. Despite being the most popular language among the user base of Twitter residing in the Kingdom, Arabic tweets from the Kingdom only amount to 29.34% of the whole dataset and 47.17% of tweets come from unidentified locations from which a percentage could very well be residing in the KSA itself. The KSA is followed by Egypt, Kuwait, the United Arab Emirates, the United States, and Iraq with 5.93%, 4.45%, 2.39%, 1.79%, and 1.13% of the Arabic tweets respectively (table 2). The Twitter API returned each tweet with the timestamp of creation in the Coordinated Universal Time (UTC) format. We stored this as well and then converted the format into Arabian Standard Time (AST). We saw that the Saudis are the least active at dawn, specifically 0400 hours AST or 0100 UTC, slowly picking pace and achieving peak usage in the afternoon, specifically 1300 hours AST or 1000 hours UTC. Afterwards, usage declines until 1600 hours AST (1300 UTC) to 1800 hours AST (1500 UTC) ( figure 2 ). This finding corroborates with the study [28] and suggests that the best time to get most interactions from a tweet is at noon, specifically around 1300 AST, as most of the users are active around that time. In this paper, we crawled around 86 million tweets during 23 days and then processed and analyzed the data concerning the trends and behaviour in Saudi Arabia and the Arabic language over Twitter. Primarily we are focused on exploring whether Saudi Arabia is in the top 10 tweeting countries as the number of Twitter users in the KSA is relatively higher, hence more users should result in more tweets. We have found that Saudi Arabia is placed in the 11th position as far as the number of tweets is concerned. Since Arabic is among the leading languages spoken worldwide, therefore, we have also identified how users at Twitter across the Globe behave when language is concerned and it is found that Arabic remains one of the top languages with respect to its frequency over Twitter. In addition, it has been observed that the Kingdom of Saudi Arabia has the most Arabic-tweets when comparing to Arabic tweets from any other country worldwide. The earlier study suggests that the most active time for users is at noon. We have investigated and found the most frequent tweeting time comes to be 12 pm -1 pm across the respective time zones of the countries concerned. Further, in the future, we would like to analyze this area by performing deep analysis on the tweets themselves and their impact on the Arab culture. We can perform sentiment analysis on tweets to further classify and extract features. This study is believed to establish a base of future Twitter research not only for Arab-centric but also to other regions and other aspects. Twitter turns six Measuring user influence in twitter: The million follower fallacy The influentials: New approaches for analyzing influence on twitter Understanding the demographics of Twitter users Twitter mood predicts the stock market Textual analysis of stock market prediction using breaking financial news: The AZFin text system The predictive power of online chatter Capturing Global Mood Levels using Blog Posts Predicting the future with social media IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology ARSA: a sentiment-aware model for predicting sales performance using blogs Predicting the present with Google Trends Demographic-based content analysis of webbased health-related social media Demographics analysis of twitter users who tweeted on psychological articles and tweets analysis Exploring political activities in the Saudi Twitterverse Towards analyzing Saudi tweets Analysis of the relationship between Saudi twitter posts and the Saudi stock market Arabic sarcasm detection in Twitter Arabic Twitter Profiling For Arabic-Speaking Users Sentiment Analysis for Arabic in Social Media Network: A Systematic Mapping Study Saudi Mood: A Real-Time Informative Tool for Visualizing Emotions in Saudi Arabia Using Twitter Saudi Computer Society National Computer Conference (NCC) Sentiment Analysis of Arabic Tweets in Smart Cities: A Review of Saudi Dialect The most spoken languages worldwide in 2019 List of countries where Arabic is an official language Social Media Stats Saudi Arabia The Biggest Social Media Science Study: What 4.8 Million Tweets Say About the Best Time to Tweet