key: cord-0145870-5x1kxoht authors: Pavlyshenko, Bohdan M. title: Forming Predictive Features of Tweets for Decision-Making Support date: 2022-01-06 journal: nan DOI: nan sha: acf15e4aa05bd3b07c9105a6cb148a776ccbf4ed doc_id: 145870 cord_uid: 5x1kxoht The article describes the approaches for forming different predictive features of tweet data sets and using them in the predictive analysis for decision-making support. The graph theory as well as frequent itemsets and association rules theory is used for forming and retrieving different features from these datasests. The use of these approaches makes it possible to reveal a semantic structure in tweets related to a specified entity. It is shown that quantitative characteristics of semantic frequent itemsets can be used in predictive regression models with specified target variables. Tweets, the messages of Twitter microblogs, have high density of semantically important keywords. It makes it possible to get semantically important information from the tweets and generate the features of predictive models for the decision-making support. Different studies of Twitter are considered in the papers [3] [4] [5] [6] 9, 15, 17, 19, 23, 31, 34] . In [27, 28] , we study the use of tweet features for forecasting different kinds of events. In [25] , we study the modeling of COVID-19 spread and its impact on the stock market using different types of data as well as consider the features of tweets related to COVID-19 pandemic. In this paper, we study the predictive features of tweets using loaded datasets of tweets related to Tesla company. The relationships among users can be considered as a graph, where vertices denote users and edges denote their connections. Using graph mining algorithms, one can detect user communities and find ordered lists of users by various characteristics, such as Hub, Authority, PageRank, Betweenness. To identify user communities, we used the Community Walktrap Algorithm algorithm, which is implemented in the package igraph [11] for the R programming language environment. We used the Fruchterman-Reingold algorithm from this package for visualization. The Community Walktrap algorithm searches for related subgraphs, also called communities, by random walk [30] . A graph which shows the relationships between users can be represented by Fruchterman-Reingold algorithm [12] . We can assume that tweets could carry predictive information for different business processes. For our case study, we have loaded the tweets related to Tesla company for some time period. Qualitative structure can be used for aggregating different quantitative time series and, in such a way, creating new features for predictive models which can be used, for example, for stock prices forecasting. Let us consider which features we can retrieve from tweet sets for the predictive analytics. Figure 1 shows revealed users' communities for the subset of tweets. Figure 2 shows the subgraph for users of highly isolated communities. Revealing The frequent set and associative rules theory is often used in the intelectual analysis [1, 2, 7, 10, 14, 16, 24, 32] . It can be used in a text data analysis to identify and analyze certain sets of objects, which are often found in large arrays and are characterized by certain features. Let's consider the algorithms for detecting frequent sets and associative rules on the example of processing microblog messages on Twitter. We can specify a thematic field which is a set of keywords semantically related to domain area under study. Figure 3 shows the frequencies of keywords for the thematic field of frequent itemsets analysis. This will make it possible to narrow the semantic analysis of messages to the given thematic framework. Based on the obtained frequent semantic sets, we are going to analyze possible associative rules that reflect the internal semantic connections of thematic concepts in messages. In the time period when tweet dataset was being loaded, the accident with solar panels manufactured by Tesla on Walmart stores roofs took place. It is important to consider the reflection of trends related to this topic in various processes, in particular, the dynamics of the company's stock prices in the financial market. Using frequent itemsets and association rules, we can find a semantic structure in specified semantic fields of lexemes. Figures 4, 5 shows semantic frequent itmesets for specified topics related to Tesla company. Figures 6, 7 show association rules represented by graph and by grouped matrix. Figure 8 shows sentiment and personality analytics characteristics received using IBM Watson Personality Insights [20] . Using revealed users' graph structure, semantic structure and topic related keywords and hashtags, one can receive keyword time series for tweet counts per day. These time series can be considered as features in the predictive models. In some time series, we can see when exactly the accident with solar panels on Walmart roof appeared and how long it was being considered in Twitter. Figure 9 shows the time series for different keywords and hashtags in the the tweets. Figure 10 Figure 9 . One can see that at the time of the Tesla solar panel incident, the tweet activity is increasing over the time series of some keywords. Let us analyze how this incident affects the share price of Tesla. A linear model was created, where time series of keywords and their time-shifted values (lags) were considered as independent regression variables. As a target variable, we considered the time series of the relative change in price during the day (price return). Using LASSO regression, weights were found for the analyzed traits. Figure 11 shows the dynamics of the stock price Tesla (TSLA ticker) in the stock market. We created a linear model where keyword time series and their lagged values were considered as covariates. As a target variable, we considered stock price return time series for ticker TSLA. Using LASSO regression, we found weight coefficients for the features under consideration. Figure 12 shows the stock price return and predicted values. Figure 13 shows the regression coefficients for the Bayesian inference. Bayesian approach makes it possible to calculate the distributions for model parameters and for the target variable that is important for risk assessments [8, 13, 18] . Bayesian inference also makes it possible to take into account non-Gaussian distribution of target variables that take place in many cases for financial time series. In [26] , we considered different approaches of using Bayesian models for time series. Figure 14 shows the boxplots for feature coefficients in Bayesian regression model. It is interesting to use Q-learning to find an optimal trading strategy. Q-learning is an approach based on the Bellman equation [21, 22, 33] . In [29] , we considered different approaches for sales time series analytics using deep Q-learning. Let us consider a simple trading strategy for the stocks with ticker TSLA. In the simplest case of using deep Q-learning, we can apply three actions 'buy','sell','hold'. For state features, we used keyword time series. As a reward, we used stock price return. The environment for learning agent was modeled using keywords and reward time series. Figure 15 shows the price return for the episodes for learning agent iterations. The results show that an intelligent agent can find the an optimal profitable strategy. Of course, this is a very simplified case of analysis, where the effect of overfitting may occur, so this approach requires further study. The main goal is to show that, using reinforced learning and an environment model based on historical financial data and quantitative characteristics of tweets, it is possible to build a model in which an intelligent agent can find an optimal strategy that optimizes the reward function in episodes of interaction of learning agent with the environment. It was shown that time series of keywords features can be used as predictive features for different predictive analytics problems. Using Bayesian regression and tweets quantitative features one can estimate an uncertainty for the target variable that is important for the decision making support. Using the graph theory, the users' communities and influencers can be revealed given tweets characteristics. The analysis of tweets, related to specified area, was carried out using frequent itemsets and association rules. Found frequent itemsets and association rules reveal the semantic structure of tweets related to a specified area. The quantitative characteristics of frequent itemsets and association rules, e.g. value of support, can be used as features in regression models. Bayesian regression make it possible to assess the uncertainty of tweet features and target variable. It is shown that tweet features can also be used in deep Q-learning for forming the optimal strategy of learning agent e.g. in the study of optimal trading strategies on the stock market. Fast discovery of association rules Fast algorithms for mining association rules Predicting the future with social media Improving cyberbullying detection using twitter users' psychological features and machine learning Characterizing user behavior in online social networks Twitter mood predicts the stock market Beyond market baskets: Generalizing association rules to correlations Stan: A probabilistic programming language Measuring user influence in twitter: The million follower fallacy Mining frequent itemsets from uncertain data The igraph software package for complex network research Graph drawing by force-directed placement. Software: Practice and experience Bayesian data analysis Efficiently mining maximal frequent itemsets Why we twitter: understanding microblogging usage and communities Finding interesting rules from large sets of discovered association rules The predictive power of public twitter sentiment for forecasting cryptocurrency prices Doing Bayesian data analysis: A tutorial with What is Twitter IBM Watson Personality Insights: The science behind the service Playing atari with deep reinforcement learning Human-level control through deep reinforcement learning Twitter as a corpus for sentiment analysis and opinion mining Discovering frequent closed itemsets for association rules Modeling COVID-19 Spread and Its Impact on Stock Market Using Different Types of Data. Electronics and information technologies Bayesian Regression Approach for Building and Stacking Predictive Models in Time Series Analytics Forecasting of Events by Tweets Data Mining. Electronics and information technologies Can Twitter Predict Royal Baby's Name ? Electronics and information technologies Sales Time Series Analytics Using Deep Q-learning Computing communities in large networks using random walks Tweetgeist: Can the twitter timeline reveal the structure of broadcast events Mining association rules with item constraints Introduction to reinforcement learning A novel method for twitter sentiment analysis based on attentional-graph neural network