key: cord-0219686-huugpyem authors: Kong, Quyu; Ram, Rohit; Rizoiu, Marian-Andrei title: A Toolkit for Analyzing and Visualizing Online Users via Reshare Cascade Modeling date: 2020-06-11 journal: nan DOI: nan sha: 2b40a59106b9b5cbfb848d2e7d8c99ed6f2ad9f8 doc_id: 219686 cord_uid: huugpyem Modeling online discourse dynamics is a core activity in understanding the spread of information, both offline and online, and emergent online behavior. There is currently a disconnect between the practitioners of online social media analysis - usually social, political and communication scientists - and the accessibility to tools capable of handling large quantities of online data, and examining online users and their behavior. We present two tools,birdspotter and evently, for analyzing online users based on their involvement in retweet cascades. birdspotter provides a toolkit to measure social influence and botnets of Twitter users. While it leverages the multimodal information of tweets, such as text contents, evently augments the user measurement by modeling the temporal dynamics of information diffusions using self-exciting processes. Both tools are designed for users with a wide range of computer expertise and include tutorials and detailed documentation. We illustrate a case study of a topical dataset relating to COVID-19, using both tools for end-to-end analysis of online user behavior. The dissemination of information and opinion through social media, drives change in our societies today. The existence of "viral" diffusion of information suggests that some users can exert a disproportionate influence on discourse [5, 10] , and that "bad actors" can exploit misinformation campaigns causing societal divisiveness [26] . Consequently, there is a clear need for tools to analyze the dynamics and weaknesses of online discourse systems, and to identify important users based on their activity. There seems to currently exist a disconnect between the practitioners of online social media analysis (who are most often social and political scientists, journalists or communication scientists) and the existing tools facilitating this analysis. The latter -when they exist -either require extensive programming experience, or make particular unrealistic assumptions about the usage flow. The result is that practitioners carefully curate large social media datasets, which remain underutilized due to the lack of accessible tools. This work aims to fill this gap by proposing a suite of tools, aimed at non-computing experts, to analyze online discussions and users from the view of information reshare cascades. This work addresses three specific open questions concerning the tools to model reshare cascades and analyze online users. The first open question relates to profiling user botness and influence on previously collected data. The state-of-the-art bot detector, Botometer [15] , can only be accessed through its web APIs and cannot * Both authors contributed equally to this research. produce predictions for users that are no longer accessible, such as suspended accounts. Since bots have a high tendency of being suspended by Twitter, measuring botness a while after collecting data risks missing a large proportion of the bots involved in discussions. Likewise, while there are plenty of research on examining the influence and reputation of online users in the literature [6, 41] , few of these have converted into accessible tools for this task, and even existing tools often require the knowledge of the social graph which is sometimes impossible to capture retrospectively. The question is: can we have a tool that determines users' botness and influence, locally, on existing curated datasets. The second open question relates to modeling reshare cascades. Only individual scripts or packages of proposed models are provided by recent works on information diffusion modeling [36, 44, 55] with disconnected API designs and (potentially complex) environmental setups. A question, therefore, remains open: is there an easily accessible tool that allows comparing multiple self-existing models on real data, while remaining easily accessible to the nonexperts? The third open question relates to describing users based both on their activity dynamics, and how other users react to their content. Informative temporal features of reshare cascades have been explored in prior research [33, 36] , but no existing software can extract such features at the user-level. The question is can we extract reshare cascade features easily with a tool and show their effectiveness in online user analysis. In this work we address the above-mentioned open questions by introducing an integrated suite of tools: the two packages (evently 1 and birdspotter 2 ) and one visualizer (birdspotter.ml 3 ). Starting from an already collected dataset containing one day-worth of Twitter discussion around COVID-19, we showcase the usage of the tools to analyze the reshare cascades and the online users. Both tools are open-source and available on GitHub 12 , and they both feature extensive documentation and usage tutorials 45 . We address the first open question by introducing birdspotter, a python3 package designed to measure retrospectively the botness and social influence of Twitter users. For influence estimation, birdspotter implements a recent influence estimation algorithm [42] which relies only on the reshare cascades that users are involved in, and does not require the users' social graph which is prohibitive to obtain. For bot detection, birdspotter extracts a large number of user and activity features from the dataset only - Figure 1 : (a) The birdspotter.ml visualization system: Twitter users are plotted based on their user influence and botness (left panel), and a selected user's profiles and cascade history are shown (right panel). (b) A schematic view of the software suite to analyze reshare cascades and online users, and the functionality and data dependencies between its components birdspotter and evently, designed to examine the static user attributes and dynamic user attributes respectively. The arrows depict the incoming features for birdspotter to compute botness and user influence, and for evently to model reshare cascades. i.e., it does not require accessing the user's current profile -and deploys a pre-trained XGboost classifier to produce the likelihood of a user being a bot. The tool features easy-to-use Python and R interfaces, and it exposes a simple interface which allows researchers to annotate their own Twitter user collection. This allows fine-tuning predictions to a particular problem or dataset, and even predicting entirely different user attributes. We address the second question by introducing evently, a R package dedicated to modeling online information reshare cascades using self-exciting point processes. It currently supports fitting and sampling realizations of the Hawkes process and its finite population version HawkesN [44] , using the exponential and power-law decaying kernels, both unmarked and with a continuous event mark. evently exposes a number of functionalities around reshare cascades and online users. For online cascades, it can fit any of its supported models to observed data, and it can sample synthetic cascades from fitted models. It can be used to continue likely unfoldings of partially observed cascades, and compute their expected final popularity. For online users, evently can jointly fit all cascades initiated by the same users and obtains a single model which is descriptive for the user. It also allows to build a large number of dynamic user descriptors, such as the viral score (i.e., the expected size of a cascades posted by the user), and summaries of cascade size, reshare timing and event magnitude. To answer the third question, we first introduce birdspotter.ml shown in Fig. 1a -a visualizer designed to analyze Twitter users engaged in online discussions, and aim to provide both broad and specific views of the data. Second, we show how the functionalities of birdspotter and evently can be combined to extract user and reshare cascade information from aTwitter dataset surrounding COVID-19 discussions, fit cascades and predict final popularity, and build user features. Finally, we show that two clusters of active users emerge, and that bots are not among the influential users. The main contributions of this paper are as follows: • evently -a software package dedicated to modeling reshare cascades, and capable of characterizing online users based on the reshare dynamics of the cascades they generate. • birdspotter -a software package designed to detect online bots from retrospective (already collected) data, and to estimate online user influence based on the reshare cascades. • birdspotter.ml -an online visualizer designed to perform exploratory analysis of online Twitter users. • A set of online tutorials showcasing how these tools can be used as an integrated toolkit aimed at non-computer science experts, and an example analysis of discussions around COVID-19. In this section, we briefly review the theoretical prerequisites concerning modeling reshare cascades using point processes , and estimating reshare influence. Reshare cascades. Both evently and birdspotter analyze the spread of online information in the form of online reshare cascades. A reshare cascade consists of an initial user post and some reshare events of the post by other users. In Twitter, for example, this can happen when users use the retweet functionality. We denote a cascade observed up to time T as H (T ) = {t 0 , t 1 , . . . } where t i ∈ H (T ) are the event times relative to the first event (t 0 = 0). We denote cascades with additional information about eventsdubbed here as event marks -as marked cascades. We use the notation H m (T ) = {(t 0 , m 0 ), (t 1 , m 1 ), . . . }, where each event is a tuple of the event time and the event mark. For example, for retweet cascades, the numbers of followers of a Twitter user are commonly adopted as event marks [36, 55] . The Hawkes processes. Both our proposed tools model reshare cascades using Hawkes processes [21] -a type of point processes with the self-exciting property, i.e., the occurrence of past events increases the likelihood of future events. The occurrence of events in a Hawkes process is controlled by the event intensity function: where µ(t) is the background intensity function and ϕ : R + → R + is a kernel function capturing the decaying influence from a historical event. We note that, for reshare cascades, all events are considered to be offspring of the initial event, i.e. there is no background event rate µ(t) = 0. Two widely adopted parametric forms for the kernel function ϕ include the exponential function ϕ EX P (t) = κθe −θ t and the power-law function ϕ P L (t) = κ(t + c) −(1+θ ) . The HawkesN process [44] is a finite-population variant of the Hawkes processes, stemming from the connection of Hawkes with the classical SIR epidemic model [25] . HawkesN assumes a finite N -the maximum number of events in the process -, and modulates the likelihood of future event by the remaining proportion of total population. Its intensity function is defined as: where N t is the counting process, i.e., the number of events up to time t. Note that N 0 = 1 due to the initial event t 0 = 0. Marked Models. Both evently and birdspotter implement marked versions of the point processes, where the mark is the number of followers that the user emitting the tweet has. This is because the mark of each event governs the number of future events, e.g., a tweet from a largely followed user is likely to attract more retweets. The marked versions of both Hawkes [36] and HawkesN [29] processes are then derived by rescaling the kernel functions with the marks, i.e., ϕ(m, t) = m β ϕ(t); β controls the warping effect of the mark. Sampling Hawkes realizations are of use when one wants to generate a synthetic reshare cascade given model parameters, or when one wants to continue a partially observed cascade. We apply the rejection-sampling algorithm [37] to simulate events from Hawkes and HawkesN processes. We refer to [43] for the detailed simulation algorithm. The branching factor n * is an important quantity for the Hawkes and HawkesN processes (as discussed in Section 3.1). It is defined as the expected number of events directly spawned by a single event, i.e., n * = ∫ ∞ 0 ϕ(τ )dτ . For the marked variants, an expectation is also taken over the distribution of marks, is a doubly stochastic formulation of Hawkes processes where the branching factor (dubbed as infectiousness by Zhao et al. [55] ) is a stochastic time-varying function n * (t) estimated from the observed events H m (t). Parameter estimation. We estimate the parameters of Hawkes and HawkesN using the log-likelihood function for point processes [14] : SEISMIC is fitted using the R package by Zhao et al. [55] . Cascades joint modeling. When analyzing the reshare dynamics of online items (like Youtube videos and news articles) or users, it is desirable to account for the multiple cascades relating to them. Kong et al. [28] proposed to jointly model a group of cascades with a shared Hawkes model by summing the log-likelihood functions of individual cascades. In Section 4, we jointly model the cascades initiated by the same user, and we show that the learned models can be used to separate active Twitter users from bots. Final popularity prediction. The final popularity of a reshare cascade is the total number of events which occurred until the cascade has ended. Predicting the final popularity of a cascade that is still active (i.e., before its end) has been extensively explored in prior works [36, 44, 47, 55] . Given a Hawkes process observed until time T , its expected final size has an analytical solution: where N T is the number of events observed until time T . While SEISMIC predicts final popularities with Eq. (4), Mishra et al. [36] and Kong et al. [29] apply an additional regression layer over the model parameters to train a final popularity regressor. Viral score v describes a user or an online item, and it is defined as the expected popularity of a newly started cascade relating to the given user or item. It is obtained using the model jointly trained on all observed cascades of the user (item) [45] , and it is defined as for marked models). User influence estimation. Birdspotter adopts the following definition for user influence, widely used in literature [16, 42, 52] : is defined as the mean number of reshares generated directly and indirectly by a message posted by u, irrespective if it is an original message or a reshare. Estimating influence from retweet cascades has the additional difficulty of not observing the branching structure of the diffusion -i.e., the Twitter API attributes all retweets to the original tweet. birdspotter estimates Twitter user influence using only the observed retweet cascade H m (T ) where marks correspond to users' number of followers. Rizoiu et al. [42] propose a method to estimate user influence in the absence of the branching structure by assuming that retweets arrive following a Hawkes point process [43] . We can quantify the probability that an event v j is generated by an previous event v i as the ratio of the event intensity generated by v i and the total intensity at time t j . Formally, the probability v j retweets v i is Rizoiu et al. [42] also introduce the pairwise influence score m i j , intuitively defined as the amount of influence that v i exerts over v j either directly (when v j is a direct retweet of v i ) or indirectly (when v j is a retweet of a descendant of v i ): Finally, the influence of v i is φ(v i ) = n k =i m ik , and the influence of a user u is the average of the influences of all of their tweets: where T (u) is the set of all the tweets emitted by user u. In this section, we give an overview of the two packages and the visualizer, and we describe their usage, functionalities, and design. Fig. 1b schematically shows the usages and data dependencies between the two packages. evently is a R package for modeling online reshare cascades -and retweet cascades in particular -using Hawkes processes and their variants. By design, it provides an integrated set of functionalities to enable one to conduct cascade-level or user-level analysis of reshare diffusion. Design. evently is designed around the interactions among three components: data (i.e., reshare cascades), models and diffusion measures. In applications, models can be used to simulate new cascades, and diffusion measures are analyzed with off-the-shelf supervised and unsupervised tools. Table 1 shows available diffusion measures and their corresponding R function calls at cascade and user levels. For cascade-level analysis, a reshare cascade is usually observed until a certain time T . A chosen model is then fitted on the cascade capturing its temporal dynamics. From the learned model, evently characterizes the cascade with its branching factor (in Section 2). It can also simulate possible future developments of the cascade after time T and, in addition, derive the expectation of all future unfolding (i.e., the final popularity in Eq. (4)). When performing user-level analysis, cascades are grouped based on the user that initiates them. evently models these cascades jointly, and the resulting fitted model encodes the reshare patterns at the user level. Similarly, new reshare cascades can be simulated from this model, and the viral score denoting the expected popularity of a new cascade from the same user can be derived (Section 2). Other temporal features for the user that can be derived from the group of cascades include 6-point summaries (mean, first/third quarters, median, minimum and maximum values) of cascade sizes, reshare event time intervals and event magnitudes [36] . Implementation. evently contains two core functions in terms of data and models: fit_series fits a model on given cascades; generate_series simulates cascades from a provided model. A model can be indicated by passing an model_type argument to these functions where we use abbreviated strings to denote models. For example, EXP and PL stands for Hawkes processes with an exponential kernel and a power-law kernel respectively, while mEXP and mPL are their marked variants. We refer to the package documentation 4 for a complete table of model abbreviations. Data structure. Cascades are structured as tables (or data.frames in R) where a time column stores event timestamps relative to the first event t 0 and an optional magnitude column holds the corresponding event mark information. The APIs of evently also work with an R list of cascade data.frames assuming these cascades share a same model. Optimization. As mentioned in Section 2, the model parameter estimation is done via AMPL, a modeling language designed to describe and solve large-scale optimization problems. Compared to other optimization tools such as nlopt [23] , which require precomputed or numerical gradients, AMPL provides automatic differentiation of functions leading to model implementation efficiency. Moreover, it is also compatible with a wide range of solvers including two standard non-linear solvers IPOPT [46] and LGO [39] . Installation. Evently can be installed in R directly from Github 1 : remotes::install_github('behavioral-ds/evently'). Upon the first load, it automatically downloads and configures its dependencies AMPL and IPOPT, which if performed manually would involve considerable effort. birdspotter is a python3 package providing a toolkit to measure the botness and social influence of Twitter users retrospectivelyi.e. on previously collected tweets provided in the standard jsonl or json format. The package is aimed at practitioners who analyze discourse and user activity on social media -such as social scientists and journalists -, while requiring only basic R or python experience and no statistical modeling knowledge. Measure influence and botness. birdspotter measures user influence as outlined in Section 2, using by default a marked Hawkes exponential kernel with parameters β = 1, κ = 1 θ and θ = 6.8×10 −4 . These were tuned on a large collection of real cascades [42] , and can be customized using the function getInfluenceScores(). Internally, birdspotter leverages an XGboost classifier [8] trained on a large tweet dataset of labeled bots. We construct user features in three categories: classic, semantic, and topic-based features. The classic features include user features (such as the follower-tofollowee ratio, friends count and years on twitter), activity features (such as retweet count, statuses rate, and favorite count) and lexical features (such as number of characters in tweets, number of punctuation marks, and number of mentions). The semantic features are constructed from word2vec [35] embeddings of users' tweets and their user descriptions. By default, birdspotter automatically downloads and uses the GloVe 300d Twitter embeddings [38] , however customized embeddings can easily be used. Finally, the topic-based features are constructed from the term frequencyinverse document frequency (TF-IDF) [24] of the hashtags of the tweets/retweets which users participate in. Usage and functionalities. Given a dataset of tweets collected externally, leveraging the Twitter API, birdspotter's core functionality revolves around two steps. First, it loads the Twitter dataset, 1 birdspotter