key: cord-0285688-5471dogs authors: Mishra, Swapnil; Rizoiu, Marian-Andrei; Xie, Lexing title: Feature Driven and Point Process Approaches for Popularity Prediction date: 2016-08-17 journal: nan DOI: 10.1145/2983323.2983812 sha: fca8a095b67b9c13bc969d3b726c3996b9550816 doc_id: 285688 cord_uid: 5471dogs Predicting popularity, or the total volume of information outbreaks, is an important subproblem for understanding collective behavior in networks. Each of the two main types of recent approaches to the problem, feature-driven and generative models, have desired qualities and clear limitations. This paper bridges the gap between these solutions with a new hybrid approach and a new performance benchmark. We model each social cascade with a marked Hawkes self-exciting point process, and estimate the content virality, memory decay, and user influence. We then learn a predictive layer for popularity prediction using a collection of cascade history. To our surprise, Hawkes process with a predictive overlay outperform recent feature-driven and generative approaches on existing tweet data [43] and a new public benchmark on news tweets. We also found that a basic set of user features and event time summary statistics performs competitively in both classification and regression tasks, and that adding point process information to the feature set further improves predictions. From these observations, we argue that future work on popularity prediction should compare across feature-driven and generative modeling approaches in both classification and regression tasks. Popularity prediction, especially in the context of online media, has seen increased attention, as successful solutions would help both content consumers to cope with information overload, and content producers to disseminate information within limited resources. Accurate models will help producers to identify the trends, discover useful content and decimate it faster to improve the content delivery of the network. Furthermore, this would provide with more insights into understanding collective behaviour by compiling Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. When presented with the task of predicting popularity, a clear gap appears between the two main classes of approaches: the feature-driven approaches and the generative approaches. Feature-driven approaches [3, 7, 25, 29] summarize network, user, and event history information into an extensive set of features, and they use machine learning approaches to predict future popularity. Generative methods [12, 34, 43, 44] leverage fine-grained timing information in the event series, however they make stronger assumptions about the diffusion process. A thorough understanding of strengths and weaknesses of the two classes is currently missing, especially since the generative methods are usually designed for explaining the mechanisms that generate attention, and not optimized for prediction. Another gap arises between the different problem settings. Prior work has address problems like predicting the volume of attention [34, 44] , predicting if a cascade will double in size [7] , whether an item will have 10 million views [33] , or be among the top 5% most popular [42] . However, the understanding of how the different approaches and the different features generalize over the various problem settings is incomplete. In addition, a practical bottleneck is the availability of certain types of these features, such as network data and corresponding features -for example the number of users exposed in the diffusion up to a time t or potential number of reachable users. A third gap in the current available work is the consensus and understanding about which features are informative for feature-driven models and the lack of benchmark datasets on which new methods can be devised. In this paper, we address the above challenges in the context of predicting the final size of retweet cascades. We build two predictive approaches, one generative and one featuredriven, and we show how to combine them into a hybrid predictor to further increase performances. The generative method is a two-layered approach, built on the intuitive Hawkes self-exciting process [15] model. Self-exciting point processes are a class of stochastic processes, in which the occurrence of past events makes the occurrence of future events more probable. Three key factors in information diffusion are built into the proposed model: the social influence of users, the length of "social memory" and the inherent tweet quality. We use a predictive layer on top of the generative model to make final predictions. This helps us to take into account other cascades and mitigate limitations of model assumptions and parameter estimation. Our second proposed approach, the feature-based method, uses only Figure 1 : Two retweet cascades announcing the death of "Mr. Spock". The underlying diffusion processes of tw1 and tw2 are very different: tw1 lasts for 2.5 hours. It attracts the attention of highly influential users, especially music related channels, however the generative kernel of tw2 amplifies more than that of tw1. tw2 diffuses faster and ends in just 12 minutes, but the final sizes for the two cascades are similar (224 retweets for tw1 and 219 for tw2). We model and interpret these cascades using our generative model in Sec. 3.4. x-axis: time in seconds; y-axis: number of followers each tweet can reach; φm(τ ): memory kernel of Eq. (2) plotted with m = 1000. features which can be computed on data containing solely the message content and basic user profiles, and which were highlighted as most predictive by recent studies: user related features [25] and temporal activity features [7] . This results in a competitive feature-based predictor, which consistently outperforms the current state-of-the-art popularity prediction model [44] -reducing the mean absolute relative error by more than 200% when compared to the latter, after observing the retweet series for 10 minutes. We show that the same set of features can be employed in both regression and classification tasks. We further experiment with a hybrid predictor, which uses both data features and measures issued from the generative model, and we show additional performance improvement. We conduct experiments on a large dataset, containing 49.7 million tweets containing links to top 10 news sites in English, gathered using the public API over a period of 4 months, in 2015. Fig. 1 illustrates two of the retweet cascades in this dataset, relating the passing of the actor for "Mr. Spock". Even though the two cascades achieve similar popularities, our generative model is capable of differentiating between their very different diffusion dynamics. For example, the memory kernel φm(τ ) of tw2i.e. the contribution of each event to the spawning of future events -amplifies more (shown by a larger area under the φm(τ )), but is forgotten faster (the entire diffusion ends in just 12 minutes). There are three main contributions of this paper: • Bridging the popularity prediction problem space: we address three types of gaps: problem settings and task (e.g. regression vs. classification), type of approach (feature-based vs. generative) and the generalization of the features across feature-driven approaches. • Comparative understanding of feature-driven and generative models: we propose a generative method, based on Hawkes self-exciting processes, which features the advantages of generative methods -e.g. interpretable results -and which is adapted for predicting the final cascade popularity by using an additional predictive layer; we compare approaches to popularity prediction from the two main classes, i.e. feature-based and generative, and we show that combining them into a hybrid approach increases performances. • Dataset: we identify a set of features that generalize over different settings and can be constructed with only message content and basic user features; we curate a large, domain-specific dataset of retweet diffusion cascades, which can be used by both feature-driven and generative models for prediction. Dataset along with code is publicly available for download. 1 This new dataset allows comprehensive performance benchmarks on features about users and temporal activity. The rest of this paper is structured as follows. Sec. 2 presents related work. Sec. 3 details the proposed generative model and fitting methods, while Sec. 4 presents how the popularity prediction is performed, based on the fitted model parameters. The dataset and feature construction for the feature-driven approach are detailed in Sec. 5 and the experiments are outlined in Sec. 6. We conclude in Sec. 7. We structure the discussion of related work onto the two broad previously mentioned categories: feature-driven approaches and generative approaches. For generative models, being somewhat recent, we discuss their application in popularity prediction as well as their use in other social media settings. Predicting popularity with feature-driven models. Data-driven approaches [11, 19] treat popularity as a nondecomposable process and take a bottom-up approach. They rely on machine learning algorithms to make the connection between item popularity and an extensive set of features. Most often the main contributions of such approaches are collecting large-scale datasets and constructing informative features. Bakshy et al. [4] and later Martin et al. [25] predict tweet popularity, defined as the total size of its retweet cascade. Both studies construct features relating to the user having posted the tweet and posted content. User features, particularly past user success -the ability of a user to start large cascades -are shown more informative, however prediction performances are limited. Martin et al. make the hypothesis of an inherent popularity randomness, which sets limits to its predictability. The situation is different when part of the diffusion cascade is observable, as early activity has been shown predictive for total popularity [29, 37] . When predicting whether a Facebook cascade will double in size, Cheng et al. [7] found that temporal features -like the resharing rate or the acquired volume of views -are the most informative features. Prediction performances are further improved when adding features which account for specific contexts of the information diffusion. For example, Li et al. [22] add propagation network measurements, Chang et al. [6] add sophisticated features relating to audience behavior when predicting the popularity of serials and Vallet et al. [38] construct features relating to two social networks: Youtube and Twitter. Many proposed features cannot be used outside of the proposed context, being either data-source specific or relying on information that is not publicly accessible. The feature-driven methods that we propose use only the best-performing features shown in previous studies and which can be extracted from public data. Generative models in social media. Generative models start by modeling a set of phenomena into a parametric model and fitting the parameters of the model to observed data. Such approaches have been successfully used in neuro-biology to model neuronal spiking activity [20] , in economics to model the evolution of stock prices [14] , in physics to model earthquake aftershocks occurrence [35, 17, 36] and in criminology to model crime patterns [27] . In social media such approaches have been used extensively [18, 30, 31] , mainly due to their capability to model phenomena like user interactions and diffusion susceptibility. Hidden network properties, such as connexions between individuals [23, 24] or the evolution of the diffusion network topology [13] , have been uncovered using generative methods, such as selfexciting point processes. The main difficulty with such approaches is scalability, since each social phenomenon must be accounted manually in the model. Modeling and predicting popularity using generative models. Popularity modeling [9, 12, 43] and prediction [34, 44] has been a particularly fertile field for pointprocess based generative models. In their seminal work, Crane and Sornette [9] showed how a Hawkes point-process can account for popularity bursts and decays. Afterwards, more sophisticated models have been proposed to model and simulate popularity in microblogs [43] and videos [12] . These approaches successfully account for the social phenomena which modulate online diffusion: the "rich-get-richer" phenomenon and social contagion. Certain models can output an estimate for the total size of a retweet cascade. Shen et al. [34] employ reinforced Poisson processes, modeling three phenomena: fitness of an item, a temporal relaxation function and a reinforcement mechanism. Zhao et al. [44] propose SEISMIC, which employs a double stochastic process, one accounting for infectiousness and the other one for the arrival time of events. It is the current state of the art in popularity prediction and, throughout this paper, we compare against it. Our proposed generative model uses the same information as SEISMIC -tweet timestamps and the number of followers of the user posting the tweet -, but uses a simpler, more intuitive modeling of the diffusion cascade. Generative models are typically designed to explain popularity, not predict it. Most of the methods presented earlier introduce multiple regularization and correction factors in order to make meaningful predictions. This is sub-optimal both for prediction and for interpretation. By contrast, our Hawkes predictive model separates modeling from prediction accounting directly for the most important three factors at individual cascade level: user influence, decaying social attention and content quality, followed by a predictive layer tuned from historical data. We first introduce the Hawkes self-exciting point process, a stochastic event model for describing retweeting cascades. We then describe a procedure for maximum-likelihood estimation of model parameters from an observed cascade. We present observations of the Hawkes model on individual and collections cascades, illustrating the interplay of individual memory kernel and user influence in diffusion dynamics. In self-exciting processes the occurrences of past events make future events more probable. In particular, we model each retweet as a stochastic event, and model three key intuitions in a social network: magnitude of influence, tweets by users with many followers tend to get retweeted more; memory over time, that most retweeting happens when the content is fresh [41] ; and content quality. The marked Hawkes process [15] is a well-known selfexciting process that captures these properties. Its event rate is written as: where λ(t) is the arrival rate of new events. It depends on each previously occurred event (mi, ti) of magnitude mi at time ti through the triggering kernel φm(τ ). For the problem of estimating the size of a retweet cascade, there are no exogenously generated events (as in a general Hawkes process [16] ) apart from the initial tweet, which is accounted for with magnitude m0 at time t0 = 0. The effects of mi and ti are separable in our triggering kernel: κ describes the virality -or quality -of the tweet content and it scales the subsequent retweet rate; β introduces a warping effect for the user influence and it is related to the observed long-tail distribution of user influence in social networks; and 1 + θ (θ > 0) is the power-law exponent, describing how fast an event is forgotten, parameter c > 0 is a temporal shift term to keep φm(τ ) bounded when τ 0 ; Here κm β accounts for the magnitude of influence, and the power-law kernel (τ + c) −(1+θ) models the memory over time. We approximate user influence m using the number of followers obtained from Twitter API. Table 1 summarizes the parameters of the model, along with their interpretations. Throughout literature, three families of functions have been mainly used to model temporal decay with Hawkes point-processes [32] : power-law functions φ p (τ ) = (τ +c) −(1+θ) , used in geophysics [17] and social networks [9] ; exponential functions φ e (τ ) = e −θτ , used in financial data [14] ; Reyleigh functions φ r (τ ) = e − 1 2 θτ 2 , used in epidemiology [40] . We experimented with all three kernels with the setting in Sec. 6, and found that the power-law kernel consistently outperform the other two. Consequently, we only discuss detailed results from the power law kernel in the rest of this paper. One of the key quantities that describe a Hawkes process is the branching factor n * , defined as the expected number of children events directly spawned by an event. Note that the total number of expected events from one initial event is the sum of a geometric series ∞ k=1 (n * ) k . For n * < 1, the process in a subcritical regime: the total number of retweets is bounded, the event rate λ(t) decays to zero over time and the retweet cascade dies out. When n * > 1, the process is in a so-called supercritical regime with the total number conditional intensity -or event rate -of a nonhomogeneous Poisson Process. φm(τ ) triggering kernel. Contribution of event the (m, t) to the total event rate, calculated at time t + τ . n * branching factor, mean number of children events spawned by a parent event. For n * < 1 the cascade dies out (subcritical regime), for n * > 1 the cascade explodes exponentially (supercritical regime). T time extent of the observed series of events in the beginning of a retweet cascade. T = max(t i ). n number of events in the observed series of events in the beginning of a retweet cascade. n = max(i). N∞ expected total number of events in a given retweet cascade, derived from the generative model. N real the real number of events in a given retweet cascade, or total popularity. ω predictive factor learned using a Random Forest regressor. N∞ total popularity prediction made using the output of the predictive layer. Parameter Interpretation θ the power-law exponent, describing how fast an event is forgotten. θ > 0. κ tweet quality. High quality tweets are more likely to generate more retweets. κ > 0. c temporal shift cutoff term so that φm(τ ) stays bounded when τ 0. c > 0. β user influence power-law warping effect in b(m). β > 0. 0 < β < α − 1 for having a defined branching factor n * . α exponent of the user influence power-law distribution. Fitted to the #followers distribution: α = 2.016. of retweets being unbounded. We compute the branching factor by integrating over time and event magnitudes. Here we assume mi are i.i.d. samples from a power law distribution of social influence [21] : Here α is an exponent which controls the heavy tail of the distribution. We extract the number of followers m for a large sample of users from our dataset described in Sec. 5.1 and fit it to a power-law distribution following the method detailed in [8] . We obtain α = 2.016 and we use it throughout the experiments. Substituting Eq. (2) and P (m) into (3), we obtain the closed-form expression of the branching factor: Not only the branching factor n * is an intuitive descriptor of the virality of a cascade, it is also used for predicting the total popularity in Sec 4 and as a constraint in model estimation in Sec 3.3. The log-likelihood function. Our marked Hawkes process is completely defined by the four parameters {κ, β, c, θ}. We now describe a maximum-likelihood estimation procedure of the model from a set of observed events, consisting of {(mi, ti), i = 0, . . . , n} for time t < T . The log-likelihood function of a point-process described by Eq. (1) and (2) is written as ( [10] , Ch. 7.2): The first two terms in Eq. 5 are from the likelihood computed using the event rate λ(t), the last term is a normalizing factor from integrating the event rate over the observation Finding the maximum-likelihood solution. Eq. 5 is a non-linear objective that we optimize to find the set of parameters that maximizes data likelihood. Directly maximizing this objective is more efficient than expectationmaximization approaches [12] that try to estimate the branching structure (of which event triggers which) and that have a quadratic number of latent variables. There are a few natural constraints for each of model parameter, namely: θ > 0, κ > 0, c > 0, and 0 < β < α − 1 for the branching factor to be meaningful (and positive). Furthermore, while the supercritical regimes n * > 1 are mathematically valid, they will not lead to a finite cascade size prediction. We note that in reality no retweet cascade can grow indefinitely, due to finite network sizes. We further incorporate n * < 1 as a non-linear constraint for the maximum likelihood estimation. We use the Ipopt [39] , the large-scale interior point solver that handles both non-linear objectives and non-linear constraints, with pre-programmed gradient functions to find the solution. Details of the gradient computation and optimization can be found in an online supplement [1] . Individual cascades: fast vs slow. We first examine the model parameters for the two cascades illustrated in Fig. 1 . These two cascades are about the same event and have similar observed popularities (224 vs. 219) through very different diffusion speeds at a complex interplay between power-law memory (τ + c) −(1+θ) and user influences. For tw1, the maximum-likelihood estimate of parameters are {κ = 0.75, β = 0.27, c = 58.46, θ = 0.64} with a corresponding n * = 0.12; for tw2, {κ = 1, β = 0.42, c = 27.77, θ = 0.77}, and n * = 0.16 (also shown in Fig. 2 ). The two diffusions unfold in different ways: tw1 has lower value of κ and higher waiting time c, hence its memory kernel φ(τ ) has lower values than that of tw2 and a slower decay. However, tw1 attracts the attention of some very well-followed music accounts. The original poster (@screencrushnews) has 12,122 followers, and among the retweeters @TasteOf-Country (country music) has 193,081 followers, @Loudwire (rock) had 110,824 followers, @UltClassicRock (classic rock) has 99,074 followers and @PopCrush (pop music) has 114,050 followers. The resulting cascade reached 1/4 its size after half an hour, and the final tweet was sent after 4 days. In contrast, tw2's memory kernel has higher values, but faster decay. The most influential user in tw2 has 412 followers, and the entire diffusion lasted for only 12.5 minutes. Interpreting a collection of cascades. We construct a large subset of 17,146 cascades of the News dataset, described in Sec. 5.1. For each cascade we fit the parameters of its generative model {κ, β, c, θ}. Fig. 2 shows the density distribution of parameters θ and κ. As one would expect, most cascades have a low value of κ, and reflect the longtailed distribution of content quality. n * -descriptor of the virality of a cascade -is distributed very similarly to κ, with most cascades showing low n * values (shown in the supplement [1] ). Parameter θ -which controls for the speed of the decay of the memory kernel φm(τ ) -has a more surprising distribution. Recent work [44] on point processes with power-law kernels argued that a unique value of θ = 0.242 is sufficient to explain the general behavior cascades. We find θ to vary in the interval (0, 2.8] and showing two maxima at 0.01 and 0.34. This observation indicates that learning individual memory exponents θ allows us to better describe the diffusion dynamics of each cascade. The distributions of the other parameters are shown in the supplement: c follows a long-tail distribution; β shows a maximum of density around 0.08 and a smaller peak around β ∼ α − 1 resulting from the optimizer terminating at the boundary of the non-linear constraint n * = 1. In this section, we first derive the number of expected future events in a cascade, we then describe a predictive layer tuned for estimating the total size, using information from other cascades in history. Having observed a retweet cascade until time T and fitted the parameters of the point process, one can simulate a possible continuation of the cascade, until it dies outassuming n * < 1. In this section, we derive the expected number of events in the cascade, over all possible continuations. We group events based on the generation of their parents (i.e. the preceding event they are triggered by), and estimate the number of events in each generation. We denote as Generation1 the set of simulated events spawned by the observed events (shown in red in Fig. 3) . Similarly, Generation2 (in green in Fig. 3 ) groups events spawned by events in Generation1, and so on. Let Ai be the expected number of events in Generationi. The expected number of total events in the cascade, N∞, is defined as: Ai , where n is the number of observed events. For generations after the first one, the best estimate about Ai, i ≥ 2 is constructed using the average number of children events n * and the number of events in the previous generation, i.e. Ai = Ai−1n * . Assuming A1 known, we derive: Therefore, the second term in (6) is the sum of a converging geometric progression (assuming n * < 1): A1 could also be calculated in a fashion similar to Eq. (7). However, a more precise estimation can be obtained, given that magnitudes of the observed events -parents for events in Generation1 -are known: We obtain the estimate of the total number of events in the cascade by introducing (9) and (8) into (6): Hawkes + RF n*percentile (b) Figure 4 : Reduction of prediction error for a subset of cascades. For each cascade, the error made when predicting final popularity using the theoretical N∞ is shown using red circles, and with blue squares the error made when using the predictive layerN∞. Each gray arrow shows the reduction of error for a single cascade, and pairs together a red circle with a blue square. The error reduction is show in relation to A1 (a) and n * (b). Using generative models for prediction is often sub-optimal, often due to simplifying assumptions of the generative process. Point processes in general, are optimized for explaining observed event history and not optimized for prediction. Zhao et al. [44] , for example, recently deployed a number of heuristic correction factors to discount the initial burst and account for a long-term decay. We consider that the Hawkes point process can benefit from systematically learning a predictor, to better account for data noise and limiting model assumptions. First, it is well know that using the number of followers as user influences mi is at best an approximation [5] and often does not scale with retweeting propensity. Second, the network in which cascades happen can be inhomogenous over the lifetime of the diffusion, e.g. users who respond early may have a inherently shorter memory than those responding late. Last, point processes cannot generate a meaningful prediction for supercritical cascades, and the model estimation can be prone to local minima. In this section, we leverage a collection of previously observed cascade histories to fine tune popularity prediction. This predictive layer accounts for the limitations above, and places point process estimates in a comparable framework as the feature-driven approaches. Prediction setup. Each retweet cascade is described using four features {c, θ, A1, n * }. These four features were found to be the most correlated with = (N real − N∞) 2 , the error made when predicting popularity using N∞. Furthermore, n * and θ are expressed as percentiles over the range observed in the training set, to account for their non-linear relation with . Note that parameters κ and β are not used: the information in κ is already included in A1, whereas β was found non-correlated with . In the classification task detailed in Sec. 6.3, the Random Forest is trained to output the selected class. In the popularity prediction task -Sec. 6.1 and 6.2 -we train the Random Forest to output a scaling factor ω for the expected number of future events in a given retweet cascade. The final corrected popularity prediction is computed as: Effects of the predictive layer. The immediate effect of introducing the predictive layer is the reduction of the ARE prediction error, measured as detailed in Sec. 6.1. Fig 4 illustrates this reduction on a subset of the cascades in the News dataset -described in Sec. 5.1 -and its relation with A1 and n * . The length of the arrows is proportional with the error reduction. The range the errors made without the predictive layer increases linearly with A1. Consequently the error reduction when usingN∞ is generally higher when A1 is higher (Fig. 4a) . The relation between n * and the prediction error is non-linear, therefore in Fig. 4b the horizontal axis shows the percentile values of n * . The variation of the error reduction has two maxima, one around 40% and another one around 95%. Fig. 4 can be summarized as follows: the error reduction increases linearly with the expected number of events in Generation1, but exhibits two maxima in relation with the expected number of events in Generationi, i ≥ 2. Another interesting observation is that N∞ tends to over-estimate cascade sizes, in other words cascades tend to achieve less popularity than expected when observed in isolation. This aligns with the hypothesis that content items compete with each other for a finite amount of human attention [26] . In this section we describe the two datasets, Tweet-1Mo and News, used in our experiments. Next, we describe the set of features constructed for our feature-driven approach. Tweet-1Mo: Zhao et al. [44] made publicly available the dataset they used to predict retweet cascade sizes. It contains a sample of all tweets during a month (i.e. using the firehose Twitter API restricted access), further filtered so that the length of each cascade is greater than 50. For each cascade, the dataset contains information about the original tweet id and start time and, for each of the retweets in the cascade, the time offset of the retweet and the number of followers of the user posting the retweet. We randomly sample the cascades in the original dataset and we construct Tweet-1Mo to contain 30,463 cascades of length between 50 and 5000. The resulting dataset has a mean cascade length of 160 and a median of 95. News: We construct a domain specific dataset of tweets dealing with news articles, by collecting data using the free Twitter Streaming API 2 , over a period of four moths from April 2015 to July 2015. We track the official twitter handles of ten popular news outlets, namely NewYork Times, Associated Press, CNN, Washington Post, Reuters, Yahoo News, The Guardian, BBC, Huffington Post and Google News. We also capture all tweets containing either i) user mentions of the above users or ii) a commonly used shortened url towards any of the ten news websites. This leads to a collection of 49,735,271 tweets. Most of the cascades in the News dataset are of length one or two and, for compatibility with the Tweet-1Mo, we present the statistics for cascades of length over 50: 110,954 cascades, mean cascade length is 158 and median length is 90. Difference between datasets. The News dataset contains several additional information, when compared to Tweet-1Mo: for each user in a retweet cascade, we record her number of friends, number of posted statuses and the account creation time (these information are recorded in every tweet we capture through the API). This additional information is required by the feature-driven method described in Section 5.2, therefore in the experiments in Sec. 6, feature-based methods cannot run on Tweet-1Mo. An open-access, domain-specific dataset. Many of the datasets used by prior work for popularity prediction were not made public because of user agreement terms and they are difficult to replicate. The barriers for having an open benchmark include proprietary insider information (e.g. complete Facebook network measurements [7] ), privileged API access (e.g., Twitter firehose [44] ) or contain restricted data from non-English sources [12, 43] . Furthermore, we argue that predicting popularity on a topic-or domain-specific dataset is as relevant as that on the firehose feed -content producers and consumers are often interested in content within a source (e.g. NYTimes) or a topic (e.g. technology). Therefore, we chose to construct News, our domain-specific dataset, using only free access APIs. We present a strong basis for future benchmarking by combining the descriptive power of Feature-Driven and modeling capabilities of Hawkes in Sec. 6. Our feature-driven predictor describes each cascade using four types of features: a) basic user features, b) temporal features, c) volume features and d) past user success. They are chosen as being the most informative in recent literature [4, 7, 25, 29, 37] and require only free access to the Twitter data source. In particular, Martin et al. [25] compare a wide range of features, including user, content and user-content interplay features, and they show that basic user features together with past user success features account for 87% of the prediction performance. Cheng et al. [7] showed that temporal features relating to the unfolding of the observed part of the diffusion account for the bulk of the total classification performances. They report only 0.025 additional accuracy when using all the other features combined in addition to temporal features. Lastly, Szabo et al. [37] and later Pinto et al. [29] show that early popularity is predictive for total popularity. Basic User Features [4, 7, 25] . Basic user statistics capture the social influence of a user. When observing the prefix of a cascade, we use the five point summary (min, median, max, 25-th and 75-th percentile) to represent the distribution for a feature: Temporal Features [7] . We capture the temporal dynamics of a cascade using the following features: Volume [29, 37] . Number of retweets in the observed part of the cascade, which captures early popularity. Past User Success [4, 25] . Average size of the cascades started by a given user in past. For an observed cascade, we use the five point summary of the distribution of past user success for all previously known users who participate in the cascade. This feature requires a large volume of historical data and can be computed only for users that have started cascades in the past. We detail its construction in Sec. 6.1. We first present in Sec. 6.1 a thorough comparison between the Hawkes model and Seismic, the state-of-the-art generative model on popularity prediction, using two datasets. In Sec. 6.2 and Sec. 6.3 we compare the feature-driven classifier to generative approaches. Sec. 6.1 and Sec. 6.2 tackle regression tasks, i.e., predicting total cascade size after observing it for a certain time, while Sec. 6.3 tackles classification tasks, i.e., whether or not the cascade size will double from what has been observed. We compare our point processes model, Hawkes, with Seismic on two datasets, Tweet-1Mo and News. Note that Tweet-1Mo is the dataset on which Zhao et al. [44] reported results, while News dataset supports extracting a richer set of features that is not available in Tweet-1Mo. In addition, Zhao et al. [44] showed that Seismic compared favorably against a number of baselines: from auto-regressive and generative models such as dynamic Poisson model [2, 9] and reinforced Poisson model [34] , to linear regression and it's variants [37] . Experimental setup. We run our experiments on the Tweet-1Mo dataset and a subset of News: 20,093 cascades from the month of July, which have the total length of at least 50. Both algorithms observe the initial part of each cascade for a limited extent of time and predict its final popularity. We estimate separate sets of parameters for Hawkes and Seismic after observing the cascades for 5 minutes, 10 minutes and 1 hour, respectively. After the fitting is finished, we train the Random Forest regressor in the predictive layer of the Hawkes model. Performance of this regression layer is reported using ten fold cross-validation. 40% of cascades are used for training, 60% for testing. All experimental protocol choices -minimum cascade length threshold, random training/testing split and length of time prefixes -are made to mimic closely the original experimental setup [44] . We use the Absolute Relative Error (ARE) to measure prediction performance on a given cascade w: whereN∞ and N real are the predicted and observed popularities of cascade w. Results. The prediction results are summarized in Fig. 5 as boxplots of ARE for each combination (dataset, approach, observed time), while Table 2 shows the mean ARE ± standard deviation. Note that generative models do not output a prediction for all cascades. For example, when branching factor n * ≥ 1 for Hawkes the estimate for cascade size is infinite, similar failures also occur with Seismic when the infectiousness parameter p ≥ 1 n * ([44] Eq (7)). Here we report the number of "failed" cascades in Table 2 , and only report the average ARE on cascades that both approach produce a prediction. As shown in right hand side of the Table 2 , we observe that Hawkes is able to produce meaningful predictions for more cascades than Seismic. For example, on News data with 5 minutes, we output prediction for 873 (or 0.7%) more cascades than Seismic. We can see that Hawkes consistently outperforms Seismic, as it achieves lower average ARE on both datasets. Mean ARE for Hawkes is at least 40% better than Seismic on Tweet-1Mo and 33% on News. We observe that News dataset after 5 minutes of observation seems the most difficult to predict: Hawkes produces a mean ARE of 0.42, while the mean ARE for Seismic is 11.13. Fig. 5 (top row) shows that the median of Hawkes is at least 25% better than Seismic for any time prefix, suggesting that our approach predicts better for most of the cascades in Tweet-1Mo. Similar conclusions can be drawn from the bottom We are glad to see that the Hawkes model achieves better predictive performance than Seismic, by exploiting full flexibility in its non-linear parameters, and a predictive layer that systematically optimizes for future prediction using information from other cascades. In this section we compare the performances of featuredriven and generative approaches on the News dataset only, as the Tweet-1Mo does not contain the necessary data to construct the features detailed in Sec. 5.2. Experimental setup. Similar to the setup in the previous Sec 6.1, we observe cascades for 5 minutes, 10 minutes and 1 hour and fit the Hawkes and Seismic models for each cascade. The train-test split is different, in order to replicate closer the experimental setup in Martin et al. [25] : the data from first half of July (1-15) is used for training and the data from second half of July (16-31) for testing. We use the News historical data from April to June to construct the past user success feature for the feature-driven approach. We consider only those users who initiated at least 2 cascades in the past as active users. Non-active users have a past user success of 0. We also construct a Hybrid approach, which leverages the features of the Hawkes model -i.e. {c, θ, A1, n * } -together with the features detailed in Sec. 5.2. We use a Random Forest regressor for predicting the final volume size after tuning the predictor on the training set with 10 fold cross-validation. Results. Similar to the previous subsection, we report in Table 3 the mean ARE ± standard deviation and in Fig. 6 the box plots of ARE for 10 minutes and 1 hour. In according with the results of Sec. 6 Figure 7 : ARE distribution over popularity percentiles, on News dataset, after observing cascades for 10 minutes. The blue line denotes the overall median ARE shown in Figure 6a . Note that the y-axes are not on the same scale for all three graphs. Fig. 6 shows lower median and quartiles. This indicates that Hawkes predict better than Feature-Driven for most cascades, but the error distribution of Hawkes is skewed by outliers. Hybrid performs the best, with similar boxplot summarization as Hawkes, but with a 42% better mean APE -0.17 to 0.11 for 1 hour. The differences between Hybrid and all the other approaches are statistically significant (details in the supplement [1] ). This result shows that Hawkes and Feature-Driven approaches are complimentary, likely because they summarize the cascade information differently, and the errors are uncorrelated. ARE distribution over cascade popularity. Figure 7 presents the ARE distribution over cascade popularity as boxplots for each popularity decile. Both Hawkes and Feature-Driven predict better for most of the cascades with the middle popularity. For the extreme ends, performance is affected, because for lower popularity percentiles even a small error is amplified to a large ARE and cascades in higher popularity bins are inherently more difficult to predict. The prediction performance of Seismic seems consistent across popularity bins, but always worse than Hawkes or Feature-Driven, as already showed by the aggregated results in Fig. 6a . We also tried to explain prediction performance breaking down by News sources, but we did not see any notable patterns. The learning problem in this section is to classify whether an observed cascade will double its volume or not. The experimental setup follows that of Cheng et al. [7] , in which cascades are observed for a fixed number of retweets, instead of for a fixed extend of time in Sec. 6.1 and 6.2. The proposed Hawkes model outputs directly the total popularity as a cascade size and uses the predictive layer to correct the theoretical estimate. As a consequence, it cannot be used in a classification setup. Instead, we construct a classifier HawkesC, in which a Random Forest Classifier is trained on the same features as the predictive layer of Hawkes and used to output a binary decision. Experimental setup. We use the July subset of the News dataset, filtering to only cascades of length greater than or equal to 25. We observe the cascades for two 25 retweets and 50 retweets, and we predict whether they will reach 50 and 100 retweets, respectively. We perform a 40%-60% stratified train-test split. Feature-Driven approach uses the features mentioned in Section 5.2, except for volume which is constant for all cascades. Hybrid approach combines features from both Hawkes and Feature-Driven. The random train-test split is repeated 10 times and we report mean accuracy and standard deviation. Results. Table 4 summarizes the classification performances. A random guess (majority class) would output an accuracy of 52% and 53% for an observed length of 25 and 50 respectively. The generative-based classifier HawkesC improves substantially over the baseline of random guess, however Feature-Driven has the best prediction accuracy. Interestingly, combining the generative features with the datadriven features did not lead to a significant improvement, likely due to Feature-Driven already has stronger results than HawkesC. Admittedly, predicting whether a cascade will double in size is an easier problem than forecasting its final volume. Overall, the results suggests that the predictions are robust, and the performance are comparable to earlier results reported by Cheng et al. [7] (0.81 accuracy after seeing the first 50 shares). This paper establishes a common benchmark that allows researchers and practitioners to compare feature-driven and generative approaches on different variants of popularity prediction problems. Existing feature-based approaches employ either proprietary insider information, or global network information that is impossible to retrieve with the limited access that are openly available. In this work, we bridge the problem space by first selecting a small set of features which can be obtained by using only free access services and constructing the Feature-Driven method. We further propose a Hawkes self-exciting model, which intuitively aligns with the social factors responsible for diffusion of cascades: social influence of users, social memory and inherent content quality. We systematically construct a predictive layer, which helps optimize predict from other cascades. We perform extensive evaluation on two large datasets: a benchmark dataset constructed by Zhao et al. [44] and a domain-specific dataset curated on news tweets. We compare the feature-driven approaches and generative approaches in two tasks: i) predicting the total size of retweeting cascades and ii) predicting whether cascades will double their size. Both our proposed generative Hawkes method and our Feature-Driven method outperform the current state of the art predictor. Performances are further improved when combining both approaches into Hybrid, which makes us argue that popularity modeling should leverage the best that both worlds have to provide. We plan to extend this work by including additional factors such as, different network behaviour for particular types of content, user influence distributions that are varying over time, and explicit modeling of the interplay between the unfolding of related cascades. 2 Additional density distribution graphics Figure 8 shows the density distribution of parameters β and c, along with branching factor n * . The density of β shows two local maxima: a maximum of density around 0.079 and a smaller peak around β ∼ α−1 resulting from the optimizer terminating at the boundary of the non-linear constraint n * = 1. Parameter n * represents the expected number of events spawned by a single event in the Hawkes process. Its distribution shows a maximum around n = 0.022 and a smaller local maximum around 1, resulted from the same process involving the non-linear constraint n * = 1. The low value of n * Table 5 : Mean and Median values of log-likelihood of unseen data on a sub sample of News dataset with 100 cascades for Seismic and unconstrained version of Hawkes. Seismic -6.620 -4.3964 Hawkes -7.050 -6.620 point towards a fast decaying self-exciting process, resulting in small cascades with short life spans. Parameter c is a cutoff parameter, its purpose is to keep the kernel function φm(τ ) bounded when τ ∼ 0. It controls the reactivity of the initial diffusion: small values lead to the first retweets being posted early after the initial tweet; large values introduce long waiting times between the first tweet and subsequent retweets. Fig. 8(c) shows its long-tail density distribution in log-log scale and Fig 8(d) shows its cumulative density distribution. The large majority of cascades are very reactive, with a median value of c = 111 seconds. Supplemental material -feature driven and point process approaches for popularity prediction Spatio-temporal models for estimating click-through rate Predicting the future with social media Everyone's an influencer Measuring user influence in twitter: The million follower fallacy Predicting the Popularity of Online Serials with Autoregressive Models Can cascades be predicted? Power-Law Distributions in Empirical Data Robust dynamic classes revealed by measuring the response function of a social system An introduction to the theory of point processes High Impact Academic Paper Prediction Using Temporal and Topological Features Video Popularity Prediction by Sentiment Propagation via Implicit Network Coevolve: A joint point process model for information diffusion and network co-evolution Apparent criticality and calibration issues in the Hawkes self-excited point process model: application to high-frequency financial data Spectra of some self-exciting and mutually exciting point processes A Cluster Process Representation of a Self-Exciting Process Subcritical and supercritical regimes in epidemic models of earthquake aftershocks Download relaxation dynamics on the WWW following newspaper publication of URL. PhysA '00 Spatio-temporal meme prediction A Granger causality measure for point process models of ensemble neural spiking activity What is twitter, a social network or a news media On popularity prediction of videos shared in online social networks Dyadic event attribution in social networks with mixtures of Hawkes processes Learning parametric models for social infectivity in multi-dimensional hawkes processes Exploring limits to prediction in complex social systems Limited communication capacity unveils strategies for human interaction Self-Exciting Point Process Modeling of Crime Seismicity Analysis through Point-process Modeling: A Review Using early view patterns to predict the popularity of youtube videos Trend detection in social networks using hawkes processes Modeling temporal activity patterns in dynamic social networks Uncovering the Temporal Dynamics of Diffusion Networks Viral actions: Predicting video view counts using synchronous sharing behaviors Modeling and Predicting Popularity Dynamics via Reinforced Poisson Processes Renormalization of earthquake aftershocks Endogenous versus exogenous shocks in systems with memory Predicting the popularity of online content Characterizing and Predicting Viral-and-Popular Video Content On the implementation of a primal-dual interior point filter line search algorithm for large-scale nonlinear programming Different epidemic curves for severe acute respiratory syndrome reveal similar impacts of control measures Novelty and collective attention Twitter-driven youtube views: Beyond individual influencers Uncovering and predicting the dynamic process of information cascades with survival model SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity Acknowledgments This material is based on research sponsored by the Air Force Research Laboratory, under agreement number FA2386-15-1-4018. We thank the National Computational Infrastructure (NCI) for providing computational resources, supported by the Australian Government. . . . [ 14 ] Introducing Eq. [ 2 ] into Eq. [ 14 ] , we obtain:(mi) β 1 θc θ − (T + c − ti) −θ θ [ 15 ] Hence Equation [ 13 ] can be written as,1. For differentiating L(κ, β, c, θ) w.r.t β, c and θ, we first consider the 2 nd second term in the right hand side of Eq. [ 16 ] : [ 18 ] Using the logarithmic derivation rule and Eq. [ 18 ] in Eq [ 16 ] , we obtain the partial derivative w.r.t β:Similarly, we compute the partial derivatives w.r.t c and θ: ∂L ∂c = out the task to evaluate how good our model fits the unseen data. In order to achieve this we first fit a model to all data seen till some time tN for a cascade and then estimate the log-likelihood of the unseen data till it's end time tM . We can get the log-likelihood of the unseen data by subtracting the log-likelihood of the seen cascade from the loglikelihood of the whole cascade after fitting the parameters on the seen part, it is defined as follows,For Hawkes we can easily calculate the value in Equation 22 by using Equation 16 for both seen and full cascade. But for Seismic the evaluation is not possible, as they do not derive the log-likelihood of their model explicitly. For Seismic the derivation of log-likelihood of unseen data can be written as follows,Now we can calculate the part A of Equation 23 by using Equation 1 in [1] and assuming that infectiousness once estimated at tN remains constant, as follows,Now for Part B in Equation 23 by using Equation 1 in [1], we have,Now using the trick used in Equation 14 and keeping in mind the fact that ti has two parts, i∈ [1, N ] Using Equation 24 and Equation 26 we can calculate the desired log-likelihood of the unseen data for Seismic.To present results for goodness of fit we use unconstrained optimization of parameters for Hawkes and compare it against Seismic on a sub sample of News dataset with 100 cascades. The results are laid down in Table 5 . As seen from the table, Seismic fits slightly better to the unseen data Hawkes, even though the predictive power of Hawkes is better as demonstrated in main text. In this section we present the analysis on statistical significance of our results presented in Section 6.2 of main text.Results are presented in Table 6 for Seismic, Hawkes, Feature-Driven and Hybrid models for the News dataset, we use a strict cut-off p-value, 0.01, as our sample size is in order of thousands.As seen from Table 6 Hybrid model shows statistical significant improvement over all other approaches almost every time. The results for Hawkes and Feature-Driven are in general better than Seismic, but statistically insignificant among each other. Table 6 : Statistical testing results for Seismic, Hawkes, Feature-Driven and Hybrid models for the News dataset, indicating the algorithm with statistical significant better results. Win is decided when an algorithm has majority of lower error values than the competing algorithm and p-value confirms the results as statistically significant, else we mark it a draw in case of no statistical difference.