key: cord-0509153-3e53z60y authors: Zhang, Yafei; Wang, Lin; Zhu, Jonathan J. H.; Wang, Xiaofan title: Go viral or go broadcast? Characterizing the virality and growth of cascades date: 2020-06-01 journal: nan DOI: nan sha: fa2fdc4026d9d301e63a8b80fc38bb50f0078c77 doc_id: 509153 cord_uid: 3e53z60y Quantifying the virality of cascades is an important question across disciplines such as the transmission of disease, the spread of information and the diffusion of innovations. An appropriate virality metric should be able to disambiguate between a shallow, broadcast-like diffusion process and a deep, multi-generational branching process. Although several valuable works have been dedicated to this field, most of them fail to take the position of the diffusion source into consideration, which makes them fall into the trap of graph isomorphism and would result in imprecise estimation of cascade virality inevitably under certain circumstances. In this paper, we propose a root-aware approach to quantify the virality of cascades with proper consideration of the root node in a diffusion tree. With applications on synthetic and empirical cascades, we show the property and potential utility of the proposed virality measure. Based on preferential attachment mechanism, we further introduce a model to mimic the growth of cascades. The proposed model enables the interpolation between broadcast and viral spreading during the growth of cascades. Through numerical simulations, we show the effectiveness of the proposed model in characterizing the virality of growing cascades. Our work contributes to the scientific understanding of cascade virality and growth, and could offer practical implications in a range of policy domains including viral marketing, infectious disease and information diffusion. Spreading is a ubiquitous process across disciplines. Conceptually, many spreading processes can be viewed as tree-like cascades, including the retweeting or resharing of news in online social networks [1] [2] [3] [4] [5] , the generated discussion threads in online boards [6] [7] [8] , the outbreak or transmission of diseases [9] [10] [11] [12] , and the diffusion of new products or services [13] [14] [15] [16] . For instance, in a discussion thread, comments and the reply-to actions between them can be represented as nodes and edges in a rooted tree with the original post act as the root [8] . In some cases, a single node could directly account for a large proportion of the whole diffusion process, which is a typical characteristic in the media broadcast industry; while in others, diffusion is more likely to be driven by word-of-mouth mechanism and circulates in a viral way, where each node only accounts for a small fraction of the whole diffusion process [1] . For example, one NBA Christmas game is able to attract millions of views through merely broadcast, but one tweet could reach millions as well through viral spreading. As the backbone of so many diffusion processes, cascade structure yields a valuable source to learn the extent to which a cascade grows in a broadcast (or breadthfirst) or viral (or depth-first) way. From the view of tree traversal, we can simply interpret broadcast cascade as a result of "breadth first search" and viral cascade as a result of "depth first search". In this regard, there have been several important attempts to quantify the virality of cascades [1] [2] [3] 13] , where the more viral of a cascade, the larger of this virality quantity would be. However, unlike many other undirected graphs, once the root of a cascade is identified, the direction of its growth or flow is thus determined: from the root node to its descendants (see Fig. 1 , (a)-(d)). In other words, rooted cascades are actually directed graphs, rather than undirected graphs. For example, rooted cascades in Fig. 1 (a)-(d) exhibit distinct structures, and thus should have different virality scores, but if the root nodes are overlooked, they are deemed as the same structure under graph isomorphism ( Fig. 1 (e) ). More such examples are illustrated in Fig. 2 . For ease of visualization, the directed ties in each rooted cascade are not shown hereafter, as the existence of the root node already implies the direction of the flow. Therefore, it's anticipated that ignoring the root node and simply regarding cascades as undirected trees would inevitably result in imprecise estimation of the virality of cascades under certain situations. To remedy this, a root-aware approach is proposed to quantify the virality of cascades in this paper. Specifically, instead of computing the average distance between all pairs of nodes in a cascade tree (which is also termed Wiener index or Structural virality [1] ), we calculate the average distance between a node and its descendants, and summing the computed distances over all nodes yields the amended virality of a cascade. The proposed approach can operate in a recursive manner, and is able to provide high-resolution depiction of cascade virality. In light of preferential attachment mechanism [8, [17] [18] [19] [20] [21] [22] , this work further presents a cascade growth model based on node genealogy instead of merely node degree. Specifically, the genealogy of a given node is defined as the subtree rooted at this node and contains only the given node itself and its descendants, and the probability that a new node attached to an existing node depends on the genealogy size of the target node at different generations. Through numerical simulations, we show the benefits brought by genealogy-based modeling compared with degree-based modeling. Leveraging empirical data from five online discussion communities, we further verify the potential utility of the proposed model in capturing the virality landscape of online discussion threads. Quantifying the virality of cascades is a challenging task as it synthetically considers not only the size but also the depth and branches of a given cascade tree. Goel To obtain the virality VT of a tree T , we imagine the tree is rooted at node u. Once node u is removed, the tree would split into multiple subtrees (two subtrees in the given example), each rooted at one of u's children (which are v and w in the case). Repeat the root node removal process step by step, until no more nodes have any descendants. Count the average path length between root nodes (which are u, v and w in the case) and their descendants, then add them up immediately yields the desired VT . et al. [1] address this issue by introducing Wiener Index, which quantifies the average distance between any two nodes in a cascade tree, as the desired virality measure which they term Structural Virality. This attempt makes important advancement in quantitatively depicting the extent to which a cascade is viral, but it fails to account for the root of a cascade and inherently treats the cascade as an unrooted tree, thereby hindering its ability to differentiate cascades which are actually non-isomorphic but will be isomorphic if mistakenly represented by unrooted trees (See Fig. 1 and Fig. 2 ). To remedy this and more precisely quantify the virality of cascades, we propose a root-aware approach. The obtained virality score is then named as Cascade Virality. Specifically, the cascade virality of a rooted tree is quantified by the sum of the average path length between nodes and their descendants. Formally, for a rooted cascade tree T of size N (N ≥ 2), the cascade virality of T is d ij denotes the distance from node i to one of its descendant j, D i is the set of descendants of node i and | * | represents the size of a set. A schematic of the above process is shown in Fig. 3 . Starting from the root node u, the average distance from node u to its descendants is counted, denoted asd u = ave(1, 2, 2, 1, 2) = 1.6 ( Fig. 3(b) ); Suppose that node u is then removed, thus resulting in two subtrees with node v and w (i.e., child nodes of node u) act as new root node for each subtree; The average distance from node v and w to their descendants are then computed asd v = ave(1, 1) = 1 (Fig. 3(c) ) andd w = ave(1) = 1 ( Fig. 3(d) ); Repeat the above node removal and average path length calculation process, until no more nodes have any descendant (Fig. 3(e) ). The cascade virality of T , V T , is finally obtained by the sum over all the calculated average path lengthd, which isd u +d v +d w = 3.6 in the given example (Fig. 3) . Also note that the proposed approach to quantify the cascade virality of a rooted tree is easy to implement in a recursive manner, which enables the time complexity in the actual execution. A Python implementation of this approach is freely available online [23] . The obtained cascade virality has good properties. Fig. 4 illustrates the obtained virality of rooted cascades of size 5 in an increasing order of cascade virality. For totally broadcast cascade (e.g., Fig. 4 (a)), the virality is minimized and should be 1, as the root node directly connects all other nodes. For totally viral cascade of size N (e.g., Fig. 4 (i)), the virality is maximized and should be (N − 1)(N + 2)/4, as the cascade tree is ultimately a path graph with the root node located at one end. For any other type of cascade, the virality of it locates at the range [1, (N − 1)(N + 2)/4] (Proof can be found in Appendix A). It's also anticipated that, for cascades with a fixed size, cascades with larger generations (i.e., depths) are generally more viral than their counterparts with smaller generations. For example, as shown in Fig. 4 , cascades with a depth of 3 ( Fig. 4 (f)-(h)) are generally more viral than cascades with a depth of 2 ( Fig. 4 (b)-(e)) and cascade with a depth of 1 ( Fig. 4 (a) ). We first apply cascade virality on the phylogeny of virus strains. Fig. 5 shows the phylogenetic tree of two viruses-one is the ongoing novel coronavirus COVID-19 in the region of Asia (denoted as COVID-19/Asia) and the other is the influenza H1N1pdm (denoted as flu/H1N1pdm)-obtained from Nextstrain [11] . Note that the leaf nodes represent the sampled strains, while the rest of nodes represent inferred common ancestors, and edges indicate the evolutionary relationships [11, 24] . Intuitively, flu/H1N1pdm tends to evolve in a more viral way than COVID- 19 . Based on the obtained data, we find that flu strains evolve into more generations than currently available COVID-19 strains since three out of four flu cascades go beyond the depth of 60 but no regional cascades of COVID-19 do so in current stage. Except for two large COVID-19 cascades in Europe and North America, the rest of COVID-19 cascades seem less viral than flu cascades. Nevertheless, it's foreseeable that the virality of COVID-19 cascades would increase dramatically in future if the rapid proliferation of COVID-19 is still out of control around the global. The obtained cascades of virus strains can be considered as very rare events, we then apply cascade virality on very general cases with large-scale online discussion data as examples. A discussion thread can be structurally represented as a rooted cascade tree, where the root node denotes the post itself while each other nodes denotes a comment under this thread [21] . Leveraging discussion cascade data from five online communities in Reddit-a social news aggregation and discussion website-we conduct an analysis of cascade virality and its relation with cascade size and depth. The dataset includes discussion cascades in five online domains, namely r/conspiracy, r/history, r/sports, r/science and r/Bitcoin, in the year of 2017 [26] . There are 344 396 cascades in total, including 133 917 in conspiracy, 16 741 in history, 23 178 in sports, 23 871 in science and 146 689 in Bitcoin. The CCDFs (Complementary Cumulative Distribution Functions) of cascade size, depth and virality are shown in Fig. 6 . As we can see from the figure, most of the cascades have a size of less than 100 ( Fig. 6(a) ) and a depth of less than 10 ( Fig. 6(b) ) across domains. We also note that conspiracy cascades are generally more viral than other kinds of cascades like science and sports (Fig. 6(c) ). With a further exploration to the empirical cascade data, we find that both cascade size and depth empirically contribute to an increase of cascade virality across domains (Fig. 7) . As shown in Fig. 7 , cascade virality scores become larger and larger from the lower left to the upper right of the graph in each subplot. In other words, for cascades with equal sizes, those whose depths are deeper tend to have higher cascade virality scores. Likewise, for cascades with equal depths, those whose sizes are larger are also more likely to have higher cascade virality scores. In summary, in light of the discussion cascades from five online domains, we find empirical evidence that cascade virality is positively correlated with both cascade size and depth. A. Preferential attachment Preferential attachment, especially for BA (Barabási-Albert) preferential attachment, has been widely adopted to model the growth of complex networks [17-19, 24, 27-31] . Specifically, it assumes that, at each time step, the probability that a new coming node added to node i is proportional to node i's degree k i in the network. Namely, where k j denotes the degree of node j and the sum is over all nodes already present in the network. The linear preferential attachment kernel could result in many interesting network properties, such as the power law distribution in terms of node degree. The preferential attachment to high-degree nodes could also lead to the well know "the rich get richer" phenomenon. More generally, one might consider a non-linear attachment kernel, where the attachment is proportional to node's degree raised to the power γ, Clearly, γ = 1 recovers classical preferential attachment model while γ = 0 yields uniform attachment model. As one special case of complex network, a treestructure network is achieved when each new node only connects to one (instead of multiple) existing node [18] . Recalling the way we quantify cascade virality, it's anticipated that during the growth of a cascade, new node attachment to long-range leaf nodes would result in a more viral cascade than the attachment to the root node ( Fig. 8(a) ). To mimic the growth of cascades and particularly control the virality of cascades during the growth process, we propose a genealogy-based preferential attachment mechanism inspired by classical degree-based attachment kernel. We will illuminate the proposed model in detail below. For a cascade tree, once the root node is determined, the genealogy of it is readily available. Note that, for any node in a cascade tree, the genealogy of it can be represented by a (sub)tree rooted at itself. Fig. 8(b) illustrates such an example tree rooted at node 1, where the genealogy levels of node 1 are denoted by g. Specifically, nodes 2-8 are all descendants of node 1, resulting in a genealogy of size 8 for node 1 (including node 1 itself), while only nodes 3 and 5-8 are node 2's descendants in the rooted tree, thus resulting in a genealogy of size 6 for node 2 (including node 2 itself). More importantly, the genealogy of node 1 at level 1 includes nodes 1, 2 and 4, resulting in a genealogy of size 3 at this level; while the genealogy at level 2 includes two more nodes (nodes 3 and 5), suggesting a genealogy of size 5 at this level. Similarly, for genealogy of node 2, there are three nodes (nodes 2, 3 and 5) at level 1 and six nodes (nodes 2, 3 and 5-8) at level 2. Formally, for node i in a cascade tree, the complete genealogy of it contains itself and all the descendants of it, but the genealogy at level g contains itself and its descendants within g generations only. For the sake of consistency, the complete genealogy is also denoted as genealogy at level max. Here, we propose a genealogy-based preferential attachment model, where new node attachment relies on a node's genealogy size at specific genealogy levels instead of degree. Namely, where d jg is the genealogy size of node j at level g. Eq. (5) describes a linear attachment kernel, and more generally we have where parameter γ controls the attachment intensity. Clearly, γ = 1 recovers the linear attachment kernel while γ = 0 recovers the uniform attachment kernel. When γ > 1, the attachment kernel becomes superlinear, while in the interval 0 < γ < 1, the attachment kernel becomes sub-linear. For γ < 0, the model favors homogeneous attachment where new coming nodes are preferentially attached to existing nodes with small genealogy sizes. It's anticipated that in the limit γ → +∞, the model would generate totally broadcast cascade tree, where the root node directly connects all other nodes. In the limit γ → −∞, the model would generate totally viral cascade tree, where each node has one successor only except for the last node which acts as leaf node. Synthetic cascades when the complete genealogy is considered in linear attachment kernel (i.e., γ = 1 for genealogy-based kernel at level max). Taking cascades of size 100 for illustration purposes only. We numerically evaluate the effect of parameter γ in driving the virality of cascades. For ease of elaboration, we set the cascade size as 100 when constructing synthetic cascades. In Fig. 9 , we show the cascades generated by the linear attachment kernel based on complete genealogy size. As shown in the figure, the generated cascades tend to condense on nodes at low depths (particularly on the root node), where the number of successors diminishes with the increase of distance from the root node. We further show the generated cascades based on complete genealogy with the varying γ ∈ {5.0, 1.5, 1.0, 0.5, 0.2, 0, −0.2, −2.5, −10.0} in Fig. 10 . When γ is set by large positive values, such as 5.0 in the example, the generated cascade will condense completely on the root node, which corresponds to a totally broadcast cascade. With the decrease of γ, the condensation state diminishes gradually and even becomes trivial for negative values of γ. When γ is set by negative values with large magnitude, such as -10 in the given example, the generated cascade will form a long path with the root node located at one end, which corresponds to a totally viral cascade. Synthetic cascade examples based on degree attachment kernel can be found in Appendix B. We further numerically investigate the relationship between cascade virality and parameter γ in different cascade growth models. We consider not only the complete genealogy size but also genealogy at level 1 and classic degree based attachment kernel. Specifically, cascade size is set as 100, γ is set in the range of [-15, 10] with a step of 0.1, and 100 cascades are generated under each parameter γ. Cascade virality versus parameter γ according to different growth mechanisms (using cascades of size 100 for illustration). Cascade virality bounds are shown in grey dashed horizontal lines. Shading areas (in grey) indicate bootstrapped 95% confidence intervals obtained from 100 random simulations under each parameter γ. As shown in the figure, classic degree-based mechanism cannot exhaust the complete virality space. In Fig. 11 , we show the cascade virality with the increase of γ. Note that, for cascades of size 100, their virality scores are located at the range [1, 2524.5] . The upper and lower bounds are shown in grey dashed horizontal lines in the figure. Clearly, cascade virality decreases with the increase of parameter γ in each model, but classic degree-based kernel cannot exhaust the complete virality space. With the increase of γ, two genealogybased models ultimately converge to completely condensate state, which is not observed for degree-based kernel. Intuitively, genealogy-based models provide much more strong representation power in generating cascades with varying virality. We test the performance of cascade growth models on empirical cascades described above. For each real cascade, a synthetic cascade with the same size is generated according to each growth model. We than compare the virality of generated cascades with that of real cascades in terms of the distribution of virality scores. The comparison can be done by KS (Kolmogorov-Smirnov) statistic between the generated cascade virality scores and real cascade virality scores. The KS statistic of two distributions is given by the maximum discrepancy between their cumulative distribution functions (CDFs), and has been applied in assessing the goodness of fit between two degree distributions [24, 32] . Generally, smaller KS statistic favors better fit. Given the empirical virality distribution, we estimate γ by minimizing the KS statistic averaged over ten random simulations under each parameter γ. The optimal γ for each model under each community is chosen as the γ that results in the minimum KS statistic over the parameter space. For ease of computation, cascades whose sizes are larger than 100 or the 95th percentile, whichever is greater, are omitted in the analysis. Cascades of size 2 are also not included as any cascade growth model can exactly generate the same kind of cascade of size 2. Fig. 12 presents the results for degree, genealogy-1 and genealogy-max based models, where smaller minimum KS statistic indicates better model under each community. As shown in the figure, the minimum KS statistics achieved by degree-based model are larger than 0.95 across all five domains, while genealogy-based models show better performance with the minimum KS statistics all less than 0.95. Specifically, genealogy-1 achieves better performance than genealogy-max, suggesting that new node attachment during the growth of such cascades may be primarily driven by the size of the direct descendants plus the specific node itself, rather than by the size of the complete descendants. In summary, leveraging dataset of discussion cascades from five online domains, we demonstrate the performance of genealogy-based models to mimic the growth of cascades. In this work, we introduce a root-aware approach to quantify the virality of cascades. The proposed ap-proach considers cascades as directed instead of undirected graphs, thus avoiding the pitfall of graph isomorphism that previous studies have fallen into due to the overlook of cascade source. Using synthetic and empirical cascades, we show the property and utility of the proposed approach and particularly the relationships between cascade virality and cascade size and depth. In doing so, we find that both cascade size and depth empirically contribute to an increase of cascade virality from the obtained large-scale discussion data. To mimic the growth of cascades as an interpolation between broadcast and viral proliferation, we further introduce a cascade growth model inspired by preferential attachment mechanism. Specifically, the attachment kernel is constructed based on node genealogy rather than node degree. With the decrease of the attachment intensity, the generated cascade grows from a broadcast to a viral way. Numerical simulations on synthetic and empirical cascades reveal the advantage of genealogy-based attachment kernel over degree-based attachment kernel in characterizing the virality of cascades. Our work contributes to a further understand of how to quantify the virality of cascades and provides a rich avenue for compelling applications. For example, as we have demonstrated in the main text, conspiracy and science cascades display quite different patterns in the distribution of cascade virality. As such, the proposed approach to quantify cascade virality may provide additive power to distinguish different types of cascades. Moreover, as shown in the paper, the attachment kernel significantly affects the growth and virality of cascades, which may offer valuable insights to the implementation of viral marketing in practice. In this work, we mainly address tree-structure cascades, but future works may consider more complex diffusion structures, such as the locally tree-like diffusion process which is prevalent in network spreading. We have applied the proposed measure to capture the virality of COVID-19 and influenza strains, but future works can directly apply it on infection data (once such data are available) to have a better understand of the spread of virus. For a cascade T of size N (N ≥ 2), the cascade virality of it satisfies the following: 1 ≤ V T ≤ (N − 1)(N + 2)/4. 1. Extreme case when the cascade virality is minimized: a star graph with the root located at the hub (i.e., totally broadcast), and the corresponding virality of the cascade is V T = 1; 2. Extreme case when the cascade virality is maximized: a path graph with the root located at one end (i.e., totally viral) • For a complete path cascade T of size 2, the cascade virality is 1; • For a complete path cascade T of size N (N ≥ 3), the cascade virality is V T = 1 + 1+2 2 + ... + Proceedings of the 23rd international conference on World wide web Proceedings of the National Academy of Sciences Proceedings of the 16th ACM SIGKDD international conference on Knowledge Discovery and Data Mining Proceedings of the 22nd ACM conference on Hypertext and Hypermedia Proceedings of the 24th international conference on World Wide Web Proceedings of the 21th ACM SIGKDD international conference on Knowledge Discovery and Data Mining The World Wide Web Conference Algorithm for quantifying cascade virality is The raw data of Reddit