key: cord-0201056-395cbct9
authors: Ram, Rohit; Rizoiu, Marian-Andrei
title: Conductance and Social Capital: Modeling and Empirically Measuring Online Social Influence
date: 2021-10-25
journal: nan
DOI: nan
sha: 36f405d469b4b874f381b07bad8f2f5b77eafe69
doc_id: 201056
cord_uid: 395cbct9

Social influence pervades our everyday lives and lays the foundation for complex social phenomena. In a crisis like the COVID-19 pandemic, social influence can determine whether life-saving information is adopted. Existing literature studying online social influence suffers from several drawbacks. First, a disconnect appears between psychology approaches, which are generally performed and tested in controlled lab experiments, and the quantitative methods, which are usually data-driven and rely on network and event analysis. The former are slow, expensive to deploy, and typically do not generalize well to topical issues (such as an ongoing pandemic); the latter often oversimplify the complexities of social influence and ignore psychosocial literature. This work bridges this gap and presents three contributions towards modeling and empirically quantifying online influence. The first contribution is a data-driven Generalized Influence Model that incorporates two novel psychosocial-inspired mechanisms: the conductance of the diffusion network and the social capital distribution. The second contribution is a framework to empirically rank users' social influence using a human-in-the-loop active learning method combined with crowdsourced pairwise influence comparisons. We build a human-labeled ground truth, calibrate our generalized influence model and perform a large-scale evaluation of influence. We find that our generalized model outperforms the current state-of-the-art approaches and corrects the inherent biases introduced by the widely used follower count. As the third contribution, we apply the influence model to discussions around COVID-19. We quantify users' influence, and we tabulate it against their professions. We find that the executives, media, and military are more influential than pandemic-related experts such as life scientists and healthcare professionals.

Influence is foundational to the construction of opinions and enacting change. It governs interpersonal relationships and contributes to the establishment of societal institutions and reforms [51] . Social (bottom) An example cascade is modeled using Hawkes processes. Each event (timestamp on x-axis) has a mark (y-axis) and spawns new events following a time-decaying intensity (magenta areas). (top) The latent branching structure is shown with solid lines, and other valid pathways are shown with dotted lines. GIM has two psychosocial-inspired components. Conductance: Edge thickness represents conductance, which modulates the likelihood of observing diffusions along that edge. Social capital distribution: A percentage of a node's capital (green shades) is transferred along diffusion edges (red arrows), from target to source. Influence is proportional to the accumulated capital.

media ubiquity provides fertile ground for influence mechanisms to unfold and for a minority of users to exert disproportionate control. Although there is a rich trove of literature related to social influence, stretching from the psychosocial to computational domains, a quantitative approach to measuring online influence remains elusive [43, 53] . Successful solutions to this problem would allow identifying the online actors who shape societal views and would also be a first step toward addressing societal issues-such as the spreading of misinformation amid a pandemic that undermines immunization campaigns and stokes vaccine hesitancy.

In this paper, we model and empirically measure online influence by bridging the divide between psychology and the quantitative approaches [5] . The psychology approaches [5, 16, 45] involve micro-level analysis in well-controlled laboratory experiments. Such works understand minute factors and infer causal relationships. Still, they are slow, expensive to deploy, and cannot be leveraged in real-time (say to analyze influence in pandemicrelated discussions). The quantitative approaches [14, 18, 56] scale to large networks and model the emergent behavior therein. However, these are usually data-driven tools that make "empirically questionable simplifying assumptions" [43] . Such operationalizations only offer narrow interpretations of influence and often evaluate against on-hand metrics (for example, retweet counts) that are weak and biased influence representations (see our results in Section 4.2) .

In this work, we address several open questions relating to modeling empirical influence. The first question concerns quantitatively modeling and estimating influence from online conversations data. The conductance ( [16, 49] , see Section 3.1.1) and social attribution ( [16, 32, 45] , see Section 3.1.2) are known in psychosocial literature to modulate the exertion of social influence. However, these factors are not considered by quantitative models. For conductance, these models often assume that social ties are the primary influence channels, while other conductive channels (like recommender systems and homophily) are underexplored. For social attribution, the question is who gets the (social) credit in a chain of influence exertion. Consider the example in Fig. 1 . Alice influences Bobbie, who influences several people; should Alice (the initiator) or Bobbie (the connector) be attributed more social capital? Therefore, we ask how do we model influence to account for network conductance and social attribution mechanisms?

The second question relates to empirically measuring influence at scale. Psychological studies mainly concentrate on identifying the characteristics that allow individuals to exert influence -such as authority, likeability, attractiveness, and expertise [16] . However, such works deal less with measuring relative influence between individuals and the emergence of social capital (defined as the confluence of said influence-enabling characteristics). Furthermore, performing pairwise comparisons between individuals scales quadratically with the size of the cohort, quickly rendering the costs prohibitive and keeping the size of the studies small. Additionally, without laboratory-controlled environments, data fidelity can suffer. Therefore, how do we generate an empirical social capital ranking, true to the complexities of social influence, and scale it feasibly to large cohorts?

The third question concerns the exertion of influence within discussions with societal stakes (here, the COVID-19 pandemic). Ideally, populations would heed the advice of experts (i.e., epidemiologists, biomedical scientists, and healthcare professionals) during pandemics; however, the rise of the anti-vaxxer movement and vaccine hesitancy is evidence to the contrary. The question is can we identify the occupational groups which yield the most influence in the discussions around the COVID-19 crisis?

We address the above questions on two Twitter datasets about the Australian 'Black Summer' Bushfires and the COVID-19 pandemic.

We address the first question by proposing the Generalized Influence Model (GIM) that quantifies influence based on observed information cascades. It leverages two psychosocial-inspired components, depicted in Fig. 1 . The conductance mechanism modulates the likelihood of cascade pathways using user lexical and following similarity, and relationship ties. The social capital distribution mechanism transfers a proportion of a node's social capital to its parent in the information cascade, allowing highly connected nodes to accumulate social capital (which translates to influence).

We address the second question by constructing a crowdsourcebased empirical influence measurement framework containing three components. First, we use a human-in-the-loop active learning technique that selects a minimal number of pairwise comparisons to be made and maximizes the utility of each decision. Second, we leverage an augmented Bradley-Terry model to build an influence ranking from pairwise comparisons while accounting for systematic noise in worker decisions. Third, we engineer an MTurk survey instrument; we use an ablation study to estimate the impact of design features on the worker decision accuracy. We design simulation and parameter fitting tools to estimate the budget required to scale the MTurk experiment given a number of targets. We use the MTurk empirical influence estimation to construct a ground truth for evaluation. We show GIM to outperform several baselines, including the current state-of-the-art influence quantification [56] , and to correct the biases introduced by the widely used follower count.

We address the last question by applying GIM to estimate users' influence in a large dataset of COVID-19 discussions. We determine user occupation using the O*NET taxonomy [21] , and find that executives and the media are the most influential occupations, while engineers and computer specialists are the least influential. Worryingly, the pandemic experts and healthcare workers are less influential than food processors, entertainers, and business professionals in pandemic-related discussions.

The main contributions of this work include:

• A Generalized Influence Model which incorporates two novel psychosocial-inspired mechanisms: conductance and social capital distribution. • A crowdsourced empirical influence measurement framework to build the influence ranking of large cohorts. • Quantifying the influence of occupational groups in COVID-19 related discussions and finding (perhaps unsurprisingly) that pandemic experts are not the most influential.

This section briefly introduces the prerequisites for this work. In Section 2.1, we link social influence and modeling diffusion cascades using the Hawkes self-exciting model. In Section 2.2, we cover the empirical psychometric measurement pipeline, including the use the pairwise comparisons, ranking from pairwise data, and practicalities of collecting pairwise data online.

The retweet-Twitter's affordance to share others' tweets (and retweets)is widely seen as an explicit endorsement of opinions and ideas. Consequently, prior works define a tweet's influence as the number of (direct and indirect) retweets it generates [19, 56] . However, the branching structure of retweet cascades is unobserved as Twitter attributes all retweets to the original tweet. Rizoiu et al. [56] infer it by assuming retweets arrive following a Hawkes point process.

Retweet cascades consist of an original tweet and the subsequent retweets. We denote a marked cascade up to time as H ( ) = { 1 , 2 , . . .}, where = ( , ) denotes the th tweet, is the event time relative to the original tweet ( 1 = 0), and ∈ R-dubbed the mark-is the event meta-data. The branching structure of retweet cascades is a latent graph ( , ) where are the tweets, and contains direct retweet relations. Given a cascade of events, we define the set of all valid branching structures as Υ = { |( , ) ∈ so that < }. Rizoiu et al. [56] show that |Υ| = ( − 1)!. Fig. 1 shows an example retweet cascade, its most likely branching structure in solid lines and other potential branching structures in dashed lines. Next, we compute the probability mass function over the branching structure set Υ.

Hawkes Processes [29] are stochastic processes with the selfexciting property-the arrival of an event increases the likelihood of future events-, applicable here due to the property of social affirmation (i.e., past social actions encourage incoming actions). In Hawkes processes, events arrive following the conditional intensity

where ( ) is the baseline intensity; each event increases the overall intensity by ( − ); is one way to model marks where mediates the marks' effect, and the kernel : R + → R + controls the event intensity decay. The exponential ( ) = − and the power-law ( ) = ( + ) −(1+ ) are common parametric forms. Hawkes and Oakes [30] propose the branching represention of Hawkes processes, where each event generates offspring following a non-homogenous Poisson process of intensity ( − )illustrated by the magenta areas in Fig. 1 . Lewis and Mohler [37] use the branching representation to estimate the probability that is a direct offspring of as

Intuitively, is the proportion of intensity that contributed to the total intensity at time . Note that for retweet cascades ( ) = 0, and = 0, when > . Finally, the probability of a valid branching structure is ( , ) ∈ ( ) . Hawkes-modeled Influence. Rizoiu et al. [56] measure influence as the expected number of offspring of a tweet across all valid branching structures. They formalize influence as ( ) =

indicates a path between and in the branching structure G. Using the independent cascades assumption [9] (that generating a tweet at is independent of the diffusion structure up to ), Rizoiu et al. [56] devise an efficient iterative procedure for computing ( ). They introduce the pairwise influence exerted by on , either directly when is a direct offspring of or indirectly when lies on the same diffusion path as . Formally,

= 1 when = , and = 0 when > . Intuitively, is the sum of the probabilities of all valid paths between and . Consider an example with three tweets, 1 , 2 , 3 , the influence 13 is computed as 13 = 11 13 + 12 23 = 13 + 12 23 representing all possible paths between 1 and 3 . A tweet's influence is the total influence it exerts, i.e. ( ) = = . The { } matrix is computed in matrix multiplications, and the total time complexity is ( 3 ). Finally, a user's influence is the average influence of all their tweets.

Crowdsourcing platforms, such as Amazon Mechanical Turk (MTurk), allow large pools of human workers to complete tasks that computers cannot do. The research community uses them for labeling tasks (say, identifying objects in pictures) and for psychometric studies. For the latter, they are significantly cheaper, quicker, and induce a more diverse participant pool than traditional surveys [11] . MTurk provides a programmatic interface to serve tasks and process worker responses, making it amenable to active learning setups [2]-i.e., construct the next task based on previously received answers. Additionally, researchers have meticulous control of the worker interface and can filter workers by their characteristics.

Pairwise influence comparison. There are two main psychometric approaches to estimate empirical influence. The first approach asks humans to rank (or score) a set of targets according to their perceived influence. However, this direct ranking has scale calibration issues when performed by non-experts [64] . The second approach is the pairwise comparison, which features several advantages; it leads to lower measurement errors [59] , induces a more straightforward experimental task suitable for non-experts in crowdsourcing setups, and is amenable to the active learning paradigm [2] , which reduces the number of comparisons required for high score fidelity. Methods for recovering a scoring from pairwise comparisons, such as Bradley-Terry and Thurstonian models, have a strong precedent in psychometric analysis [12, 35] .

Pairwise Comparison Matrix (PCM) ∈ R × represents the outcome of items being compared where [ , ] is the number of times item is favoured to item -denoted hereafter as ≺ .

Bradley-Terry Model (BT) [10, 67] proposes a method for ranking individuals when only incomplete pairwise comparisons are available. It is commonly used in sports analysis [13, 67] (e.g., ranking chess players from matches) and psychometric studies [35] . Each individual (hereafter called a target) is ranked by its latent intensity ∈ R. The probability that target is preferred to is P( ≺ ) = 1 1+ −( − ) . The maximum-likelihood estimates (MLE) are computable even from incomplete sets of pairwise comparisons containing circular comparison results (e.g. ≺ ≺ ≺ ) [12, 44] ; particularly relevant for crowdsource experiments where workers judge difficult choices differently. Furthermore, adaptive methods for choosing the pairs to compare were proposed [58] to obtain high fidelity scores with minimal comparisons.

Ranking items and Quicksort. When pairwise comparisons are deterministic, the optimal sorting complexity requires ( log( )) comparisons [31] . When pairwise comparisons are stochastic, all pairs must be compared times to overcome intransitivity. This requires ( 2 ) comparisons, prohibitively expensive when worker remuneration is per comparison. Maystre and Grossglauser [44] show that Quicksort is an effective heuristic for choosing items to compare and can recover a ground-truth ranking with high probability given a sub-quadratic number of comparisons, consequently reducing comparison complexity to ( log( )). Quicksort [31] sorts a collection , by randomly choosing a pivot item , and partitioning the collection into two subcollections: { ∈ : < } and { ∈ : ≥ }. When partitioning, is compared to all other items in . The algorithm is recursively applied to order these subcollections, finally concatenating to obtain a sorted collection.

This section introduces the Generalized Influence Model (GIM) and its core concepts: conductance and social capital distribution (Section 3.1). Next, we introduce a ranking model to determine the social influence of users from crowdsourced annotations (Section 3.2).

We propose GIM that generalizes the Hawkes-modeled influence (see Section 2.1) by incorporating two mechanisms modeling crucial social information. The influence conductance encapsulates user relationships; the social capital distribution models the perceived allocation of social capital. Given a retweet cascade H ( ) = { 1 , 2 , . . .}, GIM quantifies the social influence of a tweet as:

where Ψ(G → ) is the social capital distribution, and 1(G → ) indicates whether a path exists between and in G under the conductance-augmented probability distribution defined in Section 3.1.1. Ψ (Section 3.1.2) is defined at the level of edges, preserving the efficiency of the Hawkes-modeled influence computation.

Cognitive science [39] and social psychology [16, 49] literatures suggest that some relationships are more influential than others. Notably, people in the same community (or who share some similarities) are more influential to each other. The conductance assumes that different types of relations between users propagate influence more effectively (e.g., one might be more influenced by close family than by distant work colleagues); this makes some people more likely souces of influence than others. Intuitively, the social system propagates influence similarly to physical materials conducting electricity or heat. We denote the conductance of an edge ( , ) as , , and define the updated probability that is a direct offspring of as

We consider two choices for conductance: topological (users' social network) and homophilic (users' similarity with others). Topological conductance assumes influence flows between users who are connected in the social graph (i.e., the follower relationship). Each valid edge ( , ), with < , has a baseline conductance , regardless of whether is connected to , accounting for alternative influence conduits (such as news feeds or users following topics and hashtags). Formally, we define the topological conductance

where = 1 if follows and 0 otherwise. Homophilic conductance of an edge models the connection between similarity and influence, i.e., people similar to us influence us more [17, 24] .

For each user we first build a user representation ℎ ∈ R . Next, we quantify ℎ , the homophilic conductance between two users using the cosine similarity between their user representations plus the baseline conductance ℎ . Formally, ℎ ,

. In our experiments in Section 4.3, we compute the user embeddings ℎ using two lenses. The following lens leverages the observation that similar people consume similar content. On social media, following popular users is akin to consuming content. Accordingly, we measure the similarity between two users based on whether they follow the same people. The lexical lens exploits that similar people use similar language. The vocabulary and language style of users can be a strong indicator of their community, and we measure the similarity of users based on their choice of language (see more details in Section 4.1). Note that the homophilic conductance with the lexical lens does not require knowledge of the user following graph, which can be prohibitively expensive to obtain for Twitter.

We propose a social capital distribution mechanism that leads to the accumulation of social influence and explains its dynamics. Psychologists have long recognised that particular characteristics in individuals are correlated with influence, namely authority [45] , attractiveness [32] , likeability and others [16] . The congruence of these characteristics gives rise to social capital, the accumulation of which enables the exertion of influence. The distribution mechanism models the non-uniformity of social capital starting from information diffusions.

We construct the social capital distribution as follows. Whenever the user directly influences (i.e., is a direct offspring of in a retweet cascade), transfers a portion of their social capital to (denoted as ). Each tweet is endowed with 1 social capital for participation; it pays a proportion ∈ (0, 1) of all their capital to their parent (if they exist) and keeps (1 − ). Formally, we define

The initiator does not transfer. , and a user's social influence is proportional to the total social capital they accumulate via the capital distribution mechanism.

3.1.3 Iterative Computation. GIM can be computed efficiently by extending the pairwise influence (introduced in Section 2.1) to incorporate the concepts of conductance and social capital distribution. Formally,

where = ′ when = , and = 0 when > . Consequently (and similar to the Hawkes-modeled influence), we obtain ( ) = = . The full derivation of the latter, from Eq. (3) via Eqs. (4) and (5), is shown in the online appendix [3, Appendix A]. Eq. (5) recursively generates all possible paths in the same way as Eq. (2) which allows an efficient iterative algorithm of temporal complexity ( 3 ), fully detailed in the online appendix [3, Appendix B] . Visibly, the Hawkes-modeled influence [56] is a special case of GIM, with = = 1, ∀ , .

This section introduces a cost-effective method to construct empirical influence rankings using crowdsourcing platforms. First, we deploy an active learning approach that leverages an augmented BT model [44] (Section 3.2.1). Next, we show the connection between model parameters and MTurk worker accuracy (Section 3.2.2). Finally, we propose a set of simulation and fitting tools that we leverage to estimate the minimum annotation budget (Section 3.2.3).

Building the dense pairwise comparisons matrix (complexity ( 2 ) for targets) is unfeasible using crowdsourcing platforms where costs scale linearly with the number of comparisons (see discussion in Section 2.2). The Bradley-Terry model has been successfully applied on sparse versions of (i.e., not all pairs are compared) to build approximate rankings [36] . The question is how to select which pairs to compare to maximize the ranking quality with the minimum number of comparisons. Passive techniques choose pairs before running the experiment; however, they do not use the information learned during the experiment. Here, we employ a solution that exploits active learning to choose comparisons on the fly. Past comparisons inform future choices, which in turn are more informative than random choices. Sparse pairwise comparisons matrix via human-in-the-loop active learning. Our empirical influence quantification method builds on the active learning approach introduced by Maystre and Grossglauser [44] . We use the standard Quicksort algorithm to select pairs in the sparse pairwise comparison matrix . We implement a human-in-the-loop system, in which human judges make the pairwise comparisons (using the MTurk platform), while the algorithm chooses which pairs to compare and builds the final ranking. Furthermore, we implement Quicksort recursively; at each iteration, the sorting of the left (<) and right (≥) subpartitions are performed in parallel; fully taking advantage of MTurk's massive worker pool. In its design (see Section 2.2), Quicksort exploits information from past comparisons to reduce the number of future comparisons required to complete the task; therefore minimizing the total experiment cost.

We compare a pair of targets at most once during one execution (denoted as a run); usually, multiple runs are required. Maystre and Grossglauser [44] show that Kendall's Tau -a ranking quality metric-improves with the number of comparisons made; and the estimated ranking approaches asymptotically the true ranking.

Target ranking using an augmented BT model. Response fidelity in psychometric experiments suffers from two types of noise. The first type is systematic noise, associated with worker subjectivity, worker inauthenticity, and perception biases. This type of noise can be minimized via experimental design interventions. The second type of noise is stochastic noise that we average out using repeated trials. To account for response fidelity, we use an augmented BT model [44] that introduces the noise into the probability of preferring the target over as

Visibly, as increases the workers' decisions become closer to random (i.e. lim →inf P ( ≺ ) = 1 2 ). Maystre and Grossglauser [44] show that Kendall's Tau deteriorates as noise is increased.

Noise, budget, and quality. Designing an empirical experiment using MTurk requires a trade-off between three intertwined factors: noise, budget (number of comparisons which translate into dollars), and ranking quality. For obtaining a required quality of ranking, higher noise requires a higher budget. Conversely, reducing the systematic noise reduces the required budget. In Section 4.1, we perform a series of design interventions to reduce the noise and increase worker accuracy, and in Section 4.2 we show how we trade-off budget and quality for the best-obtained noise level.

Here, we show the theoretical connection between the mean accuracy of worker decisions and the noise parameter . Let the decision of a worker comparing targets and be described by a Bernoulli random variable ∼ (P ( ≺ )); with mean E[ ] = 1 × P ( ≺ ) + 0 × (1 − P ( ≺ )) = P ( ≺ ). The MTurk experiment is characterized by a series of Bernoulli trials, one trial per pair ( , ). Consequently, the accuracy of the human choices in the MTurk experiment is ( , ) . Note that are independent but not identically distributed since they depend on the choice of ( , ) for a given . The expected accuracy of the MTurk experiment over all worker choices is

Eq. (7) links the mean worker accuracy and the noise parameter (via Eq. (6)). Visibly, lim →inf E[ ] = 1 2 , e.g., the accuracy of unbiased random choice in a binary classification problem. Fig. 2a plots the relation between noise and accuracy-the horizontal asymptote shows this limit.

Here, we construct a set of tools to fit the parameters of the empirical influence measurement model (Section 3.2.1) and to sample synthetic worker comparisons (i.e., to generate synthetic MTurk experiments).

Parameter fitting. The noise ( ) is a hyper-parameter of our empirical influence measurement model, as it depends on the quality of workers decisions and not the targets being evaluated (see Sections 3.2.1 and 3.2.2). Consequently, cannot be estimated from pairwise comparison data alone. In our pilot study in Section 4.1, we use the follower count as a proxy for influence, we measure the accuracy of the MTurk workers against the proxy, and we estimatê using Eqs. (6) and (7) . Finally, given^and the set of comparisons {( ≺ )}, we obtain the influence scores MLE^by maximizing the log-likelihood ( ≺ ) (P ( ≺ )). Generate synthetic comparisons. Given a fixed noise level, the empirical influence measurement model can be used to generate synthetic pairwise comparisons. This is particularly useful to estimate the costs of a complete MTurk experiment.

The simulation requires three parameters: the maximum number of comparisons (budget), the number of targets (#targets), and the noise parameter (noise). We start by sampling the latent influence intensity from a power-law distribution for each target. We chose power-law as literature observes that social metrics tend to follow a rich-get-richer paradigm [7] . For example, Twitter follower count is power-law distributed with exponent 2.016 [46] , which we use for sampling the synthetic . For a pair of targets ( , ), our simulated workers produce correct decisions with probability P ( ≺ ), which is completely defined by , , and . We use the quicksort procedure to select comparisons, and we compute the BT estimates^from the recorded responses. Finally, we measure ranking quality as the rank correlation between and^.

In this section, we first present the datasets, performance measures, and MTurk design choices (Section 4.1). We then perform a large-scale MTurk experiment to rank the influence of 500 users (Section 4.2), and evaluate GIM (Section 4.3). Finally, we estimate the influence of users in COVID-19 discussions, and we tabulate it against their profession (Section 4.4).

The datasets were collected from Twitter in the context of two crises: the Australian 'Black Summer' Bushfires and the COVID-19 pandemic, respectively. The #ArsonEmergency dataset contains discussions and misinformation claiming that arsonists caused the bushfires [26] . The dataset was collected between 22 November 2019 and 9 January 2020-using keywords like arsonemergency, bushfireaustralia, bushfirecrisis, and other-, and contains 197,475 tweets emitted by 129,778 users. The #Covid-19 dataset was constructed using the keyword covid19 during August 2020, and contains 143,356,591 tweets by 21,527,913 users.

The homophilic conductance represents users through two lenses: following and lexical (see Section 3.1). For the following lens, we identify the 1000 most followed users, and collect the followees of all the users in our dataset (i.e., users they follow). We represent a user as ℎ ∈ R 1000 , where ℎ [ ] = 1 if follows the th most followed user (ℎ [ ] = 0 otherwise). For the lexical lens, we construct user documents by concatenating the user tweets; we represent them using TF-IDF (Term Frequency-Inverse Document Frequency); and a feature hashing dimensionality reduction technique [47] . Finally, we represent each user as ℎ ∈ R 1,048,576 .

Evaluation metrics. We evaluate GIM ranking against the ground truth constructed in Section 4.2 using three measures: AUC-NDCG (see next), MAPE, and Spearman correlation. The information retrieval literature uses the Normalised Discounted Cumulative Gain (NDCG) to measure the overlap between two rankings. It privileges the correct ranking of the top-ranked positions and discounts errors in the lower rankings. Applied to influence, NDCG@k aims to order the most influential users correctly. We compute the Area Under Curve for NDCG@k (AUC-NDCG) by varying and producing a single metric value. We also compute the Mean Absolute Percentage Error (MAPE) of the difference in the ranking percentiles for each target.

MTurk worker study design. We account for three types of design features for the interface and the study: user features, proxies, The number of people who follow the user Followee Count

The number of people the user follows Status Count

The number of posts the user has authored Proxy User A third user on whom the influence of each of the targets is projected Qualifications A mechanism to ban low-quality workers and qualifications (shown in Table 1 ). The user features show information about each of the two targets, allowing the MTurk workers to assess them against each other quickly. We provide links to the online Twitter profile should the worker want to investigate. Proxy users are introduced to reduce the workers' bias on their influence judgment-we ask workers who the proxy would find more influential. Proxies are selected similarly to targets (see Section 4.2). Finally, the qualifications are used to ban low-quality and ill-intentioned workers who randomly perform large amounts of comparisons, severely reducing worker accuracy. We describe in detail each of the design features, and we show the final worker interface in the online appendix [3, Appendix C]. Next, we perform a design feature ablation to show their relative importance for worker accuracy.

In this section, we build a user influence ground truth using the empirical influence measurement model introduced in Section 3.2.1.

Ablation study of worker interface features. We design the MTurk study iteratively, seeking to reduce the systematic noise (see Section 3.2.1). We perform an ablation study to add and remove design features, and measure worker accuracy against the follower count baseline. We run each ablation three times, and report the mean accuracy. Fig. 2a plots the relation between mean worker accuracy and the noise ( ), and places the outcomes of each ablation on this line. We make three observations. First, the user features provide a 15% accuracy increase (from 0.57 for no user features to 0.72 for all features). The most important user features are follower count, followee count, and statuses count. Second, using proxies mitigates some worker bias (3% accuracy increase). Finally, removing the banning mechanism did not decrease the performance of the ablation study. This indicates that the low-quality workers did not show up for this experiment (probably discouraged by our prior banning). We henceforth use the complete set of features, which yields a worker accuracy of 0.72 (corresponding to = 1.22).

Estimate budget requirements. We use the determined noise parameter = 1.22 and the simulation procedure introduced in Section 3.2.3 to determine the comparisons budget needed to achieve a suitably accurate influence ranking. We perform a grid search over the budget and the number of targets , and show in Fig. 2b the obtained correlation between the influence ranking estimate and the synthetic ground truth. Visibly, the required budget increases Table 1 ) and their MLE fitted noise (relevant area zoomed in the inset). As more features are shown, worker accuracy increases. (b) Estimate required MTurk budget. We vary the number of targets ( , y-axis) and maximum budget ( , x-axis). The color map and contour annotations show the Spearman correlation between the BT-estimated influence raking (^) and a synthetic ground truth ( ). The cross denotes the chosen setup and estimated budget for our real-world experiments. (c) Evaluate GIM against the MTurk ground-truth in the space of NDCG-AUC (y-axis) and negative MAPE (xaxis). The solid shapes are the best models for each combination (conductance-capital distribution). The empty shapes show the Pareto-dominated models in each combination, obtained via grid search in the space ( , ). The circle-crosses denote the baselines: Hawkes-modeled influence baseline [56] , PageRank [52] , Retweet Influence [14] , and ProfileRank [60] . Note, the gray box is not to scale, and the coordinates for baselines are shown in brackets.

with the number of targets, given a correlation level. In our experiment, we chose a correlation level of 0.93 and 500 targets, and estimate we require 30, 000 comparisons (approximately US$120).

Build an influence ranking ground-truth. We select 500 targets and 500 proxies with available profiles (not suspended nor protected) who post in English-most MTurk workers come from English-speaking countries. We sample both high and low influence targets using the Hawkes-modeled influence baseline [56] (full details of user sampling in the online appendix [3, Appendix D]). We execute seven full quicksort runs, resulting in 36, 252 comparisonsslightly higher than our budget; still, we prefer to complete the final run. We obtain the ordered influence ranking of the 500 targets 1 . According to MTurk ethics and US Federal minimum wage, we paid MTurk workers US$0.04 per task (10 pairwise comparisons).

Here, we first find the best combination of conductance-distribution mechanism for GIM; we evaluate it against several baselines in influence estimation, including the current state-of-the-art Hawkesmodeled influence [56] ; and we show it to be an unbiased alternative to the follower count metric.

Influence ranking baselines. We compare GIM to four baselines. Two baselines are widely used heuristics; PageRank [52, 65] assumes influence flows via random-walks on constructed social graphs (here the follower network) and retweet influence [14] , counts the retweets of a user's authored tweets. They are centrality-and feature-based approaches, respectively. The other two baselines 1 The generated influence ranking and the code for the MTurk empirical influence measurement method is available online at https://git.io/JK1cS are purpose-built state-of-the-art influence estimators: Hawkesmodeled influence [56] and ProfileRank [60] (a PageRank variant on Content-User bipartite network).

GIM search. For each combination of conductance (topological, lexical, and following) and distribution mechanism (social capital, and none), we perform a grid search over the hyper-parameters (conductance) and (distribution). At each grid point we compare the influence scores obtained by GIM against the MTurk empirical influence ranking. Fig. 2c shows the baselines and GIM with several conductance-distribution combinations in the space of the performance measures: negative MAPE (x-axis) and NDCG-AUC (y-axis) (the top-right corner optimizes both measures).

We make three key observations. First, GIM consistently Paretodominates (i.e., outperforms) all baselines for almost every hyperparameter combinations, showing that our psychosocial-inspired mechanisms render automatic influence quantification closer to the human judgment. Among baselines, the best performing is the Hawkes-modeled influence, followed by PageRank, retweet influence, and the purpose-built ProfileRank. Second, we observe that the homophilic conductance typically outperforms the topological conductance and that the lexical outperforms the following conductance. Third, only two models are not Pareto-dominated: the topological-none (best NDCG-AUC) and the lexical-social capital (best neg. MAPE). Notice, however, that the topological conductance requires recovering the follower network. In practice, this is prohibitive for large datasets (such as the #Covid-19 dataset) due to rate limitations of the Twitter API. Therefore, in our analysis in Section 4.4 we use the homophilic lexical conductance ( = 0.18) with the social capital distribution ( = 0.02). The 18% baseline conductance indicates that the accounted channels do not explain a relatively large proportion of conductance. Note that while passing 2% of the social capital to the parent might not seem much, this adds up for nodes with high degrees (particularly given the longtail distribution of follower count).

Debiasing the follower count. The follower count is widely used as a proxy for influence [14, 23, 55] ; however, it has been repeatedly shown to be biased [7, 57, 62] . Fig. 3a(top) shows that the follower count residuals-the difference between the follower count percentile and the empirical score percentile-are positively correlated with the follower count percentile ( 2 = 0.48). In other words, the follower count overestimates the influence of the highly followed users and underestimates the lowly followed users. Fig. 3a (bottom) shows that GIM residuals are not correlated with the follower count percentile ( 2 = 0.073). That is, GIM is an unbiased influence estimator with respect to the follower count.

In this section, we first use GIM to compute the influence of all users in the #Covid-19 dataset. We tabulate their influence against their occupation and investigate the role of several occupations in pandemic-related discussions.

Determine the occupations of Twitter users. We match user occupation against the Minor Group Occupational Classes of the O*NET occupational taxonomy [21] using textual fuzzy matching [68] . We search each user's Twitter description and select the first matched occupation (following [61] )-assuming people list their actual occupation first, before hobbies or and other information.

The influence of occupations. Fig. 3b shows the distribution of influence scores for occupations with more than a thousand users in the #Covid-19 dataset. Executives, the Media, and the Military yield among the highest online influences. This result is hardly surprising for the former two. The latter is an occupation with high bipartisan support and respect in the US -home of most Englishspeaking Twitter users. Notably, Media and Entertainers are not only influential but have the most prominent online presences, perhaps on account of their attention-related business models. At the opposite end of the spectrum (and somewhat suprisingly) are Architects, Engineers and Computer Specialists with the lowest influence. Engineers, although prominent online, are not very influential and Computer Specialists have less influence than the general public.

Pandemic-related occupations. Next, we analyze the influence of three occupations heavily involved in communicating during the COVID-19 pandemic: Healthcare (including diagnosing and treating practitioners), Media and Communication, and Life Scientists. In Fig. 3c , we compare pairs of these occupations tabulated against the follower count-influence's most important covariate. We split the users into five equally-sized bins according to their follower count percentiles, and we plot influence distribution in each bin. We make three observations. First, the Media is more influential about the pandemic than the pandemic experts (Healthcare and Life Scientists), particularly in the highly-follower bins. While this is expected given the media's role to inform the public, it could be problematic if the media outlets are more fringe and biased [20] ) or cover misinformation and conspiracy theories. Second, the distributions in each bin show a mode around low influence scores. This reinforces prior findings that followership does not translate into large influence [7, 57] . Finally, Healthcare is more influential than Life Scientists, despite the latter including epidemiologists, medical scientists, and microbiologists; users are relying more on the advice of medical doctors and frontline workers than on research experts.

Existing quantitative literature on influence measures are often limited to a narrow perspective of it. Common approaches include centrality measures [6, 14, 22, 55] , and influence maximization algorithms [1, 34, 38] ; however, these often conflate influence with popularity and reach. Recent computational approaches [40, 41, 54] predict micro-level behavior but do not offer measures, and multivariate Hawkes approaches do not easily scale [50] . Furthermore, crowdsourcing is commonly used to recover psychosocial attributes [8, 15, 33] . Several recommender approaches utilize social concepts like homophily [27, 28, 63, 66] , and diffusions [28, 48] . Arous et al. [4] use crowdsourcing to find influencers, aggregating open-ended answers, but don't provide an influence model capable of querying arbitrary users. To our knowledge, this is the first work to propose a Generalized Influence Model and tools to quantify it empirically.

Existing methods for online influence measurement rely on heuristic definitions and fail to model complex phenomena. In this work, we construct an online social influence pipeline. It includes an empirical measurement framework and GIM, accounting for socially informative channels (conductance) and attribution schemes (via distribution mechanisms). We estimate influence in discussions around the COVID-19 pandemic, finding that media, food processing, and even cooks are more influential than medical experts. Future work should include other influence signals beyond retweets and heterogeneous networks; it should incorporate assimilative and contrastive forms of influence. 

This section shows the complete derivation from Eq. (3) to Eq. (5), both shown in the main text. We define the influence of a tweet given the diffusion scenario G, as:

where (G) is the set of nodes in G, and we denote as ( , |G) := Ψ(G → ) the social capital distribution along a path. We start from the definition of GIM given a retweet cascade (main text Eq. (3)):

Notably, due to the factorial number of diffusion scenarios in Υ, computing the influence for each graph is intractable. For example, there are 10 156 diffusion scenarios for a cascade of 100 retweet [56] .

Incremental construction of diffusion scenarios. We leverage the independent cascades assumption (see main text Section 2.1) to construct an efficient influence computation that overcomes intractablility. The key observation is that each tweet is added simultaneously at time to all diffusion scenarios constructed at time −1 .

contributes only once to the tweet influence of every tweet found on the path to which is attached. The tweet influence is computed incrementally by updating ( ), < at each time . We denote by ( ) the value of tweet influence of after adding node . As a result, we only track how the tweet influence increases over time steps, and we do not construct all valid diffusion scenarios.

GIM assumes that a user's tweet is influenced by one of the precedent tweets, chosen stochastically from a discrete distribution over the valid edges. Alternatively, we can interpret that all previous tweets influence the new tweet proportionally to the same discrete distribution (a view inline with recent findings about influence and complex contagion). Let Υ 1: −1 be the set of all possible diffusion scenarios at time −1 , and G − ∈ Υ 1: −1 be one such diffusion scenario, with the set of nodes − = { 1 , 2 , · · · , −1 }. When arrives, it can attach to any node in − , generating − 1 new diffusion scenarios G + , with + = − ∪ and + = − ∪ ( , ). We can write the set of scenarios at time as:

We write the tweet influence of at time as:

( ) = ∑︁ G + ∈Υ 1:

Attach a new node . We concentrate on the right-most factor in Eq. (11) -the tweet influence in scenario G + . We observe that the terms in Eq. (8) divide into two: the paths from to all other nodes except (i.e. the old nodes) and the path from to . We obtain:

Note that a path that does not involve has the same influence capital contribution in G + and in its parent scenario G − , i.e.

( , |G + ) = ( , |G − ), for > , and ≠ . We obtain

Combining Eq. (11) and (12), we obtain:

Tweet influence at previous time step −1 . Given the definition of G + in Eq. (10) and the independant cascades assumption, we obtain that P(G + ) = P(G − ) ′ . Consequently, part in Eq. (13) can be written as:

is the tweet influence of at the previous time step −1 . Note that 

We define two matrices. First, the transfer matrix = [ ′ * ], where the element ′ is the probability that tweet is a direct retweet of tweet (defined in Eq. (4)) and is the proportion of capital transfered from to ; Second, the influence accumulation matrix = [ ], with defined in Eq. (5) is the contribution of to the influence of . For each column of , we compute the first − 1 elements by multiplying the sub-matrix [1.. −1,1.. −1] with the first − 1 elements on the ℎ column of matrix , the -th element is , and the remaining elements are 0. The computation of matrix finishes after steps, where is the total number of retweets in the cascade.

In this section, we briefly describe the generic setup and design interventions utilized to reduce systematic noise. Here we illustrate several of the design features (summarised in Table 1) , including the user component, proxies, and qualifications. We assess the design features via an ablation study, and fit the noise associated with particular feature sets (see Section 3.2.3), shown in Fig. 2a .

Generic setup. We utilize the ubiquitous MTurk crowdsourcing platform to implement the quicksort active learning procedure. The implementation runs quicksort partitions concurrently, such that comparison pairs enter a FIFO queue and are served to MTurk workers in batches of 10. Each global quicksort procedure over the entire sample called a run, is executed sequentially such that no identical pairs are shown in the same batch since quicksort compares items at most once. The workers were presented with two target users and a proxy user (see below). Workers were asked, via a pool of differently worded questions, to determine which target user was more influential to the proxy. These questions are "Which user is the proxy user most likely to retweet?", "Who will the proxy user be more socially influenced by?", and "Which user would sway the proxy users opinion more?". We performed three runs for each ablation and removed the banning of workers between ablations.

The user component was designed so workers could quickly glean the relevant information about users, with the users' names, pictures, descriptions, hyperlinks to their Twitter profile, and a small sample of their tweets presented. Additionally, the component included various user metrics, including the follower count, followee count, and statuses count. Fig. 2a shows the removal of these metrics significantly reduces the accuracy. When all metrics are removed, we observe the worst decision accuracy 0.57. In particular, when only the followee or status metrics are present, we observe an increased accuracy of 0.61 and 0.68, respectively; when both are present (without follower count), we observe another accuracy increase to 0.70. Note that when all metrics are included, the accuracy increases to 0.72, which cumulatively suggests that all metrics are independently important signals of influence; however, there is some mutual information between the metrics.

Proxy users are used to reduce the effect of a worker's opinion on their influence judgment. In judging between two targets, a worker is asked who the proxy finds more influential. Proxy users do not eliminate the workers' subjectivity; however, they might mitigate some noise introduced in this area. Removing the feature noticeably reduces the decision accuracy to 0.69.

The qualification system restricts designated workers from completing tasks. A common phenomenon on MTurk is that workers accept many tasks of the same type concurrently (and effectively reserve those tasks for future completion), allowing workers to complete large proportions of the work. The responses of some workers might be consistently inaccurate for several reasons; they are answering randomly for monetary gain, they do not understand the task or social context, or they respond in bad faith for some other reason. We broadly label such workers as low-quality workers. If low-quality workers complete large proportions of work, reducing overall quality, and accordingly, we ban them dynamically.

We measure the quality of a worker, within a run, as the accuracy of the workers' responses concerning the follower count percentiles of targets. As the task is inherently subjective and difficult, we add some leniency to this banning scheme. Firstly, only comparisons where the difference of follower count percentiles is greater than 0.2 are included in determining accuracy. The intuition is that targets with quasi-equivalent influence are difficult to judge. Secondly, banning is only implemented after having completed 100 comparisons that satisfy the prior condition. Lastly, the banning decision is only made once per run (but a banned worker is banned forever). This banning scheme is lenient enough that work is completed quickly, while restrictive enough that response quality remains high. Removing the intra-banning mechanism performs slightly better than the baseline, with an accuracy of 0.73, which might be explained by 'low-quality workers' being discouraged from that task by previous interactions with the experiment. For this reason, we use the full feature set for the final experiment.

We selected from the #ArsonEmergency dataset a sample of 500 targets and 500 proxies, controlled for availability, language, and Hawkes-modeled influence. The users' availability (suspended or protected status at the time of the experiment) was queried through the Twitter API before the experiment. To determine if a user was English-speaking (so the majoritively English-speaking MTurk workers could appropriately judge them), triple-agreement was used between three language detection systems langid [42] , cld3 [25] , and whatthelang. We verified the (55.8% remaining) filtered users were uniform with respect to Hawkes-modeled influence, via a chi-square test at a 95% significance level. From this set of valid users, we used an inverse CDF sampling method, with nearestmatching, to sample users with respect to Hawkes-modeled influence.

Effective influence estimation in twitter using temporal, profile, structural and interaction characteristics

Reconciling real scores with binary comparisons: A new logistic based model for ranking

Appendix: Conductance and Social Capital: Modeling and Empirically Measuring Online Social Influence

Opencrowd: A human-ai collaborative approach for finding social influencers via open-ended answers aggregation

Effects of group pressure upon the modification and distortion of judgments

A trust model for analysis of trust, influence and their relationship in social network communities

Everyone's an influencer: quantifying influence on twitter

Evaluating online labor markets for experimental research: Amazon. com's Mechanical Turk

A theory of fads, fashion, custom, and cultural change as informational cascades

Rank analysis of incomplete block designs: I. The method of paired comparisons

Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality data?

Models for paired comparison data: A review with emphasis on dependent data

Dynamic Bradley-Terry modelling of sports tournaments

Measuring user influence in twitter: The million follower fallacy

Ten social dimensions of conversations and relationships

Influence: Science and practice

Feedback effects between similarity and social influence in online communities

Scalable influence estimation in continuous-time diffusion networks

Scalable influence estimation in continuous-time diffusion networks

On the nature of real and perceived bias in the mainstream media

Influence-based Twitter browsing with NavigTweet

Robustness of centrality measures under uncertainty: Examining the role of network topology

The social dynamics of language change in online networks

Bushfires, bots and arson claims: Australia flung in the global disinformation spotlight

Rare: Social rank regulated large-scale network embedding

With a little help from my friends (and their friends): Influence neighborhoods for social recommendations

Spectra of some self-exciting and mutually exciting point processes

A cluster process representation of a self-exciting process

The effects of expertise and physical attractiveness upon opinion agreement and liking

Systems perspective of Amazon Mechanical Turk for organizational research: Review and recommendations

Maximizing the spread of influence through a social network

Patrick Mair: Modern Psychometrics with R

Using the method of paired comparisons in non-designed experiments

A nonparametric EM algorithm for multiscale Hawkes processes

Influence maximization on social graphs: A survey

Intergroup social influence on emotion processing in the brain

Analyzing and inferring human real-life behavior through online social networks with social influence deep learning

On the social influence in human behavior: Physical, homophily, and social communities

2012. langid. py: An off-the-shelf language identification tool

Situating social influence processes: Dynamic, multidirectional flows of influence within social networks

Just sort it! A simple and effective approach to active preference learning

Obedience to authority

Feature driven and point process approaches for popularity prediction

Fast learning in multi-resolution hierarchies

GhostLink: Latent Network Inference for Influence-aware Recommendation

An approach to the study of communicative acts

Modeling Sparse Information Diffusion at Scale via Lazy Multivariate Hawkes Processes

The dynamics of societal transition: Modeling nonlinear change in the Polish economic system

The PageRank citation ranking: Bringing order to the web

Influence analysis in social networks: A survey

Deepinf: Social influence prediction with deep learning

Who are the most influential emergency physicians on Twitter?

# debatenight: The role and influence of socialbots on twitter during the 1st 2016 us presidential debate

Influence and passivity in social media

Active learning literature survey

Estimation from pairwise comparisons: Sharp minimax bounds with topology dependence

Pro-fileRank: finding relevant content and influential users based on information diffusion

Who tweets? Deriving the demographic characteristics of age, occupation and social class from Twitter user meta-data

Influence estimation on social media networks using causal inference

Verse: Versatile graph embeddings from similarity measures

How to analyze paired comparison data

Pagerank with priors: An influence propagation perspective

Spectrum-enhanced pairwise learning to rank

Die berechnung der turnier-ergebnisse als ein maximumproblem der wahrscheinlichkeitsrechnung

Exploring Occupation Differences in Reactions to COVID-19 Pandemic on Twitter