key: cord-0043128-yxh25jmd
authors: Costa, Gianni; Ortale, Riccardo
title: Collaborative Recommendation of Temporally-Discounted Tag-Based Expertise for Community Question Answering
date: 2020-04-17
journal: Advances in Knowledge Discovery and Data Mining
DOI: 10.1007/978-3-030-47426-3_4
sha: 0f208bebd18b4f14be62fd1f072b76804dfd45ec
doc_id: 43128
cord_uid: yxh25jmd

We propose an innovative approach to finding experts for community question answering (CQA). The idea is to recommend answerers, who are credited the highest expertise under question tags at routing time. The expertise of answerers under already replied question tags is intuitively discounted by accounting for the observed tags, votes and temporal information of their answers. Instead, the discounted expertise under not yet replied tags is predicted via a latent-factor representation of both answerers and tags. These representations are inferred by means of Gibbs sampling under a new Bayesian probabilistic model of discounted user expertise and asking-answering behavior. The devised model unprecedentedly explains the latter two CQA aspects as the result of a generative process, that seamlessly integrates probabilistic matrix factorization and network behavior characterization. An extensive comparative experimentation over real-world CQA data demonstrates that our approach outperforms several-state-of-the-art competitors in recommendation effectiveness.

Expert recommendation [17] enables the timely sharing of high-quality knowledge for community question answering (CQA) [16] . Unfortunately, despite several previous research efforts, expert recommendation still remains problematic for various reasons. Firstly, question-answering (QA) communities are inherently time-evolving [21] , with new users (both askers and answerers) joining daily and the existing users changing their interests and behavior (such as, e.g., long-term inactive users, who turn into active participants). Hitherto, expert recommendation has been studied, mostly by ignoring the temporal information of posts. Accordingly, the devised approaches are suitable neither to deal with the natural drift of users' interests over time, nor to promote short-term answerers (i.e., users with a limited recent answering history). This is a severe limitation, that lowers the effectiveness of expert recommendation, since the recommended experts may no more reply to questions matching initially-relevant and already-outdated interests. In addition, short-term answerers would likely not be recommended at all, being their expertise gained in a far too short time period for such users to build a solid reputation as actual experts. Secondly, there is no common agreement on the choice of the discriminative content features to capture answerers' expertise. In most studies, the latter is inferred from the raw text of their answers, suitably weighted by the respective votes from the QA community. Tags are mainly ignored or, alternatively, incorporated into post contents as in [20] . However, tags are more insightful, concise and explicit user-generated explanations of both post meaning and topical expertise, with respect to the general topics inferrable from the textual post contents [19] . Thirdly, supplementing content features with further auxiliary data (e.g., networks of user interactions) for more effective expert recommendation involves devising a plausible joint processing of such information. Hitherto, both sources of information have been combined mainly through simplistic schemes such as, e.g., linear interpolation [17] .

In this paper, we propose a new collaborative approach to recommending question-specific experts in QA communities. The expertise of answerers is determined from the tags, votes and temporal information of their answers as well as the asking-answering relationships in the targeted QA community. More precisely, answer tags are employed to capture and represent the topical expertise of answerers. Votes indicate the degree to which answerers are publicly acknowledged within the QA community as experts under the tags of the respective answers. Posting time allows for discounting [8] earlier answers of responders, so that to account for the natural drift of their interests over time, without penalizing short-term answerers. Besides, asking-answering interactions inform the identification of experts, since repliers to expert askers are likely to be expert as well. Essentially, for each posted question, the intuition behind the presented approach consists in recommending answerers, who are credited the highest degree of expertise under the tags of the particular question at routing time. In particular, the expertise of answerers under already replied tags is intuitively determined by means of the votes and temporal information of their answers to the questions labelled with such tags. Instead, the unknown expertise of answerers under not yet replied tags is predicted through a latent-factor generative model of temporally-discounted user expertise and asking-answering behavior. Under such a model, Bayesian probabilistic matrix factorization [15] and the statistical formalization of asking-answering are seamlessly integrated. This allows for explaining both the expertise of users and their behavioral patterns as the result of a generative process, that is governed by a certain number of latent factors. These are estimated via a MCMC algorithm, that implements the derived mathematical details of Gibbs sampling inference under the devised model.

Extensive tests over real-world CQA data show that our approach overcomes several state-of-the-art competitors in recommendation effectiveness. This paper proceeds as follows. Notation and preliminaries are introduced in Sect. 2. The devised model is developed in Sect. 3. Expertise prediction for recommendation and posterior inference are covered in Sect. 4. The experimental evaluation of our approach is presented in Sect. 5. Finally, conclusions are drawn in Sect. 6, where future research is also previewed. The generic user u ∈ U can ask questions and/or provide answers. In order to capture the expertise of u, we focus on her answering history a u = {a u,1 , . . . , a u,Nu }. a u is the time sequence of N u replies from u to as many questions posted by other users of D. The arbitrary answer a u,h ∈ a u (with h = 1, . . . , N u ) is associated with a respective timestamp ts u,h , an explicative set of tags t u,h ⊆ T and a vote score s u,h . ts u,h indicates when a u,h was posted. For any two answers a u,hi , a u,hj ∈ a u , h i < h j iff ts u,hi < ts u,hj . Timestamps are useful for reasonably dealing with the drift of the interests and skills of u, across the respective answering history a u , by means of gradual forgetting [8] . The latter consists in estimating the expertise of u from the whole answering history a u , so that the earlier answers are realistically considered to be outdated and, thus, less informative of her current interests and skills. The tags in t u,h are an insightful description of the actual themes covered by a u,h 1 . In principle, t u,h is a more accurate representation of the both the intended meaning of a u,h and the topical expertise of u, in comparison with the more general topics inferrable from the textual content of a u,h [19] . For this reason, the wording of a u,h is disregarded and, consequently, the computational burden of processing very large amounts of raw text is avoided. s u,h indicates the acknowledged degree of expertise gained by u with regard to the question answered through a u,h and, by extension, under each tag within t u,h .

At the current timestamp now , the expertise of all users in U under the tags of T is summarized by matrix E (now ) . Its generic entry E (now ) ut quantifies the expertise of user u under tag t at time now as the below weighted average

is assumed to be 0, which corresponds to an unknown or missing value. Besides, w

is a weighting scheme, that implements gradual forgetting by exponential ageing. Intuitively, the earlier answers are not ignored in the estimation of the current expertise of u under t. Rather, their contribution to E (now ) ut exponentially decays according to the respective timestamps. Remarkably, such a modeling choice does not penalizes the expertise of those users with a short replying history (such as new users or mostly inactive users with a recent answering history), without discarding the old answers of long-term answerers. Notice that w (now ) u,h is parameterized by the decay rate λ. The latter determines how rapidly the contribution of answers to user expertise decays over time. Essentially, larger values of λ imply a faster decay of earlier answers.

As a supplement to the information from the answering history of users, their asking-answering interactions are also captured as edges of G. More precisely, an edge u i → u j from a responder u i to an asker u j belongs to A, if u i answered at least one question posted by u j . By an abuse of notation, we also write G to denote the adjacency matrix associated with the asking-answering graph. The generic entry G ij is 1 iff u i → u j ∈ A and 0 otherwise.

Given a question q, let t q be the set of tags attached to q by the asker. Also, assume that now is the time, when q is routed to the answerers. We aim to recommend q to targeted users, with the highest acknowledged expertise in the tags of t q at time now , who are most likely to reply with high-quality answers.

Unfortunately, in the context of the generic QA community D, E (now ) and G are generally very sparse. Consequently, the expertise of users under specific tags within t q may not be known. In this paper, we exploit latent-factor modeling to predict the unknown values of E (now ) , that correspond to the current expertise of answerers under the various adopted tags. Thus, experts can be simply recommended from a list of answerers, ranked by their average expertise under the tags attached to q.

Hereinafter, to avoid cluttering notation, we will write E to mean E (now ) .

ENGAGE (timE-evolviNG tAG-based Expertise) is a Bayesian generative latentfactor model of temporally-discounted expertise and asking-answering behavior in QA communities. Under ENGAGE, the matrices E and G of a QA community D are the result of a probabilistic generative process, that is ruled by K latent factors. These are captured by embedding users and tags in a K-dimensional latent space, through the seamless integration of Bayesian probabilistic matrix factorization [15] and the statistical modeling of the asking-answering behavior. Formally, each user u ∈ U is associated with a column vector L u ∈ R K The k-th entry of L u (with k = 1, . . . , K) is a random variable representing the unknown degree to which the k-th latent factor explains the expertise of u. Analogously, each tag t ∈ T is associated with a column vector H t ∈ R K . The k-th entry of H t (with k = 1, . . . , K) is a random variable representing the unknown extent to which the k-th latent factor is inherently characteristic of t. The latent-factor representation of all users and tags is collectively denoted as L ∈ R K×N and H ∈ R K×M , respectively. The data likelihood (i.e., the conditional distribution over the entries of E and G) is

In the above Eq. 2 and Eq. 3, N (·|μ, α −1 ) is the Gaussian distribution having mean μ and precision α. In particular, according to Eq. 2, the current expertise of answerers under the adopted tags is centered around the intuitive explanation provided by the dot product of the respective latent-factor representations. δ ut is 1 iff E ut > 0 (i.e., if the expertise of u under t is actually acknowledged) and 0 otherwise. Equation 3 seamlessly incorporates the supplementary information regarding the asking-answering interactions of users. Specifically, according to Eq. 3, the asking-answering interactions are centered around the degree of agreement between the involved users. This provides a valuable contribution to the identification of experts, since those users, who answer questions from other users with a high expertise, are also likely to have gained a high expertise.

The latent-factor representations of users and tags stem from multivariate Gaussian prior distributions parameterized, respectively, by Θ L = {μ L , Λ L } and Θ H = {μ H , Λ H }. In turn, such parameters are drawn from the below Gaussian-Wishart prior distributions (hereinafter indicated as N W) [2] Pr

where X ∈ {L, H}, W (Λ X ; ν 0 , W 0 ) denotes the Wishart distribution [2] and Θ 0 = {μ 0 , β 0 , ν 0 , W 0 } is a set of hyperparameters.

The conditional (in)dependencies among the random variables under ENGAGE are shown by means of plate notation in Fig. 1a . Notice that unshaded nodes correspond to latent factors, whereas shaded nodes correspond to observed magnitudes. The generative process modeled by ENGAGE performs the realization of the observed random variables (i.e., the individual entries of E and G) according to the conditional (in)dependencies of Fig. 1a as detailed in Fig. 1b. 

Under ENGAGE, the experts for a given question q are found by ranking users based on a recommendation score, that involves the latent-factor representations of users and tags. The recommendation score is introduced in Sect. 4.1. The inference of the latent-factor representations is discussed in Sect. 4.2. 

The rank of an answerer u ∈ U in the list of experts for q is determined by the score P uq of her acknowledged/predicted expertise. P uq is computed by averaging the current expertise of u under the individual tags of t q . This requires to distinguish between two alternative cases. Let t be an adopted tag of t q . If the expertise of u under t is acknowledged, E ut can be directly used in the definition of P uq . Otherwise, if the expertise of u under t is an unknown entry of E, then E ut is suitably predicted by resorting to the latent-factor representations of u and t under ENGAGE. Accordingly, P uq = 1 |tq| t∈tqÊ ut , whereÊ ut is defined in the below Eq. 4, so that to incorporate the current expertise of u under t according to the two above cases.

In Eq. 4, S is the number of samples of both L u and H t , which are respectively referred to as L (s) (Θ|E, G, α, β, Θ 0 ) . However, the latter is analytically intractable. Therefore, the generic L (s) u and H (s) t are drawn through approximate posterior inference, as described in Sect. 4.2.

A well-known technique for approximate stochastic inference [14] is Gibbs sampling. The latter defines a (first-order) Markov chain, whose stationary distribution eventually approaches the true posterior distribution Pr (Θ|E, G, α, β, Θ 0 ). This is accomplished by means of reiterated transitions from the current sample of the model parameters Θ to a new one. More precisely, at the generic transition, each parameter θ ∈ Θ is sequentially sampled from the respective full conditional Pr(θ|Θ − θ, E, G, α, β, Θ 0 ) . This is the conditional distribution over θ, given all other parameters Θ − θ, the (hyper)parameters Θ 0 as well as the observations E and G.

The derived full conditional distributions over the individual parameters of ENGAGE are reported next, along with the algorithm designed to perform Gibbs sampling inference.

Parameters L u and H t . Due to the conjugacy between the multivariate Gaussian distribution on L u (with unknown parameters Θ L ) and the Gaussian-Wishart prior on Θ L , the full conditional on L u is a multivariate Guassian distribution, i.e.,

Likewise, because of the conjugacy between the multivariate Gaussian distribution on H t (with unknown parameters Θ H ) and the Gaussian-Wishart prior on Θ H , the full conditional on H t is a multivariate Guassian distribution, i.e.,

with Λ * (t) 

where c is the number of columns within matrix X and Gibbs Sampling. Algorithm 1 sketches the pseudo code of the sampler, designed to implement approximate posterior inference under ENGAGE. After a preliminary initialization (line 1), the sampler enters a loop (lines 2-11), whose generic iteration h embraces two steps. Θ at the second step (lines 5-10). The maximum number H of iterations is established, by following the widelyadopted convergence-criterion in [12] . This allows the Markov chain behind the Gibbs sampler to reach its equilibrium after an initial burn-in period. As a consequence, the S samples used in Eq. 4, can be drawn when convergence is met (i.e., after the burn-in period, in which samples are instead still sensible to the preliminary initialization).

We comparatively investigated the recommendation effectiveness of ENGAGE.

All experiments were conducted on Stack Overflow 2 [1, 16] , i.e., a real-world QA community for sharing knowledge on computer programming. More precisely, we formed our training and test sets from an anonymized and quarterly dump 3 of all Stack Overflow data, produced by its users within a time interval ranging from Jan 1, 2015 to July 31, 2015. Such a dump is publicly released by the Stack Exchange network under the Creative Commons BY-SA 4.0 licence. More precisely, as far as the training set is concerned, we retained all those tags that were adopted at least 50 times in the time interval from Jan 1, 2015 to June 31, 2015. Further, we considered all those users, who provided more than 80 posts [20] in the same period. The selected tags and users, along with their answers, the respective questions, timestamps and votes were included into the training set. Overall, the latter consists of 3, 376 users, 40, 382 questions, 60, 968 answers. Regarding the test set, we focused on a collection Q of questions (with |Q| = 1, 357), that were posted by the users in the training set in a later time interval from July 1, 2015 to July 31, 2015. These questions are labelled with tags and answered by answerers in the training set. We chose such users, their answers to the questions of Q, the respective timestamps and votes as the test set. As a whole, the latter consists of 1, 357 questions, 3, 376 users, 3, 771 answers.

We contrasted ENGAGE against a selection of various competitors.

Votes [17] ranks answerers based on the mean of the difference between the positive and negative votes of their answers as well as the average percentage of the positive votes.

InDegree [3] ranks answerers by their respective numbers of best answers. The state-of-the art model in [19] , hereinafter called TER (Tag-based Expert Recommendation), infers user expertise from the factorization of the user-tag matrix. The latter is built, so that the generic entry reflects the expertise of an answerer under a tag, as captured by averaging the votes of her answers marked by that tag. Unlike ENGAGE, TER ignores both the drift of users' interests over time and their asking-answering behavior.

TEM [20] is a state-of-the art joint model of topics and expertise. Essentially, under TEM, tags are incorporated into the textual content of posts, in order to infer the topical interests of users. The specific expertise of users under the different topics is explicitly captured.

CQARank [20] combines the user topical interests and expertise under TEM with the link analysis of the asking-answering interaction graph, in order to enhance the inference of user topical expertise.

Both TEM and CQARank disregard the drift of users' interests over time.

We comparatively assessed the recommendation performance of ENGAGE through several evaluation metrics. Let q ∈ Q be a generic question of the test set. Assume that R (q) and R (q) are, respectively, the ground-truth and the recommended list of experts for q. Essentially, R (q) is the list of users, who actually answered q, ranked by the known scores of their answers. Instead, the users in R (q) are ranked by the recommendation score of Sect. 4.1.

best indicates the rank of the best answerer. The adopted evaluation metrics are enumerated next.

-Precision at top R (q) (Precision (q) @R (q) ) [9] is the correctness of R (q) , i.e., the fraction of top-R (q) recommended experts, who are ground-truth answerers. More precisely,

-Recall at top R (q) (Recall (q) @R (q) ) [9] is the coverage of R (q) , i.e., the fraction of ground-truth answerers in the top-R (q) recommended experts. Specifically,

-nDCG [10] (normalized Discounted Cumulative Gain) measures the goodness of the ranking of the recommended experts, based on their position in R (q) . This is accomplished by accumulating expert relevance to question q along R (q) , so that the relevance of higher-ranked experts is suitably discounted. Formally, nDCG(q) = DCG(q) IDCG(q) , where

In the above equation, s (q) i represents the relevance of R (q) i to q (according to thumbs-up/down). IDCG(q) is the DCG(q) value of the ideal ranking.

-Accuracy (Acc (q) ) [22] measures the quality of the best-answer's rank, i.e.,

Larger values of the above measures denote a higher recommendation effectiveness. Table 1 summarizes the average values of such measures over the whole set Q of questions for all competitors. The reported results were found by adopting the following empirical settings. In all tests, the time decay factor λ was fixed to 0.2. The number K of latent factors was set to 15. The number S of samples used in Eq. 4 was set to 200. The overall number H of iterations for Algorithm 1 was fixed to 1, 000, in compliance with the convergence-criterion in [12] . Additionally, for each q ∈ Q, the number R (q) of recommended answerers for q was set to 10.

By looking at Table 1 , it is evident that ENGAGE overcomes all tested competitors. In particular, the lower effectiveness of Votes and InDegree is due to the fact that both focus only on the importance of users, without accounting for their specific discounted expertise. TER is a state-of-the art competitor, that captures the tag-based expertise of answerers. Nonetheless, TER is still less effective than ENGAGE for two main reasons. Firstly, TER does not account for the drift of users' interests over time. Secondly, TER does not exploit any auxiliary information from the communication network, that is shaped by the asking-answering behavior. The latter is instead conveniently used, under ENGAGE, in order to more accurately inform the latent factor representation of users and tags. TEM and CQARank are two state-of-the-art competitors, that use tags to capture topical expertise. However, their effectiveness is penalized with respect to ENGAGE, since tags are mixed up with the textual content of posts, rather then being used as user-generated explanations of their topical expertise. Moreover, neither TEM nor CQARank discount the expertise of users, in order to account for the drift of their interests over time.

We proposed a new latent-factor approach to expert recommendation in QA communities. The idea is to infer the time-evolving expertise of users from the tags of the answered questions, the votes and posting time of the respective answers as well as the asking-answering behavior of the CQA users. A thorough experimentation on real-world CQA data showed the overcoming recommendation effectiveness of our approach with respect to several state-of-the-art competitors.

It is interesting to explore the impact of alternative implementations of gradual forgetting on recommendation effectiveness. In this regard, temporal hyperbolic discounting [21] is a viable choice. Finally, three further lines of innovative research involve studying the incorporation of, respectively, user roles [5, 7, 13, 18] , exposure [11] to posted questions as well as the recent generative models of text corpora (such as, e.g., [4] ) for more effective expert recommendation.

Discovering value from community activity on focused question answering sites: a case study of stack overow

Pattern Recognition and Machine Learning

Identifying authoritative actors in question-answering forums: the case of Yahoo! answers

Document clustering meets topic modeling with word embeddings

Mining overlapping communities and inner role assignments through Bayesian mixed-membership models of networks with context-dependent interactions

Optimal Statistical Decisions

Tracking user-role evolution via topic modeling in community question answering

Evaluating collaborative filtering recommender systems

Cumulated gain-based evaluation of IR techniques

Modeling user exposure in recommendation

Monte Carlo Strategies in Scientific Computing

A tri-role topic model for domain-specific question answering

Monte Carlo Statistical Methods

Bayesian probabilistic matrix factorization using Markov chain Monte Carlo

A comprehensive survey and classification of approaches for community question answering

A survey on expert recommendation in community question answering

Dual role model for question recommendation in community question answering

Tag-based expert recommendation in community question answering

CQArank: jointly model topics and expertise in community question answering

Moving from static to dynamic modeling of expertise for question routing in CQA sites

Expert finding for community-based question answering via ranking metric network learning