key: cord-020890-aw465igx authors: Brochier, Robin; Guille, Adrien; Velcin, Julien title: Inductive Document Network Embedding with Topic-Word Attention date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_22 sha: doc_id: 20890 cord_uid: aw465igx Document network embedding aims at learning representations for a structured text corpus i.e. when documents are linked to each other. Recent algorithms extend network embedding approaches by incorporating the text content associated with the nodes in their formulations. In most cases, it is hard to interpret the learned representations. Moreover, little importance is given to the generalization to new documents that are not observed within the network. In this paper, we propose an interpretable and inductive document network embedding method. We introduce a novel mechanism, the Topic-Word Attention (TWA), that generates document representations based on the interplay between word and topic representations. We train these word and topic vectors through our general model, Inductive Document Network Embedding (IDNE), by leveraging the connections in the document network. Quantitative evaluations show that our approach achieves state-of-the-art performance on various networks and we qualitatively show that our model produces meaningful and interpretable representations of the words, topics and documents. Document networks, e.g. social media, question-and-answer websites, the scientific literature, are ubiquitous. Because these networks keep growing larger and larger, navigating efficiently through them becomes increasingly difficult. Modern information retrieval systems rely on machine learning algorithms to support users. The performance of these systems heavily depends on the quality of the document representations. Learning good features for documents is still challenging, in particular when they are structured in a network. Recent methods learn the representations in an unsupervised manner by combining structural and textual information. Text-Associated DeepWalk (TADW) [28] incorporates text features into the low-rank factorization of a matrix describing the network. Graph2Gauss [2] learns a deep encoder, guided by the network, that maps the nodes' attributes to embeddings. GVNR-t [3] factorizes a random walk based matrix of node co-occurrences and integrates word vectors of the documents in its formulation. CANE [25] introduces a mutual attention mechanism that builds representations of a document contextually to each of its direct neighbors in the network. Apart from Graph2gauss, these methods are not intended to generate representations for documents with no connection to other documents and thus cannot induce a posteriori representations for new documents. Moreover, they provide little to no possibility to interpret the learned representations. CANE is a notable exception since its attention mechanism produces interpretable weights that highlight the words explaining the links between documents. Nevertheless, it lacks the ability to explain the representations for each document independently. In this paper, we describe and evaluate an inductive and interpretable method that learns word, topic and document representations in a single vector space, based on a new attention mechanism. Our contributions are the following: -we present a novel attention mechanism, Topic-Word Attention (TWA), that produces representations of a text where latent topic vectors attend to the word vectors of a document; -we explain how to train the parameters of TWA by leveraging the links of the network. Our method, Inductive Document Network Embedding (IDNE), is able to produce representations for previously unseen documents, without network information; -we quantitatively assess the performance of IDNE on several networks and show that our method performs better than recent methods in various settings, including when new documents, not part of the network, are inductively represented by the algorithms. To our knowledge, we are the first to evaluate this kind of inductive setting in the context of document network embedding; -we qualitatively show that our model learns meaningful word and topic vectors and produces interpretable document representations. The rest of the paper is organized as follows. In Sect. 2 we survey related works. We present in details our attention mechanism and show how to train it on networks of documents in Sect. 3. Next, in Sect. 4, we present a thorough experimental study, where we assess the performance of our model following the usual evaluation protocol on node classification and further evaluating its capacity of inducting representations for text documents with no connection to the network. In Sect. 5, we study the ability of our method to provide interpretable representations. Lastly, we conclude this paper and provide future directions in Sect. 6. The code for our model, the datasets and the evaluation procedure are made publicly available 1 . Network embedding (NE) provides an efficient approach to represent nodes in a low dimensional vector space, suitable for solving various machine learning tasks. Recent techniques extend NE for document networks, showing that text and graph information can be combined to improve the resolution of classification and prediction tasks. In this section, we first cover important works in document NE and then relate recent advances in attention mechanisms. DeepWalk [22] and node2vec [9] are the most well-known NE algorithms. They train dense embedding vectors by predicting nodes co-occurrences through random walks by adapting the Skip-Gram model initially designed for word embedding [19] . VERSE [24] propose an efficient algorithm that can handle any type of similarity over the nodes. Text-Associated DeepWalk (TADW) [28] extends DeepWalk to deal with textual attributes. Yang et al. prove, following the work in [17] , that Skip-Gram with hierarchical softmax can be equivalently formulated as a matrix factorization problem. TADW then consists in constraining the factorization problem with a pre-computed representation of the documents T by using Latent Semantic Analysis (LSA) [6] . The task is to optimize the objective: where M = (A + A 2 )/2 is a normalized second-order adjacency matrix of the network, W is a matrix of one-hot node embeddings and H a feature transformation matrix. Final document embeddings are the concatenation of W and HT . Graph2Gauss (G2G) [2] is an approach that embeds each node as a Gaussian distribution instead of a vector. The algorithm is trained by passing node attributes through a non-linear transformation via a deep neural network (encoder). GVNR-t [3] is a matrix factorization approach for document network embedding, inspired by GloVe [21] , that simultaneously learns word, node and document representations. In practice, the following least-square objective is optimized: where x ij is the number of co-occurrences of nodes i and j, u i is a one-hot encoding of node i and δj W |δj |1 is the average of the word embeddings of document j. Context-Aware Network Embedding (CANE) [25] consists in a mutual attention mechanism trained on a document network. It learns several embeddings for a document according to its different contextual documents, represented by its neighbors in the network. The attention mechanism selects meaningful features from text information in pairs of documents that explain their relatedness in the graph. A similar approach is presented in [4] where the links between pairs of documents are predicted by computing the mutual contribution of their word embeddings. In this work, we aim at constructing representations of documents that reflect their connections in a network. A key motivation behind our approach is to be able to predict a document's neighborhood given only its textual content. This allows our model to inductively produce embeddings for new documents for which no existing link is known. To that extend, Graph2Gauss is a similar approach. On the contrary, TADW and GVNR-t are not primarily designed for this purpose as they both learn one-hot embeddings for each node in the document network. Note that if some methods like GraphSage [10] , SDNE [27] and GAE [13] also enable induction on new nodes, they cannot deal with nodes that have no known connection. Also, our approach differs from CANE since this latter needs the neighbors of a document to generate its representation. IDNE learns to produce a single interpretable vector for each document in the network. In the next section, we review recent works in attention mechanisms for natural language processing (NLP) that inspired the conception of our method. An attention mechanism uses a contextual representation to highlight or hide some parts of input data. Attention is an essential element of state-of-the-art neural machine translation (NMT) algorithms [18] by providing a powerful way to capture dependencies between words. The Transformer [26] introduces a formalism of attention mechanisms for NMT. Given a query vector q, a set of key vectors K and a set of value vectors V , an attention vector is produced with the following formula: qK T measures the similarity between the query and each key k of K. ω is a normalization function such that all attention weights are positive and sum to 1. v a is then the weighted sum of the values V according to the attention weights. Multiple attention vectors can be generated by using a set of queries Q. In CANE, as for various NLP tasks [7] , an attention mechanism generates attention weights that represent the strengths of relation between pairs of input words. However, in this paper, we do not seek to learn dependencies between pairs of words, but rather between words and some global topics. In this direction, the Set Transformer [16] constitutes a computationally efficient attention mechanism where the queries are replaced with a fixed-size set of learnable global inducing points. This model is originally not intended for NLP tasks, therefore we will explore the capacity of such inducing points to play the role of topic representations when applied to textual data. Even if we introduce the concept of topic vectors, the aim of this work is not to propose another topic model [5, 23] . We hypothesize that the introduction of global topic vectors in an attention mechanism can (1) lead to useful representations of documents for different tasks and (2) bring an interpretable sight on the patterns learned by the model. Interpretability can help both machine learning practitioners to better refine their models and end users to understand automated recommendations. We are interested in finding low dimensional vector space representations of a set of n d documents organized in a network, described by a document-term matrix X ∈ N n d ×nw and an adjacency matrix A ∈ N n d ×n d , where n w stands for the number of words in our vocabulary. The method we propose, Inductive Document Network Embedding (IDNE), learns to represent the words and topics underlying the corpus in a single vector space. The document representations are computed by combining words and topics through an attention mechanism. In the following, we first describe how to derive the document vectors from known word and topic vectors through a novel attention mechanism, the Topic-Word Attention (TWA). Next, we show how to estimate the word and topic vectors, guided by the links connecting the documents of the network. We assume a p-dimensional vector space in which both words and topics are represented. We note W ∈ R nw×p the matrix that contain the n w word embedding vectors and T ∈ R nt×p the matrix of n t topic vectors. Figure 1 shows the matrix computation of the attention weights. Topic-Word Attention. Given a document i and its bag-of-word encoding X i ∈ N + nw , we measure the attention weights between topics and words, Z i ∈ R nt×nw , as follows: The activation function g must satisfy two requirements: (1) all the weights are non-negative and (2) columns of Z i sum to one. The intuition behind the first requirement is that enforcing non-negativity should lead to sparse and interpretable topics. The second requirement transforms the raw weights into wordwise relative attention weights, which can be read as probabilities similarly to what is done in neural topic models [23] . An obvious choice would be columnwise softmax, however, we empirically find that ReLU followed by a column-wise normalization performs best. Document Representation. Given Z i , we are able to calculate topic-specific representations of the document i. From the perspective of topic k, the p-dimensional representation of document i is: Similarly to Eq. 3, each topic vector, akin to a query, attends to the word vectors that play the role of keys to generate Z i . The topic-specific representations are then the weighted sum of the values, also played by the word vectors. The final document vector is obtained by simple summation of all the topic-specific representations, which leads to d i = k D i k . Scaling by 1 |Xi|1 in Eq. 5 ensures that the document vectors have the same order of magnitude as the word vectors. Since the corpus is organized in a network, we propose to estimate the parameters, W and T , by leveraging the links between the documents. We posit that the representations of documents connected by a short path in the network should be more similar in the vector space than those that are far apart. Thus, we learn W and T in a supervised manner, through the training of a discriminative model. Let Δ ∈ {0, 1} n d ×n d be a binary matrix, so that δ ij = 1 if document j is reachable from document i and δ ij = 0 otherwise. We model the probability of a pair of documents to be connected, given their representations, in terms of the sigmoid of the dot-product of d i and d j : Assuming the document representations are i.i.d, we can express the loglikelihood of Δ given W and T : Through the maximization of this log-likelihood via a first-order optimization technique, we back-propagate the gradient and thus learn the word and topic vectors that lead to the document representations that best reconstruct Δ. Common tasks in document network embedding are classification and link prediction. We assess the quality of the representations learned with IDNE for these tasks in two different settings: (1) a traditional setting where all links and documents are observed and (2) an inductive setting where only a fraction of the links and documents is observed during training. The first setting corresponds to a scenario where the goal is to propagate labels associated with a small portion of the documents. The second represents a scenario where we want to predict labels and links for new documents that have no network information, once the algorithm is already trained. This is common setting in real world applications. As an example, when a new user asks a new question on a Q&A website, we would like to suggest tags for its question and to recommend potential similar questions. In this case, the only information available to the algorithm is the textual content of the question. We detail here the setup we use to train IDNE. Computing the Δ Matrix. We consider paths of length up to 2 and compute the Δ matrix in the following manner: This means that two documents are considered close in the network if they are direct neighbors or share at least one neighbor. Note that this matrix is the binarized version of the matrix TADW factorizes. Optimizing the Log-Likelihood. We perform mini-batch SGD with the ADAM [12] update rule. Because most document networks are sparse, rather than uniformly sampling entries of Δ, we sample 5000 balanced mini-batches in order to favor convergence. We sample 16 We consider 4 networks of documents of various nature: -A well-known scientific citation network extracted from Cora 2 . Each document is an article labelled with a conference. -New York Times (NYT) titles of articles from January 2007. Articles are linked according to common tags (e.g. business, arts, technology) and are labeled with the section they appear in (e.g. opinion, news). This network is particularly dense and documents have a short length. -Two networks of the Q&A website Stack Exchange (SE) 3 from June 2019, namely gaming.stackexchange.com and travel.stackexchange.com. We only keep questions with at least 10 user votes and that have at least one answer with 10 user votes or more. We build the network by linking questions with their answers and by linking questions and answers of the same user. The labels are the tags associated with each question (Table 1) . For each network, we consider a traditional classification tasks, an inductive classification task and an inductive link prediction task. -the traditional task refers to a setting where the model is trained on the entire network and the learned representations are used as features for a one-vs-all linear classifier with a training set of labelled documents ranging from 2% to 10% for multi-class networks and from 10% to 50% for multi-label networks. -the inductive tasks refer to a setting where 10% of the documents are removed from the network and the model is trained on the resulting sub-network. For the classification task, a linear classifier is trained with the representations and the labels of the observed documents. Representations for hidden documents are then generated in an inductive manner, using their textual content only. Classifications and link predictions are then performed on these induced representations. To classify the learned representations, we use the LIBLINEAR [8] logistic regression [14] algorithm and we cross validate the regularization parameter for each dataset and each model. Every experiment is repeated 10 times and we report the micro average of the area under the ROC curve (AUC). The AUC uses the probabilities of the logistic regression for all classes and evaluates the quality of the resulting ranking given the true labels. This metric is thus suitable for information retrieval tasks where we want to penalize wrong predictions depending on their ranks. For link prediction, we rank pairs of documents according to the cosine similarity between their representations. For all document networks, we process the documents by tokenizing text into words, discarding punctuation, stop words and words that appear less than 5 times or in more than 25% of the documents. We create document-term matrices that are used as input for 6 algorithms. Our baselines are representative of the different approaches for document NE. TADW and GVNR-t are based on matrix factorization whereas CANE and G2G are deep learning models. For each of them, we used the implementations of the authors: -LSA: we use a 256-dimensional SVD decomposition of the tf-idf vectors as a text-only baseline; -TADW: we follow the guidelines of the original paper by using 20 iterations and a penalty term λ = 0.2. For induction, we generate a document vector by computing the textual component HT in Eq. 1; -Graph2gauss (G2G): we make sure the loss function converges before the maximum number of iterations; -GVNR-t: we use γ = 10 random walks of length t = 40, a sliding window of size l = 5 and a threshold x min = 5 with 1 iteration. For induction, we compute δj W |δj |1 in Eq. 2; -CANE: we use the same parameters as in the original paper; -IDNE: we run all experiments with n t = 32 topic vectors. The effect of n t is discussed in Sect. 4.6. Tables 2 and 3 detail the AUC scores on the traditional classification task. We report the results for CANE only for Cora since the algorithm did not terminate within 10 h for the other networks. In comparison, our method takes about 5 min to run on each network on a regular laptop. The classifier performs well on the representations we learned, achieving similar or better results than the baseline algorithms on Cora, Gaming and Travel Stack Exchange. However, regarding the New York Times network, GVNR-t and TADW have a slight advantage. Because of its high density, the links in this network are little informative which may explain the relative good scores of the LSA representations. We hypothesize that (1) TADW benefits from its input LSA features and that (2) GVNR-t benefits both from its random walk based matrix of node co-occurrences [20] , which captures more precisely the proximities of the nodes in such dense network, and from the short length of the documents making the word embedding averaging efficient [1, 15] . Table 4 shows the AUC scores in the inductive settings. For link prediction IDNE performs best on three networks, showing its capacity to learn meaningful word and topic representations according to the network structure. For classification, LSA and GVNR-t achieve the best results while IDNE reaches similar but slightly lower scores on all datasets. On the contrary, TADW and Graph2gauss show weaknesses on NYT and Gaming SE. In summary, IDNE shows constant performances across all settings where other methods lack of robustness against the type of network or the type of task. A surprising result is the good scores of GVNR-t for inductive classification which we didn't expect given that its textual component only is used for this setting. However, for the traditional classification, GVNR-t has difficulties to handle networks with longer documents. IDNE does not suffer the same problem because TWA carefully select discriminative words before averaging them. In Sect. 5, we further show that IDNE learns meaningful representations of words and topics and builds interpretable document representations. Figure 2 shows the impact of the number of topic vectors n t and of the number of steps (mini-batches) on the AUC scores obtained in traditional classification with Cora. Note that we observe a similar behavior on the other networks. We see that the scores improve from 1 to 16 topics and tend to stagnate for upper values. In a similar manner, performances improve up to 5000 iterations after which no increase is observed. We first show in Sect. 5.1 that IDNE is capable of learning meaningful word and topic vectors. Then, we provide visualizations of documents that highlight the ability of the topic-word attention to reveal topics of interest. For all experiments, we set the number of topics to n t = 6. Table 5 shows the closest words to each topic, computed as the dot product between their respective vectors, learned on Cora. Word and topic vectors are trained to predict the proximity of the nodes in a network, meaningless words are thus always dissimilar to the topic vectors, since they do not help to predict a link. This can be verified by observing the words that have the largest and the smallest norms, also reported in Table 5 . Even though the topics are learned in an unsupervised manner, we notice that, when we set the number of topics close to the number of classes, each topic seems to capture the semantics of one particular class. To further highlight the ability of our model to bring interpretability, we show in Fig. 3 the topics that most likely generated the words of a document according to TWA. The document is the abstract of this paper whose weights are inductively calculated with IDNE previously trained on Cora. We compute its attention weights Z i and associate each word k to the maximum value of its column Z i k . We then colorize and underline each word associated to the two most represented topics in the document, if its weight is higher than 1 2 . We see that the major topic (green and single underline), that accounts for 32% of the weights, deals with the type of data, here document networks. The second topic (blue and double underline), which represents 18% of the weights, relates to text modeling, with words like "interpretable" and "topics". In this paper, we presented IDNE, an inductive document network embedding algorithm that learns word and latent topic representations via TWA, a topicword attention mechanism able to produce interpretable document representations. We showed that IDNE performs state-of-the-art results on various network in different settings. Moreover, we showed that our attention mechanism provides an efficient way of interpreting the learned representations. In future work, we would like to study the effect of the sampling of the documents on the learned topics. In particular, the matrix Δ could capture other types of similarities between documents such as SimRank [11] which measures structural relatedness between nodes instead of proximities. This could reveal complementary topics underlying a document network and could provide interpretable explanations of the roles played by documents in networks. A simple but tough-to-beat baseline for sentence embeddings Deep Gaussian embedding of graphs: unsupervised inductive learning via ranking Global vectors for node representations Link prediction with mutual attention for textattributed networks Relational topic models for document networks Indexing by latent semantic analysis BERT: pre-training of deep bidirectional transformers for language understanding LIBLINEAR: a library for large linear classification node2vec: scalable feature learning for networks Inductive representation learning on large graphs SimRank: a measure of structural-context similarity Adam: a method for stochastic optimization Variational graph auto-encoders Logistic Regression. Statistics for Biology and Health Distributed representations of sentences and documents Set transformer Neural word embedding as implicit matrix factorization Effective approaches to attention-based neural machine translation Distributed representations of words and phrases and their compositionality The pagerank citation ranking: bringing order to the web GloVe: global vectors for word representation Deepwalk: online learning of social representations Autoencoding variational inference for topic models VERSE: versatile graph embeddings from similarity measures CANE: context-aware network embedding for relation modeling Attention is all you need Structural deep network embedding Network representation learning with rich text information