key: cord-020885-f667icyt authors: Sharma, Ujjwal; Rudinac, Stevan; Worring, Marcel; Demmers, Joris; van Dolen, Willemijn title: Semantic Path-Based Learning for Review Volume Prediction date: 2020-03-17 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45439-5_54 sha: doc_id: 20885 cord_uid: f667icyt Graphs offer a natural abstraction for modeling complex real-world systems where entities are represented as nodes and edges encode relations between them. In such networks, entities may share common or similar attributes and may be connected by paths through multiple attribute modalities. In this work, we present an approach that uses semantically meaningful, bimodal random walks on real-world heterogeneous networks to extract correlations between nodes and bring together nodes with shared or similar attributes. An attention-based mechanism is used to combine multiple attribute-specific representations in a late fusion setup. We focus on a real-world network formed by restaurants and their shared attributes and evaluate performance on predicting the number of reviews a restaurant receives, a strong proxy for popularity. Our results demonstrate the rich expressiveness of such representations in predicting review volume and the ability of an attention-based model to selectively combine individual representations for maximum predictive power on the chosen downstream task. Multimodal graphs have been extensively used in modeling real-world networks where entities interact and communicate with each other through multiple information pathways or modalities [1, 23, 31] . Each modality encodes a distinct view of the relation between nodes. For example, within a social network, users can be connected by their shared preference for a similar product or by their presence in the same geographic locale. Each of these semantic contexts links the same user set with a distinct edge set. Such networks have been extensively used for applications like semantic proximity search in existing interaction networks [7] , augmenting semantic relations between entities [36] , learning interactions in an unsupervised fashion [3] and augmenting traditional matrix factorization-based collaborative filtering models for recommendation [27] . Each modality within a multimodal network encodes a different semantic relation and exhibits a distinct view of the network. While such views contain relations between nodes based on interactions within a single modality, observed outcomes in the real-world are often a complex combination of these interactions. Therefore, it is essential to compose these complementary interactions meaningfully to build a better representation of the real world. In this work, we examine a multimodal approach that attempts to model the review-generation process as the end-product of complex interactions within a restaurant network. Restaurants share a host of attributes with each other, each of which may be treated as a modality. For example, they may share the same neighborhood, the same operating hours, similar kind of cuisine, or the same 'look and feel'. Furthermore, each of these attributes only uncovers a specific type of relation. For example, a view that only uses the location-modality will contain venues only connected by their colocation in a common geographical unit and will prioritize physical proximity over any other attribute. Broadly, each of these views is characterized by a semantic context and encodes modality-specific relations between restaurants. These views, although informative, are complementary and only record associations within the same modality. While each of these views encodes a part of the interactions within the network, performance on a downstream task relies on a suitable combination of views pertinent to the task [5] . In this work, we use metapaths as a semantic interface to specify which relations within a network may be relevant or meaningful and worth investigating. We generate bimodal low-dimensional embeddings for each of these metapaths. Furthermore, we conjecture that their relevance on a downstream task varies with the nature of the task and that this task-specific modality relevance should be learned from data. In this work, -We propose a novel method that incorporates restaurants and their attributes into a multimodal graph and extracts multiple, bimodal low dimensional representations for restaurants based on available paths through shared visual, textual, geographical and categorical features. -We use an attention-based fusion mechanism for selectively combining representations extracted from multiple modalities. -We evaluate and contrast the performance of modality-specific representations and joint representations for predicting review volume. The principle challenge in working with multimodal data revolves around the task of extracting and assimilating information from multiple modalities to learn informative joint representations. In this section, we discuss prior work that leverages graph-based structures for extracting information from multiple modalities, focussing on the auto-captioning task that introduced such methods. We then examine prior work on network embeddings that aim to learn discriminative representations for nodes in a graph. Graph-based learning techniques provide an elegant means for incorporating semantic similarities between multimedia documents. As such, they have been used for inference in large multimodal collections where a single modality may not carry sufficient information [2] . Initial work in this domain was structured around the task of captioning unseen images using correlations learned over multiple modalities (tag-propagation or auto-tagging). Pan et al. use a graph-based model to discover correlations between image features and text for automatic image-captioning [21] . Urban et al. use an Image-Context Graph consisting of captions, image features and images to retrieve relevant images for a textual query [32] . Stathopoulos et al. [28] build upon [32] to learn a similarity measure over words based on their co-occurrence on the web and use these similarities to introduce links between similar caption words. Rudinac et al. augment the Image-Context Graph with users as an additional modality and deploy it for generating visual-summaries of geographical regions [25] . Since we are interested in discovering multimodal similarities between restaurants, we use a graph layout similar to the one proposed by Pan et al. [21] for the image auto-captioning task but replace images with restaurants as central nodes. Other nodes containing textual features, visual features and users are retained. We also add categorical information like cuisines as a separate modality, allowing them to serve as semantic anchors within the representation. Graph representation learning aims to learn mappings that embed graph nodes in a low-dimensional compressed representation. The objective is to learn embeddings where geometric relationships in the compressed embedding space reflect structural relationships in the graph. Traditional approaches generate these embeddings by finding the leading eigenvectors from the affinity matrix for representing nodes [16, 24] . With the advent of deep learning, neural networks have become increasingly popular for learning such representations, jointly, from multiple modalities in an end-to-end pipeline [4, 11, 14, 30, 34] . Existing random walk-based embedding methods are extensions of the Random Walks with Restarts (RWR) paradigm. Traditional RWR-based techniques compute an affinity between two nodes in a graph by ascertaining the steadystate transition probability between them. They have been extensively used for the aforementioned auto-captioning tasks [21, 25, 28, 32] , tourism recommendation [15] and web search as an integral part of the PageRank algorithm [20] . Deep Learning-based approaches build upon the traditional paradigm by optimizing the co-occurrence statistics of nodes sampled from these walks. DeepWalk [22] uses nodes sampled from short truncated random walks as phrases to optimize a skip-gram objective similar to word2vec [17] . Similarly, node2vec augments this learning paradigm with second-order random walks parameterized by exploration parameters p and q which control between the importance of homophily and structural equivalence in the learnt representations [8] . For a homogeneous network, random walk based methods like DeepWalk and node2vec assume that while the probabilities of transitioning from one node to another can be different, every transition still occurs between nodes of the same type. For heterogeneous graphs, this assumption may be fallacious as all transitions do not occur between nodes of the same type and consequently, do not carry the same semantic context. Indeed, our initial experiments with node2vec model suggest that it is not designed to handle highly multimodal graphs. Clements et al. [5] demonstrated that in the context of content recommendation, the importance of modalities is strongly task-dependent and treating all edges in heterogeneous graphs as equivalent can discard this information. Metapath2Vec [6] remedies this by introducing unbiased walks over the network schema specified by a metapath [29] , allowing the network to learn the semantics specified by the metapath rather than those imposed purely by the topology of the graph. Metapath-based approaches have been extended to a variety of other problems. Hu et al. use an exhaustive list of semantically-meaningful metapaths for extracting Top-N recommendations with a neural co-attention network [10] . Shi et al. use metapath-specific representations in a traditional matrix factorization-based collaborative filtering mechanism [27] . In this work, we perform random walks on sub-networks of a restaurant-attribute network containing restaurants and attribute modalities. These attribute modalities may contain images, text or categorical features. For each of these sub-networks, we perform random walks and use a variant of the heterogeneous skip-gram objective introduced in [6] to generate low-dimensional bimodal embeddings. Bimodal embeddings have several interesting properties. Training relations between two modalities provide us with a degree of modularity where modalities can be included or held-out from the prediction model without affecting others. It also makes training inexpensive as the number of nodes when only considering two modalities is far lower than in the entire graph. In this section, we begin by providing a formal introduction to graph terminology that is frequently referenced in this paper. We then move on to detail our proposed method illustrated in Fig. 1 . Formally, a heterogeneous graph is denoted by G = (V, E, φ, σ) where V and E denote the node and edge sets respectively. For every node and edge, there exists mapping functions φ(v) → A and σ(e) → R where A and R are sets of node types and edge types respectively such that |A + R| > 2. For a heterogeneous graph G = (V, E, φ, σ), a network schema is a metagraph M G = (A, R) where A is the set of node types in V and R is the set of edge types in E. A network schema enumerates the possible node types and edge types that can occur within a network. A metapath M(A 1 , A n ) is a path on the network schema M G consisting of a sequence of ordered edge transitions: We use TripAdvisor to collect information for restaurants in Amsterdam. Each venue characteristic is then embedded as a separate node within a multimodal graph. In the figure above R nodes denote restaurants, I nodes denote images for a restaurant, D nodes are review documents, A nodes are categorical attributes for restaurants and L nodes are locations. Bimodal random walks are used to extract pairwise correlations between nodes in separate modalities which are embedded using a heterogeneous skip-gram objective. Finally, an attention-based fusion model is used to combine multiple embeddings together to regress the review volume for restaurants. Let G = (V, E) be the heterogeneous graph with a set of nodes V and edges E. We assume the graph to be undirected as linkages between venues and their attributes are inherently symmetric. Below, we describe the node types used to construct the graph (cf. Figs. 1 and 2 and use the penultimate layer output as a compressed low-dimensional representation for the image. Since the number of available images for each venue may vary dramatically depending on its popularity, adding a node for every image can lead to an unreasonably large graph. To mitigate this issue, we cluster image features for each restaurant using the K-Means algorithm and use the cluster centers as representative image features for a restaurant, similar to Zahálka et al. [35] . We chose K = 5 as a reasonable trade-off between the granularity of our representations and tractability of generating embeddings for this modality. The way patrons write about a restaurant and the usage of specialized terms can contain important information about a restaurant that may be missing from its categorical attributes. For example, usage of the Indian cottage cheese 'Paneer' can be found in similar cuisine types like Nepali, Surinamese, etc. and user reviews talking about dishes containing 'Paneer' can be leveraged to infer that Indian and Nepali cuisines share some degree of similarity. To model such effects, we collect reviews for every restaurant. Since individual reviews may not provide a comprehensive unbiased picture of the restaurant, we chose not to treat them individually, but to consider them as a single document. We then use a distributed bag-ofwords model from [13] to generate low-dimensional representations of these documents for each restaurant. Since the reviews of a restaurant can widely vary based on its popularity, we only consider the 10 most recent reviews for each restaurant to prevent biases from document length getting into the model. 6. Users: Since TripAdvisor does not record check-ins, we can only leverage explicit feedback from users who chose to leave a review. We add a node for each of the users who visited at least two restaurants in Amsterdam and left a review. Similar to [25, 28, 32] , we introduce two kinds of edges in our graph: 1. Attribute edges: These are heterogeneous edges that connect a restaurant node to the nodes of its categorical attributes, image features, review features and users. In our graph, we instantiate them as undirected, unweighted edges. 2. Similarity edges: These are homogeneous edges between the feature nodes within a single modality. For image features, we use a radial basis function as a non-linear transformation of the euclidean distances between image feature vectors. For document vectors, we use cosine similarity to find restaurants with similar reviews. Adding a weighted similarity edge between every node in the same modality would yield an extremely dense adjacency matrix. To avoid this, we only add similarity links between a node and its K nearest neighbors in each modality. By choosing the nearest K neighbors, we make our similarity threshold adaptive allowing it to adjust to varying scales of distance in multiple modalities. Metapaths can provide a modular and simple interface for injecting semantics into the network. Since metapaths, in our case, are essentially paths over the modality set, they can be used to encode inter-modality correlations. In this work, we generate embeddings with two specific properties: 1. All metapaths are binary and only include transitions over 2 modalities. Since venues/restaurants are always a part of the metapath, we only include one other modality. 2. During optimization, we only track the short-range context by choosing a small window size. Window size is the maximum distance between the input node and a predicted node in a walk. In our model, walks over the metapath only capture short-range semantic contexts and the choice of a larger window can be detrimental to generalization. For example, consider a random walk over the Restaurant -Cuisine -Restaurant metapath. In the sampled nodes below, restaurants are in red while cuisines are in blue. Optimizing over a large context window can lead to McDonald's (fast-food cuisine) and Kediri (Indonesian cuisine) being placed close in the embedding space. This is erroneous and does not capture the intended semantics which should bring restaurants closer only if they share the exact attribute. We use the metapaths in Table 1 to perform unbiased random walks on the graph detailed in Sect. 3.2. Each of these metapaths enforces similarity based on certain semantics. We train separate embeddings using the heterogeneous skip-gram objective similar to [6] . For every metapath, we maximize the probability of observing the heterogeneous context N a (v) given the node v. In Eq. (3) , A m is the node type-set and V m is the node-set for metapath m. arg max θ v∈Vm a∈Am ca∈Na (v) log p(c a |v; θ) The original Metapath2vec model [6] uses multiple metapaths [29] to learn separate embeddings, some of which perform better than the others. On the DBLP Metapath-Specific Embeddings Fig. 3 . Attention-weighted modality fusion: metapath-specific embeddings are fed into a common attention mechanism that generates an attention vector. Each modality is then reweighted with the attention vector and concatenated. This joint representation is then fed into a ridge regressor to predict the volume of ratings for each restaurant. bibliographic graph that consists of Authors (A), Papers (P) and Venues (V), the performance of their recommended metapath 'A-P-V-P-A' was empirically better than the alternative metapath 'A-P-A' on the node classification task. At this point, it is important to recall that in our model, each metapath extracts a separate view of the same graph. These views may contain complementary information and it may be disadvantageous to only retain the best performing view. For an optimal representation, these complementary views should be fused. In this work, we employ an embedding-level attention mechanism similar to the attention mechanism introduced in [33] that selectively combines embeddings based on their performance on a downstream task. Assuming S to be the set of metapath-specific embeddings for metapaths m 1 , m 2 , . . . , m N , following the approach outlined in Fig. 3 , we can denote it as: We then use a two-layer neural network to learn an embedding-specific attention A mn for metapath m n : Further, we perform a softmax transformation of the attention network outputs to an embedding-specific weight Finally, we concatenate the attention-weighted metapath-specific embeddings to generate a fused embedding We evaluate the performance of the embedding fusion model on the task of predicting the volume (total count) of reviews received by a restaurant. We conjecture that the volume of reviews is an unbiased proxy for the general popularity and footfall for a restaurant and is more reliable than indicators like ranking or ratings which may be biased by TripAdvisor's promotion algorithms. We use the review volume collected from TripAdvisor as the target variable and model this task as a regression problem. Data Collection. We use publicly-available data from TripAdvisor for our experiments. To build the graph detailed in Sect. 3.2, we collect data for 3,538 restaurants in Amsterdam, The Netherlands that are listed on TripAdvisor. We additionally collect 168,483 user-contributed restaurant reviews made by 105,480 unique users, of which only 27,318 users visit more than 2 restaurants in the city. We only retain these 27,318 users in our graph and drop others. We also collect 215,544 user-contributed images for these restaurants. We construct the restaurant network by embedding venues and their attributes listed in Table 1 as nodes. Bimodal Embeddings. We train separate bimodal embeddings by optimizing the heterogeneous skip-gram objective from Eq. (3) using stochastic gradient descent and train embeddings for all metapaths enumerated in Table 1 . We use restaurant nodes as root nodes for the unbiased random walks and perform 80 walks per root node, each with a walk length of 80. Each embedding has a dimensionality of 48, uses a window-size of 5 and is trained for 200 epochs. Embedding Fusion Models. We chose two fusion models in our experiments to analyze the efficacy of our embeddings: 1. Simple Concatenation Model: We use a model that performs a simple concatenation of the individual metapath-specific embeddings detailed in Sect. 3.4 to exhibit the baseline performance on the tasks detailed in Sect. 4. Simple concatenation is a well-established additive fusion technique in multimodal deep learning [18, 19] . Each of the models uses a ridge regression algorithm to estimate the predictive power of each metapath-specific embedding on the volume regression task. This regressor is jointly trained with the attention model in the Attention-Weighted Model. All models are optimized using stochastic gradient descent with the Adam optimizer [12] with a learning rate of 0.1. In Table 2 , we report the results from our experiments on the review-volume prediction task. We observe that metapaths with nodes containing categorical attributes perform significantly better than vector-based features. In particular, categorical attributes like Cuisines, Facilities, and Price have a significantly higher coefficient of determination (R 2 ) as compared to visual feature nodes. It is interesting to observe here that nodes like locations, images, and textual reviews are far more numerous than categorical nodes and part of their decreased performance may be explained by the fact that our method of short walks may not be sufficiently expressive when the number of feature nodes is large. In addition, as mentioned in related work, we performed these experiments with the node2vec model, but since it is not designed for heterogeneous multimodal graphs, it yielded performance scores far below the weakest single modality. A review of the fusion models indicates that taking all the metapaths together can improve performance significantly. The baseline simple concatenation fusion model, commonly used in literature, is considerably better than the best-performing metapath (Venues -Facilities -Venues). The attention basedmodel builds significantly over the baseline performance and while it employs a similar concatenation scheme as the baseline concatenation model, the introduction of the attention module allows it to handle noisy and unreliable modalities. The significant increase in the predictive ability of the attention-based model can be attributed to the fact that while all modalities encode information, some of them may be less informative or reliable than others, and therefore contribute less to the performance of the model. Our proposed fusion approach is, therefore, capable of handling weak or noisy modalities appropriately. In this work, we propose an alternative, modular framework for learning from multimodal graphs. We use metapaths as a means to specify semantic relations between nodes and each of our bimodal embeddings captures similarities between restaurant nodes on a single attribute. Our attention-based model combines separately learned bimodal embeddings using a late-fusion setup for predicting the review volume of the restaurants. While each of the modalities can predict the volume of reviews to a certain extent, a more comprehensive picture is only built by combining complementary information from multiple modalities. We demonstrate the benefits of our fusion approach on the review volume prediction task and demonstrate that a fusion of complementary views provides the best way to learn from such networks. In future work, we will investigate how the technique generalises to other tasks and domains. MANTIS: system support for MultimodAl NeTworks of in-situ sensors HyperLearn: a distributed approach for representation learning in datasets with many modalities Interaction networks for learning about objects, relations and physics Heterogeneous network embedding via deep architectures The task-dependent effect of tags and ratings on social media access Metapath2vec: scalable representation learning for heterogeneous networks M-HIN: complex embeddings for heterogeneous information networks via metagraphs Node2vec: scalable feature learning for networks Deep residual learning for image recognition Leveraging meta-path based context for top-n recommendation with a neural co-attention model Multimodal network embedding via attention based multi-view variational autoencoder Adam: a method for stochastic gradient descent Distributed representations of sentences and documents Deep collaborative embedding for social image understanding How random walks can help tourism Image labeling on a network: using social-network metadata for image classification Distributed representations of words and phrases and their compositionality Multimodal deep learning Multi-source deep learning for human pose estimation The PageRank citation ranking: bringing order to the web GCap: graph-based automatic image captioning DeepWalk: online learning of social representations The visual display of regulatory information and networks Nonlinear dimensionality reduction by locally linear embedding Generating visual summaries of geographic areas using community-contributed images ImageNet large scale visual recognition challenge Heterogeneous information network embedding for recommendation Semantic relationships in multi-modal graphs for automatic image annotation PathSim: meta path-based top-k similarity search in heterogeneous information networks LINE: large-scale information network embedding Study on optimal frequency design problem for multimodal network using probit-based user equilibrium assignment Adaptive image retrieval using a graph model for semantic feature integration Heterogeneous graph attention network Network representation learning with rich text information Interactive multimodal learning for venue recommendation MetaGraph2Vec: complex semantic path augmented heterogeneous network embedding