key: cord-0039572-ef481p8k authors: Meng, Yitong; Dai, Xinyan; Yan, Xiao; Cheng, James; Liu, Weiwen; Guo, Jun; Liao, Benben; Chen, Guangyong title: PMD: An Optimal Transportation-Based User Distance for Recommender Systems date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_34 sha: 7fa388a14abee4a646f20a975c2bf4136ab91956 doc_id: 39572 cord_uid: ef481p8k Collaborative filtering predicts a user’s preferences by aggregating ratings from similar users and thus the user similarity (or distance) measure is key to good performance. Existing similarity measures either consider only the co-rated items for a pair of users (but co-rated items are rare in real-world sparse datasets), or try to utilize the non-co-rated items via some heuristics. We propose a novel user distance measure, called Preference Mover’s Distance (PMD), based on the optimal transportation theory. PMD exploits all ratings made by each user and works even if users do not share co-rated items at all. In addition, PMD is a metric and has favorable properties such as triangle inequality and zero self-distance. Experimental results show that PMD achieves superior recommendation accuracy compared with the state-of-the-art similarity measures, especially on highly sparse datasets. Collaborative filtering (CF) is one of the most widely used recommendation techniques [14, 47] . Given a user, CF recommends items by aggregating the preferences of similar users. Among CF recommendation approaches, methods based on nearest-neighbors (NN) are widely used, thanks to their simplicity, efficiency and ability to produce accurate and personalized recommendations [13, 35, 44] . Although deep learning (DL) methods [16, 19, 43] have attracted much attention in the recommendation community over the past few years, a very recent study [12] shows that NN-based CF is still a strong baseline and outperforms many DL methods. For NN-based methods, the user similarity measure plays an important role. It serves as the criterion to select a group of similar users whose ratings form the basis of recommendations, and is used to weigh users so that more similar users have greater impact on recommendations. Besides CF, user similarity is also important for applications such as link prediction [4] , community detection [34] and so on. Related Work. Traditional similarity measures, such as cosine distance (COS) [9] , Pearson's Correlation Coefficient (PCC) [9] and their variants [18, 29, 38, 39] , have been widely used in CF [13, 44] . However, such measures only consider co-rated items and ignore ratings on other items, and thus may only coarsely capture users' preferences as ratings are sparse and co-rated items are rare for many real-world datasets [35, 40, 44] . Some other similarity measures, such as Jaccard [22] , MSD [39] , JMSD [8] , URP [27] , NHSM [27] , PIP [5] and BS [14] do not utilize all the rating information [6] . For example, Jaccard only uses the number of rated items and omits the specific rating values, while URP only uses the mean and the variance of the ratings. Critically, all these measures give zero similarity value when there are no co-rated items, which would harm recommendation performance. Recently, BCF [35] and HUSM [44] were proposed to alleviate the co-rating issue by modeling user similarity as a weighted sum of item similarities, where the weights are obtained using heuristics. As the weights are not derived in a principled manner, they do not satisfy important properties such as triangle inequality and zero self-distance, which are important for a high quality similarity measure. The Earth Mover's Distance (EMD) is a distance metric on probabilistic space that originates from the optimal transportation theory [25, 37] . EMD has been applied to many applications, such as computer vision [7] , natural language processing [17, 23] and signal processing [41] . EMD has also been applied to CF [48] but is used as a regularizer to force the latent variable to fit a Gaussian prior in auto-encoder training rather than a user similarity measure. Our Solution. We propose the Preference Mover's Distance (PMD), which considers all ratings made by each user and is able to evaluate user similarity even in the absence of co-rated items. Similar to BCF and HUSM, PMD uses the item similarity as side information and assumes that if two users have similar opinions on similar items, then their tastes are similar. But the key difference is: PMD formulates the distance between a pair of users as an optimal transportation problem [26, 36] such that the weights for item similarities can be derived in a principled manner. In fact, PMD can be viewed as a special case of EMD [33, 37, 45] , which is a metric that satisfies important properties such as triangle inequality and zero self-distance. We also make PMD practical for large datasets by employing the Sinkhorn algorithm [10] to speed up distance computation and using HNSW [30] to further accelerate the search for similar users. Experimental results show that PMD leads to superior recommendation accuracy over the state-of-the-art similarity measures, especially on sparse datasets. Problem Definition. Let U be a set of m users, and I a set of n items. The user-item interaction matrix is denoted by R ∈ R m×n , where R(u, i) ≥ 0 is the rating user u gives to item i. R is a partially observed matrix and usually highly sparse. For user u ∈ U, her rated items are denoted by I u ⊂ I. The item similarities are described by matrix D and D(i, j) ≥ 0 denotes the distance between items i and j. Item similarities can be derived from the ratings on items [35, 44] or content information [46] , such as item tags, comments, etc. In this paper, we assume D is given. We are interested in computing the distance between any pair (u, v) of users in U given R and D. User similarity can be easily derived from the user distance as they are negatively correlated. and 1 is an all-1 column vector. We model a user's preferences as a probabilistic distribution p u ∈ Σ |Iu| on I u , where p u (i) indicates how much user u likes item i. In practice, the ground truth of p u cannot be observed and we estimate it by normalizing user u's ratings on I u , i.e., p u (i) ≈ R(u,i) j∈Iu R(u,j) for i ∈ I u . We model the distance between users u and v, denoted by d(p u , p v ), as the weighted average of the distances among their rated items, i.e., i∈Iu j∈Iv where W u,v (i, j) ≥ 0 is the weight for an item pair (i, j) and we introduce the is the aggregate weight received by item i for user u and it should be large if p u (i) is large such that d(p u , p v ) can focus on the items that user u likes. Similarly, However, U (p u , p v ) contains many different configurations of W u,v , which means that the user distance is indeterminate. Therefore, we define the user distance as the smallest among all possibilities: Equation (3) is a special case of the earth mover's distance (EMD) [11] , when the moment parameter p = 1 and the probability space is discrete. Moreover, PMD is a metric as long as D is a metric [37] . We call d(p u , p v ) the preference mover's distance (PMD) to highlight its connection to EMD. Being a metric has some nice properties that make the user distance meaningful. For example, the triangle inequality indicates that if both user A and user B are similar to a third user C, then user A and user B are also similar. Moreover, a user should be most similar to himself among all users if D(i, i) = 0. In contrast, it is unclear whether BCF and HUSM also have these properties as they determine weights using heuristics. Illustration. Intuitively, d(p u , p v ) can be viewed as the minimum cost of transforming the ratings of user u to the ratings of user v, which we show in Fig. 1 . p u and p v define two distributions of mass, while D(i, j) models the cost of moving one unit of mass from p u (i) to p v (j). Therefore, PMD can model the similarity between u and v even if they have no co-rated items. If two users like similar items, W u,v (i, j) takes a large value for item pairs with small D(i, j), which results in a small distance. This is the case for u 0 and u 1 in Fig. 1 as they both like science fiction movies. In contrast, if two users like dissimilar items, W u,v (i, j) is large for item pairs with large D(i, j), which produces a large distance. In Fig. 1 , u 0 likes science fiction movies while u 2 likes romantic movies, and thus d(p u0 , p u2 ) is large. Even if u 0 has no co-rated movies with u 1 and u 2 , PMD still gives d(p u0 , p u1 ) < d(p u0 , p u2 ), which implies that u 0 is more similar to u 1 than to u 2 . Speedup. An exact solution to the optimization problem in Eq. (3) takes a time complexity of O(q 3 log q) [36] , where q = |I u ∪ I v |. To reduce the complexity, we use the Sinkhorn algorithm [10] , which produces a high-quality approximate solution with a complexity of O(q 2 ). To speed up the lookup for similar users in large datasets, we employ HNSW [30] , which is the state-of-the-art algorithm for similarity search. HNSW builds a multi-layer knearest neighbour (KNN) graph for the dataset and returns high quality nearest neighbours for a query with O(log N ) distance computations, in which N is the number of users. With these two techniques, looking up for the top 100 neighbours takes only 0.02 s on average for a user and achieves a high recall of 99.2% for the Epinions dataset in our experiments. We conduct the experiment on a machine with two 2.0 GHz E5-2620 Intel(R) Xeon(R) CPU (12 physical cores in total), 48 GB RAM, a 450 GB SATA disk (6 Gb/s, 10k rpm, 64 MB cache), and 64-bit CentOS release 7.2. Positive/Negative Feedback. We can split the user ratings into positive ratings R p , e.g., 3, 4 and 5 if a score of 1-5 is allowed, which indicates that the user likes the item, and negative ratings R n , e.g., 1 and 2, which indicates that the user dislikes the item. Based on R p and R n , we define positive preference p p u and negative preference p n u , i.e., p p u (i) = Then we can define more fine-grained user distances using Eq. (3), e.g., indicates that the two users dislike similar items and can be used to avoid making bad recommendations that may lose users. A small d(p p u , p n v ) or d(p n u , p p v ) means that the interests of the two users complement each other and may be used for friend recommendation in social networks. We may also construct composite PMD (CPMD) such as: where μ ∈ [0, 1] is a tuning parameter weighting the importance of the distances of positive and negative preferences. We evaluate PMD by comparing its performance for NN-based recommendation with various user similarity measures. Two well-known datasets, i.e., MovieLens-1M [2] and Epinions [1], are used and their statistics are reported in Table 1 . The rating user u gives to item i is predicted as a weighted sum of its top-K neighbours in the training set, i.e.,R(u, i) =ū + v∈Nu v∈Nu s(u,v) [13] , in whichū is the average of the ratings given by user u, N u contains the top-K neighbours of u and s(u, v) is the similarity between a user pair u and v. We convert PMD into a similarity measure using s(u, v) = 2 − d(p u , p v ) and divide all ratings into train/validation/test sets, with an 8:1:1 ratio. Hyper-parameters are tuned to be optimal on the validation set for all methods. The mean absolute error (MAE) and the root mean square error (RMSE) [15, 31, 32] of the predicted ratings on the test set are used to evaluate the recommendation performance. , j) ), which is a metric on the item space. For fair comparison, the same item similarity matrix is used for PMD, BCF and HUSM 1 . Comparison Methods. COS, PCC and MSD are three classical user similarity measures. Jaccard, JMSD, NHSM, BCF, HUSM are five state-of-the-art measures. NMF [28] , SVD [21] and SVD++ [20] are latent factor models for CF. We report the performance of various similarity measures in Table 3 , where PMD is based on Eq. (3) and CPMD is based on Eq. (4). The results show that PMD and CPMD consistently outperform other similarity measures and the improvement is more significant on the Epinions dataset which is much more sparse. We believe that our methods achieve good performance on sparse datasets mainly because it utilizes all rating information and derives the weights of the items using the optimal transportation theory, which works well when there are only few or no co-rated items. This is favorable as ratings are sparse in many real-world datasets [40] . CPMD achieves better performance than PMD, which suggests that it is beneficial to distinguish positive and negative feed-backs. We also compare our methods with the latent factor models in Table 4 . On the sparse Epinions dataset, both PMD and CPMD outperform the latent factor models. We report the performance of CPMD-based NN CF under different configurations of K and μ in Table 2 . CPMD performs best when μ is around 0.6 on both datasets possibly because positive ratings can better represent the taste of a user than the negative ratings. In contrast, the optimal value of K is dataset dependent. We proposed PMD, a novel user distance measure based on optimal transportation, which addresses the limitation of existing methods in dealing with datasets with few co-rated items. PMD also has the favorable properties of a metric. Experimental results show that PMD leads to better recommendation accuracy for NN-based CF than the state-of-the-art user similarity measures, especially when the ratings are highly sparse. A new similarity measure for link prediction based on local structures in social networks A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem Collaborative filtering similarity measures: revisiting A new collaborative filtering metric that improves the behavior of recommender systems Empirical analysis of predictive algorithms for collaborative filtering Sinkhorn distances: lightspeed computation of optimal transport A primer on optimal transport Are we really making much progress? A worrying analysis of recent neural recommendation approaches A comprehensive survey of neighborhood-based recommendation methods Recommender Systems Handbook A novel Bayesian similarity measure for recommender systems TrustSVD: collaborative filtering with both the explicit and implicit influence of user trust and of item ratings Proceedings of the 26th International Conference on World Wide Web Supervised word mover's distance TrustWalker: a random walk model for combining trustbased and item-based recommendation Item recommendation with variational autoencoders and heterogeneous priors Factorization meets the neighborhood: a multifaceted collaborative filtering model Matrix factorization techniques for recommender systems FlexRecs: expressing and combining flexible recommendations From word embeddings to document distances Distributed representations of sentences and documents The earth mover's distance is the mallows distance: some insights from statistics An efficient earth mover's distance algorithm for robust histogram comparison A new user similarity model to improve the accuracy of collaborative filtering. Knowl.-Based Syst An efficient non-negative matrix-factorizationbased approach to collaborative filtering for recommender systems Effective missing data prediction for collaborative filtering Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs Psrec: social recommendation with pseudo ratings Probabilistic matrix factorization Mémoire sur la théorie des déblais et des remblais. Histoire de l'Académie royale des sciences de Detecting community structure in complex networks via node similarity A new similarity measure using Bhattacharyya coefficient for collaborative filtering in sparse data Fast and robust earth mover's distances A metric for distributions with applications to image databases Item-based collaborative filtering recommendation algorithms Social information filtering: algorithms for automating "word of mouth Collaborative filtering: fallacies and insights in measuring similarity A transportation LP distance for signal analysis The tag genome: encoding community knowledge to support novel interaction Collaborative deep learning for recommender systems A hybrid user similarity model for collaborative filtering Integer and combinatorial optimization Judging similarity: a user-centric study of related item recommendations Collaborative filtering meets mobile recommendation: a user-centered approach Wasserstein autoencoders for collaborative filtering