key: cord-020806-lof49r72 authors: Landin, Alfonso; Parapar, Javier; Barreiro, Álvaro title: Novel and Diverse Recommendations by Leveraging Linear Models with User and Item Embeddings date: 2020-03-24 journal: Advances in Information Retrieval DOI: 10.1007/978-3-030-45442-5_27 sha: doc_id: 20806 cord_uid: lof49r72 Nowadays, item recommendation is an increasing concern for many companies. Users tend to be more reactive than proactive for solving information needs. Recommendation accuracy became the most studied aspect of the quality of the suggestions. However, novel and diverse suggestions also contribute to user satisfaction. Unfortunately, it is common to harm those two aspects when optimizing recommendation accuracy. In this paper, we present EER, a linear model for the top-N recommendation task, which takes advantage of user and item embeddings for improving novelty and diversity without harming accuracy. In recent years, the way users access services has shifted from a proactive approach, where the user actively looks for the information, to one where the users take a more passive role, and content is suggested to them. Within this transformation, Recommender Systems have played a pivotal role, enabling an increase in user engagement and revenue. Recommender Systems are usually classified into three families [1] . The first approach, content-based systems, use item metadata to produce recommendations [7] . The second family, collaborative filtering, is composed of systems that exploit the past interactions of the users with the items to compute the recommendations [10, 17] . These interactions can take several forms, such as ratings, clicks, purchases. Finally, hybrid approaches combine both to generate suggestions. Collaborative Filtering (CF) systems can be divided into memory-based systems, that use the information about these interactions directly to compute the recommendations, and model-based systems, that build models from this information that are later used to make the recommendations. In this paper, we will present a CF model to address the top-N recommendation task [4] . The objective of a top-N recommender is to produce a ranked list of items for each user. These systems can be evaluated using traditional IR metrics over the rankings [2, 4] . In that evaluation approach, accuracy is usually the most important metric and has been the focus of previous research and competitions [3] . Nevertheless, other properties are also important, such as diversity and novelty [8, 13] . Diversity is the ability of the system to make recommendations that include items equitably from the whole catalog, which is usually desired by vendors [5, 22] . On the other hand, novelty is the capacity of the system to produce unexpected recommendations. This characteristic is a proxy for serendipity, associated with higher user engagement and satisfaction [6] . All these properties, accuracy, diversity and novelty, are linked to the extent that raising accuracy usually lowers the best achievable results in the other properties [11] . In this paper, we propose a method to augment an existing recommendation linear model to make more diverse and novel recommendations, while maintaining similar accuracy results. We do so by making use of user and item embeddings that are able to capture non-linear relations thanks to the way they are obtained [21] . Experiments conducted on three datasets show that our proposal outperforms the original model in both novelty and diversity while maintaining similar levels of accuracy. With reproducibility in mind, we also make the software used for the experiments publicly available 1 . In this section, we introduce FISM, the existing recommendation method we augment in our proposal. After that, we introduce prefs2vec, the user and item embedding model used to make this enhancement. FISM is a state-of-the-art model-based recommender system proposed by Kabbur et al [9] . This method learns a low rank factorization of an item-item similarity matrix, which is later used to compute the scores to make the predictions. This method is an evolution of a previous method, SLIM [16] , that learns this matrix without factorizing it. Factorizing the similarity matrix allows FISM to overcome SLIM's limitation of not being able to learn a similarity other than zero for items that have never been rated both by at least one user. As a side effect of this factorization, it lowers the space complexity from O( |I|. It also drops the non-negativity constraint and the constraint that the diagonal of the similarity matrix has to contain zeroes. As a consequence of these changes, the optimization problem can be solved using regular gradient descent algorithms, instead of the coordinated gradient descent used by SLIM, leading to faster training times. Embedding models allow transforming high-dimensional and sparse vector representations, such as classical one-hot and bag-of-words, into a space with much lower dimensionality. In particular, previous word embedding models, that produce fixed-length dense representations, have proven to be more effective in several NPL tasks [14, 15, 19] . Recently, prefs2vec [21] , a new embedding model for obtaining dense user and item representations, an adaptation of the CBOW model [14] , has shown that these embeddings can be useful for the top-N recommendation task. When used with a memory-based recommender, they are more efficient than the classical representation [21] . The results show that not only they can improve the accuracy of the results, but also their novelty and diversity. The versatility of this embedding model, in particular of the underlying neural model and the way it is trained, is also shown in [12] . Here the prediction capabilities of the neural model are used directly in a probabilistic recommender. In this section, we present our method to enhance diversity and novelty in recommendation, explaining how the model is trained and used to produce recommendations. Firstly, we introduce how the product of user and item embeddings (based on prefs2vec) can be used to make recommendations, which is later used as part of the proposal. As representations of users and items in a space with much lower dimensionality, prefs2vec embeddings can be viewed as latent vectors. However, there is no sense in multiplying both item and user vectors as they have different basis even when they have the same dimensions. This is a consequence of learning the item and user representations independently, how prefs2vec initializes the parameters of the model and how the training is performed. However, it is possible to make this product if we can compute a change of basis matrix T ∈ R d×d to transform the user embeddings into the item embeddings space. This way we can calculate an estimated ratings matrixR using the simple matrix multiplication: where E ∈ R |U |×d is the matrix of user embeddings, and F ∈ R |I|×d is the matrix of item embeddings, one embedding in each row. The transformation matrix T is learned by solving the optimization problem with 2 regularization: where R is the ratings matrix and β e is the regularization hyperparameter. This problem can be solved using gradient descent algorithms. Once the transformation matrix has been trained, recommendations can be produced by computing the estimated rating matrixR as described in Eq. 1. Recommendations are made to each user by sorting the corresponding row and picking the top-N items not already rated by the user. We dubbed this recommender ELP, short for Embedding Linear Product, and we present its performance in Table 3 in the experiments section. We have seen that linear methods, like FISM, can obtain good accuracy figures. On the other side, as results in Table 3 show, ELP is able to provide good figures in novelty and diversity, thanks to the embedding model capturing non-linear relations between users and items. We propose to capture both properties by joining the models together in the EER model (Embedding Enhanced Recommender). We choose the RMSE variant of FISM as it matches the loss used in ELP. We also use a trainable scalar parameter α to joint the models, as the scores obtained from each recommender need not be on the same scale. This results in the following equation to calculate the estimated ratings matrix:R where P ∈ R |I|×k and Q ∈ R k×|I| are the low rank factorization of the item-item similarity matrix. The parameters of the model, P , Q, T and α, are learned by solving the joint 2 regularized optimization problem resulting from the previous joint equation, using standard gradient descent algorithms: minimize P ,Q ,T ,α Similar to the case of ELP, once the parameters are learned, we make the recommendations by calculating the estimated ratings matrix using Eq. 3, sorting each row and picking the top-N items not yet rated by the user corresponding to that row. In this section, we introduce the datasets used to perform our experiments, the evaluation protocol followed and the metrics used. After that, we present the results of our experiments. To evaluate our proposal, we conducted a series of experiments on several datasets, from different domains: the MovieLens 20M dataset 2 , a movie dataset, the book dataset LibraryThing, and the BeerAdvocate dataset 3 , consisting of beer reviews. Table 1 shows statistics of each collection. In order to perform the experiments, the datasets were divided randomly into train and test sets. The training dataset consisted of 80% of the ratings of each user, with the remaining 20% forming the test dataset. We follow the TestItems evaluation methodology [2] to evaluate the performance. To assess the accuracy of the rankings, we use Normalized Discounted Cumulative Gain (nDCG), using the standard formulation as described in [23] , with the ratings in the test set as graded relevance judgments. We considered only items with a rating of 4 or more, on a 5 point scale, to be relevant for evaluation purposes. We also measured the diversity of the recommendations using the complement of the Gini index [5] . Finally, we use the mean self-information (MSI) [24] to assess the novelty of the recommendations. All the metrics are evaluated at cut-off 100 because it has shown to be more robust with respect to the sparsity and popularity biases than sallower cut-offs [20] . We perform a Wilcoxon test [18] to asses the statistical significance of the improvements regarding nDCG@100 and MSI@100, with p < 0.01. We cannot apply it to the Gini index because we are using a paired test and Gini is a global metric. Results in Table 3 are annotated with their statistical significance. We performed a grid search over the hyperparameters of the original model and our proposal tuning them to maximize nDCG@100. Although we aim to increase diversity and novelty, we want the recommendations to be effective, which is why the tuning is done over accuracy. For the parameters of the prefs2vec model, we took those that performed better in [21] . For reproducibility's sake, values for the best hyperparameters for each collection can be consulted in Table 2 . Table 3 shows the values of nDCG@100, Gini@100 and MSI@100 for FISM, EER and ELP. The results show that EER outperforms the baseline (FISM) on both novelty and diversity. It also surpasses it on accuracy on the MovieLens 20M and LibraryThing datasets. In the case of diversity, we can see important Table 2 . Best values of the hyperparameters for nDCG@100 for FISM and our proposals EER and ELP. LibraryThing BeerAdvocate FISM β = 1, k = 1000 β = 1000, k = 1000 β = 50, k = 1000 ELP βe = 0.1 βe = 10 βe = 10 EER β = 0.1, βe = 1, k = 1000 β = 500, βe = 10, k = 1000 β = 10, βe = 1, k = 1000 improvements. ELP, on the other hand, obtains the best diversity and novelty values, but this comes with a big reduction in accuracy. It is common in the field of recommender systems for methods with lower accuracy to have higher values in diversity and novelty. We believe that the ability of the embeddings to find nonlinear relationships contributes to the model novelty and diversity. This property of the model allows it, for example, to discover relationships between popular and not so popular items leading to better diversity. Moreover, the integration in the linear model allows to keep its advantage in terms on accuracy, clearly suparssing the use of embeddings in isolatation (ELP). In this paper, we presented EER, a method to enhance an existing recommendation algorithm to produce recommendations that are both more diverse and novel, while maintaining similar levels on accuracy. This process is done by combining two models, a linear one that is able to obtain good levels of accuracy, with a model based in an embedding technique that extracts non-linear relationships, allowing it to produce more diverse and novel recommendations. As future work, we plan to apply the same technique to other recommender systems, examining if it can be applied in general to enhance the recommendations, independently of the base algorithm chosen for the task. We also envision studying the effects that varying the value of α in Eq. 3 has on the recommendations. Fab: content-based, collaborative recommendation Precision-oriented evaluation of recommender systems The netflix prize Performance of recommender algorithms on top-n recommendation tasks Blockbuster culture's next rise or fall: the impact of recommender systems on sales diversity Beyond accuracy: evaluating recommender systems by coverage and serendipity Semantics-aware content-based recommender systems Recommender Systems Handbook Evaluating collaborative filtering recommender systems FISM: factored item similarity models for top-n recommender systems Advances in collaborative filtering When diversity met accuracy: a story of recommender systems PRIN: a probabilistic recommender with item priors and neural models Being accurate is not enough: how accuracy metrics have hurt recommender systems Efficient Estimation of Word Representations in Vector Space Distributed representations of words and phrases and their compositionality Slim: sparse linear methods for top-n recommender systems A comprehensive survey of neighborhoodbased recommendation methods Using score distributions to compare statistical significance tests for information retrieval evaluation Glove: global vectors for word representation On the robustness and discriminative power of information retrieval metrics for top-n recommendation Collaborative filtering embeddings for memory-based recommender systems Item-based relevance modelling of recommendations for getting rid of long tail products. Knowl.-Based Syst A theoretical analysis of NDCG ranking measures Solving the apparent diversity-accuracy dilemma of recommender systems