liu-relative-2016.pdf Relative Preference-based Recommender Systems By Shaowu Liu BIT (Hons) Submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Deakin University June, 2016 sfol Retracted Stamp sfol Retracted Stamp To my unborn baby... iv Table of Contents Table of Contents v List of Tables viii List of Figures ix Acknowledgements x Publication List xi Abstract xiii 1 Introduction 1 1.1 Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Overview of the Proposed Methodology . . . . . . . . . . . . . . . . . 5 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2 Literature Review 9 2.1 Notation and Problem Formulation . . . . . . . . . . . . . . . . . . . 9 2.2 Recommendation Techniques . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Content-based Recommender Systems . . . . . . . . . . . . . 12 2.2.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . 14 2.2.3 Context-Aware Recommendation . . . . . . . . . . . . . . . . 16 2.2.4 Graph-based Recommendation . . . . . . . . . . . . . . . . . . 17 2.2.5 Trust-based Recommendation . . . . . . . . . . . . . . . . . . 18 2.3 Relative Preference-based Recommender Systems . . . . . . . . . . . 18 2.3.1 Notation and Problem Statement . . . . . . . . . . . . . . . . 20 2.3.2 Memory-based Models . . . . . . . . . . . . . . . . . . . . . . 24 v 2.3.3 Model-based Models . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.1 Accuracy Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.4.2 Diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.3 Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 2.4.4 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.4.5 Metrics for Relative Preference-based Models . . . . . . . . . . 31 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Ordinal Random Fields for Recommender Systems 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Matrix Factorization . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.2 Ordinal Matrix Factorization . . . . . . . . . . . . . . . . . . 40 3.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3 Ordinal Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 ORF: Unifying MRF and OMF . . . . . . . . . . . . . . . . . 45 3.3.3 Feature Design . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.4 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 47 3.3.5 Preference Prediction . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 Experiment and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 49 3.4.2 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 52 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4 Preferecen Relation-based Recommender System 58 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Preference Relation . . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 63 4.2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.1 Preference Relation-based Matrix Factorization . . . . . . . . 67 4.3.2 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . 70 4.3.3 Conditional Random Fields . . . . . . . . . . . . . . . . . . . 73 4.3.4 PrefCRF: Unifying PrefNMF and CRF . . . . . . . . . . . . . 76 4.4 Experiment and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.4.1 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 85 vi 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5 Learning from Heterogeneous Data Sources for Improved Top-N Recommendation 97 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.2.1 Heterogeneous Sources . . . . . . . . . . . . . . . . . . . . . . 99 5.2.2 Preference Relation . . . . . . . . . . . . . . . . . . . . . . . . 100 5.2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.3 Preference Relation-based Conditional Random Fields . . . . . . . . . 104 5.3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.2 Preference Relation-based Matrix Factorization . . . . . . . . 105 5.3.3 Conditional Random Fields . . . . . . . . . . . . . . . . . . . 106 5.3.4 Ordinal Logistic Regression . . . . . . . . . . . . . . . . . . . 109 5.3.5 PrefCRF: Unifying PrefNMF and CRF . . . . . . . . . . . . . 110 5.4 Experiment and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.4.1 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . 115 5.4.2 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 118 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6 Conclusion and Future Work 127 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 Bibliography 130 vii List of Tables 2.1 Notations used in rating-based RecSys . . . . . . . . . . . . . . . . . 11 2.2 Capabilities of existing methods . . . . . . . . . . . . . . . . . . . . . 19 2.3 Notations used in Relative Preference-based RecSys . . . . . . . . . . 23 3.1 Summary of Major Notations (Chapter 3) . . . . . . . . . . . . . . . 38 3.2 Performance of ORF model . . . . . . . . . . . . . . . . . . . . . . . 53 3.3 Paired t-test for the ORF and the OMF. . . . . . . . . . . . . . . . . 53 4.1 Summary of Major Notations (Chapter 4) . . . . . . . . . . . . . . . 65 4.2 Capabilities of PR-based methods . . . . . . . . . . . . . . . . . . . . 67 4.3 Results on MovieLens-1M dataset . . . . . . . . . . . . . . . . . . . . 86 4.4 Results on EachMovie dataset . . . . . . . . . . . . . . . . . . . . . . 87 4.5 Paired t-test for PrefMRF and PrefNMF. . . . . . . . . . . . . . . . . 93 5.1 Performance of PrefCRF on MovieLens-1M dataset . . . . . . . . . . 120 5.2 Results over ten runs on Amazon dataset. . . . . . . . . . . . . . . . 121 5.3 Results over ten runs on EachMovie dataset. . . . . . . . . . . . . . . 122 5.4 NDCG@10 on MovieLens-20M dataset. . . . . . . . . . . . . . . . . . 123 viii List of Figures 2.1 An overview of Recommender System . . . . . . . . . . . . . . . . . . 13 3.1 Example of undirected graphs . . . . . . . . . . . . . . . . . . . . . . 44 3.2 Impact of Minimum Correlation Threshold . . . . . . . . . . . . . . . 54 3.3 Impact of Minimum Correlation Threshold . . . . . . . . . . . . . . . 55 3.4 Impact of Regularization Coefficient . . . . . . . . . . . . . . . . . . . 56 4.1 Example of MRF graphs . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2 Example of CRF graphs . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.3 Performance of different position T on MovieLens-1M dataset (Sparse). 89 4.4 Performance of different position T on MovieLens-1M dataset (Dense). 90 4.5 Performance of different position T on EachMovie dataset (Sparse). . 91 4.6 Performance of different position T on EachMovie dataset (Dense). . 92 4.7 Impact of Sparsity Levels. . . . . . . . . . . . . . . . . . . . . . . . . 94 4.8 Impact of Parameters (MovieLens-1M) . . . . . . . . . . . . . . . . . 95 5.1 Flow from user preferences to PR . . . . . . . . . . . . . . . . . . . . 101 5.2 Undirected graphs for users u and v . . . . . . . . . . . . . . . . . . . 108 5.3 Varying sparsity on MovieLens-1M dataset. . . . . . . . . . . . . . . 121 5.4 Varying position T on EachMovie dataset. . . . . . . . . . . . . . . . 123 5.5 Varying biases on MovieLens-20M dataset. . . . . . . . . . . . . . . . 124 5.6 Varying parameters on MovieLens-1M. . . . . . . . . . . . . . . . . . 125 ix Acknowledgements I would like to thank Dr. Gang Li, my supervisor, for his many suggestions and constant support during this research. I am also thankful to my associate supervisors Prof. Gleb Beliakov and A/Prof. Ping Xiong for useful suggestions and friendly encouragement. I would like to thank my research mates and friends: Dr. Jia Rong, Dr. Huy Quan Vu, Dr. Truyen Tran, Dr. Veelasha Moonsamy, Dr. Yongli Ren, Dr Tianqing Zhu, and so many more. I am also thankful to the support staff, in particular Ms. Lauren Browne, Ms. Judy Chow for their administrative help. Of course, I am grateful to the patience and love from my parents, who gave me birth at the first place and raised me up. I am also thankful for my parents in law for their support throughout my research studies. Last but not the least, I would like express my sincere gratitude to my beautiful wife, Manchun Li, for her patience and love. Without her everlasting encouragement and understanding, this work would never have come into existence. Melbourne, Australia Shaowu Liu June 2016 x Publication List Refereed Journal Articles 1. Shaowu Liu, Gang Li, Truyen Tran, and Yuan Jiang. Preference Relation- based Markov Random Fields for Recommender Systems. Machine Learning. (ERA 2010 ranking A*) 2. Veelasha Moonsamy, Jia Rong, Shaowu Liu, Mining Permission Patterns for Contrasting Clean and Malicious Android Applications. Future Generation Computer Systems, 2014, 36: 122–132. (ERA 2010 ranking A) 3. Shaowu Liu, Rob Law, Jia Rong, Gang Li, and John Hall. Analyzing Changes in Hotel Customers’ Expectations by Trip Mode. International Journal of Hos- pitality Management, 2013, 34: 359–371. (ERA 2010 ranking A) 4. Gleb Beliakov, Gang Li, Shaowu Liu, Parallel bucket sorting on Graphics Processing Units based on convex optimization. Optimization, 2015, 64(4): 1033-1055. (Impact Factor: 0.936) 5. Huy Quan Vu, Gang Li, Nadezda S. Sukhorukova, Gleb Beliakov, Shaowu Liu, Carole Philippe, Helene Amiel, Adrien Ugon. K-complex Detection using a Hybrid-Synergic Machine Learning method. IEEE Transactions on Sys- tems, Man, and Cybernetics Part C: Application and Reviews (IEEE TSMCC), 2012, 42(6): 1478–1490. (Impact Factor: 2.171) xi xii 6. Huy Quan Vu, Shaowu Liu, Xinghua Yang, Zhi Li, Yongli Ren. Identify- ing Microphone from Noisy Recordings by Using Representative Instance One Class-Classification Approach. Journal of Networks (JNW), 2012, 7(6): 908- 917. (ERA 2010 ranking A) Refereed Conference Papers 1. Shaowu Liu, Gang Li, Truyen Tran, Yuan Jiang. Preference Relation-based Markov Random Fields. In Proceedings of the 7th Asian Conference on Machine Learning (ACML 2015), pp. 157-172, Journal of Machine Learning Research: workshop and conference proceedings, 2015. 2. Shaowu Liu, Truyen Tran, Gang Li, Yuan Jiang. Ordinal Random Fields for Recommender Systems. In Proceedings of the 6th Asian Conference on Machine Learning (ACML 2014), pp. 283-298, Journal of Machine Learning Research: workshop and conference proceedings, 2014. 3. Veelasha Moonsamy, Jia Rong, Shaowu Liu, Gang Li, Lynn Batten. Contrast- ing Permission Patterns between Clean and Malicious Android Applications. In Proceedings of the 9th International Conference on Security and Privacy in Communication Networks (SecureComm 2013). 4. Huy Quan Vu, Shaowu Liu, Zhi Li, Gang Li. Microphone Identification using One Class-Classification Approach. In Proceedings of the 2nd Workshop on Applications and Techniques in Information Security (ATIS 2011). Book Chapter 1. Gleb Beliakov and Shaowu Liu. Parallel Monotone Spline Interpolation and Approximation on GPUs. Designing Scientific Applications on GPUs, CRC Press, Taylor and Francis Group, 2013, 295–310. Abstract Recommender systems aim to recommend users with some of their potentially inter- esting items by exploiting various information, especially the absolute ratings. Nev- ertheless, recent literature has suggested that rating-based systems are less reliable comparing to those based on relative preferences, i.e., “which one is better?” instead of “what do you think of this one?” However, a problem of these emerging relative preference-based models is that they consider either the second order interactions, such as similarities between users, or the higher order interactions, such as latent factors. This limitation reduces the performance of relative preference-based systems as the two types of interactions are complementary. On the other hand, due to the change of input format, existing relative preference-based systems do not consider side information such as user profiles and item content, which can be helpful to further improve the performance. Furthermore, the potential of relative preference-based systems to merge heterogeneous data sets was not identified in literature, which can help alleviate the cold-start problem of having limited information for new users or items. In this thesis, we tackle these three issues. We propose a novel model to exploit the ordinal properties possessed by ratings, where both the second and higher order interactions are considered. In this model, ratings are no longer considered as num- bers, but a sequence of ordinal labels. The proposed model used Markov Random Fields to combine two types of interactions. Another type of relative preference is Preference Relation (PR), i.e., compar- isons of items. For PR-based systems, we proposed a modified version of Markov xiii xiv Random Fields which accepts PR instead of ordinal preferences, by converting PR into user-wise preferences, and then into ordinal distributions through ordinal logis- tic regression. This process produces the first PR-based recommender system that captures both types of interactions. For incorporating side information, we extended the Markov Random Fields to Conditional Random Fields, in which the users profiles and item content are considered by designing new features. Despite of improving existing PR-based systems, we also identified a great po- tential of such systems to merge heterogeneous data sets. Specifically, data sets in different format, such as 5-star ratings, binary ratings, page views, and mouse clicks can all be converted into PR format and used by PR-based systems. This observation makes it possible to alleviate the cold-start problem by generating a much denser data set, which could not be done for rating-based systems. To evaluate the performance of proposed models, we conducted experiments on different public data sets against the state-of-the-art relative preference-based models measured by different metrics. The results presented in the experiment sections of each chapter show statistically significant improvement over existing models. The main contributions of this research are proposing the first relative preference-based models that can capture both types of interactions, and using PR-based models to alleviate the cold-start problem. Keywords: Recommender Systems, Preference Relation, Collaborative Filtering, Relative Preference, Ordinal Preference Chapter 1 Introduction 1.1 Background and Motivation Recommender Systems (RecSys) aim to suggest items (books, movies, tourism attrac- tions, etc.) that are potentially to be liked by the user. To identify the appropriate items, RecSys use various sources of information, such as historical ratings given by users [38] or content of items [5]. RecSys were originally designed for users with in- sufficient personal experience or with limited knowledge on the items. However, with the rapid expansion of Web 2.0 and e-commerce, overwhelming number of items are offered, and now every user can be benefited from RecSys [66]. Over the last decade there have been rapid advances in RecSys, from both academia and industry [7,20,37,44,49,76]. One of the most important events in RecSys was the one million Netflix Prize [7] launched in 2006, which sought for RecSys that out- perform Netflix company’s own RecSys. The dataset released in this competition contains historical ratings on movies given by individuals. The Netflix dataset, to- gether with other datasets released by companies such as Amazon and Yahoo! have become the popular benchmark datasets in this field. Due to the extensive use of 1 2 these datasets, which contain ratings, most RecSys to date are designed to exploit ratings [40,41,69]. However, user feedbacks are not always expressed in form of absolute ratings, and it is often expensive to collect such explicit feedbacks. Furthermore, studies [12,42] have reported that absolute ratings may not be completely trustworthy. For example, the rating 4 out of 5 may in general indicate high quality, but it can mean just OK for critics. In fact, users’ quantitative judgment can be affected by a number of irrelevant factors such as the mood when rating, and in psychology this is called misattribution of memory [71]. While users are not good at making consistent quantitative judgment, the relative preferences such as ordinal preferences [42,46,79,82] and preference relations (PR) [12, 18,19] have been considered as more consistent form of feedbacks across like-minded users. For example, by measuring the relative order between items, the PR is usually less variant to irrelevant factors: a user in bad mood may give lower ratings to all items but the relative orderings between items remain the same. Being a more reliable type of user preferences, PR is also easier to collect comparing to ratings as it can be inferred from implicit feedbacks. For example, the PR between two items can be inferred by comparing their ratings, page views, played counts, mouse clicks, etc. This property is important as not all users are willing to rate their preferences, where collecting feedbacks implicitly delivers a more user-friendly recommender system. In addition, as the ultimate goal of RecSys, obtaining the ranking of items by itself is to obtain the relative preferences, a more natural input than absolute ratings [42,81]. Despite of its potential, the newly emerged relative preference-based RecSys pro- vides less features comparing to the well-established rating-based RecSys. Meanwhile, 3 relative preference-based RecSys provides an alternative view of user preferences, thus can be used to resolve issues of rating-based RecSys. Currently, relative preference- based RecSys still faces the following unresolved issues: • Different Structures in User Preferences: Existing recommendation techniques can be largely divided into two forms: memory-based [65,69] and model-based [38]. Memory-based approaches focus on capturing the second-order interactions be- tween similar users [65] or items [69]. This type of information is called Local Structure (LS) of user preferences. On the other hand, model-based approaches focus on discovering the weaker but higher-order interactions among all users and items. This type of information is called Global Structure (GS) of user preferences. Previous studies have suggested that these two types of struc- ture are complementary since they address different aspects of the preferences [38,46,79]. However, there is yet no relative preference-based RecSys that can capture both LS and GS. • Side Information of User Preferences: While existing recommendation tech- niques focus on exploiting user preferences, side information such as item con- tent and user attributes [79] are also shown to be useful in improving recom- mendation quality. However, due to the change of input format, there is yet no relative preference-based RecSys can incorporate side information. • User Preferences from Heterogeneous Sources: Last decade has seen a growing trend towards creating and managing more profiles in Online Social Networks. User are now providing feedbacks on different platforms in different formats, such as 5-star ratings, thumbs up/down, as well as implicitly as mouse clicks. 4 These rich, but heterogeneous, user preferences provide an opportunity of allevi- ate the cold-start problem [72]. However, existing recommendation techniques usually assume the user preferences are in the same format, and therefore are unable to exploit these heterogeneous user preferences. The first two issues have constrained the potentials of relative preference-based RecSys, while the third issue is faced by all existing recommendation techniques. This thesis aims to address these issues to make relative preference-based ReSys more effective and applicable. 1.2 Research Objectives The objective of this work is to overcome the aforementioned weaknesses of existing relative preference-based RecSys, as well as resolving the heterogeneous data sources issue of traditional RecSys. More specifically, the research objectives of this work are: • Learning Local and Global Structures: Capturing both the local and global structures of user preferences have been done in rating-based recommendation techniques [38]. However, existing approaches are not directly applicable to relative preference-based RecSys as the format of input has changed. Recent advances in Markov Random Fields-based RecSys [79] have made it possible to capture both structures in a principled way by utilizing the flexibility of graphical models. This thesis will investigate how the two structures can be compiled into a single model in a probabilistic manner. • Incorporating Side Information: How to incorporate side information such as 5 item content and user attributes is a problem for relative preference-based Rec- Sys. In fact, no existing relative preference-based RecSys has attempted this task as these models are designed particularly for user preferences and have no flexibility to incorporate side information in a proper way. On the other hand, Conditional Random Fields [79], as an the extended version of Markov Ran- dom Fields, can easily incorporate side information in a probabilistic manner. However, it remains unknown how Conditional Random Fields can accept rel- ative preferences as input. This thesis will investigate how to design a relative preference-based Conditional Random Fields model. • Learning from Heterogeneous Data Sources: How to unify user preferences in different formats has been a problem for traditional rating-based RecSys. The main difficulty is that there is no suitable method to convert user preferences among formats without introducing noises. Furthermore, some conversions are impractical, such as converting mouse clicks to 5-star ratings. Fortunately, the relative preference provides a unified interface for all kinds of user preferences, where both mouse clicks and 5-star ratings can be converted into pairwise preference relations. In this thesis, we will investigate how to learn from het- erogeneous data sources using relative preference-based RecSys. 1.3 Overview of the Proposed Methodology Firstly, to address the problem of learning both local and global structures, we propose two Markov Random Fields-based models to capture and unify both the LS and GS information. Specifically, the proposed model employs Markov Random Fields 6 (MRF) to investigate the LS information while the Ordinal Matrix Factorization (OMF) captures the GS information. In this way, we take advantages of both the representational power of the MRF and the ease of modeling ordinal preferences by the OMF. Experimental result on public datasets demonstrates that the proposed model can capture both types of interactions, resulting in improved recommendation accuracy. On the other hand, when the input format is pairwise preference relation, the Preference Relation-based Markov Random Fields model is proposed to deal with the input format of pairwise comparisons of items. Secondly, to address the problem of incorporating side information, we extended the proposed Markov Random Fields-based models to Conditional Random Fields- based models, in which the side information are modeled as global observations of the graphical models. We performed experiments on public datasets and demonstrate that side information has been properly incorporated, and significantly improved recommendation performance has been achieved and validated by statistical tests. Finally, to address the problem unifying information from heterogeneous data sources, we employed the several models to convert and exploit user preferences of different formats. Specifically, all types of user preferences are converted into the unified preference relation format and modeled by our proposed models. Experiment results on public datasets demonstrate that our solutions to unifying data from het- erogeneous sources have successfully minimized the noises information introduced, resulting improved recommendation quality, especially in cold-start cases where each data source provides a limited amount of data. 7 1.4 Thesis Outline This section presents the overall organization of this thesis. As the objective of this thesis is to address the problems of relative preference-based RecSys, the content of each chapter is organized as follows: • Chapter 2 presents a comprehensive survey on recommender systems in gen- eral with a focus on relative preference-based RecSys. Specifically, the relevant concepts, assumptions, and emerging research issues in this area will be dis- cussed. Efforts have been made to identify current and future issues of relative preference-based RecSys. • Chapter 3 focuses on resolving the issue of learning from both the local and global structures in ordinal user preferences. This chapter specifically inves- tigates the scenario of using ordinal type of preference as input. An Ordinal Random Fields (ORF) method is proposed to capture and unify both types of structures in a principled way. Experiments on multiple public datasets are con- ducted to show that the proposed method effectively improves the performance of recommendation by utilizing both types of structures. • Chapter 4 proposes a novel Preference Relation-based Markov Random Fields model to address the issue of learning from both the local and global structures in preference relations. This chapter also proposes a Preference Relation-based Conditional Random Fields model, which incorporates side information of users and items. The proposed model does not rely on ratings but pairwise compar- isons of items, thus offers better reliability and can be applied to a wider range of applications. To validate its performance, we conducted experiments on several 8 public datasets together with side information, and performance improvements have been confirmed statistically. • Chapter 5 addresses the issue of learning from multiple heterogeneous data sources. This chapter identifies and formalizes the heterogeneous data sources problem, and proposes the preference relation-based method to unify heteroge- neous data. With consideration of multiple data sources, the proposed method can reduce the effect of cold-start problem where each data source provides limited amount of data. With the help of the proposed method, implicit user preferences such as page views and mouse clicks can be easily exploited to alle- viate the cold-start problem. • Chapter 6 summarizes the contributions of this thesis, as well as discusses some possible extensions and directions of future research. To maintain readability, some essential concepts, definitions, and motivations are recounted in each chapter to make it self-contained. For basic concepts of recom- mender systems, readers may refer to the recommender system handbook [66]. Chapter 2 Literature Review This chapter is devoted to provides an extensive literature review on recommender systems by racing the trends and directions of current research. We chronologically review contributions along each research direction regarding recommender systems with a focus on relative preference-based RecSys. Specifically, Section 2.1 introduces the basic notations and related concepts of recommender systems. Section 2.2 reviews popular recommendation techniques along with the latest developments. Section 2.3 focuses on relative preference-based RecSys, and will introduce the recent develop- ments on this emerging topic. Section 2.4 presents evaluation metrics that are used to evaluate recommendation performance. 2.1 Notation and Problem Formulation RecSys use historical data to predict future interest in items by users. Two objects are involved in RecSys: items and users. Let U = {u1,u2, ...,um} denote a set of m users, and T = {t1, t2, ..., tn} denote a set of n items, such as books, movies, etc. The interest of user u ∈ U in item t ∈ T is encoded as the preference ru,t ∈ R, where 9 10 R captures the known preferences for all users U. A typical form of preferences is the ratings (e.g. 1 − 5 stars), though many other forms exist, such as like/dislike, clicked/not clicked, etc. Definition 1 (Recommender System). Given item collection T and known preferences R of all users U, RecSys aims to identify the item t̂ ∈ T that maximizes the preference rua,t of the active user ua ∈ U [1]: t̂ = arg max t∈T (rua,t) (2.1.1) This definition often implies that individual items are suggested to the individual users, however, real-world applications may require suggesting a set of items and/or to a group of users. To handle such cases, Definition 1 needs to be extended, how- ever, it remains a challenging task to making recommendations to groups of users or recommending a set of items. For ease of reference, notations used by rating-based RecSys are summarized in Table 2.1. Over the last decade, the development on RecSys has been carried out along two research lines: Recommendation Techniques and Evaluation Metrics. Works on the recommendation techniques focus on how to generate recommendations based on various information sources, ranging from the item content, the known preferences, to more recent sources such as the context [2] and the social trust [28]. After the recommendations have been generated, the next task is to evaluate the quality of recommendations using evaluation metrics. Evaluating common machine learning tasks such as classification are in general less difficult as the ground truth is available to assess the predictions. Accuracy metrics such as mean absolute error (MAE) are often employed to assess the performance of machine learning tasks as well as the RecSys. However, it becomes tricky in RecSys where the ground truth 11 Table 2.1: Notations used in rating-based RecSys Notations Mathematical Meanings U set of all users U = {u1,u2, ...,un} T set of all items T = {t1, t2, ..., tm} G a user group, G ⊆ U K an item package, K ⊆ T R available preferences data of all users ru,t the preference of user u on item t R(ux) the set of items rated by user ux S(tx) the set of users rated item tx Txy the set of items co-rated by user ux and uy rG,K the group G’s preference on package K Ir(G,K) the inter-relevance between group G and package K Ih(K) the aggregation of inherence properties of package K |·| the cardinality of the set ·̂ the prediction, e.g. t̂ is the predicted item t ·̄ the arithmetic mean 12 (user’s satisfaction) may not be well represented by the preferences data such as ratings. For example, a user rated 5-star for the movie Titanic but he/she may not want to watch Titanic Extended Version, but a RecSys focuses on ratings may still consider Titanic Extended Version as a 5-star recommendation. For this reason, research of RecSys has gone beyond the accuracy metrics to many novel metrics such as diversity [11], novelty [25], etc. Figure 2.1 provides an overview of recommendation techniques and evaluation metrics in RecSys. 2.2 Recommendation Techniques Recommendation techniques aims to identify the right item for the user, where two fundamental approaches are Content-based methods [62] and Collaborative Filtering methods [65]. Conventionally, Content-based methods generate recommendations by exploiting regularities in the item content, while Collaborative Filtering methods generate recommendations based on available preferences data of users. More recent approaches are exploiting extra information such as the context [2] and the social trust [28]. In this section, we briefly review these recommendation techniques. 2.2.1 Content-based Recommender Systems Content-based methods generate recommendations for the active user ua based on the contents of related items, where other users’ information is not utilized. The basic idea is to identify the unrated items that are similar to the active user’s highly rated items. 13 R ec om m en de r S ys te m R ec om m en da tio n T ec hn iq ue s E va lu at io n M et ric s C on te nt -b as ed C ol la bo ra tiv e F ilt er in g C on te xt -A w ar e G ra ph -b as ed T ru st -b as ed M em or y- ba se d M od el -b as e S up er vi se d le ar ni ng U ns up er vi se d le ar ni ng M at rix F ac ot riz at io n M as sa e t a l., 2 00 4 G uy e t a l., 2 00 9 et c.S al to n, 1 98 9 et c. N ak am ur a et a l., 1 99 8 R es ni ck e t a l., 1 99 4 et c. M iy ah ar a et a l., 2 00 0 et c. H of m an n et a l., 1 99 9 S ar w ar e t a l., 2 00 2 et c. K or en , 2 00 8 et c. A do m av ic iu s et a l., 2 01 1 K or en , 2 00 9 et c. Z ho u et a l., 2 00 7 et c. A cc ur ac y M et ric s S ar w ar e t a l., 2 00 1 Z ho u et a l., 2 00 7 et c. D iv er si ty G e et a l., 2 01 0 Z ho u et a l., 2 00 8 et c. N ov el ty Lü e t a l., 2 01 2 et c. C ov er ag e G e et a l., 2 01 0 et c. S ta bi lit y A do m av ic iu s et a l., 20 12 et c. Le ge nd S ub je ct M et ho d R ep re se nt at iv e W or ks B ey on d A cc ur ac y F ig u re 2 .1 : A n o v er v ie w o f R ec o m m en d er S y st em 14 The prediction of active user’s preference rua,t on unrated item t is calculated based on known preferences of items similar to t. The similarity between two items is measured by comparing the content of the items. For example, two movies can be compared in terms of the actors, directors, genres, etc [47]. For text-based items, the features can be represented by keywords using term frequency/inverse document frequency (TF-IDF) [67]. Given the features, the similarity can be calculated using standard metrics such as the cosine distance. Despite of the simplicity, content-based methods have three limitations. Firstly, it can be difficult to define features or extract content from some types of items, such as audio, videos, and pictures. Secondly, the user will always be recommended with items that are highly similar to the items he/she liked, which leads to the lacking of diversity [11]. Finally, it is difficult to identify items for new users or users with few ratings, and this is referred to as the cold-start problem [72]. 2.2.2 Collaborative Filtering Collaborative Filtering (CF) looks for items highly rated by users similar to the active user. CF methods can be classified into two classes: memory-based methods and model-based methods. Memory-based Methods In memory-based methods [59,65], the preference pre- dictions are based on the entire collection of known preferences. The idea is that similar users should rate the same movie similarly. The preference rua,t of unrated item t for active user ua is calculated based on the preference ruj,t from every user uj ∈ U who is similar to the active user ua. The similarity between two users is defined by comparing their known preferences, and two 15 popular measures are Pearson Correlation Coefficient (PCC) [65] and Vector Space Similarity (VS) [1]: Pearson Correlation Coefficient (PCC) sim(ux,uy) = ∑ ti∈Txy(rxi − r̄x)(ryi − r̄y)√∑ ti∈Txy(rxi − r̄x) 2 ∑ ti∈Txy(ryi − r̄y) 2 (2.2.1) Vector Space Similarity (VS) sim(ux,uy) = cos(ux,uy) = ∑ ti∈Txy rxiryi√∑ ti∈Txy r 2 xi ∑ ti∈Txy r 2 yi (2.2.2) where Txy = {ti ∈ T|rxi �= Ø,ryi �= Ø} denotes the set of items co-rated by both ux and uy. Model-based Methods In contrast to memory-based methods, model-based meth- ods [57,68] will construct a model from the known preferences, and make future recommendations based on the model. Model-based methods often take more training time than memory-based methods, however, they are more efficient in generating recommendations. According to the type of model, model-based methods can be further divided into three classes: supervised learning-based [57], unsupervised learning-based [31,70], and matrix factorization-based [9,38]. Similar to content-based methods, CF also suffers from cold-start problem. In addition, new items rated by a small number of users will have a low chance to be recommended. However, CF has been one of the most popular recommendation techniques for to its efficiency and high quality recommendations. 16 2.2.3 Context-Aware Recommendation Both content-based methods and CF focus on the preferences, however, users’ inter- ests could also be affected by the context. For example, whether a user would like a movie not only depends on the user’s taste, but also the context such as when, where, and with whom. One of the first considered context is temporal information. In 2001, Zimdars et al. [86] treated CF as a uni-variate time series problem, where a user’s next preference is predicted based on the previous preference. However, temporal information did not attract much attention until the successful of timeSVD++ method [39] in Netflix Progress Prize competition. The timeSVD++ method predicts preference of active user ua on item i at time t as: r̂ai(t) = μ + bi(t) + ba(t) + q T i ⎛ ⎝pa(t) + |R(ua)|12 ∑ j∈R(ua) yj ⎞ ⎠ , (2.2.3) where μ denotes the overall average preference, bi(t) is the item’s bias at time t, ba(t) is the user’s bias at time t, R(ua) is the set of items rated by user ua, qi and yj are item-factor vectors, and pa(t) is the user-factor vector at time t, pa(t), qi, and yj are in a joint latent factor space as used in matrix factorization techniques [38]. In this formulation, temporal information is modeled by the time-based bias, and this makes it superior to other competitors. Recently, it has been recognized that temporal is not the only important context and various kinds of context can be exploited to improve the recommendation quality, and this kind of RecSys is referred to as Context-Aware Recommender Systems [2]. 17 2.2.4 Graph-based Recommendation Graph-based methods consider the recommendation task as a link prediction problem of bipartite graph [85]. On bitpartite graph, users U and items T are represented by two sets of nodes, and each pair of < user,item > can be connected with an edge. These edges represent users’ interests in items, and making recommendations is the same as connecting the missing edges. A representative work is the Network-Based Inference (NBI) proposed by Zhou et al. [85] which generates recommendations based on the resource-allocation process. To make predictions for user ua, NBI first initializes the network as: AC(ua, ti) = ⎧⎨ ⎩1 if ti ∈ R(ua) 0 otherwise (2.2.4) where R(ua) denotes the set of items rated by user ua, and AC(ua, ti) is the Allocated Resource (AC) to node of item ti that represents user’s interests. In this initialization, 1 is assigned to every item ti if rated by user ua, and 0 otherwise. After this initial- ization for all users, the allocated resources will be redistributed among all items in the following two steps and the item with the most AC at the final stage will be recommended to the active user ua. Spreading Step In spreading step, all initially allocated resources will flow from all items T to all users U. The resources flow to each user ux is calculated as: AC(ux) = |T |∑ i=1 cxiAC(ux, ti) |S(ti)| (2.2.5) where S(ti) denotes the set of users who rated item ti, AC(ux, ti) denotes the initial resources allocated to item ti by user ux, and cxi is 1 if ux rated item ti and 0 otherwise. 18 Redistribution Step In this step, the resources will flow back to items T from users U. The final resources allocated to each item ti is calculated as: ÂC(ti) = |U|∑ x=1 cxiAC(ui) |Tx| (2.2.6) The item with most resources ÂC(ti) allocated will be recommended to the active user ua. 2.2.5 Trust-based Recommendation Similarity in typical recommendation methods is often defined by standard metrics such as cosine. However, instead of finding recommendations from similar users, it is also reasonable to find recommendations from familiar users [28]. Intuitively, an item liked by the user’s good friend has the potential to be liked by the user. Recent developments in social networks have further revealed the social trust rela- tionships among users, and Massa and Avesani [53] termed this kind of recommender systems as Trust-Aware Recommendation. Empirical results from Guy’s work [28] in- dicated that familiarity-based methods can be superior to similarity-based methods. Despite of the performance comparison, the key advantage of trust-aware methods is that it provides a promising approach to cold-start problem [27,28]. 2.3 Relative Preference-based Recommender Sys- tems 19 User preferences can be modeled in three types: pointwise, pairwise, and listwise. Though RecSys is not limited to pointwise absolute ratings, the recommendation task is usually considered as a rating prediction problem [38, 40, 69, 78]. Recently, a considerable literature [12, 19, 45, 64, 74] has grown up around the theme of rela- tive preferences, especially the pairwise PR. Meanwhile, recommendation task is also shifting from rating prediction to item ranking [56,74,83] in which the ranking itself is also relative preferences. The use of relative preferences has been widely studied in the field of Information Retrieval for learning to rank tasks [21,22,35]. Recently, PR-based [12,19,45,64] and listwise-based [74] RecSys have been proposed. Among them, the PR-based approach is the most popular, which can be further categorized as memory-based methods [12] that capture local structure and model-based methods [19,45,64] that capture global structure. We summarize the capabilities of the existing methods in Table 2.2. Table 2.2: Capabilities of existing methods Method Input Output LS GS Pointwise Memory-based Ratings Ratings � Pointwise Model-based Ratings Ratings � Pointwise Hybrid Ratings Ratings � � Pairwise Memory-based Preference Relations Item Rankings � Pairwise Model-based Preference Relations Item Rankings � 20 2.3.1 Notation and Problem Statement Preference Relation A preference relation (PR) encodes user preferences in form of pairwise ordering between items. This representation is a useful alternative to absolute ratings for three reasons. Firstly, PR is more consistent across like-minded users [12,19] as it is invariant to many irrelevant factors, such as mood. Secondly, PR is a more natural and direct input for Top-N recommendation, as both the input and the output are relative preferences. Finally, and perhaps most importantly, PR can be obtained implicitly rather than asking the users explicitly. For example, the PR over two Web pages can be inferred by the stayed time, and consequently applies to the displayed items. This property is important as not all users are willing to rate their preferences, where collecting feedbacks implicitly delivers a more user-friendly RecSys. In addition, PR- based RecSys provides an opportunity to utilize the vast amount of implicit data that have already been collected over the years, such as activity logs. With these potential benefits, we shall take a closer look at the PR, and investigate how they can be utilized in RecSys. We formally define the PR as follows. Let U = {u}n and I = {i}m denote the set of n users and m items, respectively. The preference of a user u ∈ U between items i and j is encoded as πuij, which indicates the strength of user u’s PR for the ordered item pair (i,j). A higher value of πuij indicates a stronger preference on the first item over the second item. 21 Definition 2 (Preference Relation). The preference relation is defined as πuij = ⎧⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎩ (2 3 ,1] if i � j (u prefers i over j) [1 3 , 2 3 ] if i � j (i and j are equally preferable to u) [0, 1 3 ) if i ≺ j (u prefers j over i) (2.3.1) where πuij ∈ [0,1] and πuij = 1 − πuji. This definition is similar to [19], however, we allocate an interval for each pref- erence category, i.e., preferred, equally preferred, and less preferred. Indeed, each preference category can be further break down into more intervals. Similar to [12], the PR can be converted into user-wise preferences over items. Definition 3 (User-wise Preference). The user-wise preference is defined as pui = ∑ j∈Iu[[πuij > 2 3 ]] − ∑ j∈Iu[[πuij < 1 3 ]] |Πui| (2.3.2) where [[·]] gives 1 for true and 0 for false, and Πui is the set of user u’s PR related to item i. The user-wise preference pui falls in the interval [−1,1], where −1 and 1 indicate that item i is the least or the most preferred item for user u, respectively. The user- wise preference measures the relative position of an item for a particular user, which is different from absolute ratings. Preference relation has been widely studied in the field of Information Retrieval [14, 21,22,35]. Nevertheless, PR-based RecSys have only emerged recently [12,19,45,64]. Problem Statement Generally, the task of PR-based RecSys is to take PR as input and output Top-N recommendations. Specifically, let πuij ∈ Π encode the PR of each user u ∈ U. 22 Each πuij is defined over an ordered item pair (i, j), denoting i ≺ j, i � j, or i � j as described in Eq. 2.3.1. The goal is to estimate the value of each unknown πuij ∈ Πunknown, such that π̂uij approximates πuij. This can be considered as an optimization task performs directly on the PR: π̂uij = arg min π̂uij∈[0,1] (πuij − π̂uij)2 (2.3.3) However, it can be easier to estimate the π̂uij by the difference between the two user-wise preferences pui and puj, i.e., π̂uij = φ(p̂ui−p̂uj), where φ(·) is a function that bounds the value into [0,1] and ensures φ(0) = 0.5. For example, the inverse-logit function φ(x) = e x 1+ex can be used when user-wise preferences involve large values. Therefore, the objective of this paper is to solve the following optimization problem: (p̂ui, p̂uj) = arg min p̂ui,p̂uj (πuij − φ(p̂ui − p̂uj))2 (2.3.4) which optimizes the user-wise preferences directly, and Top-N recommendations can be obtained by simply sorting the estimated user-wise preferences. Let us consider an instance space X = {xi} (e.g. items) and a finite set of labels (e.g. ratings) Y = {yi|i = 1,2, ...,k}. One task of preference learning is to find a label ranking for any instance, e.g., determine the most likely rating for an item. The other task is to find an object ranking, e.g., to determine the ranking of items. For ease of reference, notations used in Relative Preference-based RecSys are summarized in Table 2.3. The letters u, v, a, b represent users, and the letters i, j, k, l represent items. 23 Table 2.3: Notations used in Relative Preference-based RecSys Notations Mathematical Meanings U the set of users I the set of items Π the set of preference relations pui the user-wise preference of user u on item i G an undirected graph encodes relations of user-wise preferences V the set of vertices each represents a user-wise preference E the set of edges each connects two vertices fuv the correlation feature between users u and v fij the correlation feature between items i and j wuv the weight associated to the user-user correlation feature fuv wij the weight associated to the item-item correlation feature fij Q(pui | u,i) the ordinal distribution o the side information, e.g., user attributes and item content 24 2.3.2 Memory-based Models A memory-based model is proposed in [12] to take preference relations as input and compute similarities between users. The proposed model has the following three steps: Collecting User Profiles When preference relations are employed, four values are possible for user preferences: • i � j indicates item i is preferred over item j • i ≺ j indicates item j is preferred over item i • i ≈ j indicates item i and j are equally preferable Each value correspond to one question answered by a user. However, there are too many possible questions which cannot be asked. Therefore, a decision must be made to decide which subset of questions to ask. Computing Similarities Between Users Let Iu be the set of preference relations of user u, and fu1,u2(i,j) indicating whether two users u1 and u2 agree on their preference on the two items i and j. Given an item pair (i,j), the function fu1,u2(i,j) gives the value 1 if the two users have the same preference, and 0 otherwise. The similarity measure between two users u1 and u2 is then defined as: cos�(u1,u2) = ∑ (i,j)∈I1∩I2 fu1,u2(i,j)√∑ (i,j)∈I1 fu1,u1(i,j) · √∑ (i,j)∈I2 fu2,u2(i,j) = ∑ (i,j)∈I1∩I2 fu1,u2(i,j)√ |I1| · |I2| (2.3.5) 25 where the numerator represents the number preferences that both users agreed, and the denominator normalizes the result. Making Recommendations To make recommendations, the preference relations are first converted into user-wise preferences. Denote: • c+u,i: the number of preference relations that i is preferred • c=u,i: the number of preference relations that i is equally preferred to others • c−u,i: the number of preference relations that i is less preferred Then the user-wise preference of item i by user u is defined as: pui = c−u,i − c+u,i c−u,i + c + u,i + c = u,i (2.3.6) With the user-wise preferences computed, the preference over an unknown item j can be predicted by: puj = ∑ v∈Nu sim(u,v) · pvj∑ v∈Nu sim(u,v) (2.3.7) where Nu is the set of users that have similar profiles to user u. 2.3.3 Model-based Models Ordinal Matrix Factorization The ordinal nature of preferences has been overlooked in RecSys literature, until recently Ordinal Matrix Factorization (OMF) [32,42,61,82] has emerged to explore the ordinal properties of ratings. 26 In general, OMF aims to generate an ordinal distribution Q(rui|u,i) over all pos- sible rating values for each user/item pair. Predicting the rating for user u on item i is then equivalent to identifying the rating with the greatest mass in the ordinal distribution Q(rui|u,i). While traditional RecSys approaches make only a point esti- mate, the OMF produces a full distribution and each prediction is associated with a probability as a confidence measure. Typical OMF approaches assume the existence of a latent utility xui that captures how much the user u is interested in the item i. The latent utility xui can be defined in different ways [32, 42, 61, 82], but under the same framework of Random Utility Models [55] xui = μui + �ui (2.3.8) where μui is an internal score represents the interaction between the user u and the item i. The �ui is the random noise normally assumed to follow the logistic distribution in practice [42]. The latent utility xui is then generated from a logistic distribution centred at μui with the scale parameter sui proportional to the standard deviation xui ∼ Logi(μui,sui) (2.3.9) In collaborative filtering, the user-item interaction is often captured by MF tech- niques, thereby the internal score μui can be substituted with the MF term bui +p T u qi xui = bui + p T u qi + �ui (2.3.10) where pu and qi are, respectively, the latent feature vectors of the user u and the item i. Modelling the latent utility with MF reflects the name OMF. Despite how the latent utility is modelled, an ordinal assumption is required to convert the numerical utility into ordinal values. A common approach is the ordinal 27 logistic regression originally described by McCullagh [54], which assumes that the rating is chosen based on the interval to which the utility belongs rui = l if xui ∈ (θl−1,θl] for l < L and rui = L if xui > θL−1 (2.3.11) where L is the number of ordinal levels and θl are the threshold values of interest. Other assumptions [51] are also possible but McCullagh’s model is by far the most popular. The probability of receiving a rating l is therefore Q(rui = l|u,i) = ∫ θl θl−1 P(xui|θ) = F(θl) − F(θl−1) (2.3.12) where F(θl) is the cumulative logistic distribution evaluated at θl F(xui ≤ l|θl) = 1 1 + exp(−θuil−μui sui ) (2.3.13) where the thresholds θl can be parameterised to depend on user or item. This paper employs the user-specific thresholds parameterisation described in [42]. Therefore a set of thresholds {θul}Ll=1 is defined for each user u to replace the thresholds θuil in Eq. 2.3.13. Given the learned ordinal distribution Q(rui|u,i), not only the ratings can be predicted but also the confidence for each prediction. Preference Relation-based Matrix Factorization Matrix Factorization (MF) [41] is a popular approach to RecSys that has mainly been applied to absolute ratings. Recently, the PrefNMF [19] model was proposed to adopt PR input for MF models. The PrefNMF model discovers the latent factor space shared between users and items, where the latent factors describe both the taste of users and the characteristics of items. The attractiveness of an item to a user is then measured by the inner product of their latent feature vectors. 28 Formally, each user u is associated with a latent feature vector uu ∈ Rk and each item i is associated with a latent feature vector vi ∈ Rk, where k is the dimension of the latent factor space. The attractiveness of items i and j to the user u are u�u vi and u�u vj, respectively. When u � u vi > u � u vj the item i is said to be more preferable to the user u than the item j, i.e., i � j. The strength of this preference relation πuij can be estimated by u�u (vi − vj), and the inverse-logit function is applied to ensure π̂uij ∈ [0,1]: π̂uij = eu � u (vi−vj) 1 + eu � u (vi−vj) (2.3.14) The latent feature vectors uu and vi are learned by minimizing regularized squared error with respect to the set of all known preference relations Π: min uu,vi∈Rk ∑ πuij∈Π∧(i θL−1 (3.2.6) where L is the number of ordinal levels and θl are the threshold values of interest. Other assumptions [51] are also possible but McCullagh’s model is by far the most popular. The probability of receiving a rating l is therefore Q(rui = l|u,i) = ∫ θl θl−1 P(xui|θ) = F(θl) − F(θl−1) (3.2.7) where F(θl) is the cumulative logistic distribution evaluated at θl F(xui ≤ l|θl) = 1 1 + exp(−θuil−μui sui ) (3.2.8) where the thresholds θl can be parameterized to depend on user or item. This paper employs the user-specific thresholds parameterization described in [42]. Therefore a set of thresholds {θul}Ll=1 is defined for each user u to replace the thresholds θuil in Eq. 3.2.8. Given the learned ordinal distribution Q(rui|u,i), not only the ratings can be predicted but also the confidence for each prediction. 3.2.3 Summary Matrix Factorization has been one of the most popular RecSys approaches, which primarily focuses on numerical preferences such as ratings. Nevertheless, the nature 42 of user preferences is often ordinal, and the importance of modeling ordinal properties has been recognized in recent works on OMF [32,42,61,82]. Although the OMF en- ables the modeling of ordinal properties, the employment of MF makes it only focuses on the higher-order interactions (the GS) regardless of the localized interactions (the LS), whereas both information are valuable [38, 79]. Furthermore, the OMF by its nature cannot model auxiliary information such as content [5] directly. The powerful representation of Markov Random Fields (MRF) offers an oppor- tunity to take advantages from all of these information, and have been developed in recent works [78, 79]. Nevertheless, exploiting the ordinal properties is not an easy task for MRF [79], therefore the strengths of the OMF and the MRF are nicely complementary. This observation leads to a naturally extension of unifying these two approaches, and motivates the present work. 3.3 Ordinal Random Fields In this section, we propose the Ordinal Random Fields (ORF) to model the ordinal properties and capture both the LS and the GS. Here we exploit the LS of the item- item correlations only, while the user-user correlations can be modeled in a similar manner. The rest of this section introduces the concept of the Markov Random Fields followed a detailed discussion of the ORF including its feature design, parameter estimation, and predictions. 43 3.3.1 Markov Random Fields Markov Random Fields (MRF) [16, 78] models a set of random variables having Markov property with respect to an undirected graph G. The undirected graph G consists a set of vertices V connected by a set of edges E without orientation, where two vertices are neighborhood of each other when connected. Each vertex in V encodes a random variable, and the Markov property implies that a variable is conditionally independent of other variables given its neighborhoods. In this work, we use MRF to model user preferences and their relations respect to a set of undirected graphs. Specifically for each user u, there is a graph Gu with a set of vertices Vu and a set of edges Eu. Each vertex in Vu represents a preference rui of user u on item i, and each edge in Eu captures a relation between two preferences by the same user. As we consider only the item-item correlations in this work, two preferences are connected by an edge if and only if they are given by the same user. Fig. 3.1 shows an example of two graphs for users u and v. Note that vertices of different graphs are not connected directly, however, the edges between the same pair of items are associated to the same item-item correlation. For example, the edge between rui and ruj and the edge between rvi and rvj are associated to the same item-item correlation between items i and j (see the green dashed line in Fig. 3.1). Formally, let I(u) be the set of all items rated by user u and ru = {rui|i ∈ I(u)} be the joint set of all preferences (the variables) related to user u, then the MRF defines a distribution P(ru) over the graph Gu: P(ru) = 1 Zu Ψ(ru) (3.3.1) 44 ruj ruk rvj rvk rui rvi i-j correlation j-k correlati on i-k correlati on Figure 3.1: Example of undirected graphs for users u and v Ψ(ru) = ∏ (ui,uj)∈Eu ψij(rui,ruj) (3.3.2) where Zu is the normalization term that ensures ∑ ru P(ru) = 1, and ψ(·) is a positive function known as potential. The potential ψij(rui,ruj) captures the correlation between items i and j ψij(rui,ruj) = exp{wijfij(rui,ruj)} (3.3.3) where fij(·) is the feature function and wij is the corresponding weight. The cor- relation features capture the LS, while the weights realize the importance of each correlation feature. In ORF, the weights also control the relative importance between the LS and the GS. With the weights estimated from data, the unknown preference rui can be predicted as r̂ui = arg max rui P(rui|ru) (3.3.4) where P(rui|ru) serves as the confidence measure of the prediction. 45 3.3.2 ORF: Unifying MRF and OMF The standard MRF approach captures the LS by modeling item-item correlations under the framework of probabilistic graphical models. However, it employs the log-linear modeling as shown in Eq. 3.3.3, and therefore does not enable a simple treatment of ordinal preferences. OMF, on the other hand, can nicely model the ordinal preferences in a probabilistic way but is weak in capturing the LS. The com- plementary between these two techniques calls for the unified ORF model to take all of the advantages. Essentially, the proposed ORF model promotes the agreement between the GS discovered by the OMF and the LS discovered by the MRF. More specifically, the ORF model combines the item-item correlations (Eq. 3.3.3) and the point-wise ordinal distribution Q(rui|u,i) obtained from the OMF (Eq. 3.2.7) P(ru) ∝ Ψu(ru) ∏ rui∈ru Q(rui|u,i) (3.3.5) where Ψu(ru) is the potential function capturing the interaction among items, and ru is the set of preferences from user u. The potential function Ψu(ru) can be further factorized into pairwise potentials based on Eq. 3.3.3 and Eq. 3.3.2: Ψu(ru) = exp ⎛ ⎝ ∑ rui,ruj∈ru wijfij(rui,ruj) ⎞ ⎠ (3.3.6) where fij(·) is the correlation feature between items i and j to be defined shortly in Section 3.3.3, and wij is the corresponding weight controls the relative importance of each correlation feature (LS) to the ordinal distribution (GS). Put all together, the 46 joint distribution P(ru) for each user u can be modelled as P(ru) ∝ exp ⎛ ⎝ ∑ rui,ruj∈ru wijfij(rui,ruj) ⎞ ⎠ ∏ rui∈ru Q(rui|u,i) (3.3.7) where there is a graph for each user but the weights are optimised by all users. In fact, the user-user correlations can also be captured as P(R) ∝ ∏ i Ψi(ri) ∏ u Ψu(ru) ∏ u,i Q(rui|u,i) (3.3.8) but we limit our discussion to item-item correlations in this paper. 3.3.3 Feature Design A feature is essentially a function f of n > 1 arguments that maps the (n-dimensional) input onto the unit interval f : Rn → [0,1], where the input can be ratings or auxiliary information such as content [78]. The item-item correlation is captured by the following feature f(rui,ruj) = g(|(rui − r̄i) − (ruj − r̄j)|) (3.3.9) where g(α) = 1−α/(L−1) does normalization, and r̄i and r̄j are the average ratings for items i and j, respectively. This correlation feature captures the intuition that correlated items should receive similar ratings by the same user after offsetting the goodness of each item. Though this work focuses on the item-item correlations, the feature for user-user correlations can be designed in a similar manner: f(rui,rvi) = g(|(rui − r̄u) − (rvi − r̄v)|) (3.3.10) where r̄u and r̄v are the global average ratings for users u and v respectively. 47 Although the user and item bias have been modeled by the underlying OMF, the ORF itself can also model the bias with identity features for item i and for user u fi(rui, i) = g(|rui − r̄i|), fu(rui,u) = g(|rui − r̄u|) (3.3.11) Indeed, auxiliary information such as content [5] and social relations [50] can also be modeled by designing corresponding features. That being said, the ORF is a generic framework with great extensibility to integrate multiple sub-components such as neighborhood, content, and ordinal ratings. Nevertheless, this work focuses on the item-item correlation features only. Since one correlation feature exists for each possible pair of co-rated items, the number of correlation features can be large, and this makes the estimation slow to converge and less robust. Therefore we only keep the correlation features if strong correlation exists between two items i and j. Specifically, the strong correlation features are extracted based on the Pearson correlation and a user-specified minimum correlation threshold. 3.3.4 Parameter Estimation In general, MRF models cannot be determined by standard maximum likelihood approaches, instead, approximation techniques such as Markov Chain Monte Carlo (MCMC) [26] and Pseudo-likelihood [8] are often used in practice. The pseudo- likelihood leads to exact computation of the loss function and its gradient with respect to parameters, and thus faster. The MCMC-based methods may, on the other hand, lead to better estimation given enough time. As the experiments involve different settings and large number of features, this study employs the pseudo-likelihood tech- nique to perform efficient parameter estimation by maximizing the regularized sum 48 of log local likelihoods L(w) = ∑ rui∈R log P(rui|ru\rui) − 1 2σ2 ∑ u∈U wTu wu (3.3.12) where σ is the regularization coefficient, and wu is the subset of weights related to user u. The local likelihood is defined as P(rui|ru\rui) = 1 Zui Q(rui|u,i) exp ⎛ ⎝ ∑ ruj∈ru\rui wijfij(rui,ruj) ⎞ ⎠ (3.3.13) where Zui is the normalization term. Zui = L∑ rui=1 Q(rui|u,i) exp ⎛ ⎝ ∑ ruj∈ru\rui wijfij(rui,ruj) ⎞ ⎠ (3.3.14) To optimize the parameters, we use the stochastic gradient ascent procedure that updates the parameters by passing through the set of ratings of each user: wu ← wu + η∇L(wu) (3.3.15) where η is the learning rate. More specifically, for each rui and its neighbor ruj in the set of ratings ru by user u, update the weight wij using the gradient of the log pseudo-likelihood ∂logL ∂wij = fij(rui,ruj) − L∑ rui=1 P(rui|ru\rui)fij(rui,ruj) (3.3.16) 3.3.5 Preference Prediction The prediction of rating rui is straightforward, which can be done by identifying the rating with the greatest mass in local likelihood: r̂ui = arg max rui P(rui|ru) (3.3.17) 49 where the local likelihood is given by Eq. 3.3.13. Prediction made in this approach identifies the most likely rating from discrete values 1 to L, and the local likelihood serves as a confidence measure. For predictions of scalar values, the expectation can be used instead: r̂ui = L∑ rui=1 ruiP(rui|ru) (3.3.18) Finally, Alg. 1 summarizes the learning and prediction procedures for the ORF. 3.4 Experiment and Analysis To study the performance of the proposed ORF model, comparisons were made with the following representative algorithms: a) K-Nearest Neighbors (K-NN) [65, 69], which represents the methods exploiting the LS; b) OMF [42], which exploits the GS and ordinal properties; c) and finally the ORF model, which takes ordinal prop- erties into account and exploits both the LS and the GS. Details of the experimental settings and results are presented in this section. 3.4.1 Experimental Settings Datasets Experiments were conducted on two public movie rating datasets: the MovieLens-100K and the MovieLens-1M 1 datasets. The MovieLens-1M dataset contains roughly 1 million ratings by 6040 users on 3900 movies. The MovieLens- 100K dataset contains 100K ratings by 943 users on 1682 movies. Ratings are on the 1 − 5 scale. To perform a reliable evaluation, we keep only users who rated at least 30 1http://grouplens.org/datasets/movielens 50 Algorithm 1 Ordinal Random Fields Algorithm Input: User preferences R; the ordinal distribution Q from Eq. 3.2.7. Step 1: Generate strong correlation features: fstrong ← {fij|Pearson(i,j) ≥ minCorr} Step 2: Initialize the weights: ∀wij ∈ w, wij ← N(0,0.01); Step 3: Repeat for each u ∈ U do for each rui,ruj ∈ ru, i �= j do if fij ∈ fstrong then Compute correlation feature fij according to Eq. 3.3.9 Compute normalization term Zui according to Eq. 3.3.14 Compute local likelihood according to Eq. 3.3.13 Compute the gradient for weight wij according to Eq. 3.3.16 Update wij with the gradient wij ← wij + η∇L(wij) end if end for end for Until stopping criteria met Predictions: * Predict most likely rating with confidence measure using Eq. 3.3.18. * Predict expectation using Eq. 3.3.17 51 movies, and each dataset is shuffled and split into the disjoint training set, validation set and test set. For each user, 5 ratings are kept in the validation set for tuning the hyper-parameters, 10 ratings are reserved for testing, and the rest for training. Evaluation Metric The Mean Absolute Error (MAE) and the Root Mean Square Error (RMSE) are used as the evaluation metric MAE = 1 |Rtest| ∑ (u,i)∈Rtest |r̂ui − rui|, RMSE = √∑ (u,i)∈Rtest(r̂ui − rui)2 |Rtest| (3.4.1) where Rtest is the test set kept aside until all parameters have been tuned. A smaller MAE or RMSE value indicates better performance. Although both metrics are used, we consider the MAE metric to be more suitable for ordinal preferences. The reason is that it makes more scenes to consider being off by 4 is just twice as bad as being off by 2 when the preferences are ordinal. The RMSE metric, on the other hand, can be skewed to methods that are optimized for numerical preferences. Parameters To perform a fair comparison, we fix the number of latent factors to the typical value of 50 for all algorithms, and all weights are randomly initialized from N(0,0.01). The number of neighbors for K-NN algorithms is set to 10 and 30. The minimum correlation threshold for the ORF is set to reasonable values considering both the prediction performance and computational efficiency. We will also report the effect of varying the minimum correlation threshold. 52 3.4.2 Result and Analysis We first compare the performance of the proposed ORF model with related algo- rithms: user-based K-NN, item-based K-NN and OMF, where the OMF is the tar- geted baseline. Then the impact of parameters is investigated for the ORF model, in particular the regularization coefficient and the minimum correlation threshold. Comparison with Other Methods The comparison results in terms of prediction accuracy are shown in Table 3.2. The global average is used only as a benchmark, which uses the average rating as the predictions. The following observations can be made based on the results. Firstly, the K-NN methods, especially the item-based K-NN, perform quite well. As the K-NN methods exploit only the LS, this result indicates the effectiveness of the LS. However, the ignorance of the GS makes the K-NN methods not less generalized and thus highly susceptible to noisy data. Secondly, the OMF fits the data quite well when predicting the most likely rat- ings for the MAE metric. However, it exploits only the GS and therefore further improvements are possible by incorporating the LS information. Finally, the ORF has made further improvements upon the OMF by unifying the modeling of both the LS and the GS, as well as ordinal properties. Note that the performance of the ORF relies on the ordinal distributions generated by the underlying OMF, which can be implemented in different ways [32, 42, 61, 82]. In present work, the improvements over the OMF are soley based on incorporating the LS information. To confirm the improvements, a paired t-test (two-tailed) with a confidence level of 53 Table 3.2: For both the OMF and the ORF, the expectation values (Eq. 3.3.18) are used for RMSE and the most likely values (Eq. 3.3.17) are used for MAE. MovieLens-100K MovieLens-1M Method RMSE MAE RMSE MAE Global Ave. 1.1186 0.9430 1.1123 0.9401 UserKNN, K=10 0.9687 0.7584 0.9350 0.7328 ItemKNN, K=10 0.9372 0.7305 0.9032 0.7041 UserKNN, K=30 0.9463 0.7413 0.9149 0.7173 ItemKNN, K=30 0.9315 0.7295 0.8987 0.70478 OMF 0.9525 0.7226 0.9144 0.6918 ORF, minCorr=0.4 0.9475 0.7185 0.9117 0.6887 ORF, minCorr=0.3 0.9448 0.7148 0.9093 0.6870 Table 3.3: Paired t-test for the ORF and the OMF. t-test statistics Methods df t p-value ORF vs. OMF on MAE 9 6.0163 0.0002 ORF vs. OMF on RMSE 9 4.8586 0.0009 54 95% has been applied to the ORF and the OMF. Results shown in Table 3.3 confirm that the performance of methods with and without capturing the LS is statistically significant. Impact of Minimum Correlation Threshold As mentioned in Section 3.3.3, the ORF model requires a minimum correlation thresh- old to control the number of correlation features. The reason is that the number of correlation features can be very large, which makes the model less robust and slow to converge. Specifically, when this threshold goes to minimum (e.g. −1 for Pearson cor- relation), the potential number of correlation features can be as large as n2/2 where n is the number of items. On the other hand, the number of correlation features goes to zero when the threshold goes to maximum, and the ORF reduces to the OMF. (a) Number of Correlation Features (b) Coverage of Correlation Features Figure 3.2: Impact of Minimum Correlation Threshold on Number of Correlation Features 55 Fig. 3.2(a) shows the number of correlation features for different minimum corre- lation thresholds. Given that the MovieLens-100K dataset contains less items com- paring the MovieLens-1M dataset, there are even more correlation features remained in the MovieLens-100K dataset when the threshold becomes larger. This observation implies that the items in the MovieLens-100K dataset are more correlated with each other. We also show the coverage of correlation features for both datasets, and the MovieLens-100K has consistently higher coverage of correlation features. (a) MAE (b) RMSE Figure 3.3: Impact of Minimum Correlation Threshold Having these statistics result, we further examine the impact of the minimum cor- relation threshold on prediction accuracy, as plotted in Fig. 3.3. It can be observed that the prediction accuracy improves as the minimum correlation threshold becomes smaller. However, we notice that the performance on the smaller MovieLens-100K dataset is not as stable as that on the MovieLens-1M dataset, where the curve of the MovieLens-1M dataset is smoother and shows better monotonicity. One explanation 56 is that the MovieLens-100K dataset may not have enough data to make robust esti- mation for large number of weights. However, given adequate data and time, the best prediction performance can be achieved by including all correlation features, i.e., the minimum correlation threshold is set to minimum. Impact of Regularization Coefficient While the number of correlation features can be large, the model might be prone to over-fitting. Therefore we investigate the impact of varying the the regularization coefficient. Fig. 3.4 shows the performance of the ORF under different regularization (a) MAE (b) RMSE Figure 3.4: Impact of Regularization Coefficient settings. We observe that by varying the regularization coefficient the prediction performance was not affected too much. One possible explanation is that the ordinal distribution employed in the ORF is generated by the underlying OMF with its own regularization mechanism, whereas the regularization term in the ORF controls only 57 the weights for the second-order item-item correlation features. In other words, the ORF model by itself is unlikely to over-fit the data given that the underlying OMF model has been properly regularized. 3.5 Summary In this work we presented the ORF model that takes advantages of both the rep- resentational power of the MRF and the ease of modeling ordinal preferences by the OMF. While the standard OMF approaches exploit only the GS, the ORF model is able to capture the LS as well. In addition, the ORF model defines a uniformed inter- face for different OMF approaches with various internal implementations. Last but not least, the ORF model is a generic framework that can be extended to incorporate additional information by designing more features. A future extension could take the user-user correlations into account as we modeled only the item-item correlations in this work. Incorporating the user-user correlations may further improve the prediction performance. Another future work is to take auxiliary information into consideration by replacing the MRF with the Conditional Random Fields [43]. Fusing auxiliary information such as the item content and social relations could improve the prediction performance especially when the data is highly sparse. Chapter 4 Preferecen Relation-based Recommender System 4.1 Introduction RecSys aim to recommend users with some of their potentially interesting items, which can be virtually anything ranging from movies to tourism attractions. To identify the appropriate items, RecSys attempts to exploit user preferences [41] and various side information including content [5,6], temporal dynamics [40], and social relations [50]. By far, Collaborative Filtering [41] is one of the most popular RecSys techniques, which exploits user preferences, especially in form of explicit absolute ratings. Nevertheless, relying on solely absolute ratings is prone to the cold-start problem [72] where few ratings are known for cold users or items. To alleviate the cold-start problem, additional information, which is usually heterogeneous [6] and implicit [64] must be acquired and exploited. Recently, a considerable literature [12, 19, 45, 64, 74] has grown up around the theme of relative preferences. The underlying motivation is that relative preferences are often easier to collect and more reliable as a measure of user preferences. For 58 59 example, it can be easier for users to tell which item is preferable than expressing the precise degree of liking. Furthermore, studies [12,42] have reported that absolute ratings may not be completely trustworthy. For example, rating 4 out of 5 may in general indicate high quality, but it can mean just OK for critics. In fact, users’ quantitative judgment can be affected by irrelevant factors such as the mood when rating, and this is called misattribution of memory [71]. While users are not good at making consistent quantitative judgment, the pref- erence relation (PR), as a kind of relative preference, has been considered as more consistent across like-minded users [12,18,19]. By measuring the relative order be- tween items, the PR is usually less variant to irrelevant factors. For example, a user in bad mood may give lower ratings to all items but the relative orderings between items remain the same. Being a more reliable type of user preferences, PR is also eas- ier to collect comparing to ratings as it can be inferred from implicit feedbacks. For example, the PR between two items can be inferred by comparing their ratings, page views, played counts, mouse clicks, etc. This property is important as not all users are willing to rate their preferences, where collecting feedbacks implicitly delivers a more user-friendly recommender system. In addition, as the ultimate goal of RecSys, obtaining the ranking of items by itself is to obtain the relative preferences, a more natural input than absolute ratings [42,81]. While the PR captures the user preferences in the pairwise form, most existing works [42,46] take the pointwise approach to exploiting ordinal properties possessed by absolute ratings. To accept the PR as input and output item rankings, pair- wise approaches to RecSys have recently emerged in two forms: memory-based [12] and model-based [19, 45, 64]. These studies have shown the feasibility of PR-based 60 methods, and demonstrated competitive performance comparing to their underly- ing models, such as memory-based K-Nearest Neighbor (KNN) [12] and model-based Matrix Factorization (MF) [19]. However, the limitations of these underlying models have constrained the poten- tials of their PR extensions. More specifically, both KNN and MF based methods can only capture one type of information at a time, while both the local and the global information are essential in achieving good performance [38,46,79]. We refer these two types of information as the local and the global structures of the preferences: Local Structure The local structure (LS) refers to the second-order interactions between similar users [65] or items [69]. This type of information is often used by neighborhood-based collaborative filtering, in which the predictions are made by looking at the neighborhood of users [65] or items [69]. LS-based approaches ignore the majority of preferences in making predictions, but are effective when the users or items correlations are highly localized. Global Structure The global structure (GS) refers to the weaker but higher-order interactions among all users and items [41]. This type of information is often used by latent factor models such as Matrix Factorization [41], which aim to discover the latent factor space in the preferences. GS-based approaches are often competitive in terms of accuracy and computational efficiency [41]. Previous studies have suggested that these two structures are complementary since they address different aspects of the preferences [38, 46, 79]. However, to the best of our knowledge, there is yet no PR-based method that can capture both LS and GS. Another problem of existing PR-based methods is that side information such as item content and user attributes can’t be easily incorporated, which is critical in 61 cold-start cases. All the above reasonings lead to the desired model with the following properties: 1) Accept PR as input; 2) Capture both LS and GS; 3) Side information can be easily incorporated; 4) Output item rankings. Recent advances in Markov Random Fields-based RecSys [16,46,77,79] have made it possible to achieve the above objectives. MRF-based RecSys was first developed in [79] to capture both LS and GS. Later on, it has been extended in [46] to exploit ordinal properties possessed by absolute ratings. Nevertheless, all of these attempts rely on absolute ratings. This work aims to push the MRF-based RecSys one step further by fitting it into the PR framework, namely the Preference Relation-based Markov Random Fields (PrefMRF) and the Preference Relation-based Conditional Random Fields (PrefCRF) when side information is incorporated. The remaining part of this paper is organized as follows. Section 4.2 introduces the concepts of PR-based RecSys and formalizes the problem, followed by a review of related work. Section 4.3 is devoted to the proposed PrefMRF and PrefCRF models. Benchmark results on Top-N recommendation are presented in Section 4.4. Finally, Section 4.5 concludes this work by summarizing the main contributions and envisaging future works. 4.2 Preliminaries RecSys aim at predicting users’ future interest in items, and the recommendation task can be considered as a preference learning problem, which aims to construct a predictive preference model from observed preference information [58]. Existing preference learning methods are based on different learning to rank approaches [23]. 62 Among them, the pointwise approach is the choice of most RecSys [38, 69], which exploit absolute ratings, though pairwise approach that exploits PR has been largely overlooked until recently. The rest of this section describes the basic concepts and formalizes the PR-based RecSys followed by a review of related work. 4.2.1 Preference Relation A preference relation (PR) encodes user preferences in form of pairwise ordering between items. This representation is a useful alternative to absolute ratings for three reasons. Firstly, PR is more consistent across like-minded users [12,19] as it is invariant to many irrelevant factors, such as mood. Secondly, PR is a more natural and direct input for Top-N recommendation, as both the input and the output are relative preferences. Finally, and perhaps most importantly, PR can be obtained implicitly rather than asking the users explicitly. For example, the PR over two Web pages can be inferred by the stayed time, and consequently applies to the displayed items. This property is important as not all users are willing to rate their preferences, where collecting feedbacks implicitly delivers a more user-friendly RecSys. In addition, PR- based RecSys provides an opportunity to utilize the vast amount of implicit data that have already been collected over the years, such as activity logs. With these potential benefits, we shall take a closer look at the PR, and investigate how they can be utilized in RecSys. We formally define the PR as follows. Let U = {u}n and I = {i}m denote the set of n users and m items, respectively. The preference of a user u ∈ U between items i and j is encoded as πuij, which indicates the strength of user u’s PR for the ordered 63 item pair (i,j). A higher value of πuij indicates a stronger preference on the first item over the second item. Definition 5 (Preference Relation). The preference relation is defined as πuij = ⎧⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎩ (2 3 ,1] if i � j (u prefers i over j) [1 3 , 2 3 ] if i � j (i and j are equally preferable to u) [0, 1 3 ) if i ≺ j (u prefers j over i) (4.2.1) where πuij ∈ [0,1] and πuij = 1 − πuji. This definition is similar to [19], however, we allocate an interval for each pref- erence category, i.e., preferred, equally preferred, and less preferred. Indeed, each preference category can be further break down into more intervals. Similar to [12], the PR can be converted into user-wise preferences over items. Definition 6 (User-wise Preference). The user-wise preference is defined as pui = ∑ j∈Iu[[πuij > 2 3 ]] − ∑ j∈Iu[[πuij < 1 3 ]] |Πui| (4.2.2) where [[·]] gives 1 for true and 0 for false, and Πui is the set of user u’s PR related to item i. The user-wise preference pui falls in the interval [−1,1], where −1 and 1 indicate that item i is the least or the most preferred item for user u, respectively. The user- wise preference measures the relative position of an item for a particular user, which is different from absolute ratings. 4.2.2 Problem Statement Generally, the task of PR-based RecSys is to take PR as input and output Top-N recommendations. Specifically, let πuij ∈ Π encode the PR of each user u ∈ U. 64 Each πuij is defined over an ordered item pair (i, j), denoting i ≺ j, i � j, or i � j as described in Eq. 4.2.1. The goal is to estimate the value of each unknown πuij ∈ Πunknown, such that π̂uij approximates πuij. This can be considered as an optimization task performs directly on the PR: π̂uij = arg min π̂uij∈[0,1] (πuij − π̂uij)2 (4.2.3) However, it can be easier to estimate the π̂uij by the difference between the two user-wise preferences pui and puj, i.e., π̂uij = φ(p̂ui−p̂uj), where φ(·) is a function that bounds the value into [0,1] and ensures φ(0) = 0.5. For example, the inverse-logit function φ(x) = e x 1+ex can be used when user-wise preferences involve large values. Therefore, the objective of this paper is to solve the following optimization problem: (p̂ui, p̂uj) = arg min p̂ui,p̂uj (πuij − φ(p̂ui − p̂uj))2 (4.2.4) which optimizes the user-wise preferences directly, and Top-N recommendations can be obtained by simply sorting the estimated user-wise preferences. For ease of reference, notations used throughout this chapter are summarized in Table 4.1. The letters u, v, a, b represent users, and the letters i, j, k, l represent items. 4.2.3 Related Work User preferences can be modeled in three types: pointwise, pairwise, and listwise. Though RecSys is not limited to the pointwise absolute ratings, the recommendation task is usually considered as a rating prediction problem. Recently, a considerable literature [12,17,19,24,36,45,64,74] has grown up around the theme of relative pref- erences, especially the pairwise PR. Meanwhile, recommendation task is also shifting 65 Table 4.1: Summary of Major Notations (Chapter 4) Notations Mathematical Meanings U the set of users I the set of items Π the set of preference relations pui the user-wise preference of user u on item i G an undirected graph encodes relations of user-wise preferences V the set of vertices each represents a user-wise preference E the set of edges each connects two vertices fuv the correlation feature between users u and v fij the correlation feature between items i and j wuv the weight associated to the user-user correlation feature fuv wij the weight associated to the item-item correlation feature fij Q(pui | u,i) the ordinal distribution produced by PrefNMF o the side information, e.g., user attributes and item content 66 from rating prediction to item ranking [56,74,83], in which the ranking itself is also relative preferences. Preference relation has been widely studied in the field of The use of relative preferences has been widely studied in the field of Information Retrieval [14,21,22,35,64] for learning to rank tasks. Recently, PR-based [12,19,45, 64] and listwise-based [74] RecSys have been proposed. Among them, the PR-based approach is the most popular, which can be further categorized as memory-based methods [12] that capture local structure and model-based methods [19, 45, 64] that capture global structure. To the best of our knowledge, there is yet no PR-based method that can capture both LS and GS. Advances in Markov Random Fields (MRF) and its extension Conditional Random Fields (CRF) have made it possible to utilize both LS and GS by taking advantages of MRF’s powerful representation capability. Nevertheless, exploiting the PR is not an easy task for MRF and CRF [46,79]. This observation leads to a natural exten- sion of unifying the MRF models with the PR-based models, to complement their strengths. We summarize the capabilities of the existing and our proposed PrefMRF and PrefCRF models in Table 4.2. 4.3 Methodology In this section, we propose the Preference Relation-based Markov Random Fields (PrefMRF) to model the PR and capture both LS and GS. When side information is taken into consideration, this model extends to Preference Relation-based Conditional Random Fields (PrefCRF). In this work, we exploit LS in terms of the item-item correlations as well as the user-user correlations. The rest of this section introduces 67 Table 4.2: Capabilities of PR-based methods Method Input Output LS GS Side Information Pointwise Memory-based Ratings Ratings � Pointwise Model-based Ratings Ratings � Pointwise Hybrid Ratings Ratings � � Pairwise Memory-based Preference Relations Item Rankings � Pairwise Model-based Preference Relations Item Rankings � PrefMRF Preference Relations Item Rankings � � PrefCRF Preference Relations Item Rankings � � � the concept of the Preference Relation-based Matrix Factorization(PrefNMF) [19] that will be our underlying model, and then followed a detailed discussion of the PrefMRF and PrefCRF on issues including feature design, parameter estimation, and predictions. 4.3.1 Preference Relation-based Matrix Factorization Matrix Factorization (MF) [41] is a popular approach to RecSys that has mainly been applied to absolute ratings. Recently, the PrefNMF [19] model was proposed to adopt PR input for MF models. Like MF models, the PrefNMF model discovers the latent factor space shared between users and items, where the latent factors describe both the taste of users and the characteristics of items. The attractiveness of an item to a user is then measured by the inner product of their latent feature vectors. 68 Point Estimation The PrefNMF model described in [19] provides a point estimation to each preference, i.e., a single value. Formally, each user u is associated with a latent feature vector uu ∈ Rk and each item i is associated with a latent feature vector vi ∈ Rk, where k is the dimension of the latent factor space. The attractiveness of items i and j to the user u are u�u vi and u � u vj, respectively. When u � u vi > u � u vj the item i is said to be more preferable to the user u than the item j, i.e., i � j. The strength of this preference relation πuij can be estimated by u � u (vi − vj), and the inverse-logit function is applied to ensure π̂uij ∈ [0,1]: π̂uij = eu � u (vi−vj) 1 + eu � u (vi−vj) (4.3.1) The latent feature vectors uu and vi are learned by minimizing regularized squared error with respect to the set of all known preference relations Π: min uu,vi∈Rk ∑ πuij∈Π∧(i θL−1 (4.3.5) where L is the number of ordinal levels and θl are the threshold values of interest. The probability of receiving a preference l is therefore Q(pui = l | u,i) = ∫ θl θl−1 P(xui | θ) dθ = F(θl) − F(θl−1) (4.3.6) 70 where F(θl) is the cumulative logistic distribution evaluated at θl with standard de- viation sui: F(xui ≤ l | θl) = 1 1 + exp(−θuil−μui sui ) (4.3.7) The thresholds θl can be parameterized to depend on user or item. This paper employs the user-specific thresholds parameterization described in [42]. Therefore a set of thresholds {θul}Ll=1 is defined for each user u to replace the thresholds θuil in Eq. 4.3.7, and is learned from data. Given the learned ordinal distribution Q(pui | u,i), not only the preferences can be predicted but also the confidence for each prediction. The ordinal distribution Q(pui | u,i) captures the GS information in a probabilistic form, and will be incorporated into MRF and CRF in Section 4.3.4. Note that the user preference is quantized into ordinal values in this process. 4.3.2 Markov Random Fields Markov Random Fields (MRF) [16,78] model a set of random variables having Markov property with respect to an undirected graph G. The undirected graph G consists a set of vertices V connected by a set of edges E without orientation, where two vertices are neighborhood of each other when connected. Each vertex in V encodes a random variable, and the Markov property implies that a variable is conditionally independent of others given its neighborhoods. In this work, we use MRF to model user-wise preference and their interactions respect to a set of undirected graphs. Specifically for each user u, there is a graph Gu with a set of vertices Vu and a set of edges Eu. Each vertex in Vu represents a preference pui of user u on the item i. Note that the term preference is used instead 71 of rating because in the new model the preference is not interpolated as absolute ratings but user-wise ordering of items. Each edge in Eu captures a relation between two preferences by the same user. Two preferences are connected by an edge if they are given by the same user or on the same item, corresponding to the item-item and user-user correlations, respec- tively. Modeling these correlations is actually capturing the LS information in the preferences. However, it is not easy to model two types of correlations at the same time as it will result a large graph. Instead, we model the item-item and user-user correlations separately, and merge their predictions. Fig. 4.1 shows an example of four graphs for user u, user v, item i, and item j. Note that vertices of different graphs are not connected directly, however, the weights are estimated across graphs when the edges correspond to the same correlation. For example, the edge between pui and puj and the edge between pvi and pvj are associated to the same item-item correlation ψij between items i and j. Formally, let Iu be the set of all items evaluated by the user u and Ui be the set of all users rated the item i. Then denote pu = {pui | i ∈ Iu} be the joint set of all preferences (the variables) expressed by user u, and pi = {pui | u ∈ Ui} be the joint set of all preferences (the variables) rated on item i. Under this setting, the MRF defines two distributions P(pu) and P(pi) over the graphs Gu and Gi, respectively: P(pu) = 1 Zu Ψu(pu) , P(pi) = 1 Zi Ψi(pi) (4.3.8) Ψu(pu) = ∏ (ui,uj)∈Eu ψij(pui,puj) , Ψi(pi) = ∏ (ui,vi)∈Ei ψuv(pui,pvi) (4.3.9) where Zu and Zi are the normalization terms that ensure ∑ pu P(pu) = 1 and ∑ pi P(pi) = 1. The term ψ(·) is a positive function known as potential. 72 pui puk puj pul ψij pvi pvk pvj pvl ψij ψik ψik puj pvj paj pbj ψua ψuv pui pvi pai pbi ψua ψuv Figure 4.1: Example of MRF graphs. u, v, a, and b are users. i, j, l, and k are items. The potentials ψij(pui,puj) and ψuv(pui,pvi) capture the correlation between items i and j and correlation between users u and v, respectively: ψij(pui,puj) = exp{wijfij(pui,puj)} (4.3.10) ψuv(pui,pvi) = exp{wuvfuv(pui,pvi)} (4.3.11) where fij(·) and fuv(·) are the feature functions to be designed shortly in Section 4.3.4, and wij and wuv are the corresponding weights. These correlation features capture the LS information, while the weights realize the importance of each correlation feature. With the weights estimated from data, the unknown preference pui can be predicted using item-item correlations: p̂ui = arg max pui∈[−1,1] P(pui | pu) (4.3.12) 73 or using user-user correlations: p̂ui = arg max pui∈[−1,1] P(pui | pi) (4.3.13) where the confidences of the predictions can be measured by P(pui | pu) and P(pui | pi). 4.3.3 Conditional Random Fields Despite of the user preferences, various side information including content [5, 6], temporal dynamics [40], and social relations [50] are also important in making quality recommendations. While there exist methods to incorporate side information, there is yet no PR-based method that can achieve this. One advantage of Markov Random Fields is its extensibility, thus side informa- tion can be easily incorporated by extending the MRF to Conditional Random Fields (CRF). In MRF, the item-item and user-user correlations are modeled in a set of graphs, where each graph has a set of vertices representing the preferences. To incor- porate side information, the MRF is extended to CRF by conditioning each vertex on a set of global observations o, i.e., the side information in our context. Specif- ically, each user u is associated with a set of attributes {ou} such as gender and occupation. Similarly, each item i is associated with a set of attributes {oi} such as genres of movie. This side information is encoded as the set of global observations o = {{ou},{oi}}. The graphs for item-item and user-user correlations conditioned on global observations are illustrated in Fig. 4.2. Using the same settings as MRF in Section 4.3.2, the CRF models the conditional 74 p p p i k l j u p p p i k l j v p p p u v b a i p p p u v b a j Figure 4.2: Example of CRF graphs. u , v , a , and b are users. i , j , l , and k are items. 75 distributions P(pu | o) and P(pi | o) over the graphs Gu and Gi, respectively: P(pu | o) = 1 Zu(o) Ψu(pu,o) , P(pi | o) = 1 Zi(o) Ψi(pi,o) (4.3.14) Ψu(pu,o) = ∏ (ui)∈Vu ψui(pui,o) ∏ (ui,uj)∈Eu ψij(pui,puj) (4.3.15) Ψi(pi,o) = ∏ (ui)∈Vi ψui(pui,o) ∏ (ui,vi)∈Ei ψuv(pui,pvi) (4.3.16) where Zu(o) and Zi(o) are the normalization terms ensure ∑ pu P(pu | o) = 1 and∑ pi P(pi | o) = 1. The term ψ(·) is a positive function known as potential. The potential ψui(·) captures the global observations associated to the user u and the item i: ψui(pui,o) = exp{w�u fu(pui,oi) + w�i fi(pui,ou))} (4.3.17) The potentials ψij(·) and ψuv(·) capture the item-item and user-user correlations, respectively: ψij(pui,puj) = exp{wijfij(pui,puj)} (4.3.18) ψuv(pui,pvi) = exp{wuvfuv(pui,pui)} (4.3.19) where fu, fi, fij, and fuv are the features to be designed shortly in Section 4.3.4, and wu, wi, wij, and wuv are the corresponding weights realize the importance of each feature. With the weights estimated from data, the unknown preference pui can be predicted using item-item correlations: p̂ui = arg max pui∈[−1,1] P(pui | pu,o) (4.3.20) or using user-user correlations: p̂ui = arg max pui∈[−1,1] P(pui | pi,o) (4.3.21) where P(pui | pu,o) and P(pui | pi,o) give the confidence of the predictions. 76 4.3.4 PrefCRF: Unifying PrefNMF and CRF The CRF model captures the LS information by modeling item-item and user-user correlations under the framework of probabilistic graphical models. However, it em- ploys the log-linear modeling as shown in Eq. 4.3.18 and Eq. 4.3.18, and therefore does not enable a simple treatment of PR. The PrefNMF model, on the other hand, can nicely model the PR but is weak in capturing the LS and side information. The complementary between these two techniques calls for the unified PrefCRF model to take all of the advantages. Essentially, the proposed PrefCRF model promotes the agreement between the GS discovered by the PrefNMF, the LS discovered by the MRF, and the side information discovered by the CRF. More specifically, the PrefCRF model combines the item-item and user-user correlations (Eq. 4.3.15 and Eq. 4.3.16) and the ordinal distributions Q(pui | u,i) over user-wise preferences obtained from Eq. 4.3.6: P(pu | o) ∝ Ψu(pu,o) ∏ pui∈pu Q(pui | u,i) (4.3.22) P(pi | o) ∝ Ψi(pi,o) ∏ pui∈pi Q(pui | u,i) (4.3.23) where Ψu is the potential function capturing the interaction among items evaluated by user u, and Ψi is the potential function capturing the interaction among users rated item i. Put all together, the joint distribution P(pu) for each user u can be modeled as: P(pu) ∝ exp ⎛ ⎝ ∑ pui,puj∈pu wijfij(pui,puj) + ∑ (ui)∈Vu ψui(pui,o) ⎞ ⎠ ∏ pui∈pu Q(pui | u,i) (4.3.24) 77 and the joint distribution P(pi) for each item i can be modeled as: P(pi) ∝ exp ⎛ ⎝ ∑ pui,pvi∈pi wuvfuv(pui,pvi) + ∑ (ui)∈Vi ψui(pui,o) ⎞ ⎠ ∏ pui∈pi Q(pui | u,i) (4.3.25) where there is a graph for each user or item but the weights are optimized by all users or all items. Feature Design A feature is essentially a function f of n > 1 arguments that maps the n-dimensional input into the unit interval f : Rn → [0,1]. We design the following kinds of features: Correlation Features The item-item correlation is captured by the feature: fij(pui,puj) = g(|(pui − p̄i) − (puj − p̄j)|) (4.3.26) and the user-user correlation is captured by the feature fuv(pui,pvi) = g(|(pui − p̄u) − (pvi − p̄v)|) (4.3.27) where g(α) normalizes feature values and α plays the role of deviation. The terms p̄i, p̄j, p̄i, and p̄j are the item or user averages. The item-item correlation feature captures the intuition that correlated items should be ranked similarly by the same user after offsetting the goodness of each item. Similarly, the user- user correlation feature captures the intuition that correlated users should rate the same item similarly. 78 Attribute Features Each user u and item i has a set of attributes ou and oi, respec- tively. These attributes are mapped to preferences by the following features: fi(pui) = oug(|(pui − p̄i)|) fu(pui) = oig(|(pui − p̄u)|) (4.3.28) where fi models which users like the item i and fu models which classes of items the user u likes. Since one correlation feature exists for each possible pair of co-rated items, the number of correlation features can be large which makes the estimation slow to con- verge and less robust. Therefore we only keep the correlation features if strong item- item correlation or user-user correlation exists. Specifically, the strong correlation features fstrong are extracted based on the Pearson Correlation and a user-specified minimum correlation threshold. Note that the correlation is calculated based on the user-wise preferences generated from PR thus the rule of using PR as input is not violated. Parameter Estimation In general, MRF-based models cannot be determined by standard maximum likeli- hood approaches, instead, approximation techniques are often used in practice such as Pseudo-likelihood [8] and Contrastive Divergence (CD) [30]. The Pseudo-likelihood leads to exact computation of the loss function and its gradient with respect to pa- rameters, and thus faster. The CD-based methods may, on the other hand, lead to better estimation given enough time. As the experiments involve different settings and large number of features, this study employs the Pseudo-likelihood technique to perform efficient parameter estimation by maximizing the regularized sum of log local 79 likelihoods: logL(w) = ∑ pui∈Π log P(pui | pu,o) − 1 2σ2 w�w (4.3.29) where w are the weights and 1/2σ2 controls the regularization. To make the notation uncluttered, we write pu instead of explicitly as pu\pui. In this section we describe the parameter estimation of item-item correlations, where the user-user correlations can be estimated in the same way by replacing items with users. The local likelihood in Eq. 4.3.29 is defined as: P(pui | pu,o) = 1 Zui(o) Q(pui | u,i)ψui(pui,o) ∏ puj∈pu ψij(pui,puj) (4.3.30) where Zui(o) is the normalization term: Zui = lmax∑ pui=lmin Q(pui | u,i)ψui(pui,o) ∏ puj∈pu ψij(pui,puj) (4.3.31) where lmin is the first and lmax is the last interval, i.e., 1 and 3 in our settings. To optimize the parameters, we use the stochastic gradient ascent procedure that updates the parameters by passing through the set of ratings of each user: w ← w + η∇logL(w) (4.3.32) where η is the learning rate. More specifically, for each pui we update the attribute weights wo = {wu,wi} and correlation weight wij for each neighbor puj ∈ pu using the gradient of the log pseudo-likelihood ∂logL ∂wo = fo(pui,o) − lmax∑ pui=lmin P(pui | pu,o)fo(pui,o) − wi σ2 (4.3.33) ∂logL ∂wij = fij(pui,puj) − ∑lmax pui=lmin P(pui | pu,o)fij(pui,puj) − wijσ2 (4.3.34) 80 Item Recommendation The ultimate goal of RecSys is often to rank the items and recommend the Top-N items to the user. To obtain the item rankings, PrefMRF estimates distributions over user-wise preferences which can be converted into point estimate: The PrefCRF produces distributions over the user-wise preferences, which can be converted into point estimate: Most Likely Preference The preference can be determined by selecting the pref- erence with the greatest mass in local likelihood: p̂ui = arg max pui P(pui | pu,o) (4.3.35) where the local likelihood is given by Eq. 4.3.30. The local likelihood serves as a confidence measure. Smoothed Expectation When the prediction is not strict to discrete values, the expectation can be used instead: p̂ui = lmax∑ pui=lmin puiP(pui | pu,o) (4.3.36) where l refers to the intervals of user-wise preferences: from least to most preferred. Note that l is limited to the simplest case of 3 intervals in our settings, but more intervals are possible. The predictions by item-item correlation and user-user correlations can be merged by taking the mean value, and then items can be sorted and ranked accordingly. Finally, Alg. 2 summarizes the learning and prediction procedures for the PrefCRF. 81 Algorithm 2 PrefCRF Algorithm Input: Explict or implicit preferences. Step 1: Infer PR from preferences. Step 2: Predict user-wise preferences p̂ui using Eq. 4.2.2. Step 3: Predict distribution for each p̂ui using Eq. 4.3.6. Step 4: Repeat for each u ∈ U do for each pui ∈ pu do Compute normalization term Zui using Eq. 4.3.31 Compute local likelihood using Eq. 4.3.30 Compute attribute feature fi and fu using Eq. 4.3.28 Compute gradients for attribute features fo using Eq. 4.3.33 Update wo with the gradient using Eq. 4.3.32 for each puj ∈ pu, i �= j ∧ fij ∈ fstrong do Get correlation feature fij and fuv using Eq. 4.3.26 and Eq. 4.3.27 Get gradient for correlation feature fij using Eq. 4.3.34 Update wij with the gradient using Eq. 4.3.32 end for end for end for Until stopping criteria met Predictions: * Predict user-wise preferences using Eq. 4.3.36 or Eq. 4.3.35. * Select Top-N items according to estimated preferences. 82 Computational Complexity We perform the computational complexity analysis on the PrefMRF and its under- lying PrefNMF algorithms. Given n users and m items each with du and di prefer- ences, respectively. Let us temporarily ignore the user-specified latent factors. Then the complexity of both PrefNMF and PrefMRF is O(nd2u). However, in practice few item co-rated by the same user are strong neighbors of each other due to the correla- tion threshold defined in Section 4.3.4. As a result, the computation time of PrefMRF tends to be O(nduc) where c is a factor of correlation threshold. 4.4 Experiment and Analysis Datasets Ideally, the experiments should be conducted on datasets that contain user preferences in two forms: PR and absolute ratings. Unfortunately no such a dataset is publicly available at the moment, therefore we choose to compile the rating-based datasets into the form of PR. We use the same conversion method as in [19] by comparing the ratings of each ordered pair of items co-rated by the same user. For example, 1 is assigned to the PR πuij if pui > puj; 0 is assigned if pui < puj, and 0.5 is assigned if pui = puj. Experiments were conducted on two datasets: the MovieLens-1M 1 and the Each- Movie 2 datasets. The MovieLens-1M dataset contains more than 1 million ratings 1http://grouplens.org/datasets/movielens 2http://grouplens.org/datasets/eachmovie 83 by 6,040 users on 3,900 movies. The EachMovie dataset contains 2.8 million ratings by 72,916 users on 1,628 movies. The minimum rating is 1 and we cap the maximum at 5 for both datasets. The impact of side information is studied on the MovieLens- 1M dataset which provides gender, age, and occupation information about users and genres of movies. For a reliable and fair comparison, each dataset is split into train and test sets, and the following settings are aligned to related work [83]. As the sparsity levels differ between the MovieLens-1M and the EachMovie datasets, different number of ratings are reserved for training and the rest for testing. Specifically, for each user in the MovieLens-1M we randomly select N = 30, 40, 50, 60 ratings for training, and put the rest for testing. Some users do not have enough ratings thus were excluded from experiments. The EachMovie has less items but much more users comparing to MovieLens-1M, therefore it is safe to remove some less active users and we set N = 70, 80, 90, 100 to investigate the performance on dense dataset. Evaluation Metrics Traditional recommender systems aim to optimize RMSE or MAE which emphasizes on absolute ratings. However, the ultimate goal of recommender systems is usually to obtain the ranking of items [42], where good performance on RMSE or MAE may not be translated into good ranking results [42]. Therefore, we employ two evaluation metrics: Normalized Cumulative Discounted Gain@T (NDCG@T) [34] which is popular in academia, and Mean Average Precision@T (MAP@T) [13] which 84 is popular in contests 3. Among them, the NDCG@T metric is defined as NDCG@T = 1 K(T) T∑ t=1 2rt − 1 log2 (t + 1) (4.4.1) where rt is the relevance judgment of the item at position t, and K(T) is the normal- ization constant. The MAP@T metric is defined as MAP@T = 1 |Utest| ∑ u∈Utest T∑ t=1 Pu(t) min(mu, t) (4.4.2) where mu is the number relevant items to user u, and Pu(t) is user u’s precision at position t. Both metrics are normalized to [0,1], and a higher value indicates better performance. These metrics, together with other ranking-based metrics, require a set of relevant items to be defined in the test set such that the predicted rankings can be evaluated against. The relevant items can be defined in different ways. For example, for each user we can consider only 5-star items in the testing set as relevant items, or those items above each user’s average as relevant items. In this paper, we follow the same selection criteria used in the related work [12,38] to consider items with the highest ratings as relevant. Parameter Setting For a fair comparison, we fix the number of latent factors to 50 for all algorithms, the same as in related work [15]. The number of neighbors for KNN algorithms is set to 50. We vary the minimum correlation threshold to examine the performances with different number of features. Different values of regularization coefficient are also tested. 3KDD Cup 2012 and Facebook Recruiting Competition 85 4.4.1 Results and Analysis We first compare the performance of the proposed PrefMRF and PrefCRF models with four related models: KNN, NMF, PrefKNN, and PrefNMF, where the PrefNMF is the targeted model, and then investigate the impact of parameter settings. Comparison on Top-N Recommendation Comparison of these algorithms is conducted by measuring the NDCG and the MAP metrics on Top-N recommendation tasks. Each experiment is repeated ten times with different random seeds and we report the mean results with standard deviations on MovieLens-1M dataset in Table 4.3 and EachMovie dataset in Table 4.4. Note that only the MovieLens-1M dataset has side information which is used by the PrefCRF model. The PrefMRF as well as other models are based on only preferences data. We also report the NDCG and MAP values by varying the position T (i.e., how many items to recommend) in Fig. 4.3 and Fig. 4.4 for MovieLens-1M dataset and in Fig. 4.5 and Fig. 4.6 for MovieLens-1M dataset. The following observations can be made based on the results. Firstly, the KNN and the PrefKNN methods didn’t perform well on MovieLens- 1M comparing with Matrix Factorization based methods. One possible reason is that predictions are made based only on the neighbors, and as a result too much information has been ignored especially when the dataset is large. However, the performance of KNN -based methods has improved on the EachMovie dataset as we reserved more ratings for training, i.e., better neighbors can be found for prediction. Secondly, PrefNMF outperforms NMF on MovieLens-1M dataset which is con- sistent to the results reported in [19]. However, PreNMF does not perform well 86 Table 4.3: Results over ten runs on MovieLens-1M dataset. Given 30 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.3969 ± 0.0020 0.4081 ± 0.0029 0.2793 ± 0.0021 0.2744 ± 0.0025 NMF 0.5232 ± 0.0057 0.5195 ± 0.0040 0.3866 ± 0.0055 0.3549 ± 0.0037 PrefKNN 0.3910 ± 0.0044 0.4048 ± 0.0038 0.2745 ± 0.0043 0.2720 ± 0.0037 PrefNMF 0.5729 ± 0.0049 0.5680 ± 0.0041 0.4387 ± 0.0046 0.3992 ± 0.0033 PrefMRF 0.6020 ± 0.0050 0.5934 ± 0.0039 0.4721 ± 0.0050 0.4244 ± 0.0036 PrefCRF 0.6316 ± 0.0076 0.5966 ± 0.0028 0.6254 ± 0.0073 0.4245 ± 0.0028 Given 40 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.4108 ± 0.0040 0.4252 ± 0.0036 0.2936 ± 0.0036 0.2877 ± 0.0034 NMF 0.5323 ± 0.0050 0.5291 ± 0.0034 0.3976 ± 0.0045 0.3631 ± 0.0035 PrefKNN 0.4122 ± 0.0024 0.4283 ± 0.0024 0.2944 ± 0.0023 0.2904 ± 0.0023 PrefNMF 0.5773 ± 0.0037 0.5732 ± 0.0028 0.4437 ± 0.0041 0.4019 ± 0.0032 PrefMRF 0.6215 ± 0.0029 0.6140 ± 0.0023 0.4844 ± 0.0025 0.4420 ± 0.0020 PrefCRF 0.6435 ± 0.0064 0.6092 ± 0.0023 0.6420 ± 0.0062 0.4392 ± 0.0021 Given 50 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.4273 ± 0.0040 0.4424 ± 0.0027 0.3078 ± 0.0038 0.3015 ± 0.0026 NMF 0.5360 ± 0.0041 0.5326 ± 0.0036 0.4010 ± 0.0040 0.3669 ± 0.0025 PrefKNN 0.4326 ± 0.0027 0.4483 ± 0.0030 0.3125 ± 0.0024 0.3070 ± 0.0022 PrefNMF 0.5761 ± 0.0067 0.5745 ± 0.0035 0.4424 ± 0.0064 0.4019 ± 0.0033 PrefMRF 0.6248 ± 0.0053 0.6172 ± 0.0032 0.4896 ± 0.0053 0.4460 ± 0.0027 PrefCRF 0.6648 ± 0.0055 0.6158 ± 0.0018 0.6580 ± 0.0059 0.4471 ± 0.0024 Given 60 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.4480 ± 0.0044 0.4622 ± 0.0035 0.3266 ± 0.0036 0.3163 ± 0.0027 NMF 0.5462 ± 0.0068 0.5409 ± 0.0063 0.4109 ± 0.0069 0.3734 ± 0.0055 PrefKNN 0.4526 ± 0.0062 0.4689 ± 0.0039 0.3301 ± 0.0051 0.3223 ± 0.0033 PrefNMF 0.5756 ± 0.0062 0.5733 ± 0.0048 0.4409 ± 0.0059 0.4007 ± 0.0037 PrefMRF 0.6422 ± 0.0037 0.6301 ± 0.0037 0.5112 ± 0.0035 0.4600 ± 0.0026 PrefCRF 0.6772 ± 0.0074 0.6242 ± 0.0018 0.6715 ± 0.0072 0.4536 ± 0.0016 87 Table 4.4: Results over ten runs on EachMovie dataset without side information. Given 70 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.7088 ± 0.0020 0.7115 ± 0.0015 0.6012 ± 0.0027 0.5767 ± 0.0017 NMF 0.7581 ± 0.0022 0.7577 ± 0.0017 0.6524 ± 0.0026 0.6225 ± 0.0020 PrefKNN 0.7260 ± 0.0022 0.7307 ± 0.0018 0.6197 ± 0.0020 0.5990 ± 0.0016 PrefNMF 0.7408 ± 0.0033 0.7348 ± 0.0039 0.6330 ± 0.0035 0.5800 ± 0.0038 PrefMRF 0.8317 ± 0.0032 0.8245 ± 0.0029 0.7512 ± 0.0039 0.6921 ± 0.0034 Given 80 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.7146 ± 0.0018 0.7168 ± 0.0017 0.6070 ± 0.0021 0.5825 ± 0.0019 NMF 0.7636 ± 0.0021 0.7638 ± 0.0018 0.6583 ± 0.0025 0.6286 ± 0.0018 PrefKNN 0.7337 ± 0.0028 0.7377 ± 0.0018 0.6271 ± 0.0029 0.6057 ± 0.0021 PrefNMF 0.7422 ± 0.0036 0.7319 ± 0.0040 0.6329 ± 0.0039 0.5774 ± 0.0033 PrefMRF 0.8364 ± 0.0036 0.8232 ± 0.0030 0.7553 ± 0.0038 0.6991 ± 0.0032 Given 90 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.7191 ± 0.0022 0.7279 ± 0.0028 0.6120 ± 0.0021 0.5933 ± 0.0013 NMF 0.7712 ± 0.0039 0.7692 ± 0.0033 0.6663 ± 0.0043 0.6431 ± 0.0034 PrefKNN 0.7418 ± 0.0028 0.7421 ± 0.0015 0.6357 ± 0.0030 0.6192 ± 0.0020 PrefNMF 0.7456 ± 0.0031 0.7358 ± 0.0038 0.6357 ± 0.0040 0.5819 ± 0.0036 PrefMRF 0.8394 ± 0.0035 0.8249 ± 0.0032 0.7474 ± 0.0037 0.7046 ± 0.0032 Given 100 Algorithm NDCG@5 NDCG@10 MAP@5 MAP@10 UserKNN 0.7279 ± 0.0028 0.7277 ± 0.0015 0.6238 ± 0.0032 0.5973 ± 0.0021 NMF 0.7741 ± 0.0030 0.7717 ± 0.0028 0.6719 ± 0.0034 0.6411 ± 0.0030 PrefKNN 0.7505 ± 0.0019 0.7511 ± 0.0012 0.6478 ± 0.0020 0.6231 ± 0.0014 PrefNMF 0.7391 ± 0.0033 0.7298 ± 0.0034 0.6318 ± 0.0039 0.5761 ± 0.0039 PrefMRF 0.8418 ± 0.0031 0.8277 ± 0.0030 0.7546 ± 0.0038 0.7063 ± 0.0036 88 on EachMovie where its performance is only slightly better than user-based KNN. The reason behind could be the EachMovie is much denser than the MovieLens-1M dataset, which makes the number of PR huge and difficult to tune optimal parame- ters. Besides, we observe that PrefNMF in general only achieves a slight improvement with more training data and even drops a bit with Given 60. Similarly for the Each- Movie dataset. With these observations, it appears that for a given number of users, the PrefNMF can be trained reasonably well with fewer data. Finally, the proposed PrefMRF and PrefCRF have made further improvement on both datasets upon the PrefNMF through capturing both LS and GS, as well as exploiting side information. From Fig. 4.3 and Fig. 4.4 we can see that the algorithms stabilized around position 10 and PrefMRF and PrefCRF consistently deliver better performance than others. It should be noted that the performance of PrefMRF and PrefCRF rely on their underlying model that captures the GS. In other words, the performance may vary when the PrefNMF is replaced with other alternative methods such as [45]. To confirm the improvements, a paired t-test (two-tailed) with a significance level of 95% has been applied to the best PrefMRF and the second best PrefNMF. Re- sults shown in Table 4.5 confirm that the performance of models with and without capturing the LS is statistically significant. Performance on Various Data Sparsity Levels To thoroughly examine the performance of these algorithms, we compare their per- formances under different settings of training set sizes: from Given 30 to Given 60 on MovieLens-1M dataset, and from Given 70 to Given 100 on EachMovie dataset. 89 (a) NDCG@T (Given 30) (b) MAP@T (Given 30) (c) NDCG@T (Given 40) (d) MAP@T (Given 40) Figure 4.3: Performance of different position T on MovieLens-1M dataset (Sparse). 90 (a) NDCG@T (Given 50) (b) MAP@T (Given 50) (c) NDCG@T (Given 60) (d) MAP@T (Given 60) Figure 4.4: Performance of different position T on MovieLens-1M dataset (Dense). 91 (a) NDCG@T (Given 70) (b) MAP@T (Given 70) (c) NDCG@T (Given 80) (d) MAP@T (Given 80) Figure 4.5: Performance of different position T on EachMovie dataset (Sparse). 92 (a) NDCG@T (Given 90) (b) MAP@T (Given 90) (c) NDCG@T (Given 100) (d) MAP@T (Given 100) Figure 4.6: Performance of different position T on EachMovie dataset (Dense). 93 Table 4.5: Paired t-test for PrefMRF and PrefNMF. Settings t-test statistics Dataset Sparsity Metric df t p-value MovieLens Given 60 NDCG@10 9 16.6218 < 0.00001 MovieLens Given 60 MAP@10 9 23.5517 < 0.00001 EachMovie Given 100 NDCG@10 9 72.4189 < 0.00001 EachMovie Given 100 MAP@10 9 72.1346 < 0.00001 Results are plotted in Fig. 4.7. It can be observed that in general more training data result in better performance. However, PrefNMF does not gain much benefit from more data and even perform slightly worse in Given 60. The PrefMRF on the other hand consistently gains performance from more data as the LS information can be better captured. Impact of Minimum Correlation Threshold As described in Section 4.3.4, a minimum correlation threshold is required to control the number of features in the PrefMRF model. By default, each pair of co-rated items has a feature which results in a large number of features. However, many of these features are useless if the item-item correlation are weak. To make the model more robust and with faster convergence, a minimum correlation threshold is applied to remove weak features. Specifically, the feature is removed if two items has a correlation measured by Pearson correlation less than the threshold. Results are 94 (a) Mean NDCG by training set sizes on MovieLens-1M dataset. (b) Mean MAP by training set sizes on MovieLens-1M dataset. (c) Mean NDCG by training set sizes on Each- Movie dataset. (d) Mean MAP by training set sizes on Each- Movie dataset. Figure 4.7: Impact of Sparsity Levels. 95 plotted in Fig. 4.8(a). It can be observed that a smaller correlation threshold delivers better performance, however, the number of features will also increase. To balance the performance and computation time, it is wise to select a moderate level of threshold depending on the dataset. (a) NDCG@10 by Threshold (b) NDCG@10 by Regularization Figure 4.8: Impact of Parameters (MovieLens-1M) Impact of Regularization Coefficient As the number of features in PrefMRF can be large, the model might be prone to over-fitting. Therefore, we investigate the impact of regularization settings as plotted in Fig. 4.8(b). We observe that the performance is better when a small regularization penalty applies. In other words, the PrefMRF can generalize reasonable well without too much regularization. This can be explained as the weights of item-item correlations 96 are not user-specific but shared by all users, thus they cannot over-fit every user perfectly. 4.5 Summary In this chapter we presented the PrefMRF model, which takes advantages of both the representational power of the MRF and the ease of modeling preference relations by the PrefNMF. To the best of our knowledge, there was no PR-based method that can capture both LS and GS, until the PrefMRF model is proposed in this work. In addition, side information can be easily incorporated by extending the PrefMRF model to the PrefCRF model. Experiment results on public datasets demonstrate that both types of interactions have been properly captured by PrefMRF, and significantly improved Top-N recommendation performance has been achieved. In addition, the PrefMRF model provides a generic interface for unifying various PR-based methods other than the PrefNMF used in this paper. In other words, any PR-based method that captures the GS should be able to take advantages from PrefMRF to capture LS as well. For future work, we would like to work on several directions. First, the compu- tation efficiency of PR-based matrix factorization needs to be improved given that the number of preference relations is usually much larger than absolute ratings. This is feasible as each user has his/her own set of preference relations, thus the learning process can be parallelized. Secondly, it would be interesting to see how PR-based methods perform on real implicit preferences dataset, such as page views and mouse clicks. Chapter 5 Learning from Heterogeneous Data Sources for Improved Top-N Recommendation 5.1 Introduction RecSys aim to recommend users with some of their potentially interesting items, which can be virtually anything ranging from movies to tourism attractions. To identify the appropriate items, RecSys attempts to exploit user preferences [41] and various side information [6]. However, the cold-start problem [72] raises when little information is known for cold users or items, e.g., a newly registered user. To alleviate the cold-start problem, additional information, which is usually heterogeneous, must be acquired. The last decade has seen a growing trend towards creating and managing more profiles in Online Social Networks (OSN), such as Facebook, LinkedIn, and Netflix etc. In light of this trend, it becomes possible to alleviate the cold-start problem by learning user preferences from heterogeneous data sources, e.g., a cold user of Netflix may have been used Facebook for a while. Nevertheless, user preferences from heterogeneous sources are heterogeneous whereas existing recommendation techniques 97 98 require the user preferences to be homogeneous. For example, user preferences are expressed as 5-star ratings for Netflix 1, 6-star ratings for EachMovie 2, and binary ratings for Facebook. Sometimes the user preferences may not even be expressed as ratings but as implicit feedbacks such as page views and clicks. Moreover, user preferences collected from heterogeneous sources may have different biases, as the user preferences not only reflect the quality of the items but also the quality of service providers. When the user preferences are heterogeneous and with biases, existing recommendation techniques cannot be directly applied. To the best of our knowledge, no previous work has considered the heterogeneous sources problem, in which the user preferences are heterogeneous with biases. In this work, we identify that the preference relations (PR), which measures the relative ordering between items, could be a key to the heterogeneous sources problem. With the assistance of PR, user preferences from heterogeneous sources can be fused seamlessly. For example, user preferences expressed as 5-star ratings, binary ratings, and page views can not be directly fused in general. However, all those user preferences can be deduced into the PR format by performing pairwise comparison on items. Once the user preferences are represented in PR, a direct merge can be performed. In fact, converting user preferences into PR not only provides a method to merge heterogeneous data but also reduces the biases that come with heterogeneity, i.e., the relative ordering of items is resistant to biases. In addition to collecting more user preferences, another method to alleviate the cold-start problem is to utilize the heterogeneous side information, such as attributes like age or occupation of users and items. 1http://www.netflixprize.com 2http://grouplens.org/datasets/eachmovie 99 This chapter aims to propose the Preference Relation-based Conditional Random Fields (PrefCRF) model to learn from heterogeneous data sources as well as the het- erogeneous side information The remaining part of this paper is organized as follows. Section 5.2 introduces the basic concepts of learning from heterogeneous sources and preference relations, followed by a review of related work. Section 5.3 is devoted to the proposed PrefCRF model. Benchmark results on Top-N recommendation are presented in Section 5.4. Finally, Section 5.5 concludes this chapter by summarizing the main contributions and envisaging future works. 5.2 Preliminaries This section briefly summarizes necessary background related to the heterogeneous sources problem and the preference relations that form the basis of our solution. 5.2.1 Heterogeneous Sources User preferences are usually assumed to come from a single homogeneous source. This assumption is becoming invalid given the rapid development of OSN, in which users maintain multiple profiles and the form of preferences diverges. We define two sources as heterogeneous if their preferences are 1) in different forms, e.g., ratings and clicks; 2) in different scales, e.g., 5-star scale and 6-star scale; 3) or biased differently due to factors irrelevant to the items’ quality, e.g., quality of the service providers. Based on this definition, not only the physically separated 100 sources are heterogeneous but a source changed significantly is also considered het- erogeneous to itself. In general, user preferences from heterogeneous sources cannot be merged directly as they may be in different forms. Even if their forms are the same, the scales could be different, where a force casting may change the meaning of preferences. In case that the scales are the same, biases are still introduced by the sources which make the recommendations inaccurate. 5.2.2 Preference Relation Preference relation (PR) encodes user preferences in the form of relative ordering between items, which is a useful alternative representation to absolute ratings as suggested in recent works [12, 19]. In fact, existing preferences such as ratings or other types of preferences can be easily represented as PR and then merged into a single dataset as shown in Fig. 5.1. . This property is particularly useful for the cold-start problem but has been overlooked in literature. We formally define the PR as follows. Let U = {u}n and I = {i}m denote the set of n users and m items, respectively. The PR of a user u ∈ U between items i and j is encoded as πuij, which indicates the strength of user u’s preference relation for the ordered item pair (i,j). A higher value of πuij indicates a stronger preference to the first item over the second item. 101 U se r Pr ef er en ce s Im pl ic it Pr ef er en ce s So ur ce 1 So ur ce 2 So ur ce n PR PR Pu rc ha se s Pl ay ed c ou nt C lic ks Pa ge v ie w s ... Ex pl ic it Pr ef er en ce s So ur ce 1 So ur ce 2 So ur ce n Th um bs u p/ do w n 5- st ar ra tin gs 6- st ar ra tin gs ... 5- st ar ra tin gs PR PR PR PR PR F ig u re 5 .1 : F lo w fr o m u se r p re fe re n ce s to P R 102 Definition 7 (Preference Relation). The preference relation is defined as πuij = ⎧⎪⎪⎪⎪⎪⎪⎨ ⎪⎪⎪⎪⎪⎪⎩ (2 3 ,1] if i � j (u prefers i over j) [1 3 , 2 3 ] if i � j (i and j are equally preferable) [0, 1 3 ) if i ≺ j (u prefers j over i) (5.2.1) where πuij ∈ [0,1] and πuij = 1 − πuji. An interval is allocated for each preference category, i.e., preferred, equally pre- ferred, and less preferred. Indeed, each preference category can be further break down into more intervals, though here in this paper we consider the minimal case of 3 intervals. Similar to [12], the PR can be converted into user-wise preferences over items which encode the ranking of items evaluated by a particular user. Definition 8 (User-wise Preference). The user-wise preference is defined as pui = ∑ j∈Iu\i[[πuij > 2 3 ]] − ∑ j∈Iu\i[[πuij < 1 3 ]] |Πui| (5.2.2) where Iu is the set of items related to user u, [[·]] gives 1 for true and 0 for false, and Πui is the set of user u’s PR related to item i. The user-wise preference pui falls in the interval [−1,1], where 1 and −1 indicate that item i is the most and the least preferred item for u, respectively. 5.2.3 Related Work Preference Relation (PR) has been widely studied in the field of Information Re- trieval [33]. Nevertheless, PR-based RecSys have only emerged recently [12,19,45,64]. 103 User preferences are usually categorized into three classes: pointwise, pairwise, and listwise. The pointwise preferences correspond to the absolute ratings widely used in RecSys, the pairwise preferences correspond to the PR presented in this work, and the listwise preferences is indeed the output of Top-N RecSys. Though RecSys is not limited to absolute ratings, the recommendation task is usually considered as a rating prediction problem. Recently, a considerable literature [12,19,45,64,74] has grown up around the theme of relative preferences. Meanwhile, recommendation task is also shifting from rating prediction to item ranking [74,83], in which the ranking itself is also relative preferences. Recently, pairwise preference relation-based [12, 19, 45, 64] and listwise-based [74] RecSys have been proposed. Among them, the pairwise approach is the most popular, which can be further cat- egorized as memory-based methods [12] and model-based methods [19, 45, 64]. The pairwise PR-based RecSys has the potential to unifying heterogeneous user prefer- ences, and this property, to the best of our knowledge, is first recognized in the present paper. Though the PR-based RecSys has the potential to alleviate the cold- start problem, it does not utilize the side information. Recent advances in PR-based RecSys [12, 19, 45, 64] and Conditional Random Fields (CRF) [78] have made it possible to unify the heterogeneous preferences into the unified PR format as well as modeling side information. This observation leads to a natural extension presented in this work to unify the CRF-based method with the PR-based methods, to complement their strengths. There exist two similar research topics that should be distinguished from our work. The first one is the cross-domain RecSys [60] which considers heterogeneous items, i.e., mixing movies and books, while our work considers the heterogeneous preferences 104 associated to the same type of items. The other topic is the mixture of experts [75], in which the non-personalized expert opinions from different sources are weighted into the personalized recommendations, while our work considers heterogeneous prefer- ences of the same user distributed across various sources. 5.3 Preference Relation-based Conditional Random Fields In this section, we propose the Preference Relation-based Conditional Random Fields (PrefCRF) to model both the heterogeneous preferences and the side informa- tion. The rest of this section defines the PR-based RecSys problem, and introduces the concept of the PrefNMF [19] that forms our underlying model, followed by a detailed description of the PrefCRF and discussion on issues such as feature design, parameter estimation, and predictions. 5.3.1 Problem Statement Generally, the task of PR-based RecSys is to take PR as input and output Top-N recommendations. Specifically, let πuij ∈ Π encode the PR of each user u ∈ U, and each πuij is defined over an ordered item pair (i, j), denoting i ≺ j, i � j, or i � j. The main task towards Top-N recommendations is to estimate the value of each unknown πuij ∈ Πunknown, such that π̂uij approximates πuij. This can be considered 105 as an optimization task that performs directly on the PR π̂uij = arg min π̂uij∈[0,1] (πuij − π̂uij)2 (5.3.1) However, it can be easier to estimate the π̂uij by the difference between two user- wise preferences pui and puj, i.e., π̂uij = φ(p̂ui − p̂uj), where φ(·) is a function that bounds the value into [0,1] and ensures φ(0) = 0.5. For example, the inverse-logit function φ(x) = e x 1+ex can be used when user-wise preferences involve large values. The objective of this paper is then to solve the following optimization problem (p̂ui, p̂uj) = arg min p̂ui,p̂uj (πuij − φ(p̂ui − p̂uj))2 (5.3.2) which optimizes the user-wise preferences directly, and Top-N recommendations can be obtained by simply sorting the estimated user-wise preferences. 5.3.2 Preference Relation-based Matrix Factorization Matrix Factorization (MF) [41] is a popular RecSys approach that has mainly been applied to absolute ratings. Recently, the PrefNMF [19] model was proposed to accommodate PR input for MF models. Like traditional MF models, the PrefNMF model discovers the latent factor space shared between users and items, where the latent factors describe both the taste of users and the characteristics of items. The attractiveness of an item to a user is then measured by the inner product of their latent feature vectors. Formally, each user u is associated with a latent feature vector uu ∈ Rk, and each item i is associated with a latent feature vector vi ∈ Rk, where k is the dimension of the latent factor space. The attractiveness of items i and j to user u are u�u vi and u�u vj, respectively. When u � u vi > u � u vj, the item i is said to be more preferable 106 to the user u than item j, i.e., i � j. The strength of this preference relation πuij can be estimated by u�u (vi − vj), and the inverse-logit function is applied to ensure π̂uij ∈ [0,1]: π̂uij = eu � u (vi−vj) 1 + eu � u (vi−vj) (5.3.3) The latent feature vectors uu and vi are learned by minimizing regularized squared error with respect to the set of all known preference relations Π: min uu,vi∈Rk ∑ πuij∈Π∧(i θL−1 (5.3.10) where L is the number of ordinal levels and θl are the threshold values of interest. 110 The probability of receiving a preference l is therefore: Q(pui = l | u,i) = ∫ θl θl−1 P(xui | θ) dθ = F(θl) − F(θl−1) (5.3.11) where F(θl) is the cumulative logistic distribution evaluated at θl with standard de- viation sui. F(xui ≤ l | θl) = 1 1 + exp(−θuil−μui sui ) (5.3.12) The thresholds θl can be user-specific or item-specific, and this work uses the user- specific parametrization same as in [42]. Then the thresholds θuil in Eq. 5.3.12 are replaced with a set of user-specific thresholds {θul}Ll=1 for each user u. These thresh- olds are then estimated from data for each user. 5.3.5 PrefCRF: Unifying PrefNMF and CRF The CRF provides a principled way of capturing both the side information and in- teractions among preferences. However, it employs the log-linear modeling as shown in Eq. 5.3.6, and therefore does not enable a simple treatment of PR. The PrefNMF, on the other hand, accepts PR but is weak in utilizing side information. The com- plementary between these two techniques calls for an unified PrefCRF model to take all the advantages. Unification Essentially, the proposed PrefCRF model captures the side information and promotes the agreement between the PrefNMF and the CRF. Specifically, the PrefCRF model combines the item-item correlations (Eq. 5.3.8) and the ordinal distributions Q(pui | 111 u,i) over user-wise preferences obtained from Eq. 5.3.11. P(pu | o) ∝ Ψu(pu,o) ∏ pui∈pu Q(pui | u,i) (5.3.13) where Ψu is the potential function capturing the side information and interaction among preferences related to user u. Though there is a separated graph for each user, the weights are optimized across all graphs. Feature Design A feature is essentially a function f of n > 1 arguments that maps the n-dimensional input into the unit interval f : Rn → [0,1]. We design the following kinds of features: Correlation Features The item-item correlation is captured by the feature fij(pui,puj) = g(|(pui − p̄i) − (puj − p̄j)|) (5.3.14) where g(α) normalizes feature values and α plays the role of deviation, and p̄i and p̄j are the average user-wise preference for items i and j, respectively. This correlation feature captures the intuition that correlated items should be ranked similarly by the same user after offsetting the goodness of each item. Attribute Features Each user u and item i has a set of attributes ou and oi, re- spectively. These attributes are mapped to preferences by the following features fi(pui) = oug(|(pui − p̄i)|) fu(pui) = oig(|(pui − p̄u)|) (5.3.15) where fi models which users like the item i and fu models which classes of items the user u likes. 112 Since one correlation feature exists for each pair of co-rated items, the number of correlation features can be large, and makes the estimation slow to converge and less robust. Therefore, we only keep strong correlation features fstrong extracted based on the Pearson correlation between items using a user-specified minimum correlation threshold. The correlations are computed using user-wise preferences generated from PR. Parameter Estimation In general, CRF models cannot be determined by standard maximum likelihood esti- mations, instead, approximation techniques are used in practice. This study employs the pseudo-likelihood [8] to estimate parameters by maximizing the regularized sum of log local likelihoods: logL(w) = ∑ pui∈Π log P(pui | pu,o) − 1 2σ2 w�w (5.3.16) where w are the weights and 1/2σ2 controls the regularization. To make the notation uncluttered, we write pu instead of explicitly as pu\pui. The local likelihood in Eq. 5.3.16 is defined as: P(pui|pu,o) = 1 Zui Q(pui|u,i)ψui(pui,o) ∏ puj∈pu ψij(pui,puj) (5.3.17) where Zui(o) is the normalization term: Zui = lmax∑ pui=lmin Q(pui | u,i)ψui(pui,o) ∏ puj∈pu ψij(pui,puj) (5.3.18) where lmin is the first and lmax is the last interval, i.e., 1 and 3 in our settings. 113 To optimize the parameters, we use the stochastic gradient ascent procedure that updates the parameters by passing through the set of ratings of each user: w ← w + η∇logL(w) (5.3.19) where η is the learning rate. More specifically, for each pui we update the attribute weights wo = {wu,wi} and correlation weight wij for each neighbor puj ∈ pu using the gradient of the log pseudo-likelihood ∂logL ∂wo = fo(pui,o) − lmax∑ pui=lmin P(pui | pu,o)fo(pui,o) − wi σ2 (5.3.20) ∂logL ∂wij = fij(pui,puj) − ∑lmax pui=lmin P(pui | pu,o)fij(pui,puj) − wijσ2 (5.3.21) Item Recommendation The ultimate goal of RecSys is often to rank the items and recommend the Top-N items to the user. As the input of PrefCRF is the preference relations, the final output will be item rankings instead of ratings. The PrefCRF produces distributions over the user-wise preferences, which can be converted into point estimates by computing the expectation p̂ui = lmax∑ pui=lmin puiP(pui | pu,o) (5.3.22) where l refers to the intervals of user-wise preferences: from the least to the most preferred. Given the predicted user-wise preferences, the items can be sorted and ranked accordingly. Alg. 3 summarizes the procedures of PrefCRF. 114 Algorithm 3 PrefCRF Algorithm Input: Heterogeneous preferences from sources. Step 1: Infer and merge PR. Step 2: Predict each user-wise preferences p̂ui using Eq. 5.3.3 and Eq. 5.2.2. Step 3: Predict distributions of p̂ui with Eq. 5.3.11. Step 4: Repeat for each u ∈ U do for each pui ∈ pu do Get local likelihood using Eq. 5.3.17 Get attribute feature fv using Eq. 5.3.15 Get attribute feature gradients using Eq. 5.3.20 Update wo with gradients using Eq. 5.3.19 for each puj ∈ pu, i �= j ∧ fij ∈ fstrong do Get correlation feature fij using Eq. 5.3.14 Get corr. feature gradient using Eq. 5.3.21 Update wij with gradient using Eq. 5.3.19 end for end for end for Until stopping criteria met Predictions: * Predict user-wise preferences using Eq. 5.3.22. * Sort and select the Top-N items. 115 Computational Complexity We perform the computational complexity analysis on the PrefCRF and its underly- ing PrefNMF algorithms. Given n users and m items each with du and di preferences, respectively. Let us temporarily ignore the user-specified latent factors. Then the complexity of both PrefNMF and PrefCRF is O(nd2u). However, in practice few item co-rated by the same user are strong neighbors of each other due to the correlation threshold defined in Section 5.3.5. As a result, the computation time of PrefCRF tends to be O(nduc) where c is a factor of correlation threshold. 5.4 Experiment and Analysis To study the performance of the proposed PrefCRF model, comparisons were done with the following representative algorithms: KNN [65], NMF [41], PrefKNN [12], and PrefNMF [19]. We implemented PrefCRF as well as the compared algorithms according to their original papers. 5.4.1 Experimental Settings Datasets and Experiment Design Experiments are conducted on four public datasets: MovieLens-1M 3, Amazon Movie Reviews 4, EachMovie 5, and MovieLens-20M 3. These datasets or their subsets are transformed to simulate four scenarios of heterogeneous data: 3http://grouplens.org/datasets/movielens 4http://snap.stanford.edu/data/web-Movies.html 5http://grouplens.org/datasets/eachmovie 116 Side Information The impact of side information is studied on the MovieLens-1M dataset which provides gender, age, and occupation information about users and genres of movies. The dataset contains more than 1 million ratings by 6,040 users on 3,900 movies. For a reliable comparison, the dataset is split into training and test sets with different sparsities, similar to related work [80,83]. Specifically, for each user we randomly select N = 30, 40, 50 and 60 ratings for training, and put the rest for testing. To ensure that each user has at least 10 movies for testing, users with less than 40, 50, 60 or 70 ratings are removed. Different Forms Amazon Movie Reviews dataset contains two forms of preferences: textual reviews and 5-star ratings. We extracted a dense subset by randomly selecting 5141 items with at least 60 reviews for each, and 2000 users with at least 60 movies reviews for each, and this results in 271K ratings. For each user, 50 random reviews are selected for training, and the rest are put aside for testing. The training set is further split into half ratings and half textual reviews. Rating-based models are trained on the ratings only, where PR-based models utilize textual reviews as well. Different Scales EachMovie dataset contains ratings in 6-star scale that can be easily converted into binary scale, i.e., ratings 1 − 3 and 4 − 6 are mapped to 0 and 1 respectively. We extract a subset by randomly selecting 3000 users who have rated at least 70 items as a dense dataset is required for splitting. The resultant dataset contains 120K ratings on 1495 items. For each user we randomly select 60 ratings for training and leave the rest for testing, and half of the ratings in the training set are mapped into binary scale. Rating-based models are trained on the 6-star ratings while PR-based models will exploit the 117 binary ratings as well. Different Biases We study the impact of biases by adding biases into a stable dataset with minimal existing biases. To prepare such dataset we extract a stable subset from the latest MovieLens-20M released on April-2015. Specif- ically, 258K ratings by 2020 users on 4408 movies released between 2010 and 2015 are extracted, where each user has rated at least 60 ratings. Biases are then introduced by adding a different Laplace noise sampled from Laplace(0,b) to each user and item. For PR-based methods, the same conversion method as in [19] is used to converted ratings into PR. For example, 1, 0 and 0.5 are assigned to the preference relation πuij when pui > puj, pui < puj, and pui = puj, respectively. Evaluation Metrics Traditional recommender systems aim to optimize RMSE or MAE which emphasizes on absolute ratings. However, the ultimate goal of recommender systems is usually to obtain the ranking of items [42], where good performance on RMSE or MAE may not be translated into good ranking results [42]. Therefore, we employ two evaluation metrics Normalized Cumulative Discounted Gain@T (NDCG@T) [34] that is popular in academia, and Mean Average Precision@T (MAP@T) [13] that is common in contests. The NDCG@T metric is defined as NDCG@T = 1 K(T) T∑ t=1 2rt − 1 log2 (t + 1) (5.4.1) 118 where rt is the relevance judgment of the item at position t, and K(T) is the normal- ization constant. The MAP@T metric is defined as MAP@T = 1 |Utest| ∑ u∈Utest T∑ t=1 Pu(t) min(mu, t) (5.4.2) where mu is the number of relevant items to user u, and Pu(t) is user u’s precision at position t. Both metrics are normalized to [0,1] with higher value indicates better performance. These metrics, together with other ranking-based metrics, require a set of relevant items to be defined in the test set such that the predicted rankings can be evaluated against. In this paper, we follow the same heuristics as in the related work [12,63] and consider items with the highest ratings as relevant. Parameter Setting For a fair comparison, we fix the number of latent factors to 50 for all algorithms, which is the same setting as used in [63].The number of neighbors for KNN algorithms is set to 50. We vary the minimum correlation threshold for the PrefCRF to examine the performance with different number of features. Different values of regularization coefficient are also tested. 5.4.2 Results and Analysis Algorithms are compared on four heterogeneous scenarios: side information, differ- ent forms, different scales and different biases. The impact of sparsity levels and parameters is studied on the MovieLens-1M dataset, while these settings for other experiments are fixed. Each experiment is repeated ten times with different random 119 seeds and we report the mean results with standard deviations. For each experiment, we also performed a paired t-test (two-tailed) with a significance level of 95% on the best and the second best results, and all p-values are less than 1 × 10−5. Fusing Side Information Table 5.1 shows the NDCG and MAP metrics on Top-N recommendation tasks by compared algorithms. It can be observed that the proposed PrefCRF, which captures the side information, consistently outperforms others. To confirm the improvement, we plot the results in Fig. 5.3(b) by varying the position T . The figure shows that PrefCRF not only outperforms others but has a strong emphasize on top items, i.e., T < 5. The impact of sparsity is investigated by plotting the results against sparsity levels as in Fig. 5.3(a). We can observe that the performance of PrefCRF increases linearly given more training data, while its underlying PrefNMF model is less extensible to denser dataset. One possible reason is that incorporating side information has extended the modeling capability of the model, resulting better utilization of more data. Fusing Preferences in Different Forms In this experiment, we first converted textual reviews into negative (−1), neutral (0), and positive (1) values using the NLTK library [10], and then converted them into PR. We study how these additional information can assist PR-based methods, and results over ten runs are shown in Table 5.2. Surprisingly, the performance of all PR-based methods except PrefCRF have decreased by incorporating textual reviews. 120 T a b le 5 .1 : R es u lt s o v er te n ru n s o n M o vi eL en s -1 M d a ta se t. G iv en 3 0 G iv en 4 0 A lg o ri th m N D C G @ 1 N D C G @ 1 0 M A P @ 1 M A P @ 1 0 N D C G @ 1 N D C G @ 1 0 M A P @ 1 M A P @ 1 0 U se rK N N 0 .4 3 0 6 ± 0 .0 0 1 1 0 .4 0 8 1 ± 0 .0 0 2 9 0 .3 5 3 9 ± 0 .0 0 7 1 0 .2 7 4 4 ± 0 .0 0 2 5 0 .3 6 9 5 ± 0 .0 0 4 8 0 .4 2 5 2 ± 0 .0 0 3 6 0 .3 6 6 3 ± 0 .0 0 4 7 0 .2 8 7 7 ± 0 .0 0 3 4 N M F 0 .5 2 7 4 ± 0 .0 0 8 4 0 .5 1 9 5 ± 0 .0 0 4 0 0 .5 2 2 5 ± 0 .0 0 8 1 0 .3 5 4 9 ± 0 .0 0 3 7 0 .5 4 2 4 ± 0 .0 0 6 7 0 .5 2 9 1 ± 0 .0 0 3 4 0 .5 3 7 7 ± 0 .0 0 6 6 0 .3 6 3 1 ± 0 .0 0 3 5 P re fK N N 0 .3 4 6 2 ± 0 .0 0 7 3 0 .4 0 4 8 ± 0 .0 0 3 8 0 .3 4 3 0 ± 0 .0 0 7 2 0 .2 7 2 0 ± 0 .0 0 3 7 0 .3 6 5 1 ± 0 .0 0 6 5 0 .4 2 8 3 ± 0 .0 0 2 4 0 .3 6 2 0 ± 0 .0 0 6 3 0 .2 9 0 4 ± 0 .0 0 2 3 P re fN M F 0 .5 7 7 8 ± 0 .0 1 1 2 0 .5 6 8 0 ± 0 .0 0 4 1 0 .5 7 2 4 ± 0 .0 1 0 9 0 .3 9 9 2 ± 0 .0 0 3 3 0 .5 8 8 3 ± 0 .0 0 7 3 0 .5 7 3 2 ± 0 .0 0 2 8 0 .5 8 3 2 ± 0 .0 0 7 3 0 .4 0 1 9 ± 0 .0 0 3 2 P re fC R F 0 .6 2 0 6 ± 0 .0 0 7 6 0 .5 8 5 6 ± 0 .0 0 2 8 0 .6 1 5 0 ± 0 .0 0 7 3 0 .4 1 9 5 ± 0 .0 0 2 8 0 .6 3 9 5 ± 0 .0 0 6 4 0 .5 9 9 0 ± 0 .0 0 2 3 0 .6 3 4 0 ± 0 .0 0 6 2 0 .4 2 9 4 ± 0 .0 0 2 1 G iv en 5 0 G iv en 6 0 A lg o ri th m N D C G @ 1 N D C G @ 1 0 M A P @ 1 M A P @ 1 0 N D C G @ 1 N D C G @ 1 0 M A P @ 1 M A P @ 1 0 U se rK N N 0 .3 8 3 1 ± 0 .0 0 6 3 0 .4 4 2 4 ± 0 .0 0 2 7 0 .3 8 0 3 ± 0 .0 0 6 0 0 .3 0 1 5 ± 0 .0 0 2 6 0 .4 0 3 5 ± 0 .0 0 9 0 0 .4 6 2 2 ± 0 .0 0 3 5 0 .4 0 0 2 ± 0 .0 0 8 5 0 .3 1 6 3 ± 0 .0 0 2 7 N M F 0 .5 4 3 0 ± 0 .0 0 8 3 0 .5 3 2 6 ± 0 .0 0 3 6 0 .5 3 9 0 ± 0 .0 0 8 2 0 .3 6 6 9 ± 0 .0 0 2 5 0 .5 5 4 7 ± 0 .0 1 0 9 0 .5 4 0 9 ± 0 .0 0 6 3 0 .5 5 0 4 ± 0 .0 1 1 3 0 .3 7 3 4 ± 0 .0 0 5 5 P re fK N N 0 .3 8 3 1 ± 0 .0 0 9 2 0 .4 4 8 3 ± 0 .0 0 3 0 0 .3 8 0 3 ± 0 .0 0 8 9 0 .3 0 7 0 ± 0 .0 0 2 2 0 .3 9 7 9 ± 0 .0 0 7 5 0 .4 6 8 9 ± 0 .0 0 3 9 0 .3 9 4 8 ± 0 .0 0 6 9 0 .3 2 2 3 ± 0 .0 0 3 3 P re fN M F 0 .5 8 7 3 ± 0 .0 0 9 6 0 .5 7 4 5 ± 0 .0 0 3 5 0 .5 8 3 0 ± 0 .0 0 9 8 0 .4 0 1 9 ± 0 .0 0 3 3 0 .5 8 5 4 ± 0 .0 1 4 5 0 .5 7 3 3 ± 0 .0 0 4 8 0 .5 8 0 8 ± 0 .0 1 4 2 0 .4 0 0 7 ± 0 .0 0 3 7 P re fC R F 0 .6 5 4 8 ± 0 .0 0 5 5 0 .6 0 6 8 ± 0 .0 0 1 8 0 .6 4 9 9 ± 0 .0 0 5 9 0 .4 3 7 2 ± 0 .0 0 2 4 0 .6 6 7 7 ± 0 .0 0 7 4 0 .6 1 3 9 ± 0 .0 0 1 8 0 .6 6 2 5 ± 0 .0 0 7 2 0 .4 4 3 6 ± 0 .0 0 1 6 121 (a) NDCG VS. Sparsity (b) NDCG@T (Given 60) Figure 5.3: Varying sparsity on MovieLens-1M dataset. We suspect that this is due to the misclassification errors introduced by sentiment classification on text. Indeed, converting preferences in different forms into PR can be accomplished in different ways, and may require domain knowledge fall beyond the scope of this paper. However, in the next subsection we will see that an accurate conversion can actually improve the performance. Table 5.2: Results over ten runs on Amazon dataset. Ratings Ratings + Textual Reviews Algorithm NDCG@10 MAP@10 NDCG@10 MAP@10 UserKNN 0.6244 ± 0.0040 0.4599 ± 0.0035 0.6244 ± 0.0037 0.4599 ± 0.0025 NMF 0.8073 ± 0.0040 0.6689 ± 0.0038 0.8073 ± 0.0041 0.6689 ± 0.0000 PrefKNN 0.6410 ± 0.0038 0.4690 ± 0.0029 0.5765 ± 0.0039 0.4083 ± 0.0029 PrefNMF 0.7495 ± 0.0040 0.5924 ± 0.0031 0.7377 ± 0.0030 0.5806 ± 0.0031 PrefCRF 0.8223 ± 0.0033 0.6813 ± 0.0027 0.8259 ± 0.0035 0.6890 ± 0.0026 122 Fusing Preferences in Different Scales In this experiment preferences in different scales are fused into PR to boost the performance of PR-based methods. The binary scale ratings are similar to the pos- itive/negative textual reviews, however without incorrect values introduced by text classification. Table 5.3: Results over ten runs on EachMovie dataset. 6-star Ratings 6-star Ratings + Binary Ratings Algorithm NDCG@10 MAP@10 NDCG@10 MAP@10 UserKNN 0.4374 ± 0.0047 0.3418 ± 0.0029 0.4374 ± 0.0047 0.3418 ± 0.0029 NMF 0.5211 ± 0.0078 0.3710 ± 0.0034 0.5211 ± 0.0078 0.3710 ± 0.0034 PrefKNN 0.4908 ± 0.0070 0.3793 ± 0.0031 0.5074 ± 0.0061 0.3938 ± 0.0044 PrefNMF 0.5233 ± 0.0061 0.3820 ± 0.0033 0.5454 ± 0.0060 0.3881 ± 0.0032 PrefCRF 0.5439 ± 0.0056 0.4006 ± 0.0045 0.5506 ± 0.0053 0.4038 ± 0.0043 Table 5.3 shows the performance of each method and the impact of position T is illustrated in Fig. 5.4. From the table, we can observe that the performance of all PR-based methods has increased by incorporating additional binary ratings, while the performance of rating-based methods remains the same. Fusing Preferences with Different Biases In this experiment we investigate the impact of different biases, particularly the user- wise and item-wise biases, which are sampled from Laplace(0,b) for each user and each item. Results for unbiased and biased datasets are reported in Table 5.4. For user-wise biases, we can see that the performance of rating-based methods 123 (a) NDCG@T (b) MAP@T Figure 5.4: Varying position T on EachMovie dataset. Table 5.4: NDCG@10 on MovieLens-20M dataset. Algorithm Bias = None User-bias = Laplace(0,2) Item-bias = Laplace(0,2) UserKNN 0.4465 ± 0.0033 0.3729 ± 0.0033 0.2914 ± 0.0017 NMF 0.4982 ± 0.0034 0.4566 ± 0.0032 0.3074 ± 0.0019 PrefKNN 0.4683 ± 0.0027 0.4683 ± 0.0027 0.3157 ± 0.0021 PrefNMF 0.4950 ± 0.0035 0.4950 ± 0.0035 0.3137 ± 0.0017 PrefCRF 0.5288 ± 0.0037 0.5288 ± 0.0037 0.3729 ± 0.0023 124 has decreased while PR-based methods are unaffected by such biases. This is further illustrated in Fig. 5.5(a). For item-wise biases, the performance of all methods has decreased, however, to different extent. Fig. 5.5(b) further shows the better resistance of PrefCRF to item-wise biases. (a) NDCG@10 (b) NDCG@10 Figure 5.5: Varying biases on MovieLens-20M dataset. Impact of Regularization and Correlation Threshold The proposed PrefCRF method has two user specified parameters: the regularization coefficient and a minimum correlation threshold that controls the number of corre- lation features. We examine the impact of these parameters on the MovieLens-1M dataset and the results are plotted in Fig. 5.6. For the regularization, we can see from Fig. 5.6(a) that the performance gets better when a small regularization penalty applies. In other words, PrefCRF can generalize reasonable well without too much regularization. One reason is that the model has already been regularized by its underlying PrefNMF model. Another reason is that 125 the weights of item-item correlations are not user-specific but shared by all users, thus they cannot over-fit every user perfectly. For the correlation threshold, Fig. 5.6(b) shows that a smaller threshold results better performance by including more correlation features, however, at the cost of more training time and more training data. (a) NDCG@10 (b) NDCG@10 Figure 5.6: Varying parameters on MovieLens-1M. 5.5 Summary In this chapter we proposed the PrefCRF model, which takes advantages of both the representational power of the CRF and the ease of modeling PR by the PrefNMF. To the best of our knowledge, this is the first study on unifying heterogeneous user preferences and heterogeneous side information. Experiment results on four public datasets demonstrate that various heterogeneous data have been properly handled by 126 PrefCRF, and significantly improved Top-N recommendation performance has been achieved. For future work, the computation efficiency of PR-based methods can be further improved given that the number of PR is usually much larger than ratings. Paral- lelization is feasible as as each user has a separated set of PR that can be processed simultaneously. Chapter 6 Conclusion and Future Work The research presented in this thesis consists of two parts: the first part focuses on the local and global structure and the side information issues and while the second part focuses on the heterogeneous data source issue. Several new research problems have been identified together with solutions. This chapter summarizes the research results and the main contributions of this thesis. 6.1 Contributions Theoretical and experimental results have led to the conclusion and main contribution of this thesis: • The Ordinal Random Fields (ORF) model is proposed to capture both the local and global structures of ordinal user preferences. Through this novel method, both structures are properly captured, resulting improved recommen- dation quality. ORF is one of first attempts on modeling both structures for ordinal user preferences. 127 128 • The Preference Relation-based Markov Random Fields (PrefMRF) model is pro- posed to capture both the local and global structures of Preference Relations. Due to the flexibility of previous preference relation-based models, only one type of structure can be modeled at a time. Through our novel method, both structures can be modeled, resulting significantly improved recommendation performance. PrefMRF is the first preference relation-based model that captures both types of structures. • The Preference Relation-based Conditional Random Fields (PrefCRF) model is proposed to incorporate side information such as item content and user profiles. PrefCRF is the first preference relation-based model that can incorporate side information in a principled way. • The preference relation-based models proposed in this thesis is applied to resolve the heterogeneous data sources problem. This is the first time this problem is spotted and we provide effective solutions to unify heterogeneous data. By exploiting multiple heterogeneous data sources help to alleviate the cold-start problem in which limited data is provided by each data source. 6.2 Future Work Although the proposed methods have addressed the aforementioned three issues to a certain extend, there remains several problems that need to be addressed in the future. We summarize these follows: • Parallelization of Preference Relation-based Models: The number of preference 129 relations are usually much larger than the number of traditional ratings. As a re- sult, the modeling process is often slower than rating-based methods. However, we observe that the preference relations of each user are kind of independent from other users’. Therefore, it it possible to parallelize the modeling process for each user to achieve faster modeling process. • Identifying Key Preference Relations: While there are so many possible prefer- ence relations, not all of them are important in making recommendations. It remains unknown how to measure the importance of each preference relation to the recommendation quality. If this information can be obtained in future work, then the number of preference relations to be used in modeling can be significantly reduced while the recommendation quality is preserved. Bibliography [1] G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans- actions on Knowledge and Data Engineering, 17(6):734–749, 2005. [2] G. Adomavicius and A. Tuzhilin. Context-aware recommender systems. In Rec- ommender systems handbook, pages 217–253. Springer, 2011. [3] G. Adomavicius and J. Zhang. On the stability of recommendation algorithms. In Proceedings of the fourth ACM conference on Recommender systems, pages 47–54. ACM, 2010. [4] G. Adomavicius and J. Zhang. Stability of recommendation algorithms. ACM Transactions on Information Systems (TOIS), 30(4):23, 2012. [5] M. Balabanović and Y. Shoham. Fab: content-based, collaborative recommen- dation. Commun. ACM, 40(3):66–72, 1997. [6] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML’04, pages 65–72. ACM, 2004. [7] J. Bennett and S. Lanning. The netflix prize. In Proceedings of KDD cup and workshop, volume 2007, page 35, 2007. [8] J. Besag. Spatial interaction and the statistical analysis of lattice systems. Jour- nal of the Royal Statistical Society, Series B, 36(2):192–236, 1974. 130 131 [9] D. Billsus and M. J. Pazzani. Learning collaborative information filters. In ICML, volume 98, pages 46–54, 1998. [10] S. Bird, E. Klein, and E. Loper. Natural language processing with Python. O’Reilly Media, Inc., 2009. [11] K. Bradley and B. Smyth. Improving recommendation diversity. In Proceedings of the Twelfth National Conference in Artificial Intelligence and Cognitive Science (AICS-01), pages 75–84. Citeseer, 2001. [12] A. Brun, A. Hamad, O. Buffet, and A. Boyer. Towards preference relations in recommender systems. In Preference Learning (PL 2010) ECML/PKDD, 2010. [13] O. Chapelle, D. Metlzer, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In CIKM’09, pages 621–630. ACM, 2009. [14] W. W. Cohen, R. E. Schapire, and Y. Singer. Learning to order things. J. Artif. Int. Res., 10(1):243–270, May 1999. [15] P. Cremonesi, Y. Koren, and R. Turrin. Performance of recommender algorithms on top-n recommendation tasks. In RecSys’10, pages 39–46. ACM, 2010. [16] A. Defazio and T. Caetano. A graphical model formulation of collaborative filter- ing neighbourhood methods with fast maximum entropy training. In ICML’12, pages 265–272, 2012. [17] M. S. Desarkar and S. Sarkar. Rating prediction using preference relations based matrix factorization. In UMAP Workshops, 2012. [18] M. S. Desarkar, S. Sarkar, and P. Mitra. Aggregating preference graphs for collaborative rating prediction. In RecSys’10, pages 21–28. ACM, 2010. 132 [19] M. S. Desarkar, R. Saxena, and S. Sarkar. Preference relation based matrix factorization for recommender systems. In UMAP’12, pages 63–75. Springer, 2012. [20] D. Fleder and K. Hosanagar. Blockbuster culture’s next rise or fall: The impact of recommender systems on sales diversity. Management science, 55(5):697–712, 2009. [21] Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. An efficient boosting algorithm for combining preferences. Journal of machine learning research, 4:933–969, 2003. [22] J. Fürnkranz and E. Hüllermeier. Pairwise preference learning and ranking. In Machine Learning: ECML’03, pages 145–156. Springer, 2003. [23] J. Fürnkranz and E. Hüllermeier. Preference learning. Springer, 2010. [24] Z. Gantner, L. Drumond, C. Freudenthaler, S. Rendle, and L. Schmidt-Thieme. Learning attribute-to-feature mappings for cold-start recommendations. In 2010 IEEE International Conference on Data Mining, pages 176–185. IEEE, 2010. [25] M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems by coverage and serendipity. In Proceedings of the fourth ACM conference on Recommender systems, pages 257–260. ACM, 2010. [26] P. J. Green. Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4):711–732, 1995. [27] I. Guy, I. Ronen, and A. Raviv. Personalized activity streams: sifting through the river of news. In Proceedings of the fifth ACM conference on Recommender systems, pages 181–188. ACM, 2011. [28] I. Guy, N. Zwerdling, D. Carmel, I. Ronen, E. Uziel, S. Yogev, and S. Ofek- Koifman. Personalized recommendation of social software items based on social 133 relations. In Proceedings of the third ACM conference on Recommender systems, pages 53–60. ACM, 2009. [29] J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006. [30] G. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002. [31] T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In IJCAI, volume 99, pages 688–693, 1999. [32] N. Houlsby, J. M. Hernandez-lobato, and Z. Ghahramani. Cold-start active learning with robust ordinal matrix factorization. In ICML’14, pages 766–774, 2014. [33] E. Hüllermeier, J. Fürnkranz, W. Cheng, and K. Brinker. Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16):1897–1916, 2008. [34] K. Järvelin and J. Kekäläinen. Cumulated gain-based evaluation of ir techniques. ACM TOIS, 20(4):422–446, 2002. [35] R. Jin, L. Si, and C. Zhai. Preference-based graphic models for collaborative filtering. In UAI’02, pages 329–336. Morgan Kaufmann Publishers Inc., 2002. [36] N. Jones, A. Brun, and A. Boyer. Comparisons instead of ratings: Towards more stable preferences. In Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on, volume 1, pages 451–456. IEEE, 2011. [37] B. P. Knijnenburg, M. C. Willemsen, Z. Gantner, H. Soncu, and C. Newell. Explaining the user experience of recommender systems. User Modeling and User-Adapted Interaction, 22(4-5):441–504, 2012. 134 [38] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD’08, pages 426–434. ACM, 2008. [39] Y. Koren. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 447–456. ACM, 2009. [40] Y. Koren. Collaborative filtering with temporal dynamics. Commun. ACM, 53(4):89–97, 2010. [41] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recom- mender systems. IEEE Computer, 42(8):30–37, 2009. [42] Y. Koren and J. Sill. Ordrec: an ordinal model for predicting personalized item rating distributions. In RecSys’11, pages 117–124. ACM, 2011. [43] J. D. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: prob- abilistic models for segmenting and labeling sequence data. In ICML’01, pages 282–289, 2001. [44] X. Li, G. Xu, E. Chen, and Y. Zong. Learning recency based comparative choice towards point-of-interest recommendation. Expert Systems with Applications, 42(9):4274–4283, 2015. [45] N. N. Liu, M. Zhao, and Q. Yang. Probabilistic latent preference analysis for collaborative filtering. In CIKM’09, pages 759–766. ACM, 2009. [46] S. Liu, T. Tran, G. Li, and Y. Jiang. Ordinal random fields for recommender systems. In ACML’14, pages 283–298. JMLR: workshop and conference proceed- ings, 2014. [47] P. Lops, M. de Gemmis, and G. Semeraro. Content-based recommender systems: State of the art and trends. In Recommender Systems Handbook, pages 73–105. Springer, 2011. 135 [48] L. Lü and W. Liu. Information filtering via preferential diffusion. Physical Review E, 83(6):066119, 2011. [49] L. Lü, M. Medo, C. H. Yeung, Y.-C. Zhang, Z.-K. Zhang, and T. Zhou. Recom- mender systems. Physics Reports, 519(1):1–49, 2012. [50] H. Ma, D. Zhou, C. Liu, M. R. Lyu, and I. King. Recommender systems with social regularization. In WSDM’11, pages 287–296. ACM, 2011. [51] R. D. Mare. Social background and school continuation decisions. Journal of the American Statistical Association, 75(370):295–305, 1980. [52] B. M. Marlin. Modeling user rating profiles for collaborative filtering. In NIPS’03. MIT Press, 2003. [53] P. Massa and P. Avesani. Trust-aware collaborative filtering for recommender systems. In On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE, pages 492–508. Springer, 2004. [54] P. McCullagh. Regression models for ordinal data. Journal of the Royal Statistical Society, Series B, 42(2):109–142, 1980. [55] D. McFadden. Econometric models for probabilistic choice among products. Journal of Business, 53(3):S13–S29, 1980. [56] S. M. McNee, J. Riedl, and J. A. Konstan. Being accurate is not enough: how accuracy metrics have hurt recommender systems. In CHI EA’06, pages 1097– 1101. ACM, 2006. [57] K. Miyahara and M. J. Pazzani. Collaborative filtering with the simple bayesian classifier. In PRICAI 2000 Topics in Artificial Intelligence, pages 679–689. Springer, 2000. 136 [58] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2012. [59] A. Nakamura and N. Abe. Collaborative filtering using weighted majority pre- diction algorithms. In ICML, volume 98, pages 395–403, 1998. [60] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [61] U. Paquet, B. Thomson, and O. Winther. A hierarchical model for ordinal matrix factorization. Statistics and Computing, 22(4):945–957, 2012. [62] M. J. Pazzani and D. Billsus. Content-based recommendation systems. In The adaptive web, pages 325–341. Springer, 2007. [63] Y. Ren, G. Li, and W. Zhou. Learning rating patterns for top-n recommenda- tions. In ASONAM’12, pages 472–479. IEEE Computer Society, 2012. [64] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme. Bpr: Bayesian personalized ranking from implicit feedback. In Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence, pages 452–461. AUAI Press, 2009. [65] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. An open ar- chitecture for collaborative filtering of netnews. In CSCW’94, pages 175–186. ACM, 1994. [66] F. Ricci, L. Rokach, and B. Shapira. Introduction to recommender systems hand- book. Springer, 2011. [67] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Re- trieval of. Addison-Wesley, 1989. 137 [68] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Application of dimension- ality reduction in recommender system-a case study. Technical report, DTIC Document, 2000. [69] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-based collaborative fil- tering recommendation algorithms. In WWW’10, pages 285–295. ACM, 2001. [70] B. M. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Recommender systems for large-scale e-commerce: Scalable neighborhood formation using clustering. In Proceedings of the fifth international conference on computer and information technology, volume 1. Citeseer, 2002. [71] D. L. Schacter and C. S. Dodson. Misattribution, false recognition and the sins of memory. Philosophical Transactions of the Royal Society B: Biological Sciences, 356(1413):1385–1393, 2001. [72] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start recommendations. In SIGIR’02, pages 253–260. ACM, 2002. [73] A. Sharma and B. Yan. Pairwise learning in recommendation: experiments with community recommendation on linkedin. In RecSys’13, pages 193–200. ACM, 2013. [74] Y. Shi, M. Larson, and A. Hanjalic. List-wise learning to rank with matrix factorization for collaborative filtering. In RecSys’10, pages 269–272. ACM, 2010. [75] X. Su, R. Greiner, T. M. Khoshgoftaar, and X. Zhu. Hybrid collaborative filtering algorithms using a mixture of experts. In WI’07, pages 645–649. IEEE, 2007. [76] G. Takács, I. Pilászy, B. Németh, and D. Tikk. Scalable collaborative filtering approaches for large recommender systems. The Journal of Machine Learning Research, 10:623–656, 2009. 138 [77] T. Tran, D. Phung, and S. Venkatesh. Collaborative filtering via sparse markov random fields. arXiv preprint arXiv:1602.02842, 2016. [78] T. Tran, D. Q. Phung, and S. Venkatesh. Preference networks: Probabilistic models for recommendation systems. In AusDM’07, pages 195–202. ACS, 2007. [79] T. Tran, D. Q. Phung, and S. Venkatesh. Ordinal boltzmann machines for collaborative filtering. In UAI’09, pages 548–556. AUAI Press, 2009. [80] T. Tran, D. Q. Phung, and S. Venkatesh. Probabilistic models over ordered partitions with applications in document ranking and collaborative filtering. In SDM’11, pages 426–437. SIAM, 2011. [81] T. Tran, D. Q. Phung, and S. Venkatesh. Learning from ordered sets and appli- cations in collaborative ranking. In ACML’12, pages 427–442. JMLR: workshop and conference proceedings, 2012. [82] T. Tran, D. Q. Phung, and S. Venkatesh. A sequential decision approach to ordinal preferences in recommender systems. In AAAI’12, 2012. [83] M. Weimer, A. Karatzoglou, Q. V. Le, and A. Smola. Maximum margin matrix factorization for collaborative ranking. In NIPS’07, pages 1593–1600, 2007. [84] T. Zhou, L.-L. Jiang, R.-Q. Su, and Y.-C. Zhang. Effect of initial configuration on network-based recommendation. EPL (Europhysics Letters), 81(5):58004, 2008. [85] T. Zhou, J. Ren, M. Medo, and Y.-C. Zhang. Bipartite network projection and personal recommendation. Physical Review E, 76(4):046115, 2007. [86] A. Zimdars, D. M. Chickering, and C. Meek. Using temporal data for making recommendations. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pages 580–588. Morgan Kaufmann Publishers Inc., 2001.