key: cord-0045881-1bcqwnfs authors: Kużelewska, Urszula title: Effect of Dataset Size on Efficiency of Collaborative Filtering Recommender Systems with Multi-clustering as a Neighbourhood Identification Strategy date: 2020-05-22 journal: Computational Science - ICCS 2020 DOI: 10.1007/978-3-030-50420-5_25 sha: 04f0c6ea5c62afdef3174aea2c450de23b10399f doc_id: 45881 cord_uid: 1bcqwnfs Determination of accurate neighbourhood of an active user (a user to whom recommendations are generated) is one of the essential problems that collaborative filtering based recommender systems encounter. Properly adjusted neighbourhood leads to more accurate recommendation generated by a recommender system. In classical collaborative filtering technique, the neighbourhood is modelled by kNN algorithm, but this approach has poor scalability. Clustering techniques, although improved time efficiency of recommender systems, can negatively affect the quality (precision or accuracy) of recommendations. This article presents a new approach to collaborative filtering recommender systems that focuses on the problem of an active user’s neighbourhood modelling. Instead of one clustering scheme, it works on a set of partitions, therefore it selects the most appropriate one that models the neighbourhood precisely. This article presents the results of the experiments validating the advantage of multi-clustering approach, [Formula: see text], over the traditional methods based on single-scheme clustering. The experiments particularly focus on the effect of great size of datasets concerning overall recommendation performance including accuracy and coverage. Recommender Systems (RSs) are solutions to cope with information overload that is observed nowadays on the Internet. Their goal is to provide filtered data to the particular user [12] . As stated in [25] , RSs are a special type of information retrieval to estimate the level of relevance of unknown items to a particular user and to order them according to the relevance. There are non-personalized recommenders based on e.g. average customers' ratings as well as personalized systems predicting preferences based on analysing users' behaviour. The most popular RSs are collaborative filtering methods (CF ) that build a model on users and the items which the users were interested in [1] . The model's data are preferences e.g. visited or purchased items, ratings [20] . Then CF search for the similarities in the model to generate a list of suggestions that fit users' preferences [20] . They are based on either user-based or item-based similarity to make recommendations. The item-based approach usually generates more relevant recommendations since it uses user's ratings [23] -there are identified similar items to a target item, and the user's ratings on those items are used to extrapolate the ratings of the target. This approach is more resistant to changes in the ratings, as well, because usually the number of users is considerably greater than the number of items and new items are less frequently added to the dataset [2] . During recommendations generation, a huge amount of data is processed. To improve time efficiency and make it possible to generate proposition lists in real time, RSs reduce the search space around an active user to its closest neighbourhood. A traditional method for this purpose is k Nearest Neighbours (kN N ) [4] . It calculates all user-user or item-item similarities and identifies the most k similar objects (users or items) to the target object as its neighbourhood. Then, further calculations are performed only on objects from the neighbourhood improving the time of processing. The kN N algorithm is a reference method used in order to determine the neighbourhood of an active user for the collaborative filtering recommendation process [8] . Simplicity and reasonably accurate results are its advantages; its disadvantages are low scalability and vulnerability to sparsity in data [24] . Clustering algorithms can be an efficient solution to the disadvantages of kN N approach due to the neighbourhood is shared by all cluster members. The problems are: the results can be different as the most of clustering methods are non-deterministic and usually significant loss of prediction accuracy. Multi-clustering approach, instead of one clustering scheme, works on a set of partitions, therefore it selects the most appropriate one that models the neighbourhood precisely, thus reducing the negative impact of non-determinism. The article is organised as follows: the first section presents problems with scalability occurring in collaborative filtering Recommender Systems with a solution based on clustering algorithms, including their advantages and disadvantages. Next section describes the proposed multi-clustering algorithm on the background of alternative clustering techniques, whereas the following section contains results of performed experiments to compare multi-clustering and single-clustering approaches. The last section concludes the paper. Clustering is a part of Machine Learning domain. The aim of clustering methods is to organize data into separate groups without any external information about their membership, such as class labels. They analyse only the relationship among the data, therefore clustering belongs to Unsupervised Learning techniques [13] . Due to independentá priori clusters identification, clustering algorithms are an efficient solution to the problem of RSs scalability, providing for recommendation process a pre-defined neighbourhood [21] . Recently, clustering algorithms have drawn much attention of researchers and there were proposed new algorithms, particularly developed for recommender systems application [6, 16, 22] . The efficiency of clustering techniques is related to the fact, that a cluster is a neighbourhood that is shared by all the cluster members, in contrast to kN N approach determining neighbours for every object separately [2] . The disadvantage of this approach is usually loss of prediction accuracy. The explanation for decreasing recommendations accuracy is in the way how clustering algorithms work. A typical approach is based on a single partitioning scheme, which is generated once and then not updated significantly. There are two major problems related to the quality of clustering. The first is the clustering results depend on the input algorithm parameters, and additionally, there is no reliable technique to evaluate clusters before on-line recommendations process. Moreover, some clustering schemes may better suit to some particular applications, whereas other clustering schemes perform better in other solutions [28] . The other issue addressed to decreasing prediction accuracy is imprecise neighbourhood modelling of the data located on borders of clusters [14, 18] . Popular clustering technique is k − means due to its simplicity and high scalability [13] . It is often used in CF approach [21] . A variant of k − means clustering, bisecting k−means, was proposed for privacy-preserving applications [7] and web-based movie RS [21] . Another solution, ClustKNN [19] was used to cope with large-scale RS applications. However, the k − means approach, as well as many other clustering methods, do not always result in clustering convergence. Moreover, they require input parameters e.g. a number of clusters, as well. The disadvantages described above can be solved by techniques called alternate clustering, multi-view clustering, multi-clustering or co-clustering. They include a wide range of methods which are based on widely understood multiple runs of clustering algorithms or multiple application of clustering process on different input data [5] . Multi-clustering or co-clustering have been applied to improve scalability in the domain of RSs. Co-clustering discovers samples that are similar to one another with respect to a subset of features. As a result, interesting patterns (co-clusters) are identified unable to be found by traditional one-way clusterings [28] . Multiple clustering approaches discover various partitioning schemes, each capturing different aspects of the data [3] . They can apply one clustering algorithm changing values of input parameters or distance metrics, as well as they can use different clustering techniques to generate a complementary result [28] . The role of multi-clustering in the recommendations generation process that is applied in the approach described in this article, is to determine the most appropriate neighbourhood for an active user. It means that the algorithm selects the best cluster from a set of clusters prepared previously (see the following Section). A method described in [18] combines both content-based and collaborative filtering approaches. The system uses multi-clustering, however, it is interpreted as clustering of a single scheme on both techniques. It groups the ratings, to create an item group-rating matrix and a user group-rating matrix. As a clustering algorithm, it uses k − means combined with a fuzzy set theory to represent the level of membership of an object to the cluster. Then a final prediction rating matrix is calculated to represent the whole dataset. In the last step of prerecommendation process k − means is used again on the new rating matrix to find a group of similar users. The groups represent the neighbourhood of users to limit the search space for a collaborative filtering method. It is difficult to compare this approach to the other techniques including single-clustering ones because the article [18] describes the experiments on the unknown dataset containing only 1675 ratings. The other solution is presented in [26] . The authors observed, that users might have different interests over topics, thus they might share similar preferences with different groups of users over different sets of items. The method CCCF (Co-Clustering For Collaborative Filtering) first clusters users and items into several subgroups, where the each subgroup includes a set of like-minded users and the set of items in which these users share their interests. The groups are analysed by collaborative filtering methods and the result recommendations are aggregated over all the subgroups. This approach has advantages like scalability, flexibility, interpretability and extensibility. Other applications are: accurate recommendations of tourist attractions based on a co-clustering and bipartite graphs theory [27] and OCuLaR (Overlapping co-CLuster Recommendation) [11] -an algorithm for processing very large databases, detecting co-clusters among users and items as well as providing interpretable recommendations. There are some other methods, which can be generally called as multi-view clustering, that find partitioning schemes on different data (e.g. ratings and text description) combining results after all ( [5, 17] ). The main objective of a multiview partitioning is to provide more information about the data in order to understand them better by generating distinct aspects of the data and searching for the mutual link information among the various views [10] . It is stated, that single-view data may contain incomplete knowledge while multi-view data fill this gap by complementary and redundant information [9] . It is rather useful in interpretability aspect developing in Recommender Systems [11] . The approach presented in this article has a name Multi-Clustering Collaborative Filtering (M − CCF ) and defines a multi-clustering process as generation of a set of clustering results obtained from an arbitrary clustering algorithm with the same data on its input. The advantage of this approach is a better quality of the neighbourhood modelling, leading to the high quality of predictions, keeping real-time effectiveness provided by clustering methods. The explanation is in imprecise neighbourhood modelling of the data located on borders of the clusters. The border objects have fewer neighbours in their closest area than the objects located in the middle of a cluster. The multi-clustering technique selects the most appropriate cluster to the particular data object. The most appropriate means the one that includes the active object in the closest distance to the cluster's center, thus delivering more neighbours around it. A more detailed description of this phenomenon is in [14, 15] . The general algorithm M − CCF is presented in Algorithm 1. The input set contains data of n users, who rated a subset of items -A = {a 1 , . . . , a k }. The set of possible ratings -V -contains values v 1 , . . . , v c . The input data are clustered ncs times into nc clusters every time, giving, as a result, a set of clustering schemes CS. Finally, the algorithm generates a list of recommendations R xa for the active user. The set of groups is identified by the clustering algorithm which is run several times with the same or different values of its input parameters. In the experiments described in this article, k − means was used as a clustering method. The set of clusters provided for the collaborative filtering process was generated with the same parameter k (a number of clusters). This step, although time-consuming, has a minor impact on overall system scalability, because it is performed rarely and in an off-line mode. After the neighbourhood identification, the following step, appropriate recommendation generations, is executed. This process requires, despite great precision, high time effectiveness. The multi-clustering approach satisfies these two conditions because it can select the most suitable neighbourhood area of an active user for candidates searching and the neighbourhood of all objects is already determined, as well. One of the most important issues of this approach is to generate a wide set of input clusters that is not very numerous in the size, thus providing a high similarity for every user or item. The other matter concerns matching users with the best clusters as their neighbourhood. It can be obtained in the following ways. The first of them compares the active user's ratings with the cluster centers' ratings and searches for the most similar one using a certain similarity measure. The other way, instead of the cluster centers, compares the active user with all cluster members and selects the one with the highest overall similarity. Both solutions have their advantages and disadvantages, e.g. the first one works well for clusters of spherical shapes, whereas the other one requires higher time consumption. In the experiments presented in this paper, the clusters for active users are selected based on their similarity to the centers of groups (see Algorithm 2). -C bestx a -the best cluster for an active user xa, -δ best -a matrix of similarity within the best cluster begin δ1..δncs·ncs ←− calculateSimilarity(xa, CSr, δ); δ best ←− selectTheHighestSimilarity(δ1..δncs); C bestx a ←− findTheBestCluster(δ best , CS, CSi); Afterwards, a recommendations generation process works as a typical collaborative filtering approach, although the candidates are searched only within the selected cluster of the neighbourhood. Evaluation of the performance of the proposed algorithm M-CCF was conducted on the MovieLens dataset [30] . The original set is composed of 25 million ratings; however two subsets were used in the experiments: a small dataset -100k and a big dataset -10M . The parameters of the subsets are presented in Table 1 . The results obtained with the algorithm M − CCF were compared with the recommender system whose neighbourhood identification is based on a singleclustering. The attention was paid to the precision and completeness of recommendation lists generated by the systems. The evaluation criteria were related to the following baselines: Root Mean Squared Error (RM SE) described by (1) and Coverage described by (2) (in %). The symbols in the equations, as well as the method of calculation are characterised in details below. where IR + stands for a set of positive real numbers. The performance of both approaches was evaluated in the following way. Before the clustering step, the whole input dataset was split into two parts: training and testing. In the case of 100k subset, the parameters of a testing part were as follows: 393 ratings, 48 users, 354 items, whereas the case of 10M subset: 432 ratings, 44 users and 383 items. This step provides the same testing data during experiments and makes the comparison more objective. In the evaluation process, the values of ratings from the testing part were removed and estimated by the recommender systems. The difference between the original and the calculated value (represented respectively as r real (x i ) and r est (x i ) for user x i and a particular item i) was taken for RM SE calculation. The number of ratings is denoted as N in the equations. The lower value of RM SE stands for a better prediction ability. During the evaluation process, there were the cases in which estimation of ratings was not possible. It occurs when the item for which the calculations are performed, is not present in the clusters which the items with existing ratings belong to. It is considered in Coverage index (2) . In every experiment, it was assumed that the RM SE is significant if the value of Coverage is greater than 80%. It means that if the number of users to whom the recommendations were calculated was 48 and for each of them it was expected to estimate on average 5 ratings, therefore at least 192 values should be present in the recommendation lists. The clustering method, similarity and distance measures were taken from Apache Mahout environment [29] . To achieve the comparable time evaluation, in implementation of the multi-clustering algorithm, data models (FileData-Model) and structures (FastIDMap, FastIDSet) derived from Apache Mahout were taken, as well. The following data models were implemented: Cluster-ingDataModel and MultiClusteringDataModel that implement the interface of DataModel. The appropriate recommender and evaluator classes were implemented, as well. The first experiment was performed on 100k dataset, that was clustered independently five times into 10 groups. The clustering algorithm was k − means and a distance measure -cosine value between the vectors formed from data points. The number of groups (10) was determined experimentally as an optimal value that led to the highest values of Coverage in the recommendations. In every case, a new recommender system was built and evaluated. Table 2 contains evaluation of the systems' precision that was run with the following similarity indices: Cosine − based, LogLikelihood, P earson correlation, Euclidean distance-based, CityBlock distance-based and T animoto coefficient. In the tables below they are represented by the following shortcuts respectively: Cosine, LogLike, P earson, Euclidean, CityBlock, T animoto. The RM SE values are presented with a reference value in brackets that stands for Coverage in this case. Table 2 . RMSE of item based collaborative filtering recommendations with the neighbourhood determined by a single (5 different runs of k − means algorithm) as well as multi-clustering (k − means with cosine − based distance measure) for a small dataset. The best values are in bold. It is visible that the values are different for different input data, although the number of clusters is the same in every result. As an example, the recommender system with Cosine − based similarity has RM SE in the range from 0.87 to 0.9. The difference in values may seem to be small, but the table contains only values whose Coverage was high enough. Different values of RM SE mean that the precision of a recommender system depends on the quality of a clustering scheme. There is no guarantee that the scheme selected for recommendation process is optimal. Table 2 contains performance results of the recommender system that has the neighbourhood determined by the multi-clustering approach. There is a case where the precision is better (for the Euclidean distance based similarity), but in the majority of cases it is slightly worse. Despite this, the multi-clustering approach has eliminated the ambiguity of clustering scheme selection. The goal of the other experiment was to examine the influence of a distance measure used in the clustering process on a final recommender system performance. The dataset, as well as the similarity measures or a number of clusters, remained the same; however, the distance between the data points was measured by the Euclidean distance. The results are presented in Table 3 . In this case, one can observe the same values of RM SE regardless of the similarity measure. Note, that the M − CCF algorithm generated results identical to the values from the single-clustering approach. The following experiments were performed on the big dataset (10M ). By an analogy to the previous ones, the influence of a distance measure was examined, as well. In the first of them, the cosine value between data vectors was taken as a distance measure. The results, for all the similarity indices, are presented in Table 4 . The overall performance is worse, although the size of the dataset is considerably greater. There are more cases with insufficient Coverage related to Finally, the last experiment was performed on the big dataset (10M ) which was clustered based on the Euclidean distance. Table 5 contains the results of RM SE and Coverage. Coverage values are visibly higher in this case, even for the P earson correlation similarity index. The performance of the multiclustering approach is still better than the method based on single-clusteringthe RM SE values are lower in the majority of cases, in the case of the singleclustering, there is only one scheme that slightly outperforms the M − CCF method. Taking into consideration all the experiments presented in this article, it can be observed, that the performance of a recommender system depends on the quality of a clustering scheme provided to the system by a clustering algorithm. In the case of the single-clustering and several schemes generated by this approach, the final precision of recommendations can differ. It means that in order to build a good neighbourhood model for a recommender system, a single run of a clustering algorithm is insufficient. A multi-clustering recommender system and the technique of dynamic selection the most suitable clusters, offers valuable results, particularly in the case of a great size of datasets. In this paper, a developed version of a collaborative filtering recommender system based on multi-clustering neighbourhood modelling is presented. The algorithm M − CCF dynamically selects the most appropriate cluster for every user whom recommendations are generated to. Properly adjusted neighbourhood leads to more accurate recommendations generated by a recommender system. The algorithm eliminates a disadvantage appeared in the case of the neighbourhood determination by a single-clustering method -dependence of the final performance of a recommender system on a clustering scheme selected for the recommendation process. The experiments described in this paper validated the better performance of the recommender system when the neighbourhood is modelled by the M − CCF algorithm. It was particularly evident in the case of the great dataset containing 10 million ratings. The experiments showed good scalability of the method and increased the competitiveness of the M − CCF algorithm relative to a single-clustering approach in the case of the bigger dataset. Additionally, the technique is free from the negative impact on precision provided by selection of an inappropriate clustering scheme. The future experiments will be performed to validate the proposed algorithm on different datasets, particularly focused on its great size. It is planned to check the impact of a type of a clustering method on the recommender system's final performance and a mixture of clustering schemes instead of one-algorithm output on an input of the recommender system. Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions Time-and location-sensitive recommender systems. Recommender Systems Evolutionary parameter setting of multi-clustering Recommendation strategies in personalization applications Alternative Clustering Analysis: A Review. Intelligent Decision Technologies: Data Clustering: Algorithms and Applications An evolutionary scheme for improving recommender system using clustering A scalable privacy-preserving recommendation scheme via bisecting k-means clustering Recommender systems survey. Knowl.-Based Syst New approaches in multiview clustering Multi-view collaborative locally adaptive clustering with Minkowski metric Scalable and interpretable product recommendations via overlapping co-clustering Recommender Systems: An Introduction Finding Groups in Data: An Introduction to Cluster Analysis Collaborative filtering recommender systems based on k-means multi-clustering Multi-clustering used as neighbourhood identification strategy in recommender systems A recommender system based on hierarchical clustering for cloud e-learning Rough-fuzzy collaborative clustering A Multi-clustering hybrid recommender system ClustKNN: a highly scalable hybrid model -& memory-based CF Algorithm Recommender systems: introduction and challenges Recommender systems for large-scale e-commerce: scalable neighborhood formation using clustering A novel Adaptive Genetic Neural Network (AGNN) model for recommender systems using modified k-means clustering approach Collaborative filtering recommender systems Scalability and sparsity issues in recommender datasets: a survey Novelty and diversity enhancement and evaluation in recommender systems and information retrieval CCCF: improving collaborative filtering via scalable user-item co-clustering A novel recommendation algorithm frame for tourist spots based on multi -clustering bipartite graphs Discovering multiple co-clusterings in subspaces Acknowledgment. The work was supported by the grant from Bialystok University of Technology and funded with resources for research by the Ministry of Science and Higher Education in Poland.