doi:10.1016/j.eswa.2007.05.013


ARTICLE IN PRESS
www.elsevier.com/locate/eswa

Expert Systems with Applications xxx (2007) xxx–xxx

Expert Systems
with Applications
Collaborative recommender systems: Combining effectiveness
and efficiency

Panagiotis Symeonidis *, Alexandros Nanopoulos, Apostolos N. Papadopoulos,
Yannis Manolopoulos

Department of Informatics, Aristotle University, 54124 Thessaloniki, Greece
Abstract

Recommender systems base their operation on past user ratings over a collection of items, for instance, books, CDs, etc. Collabora-
tive filtering (CF) is a successful recommendation technique that confronts the ‘‘information overload’’ problem. Memory-based algo-
rithms recommend according to the preferences of nearest neighbors, and model-based algorithms recommend by first developing a
model of user ratings. In this paper, we bring to surface factors that affect CF process in order to identify existing false beliefs. In terms
of accuracy, by being able to view the ‘‘big picture’’, we propose new approaches that substantially improve the performance of CF algo-
rithms. For instance, we obtain more than 40% increase in precision in comparison to widely-used CF algorithms. In terms of efficiency,
we propose a model-based approach based on latent semantic indexing (LSI), that reduces execution times at least 50% than the classic
CF algorithms.
� 2007 Elsevier Ltd. All rights reserved.

Keywords: Recommender system; Collaborative filtering; Nearest neighbors
1. Introduction

The ‘‘information overload’’ problem affects our every-
day experience while searching for knowledge on a topic.
To overcome this problem, we often rely on suggestions
from others who have more experience on the topic. How-
ever, in Web case where there are numerous suggestions, it
is not easy to detect the trustworthy ones. Shifting from
individual to collective suggestions, the process of recom-
mendation becomes controllable. This is attained with the
introduction of CF, which provides recommendations
based on the suggestions of users who have similar prefer-
ences. Since CF is able to capture the particular preferences
of a user, it has become one of the most popular methods
in recommender systems.
0957-4174/$ - see front matter � 2007 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2007.05.013

* Corresponding author.
E-mail addresses: symeon@csd.auth.gr (P. Symeonidis), ananopou@

csd.auth.gr (A. Nanopoulos), papadopo@csd.auth.gr (A.N. Papadopou-
los), manolopo@csd.auth.gr (Y. Manolopoulos).

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
Two types of CF algorithms have been proposed in the
literature: memory-based algorithms, which recommend
according to the preferences of nearest neighbors, and
model-based algorithms, which recommend by first devel-
oping a model of user ratings. Both practical experience
and related research have reported that memory-based
algorithms (a.k.a. nearest-neighbor algorithms) present
excellent performance, in terms of accuracy, for multi-
value rating data. On the other hand, model-based algo-
rithms are efficiently handle scalability to large data sets.
1.1. Motivation

Nearest-neighbor CF is influenced by several factors.
Related research on CF, during the past decade,
approached some of these factors. However, existing
approaches may not be considered complete, because they
examine the various factors only partially. More specifi-
cally, existing CF algorithms and their experimental evalu-
ation focus only on parts of the CF process and do not
e recommender systems: Combining effectiveness ..., Expert Sys-

mailto:symeon@csd.auth.gr
mailto:ananopou@ csd.auth.gr
mailto:ananopou@ csd.auth.gr
mailto:papadopo@csd.auth.gr
mailto:manolopo@csd.auth.gr


2 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
handle it as a whole. For the aspects that these partial con-
siderations do not examine, they usually make choices,
which our study demonstrates that can be misleading.
Through our study we are also able to confirm that there
exist dependencies between the factors affecting CF. There-
fore, we have to perform an evaluation of the entire CF
process in order to produce reliable conclusions.

Moreover, to handle scalability, we have to extend our
findings for nearest-neighbor CF algorithms through a
model-based approach. This approach will combine the
effectiveness of the nearest-neighbor CF algorithms in
terms of accuracy, with the efficiency in terms of execution
time. Towards this direction, latent semantic indexing
(LSI) is a technique that has been extensively used in infor-
mational retrieval. LSI detects latent relationships between
documents and terms. In CF, LSI can be used to form
users’ trends from individual preferences, by detecting
latent relationships between users and items. Therefore,
with LSI, a higher level representation of the original
user-item matrix is produced, which presents a three-fold
advantage: (i) it contains the main trends of users’ prefer-
ences, (ii) noise is removed, (iii) it is much more condensed
than the original matrix, thus it favors scalability.

1.2. Contributions

In this work, first, we provide a thorough analysis of the
factors involved in CF. Notably, we examine several simi-
larity measures, various criteria for generating the recom-
mendation list, the appropriateness of evaluation metrics,
and the impact of CF in real-world applications, which is
considered through user’s satisfaction (measured with the
popularity of recommended items) and the division of
the ratings of the test user in past and future sets. During
the analysis we identify choices that have been incorrectly
adopted and new issues that have not been considered so
far. As a result, we propose several extensions and new
approaches, which substantially improve the entire CF
process. Moreover, we propose a new model-based CF
approach, which is based on LSI to produce a condensed
model for the user-item matrix that handles the factor of
scalability.

Our contributions are summarized as follows:

• The proposed approach examines the factors involved
through the entire CF process. This helps to: (a) reveal
fallacies in existing beliefs for several of them, (b) better
analyze their impact and provide insights, and (c) syn-
thesize a novel method, which substantially improve
the effectiveness of CF in terms of accuracy.

• The previous findings are extended with an approach
based on LSI. This way, execution times for CF are sig-
nificantly reduced. Moreover, improvement is possible
in effectiveness too, because the proposed model identi-
fies the main trends and removes noise. Notice that dif-
ferently from similar approaches in related work, we
introduce the notion of pseudo-user in order to fold-in
Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
the vectors of target users in the produced model
(related work incorrectly assumes the knowledge of tar-
get users during the building of the model).

• We carried out extensive experimental evaluation,
which, to our knowledge, considers all the involved fac-
tors for the first time. Our experimental results demon-
strate the superiority of the proposed methods (more
than 40% improvements in terms of precision over
widely-used CF algorithms and 50% in terms of execu-
tion times).

The rest of this paper is organized as follows. Section 2
summarizes the related work, whereas Section 3 contains
the analysis of the CF factors. The proposed approaches
are described in Sections 4 and 5. Experimental results
are given in Section 6. Finally, Section 7 concludes this
paper.

2. Related work

In 1992, the Tapestry system (Goldberg, Nichols, Brian,
& Terry, 1992) introduced collaborative filtering (CF). In
1994, the GroupLens system (Resnick, Iacovou, Suchak,
Bergstrom, & Riedl, 1994) implemented a CF algorithm
based on common users preferences. Nowadays, it is
known as user-based CF algorithm, because it employs
users’ similarities for the formation of the neighborhood
of nearest users. Since then, many improvements of user-
based algorithm have been suggested, e.g., (Breese, Hecker-
man, & Kadie, 1998; Herlocker, Konstan, Borchers, &
Riedl, 1999).

In 2001, another CF algorithm was proposed. It is based
on the items’ similarities for a neighborhood generation of
nearest items (Karypis, 2001; Sarwar, Karypis, Konstan, &
Riedl, 2001) and is denoted as item-based CF algorithm.

Most recent work followed the two aforementioned
directions (i.e., user-based and item-based). Herlocker,
Konstan, and Riedl (2002) weight similarities by the num-
ber of common ratings between users/items. Deshpande
and Karypis (2004) apply item-based CF algorithm
combined with conditional-based probability similarity
and cosine-similarity. Xue, Lin, and Yang (2005) suggest
a hybrid integration of aforementioned algorithms (mem-
ory-based) with model-based algorithms.

All aforementioned algorithms are memory-based.
Their efficiency is affected from scalability of data. This
means that they face performance problems, when the vol-
ume of data is extremely big. To deal with this problem,
many model-based algorithms have been developed (Breese
et al., 1998). However, there are two conflicting challenges.
If an algorithm spends less execution time, this should not
worse its quality. The best result would be to improve qual-
ity with the minimum calculation effort.

Furnas, Deerwester, and Dumais (1988) proposed latent
semantic indexing (LSI) in information retrieval area to deal
with the aforementioned challenges. More specifically, LSI
uses SVD to capture latent associations between the terms
e recommender systems: Combining effectiveness ..., Expert Sys-


Table 1
Factors affect CF algorithms

Factor name Short description Stage

Sparsity Limited percentage of rated products 1
Scalability Computation increase by the number of

users and items
1

Train/test data size Data are divided to training and
evaluation or test

1,3

Neighborhood size Number of neighbors used for the
neighborhood formation

1

Similarity measure Measures that calculate proximity of
two objects

1

Recommendation list
size

Number of top-N recommended items 2

Generation of
recommendation list

Algorithms for the top-N list generation 2

Positive rating-
threshold

Positive and negative ratings
segregation

2,3

Evaluation metrics Metrics that evaluate the quality of top-
N list

3

Setting a baseline
method

A simple method against which
performance is compared

3

Past/future items The segregation between apriori known
and unknown items

3

Table 2
Symbols and definitions

Symbol Definition Symbol Definition

P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 3

ARTICLE IN PRESS
and the documents. SVD is a well-known factorization tech-
nique that factors a matrix into three matrices. Berry, Dum-
ais, and O’Brien (1994) carried out a survey of the
computational requirements for managing (e.g., folding-
in1) LSI-encoded databases. He claimed that the reduced-
dimensions model is less noisy than the original data.

Sarwar, Karypis, and Konstan (2000, 2002) applied
dimensionality reduction for the user based CF approach.
He also used SVD for generating predictions. In contrast
to our work, Sarwar et al. (2000, 2002) do not consider
two significant issues: (i) Predictions should be based on
the users’ neighbors and not on the test (target) user, as
the ratings of the latter are not apriori known. For this rea-
son we rely only on the neighborhood of the test user. (ii)
The test users should not be included in the calculation
of the model, because they are not known during the fac-
torization phase. For this reason, we introduce the notion
of pseudo-user in order to include a new user in the model
(folding in), from which recommendations are derived.
Other related work also includes Goldberg, Roeder,
Gupta, and Perkins (2001), who applied Principal Compo-
nents Analysis (PCA) to facilitate off-line dimensionality
reduction for clustering the users, and therefore, manages
to have rapid on-line computation of recommendations.
Hofmann (2004) proposed a model-based algorithm which
relies on latent semantic and statistical models.
k Number of nearest
neighbors

�ri Mean rating value for item i

N Size of
recommendation list

pu,i Predicted rate for user u on
item i

NN(u) Nearest neighbors of
user u

jTj Size of the test set

NN(i) Nearest neighbors of
item i

c Number of singular values

Ps Threshold for positive
ratings

A Original matrix

I Domain of all items U Left singular vectors of A
U Domain of all users S Singular values of A
u, v Some users V0 Right singular vectors of A
i,j Some items A* Approximation matrix of A
Iu Set of items rated by

user u
u User vector

Ui Set of users rated item
i

unew Inserted user vector

ru,i The rating of user u
on item i

n Number of training users

�ru Mean rating value for
user u

m Number of items
3. Factors affecting the CF process

In this section, we identify the major factors that criti-
cally affect all CF algorithms. Our analysis focuses on the
basic operations of the CF process, which consists of three
stages.

• Stage 1: formation of user or item neighborhood, where
objects inside the neighborhood have similar ratings and
behavior.

• Stage 2: top-N list generation with algorithms that con-
struct a list of best items recommendations for a user.

• Stage 3: quality assessment of the top-N list.

In the rest of this section we elaborate on the aforemen-
tioned factors, which are organized with respect to
the stage that each one is involved. The examined fac-
tors,which are detailed in the following, are described in
Table 1. Table 2 summarizes the symbols that are used
in the sequel. To ease the discussion, we will use the run-
ning example illustrated in Fig. 1 where U1�10 are users
I1�6 are items. As shown, the example data set is divided
into a training and test set. The null cells (no rating) are
represented as zeros.

Note that related work has identified some few more
factors, like the impact of subjectivity during the rating
1 Folding in terms or documents is a simple technique that uses existing
SVD to represent new information.

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
or issues concerning the preprocessing of data (Breese
et al., 1998; Mobasher, Dai, Luo, & Nakagawa, 2001; Sar-
war et al., 2000). Nevertheless, we do not examine these
factors, because their effect is less easily determinable.
3.1. First stage factors

Sparsity: In most real-world cases, users rate only a very
small percentage of items. This causes data sets to become
sparse. The problem of sparsity is extensively studied. In
e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 1. (a) Training set (n · m) and (b) test set.

2 Means �ru, �rv are the mean ratings of u and v over their co-rated items.
3 Means �ru, �rv are taken over all ratings of u and v.

4 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
such cases, the recommendation engine cannot provide pre-
cise proposals, due to lack of sufficient information. A simi-
lar problem of CF algorithms is the one of cold-start
(O’Mahony, Hurley, Kushmerick, & Silvestre, 2004).

In few existing works (Mobasher et al., 2001; Sarwar
et al., 2000), there is a preprocessing step that fills missing
values. Our experimental examination indicated that this
approach incurs the loss of valuable information. For the
same reasons, the aforementioned direction has not been
followed by forthcoming works.

Several recent works (Herlocker et al., 2002; Karypis,
2001; Sarwar et al., 2001) focus only on very sparse data.
Also, related work provides benchmark data sets with dif-
ferent sparsity, e.g., the Jester data set (Goldberg et al.,
2001) is dense; in contrast the Movielens data sets (Her-
locker et al., 1999) are relatively sparse. The degree of spar-
sity, however, depends on the application type. To provide
complete conclusions, someone has to experiment with the
amount of sparsity as well.

Scalability: Scalability is important, because in real-
world applications the number of users/items is very large.
Related work (Sarwar et al., 2000) has proposed the use of
dimensionality reduction techniques, which introduce a
trade-off between the accuracy and the execution time of
CF algorithms. In Section 5, we will propose a model to
confront with the scalability problem.

Train/test data size: There is a clear dependence between
the training set’s size and the accuracy of CF algorithms
(Sarwar et al., 2001). Through our experimental study we
verified this conclusion. Additionally, we saw that after
an upper threshold of the training set size, the increase in
accuracy is less steep. However, the effect of overfitting is
less significant compared to general classification problems.
In contrast, low training set sizes negatively impact accu-
racy. Therefore, the fair evaluation of CF algorithms
should be based on adequately large training sets. Though
most related research uses a size around 80%, there exist
works that use significantly smaller sizes (McLauglin &
Herlocher, 2004). From our experimental results we con-
cluded that an 75% training set size corresponds to an ade-
quate choice. But we have to notice that training/test size
should not be data set independent. (In the running exam-
ple, we set training size at 60%).

Neighborhood size: The number, k, of nearest neighbors
used for the neighborhood formation produces a trade-off:
a very small k results to low accuracy, because there are not
enough neighbors to base prediction. In contrast, a very
Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
large k impacts precision too, as the particularities of user’s
preferences can be blunted due to the large neighborhood
size. In most related works (Herlocker et al., 1999; Sarwar,
Karypis, Konstan, & Riedl, 2000), k has been examined in
the range of values between 10 and 100. The optimum k
depends on the data characteristics (e.g., sparsity). There-
fore, CF algorithms should be evaluated against varying
k, in order to tune it(In the running example, we set k = 3).

Similarity measure: Related work (Herlocker et al.,
2002; McLauglin & Herlocher, 2004; Mobasher et al.,
2001; Sarwar et al., 2001) has mainly used Pearson correla-
tion and cosine-similarity. In particular, user-based (UB)
CF algorithms use the Pearson correlation (Eq. (1)),2 which
measures the similarity between two users, u and v. Item-
based (IB) CF algorithms use a variation of adjusted
cosine-similarity (Eq. (2)),3 which measures the similarity
between two items, i and j, and has been proved more accu-
rate (McLauglin & Herlocher, 2004; Sarwar et al., 2001), as
it normalizes bias from subjective ratings.

simðu; vÞ¼
P
8i2Sðru;i ��ruÞðrv;i ��rvÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

8i2Sðru;i ��ruÞ
2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP
8i2Sðrv;i ��rvÞ

2
q ;

S ¼ I u \ I v; ð1Þ

simði; jÞ¼
P
8u2Tðru;i ��ruÞðru;j ��ruÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP

8u2U iðru;i ��ruÞ
2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP
8u2U jðru;j ��ruÞ

2
q ;

T ¼ U i \ U j: ð2Þ

Herlocker et al. (2002) proposed a variation of the pre-
vious measures, which henceforth is denoted as Weighted
Similarity (WS). If sim is a similarity measure (e.g., Pearson
or cosine), then WS is equal to

maxðc;cÞ
c
� sim, where c is the

number of co-rated items.
Eq. (1) takes into account only the set of items, S, that

are co-rated by both users. This, however, ignores the items
rated by only one of the two users. The number of the latter
items denotes how much their preferences differ. Especially
for the case of sparse data, by ignoring these items we dis-
card significant information. Analogous reasoning applies
for Eq. (2), which considers (in the numerator) only the
set of users, T, that co-rated both the examined pair of
items, and for WS, which is based on Eqs. (1) or (2). To
e recommender systems: Combining effectiveness ..., Expert Sys-


P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 5

ARTICLE IN PRESS
address the problem, in the following, we will examine
alternative definitions for S and T.

Another issue that has not been precisely clarified in
related work, is whether we include in the neighborhood
a user or item with negative similarity. In order to improve
accuracy, we suggest keeping only the positive similarities
for the neighborhood formation, even if less than the spec-
ified number k of neighbors remain. This approach is also
followed in several works (Mobasher et al., 2001). Our
experiments have shown that, by including users or items
with negative similarities (just to maintain a constant
neighborhood size), can impact the accuracy.

Moreover, we have to mention that, for the generation
process of item similarities matrix, this can be made off-
line. In contrast, the user similarities matrix should be cre-
ated on-line. More specifically, for the user similarities
matrix we use the test data, while for the item based simi-
larities table we use only training data.

The application of Pearson Correlation (Eq. (1)) to the
running example is depicted in Fig. 2a and the resulting
k-nearest neighbors (k-NN) are given in Fig. 2b. Respec-
tively, the similarities between items, calculated with the
adjusted cosine measure (Eq. (2)), are given in Fig. 2c
and d depicts the nearest neighbors. As only positive values
of similarities are considered during the neighborhood for-
mation, the items have different neighborhood size.

3.2. Second stage factors

Recommendation list’s size: The size, N, of the recom-
mendation list corresponds to a trade-off: With increasing
N, the absolute number of relevant items (i.e., recall) is
expected to increase, but their ratio to the total size of
the recommendation list (i.e., precision) is expected to
decrease. (Recall and precision metrics are detailed in the
following.) In related work (Karypis, 2001; Sarwar et al.,
2001), N usually takes values between 10 and 50. (In the
running example, we set N = 2).

Positive rating threshold: It is evident that recommenda-
tions should be ‘‘positive’’. It is not success to recommend
an item that will be rated with 1 in scale 1–5. Nevertheless,
Fig. 2. (a) Users’ similarities matrix, (b) users’ nearest neighbors in descend
neighbors in descending similarity matrix.

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
this issue is not clearly defined in several existing works.
We argue that ‘‘negatively’’ rated items should not contrib-
ute to the increase of accuracy, and we use a rating-thresh-
old, Ps, to recommended items whose rating is no less than
this value. If not a Ps value is used, then the results can
become misleading, since negative ratings can contribute
to the measurement of accuracy.

Generation of recommendation list: The most often used
technique for the generation of the top-N list, is the one
that counts the frequency of each item inside the found
neighborhood, and recommends the N most frequent ones
(Sarwar et al., 2000). Henceforth, this technique is denoted
as Most-Frequent item recommendation (MF). MF can be
applied to both user-based and item-based CF algorithms.
For example, assume that we follow the aforementioned
approach for the test user U7, for k = 3 and N = 2. For
the case of user-based recommendation, the top-2 list
includes items I3, I6. In contrast, for the case of item-based
recommendation, the top-2 list includes two items of I6 or
I2 or I5 because all three have equal presence. Fig. 3
describes the corresponding two algorithms (for the user
and item-based CF, respectively).

We have to mention that these two algorithms, in con-
trast to the past work, include, in addition, the concept
of positive rating-threshold (Ps). Thus, ‘‘negatively’’ rated
items do not participate to the top-N list formation. More-
over, it is obvious that the generation of top-N list for the
user-based approach is more complex and time consuming.
The reason is that the former algorithm finds, firstly, user
neighbors in the neighborhood matrix and then counts
presences of items in the user-item matrix. In contrast, with
the item-based approach the whole work is completed in
the item neighborhood matrix.

Karypis (2001) reports another technique, which addi-
tionally considers the degree of similarity between items.
This takes into account that the similarities of the k neigh-
bors may vary significantly. Thus, for each item in the
neighborhood, this technique admeasures not just their
number of appearances, but the similarity of neighbors as
well. The N items with the highest sum of similarities are
finally recommended. Henceforth, this technique is denoted
ing similarity matrix, (c) items’ similarities matrix and (d) items’ nearest

e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 3. Generation of top-N list based on most frequent algorithm for (a) user-based algorithm and (b) item-based algorithm.

6 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
as Highest-Sum-of-Similarities item recommendation
(HSS). HSS is applicable only to item-based CF. For our
example, the top-2 list based on this algorithm includes
the items I6, I2 because they have the greater sum of similar-
ities. Note that for both techniques, we have to remove from
the recommendation items, these that are already rated
from the test user.

Based on the aforementioned rationale, we wanted to
perform an examination of other additional criteria against
MF (used in the majority of existing works), in order to
examine if this direction is promising for future work,
because besides (Karypis, 2001), very few works elaborate
on this issue.

3.3. Third stage factors

Evaluation metrics: For the evaluation of CF algorithms
several metrics have been used in related work (Herlocker
et al., 2002; Herlocker, Konstan, Terveen, & Riedl,
2004), for instance the Mean Absolute Error (MAE) or
the Receiving Operating Characteristic (ROC) curve.
Although MAE has been used in most of related works,
it has received criticism as well (McLauglin & Herlocher,
2004). From our experimental study we understood that
MAE is able to characterize the accuracy of prediction,
but is not indicative for the accuracy of recommendation,
as algorithms with worse MAE many times produce more
accurate recommendations than others with better MAE.
Since in real-world recommender systems the experience
of users mainly depends on the accuracy of recommenda-
tion, MAE may not be the preferred measure. Other exten-
sively used metrics are precision and recall. These metrics
are simple, well known, and effective to measure the accu-
racy of recommendation procedure.

For a test user that receives a top-N recommendation
list, let R denote the number of relevant recommended items
(the items of the top-N list that are rated higher than Ps by
the test user). We define the following:
Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
• Precision is the ratio of R to N.
• Recall is the ratio of R to the total number of relevant

items for the test user (all items rated higher than Ps
by him).

Notice that with the previous definitions, when an item
in the top-N list is not rated at all by the test user, we con-
sider it as irrelevant/and it counts negatively to precision
(as we divide by N). In the following we also use F1 = 2 Æ
recall Æ precision/(recall + precision). F1 is used because it
combines both the previous metrics.

Recently, (McLauglin & Herlocher, 2004) has proposed
a modified precision metric for evaluating user experience
to address obscure recommendations that cannot been
identified with the MAE measure. To keep comparisons
with prior work clear, we focused on the traditional preci-
sion/recall metrics. However, we also consider the issue of
items’ popularity in order to test if the examined algo-
rithms recommend obscure items or not.

Setting a baseline method: Existing experimental evalua-
tions lack the comparison against a baseline algorithm. The
baseline algorithm has to be simple and to indicate what
can be attained with as little effort as possible. Through a
baseline, we can see the actual improvement due to existing
CF algorithms.

Past/future data: In real-world applications, recommen-
dations are derived only from the currently available rat-
ings of the test user. However, in most of related works,
all the ratings of each test user is considered apriori known.
For a more realistic evaluation, recommendations should
consider the division of items of the test user into two sets
(Huang, Chen, & Zeng, 2004): (i) the past items of the test
user and (ii) the future items of the test user. As most exist-
ing works ignore this division, their reported performance
corresponds to the best case, because they exploit apriori
known information. To make it more clear, let’s see an
example. Assume that a test user has rated 10 items.
According to most previous works, all the 10 ratings are
e recommender systems: Combining effectiveness ..., Expert Sys-


P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 7

ARTICLE IN PRESS
used to calculate his similarity with each training user. In
contrast, we want to measure how algorithms behave when
we use a smaller number of items than those he will finally
rate. This simulates the real-world applications, where the
test user gradually rates items and receives recommenda-
tions before he provides all his ratings. In the previous
example, if we use two past-items, we have to compute sim-
ilarities based only on the provided two ratings and we
have to predict the remaining eight items (those in the
future set). If only a fraction of items is included in the past
set, then the accuracy is expected to decrease compared to
the best case. For this reason, we study the effect of this
issue.
4. Proposed extensions for memory-based algorithms

In the sequel, we describe in more detail our proposed
extensions. These extensions are based on the ‘‘big picture’’
of the CF process, which was obtained from the investiga-
tion of factors described in Section 3.
4.1. First stage extension : the UNION similarity measures

We first examined the two most important factors that
are involved in the first stage: sparsity and the similarity
measure. As mentioned, for the formation of sets S and
T (see Eqs. (1) and (2)), past work (Herlocker et al.,
2004; McLauglin & Herlocher, 2004; Sarwar et al., 2001)
takes into account only the items that are co-rated by both
users.

Example a: In Fig. 4, Example a depicts the ratings of
four users, U1–U4, over six items. Comparing the similarity
of U1 and U2 users, when only co-rated items are consid-
ered, then the similarity measure will be computed based
only on the ratings for I1 and I3.

In case of sparse data, we have a very small amount of
provided ratings to compute the similarity measure. By
additionally constraining S and T with co-rated items only,
we reduce further the effective amount of used information.
To avoid this, we consider alternative definitions for S and
T, given in Eq. (3):

S ¼ I u [ I v; T ¼ U i [ U j ð3Þ

According to Eq. (3), S includes items rated by at least
one of the users. In the Example a of Fig. 4, except the rat-
ings for I1 and I3, the ratings for I2 and I4 will be considered
Fig. 4. (a) Example of the ratings of four users over six items (dash
denotes an empty value) and (b) example of items rated positively by a test
user Utest and two other users U1 and U2.

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
too (the issue of how to treat items rated by only one user,
will be discussed in the following). Similar reasoning is fol-
lowed for the set T, in the case of IB CF. By combining the
definitions of S and T given in Eq. (3) with the Pearson cor-
relation and adjusted cosine-similarity measures, we get
two reformed measures: UNION Pearson correlation (for
UB) and UNION adjusted cosine (for IB), respectively.4

Notice that in case of UNION Pearson correlation, user
mean ratings correspond to the average user ratings over
all rated items. To further understand why should not base
similarity only on co-rated items, consider the following
example.

Example b: In Fig. 4, Example b depicts the items rated
positively (i.e., higher than Ps) by a test user Utest and two
users, U1 and U2, belonging in the training set. Utest and U1
have co-rated items I1 and I2. Assume that Utest and U1
rated I1 with 5 in the 1–5 scale, whereas I2 have been rated
by both of them with 4. Nevertheless, items I3 – I9 are rated
only by Utest or U1, and not by both. In this case, the Pear-
son measure of Eq. (1), which is based on co-rated items
only, results to the maximum possible similarity value
(i.e., equal to 1) between Utest and U1. However, this is
based only on the two co-rated items and ignores the seven
items that are rated only by one of them. On the other
hand, assume that U2 rated I1 and I2 with 5 and 4, respec-
tively. As previously, the Pearson measure of Eq. (1) results
to the maximum possible similarity between Utest and U2,
whereas Utest and U2 differ in three items rated by only
one of them. This example reflects the impotence of Eq.
(1) to capture the actual notion of similarity: despite the
fact that Utest and U1 differ at seven items and Utest and
U1 differ at three, it assigns the same similarity value in
both cases.

In the previous example, if we designate U1 as neighbor
of Utest, we ignore two issues: (i) Items I3 and I4, will not be
recommended to Utest by U1, as U1 has not rated them; this
fact harms recall. (ii) Items I5–I9 will be recommended by
U1, but as they have not been rated by Utest, this will harm
precision. It follows that a desirable property from a simi-
larity measure is to maximize the number, x, of items that
are co-rated by the test user and each of his neighbors, rel-
atively to the number, y, of items that are rated only by one
of them (in the example of Fig. 4b, for Utest and U1, x = 2
and y = 7). In the best case, the ratio x/(x + y) has value
equal to 1 and in the worst 0.

To evaluate the previously described argument, we com-
pared Pearson correlation against UNION by performing
the following measurement. We used the MovieLens
100K data set and for each test user we computed its k
nearest neighbors (k was set to 10) from the training set.
Next, we measured x and y between each test user and each
of his k neighbors. Fig. 5a illustrates for each x, the result-
ing ratio x/(x + y). Fig. 5a clearly presents that Pearson
4 Henceforth, when it is clearly understood from the context whether we
discuss about UB or IB, we use only the name UNION.

e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 5. (a) Measuring the ratio x/(x + y) and (b) impact of assigned value for unrated items.

8 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
measure results to significantly lower ratios than UNION.
This explains why UNION compares favorably to Pearson
correlation in terms of precision and recall, as will be pre-
sented experimentally in Section 6. (Due to lack of space
we do not present the analogous comparison between
adjusted cosine and UNION.)
5 If less than N items have positive ratings (i.e., not less than Ps), then
less than N items remain in the list.
4.1.1. Assigning a value to unrated items

To calculate the UNION Pearson correlation between
two users U1 and U2, we have to assign a rating by U1 to
an item that is rated only by U2 (e.g., I2 in the example
of Fig. 4a) and vice-versa. The same requirement holds
for the UNION adjusted cosine measure in the IB case.
Notice that this problem cannot be effectively solved with
existing techniques for filling missing values, because the
sparsity of user-item matrices severely hinders this task
(more than 99.9% missing values). For this reason we
assign the same rating value to all the required cases. There
could be several options for this value. For instance, in a 1–
5 scale, we can assign the 0 value, to reflect that the absence
of a rating for an item denotes indifference. Assuming that
user u did not rate item i, another option is to assign the
average value of the provided ratings on other items by
u, or the average of the provided ratings on i by all users.

To examine the impact of the selected value, we mea-
sured F1 versus the assigned value, which is depicted in
Fig. 5b for the MovieLens 100K data set (it uses 1–5 scale).
The dashed line in Fig. 5b corresponds to F1 of the Pearson
correlation (it is independent from the assigned value). As
shown, values between the positive threshold (in this case
Ps was set to 3) and the maximum rating of the scale, result
to reduced F1 (notice that this range also includes the user
average rating value). The reason is that these values
impinge the ability to distinguish the assigned values from
the actual positive ratings. However, when we assign values
smaller than Ps or outside the rating scale, F1 is not
affected. The reason is that with such assigned values we
do not miss the ability to distinguish the assigned values
from the actual positive ratings (as the latter are always
within the provided scale). Thus, we conclude that UNION
is not significantly affected by the selection for the assigned
ratings, as all values outside the rating scale or below Ps
result to about the same F1. Even more, the values between
Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
Ps and the upper limit of the scale result to significantly
higher F1 than Pearson measure. Henceforth, we assume
that the assigned value is equal to 0.

4.2. Second stage extensions: the HPR and HSR algorithms

In Section 3.2, we described the two criteria, MF and
HSS, that have been proposed so far for the generation
of the top-N list. The ranking criteria are important, as
they can significantly affect the performance of CF. For
this reason we examined additional criteria, to see if this
direction of research worths investigation.

As a first extension of the existing ranking criteria,
someone could use the predicted values for each item to
rank them. Predicted values (McLauglin & Herlocher,
2004) are computed by Eqs. (4) and (6), for the cases of
user-based and item-based CF, respectively. These equa-
tions have been used in related work for the purpose of
MAE calculation, whereas we use them for generation of
top-N list.

pu;i ¼ �ru þ
P

v2U simðu; vÞðrv;i ��rvÞP
v2Ujsimðu; vÞj

ð4Þ

pu;i ¼ �ri þ
P

j2I simði; jÞðru;j ��rjÞP
j2Ijsimði; jÞj

ð5Þ

MAE ¼
PT

u¼1jru;i � pu;ij
jT j

ð6Þ

Therefore, we sort (in descending order) the items
according to predicted rating value, and recommend the
first N of them.5

This ranking criterion, denoted as highest predicted
rated item recommendation (HPR) is influenced by the
good accuracy of prediction that existing related work
reports through the MAE. HPR opts for recommending
the items that are more probable to receive a higher rating.
Nevertheless, as already mentioned, MAE is not indicative
for the accuracy of recommendation. As our experimental
results will demonstrate, HPR presents poor performance.
This fact is another indication that MAE alone cannot
e recommender systems: Combining effectiveness ..., Expert Sys-


P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 9

ARTICLE IN PRESS
characterize the performance of CF algorithms. In Fig. 6a,
we represent the aforementioned criterion for the user-
based approach.

As another criterion, which resulted from observations
during our experimental investigation, we sum the positive
rates of the items in the neighborhood, instead of just
counting their frequency. This criterion is denoted as high-
est sum of rates item recommendation (HSR). The top-N
list consists of the N items with the highest sum. The intu-
ition behind HSR is that it takes into account both the fre-
quency (as MF) and the actual ratings, because it wants to
favor items that appear most frequently in the neighbor-
hood and have the best ratings. Assume, for example, an
item i that has just a smaller frequency from an item j. If
i is rated much higher than j, then HSR will prefer it from
i, whereas MF will favor j. In the running example, for test
user U8, for the user-based approach with k = 3 and N = 1,
using this criterion for the export of the top-N list, we rec-
ommend finally item I2 because its sum of rates in the
neighborhood is 9(4 + 5) instead of item I3 where its sum
of rates is 8(4 + 4). If we had used the most frequent
top-N list generation these two items would have the same
Fig. 7. The 20 most positively rated mo

Fig. 6. Generation of top-N list based on (a) highest predicted

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
presences (2) in the neighborhood. So, both of them could
be proposed. In Fig. 6b, we represent the aforementioned
criterion for the user-based approach.

4.3. Third stage extension : a baseline algorithm

For the third stage, we considered all the issues that are
described in Section 3.3. One issue that needs further expla-
nation is the definition of a baseline algorithm. We propose
the one that recommends the N items that are most fre-
quently rated positively in the entire training set. This algo-
rithm is denoted as BL. BL is very simple and, as will be
shown in our experimental results, it is quite effective.
For instance, our experiments with Movielens-100K data
set have shown that, with BL, when we simply propose
the N = 20 most positively rated movies (20 most popular
movies), precision reaches 40%. We examined these movies
and verified that they are not obscure at all, as one can rec-
ognize in Fig. 7. Therefore, the most frequently rated items
are very probable to be selected by the majority of the
users. For the aforementioned reasons, BL is a tool to
clearly evaluate the actual improvement of existing CF
vies of the Movielens 100K dataset.

rated (HPR) algorithm and (b) highest sum of rates (HSR).

e recommender systems: Combining effectiveness ..., Expert Sys-


10 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
algorithms. We will show experimentally that, as we
increase the k nearest neighbors, the performances of Pear-
son and adjusted cosine become equivalent to BL.
5. Proposed method for an LSI-based algorithm

Our approach, initially, applies Singular Value Decom-
position (SVD) over the user-item matrix A. We tune the
number, c, of principal components (i.e., dimensions) with
the objective to reveal the major trends. The tuning of c is
determined by the information percentage that is preserved
compared to the original matrix. Therefore, a c-dimen-
sional space is created and each of the c dimensions corre-
sponds to a distinctive rating trend. Next, given the current
ratings of the target user u, we enter pseudo-user vector in
the c-dimensional space. Finally, we find the k nearest
neighbors of pseudo-user vector in the c-dimensional space
and apply either user- or item-based similarity to compute
the top-N recommended items. Conclusively, the provided
recommendations consider the existence of user rating
trends, as the similarities are computed in the reduced
c-dimensional space, where dimensions correspond to
trends.

To ease the discussion, we will use the running example
illustrated in Fig. 8 where I1�4 are items and U1�4 are users.
As shown, the example data set is divided into training and
test set. The null cells(no rating) are presented as zeros.
5.1. Applying SVD on training data

Initially, we apply SVD on training data n · m matrix A
that produces three matrices. These matrices obtained by
SVD can give by performing multiplication the initial
matrix as the following Eq. (7) and Fig. 9 show:
Fig. 8. (a) Training set (n · m) and (b) test set.

Fig. 9. Example of: An·m (initial matrix A), Un·m (left singular vectors o

Fig. 10. Example of: A�n�m (approximation matrix of A), Un·c (left singular vec

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
An�m ¼ U n�n � Sn�m � V 0m�m ð7Þ
5.2. Preserving the principal components

It is possible to reduce the n · m matrix S to have only c
largest singular values. Then, the reconstructed matrix is
the closest rank-c approximation of the initial matrix A
as it is shown in Eq. (8) and Fig. 10:

A�n�m ¼ U n�c � Sc�c � V
0
c�m ð8Þ

We tune the number, c, of principal components (i.e.,
dimensions) with the objective to reveal the major trends.
The tuning of c is determined by the information percent-
age that is preserved compared to the original matrix.
Therefore, a c-dimensional space is created and each of
the c dimensions corresponds to a distinctive rating trend.
We have to notice that in the running example we create a
2-dimensional space using 83% of the total information of
the matrix (12,88/15,39). In our experiments we have seen
that only a 10% is adequate to provide accurate results.

5.3. Inserting a test user in the c-dimensional space

Related work (Sarwar et al., 2000) has studied SVD on
CF considering the test data as apriori known. It is evident
that, for user-based approach, the test data should be con-
sidered as unknown in the c-dimensional space. Thus a spe-
cialized insertion process should be used. Given the current
ratings of the test user u, we enter pseudo-user vector in the
c-dimensional space using the following Eq. (9) (Furnas
et al., 1988). In the current example,we insert U4 into the
2-dimensional space, as it is shown in Fig. 11:

unew ¼ u � V m�c � S�1c�c ð9Þ

In Eq. (9), unew denotes the mapped ratings of the test
user u, whereas Vm·c and S

�1
c�c are matrices derived from

SVD. This unew vector should be added in the end of the
Un ·c matrix which is shown in Fig. 10. Notice that the
inserted vector values of test user U4 are very similar to
these of U2 after the insertion. This is reasonable, because
these two users have similar ratings as it is shown in Fig. 8.
f A), Sn·m (singular values of A), V
0
m�m (right singular vectors of A).

tors of A*), Sc·c (singular values of A*), V
0
c�m (right singular vectors of A

*).

e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 11. Example of: unew (inserted new user vector), u (user vector), Vm·c
(two left singular vectors of V), S�1c�c (two singular values of inverse S).

P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 11

ARTICLE IN PRESS
Note that this process is omitted in the item-based
approach, which means that this process is more applicable
in real applications in terms of complexity. In our experi-
ments, we will present this drawback of the user-based
approach in terms of execution time.

5.4. Generating the neighborhood of users/items

Having a reduced dimensional representation of the ori-
ginal space, we form the neighborhoods of users/items in
that space.

For the user based approach, we find the k nearest
neighbors of pseudo-user vector in the c-dimensional space.
The similarities between training and test users can be
based on cosine similarity. First, we compute the matrix
Un·c Æ Sc·c and then we perform vector similarity. This
n · c matrix is the c-dimensional representation for the n
users. Note that in our experiments, we generate – as a sec-
ond approach – the similarity matrix taking into account
only Un·c matrix. As we will see, results are quite impres-
sive, but have to do only with the user- based approach.

For the item based approach, we find the k nearest
neighbors of item vector in the c-dimensional space. First,
we compute the matrix Sc·c Æ Vc·m and then we perform
vector similarity. This c · m matrix is the c-dimensional
representation for the m items.

5.5. Generating the recommendation list

As it is mentioned in Section 3.2, existing ranking crite-
ria, such as MF, are used for the generation of the top-N
list in classic CF algorithms. We propose a ranking crite-
rion that uses the predicted values of a user for each item.
Predicted values are computed by Eqs. (4) and (6), for the
cases of user-based and item-based CF, respectively. These
equations have been used in related work for the purpose
of MAE calculation, whereas we use them for generation
of top-N list.

Therefore, we sort (in descending order) the items
according to predicted rating value, and recommend the
first N of them.6 This ranking criterion, denoted as highest
predicted rated item recommendation (HPR), is influenced
by the good accuracy of prediction that existing related
work reports through the MAE. HPR opts for recom-
mending the items that are more probable to receive a
6 If less than N items have positive ratings (i.e., not less than Ps), then
less than N items remain in the list.

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
higher rating. As our experimental results will demonstrate,
HPR presents poor performance for the classic CF algo-
rithms, but dramatically spectacular results when it is used
in combination with SVD. The reason is that in the latter it
is based only on the major trends of users.
5.6. Evaluation of the CF process

Related work (Sarwar et al., 2000) proposed Eq. (10) for
the generation of predictions, where � means the addition
of a scalar with a matrix.

pu;i ¼ �ru � U n�c �
ffiffiffiffiffiffiffiffiffi
Sc�c

p ffiffiffiffiffiffiffiffiffi
Sc�c

p
� V 0c�m ð10Þ

We test this equation and find out that the predicted val-
ues were calculated not from other users but from the user
himself. So, the information of an inserted in c-dimensional
space test user was used to predict his own real rates. These
predicted values were so close to the real ones. Therefore,
although the produced MAE may be good, this procedure
is not prediction, because the test user is considered apriori
known. In contrast, we use Eqs. (4) and (6) for prediction,
to exploit information from other users or items. Thus, we
use these predicted values for the calculation of MAE.
6. Performance study

In the sequel, we study the performance of the described
extensions against existing CF algorithms, by means of a
thorough experimental evaluation. Both user-based and
item-based algorithms are tested. Moreover, we study the
performance of the described SVD dimensionality reduc-
tion techniques against existing CF algorithms. Several
factors are considered, like the sparsity, the similarity mea-
sures, and criteria for generating the top-N list.

The additional factors, that are treated as parameters,
are the following: the neighborhood size (k, default value
10), the size of the recommendation list (N, default value
20), the size of training set (default value 75%), and the
division between past/future data. Regarding WS, the c
value was set to 5. Evaluation is performed with the preci-
sion and recall metrics (given as percentages). We also use
F1 metric.

We perform experiments with four real data sets that
have been used as benchmarks in prior work. In particular,
we examined two MovieLens data sets: (i) the first one with
100,000 ratings assigned by 943 users on 1682 movies (This
is the default data set. Any use of other data sets is categor-
ically defined), and (ii) the second one with about 1 million
ratings for 3592 movies by 6040 users. The range of ratings
is between 1(bad)-5(excellent) of the numerical scale. More-
over, we ran our experiments on the EachMovie data set
(McJones & DeTreville, 1997), which contains 2,811,983
ratings entered by 72,916 for 1628 different movies, and
the Jester data set, which contains 4.1 million ratings of
100 jokes from 73,496 users. Finally, in all data sets, we
normalized the rating scale in the range 1–5, whereas Ps
e recommender systems: Combining effectiveness ..., Expert Sys-


12 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
is set to 3 and the value of an unrated item is considered
equal to zero.

6.1. Results for user-based CF algorithms

First, we examine user-based CF algorithms and com-
pare the existing Pearson similarity and WS measures
against the proposed UNION method. We also include
the baseline (BL) algorithm. The results for precision and
recall vs. k are displayed in Fig. 12a and b, respectively.

As shown, the existing Pearson measure, which is based
on co-rated items, performs worst than BL. This result is
surprising, as BL is very simple. WS improves Pearson,
because the disadvantage of Pearson, due to co-rated items,
is downsized by the weighting with the number of common
items. UNION clearly outperforms all other measures. The
reason is that the MovieLens data set is sparse and these
measures better exploit the available information. UNION
reaches an optimum performance for a specific k. In con-
trast, in the examined range of k values, the performance
of Pearson increases with increasing k. Outside the exam-
ined k range (not displayed), it stabilizes and never exceeds
BL. We have to notice here, that as we increase the number
of k nearest neighbors Pearson measure is literally trans-
formed to BL.

We now examine the MAE metric. Results are illus-
trated in Fig. 13a (BL is only for recommendation, not pre-
diction, thus omitted). As expected, Pearson yields the
lowest MAE values, whereas WS is second best. This fact
supports our explanation that MAE is indicative only for
Fig. 12. Performance of user-based CF

Fig. 13. Performance of user-based CF vs.

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
the evaluation of prediction and not of recommendation,
as these measures did not attain the best performance in
terms of precision and recall.

To consider the impact of density, we also examine the
Jester data set. The results for the F1 metric are depicted
in Fig. 13b. In this case, the relative differences are smaller
than for the case of sparse data. The reason is that dense
data have a sufficient amount of information, thus there
is less need to exploit information in the way UNION does.

Finally, we test the described criteria for the top-N list
generation: MF, HPR, and HSR. We used the UNION
similarity measure, because it presented the best perfor-
mance in the previous measurements. The results for preci-
sion and recall are given in Fig. 14a and b, respectively.

In Fig. 14a it is shown that HPR is clearly outperformed
by the other two criteria, a fact that furthermore illustrates
the unsuitability of the MAE metric to characterize the
task of recommendation. On the other hand, MF and
HSR present similar precision. Fig. 14b presents the results
on recall (to make comparison between MF and HSR more
clear, we omit HPR). HSR is constantly better than MF,
though slightly.

6.2. Results of LSI application on user-based CF

algorithms

Regarding the performance of SVD dimensionality
reduction in user-based approach, we preserve, each time,
a different fraction of principal components of SVD model.
More specifically, we preserve 90%, 70%, 50%, 30% and
vs. k: (a) precision and (b) recall.

k: (a) MAE and (b) F1 for dense data.

e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 14. Comparison of criteria for the generation of top-N list for user-based CF vs. k: (a) precision and (b) recall.

Fig. 15. Performance of SVD dimensionality reduction of user-based CF vs. k: (a) precision and (b) recall.

P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 13

ARTICLE IN PRESS
10% of the total information of initial user-item matrix.
The results for precision and recall vs. k are displayed in
Fig. 15a and b, respectively.

As we can see, the performance of SVD50, SVD70,
SVD90 are similar in terms of precision and recall. The rea-
son is that SVD50 is adequate for producing a good
approximation of the original matrix. Thus, we will con-
tinue our study with two representative SVD models,
SVD50 and SVD10. These choices are indicative of the
behavior of the SVD model.

We now move on to comparison of existing user-based
CF algorithm that uses Pearson similarity against the two
representative SVD reductions. The results for precision
and recall vs. k are displayed in Fig. 16a and b, respectively.

As shown, the existing Pearson measure, which is based
on co-rated items, performs worst than SDV reductions.
The two proposed reductions clearly outperform Pearson
Fig. 16. Performance of user-based CF

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
measure. We have to notice that SVD50 perfroms equally
with the UNION algorithm. The reason is that the Movie-
Lens data set is sparse and relatively large (high n value).
The SVD reductions reveal the most essential dimensions
and filter out the outliers and misleading information. So,
with less information, we have almost the same result.
SVD50 performs a little better than SVD10, as it uses more
information. In contrast, the latter SVD reduction uses
only 11 dimensions instead of 157 of the former. This
means that the performance of the latter is faster than
the former one.

Now, we compare the SVD50 and UNION algorithms
with the variation we described in Section 5.4 denoted as
SVDU50. As we can see, SVDU50 performs even better
than the UNION algorithm. We propose it, as a second
way for producing the users’ similarity matrix. The posi-
tion of the points between Un·c Æ Sc·c matrix and only the
vs. k: (a) precision and (b) recall.

e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 18. Comparison HPR criteria for the generation of top-N (a) precision and (b) recall list for user-based CF vs. k

Fig. 17. Performance of user-based CF vs. k: (a) F1 for 100k Movielens data set and (b) F1 for 1M Movielens data set.

14 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
Un·c matrix are the same except that the former is stretched
or shrunk in proportion to the corresponding diagonal ele-
ments of Sc·c matrix.

To consider the impact of scalability, we also examine
the 1M data set. The results for the F1 metric are depicted
in Fig. 17b. In this case, the relative difference broadens as
k increases, than the case of 100K data set. The reason is
that 1M data set is more sparse (the percentage of non
rated items exceeds 95%) than 100K data set (the corre-
sponding percentage is 93%). Moreover, the rank (i.e.,
number of independent dimensions) of the 1M data set is
3010 instead of 943 for the 100K. This means, that there
are many dependent dimensions in the former that can be
compressed.

Finally, we test the described criteria for the HSR top-N
list generation algorithm. The results for precision and
recall are given in Fig. 18. As shown, the combination of
the SVD similarity measure with HPR as list generation
algorithm, clearly outperforms the Pearson with HPR. This
is due to the fact that in the former the remaining dimen-
sions are the determinative ones and outliers users have
been rejected. Note that in the SVD50 we preserve only
157 basic dimensions instead of 943 for the latter.
7 We have examined a variation of adjusted cosine that uses only the co-
rated items in the denominator. As expected, it resulted to worse precision
and recall, but to better MAE.
6.3. Results for item-based CF algorithms

We perform similar measurements for the case of
item-based CF. Thus, we first examine the precision
Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
and recall for the existing adjusted cosine measure (con-
siders co-rated items) against UNION for the item-based
case (that consider not only co-rated items through the
extended definitions of T set). The results are depicted
in Fig. 19 and are analogous to those of the user-based
case. UNION clearly outperforms adjusted cosine and
WS. Again, it is surprising to find the item-based CF
algorithm with the adjusted cosine measure looses out
by BL.

Next, we compare adjusted cosine, UNION and WS
against MAE. The results are illustrated in Fig. 20a. Dif-
ferently from the case of user-based CF, all measures
have similar MAE, and for larger k values (where all
of them reach the optimum MAE value) UNION and
the adjusted cosine are equal. The reason that adjusted
cosine does not present better MAE, is that in its
denominator it considers all items and not just the co-
rated ones (see Eq. (2)). This improves its performance
for the task of recommendation and worsens the perfor-
mance of prediction.7

Regarding the examination of the dense data set (Jester),
the results for the F1 metric are illustrated in Fig. 20b. Since
item-based CF has been designed to suit the needs of sparse
data, we find out that for dense data all item-based algo-
e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 19. Performance of item-based CF vs. k: (a) precision and (b) recall.

Fig. 20. Performance of item-based CF vs. k: (a) MAE and (b) F1 for dense data.

Fig. 21. Comparison of criteria for the generation of top-N list for item-based CF vs. k: (a) precision and (b) recall.

P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 15

ARTICLE IN PRESS
rithms are outperformed by BL. This is the case even for
UNION, although it performs better than adjusted cosine.
This result clarifies the need to examine CF algorithms for
all the involved factors, in this case density, in order to
draw more complete conclusions.

Finally, we test the three described criteria, MF, HPR,
and HSS, for the top-N list generation by item-based CF
algorithms. The results for precision are illustrated in
Fig. 21a. Similarly to the user-based case, HPR performs
very poorly. Fig. 21b shows the resulting recall (HPR is
omitted to make comparisons more clear). MF performs
a little better than HSS. However, no clear winner can be
identified, because for the other data sets HSS performed
a little better.
Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
6.4. Results of LSI application on item-based CF algorithms

Regarding the performance of SVD dimensionality
reduction in item-based approach, we first examine the pre-
cision and recall for the existing adjusted cosine Measure
(considers co-rated items) against SVD50 and SVD10 for
the item-based case. The results are depicted in Fig. 22
and are analogous to those of the user-based case.
SVD50 and SVD10 clearly outperform adjusted cosine.
Notice that unlike the user-based case, the difference
between SVD50 and SVD10 is greater. This mean that item
based algorithm cannot preserve accuracy in a satisfactory
way when we decrease the percentage of dimension’s
information.
e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 22. Performance of item-based CF vs. k: (a) precision and (b) recall.

Fig. 23. Comparison between item-based and user-based CF vs. k.

16 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
6.5. Comparative results in terms of effectiveness

In this section, we compare user-based and item-based
CF using the best options as they have resulted from the
previous measurements. Therefore, for user-based and
item-based we use the UNION similarity measure. With
respect to the criterion for the generation of the top-N list,
for user-based we use HSR, whereas for the item-based we
use MF. The result for F1 metric are displayed in Fig. 23a.

These results demonstrate that user-based CF, when
using the best options, compares favorably against item-
based CF, even when the latter uses the best options. The
difference in F1 is larger than 0.3. The difference with
Fig. 24. Comparison between item-based and user-based CF in terms of e
information.

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
respect to precision is larger of 10%, whereas with respect
to recall, it exceeds 5% (we refer to the optimum values
resulting from the tuning of k). This conclusion contrasts
the existing one, that item-based CF is preferable than
user-based for sparse data. The reason is that user-based
CF is more focused towards the preferences of the target
user. In contrast, with item-based CF, the recommended
items may have been found similar by transactions of users
with much different preferences than the ones of the target
user. Thus, they may not directly reflect the preferences of
the latter. The differences that have been reported in prior
work can only be explained by the fact that the algorithms
did not use the best possible options. In Fig. 23b, we pres-
xecution time: (a) time vs. k and (b) time vs. percentage of preserved

e recommender systems: Combining effectiveness ..., Expert Sys-


P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 17

ARTICLE IN PRESS
ent again -for reasons of clearness- the comparison between
UNION algorithm with SVDU50. As we can see, SVDU50
performs favorably against UNION algorithm. The reason
is that the SVD reduction reveals the most essential dimen-
sions and filters out the outliers and misleading informa-
tion. So, by applying LSI in user-based algorithm we
attain the best possible result in terms of accuracy.
6.6. Comparative results in terms of efficiency

Regarding the execution time, we measured the wall-
clock time for the on-line parts of the user-based and
item-based algorithms. The results regarding the time
needed for a user recommendation vs. k are presented in
Fig. 24a, whereas the results for varying percentage of pre-
served information, are depicted in Fig. 24b (in the latter
measurement we set k = 10).

As already mentioned, item based CF needs less time to
provide recommendations than user-based CF. This holds
for both the aforementioned measurements. The reason is
that a user-rate vector in user-based approach has to be
inserted in the c-dimensional space. Moreover, we have
to mention that the generation of top-N list for the user-
based approach further burdens the CF process. The rea-
son is that the algorithm finds, firstly, user neighbors in
the neighborhood matrix and then counts presences of
items in the user-item matrix. In contrast, with the item-
based approach the whole work is completed in the item
Fig. 25. Comparison vs. N: (a

Fig. 26. (a) Comparison vs. training set s

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
neighborhood matrix. So, in terms of execution item based
approach is superior over user based.

6.7. Examination of the additional factors

In this section we examine the impact of the additional
factors. In our measurements we consider the existing
two cases, that is, user-based CF with Pearson similarity
and item-based CF with adjusted cosine (both the two
existing cases consider co-rated items).

First, we examine the impact of N (recommendation
list’s size). The results are depicted in Fig. 25. As
expected, with increasing N, recall increases and preci-
sion decreases. The relative differences between the algo-
rithms are coherent with those in our previous
measurements.

Next, we test the impact of the size of the training set,
which is expressed as percentage of the total data set size.
The results for the F1 metric are given in Fig. 26a (to make
the graph more clear, we show results only for the two best
user-based and item-based algorithms). As expected, when
the training set is small, performance downgrades for both
algorithms. Therefore, we should be careful enough when
we evaluate CF algorithms and use adequately large train-
ing sets. Similar to the previous measurements, in all cases
user-based UNION is better than item-based UNION. The
performance of both reaches a peak around 75%, after
which it reduces. The reason for this is the overfitting that
results from very large training sets.
) precision and (b) recall.

ize and (b) comparison vs. past’s size.

e recommender systems: Combining effectiveness ..., Expert Sys-


Fig. 27. Measuring the percentage of items that are correctly predicted
and top rated.

18 P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx

ARTICLE IN PRESS
We also examine the performance when considering the
division between past and future data. For each user’s
transaction in the test set we keep the 70% as future data
and use a varying number of ratings from the rest 30% as
past data. This way, we examine the impact of past’s size.
We compare user-based UNION with item-based UNION
using the F1 metric. The results are illustrated in Fig. 26b.
As the size of the considered past reduces, the performance
of both algorithms reduces. This result demonstrates the
need to evaluate CF algorithms using the division between
past and future data, because this division is more indica-
tive about the actual performance in real-world CF appli-
cations. Nevertheless, user-based UNION manages to
keep a higher F1 value for small past size, whereas both
algorithms converge to the same point for larger past.
The merit of user-based UNION is evident, as in real-
world applications the past size usually takes very small
values (users do not have to wait long enough before start
receiving recommendations).

Finally, we test the examined algorithms with respect to
the quality of the provided recommendations. The results
are presented in Fig. 27. In particular, we are interested
in determining whether these algorithms provide obscure
recommendations or not. For this reason, we sort the items
according to the number of positive ratings (i.e., higher
than Ps) they received. Each time we keep a varying num-
ber of these top rated items (e.g., the top 20, 40, etc.). These
top rated items are considered as more popular, thus they
do not comprise obscure recommendations. We measure
the percentage of the top rated items that are correctly pre-
dicted by each CF algorithm.8 This measurement indicates
how much of the correct predictions belong to the most
rated items.

As the number of considered top rated items increases,
the Pearson similarity measure (which is based on co-rated
items) does not increase the percentage of the correctly pre-
dicted items that belong to these top rated items. This
means that the recommendations of this algorithm tend
to be outside the range of the top rated items. Thus, Pear-
8 The alternative of just counting the predictions, regardless if they are
correct, could provide misleading results and, thus, it is avoided.

Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
son recommends more obscure items. The adjusted cosine
similarity measure presents a small increase, which means
that it presents the latter problem to a smaller degree. In
contrast, both user-based UNION and item-based UNION
clearly present the best percentage and their resulting slope
is high. This means that these algorithms tend to avoid
obscure items during their recommendations and concen-
trate on the first positions of the top rated items.

7. Conclusions

We have performed a thorough study of neighborhood-
based CF, which brought out several factors that have not
been examined carefully in the past. Based on our observa-
tions, we proposed extensions of similarity measures, new
criteria for generating the recommendation list, and new
ways to evaluate the quality of the recommendation result.
Additionally to these extensions, we proposed an LSI-
based approach combining simultaneously effectiveness
and efficiency of CF algorithms. We performed experimen-
tal comparison of our proposed method against well
known CF algorithms, like user-based or item-based meth-
ods, with real data sets. Our experimentation reforms sev-
eral existing beliefs and provides new insights.

In particular, we highlight the following conclusions
from our examination:

• In contrast to what is reported in majority of related
work, MAE is not indicative for the accuracy of the rec-
ommendation process. It is, though, useful to character-
ize the quality of the similarity measure (as reflected in
the process of prediction).

• Constraining similarity measures with co-rated items,
weaknesses the measure. Though it is somewhat useful
to consider the number of co-rated items (as WS does),
the strict constraining inside the formulae for similarity
measures is not suitable.

• The proposed extensions that do not use co-rated items
only, substantially improve the performance of CF,
especially for sparse data, because they exploit more
information. Actually, the existing approaches based
on co-rated items are outperformed by a simple baseline
algorithm we developed.

• Our results showed that, following the best options for
user-based and item-based CF, the former compares
favorably to the latter. This contrasts with existing
results in related work, because until now, comparison
did not follow the best options we describe.

• Our results have shown that item-based CF is not
appropriate for dense data.

• Our LSI-based significantly outperforms existing CF
algorithms in terms of accuracy (measured through
recall/precision). Moreover, in some cases, it can
slightly improve UNION. The reason is that it is able
to identify more clearly the correct recommended items
by focusing on trends and isolating noisy users (e.g.,
outliers). In terms of execution times, due to the use
e recommender systems: Combining effectiveness ..., Expert Sys-


P. Symeonidis et al. / Expert Systems with Applications xxx (2007) xxx–xxx 19

ARTICLE IN PRESS
of smaller matrices, execution times are substantially
reduced.

• In our experiments we have seen that only a 10% of the
original matrix is adequate to provide accurate results.

• The proposed HPR criterion in combination with SVD
can compete the existing MF, which has been used by
the majority of related work.

• The process of folding-in is omitted in the item-based
approach. Thus, item-based algorithms are faster than
user-based ones. For this reason they may be more
appropriate for real-world applications, despite their
worse accuracy, since execution time is sine-qua-non in
realistic deployments.

Summarizing the aforementioned conclusions, we see
that, on one hand, item-based algorithms are more appro-
priate for off-line computations and, thus, better in on-line
response. On the other hand, user-based algorithms can
produce more effective recommendations with a small frac-
tion of initial information and efficient responding time.
With the proposed approaches we combine effectiveness
and efficiency in contrast to the classic CF item and user-
based algorithms for each approach individually. In our
future work we will consider the issue of an approach that
would unify the best characteristics of these two cases in an
integrated approach.
References

Berry, M., Dumais, S., & O’Brien, G. (1994). Using linear algebra for
intelligent information retrieval. SIAM Review, 37(4), 573–595.

Breese, J., Heckerman, D. & Kadie, C. (1998). Empirical analysis of
predictive algorithms for collaborative filtering. In Proceedings of the
conference on uncertainty in artificial intelligence (pp. 43–52).

Deshpande, M., & Karypis, G. (2004). Item-based top-n recommendation
algorithms. ACM Transactions on Information Systems, 22(1),
143–177.

Furnas, G., Deerwester, S. & Dumais, S., et al. (1988). Information
retrieval using a singular value decomposition model of latent semantic
structure. In Proceedings of the ACM SIGIR conference (pp. 465–480).

Goldberg, D., Nichols, D., Brian, M., & Terry, D. (1992). Using
collaborative filtering to weave an information tapestry. ACM
Communications, 35(12), 61–70.

Goldberg, K., Roeder, T., Gupta, T., & Perkins, C. (2001). Eigentaste: a
constant time collaborative filtering algorithm. Information Retrieval,
4(2), 133–151.
Please cite this article in press as: Symeonidis, P. et al., Collaborativ
tems with Applications (2007), doi:10.1016/j.eswa.2007.05.013
Herlocker, J., Konstan, J., Borchers, A. & Riedl, J. (1999). An algorithmic
framework for performing collaborative filtering. In: Proceedings of
the ACM SIGIR conference (pp. 230–237).

Herlocker, J., Konstan, J., & Riedl, J. (2002). An empirical analysis of
design choices in neighborhood-based collaborative filtering algo-
rithms. Information Retrieval, 5(4), 287–310.

Herlocker, J., Konstan, J., Terveen, L., & Riedl, J. (2004). Evaluating
collaborative filtering recommender systems. ACM Transactions on
Information Systems, 22(1), 5–53.

Hofmann, T. (2004). Latent semantic models for collaborative filtering.
ACM Transactions on Information Systems, 22(1), 89–115.

Huang, Z., Chen, H., & Zeng, D. (2004). Applying associative retrieval
techniques to alleviate the sparsity problem in collaborative filtering.
ACM Transactions on Information Systems, 22(1), 116–142.

Karypis, G. (2001). Evaluation of item-based top-n recommendation
algorithms. In Proceedings of the ACM CIKM conference (pp. 247–254).

McJones, P. & DeTreville, J. (1997). Each to each programmer’s reference
manual. Tech. Rep. Systems Research Center, 1997-023.

McLauglin, R. & Herlocher, J. (2004). A collaborative filtering algorithm
and evaluation metric that accurately model the user experience. In
Proceedings of the ACM SIGIR conference (pp. 329–336).

Mobasher, B., Dai, H., Luo, T. & Nakagawa, M. (2001). Improving the
effectiveness of collaborative filtering on anonymous web usage data.
In Proceedings of the workshop intelligent techniques for web person-
alization (pp. 53–60).

O’Mahony, M., Hurley, N., Kushmerick, N., & Silvestre, G. (2004).
Collaborative recommendation: a robustness analysis. ACM Transac-
tions on Internet Technology, 4(4), 344–377.

Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P. & Riedl, J. (1994).
Grouplens: an open architecture for collaborative filtering on netnews.
In Proceedings of the conference computer supported collaborative work
(pp. 175–186).

Sarwar, B., Karypis, G., Konstan, J. & Riedl, J. (2000). Analysis of
recommendation algorithms for e-commerce. In Proceeedings of the
ACM electronic commerce conference (pp. 158–167).

Sarwar, B., Karypis, G., Konstan, J. & Riedl J. (2000). Application of
dimensionality reduction in recommender system – a case study. In
ACM WebKDD workshop.

Sarwar, B., Karypis, G., Konstan, J. & Riedl, J. (2001). Item-based
collaborative filtering recommendation algorithms. In Proceedings of
the WWW conf. (pp. 285–295).

Sarwar, B., Karypis, G., Konstan, J. & Riedl, J. (2002). Incremental singular
value decomposition algorithms for higly scalable recommender sys-
tems. In Proceedings of the computer and information technology.

Xue, G., Lin, C. & Yang, Q., et al. (2005). Scalable collaborative filtering
using cluster-based smoothing. In: Proceedings of the ACM SIGIR
conference (pp. 114–121).
e recommender systems: Combining effectiveness ..., Expert Sys-


	Collaborative recommender systems: Combining effectiveness and efficiency
	Introduction
	Motivation
	Contributions

	Related work
	Factors affecting the CF process
	First stage factors
	Second stage factors
	Third stage factors

	Proposed extensions for memory-based algorithms
	First stage extension : the UNION similarity measures
	Assigning a value to unrated items

	Second stage extensions: the HPR and HSR algorithms
	Third stage extension : a baseline algorithm

	Proposed method for an LSI-based algorithm
	Applying SVD on training data
	Preserving the principal components
	Inserting a test user in the c-dimensional space
	Generating the neighborhood of users/items
	Generating the recommendation list
	Evaluation of the CF process

	Performance study
	Results for user-based CF algorithms
	Results of LSI application on user-based CFalgorithms
	Results for item-based CF algorithms
	Results of LSI application on item-based CF algorithms
	Comparative results in terms of effectiveness
	Comparative results in terms of efficiency
	Examination of the additional factors

	Conclusions
	References