Microsoft Word - _completo_ Clustering people according to their preference criteria.doc Clustering people according to their preference criteria Jorge Díez*, Juan José del Coz, Oscar Luaces, and Antonio Bahamonde Centro de Inteligencia Artificial. Universidad de Oviedo at Gijón, Campus de Viesques, E-33271 Gijón (Asturias), Spain {jdiez, juanjo, oluaces, antonio}@aic.uniovi.es Abstract. Learning preferences is a useful task in application fields such as collaborative filtering, information retrieval, adaptive assistants or analysis of sensory data provided by panels. SVMs, using preference judgments, can induce ranking functions that map objects into real numbers, in such a way that more preferable objects achieve higher values. In this paper we present a new algorithm to build clusters of people with closely related tastes, and hence people whose preference judgment sets can be merged in order to learn more reliable ranking functions. In some application fields, these clusters can be seen as market segments that demand different kinds of products. The method proposed starts representing people’s preferences in a metric space, where it is possible to define a kernel based similarity function; finally a clustering algorithm discovers significant groups with homogeneous tastes. The key point of our proposal is to use the ranking functions induced from the preference judgments of each person; we will show that those functions codify the criteria used by each person to decide her preferences. To illustrate the performance of our approach, we present two experimental cases. The first one deals with the collaborative filtering database EachMovie. The second database describes a real case of consumers of beef meat. Keywords: learning preferences, clustering, adaptive assistants, analysis of sensory data, market segmentation. * Corresponding author. Fax: +34 985 18 21 25. Phone: +34 985 18 25 88. E-mail address: jdiez@aic.uniovi.es 2 1. Introduction Supervised inductive learning deals with sets of training examples; these represent pairs of input and the attached outputs of a function that has to be found in a given family of hypotheses. The input is described by a set of attribute values, while the output is in fact another attribute of the examples called class; its type determines the approach and even the name of the learning task. Regression is used when the class is a continuous number, and categorical classification is employed when the class or output of training examples is one of a finite set of symbolic categories. In this paper we tackle a slightly different problem: learning people’s preferences for consumable products, or for system configurations, or for responding to information requests. Here the training material can be expressed as in regression problems: the description of each object is then followed by a number that assesses the degree of satisfaction. Alternatively, training examples can be represented by preference judgments: pairs of vectors (x (1) , x (2) ) where someone expresses the fact that he or she prefers x (1) to x (2) . In other words, training sets are samples of binary relations between objects. As pointed out in (Cohen et al., 1999; Dumais et al., 2003), obtaining preference information may be easier and more natural than obtaining the labels needed for a classification or regression approach. Moreover, this type of information is more accurate, since people tend to rate their preferences in a relative way, comparing objects with the other partners in the same batch. There is a kind of batch effect that often biases the ratings. Thus, an object presented in a batch surrounded by worse objects will probably obtain a higher rating than if it were presented together with better objects. There are algorithms in the literature able to learn preferences. Sometimes they aim to classify pairs of objects (x (1) , x (2) ), deciding whether x (1) is preferable to x (2) or not, as in (Branting and Broos, 1997; Cohen et al., 1999). Another approach consists in learning a real preference or ranking function f from the space of objects considered in such a way that f(x (1) ) > f(x (2) ) whenever x (1) is preferable to x (2) . This functional approach can start from a set of objects endowed with a (usually ordinal) rating, as in regression (Herbrich et al., 2000; Crammer and Singer, 2001; Shashua and Levin, 2002), or can stem from sets of preference judgments, as in (Tesauro, 1989; Utgoff and Clouse, 1991; Freund et al., 1998; Fiechter and Rogers, 2000; Joachims, 2002; Bahamonde et al. 2004). 3 In this paper we present a new algorithm that, given a family of preference judgment sets from a number of people, discovers groups with homogeneous tastes. The usefulness of this task is twofold. First, we can merge the sets of preference judgments of members of the same group, thus attaining more useful and reliable knowledge about peoples’ preferences; this is usually a critical point, since the available data from individuals is frequently quite limited. In second place, the discovery of clusters is important by itself. This is the case when we deal with data collected from panels of consumers; then the groups found may constitute significant market segments that demand different kinds of products. At the end of the paper, we will present an application to food industry. Using the data from a panel of beef meat consumers, we discuss the implications of the results returned by the algorithm for meat industries and breeders. A shorter version of this case study was presented in (Díez et al. 2005). To cluster anything it is necessary to define a similarity measure. The core idea of our proposal is that similarity can be defined between the ranking functions learned from the preference judgment set of each person; these functions codify the criteria used to decide the preferences. Since those ranking functions are induced by a classification Support Vector Machine (SVM) (Vapnik, 1998), the similarity measure is defined by means of a kernel method. In the next section, we review the literature related to explaining the usefulness of clusters of preference criteria in different areas of application. Then, we discuss the details of our proposal. Finally, the paper concludes with a report of the experiments conducted to evaluate the proposed algorithm. For this purpose we use EachMovie (McJones, 1997), a publicly available collaborative filtering database for movie ratings. Finally, we review the results obtained with the dataset of the panel of beef meat consumers. 2. How clusters can be useful The learning tasks involved in recommender systems (Resnick and Varian, 1997) can be considered as special cases of ordinal regression. Here users rate one kind of object and receive recommendations about 4 objects that they are likely to prefer. Such advice can be elaborated according to the relationship of the properties of the objects and the user’s past ratings; this is the content-based model (Basu et al., 1998; Pazzani, 1999). Or, on the other hand, in the model called collaborative or social filtering, the recommendations are induced from the user and other users’ ratings, formulating them as a learning task (Goldberg et al., 1992; Resnick et al., 1994; Shardanand and Maes, 1995). Within this context, as pointed out in (Hofmann and Puzicha, 1999), there is another fundamental problem in addition to the prediction of ratings: the discovery of meaningful groups or clusters of persons and objects able to explain the observed preferences by some smaller number of typical preference patterns. This point of view gives rise to the latent class model (Cheung et al., 2000). See also (Ungar and Foster, 1998). There are other application fields where clusters and preferences appear together as a desirable mixture. For instance, in (Joachims, 2002), Joachims presents an information retrieval system equipped with a ranking function learned from click-through data collected from user interaction with a www search engine. To improve his proposal, the author acknowledges the need to obtain feasible training data. This raises the question of the convenience of developing clustering algorithms to find homogeneous groups of users. An Adaptive Route Advisor is described in (Fiechter and Rogers, 2000); the system is able to recommend a route to lead users through a digitalized road map taking into account their preferences. An interesting extension discussed in the paper is to modify route recommendations depending on the time of the day or the purpose of the trip. The approach suggested the inclusion of an algorithm that clusters user preferences into contexts. 3. Sensory data Analysis The quality or acceptability of market products can be measured in a number of different dimensions. Sensory analysis is concerned with those aspects that are principally appreciated through sensory impressions. It is typically used by food industries and breeders to improve some production decisions. 5 An excellent survey of the use of this type of data in the food industry can be found in (Murray et al., 2001; Buck et al., 2001); for a Machine Learning perspective, see (Corney, 2002) and (Goyache et al., 2001; Del Coz et al. 2004; Luaces et al. 2004; Díez et al. 2005). From a conceptual point of view, sensory data include the assessment of products provided by two different kinds of panels. The first one is made up of a small group of expert, trained judges; these will describe each product by attribute-value pairs. Expert panelists are thus required to have enough sensory accuracy so as to discriminate between different and similar products; note that experts are not necessarily asked to rate the overall quality of products. This panel will play the role of a bundle of sophisticated sensors. So, experts’ assessments are used to describe the products in addition to other measures usually obtained by some chemical, biological or physical devices. The second kind of panel is made up of untrained consumers; these are asked to rate their degree of acceptance of the tested products on a scale. The aim of the analysis of these data is to be able to relate product descriptions with consumer preferences. Therefore, sensory studies of markets will start out from tables such as Table 1. Each row represents a product rated by a consumer in a given tasting session. The concept of session is important, since we can interpret consumer ratings relative to each session (Joachims, 2002; Bahamonde et al. 2004; Díez et al. 2004). In this way, we do not need to assume that a rating of “7” means the same thing to every consumer and in every session (Cohen et al., 1999). Additionally, when dealing with food products, we must realize that the size of the sample prevents panelist from testing all products. Hence, we cannot ask our panelist to spend long periods rating the whole set of food samples. Typically, each consumer only participates in one or a small number of tasting sessions, usually in the same day. Notice that tasting a large sample of food may be physically impossible, or the number of tests performed would damage the sensory capacity of consumers. On the other hand, expert descriptions are ratings in an ordinal scale of different aspects of products related to their taste, odor, color, etc. Here we must assume that a rating of “7” (in say, texture) means the same for a given expert in every product; though not necessarily for every expert. 6 Sometimes, to describe a kind of food products, it is possible to use categorical attributes instead of numerical ones. In these cases, it will be necessary the use of an adequate kernel in the algorithms that we are going to explain in the next section. From a practical point of view, the market is not interested in tastes of individual consumers, the purpose of marketing studies of sensorial data is to discover, if they exist, widespread ways to appreciate food products that can be considered as market segments. These segments can be seen as clusters of consumers with similar tastes. In this paper, we will show that the similarity of preference criteria of consumers can be computed in a high dimension space by means of a kernel-based method. 4. Clustering people according to their preferences In this section we are going to discuss the available options to build a method for clustering people using their preference criterion as the measure of similarity. For ease of reference, let S be a set of n objects from an input space X that will be rated by m people. Therefore, given a space of ordinal values, Scale, we have a family ri: Si → Scale, i = 1,…,m (1) of rankings, one for each person, that in general are partially defined with respect to S, since usually not everybody rates everything, that is, Si ⊆ S. To measure the similarity between the preferences of two people pi and pj, a first attempt consists of comparing the vectors (ri(x): x ∈ Si) and (rj(x): x ∈ Sj) of their ratings. To do this, we must realize that Si and Sj can have few common objects. In fact, in sensory data it is frecuent that Si ∩ Sj = Ø. Different tools have been employed to make comparisons when using these rating vectors for prediction tasks in collaborative filtering (see (Breese et al., 1998)); the most obvious and unsuccessful being Euclidean distance, used in nearest neighbor algorithms. Pearson’s correlation or the cosine of the vectors has been put forward to consider the possible differences in the scales used by different people. 7 However, these comparison techniques, devised for prediction purposes, are not easily extendable to clustering. To illustrate this point, let us consider one person pi with a coherent preference criterion, and let us divide pi’s rating vector in two parts: (ri(x): x ∈ Si1), (ri(x): x ∈ Si2), with Si1 ∩ Si2 = Ø and Si1 U Si2 = Si. (2) These two vectors would not have anything in common for any reasonable comparison measure. However, both vectors represent the same rating criterion, the criterion of pi; so pi rating both Si1 and Si2 must be included in the same cluster. However, it is not obvious how to represent what we have lazily called rating criterion of a person. The proposal presented in this paper is to represent preference criteria explicitly by means of a function able to generalize somehow the preferences provided by the ratings. The following sections will show how these functions can be defined and learned. Then we will describe the clustering method based on the similarity of these functions. But before going on it is necessary to reconsider the kind of data that we will use as training material to learn people’s preferences. If we have a vectorial way to describe the objects in S and a rating vector (r(x): x ∈ S), we can try to use regression to induce a function that maps object descriptions into ratings. However, this is not a faithful way of capturing people’s preferences; see (Herbrich et al., 2000) for a discussion of the differences between ordinal and continuous (metric or least squares) regression. Additionally, when consumer panels are involved, an important reason to discard any kind of regression is that ratings are relative orderings instead of absolute values, due to the so-called batch effect (Joachims, 2002; Díez et al., 2004). Therefore, when it is possible to consider explicitly the tasting session where the rates were made, we will take them into account. Thus, a more general setting will be used. Following (Herbrich et al., 2000; Joachims, 2002), in order to capture the ordinal nature of ratings, and the relative character of people’s preferences, for each consumer pi, we will separate the ratings given at each tasting session. Then we will create a preference judgment set PJi = {( x (1) , x (2) ) : x (1) , x (2) ∈ Si, ri(x (1) ) > ri(x (2) ), session(pi, x (1) ) = session(pi, x (2) )} (3) 8 suitable for our needs just considering all pairs (x (1) , x (2) ) of object in Si such that they were presented in the same session to consumer pi, and the rating of the first object was higher than the rating of the second. 4.1 Class separation and preferences Let us first consider a simplified case in this subsection before presenting the general learning framework. So, let us assume that the input space X is in fact a subset of real vectors; in symbols, X ⊂ R d . In order to learn our preference problems, we will try to find a real ranking (or preference) function f: R d � R that maximizes the probability of having f(x (1) ) > f(x (2) ) whenever x (1) is preferable to x (2) . If we now restrict the space of hypothesis to linear functions, each preference judgments set PJi gives rise to a set of constraints (x (1) , x (2) ) ∈ PJi � fi(x (1) ) > f(x (2) ) � fi(x (1) - x (2) ) > 0 � fi(x (2) - x (1) ) < 0 (4) Therefore, ranking functions can be learned using a binary classification algorithm able to discriminate the class of objects according to the sign returned, as happens with Support Vector Machines (SVM). Notice that the training set is Ti = {( x (1) - x (2) ; +1), (x (2) - x (1) ; -1): (x (1) , x (2) ) ∈ PJi}. (5) The function learned passes through the origin of coordinates, and it is then defined by ( ) ∑ = == d k kikii 1 ,f xwxwx (6) where wi is the weight vector of the consumer pi. The return of fi on an object representation x can be thought as the assessment of x in the sense that fi(x) will be used to predict preferences between x and other products. On the other hand, the weight vector wi is the director vector of a hyperplane ( )0, =xw i called the assessment hyperplane. From a geometrical point of view, the distance (in the direction of wi) from the hyperplane to each object is proportional to the value returned by fi. In fact (see Figure 1), 9 ( ) i i i i i w x w xw xxw ′ = ′ =′= f, );0,(distance (7) Of course, it would be difficult to find a linear function able to separate the positive from the negative differences of the training set Ti. In terms of preferences, not ever the assessment of an object will be proportional to the components of a vectorial description. It is not true that the more (or the less) the better; for instance, the amount of sugar or salt in our favorite foods usually has an optimal point and the increase or decrease from that equilibrium point is frequently rejected. Additionally, there may be situations where the objects considered have not a straightforward description by means of real vectors. To take into account all these issues, it is possible to use a kernel trick (Herbrich et al., 2000). To present this general framework, let us assume that the space X of objects can be mapped to a Hilbert space F using φ: X � F. Then, as we have just seen, to learn a ranking or preference function we only have to find a linear separation in F for the training set TiF = {(φ(x (1) ) - φ(x (2) ); +1), (φ(x (2) ) - φ(x (1) ); -1): (x (1) ,x (2) ) ∈ PJi }. (8) To avoid handling the components of product descriptions in F, the kernel trick helps us redefining Ti as Ti = {(x (1) , x (2) ; +1), (x (2) , x (1) ; -1): (x (1) ,x (2) ) ∈ PJi }. (9) Therefore, the input space that we need is in fact the product of X by itself, and it is mapped into the feature space F using Ψ: X x X � F , Ψ(x (1) ,x (2) ) = φ(x (1) ) - φ(x (2) ). (10) Hence, the associated kernel to this transformation is ( ) ( ) ( ) ( ) ( ) ),(),(),(),( )(),()(),()(),()(),( )()(,)()(,,,,;, (4)(2)(3)(2)(4)(1)(3)(1) (4)(2)(3)(2)(4)(1)(3)(1) (4)(3)(2)(1)(4)(3)(2)(1)(4)(3)(2)(1) xxxxxxxx xxxxxxxx xxxxxxxxxxxx KKKK +−−= +−−= −−=ΨΨ= φφφφφφφφ φφφφK (11) where K(x (.) ,x (.) ) is the kernel associated to the transformation φ of the individual objects. The kernel K so built is called the Herbrich’s Kernel (Herbrich et al., 2000) attached to K. 10 The separation function induced by a classification SVM from Ti with kernel K will be a function Fi: X x X � R of the form ( ) ( )( ) ( ) ( ) ( )( ) ( )( ) ( ) ( )( )∑∑ ∈∈ =−−= ii SVs ssss SVs ssssi yy 21)2()1(21)2()1(21 ,,,,, xxxxxxxxxxF Kαφφφφα (12) where SVi is a set of indexes for the support vectors, and ys is the class (+1 or -1) of the pairs (xs (1) , xs (2) ) of Ti. This function has an important property that is an immediate consequence of the kernel definition: Fi(x (1) ,0) > Fi(x (2) ,0) � Fi(x (1) , x (2) ) > 0 (13) Thus, we define the following assessment function fi: X � R, fi(x) = Fi(x,0). (14) It is trivial to see that fi is as coherent with the preference judgments of PJi as Fi is a separating function for Ti. Moreover, in the case of using as K the linear kernel, fi coincides with the function defined in (6). The expression of fi can be simplified given that it is possible to skip a constant term from Fi(x,0); in the linear case this constant is 0. Therefore, in practice, the general assessment function fi is given by ( ) ( ) ( ) ( ) ( ) ( )( )∑∑ ∈∈ −=−= ii SVs ssss SVs ssssi KKyy xxxxxxxx ,,,f )2()1()2()1( αφφφα (15) 4.2 Defining a similarity measure of people’s preferences Let us start with a motivating example assuming that there are 3 people that have been asked about their preferences about a kind of products. To simplify the presentation, let us suppose that the products studied can be represented by a single number. Think, for instance, in the proportion of milk to coffee in your café au lait. If the first person (p1) prefers coffees with a proportion of 0.20 of milk, her preference function could be a parabola with vertex in that point. Given that ranking or preference functions cross the origin of coordinates, the preference function of p1 has the form f 1(x) = a1 ∗ (-x 2 + x ∗ 2 ∗ 0.20) = < a1 ∗ (-1, 0.40), (x 2 , x)> (16) 11 where a1 is a positive constant. If the other two persons prefer the combinations around 0.21 and 0.40 respectively, the director vectors of the preference functions of each people are the following w1 = a1 ∗ (-1, 0.40); w2 = a2 ∗ (-1, 0.42); w3 = a3 ∗ (-1, 0.80). (17) Notice that the values of positive constants ai are not relevant if we want to use the functions fi to predict the preference of pi about a pair of possible combinations. What is really important is the relative weight of the components of wi. Thus, a way to measure the similarity of the tastes of these people is by the cosine of the director vectors of their preference functions ji ji jiji pp ww ww ww ⋅ ⋅ == ),cos(),(similarity (18) Using this definition, the similarity of p1, p2, p3 is similarity(p1,p2) = 0.9999; similarity(p1,p3) = 0.9570; similarity(p2,p3) = 0.9618. (19) In other words, as one could hope, using this approach the tastes of p1 and p2 are nearer from each other than the taste of p3. Notice that we have summarized the tastes of people in their preference function instead of using somehow the set of ratings assigned to a collection of products. Probably the products tasted were different for each person, what would make difficult to define any reliable similarity measure based on these ratings. Returning back to the general setting, let ((PJi, wi): i = 1,…, m) be a collection of m preference judgment sets PJi endowed with the director vector wi of their respective ranking functions. We define the similarity of their preference criterion by the cosine of equation (18). To make operative this definition, we need to arrange the equation (15) to make visible the director vector wi attached to a set of preference judgments PJi. In fact, the assessment function fi can be represented by a weight vector wi in the higher dimensional space of features such that fi(x) = < wi , φ(x)>, (20) 12 where ( )∑ ∈ −= iSVs )( s )( sssi )()(y 21 xxw φφα (21) Now, given that the definition of similarity uses scalar products instead of coordinates of weighting vectors, we can easily rewrite (18) in terms of the kernels used in the derivations explained in the previous subsection. The essential equality is: ( )∑ ∑ ∑ ∑ ∈ ∈ ∈ ∈ = −−= i j i j SVs SVl llsslsls SVs SVl llsslslsji yy yy )2()1()2()1( )2()1()2()1( ,,, )()(),()(, K xxxx xxxxww αα φφφφαα (22) 4.3 The clustering method Once we have defined a reasonable similarity measure for preference criteria, we proceed to look for clusters of consumers with homogeneous tastes. In principle, we could use any available clustering algorithm. However, we avoided those methods, like k-means, that require frequent recomputations of the centroids of each cluster. The reason is that the updating of (22) would result computationally expensive. Additionally, we need a mechanism able to estimate an appropriate number of clusters directly from the data, without any explicit manual intervention. Hence, we applied a nonparametric pairwise algorithm (Dubnov et al. 2002), although this is not the only possibility. The following paragraphs sketch a description of this algorithm as we used it in the experimental results reported in the last section. Let M = (Mij) be a square matrix where Mij stands for the similarity between data points i and j; in our case, data points are the vectorial representation of the preference criteria of consumers, and similarities are given by equation (18). In the following, M will be called the proximity matrix. Then, matrix M is transformed iteratively, following a two step procedure that converges to a two values matrix (1 and 0), yielding a bipartition of the data set into two clusters. Then, recursively, the partition mechanism is applied to each of the resulting clusters represented by their corresponding submatrices. To guarantee that 13 only meaningful splits take place, in (Dubnov et al. 2002) the authors provide a cross validation method that measures an index that can be read as a significance level; we will only accept splits whose level is above 0.90. The basic iterative transformation uses the following formulae to go from iteration t to t+1: ( ) ( )        +++ + ++ +++ + +=+ =+ ∑∑ k ikjk jk jk k jkik ik ikij ik ij ij tVtV tV tV tVtV tV tVtM ktM tM tV )1()1( )1( log)1( )1()1( )1( log)1( 2 1 )1( }:)(max{ )( )1( 2 1 2 1 (23) The first step normalizes the columns of the proximity matrix using the L∞ norm; then the proximities are re-estimated using the Jensen-Shannon divergence. The idea is to formalize that two preference criteria are close (after these two steps) if they were both similar and dissimilar to analogous sets of criteria before the transformation. 5. Experimental results To illustrate the performance of our proposals, we used two datasets. The first one is EachMovie, a dataset widely used in the field of collaborative filtering. The second dataset deals with ratings of beef meat consumers, and was previously used in (Díez et al. 2005). This is a dataset that was built in order to study the beef market preferences. 5.1 Clustering spectators according to their tastes about movies The database EachMovie contains ratings of 1,628 movies provided by 72,916 people. The scale used has 6 values {0, 0.2, 0.4, 0.6, 0.8, 1}, but less than only 2.4% of the possible values are filled. Thus, for instance, 11,651 people have not given any rating to any movie. To take into account explicitly the absence of ratings, that is so frequent in this collection, and as usual when dealing with EachMovie, we moved the scale of ratings to {-1.5, -1, -0.5, 0.5, 1, 1.5}, assuming zero as missing value. 14 As many authors do, in order to avoid the extreme sparsity of data, we considered a subset of EachMovie. We only considered movies with at least 1,000 ratings; there are 504 movies in such conditions. We likewise only took into consideration people who had submitted at least 275 ratings for these movies; this makes a total of 189 people. To accomplish our setting for clustering, we considered 100 people as our sample to study possible similarity of preference criteria; i.e. we set m = 100 (see equation (1)) and the remaining available ratings were used to define the input space X of the objects (movies) considered in our experiment. In other words, we represented each movie by a vector of 89 (= 189 – 100) ratings. The distributions of ratings in the description of the movies, that is in the attribute space, and those provided by our 100 spectators are quite similar, see Table 2. The main novelty in this setting with respect to typical collaborative filtering applications is that we use some consumers to describe the products. Notice that we only use “expert” consumers, those who evaluated a lot of movies, so in fact they form a kind of expert panel. For each spectator pi from the set of 100 people considered, we have a rating vector (ri(x): x ∈ Si), and we have to assemble a set of preference judgments PJi from it. Notice that |PJi| is of the order of O(|Si| 2 ). Here |Si|<=504, so to avoid too big PJi datasets, we proceed as follows. For each movie x (1) ∈ Si ranked by pi, 10 more movies ranked by pi with different ratings were randomly selected; then, if x (2) is one of such movies, we include (x (1) , x (2) ) in PJi if and only if ri(x (1) ) > ri(x (2) ), otherwise, we include (x (2) , x (1) ) in PJi. Notice that in EachMovie there is no indication about rating sessions, therefore we assumed only one session, and hence all ratings are comparable. Finally, we used two different kernels to appreciate their influence in the clusters obtained applying the method described in this paper. In Figure 2 we depict the tree of clusters obtained with linear kernel and polynomial kernel (degree=2). The overlapping between the clusters obtained by these two kernels is reported in Table 3. The set of all preference judgments, PJ = U(PJi: i = 1,...,m) 15 has a lot of disagreements. If, for each pair of movies, we choose the most frequent relative ordering in PJ, then the percentage of pairs disagreeing with that majority opinion is 25.92% (see Table 4). Notice that this percentage is the minimum training error for any algorithm trying to learn the preferences of the whole group of 100 spectators. The clustering algorithm, with both kernels, has discovered groups of almost 50 people, whose disagreements are lower than the global percentage of disagreements: 17.4% for the linear kernel (li_r), and 16.71% for the polynomial one (po_r). These clusters are quite the same; since 42 spectators are present in these majority groups for both kernels (see Table 3). However, the other two smaller groups have internal disagreements similar to that of the whole group in both kernels. The reason for this behavior may be that these groups are made up of very peculiar spectators whose preference functions are difficult to learn. The preference functions learned from the union of preference judgments of the members of each cluster have an estimation of misclassifications similar to the percentage of disagreements, especially in the case of the polynomial kernel, see Table 4. 5.2 Market segments in beef meat To illustrate our method with a real world application, we used a database described in (Sañudo et al. 2004; Díez et al. 2005). The data collects the sensory ratings of a panel of beef meat consumers about two aspects: tenderness, and acceptability. First we will describe the dataset, to conclude with a discussion of the implications of the results obtained by our method. 5.2.1 Beef meat dataset For this experience, 101 animals of 7 Spanish breeds were slaughtered to obtain two kinds of carcasses: lights, from animals with a live weight around 300–350 kg (light); and heavies, from animals at 530–560 kg. The set of animals was uniformly distributed by breeds and weights. Additionally, to test the influence of aging in consumers’ appreciation, each piece of meat was prepared with 3 aging periods, 1, 7, and 21 16 days. Notice that these settings give rise to 303 (101 animal with 3 different aging periods) samples of meat. On the other hand, the 7 breeds used constitute a wide representation of beef cattle. These breeds can be divided into four types: double muscled (DM, one breed), fast growth (FG, two breeds), dual purpose (DP, one breed), and unimproved rustic type (UR, three breeds). In Table 5, we show the average percentages of fat, muscle and bone for each breed. Notice that it is important to distinguish three kinds of fat: inter-muscular, subcutaneous, and intramuscular, that is included in the percentage of muscle. In the experimental results reported below, we used the carcass composition of each animal; the breed was only used to provide an explanation of the tastes of clusters. Additionally, each kind of meat was also described by a panel of 11 trained experts who rate 12 traits of products such as fibrosis, flavor, odor, etc.; we considered the average rate of each trait. The characterization of meat samples was completed with 6 physical features describing its texture. The panel was organized in a set of tasting sessions, where a group of potential consumers assessed 3 or 7 instances of the available beef kinds. Frequently, each consumer only participated in a small number (sometimes only one) of tasting sessions, usually in the same day. To apply the clustering method described in this paper, we first selected those people involved in our consumers’ panel whose ratings gave rise to more than 30 preference judgments; these yielded us to consider a set of 171 panelists (m=171) that tested from 9 to 14 samples of meat. Notice that this is just a subset of all available panelists since samples rated with the same rate did not produce preference judgment pairs. This individual treatment contrasts with the global one that we reported in (Luaces et al. 2004; Del Coz et al. 2004; Díez et al. 2004). In these prior works, we were interested in modeling the general opinion of consumers; so for each session, to summarize the opinions of consumers, we computed the mean of the ratings obtained by each piece of meat. Here, we will also use the polynomial kernel of degree 2 reported in those papers, since the nature of the learning problems is essentially the same. Since there are 303 samples of meat, then the opinions of our panelists can only be estimated inducing a preference or ranking function (see section 4.1) using PJi datasets obtained applying equation (3). Notice that only such functions can be used in order to compare the preferences of different consumers; in 17 general, two arbitrary consumers have not tested samples of the same animal prepared with the same aging. However, it is possible to compare the preference functions of any couple of consumers as vectors in a high dimension space following the kernel based method of section 4.2. The clustering algorithm returns the trees depicted in Figure 3. Split nodes achieved a confidence level of 91% for tenderness dataset, and 97% for acceptance. The leaves of these trees reached lower confidence levels, and therefore their splits were rejected. In the scores reported in Table 6, we observe that both in acceptance and tenderness there are percentages of disagreements over 20%, while the clusters found decrease this amount in 5 points. Using the set of all preference judgments of each cluster, the preference functions learned (with a polynomial kernel of degree 2) have an estimation of misclassification around 20%. 5.2.2 Implications of beef meat results In general, it is well known that meat qualities are mainly the result of a set of complex factors. In this study, we focus our attention on two of them: breed and aging period. Specifically, we are interested in knowing if there are different groups of people who prefer some breeds to others or who prefer long to short aging periods. In fact, these are the most relevant features in the acceptance and tenderness appreciation of beef meat by consumers according to (Sañudo et al. 2004; Luaces et al. 2004). To gain insight into the meaning of the preference criteria of each cluster, we used the ranking or preference functions to order the samples of meat; then we assessed 10 points to those samples included in the first decile, 9 to the second decile, and so on. Graphical representations of the average points obtained by each breed and each aging period are shown in Figure 4 (acceptance) and Figure 5 (tenderness); notice that the average score of all samples is 5.5. The results are quite the same if we use quartiles instead of deciles or any other division of the relative rankings of each cluster. In the acceptance dataset (Figure 4a), let us emphasize the opposite role played by Retinta and Asturiana breeds: they were first and last (or almost last) in each cluster alternatively. In (Luaces et al. 2004) we used Boolean attributes to include the breed in the description of each sample, and then Retinta and 18 Asturiana were found to be the most relevant Boolean features in order to explain consumer’s acceptance of meat. Additionally, these two breeds have significant differences in carcass composition (see Table 5). Notice that Asturiana breed is the only double muscled breed of the sample, and then it has the lowest values in percentages of subcutaneous and inter-muscular fat, and bone; while Retinta is the unimproved rustic breed with the highest percentages of fat and bone. Therefore, there are some reasons so as to assign opposite ratings to samples of these two breeds, although, in general, the final acceptance scorings rely on a complex set of features. On the other hand, analyzing aging periods in both clusters, (Figure 4b) we can see that people in the left cluster prefer a 7 days aging period meanwhile in the right cluster longer aging periods have better acceptance. In the tenderness dataset (Figure 5a), meat from Pirenaica and Retinta breeds are the tenderest for people in left cluster, however they are ranked in low positions in right cluster. We can say exactly the opposite of meat from Asturiana and Parda breeds. Again, Asturiana and Retinta breeds play opposite roles in each cluster. In this case, we cannot appreciate differences between clusters’ preferences about aging periods (Figure 5b): in both clusters, all breeds obtain higher tenderness scores as aging periods increase. This is not a surprisingly result, it is well documented that long aging periods give rise to more tender meat (Sañudo et al. 2004). 6. Conclusions We have presented a new algorithm to build clusters of preference criteria. Starting from a collection of preference judgments of different people, our algorithm discovers groups or clusters of people with closely related tastes; that is to say, people whose preference judgments can be merged in order to learn more reliable ranking functions able to express the preferences of the people involved. These clusters can be also interpreted as market segments. In building these clusters, the key insight is that ranking 19 functions, learned from the set of preference judgment of each person, codify the rationale or criteria for his or her preferences. We believe that the contributions of this paper fall mainly within the applications of preference learning. Thus, clustering is useful for strengthening applications such as collaborative filtering, information retrieval or adaptive assistants. We illustrated the usefulness of the method presenting the experimental results achieved with a collaborative filter dataset (EachMovie), and with a dataset that gathers ratings of beef meat consumers. In this case, the clusters represent market segments with differentiated requirements from this kind of food products. It is straightforward the extension of this case to other fields that handle sensory data provided by consumer panels. Acknowledgements The authors would like to thank: the Company Compaq for providing us with the EachMovie database (McJones, 1997); Thorsten Joachims for his SVM light (Joachims, 1998), and the authors of Spider (Weston et al., 2006), a MatLab toolbox that includes kernel based algorithms; these systems were used in the experiments reported in this paper. References Bahamonde, A.; Bayón, G. F.; Díez, J.; Quevedo, J. R.; Luaces, O.; del Coz, J. J.; Alonso, J.; Goyache, F. (2004). Feature subset selection for learning preferences: a case study. Proceedings of the 21st International Conference on Machine Learning, ICML 2004, Banff, Canada, pp. 49-56. Basu, C., Hirsh, H., and Cohen, W.W. (1998). Recommendation as classification: Using social and content-based information in recommendation. In Proc. AAAI. pp. 714-720. Branting, K.L. and Broos, P.S. (1997). Automated acquisition of user preferences. International Journal of Human- Computer Studies 46:55-77. 20 Breese, J.S., Heckerman, D., and Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering. In Proc. Conference on Uncertainty in Artificial Intelligence. pp. 43-52. Buck, D., Wakeling, I., Greenhoff, K., and Hasted, A. (2001). Predicting paired preferences from sensory data. Food quality and preference 12:481-487. Cheung, W.K., Kwok, J.T., Law, M.H., and Tsui, K.C. (2000). Mining customer preference ratings for product recommendations using the support vector machine and the latent class model. In Proc. International Conference on Data Mining Methods and Databases for Engineering. pp. 601-610. Cohen, W.W., Shapire, R.E., and Singer, Y. (1999). Learning to order things. Journal of Artificial Intelligence Research 10:243-270. Corney, D. (2002). Designing food with bayesian belief networks. In Proc. International Conference on Adaptive Computing in Engineering Design and Manufacture. pp. 83-94. Crammer, K. and Singer, Y. (2001). Pranking with ranking. In Proc. Conference on Neural Information Processing Systems. pp. 641-647. Del Coz, J. J.; Bayón, G. F.; Díez, J.; Luaces, O.; Bahamonde, A.; Sañudo, C. (2004). Trait selection for assessing beef meat quality using non-linear SVM. Proceedings of the 18th Annual Conference on Neural Information Processing Systems (NIPS 2004). Vancouver, British Columbia, Canada, pp. 321-328. Díez, J.; Bayón, G.F.; Quevedo, J. R.; del Coz, J. J.; Luaces, O.; Alonso, J.; Bahamonde, A. (2004). Discovering relevancies in very difficult regression problems: applications to sensory data analysis. In Proceedings of the European Conference on Artificial Intelligence (ECAI ’04), Valencia, Spain, pp. 993-994. Díez, J.; del Coz, J. J.; Sañudo, C.; Albertí, P.; Bahamonde, A. (2005). A Kernel Based Method for Discovering Market Segments in Beef Meat. Proceedings of the 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, ECML/PKDD'2005, Porto, Portugal, pp. 462-469. Dubnov, S.; El-Yaniv, R.; Gdalyahu, Y.; Schneidman, E.; Tishby, N.; Yona, G.. A New Nonparametric Pairwise Clustering Algorithm Based on Iterative Estimation of Distance Profiles. Machine Learning, 47 (2002) 35–61 Dumais, S., Bharat, K., Joachims, T., and Weigend, A. (2003). Workshop on implicit measures of user interests and preferences. In ACM SIGIR Conference, Toronto, Canada. Fiechter, C.N. and Rogers, S. (2000). Learning subjective functions with large margins. In Proc. International Conference on Machine Learning. pp. 287-294. 21 Freund, Y., Iyer, R., Schapire, R.E., and Singer, Y. (1998). An efficient boosting algorithm for combining preferences. In Proc. International Conference on Machine Learning. pp. 170-178. Goldberg, D., Nichols, D., Oki, B.M., and Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM 35 (12):61-70. Goyache, F., Bahamonde, A., Alonso, J., López, S., del Coz J.J., Quevedo, J.R., Ranilla, J., Luaces, O., Alvarez, I., Royo, L., and Díez, J. (2001). The usefulness of artificial intelligence techniques to assess subjective quality of products in the food industry. Trends in Food Science and Technology 12 (10):370-381. Herbrich, R.; Graepel, T.; and Obermayer, K.: Large margin rank boundaries for ordinal regression. In A. Smola, P. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large Margin Classifiers, 115–132. MIT Press, Cambridge, MA, (2000) Hofmann, T and Puzicha, J. (1999) Latent class models for collaborative filtering. In Proc. International Joint Conference on Artificial Intelligence. pp.688-693. Joachims, T. (1998). Making large-scale SVM learning practical. Advances in Kernel Methods - Support Vector Learning, MIT Press, Cambridge MA. Chapter 11. Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proc. ACM Conference on Knowledge Discovery and Data Mining. Luaces, O.; Bayón, G.F.; Quevedo, J.R.; Díez, J.; del Coz, J.J.; Bahamonde, A. (2004). Analyzing sensory data using non-linear preference learning with feature subset selection. Proceedings of the 15th European Conference of Machine Learning, (ECML 04). Pisa, Italia, pp. 286-297. McJones, P. (1997). EachMovie collaborative filtering dataset. DEC (now Compaq) Systems Research Center. http://www.research.compaq.com/SRC/eachmovie/. Murray, J.M., Delahunty, C.M., and Baxter, I.A. (2001). Descriptive sensory analysis: Past, present and future. Food Research International 34:461-471. Pazzani, M.J. (1999). A framework for collaborative filtering, content-based and demographic filtering. Artificial Intelligence Review 13 (5-6):393-408. Resnick, P. and Varian, H. (1997). Introduction to special section on recommender systems. Communicatons of the ACM 40(3):56-58. 22 Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J. (1994). GroupLens: An open architecture for collaborative filtering of networks. In Proc. Conference on Computer Supported Cooperative Work. pp. 175-186. Sañudo, C.; Macie, E.S.; Olleta, J.L.; Villarroel, M.; Panea, B.; Albertí, P.. The effects of slaughter weight, breed type and ageing time on beef meat quality using two different tex-ture devices. Meat Science, 66 (2004), 925–932 Shardanand, U. and Maes, P. (1995). Social information filtering: Algorithms for automatic “Word of Mouth”. In Proc. Conference on Human Factors in Computing Systems. pp. 210-217. Shashua, A. and Levin, A. (2002). Ranking with large margin principle: Two approaches. In Proc. Neural Information and Processing Systems. Tesauro, G. (1989). Connectionist learning of expert preferences by comparison training. In Proc. Neural Information and Processing Systems. pp. 99-106. Ungar, L.H. and Foster, D.P. (1998). Clustering methods for collaborative filtering. In Proc. AAAI’98 Workshop on Recommendation Systems. pp. 112-125. Utgoff, J. P. and Clouse, J. (1991). Two kinds of training information for evaluation function learning. In Proc. AAAI'91. pp. 596-600. Vapnik, V. (1998). Statistical learning theory. John Wiley, New York. Weston, J.; Elisseeff, A.; BakIr, G.; Sinz, F. (2006) SPIDER: object-orientated machine learning library. http://www.kyb.tuebingen.mpg.de/bs/people/spider/ 23 Table 1. Sensory data collected from panels of experts and consumers. Each product is described by expert assessments (Attj) in addition to other (O-Atti) chemical, biological or physical analysis outputs Expert sensory appreciations Expert-1 Expert-k Other descriptive attributes Consumer preferences Att1 … Attm … Att1 … Attm O-Att1 … O-Attn Session Consumer Rating … … … … … … Table 2. Rating distribution in data sets used for the experiments In attribute space In the 100 spectators’ ratings Movies’ Rating Number Percentage Number Percentage 0 4,392 17.02% 6,665 19.32% 0.2 1,955 7.57% 2,651 7.68% 0.4 3,813 14.77% 4,801 13.92% 0.6 6,373 24.69% 8,193 23.75% 0.8 5,774 22.37% 7,471 21.66% 1 3,503 15.57% 4,716 13.67% Totals 25,810 34,497 Table 3. Number of common spectators in clusters obtained with the linear kernel and the polynomial kernel of degree 2 Polynomial of degree 2 po_l_l po_l_r po_r li_l_l 12 12 3 27 li_l_r 5 17 2 24 L in e a l li_r 7 0 42 49 24 29 47 24 Table 4. Number of disagreements (minimum possible training error) and classification error estimations in each cluster depending of the kernel used Kernel disagreements cluster disagreements classification errors linear 25.92% li_l_l 27.35% 32.71% li_l_r 25.78% 30.62% li_r 17.40% 19.18% polynomial 25.92% po_l_l 23.8% 24.68% po_l_r 30.36% 30.86% po_r 16.71% 17.00% Table 5. Carcass compositions of 7 Spanish beef breeds used in the experiment Breed Fat % Bone Muscle Intramuscular Name Type inter-muscular subcutaneous % % fat % Asturiana Valles DM 4.77 0.89 16.00 78.34 0.90 Avileña UR 13.17 3.53 19.25 64.05 2.28 Morucha UR 12.46 3.46 19.28 64.80 2.10 Parda Alpina DP 9.65 2.32 20.86 67.17 1.82 Pirenaica FG 9.02 3.01 17.33 70.63 1.48 Retinta UR 14.16 4.75 20.89 60.20 2.13 Rubia Gallega FG 5.73 1.20 16.56 76.52 1.12 Table 6. Number of disagreements (minimum possible training error) and classification error estimations in each cluster for acceptance and tenderness Dataset disagreements cluster |PJ| disagreements classification errors acceptance 21.41% left 1927 16.19% 19.20% right 2150 17.07% 21.12% tenderness 20.25% left 2487 15.96% 19.38% right 2432 15.21% 19.59% 25 Fig. 1. The assessment of each object is proportional to the distance of the vector that represents it to the (assessment) hyperplane perpendicular to wi. In the picture, x (1) is preferable to x (2) , since it is further Fig. 2. Trace of the clustering algorithm in the EachMovie dataset. The left hand side corresponds to the linear kernel while the right side represents the result obtained with the polynomial kernel of degree 2. In each node we report the number of spectators in bold, and the confidence level required to split the node (in parenthesis). The leaves are labelled for further reference 26 Fig. 3. Trace of the clustering algorithm in the beef meat dataset Fig. 4. Acceptance of beef meat. Average ranking scores for each breed and aging period 27 Fig. 5. Tenderness of beef meat. Average ranking scores for each breed and aging period