key: cord-0044809-v1te4zc6
authors: Tang, Mengzi; Pérez-Fernández, Raúl; De Baets, Bernard
title: Combining Absolute and Relative Information with Frequency Distributions for Ordinal Classification
date: 2020-05-15
journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems
DOI: 10.1007/978-3-030-50143-3_47
sha: 54738265ca027312146d6a2b62a3f82e5ff690f2
doc_id: 44809
cord_uid: v1te4zc6

A large amount of labelled data (absolute information) is usually needed for an ordinal classifier to attain a good performance. As shown in a recent paper by the present authors, the lack of a large amount of absolute information can be overcome by additionally considering some side information in the form of relative information, thus augmenting the method of nearest neighbors. In this paper, we adapt the method of nearest neighbors for dealing with a specific type of relative information: frequency distributions of pairwise comparisons (rather than a single pairwise comparison). We test the proposed method on some classical machine learning datasets and demonstrate its effectiveness.

Typically for ordinal classification there is only absolute information available, i.e., examples with an associated class label of a fixed ordinal scale. Unfortunately, in real-life applications, it is often the case that the amount of absolute information available is limited, thus largely impacting the performance of an ordinal classifier. Fortunately, different types of side information can be additionally collected and make up for the limitation regarding the little amount of absolute information available [1, 2] . A popular type of such side information is relative information, i.e., couples of examples without an explicitly given class label but with an order relation between them.

Interestingly, in real-life applications, relative information with frequency distributions arises quite commonly. For instance, the emergence of online transaction platforms such as Amazon Mechanical Turk offers some possibilities to distribute evaluation tasks to consumers and collect a large amount of relative information. However, the order relations from relative information are usually less informative than the class labels from absolute information. It is also quite common for customers to have contradictory order relations for the same couple of examples. Due to these facts, it is better to consult several customers for collecting the preference among two examples, thus obtaining for each couple of examples a frequency distribution of order relations (hereinafter referred to as relative information with frequency distributions). Hence, how to combine a small amount of absolute information and a large amount of relative information with frequency distributions becomes our main goal. Some related works [3, 4] have shown the effectiveness of fusing absolute and relative information. In the field of ordinal classification, for example, Sader et al. [5] proposed an ordinal classification method for combining both absolute and relative information to perform prediction tasks. This method needs to learn many parameters for solving a constrained convex optimization problem. In a similar direction, our previous work [6] incorporated both types of information into the method of nearest neighbors, and proposed an augmented method for ordinal classification that is non-parametric and easy to explain. However, this method was designed to deal with just one order relation for each couple of examples and not with a frequency distribution of order relations. An immediate extension of this method to the latter setting reduces the study of the nearest couples of examples to just the nearest couple of examples, thus impacting its overall performance. To properly address our problem setting, where there is a small amount of absolute information and a large amount of relative information with frequency distributions available, we propose a method to incorporate both types of information into the method of nearest neighbors for ordinal classification on the basis of our previous work [6] .

The remainder of this paper is organized as follows. In Sect. 2, we formulate our problem. We propose our method in Sect. 3. In Sect. 4, we perform experiments and analyze the performance. Some conclusions are presented in Sect. 5.

The available data is composed of absolute and relative information. We denote the input examples by D = {x 1 , x 2 , . . . , x n }. The input examples x i = (x i1 , . . . , x id ) belong to the input space X ⊆ R d and their corresponding class labels y i belong to the output space Y = {C 1 , C 2 , . . . , C r }, where the class labels are supposed to be ordered as follows:

Although for some examples there is no explicitly given class label, it is still possible to have some side information in the form of relative information. Relative information is typically expressed for a set of couples of exam-

With each couple (a i , b i ), a frequency distribution (α i , β i ) is associated, α i representing the proportion of times that a i is preferred to b i and β i representing the proportion of times that b i is preferred to a i . Obviously, α i + β i = 1. Relative information is collected in a set R = {((a 1 , b 1 ), (α 1 , β 1 )), ((a 2 , b 2 ), (α 2 , β 2 )), . . . , ((a m , b m ), (α m , β m ))}.

If ((a i , b i ), (α i , β i )) belongs to R, then ((b i , a i ), (β i , α i )) is supposed also to belong to R. Note that here we do not consider the case in which a p and b p are equally preferred. The main characteristic of our problem is that the amount of absolute information is typically smaller than the amount of relative information, i.e., n m.

In this subsection, we recall the method proposed in our previous work [6] .

Firstly, according to a fixed distance metric d, we find the k nearest neighbor examples D k = {x ij } k j=1 of the test example x * . We see each couple (x * , x ij ) as a new object and look for the nearest neighbor couples

For this process, we compute the distance between couples according to the product distance metric (see [7] , page 83, with p=1), which is defined as

Secondly, we rely on the assumption that a couple and its nearest neighbor couples have similar order relations. More in detail, for the new object (x * , x ij ), we focus on the nearest neighbor couple and get its corresponding order relation. For instance, if the nearest neighbor couple of (x * , x ij ) is (a j 1 , b j 1 ) and its given order relation is a j 1 b j 1 , then we assume the same order relation x * x ij for (x * , x ij ). Since the class label of x ij is known to be, for instance, C cj , the class label of x * is expected to be at least C cj . The same applies to the other − 1 neighbor couples. For each among these relations, we obtain an interval of potential class labels for x * . We denote this interval as I jq = [l Ijq , r Ijq ], where j ∈ {1, . . . , k} and q ∈ {1, . . . , }. For instance, if the given class label of x ij is C cj and we obtain that the relation for the couple (x * , x ij ) is x * x ij according to its q-th nearest neighbor couple, then the interval of possible values of y * is

Finally, we denote by I = (I jq ) j∈{1,...,k},q∈{1,..., } the list of all the obtained intervals. We consider the penalty function associated with the median for intervals (see, for instance, Beliakov et al. [8] ):

where |C i − C j | denotes the L 1 -distance between two class labels C i and C j . Note that the L 1 -distance metric treats all class labels of the ordinal scale as if they were equidistant, something that is not always advisable depending on the nature of Y. The class label y * of x * is then determined using the corresponding penalty-based (aggregation) function:

The method above only focuses on how to deal with couples provided with only one order relation. However, in our problem setting, we have relative information with frequency distributions. More specifically, we have a frequency distribution of order relations for each (a j , b j ). We use (α j , β j ) to characterize this frequency distribution. In the following, we explain how to deal with such information. Firstly, we repeat the process above to look for the k nearest neighbor examples of the test example x * . We see each couple (x * , x ij ) as a new object, search for the nearest neighbor couples, and get their frequency distributions of order relations. We rely on the same assumption above that a couple and its nearest neighbor couples have similar order relations. More in detail, if the nearest neighbor couple of (x * , x ij ) is (a j q , b j q ) and its frequency distribution is (α j q , β j q ), which implies that a proportion α j q of times the order relation of (a j q , b j q ) is a j q b j q and a proportion β j q of times the order relation of (a j q , b j q ) is a j q ≺ b j q , then in case the couple (x * , x ij ) needs to be labelled, we would expect that a proportion α j q of times the order relation of (x * , x ij ) is x * x ij and a proportion β j q of times the order relation of (x * ,

For the new couple (x * , x ij ) in which the given class label of x ij is C cj , we get α j q times the interval [C cj , C r ] and β j q times the interval [C 1 , C cj ] for the potential class label y * of x * . We repeat this process for the other − 1 nearest neighbor couples. Exploiting all k nearest neighbors and nearest neighbor couples, we get a list of intervals of potential class labels for x * . We denote by I the list of all gathered intervals.

Finally, differently to the previous section, here we do not use the notation I jq = [l Ijq , r Ijq ] to represent the interval. More specifically, we now have a proportion α j q of times the interval [C cj , C r ] and a proportion β j q of times the interval [C 1 , C cj ] for each nearest neighbor couple of (x * , x ij ). Thus, the penalty function associated with the median reads as follows:

where (a j q , b j q ) is the q-th nearest neighbor couple of the couple (x * , x ij ), (α j q , β j q ) is the corresponding frequency distribution and the given class label of the j-th Bank1-10 (BA1-10) 8192 8 10

Bank2-10 (BA2-10) 8192 32 10

Computer1-10 (CO1-10) 8192 12 10

Computer2-10 (CO2-10) 8192 21 10

nearest neighbor x ij of x * is C cj . The class label y * of x * is then determined using the corresponding penalty-based (aggregation) function:

y * = f (y * ) = arg min y∈Y P (I, y).

We perform the experiments on some classical machine learning datasets from some typical repositories [9] [10] [11] . The detailed characteristics of these datasets can be found in Table 1 , including the number of examples, features and classes. All the features have been properly normalized (by making all the features to have zero mean and unit standard deviation) to avoid the impact of the scale of features. Note that the datasets do not contain relative information with frequency distributions. Based on a similar generation process of relative information as the one described in [6] , we generate couples with frequency distributions of order relations. More in detail, when comparing two examples for generating a couple (a i , b i ), we randomly sample α i or β i from a uniform distribution. For example, if the real order relation of these two examples is a i b i , then we sample α i from a uniform distribution on [0.5, 1] and set β i = 1 − α i . Similarly, if the real order relation of these two examples is a i ≺ b i , then we sample β i from a uniform distribution on [0.5, 1] and set α i = 1 − β i . Thus, we generate a couple ((a i , b i ), (α i , β i ) ).

To test our method, we construct two different datasets for each original dataset. Based on a similar generation process as in our previous work [6] , we fix 10% of the data that will be shared by both datasets for testing. The remaining 90% is used for generating the data for training. We keep 5% of the remaining 90% as absolute information. We use the remaining 95% for generating relative information following the aforementioned description. Dataset 1 includes just absolute information. Dataset 2 not only includes the same absolute information as Dataset 1, but also includes relative information with frequency distributions. By comparing the performance on these two datasets, we test the impact of incorporating relative information with frequency distributions.

We use the three most common performance measures to evaluate ordinal classification models [13, 14] : the Mean Zero-one Error (MZE), the Mean Absolute Error (MAE) and the C-index.

The MZE describes the error rate of the classifier and is computed as

where T is the number of test examples, y i is the real class label and y * i is the predicted class label. Acc is the accuracy of the classifier. The value of MZE ranges from 0 to 1. It describes the global performance, but it neglects the relations among the class labels.

The MAE is the average absolute error between y i and y * i . If the class labels are represented by numbers, the MAE is computed as:

The value of MAE ranges from 0 to r − 1 (maximum absolute error between classes). Because the real distances among the class labels are unknown, the numerical representation of the class labels has a big impact on the MAE performance.

In order to avoid this impact, one could consider the relations between the real class label and the predicted class label. Here we use the concordance index or C-index to represent these relations. The C-index is computed as the proportion of the number of concordant pairs to the number of comparable pairs (see [15] , page 50):

where T Cp and T Cq are respectively the numbers of test examples with the class label C p and C q , {y i , y j } is the real pair from the test examples, while {y * i , y * j } is the corresponding predicted pair. When there are only two different class labels, the C-index amounts to the area under the Receiver Operating Characteristic (ROC) curve [16] and ranges from 0.5 to 1. A lower MZE or MAE means a better performance, while a higher C-index means a better performance. Here, we replace C-index by (1 − C-index) to keep an analogy with the other performance measures and facilitate the discussion of the results.

In this subsection, we analyze the performance of the proposed method on the different datasets listed in Subsect. 4.1. All the experimental results are obtained by applying ten-fold cross validation. We perform experiments on all datasets, setting the number k of nearest neighbor examples to 5. Table 2 shows the performance on Dataset 1 and Dataset 2. It is clear that the performance on Dataset 2 is better than the performance on Dataset 1 for almost all original datasets except for AB5 and AB10.

In order to test whether there is a significant difference in performance on these two datasets, we perform the Wilcoxon signed-rank test [12] at a significance level of α = 0.05. If the p-value is smaller than the fixed significance level of α, then it means that there exists a statistically significant difference between these two datasets. In Table 2 , it can be seen that the p-values for MZE, MAE and 1 − C-index are smaller than α, which means that there exists a statistically significant difference between the performance on these two datasets obtained from all original datasets. The experimental results, together with the obtained p-values and associated point estimates (median differences), show that using relative information with frequency distributions is meaningful.

Based on our previous work [6] , we have proposed an augmented method for ordinal classification for the setting in which there exists a small amount of absolute information and a large amount of relative information with frequency distributions. Specifically, we adapt the method of nearest neighbors for dealing with relative information with frequency distributions. We have carried out experiments on some classical ordinal classification or regression datasets. The experimental results show that the performance improves when relative information with frequency distributions is considered, which validates the usefulness of taking into account relative information with frequency distributions.

We see several interesting future directions for extending this work. On the one hand, absolute information with frequency distributions is also common. How to combine both absolute and relative information with frequency distributions for ordinal classification is still an open problem. On the other hand, in case the amount of relative information is large, it might be necessary to explore how to select the most informative pairwise comparisons for relative information in order to reduce the computational complexity of the proposed method.

Boosting with side information

Efficiently learning the metric with sideinformation

Ratings and rankings: reconsidering the structure of values and their measurement

Insight into the relative merits of rating and ranking in a cross-national context using three-way correspondence analysis

Integrating expert and novice evaluations for augmenting ordinal regression models

Fusing absolute and relative information for augmenting the method of nearest neighbors for ordinal classification

Encyclopedia of Distances

Aggregation for Atanassov's intuitionistic and interval valued fuzzy sets: the median operator

UCI Machine Learning Repository

Gaussian processes for ordinal regression

Statistical comparisons of classifiers over multiple data sets

Metrics to guide a multi-objective evolutionary algorithm for ordinal classification

Evaluation measures for ordinal regression

Learning to rank: a ROC-based graphtheoretic approach

AUC optimization vs. error rate minimization

Acknowledgements. Mengzi Tang is supported by the China Scholarship Council (CSC). Raúl Pérez-Fernández acknowledges the support of the Research Foundation of Flanders (FWO17/PDO/160) and the Spanish MINECO (TIN2017-87600-P). This research received funding from the Flemish Government under the "Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen" programme.