Submitted 7 August 2019 Accepted 4 November 2019 Published 9 December 2019 Corresponding author Hyukjun Gweon, hgweon@uwo.ca Academic editor Diego Amancio Additional Information and Declarations can be found on page 17 DOI 10.7717/peerj-cs.242 Copyright 2019 Gweon et al. Distributed under Creative Commons CC-BY 4.0 OPEN ACCESS Nearest labelset using double distances for multi-label classification Hyukjun Gweon1, Matthias Schonlau2 and Stefan H. Steiner2 1 Department of Statistical and Actuarial Sciences, University of Western Ontario, London, Ontario, Canada 2 Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Ontario, Canada ABSTRACT Multi-label classification is a type of supervised learning where an instance may belong to multiple labels simultaneously. Predicting each label independently has been criticized for not exploiting any correlation between labels. In this article we propose a novel approach, Nearest Labelset using Double Distances (NLDD), that predicts the labelset observed in the training data that minimizes a weighted sum of the distances in both the feature space and the label space to the new instance. The weights specify the relative tradeoff between the two distances. The weights are estimated from a binomial regression of the number of misclassified labels as a function of the two distances. Model parameters are estimated by maximum likelihood. NLDD only considers labelsets observed in the training data, thus implicitly taking into account label dependencies. Experiments on benchmark multi-label data sets show that the proposed method on average outperforms other well-known approaches in terms of 0/1 loss, and multi- label accuracy and ranks second on the F-measure (after a method called ECC) and on Hamming loss (after a method called RF-PCT). Subjects Data Mining and Machine Learning, Data Science Keywords Multi-label classification, Label correlations, Nearest neighbor INTRODUCTION In multi-label classification, an instance can belong to multiple labels at the same time. This is different from multi-class or binary classification, where an instance can only be associated with a single label. For example, a newspaper article talking about electronic books may be labelled with multiple topics such as business, arts and technology simultaneously. Multi- label classification has been applied in many areas of application including text (Schapire & Singer, 2000; Godbole & Sarawagi, 2004), image (Boutell et al., 2004; Zhang & Zhou, 2007), music (Li & Ogihara, 2003; Trohidis et al., 2008) and bioinformatics (Elisseeff & Weston, 2001). A labelset for an instance is the set of all labels that are associated with that instance. Approaches for solving multi-label classification problems may be categorized into either problem transformation methods or algorithm adaptation methods (Tsoumakas & Katakis, 2007). Problem transformation methods transform a multi-label problem into one or more single-label problems. For the single-label classification problems, binary or multi-class classifiers are used. The results are combined and transformed back into a multi-label representation. Algorithm adaptation methods, on the other hand, modify specific learning algorithms directly for multi-label problems. Tsoumakas, Katakis How to cite this article Gweon H, Schonlau M, Steiner SH. 2019. Nearest labelset using double distances for multi-label classification. PeerJ Comput. Sci. 5:e242 http://doi.org/10.7717/peerj-cs.242 https://peerj.com/computer-science mailto:hgweon@uwo.ca https://peerj.com/academic-boards/editors/ https://peerj.com/academic-boards/editors/ http://dx.doi.org/10.7717/peerj-cs.242 http://creativecommons.org/licenses/by/4.0/ http://creativecommons.org/licenses/by/4.0/ http://doi.org/10.7717/peerj-cs.242 & Vlahavas (2010), Madjarov et al. (2012) and Zhang & Zhou (2014) give overviews of multi-label algorithms and evaluation metrics. In this article, we propose a new problem transformation approach to multi-label classification. Our proposed approach applies the nearest neighbor method to predict the label with the shortest distance in the feature space. However, because we have multiple labels, we additionally consider the shortest (Euclidean) distance in the label space where the input of the test instance in the label space consists of probability outputs obtained by independent binary classifiers. We then find the labelset that minimizes the expected label misclassification rate as a function of both feature space and label space distances, thus exploiting high-order interdependencies between labels. The nonlinear function is estimated using maximum likelihood. The effectiveness of the proposed approach is evaluated with various multi-label data sets. Our experiments show that the proposed method performs on average better on standard evaluation metrics (Hammming loss, 0/1 loss, multi-label accuracy and the F-measure) than other commonly used algorithms. The rest of this article is organized as follows: in ‘Related work’ we review previous work on multi-label classification. In ‘The nearest labelset using double distances approach’, we present the details of the proposed method. In ‘Experimental Evaluation’, we report on experiments that compare the proposed method with other algorithms on standard metrics. In ‘Discussion’ we discuss the results. In ‘Conclusion’, we draw conclusions. RELATED WORK In this section, we briefly review the multi-label approaches that are existing competitors to the proposed method. There are several approaches to classifying multi-label data. The most common approach, binary relevance (BR) (Zhang & Zhou, 2005; Tsoumakas & Katakis, 2007), transforms a multi-label problem into separate binary problems. That is, using training data, BR constructs a binary classifier for each label independently. For a test instance, the prediction set of labels is obtained simply by combining the individual binary results. In other words, the predicted labelset is the union of the results predicted from the L binary models. This approach requires one binary model for each label. The method has been adapted in many domains including text (Gonçalves & Quaresma, 2003), music (Li & Ogihara, 2003) and images (Boutell et al., 2004). One drawback of the basic binary approach is that it does not account for any correlation that may exist between labels, because the labels are modelled independently. Taking correlations into account is often critical for prediction in multi-label problems (Godbole & Sarawagi, 2004; Ji et al., 2008). Subset-Mapping (SMBR) (Schapire & Singer, 1999; Read et al., 2011) is a method related to BR. For a new instance, first labels are predicted by the binary outputs of BR. Then, final prediction is made by the training labelset with the shortest Hamming distance to the predicted labelset. SMBR makes predictions by selecting labelsets observed in the training data. SBMR is a nearest neighbor approach in the label space—from the set of predicted labels to the sets of labels observed in the training data—with Hamming distance as the distance metric. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 2/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 An extension of binary relevance is Classifier Chain (CC) (Read et al., 2011). CC fits labels sequentially using binary classifiers. Labels already predicted are included as features in subsequent classifiers until all labels have been fit. Including previous predictions as features ‘‘chains’’ the classifiers together and also takes into account potential label correlations. However, the order of the labels in a chain affects the predictive performances. Read et al. (2011) also introduced the ensemble of classifier chains (ECC), where multiple CC are built with re-sampled training sets. The order of the labels in each CC is randomly chosen. The prediction label of an ECC is obtained by the majority vote of the CC models. Label Powerset learning (LP) transforms a multi-label classification into a multi- class problem (Tsoumakas & Katakis, 2007). In other words, LP treats each labelset as a single label. The transformed problem requires a single classifier. Although LP captures correlations between labels, the number of classes in the transformed problem increases exponentially with the number of original labels. LP learning can only choose observed labelsets for predictions (Tsoumakas & Katakis, 2007; Read, Pfahringer & Holmes, 2008). The random k-labelsets method, (RAKEL) (Tsoumakas & Vlahavas, 2007), is a variation on the LP approach. In a multi-label problem with L different labels, RAKEL employs m multi-class models each of which considers k(≤L) randomly chosen labels, rather than the entire labelset. For a test instance, the prediction labelset is obtained by the majority vote of the results based on the m models. RAKEL overcomes the problem that the number of multinomial classes increases exponentially as a function of the number of labels. It also considers interdependencies between labels by using multi-class models with subsets of the labels. A hierarchy of multi-label classifiers (HOMER) (Tsoumakas, Katakis & Vlahavas, 2008) constructs a tree-shaped hierarchy by partitioning the labels recursively into smaller disjoint subsets (i.e., nodes) using a balanced clustering algorithm, which extends the k means algorithm with an additional constraint on the size of each cluster. After that, HOMER constructs a multi-label classifier for the labelsets in each node. For the prediction of a new instance, HOMER follows a top-down recursive process from the root. A classifier on a non-root node is called only if the prediction of its parent node is positive. The final labelset is determined by the positive leaves (i.e., labels) whose parent nodes are all positive. A popular lazy learning algorithm based on the k Nearest Neighbours (kNN) approach is MLKNN (Zhang & Zhou, 2007). Like other kNN-based methods, MLKNN identifies the k nearest training instances in the feature space for a test instance. Then for each label, MLKNN estimates the prior and likelihood for the number of neighbours associated with the label. Using Bayes theorem, MLKNN calculates the posterior probability from which a prediction is made. The Conditional Bernoulli Mixtures (CBM) (Li et al., 2016) approach transforms a multi-label problem into a mixture of binary and multi-class problems. CBM divides the feature space into K regions and learns a multi-class classifier for the regional components as well as binary classifiers in each region. The posterior probability for a labelset is obtained by mixing the multi-class and multiple binary classifiers. The model parameters are estimated using the Expectation Maximization algorithm. Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 3/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 (1,1,0) (1,0,1) (0,0,1) (0,0,0) y 1 y 2 y 3 𝑝 𝑦(1) 𝐷y1 𝐷y2 𝑦 (3 ) 𝐷y3 Figure 1 An illustration of the label space when L = 3. Each vertex represents a labelset. The inner point represents a fitted vector of an instance. Dyi represents the distance between p̂ and yi. Full-size DOI: 10.7717/peerjcs.242/fig-1 Multi-target classification approaches may also be used for multi-label classification. A number of multi-target learning methods use the predictive clustering tree (PCT) (Blockeel, Raedt & Ramon, 1998) as the base classifier. Random forest of predictive clustering trees (RF- PCT) (Kocev et al., 2007) has been shown to be competitive (Madjarov et al., 2012). RF-PCT is a tree-based ensemble method using PCTs as base classifiers. Different PCTs are constructed from different bootstrap samples and random subsets of the features. THE NEAREST LABELSET USING DOUBLE DISTANCES APPROACH Hypercube view of a multi-label problem In multi-label classification, we are given a set of possible output labels L={1,2,...,L}. Each instance with a feature vector x∈Rd is associated with a subset of these labels. Equivalently, the subset can be described as y=(y(1),y(2),...,y(L)), where y(i) =1 if label i is associated with the instance and y(i) =0 otherwise. A multi-label training data set is described as T ={(xi,yi)},i=1 ,{2,...,N}. Any labelset y can be described as a vertex in the L-dimensional unit hypercube (Tai & Lin, 2012). Each component y(i) of y represents an axis of the hypercube. As an example, Fig. 1 illustrates the label space of a multi-label problem with three labels (y(1), y(2), y(3)). Assume that the presence or absence of each label is modeled independently with a probabilistic classifier. For a new instance, the classifiers provide the probabilities, p(1), ..., p(L), that the corresponding labels are associated with the instance. Using the probability outputs, we may obtain a L-dimensional vector p̂=(p(1),p(2),...,p(L)). Every element of p̂ has a value from 0 to 1 and the vector p̂ is an inner point in the hypercube (see Fig. 1). Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 4/20 https://peerj.com https://doi.org/10.7717/peerjcs.242/fig-1 http://dx.doi.org/10.7717/peerj-cs.242 Given p̂ the prediction task is completed by assigning the inner point to a vertex of the cube. For the new instance, we may calculate the Euclidean distance, Dyi, between p̂ and each yi (i.e., the labelset of the ith training instance). In Fig. 1, three training instances y1, y2 and y3 and the corresponding distances are shown. A small distance Dyi indicates that yi is likely to be the labelset for the new instance. Nearest labelset using double distances (NLDD) In addition to computing the distance in the label space, Dyi, we may also obtain the (Euclidean) distance in the feature space, denoted by Dxi. The proposed method, NLDD, uses both Dx and Dy as predictors to find a training labelset that minimizes the expected loss. For each test instance, we define loss as the number of misclassified labels out of L labels. The expected loss is then Lθ where θ =g(Dx,Dy) represents the probability of misclassifying each label. The predicted labelset, ŷ∗, is the labelset observed in the training data that minimizes the expected loss: ŷ∗=argmin y∈T Lg(Dx,Dy) (1) The loss follows a binomial distribution with L trials and a parameter θ. We model θ=g(Dx,Dy) using binomial regression. Specifically, log ( θ 1−θ ) =β0+β1Dx+β2Dy (2) where β0, β1 and β2 are the model parameters. Greater values for β1 and β2 imply that θ becomes more sensitive to the distances in the feature and label spaces, respectively. The misclassification probability decreases as Dx and Dy approach zero. A test instance with Dx=Dy=0 has a duplicate instance in the training data (i.e., with identical features). The predicted probabilities for the test instance are either 0 or 1 and the match the labels of the duplicate training observation. For such a ‘‘double’’-duplicate instance (i.e., Dx =Dy =0), the probability of misclassification is 1/(1+e−β0) > 0. As expected, the uncertainty of a test observation with a ‘‘double-duplicate’’ training observation is greater than zero. This is not surprising: duplicate training observations do not necessarily have the same response, and neither do double-duplicate observations. The model in Eq. (2) implies g(Dx,Dy)=1/(1+e−(β0+β1Dx+β2Dy)). Because log ( θ 1−θ ) is a monotone transformation of θ and L is a constant, the minimization problem in (1) is equivalent to ŷ∗=argmin (x,y)∈T β1Dx+β2Dy (3) That is, NLDD predicts by choosing the labelset of the training instance that minimizes the weighted sum of the distances. For prediction, the only remaining issue is how to estimate the weights. Estimating the relative weights of the two distances The weights β0,β1 and β2 can be estimated using binomial regression. Binomial regression can be fit by running separate logistic regressions, one for each of the L labels. To run Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 5/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 1The dataset is split equally for training and testing. An unequal split is not desirable: adding more instances to the internal training set may improve the performance of individual probabilistic classifiers. However, this would lead to the decrease of the number of distance pairs that are needed for the binomial regression modeling (# distance pairs = 2(# instance in the internal validation set)). That is, reducing the size of the validation set will decrease the amount of data used for binomial regression. Algorithm 1 The training process of NLDD Input: training data T , number of labels L Output: probabilistic classifiers h(i), binomial regression g Split T into T1 and T2 for i=1 to L do train probabilistic classifier h(i) based on T train probabilistic classifier h(i)∗ based on T1 end for S,W ←∅ for each instance in T2 do obtain p̂=(h(1)∗ (x),...,h (L) ∗ (x)) for each instance in T1 do compute Dx and Dy W ←W ∪(Dx,Dy) end for find m1,m2∈W update S←S∪{m1,m2} end for Fit log ( θ 1−θ ) =β0+β1Dx+β2Dy to S Obtain g :S→ θ̂= e f̂ 1+e f̂ where f̂ = β̂0+β̂1Dx+β̂2Dy Algorithm 2 The classification process of NLDD Input: new instance x, binomial model g, probabilistic classifiers h(i), training data T of size N Output: multi-label classification vector ŷ for j=1 to N do compute p̂=(h(1)(x),...,h(L)(x)) compute Dxj and Dyj obtain θ̂j ←g(Dxj,Dyj ) end for return ŷ←argmin yj∈T θ̂j the regressions Dx and Dy need to be computed on the training data. For this purpose we split the training data (T) equally into an internal training data set, T1, and an internal validation data set, T2.1 We next fit a binary classifier to each of the L labels separately and obtain the labelset predictions (i.e., probability outcomes) for the instances in T2. In principle, each observation in T2 can be paired with each observation in T1, creating a (Dx,Dy) pair, and the regressions can be run on all possible pairs. Note that matching any single instance in T2 to those in T1 results in N/2 distance pairs. However, most of the pairs are uninformative because the distance in either the feature space or the label space is very large. Since candidate labelsets for the final prediction will have a small Dy and a small Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 6/20 https://peerj.com http://dx.doi.org/10.7717/peerj-cs.242 1 2 3 4 5 6 1 2 3 4 5 6 Dx D y mi1 mi2 Wiy Figure 2 An illustration of how to identify mi1 and mi2 for N = 20.T1 and T2 contain 10 instances each. The 10 points in the scatter plot were obtained by calculating Dx and Dy between a single instance in T2 and the 10 instances in T1. In this example two points have the lowest distance in Dy and are candidates for mi2. Among the candidates, the point with the lowest Dx is chosen. Full-size DOI: 10.7717/peerjcs.242/fig-2 Dx, it is reasonable to focus more on the behavior of the loss especially at small values of Dx and Dy than considering the loss at the entire range of the distances. Moreover, since T2 contains N/2 instances, the number of possible pairs is potentially large (N 2/4). Therefore, to reduce computational complexity, for each instance we only identify two pairs: the pair with the smallest distance in x and the pair with the smallest distance in y. In case of ties in one distance, the pair with the smallest value in the other distance is chosen. More formally, we identify the first pair mi1 by mi1 = argmin (Dx,Dy)∈Wix Dy where Wix is the set of pairs that are tied; i.e., that each corresponds to the minimum distance in Dx. Similarly, the second pair mi2 is found by mi2 = argmin (Dx,Dy)∈Wiy Dx. where Wiy is the set of labels that are tied with the minimal distance in Dy. Figure 2 illustrates an example of how to identify mi1 and mi2 for N =20. Our goal was to identify the instance with the smallest distance in x and the instance with the smallest distance in y. Note that mi1 and mi2 may be the same instance. If we find a single instance that minimizes both distances, we use just that instance. (A possible duplication of that instance is unlikely to make any difference in practice). Gweon et al. (2019), PeerJ Comput. Sci., DOI 10.7717/peerj-cs.242 7/20 https://peerj.com https://doi.org/10.7717/peerjcs.242/fig-2 http://dx.doi.org/10.7717/peerj-cs.242 The two pairs corresponding to the ith instance in T2 are denoted as the set Si = { mi1,mi2 } , and their union for all instances is denoted as S= ⋃N/2 i=1 Si. The binomial regression specified in Eq. (2) is performed on the 2N2 =N instances in S. Algorithm 1 outlines the training procedure. For the classification of a new instance, we first obtain p̂ using the probabilistic classifiers fitted to the training data T . Dxj and Dyj are obtained by matching the instance with the jth training instance. Using the MLEs β̂0, β̂1 and β̂2, we calculate θ̂j = e f̂j 1+e f̂j where f̂j = β̂0+β̂1Dxj +β̂2Dyj . The final prediction of the new instance is obtained by ŷ=argmin yj∈T Ê(loss)=argmin yj∈T θ̂j. The second equality holds because Ê(loss)=Lθ̂ and L is a constant. As in LP, NLDD chooses a training labelset as the predicted vector. Algorithm 2 outlines the classification procedure. The training time of NLDD is O(L(f (d,N)+ f (d,N/2)+g(d,N/2))+N 2(d+L)+ Nlog(k)) where O(f (d,N)) is the complexity of each binary classifier with d features and N training instances, O(g(d,N/2)) is the complexity for predicting each label for T2, N 2(d+L) is the complexity for obtaining the distance pairs for the regression and O(Nlog(k)) is the complexity for fitting a binomial regression. T1 and T2 have N/2 instances respectively. O(Lf (d,N/2)) is the complexity for fitting binary classifiers using T1 and obtaining the probability results for T2 takes O(Lg(d,N/2)). For each instance of T2, we obtain N/2 numbers of distance pairs. This has complexity O((N/2)(d+L)). Since there are N/2 instances, overall it takes O((N/2)(N/2)(d+L)) or O(N 2(d+L)) when omitting the constant. Among the N/2 pairs for each instance of T2, we only identify at most 2 pairs. This implies N/2≤ s≤N where s is the number of elements in S. Each iteration of the Newton–Raphson method has a complexity of O(N). For k-digit precision complexity O(logk) is required (Ypma, 1995). Combined, the complexity for estimating the parameters with k-digit precision is O(Nlog(k)). In practice, however, this term is dominated by N 2(d+L) as we can set k <