A Practical Outlier Detection Approach for Mixed-Attribute Data Mohamed Bouguessa University of Quebec at Montreal Department of Computer Science Montreal, Qc, Canada bouguessa.mohamed@uqam.ca Abstract Outlier detection in mixed-attribute space is a challenging problem for which only few approaches have been proposed. However, such existing methods suf- fer from the fact that there is a lack of an automatic mechanism to formally discriminate between outliers and inliers. In fact, a common approach to outlier identification is to estimate an outlier score for each object and then provide a ranked list of points, expecting outliers to come first. A major problem of such an approach is where to stop reading the ranked list? How many points should be chosen as outliers? Other methods, instead of outlier ranking, implement var- ious strategies that depend on user-specified thresholds to discriminate outliers from inliers. Ad hoc threshold values are often used. With such an unprinci- pled approach it is impossible to be objective or consistent. To alleviate these problems, we propose a principled approach based on the bivariate beta mixture model to identify outliers in mixed-attribute data. The proposed approach is able to automatically discriminate outliers from inliers and it can be applied to both mixed-type attribute and single-type (numerical or categorical) attribute data without any feature transformation. Our experimental study demonstrates the suitability of the proposed approach in comparison to mainstream methods. Keywords: Data Mining, Outlier detection, Mixed-attribute data, Mixture model, Bivariate beta. Preprint submitted to Expert Systems with Applications February 4, 2016 1. Introduction1 Outlier detection is the practice of identifying data points which are consider-2 ably different from the remaining data (Aggarwal, 2013; Cao, Si, Zhang, and Jia,3 2010; Kriegel, Kroger, Schubert, and Zimek, 2011; Tan, Steinbach, and Kumar,4 2006). Outlier detection is also known as exception mining or deviation detec-5 tion because outlier points are exceptional in some sense or they have attribute6 values that deviate significantly from the expected or typical attribute values7 (Tan et al., 2006). Identifying outliers has practical applications in different do-8 mains such as intrusion and fraud detection, medical diagnosis, and many others9 (Fustes, Dafonte, Arcay, Manteiga, Smith, Vallenari, and Luri, 2013; Maervoet,10 Vens, Berghe, Blockeel, and Causmaecker, 2012; Alan and Catal, 2011). For11 example, in medical diagnosis, outliers may arise when the patient is afflicted12 with some disease, or suffers side-effects from a drug. Efficient detection of such13 outliers aids in identifying, preventing, and repairing the effects of malicious or14 faulty behavior (Penny and Jolliffe, 2011).15 Approaches to outlier detection can be categorised as supervised, semi-16 supervised, and unsupervised (Angiulli and Fassetti, 2014). In principle, super-17 vised, as well as semi-supervised learning methods, use labeled data to create18 a model which distinguishes outliers from inliers. On the other hand, unsuper-19 vised approaches do not require any labeled objects and detect outliers as points20 that are considerably dissimilar or inconsistent with respect to the remaining21 data using some quantified measures of outlierness (Aggarwal, 2013). To im-22 plement supervised and semi-supervised outlier detection methods, we should23 first label the training data (Wu and Wang, 2013). The problem here is that24 labeled data samples are more difficult, expensive and time consuming to obtain25 than unlabeled ones. This is why unsupervised approaches are more generally26 and widely used, since they do not require labeled information. In this paper27 we focus only on unsupervised outlier detection. For more surveys and details28 on outlier analysis, we refer the reader to Aggarwal (2013). In the following,29 we first describe some background information by providing a brief description30 2 of the key idea of some outlier detection approaches which are relevant to this31 work. Next, we discuss a number of elements that motivate this study and32 describe our contributions.33 1.1. Background Information34 Several unsupervised approaches have been proposed to identify outliers in35 numerical data. Such approaches can be broadly classified as statistical-based,36 distance-based, and density-based (Angiulli and Pizzuti, 2005). Statistical-37 based approaches attempt to fit the data set under investigation to a certain kind38 of distribution model (in general, the Gaussian model) (Yamanishi, Takeuchi,39 Williams, and Milne, 2000). Inliers occur in a high probability region of the40 model while outliers deviate strongly from the distribution. Distance-based41 approaches evaluate the outlierness of a point based on the distances to its k-42 nearest neighbors (kNN) (Angiulli and Pizzuti, 2005, 2002). Points with large43 kNN distance are defined as outliers. Finally, density-based approaches use the44 number of points within a specific local region of a data point in order to define45 local density (Breunig, Kriegel, Ng, and Sander, 2000). The local density val-46 ues could be then used to measure how isolated a point is with respect to the47 surrounding objects (Wu and Wang, 2013).48 The aforementioned approaches were specifically designed for numerical data.49 However, in several applications, attributes in real data sets are not numerical,50 but have categorical values. For categorical data sets, distance-based as well51 as density-based techniques must confront the problem of how to choose the52 measurement of distance or density (Wu and Wang, 2013). This poses a sig-53 nificant challenge in terms of generalizing algorithms for numerical data to the54 categorical domain (Aggarwal, 2013). To address this issue, a number of ap-55 proaches have been proposed to deal with categorical data (Koufakou, Secretan,56 and Georgiopoulos, 2011; He, Xu, Huang, and Deng, 2005). Some of these ap-57 proaches use the concept of frequent itemset mining to estimate an outlying58 score for each point. Inliers are those points which contain sets of items that59 co-occur frequently in the data sets, while outliers are likely to be the points60 3 Figure 1: Mixed-attribute data set with clustered objects and outliers. that contain rare itemsets (Koufakou et al., 2011).61 In many cases, categorical and numerical data are found in the same data62 set, as different attributes. This is referred to as mixed-attribute data (Ag-63 garwal, 2013). Outliers are those objects containing attribute values that are64 dissimilar to or inconsistent with the remaining objects in both the numerical65 and the categorical space (Koufakou and Georgiopoulos, 2010; Otey, Ghoting,66 and Parthasarathy, 2006). To illustrate, Fig. 1 shows a small data set composed67 of 18 objects with four numerical attributes (A1,A2,A3, and A4) and four cat-68 egorical attributes (A5,A6,A7, and A8). As can be seen from this figure, data69 objects O1,O2, . . . ,O15 are grouped into three clusters, while the remaining70 points, that is, O16,O17, and O18, are outliers which could not be located in71 any cluster. Note that in this figure each cluster is represented by a shade of72 gray and the unclustered background is white. Clusters thus exist in different73 subspaces spanned by different attributes. From Fig. 1, we can see that, in74 contrast to inliers (that is, the clustered objects), outliers contain dissimilar75 attribute values. In fact, compared to points that belong to clusters, outliers76 have non-correlated numerical attribute values along the numerical space and77 infrequent attribute values across the categorical space. On the other hand,78 4 from Fig. 1, we can see that objects grouped within clusters contain attribute79 values that are closely related along a specific subset of dimensions. For ex-80 ample, objects O1,O2,O3,O4, and O5, which form cluster 1, contain correlated81 attribute values along the numerical attributes A1,A2,A3, and a large number82 of common categorical attribute values along the categorical attribute A6.83 In practice, when faced with mixed-attribute data, it is common to discretize84 the numerical attributes and treat all the data as categorical so that categorical85 outlier detection algorithms can be applied to the entire data set. However, as86 suggested in Zhang and Jin (2010), discretizing numerical values into several87 bins could introduce noise or information losses. Improper discretizing thus88 would hamper the detection performance. To alleviate this problem, only few89 approaches (Koufakou and Georgiopoulos, 2010; Zhang and Jin, 2010; Otey,90 Ghoting, and Parthasarathy, 2006), have been proposed to handle outliers in91 the mixed-attribute space.92 The approach proposed in Otey et al. (2006) is based on the concept of93 frequent itemsets to deal with categorical attributes, and the covariance for94 continuous attributes. Specifically, the authors in Otey et al. (2006) assign to95 each point an outlier score inversely proportionate to its infrequent itemsets.96 They also maintain a covariance matrix for each itemset to compute anomaly97 scores in the continuous attribute space. A point is likely to be an outlier if it98 contains infrequent categorical sets, or if its continuous values differ from the99 covariance violation threshold. It is worth noting that the work proposed by100 Otey et al. (2006) has the merit of being the first outlier detection algorithm101 for mixed-attribute data.102 Koufakou and Georgiopoulos (2010) proposed an approach named ODMAD103 (Outlier Detection for Mixed Attribute Datasets). This algorithm calculates104 first, for each point in the categorical space, an outlier score which depends on105 the infrequent subsets contained in that point. Data points with score values less106 than a user-entered frequency threshold are isolated since they contain highly107 infrequent categorical values and may thus potentially correspond to outliers.108 This process results in a reduced data set based on which other outlier scores109 5 are calculated for the numerical space using the cosine similarity measure. As110 described in Koufakou and Georgiopoulos (2010), since minimum cosine simi-111 larity is 0 and maximum is 1, the data points with similarity close to 0 are more112 likely to be outliers. Experiments in Koufakou and Georgiopoulos (2010), show113 that ODMAD is fast and outperforms Otey’s approach.114 Zhang and Jin (2010) proposed a Pattern based Outlier Detection approach115 (POD). Patterns in Zhang and Jin (2010) are defined to describe the data objects116 as well as to capture interactions among different types of attributes. The more117 an object deviates from these patterns, the higher its outlier score. The authors118 in Zhang and Jin (2010) use logistic regression to learn patterns. These patterns119 are then used to estimate outlier scores for objects with mixed attribute. The120 top n points with the highest score values are declared as outliers. It is important121 to note that POD is not able to handle categorical values directly. To detect the122 target patterns, categorical attributes are first mapped into binary attributes.123 Then, these binary attributes are analyzed together with the original continuous124 attributes to detect outliers in the mixed-attribute space.125 1.2. Motivations and Contributions126 The area of outlier detection in mixed-attribute data offers several oppor-127 tunities for improvement. There are just very few approaches around in the128 literature so far, yet there are a number of problems still to solve. For instance,129 the output of POD (Zhang and Jin, 2010) is a ranked list of points that repre-130 sents the degree of outlierness of each point. The top n points in the list with131 the highest degree values are considered as outliers. This method encounters a132 major concern: at which level should this list be cut? Stated otherwise, starting133 from the first (ranked number one) object, how far should we go in that list? In134 general, no principled way is suggested on how many points should be selected135 from a ranked list. In some situations, the top n points are selected solely on the136 basis of specific knowledge of an application. Unfortunately, prior knowledge137 about the data under investigation is not always available.138 Since a ranked list has a particular disadvantage because there is no clear139 6 cut-off point of where to stop consulting the results, thresholding has turned140 out to be important in detecting outliers. For instance, ODMAD (Koufakou141 and Georgiopoulos, 2010) and the approach proposed by Otey et al. (2006)142 implement various strategies that depend on user-specified thresholds to detect143 outliers. In real situations, however, it is rarely possible for users to supply the144 threshold values accurately. Outlier detection accuracy can thus be seriously145 reduced if an incorrect threshold value is used. The experiments conducted in146 Koufakou and Georgiopoulos (2010) on the impact of using various threshold147 values on the outlier detection accuracy corroborate our claim. Finally, it is148 worth noting that ODMAD and Otey’s approach depend also on other input149 parameters such as the minimum support, the maximum length of itemset and150 the size of a window of categorical and numerical scores. Setting appropriate151 values of these parameters is not a straightforward task.152 To alleviate the aforementioned drawbacks of existing approaches for detect-153 ing outliers in the mixed-attribute space, we propose in this paper a principled154 approach which is able to automatically identify outliers. In our approach, we155 first estimate an outlying score, for each object, in the numerical space and156 another score in the categorical space. Next, we associate to each data point a157 two dimensional vector containing the estimated scores: one dimension contains158 the score estimated in the numerical space while the second one contains the159 outlying score calculated in the categorical space. We assume that, in both160 spaces, outliers are characterised by high score values. Finally, we propose a161 statistical framework, based on the bivariate beta mixture, in order to model162 the estimated outlier score vectors. The goal is to cluster the estimated vectors163 into several components such that data points associated to the component with164 the highest score values correspond to outliers.165 We have used the beta mixture mainly because it permits multiple modes166 and asymmetry and can thus approximate a wide variety of shapes (Dean and167 Nugent, 2013; Bouguila and Elguebaly, 2012; Bouguessa, 2012; Ji, Wu, Liu,168 Wang, and Coombes, 2005), while several other distributions are not able to do169 so. For example, the standard Gaussian distribution permits symmetric “bell”170 7 shape only. However, in many real life applications, the data under investigation171 is skewed with non-symmetric shapes. In this setting, as observed in Dean and172 Nugent (2013), and in Boutemedjet, Ziou, and Bouguila (2011), the standard173 Gaussian distribution may lead to inaccurate modeling (e.g. over estimation of174 the number of components in the mixture, increase of misclassification errors,175 etc.). In contrast to several distributions, the beta distribution is more flexible176 and powerful since it permits multiple symmetric and asymmetric modes, it177 may be skewed to the right, skewed to left or symmetric (Bouguila, Ziou, and178 Monga, 2006). This great shape flexibility of the beta distribution provides179 a better fitting of the outlier score vectors, which leads, in turn, to accurate180 detection of outliers. Our experimental results corroborate our claim.181 We summarize the significance of our work as follows:182 1. We view the task of identifying outliers from a mixture modeling perspec-183 tive, on which we devise a principled approach which is able to formally184 discriminate between outliers and inliers, while previous works provide185 only a ranked list of objects expecting outliers to come first.186 2. The proposed method automatically identifies outliers, while existing ap-187 proaches require human intervention in order to set a detection threshold188 or to manually define the number of outliers to be identified. Furthermore,189 our method is general, in the sense that it is not limited to mixed-attribute190 data and it can be applied to single-type attribute (numerical or categor-191 ical) data without any feature transformation.192 3. We conducted detailed experiments on several real data sets with mixed-193 attribute as well as with single-type attribute. The results suggest that194 the proposed approach achieves competitive results in comparison to main-195 stream outlier detection algorithms.196 The rest of this paper is organized as follows. Section 2 describes our ap-197 proach in detail. An empirical evaluation of the proposed method is given in198 Section 3. Finally, our conclusion is given in Section 4.199 8 Figure 2: Workflow of the proposed approach. 2. Proposed Approach200 We begin by fixing a proper notation to be used throughout the paper. Let201 D = {O1, . . . ,ON} be a set of N mixed-attribute data points. Each point con-202 tains An numerical attributes and Ac categorical attributes. The subspace of D203 that contains only numerical attributes is denoted Snum, while Scat refers to the204 subspace of D which contains only categorical attributes. In this paper, we rep-205 resent a data point Oi as Oi = [O n i ,O c i ], such that O n i = (o n i1, . . . ,o n il, . . . ,o n iAn )206 and Oci = (o c i1, . . . ,o c it, . . . ,o c iAc ), where onil designates the l th numerical attribute207 value and ocit corresponds to the t th categorical attribute value. In what follows,208 we will call onil a numerical 1D point and o c it a categorical 1D point.209 In our approach, we propose first to estimate, for each object Oi, an outlier210 score in the numerical space and another score in the categorical space. Then,211 we associate to each data point a two-dimensional outlier score vector ~Vi con-212 taining the two estimated scores. Finally, based on {~Vi}(i=1,...,N), we devise a213 probabilistic approach that uses the bivariate beta mixture model to automati-214 cally discriminate outliers from inliers in the full-dimensional space. Specifically,215 we first model {~Vi} as a mixture of m bivariate beta components. We then se-216 lect the component that corresponds to vectors with the highest score values.217 Data objects associated with the set of vectors that belong to the selected com-218 ponent correspond to outliers. Fig. 2 provides a simple visual illustration of the219 9 proposed approach. More details are given in the follows.220 2.1. Estimating Outlier Score in the Numerical Space221 It is widely accepted that outliers are data points that are considerably222 dissimilar from the remaining data (Aggarwal, 2013; Huang and Yang, 2011;223 Kriegel et al., 2011). In this setting, it is reasonable to assume that, in gen-224 eral, most of the attribute values of outlier objects projected along each of the225 dimensions in Snum tend to be far apart from the remaining attribute values226 (Tan et al., 2006). On the other hand, inliers have attribute values that tend to227 be closely related along several (or all) dimensions in Snum. Our assumption is228 based on the fact that inliers tend to form dense regions across several dimen-229 sions in the numerical space, while outliers are sparsely distributed. With this230 intuition in mind, we define the outlier score ON(Oni ) for an object Oi in the231 numerical attribute space as232 ON(Oni ) = An∑ l=1 log ( WN (o n il) + 1 ) (1) with233 WN (o n il) = k∑ j=1 [ dl ( onil,kNNj(o n il) )]2 (2) where, for a specific dimension l in Snum, kNNj(o n il) denotes the j th nearest (1D234 point) neighborhood of onil and dl denotes the distance between two numerical235 1D points. In our case, this distance simply corresponds to the absolute value236 of the difference between two numerical attribute values of a specific dimension.237 The outlier score defined in (1) is the sum, over all dimensions in the numer-238 ical space Snum, of the log of the weight WN (o n il). As described by (2), WN (o n il)239 computes the sum of the square of the distance between each 1D point onil and240 its k nearest neighborhoods in dimension l. Intuitively, a large value of WN (o n il)241 means that onil falls into a sparse region in which the k nearest neighborhood242 attribute values of onil are loosely related, while a small value indicates that o n il243 10 Figure 3: The estimated outlier scores in the numerical space for the data objects depicted by Fig. 1. belongs to a dense region in which the k nearest neighborhood of onil are closely244 related. Note that we have used the square power in (2) in order to favor the245 weight of the 1D points belonging to a sparse region.246 The weight WN (o n il) captures the degree of isolation of an attribute value247 with respect to its neighbors. The higher its weight, the more distant are its248 neighbors along dimension l of Snum. Accordingly, based on (2), we can surmise249 that objects that are sparsely distributed over Snum will receive high ON(Oni )250 values, while related points will receive low score values. This means that out-251 liers will be characterized by high score values in contrast to inliers. As an252 illustration, Fig. 3 shows the estimated outlier scores in the numerical space253 for the data objects depicted by Fig. 1. As can be seen from Fig. 3, outlier254 objects O16,O17, and O18 have high score values in comparison to inliers, that255 is, O1, . . . ,O15.256 It is important to note that we have used the logarithm function in (1)257 primarily to squeeze together the large values that characterize outliers and258 stretch out the smallest values, which correspond to inliers. This squeezing and259 stretching contributes to enhancing the contrast between largest and smallest260 values which helps in distinguishing outliers from the rest of the points. Finally,261 note that we have added 1 to WN (o n il) in equation (1) to avoid the null value in262 the calculation of the logarithm, since it is possible to have WN (o n il) = 0 in the263 11 likely case where an attribute has more than k duplicative values.264 It is clear that the calculation of the k nearest neighbors is, in general, an265 expensive task, especially when the number of data points N is very large. How-266 ever, since we are searching for the k nearest neighbors in the one-dimensional267 space, we can perform the task in an efficient way by sorting the values in each268 attribute and limiting the number of distance comparisons to a maximum of269 2k values. The computation of the kNN distance is sensitive to the value of270 k, which is a limitation common to all kNN based approaches. However, we271 believe the problem this limitation creates for our approach does not have a272 major impact. This is because, since we estimate the kNN distances in the273 one-dimensional space only, the choice of the value of k is not as critical as in a274 multi-dimensional case. As suggested in Bouguessa and Wang (2009), to gain a275 clear idea of the sparseness of the neighborhood of a 1D point, we suggest using276 k = √ N as a default value.277 2.2. Estimating Outlier Score in the Categorical Space278 Virtually, as suggested in previous studies (Koufakou et al., 2011; Koufakou279 and Georgiopoulos, 2010; He et al., 2005), outliers in the categorical space are280 those points that have infrequent attribute categorical values for all dimensions281 compared to normal points. This means that every categorical 1D point of282 outlier objects is infrequent across all dimensions of Scat, while inliers have283 several categorical 1D points which occur with higher frequency along several (or284 all) categorical attributes (Koufakou et al., 2011; Koufakou and Georgiopoulos,285 2010). Based on such a definition, the outlier score OC(Oci ) for an object Oi in286 the categorical attribute space is formulated as287 OC(Oci ) = Ac∑ t=1 log ( WC(o c it) ) (3) with288 WC(o c it) = f(o c it) (4) 12 where f(ocit) denotes the number of times o c it appears in a specific categorical289 dimension t of Scat.290 OC(Oci ) is defined as the sum, across all dimensions in the categorical space291 Scat, of the log of the weight WC(o c it), which, in turn, corresponds to the occur-292 rence frequency of ocit in the categorical attribute t. Here, it is clear that rare293 categorical attribute values projected along dimension t will receive low weight294 values, while larger WC(o c it) values indicate that o c it is shared by several objects295 within dimension t. Accordingly, based on (3), points that share common cate-296 gorical values across Scat will get large OC(Oci ) values, while data objects that297 have infrequent categorical values across Scat will receive low OC(Oci ) values. As298 a result, since outliers are those points whose attribute categorical values occur299 very rarely along each dimension in Scat (Koufakou et al., 2011), it is easy to300 see that small values of OC(Oci ) designate outliers and high scores correspond301 to inliers. Finally, note that, as with the numerical outlier score described by302 (1), we have used a logarithm function in (3) to enhance the contrast between303 larger and smaller weight values.304 In this paper, as mentioned in Section 1, we assume that outliers are char-305 acterized by large score values in contrast to inliers. However, as just discussed,306 large OC(Oci ) scores refer to inliers. To regularize such scores, we need to invert307 them. For this purpose, we simply take the difference between the observed308 score and the maximum possible estimated score OCmax. The inverted score is309 estimated as310 OCinv(Oci ) = OCmax −OC(O c i ) (5) It easy to show that this linear inversion doesn’t affect the ranking-stability of311 the inverted scores:312 OC(Oc1) ≤OC(O c 2) ⇐⇒ −OC(O c 1) ≥−OC(O c 2) ⇐⇒ OCmax −OC(Oc1) ≥OCmax −OC(O c 2) ⇐⇒ OCinv(Oc1) ≥OCinv(O c 2). 13 Figure 4: The estimated outlier scores in the categorical space for the data objects depicted by Fig. 1. Accordingly, based on such a linear inversion, outliers will receive large score313 values while inliers will receive the lowest score values. In the remainder of this314 paper, unless otherwise specified, we use only the inverted categorical outlier315 score values. As an illustration, Fig. 4 shows the estimated outlier scores in the316 categorical space for the data objects depicted by Fig. 1. As can be seen from317 Fig. 4, outlier objects O16,O17, and O18 have high score values in comparison318 to inliers, that is, O1, . . . ,O15.319 Finally, as the reader can notice, in our approach we treat numerical and320 categorical attributes independently in order to estimate outlier scores in the321 numerical and the categorical space. In other words, this means we assume the322 independence of both numerical and categorical attributes. Such an assump-323 tion is mainly based on the general definition of outliers, which relies on the fact324 that outlier objects contain attribute values that are dissimilar to or inconsistent325 with the remaining points. Stated otherwise, outliers may contain many atyp-326 ical attribute values across most (or all) attributes of the data in comparison327 to inliers. Accordingly, investigating individual attributes in order to localize328 attribute values that deviate significantly from the expected or typical attribute329 values is appropriate to effectively detect outliers in the whole space.330 14 2.3. Modeling Outlier Score Vectors331 Once the outlier scores are estimated in both the numerical and the categor-332 ical spaces, we now focus on how to automatically identify outliers in the mixed-333 attribute space. To this end, we associate to each object Oi a two-dimensional334 vector ~Vi such that the first element of this vector corresponds to the outlier335 score of Oi in the numerical space, while the second element represent the out-336 lier score of Oi in the categorical space. Then, based on the estimated vectors,337 we propose a probabilistic approach that uses the bivariate beta mixture model338 to automatically discriminate outliers from inliers in the full-dimensional space.339 The probabilistic model framework is described in the follows.340 2.3.1. The Bivariate Beta Mixture Model341 Since the beta distribution is defined on the interval [0,1], we should first,342 without loss of generality, normalize the estimated outlier score values between343 0 and 1. Let ~Vi = (Vi1,Vi2) T where Vi1 and Vi2 represent, respectively, the344 normalized outlier scores ON(Oni ) and OCinv(O c i ). Under a mixture of bivariate345 beta distribution,346 ~Vi ∼ M∑ m=1 λm Bm(~Vi|~xm,~ym) (6) where Bm(~Vi|~xm,~ym) is the mth bivariate beta distribution; M denotes the347 number of components in the mixture; ~x = {~x1, . . . ,~xM} and ~y = {~y1, . . . ,~yM}.348 ~xm and ~ym are the parameters of the m th component with ~xm = (xm1,xm2) T 349 and ~ym = (ym1,ym2) T . λ = {λ1, . . . ,λM} represents the mixing coefficients350 such that ∑M m=1 λm = 1 and λm > 0.351 The bivariate beta distribution can be obtained by cascading two beta vari-352 ables together, that is, each element in the two-dimensional vector ~Vi is a scalar353 beta variable. In other words, the bivariate beta is the product of two uni-354 variate beta densities. Accordingly, the probability density function of the mth355 15 bivariate beta component is expressed as356 Bm(~Vi|~xm,~ym) = 2∏ d=1 B(Vid|xmd,ymd) (7) B(Vid|xmd,ymd) is the probability density function of the univariate beta distri-357 bution which is given by358 B(Vid|xmd,ymd) = Γ(xmd + ymd) Γ(xmd)Γ(ymd) V xmd−1 id (1 −Vid) ymd−1 (8) where Γ(.) is the gamma function given by Γ(α) = ∫∞ 0 βα−1 exp(−β)dβ; β > 0.359 2.3.2. Maximum Likelihood Estimate360 A common approach for estimating the unknown parameters xmd and ymd,361 (m = 1, . . . ,M; d = 1, 2) is the maximum likelihood estimation technique. The362 likelihood function corresponding to the mth bivariate beta component Bm is363 defined as364 L ( Bm(~Vi|~xm,~ym) ) = ∏ ~Vi∈Bm Bm(~Vi|~xm,~ym) = ∏ ~Vi∈Bm 2∏ d=1 B(Vid|xmd,ymd) (9) The logarithm of the likelihood function is given by365 log [ L ( Bm(~Vi|~xm,~ym) )] = Nm∑ i=1 2∑ d=1 log [ B(Vid|xmd,ymd) ] (10) where Nm is the size of the m th component.366 We note that the parameters pair {xmd,ymd} is independent from all other pairs. The problem of estimating the parameters of the model can thus be reduced to the estimation of the parameters pair {xmd,ymd} independently over each dimension of the outlier score vectors belonging to component m. In this 16 setting, the value {x̂md, ŷmd} that maximizes the likelihood can be obtained by taking the derivative of the expectation of the log-likelihood function with respect to xmd and ymd and setting the gradient equal to zero as   ∂E ( log [ L(Bm(~Vi|~xm,~ym)) ]) ∂xmd ∂E ( log [ L(Bm(~Vi|~xm,~ym)) ]) ∂ymd   = 0 (11) where367 ∂E ( log [ L(Bm(~Vi|~xm,~ym)) ]) ∂xmd = Nm∑ i=1 [ ∂ ∂xmd log ( Γ(xmd + ymd) Γ(xmd)Γ(ymd) V xmd−1 id (1 −Vid) ymd−1 )] = Nm∑ i=1 [Γ′(xmd + ymd) Γ(xmd + ymd) − Γ′(xmd) Γ(xmd) + log(Vid) ] = Nm Γ′(xmd + ymd) Γ(xmd + ymd) −Nm Γ′(xmd) Γ(xmd) + Nm∑ i=1 log(Vid). (12) and368 ∂E ( log [ L(Bm(~Vi|~xm,~ym)) ]) ∂ymd = Nm∑ i=1 [ ∂ ∂ymd log ( Γ(xmd + ymd) Γ(xmd)Γ(ymd) V xmd−1 id (1 −Vid) ymd−1 )] = Nm∑ i=1 [Γ′(xmd + ymd) Γ(xmd + ymd) − Γ′(ymd) Γ(ymd) + log(1 −Vid) ] = Nm Γ′(xmd + ymd) Γ(xmd + ymd) −Nm Γ′(ymd) Γ(ymd) + Nm∑ i=1 log(1 −Vid). (13) Equations (11), (12) and (13) yield the following expression369   Nm [ ψ(xmd + ymd) −ψ(xmd) ] + ∑Nm i=1 log(Vid) Nm [ ψ(xmd + ymd) −ψ(ymd) ] + ∑Nm i=1 log(1 −Vid) ]   = 0 (14) 370 where ψ(.) is the digamma function given by ψ(α) = Γ′(α) Γ(α) .371 17 Since the digamma function is defined though an integration, a closed-372 form solution to (14) does not exist. So the parameters pair {xmd,ymd} can373 be estimated using the Newton-Raphson method (Ypma, 1995). Specifically,374 {xmd,ymd} are estimated iteratively:375   x (I+1) md y (I+1) md   =   x (I) md y (I) md  − [~hm]T [Hm]−1 (15) where I is the iteration index, hm and Hm are respectively the vector of the376 first derivatives and the matrix of the second derivatives of the log likelihood377 function of the mth component.378 The vector ~hm is defined as379 ~hm =   h1m h2m   =   ∂E ( log [ L(Bm( ~Vi| ~xm, ~ym)) ]) ∂xmd ∂E ( log [ L(Bm( ~Vi| ~xm, ~ym)) ]) ∂ymd   (16) and the matrix Hm is expressed as380 Hm =   ∂h1m ∂xmd ∂h1m ∂ymd ∂h2m ∂xmd ∂h2m ∂ymd  ,381 where382 ∂h1m ∂xm = Nm [ ψ′(xmd + ymd) −ψ′(xmd) ] , ∂h1m ∂ym = ∂h2m ∂xmd = Nm [ ψ′(xmd + ymd) ] , ∂h2m ∂ym = Nm [ ψ′(xmd + ymd) −ψ′(βmd) ] . (17) ψ′(.) is the trigamma function given by ψ′(α) = Γ′′(α) Γ(α) − [ Γ ′(α) Γ(α) ]2.383 The Newton-Raphson algorithm for the update of equation (15) converges,384 as our estimate of xmd and ymd change by less than a small positive value �385 with each successive iteration, to x̂md and ŷmd. Note that, we have used in386 18 our implementation the method of moments estimators of the beta distribution387 (Bain and Engelhardt, 2000) to define starting values for {x(0)md,y (0) md} in equation388 (15). In this technique, the expected mean of the distribution is equated to the389 sample mean and the expected variance to the sample variance. Specifically,390 the method of moments estimators are391 x̂ (0) md = µmd [ µmd(1 −µmd) σ2md − 1 ] , ŷ (0) md = (1 −µmd) [ µmd(1 −µmd) σ2md − 1 ] . (18) where µmd and σ 2 md denote respectively the sample mean and variance of the392 normalized outlier score vectors belonging to the mth component which are393 projected along dimension d.394 2.3.3. EM Algorithm for the Bivariate Beta Mixture395 Let P = {λ1, . . . ,λM,~x1, . . . ,~xM,~y1, . . . ,~yM} denote the set of parameters396 of the mixture and V = {~V1, . . . , ~VN} the set of the normalized outlier score397 vectors. The usual choice for obtaining the maximum likelihood of the distribu-398 tion parameters is the EM algorithm (Dempster, Laird, and Rubin, 1977). This399 algorithm is based on the interpretation of V as incomplete data. As mentioned400 in Figueiredo and Jain (2002), for finite mixture, the missing part is a set of N401 label vectors η = {~η1, . . . ,~ηN} associated with the N outlier score vectors, in-402 dicating to which component ~Vi belongs. Specifically, each ~ηi = (ηi1, . . . ,ηim) T 403 is a binary vector, where ηim = 1 if ~Vi belongs to component m and ηim = 0404 otherwise.405 The complete data is thus defined by the sets η and V. The likelihood of the406 complete data is then:407 L(V,η|P) = N∏ i=1 M∏ m=1 [ λm Bm(~Vi|~xm,~ym) ]ηim (19) 19 and the complete log likelihood is:408 log(L(V,η|P)) = N∑ i=1 M∑ m=1 ηim log [ λm Bm(~Vi|~xm,~ym) ] = N∑ i=1 M∑ m=1 ηim log [ λm 2∏ d=1 B(Vid|xmd,ymd) ] = N∑ i=1 M∑ m=1 ηim [ log(λm) + 2∑ d=1 log(B(Vid|xmd,ymd)) ] (20) The EM algorithm can now be used to estimate P. Specifically, the algo-409 rithm iterates between an Expectation step and an Maximization step in order410 to produce a sequence estimate {P̂}(I), (I = 0, 1, 2, . . . ), where I denotes the411 current iteration step, until the change in the value of the complete log-likelihood412 in (20) is negligible. Details of each step are given below.413 In the Expectation step: each latent variable ηim is replaced by its expecta-414 tion as follows415 η̂ (I) im = E[ηim|~Vi,P] = λ̂ (I) m Bm(~Vi|~̂xm,~̂ym)∑M j=1 λ̂ (I) j Bj(~Vi|~̂xj,~̂yj) (21) In the Maximization step: the mixing coefficients {λm} and the parameters416 {~x1, . . . ,~xM,~y1, . . . ,~yM} are calculated using the values of η̂im estimated in the417 Expectation step. Specifically, the mixing coefficients are calculated as418 λ̂(I+1)m = ∑N i=1 η̂ (I) im N , m = 1, . . . ,M (22) The parameters {~xm = (xm1,xm2)T}(m=1,...,M) and {~ym = (ym1,ym2)T}(m=1,...,M)419 are estimated using the Newton-Raphson algorithm, based on (15), as described420 in the previous subsection.421 Finally, note that, the EM algorithm requires the initial parameters of each422 component. Since EM is highly dependent on initialization, it will be help-423 ful to perform initialization by means of clustering algorithms (Figueiredo and424 Jain, 2002). For this purpose we implement the k-means algorithm in order to425 20 Algorithm 1: EM algorithm for the bivariate beta mixture Input : {~Vi}(i=1,...,N); M Output: P̂ = {λ̂1, . . . , λ̂M, ~̂x1, . . . , ~̂xM, ~̂y1, . . . , ~̂yM} begin1 // Initialization Apply the k-means algorithm to cluster the set {~Vi} into M components;2 Estimate the initial set of parameters of each component using (18);3 repeat4 // Expectation Estimate {η̂im}(i=1,...,N; m=1,...,M) using (21);5 // Maximization Estimate {λ̂m}(m=1,...,M) using (22);6 Estimate {x̂md, ŷmd}(m=1,...,M; d=1,2) using (15);7 until the change in (20) is negligible;8 Return P̂;9 end10 partition the set {~Vi}(i=1,...,N), into M components. Then, based on such par-426 tition, we estimate the initial parameters of each component using the method427 of moment estimator of the beta distribution (Bain and Engelhardt, 2000) and428 set them as initial parameters to the EM algorithm. The detailed algorithm is429 summarized in Algorithm 1.430 2.3.4. Estimating the Optimal Number of Components in the Mixture431 The use of mixture of the bivariate beta distribution allows us to give a432 flexible model to describe the outlier score vectors. To form such a model,433 we need to estimate M, the number of components, and the parameters for434 each component. Several model selection approaches have been proposed to435 estimate M (Bouguessa, Wang, and Sun, 2006). In this paper, we implemented436 a deterministic approach that uses the EM algorithm described in Algorithm437 1 in order to obtain a set of candidate models for the range value of M (from438 1 to M max, the maximum number of components in the mixture) which is439 assumed to contain the optimal M (Figueiredo and Jain, 2002). The number of440 components is then selected according to441 M̂ = argmin M { C ( P̂,M )} M=1,...,Mmax (23) 21 Algorithm 2: Estimating the number of components in the mixture Input : {~Vi}(i=1,...,N), M max Output: The optimal number of components M̂ begin1 for M = 1 to M max do2 if M==1 then3 Estimate {x̂d, ŷd}d=1,2 using (15);4 Estimate ICL−BIC(P̂,M) using (24);5 else6 Estimate the parameters of the mixture using Algorithm 1;7 Estimate ICL−BIC(P̂,M) using (24);8 end9 end10 Select M̂, such that M̂ = argmin M ICL−BIC(P̂,M); 11 end12 where C ( P̂,M ) is some model selection criterion. Ji et al. (2005) found that the Integrated Classification Likelihood-Bayesian Information Criterion (ICL-BIC) performs well in selecting the number of components in the beta mixture. ICL- BIC has been also used in Dean and Nugent (2013) to select the number of beta mixture components. Accordingly, we use in our method ICL-BIC to identify the optimal number of components. The ICL-BIC criterion is given by ICL − BIC(P̂,M) = −2 log(L(V, η̂|P̂)) + QM log(N) − 2 N∑ i=1 M∑ m=1 η̂im log(η̂im) (24) where QM denotes the number of parameters of the model with M components442 and log(L(V, η̂|P̂)) corresponds to logarithm of the likelihood at the maximum443 likelihood solution for the investigated mixture model. The number of compo-444 nents that minimize ICL − BIC(P̂,M) is considered to be the optimal value for445 M. The procedure for estimating the number of components in the mixture is446 summarized in Algorithm 2.447 2.3.5. Automatic Identification of Outlier448 Once the optimal number of components has been identified, we focus now449 on detecting the bivariate beta component that corresponds to outliers. To this450 22 end, we used the results of the EM algorithm in order to derive a classifica-451 tion decision about which outlier score vector ~Vi belongs to which component452 in the mixture. In fact, the EM algorithm yields the final estimated posterior453 probability η̂im, the value of which represents the posterior probability that ~Vi454 belongs to component m. We assign ~Vi to the component that corresponds to455 the maximum value of η̂im. We thus divide the set of outlier score vectors into456 several components. As discussed earlier, in our approach we assume that out-457 lier points are characterized by high score values. Therefore, we are interested458 by the bivariate beta component which contains vectors with the highest score459 values. To identify such a component, we first compute, for each component460 in the mixture, the average value of the numerical outlier scores and also the461 average value of the categorical outlier scores (that is, we compute the average462 of Vi1 and Vi2 per component). Then, we select the component with the largest463 average values as our target component. This simple strategy for determining464 which component to pick works well in practice since it fits our assumption,465 which is based on the fact that outlier points are characterized by large score466 values in both numerical and categorical space. Finally, once our target compo-467 nent is identified, we focus on the problem of detecting outlier objects. To this468 end, we identify the set of data objects that are associated with the outlier score469 vectors ~Vi that belong to the selected component. The identified objects are out-470 liers. The steps described in Algorithm 3 can be implemented to automatically471 identify outliers.472 Finally, it is worth noting that the proposed methodology could be also473 used to identify outlier objects in single-type (categorical or numerical) at-474 tribute data. In this particular case, we propose to associate to each object475 only one score (ON(Oni ) or OCinv(O c i ), depending on the attribute type of476 the data under investigation). Then, to automatically discriminate between477 outliers and inliers, we can model the estimated scores as a finite mixture dis-478 tribution using the univariate beta which is given by (8). Here, the problem479 is thus reduced from modeling a set of two-dimensional outlier score vectors480 {~Vi}(i=1,...,N) (in the case of mixed-attribute data) to modeling a list of scalar481 23 Algorithm 3: Automatic identification of outliers Input : A data set D Output: A set of outliers OUT begin1 Estimate {ON(Oni )}(i=1,...,N) using (1);2 Estimate {OCinv(Oci )}(i=1,...,N) using (3) and (5);3 Associate, to each object Oi in D, a vector ~Vi = (Vi1,Vi2)T where Vi1 and4 Vi2 represent, respectively, the normalized values of ON(Oni ) and OCinv(Oci ) in [0,1] ; Apply Algorithm 2 to cluster {~Vi}(i=1,...,N) into M bivariate beta5 components; Use the results of the EM algorithm to decide about the membership of the6 outlier score vectore ~Vi in each component; Select the bivariate beta component that contains vectors with the highest7 score values; Identify objects in D associated with the set of ~Vi that belong to the8 selected component and store them in OUT ; Return OUT ;9 end10 outlier score values ({ON(ONi )}(i=1,...,N) or {OCinv(O c i )}(i=1,...,N)). In this set-482 ting, the parameters of the univariate beta mixture model to be estimated are483 {λm,xm,ym}(m=1,...,M). These parameters and the optimal number of compo-484 nents in the mixture are estimated using the EM algorithm with the Newton-485 Raphson method and ICL-BIC as described in the above subsections. By doing486 so, we divide the outlier scores into several populations so that the larger scores487 can be identified and the associated objects are then declared as outliers.488 3. Experimental Evaluation489 In this section, we devise an empirical study to evaluate the suitability of the490 proposed approach. In the following, we first describe the technique that we have491 adopted to produce data for use in outlier detection and the performance metrics492 used in the evaluation. Next, we illustrate the effectiveness of our approach to493 identify outliers in mixed-attribute data. Finally, we devise further experiments494 to evaluate the performance of the proposed methodology in detecting outliers495 in single-type attribute data.496 24 3.1. Data Preparation and Metrics497 We draw the attention of the reader to the fact that, at the time of writing498 this paper, there is a shortage of standard benchmark data that can be used499 to evaluate outlier detection algorithms. Most of the publicly available labeled500 data are primarily designed for classification and machine learning applications.501 If the real data are unlabeled, then the evaluation of outlier detection accuracy502 must be done based on domain knowledge or with the help of a domain expert.503 However, this scenario is not practical for the purpose of evaluation since domain504 knowledge is not always available. All these factors make the evaluation of the505 proposed methodology a challenging task.506 In this paper, we saliently illustrate the performance of our approach in507 handling outliers using real data from the UCI Machine Learning Repository 1.508 Most of these data sets are labeled for classification purposes. Here, we have509 to be aware of the fact that these class labels are not the perfect ground truth510 in the sense that they do not correspond necessarily to potential outliers in511 the data. Keeping these issues in mind, we have adopted a principled way to512 produce real data for use in outlier detection.513 In our experiments, similar to the work in (Das and Schneider, 2007), we514 create simulated outlier objects by randomly selecting attribute values. Specif-515 ically, in the numerical attribute space, we first normalize the attribute values516 of each numerical attribute onto the interval [0, 1] and the then inject outlier517 points whose attribute values are randomly selected from [0, 1]. As a result of518 this process, all the points in our data sets have coordinates in the range [0, 1]519 and are either normal points or outliers. Note that the outliers are distributed520 at random throughout the entire space. On the other hand, to obtain outliers521 in the categorical space, we inject novel objects in the data set in such a way522 that, for each dimension t, the attribute value of the newly generated object is523 randomly selected from the whole set of distinct categorical values that form524 1http://archive.ics.uci.edu/ml/ 25 dimension t in the original data. Outliers in the mixed-attribute space are a525 random combination of the newly generated objects in both the numerical and526 the categorical spaces.527 For the purpose of evaluation, we used the following standard metrics: (1)528 Accuracy, which corresponds to the proportion of correctly partitioned objects,529 (2) True Positive Rate (TPR), measuring the proportion of outliers that are530 correctly identified as outliers, (3) False Positive Rate (FPR), corresponding to531 the proportion of inliers incorrectly classified as outliers, and (4) F-measure of532 the outliers class, corresponding to the harmonic mean between precision and533 recall of the outlier objects class.534 3.2. Experiments on Mixed-Attribute Data535 The goal of the experiments conducted in this section are to evaluate the536 suitability of the proposed approach in handling outliers in mixed-attribute data.537 We compare the performance of our approach to that of ODMAD (Koufakou538 and Georgiopoulos, 2010), the most recent approach for detecting outliers in the539 mixed-attribute space. Note that ODMAD requires a number of parameters to540 be set by the user. For fairness in comparison, several values were tried for the541 parameters of ODMAD, following the suggestions in its original paper, and we542 report results for the parameter settings that produced the best results. Note543 that the selection of the best result here refers to the best F-measure value,544 since this metric represents a good trade-off between TPR and FPR.545 We considered mixed-attribute data sets taken from the UCI Machine Learn-546 ing Repository. As mentioned in the previous subsection, to obtain data sets547 for use in outlier detection, we generated outlier objects by randomly flipping548 attribute values. We fixed the number of outliers injected in each set to 10%549 of the original data set size under investigation. Fig. 5 summarizes the main550 characteristics of the data sets used in our experiments. Note that some data551 sets (such as Credit Approval, Automobile and Cylinder Bands) originally con-552 tain a number of objects with missing attribute values. In our experiments, we553 simply ignore such objects.554 26 Figure 5: Mixed-attribute data sets characteristics. (a) Credit Approval (b) Heart (c) AutoUniv (au 6) Figure 6: Estimated density curve of the outlier score vectors that correspond to three mixed-attribute data sets. We used our approach to identify outliers in each of the mixed-attribute555 data sets considered in these experiments. To this end, we set M max to 5 and556 then, as discussed in Section 2, we selected the optimal number of components557 that minimize ICL-BIC. Here, the reader should be aware that the value of558 M max is not limited to 5 and the user can set any other value. Interestingly,559 we found that the estimated outlier score vectors in each of the ten data sets are560 well fitted by three bivariate beta components. For the purpose of illustration561 and in order to not encumber the paper, we show in Fig. 6 the estimated562 probability density function of the outlier score vectors, that corresponds to563 Credit Approval, Heart and AutoUniv (au 6) only. Data points associated with564 the bivariate beta component that contains the score vectors with the highest565 values correspond to outliers. Recall that the identification of the component566 containing the highest score values follows the procedure described in Section567 2.3.5.568 27 Figure 7: Performance results on mixed-attribute data sets. Fig. 7 compares the proposed method with ODMAD. Shaded regions in this569 figure correspond to the best values of the four evaluation metrics considered570 in the experiment. As can be seen from Fig. 7, our approach achieves the571 highest true positive rates and F-measure values across all the data sets under572 investigation and reports low false positive rates with high accuracy values. In573 fact, the proposed method achieves, on average, an accuracy of 96.53%, TPR574 and FPR of 92.45% and 3.01% respectively and finally an F-measure of 0.847, all575 pointing to fairly accurate results. On the other hand, the results provided by576 ODMAD are, on average, reasonable but less competitive than those achieved by577 our approach. As depicted by Fig. 7, ODMAD reports, on average, an accuracy578 of 94.75%, TPR and FPR of 72.22% and 2.93% respectively and finally an F-579 measure of 0.717. Overall, in term of Accuracy, TP rate and F-measure, the580 proposed method performs better than ODMAD while the FPR achieved by581 both approaches are comparable.582 From Fig 7, we observe that our proposed method reports an average 92.45%583 TP rate. This means that 7.55%, on average, of outliers were misclassified as584 inliers by our approach. This not necessarily an error, since data points have585 coordinates in the range [0, 1] and are either inliers or outliers. Outliers were586 randomly placed throughout the entire space. In this setting, it is probable that587 some of the outlier objects will have attribute values related to normal objects588 in the data set under investigation. Under these circumstances, it is possible589 that few outlier objects will have low outlier score values and consequently be590 28 considered as inliers.591 To summarize, the results presented in Fig. 7, suggest that the proposed592 method performs well on different data sets. Furthermore, in contrast to ODMAD593 which suffers from its dependency on several input parameters (detection thresh-594 old, minimum support, the maximum length of itemset and the size of a window595 of categorical and numerical scores), our approach is able to accurately iden-596 tify outliers in an automatic fashion. Such a notable feature of our approach597 illustrates its practical usability to effectively identify outliers in real-life appli-598 cations. Another advantage of our approach is that it is able to handle out-599 liers in single-type (numerical or categorical) attribute data without any feature600 transformation, while existing methods are not able to do so. The following601 two subsections investigate this point using real data sets characterized by only602 numerical or categorical attributes.603 3.3. Experiments on Numerical Data604 The experiments described in this section aim to illustrate the capability605 of the proposed methodology in detecting outlier objects in numerical data.606 As discussed at the end of Section 2, when the data contains only numerical607 attributes, we associate to each object the numerical score ON(Oni ) given by608 (1). Then, we model these scores as a mixture of univariate beta mixture.2609 The parameters of the model {λm,xm,ym}(m=1,...,M) and the optimal number610 of components in the mixture are estimated following the reasoning described611 in Section 2. This process results in grouping outlier scores into several beta612 components. Data objects associated with the beta component containing the613 highest score values are declared outliers.614 Fig. 8 summarizes the main characteristics of the UCI numerical data sets615 used in the experiments. Note that, as with the experiments on mixed-attribute616 data, we have adopted the same technique to produce outliers, that is, normal-617 izing the attribute values between 0 and 1 and then injecting outliers in the618 2To fit the beta distribution, the estimated outlier scores should be first normalized in [0,1]. 29 Figure 8: Numerical data sets characteristics. (a) Ecoli (b) Wine Quality - Red (c) Breast Cancer Figure 9: Estimated density curves of the numerical outlier scores that corre- spond to three numerical data sets. data by generating objects whose attribute values are randomly selected from619 the interval [0,1]. The number of outliers injected in each data set corresponds620 to 10% of the original data set size. For each numerical data set, we estimated621 ON(Oni ) for each object and then modelled these scores as a mixture of univari-622 ate beta distribution. To this end, we set M max to 5 and selected the optimal623 number of components that minimize ICL-BIC. We found that the number of624 components varies from two to three beta components. For the purpose of il-625 lustration, Fig. 9 shows the density curve of the numerical outlier scores that626 correspond to three UCI data sets: Ecoli, Wine Quality - Red and Wisconsin627 Diagnostic Breast Cancer. The last component in each plot depicted by Fig.628 9 represents the highest score values . Data points associated with the scores629 grouped in this component correspond to outliers.630 To demonstrate the effectiveness of our approach, we compared its perfor-631 mance to that of kNN weighed outlier algorithm (kNNW) (Angiulli and Pizzuti,632 2005, 2002). kNNW assigns a weight to each data point based on the sum of633 30 Figure 10: Performance results on numerical data sets. the distances separating that point from its k nearest neighbors in such a way634 that outliers are characterized by high weights while inliers receive low weight635 values. After ranking data points based on the estimated weights, the top n636 points are identified as outliers. The implementation of this algorithm, and637 many other outlier detection approaches, is available in the ELKI Data Min-638 ing Framework 3 (Achtert, Kriegel, Schubert, and Zimek, 2013). Note that we639 have chosen kNNW for its effectiveness. In fact, in our empirical investigation,640 we have evaluated several other mainstream outlier detection algorithms, such641 as COP (Kriegel, Kroger, Schubert, and Zimek, 2012), LDOF (Zhang, Hutter,642 and Jin, 2009), LOCI (Papadimitriou, Kitagawa, Gibbons, and Faloutsos, 2003)643 and LOF (Breunig et al., 2000), already implemented in ELKI. We found that644 kNNW was the algorithm which performs well.645 Fig. 10 illustrates the results of our approach and those of kNNW on the646 numerical data sets considered in the experiments. Shaded regions correspond647 to the best Accuracy, TPR, FPR and F-measure values. Recall that kNNW648 produces a ranked list of points expecting outliers to come first. Accordingly,649 to distinguish outliers from inliers, the user should specify the target number of650 outliers n. In this setting, and in order to compute the value of the four evalu-651 ation metrics used in the experiments (Accuracy, TPR, FPR and F-measure),652 we have simply set the value of n equal to the real number of outliers in the653 3http://elki.dbs.ifi.lmu.de 31 Figure 11: Categorical data sets characteristics. data set under investigation. Finally note that, as with ODMAD, we have tried654 multiple values of k for kNNW, and we only report the best results, that is,655 those which correspond to the highest F-measure value.656 As can be seen from Fig. 10, our approach achieves, on average, the highest657 Accuracy (97.60%), TPR (91.94%) and F-measure (0.884). On the other hand,658 kNNW reports the lowest average FPR (1.50%) while our approach achieves an659 average FPR of 1.88%. Overall, both competing algorithms show good perfor-660 mances. A significant advantage of our approach is that it is able to automati-661 cally discriminate outliers from inliers while with kNNW the user should specify662 how many points should be selected as outliers.663 3.4. Experiments on Categorical Data664 The aim of this section is to illustrate the suitability of the proposed ap-665 proach for handling outliers in data sets with categorical attributes only. To this666 end, we selected a number of categorical data from the UCI Machine Learning667 Repository. Recall that these data sets are principally labeled for classification668 purposes. Accordingly, as discussed in Section 3.1, to produce data for use in669 outlier detection, we inject novel data points in such a way that each attribute670 value of each newly inserted object is randomly selected from the set of distinct671 categorical values that initially form the corresponding attribute in the original672 data. As with our previous experiments, the number of outliers injected in each673 data set corresponds to 10% of the original data set size. The main characteris-674 tics of the categorical data sets used in the experiments are summarized in Fig.675 11.676 32 (a) Audiology (b) Vote (c) Lymphography Figure 12: Estimated density curves of the categorical outlier scores that corre- spond to three categorical data sets. To identify outliers in each of the categorical data sets considered in these677 experiments, we estimated first OCinv(Oci ) for each object. These scores are then678 normalized in [0,1] and modelled as a mixture of univariate beta distribution.679 To identify the optimal number of components in the mixture, we set M max to680 5 and selected the number of components that minimize ICL-BIC. Interestingly,681 as with the experiment on numeric data, we found that the optimal number of682 components varies from two to three. Fig. 12 illustrates the density curve of the683 outlier scores corresponding to three data sets: Audiology, Congressional Voting684 Records (Vote) and Lymphography. The last component in each plot depicted685 by Fig. 12 represents the highest score values . Data points associated with the686 scores grouped in this component correspond to outliers. The knowledgeable687 reader can also observe in this rendering, and also from the pervious illustration688 of the estimated density curves depicted by Fig. 9 and Fig. 6, the great shape689 flexibility of the beta distribution which leads to accurate partitioning of the690 outlier scores.691 Fig. 13 compares the effectiveness of our approach to that of a recent out-692 lier detection approach for categorical data named Information-Theory Based693 Single-Pass (ITB-SP) (Wu and Wang, 2013). It has been empirically illustrated694 that ITB-SP is an effective approach which outperforms several existing cat-695 egorical outlier detection algorithms. The implementation of this algorithm696 has been kindly provided by its authors. As the name implies, this approach697 harnesses information theory concepts to estimate an outlier score for each ob-698 33 Figure 13: Performance results on categorical data sets. ject. Specifically, the authors in Wu and Wang (2013) propose the concept of699 holoentropy as a new measure for outlier detection. As defined in Wu and Wang700 (2013), holoentropy is a combination between entropy and total correlation with701 attribute weighting, where the entropy measures the global disorder in the data702 and the total correlation measures the attributes relationship. Based on this703 concept, that is holoentropy, the authors formulate a function to estimate an704 outlier score for each object in such a way that outliers are characterized by705 high score values. The top n objects with the highest score values are declared706 as outliers. Note that, since ITB-SP requires the number of outliers in the data707 n to be specified by the user, and in order to compute Accuracy, TPR, FPR708 and F-measure, we have simply set the value of n equal to the real number of709 outliers in the data set under investigation.710 As can be seen from Fig. 13, the average performance results for our ap-711 proach and ITB-SP are quite similar except for the average TPR and FPR. Our712 method reports an average 94.84% of true positives while the average TPR of713 ITB-SP is 88.57%. This means that only 5.16%, on average, of outliers were714 misclassified as inliers by our approach while 11.43%, on average, of outliers were715 misclassified as inliers by ITB-SP. On the other hand, as illustrated by Fig. 13,716 we can see that ITB-SP achieves the lowest FPR, that is 3.40%, while the pro-717 posed method reports an average 4.48% of false positives. Overall, the results718 illustrated in Fig. 13 suggest that both approaches display good performance.719 Our approach has, however, the non-negligible advantage of automatically dis-720 criminating outliers from inliers while ITB-SP requires the number of outliers in721 34 the data to be specified by the user. As discussed earlier, in real applications for722 which no prior knowledge about the data is available, it is not always possible723 for the user to set accurately the value of this parameter.724 4. Conclusion725 In this paper, we have highlighted some limitations of existing outlier detec-726 tion approaches for mixed-attribute data, including their dependency on user727 parameters, such as the detection threshold and the target number of outliers to728 be identified, which are difficult to tune and their incapability of formally dis-729 criminating between outliers and inliers. To alleviate these problems, we have730 proposed a principled approach that performs outlier detection in an automatic731 fashion.732 In our approach, we first devised two functions in order to estimate, for each733 object, an outlier score in the numerical space and another score in the cate-734 gorical space. Outliers in both spaces are characterized by high score values.735 Next, we associate to each data point in the data set under investigation a two-736 dimensional vector such that the first element of this vector corresponds to the737 estimated outlier score in the numerical space, while the second element corre-738 sponds to the outlier score estimated in the categorical space. Then, we model739 these vectors as a mixture of bivariate beta. The bivariate beta component740 that corresponds to the highest score values represents outliers. The beta dis-741 tribution has been chosen due to its great shape flexibility which leads, in turn,742 to accurate fitting of the estimated outlier score vectors. We have described a743 statistical framework to illustrate how the bivariate beta mixture model can be744 used to identify outlier objects.745 Finally, we have devised a detailed empirical study to illustrate the suit-746 ability of our approach in detecting outliers using several UCI data sets with747 mixed-attributes. We have compared the performance of the proposed method748 to that of ODMAD, the most recent approach for detecting outliers in the mixed-749 attribute space. The results show that our approach achieves results that are,750 in most cases, better than those of ODMAD. Moreover, we have performed751 35 further experiments to demonstrate the capability of our methodology in han-752 dling outliers in single-type attribute data without any feature transformation.753 Tests and comparison with previous ranking approaches on several numerical754 and categorical UCI data sets show that the proposed methodology exhibits755 competitive results.756 Acknowledgements757 The author gratefully thanks Dr. Shu Wu for providing the implementation758 of the ITB-SP algorithm. The author also would like to thank the reviewers for759 their valuable comments and important suggestions. This work is supported by760 research grants from the Natural Sciences and Engineering Research Council of761 Canada (NSERC).762 References763 Achtert, E., Kriegel, H.-P., Schubert, E., & Zimek, A., 2013. Interactive data764 mining with 3D-parallel-coordinate-trees. In Proceedings of the ACM SIG-765 MOD international conference on Management of Data, 1009–1012.766 Aggarwal, C. C., 2013. Outlier Analysis. Springer.767 Alan, O., & Catal, C., 2011. Thresholds based outlier detection approach for768 mining class outliers: An empirical case study on software measurement769 datasets. Expert Systems with Applications, 38 (4), 3440–3445.770 Angiulli, F., & Fassetti, F., 2014. Exploiting domain knowledge to detect out-771 liers. Data Mining and Knowledge Discovery, 28 (2), 519–568.772 Angiulli, F., & Pizzuti, C., 2002. Fast outlier detection in high dimensional773 spaces. In Proceedings of the 6th European Conference on Principles of Data774 Mining and Knowledge Discovery, 15–26.775 Angiulli, F., & Pizzuti, C., 2005. Outlier mining in large high-dimensional data776 sets. IEEE Transactions on Knowledge and Data Engineering, 17 (2), 203–777 215.778 36 Bain, L. J., & Engelhardt, M., 2000. Introduction to Probability and Mathe-779 matical Statistics, 2nd Edition. Duxbury Press.780 Bezdek, J., 1981. Pattern Recognition with Fuzzy Objective Function Algo-781 rithms. Plenum.782 Bouguessa, M., 2012. Modeling outlier score distributions. In Proceedings of783 the 8th international conference on Advanced Data Mining and Applications,784 713–725.785 Bouguessa, M., & Wang, S., 2009. Mining projected clusters in high-dimensional786 spaces. IEEE Transactions on Knowledge and Data Engineering, 21 (4), 507–787 522.788 Bouguessa, M., Wang, S., & Sun, H., 2006. An objective approach to cluster789 validation. Pattern Recognition Letters, 27 (13), 1419–1430.790 Bouguila, N., & Elguebaly, T., 2012. A fully Bayesian model based on reversible791 jump MCMC and finite beta mixtures for clustering. Expert Systems with792 Applications, 39 (5), 5946–5959.793 Bouguila, N., Ziou, D., & Monga, E., 2006. Practical bayesian estimation of794 a finite beta mixture through gibbs sampling and its applications. Statistics795 and Computing, 16 (2), 215–225.796 Boutemedjet, S., Ziou, D., & Bouguila, N., 2011. Model-based subspace clus-797 tering of non-gaussian data. Neurocomputing, 73 (10-12), 1730–1739.798 Breunig, M. M., Kriegel, H.-P., Ng, R., & Sander, J., 2000. LOF: Identifying799 density-based local outliers. In Proceedings of the ACM SIGMOD interna-800 tional conference on Management of Data, 93–104.801 Cao, H., Si, G., Zhang, Y., & Jia, L., 2010. Enhancing effectiveness of density-802 based outlier mining scheme with density-similarity-neighbor-based outlier803 factor. Expert Systems with Applications, 37 (12), 8090–8101.804 37 Das, K., & Schneider, J., 2007. Detecting anomalous records in categorical805 datasets. In Proceedings of the 13th ACM SIGKDD international conference806 on Knowledge Discovery and Data Mining, 220–229.807 Dean, N., & Nugent, R., 2013. Clustering student skill set profiles in a unit808 hypercube using mixtures of multivariate betas. Advances in Data Analysis809 and Classification, 7 (3), 339–357.810 Dempster, A., Laird, N., & Rubin, D., 1977. Maximum likelihood from incom-811 plete data via the EM algorithm. Journal of Royal Statistical Society, (Series812 B), 39, 1–37.813 Figueiredo, M. A. T., & Jain, A. K., 2002. Unsupervised learning of finite mix-814 ture models. IEEE Transactions on Pattern Analysis and Machine Intelli-815 gence, 24 (3), 381–396.816 Fustes, D., Dafonte, C., Arcay, B., Manteiga, M., Smith, K., Vallenari, A., &817 Luri, X., 2013. SOM Ensemble for unsupervised outlier analysis. Application818 to outlier identification in the Gaia astronomical survey. Expert Systems with819 Applications, 40 (5), 1530–1541.820 He, Z., Xu, X., Huang, J., & Deng, S., 2005. FP-Outlier: Frequent pattern based821 outlier detection. Computer Science and Information System, 2 (1), 103–118.822 Huang, B., & Yang, P., 2011. Finding key knowledge attribute subspace of out-823 liers in high-dimensional dataset. Expert Systems with Applications, 38 (8),824 10147–10152.825 Ji, Y., Wu, C., Liu, P., Wang, J., & Coombes, K., 2005. Applications of beta-826 mixture models in bioinformatics. Bioinformatics, 21 (9), 2118–2122.827 Koufakou, A., & Georgiopoulos, M., 2010. A fast outlier detection strategy for828 distributed high-dimensional data sets with mixed attributes. Data Mining829 and Knowledge Discovery, 20 (2), 259–289.830 38 Koufakou, A., Secretan, J., & Georgiopoulos, M., 2011. Fast outlier detection in831 large high-dimensional categorical data. Knowledge and Information Systems,832 29 (3), 697–725.833 Kriegel, H.-P., Kroger, P., Schubert, E., & Zimek, A., 2011. Interpreting and834 unifying outlier scores. In Proceedings of the 11th SIAM international con-835 ference on Data Mining, 13–24.836 Kriegel, H. P., Kroger, P., Schubert, E., & Zimek, A., 2012. Outlier detection in837 arbitrarily oriented subspaces. In Proceedings of the 12th IEEE international838 conference on Data Mining, 379–388.839 Maervoet, J., Vens, C., Berghe, G. V., Blockeel, H., & Causmaecker, P. D., 2012.840 Outlier detection in relational data: A case study in geographical information841 systems. Expert Systems with Applications, 39 (5), 4718–4728.842 Otey, M. E., Ghoting, A., & Parthasarathy, S., 2006. Fast distributed outlier de-843 tection in mixed-attribute data sets. Data Mining and Knowledge Discovery,844 12 (2–3), 203–228.845 Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C., 2003. LOCI:846 Fast outlier detection using the local correlation integral. In Proceedings of847 the 19th IEEE international conference on Data Engineering, 315–326.848 Penny, K., & Jolliffe, I., 2011. A comparison of multivariate outlier detection849 methods for clinical laboratory safety data. Journal of the Royal Statistical850 Society. Series D (The Statistician), 50 (3), 295–308.851 Tan, P.-N., Steinbach, M., & Kumar, V., 2006. Introduction to Data Mining.852 Addison Wesley.853 Wu, S., & Wang, S., 2013. Information-theoretic outlier detection for large-scale854 categorical data. IEEE Transactions on Knowledge and Data Engineering,855 25 (3), 589–602.856 39 Yamanishi, K., Takeuchi, J.-I., Williams, G., & Milne, P., 2000. On-line un-857 supervised outlier detection using finite mixtures with discounting learning858 algorithms. In Proceedings of the 6th ACM SIGKDD international confer-859 ence on Knowledge Discovery and Data Mining, 320–324.860 Ypma, T. J., 1995. Historical development of the Newton-Raphson method.861 SIAM Review, 37 (4), 531–551.862 Zhang, K., Hutter, M., & Jin, H., 2009. A new local distance-based outlier863 detection approach for scattered real-world data. In Proceedings of the 13th864 Pacific-Asia Conference on Advances in Knowledge Discovery and Data Min-865 ing, 813–822.866 Zhang, K., & Jin, H., 2010. An effective pattern based outlier detection ap-867 proach for mixed attribute data. In Proceedings of the 23rd Australasian Joint868 Conference on Advances in Artificial Intelligence, Lecture Notes in Artificial869 Intelligence, 6464, 122–131.870 40 Introduction Background Information Motivations and Contributions Proposed Approach Estimating Outlier Score in the Numerical Space Estimating Outlier Score in the Categorical Space Modeling Outlier Score Vectors The Bivariate Beta Mixture Model Maximum Likelihood Estimate EM Algorithm for the Bivariate Beta Mixture Estimating the Optimal Number of Components in the Mixture Automatic Identification of Outlier Experimental Evaluation Data Preparation and Metrics Experiments on Mixed-Attribute Data Experiments on Numerical Data Experiments on Categorical Data Conclusion