key: cord-0044826-waa3ltya
authors: Wagner, Nicolas; Antoine, Violaine; Koko, Jonas; Lardy, Romain
title: Fuzzy k-NN Based Classifiers for Time Series with Soft Labels
date: 2020-05-16
journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems
DOI: 10.1007/978-3-030-50153-2_43
sha: d4483d9bfbbfb9d5aab4ff48a08e5aca887958cd
doc_id: 44826
cord_uid: waa3ltya

Time series are temporal ordered data available in many fields of science such as medicine, physics, astronomy, audio, etc. Various methods have been proposed to analyze time series. Amongst them, time series classification consists in predicting the class of a time series according to a set of already classified data. However, the performance of a time series classification algorithm depends on the quality of the known labels. In real applications, time series are often labeled by an expert or by an imprecise process, leading to noisy classes. Several algorithms have been developed to handle uncertain labels in case of non-temporal data sets. As an example, the fuzzy k-NN introduces for labeled objects a degree of membership to belong to classes. In this paper, we combine two popular time series classification algorithms, Bag of SFA Symbols (BOSS) and the Dynamic Time Warping (DTW) with the fuzzy k-NN. The new algorithms are called Fuzzy DTW and Fuzzy BOSS. Results show that our fuzzy time series classification algorithms outperform the non-soft algorithms especially when the level of noise is high.

Time series (TS) are data constrained with time order. Such data frequently appear in many fields such as economics, marketing, medicine, biology, physics... There exists a long-standing interest for time series analysis methods. Amongst the developed techniques, time series classification attract much attention since the need to accurately forecast and classify time series data spanned across a wide variety of application problems [2, 9, 20] .

A majority of time series approaches consists in transforming time series and/or creating an alternative distance measure in order to finally employ a basic classifier. Thus, one of the most popular time series classifier is a k-Nearest Neighbor (k-NN) using a similarity measure called Dynamic time warping (DTW) [12] that allows nonlinear mapping. More recently, a bag-of-words model combined with the Symbolic Fourier Approximation (SFA) algorithm [19] has been developed in order to deal with extraneous and erroneous data [18] . The algorithm, referred to as Bag of SFA Symbols (BOSS), converts time series into histograms. A distance is then proposed and applied to a k-NN classifier. The combinations of DTW and BOSS with a k-NN are simple and efficient approaches used as gold standards in the literature [1, 8] .

The k-NN algorithm is a lazy classifier employing labeled data to predict the class of a new data point. In time series, labels are specified for each timestamp and are obtained by an expert or by a combination of sensors. However, changing from one label to another can span multiple timestamps. For example, in animal health monitoring, an animal is more likely to become sick gradually than suddenly. As a consequence, using soft labels instead of hard labels to consider the animal state seems more intuitive.

The use of soft labels in classification for non-time series data sets has been studied and has shown robust prediction against label noise [7, 21] . Several extensions of the k-NN algorithm have been proposed [6, 10, 14] . Amongst them, the fuzzy k-NN [11] , which is the most popular algorithm [5] , handles labels with probabilities membership for each class. The fuzzy k-NN has been applied in many domains: bioinformatics [22] , image processing [13] , fault detection [24] , etc.

In this paper, we propose to replace the most popular time series classifiers, i.e. the k-NN algorithm, by a fuzzy k-NN. As a result, two new fuzzy classifiers are proposed: The Fuzzy DTW (F-DTW) and the Fuzzy BOSS (F-BOSS). The purpose is to tackle the problem of gradual labels in time series.

The rest of the work is organized as follows. Section 2 first recalls the DTW and BOSS algorithms. Then, the fuzzy k-NN classifier as well as the combinations between BOSS/DTW and fuzzy k-NN are detailed. Section 3 presents a comparison between hard and soft labels through several data sets.

The most efficient way to deal with TS in classification is to use a specific metric such as DTW or to transform like BOSS the TS into non ordered data. A simple classifier can then be applied.

Dynamic Time Warping [3] is one of the most famous similarity measurement between two times series. It considers the fact that two similar times series may have different lengths due to various speed. The DTW measure allows then a non-linear mapping, which implies a time distortion. It has been shown that DTW is giving better comparisons than a Euclidean distance metric. In addition, the combination of the elastic measure with the 1-NN algorithm is a gold standard that produces competitive results [1] , although DTW is not a distance function. Indeed, DTW does not respect the property of triangle inequality but in practice, this property is often respected [17] . Despite DTW has a quadratic complexity, the use of this measure with a simple classifier remains faster than other algorithms like neural networks. Moreover, using lower bound technique can decrease the complexity of the measure to a linear complexity [16] .

The bag of SFA Symbols algorithm (BOSS) [18] is a bag of words method using Fourier transform in order to reduce noise and to handle variable lengths. First, a sliding window of size w is applied on each time series of a data set. Then, windows from the same time series are converted into a word sequences according to the Symbolic Fourier Approximation (SFA) algorithm [19] . Words are composed of l symbols with an alphabet size of c. The time series is then represented by a histogram that corresponds to the number of word occurrences for each word. Finally, the 1-NN classifier can be used with distance computed between histograms. Given two histograms B 1 and B 2 , the measure called d BOSS is:

where a is a word and B i (a) the number of occurrences of a in the i th histogram. Note that the set of words are identical for B 1 and B 2 , but the number of occurrences for some words can be equal to 0. We propose to handle fuzzy labels in TS classification using the fuzzy k-NN algorithm.

Let D = (X , y) be a data set composed of n = |X | instances and y i ∈ C be a label assigned to each instance x i ∈ X with C the set of all possible labels.

For conventional hard classification algorithms, it is possible to compute a characteristic function f c : X → {0, 1} with c ∈ C:

Rather than hard labels, soft labels allow to express a degree of confidence on the class membership of an object. Most of the time, this uncertainty is represented given by probabilistic distribution. In that case, soft labels correspond to fuzzy labels. Thereby, the concept of characteristic function is generalized to membership function u c : X → [0, 1] with c ∈ C:

such that

There exists a wide range of k-NN variants using fuzzy labels in the literature [5] . The most famous and basic method, referred to as fuzzy k-NN [11] , predicts the class membership of an object x i using two steps. First, similarly to the hard k-NN algorithm, the k nearest neighbors x j ∈ K, |K| = k of x i are retrieved. The second step differs from hard k-NN as it computes a membership degree for each class:

, ∀c ∈ C,

with m a fixed coefficient controlling the fuzziness of the prediction, d(x i , x j ) the distance between instances x i and x j . Usually, m = 2 and the Euclidean distance is the most popular distance considered.

In order to deal with time series and fuzzy labels, we propose two fuzzy classifiers called F-DTW and F-BOSS. The F-DTW algorithm consists in using the fuzzy k-NN algorithm with DTW as distance function (see Fig. 1 ). It takes in entry a time series and computes the DTW distance with the labeled times series. Once the k closest time series found, the class membership is computed with Eq. (6).

The F-BOSS algorithm consists in first applying the BOSS algorithm in order to transform the time series into histograms. Then, the fuzzy k-NN is applied with BOSS distances. It generates fuzzy class memberships (see Fig. 2 ).

Once F-DTW and F-BOSS defined, experiments are carried out to show the interest of taking into account soft labels when there exists noise and/or uncertainties on the labels.

Experiments consist in studying the parameters setting (i.e. the number of neighbors) and compare soft and hard methods when labels are noisy. 

We have selected five data sets from the University of California Riverside (UCR) archive [4] . Each data set have different characteristics detailed in Table 1 . The hard labels are known for each data set. Thus, we generate fuzzy labels as described in [15] . First noise is introduced in the label set in order to represent uncertain knowledge: for each instance x i , a probability p i to alter label y i is randomly generated according to a beta distribution with a variance σ set to σ = 0.04 and the expectation μ set to μ = [0.1, 0.2, ..., 0.7]. In order to decide if the label of x i is modified, another random number p i is generated according to an uniform distribution. If p i > p i , a new label y i ∈ C such that y i = y i is randomly assigned to x i . Second, fuzzy labels are deduced using p i . Let Π c : X → [0, 1] be a possibilistic function computed for each instance x i and each class c:

The possibilistic distribution allows to go from total certainty when p i = 0 to total uncertainty when p i = 1. Since our algorithms employ fuzzy labels, possibilities Π i are converted into probabilities u c by normalizing Eq. (7) with the sum of all possibilities:

We propose to test and compare three strategies dealing with noisy labels. The two first ones are dedicated to classifiers taking in entry hard labels.

The first strategy, called strategy 1, considers that noise in labels is unknown. As a result soft labels are ignored and for each instance x i , label y * i is chosen using the maximum probability membership rule, i.e. max(u c (x i )).

The second strategy, called strategy 2, consists in discarding the most uncertain labels and transforming soft labels into hard labels. For each instance x i the normalized entropy H i is computed as follows:

Note that H i ∈ [0, 1] and H i = 0 corresponds to a state of total certainty whereas H i = 1 corresponds to a uniform distribution. If H i > θ we consider the soft label of x i as too uncertain and x i is discarded from the fuzzy data set. In the experiments, we set the threshold θ to 0.95.

Finally, the third strategy, called strategy 3, keeps the whole fuzzy labels and apply a classifier able to handle such labels.

In order to compare strategies and since strategies 1 and 2 give hard labels whereas strategy 3 generates fuzzy labels, we convert fuzzy labels using the maximum membership rule, i.e. max(u c (x i )), ∀c ∈ C.

The best parameters of F-BOSS are found by a leave-one-out cross-validation on the training set. The values of the parameters are fixed as in [1] : 

Usually with DTW or BOSS with hard labels, the number of neighbors is set to 1. This experiment studies the influence of the parameter k when soft labels are used. Thus, we set μ = 0.3 in order to represents a moderate level of noise that can exist in real applications and apply strategy 3 on all data sets. Figure 3 illustrates the result on the WormsTwoClass data set, i.e. the variation of the accuracy for the three classifiers according to k.

First, for all values of k the performance of the soft k-NN classifier is lower than the others. Such result has also been identified in other data sets. We also observe on Fig. 3 that the F-BOSS algorithm is often better than F-DTW. However, the pattern of the F-BOSS curve is serrated that makes difficult the establishment of guidelines for the choice of k. In addition, the best k depends on the algorithm and the data set. Therefore, for the rest of the experiments section, we choose to set k to the median value k = 5. Table 2 presents the results of all classifiers and all strategies on the five data sets for k = 5 and μ = 0.3. F-BOSS and F-DTW outperform the k-NN classifier for all data sets. This result is expected since DTW and BOSS algorithms are specially developed for time series problems. The best algorithm between F-DTW and F-BOSS depends on the data set: F-DTW is the best one for Lightning2, Yoga and MedicalImages, and F-BOSS is the best one for WormsTwoClass. Note that for ProximalPhalanxTW, F-DTW is the best with strategy 2 and F-BOSS is the best for the strategy 3. Strategy 1 (i.e. hard labels) is most of the time worse than the two other strategies. This can be explained by the fact that the strategy 1 does not take the noise into account. For all best classifiers of all data sets, the strategy 3 is the best strategy even though for ProximalPhalanxTW and MedicalImages strategy 2 competes with strategy 3. The strategy 3 (i.e. soft) is therefore better than the strategy 2 (i.e. discard) one for five algorithms and equal for two algorithms. However, the best algorithm between F-BOSS and F-DTW depends on the data sets. 

To observe the impact of the μ parameter, Fig. 4, Fig. 5 and Fig. 6 illustrate respectively the accuracy variations for the WormsTwoClass, Lightning2 and MedicalImages data sets according to the value of μ. The k-NN classifier and the strategy 1 are not represented because their poor performance (see Sect. 3.3). The figures also include the value μ = 0 that corresponds to the original data without fuzzy processing. Results are not presented for the Yoga and Proximal-PhalanxTW data sets because the accuracy differences between the strategies and the classifiers are not significant, especially when μ < 0.3. For WormsTwoClass, F-BOSS is better than F-DTW and inversely for Light-ning2 and MedicalImages data sets. For the WormsTwoClass and Lightning2 data sets, with a low or moderate level of noise (μ < 0.3), the third strategy is better than the second one. For the MedicalImages data set, the strategies 2 and 3 are quite equivalent, excepted for μ = 0.5 where the third strategy is better. When μ > 0.5, the strategy 2 is better. Higher levels of noise lead to better results with strategy 2. This can be explained as follows: strategy 2 is less disturbed by the important number of miss-classified instances since it removes them. On the opposite, with a moderate level of noise, the soft algorithms are more accurate because they keep informative labels.

Predicting soft labels instead of hard labels brings to the expert an extra information that can be analyzed. We propose to consider as uncertain all predicted fuzzy labels having a probability less than a threshold t. Figure 7 present the accuracy and the number of elements discarded varying with this threshold t for the WormsTwoClass data set. As it can be observed, the higher is t, the better is the accuracy and the more the number of predicted instances are discarded. Thus t is a tradeoff between good results and a sufficient number of predicted instances. The results above show that methods designed for time series outperform the standard ones and the fuzzy strategies give a better performance for noisy labeled data.

This paper considers the classification problem of time series having fuzzy labels, i.e. labels with probabilities to belong to classes. We proposed two methods, F-BOSS and F-DTW, that are a combination of a fuzzy classifier (k-NN) and methods dedicated to times series (BOSS and DTW). The new algorithms are tested on five data sets coming from the UCR archives. With F-BOSS and F-DTW, integrating the information of uncertainty about the class memberships of the labeled instances outperforms strategies that does not take in account such information.

As perspectives we propose to modify the classification part of F-BOSS and F-DTW in order to attribute a weight on the neighbors depending on the distance to the object to predict. This strategy, inspired by some soft k-NN algorithms for non time series data sets, should improve the performances by giving less importance to far and uncertain labeled instances.

Another perspective consists in adapting the soft algorithms to possibilistic labels. Indeed, the possibilistic labels are more suitable for real applications as it allows an expert to assign a degree of uncertainty on an object to a class independently from the other classes. For instance, in a dairy cows application where the goal is to detect anomalies like diseases or estrus [23] , the possibilistic labels are simple to retrieve and well appropriated because a cow can have two or more anomalies at the same time (e.g. a diseases and an estrus).

The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances

Interrupted time series regression for the evaluation of public health interventions: a tutorial

Using dynamic time warping to find patterns in time series

Hexagon-ML: the UCR time series classification archive

Fuzzy nearest neighbor algorithms: taxonomy, experimental analysis and prospects

A k-nearest neighbours method based on imprecise probabilities

A study of the robustness of KNN classifiers trained using soft labels

Deep learning for time series classification: a review

Trade and income-exploiting time series in geography

Possibilistic instance-based learning

A fuzzy k-nearest neighbor algorithm

Exact indexing of dynamic time warping

A 2D-approach towards the detection of distress using fuzzy K-nearest neighbor

A fuzzy vector valued knn-algorithm for automatic outlier detection

Parametric classification with soft labels using the evidential EM algorithm: linear discriminant analysis versus logistic regression

Three myths about dynamic time warping data mining

Is the DTW "distance" really a metric? An algorithm reducing the number of DTW comparisons in isolated word recognition

The BOSS is concerned with time series classification in the presence of noise

SFA: a symbolic fourier approximation and index for similarity search in high dimensional datasets

Time-series classification methods: review and applications to power systems data

Classification on soft labels is robust against label noise

An efficient approach for prediction of nuclear receptor and their subfamilies based on fuzzy k-nearest neighbor with maximum relevance minimum redundancy

Machine learning to detect behavioural anomalies in dairy cowsunder subacute ruminal acidosis

Fault analysis and prediction of transmission line based on fuzzy k-nearest neighbor algorithm