Anomaly detection in monitoring sensor data for preventive maintenance HAL Id: lirmm-00670917 https://hal-lirmm.ccsd.cnrs.fr/lirmm-00670917 Submitted on 16 Feb 2012 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Anomaly Detection in Monitoring Sensor Data for Preventive Maintenance Julien Rabatel, Sandra Bringay, Pascal Poncelet To cite this version: Julien Rabatel, Sandra Bringay, Pascal Poncelet. Anomaly Detection in Monitoring Sensor Data for Preventive Maintenance. Expert Systems with Applications, Elsevier, 2011, 38 (6), pp.7003-7015. �10.1016/j.eswa.2010.12.014�. �lirmm-00670917� https://hal-lirmm.ccsd.cnrs.fr/lirmm-00670917 https://hal.archives-ouvertes.fr Expert Systems with Applications 38 (2011) 7003–7015 Contents lists available at ScienceDirect Expert Systems with Applications j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / e s w a Anomaly detection in monitoring sensor data for preventive maintenance Julien Rabatel a,b,⇑, Sandra Bringay a,c, Pascal Poncelet a a LIRMM, Université Montpellier 2, CNRS, 161 rue Ada, 34392 Montpellier Cedex 5, France b Fatronik France Tecnalia Cap Omega, Rond-point Benjamin Franklin - CS 39521, 34960 Montpellier, France c Dpt MIAp, Université Montpellier 3, Route de Mende, 34199 Montpellier Cedex 5, France a r t i c l e i n f o Keywords: Anomaly detection Behavior characterization Sequential patterns Preventive maintenance 0957-4174/$ - see front matter � 2010 Elsevier Ltd. A doi:10.1016/j.eswa.2010.12.014 ⇑ Corresponding author at: LIRMM, Université Mon 34392 Montpellier Cedex 5, France. E-mail addresses: rabatel@lirmm.fr (J. Rabatel), b poncelet@lirmm.fr (P. Poncelet). 1 http://www.smartmotorist.com. a b s t r a c t Today, many industrial companies must face problems raised by maintenance. In particular, the anomaly detection problem is probably one of the most challenging. In this paper we focus on the railway main- tenance task and propose to automatically detect anomalies in order to predict in advance potential fail- ures. We first address the problem of characterizing normal behavior. In order to extract interesting patterns, we have developed a method to take into account the contextual criteria associated to railway data (itinerary, weather conditions, etc.). We then measure the compliance of new data, according to extracted knowledge, and provide information about the seriousness and the exact localization of a detected anomaly. � 2010 Elsevier Ltd. All rights reserved. 1. Introduction Today, many industrial companies must face problems raised by maintenance. The most common solution is called curative maintenance, i.e., equipment is replaced or repaired after the appearance of obvious failures, once the damage is occurred. This solution poses many problems. Curative maintenance is too be- lated and is particularly costly on several aspects. On the financial side first, for many companies, a few hours of downtime can result in millions of dollars in losses. It is generally much less expensive to make predictive maintenance to prevent a serious breakdown. In addition, the corrective maintenance is also a problem for secu- rity aspects. In many sensitive areas, equipment failures can cause death. For example, it is estimated that approximately 5% of motor vehicle accidents are caused by equipment malfunction or a lack of maintenance.1 Another aspect is related to the environment and energy saving. Indeed, equipment that is worn or subject to mal- functions often consumes more energy than equipment that oper- ates optimally. In addition, a systematic maintenance planning is not a satisfactory solution as too expensive compared to real needs. To reduce the problems of equipment maintenance, to propose ways to make maintenance both faster and more effective by antic- ipating serious breakdowns represents a particularly critical issue. Such a preventive maintenance consists in detecting anomalous behavior in order to prevent further damages and avoid more ll rights reserved. tpellier 2, CNRS, 161 rue Ada, ringay@lirmm.fr (S. Bringay), costly maintenance operations. To this end, it is necessary to mon- itor the working equipment. Usually, monitoring data is available through embedded sensors and provides us with important infor- mation such as temperatures, humidity rates, etc. Nevertheless, data collected by sensors are difficult to exploit for several reasons. First, a very large amount of data usually available at a rapid rate must be managed to provide a relevant description of the observed behaviors. Furthermore, they contain many errors: sensor data are very noisy and sensors themselves can become defective. Finally, when considering data transmission, very often lots of information are missing. In this paper, we focus on the field of train maintenance. Trains monitoring is also ensured by sensors positioned on the main components (wheels, motors, etc.) to provide much information (temperature, acceleration, velocity). In this context, we are subject to the difficulties we have described above: voluminous and noisy data, information transmission problems, etc. Moreover, it is important to take into account the different types of data available. We therefore wish to propose a method to exploit this informa- tion in order to assist the development of an effective predictive maintenance. The needs in the context of train maintenance are twofold. First, it is important to provide a better understanding of monitored sys- tems. Indeed, as they are often complex and contain many compo- nents, the experts have little knowledge about their actual behavior. This lack of knowledge makes the problem of mainte- nance very difficult. From another point of view it could be inter- esting to get an overview of the normal behavior (e.g., in case of monitoring) and then it is necessary to propose a way for charac- terizing such normal behaviors from a huge amount of historical data. Another challenging point that we must consider is that http://dx.doi.org/10.1016/j.eswa.2010.12.014 mailto:rabatel@lirmm.fr mailto:bringay@lirmm.fr mailto:poncelet@lirmm.fr http://www.smartmotorist.com http://dx.doi.org/10.1016/j.eswa.2010.12.014 http://www.sciencedirect.com/science/journal/09574174 http://www.elsevier.com/locate/eswa 7004 J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 normal behavior strongly depends on the context. For example, a very low ambient temperature will probably affect a train behav- ior. Similarly, each itinerary with its own characteristics (slopes, turns, etc.) influences a journey. Consequently it is essential, in or- der to efficiently characterize the behavior of trains as well as to detect anomalies, to consider the surrounding context. In this pa- per, we will show how these elements can be directly addressed by data mining techniques and how they can be used to design a system for detecting anomalies in train behavior and help experts so that a detected problem can be dealt as promptly as possible, and the right decisions can be made. Our approach follows the framework presented in Fig. 1. We ad- dress the two issues mentioned above: (i) the knowledge discovery process about normal train behavior and (ii) the anomaly detection in new data. The characterization of normal behavior is divided into three steps. First, we consider the data describing the normal behavior (i.e., containing no anomalies) that were recorded in the past. These so-called historical data are segmented into different classes, which are defined by the context in which the data were recorded. This organization brings together all trips that were conducted un- der similar conditions. The criteria used to create the classes are, for example, the outside temperature, the itinerary, etc. Then, we extract the most frequent behaviors to characterize each of these classes. We thus obtain knowledge classes that describe very pre- cisely normal train behavior and also provide us with essential information about the impact of contextual criteria. For example, we can answer questions such as ‘‘What are the behaviors that are specific to a high outside temperature?’’. After having characterized the train behavior in the first step of our framework, we wish to detect anomalies in newly recorded data. To this end, we use the previously obtained knowledge. We have developed a method to compare new monitoring data with a class of knowledge. Thus, we can detect critical events and notify experts that a maintenance operation may be necessary. This paper is organized as follows. Section 2 describes the data representation in the context of train maintenance. Section 3 shows the characterization of normal behaviors by discovering sequential patterns. Then we present the anomaly detection for predictive maintenance approach in Section 4. Experiments con- ducted with real and simulated data are described in Section 5 and related work is presented in section 6. Finally, we conclude in Section 7. Fig. 1. General 2. Data preprocessing In this section, we address the problem of preprocessing railway data. From raw data collected by sensors, we design a suitable rep- resentation for data mining tasks. The train monitoring system exploited in this study is such that a large set of sensors is distributed on the main components of each monitored train. The key element of this system is the bogie, because failures and anomalous behaviors are most of the time associated with it. Each of the 8 bogies of a train has 32 sensors col- lecting information such as temperatures, accelerations and veloc- ity. Every five minutes during a journey, all sensors collect a value stored in a central database. A complete description of these data is available in Carrascal, Díez, and Azpeitia (2009). 2.1. Sensor data for train maintenance The data resulting from sensors for train maintenance is com- plex for the two following reasons: (i) very often errors and noisy values pervade the experimental data; (ii) multi-source informa- tion must be handled at the same time. For instance, in train main- tenance following data must be considered. 2.1.1. Sensors Each sensor describes one property of the global behavior of a train which can correspond to different information (e.g., temper- ature, velocity, acceleration). 2.1.2. Measurements They stand for numerical values recorded by the sensors and could be very noisy for different reasons such as failures, data transfer, etc. Note that the numerical values collected by the sen- sors are then discretized to obtain a set of data more suited to the step of data mining described in Section 3. 2.1.3. Readings They are defined as the set of values measured by all the sensors at a given date. The information carried out by a reading could be considered as the state of the global behavior observed at the given moment. Due to the data transfer, some errors may occur and then readings can become incomplete or even missing. framework. Table 1 Extract from sensor data. TIME A B C . . . 2008/03/27 06:36:39 0 16 16 . . . 2008/03/27 06:41:39 82.5 16 16 . . . 2008/03/27 06:46:38 135.6 19 21 . . . 2008/03/27 06:51:38 105 22 25 . . . J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 7005 We consider that handled data are such as those described in Table 1, where a reading for a given date (first column) is de- scribed by sensor measurements (cells of other columns). The behavior of a sensor is described by the numerical values it measures. The discretization of these sensor values is a preprocess- ing that, for example, has been used in Chong, Krishnaswamy, Loke, and Gaben (2008), Mihail Halatchev (2005) to group values expressing the same information into classes of values (e.g., a value could be low, medium or high). For example, temperature values 25�C and 26�C are different but very close, so they can be regarded as carrying the same information. Moreover, for velocity as well as acceleration sensors, we also consider a zero value because it is a particular value that can not be confused with non-zero values. For example, a zero velocity indicates that the train is stationary, while a non-zero velocity, no matter how low it is, concerns a train in motion. 2.2. Granularity in railway data Data collected from a train constitutes a list of readings describ- ing its behavior over time. As such a representation is not appropri- ate to extract useful knowledge, we decompose the list of readings at different levels of granularity and then we consider the three fol- lowing concepts journeys, episodes and episode fragments which are defined as follows. 2.2.1. Journey The definition of a journey is associated to the railway context. For a train, a journey stands for the list of readings collected during the time interval between the departure and the arrival. Usually, a journey is several hours long and has some interruptions when the train stops in railway stations. We consider the decomposition into journeys as the coarsest granularity of railway data. Let minDuration be a minimum duration threshold, maxStop be a maximum stop duration, and J be a list of readings (rm, . . ., ri , . . .rn), where ri is the reading collected at time i. J is a journey if: 1. (n � m) > minDuration, 2. 9=ðru; . . . ; rv; . . . rwÞ# Jj ðw � uÞ > maxStop; and 8v 2 ½u; w�; velocityðvÞ¼ 0: � 2.2.2. Episode The main issue for characterizing train behavior is to compare elements which are similar. However, as trains can have different routes the notion of journey is not sufficient (for instance, between two different journeys, we could have different number of stops as well as a different delay between two railway stations). That is the reason why we segment the journeys into episodes to get a finer level of granularity. Episodes are obtained by relying on the stops of a train (easily recognizable considering the train velocity). An episode is defined as a list of readings (rm, . . .ri, . . . , rn) such as: � velocity(m) = 0 and velocity(n) = 0,2 � if m < i < n, velocity(i) – 0. 2 Here, the velocity of the train at time t is denoted as velocity(t). Fig. 2 describes a segmentation of a journey into episodes by considering the velocity changes. This level of granularity is con- sidered as the most relevant because it provides us with a set of homogeneous data. However, we can segment episodes in order to obtain a more detailed representation and a finer granularity level. 2.2.3. Episode fragment The level of granularity corresponding to the fragments is based on the fact that the behavior of a train during an episode can easily be divided into three chronological steps. First, the train is station- ary (i.e., velocity 0) then an acceleration begins. We call this step the starting step. More formally, let E = (rm, . . . , rn) be an episode. The starting fragment Estarting = (rm, . . . , rk) of this episode is a list of readings such as: 8i; j 2 ½m; k�; i < j () velocityðiÞ < velocityðjÞ: At the end of an episode, the train begins a deceleration ending with a stop. This is the ending step. More formally, let E = (rm, . . . , rn) be an episode. The ending fragment Eending = (rk, . . . , rn) of this epi- sode is a list of readings such as: 8i; j 2 ½k; n�; i < j () velocityðiÞ > velocityðjÞ: The traveling fragment is defined as the sublist of a given epi- sode between the starting fragment and the ending fragment. During this fragment, there are accelerations or decelerations, but no stop. More formally, let E be an episode, Estarting its starting fragment, and Eending its ending fragment. Then, the traveling fragment of E, de- noted as Etraveling, is a list of readings defined as: Etraveling ¼ E � Estarting � Eending: Fig. 2 shows the segmentation of an episode into three frag- ments: the starting fragment, the traveling fragment and the end- ing fragment. From now we thus consider that all the sensor data are stored in a database, containing all information about the different granular- ity levels. For example, all the sensor readings composing the frag- ment shown in Fig. 2 are indexed and we know that a particular fragment f is an ending fragment included in an episode e, belong- ing to the journey J. J is associated with the itinerary I and the index of e in I is 2 (i.e., the second portion of this route). 2.3. Data classes Previously, we have presented how to process the data in the railway context in order to obtain a relevant representation. How- ever, we have not considered the contextual aspects of such data. Indeed, in the railway field of application as well as other industrial applications, the behavior strongly depends on different exterior parameters. The nature of such context is inevitably dependant on the field of application. For example, the behavior of a train during a trip de- pends on contextual criteria such as the weather conditions or its geographical position. We have previously exploited different lev- els of granularity in data, we now use different existing contexts to consider the variability of train behavior according to them. Example 1. Table 2 presents a set of fragments, identified by the id column. Each fragment is associated with contextual criteria (the humidity rate, the exterior temperature, the global route, and the index of the episode in this route). For example, the fragment f2 was performed with a low humidity rate and a low exterior temperature. In addition, f2 is part of the second episode (E2) of the itinerary denoted by I1. Table 3 Some contextual classes. Class Humidity Exterior temperature [⁄,⁄] ⁄ ⁄ [low,⁄] low ⁄ [high,⁄] high ⁄ [low, high] low high [⁄, high] ⁄ high . . . . . . . . . Fig. 2. Segmentation of a journey into episodes. Table 2 Fragments and contextual information. ID Humidity Exterior temperature Itinerary Index f1 low high I1 E1 f2 low low I1 E2 f3 high high I2 E1 f4 low low I1 E1 f5 high low I1 E2 7006 J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 2.3.1. Data classes Now we present how contextual criteria are handled, by divid- ing the data into classes according to the different contextual crite- ria. To this end, we will at first develop a formal representation of the different contexts that can be met. Based on the general principle of contextualization, here we de- fine how classes are constructed. We consider now a formal description of a context. Each fragment is described in a set of n contextual dimensions (e.g., Duration or Exterior Temperature), denoted by DC. Let c be a class, defined in DC. A class c is denoted by ½cD1 ; . . . ; cDi ; . . . ; cDk�, where cDi is the value of c for the dimension Di, Di 2 DC. We use a wildcard value, denoted by ⁄, which can sub- stitute any value on each dimension in DC. In other words, "X 2 DC, "x 2 Dim(X), {x} �⁄. Thus, an episode e belongs to a class c if the restriction of e on DC 3 is included in c: 8X 2 DC; eX # cX: We have identified a lattice structure, called the contextual lat- tice, to represent the set of contextual classes. We now explain how this structure is defined. 2.3.2. Contextual lattice The set of contextual classes can be represented in a multidi- mensional space containing all the combinations of different crite- ria as well as their possible values. Example 2. Table 3 shows some of the contextual classes corre- sponding to two dimensions: Humidity and Exterior Temperature (respectively denoted as Hum and ExtT below). A class c is denoted by [cExtT, cHum]. For example, the class denoted by [low,⁄] is equivalent to the context where the temper- ature is low (i.e., cExtT = low), for any humidity rate (i.e., cHum = ⁄). Using the dataset of Table 2, we can see that the set of fragments belonging to the class [low,⁄] is {f1, f2, f4}. Similarly, the set of fragments belonging to the class [low, high] is {f1}. 3 The restriction of e in DC is the description of e, limited to the dimensions of DC. Contextual classes and their relationships can be represented as a lattice. Nevertheless, we first have to define a generalization/ specialization order on the set of environmental classes. Definition 1. Let c, c0 be two classes. c > c0 ()8X 2 DC; c0X � cX . If c > c0, then c is said to be more general than c0 and c0 is said to be more specific than c. Moreover, if there is no class c0 such that c0 – [;, . . . ,;] and c > c0, then c is said to be a specialized class. In order to construct contextual classes, we provide a sum oper- ator (denoted by +) and a product operator (denoted by �). The sum of two classes gives us the most specific class generalizing them. The sum operator is defined as follows. Definition 2. Let c, c0 be two classes. z ¼ c þ c0 ()8X 2 DC; zX ¼ cX if cX ¼ c0X; � elsewhere: � Example 3. Given the set of classes in Table 3, we note that [low, low] + [low, high] = [low,⁄], and [low, high] + [high, low] = [⁄,⁄]. The product of two classes gives the most general class special- izing them. The product operator is defined as follows. Definition 3. Let c, c0 be two classes. The class u is defined as follows: 8X 2 DC; uX ¼ cX \ c00X . Then, z ¼ c � c0 () z ¼ u if 9= X 2 DCjuX ¼;; ½;; . . . ;;� elsewhere: � Example 4. Given the set of classes in Table 3, we note that [low, high]�[low,⁄] = [low, high], and [⁄, high]�[low, low] = [;,;]. We can now define a lattice, by using the generalization/specializa- tion order between classes and the operators defined above. The or- dered set hCS, > i is a lattice denoted as CL, in which Meet ( V ) and Join ( W ) elements are given by: 1. "C � CL, V C = +c2Cc 2. "C � CL, W C = �c2Cc Fig. 3. A contextual lattice. J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 7007 Fig. 3 illustrates an extract of the lattice of contextual classes of the dataset provided in Table 2. We can now very precisely index historical data in appropriate classes. 3. Normal behavior characterization In this section, we focus on the data mining step in the knowl- edge discovery process and more precisely on the extraction of pat- terns characterizing normal behavior. 3.1. How to extract normal behavior? The objective of the behavior characterization is, from a data- base of sensor measurements, to provide a list of patterns depict- ing normal behavior. We want to answer the following question: which patterns often appear in the data? Such a problem, also known as pattern mining, has been extensively addressed by the data min- ing community in the last decade. Amongst all the data mining methods, we cite the sequential patterns mining problem. Sequential patterns were introduced in Agrawal and Srikant (1995) and can be considered as an extension of the concept of association rule Agrawal, Imieliński, and Swami (1993) by handling timestamps associated to items. The sequential patterns goal is to extract sets of items commonly associated over time. In the ‘‘basket market ’’ concern, a sequential pattern can be for example: ‘‘40 % of the customers buy a television, then buy later on a DVD player ’’. In the following we adapt the sequential pattern mining problem to our sensor data. In the following definitions and examples, we will consider a set of sensors, denoted by X, and the set of possible values collected by a sensor A 2 X, denoted by dom(A). Definition 4. An item Av is a pair {A, v}, where A 2 X and v 2 dom(A). It stands for the value v collected by the sensor A at a given time. The item Av is called an A-item. Moreover, we denote by value(Av) the value v, and by sensor(Av) the sensor A. Definition 5. Let x = {A1, A2, . . . , Ai, . . . , An} a set of sensors such that x # X. An itemset I is a reading or a part of a reading at a given time, i.e., a non-ordered set of items, denoted as: I ¼ A1v 1 A 2 v 2 . . . Aiv i . . . A n v n � � : Note that if Aiv i 2 I; 9= A i v0 i j Aiv0 i 2 I. Indeed, a sensor cannot read two different values at the same time. Definition 6. A sequence s is a list of itemsets denoted as: s ¼hI1I2 . . . Ii . . . Ini; where the itemset Ii stands for the ith itemset of s. An empty sequence is denoted as ;. Example 5. Data described in Table 1, are translated into the fol- lowing sequence: hðA0B16C16ÞðA82:5 B16C16ÞðA135:6 B19C21ÞðA105B22C25Þi: When data are discretized, the sequence becomes: hðA0BlowClowÞðAavg BlowClowÞðAhighBavg CavgÞðAhighBavg ChighÞi: Definition 7. Given two sequences s = hI1, I2, . . . , Imi and s0 ¼ hI01I 0 2 . . . I0ni, if there exist integers 1 6 i1 < i2 <� � �< im 6 n such that I1 # I 0 i1 ; I2 # I 0 i2 ; . . . ; Im # I 0 im , then the sequence s is a subsequence of the sequence s0, denoted as s v s0, and s0 supports s. Table 4 Sample of sequences. ID Sequence sa h(AhighBhigh)i sb h(Ahigh)(AavgBavg)(Blow)(Alow) i sc h(Ahigh)(Blow)i sd h(Ahigh)(Bavg)i 7008 J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 Example 6. From Table 4, we can note that sc v sb and sd v sb. Definition 8. If a sequence s is not a subsequence of any other sequences, then we say that s is maximal. Example 7. From the set of sequences contained in Table 4, we can note that sa and sb are maximal sequences. Definition 9. Let s be a sequence. The length of s, denoted as jsj, is the number of itemsets in s. The size of s, denoted as ksk, is the number of items in s. Example 8. The length of sa is jsaj = 1, and its size is ksak = 2. Definition 10. The support of a sequence is defined as the fraction of total sequences in a sequence database DB that support this sequence. A sequence is said to be frequent if its support is greater than or equal to a threshold minimum support (minSupp) specified by the user. The sequential pattern mining problem is, for a given threshold minSupp and a sequence database DB, to find all frequent sequences in DB. 3.1.1. Contiguous subsequences mining To extract interesting patterns in a database of behavioral sequences, it is important to note that a frequent sequence is inter- esting only if the gap between each itemset is limited. By extract- ing frequent sequences in a database of behavioral sequences, we want to highlight the frequent interactions and correlations between sensors, but also between readings. However, the interest of those patterns is strongly associated with the temporal gap between each pair of itemsets in a sequence. To take this into consideration, we modify the concept of subsequence in order to consider only the consecutive itemsets in a sequence but first of all we define the notion of concatenation. Definition 11. The concatenation of sequences is denoted as s + s0, and the result is the sequence obtained by appending s0 to the end of s, so that we have js + s0j = jsj + js0j and ks + s0k = ksk + ks0k. Example 9. The sequence corresponding to the concatenation of sa and sc is: sa þ sc ¼hðAhighBhighÞðAhighÞðBlowÞi: Definition 12. A sequence s is a contiguous subsequence of the sequence s0, denoted as s v cs0, if there exist three sequences s1, s2, and s3 such that s0 = s1 + s2 + s3, jsj = js2j, and s v s0. Example 10. The sequence sd is a contiguous subsequence of sb, i.e., sd v csb. Indeed, there exist three sequences s1 = ;, s2 = h(Ahigh)- (AavgBavg)i and s3 = h(Alow)i, such that sd = s1 + s2 + s3, jsdj = js2j, and sb v s2. sb ¼ ; z}|{s1 þhðAhighÞðAavg BavgÞi zfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{s2 þhðBlowÞðAlowÞi zfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflffl{s3 ; sd ¼hðAhighÞðBavgÞi: 3.1.2. Aggregated sequences Handled behavioral data are highly redundant: the behavior described by the sensor measurements are generally stable for a reading to another. This characteristic is particularly visible for data that change slowly over time (e.g., a wheel temperature). Therefore, we are dealing with very large sequences with consecutive itemsets containing redundant information. Such sequences bring two problems: (i) the mining of this kind of se- quences is particularly difficult (the repetition of identical itemsets considerably increases the search space) and (ii) it yields no addi- tional information. We therefore propose the concept of aggre- gated sequence. Definition 13. Let s = hI1, I2, . . . , Ini be a sequence. The correspond- ing aggregated pattern, denoted by s⁄, is the maximal subse- quence of s respecting the following condition: s� ¼ hI�1 I � 2 . . . I � i . . . I � mi; such that 8I � i 2 s �; I�i – I � iþ1: Note that s⁄v s. Example 11. Let s = h(AlowBavg)(Alow)(Alow)(Bhigh)i. The aggregated pattern of s, denoted by s*, is: s� ¼ hðAlowBavgÞðAlowÞðBhighÞi: 3.2. Contextualized characterization We have seen in Section 2 that normal train behavior is related to contextual parameters. We have described the necessary con- cepts and methodologies to extract knowledge in a set of historical data. However, in Section 2 we have underlined the fact that nor- mal behavior are very dependent on the context. In consequence, to properly characterize such behavior, we take into account the data classes described previously. 3.2.1. Knowledge classes A class describes a context according to different criteria. To ex- tract knowledge about the influence of these criteria on a train behavior, each class is associated with (i) all behavioral data col- lected in the associated context (see Section 2), (ii) the patterns characterizing the normal behavior in this context. In Section 2.3.1, we have seen that the context of a class is pro- vided through the description of this class. We present below how are defined the corresponding set of data and the set of character- izing patterns. Definition 14. A sequence s is a c-general sequence if s is frequent in all child classes of c. If c is a specialized class, then the set of general sequences in c is the set of frequent sequences in c. A class c, defined in DC, is associated to the set of behavioral sequences contained in c as well as to the set of c-general se- quences. Consequently, a class c is now denoted by a triplet c ¼hD;E;Si, where: � D¼ descðcÞ, is the description of c in DC. � E ¼ dataðcÞ, is the set of fragments contained in c. � S ¼ seqðcÞ, is the set of c-general sequences. By using the previous definitions, the set of general sequences for each class in the class lattice is constructed in the following manner: 1. Let c be a specialized class. seq(c) is obtained by extracting all frequent sequences in data(c) (see Section 3.1). 2. Let c be a non-specialized class. seq(c) is defined as follows: Fig. 4. Sequences spreading in the contextual lattice. J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 7009 seqðcÞ¼ \ ci2�c seqðciÞ: Thus, by extracting frequent sequences in specialized classes only, we construct all sets of general sequences. Fig. 4 is depicting the construction of general sequences in a contextual lattice, by following the previously described method. This method of construction of all general sequences for each class has the advantage of limiting the application of the algorithm for discovering sequential patterns. Indeed, the extraction is per- formed only on specialized classes, and the remaining sets of gen- eral sequences are deducted from them. We have presented in this section the approach for extracting knowledge from historical behavioral data collected by sensors. In particular, we have seen how to exploit contextual data classes, and thereby obtained results are not associated with all historical data, but with contextual criteria. Therefore, we can precisely de- scribe the impact of these criteria on the normal behavior of a train and answer questions such as: � What are the behaviors that appear only when the outside tem- perature is high? � What behaviors are specific to a given itinerary? � What behaviors are specific to the ending fragments? � What are the most general behaviors, which do not depend on the context? This accuracy greatly improves the lack of knowledge about the normal train behavior. The obtained knowledge can be used be- sides to detect anomalies in the data collected, as we will see in the next section. 4. Anomaly detection In this section, we present how anomaly detection is performed. We consider that we are provided with both one database contain- ing normal behavior on which knowledge have been extracted (see Section 3) and data corresponding to one new journey. During a journey, each component (and each sensor) has a behavior which is related to the behavior of other components. These interactions, both temporal and inter-components, are de- scribed by a set of sequential patterns that we have extracted ear- lier (see Section 3). Therefore, we want to make an evaluation of the behavior of a component based on obtained knowledge. The main idea is relying on the following remarks: � we can consider that a sensor behaves normally if we find enough patterns in our knowledge base that validate its current state, � we can say that its behavior is anomalous if we find enough pat- terns that contradict it, � if there are not enough patterns to validate or contradict the current sensor state, then the behavior is considered uncertain because we do not have sufficient knowledge to assess its cur- rent behavior. Thus, for each reading and each sensor, we compute two scores: a concordance score and a discordance score. Depending on the va- lue of these scores, we can then indicate whether the behavior is normal, anomalous, or uncertain. Below, we describe how the var- ious needed scores are calculated. 4.1. Preliminary definitions In order to use the extracted sequential patterns for measuring the compliance of a sequence describing a new journey, first of all we introduce the concept of covering sequences. Definition 15. Let I be an itemset and s = hI1, . . . , Ii, . . . , Ini a sequence. I is covering s if "Ii 2 s, I # Ii. Example 12. The itemset (Alow) is covering the sequence h(Alow Blow)(Alow)i. Definition 16. Let p ¼hI�1 . . . I � i . . . I � l i be an aggregated pattern and s = hI1, . . ., Ij,. . . , Imi a sequence. p is covering s, denoted by p � s, if there exists a set of sequences {s1, s2, . . . , sm} such that: (i) s = s1 + s2 + � � � + sm, (ii) 8ij1 6 i 6 m; I�i is covering Ii. Moreover, I � i is called the corre- sponding itemset of Ii in p. Example 13. Let s be a sequence, and p an aggregated pattern, such that: s ¼hðAlowÞðAlowBlowÞ zfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflffl{s1 ðAavg Bav gÞðAavg Bav gÞ zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{s2 ðBhighÞ zfflfflffl}|fflfflffl{s3 i p ¼hðAlowÞ|fflffl{zfflffl} I1 ðAavg BavgÞ|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} I2 ðBhighÞ|fflfflffl{zfflfflffl} I3 i: We can note that s can be broken down into 3 sequences s1, s2 and s3, such that I1 v s1, I2 v s2, and I3 v s3. Thus, p � s. On the other hand, p is not covering the sequence s0 ¼ hðAlowÞðAlowBlowÞðAav gÞðAavg Bav gÞðAavg BavgÞðBhighÞi: Using the notion of covering sequence, we can now describe two types of patterns: (1) concordant patterns, validating the behavior of a sensor at a given time, and (2) discordant patterns contradicting this behavior. Definition 17. Let A 2 X, s = hI1, I2, . . . , Ini a sequence, and p ¼ hI�1I � 2 . . . I � mi an aggregated pattern. p is a concordant pattern for A in the ith itemset of s, i.e., a (A, i)-concordant pattern in s, if: (i) there exist integers h, j such that 1 6 h 6 i 6 j 6 n, and p �hIh, . . ., Ii , . . .Iji. (ii) let I⁄ be the corresponding itemset of Ii in p, there exists an item Av 2 I⁄. Table 5 Concordant patterns. ID Aggregated pattern Support p1 h(Alow)(AavgBavg)i 25% p2 h(AlowBavg)i 70% p3 h(AlowBavg)(Aavg)i 30% p4 h(AlowBlow)(AlowBavg)i 55% Table 6 Discordant patterns. ID Aggregated pattern Support p5 h(AlowBlow)(AavgBavg)i 45% p6 h(AhighBavg)i 20% 7010 J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 Example 14. Let A, B and C be sensors, p1 = h(AavgBlow)(Bavg)i, p2 = h(Aavg)i and p3 = h(Alow)(Aavg)i three aggregated patterns, and s a sequence such that: s ¼hðAavg BlowClowÞðAhighBavg CavgÞðAhighBhighChighÞi: The aggregated patterns p1 and p2 are (A, 1)-concordant patterns. On the other hand, p3 is not a (A, 1)-concordant pattern. A discordant pattern for the state of a sensor at a given time de- scribes an unexpected behavior. Definition 18. Let A 2 X, s = hI1, . . ., Ii , . . .Ini a sequence such that Ii contains an A-item, denoted by is, and p ¼hI�1I � 2 . . . I � j . . . I � mi an aggregated pattern such that I�j contains an A-item, denoted by ip. p0 is the sequence p where ip has been replaced by is in Ij. p is a discordant pattern for A in the ith itemset of s, i.e., a (A, i)- discordant pattern in s if: (i) p is not a (A, i)-concordant pattern, (ii) p0 is a (A, i)-concordant pattern. The items is and ip are called discordant items. More precisely, is is the discordant item of s, and ip is the discordant item of p. Example 15. Let A, B and C be sensors, p1 = h(AlowBlow)(Bavg)i and p2 = h(AhighBavg)i two aggregated patterns, and s a sequence such that: s ¼hðAavg BlowClowÞðAhighBavg CavgÞðAhighBhighChighÞi: The aggregated pattern p1 is a (A, 1)-discordant pattern. On the other hand, p2 is not a (A, 1)-discordant pattern. 4.2. Conformity score In the following examples, we will consider the sequence s, such that: s ¼hðAlowBlowÞðAlowBavgÞðAlowBav gÞðAavg Bav gÞi: Table 5 (respectively Table 6), contains the set of (A, 3)-concor- dant patterns (respectively (A, 3)-discordant patterns) in s as well as their support. 4.2.1. Concordance score Computing a concordance score for a given sensor A in the ith itemset of a given sequence s consists in distinguishing, in a set of patterns such as that described in Table 5, the (A, i)-concordant pattern in s. Note that two concordant patterns do not always have the same weight. In Table 5, we can not give the same weight to all the patterns in the overall concordance score of sensor A in the 3th itemset of s. Indeed, the size of the pattern p2 is too low to consider that it will affect the final score. In addition, we believe that the support of a pattern also influences its weight. Therefore, we define the weight of a sequence as follows: Definition 19. Let p be an aggregated pattern and s a sequence such that p is an (A, i)-concordant pattern in s. The weight of such a concordant pattern is defined as follows: weightðpÞ¼ kpk supportðpÞ: Example 16. The weight of p1 is: weightðp1Þ¼ kp1k supportðp1Þ¼ 3 0:25 ¼ 0:75: We can now define the concordance score of a sensor A in the ith itemset of a sequence s. Definition 20. Let Pc the set of all (A, i)-concordant patterns. The concordance score of a sensor A in the ith itemset of a sequence s is defined as follows: scoreconcðA; iÞ¼ X p2Pc weightðpÞ: Example 17. The concordance score of A in the 3rd itemset of s is: scoreconcðA; iÞ¼ X p2Pc weightðpÞ¼ weightðp1Þþ . . . þ weightðp4Þ ¼ 0:75 þ 1:4 þ 0:9 þ 2:2 ¼ 5; 25: 4.2.2. Discordance score In this part, we focus on discordant patterns, which tend to inval- idate the behavior of a sensor A in the ith itemset of a sequence. Discordance degree To evaluate the weight of a discordant pat- tern, it is important to consider the gap between this value and the ‘‘expected value’’. Let A 2 X a sensor, and D¼ domðAÞ. D is such that D¼ðv 1; . . . v i; . . . ; v nÞ, where vi is a discrete value. Definition 21. Let p be a (A, i)-discordant pattern in a sequence s, ip the discordant item of p, and is the discordant item of s. We define the discordance degree of p as follows. Let us consider v k 2D and v l 2D, such that vk = value(ip) and vl = value(is). The discordance degree of p, denoted by discDegree(p), is: discDegreeðpÞ¼ jl � kj n : Example 18. In the previous sequence s, dom(A) = dom(B) = (i1, i2, i3), such that i1 = low, i2 = avg, and i3 = high. By considering the sequence s and the discordant pattern p5, we can note that the discordant item of s is Alow and the discordant item of p5 is Aavg. Thus, the discordance degree of p5 is: discDegreeðp5Þ¼ j2 � 1j jdomðAÞj ¼ 1=3: We can now define the weight of a discordant pattern. This weight must take into account three features: the discordance de- gree, the size of a pattern, and its support. Concerning the size of a discordant pattern, we believe that the important part of a discor- dant pattern is the part covering the tested sequence. Therefore, we consider the size of the sequence s without the discordant item. Definition 22. Let p be an aggregated pattern and s a sequence such that p is an (A, i)-discordant pattern. The weight of such a discordant pattern is defined as follows: weightðpÞ¼ ðkpk� 1Þ supportðpÞ discDegreeðpÞ: J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 7011 Example 19. The weight of p5 is: weightðp5Þ¼ ðkp5k� 1Þ supportðp5Þ discDegreeðp5Þ ¼ ð4 � 1Þ 0:45 1 3 ¼ 0:45: Definition 23. Let Pd the set of all (A, i)-discordant patterns. The dis- cordance score of a sensor A in the ith itemset of a sequence s is: scorediscðA; iÞ¼ X p2Pd weightðpÞ: Fig. 5. Influence of the smoothing window on the anomaly score. Example 20. The discordance score of A in the 3rd itemset of s is calculated as follows: scorediscðA; 3Þ¼ X p2Pd weightðpÞ¼ weightðp5Þþ weightðp6Þ ¼ 0:45 þ 0:13 ¼ 0:58: 4.2.3. Conformity score For each sensor and each itemset in a sequence, we defined how to calculate a concordance score and a discordance score. We can now address the problem of defining a global conformity score of a sensor A in the ith itemset of a sequence, denoted as score (A, i). This score, defined between �1 and 1, must meet the following requirements: � if score(c, t) is close to 1, the state of A in i is considered as normal, � if score(c, t) is close to �1, the state of A in i is considered as abnormal, � if score(c, t) is close to 0, the state of A in i is considered as uncertain. Definition 24. Let A 2 X be a sensor and s a sequence. The confor- mity score of A in the ith itemset of s, denoted by score(A, i) is defined as follows: scoreðA; iÞ¼ scoreconcðA; iÞ� scorediscðA; iÞ maxðscoreconcðA; iÞ; scorediscðA; iÞÞ : Example 21. By considering the previous examples, we can now calculate the conformity score of A in the 3rd itemset of s as follows: scoreðA; 3Þ¼ scoreconcðA; 3Þ� scorediscðA; 3Þ maxðscoreconcðA; 3Þ; scorediscðA; 3ÞÞ ¼ 5; 25 � 0:58 5; 25 ¼ 0:89: 4.2.4. Smoothing window One important issue when addressing the problem of detecting anomalies concerns false alarms, especially when data are noisy. Thus, it is possible that a sensor has a bad score on a reading, which do not correspond to a real problem. In this case, the score of the sensor then quickly goes back to normal values. To avoid this, it is preferable to take into account the surrounding score values of the same sensor. We have thus developed the possibity to set a parameter called smoothing window and noted w. The score of a sensor for a given date is the average score on the smoothing window. Fig. 5 shows the evolution of the score of a sensor during a frag- ment, with and without smoothing (with w = 3). Without smooth- ing, we can see a decrease of the value that could be interpreted as an anomalous behavior. However, the immediate increasing of the score indicates that it is not a real anomaly. The smoothed score is an efficient method for restricting this phenomenon. We have described throughout this section how to use a list of sequential patterns, illustrating the normal behavior of the moni- tored system, to detect abnormal behavior among new collected data. This approach is particularly suited to the problem of moni- toring complex industrial systems because it does not only declare a data sequence as normal or abnormal, but provides also very pre- cise information about the anomaly detected: the localization of the fault (which sensor, which component, etc.), the precise mo- ment when this anomaly occurred, and a score quantifying the seriousness of the detected problem. Moreover, the described approach can also obtain information about the nature of the detected anomaly. We can in particular dis- tinguish the case where the detected anomaly corresponds to a sensor failure (which means that the behavior of the component on which the sensor is located is normal, but the measurements are corrupted) from the more serious case where the behavior of the component is actually involved in the anomaly. To differentiate those cases, we exploit the fact that several sensors are generally installed on a same component (e.g., 4 temperature sensors are in- stalled on each of the wheels of a train). We can therefore infer that if only one sensor from a group of sensors has a low score, then the corresponding anomaly relates to that sensor only. We can there- fore analyze the data according to several levels of granularity: the sensor level (by considering only the score of each sensor), the level of a component (e.g. by interpreting the average score of sensors installed on it), the level a car (by calculating the average score of components in the car), etc. 5. Experimental results In order to evaluate our proposal, several experiments were conducted on a real dataset. They correspond to the railway data collected on 12 trains where each train has 249 sensors, over a one-year period. A reading is collected every five minutes. 232 temperature sensors and 16 acceleration sensors are distributed on the different components (e.g., wheels, motors, etc.) and a sen- sor measures the overall speed of the train. The experimental protocol follows the organization of the proposals: 1. Characterization of normal behavior (see Section 3). We have studied the impact of contextualization on the characterization of normal behavior, and the benefit of the search for specific patterns in various contexts. Table 7 Number of general sequences and aggregated patterns, according to the contextual class. Class General sequences Aggregated patterns [T = ⁄; FT = ⁄; I = ⁄; E = ⁄] 5 5 [T = low; FT = ⁄; I = ⁄; E = ⁄] 20 17 [T = low; FT = middle; I = ⁄; E = ⁄] 123 89 [T = low; FT = middle; I = 1; E = ⁄] 270 189 [T = low; FT = middle; I = 1; E = 2] 542 455 Fig. 6. Simulated anomaly: blocked values. Fig. 7. Simulated anomaly: shifted values. 7012 J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 2. Anomaly detection (see Section 4). We have evaluated the con- formity score by applying it on normal and abnormal data. 5.1. Normal behavior characterization The discovery of frequent sequences has been performed with the PSP algorithm described in Masseglia, Cathala, and Poncelet (1998) with the minimum support set to 0.3. To build a contextual lattice such as that presented in Section 3.2, we used four criteria: � the fragment type (denoted as FT): beginning, middle, or ending fragments. � the exterior temperature (denoted as T): low or high. � The itinerary (denoted as I): I1, I2, I3. � The episode in an itinerary (denoted as E): 1st, 2nd, 3rd, or 4th episode. Table 7 shows, for some classes extracted from the obtained hierarchy, the respective number of general sequences, as well as the number of aggregated patterns among them. For example, the class of middle fragments which have been achieved with a low exterior temperature for any itinerary and any episode, de- noted as [T = low; FT = middle; I = ⁄; E = ⁄], contains 123 general se- quences. 89 (72%) among them are aggregated patterns. We can note that the more a class is general, the less it contains general sequences. This shows the necessity of contextualizing the behavior characterization. Indeed, very few behaviors are common to all contexts: a normal behavior is actually dependent on the context in which it operates. Moreover, aggregated patterns have reduced the number of extracted sequential patterns, by removing all very long and redundant sequences. 5.2. Anomaly detection We described in Section 4 the approach used to detect anoma- lies in new data. Thus, we can use it to classify new fragments of journeys into two categories: normal data and abnormal data. To evaluate our approach in an appropriate manner, it is necessary to conduct experiments on both types of data. However, if it is easy to find data which do not contain faults, it is often more difficult to obtain a large data set containing anomalies. For this reason, we have simulated a set of anomalous data, on which we have con- ducted our experiments. To this end, we have used the real normal data set, and we have corrupted the data in order to reproduce classic behavioral anom- alies on the values collected by sensors. Anomalies generated are threefold: � Blocked values (see Fig. 6): the value collected by a sensor is not refreshing. This anomaly is usually related to a faulty sen- sor, or a problem of transmitting information between the sen- sor and the database, but very rarely describes a real problem on a component. � Shifted values (see Fig. 7): a value that is shifted in comparison with the normal behavior of a sensor (or group of sensors): a constant value is added (or substracted) to the real collected value. This type of anomaly may, for example, describe an abnormal overheating of a component. � Random values (see Fig. 8): in this case, collected values are randomized. They describe an aberrant behavior without any link with the expected behavior. We simulated these anomalies in order to build a data set con- taining about 600 fragments of episodes (i.e., 200 episodes) com- posed as follows: � 300 are fragments of real data validated as normal (i.e., without anomalies), � 300 fragments containing anomalies were generated from nor- mal data. The three types of anomalies described above are dis- tributed equally between these fragments. The approach was tested on the basis of cross-validation, by segmenting the total set of data into 10 equal parts. Thus, each part contains 600 fragments, of which 300 are normal data. Note that the learning step (i.e., characterization of normal behavior) is per- formed on normal data only. To quantify the number of anomalous fragments in our data set, we consider a fragment is abnormal if its anomaly score falls below �0.5. Thus, we have obtained the confusion matrices presented in Table 8, containing the average results obtained by cross validation Fig. 8. Simulated anomaly: random values. Table 8 Global confusion matrix. Predict. normal Predict. anomalous Recall Precision Real normal 272 28 90.67% 91.58% Real anomalous 25 275 91.67% 90.76% Global results 297 303 91.17% 91.17% Table 9 Confusion matrix for random values anomalies. Predict. anomalous Predict. normal Real anomalous 98 2 Table 10 Confusion matrix for blocked values anomalies. Predict. anomalous Predict. normal Real anomalous 85 15 Table 11 Confusion matrix for shifted values anomalies. Predict. anomalous Predict. normal Real anomalous 92 8 J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 7013 on the entire test set. We can note that the system is well balanced and has similar qualities to both recognize the normal behaviors, but also to identify anomalies. In order to evaluate the results, we have calculated two mea- sures of quality widely used: the precision and the recall. More- over, recall and precision is in all cases above 90%, so the approach limit both the number of false alarms and the number of undetected anomalies. Tables 9–11 contain the results for each type of anomaly. We note that the results are very good to recognize a ‘‘random values’’ or a ‘‘shifted values’’ anomaly. This is due to the fact that in this case the simulated behavior is very different from a normal behavior. In contrast, ‘‘blocked values’’ anomalies are different: the anomalous value may not be sufficiently distant from ‘‘expected’’ values for detecting the problem. 5.2.1. Additional information about detected anomalies We have seen how to detect anomalies in new collected data. Moreover, it is possible to obtain additional information about de- tected anomalies. In particular, we use the lattice of classes to bet- ter understand the anomalies. When an anomaly is detected in a specialized class it may be useful, for users taking maintenance decisions, to better under- stand anomalies sources in order to assess its seriousness. In par- ticular, the user can recalculate the conformity score in other classes of the contextual lattice, and particularly in the super- classes of its specialized class. Several cases are then possible. � The score is low in the more specific classes of the lattice, but is improved in the more general classes. In this case, the tested journey is anomalous, but the anomaly is not serious. It does not meet the most specific behaviors expected in its class, but nonetheless meets the most general behaviors. � The score is low in all classes of the lattice, including the general classes. Here, the observed journey is problematic. The behavior of the train does not correspond to what is expected. � The score is low in a branch of the lattice, but still good in another branch. For example, consider a journey that contains an anomaly in its specialized class [high, low] (i.e., high exterior temperature and low humidity rate). So it is particularly helpful for decision makers to know the scores obtained by this journey in the superclasses [high,⁄] or [⁄low]. If the anomaly is related to one of these classes, i.e., if the score is low in [high,⁄] and correct in [⁄, low] for example, then the user gets additional information, which will allow him to investigate more effec- tively the cause of the anomaly. This information, combined with the fact that user may at any time consult the list of concordant or discordant patterns when an anomaly is detected, can provide a better understanding of the trains behavior. Therefore, the approach not only detects prob- lems but also allows decision makers to investigate the causes and potential seriousness of an anomaly, and help them to make the right decisions. 6. Related work The needs to develop our approach are twofold: (i) extract knowledge from sensor data, and (ii) detect anomalies in such data. Recently, the problem of mining sensor data has been addressed by the data mining community. Different approaches focusing either on the data representation by performing for example sensor clus- tering Ci, Guizani, and Sharif (2007), Rodrigues and Gama (2006), or knowledge extraction by mining association rules Boukerche and Samarah (2007), Boukerche and Samarah (2008), Chong et al. (2008), Ma et al. (2004), Mihail Halatchev (2005), Yairi, Kato, and Hori (2001), or sequential patterns Cook et al. (2003), Guralnik and Haigh (2002), Tseng and Lu (2009), Wu, Peng, and Chen (2001) were proposed. The second problem we address in this paper concerns the detection of anomalous behavior from data collected by the sen- sors. This topic, usually known as the anomaly detection problem, has received considerable attention in recent years. It can be viewed as a classification problem in which a system behavior can be clas- sified as normal or anomalous. Defined in Grubbs (1969), an anom- aly or outlier is an observation that appears to deviate markedly from other members of the sample in which it occurs. More recently, different definitions have been proposed in literature (Barnett & Lewis, 1994; Douglas M., 1980; Ramaswamy, Rastogi, & Shim, 2000). More generally, we will consider an anomaly as 7014 J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 behaving differently from the norm. A lot of approaches have been developed in order to meet the requirements of very different appli- cation domains, such as network intrusion detection (Jayakumar et al., 2008; Siaterlis & Maglaris, 2004; Wang, Guan, & Zhang, 2008; Wu and, 2006), industrial systems monitoring (Keogh, Lin, Lee, & Van Herle, 2006; Yairi et al., 2001), medical condition moni- toring (Lin, Keogh, Fu, & Van Herle, 2005; Wong, Moore, Cooper, & Wagner, 2003), fraud detection (Brause, Langsdorf, & Hepp, 1999, Phua, Alahakoon, & Lee, 2004), etc. Of course, the techniques employed are strongly related to the nature of input data. In particular, numerous anomaly detection methods have been developed considering transaction data sets. However, a lot of real-world domains have to handle temporal data that can be stored as sequential data sets. In particular, the behav- ioral data collected by sensors in the complex systems monitoring problem belongs to this category. The temporal nature of sequen- tial data requires new techniques to detect anomalies within. Sev- eral methods for detecting anomalies in sequential data have been developed for different application domains (Lee, Stolfo, & Chan, 1997; Michael & Ghosh, 2000; Sun, Chawla, & Arunasalam, 2006), but they cannot meet all our requirements. We have previously seen that an anomaly behave differently from the norm. Conse- quently, in order to properly detect anomalies in a data set, we need to define what is the norm, i.e., to characterize normal behav- ior. Some application domains can benefit from a priori knowledge of experts to construct this characterization. However, in other do- mains such as train monitoring for example, a priori knowledge is difficult to obtain. In consequence, we have to develop a system for exploiting the extracted results to characterize normal behavior. These patterns are describing the expected norm. We also have to manage several difficulties inherent to the problem that we address. For example, developing a hierarchy of contexts is necessary to (i) take into account variations in the train behavior according to various criteria, and (ii) provide more expla- nations to maintenance experts when an anomaly is detected. In addition, we handle sequences particularly large and redundant (see Section 2). Therefore, we propose the concept of aggregated sequences to accomplish our goals. Moreover, we must attach importance to the interpretability of results in the normal behavior characterization process as well as in the anomaly detection. Indeed, the problem is very sensitive (er- rors can lead to heavy financial costs), the experts need to under- stand the results to make and justify the right decisions. For all these reasons, we must develop a new appropriate method. 7. Conclusion In this paper, we have addressed a problem involved in many areas: the maintenance of complex systems. There are several solu- tions to perform maintenance of such systems. Firstly, corrective maintenance, consisting in making the necessary maintenance operations after the occurrence of a failure, is not suited to this contexte as too costly and too dangerous. A planned and system- atic maintenance is too expensive, although it can usually avoid serious failures. Thus, we addressed the problem of developing an approach to allow preventive maintenance, which is a good compromise between the two previous solutions. Preventive maintenance consists in detecting abnormal behavior that may be harbingers of major failures, to perform only the necessary pre- ventive maintenance operations. However, the complexity of such systems (e.g., trains, indus- trial machinery, etc.) makes their behavior difficult to understand and interpret. In these circumstances it is particularly difficult to detect abnormal behavior that often herald significant and costly failures. The problem that we address is particularly challenging. Indeed, errors in diagnosis may cause many inconveniences. We have therefore developed an approach to use data collected by sensors in order to analyze behaviors and allow the detection of anomalies. Our contribution is divided into three parts. First, we have pro- posed an adequate representation of data in order to extract knowledge based on sensor data describing the past normal behav- ior of systems. Secondly, we studied the possibility to extract sequential patterns in historical data both to improve the under- standing of systems for experts, but also to provide a knowledge database used to develop a method for detecting anomalies. To characterize normal behavior adequately, we also took into ac- count the context. Indeed, the context in which a system is running often has an influence on its behavior, and on the definition of an anomaly. Finally, we proposed an approach to compare new pat- terns with all sequential patterns describing normal behavior. We provide experts with an anomaly score for each sensor and each component of the systems studied. The approach allows to lo- cate precisely the anomalies and to quantify the extent to which the anomaly seems problematic. One main advantage of the presented approach is that all ob- tained results are easily interpretable for decision makers. This is particularly important because users must be able to make deci- sions with certainty. Although the presented work meets our expectations in terms of results, it opens interesting perspectives. In particular, the devel- opment of a graphical user interface will allow users to access re- sults quickly and efficiently. Another important aspect may be to adapt the approach to a real-time context. This will detect the anomalies on the running trains. Furthermore, an interesting ques- tion may be addressed: how to manage the evolution of normal behavior over time? This will conduct the knowledge database to be incrementally updated in order to ensure handled knowledge to be valid over time. References Agrawal, R., Imieliński, T., & Swami, A. (1993). Mining association rules between sets of items in large databases. SIGMOD Record, 22(2). Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In P. S. Yu & A. S. P. Chen (Eds.), Eleventh International Conference on Data Engineering. IEEE Computer Society Press. Barnett, V., & Lewis, T. (1994). Outliers in statistical data. Wiley series in probability & statistics. Wiley. April. Boukerche, A., & Samarah, S. (2007). A new in-network data reduction mechanism to gather data for mining wireless sensor networks. In MSWiM ’07: Proceedings of the 10th ACM symposium on modeling, analysis, and simulation of wireless and mobile systems. New York, NY, USA: ACM. Boukerche, A., & Samarah, S. (2008). A novel algorithm for mining association rules in wireless ad hoc sensor networks. IEEE Transactions on Parallel Distributed Systems, 19(7), 865–877. Brause, R., Langsdorf, T., & Hepp, M. (1999). Neural data mining for credit card fraud detection. In ICTAI ’99: Proceedings of the 11th IEEE international conference on tools with artificial intelligence (pp. 103). Washington, DC, USA: IEEE Computer Society. Carrascal, A., Díez, A., & Azpeitia, A. (2009). Unsupervised methods for anomalies detection through intelligent monitoring systems. In Proceedings of the fourth international conference on hybrid artificial intelligence systems (pp. 144). Springer. Chong, S. K., Krishnaswamy, S., Loke, S. W., & Gaben, M. M. (2008). Using association rules for energy conservation in wireless sensor networks. In SAC ’08: Proceedings of the 2008 ACM symposium on Applied computing. ACM. Ci, S., Guizani, M., & Sharif, H. (2007). Adaptive clustering in wireless sensor networks by mining sensor energy data. Computer Communications, 30(14–15), 2968–2975. Cook, D. J., Youngblood, M., Heierman, E. O., III, Gopalratnam, K., Rao, S., Litvin, A., et al. (2003). Mavhome: An agent-based smart home. In PERCOM ’03: Proceedings of the first IEEE international conference on pervasive computing and communications. Springer. Grubbs, F. E. (1969). Procedures for detecting outlying observations in samples. Technometrics, 11, 1–21. Gruenwald, L., & Halatchev, M. (2005). Estimating missing values in related sensor data streams. In J. R. Haritsa & T. M. Vijayaraman (Eds.), Proceedings of the 11th international conference on management of data (COMAD ’05). Computer Society of India. J. Rabatel et al. / Expert Systems with Applications 38 (2011) 7003–7015 7015 Guralnik, V., & Haigh, K. Z. (2002). Learning models of human behaviour with sequential patterns. In Proceedings of the AAAI-02 workshop ‘‘Automation as Caregiver’’. Hawkins Douglas, M. (1980). Identification of outliers. Chapman and Hall. Keogh, E., Lin, J., Lee, S.-H., & Van Herle, H. (2006). Finding the most unusual time series subsequence: algorithms and applications. Knowledge Information Systems, 11(1), 1–27. Lee, W., Stolfo, S. J., & Chan, P. K. (1997). Learning patterns from unix process execution traces for intrusion detection. In In AAAI workshop on AI approaches to fraud detection and risk management (pp. 50–56). AAAI Press. Lin, J., Keogh, E., Fu, A., & Van Herle, H. (2005). Approximations to magic: Finding unusual medical time series. In CBMS ’05: Proceedings of the 18th IEEE symposium on computer-based medical systems (pp. 329–334). Washington, DC, USA: IEEE Computer Society. Masseglia, F., Cathala, F., & Poncelet, P. (1998). The psp approach for mining sequential patterns. In J. M. Zytkow & M. Quafafou (Eds.), PKDD. Lecture Notes in Computer Science (Vol. 1510). Springer. Ma, X., Yang, D., Tang, S., Luo, Q., Li, S., & Zhang, D. (2004). Online mining in sensor networks. In H. Jin, G. R. Gao, Z. Xu, & H. Chen (Eds.), NPC. Lecture Notes in Computer Science (Vol. 3222). Springer. Michael, C. C., & Ghosh, A. (2000). Two state-based approaches to program-based anomaly detection. In ACSAC ’00: Proceedings of the 16th annual computer security applications conference (pp. 21). Washington, DC, USA: IEEE Computer Society. Phua, C., Alahakoon, D., & Lee, V. (2004). Minority report in fraud detection: Classification of skewed data. SIGKDD Explorations Newsletter, 6(1), 50–59. Ramaswamy, S., Rastogi, R., & Shim, K. Efficient algorithms for mining outliers from large data sets. In SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data (pp. 427–438). New York, USA: ACM. Rodrigues, P. P., & Gama, J. (2006). Online prediction of streaming sensor data. In: Jesus, S., Gama, A. -R. J., & Roure, J., (Eds.), Proceedings of the 3rd international workshop on knowledge discovery from data streams (IWKDDS 2006), in conjuntion with the 23rd international conference on machine learning. Shaikh, R. A., Jameel, H., d’Auriol, B. J., Lee, S., Song, Y.-J., & Lee, H. (2008). Trusting anomaly and intrusion claims for cooperative distributed intrusion detection schemes of wireless sensor networks. In ICYCS ’08: Proceedings of the 2008 the 9th international conference for young computer scientists (pp. 2038–2043). Washington, DC, USA: IEEE Computer Society. Siaterlis, C., & Maglaris, B. (2004). Towards multisensor data fusion for dos detection. In SAC ’04: Proceedings of the 2004 ACM symposium on applied computing (pp. 439–446). New York, NY, USA: ACM. Sun, P., Chawla, S., & Arunasalam, B. (2006). Mining for outliers in sequential databases. In D. Lambert, J. Ghosh, J. Srivastava, & D. B. Skillicorn (Eds.), SDM. SIAM. Tseng, V. S., & Lu, E. H.-C. (2009). Energy-efficient real-time object tracking in multi- level sensor networks by mining and predicting movement patterns. Journal of Systems and Software, 82(4), 697–706. Wang, W., Guan, X., & Zhang, X. (2008). Processing of massive audit data streams for real-time anomaly intrusion detection. Computer Communications, 31(1), 58–72. Wong, W.-K., Moore, A., Cooper, G., & Wagner, M. (2003). Bayesian network anomaly pattern detection for disease outbreaks. In Proceedings of the twentieth international conference on machine learning (pp. 808–815). AAAI Press. Wu, P.-H., Peng, W.-C., & Chen, M.-S. (2001). Mining sequential alarm patterns in a telecommunication database. In DBTel ’01: Proceedings of the VLDB 2001 international workshop on databases in telecommunications II. Springer-Verlag. Wu, N., & Zhang, J. (2006). Factor-analysis based anomaly detection and clustering. Decision Support Systems, 42(1), 375–389. Yairi, T., Kato, Y., Hori, K. (2001). Fault detection by mining association rules from house-keeping data. In Proceedings of the sixth international symposium on artificial intelligence, robotics and automation in space. Anomaly detection in monitoring sensor data for preventive maintenance Introduction Data preprocessing Sensor data for train maintenance Sensors Measurements Readings Granularity in railway data Journey Episode Episode fragment Data classes Data classes Contextual lattice Normal behavior characterization How to extract normal behavior? Contiguous subsequences mining Aggregated sequences Contextualized characterization Knowledge classes Anomaly detection Preliminary definitions Conformity score Concordance score Discordance score Conformity score Smoothing window Experimental results Normal behavior characterization Anomaly detection Additional information about detected anomalies Related work Conclusion References