key: cord-0210890-1cg3iv99 authors: Kabra, Rohan; Saxena, Divya; Patel, Dhaval; Cao, Jiannong title: Time Series Clustering for Human Behavior Pattern Mining date: 2021-10-14 journal: nan DOI: nan sha: 3546327bb6056630b62488d54cd50a8d60d72840 doc_id: 210890 cord_uid: 1cg3iv99 Human behavior modeling deals with learning and understanding behavior patterns inherent in humans' daily routines. Existing pattern mining techniques either assume human dynamics is strictly periodic, or require the number of modes as input, or do not consider uncertainty in the sensor data. To handle these issues, in this paper, we propose a novel clustering approach for modeling human behavior (named, MTpattern) from time-series data. For mining frequent human behavior patterns effectively, we utilize a three-stage pipeline: (1) represent time series data into a sequence of regularly sampled equal-sized unit time intervals for better analysis, (2) a new distance measure scheme is proposed to cluster similar sequences which can handle temporal variation and uncertainty in the data, and (3) exploit an exemplar-based clustering mechanism and fine-tune its parameters to output minimum number of clusters with given permissible distance constraints and without knowing the number of modes present in the data. Then, the average of all sequences in a cluster is considered as a human behavior pattern. Empirical studies on two real-world datasets and a simulated dataset demonstrate the effectiveness of MTpattern with respect to internal and external measures of clustering. M Odeling human behavior using data originating from social network, Internet of Things (IoT), smart home is an active area of research. A behavior pattern refers to a recurrent way of acting or conduct by an individual or a group, such as mobile devices, animals, vehicles, etc., in a physical/virtual environment, while learning and understanding human behavior patterns from raw data is known as human behavior modeling. Researchers have discovered and identified various types of behaviors on the basis of the source and domain of raw data, such as social behavior using online social networks [1] , [2] , biological health behavior using smart body sensors [3] , online user behavior analysis using clickstream data [4] , customer energy consumption behavior [5] , customer spending behavior using transaction data [6] , human activity behavior using multivariate temporal data [7] and mobility behavior using smart card data, GPS and WiFi traces [8] , [9] , [10] , etc. Identifying frequent human behaviors is very challenging in any behavior analysis. The complexity of modeling human behaviors comes from two aspects: vast types of humans and irregular behaviors of each human type. Mining patterns in human behavior have wide practical applications in several domains, such as recommendation systems, healthcare, transportation, etc. For instance, identifying underlying patterns • *Authors contributed equally. in human moving behavior in a location, such as mall, restaurant, during the COVID-19 pandemic (i.e., contact tracing) can support the authorities to understand who and how many people were in close contact with each other. The main aim of the existing human visiting (or, mobility) behavior pattern mining techniques is to model humans/individuals behavior at single or group level in temporal, spatial or spatio-temporal dimension. Some of the popular behavior pattern mining techniques, include frequency spectrum analysis [11] to analyze the periodicity of recurring behavior patterns, partition and model based time-series clustering [11] , [12] , [13] , [14] , [15] , [16] and PCA [17] based eigenbehavior technique to extract the structure of behavior patterns. Even though existing pattern mining methods have shown potential for mining visiting behavior patterns, they have the following limitations: (1) assume temporal dy-arXiv:2110.07549v2 [cs. LG] 25 Oct 2021 namics is strictly periodic. However, in practice, human behavior patterns are temporally variable inherently; (2) The sensor is not accurate and may produce false negatives (non-deterministic values). This leads to uncertainty in data which has not been addressed yet; (3) Multi-modality is inherent in behavior of individuals which means that multiple patterns may occur during a given time window, but number of those patterns are unknown. The pattern may span an entire day or for a short span of the day. This poses a challenge to identify the time of occurrence and duration of all patterns. Most of the techniques need the number of modes of behavior as input and only identify patterns that span the entire time period. In many real-world problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords [18] ; cancer patients are characterized by presence or absence of certain mutations [19] ; traffic road accident and crash patterns are identified using the presence or absence of accidents and crashes data [20] [21] ; trading patterns are identified by analysing the presence or the absence of a trading activity [22] ; To better understand the attributes and characteristics of data used to find the behavior of individuals to a location, we take an example. Suppose the longitudinal sequences in Figure 1 represents the observed day-wise visiting sequences of individuals as detected by (inaccurate) sensor(s). Black shade represents presence of an individual during the respective time interval deterministically. Unshaded region represents a non-deterministic value or uncertainty about the individual's presence or absence. Temporal data collected through sensors can have uncertainty as individual may actually be absent during the time interval or the sensor may have failed to detect the presence of an individual (false negative). This may lead to an ambiguity in the observed data. Another aspect of the behavior data is temporal variability of underlying patterns because human routine behavior is not perfectly periodic. Moreover, human behavior is multi-modal in nature and the number of these modes are unknown beforehand. In the example, individuals exhibit four modes of mobility behavior where cluster A, B, and C represent patterns spanning entire time period whereas cluster D represents a localized pattern. The time of occurrence and duration of localized pattern is also unknown which poses another challenge. In addition, to determine number of modes of behavior (C) in a person's visiting sequence is not an easy task. The optimal choice of C strikes a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. There are many techniques, like The Elbow Method or Silhouette index to estimate number of clusters. These techniques do not work well in non-euclidean space. One of such technique involves P CA analysis before clustering and setting C equal to the number of dominant principal components. This method is not accurate because it often underestimates number of clusters in the dataset. Furthermore, the choice of distance measure is of utmost importance for any clustering. The conventional distance measures, like euclidean, Manhattan, Jaccard, etc., compare corresponding time slots and do not capture temporal dynamics or affinity between neighbouring time instances which is very important in case of uncertainty, noise and variability in data. Moreover, they are not sensitive to local differences between sequences. To handle the aforementioned issues, in this paper, we focus on extracting and identifying visiting (or, mobility) behavior of individuals to a location (named, MTpattern) which requires to group or cluster similar visiting sequences. We propose a novel distance measure scheme to find an appropriate dissimilarity measure between visiting sequences that is invariant to only small temporal variation and uncertainty in the observation. Then, we use segment tree data structure to discover localized frequent patterns in given time window or to find frequent patterns of given length efficiently. Then, we propose an effective clustering mechanism by exploiting affinity propagation to cluster sequences with the given constraints and unknown number of modes (or clusters) of behavior. We consider the average of all visiting sequences in a cluster for every unit time interval to have a behavior pattern of individuals. We formulate our problem of finding behavior patterns problem as a constrained optimization problem where we minimize number of mutually exclusive clusters that cover the entire dataset under the given constraint of maximum permissible local dissimilarity enforced by dissimilarity metric 1 . We perform experiments on a simulated and two realworld datasets. Results show that MTpattern outperforms three baseline clustering algorithms. The contributions of this paper are as follows: 1) We propose a novel approach to discover visiting (or, mobility) behavior patterns of individuals of given length by clustering similar visiting sequences or to find localized frequent patterns in a given time window (named, MTpattern). 2) We represent time series data into discretized sequence of regularly sampled equal sized unit time intervals for better analysis and propose a novel dissimilarity measure, called TDist, to cluster similar sequences by putting an upper-bound on the local dissimilarity for handling temporal variations and uncertainty in the data. Then, we fine-tune a non-parametric exemplar based clustering technique to cluster sequences with the given constraints and unknown number of modes of behavior. 3) To validate and to show the usability of MTpattern, we evaluate our proposed approach on two real-world datasets and a simulated dataset. Results show that MTpattern outperforms three baseline clustering algorithms w.r.to both internal and external measures. The idea of discovering underlying patterns in individuals' behavior is not new. In this section, we shall discuss important and popular techniques for modeling mobility behavior using time-series data. Majority of the works have employed time-series clustering algorithms to infer behavior patterns which can be grouped into two main categories: partitional 1. This problem is similar to finding minimum dominating set of vertices in an undirected graph and hierarchical clustering, and model-based clustering. We also discuss sequential pattern mining and PCA-based Eigenbehavior techniques that are commonly used to find the structure of frequent individuals' behavior. We also discuss the limitations and shortcomings of these existing techniques. Previous works have used partitional (like, K-Means or K-Medoids) [12] , [23] or hierarchical [11] clustering techniques for behavior modelling. For these techniques, user has to either pre-specify the number of clusters or set an upper bound on overall representational error to stop clustering. Determining number of clusters in advance or maximum permissible representational error is not trivial and is often subjective. Under model based clustering (HMM, Kalman filter, etc.) probabilistic models are initially built on time-series. The asymmetric dissimilarity between time-series A from B is calculated by using a posteriori probability of time-series A given the probabilistic model of time-series B. Probabilistic distance measures, such as KL distance [13] , [14] , [15] , [16] are popularly used. The major limitation of this approach is that there is need to pre-determine number of clusters. Matsubara, et al. [24] used an Baum-Welch algorithm based approach for the finding distinct high-level patterns from a large set of co-evolving sequences. The Baum-Welch algorithm is a special case of the EM algorithm used to find the unknown parameters of a HMM. However, model based sequence clustering methods, such as HMM are more sensitive to the order of events and invariant to the actual time of occurrence of the events. In model based clustering, it is also important to assume that the observations are deterministic which is not possible in the real-life data as data often contains uncertainty and non-deterministic instances in observation. In recent times, priori-based GSP algorithm [25] , projectiondatabase based Freespan [26] and Prefix Span [27] , vertical id-list database and lattice theory based SPADE [28] algorithm has been used to find frequent subsequences in a given set of categorical time sequences. These methods only consider the topological order of events for finding frequent patterns instead of the absolute time of occurrence or duration of the events. Moreover, in case of interval events unlike instantaneous events, it is not trivial to establish the topological order. Some recent works [29] , [30] , [31] have proposed mining temporal interval patterns from interval data. For extracting temporal patterns from interval-based sequences [32] , Guyet, et al. [33] proposed converting temporal sequences into sequence of events linked by Allen's temporal relations. Rawassizadeh, et al. [7] proposed a model for identifying frequent behavioral patterns with temporal granularity from the real-time dataset. But they dealt with temporal events that fit into the Allen's interval algebra, which is not about time-series analysis. Even the dissimilarity measure (City-Block distance) in extracting temporal patterns from intervalbased sequences between subsequences does not take into account uncertain (or, undeterministic) observations in data. Another popular method to model time-series data is 'Eigenbehavior' [17] . These eigenbehaviors are the eigenvectors of the covariance matrix of behavior data obtained after application of PCA or Singular Value Decomposition (SVD). The first few eigenvectors (highest eigenvalues) of the decomposition typically account for a very large percentage of the overall variance in the longitudinal data's eigen decomposition. It is claimed that every non-trivial (high eigenvalue) eigenvector represents a recurrent dominant pattern. One advantage of this approach is that we do not need to pre-determine or predict the number of modes in the data and it can be directly derived from the number of eigenvectors having large eigenvalues. But, it still suffers from a number of shortcomings. Like conventional distance measures, it fails to capture the temporal dynamics as it plots the sequence in Euclidean space and treats every dimension independent of each other failing to capture the affinity between nearby time instances. Number of eigenvectors is limited by number of dimensions and if a cluster can be represented by a linear combination of eigenvectors having higher eigenvalue, then a new eigenvector in the direction of the cluster is not needed. Therefore, PCA often underestimates number of clusters. Since, all the eigenvectors need to be orthogonal to each other, many eigenvectors do not point in the direction of actual clusters since not all clusters are orthogonal to the principal eigenvector at mean. So, all cluster centroids may not necessarily lie on some eigenvector. Eigenvectors are also biased towards cluster located at a larger distance from mean as they cause bigger variance in the data. Our proposed approach handles the issue of temporal dynamics, uncertainty in the data, and do not require number of behavior modes as input beforehand at the same time. To the best of our knowledge, no prior work studies human behavior patterns mining by handling the above-mentioned issues simultaneously. Human behavior at a location can help us to infer interesting information about the location. Simple statistics like number of visitors, average stay time and frequency can reveal semantics of a location. There have been a number of applications in a number of domains, such as transportation (analyze the number of visitors to identify time of the day when overcrowding usually takes place [34] ), smart environment (in a residential location, an intelligent lighting, heating or cooling system can be developed by modelling the presence of people in the room [35] [36] ), health (monitoring the habits and mobility of patients can be used as an indicator of overall health [37] ), education (understanding how the campus is used can provide very important information and insights to college authorities [38] ), etc. We define the problem as follows: Given a raw sensor data tracking the presence of individuals visiting a location over a long period of time and a threshold δ, assuming that presence of individuals in a location is not strictly periodic, sensors are not accurate and may produce false negatives, and multimodality is inherent in the individuals' visiting patterns and the number of these modes are unknown beforehand, the goal is to transform the raw sensor data into discretized sequence of regularly sampled equal sized unit time intervals for better analysis and then effectively summarize the frequent behavioural patterns of m individuals in any given time window during the day. To be able to formulate the problem first, we describe our definitions. Table 1 lists notations that we have used in this paper. Human behavior is recurring and influenced by a range of factors (such as, time). Here, human behavior under the influence of time has been called "frequent behavioral patterns". Definition 1. Time Point Sequence (PS) is an ordered sequence of p timestamps at a day d at which an individual i was detected by the sensor(s), Complete Ω covering for BIS A is set of all other BISs B i , 1 ≤ i ≤ m, such that B i is in partial Ω covering of A and A is in partial Ω covering of B i at the same time. Definition 6. Frequent behavioral patterns are the average of all BISs in a cluster, such that every member of the cluster is in complete Ω-covering of an exemplar BIS which is also a member of the cluster. The average obtained represents the probability of the presence of individuals in every unit time interval for that particular pattern. In this section, we discuss about our proposed solution, MTpattern in detail. Figure 2 shows the overview of our proposed approach MTpattern. MTpattern is composed of four major parts as follows: 1) Data Preprocessing. We calculate the Time Interval Sequence (IS) for every individual from its corresponding Time Point Sequence (PS). Then the Time Interval Sequences are discretized into sequence of regularly sampled equal sized unit intervals, named Discretized Time Interval Sequence (BIS) for better analysis. 2) Segmentation. To facilitate the piece-wise analysis, each and every BIS is hierarchically segmented and stored in a segment tree. 3) Dissimilarity Measure. To calculate symmetric dissimilarity between pair of same length BIS segments that is invariant to uncertainty in observations and small temporal variation in underlying patterns, we propose a novel symmetric dissimilarity metric, called TDist. The dissimilarity matrix is pre-computed for every segment in the segment tree. Dissimilarity matrix (or distance matrix) can then be computed for any time interval from dissimilarity matrix of segments in the segment tree. 4) Pattern Discovery. Every row of the dissimilarity matrix represents a cluster which is the complete Ω-Covering of the corresponding BIS. These clusters are overlapping in nature as any BIS may be a member of more than one complete Ω-Covering. Therefore, we further extend our analysis and optimize the discovery of patterns by finding a minimum set of disjoint (or, non-overlapping) clusters such that every BIS is a member of exactly one cluster and every cluster has an exemplar BIS that has all members of the corresponding cluster in it is complete Ω-Covering. To minimize the number of clusters, the dissimilarity matrix of the time window is fed into affinity propagation module and it's preference parameter tuned to minimize number of unique Ω-coverings that cover all the BISs while maximizing the net similarity between member BISs of the cluster and the exemplar BIS of the cluster it is part of. We take the average of all BIS in Ω-covering which is a discrete probability distribution as a representative of the corresponding cluster. We analyze and preprocess the collected raw sensors data for better analysis. Figure 3 (a) shows the Time Point Sequence (PS) for m individuals in which a vertical bar (|) in a row shows the instantaneous timestamp when a WiFi packet corresponding to a particular MAC-address is received (i.e., an individual presence is detected). However, storing information of closely spaced presence is redundant and costly. Therefore, we calculate the IS for every individual from its corresponding PS by inspecting the time delay between consecutive individual's presence detected in P S. If the time delay between consecutive bars (|) (i.e., consecutive WiFi packets captured of an individual) is below a threshold δ, then they form a time interval in the IS. We discuss in detail on how appropriate interval threshold δ is calculated in Section 6.1.3. Figure 3 (b) shows m IS where each filled rectangle ( ) shows the time interval when an individual is present. However, performing piece-wise analysis on IS is difficult as the intervals in continuous time domain are of varying length (or duration). So, the IS are discretized into sequence of regularly sampled equal sized unit intervals of length λ. We set λ equal to δ 2 so that any gap in raw sensor data (consisting of point sequences) which is more than δ minutes is captured in BIS after discretization. We obtain BIS after representing IS in discrete time domain. '1' represents a deterministic value and '0' represents a non-deterministic value. Figure 3 (c) shows discretized time interval sequence (BIS) of m individuals. There can be a possibility that a pattern spans for an entire time period (a day in our case) or only during some time window during the day. Since, we do not know when and for how long this regular behavior occurs during the day, we segment every BIS into hierarchical segments to facilitate piece-wise analysis. These segments are arranged in a binary tree data structure called segment tree as shown in Figure 4 . Every segment is divided into two equal parts in the next level in the segment tree hierarchy. The Lef t child of any segment in the tree is the left half of the segment and the Right child is the right half off the segment. This data structure is particularly helpful when a solution to a problem can be represented as a combination of solutions to it's sub problems. This is true in our work as any frequent behavior pattern is also piece-wise frequent patterns and can be represented as a concatenation of frequent patterns in it's smaller pieces or segments. The length of the segments at the leaf nodes of the segmented tree represents the highest level of granularity L 2 B where B is a positive integer in (0, log 2 L) range. B = 0 corresponds to lowest granularity, whereas B = log 2 L represents highest granularity possible (as the segments at the leaf nodes will be of unit length). Any time interval in continuous [start, end] can be approximated to discretized time domain. Higher granularity will give more accurate results but increase the time and space complexity of the algorithm and makes it more computationally intensive. So, there is need of a trade-off between accuracy and complexity. Therefore, we set L 2 B equal to Ω. Every BIS segment is also augmented with extensions at both ends. These extensions are used for computing partial dissimilarity between segments at borders and ensure that no information is lost near the segment boundaries due to segmentation. These extensions only contain Ω λ discretized time internal units because when we compute dissimilarity between two BIS segments for every unit time interval with value '1' in one BIS, we only need to inspect unit time intervals in the other BIS with interval temporal distance less than Ω (discussed in sub-section 4.3). The augmented BIS with extensions is represented by eBIS. In our work, every BIS is a data point in non-Euclidean space. We define dissimilarity metric between pair of BISs by comparing the relative time of occurrence of unit intervals with value '1' in the two BIS. For every unit interval with value '1' in BIS 1 , we calculate IT Dist of nearest unit interval with value '1' in BIS 2 . The average of IT Dist value for all unit time intervals with value '1' in BIS 1 from the nearest unit time interval in BIS 2 gives the partial dissimilarity of BIS 1 from BIS 2 denoted by d (1, 2) . We also record the count of unit time intervals with value '1' in BIS 1 as cnt A . We repeat the process to find partial dissimilarity of BIS 2 from BIS 1 given by d(2, 1) and record the number of unit time intervals with value '1' in BIS 2 as cnt B (see Algorithm 1 and 2). for i in range(0, Len) do 6: if BIS1(i) = 1 then /*eBIS2.intervals is set of discretized intervals in extended BIS2*/ 7: cnt ← cnt + 1 12: return d(BIS1, BIS2), cnt ity between the two BIS is set as infinite (or, undefined) which means that the two BISs can never be linked together (or, one of the two BISs can never be exemplar of the other BIS). The overall dissimilarity between BIS 1 and BIS 2 (D(1, 2) = D(2, 1)) is calculated from the partial dissimilarities given by Algorithm 2. The dissimilarity matrix / distance matrix thus obtained is symmetric about main diagonal, so only one half needs to be stored (see Figure 7) . The value of Ω takes into account uncertainty in observed data and temporal variability of patterns. It has been proved that temporal patterns of human behavior tend to be normally distributed [39] . This property can be used to model starting (arrival) and ending (departure) time of behavior patterns without uncertainty as normal distributions. But the observed behavior patterns contain uncertainty (or, false negatives). The distribution with the combined effects of normal temporal behavior of humans and uncertainty in observations manifests itself in the observed arrival and departure distribution of every behavior pattern. To get a sample of this distribution, we record the timestamp (in discrete time domain) when a particular individual was first detected during the day. We construct dissimilarity matrix / distance matrix for any time interval from dissimilarity matrix of it's constituent segments contained in a segment tree. Let the complete dissimilarity between two BISs A and B of same length be D(A, B) (see Algorithm 3). We divide A and B into k segments. Let the segment i of A and B is A i and B i , respectively, and the complete dissimilarity between them is D(A i , B i ). Let Cnt(A i ) and Cnt(B i ) be number of unit time intervals with value '1' in the BIS A i and BIS B i , respectively as described in Section 4.3. Then, D(A, B) can be written as Using the Eq. 1, MTpattern calculates the dissimilarity between every pair of BIS in any time interval from the pairwise dissimilarity score of its constituent BIS segments in the segment tree. So, MTpattern can reconstruct dissimilarity matrix for any time interval without the need to compute dissimilarity score between every pair of BIS from scratch. Distance measure between any two BIS ensures that for every interval in a sequence member of Ω-covering, there exists an interval in exemplar BIS such that their ITDist is less than a given threshold Ω. We also call this minimum ITDist between constituent intervals as local distance. Transitivity property ensures that maximum local distance between any two BISs in a Ω-covering is less than 2 * Ω. In this section, we shall discuss discovery of behavior patterns from the database of BISs. The dissimilarity matrix represents overlapping BIS clusters where every row represents a complete Ω-covering of the corresponding BIS (see Figure 7 for example). These Ω-coverings are overlapping as any given BIS may be a member of complete Ω-covering of more than one BIS. If cardinality of a complete Ω-covering of a BIS is small, the corresponding behavior pattern is not frequent. Ω-coverings which are a subset of other Ω-coverings can also be ignored. If two Ω-coverings are equal then MTpattern will ignore the Ω-covering for which average distance of all BIS with the exemplar BIS of the corresponding Ω-covering is higher. If the cardinality is more than a threshold α then MTpattern takes the average of all BIS in a Ω-covering for every unit time interval to obtain a behavior pattern in the form of discrete time probability distribution. For example, in Figure 7 , if the threshold α is 3, then, MTpattern optimizes the clustering of BIS by finding minimum number of disjoint (non-overlapping) BIS clusters such that every BIS is a member of some cluster and the exemplar BIS of any cluster has every member BIS of the same cluster in its Complete Ω-Covering. This task can be broken down to a constrained optimization problem where we need to minimize number of clusters under the constraint that every BIS should be a member of some cluster and every member of a cluster should be Ω-covered by the exemplar BIS of the corresponding cluster. If there are multiple arrangements of clusters that satisfy the above constraint, then that arrangement should be chosen which minimizes net dissimilarity of all BIS with their corresponding cluster's exemplar BIS. MTpattern achieves this optimization by using affinity propagation which is a relatively new clustering technique based on the concept of "message passing" between data points (or BISs). It starts by considering all BISs as candidate exemplars and exchanges messages between every pair of BISs in every iteration till a good set of exemplars are obtained and the algorithm converges. The advantage of this technique is that it does not need number of clusters to be pre-specified and it clusters around "exemplar" BISs [40] (members of the input set that are good representative of their corresponding cluster). This suits to the problem of behavior modeling as it is not possible to have idea about the number of underlying modes or clusters in the visiting sequences of individuals beforehand. MTpattern minimizes the number of clusters by tuning the preference parameter of affinity propagation 2 . The preference parameter determines the granularity of the clusters. The value of preference is usually set to the median of data points which outputs moderate number of clusters. On the other hand, it can be shown mathematically, that by setting preference to a very large negative value (negative infinity), affinity propagation converges to a solution which outputs minimum number of clusters such that every BIS in every cluster is in the Ω-Covering of the "exemplar" BIS of the corresponding cluster. In the following equations, MTpattern use similarity metric instead of dissimilarity metric where a similarity score is negative of dissimilarity score. In Eq. 2, s(S i , S i ) represents self similarity or preference of data point (or, BIS) i. MTpattern set the preference value for all BIS to the same global value which ensures that affinity propagation is not biased towards choosing any BIS as an exemplar beforehand. In Eq. 3, s(i, j) is the similarity score between BIS i and BIS j and is equal to the negative of dissimilarity score between BIS i and BIS j. C represents total number of clusters and ex c represents exemplar BIS of the cluster c. In Eq. 4, MTpattern set the value of preference to a large negative value, much smaller than Inter Sim which is the sum of all pair-wise similarities between similar BISs. Affinity propagation seeks to find number of clusters that maximizes the total similarity for each cluster which is measured as the sum of the similarities between nonexemplar BISs and their exemplar BIS and the sum of preferences for selected exemplars BISs. A formal statement of the MTpattern optimization problem that underlies affinity propagation begins with the definition of two inputs: the similarity matrix, s(i, j); and the preference. There are two sets of decision variables associated with the optimization problem: y c = 1 if a BIS c is selected as an exemplar and 0 otherwise, for 1 ≤ c ≤ C; and x ic = 1 if BIS i is assigned to the cluster for which BIS j serves as an exemplar and 0 otherwise, for 1 ≤ i ≤ m and 1 ≤ c ≤ C. The integer linear programming formulation of the problem can then be stated as follows: s(ex c , ex c ) * y c (5) 2. "The preference of point i, called p(i) or s(i, i), is the a priori suitability of point i to serve as an exemplar. Preferences can be set to a common global value, or customized for every data point. High values of the preferences will cause affinity propagation to find many exemplars (clusters), while low values will lead to a small number of exemplars (clusters)" [40] subject to C c=1 x ic , f or all 1 ≤ i ≤ m and IT Dist(i, ex c ) < Ω; (6) x ic ≤ y c , f or all 1 ≤ i ≤ m and 1 ≤ c ≤ C; (7) x cc = y c , f or all 1 ≤ c ≤ C. x ic ∈ {0, 1}, f or all 1 ≤ i ≤ m and 1 ≤ c ≤ C; The objective function (equation (5)) of the optimization problem is N et Sim, and the first and second terms on the right-hand side of the equals sign are BISs similarity and exemplar preference similarity, respectively. Constraint set (6) guarantees that each BIS is assigned to exactly one exemplar and each non-exemplar BIS in a cluster is Ωcovered by the exemplar BIS of the corresponding cluster under the given constraint of local dissimilarity. Constraint set (7) ensures that a BIS is not assigned to an BIS that is not selected as an exemplar. Constraint set (8) is incorporated in the affinity propagation algorithm to ensure that, if a BIS is selected as an exemplar, then that BIS must be assigned to the cluster for which it serves as the exemplar. Finally, constraint sets (9) and (10) enforce the binary restrictions on the x ic and y c variables, respectively. ≈ C × pref erence (from Eq. 3 and 4) MTpattern maximizes N et Sim similar as affinity propagation [40] does. For the x ic = 1 and y c =1, we get the Eq. 13 where s(ex c , ex c ) is also equal to preference (from Eq. 2). As preference is set to -∞, increasing the number of clusters (C) will decrease the N et Sim value as the preference is negative and the goal is to maximize N et Sim. Therefore, a globally stable solution will minimize total number of clusters (i.e., C) for convergence. Once clusters are obtained, an average of all the BISs in the cluster is taken for every unit time interval. This average gives us a probabilistic view of an individual's behavior in every time slot. This average is the manifestation of the cumulative effect of all BIS in the same cluster. Because the dissimilarity matrix is sparse, the number of messages exchanged between data points in every iteration of affinity propagation is significantly less reducing time complexity (assuming serial message passing) and space complexity of clustering. In the Figure 7 , after applying affinity propagation, we get two clusters. The exemplar BIS of first cluster is S-3 and it contains S-1, S-3, S-4, S-6 and S-7. The exemplar BIS of second cluster is S-5 and it contains S-2 and S-5. The N et Sim (see Eq. 13) is minimum in this configuration. In this section, we shall discuss time and space complexity for two main components of the proposed solution, i.e., dissimilarity measure and pattern discovery in detail. The time complexity of the distance measure (Algorithm 3) between two segments of N s length, each can be given by the total number of MIN-ITDist's calculated. Maximum number of times MIN-ITDist is executed for a pair of segment is equal to the total number of 1's in both the sequences combined which can be 2N s in the worst case. One instance of MIN-ITDist method runs for O(Ω) times. So, complexity of distance computation is 2O(Ω)N s , i.e., O(N s Ω). Since there are e 2 pairs of sequences for every segment (where e is the total number of days), in worst case, time complexity to generate distance matrix for a time segment of length N s is O(e 2 N s Ω). The above distance matrix represents one node in the segment tree. Similar operation needs to be done for every node (every segment at every level). Time complexity for constructing distance matrix for the segment tree is equal to sum of time complexity to generate distance metric of all nodes. If the total length of root node (unsegmented sequence) is L BIS, height of the tree will be O(log 2 L) since we divide the segment length by 2 at every level. Total number of leaf nodes is (2 log2L ) = L. Time complexity for computing distance metric for all leaf nodes = O(Lk 2 N s Ω). Once, we compute the distance matrix for leaf nodes in the segment tree, distance matrix for non-leaf nodes in the segmentation can be computed using the distance matrix of its children nodes without any information loss, i.e., by using eBIS. The time complexity for generating distance matrix for all non-leaf nodes at a given depth d can be given by time complexity for computing one node at that level K 2 (distance between one pair of segment can be calculated in constant time using distance matrix of its children nodes. There are total e 2 pairs) multiplied by total number of nodes at that level O(2 d ) = O(e 2 2 d ). The total cost will be the sum of time complexity for leaf nodes and nonleaf nodes given by Thus, creating hierarchical segments from leaf nodes does not asymptotically take more time complexity than it would take for computing distance matrix for leaf nodes alone. For clustering, MTpattern uses affinity propagation. The underlying concept of this algorithm is belief propagation and the worst time complexity is O(r'd p 2 ) where r' is the total number of iterations for convergence and d p is the total number of data points. For sparse distance matrix, this time complexity is less as messages will not be passed between data points between for whom distance is not defined (or infinite). For a given distance matrix with d p data points, time complexity is O(r'd p 2 )⇒ O(d p 2 ). Space complexity to store distance matrix for entire segment tree is equal to the size of one node multiplied by total number of nodes. The exhaustive search for all Ω-coverings for a given time interval is in place so it does not require any extra space. For clustering using affinity propagation, there is need to keep in memory two messages (responsibility and availability) from every other data point, so, space complexity of single instance of affinity propagation is O(e 2 ). At the high-level, an effective clustering algorithm should be able to cluster similar users together and different ones separately. We evaluate our behavioral clusters quality by finding how well they capture similar users. To evaluate clustering quality, internal and external evaluation measures are used. Internal criteria are used for finding clustering quality when ground truth in the dataset is not available while, external criteria is used when ground truth is available. We conduct three experiments from different perspectives to evaluate and compare our proposed approach MTpattern with baselines. We select three widely used clustering algorithms as baselines: K-means, Hierarchical Clustering (HC), and a variant of HMM, Expectation-Maximization (EM) [41] . We also compare MTpattern with one of the most popular dimensionality reduction techniques, PCA. Experiment 1: Evaluation and Comparison with Baselines through Internal Criteria. In Internal measure, the clustering evaluation is compared only with the result itself, i.e., to evaluate the structure of the found clusters and relationships among these clusters. Evaluation of clustering quality using internal measure is preferred in several real-world scenarios as it is not always possible to obtain ground truth with the data. While, the data labeling is expensive task. Experiment 2: Evaluation and Comparison with Baselines through External Criteria. In most real applications, complete knowledge of the ground truth is not available. Therefore, external measure is widely used to evaluate synthetic data. We create synthetic dataset for this purpose. Experiment 3: Evaluation of Distance Measure. The clustering quality is highly dependent on the distance measure used between the data objects. We compare our proposed distance measure with widely used Euclidean distance (ED) and Dynamic Time Warping (DTW). In MTpattern, the results are generated for two values of preference for affinity propagation: median of dissimilarity scores and < where device id is a PSU name, MAC address is the MAC id of the individual detected, Time is the physical time at which an individual is detected, Client RSSI is the signal strength of an individual to estimate the physical distance (in meters) between the PSU and detected individual. Location captures the GPS coordinates of the PSU. Location comprises of latitude, longitude, and accuracy. PSUs keep uploading the collected data at the cloud server periodically where data is stored in a timestamped manner. Moreover, for the security and privacy reason, all MAC addresses are strictly anonymized using AES algorithm. Experiments are carried out in Indian Institute of Technology, Roorkee (IITR) campus. IITR is an academic and research institute in the state of Uttarakhand, India and has a 1.48 km 2 campus including many objects, such as academic departments, administrative buildings, hostels, library, banks, post office, hospital, schools, canteen shops, etc. We set our PSUs in the UGPC lab in the Department of Computer Science and Engineering, IITR and track all those smart devices for 3 months using SmartITS (for which the WiFi is turned ON). In total, there are 11,853 records collected at the UGPC. Reality mining dataset was collected by MIT Media lab where 95 academic mobile phone users are tracked for approximately 9 months. We use this dataset and mine the visiting patterns of home (using the cell tower associated with the home-ids) [43] . Both real-world datasets have the uncertainty in the data as in WiFi dataset, sensors may fail to capture emitted packets from individuals' smart devices while reality mining dataset can loose cell tower signal or inconsequential tower transitions due to dense tower network and overlapping tower range. It can be observed that both real-world datasets share the common uncertainty arising from inherent temporal variability in any individual(s) nature. We set the threshold δ equal to 15 mins based on an experimental analysis. To find the appropriate value of threshold δ, we analyze duration between consecutive WiFi packets emitted by the same device in the direct line of sight which are stationary and within the range of the adapter. We use four android phones, one iPhone and one windows phone for this experiment. We inspect time delay between all consecutive WiFi packets from same device and realize that for more than 95% of packets, time delay between the current packet and the previous packet is below 15 minutes (see Figure 9 ). Therefore, we set the threshold δ equal to 15 minutes with the false negative rate of less than 5%. If devices are not placed in the direct line of sight and are not static then the average rate of false negatives rate for δ equal to 15 mins increases to 38% due to packet loss. In other words, threshold δ is the transmitting interval of wireless packets from client smart device in order to advertise their presence to nearby devices or actively discover the access points in its proximity. The computation of the threshold δ depends on the data collection strategy. For example, in the reality mining dataset, Bluetooth scans periodically at 5 minute intervals. While we collect all the incoming wireless packets and then preprocess them in the Time Point Sequence (PS) as shown in Figure 3 . Then, we use the threshold δ to convert point sequence data into interval sequence data. Furthermore, threshold δ determines the width of discretized bin where the width of discretized bin decides the granularity of clustering. Size of each cluster is described by granularity. Therefore, we can say that we use threshold δ to decide the size of a cluster instead of pre-specifying number of clusters. Clustering efficiency can be accurately measured on synthetic datasets, since the true distribution and its modes are known. We generate different planted patterns to remove any bias that can be present in the above real-world datasets with a pre-determined temporal variability and sensor uncertainty. For simulating temporal variance, we vary the sequences' start-end in a given visiting mode/pattern so that the start/end points follow a normal distribution around a mean with 3xσ (standard deviation) equal to 4 λ (1 λ = 7.5 minutes, 4 λ = 30 minutes). When we perform clustering on planted patterns, we set Ω between 30 minutes (4 λ) and 1 hour (8λ). Note that, by definition 99.74% of data points reside between -3σ and 3σ, so we have significantly reduced the probability of any MIN ITDist being more than Ω because of temporal variation, which is inline of our assumption of a small inherent temporal variability in human actions. We set the visiting pattern with an error probability of 0.2, i.e., sensor can fail to sense the signal even though an individual was in the surrounding with a probability of 0.2 for a given IS. The probability of erroneously failing to sense a signal over consecutive BISs reduces with a power of 0.2. So, false negative probability for 30 minutes (4λ) is 0.2 4 = 0.0016 which is again very less. This dataset has only 1000 different and random patterns. We evaluate MTpattern using two internal measures metrics: number of clusters and accuracy score. Both metrics consider some data points as representatives of each cluster. Below, we define them formally. 1) Number of clusters: It is the total number of clusters achieved for the given data. 2) Accuracy Score: It is an output pattern to correctly represent all members of a cluster. Small score means high accuracy and high score means low accuracy. We plot the total number of clusters achieved after applying the MTpattern and baseline clustering approaches. Figure 10 (a), (b) and (c) show that total number of clusters achieved for all three datasets after clustering is low for our proposed technique, M T pattern compared to all baseline approaches. We find that for all approaches, the number of clusters decreases as the Ω increases. M T pattern tunes affinity propagation to give the minimum number of clusters, while satisfying the upper bound in local dissimilarity. There is no objective way in which such minimization can be achieved for P CA or for any other partition-based clustering, i.e., K-means and HC. Since, eigenbehavior outputs eigenvectors (unit magnitude), we first convert frequent behavior patterns obtained from our proposed approach into unit vectors. We first calculate accuracy score as per the Eq. 14. In our case, there are only two classes, present and absent. For every visiting sequence, we compare it with the unit vector that represents the corresponding cluster. Every correct prediction (in a unit time interval) will be rewarded and every wrong prediction will be penalized. Sum of accuracy score is taken for all segments at all segmentation levels for all individuals to represent the overall accuracy score for the techniques. e is the total number of sequences or number of days. N s is the length of a segment. N c is total number of classes, i.e., present and absent. S ij is the j th bit of sequence 'i'. P ijc is the value of j th bit of P i (which is the frequent unit vector pattern representing the cluster which sequence 'i' belongs to) to be of class c. The above equation rewards correct prediction and penalizes wrong prediction. Figure 11 (a), (b) and (c) show that accuracy score for all three datasets is less for our proposed technique, MTpattern compared to P CA and baseline approaches. Small accuracy score shows that MTpattern has high accuracy, i.e., MTpattern has high cluster accuracy compared to baselines. MTpattern takes into account local proximity between discrete time intervals unlike baseline approaches which treat every time interval as an independent dimension. Even though eigenbehavior representation has been used to find the human's behavior structure, it has several drawbacks: First, P CA is a dimensionality reduction method which does not exploit domain-specific knowledge. On the other hand, assigning physical meanings to the weight values in eigenvectors is challenging and can be too subjective for humans' behavior patterns discovery as eigenvectors themselves have no well-defined physical meanings and are only used to project data to low-dimensional space. Second, the eigenbehavior representation does not consider the uncertainty in humans' behaviors. Third, Eigenvectors are not suitable for clustering as this process lacks theoretical support. Our proposed model, MTpattern compensates these drawbacks. After doing multi-modal analysis for each time slot, frequent co-occurring behavior in different time slots in a day is presented. This gives us broader view of individuals' behavior. The correlated behavior may not be serial or contiguous. E.g., a peculiar early morning behavior may lead to some different behavior at night. We owe this advantage to the segmentation of day into time slots. Early techniques, like eigenbehavior fail to capture non-contiguous co-occurring frequent behavior. As we have already mentioned that determining number of clusters in advance or maximum permissible representational error is not trivial and is often subjective. Partition-based clustering, like K-means and HC clustering algorithms and model-based clustering, like EM suffer from the above-mentioned issue. Moreover, MTpattern puts an upper bound on local dissimilarity which is allowed for the two sequences to be similar, whereas there is no such provision in P CA, partition-based clustering and modelbased clustering. On the other hand, we also observe that model-based sequence clustering methods, like HMM and EM are more sensitive to the order of events and invariant to the actual time of occurrence of the events as shown in Figure 12 . We fix the number of hidden states to 2 3 . As it turns out, EM for both the cases will have same sequence of hidden states. The EM models for both the cases also has same transition and emission probability matrix. So, EM based clustering will render these two sequences indistinguishable. We also assume the observations in Figure 12 to be deterministic which is prerequisite for model-based clustering to work. But, real-world data often contains uncertainty and nondeterministic instances in observation as shown in Example 1 (see Figure 1) . Also, we observe through the results in Figure 10 and 11 that synthetic dataset has similar performance pattern for the proposed approach, MTpattern and baseline clustering approaches. This gives confidence that synthetic dataset has similar data distribution as real-world datasets. Thus, synthetic dataset can be used further for evaluating clustering quality. Furthermore, we evaluate the time efficiency of MTpattern for the optimal number of clusters achieved in WiFi dataset (Ω = 15 mins) w.r.to K-means, EM and Hierarchical clustering algorithms. The execution time of MTpattern, kmeans, EM and Hierarchical clustering algorithms are 9 mins, 28 mins, 16 mins, and 43 mins, respectively. MTpattern seems to outperform methods that randomly re-sample potential centers. Finding C clusters within d p data points involve a search space of size d p choose C, or [d p (d p -1)(d p -2)...(d p -C+1)]/[C(C-1)(C-2)... (2) ], which is polynomial in d p if C is within a fixed distance of 1 or d p , but exponential in d p if C grows linearly with d p . k-centers clustering algorithms, such as K-means depends on random initialization and performs better with smaller search space as the initialization will be more likely to be sampled from a region near the optimal solution. Therefore, we can say that these methods work well when C is close to 1 or d p . On the other hand, MTpattern exploits Affinity which does not rely on random sampling. Affinity propagation exchanges messages between data points (i.e., d p ) while considering all of them to be potential exemplars. It works better in the large-search-space situation, where C is not close to 1 or d p . 3 . There is no golden rule to find the correct number of hidden states in HMM We evaluate MTpattern using three external measures metrics: Purity, Rand Index and F-measure. Below, we define them formally. For calculating purity, every cluster is assigned to most frequent class, then accuracy is calculated by counting the number of correctly assigned data points divided by total number of data points. High purity percentage shows that patterns are classified correctly, while low purity percentage shows wrong classification of the patterns. The purity percentage for K-means, HC and EM approach is 68%, 66%, 69% for all 30 mins of Ω since these approaches are not temporally sensitive. The purity percentage of MTpattern for preference << Median (≈ −∞) is 98% for 30 mins of Ω. This result shows that MTPattern is able to correctly cluster sequences with higher accuracy as compared to baselines. As the Ω increases, purity and number of clusters (note that number of clusters is inverse of data compression) naturally decreases and as sequences farther apart which are member of different class may be grouped in the same cluster. We assign two items to the same cluster iff they are similar. A true positive (TP) decision assigns two similar items to the same cluster, a true negative (TN) decision assigns two dissimilar items to different clusters. But, a False Positive (FP) decision assigns two dissimilar items to the same cluster, while False Negative (FN) decision assigns two similar items to different clusters. The Rand Index (RI) measures the percentage of decisions that are correct, i.e., accuracy. But, RI gives equal weightage to FPs and FNs. Separating similar items is sometimes worse than putting pairs of dissimilar items in the same cluster. Therefore, we also use F-measure (F-Score) to evaluate the clustering quality by penalizing FNs more strongly than FPs by selecting a value β > 1. Table 2 shows a comparison between K-means, HC, EM and MTpattern algorithms from the point of the view of purity, Rand Index (RI), and F-measure. Purity, RI and Fmeasure of MTpattern are significantly higher than the baseline approaches which confirms a better clustering quality. There are many methods to calculate the distance information; the choice of distance measures is a critical step in clustering. It calculates the similarity between two elements and also influences the shape of the clusters. We compare our proposed distance measure with the well-known Euclidean distance (ED) and Dynamic Time Warping (DTW). DTW is the generalization of the ED. We define a novel distance measure between discrete time series which takes into account temporal proximity of sequences and guarantees an upper limit on the maximum variance within the cluster. We feed this sparse distance matrix into affinity propagation which finds naturally dense clusters. Table 3 shows the F-measure for the different distance measure ED, DTW and our proposed distance measure, TDist for the Affinity Propagation clustering algorithm. Results show that our proposed distance measure, TDist outperforms ED and DTW. The reason is that the conventional distance measures used in clustering, like Euclidean distance, Manhattan distance, Jaccard distance, Kullback Leibler distance, etc., consider every time slot as an independent dimension and hence, fails to capture the temporal dynamics between neighbouring time instances. In addition, Dynamic Time Warping (DTW) is widely used to find dissimilarity between temporal sequences independent of some non-linear variations between them. Unlike conventional distance measures, DTW takes into account the affinity between neighbouring time instances. But, DTW metric only takes the difference of magnitude of the two sequences after aligning them and is not suitable for binary or categorical time series with uncertainty as it does not take into account the extent of warping needed to perfectly align two sequences. Therefore, the DTW metric is not the right choice for binary or categorical time series as the DTW distance for categorical time series will always be 0 after alignment and makes it impossible to distinguish between sequences based on the extent of non-linear variation between them. Moreover, in DTW metric, it is required to map every observation to some observation of the other signal which may lead to unexpected results in case of uncertain or non-deterministic observations. In this paper, we mine temporal variable patterns from uncertain temporal data. We propose a novel approach to effectively cluster behavior patterns of individuals (named, MTpattern) from the temporal data. We propose a dissimilarity measure, called TDist, between visiting sequences that considers only temporal distance between deterministic values. Our dissimilarity measure is sensitive to local temporal differences between sequences and restricts local temporal difference between similar sequences below a threshold. Since, it is not possible to know the number of patterns exhibited by an individual in a given segment, we employ affinity propagation, a non-parametric exemplar based clustering technique. We optimize the clustering by tuning the pref erence value to output minimum number of clusters such that every member sequence of a cluster is similar to the exemplar sequence of the cluster. We also use segment tree data structure to pre-compute distance matrix for segments which can be in turn used to compute distance matrix for any interval. Our extensive experiments show that MTpattern outperforms the several baseline clustering approaches. In future, we can correlate behavior patterns from multi-source data with various external factors to help us to understand 'humans' behavior more accurately. Understanding user behavior in online social networks: A survey Modeling relationship strength in online social networks Integrating modelling and smart sensors for environmental and human health Unsupervised clickstream clustering for user behavior analysis Learning customer behaviors for effective load forecasting Purtreeclust: A clustering algorithm for customer segmentation from massive customer transaction data Scalable daily human behavioral pattern mining from multivariate temporal data Clustering smart card data for urban mobility analysis Analyzing web behavior in indoor retail spaces Scalable detection of crowd motion patterns Mining periodic behaviors for moving objects Clustering of time series data-a survey Similarity-based clustering of sequences using hidden markov models A unified framework for modelbased clustering Time series clustering: Complex is simpler A probabilistic distance measure for hidden markov models Eigenbehaviors: identifying structure in routine A bayesian non-parametric method for clustering highdimensional binary data Bionumerics tutorial: Clustering a binary data set Clustering approach toward large truck crash analysis Identifying primary and secondary crashes from spatiotemporal crash impact analysis Binary time series models driven by a latent process Clustering weekly patterns of human mobility through mobile phone data Autoplait: Automatic mining of co-evolving time sequences ACM SIGMOD international conference on Management of data Mining sequential patterns: Generalizations and performance improvements Mining sequential patterns by patterngrowth: the prefixspan approach Freespan: Frequent pattern-projected sequential pattern mining Spade: An efficient algorithm for mining frequent sequences Mining temporal patterns with quantitative intervals Discovery of quantitative sequential patterns from event sequences Uncertainty interval temporal sequences extraction Extracting temporal patterns from interval-based sequences Maintaining knowledge about temporal intervals A scalable distributed stream mining system for highway traffic data Trace analysis and mining for smart cities: issues, methods, and applications Privacy preserving data analytics for smart homes Monitoring the habits of elderly people through data mining from home automation devices data Studentlife: assessing mental health, academic performance and behavioral trends of college students using smartphones Temporal modeling of human activity in smart homes Clustering by passing messages between data points Maximum likelihood from incomplete data via the em algorithm Smartits: Smartphone-based identification and tracking using seamless indoor-outdoor localization Reality mining: sensing complex social systems