key: cord-0043198-hhqkibpv authors: Fournier-Viger, Philippe; Yang, Yanjun; Lin, Jerry Chun-Wei; Frnda, Jaroslav title: Mining Locally Trending High Utility Itemsets date: 2020-04-17 journal: Advances in Knowledge Discovery and Data Mining DOI: 10.1007/978-3-030-47436-2_8 sha: 48fe2aad4f383586c3787b1518b33f99ba242e81 doc_id: 43198 cord_uid: hhqkibpv High utility itemset mining consists of identifying all the sets of items that appear together and yield a high profit in a customer transaction database. Recently, this problem was extended to discover trending high utility itemsets (itemsets that yield an increasing or decreasing profit over time). However, an important limitation of that problem is that it is assumed that trends remain stable over time. But in real-life, trends may change in different time intervals due to specific events. To identify time intervals where itemsets have increasing/decreasing trends in terms of utility, this paper proposes the problem of mining Locally Trending High Utility Itemsets (LTHUIs) and their Trending High Utility Periods (THUPs). Properties of the problem are studied and an efficient algorithm named LTHUI-Miner is proposed to enumerate all the LTHUIs and their THUPs. An experimental evaluation shows that the algorithm is efficient and can discover insightful patterns not found by previous algorithms. Frequent itemset mining (FIM) is a popular data mining task, which consists of enumerating all sets of values (items) that have a support (occurrence frequency) that is no less than a minimum threshold in a transaction database [5] . FIM has recently been generalized as high utility itemset mining (HUIM) to consider items having non binary purchase quantities in transactions and weights indicating their relative importance [2, 4, 10, [13] [14] [15] . The goal of HUIM is to find all itemsets that have a high utility (e.g. yield a high profit). Though, HUIM is useful to understand customer behavior, a key problem of HUIM is that the time dimension is ignored. But in real-life, the utility of itemsets vary over time. For example, the sales of some products in a retail store may increase or decrease over a few weeks as it loses or gains in popularity. To discover high utility itemsets that have an increasing or decreasing utility over time, the problem of mining trending HUIs was proposed [9] . However, this problem only focuses on discovering itemsets that have trends spanning over the whole database (e.g. a set of products having sales that always follows an upward or downward trend). But that assumption is often unrealistic as an itemset may have upward or downward trends only during some time periods rather than in the whole database. For instance, the utility (profit) generated by the sale of sunscreen in a store may have an upward trend from May to July but not during the whole year. It is thus an important challenge to design algorithms to identify trends in non predefined time intervals. This is also challenging as it requires to not only consider a large search space of itemsets but also of time intervals. This paper addresses this issue by proposing a novel problem of mining locally trending high utility itemsets (LTHUIs), that is to find all time intervals where itemsets have a high utility and show an upward or downward trend. To efficiently discover these patterns, this paper proposes a novel algorithm named Locally Trending High Utility Itemset Miner (LTHUI-Miner). It relies on novel upper-bounds and pruning techniques. An experimental evaluation on real transaction data shows that the proposed algorithm has excellent performance and can discover insightful patterns not found by previous algorithms. The rest of this paper is organized as follows. Section 2 reviews related work. Section 3 defines the proposed problem of LTHUI mining. Then, Sect. 4 describes the designed algorithm, Sect. 5 presents the experimental evaluation. Lastly, Sect. 6 draws a conclusion and discusses future work. HUIM extends FIM [1, 5] and thus algorithms for these problems have similarities. However, there is also a key difference. FIM algorithms discover frequent itemsets by relying on the anti-monotonicity property of the support measure, which states that the support of an itemset cannot be greater than that of its subsets [1, 5] . This is a very powerful property to reduce the search space, but it does not hold for the utility measure in HUIM. To mine high utility itemsets efficiently, state-of-the-art HUIM algorithms such as Two-Phase [14] , HUI-Miner [15] , d2hup [13] and HU-FIMi [10] introduced various upper-bounds on the utility measure that respect the anti-monotonicity property to reduce the search space, and novel data structures to perform utility computation efficiently. Though HUIM is useful to reveal profitable customer behavior, few HUIM algorithms consider the time dimension. The PHM algorithm [3] finds patterns that periodically appear and yield a high profit (e.g. a customer buys wine every week). The RUP algorithm [8] finds itemsets that recently had a high utility by applying a decay function to the utility measure (recent events are considered more important in utility calculations). And recently, to discover itemsets that follow some trends such as an increase or decrease in utility, the TPHUI-Miner algorithm [9] was designed. However, a major limitation of the three above algorithms is that they find patterns that shows some periodic behavior, recent behavior or trends valid for the whole database rather than for specific time intervals. But in real-life, the utility of itemsets vary over time, and some of these behaviors may only appear in some time intervals. To find itemsets that have a high utility in some specific time periods, onshelf high utility itemset mining was proposed [11] . However, the time periods need to be fixed by the user beforehand. To find high utility itemsets in non predefined time intervals, it was proposed to mine local high utility itemsets with the LHUI-Miner algorithm [7] . Though this algorithm can find insightful patterns, it is unable to discover trends such as an increase or decrease of utility in specific time periods. To address these limitations, the next section proposes the novel problem of discovering locally trending high utility itemsets. This section introduces HUIM, and then defines the proposed problem of mining locally trending high utility itemsets. The input of HUIM is a transaction database. Consider a set of items (products) I = {i 1 , i 2 , . . . , i n }. A subset X ⊆ I is called an itemset. An itemset {i} containing a single item i can be denoted without brackets as i, when the context is clear. A transaction T is an itemset, purchased by a customer. A transactional database is a multiset of transactions D = {T 1 , T 2 , . . . , T m }, where each transaction T tid ∈ D has a unique identifier tid and a timestamp t(T tid ), which may not be unique. Each item i appearing in a transaction T is associated with a number q(i, T ) ∈ N + called its internal utility (purchase quantity). Moreover, each item i ∈ I is associated with an external utility value p(i) ∈ N + representing its relative importance (e.g. unit profit). For instance, Table 1 shows a database containing five items (a, b, c, d, e) and nine transactions (T 1 , T 2 , . . . , T 9 ), which will be used as running example. Timestamps are denoted as d 1 , d 3 , . . . , d 12 . The internal utility of an item in a transaction is shown as a number besides the item, while the external utility of items is given in Table 2 . Transaction T 1 indicates that a customer purchased the items b, c, and e with purchase quantities (internal utility) of 2, 2 and 1, respectively. Their external utility (unit profit) are 2, 1 and 3, respectively. The task of HUIM consists of enumerating all high utility itemsets, i.e. itemsets having a utility that is no less than a positive minimum utility threshold (minutil) set by the user [14] . The utility of an item i in a transaction T is defined as u(i, T ) = p(i) × q(i, T ). The utility of an itemset X in T is defined as u(X, T ) = i∈X∧X⊆T u(i, T ) if X ⊆ T , and otherwise u(X, T ) = 0. The utility of an itemset X in a database D is defined as u(X) = To find HUIs having increasing/decreasing trends in terms of utility in a database, Hackman et al. [9] proposed to mine trending high utility itemsets, i.e. HUIs having a positive/negative slope for a whole database. The slope of a HUI is defined as follows. The utility of an itemset X at a timestamp d in a database D is defined as: u(X, d) = X⊆T ∈D∧t(T )=d u(X, T ). Let there be a HUI X and T S be the set of timestamps in a database D. The utility set of X in D is defined as the multiset where avg is the average. There are two important issues with the problem of mining trending HUIs [9] . First, in the above slope calculation, it can be argued that time should be used as denominator instead of the utility because the user is typically interested in how utility varies over time rather than the opposite. Second, the slope of a HUI is calculated for the whole database. Hence, the algorithm of Hackman et al. [9] is unable to find local trends such as a HUI that follows a trend only in a sub-time interval. To address these issues, this paper proposes to mine itemsets that have a high utility and follow an increasing/decreasing trend in some non predefined time intervals. This paper redefines the concepts of utility and slope such that the time is divided into non-overlapping consecutive bins to reduce the influence of small fluctuations in the utility of items. The user must set a bin length binlen ∈ Z + . Then, the average timestamp and average utility of each bin is used as basis for slope calculations. i, j such that 0 ≤ i ≤ j. The bin from time i to j is defined as B i,j = {T |i ≤ t(T ) ≤ j ∧ T ∈ D}. The length of a bin B i,j is length(B i,j ) = j − i + 1. The average timestamp of a bin B i,j is defined as at(B i,j ) = i+j 2 . The utility of an itemset X in a bin B i,j is defined as: u(X, B i,j ) = X⊆T ∈Bi,j u(X, T ). The average utility of X in B i,j is defined as au(X, B i,j ) = u(X,Bi,j ) length(Bi,j ) . D = {T 1 , T 2 , . . binlen . The sequence of bins in D, ordered by time, is defined as: To detect non predefined time intervals containing trends, a sliding window of length winlen is slided over the sequence of bins BS. Let there be a database D and a user-defined sliding window length winlen, such that winlen = binlen × k where k ∈ Z and k ≥ 2. Each window contains winlen/binlen bins. Let W [i,j] denotes the window containing the i-th bin until the j-th bin of the sequence BS, = 1.78. If binlen is set to a reasonably large value, the requirement that an itemset X appears in each bin of a sliding window to have a slope is reasonable, and ensures that the slope is not influenced by missing values. Based on the above definitions, the problem of mining locally trending HUIs is defined. Let there be some parameters binlen ∈ Z + , winlen = x × binlen for an integer x ∈ Z + such that x ≥ 2, minutil ≥ 0, minslope > 0 (or maxslope < 0) set by the user. A window W [ The search space in traditional HUIM consists of 2 I − 1 itemsets. For the proposed problem, if there are w sliding windows, then there are (2 I − 1) × w potential THUPs to be considered. To efficiently find LTHUIs, the proposed LTHUI-Miner uses three properties that reduce the search space by eliminating items or itemsets w.r.t. the whole database or a sliding window. This property was proven for HUIs in the traditional HUIM problem [14] . But it also holds for LTHUIM since every LTHUI must be a HUI. For example, if minutil = 20, binlen = 3 and winlen = 6, T W U(d) = T ∈D∧{d}∈T u(T, T ) = u(T 2 , T 2 ) = 18 < minutil. Thus, d is a low TWU item in the database, and any itemset X {d} is not a LTHUI. The second and third pruning properties require a total order ≺ on the set of items I, which is used by LTHUI-Miner to explore the search space of itemsets. LTHUI-Miner performs a depth-first search starting from itemsets containing single items, and recursively extends each itemset by appending single items according to that order. Formally, an itemset X ∪ {y} obtained by appending an item y to an itemset X is said to be an extension of X if i ≺ y, ∀i ∈ X. Property 2 (Pruning an Unpromising itemset using its Remaining Utility in a Database). The remaining utility of an itemset X in a transaction T is defined as ru(X, T ) = i∈T ∧i x∀x∈X u(i, T ) if X ⊆ T . The remaining utility of an itemset X in a database is defined as reu(X) = T ∈D∧X⊆T ru(X, T ). If u(X)+ reu(X) < minutil, then X and its transitive extensions are not LTHUIs. For example, if minutil = 30, binlen = 3 and winlen = 6. The TWU ascending order on items is a ≺ e ≺ b ≺ c. Note that item d has been pruned using Property 1. u({c}) + reu({c}) = 27 + 0 < minutil. Then, itemset {c} and its transitive extensions are not LTHUIs. Property 3 (Pruning an Unpromising itemset using its Remaining Utility in a sliding window). The remaining utility of an itemset X in a sliding Window W is defined as reu(X, W ) = T ∈B∈W ∧X⊆T ru(X, T ). If u(X, W ) + reu(X, W ) < minutil, then X and its transitive extensions have no THUP in W . This property can be proved by observing that such itemsets cannot have a utility greater than or equal to minutil in the sliding window W , and thus these itemsets cannot have a THUP in W . For example, if minutil = 20, binlen = 3 and winlen = 6, u({e}, W [1, 2] ) + reu({e}, W [1, 2] ) = 6 + 9 = 15 < minutil. Thus, the window W [1, 2] is not a THUP for itemset {e} and its transitive extensions. To efficiently calculate the utility of any itemset during the depth-first search and check the pruning conditions of Properties 2 and 3, the proposed algorithm utilizes a novel structure called Trending Utility-list (TU-list), which extends the utility-list structure used in traditional HUIM [4] with information about bins and time periods. The first part of a TU-list of an itemset X stores information about the utility of the itemset X in transactions where it appears, and about the utilities of items that could extend X in these transactions. Formally, the first part of a TU-list is a set of tuples called elements such that there is a tuple (tid, iutil, rutil) for each transaction T tid containing X where iutil = u(X, T tid ) and rutil = ru(X, T tid ). The second part of a TU-list contains four lists named binU tils, binRutils, trendP eriods and promisingP eriods. They store the utility of X for each bin, the remaining utility of X for each bin, the maximum trending high utility periods of X and the promising periods of X, respectively. A promising period of an itemset X is a time period where X and its transitive extensions may have a utility greater than or equal to minutil based on Property 3. Formally, let there be some parameters winlen ∈ Z + and minutil ≥ 0 set by the user. A window W [i,j] is a promising period of an itemset X if for any sliding window W [k,l] The TU-list structure of an itemset X has two interesting properties. First, it allows to directly calculate the utility u(X) of X without scanning the database, as the sum of the iutil values in the TU-list of X. Second, reu(X) can be calculated as the sum of rutil values. Moreover, the utility and remaining utility of an itemset X in a bin B and a window W can also be calculated from its TU-list by considering only transactions in B and W , respectively. For example, the TU-list of itemset {e} is elements = (0, 3, 6), (1, 3, 11), (2, 3, 11), (4, 6, 6), (7, 6, 6) , (8, 3, 12) , binU tils = 6, 9, 0, 9 , binRutils = 17, 17, 0, 18 , trendP eriods = and promisingP eriods = W [1, 2] . Then, the utility of itemset {e} in a database or window can be calculated without scanning the database again, e.g., u({e}) = 3 + 3 + 3 + 6 + 6 + 3 = 24, u({e}, W [1, 2] ) = binU tils [1] + binU tils [2] = 6 + 9 = 15. The remaining utility of itemset {e} in a window can also be calculated directly using binRutils: reu({e}, W [1, 2] ) = binRutils [1] + binRutils [2] = 17 + 17 = 34. Another property of TU-lists is that those of two itemsets of the form P ∪{x} and P ∪ {y} can be joined to obtain the TU-list of an itemset P ∪ {x, y}. This is done by first applying the construct procedure of HUI-Miner [15] . Then, the binU tils, binRutils, trendP eriods and promisingP eriods lists can be calculated by applying the findTrend procedure, presented in the next section. The Algorithm. We next present the proposed LTHUI-Miner algorithm by explaining how it finds increasing trends. Decreasing trends are found in a similar way. The algorithm takes as input a transaction database D and the binlen, winlen, minutil and minslope parameters. The algorithm outputs all LTHUIs and their maximum THUPs. The algorithm first scans the database to calculate the bins, sliding windows and T W U(i) of each item i. Then, each item i such that T W U(i) < minutil is ignored from further processing as it cannot be part of a LTHUI by Property 1. Then, the processing order ≺ on remaining items is defined as the increasing order of TWU, as in previous work [4] . Then, the algorithm scans the database again to create the TU-list of each remaining item. Thereafter, LTHUI-Miner recursively extends each of those items by appending items following the order. This is done by calling the LTHUISearch procedure (Algorithm 1) with six parameters: (1) an itemset P (initially P = ∅), (2) a set exP of one-item extensions of P of the form P x = P ∪ {x} where x ∈ I (initially, the remaining items), (3) binlen, (4) winlen, (5) minutil, and (6) minslope. The procedure first checks if the trendP eriods list of each itemset P x in the set exP is empty. If not, the itemset P x is output as a LT HU I with P x.T U list.trendP eriods as its maximum THUPs. Moreover, if promisingP eriods of P x is not empty and P x is promising in the database according to Property 2, the algorithm will try to extend P x. This is done by joining P x with each itemset P y ∈ exP such that y x, to obtain itemsets of the form P xy. The TU-list of P xy is constructed by calling the construct procedure. Then, the procedure F indT rend is called to construct the binU tils, binRutils, trendP eriods and promisingP eriods of that TU-list, and P xy is added to a set exP x. Then, the procedure LTHUI-Search is called with P x and exP x to check if itemsets in exP x are LT HU Is and explore their extensions. The F indT rend procedure takes as input (1) an itemset P , (2) a one item extension of P , (3) binlen, (4) winlen, (5) minutil and (6) minslope. First, the procedure scans the elements of the TU-list of P x to calculate binU tils and binRutils. Then, the procedure moves a sliding window over the sequence of bins BS to calculate the utility and slope of windows using two variables, namely winStart (the index in BS of the first bin of a sliding window, initialized to 0) and winEnd (the index in BS of the last bin of a sliding window, initialized to winlen/binlen). However, the process of sliding a window while calculating the slope and utility may be interrupted because some sliding windows in BS may have empty bins, and the slope cannot be calculated in that case. Thus, a loop is performed to find the next sliding window without empty bins, and then continue the sliding process until an empty bin is encountered or winEnd reaches the last bin of the sequence BS. In more details, this is done by first finding the first no-empty-bin sliding window starting from winStart, updating winStart, winEnd and calculating utils (utility of the itemset P x in that window), rutils (remaining utility of P x in that window) . Then, the following step is repeated until (winEnd + 1) reaches the last bin of BS or the utility of P x in the bin of index (winEnd + 1) is 0 or itemset P is unpromising in the sliding window W [winStart+1,winEnd+1] : (1) increase the index of the first and last bin of the sliding window, then update utils and rutils, (2) compare the value of utils, utils + rutils with minutil to determine whether to merge the sliding window with the previous period or add that window to P x.T U list.trendP eriods and P x.T U list.promisingP eriods. These latter are used to store maximum THUPs and promising periods. LTHUI-Miner is correct and complete, as it explores itemsets by recursively performing extensions of single items, and the algorithm only prunes extensions based on the pruning properties. To test the performance of LTHUI-Miner, experiments were done on a computer having an Intel Xeon E3-1270 v5 processor with 64 GB RAM, on Windows 10. LTHUI-Miner was implemented in Java. Two real-life datasets with timestamps were used: retail and f oodmart. Let |I|, |D| and A represents the number of distinct items, the number of transactions and the average transaction length. retail contains transactions from an anonymous Belgian retail store (|I| = 16,470, |D| = 88,162, A = 10.30). f oodmart is transactional data obtained and transformed from the SQL-Server 2000 distribution (|I| = 1559, |D| = 4141, A = 4.40). The timestamps of retail and f oodmart were generated by adopting a distribution used in prior work for retail data [7] . 176 times faster than lthui-no-prop3. Memory consumption was also measured to compare the two algorithm versions. It was found that in most cases, the memory usage of lthui is less than lthui-no-prop3, which shows that Property 3 reduces memory consumption. Details are not shown due to the page limitation. Influence of minslope on the Number of Patterns Found. In the second experiment, the minslope parameter was varied to evaluate its influence on the number of patterns found. Algorithms were run with binlen = 1000, winlen = 2 × binlen and minutil = 600 on the retail dataset and binlen = 250, winlen = 2 × binlen and minutil = 100 on the f oodmart dataset. Results for the number of patterns are shown in Fig. 1 (b) for the two datasets. It is observed that as minslope increases, the number of patterns decreases, which was expected. Pattern Analysis. On the two datasets, some patterns having a strong trend were found, which means that the utility of these itemsets was high and increased rapidly in their THUPs. For example, on retail and f oodmart dataset, 179 and 13 patterns have slope values greater than 1.1 and 0.6 respectively. Discovering such strong trends can be very helpful for a retail store manager to understand customer behavior and take decisions, since products in LTHUIs generate high profits and the profits is growing quickly during their THUPs. This paper has defined a novel problem of mining Locally Trending High Utility Itemsets having increasing/decreasing trend(s) in some non-predefined time periods. The properties of LTHUI mining were studied and a novel algorithm named LTHUI-Miner was proposed to efficiently mine all LTHUIs and their maximum THUPs. Besides, three pruning strategies were designed to improve the performance of LTHUI-Miner. The experimental evaluation has shown that the algorithm is efficient and can find useful patterns. In future work, techniques to automatically adjust parameters will be considered, as well as extensions for high utility episode mining [6] , incremental pattern mining [12] , [?] and using swarm optimization [16] . Fast algorithms for mining association rules Up-Hist Tree: an efficient data structure for mining high utility patterns from transaction databases PHM: mining periodic high-utility itemsets A survey of high utility itemset mining A survey of itemset mining HUE-Span: fast high utility episode mining Mining local and peak high utility itemsets Mining recent high-utility patterns from temporal databases with time-sensitive constraint Mining trending high utility itemsets from temporal transaction databases Efficiently finding high utility-frequent itemsets using cutoff and suffix utility Discovery of high utility itemsets from on-shelf time periods of products Efficient incremental high utility pattern mining based on pre-large concept Direct discovery of high utility itemsets without candidate generation A two-phase algorithm for fast discovery of high utility itemsets Efficient algorithms for high utility itemset mining without candidate generation Discovering high utility itemsets based on the artificial bee colony algorithm Algorithm 2: FindTrend input : P : an itemset, P x: a one item extension of P , binlen: the length of a bin, winlen:the sliding window length, minutil: the minimum utility threshold, and minslope: the minimum slope threshold. output: Calculate the binU tils, binRutils, trendP eriods and promisingP eriods of the TU-list of P x 1 Scan the elements of the TU-list of P x to calculate binU tils and binRutils; 2 numBP W = winlen/binlen (the number of bins per sliding window); 3 winStart = 0 (the index of a sliding window's first bin in BS); 4 winEnd = numBP W (the index of a sliding window's last bin in BS); 5 while winEnd < BS.size do 6 Find the first no-empty-bin sliding window W for P x starting from winStart; 7 Update winStart and winEnd in terms of W ; Because LTHUIM is a new problem, the performance of LTHUI-Miner cannot be compared with prior work. Thus, we compared three versions of LTHUI-Miner: (1) LTHUI-Miner (with all pruning techniques), denoted as lthui, (2) LTHUI-Miner without Property 3, denoted as lthui-no-prop3, and (3) a version of LTHUI-Miner without Property 2 and 3. However, that latter ran out of memory for all the experiments, and thus its results are not reported in the following. Experiments were done by varying the minutil and minslope parameters to see the influence on runtime and pattern count, respectively. No results are shown for an algorithm if it ran out of memory, or the runtime exceeded one hour. In the first experiment, the parameter minutil was varied to evaluate the performance of LTHUI-Miner in terms of runtime. LTHUI-Miner was run with winlen = 2000 (about 5.5 h), binlen = 1000 and minslope = 0.1 on the retail dataset, and run with winlen = 500, binlen = 250 and minslope = 0.1 on the f oodmart dataset. Fig. 1 (a) compares the runtimes of lthui and lthui-no-prop3 for the two datasets. It is observed that as minutil is decreased, runtime increases, which is reasonable since more patterns may be found. It is also observed that pruning an unpromising itemset in a sliding window using the remaining utility (Property 3) greatly reduces the runtime. For example, on the retail dataset, when minutil = 1500, the execution time of lthui-no-prop3 is 498 s, which is more than 32 times that of lthui, and on the f oodmart dataset, when minutil = 1400, lthui is up to