key: cord-0044881-z9kqhhpk authors: Mirshahi, Soheyla; Novák, Vilém title: A Fuzzy Approach for Similarity Measurement in Time Series, Case Study for Stocks date: 2020-05-16 journal: Information Processing and Management of Uncertainty in Knowledge-Based Systems DOI: 10.1007/978-3-030-50153-2_42 sha: 1a902c56698c888053d7f5bfa3b9231b4ca45b00 doc_id: 44881 cord_uid: z9kqhhpk In this paper, we tackle the issue of assessing similarity among time series under the assumption that a time series can be additively decomposed into a trend-cycle and an irregular fluctuation. It has been proved before that the former can be well estimated using the fuzzy transform. In the suggested method, first, we assign to each time series an adjoint one that consists of a sequence of trend-cycle of a time series estimated using fuzzy transform. Then we measure the distance between local trend-cycles. An experiment is conducted to demonstrate the advantages of the suggested method. This method is easy to calculate, well interpretable, and unlike standard euclidean distance, it is robust to outliers. Time series is a feasible way of representing data in many fields, including the finance sector. Financial crises in the 19th and early 20th caused a challenging situation for economies, and it led to a massive interest in economic and financial analysis. In this situation, any information that provides a better understanding to the behavior of markets is highly critical. Among many types of research concerning data mining in time series (see- [4, 7, 9, 10] ); One of the key applications in this field [11] is stock data mining. Assessing time series similarity, i.e., the degree to which a given time series resembles another one is a core to many mining, retrieval, clustering, and classification tasks [18] . In the construction of financial portfolios (see [5] ), diversification, which conveys investing in a variety of assets, is a key to reduce the risk of a chosen portfolio. Thus, identifying stocks that share similar behavior is vital. There is no straightforward approach, known as the best measure for assessing the similarities in time series. Surprisingly, many simple approaches like simple euclidean distance can outperform the most complicated approaches [18] . Wang et al., in 2013 , perform an extensive comparison between nine measurements across 38 data sets from various scientific domains (see [21] ). One of their findings is that the euclidean distance remains an entirely accurate, robust, simple, and efficient way of measuring the similarity between two time series. However, stock markets have some properties which make the current similarity measures unfavorable. For instance, stocks react to a lot of exogenous factors such as news (see, e.g., [2] ); thus, the presence of outliers in them is inevitable. Therefore, developing a measure that can react to the nature of stock markets seems essential. A very effective technique in the analysis of time series is the fuzzy transform. Using it, we can extract trend-cycle (a low-frequency trend component) of the time series with high fidelity. The fuzzy transform provides not only the computed trend-cycle but also its analytic formula (cf. [16, 17] ). In this paper, using fuzzy transform, we first assign to each time series an adjoint one that consists of its local trend-cycle. Then we measure the distance between these approximate time series by a suggested formula. There are several reasons to employ our fuzzy estimation of the trend-cycle for similarity measurement: Firstly, the trend-cycle in stocks tends to smoothen the price value and describes the behavior of the market concerning the changes in price values. Thus, it is more intuitive for experts than price values themself. It has been proven that we can successfully reach this goal using the fuzzy transform. Secondly, stock markets can be boisterous with outliers. Consequently, assessing similarities based on actual price values without any preprocessing can lead to unrealistic results. Using our method, we can easily "wipe out" the outliers without harming the basic characteristics of the time series. Finally, Our method is flexible and can answer the question of how we can find stocks that behave similarly in various time slots. For instance, experts can measure the similarity between stocks that behave similarly in a short to long term (e.g., one to several weeks). The paper is structured as follows. After Introduction, we describe our method in Sects. 2 and 3. Section 4 is dedicated to an illustration of the purposed method and the evaluation of the results. Our techniques stem from the following characterization of a time series. It is understood as a stochastic process (see, e.g., [1, 6] ) X : T × Ω → R where Ω is a set of elementary random events and T = {0, . . . , p} ⊂ N is a finite set of numbers interpreted as time moments. Since financial time series typically posses no seasonality, we assume that they can be decomposed into components as follows: where T C(t) = T r(t) + C(t) called trend-cycle and R is a random noise, i.e., a sequence of (possibly independent) random variables R(t) such that for each t ∈ T, the R(t) has zero mean and finite variance. Fuzzy transform (F-transform) is the fundamental theoretical tool for the suggested similarity measurement. Because of the lack of space, we will only briefly outline the main principles of the F-transform and refer the reader to the extensive literature, e.g., [15, 16] and many others. The F-transform is a procedure applied, in general, to a bounded real con- It is based on the concept of a fuzzy partition that is a set A = {A 0 , . . . , A n }, n ≥ 2, of fuzzy sets fulfilling special axioms. The fuzzy sets are defined over nodes a = c 0 , . . . , c n = b in such a way that for each k = 0, . . . , n, A(c k ) = 1 and Supp(A k ) = [c k−1 , c k+1 ] 1 . The nodes are usually (but not necessarily) uniformly distributed, i.e., c k+1 = c k + h where h > 0 is a given value. To emphasize that the fuzzy partition is formed using the distance h, we will write A h . The F-transform has two phases: direct and inverse. The direct F-transform assigns to each provides estimation of an average value of the tangent (slope) of f over the area characterized by the fuzzy set A k ∈ A. From the direct F-transform of f The function I[f |A] is called the inverse Ftransform of f and it approximates the original function f . It can be proved that this approximation is universal. The application of the F-transform to the time series analysis is based on the following result (cf. [14, 16] ). Let us now assume (without loss of generality) that the time series (1) contains periodic subcomponents with frequencies λ 1 < · · · < λ r . These frequencies correspond to periodicities respectively (via the equality T = 2π/λ). (1) . Let us assume that all subcomponents with frequencies λ lower than λ q are contained in the trend-cycle TC . If we construct a fuzzy partition A h over the set of equidistant nodes with the distance h = d T q where d ∈ N and T q is a periodicity corresponding to λ q then the corresponding inverse F-transform I[X|A] of X(t) gives the following estimation of the trend-cycle: The precise form of D and the detailed proof of this theorem can be found in [13, 16] . It follows from this theorem that the F-transform makes it possible to filter out frequencies higher than a given threshold and also to reduce the noise R. Consequently, we have a tool for separation of the trend-cycle or trend. Theorem 1 tells us how the distance between nodes of the fuzzy partition should be set. This choice enables us to detect trend cycles for different time frames of interest. Of course, the estimation depends on the course of TC and it is the better the smaller is the modulus of continuity ω(h, TC ) (which in case of the trend-cycle or trend is a natural assumption). The periodicities (2) can be found using the classical technique of periodogram-see [1, 6] . Selection of T q in Theorem 1 can be based on the following general OECD specification: Trend (tendency) is the component of a time series that represents variations of low frequency in a time series, the high and medium frequency fluctuations having been filtered out. Trend-cycle is the component of the time series that represents variations of low frequency, the high frequency fluctuations having been filtered out. In this section, we will describe how our suggested method evaluates the pairwise similarity between time series. where E(T C X ) and E(T C Y ) are mean values (averages) of T C X and T C Y , respectively and |.| denotes absolute value. It is easy to show that S(X, Y ) ∈ [0, 1] where it has certain features that is described on the following theorem and can be proved. In Definition 1, it is necessary to emphasize, that T C X and T C Y are estimation, not the real trend-cycles, since we do not know them (cf. formulas (1) and (3)). A stock can be seen as a time series {X(t)|t = 1, . . . , n} where X(t) is closing price at time t within an interval [0, T ]. For instance, let us consider closing price of a stock from Nasdaq INC 3 , from 05.10.2008 to 30.09.2018 (522 weeks). In order to estimate its local trend-cycle, we first build a uniform fuzzy partition such that the length of each basic functions A 2 ; ...; A n1 is equal to a proper time slot. In our case, by setting the length of basic function to four, we obtain the approximation of the trend-cycles for one month. In other terms, the monthly behavior of this stock is our concern here. Figure 1 depicts the mentioned weekly stock and the fuzzy approximation of its local trend-cycle. The first and the last components of F-transform are subject to big error (because the corresponding basic functions (A 1 and A n are incomplete). Regardless it is clear that F-transform has approximated the local-trend cycles of the stock successfully. As we mentioned before, stock markets react to many exogenous factors; thus, the presence of outliers is unavoidable. A red square in Fig. 1 shows one of these outliers for the mentioned stock. It is clear to see that F-transform has successfully wiped out the outlier while preserved the core behavior of the stock. The similarity from Definition 1 can be used for measuring the similarity between any number of stocks. We can measure using it also local behavior of them. In the next section, we will demonstrate how our suggested method works with a relatively large data set in conjunction with its comparison to standard the euclidean distance. Our data set consists of a closing price of 92 stocks over 522 weeks obtained from Nasdaq INC. An example of twenty stocks from the mentioned data set is depicted in Fig. 2, where the x-axis and y-axis represent price values in dollars and number of weeks, respectively. From this figure, it is clear that any decision about the similarity between time series is impossible. Therefore it seems necessary to consider similarity between time series. One possible way to evaluate the competency of any new similarity measurement (distance measurement), is to apply it for data clustering. The quality of clustering based on the new and current similarities can validate the competency of the suggested method [12, 19] . Therefore, we will below apply clustering of time series and compare the behavior of our similarity with the euclidean one. However, let us emphasize that time series clustering is not the primary goal of this research since our focus is on discovering the most similar pairs of stocks available in the database. As we mentioned before, the euclidean distance is an accurate, robust, simple, and efficient way to measure the similarity between two time series and, surprisingly, can outperform most of the more complex approaches (see [18, 20] ). Therefore we will compare our method with the euclidean distance by means of the quality of hierarchical clustering on a dataset. Hierarchical clustering is a method of cluster analysis which attempts at building a hierarchy of similar groups in data [8] . In this case, one problem to consider is the optimal number of clusters in a dataset. Overall, none of the methods for determining the optimal numbers of clusters is flawless, and none of the suggested similarities are fully satisfactory. Hierarchical clustering does not reveal an adequate number of clusters and estimation of the proper number of clusters is rather intuitive. Hence, there is a fair amount of subjectivity in determination of separate clusters. Figures 3 and 4 , demonstrate the dendrogram of hierarchical clustering of the 92 stocks based on the suggested and euclidean similarity, respectively. The proper number of clusters for both similarities is equal to six. In these figures, the 92 stocks are represented in the x-axis, and their distances are depicted on the y-axis accordingly. Since the stocks are from various industries, they have different scales, and in the case of the clustering with the euclidean distance, we will eliminate the different scaling by normalizing the data. Nevertheless, this step is not demanded by the suggested method since the scale does not influence it. To measure the quality of clustering, we apply the Davies-Bouldin index, which is usually used in clustering. This measure evaluates intra-cluster similarity and inter-cluster differences [3] . Therefore, it can be a proper metric for clustering evaluation. Table 1 demonstrates the Davies-Bouldin index for a different number of clusters based on the both similarities. Since the lower score indicates better quality of clustering, the, results reveal that not only is our method reasonably comparable to the euclidean method, but also it has provided more efficient clustering for these examples. Furthermore, as we mentioned before, stock markets are prone to exogenous factors such as bad or good news (see e.g., [2] ). If a method pairs two stocks as similar, one can expect that after the occurrence of an outlier(s), the method would still evaluate these stocks alike. Hence, we will compare the performance of our method, and the euclidean distance metric for the stocks containing outliers. Recall from the previous section that based on both methods, stocks 52 and 53 are very similar to each other since their distance is minimal. Therefore, first, we will add some random artificial outliers to the stock 52, but we do not alter the stock 53 as shown in Fig. 7 . Subsequently, we apply both methods to re-evaluate the similarity between these stocks. Table 2 demonstrates the results. It is apparent, after including artificial outliers, that the euclidean distance has a dramatic jump (around 1800% increase). At the same time, the purposed method shows a minimal increase in distance (33%), which means that the suggested method is much less sensitive to the presence of outliers. Considering that the suggested method is based on the F-transform, it evaluates the similarity between the stocks concerning their local trend-cycles; therefore, it does not have the drawbacks of raw-data based approaches such as the euclidean distance. The latter methods are sensitive to noisy data [22] . One advantage of the euclidean method is its simplicity; however, the suggested method is also relatively simple since it has only one parameter to set (the length of the basic functions). Moreover, experts are able to adjust the suggested similarity measure, according to their time slot of interest. Table 2 . The distance between stock 52 and 53, before and after outliers Distance before outliers Distance after outliers The suggested method 0.09 0.12 The euclidean method 0.17 3.33 In this paper, we developed a new method for pairwise similarity measurement. The method is based on the application of the fuzzy transform and a customized metric. The idea is based on the estimation of local trends using inverse fuzzy transform. The time series can then be paired together according to the similarity of the adjoint time series consisting of the local trends. We demonstrated the application of the suggested method in real life in addition to its comparison with the euclidean distance. Experimental results verify the capability of the suggested method for measuring the similarity between time series. Further work will be focused on the application of this method in portfolio management and evaluation of its profitability in finance. Another addition to this work can be extending the method for time series of various lengths and compare the result with the so-called dynamic time warping (DTW) method. Statistical Analysis of Time Series. SNTL, Praha Stock price reaction to news and no-news: drift and reversal after headlines A cluster separation measure A review on time series data mining Numerical Methods and Optimization in Finance Time Series Analysis Data mining: concepts and techniques Finding Groups in Data: An Introduction to Cluster Analysis On the need for time series data mining benchmarks: a survey and empirical demonstration Clustering of time series data-a survey Data Mining: Concepts and Techniques An efficient and accurate method for evaluating time series similarity Filtering out high frequencies in time series using Ftransform with respect to raised cosine generalized uniform fuzzy partition Forecasting seasonal time series based on fuzzy techniques. Fuzzy Sets and Systems Insight into Fuzzy Modeling Filtering out high frequencies in time series using F-transform Analysis of seasonal time series using fuzzy approach An empirical evaluation of similarity measures for time series classification. Knowl.-Based Syst Indexing multidimensional time-series Computing with Words Experimental comparison of representation methods and distance measures for time series data The curse of dimensionality and document clustering Acknowledgment. The paper has been supported by the grant 18-13951S of GAČR, Czech Republic.