key: cord-0961513-acr4xftl authors: Watanabe, Norio title: A k-means method for trends of time series: An application to time series of COVID-19 cases in Japan date: 2022-03-03 journal: Jpn J Stat Data Sci DOI: 10.1007/s42081-022-00148-0 sha: 1825552e3f025a8afb0e3755dc88746bbb386362 doc_id: 961513 cord_uid: acr4xftl A k-means method style clustering algorithm is proposed for trends of multivariate time series. The usual k-means method is based on distances or dissimilarity measures among multivariate data and centroids of clusters. Some similarity or dissimilarity measures are also available for multivariate time series. However, suitability of dissimilarity measures depends on the properties of time series. Moreover, it is not easy to define the centroid for time series. The k-medoid clustering method can be applied to time series using one of dissimilarity measures without using centroids. However, the k-medoid method becomes restrictive if appropriate medoids do not exist. In this paper, the centroid is defined as a common trend and a dissimilarity measure is also introduced for trends. Based on these centroids and dissimilarity measures, a k-means method style algorithm is proposed for a multivariate trend. The proposed method is applied to the time series of COVID-19 cases in each prefecture of Japan. Clustering is one of the important issues in multivariate analysis. This is also true for multivariate time series. Clustering methods for time series depend on characteristics of time series. In this paper, we assume that time series contain trends, and then, they are nonstationary, and we propose a non-hierarchical clustering method, which has a k-means method style, for a multivariate trend. The usual k-means method is based on distances or dissimilarity measures among data and centers of gravity or centroids of clusters. For time series, some similarity or dissimilarity measures are available. Pearson's correlation coefficient and the cosine similarity are typical similarity measures, and the dynamic time warping method can also be applied (see Bagnal (2017) , Berndt and Clifford (1994) , Egghe and Leydesdorff (2009) , for example). However, suitability of dissimilarity measures depends on the properties of time series. Watanabe (2021) discussed dissimilarity measures for nonstationary time series and proposed other dissimilarity measures explained in the following section. In this paper, we introduce a new dissimilarity measure for applying to clustering. The new one has a similar form to those by Watanabe (2021) . Unlike similarity or dissimilarity measures, it is not easy to define the centroid for time series. Without using centroids, the k-medoid clustering method can be applied to time series using one of dissimilarity measures (cf. Kaufman and Rousseeuw (2009) ). However, the k-medoid method is restrictive if appropriate medoids, which represent the centers of clusters, do not exist. Our aim is to analyze a multivariate trend of nonstationary time series. In this case, an idea of common trend can be adopted for definition of the centroid for time series. In this paper, we apply the definition of common trend given by Watanabe (2021) . However, Watanabe (2021) does not consider the lags. Therefore, we provide a new definition of common trend by considering lags, since lags are important in time series analysis. The lag is especially crucial for multivariate time series. The main task of clustering is to find similar patterns in time series. We assume that time series under analysis are ratio scale data and positively correlated each other in some sense. We also assume that standardization is appropriate for comparison with each other. If time series are nonstationary, standardization is also difficult. The dissimilarity measure in this paper is based on weighting as standardization. We propose a k-means method style algorithm for a multivariate trend using a new dissimilarity measure and the common trend. The proposed method is applied to the time series of COVID-19 cases in each prefecture of Japan. We also discuss clustering of original nonstationary time series themselves not trends briefly using COVID-19 series. Target data in this paper are P-variate time series and we consider problems on their trends based on dissimilarity among time series. First, we discuss dissimilarity between two time series in this section. Let (x n , y n ) (n = 1, 2, . . . , N ) be an observed bivariate time series. We assume that x n = T n + v n (1) where {(v n , w n )} is a bivariate zero mean stationary process, and {T n } and {U n } are trends or mean value functions. We assume that trends are (conditionally) deterministic and that they are estimated appropriately. The moving average method is a simple way for trend estimation. Other methods are found in Kim (2009) and Watanabe and Watanabe (2015) , for example. Moreover we assume that the seasonality or periodic movements are absent in estimated trends; so-called cyclic components can be included in trends. Note that the period of cyclic component is relatively large comparing to the one of seasonality. It is not easy to capture dissimilarity between nonstationary time series, unlike stationary time series. Watanabe (2021) introduced the dissimilarity functions between {T n } and {U n } as follows: Definition 1 (Watanabe 2021) The simple dissimilarity measure function δ S ( ) is given by for = 0, ±1, ±2, . . .. Definition 2 (Watanabe 2021) The weighted dissimilarity measure function δ W ( ) is given by The region of minimization can be replaced by The choice depends on the property of time series. Definition 3 (Watanabe 2021) The normalized dissimilarity measure function δ N ( ) is given by for = 0, ±1, ±2, . . . and C is a bounded subset of 2-dimensional Euclidean space. The independent variable of these functions is the lag similarly to the cross correlation function for the stationary time series. Watanabe (2021) showed that the k-medoid clustering method (Kaufman and Rousseeuw 2009) can be applied to time series using these dissimilarity measures (an example in Watanabe (2021) are based on δ W (0)). The k-medoid method can be used when it is difficult to define centroids (cf. Kaufman and Rousseeuw (2009) ). The dissimilarity measures δ W and δ N are essentially invariant for the exchange of time series (δ for is equal to the original δ for − ). This symmetric property is not required for dissimilarity measure between one trend and a centroid of trends. If a kind of average time series in each cluster is available instead of the medoid, it is not necessarily adequate to use the weights for both time series. The reason is that the weights for the centroid should be invariant. In Watanabe (2021) , the common trend for multivariate time series is also defined as shown below. We can use the common trend in each cluster instead of the medoid as the centroid. Then, the weight for the common trend becomes unnecessary. In this paper, we propose another dissimilarity function for clustering based on δ W ( ) but in a slightly different form. Definition 4 Let {C n } be the given common trend. The dissimilarity measure function δ C ( ) between {T n } and the common trend {C n } is given by The restriction r > 0 means that time series under consideration are assumed to be positively correlated each other in some sense. It is easily shown that Moreover, we have the following theorem. Theorem 1 Suppose that Then, the minimum value of the right-hand side of Eq. (7) is attained at r = r 0 and we have The proof is easy. The cosine similarity is one of the well-known similarity measures (Egghe and Leydesdorff 2009) . The cosine similarity between {C n } and {T n+ } is given by We can consider that the above theorem provides some validity of the cosine similarity. (The relation between the cosine similarity and the dissimilarity δ S in Definition 1 is stated in Watanabe (2021) .) The use of the dissimilarity δ C with some appropriately defined common trend makes a k-means style clustering possible. In the following section, we introduce the common trend which is well suited to δ C . In this paper, we do not assume any models and define the common trend as the weighted sum of the multiple trends given by the solution of an optimization problem. The formulation is similar to the one by Watanabe Watanabe (2021) except for the existence of lags. Let T n = T 1n , . . . , T Pn (n = 1, 2, . . . , N ) be the P-dimensional vector of trends or mean value functions of P-variate time series. First, we introduce the common trend given by Watanabe Watanabe (2021) . (0) n } is the time series given by the optimization problem Each term in the objective function (12) has the same form as Eq. (7) with = 0. That is, lags are not considered in this definition. However, it is important to consider lags for multivariate time series. It is especially important for discussing the common trend. For example, studies on business cycle are crucial in the econometric field. The business cycle is related to the common trend of many time series, and these time series consist of leading, coincident, and lagging indicators. This means that plus or minus lags play key roles. In this paper, we generalize the above definition for considering lags. Let p denote the lag for {T pn } and assume that −L max ≤ p ≤ L max , where L max is a given integer. The common trend {C n } is the time series given by the optimization problem J C (r 1 , . . . , r P , 1 , . . . , P ) −→ minimize (13) with respect to r 1 , . . . , r P and 1 ,…, P , where r p ≥ 0, P p=1 r p = 1 and −L max ≤ p ≤ L max . The objective function is defined by where C n = P p=1 r p T p,n+ p . The optimization (13) is not easy unless both P and L max are small. Watanabe Watanabe (2021) proposed the recursive algorithm for C (0) n in Definition 5. In the following, we propose an extended recursive algorithm for C n in Definition 6. Step 1. Initialize C n by setting r p = 1/P, that is Step 2. For fixed {C n }, find r p and p that minimize Step 3. Replace r p by r p / P p=1 r p and calculate Step 4. Calculate J C in Eq. (14). Step 5. Go to Step 2 until some termination condition is satisfied. Step 6. Select {C n } with {r p , p | p = 1, . . . , P} that minimizes J C . We call this an ECT algorithm. In Step 2, the analytical solution is available. See Theorem 1. This algorithm is applied to practical time series in Sect. 5. A comparison with the Definition 6 is also considered. Now, we propose a k-means method based on the dissimilarity function δ C ( ) and the common trend calculated by the recursive algorithm in Sect. 3. Target data are the P-variate trend T n = T 1n , . . . , T Pn (n = 1, 2, . . . , N ). Let K be a given number of clusters. We define the function u kp as follows: where K k=1 u kp = 1(∀ p) and P p=1 u kp > 0(∀k). Let L MAX be the upper bound of lags and L KMT be a number of repetitions. The following is the recursive algorithm for clustering of trends. Step 1. (Initialize) Set {u kp } randomly. Step 2 Let δ kp be the minimum value with respect to . Step 4. (Reassignment) Redefine u kp as follows: (the tie is not considered here for simplicity). Step 5. If some change in {u kp } occurs, goto Step 2. If there is no change, calculate the value Step 6. Go to Step 1 L KMT − 1 times. Step 7. The classification with the minimum value of J is adopted as the result. We call this a KMT algorithm. Similarly to the usual k-means method, sensitivity of initial values is large. That is, it is not assured that the solution is the global minimum. Usually, L KMT should be set relatively large. If a tie occurs in Step 4, k can be determined randomly similarly to the usual k-means method. An efficient approach to solve the tie problem is to extend the hard clustering to soft clustering. In this paper, we consider hard clustering only. An important feature of the ECT and KMT algorithm is that any extra numerical optimization technique is required, though some nonlinear optimization is required for the use of δ W instead of δ C . Time series consist of daily record numbers of COVID-19 cases in prefectures. The number of prefectures of Japan is 47 ( = P). Data are provided by NHK (Japan Broadcasting Corporation). The upper plot in Fig. 1 shows 47 time series from January 16, 2020 to August 1, 2021. The length is 564. The first day of each month is indicated by the vertical dotted line. We estimate the trends of original time series by the moving average method. We adopt the triangular weight function whose support has the length 28 (4 weeks). The length becomes 536 (= N ), since any processing for both ends of series is not applied here. The lower plot in Fig. 1 shows the moving averaged series. A purpose here is to find similarity of patterns. For this purpose, we apply the proposed KMT algorithm. First, we examine the ECT algorithm. We consider the estimation of the common trend among three time series of Tokyo, Osaka, and Hokkaido without lags by two methods. The first is to apply numerical optimization technique to minimize the objective function in Definition 5. The second is to apply the ECT algorithm without lags (L MAX = 0). (A similar example is illustrated in Watanabe (2021) .) Original time series and moving averaged time series are shown in Fig. 2 . Calculation in this paper is carried out using MATLAB. The MATLAB function 'fmincon' is used for minimization under constraints in Definition 5. Two estimated common trends are illustrated in Fig. 3 . The right is the partly magnified plot. It is found that the difference is quite small. Values of the objective function J C obtained by two methods are plotted in Fig. 4 . We can say that the recursive ECT algorithm works well for these data. The estimated common trend and three weighted trends are plotted in Fig. 5 , where r 0 in Theorem 1 is used as each weight. Tendency of three time series is reflected roughly in the estimated common trend plotted by bold line. However, it seems that there exist lags in three time series, and then, the common trend fluctuates unnaturally. Figure 6 shows the estimated common trend obtained by the ECT algorithm with lags by setting L MAX = 30. We can say that the ECT algorithm with lags also works well. However, there exists some differences among patterns even if lags are considered. This suggests the (2) values of J C necessity of clustering. In the following, clustering based on the ECT algorithm with lags is applied to all time series. The proposed KMT algorithm is applied to 47 time series by setting K = 1, . . . , 8, L max = 30 and L KMT = 1000. The KMT algorithm with K = 1 is not clustering but means the estimation of the common trend of all time series. The results of K -means method for trends are summarized in Table 1 and Fig. 7 (K = 1, . . . , 6) . Table 1 includes sizes of clusters, values of J , the maximum of | p |, and the maximum of p − q in each cluster. Each trend in Fig. 7 is weighted using r 0 in Theorem 1. The determination of K is important but difficult problem. For usual k-means method, some methods have been proposed; for example, the elbow method, silhouette method, gap statistic, and so on (cf. Yuan and Yang (2019) ). However, there is no definitive method. This is true for the proposed K -means method for trends. Therefore, we do not refer the determination of K in detail, since this problem should be discussed separately. Values of J monotonically decrease and there is no clear "elbow". Selected lags or differences between lags are relatively large when K is 2, 3, 4, or 6, but it is difficult to explain large lags. When K is less than 5, it seems that peaks are not separated well (especially the third peak). One plausible candidate of the number of clusters is 5. The result for K = 5 is illustrated again in Fig. 8 . Table 2 shows prefectures in each cluster, where indicates the selected lag to the common trend of each cluster. It is meaningless to compare lags among different clusters. It is found that Tokyo, Osaka, and Hokkaido belong to different clusters. This means that the estimated common trend in the previous subsection is not so meaningful. It is said that these COVID-19 series contain four waves. The first peak appears around n = 70. Differences among patterns in each cluster appear in the heights of peaks or locations of peaks. In clusters 1, 2, and 4, the fourth peak is remarkably large. On the other hand, the third peak is largest in cluster 5. Cluster 5 consists of prefectures in Kanto area, whose center is Tokyo, except for Gumma. We can say that tendency of each prefecture in Kanto area is resemble. Cluster 1 mainly consists of most prefectures in Kinki area, whose center is Osaka, and prefectures in south Tohoku area. Cluster 3 mainly consists of prefectures in Kyushu area. It is expected that such a statistical analysis will provide epidemiologically or medically useful findings. In this paper, we have focused on the clustering of a multivariate trend. In this final section, we consider the clustering of original series briefly. The proposed KMT algorithm can be applied directly to original time series. However, the meaning of centroid of cluster becomes vague, since it is defined as the weighted sum of original series in the ECT algorithm. As an example, the estimated centroid of all original series of COVID-19 cases by the ECT algorithm is shown in the upper graph in Fig. 9 . The estimated series includes seasonality and irregular fluctuation. It is clear that this series cannot be regarded as the common trend. Moreover, the meaning of seasonality of this series becomes vague, since lags are considered. As a result, this series is not appropriate as the centroid. This means that the direct application of the KMT algorithm to original series is not appropriate usually and the modification of algorithm is required for the direct application. Note that the estimated series might be regarded as the common trend approximately, if P is large and seasonality is absent in original time series. A modification is achieved easily by appending a trend estimation step between Steps 3 and 4 in the ECT algorithm. The moving average method is a simple way for trend estimation. In this example, we adopt the moving average method with the triangular weight function whose support has the length 15 (half a month). The length of the smoothed series does not change here, since the both ends of series are processed in a simple way introduced in Brockwell and Davis (1991) for the sake of brevity. The estimated centroid of all original series of COVID-19 cases by the modified ECT algorithm is shown in the lower graph in Fig. 9 . The estimated series can be regarded as the common trend and then as the centroid. Clustering of original series can be achieved by the KMT algorithm combined with the modified ECT algorithm. A clustering result for original series of COVID-19 cases with K = 3 is illustrated in Fig. 10 . The result is similar to the one by the multivariate trend of COVID-19 series, but not identical. The differences occur from the seasonality components and large irregular fluctuations in original series. It should be considered well whether the clustering of original series is appropriate or not, since original series include trends, seasonal components, and irregular components usually. That is, results of clustering depend on various factors. When some meaningful clusters are found by the proposed method, there is a possibility that results can be applied to prediction. In this case, it is expected to analyze multiple time series in the same cluster from the viewpoint of multivariate time series analysis. However, we should note that the prediction of trends is difficult usually because of nonstationarity. In Step 3 of the KMT Algorithm, the number of δ C ( )'s to be computed becomes huge, when P is large and L max should be large. Some simplification will be required in this case. One way is to consider a subset of lags. Another way is to derive lagged time series previously by considering the dissimilarity function between each trend and the common trend of all series, and then to apply the proposed method without lags. In this paper, we assume that trends are estimated appropriately. For this assumption roles of trend estimation methods are important and essential. In the case of the moving average method, the length of the moving average is crucial. We have to pay sufficient attention for trend estimation. When all values of time series are positive, we can try the log transformation. For log-transformed time series, the validity of our methods becomes doubtful. However, the log transformation cannot be applied to time series including zero values like COVID-19 series. On the other hand, the proposed method can be applied to time series including not only zero but also negative values. Similarly to usual k-means method, the determination of the cluster size is important. Further studies are expected for determination of K . An extension to fuzzy clustering, that is, an extension from hard to soft clustering is an issue in future. The great time series classification bake off: A review and experimental evaluation of recent algorithmic advances Using dynamic time warping to find patterns in time series Time series: Theory and methods The relation between Pearson's correlation coefficient and Salton's cosine measure Finding groups in data: An introduction to cluster analysis Dissimilarity measures for time series and trend analysis: Application to COVID-19 cases series Weighted multivariate fuzzy trend model for seasonal time series Research on K-value selection method of K-means clustering algorithm Acknowledgements This work was supported by Chuo University Personal Research Grant. We would like to thank reviewers for useful comments. The author declare no conflict of interest.