key: cord-0643988-6ajz1bcg authors: Jang, Jun-Gi; Kang, U title: DPar2: Fast and Scalable PARAFAC2 Decomposition for Irregular Dense Tensors date: 2022-03-24 journal: nan DOI: nan sha: 16cd29155b000b3c04f759adfce9c4aa36a88f47 doc_id: 643988 cord_uid: 6ajz1bcg Given an irregular dense tensor, how can we efficiently analyze it? An irregular tensor is a collection of matrices whose columns have the same size and rows have different sizes from each other. PARAFAC2 decomposition is a fundamental tool to deal with an irregular tensor in applications including phenotype discovery and trend analysis. Although several PARAFAC2 decomposition methods exist, their efficiency is limited for irregular dense tensors due to the expensive computations involved with the tensor. In this paper, we propose DPar2, a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. DPar2 achieves high efficiency by effectively compressing each slice matrix of a given irregular tensor, careful reordering of computations with the compression results, and exploiting the irregularity of the tensor. Extensive experiments show that DPar2 is up to 6.0x faster than competitors on real-world irregular tensors while achieving comparable accuracy. In addition, DPar2 is scalable with respect to the tensor size and target rank. How can we efficiently analyze an irregular dense tensor? Many real-world multi-dimensional arrays are represented as irregular dense tensors; an irregular tensor is a collection of matrices with different row lengths. For example, stock data can be represented as an irregular dense tensor; the listing period is different for each stock (irregularity), and almost all of the entries of the tensor are observable during the listing period (high density). The irregular tensor of stock data is the collection of the stock matrices whose row and column dimension corresponds to time and features (e.g., the opening price, the closing price, the trade volume, etc.), respectively. In addition to stock data, many real-world data including music song data and sound data are also represented as irregular dense tensors. Each song can be represented as a slice matrix (e.g., time-by-frequency matrix) whose rows correspond to the time dimension. Then, the collection of songs is represented as an irregular tensor consisting of slice matrices of songs each of whose time length is different. Sound data are represented similarly. Tensor decomposition has attracted much attention from the data mining community to analyze tensors [1] - [10] . Specifically, PARAFAC2 decomposition has been widely used for modeling irregular tensors in various applications including phenotype discovery [11] , [12] , trend analysis [13] , and fault detection [14] . However, existing PARAFAC2 decomposition methods are not fast and scalable enough for irregular dense tensors. Perros et al. [11] improve the efficiency for handling irregular sparse tensors, by exploiting the sparsity patterns of a given irregular tensor. Many recent works [12] , [15] - [17] adopt their idea to handle irregular sparse tensors. However, they are not applicable to irregular dense tensors that have no sparsity pattern. Although Cheng and Haardt [18] improve the efficiency of PARAFAC2 decomposition by preprocessing a given tensor, there is plenty of room for improvement in terms of computational costs. Moreover, there remains a need for fully employing multicore parallelism. The main challenge to successfully design a fast and scalable PARAFAC2 decomposition method is how to minimize the computational costs involved with an irregular dense tensor and the intermediate data generated in updating factor matrices. In this paper, we propose DPAR2 (Dense PARAFAC2 decomposition), a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. Based on the characteristics of real-world data, DPAR2 compresses each slice matrix of a given irregular tensor using randomized Singular Value Decomposition (SVD). The small compressed results and our careful ordering of computations considerably reduce the computational costs and the intermediate data. In addition, DPAR2 maximizes multi-core parallelism by considering the difference in sizes between slices. With these ideas, DPAR2 achieves higher efficiency and scalability than existing PARAFAC2 decomposition methods on irregular dense tensors. Extensive experiments show that DPAR2 outperforms the existing methods in terms of speed, space, and scalability, while achieving a comparable fitness, where the fitness indicates how a method approximates a given data well (see Section IV-A). The contributions of this paper are as follows. • Algorithm. We propose DPAR2, a fast and scalable PARAFAC2 decomposition method for decomposing irregular dense tensors. • Analysis. We provide analysis for the time and the space complexities of our proposed method DPAR2. • Experiment. DPAR2 achieves up to 6 (d) Trade-off Fig. 1. [Best viewed in color] Measurement of the running time and fitness on real-world datasets for three target ranks R: 10, 15, and 20. DPAR2 provides the best trade-off between speed and fitness. DPAR2 is up to 6.0× faster than the competitors while having a comparable fitness. Description irregular tensor of slices X k for k = 1, ..., K X k slice matrix (∈ I k × J) X(i, :) i-th row of a matrix X X(:, j) j-th column of a matrix X X(i, j) (i, j)-th element of a matrix X X (n) mode-n matricization of a tensor X Q k , S k factor matrices of the kth slice H, V factor matrices of an irregular tensor vectorization of a matrix based on ALS while achieving a similar fitness (see Fig. 1 ). • Discovery. With DPAR2, we find that the Korean stock market and the US stock market have different correlations (see Fig. 12 ) between features (e.g., prices and technical indicators). We also find similar stocks (see Table III ) on the US stock market during a specific event (e.g., COVID-19). In the rest of this paper, we describe the preliminaries in Section II, propose our method DPAR2 in Section III, present experimental results in Section IV, discuss related works in Section V, and conclude in Section VI. The code and datasets are available at https://datalab.snu.ac.kr/dpar2. In this section, we describe tensor notations, tensor operations, Singular Value Decomposition (SVD), and PARAFAC2 decomposition. We use the symbols listed in Table I . We use boldface lowercases (e.g. x) and boldface capitals (e.g. X) for vectors and matrices, respectively. In this paper, indices start at 1. An irregular tensor is a 3-order tensor X whose k-frontal slice X(:, :, k) is X k ∈ R I k ×J . We denote irregular tensors by {X k } K k=1 instead of X where K is the number of k-frontal slices of the tensor. An example Algorithm 1: Randomized SVD [20] Input: A ∈ R I×J Output: U ∈ R I×R , S ∈ R R×R , and V ∈ R J×R . Parameters: target rank R, and an exponent q 1: generate a Gaussian test matrix Ω ∈ R J×(R+s) 2: construct Y ← (AA T ) q AΩ 3: QR ← Y using QR factorization 4: construct B ← Q T A 5:ŨΣV T ← B using truncated SVD at rank R 6: return U = QŨ, Σ, and V is described in Fig. 2 . We refer the reader to [19] for the definitions of tensor operations including Frobenius norm, matricization, Kronecker product, and Khatri-Rao product. Singular Value Decomposition (SVD) decomposes A ∈ R I×J to X = UΣV T . U ∈ R I×R is the left singular vector matrix of A; U = u 1 · · · u r is a column orthogonal matrix where R is the rank of A and u 1 , · · · , u R are the eigenvectors of AA T . Σ is an R×R diagonal matrix whose diagonal entries are singular values. The i-th singular value σ i is in Σ i,i where σ 1 ≥ σ 2 ≥ · · · ≥ σ R ≥ 0. V ∈ R J×R is the right singular vector matrix of A; V = v 1 · · · v R is a column orthogonal matrix where v 1 , · · · , v R are the eigenvectors of A T A. Randomized SVD. Many works [20] - [22] have introduced efficient SVD methods to decompose a matrix A ∈ R I×J by applying randomized algorithms. We introduce a popular randomized SVD in Algorithm 1. Randomized SVD finds a column orthogonal matrix Q ∈ R I×(R+s) of (AA T ) q AΩ using random matrix Ω, constructs a smaller matrix B = Q T A (∈ R (R+s)×J ), and finally obtains the SVD result U (= QŨ), Σ, V of A by computing SVD for B, i.e., B ≈ŨΣV T . Given a matrix A, the time complexity of randomized SVD is O(IJR) where R is the target rank. C. PARAFAC2 decomposition PARAFAC2 decomposition proposed by Harshman [23] successfully deals with irregular tensors. The definition of PARAFAC2 decomposition is as follows: Definition 1 (PARAFAC2 Decomposition). Given a target rank R and a 3-order tensor {X k } K k=1 whose k-frontal slice is X k ∈ R I k ×J for k = 1, ..., K, PARAFAC2 decomposition approximates each k-th frontal slice X k by U k S k V T . U k is a matrix of the size I k × R, S k is a diagonal matrix of the Algorithm 2: PARAFAC2-ALS [24] Input: X k ∈ R I k ×J for k = 1, ..., K Output: U k ∈ R I k ×R , S k ∈ R R×R for k = 1, ..., K, and V ∈ R J×R . Parameters: target rank R 1: initialize matrices H ∈ R R×R , V, and S k for k = 1, ..., K 2: repeat 3: end for 10: construct a tensor Y ∈ R R×J×K from slices Y k ∈ R R×J for k = 1, ..., K / * running a single iteration of CP-ALS on Y * / 11: for k = 1, ..., K do 15: S k ← diag(W(k, :)) 16: end for 17: until the maximum iteration is reached, or the error ceases to decrease; 18: for k = 1, ..., K do 19: U k ← Q k H 20: end for size R × R, and V is a matrix of the size J × R which are common for all the slices. The objective function of PARAFAC2 decomposition [23] is given as follows. For uniqueness, Harshman [23] imposed the constraint (i.e., U T k U = Φ for all k), and replace U T k with Q k H where Q k is a column orthogonal matrix and H is a common matrix for all the slices. Then, Equation (1) is reformulated with Q k H: Fig. 2 shows an example of PARAFAC2 decomposition for a given irregular tensor. A common approach to solve the above problem is ALS (Alternating Least Square) which iteratively updates a target factor matrix while fixing all factor matrices except for the target. Algorithm 2 describes PARAFAC2-ALS. First, we update each Q k while fixing H, V, S k for k = 1, ..., K (lines 4 and 5). By computing SVD of X k VS k H T as Z k Σ k P T k , we update Q k as Z k P T k , which minimizes Equation (3) over Q k [11] , [24] , [25] . After updating Q k , the remaining factor matrices H, V, S k is updated by minimizing the following objective function: Minimizing this function is to update H, V, S k using CP decomposition of a tensor Y ∈ R R×J×K whose k-th frontal Its PARAFAC2 Decomposition A given irregular tensor PARAFAC2 Decomposition Fig. 2 . Example of PARAFAC2 decomposition. Given an irregular tensor {X k } K k=1 , PARAFAC2 decomposes it into the factor matrices H, V, Q k , and S k for k = 1, ..., K. Note that Q k H is equal to U k . slice is Q T k X k (lines 8 and 10). We run a single iteration of CP decomposition for updating them [24] (lines 11 to 16). Q k , H, S k , and V are alternatively updated until convergence. Iterative computations with an irregular dense tensor require high computational costs and large intermediate data. RD-ALS [18] reduces the costs by preprocessing a given tensor and performing PARAFAC2 decomposition using the preprocessed result, but the improvement of RD-ALS is limited. Also, recent works successfully have dealt with sparse irregular tensors by exploiting sparsity. However, the efficiency of their models depends on the sparsity patterns of a given irregular tensor, and thus there is little improvement on irregular dense tensors. Specifically, computations with large dense slices X k for each iteration are burdensome as the number of iterations increases. We focus on improving the efficiency and scalability in irregular dense tensors. In this section, we propose DPAR2, a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. Before describing main ideas of our method, we present main challenges that need to be tackled. C1. Dealing with large irregular tensors. PARAFAC2 decomposition (Algorithm 2) iteratively updates factor matrices (i.e., U k , S k , and V) using an input tensor. Dealing with a large input tensor is burdensome to update the factor matrices as the number of iterations increases. C2. Minimizing numerical computations and intermediate data. How can we minimize the intermediate data and overall computations? C3. Maximizing multi-core parallelism. How can we parallelize the computations for PARAFAC2 decomposition? The main ideas that address the challenges mentioned above are as follows: I1. Compressing an input tensor using randomized SVD considerably reduces the computational costs to update factor matrices (Section III-B). I2. Careful reordering of computations with the compression results minimizes the intermediate data and the number of operations (Sections III-C to III-E). I3. Careful distribution of work between threads enables DPAR2 to achieve high efficiency by considering various lengths I k for k = 1, ..., K (Section III-F). Factor matrices of PARAFAC2 Decomposition using the compressed results Matrices compressed by exploiting randomized SVD Fig. 3 . Overview of DPAR2. Given an irregular tensor {X k } K k=1 , DPAR2 first compresses the given irregular tensor by exploiting randomized SVD. Then, DPAR2 iteratively and efficiently updates the factor matrices, Q k , H, S k , and V, using only the compressed matrices, to get the result of PARAFAC2 decomposition. As shown in Fig. 3 , DPAR2 first compresses each slice of an irregular tensor using randomized SVD (Section III-B). The compression is performed once before iterations, and only the compression results are used at iterations. It significantly reduces the time and space costs in updating factor matrices. After compression, DPAR2 updates factor matrices at each iteration, by exploiting the compression results (Sections III-C to III-E). Careful reordering of computations is required to achieve high efficiency. Also, by carefully allocating input slices to threads, DPAR2 accelerates the overall process (Section III-F). DPAR2 (see Algorithm 3) is a fast and scalable PARAFAC2 decomposition method based on ALS described in Algorithm 2. The main challenge that needs to be tackled is to minimize the number of heavy computations involved with a given irregular tensor {X k } K k=1 consisting of slices X k for k = 1, ..., K (in lines 4 and 8 of Algorithm 2). As the number of iterations increases (lines 2 to 17 in Algorithm 2), the heavy computations make PARAFAC2-ALS slow. For efficiency, we preprocess a given irregular tensor into small matrices, and then update factor matrices by carefully using the small ones. Our approach to address the above challenges is to compress a given irregular tensor {X k } K k=1 before starting iterations. As shown in Fig. 4 , our main idea is two-stage lossy compression with randomized SVD for the given tensor: 1) DPAR2 performs randomized SVD for each slice X k for k = 1, ..., K at target rank R, and 2) DPAR2 performs randomized SVD for a matrix, the horizontal concatenation of singular value matrices and right singular vector matrices of slices X k . Randomized SVD allows us to compress slice matrices with low computational costs and low errors. First Stage. In the first stage, DPAR2 compresses a given irregular tensor by performing randomized SVD for each slice X k at target rank R (line 3 in Algorithm 3). is a matrix consisting of left singular vectors, B k ∈ R R×R is a diagonal matrix whose elements are singular values, and C k ∈ R J×R is a matrix consisting of right singular vectors. Second Stage. Although small compressed data are generated in the first step, there is a room to further compress the intermediate data from the first stage. In the second stage, we The preprocessed results Two-stage SVD for a given irregular tensor. In the first stage, DPAR2 performs randomized SVD of X k for all k. In the second stage, DPAR2 performs randomized SVD of M ∈ R J×KR which is the horizontal Compressing the matrix M maximizes the efficiency of updating factor matrices H, V, and W (see Equation (3)) at later iterations. We construct a matrix M ∈ R J×KR by horizontally concatenating C k B k for k = 1, ..., K (line 5 in Algorithm 3). Then, DPAR2 performs randomized SVD for M (line 6 in Algorithm 3): is a matrix consisting of left singular vectors, E ∈ R R×R is a diagonal matrix whose elements are singular values, and F ∈ R KR×R is a matrix consisting of right singular vectors. With the two stages, we obtain the compressed results D, E, F, and A k for k = 1, ..., K. Before describing how to update factor matrices, we re-express the k-th slice X k by using the compressed results: . . . Since C k B k is the kth horizontal block of M and DEF (k)T is the kth horizontal block of DEF T , B k C T k corresponds to F (k) ED T . Therefore, we obtain Equation (6) by replacing In updating factor matrices, we use A k F (k) ED T instead of X k . The two-stage compression lays the groundwork for efficient updates. Our goal is to efficiently update factor matrices, H, V, and S k and Q k for k = 1, ..., K, using the compressed results A k F (k) ED T . The main challenge of updating factor matrices is to minimize numerical computations and intermediate data by exploiting the compressed results obtained in Section III-B. A naive approach would reconstructX k = A k F (k) ED T from the compressed results, and then update the factor matrices. However, this approach fails to improve the efficiency of updating factor matrices. We propose an efficient update rule using the compressed results to 1) find Q k and Y k (lines 5 and 8 in Algorithm 2), and 2) compute a single iteration of CP-ALS (lines 11 to 13 in Algorithm 2). There are two differences between our update rule and PARAFAC2-ALS (Algorithm 2). First, we avoid explicit computations of Q k and Y k . Instead, we find small factorized matrices of Q k and Y k , respectively, and then exploit the small ones to update H, V, and W. The small matrices are computed efficiently by exploiting the compressed results The second difference is that DPAR2 obtains H, V, and W using the small factorized matrices of Y k . Careful ordering of computations with them considerably reduces time and space costs at each iteration. We describe how to find the factorized matrices of Q k and Y k in Section III-D, and how to update factor matrices in Section III-E. The first goal of updating factor matrices is to find the factorized matrices of Q k and Y k for k = 1, ..., K, respectively. In Algorithm 2, finding Q k and Y k is expensive due to the computations involved with X k (lines 4 and 8 in Algorithm 2). To reduce the costs for Q k and Y k , our main idea is to exploit the compressed results A k , D, E, and F (k) , instead of X k . Additionally, we exploit the column orthogonal property of A k , i.e., A T k A k = I, where I is the identity matrix. We first re-express Q k using the compressed results obtained in Section III-B. DPAR2 reduces the time and space costs for Q k by exploiting the column orthogonal property of but there is a more efficient way than this approach. Thanks to the column orthogonal property of where Σ k is a diagonal matrix whose entries are the singular values of F (k) ED T VS k H T , the column vectors of Z k and P k are the left singular vectors and the right singular vectors of F (k) ED T VS k H T , respectively. Then, we obtain the factorized matrices of Q k as follows: Algorithm 3: DPAR2 Input: X k ∈ R I k ×J for k = 1, ..., K Output: U k ∈ R I k ×R , S k ∈ R R×R for k = 1, ..., K, and V ∈ R J×R . Parameters: target rank R 1: initialize matrices H ∈ R R×R , V, and S k for k = 1, ..., K / * Compressing slices in parallel * / 2: for k = 1, ..., K do 3: compute at rank R / * Iteratively updating factor matrices * / 7: repeat 8: end for 23: until the maximum iteration is reached, or the error ceases to decrease; 24: for k = 1, ..., K do 25: where A k Z k and P k are the left and the right singular vectors of A k F (k) ED T VS k H T , respectively. We avoid the explicit construction of Q k , and use A k Z k P T k instead of Q k . Since A k is already column-orthogonal, we avoid performing SVD of A k F (k) ED T VS k H T , which are much larger than Then, we represent Y k as the following expression (line 12 in where I R×R is the identity matrix of size R × R, for the last equality. By exploiting the factorized matrices of Q k , we compute Y k without involving A k in the process. The next goal is to efficiently update the matrices H, V, and W using the small factorized matrices of Y k . Naively, we would compute Y and run a single iteration of CP-ALS with Y to update H, V, and W (lines 11 to 13 in Algorithm 2). However, multiplying a matricized tensor and a Khatri-Rao product (e.g., Y (1) (W V)) is burdensome, and thus we exploit the structure of the decomposed results P k Z T k F (k) ED T of Y k to reduce memory requirements and computational costs. In other word, we do not compute Y k , and use only (1) . Therefore, we compute that term without the reconstruction by carefully determining the order of computations and exploiting the factorized matrices of Y (1) , D, E, P k , Z k , and F (k) for k = 1, ..., K. With Lemma 1, we reduce the computational cost of Y (1) Proof. Y (1) is represented as follows: is expressed as follows: The mixed-product property (i.e., (A ⊗ B)(C ⊗ D) = AC ⊗ BD)) is used in the above equation. Therefore, G (1) (: , r) is equal to K k=1 P k Z T k F (k) W(:, r) ⊗ ED T V(:, r) . We represent it as K k=1 W(k, r) P k Z T k F (k) ED T V(: , r) using block matrix multiplication since the k-th ver- ∈ ℝ $×& , ∈ ℝ $×& , ∈ ℝ '×& , ( " As shown in Fig. 5 , we compute Y (1) (W V) column by column. In computing G (1) (:, r), we compute ED T V(:, r), sum up W(k, r) P k Z T k F (k) for all k, and then perform matrix multiplication between the two preceding results (line 14 in Algorithm 3). After computing † denotes the Moore-Penrose pseudoinverse (line 15 in Algorithm 3). Note that the pseudoinverse operation requires a lower computational cost compared to computing G (1) since the size of (W , by carefully determining the order of computations and exploiting the factorized matrices of Y (2) . Proof. Y (2) is represented as follows: is expressed as follows: Exploiting the factorized matrices of Y (3) and carefully determining the order of computations improves the efficiency. where vec(·) denotes the vectorization of a matrix. Proof. Y (3) is represented as follows: . We obtain S k whose diagonal elements correspond to the kth row vector of W (line 21 in Algorithm 3). After convergence, we obtain the factor matrices, (U k ← A k Z k P T k H = Q k H), S k , and V (line 25 in Algorithm 3). Convergence Criterion. At the end of each iteration, we determine whether to stop or not (line 23 in Algorithm 3) based on the variation of e = K k=1 X k −X k 2 F whereX k = Q k HS k V T is the kth reconstructed slice. However, measuring reconstruction errors K k=1 X k −X k 2 F is inefficient since it requires high time and space costs proportional to input slices X k . To efficiently verify the convergence, our idea is to exploit A k F (k) ED T instead of X k , since the objective of our update process is to minimize the difference between A k F (k) ED T Input: the number T of threads, X k ∈ R I k ×J for k = 1, ..., K Output: sets Ti for i = 1, ..., T . 1: initialize Ti ← ∅ for i = 1, ..., T . 2: construct a list S of size T whose elements are zero 3: construct a list Linit containing the number of rows of X k for k = 1, ..., K 4: sort Linit in descending order, and obtain lists L val and L ind that contain sorted values and those corresponding indices 5: for k = 1, ..., K do 6: tmin ← argmin S 7: l ← L ind [k] 8: Tt min ← Tt min ∪ {X l } 9: S[tmin] ← S[tmin] + L val [k] 10: end for andX k = Q k HS k V T . With this idea, we improve the efficiency by computing Since the Frobenius norm is unitarily invariant, we modify the computation as follows: where P T k P k and Z k Z T k are equal to I ∈ R R×R since P k and Z k are orthonormal matrices. Note that the size of P k Z T k F (k) ED T and HS k V T is R×J which is much smaller than the size I k × J of input slices X k . This modification completes the efficiency of our update rule. The last challenge for an efficient and scalable PARAFAC2 decomposition method is how to parallelize the computations described in Sections III-B to III-E. Although a previous work [11] introduces the parallelization with respect to the K slices, there is still a room for maximizing parallelism. Our main idea is to carefully allocate input slices X k to threads by considering the irregularity of a given tensor. The most expensive operation is to compute randomized SVD of input slices X k for all k; thus we first focus on how well we parallelize this computation (i.e., lines 2 to 4 in Algorithm 3). A naive approach is to randomly allocate input slices to threads, and let each thread compute randomized SVD of the allocated slices. However, the completion time of each thread can vary since the computational cost of computing randomized SVD is proportional to the number of rows of slices; the number of rows of input slices is different from each other as shown in Fig. 8 . Therefore, we need to distribute X k fairly across each thread considering their numbers of rows. For i = 1, .., T , consider that an ith thread performs randomized SVD for slices in a set T i where T is the number of threads. To reduce the completion time, the sums of rows of slices in the sets should be nearly equal to each other. To achieve it, we exploit a greedy number partitioning technique that repeatedly adds a slice into a set with the smallest sum of rows. Algorithm 4 describes how to construct the sets T i for compressing input slices in parallel. Let L init be a list containing the number of rows of X k for k = 1, ..., K (line 3 in Algorithm 4). We first obtain lists L val and L ind , sorted values and those corresponding indices, by sorting L init in descending order (line 4 in Algorithm 4). We repeatedly add a slice X k to a set T i that has the smallest sum. For each k, we find the index t min of the minimum in S whose ith element corresponds to the sum of row sizes of slices in the ith set T i (line 6 in Algorithm 4). Then, we add a slice X l to the set T tmin where l is equal to L ind [k], and update the list S by S[t min ] ← S[t min ] + L val [k] (lines 7 to 9 in Algorithm 4). Note that S[k], L ind [k], and L val [k] denote the kth element of S, L ind , and L val , respectively. After obtaining the sets T i for i = 1, .., T , ith thread performs randomized SVD for slices in the set T i . After decomposing X k for all k, we do not need to consider the irregularity for parallelism since there is no computation with A k which involves the irregularity. Therefore, we uniformly allocate computations across threads for all k slices. In each iteration (lines 8 to 22 in Algorithm 3), we easily parallelize computations. First, we parallelize the iteration (lines 8 to 10) for all k slices. To update H, V, and W, we need to compute G (1) , G (2) , and G (3) in parallel. In Lemmas 1 and 2, DPAR2 parallelly computes W(k, r) P k Z T k F (k) and W(k, r)F (k) Z k P T k H(:, r) for k, respectively. In Lemma 3, DPAR2 parallelly computes vec P k Z T k F (k) T ED T V(:, r) ⊗ H(:, r) for k. We analyze the time complexity of DPAR2. [3] 5, 270 88 3, 664 stock Activity 5 [28] , [29] 553 570 320 video feature Action 5 [28] , [29] 936 570 567 video feature Traffic 6 [30] 2 Lemma 4) and the iteration cost (see Lemma 5) : Note that M JR 2 term is omitted since it is much smaller than K k=1 I k JR and JKR 2 . Proof. The size of preprocessed data of DPAR2 is proportional to the size of E, D, A k , and F (k) for k = 1, ..., K. The size of E and D is R and J × R, respectively. For each k, the size of A and F is I k × R and R × R, respectively. Therefore, the size of preprocessed data of DPAR2 is O K k=1 I k R + KR 2 + JR . In this section, we experimentally evaluate the performance of DPAR2. We answer the following questions: Q1 Performance (Section IV-B). How quickly and accurately does DPAR2 perform PARAFAC2 decomposition compared to other methods? Q2 Data Scalability (Section IV-C). How well does DPAR2 scale up with respect to tensor size and target rank? Q3 Multi-core Scalability (Section IV-D). How much does the number of threads affect the running time of DPAR2? Q4 Discovery (Section IV-E). What can we discover from real-world tensors using DPAR2? We describe experimental settings for the datasets, competitors, parameters, and environments. Machine. We use a workstation with 2 CPUs (Intel Xeon E5-2630 v4 @ 2.2GHz), each of which has 10 cores, and 512GB memory for the experiments. Real-world Data. We evaluate the performance of DPAR2 and competitors on real-world datasets summarized in Table II . FMA dataset 1 [26] is the collection of songs. Urban Sound dataset 2 [27] is the collection of urban sounds such as drilling, siren, and street music. For the two datasets, we convert each time series into an image of a log-power spectrogram so that their forms are (time, frequency, song; value) and (time, frequency, sound; value), respectively. US Stock dataset 3 is the collection of stocks on the US stock market. Korea Stock dataset 4 [3] is the collection of stocks on the South Korea stock market. Each stock is represented as a matrix of (date, feature) where the feature dimension includes 5 basic features and 83 technical indicators. The basic features collected daily are the opening, the closing, the highest, and the lowest prices and trading volume, and technical indicators are calculated based on the basic features. The two stock datasets have the form of (time, feature, stock; value). Activity data 5 and Action data 5 are the collection of features for motion videos. The two datasets have the form of (frame, feature, video; value). We refer the reader to [28] for their feature extraction. Traffic data 6 is the collection of traffic volume around Melbourne, and its form is (sensor, frequency, time; measurement). PEMS-SF data 7 contain the occupancy rate of different car lanes of San Francisco bay area freeways: (station, timestamp, day; measurement). Traffic data and PEMS-SF data are 3-order regular tensors, but we can analyze them using PARAFAC2 decomposition approaches. Synthetic Data. We evaluate the scalability of DPAR2 and competitors on synthetic tensors. Given the number K of slices, and the slice sizes I and J, we generate a synthetic tensor using tenrand(I, J, K) function in Tensor Toolbox [31] , which randomly generates a tensor X ∈ R I×J×K . We construct a tensor {X k } K k=1 where X k is equal to X(:, :, k) for k = 1, ...K. Competitors. We compare DPAR2 with PARAFAC2 decomposition methods based on ALS. All the methods including DPAR2 are implemented in MATLAB (R2020b). • DPAR2: the proposed PARAFAC2 decomposition model which preprocesses a given irregular dense tensor and updates factor matrices using the preprocessing result. • RD-ALS [18] : PARAFAC2 decomposition which preprocesses a given irregular tensor. Since there is no public code, we implement it using Tensor Toolbox [31] based on its paper [18] . • PARAFAC2-ALS: PARAFAC2 decomposition based on ALS approach. It is implemented based on Algorithm 2 using Tensor Toolbox [31] . • SPARTan [11] : fast and scalable PARAFAC2 decomposition for irregular sparse tensors. Although it targets on sparse irregular tensors, it can be adapted to irregular dense tensors. We use the code implemented by authors 8 . Parameters. We use the following parameters. • Number of threads: we use 6 threads except in Section IV-D. • Max number of iterations: the maximum number of iterations is set to 32. • Rank: we set the target rank R to 10 except in the tradeoff experiments of Section IV-B and Section IV-D. We also set the rank of randomized SVD to 10 which is the same as the target rank R of PARAFAC2 decomposition. To compare running time, we run each method 5 times, and report the average. Fitness. We evaluate the fitness defined as follows: where X k is the k-th input slice andX k is the k-th reconstructed slice of PARAFAC2 decomposition. Fitness close to 1 indicates that a model approximates a given input tensor well. We evaluate the fitness and the running time of DPAR2, RD-ALS, SPARTan, and PARAFAC2-ALS. Trade-off. Fig. 1 shows that DPAR2 provides the best trade-off of running time and fitness on real-world irregular tensors for the three target ranks: 10, 15, and 20. DPAR2 achieves 6.0× faster running time than the competitors for FMA dataset while having a comparable fitness. In addition, DPAR2 provides at least 1.5× faster running times than the competitors for the other datasets. The performance gap is large for FMA and Urban datasets whose sizes are larger than those of the other datasets. It implies that DPAR2 is more scalable than the competitors in terms of tensor sizes. Preprocessing time. We compare DPAR2 with RD-ALS and exclude SPARTan and PARAFAC2-ALS since only RD-ALS has a preprocessing step. As shown in Fig. 9 (a), DPAR2 is up to 10× faster than RD-ALS. There is a large performance gap on FMA and Urban datasets since RD-ALS cannot avoid the overheads for the large tensors. RD-ALS performs SVD of the concatenated slice matrices K k=1 X T k , which leads to its slow preprocessing time. Iteration time. Fig. 9 (b) shows that DPAR2 outperforms competitors for running time at each iteration. Compared to SPARTan and PARAFAC2-ALS, DPAR2 significantly reduces the running time per iteration due to the small size of the preprocessed results. Although RD-ALS reduces the computational cost at each iteration by preprocessing a given tensor, DPAR2 is up to 10.3× faster than RD-ALS. Compared to RD-ALS that computes the variation of F for the convergence criterion, DPAR2 efficiently verifies the convergence by computing the variation of K k=1 P k Z T k F (k) ED T − HS k V T 2 F , which affects the running time at each iteration. In summary, DPAR2 obtains U k , S k , and V in a reasonable running time even if the number of iterations increases. Size of preprocessed data. We measure the size of preprocessed data on real-world datasets. For PARAFAC2-ALS and SPARTan, we report the size of input irregular tensor since they have no preprocessing step. Compared to an input irregular tensor, DPAR2 generates much smaller preprocessed data by up to 201 times as shown in Fig. 10 . Given input slices X k of size I k ×J, the compression ratio increases as the number J of columns increases; the compression ratio is larger on FMA, Urban, Activity, and Action datasets than on US Stock, KR Stock, Traffic, and PEMS-SF. This is because the compression ratio is proportional to Size of an irregular tensor Size of the preprocessed results ≈ IJK IKR+KR 2 +JR = 1 R/J+R 2 /IJ+R/IK assuming I 1 = ... = I K = I; R/J is the dominant term which is much larger than R 2 /IJ and R/IK. We evaluate the data scalability of DPAR2 by measuring the running time on several synthetic datasets. We first compare the performance of DPAR2 and the competitors by increasing the size of an irregular tensor. Then, we measure the running time by changing a target rank. Tensor Size. To evaluate the scalability with respect to the tensor size, we generate 5 synthetic tensors of the following sizes I × J × K: For simplicity, we set I 1 = · · · = I K = I. Fig. 11 (a) shows that DPAR2 is up to 15.3× faster than competitors on all synthetic tensors; in addition, the slope of DPAR2 is lower than that of competitors. Also note that only DPAR2 obtains factor matrices of PARAFAC2 decomposition within a minute for all the datasets. Rank. To evaluate the scalability with respect to rank, we generate the following synthetic data: I 1 = · · · = I K = 2, 000, J = 2, 000, and K = 4, 000. Given the synthetic tensors, we measure the running time for 5 target ranks: 10, 20, 30, 40, and 50. DPAR2 is up to 15.9× faster than the secondfastest method with respect to rank in Fig. 11(b) . For higher ranks, the performance gap slightly decreases since DPAR2 depends on the performance of randomized SVD which is designed for a low target rank. Still, DPAR2 is up to 7.0× faster than competitors with respect to the highest rank used in our experiment. We generate the following synthetic data: I 1 = · · · = I K = 2, 000, J = 2, 000, and K = 4, 000, and evaluate the We discover various patterns using DPAR2 on real-world datasets. 1) Feature Similarity on Stock Dataset: We measure the similarities between features on US Stock and Korea Stock datasets, and compare the results. We compute Pearson Correlation Coefficient (PCC) between V(i, :), which represents a latent vector of the ith feature. For effective visualization, we select 4 price features (the opening, the closing, the highest, and the lowest prices), and 4 representative technical indicators described as follows: • OBV (On Balance Fig. 12 (a) and 12(b) show correlation heatmaps for US Stock data and Korea Stock data, respectively. We analyze correlation patterns between price features and technical indicators. On both datasets, STOCH has a negative correlation and MACD has a weak correlation with the price features. On the other hand, OBV and ATR indicators have different patterns on the two datasets. On the US stock dataset, ATR and OBV have a positive correlation with the price features. On the Korea stock dataset, OBV has little correlation with the price features. Also, ATR has little correlation with the price features except for the closing price. These different patterns are due to the difference of the two markets in terms of market size, market stability, tax, investment behavior, etc. 2) Finding Similar Stocks: On US Stock dataset, which stock is similar to a target stock s T in a time range that a user is curious about? In this section, we provide analysis by setting the target stock s T to Microsoft (Ticker: MSFT), and the range a duration when the COVID-19 was very active (Jan. 2, 2020 -Apr. 15, 2021) . We efficiently answer the question by 1) constructing the tensor included in the range, 2) obtaining factor matrices with DPAR2, and 3) post-processing the factor matrices of DPAR2. Since U k represents temporal latent vectors of the kth stock, the similarity sim(s i , s j ) between stocks s i and s j is computed as follows: sim(s i , s j ) = exp −γ U si − U sj 2 F (10) where exp is the exponential function. We set γ to 0.01 in this section. Note that we use only the stocks that have the same target range since U si − U sj is defined only when the two matrices are of the same size. Based on sim(s i , s j ), we find similar stocks to s T using two different techniques: 1) k-nearest neighbors, and 2) Random Walks with Restarts (RWR). The first approach simply finds stocks similar to the target stock, while the second one finds similar stocks by considering the multi-faceted relationship between stocks. k-nearest neighbors. We compute sim(s T , s j ) for j = 1, ..., K where K is the number of stocks to be compared, and find top-10 similar stocks to s T , Microsoft (Ticker: MSFT). In Table III (a), the Microsoft stock is similar to stocks of the Technology sector or with a large capitalization (e.g., Amazon.com, Apple, and Alphabet) during the COVID-19. Moody's is also similar to the target stock. Random Walks with Restarts (RWR). We find similar stocks using another approach, Random Walks with Restarts (RWR) [32] - [35] . To exploit RWR, we first a similarity graph based on the similarities between stocks. The elements of the adjacency matrix A of the graph is defined as follows: We ignore self-loops by setting A(i, i) to 0 for i = 1, ..., K. After constructing the graph, we find similar stocks using RWR. The scores r is computed by using the power iteration [36] as described in [37] : r (i) ← (1 − c)à T r (i−1) + cq (12) whereà is the row-normalized adjacency matrix, r (i) is the score vector at the ith iteration, c is a restart probability, and q is a query vector. We set c to 0.15, the maximum iteration to 100, and q to the one-hot vector where the element corresponding to Microsoft is 1, and the others are 0. As shown in Table III , the common pattern of the two approaches is that many stocks among the top-10 belong to the technology sector. There is also a difference. In Table III , the blue color indicates the stocks that appear only in one of the two approaches among the top-10. In Table III (a), the k-nearest neighbor approach simply finds the top-10 stocks which are closest to Microsoft based on distances. On the other hand, the RWR approach finds the top-10 stocks by considering more complicated relationships. There are 4 stocks appearing only in Table III(b) . S&P Global is included since it is very close to Moody's which is ranked 4th in Table III(a) . Netflix, Autodesk, and NVIDIA are relatively far from the target stock compared to stocks such as Intuit and Alphabet, but they are included in the top-10 since they are very close to Amazon.com, Adobe, ANSYS, and Synopsys. This difference comes from the fact that the k-nearest neighbors approach considers only distances from the target stock while the RWR approach considers distances between other stocks in addition to the target stock. DPAR2 allows us to efficiently obtain factor matrices, and find interesting patterns in data. We review related works on tensor decomposition methods for regular and irregular tensors. Tensor decomposition on regular dense tensors. There are efficient tensor decomposition methods on regular dense tensors. Pioneering works [38] - [43] efficiently decompose a regular tensor by exploiting techniques that reduce time and space costs. Also, a lot of works [44] - [47] proposed scalable tensor decomposition methods with parallelization to handle large-scale tensors. However, the aforementioned methods fail to deal with the irregularity of dense tensors since they are designed for regular tensors. PARAFAC2 decomposition on irregular tensors. Cheng and Haardt [18] proposed RD-ALS which preprocesses a given tensor and performs PARAFAC2 decomposition using the preprocessed result. However, RD-ALS requires high computational costs to preprocess a given tensor. Also, RD-ALS is less efficient in updating factor matrices since it computes reconstruction errors for the convergence criterion at each iteration. Recent works [11] , [12] , [15] attempted to analyze irregular sparse tensors. SPARTan [11] is a scalable PARAFAC2-ALS method for large electronic health records (EHR) data. COPA [12] improves the performance of PARAFAC2 decomposition by applying various constraints (e.g., smoothness). REPAIR [15] strengthens the robustness of PARAFAC2 decomposition by applying low-rank regularization. We do not compare DPAR2 with COPA and REPAIR since they concentrate on imposing practical constraints to handle irregular sparse tensors, especially EHR data. However, we do compare DPAR2 with SPARTan which the efficiency of COPA and REPAIR is based on. TASTE [16] is a joint PARAFAC2 decomposition method for large temporal and static tensors. Although the above methods are efficient in PARAFAC2 decomposition for irregular tensors, they concentrate only on irregular sparse tensors, especially EHR data. LogPar [17] , a logistic PARAFAC2 decomposition method, analyzes temporal binary data represented as an irregular binary tensor. SPADE [48] efficiently deals with irregular tensors in a streaming setting. TedPar [49] improves the performance of PARAFAC2 decomposition by explicitly modeling the temporal dependency. Although the above methods effectively deal with irregular sparse tensors, especially EHR data, none of them focus on devising an efficient PARAFAC2 decomposition method on irregular dense tensors. On the other hand, DPAR2 is a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. In this paper, we propose DPAR2, a fast and scalable PARAFAC2 decomposition method for irregular dense tensors. By compressing an irregular input tensor, careful reordering of the operations with the compressed results in each iteration, and careful partitioning of input slices, DPAR2 successfully achieves high efficiency to perform PARAFAC2 decomposition for irregular dense tensors. Experimental results show that DPAR2 is up to 6.0× faster than existing PARAFAC2 decomposition methods while achieving comparable accuracy, and it is scalable with respect to the tensor size and target rank. With DPAR2, we discover interesting patterns in real-world irregular tensors. Future work includes devising an efficient PARAFAC2 decomposition method in a streaming setting. Metafac: community discovery via relational hypergraph factorization Link prediction on evolving data using tensor factorization Fast and memory-efficient tucker decomposition for answering diverse time range queries High-performance tucker factorization on heterogeneous platforms Slicenstitch: Continuous cp decomposition of sparse tensor streams Accurate online tensor factorization for temporal tensor streams with missing values Time-aware tensor decomposition for sparse tensors Gtensor: Fast and accurate tensor analysis system using gpus S3cmtf: Fast, accurate, and scalable method for incomplete coupled matrix-tensor factorization Fast and scalable method for distributed boolean tensor factorization Spartan: Scalable PARAFAC2 for large & sparse data COPA: constrained PARAFAC2 for sparse & large datasets Estimating latent trends in multivariate longitudinal data via parafac2 with functional and structural constraints Application of parafac2 to fault detection and diagnosis in semiconductor etch Robust irregular tensor factorization and completion for temporal health data analysis Taste: Temporal and static tensor factorization for phenotyping electronic health records Logpar: Logistic parafac2 factorization for temporal binary data with missing values Efficient computation of the PARAFAC2 decomposition Tensor decompositions and applications Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions A fast randomized algorithm for the approximation of matrices Low-rank approximation and regression in input sparsity time Parafac2: Mathematical and technical notes Parafac2-part i. a direct fitting algorithm for the parafac2 model Matrix computations FMA: A dataset for music analysis A dataset and taxonomy for urban sound research Mining actionlet ensemble for action recognition with depth cameras Multivariate lstmfcns for time series classification Traffic forecasting in complex urban networks: Leveraging big data and machine learning Matlab tensor toolbox version 3.0-dev Signed random walk diffusion for effective representation learning in signed graphs Accurate relational reasoning in edge-labeled graphs by multi-labeled random walk with restart Random walk-based ranking in signed social networks: model and algorithms Supervised and extended restart in random walks for ranking and link prediction in networks The pagerank citation ranking: Bringing order to the web Bepi: Fast and memory-efficient method for billion-scale random walk with restart D-tucker: Fast and memory-efficient tucker decomposition for dense tensors Low-rank tucker decomposition of large tensors using tensorsketch Fast and guaranteed tensor decomposition via sketching MACH: fast randomized tensor decompositions A practical randomized CP tensor decomposition Adaptive sketching for fast and convergent canonical polyadic decomposition PARAFAC algorithms for large-scale problems 2pcp: Twophase CP decomposition for billion-scale dense tensors Parallel tensor compression for large-scale scientific data H-PARAFAC: hierarchical parallel factor analysis of multidimensional big data Spade: S treaming pa rafac2 de composition for large datasets Tedpar: Temporally dependent parafac2 factorization for phenotype-based disease progression modeling