key: cord-0556671-kyq6ozm9 authors: Lyu, Hanbaek; Strohmeier, Christopher; Menz, Georg; Needell, Deanna title: COVID-19 Time-series Prediction by Joint Dictionary Learning and Online NMF date: 2020-04-20 journal: nan DOI: nan sha: 8df55596099e56ce2990629dac96129e4080a0c5 doc_id: 556671 cord_uid: kyq6ozm9 Predicting the spread and containment of COVID-19 is a challenge of utmost importance that the broader scientific community is currently facing. One of the main sources of difficulty is that a very limited amount of daily COVID-19 case data is available, and with few exceptions, the majority of countries are currently in the"exponential spread stage,"and thus there is scarce information available which would enable one to predict the phase transition between spread and containment. In this paper, we propose a novel approach to predicting the spread of COVID-19 based on dictionary learning and online nonnegative matrix factorization (online NMF). The key idea is to learn dictionary patterns of short evolution instances of the new daily cases in multiple countries at the same time, so that their latent correlation structures are captured in the dictionary patterns. We first learn such patterns by minibatch learning from the entire time-series and then further adapt them to the time-series by online NMF. As we progressively adapt and improve the learned dictionary patterns to the more recent observations, we also use them to make one-step predictions by the partial fitting. Lastly, by recursively applying the one-step predictions, we can extrapolate our predictions into the near future. Our prediction results can be directly attributed to the learned dictionary patterns due to their interpretability. The rapid spread of coronavirus disease (COVID-19) has had devastating effects globally. The virus first started to grow significantly in China and then in South Korea around January of 2020, and then had a major outbreak in European countries within the next month, and as of April the US alone has over 400,000 cases with over 12,000 deaths. Predicting the rapid spread of COVID-19 is a challenge of utmost importance that the broader scientific community is currently facing. A conventional approach to this problem is to use compartmental models (see, e.g. [15] , [8] ), which are mathematical models used to simulate the spread of infectious diseases governed by stochastic differential equations describing interactions between different compartments of the population (e.g. susceptible, infectious, and recovered). Namely, one may postulate a compartmental model tailored to COVID-19 and find optimal parameters for the model by fitting it them the available data. An alternative approach is to use data-driven machine learning techniques, especially deep learning algorithms [11] , [2] , [10] , which have had great success in various CS and DN supported in part by NSF CAREER #1348721 and NSF BIGDATA #1740325. Authors are with the Department of Mathematics, University of California, Los Angeles CA 90095 USA (e-mail: hlyu@math.ucla.edu). Codes are available at https://github.com/HanbaekLyu/ONMF-COVID19 problems including image classification, computer vision, and voice recognition [16] , [6] , [13] , [1] . In this paper, we propose an entirely different approach to predicting the spread of COVID-19 based on dictionary learning (or topic modeling), which is a machine learning technique that is typically applied to text or image data in order to extract important features of a complex dataset so that one can represent said dataset in terms of a reduced number of extracted features, or dictionary atoms [24] , [5] . Although dictionary learning has seen a wide array of applications in data science, to our best knowledge this work is the first to apply such an approach to time-series data and time-series prediction. Our proposed method has four components: 1. (Minibatch learning) Use online nonnegative matrix factorization (online NMF) to learn "elemental" dictionary atoms which may be combined to approximate short time evolution patterns of correlated time-series data. 2. (Online learning) Further adapt the minibatch-learned dictionary atoms by traversing the time-series data using online NMF. 3. (Partial fitting and one-step prediction) Progressively improve our learned dictionary atoms by online learning while concurrently making one-step predictions by partial fitting. 4. (Recursive extrapolation) By recursively using the onestep predictions above, extrapolate into the future to predict future values of the time-series. Our method enables us to learn dictionary atoms from a diverse collection of correlated time-series data (e.g. new daily cases of COVID-19, number of fatal and recovered cases, and degree of observance of social distancing measures). The learned dictionary atoms describe "elemental" short-time evolution patterns from the correlated data which may be superimposed to recover and even predict the original timeseries data. Online Nonnegative Matrix Factorization is at the core of our learning algorithm, which continuously adapts and improves the learned dictionary atoms to newly arrived timeseries data sets. There are a number of advantages of our proposed approach that may complement some of the shortcomings of the more traditional model-based approach or large-data-based machine learning approach. First, Our method is completely modelfree and has the universality of data types, as the dictionary atoms directly learned from the data serve as the 'model' for prediction. Hence a similar method could be applied to predict not only the spread of the virus but also other related parameters. These include the spread of COVID-19 media information, medical and food supply shortages and demands, patient subgroup infections, immunity and many more. Second, our method does not lose interpretability as some of the deep-learning-based approaches do, which is particularly important in making predictions for health-related areas. Third, our method is computationally efficient and can be executed on a standard personal computer in a short time. This enables our method to be applied in real-time in onlinesetting to generate continuously improving prediction. Lastly, our method has a strong theoretical foundation based on the recent work [19] . In this article, we demonstrate our general online NMFbased time-series prediction method on COVID-19 time-series data by learning a small number of fundamental time evolution patterns in joint time-series among the six countries in three different cases (confirmed/death/recovered) concurrently. Our analysis shows that we can indeed extract interpretable dictionary atoms for short-time evolution of such correlated time-series and use them to get accurate short-time predictions with a small variation. This approach could further be extended by augmenting various other types of correlated timeseries data set that may contain nontrivial information on the spread of COVID-19 (e.g. time-series quantifying commodity, movement, and media data). This paper is organized as follows. In Section II, we give a brief overview of dictionary learning by nonnegative matrix factorization, and provide the full statement of our learning and prediction algorithms. In Subsection III-A, we give a description of the time-series data set of new COVID-19 cases and discuss a number of pre-processing methods for regularizing high fluctuations in the data set. Then we discuss our data analysis scheme and simulation setup in Subsection III-B. In the following subsections III-C-III-E, we present our main simulation results. Finally, we conclude and suggest further directions in Section IV. Matrix factorization provides a powerful mathematical setting for dictionary learning problems. We first organize n observations of d-dimensional samples a data matrix X ∈ R d×n , and then seek a factorization of X into the product W H for some W ∈ R d×r and H ∈ R r×n . This means that each column of the data matrix is approximated by the linear combination of the columns of the dictionary matrix W with coefficients given by the corresponding column of the code matrix H (see Figure 1 ). This problem has been extensively studied under many names, each with different constraints: dictionary learning, factor analysis, topic modeling, component analysis. It has also found applications in text analysis, image reconstruction, medical imaging, bioinformatics, and many other scientific fields [23] , [3] , [4] , [9] , [25] , [7] , [21] . Nonnegative matrix factorization (NMF) is an instance of matrix factorization where one seeks two nonnegative matrices whose product approximates a given nonnegative data matrix. Below we give an extension of NMF with an extra sparsity constraint on the code matrix which is particularly suited for dictionary learning problems [14] . Given a data matrix X ∈ R d×n ≥0 , the goal is to find a nonnegative dictionary W ∈ R d×r and nonnegative code matrix H ∈ R r×n by solving the following optimization problem: where A 2 F = i,j A 2 ij denotes the matrix Frobenius norm and λ ≥ 0 is the L 1 -regularization parameter for the code matrix H. A consequence of the nonnegativity constraints is that one must represent the data using the dictionary W without exploiting cancellation. This is a critical mechanism that gives a parts-based representation of the data (see [17] ). Many efficient iterative algorithms for NMF are based on block optimization schemes that have been proposed and studied following the introduction of the first, and most well-known, multiplicative update method by Lee and Seung [18] (see [12] for a survey). In this section, we provide algorithms for online dictionary learning and prediction for ensembles of correlated time-series. At the core of our online dictionary learning algorithm is the well-known online nonnegative matrix factorization (ONMF) algorithm [20] , [19] , which is an online extension of NMF that learns a sequence of dictionary matrices from a sequence of data matrices. We first illustrate the key idea in the simplest setting of timeseries data for a single entity. Suppose we observe a single numerical value x s ∈ R at each discrete time s. By adding a suitable constant to all observed values, we may assume without loss of generality that x s ≥ 0 for all s ≥ 0. Fix integer parameters k, N, r ≥ 0. Suppose we only store N past data points at any given time, due to memory constraints. So at time t, we hold the vector D t = [x t−N +1 , x t−N +2 , · · · , x t ] in our memory. The goal is to learn a dictionary of k-step evolution patterns from the observed history (x s ) 0≤s≤t up to time t. A possible approach is to form a k by N −k+1 Hankel matrix X t (see, e.g., [22] ), whose ith column consists of the k consecutive values of D t starting from its ith coordinate. We can then factorize this into k by r dictionary matrix W and r by t − k code matrices using an NMF algorithm: This approximate factorization tells us that we can approximately represent any k-step evolution pattern from our past data (x s ) t−N 0 Do: Partial fitting: is obtained by concatenating the columns of the following matrix (7) in Algorithm 1. end for Algorithm 1 in T steps. Also, we may use a different timeseries (y t ) 1≤t≤T to find a dictionary tensor W M and use it as the initial dictionary for the given time-series (x t ) 1≤t≤T in Algorithm 1. This "transfer learning" would be effective when the two time-series share a similar structure. The sparse coding problems in (3) and (8) can be solved by LASSO, whereas the constrained quadratic problem (7) can be solved by projected gradient descent algorithms. See [19] for more details and background. For practical use, we provide a python implementation of our algorithms in our GihHub repository (see the footnote of the front page). We also remark that, using the recent contribution of Lyu, Needell, and Balzano [19] on online matrix factorization Death Recovered algorithms on dependent data streams, we can give a theoretical guarantee of convergence of the sequence of learned dictionary atoms by Algorithm 1 under suitable assumptions. The essential requirement is that the time-series data satisfies a weak "stochastic periodicity" condition. A complete statement of this result and proof will be provided in our follow-up paper. While this is a substantial extension of the usual independence assumption in the streaming data set, unfortunately, most COVID-19 time-series do not verify any type of weak periodicity condition directly, which is one of the biggest challenges in predicting the spread of COVID-19. In the next section, we describe our experiment setting and how we may overcome this issue of "lack of periodicity" by combining the minibatch and online learning algorithms. It is important to note that minibatch learning algorithm (Algorithm 3) converges for fixed T as M → ∞, as it is a version of online NMF on a i.i.d. sequence of data, which satisfies our stochastic periodicity condition. A. Data set and pre-processing To illustrate our dictionary learning and prediction algorithms for time-series data, we analyze the historical daily reports of confirmed/death/recovered COVID-19 cases in six countries -South Korea, China, US, Italy, Spain, and Germany -from Jan 19, 2020 to Apr. 12, 2020. The input data can be represented as a tensor of shape 6 × 80 × 3 corresponding to countries, days, and types of cases, respectively 1 . In order to apply our dictionary learning algorithms, we first unfold this data tensor into a matrix X of shape (d × T ) = (6 × 3) × 80, whose tth column, which we denote by x t , gives the full 18 dimensional observation in day t, 0 ≤ t ≤ 80. Also, we find that the fluctuation in the original data set is too large to yield stable representations, let alone predictions. In order to remedy this, we pre-process the time-series data by taking a 5-day moving average and then entry-wise log-transform x → log(x + 1). After applying dictionary learning and prediction algorithms, we take the inverse transform x → exp(x) − 1 and plot the result. As the COVID-19 time-series data set we analyze here only consists of T = 80 highly non-repetitive observations, applying the online dictionary learning algorithm (Algorithm 1) with random initialization is not sufficient for proper learning and accurate prediction. We overcome this by using the minibatch learning algorithm (Algorithm 3) for the initialization for the online learning and prediction. Namely, we use the following scheme: 1. (Minibatch learning) Use minibatch Algorithm 3 for the time-series (x t ) 0≤t≤T to obtain dictionary tensor W M ∈ R d×k×r ≥0 and aggregate matrices A M ∈ R r×r ≥0 , B M ∈ R r×dk ≥0 . 2. (Online learning and one-step prediction) Use the output in step 1 as the initialization for Algorithm 1. For each t = 1, · · · , T , iterates the steps in Algorithm 1 as well as Algorithm 2. This outputs a dictionary tensor W T and a prediction (x t ) k≤t≤T +1 . Prediction of COVID-19 daily new confirmed cases Joint dictionary of 6-day evolution Prediction of COVID-19 daily new deaths Joint dictionary of 6-day evolution Prediction of COVID-19 daily new recovered cases Joint dictionary of 6-day evolution The role of the parameter β becomes clear when examining the equations (5) and (6) . It weights how much of the past data is used when updating the new dictionary matrix W t . For example, in (7), in the extreme case β = 0 the present observation at time t is not used, wherease in the other extreme case β = ∞ only the present observation is used. An example of dictionary atoms obtained from the minibatch learning algorithm (Algorithm 3) from the COVID-19 daily new cases time-series data set is presented in Figure 2 . We note that the time evolution structure in the data set is not used in this minibatch learning process, as slices of length k evolution are sampled independently and uniformly at random from the entire history. Each dictionary atom is a 6 * 6 * 3 = 108 dimensional vector corresponding to time * country * case type. To each atom, we associate an "importance metric" first introduced in [19] as a measure of the total contribution of the atom in representing the original data set. Namely, the importance metric of each atom is the total sum of its linear coefficients in the sparse coding problem (3) during the entire learning process. This is computed as the row sums of the sum of all code matrices H t in (3) obtained from the learning process. For example, the (1, 1) atom (in matrix coordinates) with importance 0.23 in Figure 2 indicates that the number of daily new confirmed cases in all six countries are almost constant and that China has significantly higher values than other countries. Also, the (1, 4) atom (in matrix coordinates) with importance 0.07 in Figure 2 indicates that the number of daily new confirmed cases are growing rapidly in Korea and Italy, while for the other four countries the values are almost constant. It is important to note that these dictionary atoms are learned by a nonnegative matrix factorization algorithm, so they maintain their individual interpretation we described before in representing the entire data set. Indeed, such patterns Prediction of COVID-19 daily new deaths Joint dictionary of 6-day evolution Prediction of COVID-19 daily new recovered cases Joint dictionary of 6-day evolution were dominant in the COVID-19 time-series data during the period of Jan. 21 -Mar. 1, 2020 (see e.g. the blue plot in Figure 3 ). Similarly, a direct interpretation of other dictionary atoms can be associated with features in the original data set. The learned dictionary atoms in Figure 2 are not only directly interpretable but also provide a compressed representation of the original data set. Namely, we will be able to approximate any 6-day evolution pattern in the data set by only using the 24 atoms in Figure 2 with suitable nonnegative linear coefficients found from the sparse coding problem (3). Obtaining a global approximation of the entire data set based on such local approximations by dictionary atoms is called reconstruction (see for instance the image reconstruction example in [19] ). Our partial fitting and prediction algorithm (Algorithm 2) extends this reconstruction procedure by using the learned patterns to make one step predictions adapted to the time-series data. The main point of our application is to illustrate that we may obtain reasonable predictions on our very limited T = 80 COVID-19 time-series data set by learning a small number of fundamental patterns in joint time evolution among the six countries in three different cases (confirmed/death/recovered) concurrently. This approach could further be extended by augmenting various other types of correlated time-series which contain nontrivial information on the spread of COVID-19 (e.g. time-series quantifying commodity, movement, and media data trends). One can think of learning the highly correlated temporal evolution patterns across different countries and cases as a model construction process for the joint time evolution of the data set. There are relatively few hyper-parameters to train in our algorithm compared to deep neural network-based models, and parameters in our method are in some sense built-in to the temporal dictionary atoms so that one does not need to begin by choosing the model to use. However, recall that the dictionary atoms learned by the minibatch algorithm are not adapted to the temporal structure of the data set. We find that further adapting them in the direction of time evolution by using our online learning algorithm (Algorithm 1) according to the scheme in subsection III-B significantly improves the prediction accuracy by reducing the standard deviation of the prediction curves over a number of trials. An example of the result of this further adaptation of the minibatch-learned dictionary atoms is shown in Figures 3, 4 , and 5. In each figure, the online-improved dictionary atoms are shown in the right, and the original time-series data and its prediction computed by Algorithms 1-2 are shown in blue and red, respectively. Prediction curves also show error bars of one standard deviation from 1000 trials of the entire scheme (minibatch + online + extrapolation) under the same hyperparameters in Subsection III-B. By comparing the corresponding dictionary atoms in Figures 2 and 3 for example, we find that the online learning process does not change the importance metric on the top 24 atoms, and only a few atoms change in their shape significantly (especially the curves for China in atoms at (1, 1), (1, 2) , (1, 3) , (2, 1) , and (4, 2)). Such new patterns were not able to be learned by the minibatch learning, but they were picked up by traversing the time-series data from the past to the present with our online learning algorithm. The "correctness" of the learned dictionary atoms can be verified by the accuracy of the 1-step prediction up to time T = 80 (the end of blue curve in Figure 3 ) in Figures 3, 4 , and 5. We remark that our one-step predictions up to time T = 80 Prediction of COVID-19 daily new recovered cases Joint dictionary of 6-day evolution are not exact predictions, as our initial dictionary tensor for this step, learned by the minibatch algorithm, uses all information in the entire time-series data. More precisely, one can think of this as a reconstruction procedure without seeing the last coordinate. This would have been a proper prediction if our initialization were independent of the data, but we find that online learning alone without the minibatch initialization gives inferior reconstruction results. We believe this is due to the fact that the COVID-19 time-series data set is too short (and far from periodic) for the online dictionary learning algorithm to converge in a single run. Nevertheless, in some sense the mini-batch method is implemented to compensate for the lack of data; given a sufficient amount of data, we could omit this step. NextNext, we discuss the recursive extrapolation step, which gives the 30-day prediction of the new daily cases of all three types and all six countries simultaneously, shown as in Figures 3, 4 , and 5. For instance, by partially fitting the first five coordinates of the length-6 dictionary atoms to the last five days of the data, we can use Algorithm 2 to obtain a prediction of the future values at time T + 1. We can then recursively continue this extrapolation step using the predicted data into the future ad infinitum. Our 30-day prediction results show reasonable variation among trials. However, the variation in prediction grows in the prediction length, so only a moderate range of predictions would have meaningful implications. We remark that we did not enforce any additional assumption on the prediction curves (e.g. finite carrying capacity, logistic-like growth for the total, SIR-type structure) that are standard in many epidemiological models. Instead, we chose our hyperparameters in subsection III-B so that the future prediction curves approximately satisfy such assumptions, highlighting the model-free nature of our approach. This highlevel fitting is not without compromise in the uncertainty of the prediction. For example, choosing large values of the L 1regularization parameter λ used in the recursive extrapolation step reduces the variability of prediction significantly, but the prediction curve drops to zero very rapidly right after the end of the current data, which is absurd. Lastly, we also mention that our recursive extrapolation defines a deterministic dynamical system in multidimensional space (dimension 18 in this case). The evolution is determined by the hyperparameters and the set of dictionary atoms at the end of the given time-series data (T = 80 in this case). Even though our dictionary learning algorithms are randomzied, we find our 30-day predictions over many trials are within a modest standard error. However, we find that the mean trajectory of the prediction could vary significantly with respect to changes in the hyperparameters. Developing a more systematic method of choosing these hyperparameters is a future direction of inquiry. With the rapidly changing situation involving COVID-19, it is critical to have accurate and effective methods for predicting short-term and long-term behavior of many parameters relating to the virus. In this paper, we proposed a novel approach that uses dictionary learning to predict time-series data. We then applied this approach to analyze and predict the new daily reports of COVID-19 cases in multiple countries. Usually, dictionary learning is used for text and image data; often with impressive results. To our best knowledge, our work is the first to implement dictionary learning to time-series data. There are a number of advantages of our approach that may complement some of the shortcomings of the more traditional model-based approach or the large-data-based machine learning approach. First, our method is completely model-free as the dictionary atoms directly learned from the data serve as the 'model' for prediction. Our approach also works with universal data types. For example, it could be applied to predict not only the spread of the virus but also other related parameters. These include the spread of COVID-19 media information, medical and food supply shortages and demands, subgroup infections, immunity and many more. Second, the method works with small data sets. Most machine-learning methods need either model-specific input or large data sets. Because COVID19 only appeared recently, there are no large data sets yet available. Third, the method does not lose interpretability as some of the deep-learning-based approaches do. This is particularly important when making predictions for health-related areas. The learned dictionary atoms are not only interpretable but also identify hidden correlations between multiple entities. Fourth, our approach uses only a few hyperparameters that are model-free and independent of the data set. Therefore our approach avoids the issue of over-fitting. Furthermore, the method is computationally efficient and can be executed on a standard personal computer in a short time. Hence, it could be applied in real-time or online-settings to generate continuously improving predictions. Lastly, the method has a strong theoretical foundation: Convergence of the minibatch algorithm is always guaranteed and the convergence of the online learning algorithm is guaranteed under the assumption of quasi-periodicity. Therefore we expect our method to be robust i.e. small changes to the data-set or to the parameters should not destroy the learning outcome. There are a number of future directions that we are envisioning. First, one could apply our method to county-level time-series data. One would obtain dictionary atoms describing county-wise correlation and prediction. This analysis could be critically used in distributing medical supplies and also in measuring the effect of re-opening the economy successively by a few counties at a time. Second, as not every individual can be tested, it is valuable to be able to transfer the knowledge on tested subjects to the ones yet to be tested. This 'transfer learning' could naturally be done with our method, by learning a dictionary from one subject group and apply that to make predictions for the unknown group. Lastly, one may extend the method to a fully tensor-based setting, where a large number of related variables of different types could be encoded in a single tensor. Then one could use various direct tensorfactorization methods to learn higher-order dictionary atoms. For example, this might be useful in identifying a critical subgroup of variables for clinical data and therapeutics. End to end speech recognition in english and mandarin Deep learning of representations: Looking forward Email surveillance using nonnegative matrix factorization Algorithms and applications for approximate nonnegative matrix factorization Probabilistic topic models: A focus on graphical model design and applications to document and image analysis A theoretical analysis of feature pooling in visual recognition Clustering-initiated factor analysis application for tissue classification in dynamic brain positron emission tomography Compartmental models in epidemiology Phoenix: A weight-based network coordinate system using matrix factorization A tutorial survey of architectures, algorithms, and applications for deep learning Recent advances in deep learning for speech research at microsoft The why and how of nonnegative matrix factorization. Regularization, optimization, kernels, and support vector machines Deep speech: Scaling up end-to-end speech recognition Non-negative matrix factorization with sparseness constraints Networks and epidemic models Imagenet classification with deep convolutional neural networks Learning the parts of objects by non-negative matrix factorization Algorithms for non-negative matrix factorization Online matrix factorization for markovian data and applications to network dictionary learning Online learning for matrix factorization and sparse coding Non-negative matrix factorization: robust extraction of extended structures Algorithms for triangular decomposition of block hankel and toeplitz matrices with application to factoring positive matrix polynomials Correction for ambiguous solutions in factor analysis using a penalized least squares objective Probabilistic topic models. Handbook of latent semantic analysis A framework for regularized nonnegative matrix factorization, with application to the analysis of gene expression data