key: cord-0166083-by3sc9pr
authors: Bach, Tran Xuan; Anh, Nguyen Duc; Linh, Ngo Van; Than, Khoat
title: Dynamic transformation of prior knowledge into Bayesian models for data streams
date: 2020-03-13
journal: nan
DOI: nan
sha: 9ebe0c74329f498447b0f775c4dbf710dd2a3080
doc_id: 166083
cord_uid: by3sc9pr

We consider how to effectively use prior knowledge when learning a Bayesian model from streaming environments where the data come infinitely and sequentially. This problem is highly important in the era of data explosion and rich sources of precious external knowledge such as pre-trained models, ontologies, Wikipedia, etc. We show that some existing approaches can forget any knowledge very fast. We then propose a novel framework that enables to incorporate the prior knowledge of different forms into a base Bayesian model for data streams. Our framework subsumes some existing popular models for time-series/dynamic data. Extensive experiments show that our framework outperforms existing methods with a large margin. In particular, our framework can help Bayesian models generalize well on extremely short text while other methods overfit. The implementation of our framework is available at https://github.com/bachtranxuan/TPS.git.

Bayesian approach can efficiently model the uncertainty in data and make prediction on the future. A Bayesian model [1] however might not generalize well in the cases of misspecification nor sparsity nor noise. Misspecification [2] is a situation in which a particular model cannot cover all key aspects of reality, whereas sparsity is the case in which each data sample provides little information. Note that misspecification could not be avoided, while sparse and noisy data are prevalent in practice, such as modeling ratings or feedbacks in recommender systems [3] , [4] , and modeling short text from social networks [5] , [6] . Those situations cause various challenges [3] , [7] - [9] . Theoretically, we may not correctly recover a Bayesian model from sparse data even in cases of having arbitrarily large number of samples [10] , while in practice training from sparse and noisy data easily leads to overfitting [8] , [9] . One efficient way to overcome those challenges is to exploit external or prior knowledge [11] - [18] . 1 We are interested in streaming environments where the data come sequentially and endlessly. How to effectively use a prior knowledge 2 in Bayesian models for streaming environments? Interestingly, this question has been rarely considered, in spite of its great significance in the era of data explosion and rich sources of valuable prior knowledge such as pre-trained machine learning models, ontologies, Wikipedia, etc. In particular, pre-trained models have been increasingly playing a critical role in various applications [21] - [23] , but are mostly used in static conditions. One key reason is that streaming conditions pose various challenges, e.g., How to use prior knowledge dynamically to help a Bayesian model generalize well? Can 1. Two other ways are to use multimodal data or different data sources. The latter way closely relates to Bayesian evidence synthesis [19] , [20] . This work focuses on exploitation of external knowledge, instead of data.

2. This concept should be interpreted in a wide context, and be different with "prior" in the Bayesian approach where a prior is often a probability distribution. Prior knowledge here refers to any kinds of existing knowledge that can aid a learning process.

we assure that the prior knowledge will not be forgotten quickly? The forgetting issue is a natural consequence of Bayes' Theorem when conditioned on large (infinite) data sets. Some recent studies [24] - [27] have provided excellent solutions to learning Bayesian models from data streams. However, none of those methods considers exploiting external/prior knowledge. Our first contribution is to show that streaming variational Bayes (SVB) [24] can forget any knowledge at a rate of O(T −1 ), after learning from more T minibatches of data. Such a forgetting rate in SVB is much faster than the rate Ω(T −0.67 ) in human [28] . This forgetting problem potentially appears in other related methods. As a result, those approaches cannot solve the main question of interest.

The second contribution in this paper is a novel framework called Dynamic Transformation of Prior knowledge into Bayesian models for data Streams (TPS) that fulfils the above question and provides a unified solution to the three mentioned challenges. TPS is able to exploit knowledge which is represented by vectors, matrices, or graphs. The exploitation of prior knowledge in TPS is dynamic in nature, owing to the use of a discrete-time martingale of transformation matrices. Hence TPS helps a Bayesian model better fit with data streams and generalize on unseen observations. Finally, TPS enables us to develop a streaming learning algorithm for a base model, with few changes from an existing batch learning. This property will be beneficial in practice, since Bayesian models for static conditions are prevalent. We further show that TPS subsumes some existing dynamic models [29] , [30] as special cases when trained on a fixed data set.

Our third contribution is an extensive evaluation of different frameworks, using two base models (latent Dirichlet allocation (LDA) [31] for unsupervised learning, and Naive Bayes for classification) and three kinds of prior knowledge. The experiments show that TPS often outperforms the state-of-the-art methods, in terms of generalization and model interpretability [32] . In particular, TPS can help LDA and Naive Bayes generalize well on short text while some approaches encounter overfitting.

ROADMAP: We first summarize closely re-lated work. Then we present TPS and two case studies. After that we discuss some theoretical properties of TPS, and the proof about catastrophic forgetting in SVB. Extensive evaluation appears in last section and Supplement.

There are two main directions to deal with data streams. The first direction is to design a completely new model for the endlessly sequential data [29] , [33] - [35] . The other direction is to design online/streaming algorithms for learning Bayesian models, i.e., to adapt a model from static conditions to streaming ones. Efficient methods in this direction include streaming variational Bayes (SVB) [24] , population variational Bayes (PVB) [25] , online learning [36] , [37] , sequential Monte Carlo [38] , surprise minimization [27] . Interestingly, rigorous study on exploiting external/prior knowledge in streaming conditions is rare. A wide range of studies have shown that an appropriate use of prior knowledge can significantly improve the model interpretability and generalization. Useful prior knowledge might be in different forms, such as similarity graphs [39] , [40] , WordNet [11] , pre-trained models [12] , [13] , or domain knowledge [14] - [18] . In particular, pre-trained models, considered as precious prior knowledge, have been playing a crucial role in various applications [21] - [23] . However, most existing works just focus on non-streaming conditions.

Existing methods have difficulties to effectively exploit human knowledge in streaming environments. SVB learns a model by uniformly balancing the new with old knowledge learned from data, and thus only uses the external knowledge in the first step of the learning process. This strategy can forget any knowledge very fast and limits the effect of external knowledge. (A rigorous proof can be found in Appendix B). To avoid uniformity, power priors [41] can be exploited to balance the old with new knowledge at each time step. One issue is that the balancing constant has to be set manually, causing a drawback in streaming conditions. [26] remove such a drawback by considering the balancing constant as a random variable which follows a Hierarchical power prior (HPP). Therefore, SVB-HPP [26] ) is an elegant combination of SVB and HPP to balance the old with new knowledge in a Bayesian way. Those observations suggest that SVB-HPP and SVB face the same difficulty when exploiting external knowledge. [42] suggest to maintain the prior knowledge directly in each learning step, however: the knowledge is encoded into a prior distribution which is static or gradually vanishing. Such a usage is not flexible and cannot utilize the full strength of human knowledge. Furthermore, the prior should be encoded by vectors, which therefore limits the utilization of various forms of human knowledge. In contrast, TPS in this work enables us to use richer types of knowledge, which can be represented by vectors, matrices, graphs, and pre-trained models. Further, the exploitation of knowledge in TPS is dynamic in nature.

A related topic is dynamic models for dynamic/time-series data of fixed size. Examples include [29] , [30] , [43] - [45] . One common limitation of most of those works is that their learning algorithms can only deal with training datasets of finite size, as many passes over the whole dataset are required in the training phase. In contrast, the learning method for TPS deals successfully with streams where the data may come sequentially and endlessly. The ability of TPS, to work with real data streams and to efficiently exploit external knowledge, goes beyond many existing dynamic models.

In this section, we present the ideas of our framework. We then explicitly describe applications to LDA and Naive Bayes.

A motivating example: We may want to analyze a tweet stream from Twitter to understand the hidden themes/topics (β). Each tweet contains some observed words (x), while each word has a hidden role (z) to make a meaningful tweet. The theme of the tweets can change over time, e.g., COVID-19 rarely appeared in 2019 but was frequently tweeted in 2020. One may not clearly understand about COVID-19 when first reading some tweets which are often short and noisy. In those cases, some reference knowledge (η) may facilitate his/her understandings.

Following [25] and [46] , we consider a general model B(β, z, x) with two kinds of variables: a global variable β of size K×V to model the latent structure that is shared among data points x 1:N , and probably a local variable z i to model the latent structure that governs the ith data point x i . 3 Such a model is general and successfully applied in static conditions. However, there are several challenges in a streaming environment. A data stream is an infinite sequence of minibatches D = {D 1 , D 2 , ..., D t , ...}, and each minibatch t consists of M observed data points:

Assume we have an external knowledge η which is represented by a matrix of size L × V , where L is the embedding size. Note that a matrix can help us represent different kinds of knowledge in practice, such as pre-trained word embedding [47] which uses a vector to represent the meaning of a word, the relationships among entities, and social graphs for the connections of people. For example, the prior knowledge can come from graphs 4 such as WordNet of size V × V which means L = V , or from word embeddings of size E × V where E is the embedding dimensionality.

In practice, the prior knowledge representations and model's variables probably have different shapes, i.e., the model parameter β has size K ×V and the prior has size L×V . For this problem, we create a mapping f to transform the knowledge η into β in each minibatch t. 5 This f will map the linear transformation π t η into the space of β, where π t is a transformation 3 . V is the dimensionality of variable x, while K represents the number of hidden factors.

4. Clearly, those graphs can be represented by adjacent matrices. [48] further showed that we can represent any general graph knowledge into embedding spaces. The low rank matrices in the embedding spaces help to exploit the knowledge in the graph more effective.

5. The mapping can be chosen as a (pre-specified) nonlinear function, a neural network,... As an example, we will use the standard softmax function as the mapping f in the later subsections. matrix of size K × L. Then, the global variable β t at time t is computed by: β t = f (π t η).

There may be a dynamic of β over time in the data stream (e.g. the theme of tweets can significantly change from 2019 to 2020). We need to model such a dynamic, and our reparameterization before translates the dynamic into π. Therefore, we make a relation between π t−1 and π t to capture such a dynamic. We assume π t k ∼ N (π t−1 k , σI), where k is the row index of π t , I is the identity matrix of size L, and σ(≥ 0) is the variance parameter to make π t k fluctuate around π t−1 k . By this way, the sequence of transformation matrices composes a discretetime martingale. π t k can also be interpreted as a Gaussian random walk. Note that π t plays as weighting the knowledge before transformed into the global variable of the Bayesian model. The employment of a random walk help TPS exploit the knowledge η dynamically.

Given the global variable β t in each minibatch t, the generative model of data points is the same as those in the original B. The graphical representation of TPS appears in Figure 1a .

Learning in TPS: When facing with sequential data, many approaches [38] often formulate the learning as the Bayesian filtering problem for which one has to estimate the posterior p(π 1 , π 2 , ..., π t |D 1 , D 2 , ..., D t ) or p(π t |D 1 , D 2 , ..., D t ). Note that estimating one of those posteriors will require all past data, and thus is impractical for data streams, as t → ∞. Here we propose an entirely different approach which avoids reusing past data. The learning process is performed in each minibatch t by maximizing the posterior p(z, π t |π t−1 , η, D t ), where π t−1 is made available from the previous minibatch. Hence, our approach will be potentially more efficient and truly applicable to data streams. We will decompose the posterior into components in order to reuse the inference steps of the original model B as:

In log form, we have: The learning process is separated into two parts for local and global variables, respectively. While the inference of local variables (z, x) is inherited from the original model B (e.g., by maximizing or sampling from p(z, x|β t )), we focus on maximizing LP w.r.t. π t . We extract the component G(β t ) = G(f (π t η)), that contains β t , from log p(D t , z|β t ). Then, we obtain the objective function LP(π t ) = log p(π t |π t−1 ) + G(f (π t η)), and maximize it by using gradient ascent. Algorithm 1 briefly describes the learning process.

Next we discuss how to apply TPS to LDA [31] , one of the most popular Bayesian models.

Initialize π 0 randomly for minibatch t = 0, 1, ... do Receive a minibatch D t of data while not convergence do Do inference w.r.t. the local variables (z, x), given β t = f (π t η) and D t (e.g., by maximizing or sampling from p(z, x|β t )) Maximize (1) w.r.t π t , given the statistics from (z, x) end while Set π t+1 := π t end for Algorithm 2 TPS training for LDA

LDA consists of two global variables (β, α): α contributes to the topic mixture θ of each document and is fixed in this case study, and β = (β 1 , β 2 , ..., β K ) where each β k is the topic distribution over V words.

Suppose that there is an available prior knowledge η of size L × V . We incorporate the prior knowledge into β by a linear transformation with a transformation matrix π of size K × L, and then followed by the softmax operator. The generative process of the documents in minibatch t th is as follows (Figure 1b ): 1) Draw the transformation matrix: π t k ∼ N (π t−1 k , σ 2 I) 2) Calculate the topic distributions:

For the i th word of d: Draw topic index z i ∼ Multinomial(θ d ) and then draw word w i ∼ Multinomial(β z i ) Learning parameters: We apply Algorithm 1 for estimating the posterior. We emphasize that our framework utilizes the available inference methods (e.g., variational inference, Gibbs sampling) for local variables (w, θ, z) in the original LDA model.

Here, we use mean-field variational inference as in the original paper [31] :

According to [31] , the inference for document d reduces to repeating the following updates until convergence:

The component depending on the global variable π t k in (1) for each k given data D t is:

log β kv , after removing some constants. In more details,

Consider the concavity of function LP(π t k ). It is obvious that − 1 2σ π t k − π t−1 k 2 2 and π t k η v are concave functions with respect to π t k . Further, the log-sum-exp function is well-known convex. Therefore, LP(π t k ) is concave with respect to π t k , and we can use gradient ascent to find its maximum. We can sum up the learning algorithm of TPS for LDA as in Algorithm 2.

In this subsection, we apply TPS to Multinomial Naive Bayes for classification on docu-ment streams. Let C be the number of classes, β c be the class distribution over V words of the vocabulary (where β cj = P (j|c) and

. Each document d belonging to class (label) c d is represented by a bag of N d words and each word w d,i is generated from Multinomial(β c d ).

Suppose that we have a prior knowledge η of size L × V . The generative process of documents in the minibatch t th is as follows: For each class c, draw π t c ∼ N (π t−1 c , σ 2 I) and calculate β c = softmax(π t c η). Generate document d by drawing class label c d ∼ Multinomial(α) and then drawing each word w dn ∼ Multinomial(β c d ).

Learning: From (1), we extract the term associated with π t c for each class c as:

where D t c denotes the documents with class label c in minibatch t. Learning for NB is really simple. At each minibatch t, we use gradient ascent to maximize LP(π t c ) with respect to π t c , for each c. α = 1 C is used in our experiments. Thirdly, when trained from a dataset of bounded size, TPS subsumes many existing dynamic models [29] , [30] , [43] . For example, when the prior η is the identity matrix of size V × V and LDA is the base model, TPS is reduced to dynamic topic models [29] . It is worth noting that the learning algorithms for those dynamic models can work with only datasets of fixed size, whereas the learning method for TPS deals successfully with streams with infinite size. The ability of TPS to work with real data streams and to efficiently exploit external knowledge is a significant advantage.

Next, we will analyze two key properties.

The ability to balance the old and new knowledge is the basic requirement for a learning system. When learning from data streams, three main sources of knowledge should be considered: the old knowledge learned in past data, the new knowledge to be learned from incoming data, and the external knowledge. TPS has a simple mechanism to balance those three sources, owing to the objective function in (1):

The first term controls the flexibility of the new model. An increase in variance σ implies that the new model at time t might be far from the previous one, and thus the new model is searched in a larger region. As σ → ∞, TPS will not remember what have been learned before. In contrast, a decrease in σ implies the new model should not be far from the previous one. As σ = 0, we cannot learn any new knowledge at all since the first term dominates LP(z, π t k ). The second term, log p(z, D t |β t ), enables TPS to learn new knowledge from new data. Different with the static use of external knowledge in KPS [42] , TPS exploits the prior dynamically owing to the use of the transformation matrix π t . Estimation of π t at each minibatch implies the dynamic balancing between the prior and the new knowledge learned from the data at time t. Note that the variance σ also plays the key role in this balance: lower σ means less knowledge can be learned from new data. From those observations, one can see that TPS provides a simple mechanism (σ) to dynamically balance three sources of knowledge, overcoming the limitation of existing methods.

A serious issue in many learning methods is catastrophic forgetting [49] , i.e., the learned knowledge can be forgotten quickly as learning from more data/tasks. This issue has been found repeatedly for neural networks, but was unclear for Bayesian models. More importantly, existing works did not theoretically show how fast a method can forget. Here, we show that SVB [24] has a fast forgetting rate. The detailed proof appears in Appendix B.

Theorem 1 (Forgetting in SVB for LDA). Let ξ 0 be the model at time 0, and ξ t be the model after learning by SVB from more t minibatches. Then ξ t − ξ 0 1 ≥ t and ξ 0 1 = O(t −1 ) · ξ t 1 , suggesting that ξ 0 will be quickly forgotten, at a rate of O(t −1 ), in the learned model ξ t .

It can be shown that this property of SVB holds for Naive Bayes and a large class of LDA-based variants which are conjugate. Such a forgetting rate in SVB is much faster than the rate Θ(t −0.67 ) in human [28] . We conjecture that a fast rate might appear in many existing methods. In contrast, TPS does not encounter this problem. It has an explicit mechanism to balance the three sources of knowledge as discussed in the last subsection. By manipulating σ, TPS can remember the knowledge better.

In this section, we conduct extensive experiments to evaluate the performance of TPS. Further quantitative and qualitative evaluations can be found in the appendices.

We first evaluate TPS when applied to LDA. We take four state-of-the-art baselines: SVB [24] , PVB [25] , SVB-PP [26] , and KPS [42] . 6 6. SVB-HPP is not included since its application to LDA requires non-trivial efforts. Further, as observed by [26] , SVB-HPP is often comparable to the best SVB-PP. Except KPS, all of SVB, PVB, and SVB-PP do not explicitly exploit external/human knowledge and can only use the prior (η) at the initialization. Therefore, for a fair comparison, we encode the external knowledge in the initialization of those baselines. Whenever the forms of prior knowledge are unsuitable for the baselines, we use PCA to transform the edge matrices to the same shape with η. Datasets: We use 2 regular text (Grolier, TMN) and 4 short text datasets with some statistics in Table 1 . 7 Those short text corpora contain documents of extremely short length, and are used in our evaluation to help us see the role of prior knowledge in the cases of extreme sparsity.

Prior knowledge: We use word embedding and word graph as two kinds of prior knowledge. The word embeddings were pre-trained from 6 billion tokens of Wikipedia2014 and Gigaword5 by [50] 8 . Each word is represented by a 200-dimensional vector (L = 200). Word graph represents the relationships among words, and is represented by a matrix of size V ×V . We build the 500-nearest neighbor graph based on the cosine similarity of word embedding vectors, and utilize it as prior knowledge. Due to the high computational cost as working with a matrix of size V × V , we only did experiments on Grolier and TMN-title.

Evaluation metrics: Log predictive probability (LPP) [46] and Normalized pointwise mutual information (NPMI) [32] We use a grid search to select suitable hyperparameters for the baselines, and report the best parameter sets for each method and each dataset. The ranges of the parameters are: multiple power prior ρ ∈ {0.6, 0.7, 0.8, 0.9, 0.99} for SVB-PP, population size in {10 2 , 10 3 , 10 4 , 10 5 , 10 6 } for PVB, dimming factor κ ∈ {0, 0.01, 0.03, 0.07, 0.1, 0.6, 0.7, 0.8, 0.9} in KPS, and σ ∈ {0.01, 0.1, 1, 10, 100} for TPS.

Predictive capacity: Figure 2a and Figure 3 show the results when using word embedding and word graph priors respectively. It is obvious that TPS with both kinds of prior performs significantly better than the baselines, often by a large margin. In particular, thanks to the dynamic use of prior knowledge in each minibatch, TPS keeps increasing the predictive ability when receiving more data. Moreover, TPS can attain very high predictive capacity from some beginning stages of the learning process. For regular text data, the predictive ability in the beginning minibatch is extremely higher than the baselines. This suggests that the knowledge from the prior contains a large amount of information, and TPS can exploit the knowledge better than KPS.

It is worth noticing that SVB and SVB-PP seem not to work well with extremely short text, since their predictive capability decreases as learning from more data. Short text often does not provide enough information and clear context [51] , [52] , and hence cause various difficulties for SVB, SVB-PP, PVB, and KPS. KPS is able to use prior knowledge, however its ability seems to be limited because its usage of the knowledge is static along the learning process. Figure 2a and Figure 3 clearly demonstrate that existing methods are prone to overfitting on short text, whereas TPS generalizes well.

Topic coherence: The results of evaluating topic coherence using NPMI are reported in Figures 2b and 3 . With word embedding prior, TPS obtains the best results often with a large margin. Again, TPS is effective for short text. The information from the prior injects the knowledge of word's relationship to the model. For using word graph prior, Figure 3 shows that TPS is stable in the best methods. Interestingly, KPS performed better than TPS for TMN-title. It seems that TPS did not exploit the full advantage of this knowledge, although its predictiveness is still the best.

The role of prior knowledge and transition model: There are two important components which can significantly affect the performance of TPS: the prior knowledge η, and the transition model (π t k ∼ N (π t−1 k , σI)) which connects the models in two consecutive time steps. We would like to see which one is really important to the performance of TPS. To this end, we take LDA as the base model, fix batchsize = 500, σ = 1, K = 100, and pre-trained word embedding as prior. Figure 4a shows the performance of TPS in three versions. One can observe that when there is no prior, TPS does not perform well and even encounters overfitting in short text. When a good prior knowledge is available, TPS performs significantly better and do not encounter overfitting. The transition model plays a good role as removing it may result in worse performance. It is worth observing that TPS tends to be better as learning from more data. This suggests that the prior knowledge does not overwhelm the data, but supports TPS to learn better.

Sensitivity of σ: Grolier (regular text) and TMN-title (short text) are used in this evaluation. We fix the batchsize to 1000 and K = 100 topics. The results are presented in Figure 4b . This figure shows that one should use small σ for long text, and large σ for short text. The reason might be that short text contains little information and few changes will likely lead to a great variance in the meaning of that text. Therefore, the new model π t should be learned in a large region around π t−1 to capture large variance in incoming data. This coincides well with our theoretical analysis.

We compare TPS with SVB and KPS when applied to Naive Bayes for streaming classifi- of 6 categories (business, culture, news, opinion, sport, and letters). We continuously train a model when a minibatch arrives, then do classification for documents in the next minibatch.

Here, each minibatch contains the documents of a month. Prior knowledge: We extract a feature Vdimensional vector of each class c whose element j is the ratio of the number of word j appeared in the class c to the number of documents containing word j. Then, we gain a matrix C × V in which each term v is represented by a C-dimensional vector. This matrix is used as prior for SVB and KPS. In TPS, we identify each word v by concatenating a one-hot vector (V -dimension) and the C-dimensional vector in order to get a sufficient representation. We use this representation as prior knowledge.

Results: Figure 5 reports the accuracies of three methods. TPS is comparable to KPS in the first 100 minibatches, better about 1−8.3% than KPS in the remaining minibatches. We observe that the prior knowledge is definitely suitable for KPS as it helps KPS to obtain high accuracy. The gap between TPS and KPS is significant when the number of minibatches is large. In contrast, SVB only utilizes the knowledge at the first step, and hence often gets lower accuracy than the other methods. Note that TPS performs significantly better than both KPS and SVB in the last 100 minibatches. It is worth noting that at some sudden changes in the data distribution, the performance of SVB and KPS drops significantly. TPS can reduce such a bad effect of those sudden changes. The main reason may come from the effective exploitation of prior knowledge. This seems to be an advantage of TPS in changing environments.

The final evaluation is to see how well can the baselines utilize the full strength of external knowledge. The experiments with Naive Bayes in the previous subsection provide some good evidences as all methods can use the original knowledge. However, in the experiments with LDA in subsection 5.1 we have to transform the knowledge (pre-trained word embedding) into a form that can be used in SVB, SVB-PP, PVB, and KPS, due to the mismatch in dimensionality and negativity in the embedding vectors. The transformation may cause some loss in the knowledge and hence may make some bias for the baselines, since TPS uses the original knowledge representation. Next we would like to see the performance of those methods when directly using the original knowledge representation. In this case we have to match the dimensionality of the knowledge and the global variable in LDA. We took LDA and three large datasets into evaluation: NYT-title, Yahoo-title, Irishtimes. All the settings are the same as in Subsection 5.1, except that the number of topics is K = 200 which is exactly the dimensionality of the pre-trained word embedding. To ensure nonnegativity in the knowledge vectors, we normalize each embedding vector to be in [0, 1] 200 . Figure 6 shows the results. We observe that the behaviors of the baselines are almost the same as in the experiments of Subsection 5.1. One interesting thing is that KPS in this evaluation seems not to utilize the knowledge well, as its performance keeps steady or deteriorates over time. This is in contrast to the case where the knowledge is transformed into a lower dimensionality by PCA, and then input to the baselines. Figure 6 suggests that TPS can utilize the knowledge well to perform significantly better than the baselines in both measures.

In summary, TPS can directly exploit an external knowledge of different forms when learning a model, while other baselines find difficulties. TPS can use the knowledge in its original representation, while other methods often need a suitable transformation and hence do not well exploit the full strength of the external knowledge to improve a Bayesian model.

We presented a novel framework (TPS) that overcomes many drawbacks of existing approaches for streaming conditions. In particular, TPS exploits prior knowledge well, while other methods can forget it very fast. It has hyperparameter σ as a simple mechanism to balance different sources of knowledge. One interesting question is how to learn σ efficiently? STREAMING NAIVE BAYES In this section, we explicitly describe the application of TPS, SVB [24] , and KPS [42] to multinomial Naive Bayes (NB) for classification in document streams. It is worth noting that NB models the documents in each class c by a multinomial distribution with parameter β c . A batch learning algorithm for NB focuses mostly on estimating β = (β 1 , ..., β C ) for a classification problem with C classes. 9 For TPS, the derivation is presented in the main paper. Algorithm 3 presents the streaming learning for NB by TPS.

Using variational inference, SVB [24] approximates the posterior distribution of β c in Naive Bayes by variational distribution q(β c |ξ c ) = Dirichlet(.|ξ c ) where ξ c is the variational parameter associated with class c. Therefore learning NB is translating to learning the variational parameters ξ 1 , ..., ξ C . Similar to the case of LDA, we update the model at time stamp t by:

where ξ t−1 comes from the previous minibatch t − 1 and ξ 0 is initialized with prior η. The learned informationξ t from the data D t at minibatch t is inferred by variational inference asξ t cv = d∈D t c n dv , where n dv is the frequency of term v in document d. Algorithm 4 summarizes the streaming learning for NB by SVB.

KPS [42] is a variant of SVB to explicitly exploit prior knowledge η in all minibatches. In KPS, the model parameter ξ t in minibatch t is computed as below:

where κ ≥ 0 is the dimming factor to decrease the impact of prior knowledge gradually after a number of minibatches.

Interpretability is an important criteria for evaluating a model. The results from a model 9 . Estimating the prior for each class is important. But for simplicity, in this study we use uniform prior over class labels.

Require: Prior knowledge η, variance σ, data sequence {D 1 , D 2 , ...} Ensure: π Initialize π 0 randomly for the t th minibatch do Find π t c , for each class c with dataset D t c , by using gradient ascent to maximize should be understandable and interpretable by human. In this section, we consider the interpretability/clarity of the learned topics in LDA. In several circumstances, some methods are not able to expose a clear topic with a specific domain although that domain exists in the corpus. In this case, we choose the closest topic based on topic's keywords. For the evaluation on interpretability, we use two corpora: Grolier (long text) and NYT-title (short text). We fixed K = 50 for LDA, σ = 1.0 for TPS, batchsize = 500 for Grolier due to its small size, and batchsize = 5000 for NYT-title. The other settings are the same as those in the Experiment part of the main paper.

Some results are shown in Table 2 and 3. While Table 2 shows top 10 words of two topics Military and Music of Grolier dataset, Table 3 gives top words of two topics Business and Politics of NYT-title. The ambiguous words are written in italic style.

It is clear that topics learned by TPS have least ambiguous words than the other baselines. Moreover, the meaning of TPS seems to be more clear than the others with a consistent relationship of words in the topic. In addition, it is more significant for short text data than regular text. To this end, using prior knowledge is effective in term of improving the clarity of topics and making them easy to be interpreted by human.

In this section, we investigate the effects of the parameters: number K of topics, batchsize, and variance σ. Both regular text (Grolier) and short text (TMN-title) are used in our evaluation.

We fix batchsize = 500, σ = 1, and the number of topics is tested in [50, 100, 150, 200] . The results are shown in Figure 7 for regular text (Grolier) and short text (TMN-title) respectively. While TPS is stable in regular text when changing K, it seems to be more sensitive with short text than regular text. Moreover, the smaller number of topics can make TPS do well in short text.

To examine the sensitivity of TPS over batchsize, we fix σ = 1.0, K = 50. The results are shown in Figure 8 . We can see that batchsize has some similar impact on regular and short text. From the assumption in TPS, the streaming data is processed in each data collection decided by batchsize which means this parameter determines the information from new arrived data to balance with prior knowledge and the past minibatch. Therefore, TPS seems to be more sensitive on batchsize than K. Figure 9 shows the sensitivity of TPS w.r.t σ.

It seems that large σ(≥ 10) seem to perform worse than smaller values of σ. σ ≤ 1 seems to be good, meaning that the model at each minibatch should not be far from that in the previous minibatch. The accuracy gap among the settings is noticeable in a number of the first minibatches. However, the difference gradually decreases as more data arrive.

We examine the effectiveness of prior and transition of TPS in multi-pass training, which passes (iterates) the whole training data more than one time.

In detail, we pass through the data 50 times, each time is an epoch. After each epoch, we evaluate the log predictive probability of the model. The results for Grolier and TMN-title are shown in Fig. 10 .

We again see that having Prior (for Transition + Prior and No transition) is significantly better than No Prior. This again confirms the importance of using prior knowledge.

We also see that Transition + Prior and No transition have comparable performances. We can explain as that multi-pass training pass through the data many times, hence the model can still learn the transformation of prior knowledge without using transition. Moreover, multi-pass training allows TPS achieving higher performance in LPP than single pass training. However, in reality, the infinite and large incoming data prevent us from passing data multiple times. We need to balance the trade off between the performance and the resources of storage and running time. 

We measured the running time of all methods with two datasets (Grolier and TMN-tile) similar to sensitivity analysis. We fix σ = 1, k = 50, and batchsize = 500 for Grolier and TMN-title. The results are shown in Table 4 and Figure 11 . We observe that TPS can obtain a high performance at a faster rate than other baselines. This is surprising. One reason may be that the inference for each document in the baselines takes significant times due to the computation of the expectation of Dirichlet distribution for global parameters. Meanwhile TPS does not need to compute such an expectation. Therefore, TPS runs faster than the baselines.

We follow the metric used in [46] . Generally, given the model learned from training data D, each document in the evaluation set is divided into two disjoint parts: the held-out words w ho and observed words w obs . The local variables are inferred using w obs , then the predictive probability of the model is evaluated by the log probability: log p(w ho |D, w obs ) In a LDA model with K topics, the global word topic distributions β, and the documentspecific distribution θ, we have: p(w ho |D, w obs ) = ( K 1 θ k β k,w ho )p(θ|w obs , β)p(β|D)dθdβ

where q(β) and q(θ) are approximate distribution of variables β and θ respectively. Note that when β is point estimation, E q [β] is replaced by β, and θ is inferred from observed words w obs given β.

This metric was introduced by [32] . NPMI score give an evaluation for correlation with humanjudged coherence. In detail, given a topic t with top-T topic words w 1 , w 2 , ..., w T , the NP MI score for topic t is calculated by:

where P (w i ) is the probability of word w i derived from corpus and P (w i , w j ) is the probability of co-occurrence of two words w i and w j in the same document.

Variational inference: A review for statisticians

Science and statistics

Applying associative retrieval techniques to alleviate the sparsity problem in collaborative filtering

Session-based recommendations with recurrent neural networks

Clustering short texts using wikipedia

Btm: Topic modeling over short texts

Leveraging multi-modal prior knowledge for large-scale concept learning in noisy web data

Eliminating overfitting of probabilistic topic models on short and noisy text: The role of dropout

Boosting prior knowledge in streaming variational bayes

Understanding the limiting factors of topic modeling via posterior contraction analysis

Incorporating knowledge graph embeddings into topic modeling

Improving topic models with latent feature word representations

A word embeddings informed focused topic model

Boosting signalto-noise in complex biology: prior knowledge is power

Incorporating domain knowledge into topic modeling via dirichlet forest priors

A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic

Incorporating lexical priors into topic models

Leveraging multi-domain prior knowledge in topic models

Estimating hepatitis c prevalence in england and wales by synthesizing evidence from multiple data sources. assessing data conflict and model fit

A bayesian evidence synthesis approach to estimate disease prevalence in hard-to-reach populations: hepatitis c in new york city

Bert: Pre-training of deep bidirectional transformers for language understanding

Why does unsupervised pre-training help deep learning?

Word representations: a simple and general method for semi-supervised learning

Streaming variational bayes

The population posterior and bayesian modeling on streams

Bayesian models of data streams with hierarchical power priors

Balancing new against old information: The role of puzzlement surprise in learning

The form of the forgetting curve and the fate of memories

Dynamic topic models

Dynamic poisson factorization

Latent dirichlet allocation

Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality

Topics over time: a nonmarkov continuous-time model of topical trends

Dynamic mixture models for multiple time-series

Continuous time dynamic topic models

On-line expectationmaximization algorithm for latent data models

Optimization methods for large-scale machine learning

An introduction to sequential monte carlo methods

Word features for latent dirichlet allocation

Incorporating word correlation knowledge into topic modeling

The power prior: theory and applications

Keeping priors in streaming bayesian learning

Dynamic joint sentiment-topic model

Scalable generalized dynamic topic models

The dynamic embedded topic model

Stochastic variational inference

Distributed representations of words and phrases and their compositionality

Translating embeddings for modeling multi-relational data

Continual lifelong learning with neural networks: A review

Glove: Global vectors for word representation

Empirical study of topic modeling in twitter

A biterm topic model for short texts