Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title Stability of topic modeling via matrix factorization Authors(s) Belford, Mark; MacNamee, Brian; Greene, Derek Publication date 2017-01 Publication information Expert Systems with Applications, 91 : 159-169 Publisher Elsevier Item record/more information http://hdl.handle.net/10197/9198 Publisher's statement þÿ�T�h�i�s� �i�s� �t�h�e� �a�u�t�h�o�r ��s� �v�e�r�s�i�o�n� �o�f� �a� �w�o�r�k� �t�h�a�t� �w�a�s� �a�c�c�e�p�t�e�d� �f�o�r� �p�u�b�l�i�c�a�t�i�o�n� �i�n� �E�x�p�e�r�t� �S�y�s�t�e�m�s� with Applications. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Expert Systems with Applications (98, (2018)) DOI:10.1016/j.eswa.2017.08.047 Publisher's version (DOI) 10.1016/j.eswa.2017.08.047 Downloaded 2021-04-06T02:06:12Z The UCD community has made this article openly available. Please share how this access benefits you. Your story matters! (@ucd_oa) © Some rights reserved. For more information, please see the item record link above. https://twitter.com/intent/tweet?via=ucd_oa&text=DOI%3A10.1016%2Fj.eswa.2017.08.047&url=http%3A%2F%2Fhdl.handle.net%2F10197%2F9198 Stability of Topic Modeling via Matrix Factorization Mark Belford∗, Brian Mac Namee, Derek Greene Insight Centre for Data Analytics, University College Dublin, Ireland Abstract Topic models can provide us with an insight into the underlying latent structure of a large corpus of documents. A range of methods have been proposed in the literature, including probabilistic topic models and techniques based on matrix factorization. However, in both cases, standard implementations rely on stochastic elements in their initialization phase, which can potentially lead to different results being generated on the same corpus when using the same parameter values. This corresponds to the concept of “instability” which has previously been studied in the context of k-means clustering. In many applications of topic modeling, this problem of instability is not considered and topic models are treated as being definitive, even though the results may change considerably if the initialization process is altered. In this paper we demonstrate the inherent instability of popular topic modeling approaches, using a number of new measures to assess stability. To address this issue in the context of matrix factorization for topic modeling, we propose the use of ensemble learning strategies. Based on experiments performed on annotated text corpora, we show that a K-Fold ensemble strategy, combining both ensembles and structured initialization, can significantly reduce instability, while simultaneously yielding more accurate topic models. Keywords: Topic modeling, Topic stability, LDA, NMF 1. Introduction Topic models aim to discover the latent semantic structure or topics within a corpus of documents, which can be derived from co-occurrences of words across the documents. Popular approaches for topic modeling have involved the application of probabilistic algorithms such as Latent Dirichlet Allocation (Blei et al., 2003). More recently, Non-negative Matrix Factor- ization approaches (Lee and Seung, 1999) have also been suc- cessfully applied to identify topics in unstructured text (Arora et al., 2012; Kuang et al., 2015). The standard formulations of both the LDA and NMF algo- rithms include stochastic elements in their initialization phase, prior to an optimization phase which produces a local solution. This random component can affect the final composition of the topics found and the rankings of the terms that describe those topics. This is problematic when seeking to capture a defini- tive topic modeling solution for a given corpus and represents a fundamental instability in these algorithms – different runs of the same algorithm on the same data can produce different out- comes. This problem has been widely studied in the context of partitional clustering algorithms such as k-means, which tends to converge to one of numerous local minima, depending on the choice of starting condition (Bradley and Fayyad, 1998). It has long been recognized as a significant drawback of such algorithms and a substantial number of works exist which at- ∗Corresponding author Email addresses: mark.belford@insight-centre.org (Mark Belford), brian.macnamee@ucd.ie (Brian Mac Namee), derek.greene@ucd.ie (Derek Greene) tempt to address the issue (e.g. Pena et al. (1999),Kuncheva and Vetrov (2006)). In the case of topic modeling, instability can manifest itself in two distinct aspects. The first can be observed when exam- ining the topic descriptors (i.e. the top terms representing each topic) over multiple runs. The term rankings may change con- siderably, where certain terms may appear or disappear com- pletely between runs. Secondly, issues of instability can also be observed when examining the degree to which documents have been associated with topics across different runs of the same al- gorithm on the same corpus. In both cases, such inconsistencies can potentially alter our interpretation and perception of a given topic model. Also, it is clear that any individual run should not be treated as a “definitive” summary of the underlying topics present in the data. Generally speaking, in the comparative evaluation of topic modeling approaches, researchers tend to focus on either the coherence of the topic descriptors (Newman et al., 2010) or the extent to which the topics accurately coincide with a set of ground truth categories or human annotations (Kuang et al., 2015). However, few researchers have considered the evalu- ation of different approaches from the point of view of their stability across multiple runs. In this paper we quantitatively assess the extent to which standard randomly-initialized NMF and LDA algorithms are unstable with respect to the topics that they produce on a di- verse collection of text corpora. To do this we propose mea- sures that capture the two distinct aspects of instability outlined above. We then focus on addressing the issue in the context of matrix factorization, exploring the use of strategies that involve Preprint version September 7, 2017 improved initialization and ensemble learning. In particular, we propose a new combined approach, motivated by the tradi- tional concept of k-fold cross-validation, which can yield stable results while also often producing more accurate and coherent models1. The rest of the paper is structured as follows. In Section 2 we provide an overview of relevant work in topic modeling and the more general area of cluster analysis. In Section 3 we discuss the problem of topic model instability in more detail, describing three new measures to quantify instability in topic models. In Section 4 we propose ensemble approaches to ad- dress the issue, which are subsequently evaluated on ten differ- ent text corpora in Section 5. Finally in Section 6 we conclude the paper with ideas for future work. 2. Related Work 2.1. Topic Modeling Topic models attempt to discover the hidden thematic struc- ture within an unstructured collection of text without relying on any form of training data. These models date back to the early work on latent semantic analysis (LSA) by Deerwester et al. (1990), who proposed applying SVD to decompose a document-term matrix to uncover the associations between terms and concepts in the data. In basic terms, a topic model con- sists of k topics, each represented by a ranked list of strongly- associated terms (often referred to as a “topic descriptor”). Each document in the corpus can also be associated with one or more of these topics to varying degrees. Considerable research on topic modeling has focused on the use of probabilistic methods, where a topic is viewed as a probability distribution over words, with documents being mixtures of topics (Steyvers and Griffiths, 2007). The most widely-applied probabilistic topic modeling approach has been LDA (Blei et al., 2003). Different approximation methods have been proposed for LDA inference, including variational infer- ence and Markov chain Monte Carlo (MCMC). Such approx- imation algorithms can converge to different local maxima on the same data (Zhao et al., 2015). The most commonly-used implementation, provided by the Mallet software package (Mc- Callum, 2002), relies on fast Gibbs sampling, where the initial state is determined by a user-specified random seed. Alternative algorithms, such as Non-negative Matrix Fac- torization (Lee and Seung, 1999), have also been effective in discovering topics in text corpora (Arora et al., 2012; Kuang et al., 2015). NMF is an unsupervised approach for reducing the dimensionality of non-negative matrices. When working with a document-term matrix A, the goal of NMF is to approximate this matrix as the product of two non-negative factors W and H, each with k dimensions. The rows of the factor H can be in- terpreted as k topics, defined by non-negative weights for each of the m terms in the corpus vocabulary. Ordering each row provides a topic descriptor, in the form of a ranking of the terms relative to the corresponding topic. The columns in the matrix 1See https://github.com/derekgreene/topic-ensemble W provide membership weights for all documents with respect to each of the k topics. One of the advantages of NMF over tra- ditional LDA methods is that there are fewer parameter choices involved in the modeling process, while it also has a tendency to identify more coherent topics than LDA (O’Callaghan et al., 2015). NMF is commonly initialized by assigning random non- negative weights to the entries in the factors W and H. By ap- plying an optimization process, such as alternating least squares (Lin, 2007), the factors are iteratively improved to reduce the approximation error until a local minimum is reached. As a result, the values in the initial pair of factors will have a signifi- cant impact on the values in the final factors (i.e. the topic-term and topic-document weights), even after a large number of iter- ations have been performed. Alternative initialization schemes for NMF have focused on increasing the accuracy of the final factors by using a more structured process, such as seeding us- ing a prior clustering algorithm (Wild et al., 2004). Another approach, Non-negative Double Singular Value Decomposition (NNDSVD) (Boutsidis and Gallopoulos, 2008), chooses initial factors based on a sparse SVD approximation of the original data matrix. This has been shown to be particularly effective on sparse data, such as text (O’Callaghan et al., 2015). In its ba- sic form, NNDSVD contains no stochastic element and should technically converge to the same pair of factors each time, al- though this depends on the underlying SVD implementation be- ing used. 2.2. Stability in Cluster Analysis Partitional clustering algorithms, such as k-means and k- medoids, have an inherent stability problem. That is to say, if we run the same algorithm on the same data or data drawn from the same source repeatedly, we frequently achieve different re- sults between each run. This variation can either be due to poor random seeds leading to convergence to different local minima (Pena et al., 1999), or as a result of perturbations in the data (Ben-Hur et al., 2002). One widely-adopted approach for dealing with the issue is to adopt a better cluster initialization strategy that is either fully deterministic or at least produces less variation than random initialization, while simultaneously yielding more useful clus- terings. A popular initialization approach proposed by Arthur and Vassilvitskii (2007), referred to as k-means++, involves choosing an initial seed item at random as the first cluster cen- ter and then choosing each subsequent cluster center with a probability proportional to its squared distance from the items nearest existing cluster centers. To further improve the result- ing clustering, this process can be repeated for several different initial seed items. While this strategy is not deterministic, it does tend to yield more consistent results across multiple runs. Researchers have also proposed fully deterministic strategies, where initial cluster centers are determined based on embed- ding methods such as PCA (Su and Dy, 2004), or coming from the prior application of another algorithm such as hierarchical clustering (Celebi and Kingravi, 2012). 2 2.3. Ensemble Methods An alternative strategy for reducing instability in unsuper- vised learning is to use ensemble clustering techniques, which are based on the premise that combining large, diverse sets of clusterings can produce a more stable and accurate solution (Strehl and Ghosh, 2002). Ensemble approaches are usually divided into two different stages. Firstly, a collection of base clusterings are generated (i.e. the ensemble members), typically by repeatedly applying an algorithm such as k-means with ran- dom initialization to the full dataset or to random samples of the data (Minaei-Bidgoli et al., 2004). Secondly, an integration function is applied to combine the base clusterings into a sin- gle consensus clustering. One of the most common integration strategies utilized is to leverage information from the ensem- ble regarding the level of “co-association” between all pairs of items. The underlying idea behind this is that items that are fre- quently assigned together in different clusterings will naturally belong to the same underlying group (Strehl and Ghosh, 2002). The resulting consensus clustering represents an approximation of the “average” clustering from among the ensemble members. Hadjitodorov et al. (2006) demonstrated a trade-off between di- versity and quality in cluster ensembles, and proposed a number of measures to quantify diversity for such ensembles. Work regarding the optimality and consistency of solutions produced by clustering and biclustering algorithms has been previously carried out in other domains (Bertoni and Valentini, 2005; Pio et al., 2015). However, in the general area of ma- trix factorization, there has been only some initial work on the use of ensemble approaches. This includes using a hierarchi- cal scheme to combine multiple factorizations in the study of protein networks (Greene et al., 2008) and the generation of ensembles of factorizations via a boosting-like approach (Suh et al., 2016). Recent work has also looked at using stability as a means of identifying an appropriate number of topics in a given corpus when applying NMF to text data (Greene et al., 2014). However, these studies have not investigated the extent of the problems introduced by instability in the context of topic modeling. 3. Stability of Topic Modeling In this section we introduce three new measures for assess- ing the stability of a collection of topic models, and use these to demonstrate how standard NMF and LDA approaches can be prone to produce unstable results when applied to text corpora. 3.1. Overview As discussed in Section 2, standard implementations of topic modeling approaches, such as LDA and NMF, commonly em- ploy stochastic initialization prior to optimization. As a result, the models they produce can vary quite considerably between different runs. Regardless of whether we are applying proba- bilistic or non-probabilistic algorithms, we can observe that this variation manifests itself in two ways: in relation to term-topic associations, or document-topic associations. In the former, the ranking of the top terms that describe a topic can change significantly between runs. In the latter, documents may be strongly associated with a given topic in one run, but may be more closely associated with an alternative topic in another run. In more extreme cases, a consequence of both manifestations is that topics can “appear” or “disappear” across different runs of the algorithm. This presents a challenge for domain experts who seek to gain a reliable insight into a particular corpus of documents. Depending on the topics resulting from a given algorithm run, their interpretation of the data may change con- siderably. However, the implications of this variation are rarely discussed in the topic modeling literature, particularly in the context of matrix factorization. 3.2. Measuring Stability To actually quantify the level of stability/instability present in a collection of topic models {M1, . . . ,Mr} generated over r runs on the same corpus, we propose three measures which reflect both aspects of topic model stability as described above. These measures are general in the sense that they can be ap- plied to models generated using either probabilistic or matrix factorization algorithms. 3.2.1. Descriptor Set Difference If the topics present in two topic models are similar, we should naturally expect that the prominent terms appearing in the topic descriptors in both models will be similar. Formally, if we represent each topic in a single model Mi with a number of top-ranked terms t, we can calculate the descriptor set as the union of top terms across all k topics, which we denote Ti. By measuring the symmetric difference between the descriptor sets for two different models, we can broadly gauge the similarity of the two models. This is useful as we can capture the variance at the descriptor level as terms may appear and disappear between runs. Formally, given two topic models Mi and Mj , each con- taining k topics represented by their top t terms, we calculate the descriptor set difference as: DSD(Mi,Mj) = Ti4Tj t×k (1) A value of 0 indicates identical descriptor sets (i.e. no differ- ence), while a value of 1 indicates that the topic descriptors for the two models share no common terms at all. Given a collec- tion of r topic models, we can calculate the Average Descriptor Set Difference (ADSD): ADSD = 1 r × (r −1) r∑ i,j,i 6=j DSD(Mi,Mj) (2) This produces a value ∈ [0,1], where a value closer to 0 for Eqn. 2 is indicative of a more stable collection of models. 3.2.2. Topic-Term Stability While the ADSD gives an overall measure of the difference between two models, it does not account for cases where topics are “mixed” across different runs of the algorithm (i.e. the same 3 terms appear in different topics across different runs). There- fore, we propose a measure that compares the similarity be- tween two topic models based on a pairwise matching process at the topic level. This is important as topics may appear and disappear between different runs and also helps to capture the variance at the individual topic level. First, given a pair of individual topics represented by their top t terms, we can measure the similarity between them based on the Jaccard Index: Jac(Ri,Rj) = |Ri ∩Rj| |Ri ∪Rj| (3) where Ri denotes the top t ranked terms for the i-th topic (i.e. its topic descriptor). We can use the above to build a measure of the agreement between two complete topic models, each con- taining k topics. We construct a k × k similarity matrix S, such that the entry Sxy indicates the agreement between the x-th topic in the first model and the y-th topic in the second model, as calculated using Eqn. 3. We then find the best match between the rows and columns of S (i.e. the topics from the first model and the second model). The optimal permutation π may be found in O(k3) time by solving the minimal weight bipartite matching problem using the Hungarian method (Kuhn, 1955). From this, we can produce a Term Stability (TS) score: TS(Mi,Mj) = 1 k k∑ x=1 Jac(Rix,π(Rix)) (4) where π(Rix) denotes the topic in model Mj matched to Rix in model Mi by the permutation π. Values for the above take the range [0,1], where a comparison between two identical k-way topic models will result in a score of 1. For a collection of r topic models {M1, . . . ,Mr}, we can calculate the Average Term Stability (ATS): ATS = 1 r × (r −1) r∑ i,j,i 6=j TS(Mi,Mj) (5) where a score of 1 indicates that all pairs of topic descriptors matched together across the r runs contain the same top t terms. 3.2.3. Partition Stability The second manifestation of topic model instability relates to document-topic associations. To measure the extent to which the associations between a document and one or more topics varies across different runs, for each run we can look at the dominant topic for every document. That is, we convert the document-topic associations (the probabilities in the case of LDA or the W factor weights in the case of NMF) into a dis- joint partition by taking the maximum value for each document. We can then compare the similarity between the partitions gen- erated in two runs using standard clustering agreement mea- sures. Utilizing this information allows us to observe if the dominant topic for each document changes frequently between runs. One widely-used such measure is Normalized Mutual In- formation (NMI) (Strehl and Ghosh, 2002), which quantifies the level of agreement between two partitions Pi and Pj : NMI(Pi,Pj) = I(Pi,Pj)√ H(Pi)H(Pj) (6) where I(Pi,Pj) is the mutual information between the assign- ments in the two partitions and H(Pi) is the entropy of the as- signments in Pi alone. We can compute the overall level of agreement between a set of r partitions generated by r runs of an algorithm on the same corpus as the mean Pairwise Normalized Mutual Infor- mation (PNMI) for all pairs: PNMI = 1 r × (r −1) r∑ i,j,i 6=j NMI(Pi,Pj) (7) where Pi is the partition produced from the document-topic as- sociations in model Mi. If the partitions across all models are identical, PNMI will yield a value of 1. 3.3. Examples of Instability We now provide examples of how the measures proposed above can reflect the problem of instability, using a corpus of news articles from the New York Times as an example (the nytimes-2003 corpus described later in Section 5). Firstly, to illustrate the issue of term instability, we con- sider topic models generated for r = 100 runs of randomly- initialized NMF and LDA, with a fixed number of topics k = 7 (corresponding to the number of annotated categories in the data). Fig. 1 shows the ADSD scores for each algorithm, as the number of top terms in the topic descriptors increases from 10 to 100. We can observe that, even with this relatively relaxed measure, there exists substantial variation in the terms appear- ing in the models for both algorithms. If we represent each topic using 10 terms, the ADSD score is as high as 0.47 for LDA and 0.31 for NMF. Even when we extend the topic descriptors to contain 100 terms, which we might expect to capture the bulk of the key terms for the topics in this corpus, the mean differ- ence across runs is 0.23 and 0.19 respectively. 0.00 0.10 0.20 0.30 0.40 0.50 10 20 30 40 50 60 70 80 90 100 A D S D Number of Top Terms NMF LDA Figure 1: Plot of the Average Descriptor Set Difference (ADSD) for 100 runs of NMF and LDA on the nytimes-2003 corpus, for increasing numbers of top terms representing each topic. 4 Table 1: Top-ranked terms for topics related to “sport”, generated across five runs of randomly-initialized NMF and LDA applied to a news corpus. # LDA Top 10 Terms 1 game, team, season, play, coach, games, points, players, against, football 2 game, season, team, coach, play, games points, league, football, players 3 game, season, coach, team, football, league, giants, play, jets, players 4 game, season, team, yankees, games, play, mets, nets, left, league 5 game, team, season, play, games, players, coach, yankees, time, against # NMF Top 10 Terms 1 game, season, team, yankees, games, nets, play, points, players, coach 2 game, team, season, nets, points, games, coach, play, knicks, players 3 game, team, season, nets, points, games, play, coach, knicks, giants 4 game, nets, team, season, coach, points, knicks, jets, giants, play 5 game, team, season, nets, points, games, play, knicks, coach, kidd We can explore this term instability further by inspecting the topic-term stability of the topics. As an example, Table 1 refers to five separate runs of a related topic (corresponding to the category “sport” in the ground truth for this corpus) for NMF and LDA. For both algorithms it is clear that the order- ing of the top terms for this topic can change considerably, and it is also possible for terms to completely disappear between different runs. Using the same corpus, we can also explore the partition stability afforded by both NMF and LDA. Fig. 2 plots the distri- butions for the NMI agreement scores between all pairs of par- titions corresponding to the set of 100 models produced by each algorithm. Two partitions with identical document assignments would yield an NMI score of 1. However, only 0.4% of all pairs of partitions achieve this for NMF. In the case of LDA, of the 4950 unique pairs of partitions, none achieve perfect agreement and only 0.3% achieve an NMI score ≥ 0.9. When we aver- age the agreement scores over all runs, the overall PNMI scores for the two algorithms are 0.78 and 0.66 respectively, indicating there is considerable variation in the outputs of both algorithms 0% 5% 10% 15% 20% 25% 30% 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 P er ce nt ag e of P ai rs (a) NMF 0% 5% 10% 15% 20% 25% 30% 35% 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 P er ce nt ag e of P ai rs (b) LDA Figure 2: Distribution of pairwise NMI agreement scores for document parti- tions resulting from 100 runs of (a) NMF, (b) LDA on the nytimes-2003 corpus. Table 2: Top-ranked documents for topics related to “sport”, generated across five runs of randomly-initialized NMF and LDA applied to a news corpus. # LDA Top 10 Documents 1 s4310, s5376, s4247, s5262, s6055, s1493, s3167, s5670, s4972, s6636 2 s4441, s6267, s4247, s3521, s8146, s5262, s4708, s4681, s4460, s8937 3 s0113, s3521, s4708, s1698, s5299, s4972, s8577, s8351, s5855, s6834 4 s6267, s0113, s5376, s9894, s3521, s5262, s9116, s4681, s4708, s8937 5 s6267, s0113, s9894, s4247, s5262, s4681, s2056, s4972, s8577, s6636 # NMF Top 10 Documents 1 s5995, s6558, s3547, s9993, s5281, s8029, s2484, s5114, s1227, s2934 2 s6558, s5995, s8029, s5457, s5843, s2484, s9993, s1227, s5193, s2068 3 s8029, s6558, s5995, s5457, s5843, s2484, s1227, s9993, s5193, s2068 4 s5457, s8029, s5843, s6558, s9993, s5193, s2068, s7687, s2484, s9924 5 s8029, s6558, s5995, s5457, s5843, s2484, s1227, s9993, s5193, s2265 across these runs. Again, manually inspecting the top-ranked documents in the topics for these models reveals the extent of the variation. Table 2 lists the identifiers of the top ten documents assigned to each of the topics related to “sport” selected from five runs of NMF and LDA in Table 1. We observe that, similar to the case of the top terms, the ordering of documents is also subject to the same inherent instability. 3.4. Redundant Stability While the examples above demonstrate that the production of robust, reliable topic models is important, it is also necessary to emphasize that stability should not be the sole requirement for a useful topic modeling algorithm. As observed by Ben-Hur et al. (2002) in the context of partitional clustering, in some situations stability can simply be indicative of an algorithm’s tendency to converge to a given local solution, regardless of the quality of that solution. In the context of NMF, we could initial- ize the factors W and H in a deterministic way with arbitrary non-negative values. However, this “redundant stability” is un- likely to provide a useful model. Therefore, in the next section we propose techniques that yield solutions that are not only sta- ble but also accurate – i.e. the topics are semantically coherent and provide a useful insight into the content of the corpus. 4. Methods We now propose ensemble methods for topic modeling via matrix factorization, which can be utilized to address the is- sue of stability, while also potentially producing more accurate topic models for a corpus of unstructured text. 4.1. Basic Ensemble Method We apply ensemble learning for topic modeling in the form of two layers of matrix factorization. Fig. 3 shows an overview of the method, which can naturally be divided into two steps, similar to existing strategies in ensemble clustering (Strehl and Ghosh, 2002): 1. Generation: Create a set of base topic models by exe- cuting r runs of NMF applied to the same corpus, repre- sented as a document-term matrix A. 5 NMF Original Corpus Base Models Topic-Term Matrix NMF Ensemble Topic Model Generation Step Integration Step Figure 3: Overview of a general ensemble strategy for topic modeling with matrix factorization, consisting of two steps: generation and integration. 2. Integration: Transform the base topic models to a suit- able intermediate representation, and apply a final run of NMF to produce a single ensemble topic model, which represents the final output of the method. We now discuss both of these steps in more detail. 4.1.1. Ensemble Generation Unsupervised ensemble procedures typically seek to encour- age diversity with a view to improving the quality of the infor- mation available in the integration phase (Topchy et al., 2005). Therefore, we create a diverse set of r base topic models (i.e. the topic term descriptors and document assignments will differ from one base model to another). Here we encourage diver- sity by relying on the inherent instability of NMF with random initialization – we generate each base model by populating the factors W and H with values based on a different random seed, and then applying NMF to A. In each case we use a fixed pre- specified value for the number of topics k. After each run, the H factor from the base topic model (i.e. the topic-term weight matrix) is stored for later use. 4.1.2. Ensemble Integration Once we have generated a collection of r factorizations, in the second step we create a new representation of our corpus in the form of a topic-term matrix M. The matrix is created by stacking the transpose of each H factor generated in the first step, as illustrated in Fig. 4. Here each factor consists of k top- ics {t1, . . . , tk} and m terms {w1, . . . ,wm}. We construct this topic-term matrix as we may often expect to see similar topics appearing between different runs. However, they may not be identical with respect to their terms, and we wish to leverage this variance. It is important to note that this process of com- bining the factors is order independent. This results in a matrix M where each row corresponds to a topic from one of the base topic models, and each column is a term from the original cor- pus. Each entry Mij holds the weight of association for term i in relation to a single topic from a base model. Once we have created M, we apply the second layer of NMF to this matrix to produce the final ensemble topic model. The reasoning behind applying NMF a second time to these topic descriptors is that they explicitly capture the variance be- tween the base topic models. To improve the quality of the resulting topics, we generate initial factors using NNDSVD ini- tialization (Boutsidis and Gallopoulos, 2008). As an input pa- rameter to NMF, we specify a final number of k′ topics, which is typically set to be the same as the value k used in the gen- eration step. The resulting H factor provides weights for the terms for each of the k′ ensemble topics – the top-ranked terms in each column can be used as descriptors for a topic. To pro- duce weights for the original documents in our corpus, we can “fold” the documents into the ensemble model by applying a projection to the document-term matrix A: D = A ·H T Each row of D now corresponds to a document, with columns corresponding to the k′ ensemble topics. An entry Dij indicates the strength of association of document i in ensemble topic j. 4.2. K-Fold Ensemble Method While the basic ensemble generation approach described in Section 4.1.1 does yield a diverse set of base topic models, the use of random initialization means that some of these models H1 t1 tk H2 t1 tk t1 tk Hr w1 w2 w3 w4 w5 w6 wm Figure 4: Illustration of the stacking process where the H factors from r topic models, each consisting of k topics and m terms, are combined to create a single topic-term matrix. 6 1. Construct full document-term matrix A for the corpus. 2. For p rounds: 1. Randomly divide corpus into f folds. 2. For each of the f folds: (a) Exclude the current fold. (b) Apply NMF with NNDSVD initialization to doc- uments in A from the other (f −1) folds to gen- erate k topics. 3. From the resulting p × f models, construct a topic-term matrix M by stacking the transpose of each H factor. 4. Apply NMF with NNDSVD initialization to M to gener- ate k′ topics, where the factor H gives the final topic-term associations. 5. Compute D = A · H T to find the final document-term as- sociations. Figure 5: Summary of K-Fold ensemble topic modeling method. will correspond to poor local minima with low accuracy. Fur- thermore, given the number of possible initial factors that could be generated in this way, there is still potential for several runs of the complete ensemble process to yield somewhat different final results. Therefore, we consider the use of improved ini- tialization to generate more accurate base models, while also using a more structured strategy to create the models in order to reduce variability. This strategy is based on traditional k- fold cross-validation as performed in evaluation in supervised learning. In our case, we randomly divide the corpus of documents into f folds of equal size. Each of the f folds is excluded in turn, and we apply NMF with NNDSVD initialization to the documents from the remaining (f−1) folds, yielding f models. To reduce variability, we repeat the process for p rounds using different splits of the data, yielding a total of p×f topic mod- els, each generated on a large subsample of the corpus. This collection of base topic models is then integrated as described in Section 4.1.2 to produce a final topic model. A full summary of this approach is given in Fig. 5. 5. Evaluation In this section we comprehensively assess the problem of instability in topic modeling for standard NMF and LDA ap- proaches on a diverse collection of corpora, and examine the extent to which superior initialization and ensemble methods can improve the stability of NMF-based approaches, while also yielding accurate models. 5.1. Datasets For our experiments, we use a diverse set of ten corpora, including both high-quality long texts and user-generated con- tent. All of these corpora have human annotated “ground truth” topical categories, allowing us to evaluate model accuracy. Six of these datasets consist of news articles from individual main- stream news sources (BBC, The New York Times, The Guardian, and The Irish Times), categorized by subject matter. Two more datasets consist of pages from a 2014 Wikipedia dump, catego- rized by their associated WikiProject. These eight datasets were previously used for topic modeling evaluations (Greene et al., 2014). We also include the popular 20-newsgroups dataset, where the ground truth categories correspond to individual news- groups (e.g. “comp.graphics”, “comp.windows.x”,“rec.autos”). To evaluate performance on social media data, we include a newly-collected corpus in our experiments, known as the 20- topics dataset, which consists of 4,170,382 tweets from 1,200 prominent Twitter accounts. These accounts have been manu- ally assigned to 20 different categories (e.g. “aviation”, “health”, “tech”). Each document in the corpus corresponds to the con- catenation of the tweets posted by a single user for a given week during the period March 2015 to February 2016. The corpus contains 40,498 such “user documents”. A detailed summary of all corpora used in our experiments is provided in Table 3. 5.2. Experimental Setup When pre-processing the corpora, terms appearing in < 20 documents are filtered. We use a single list of common English stop-words for all datasets. LDA operates on bag-of-words text representations, and so was applied to the raw frequency val- ues. For NMF, the same documents were transformed to log- based Term Frequency-Inverse Document Frequency (TF-IDF) vectors, and document length normalization was subsequently applied to produce the final document-term matrix. In our experiments, we compare five different topic model- ing approaches: 1. Standard LDA with random seeding, using the popular Mallet implementation with Gibbs sampling (McCallum, 2002). 2. NMF with random initialization, using the fast alternat- ing least squares variant proposed by (Lin, 2007) and pro- vided by the sckit-learn toolkit (Pedregosa et al., 2011). 3. NMF with non-random NNDSVD initialization, also im- plemented in sckit-learn. 4. Basic ensemble topic modeling for matrix factorization with random initialization, as described in Section 4.1. 5. K-Fold ensemble topic modeling for matrix factorization combined with improved initialization, as described in Section 4.2. For these approaches, there are a number of common and dis- tinct parameters which need to be specified: Common parameters: For all approaches, the number of top- ics k is set to correspond to the number of ground truth categories for each dataset. NMF parameters: For NMF with both random and NNDSVD initialization, the maximum number of iterations is set to 100 by default. For the random case, a different random seed is used for each run to populate values in the initial 7 Table 3: Details of the ten corpora used in our experiments, including the total number of documents n, number of terms m, and number of categories k̂ in the associated “ground truth” annotations. Corpus n m k̂ Description bbc 2,225 3,121 5 General news articles from the BBC from 2003. bbc-sport 737 969 5 Sports news articles from the BBC from 2003. guardian-2013 6,520 10,801 6 Corpus of news articles published by The Guardian during 2013. irishtimes-2013 3,246 4,832 7 Corpus of news articles published by The Irish Times during 2013. nytimes-1999 9,551 12,987 4 A subset of the New York Times Annotated Corpus from 1999. nytimes-2003 11,527 15,001 7 A subset of the New York Times Annotated Corpus from 2003. wikipedia-high 5,738 17,311 6 Subset of 2014 Wikipedia dump, where articles are assigned labels based on their high level WikiProject. wikipedia-low 4,986 15,441 10 Subset of 2014 Wikipedia dump, where articles are labeled with fine-grained WikiProject sub-groups. 20-newsgroups 18,662 9,954 20 Collection of posts from 20 different internet newsgroups. 20-topics 40,498 32,464 20 Tweets from user accounts associated with 20 different topical areas. factors W and H. This process is repeated for r = 100 runs. LDA parameters: The LDA algorithm has two additional hy- perparameters. We use the Mallet default values, with α = 5 and β = 0.01. The maximum number of iterations is set to 1000. For each run, a different random seed is used to initialize the Gibbs sampling process. This pro- cess is repeated for r = 100 runs. Basic ensemble: For the first ensemble approach, we integrate a collection of 100 members, generated via random ini- tialization. The final number of topics k′ is set to be the same as the number of ground truth categories for each dataset. This entire process is repeated 20 times to allow us to assess stability. K-Fold ensemble: For the second ensemble approach, we ap- ply p = 10 rounds of f = 10 folds, thus also yielding a collection of 100 ensembles members for integration, with k′ determined as above. Again this entire process is repeated 20 times. 5.3. Model Stability To assess the stability of a collection of models generated by each algorithm, we use the term-based measures ADSD (Eqn. 2) and ATS (Eqn. 5) using the top t = 10 terms for each topic, and the document-level PNMI measure (Eqn. 7). Results for these measures are shown in Tables 4, 5, and 6 respectively. Across all three measures, we observe that the NNDSVD NMF and K- Fold approaches clearly yield the most stable results. Both of these methods produce models with perfect stability for the ma- jority of our datasets - i.e. they yield models in which the topic- term and document-topic associations remain the same. As ex- pected, the randomly-initialized approaches perform the worst due to their inherent instability caused by stochastic elements as can be identified by their standard deviation scores. While the basic ensemble approach yields high stability for the smaller corpora, we do see some variation between different runs at the term and document level for the larger corpora. Here, the ran- dom initialization used when generating the ensemble members is still leading to variation in the results at the final ensemble in- tegration phase, even with 100 ensemble members. However, the more structured nature of the generation phase for the K- Fold approach effectively negates this problem. It is interesting to observe that, for the 20-newsgroups and 20-topics datasets, which contain noisier user-generated con- tent and a larger number of underlying topics, the K-Fold en- semble approach yields higher levels of stability than NNDSVD- initialized NMF. This suggests that combining the subsampling element of the ensemble process with a structured NNDSVD initialization produces a more reliable solution. It is important to note that the widely-used implementa- tion of NNDSVD provided by the sckit-learn toolkit, as used in our experiments, relies on an approximate truncated singular value decomposition method involving randomization, in order to make it applicable to large data matrices. While the resulting decompositions are often identical, this is not always the case. Computing a full SVD would eliminate the instability, with the trade-off that the running time requirements for decomposing a large, high-dimensional document-term matrix would increase dramatically. 5.4. Model Quality The primary focus of our work is on model stability. But as noted in Section 3.4, stability without meaningful and coher- ent topics is unlikely to be useful. Therefore we consider the quality of the models produced by each of the five methods, in terms of accuracy and coherence. Partition accuracy: In either the case of NMF or LDA, we can convert a model’s document-topic associations into a disjoint partition. We can then compare this partition with the corresponding disjoint “ground truth” categories for each corpus using NMI (Strehl and Ghosh, 2002). Topic coherence: Coherence refers to the overall quality and the semantic relatedness of the terms appearing in a topic 8 Table 4: Model stability — Comparison of Average Descriptor Set Difference (ADSD) scores for five topic modeling approaches. Corpus LDA NMF NNDSVD Ensemble K-Fold bbc 0.14 ± 0.15 0.15 ± 0.25 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 bbc-sport 0.29 ± 0.14 0.21 ± 0.21 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 guardian-2013 0.15 ± 0.16 0.18 ± 0.20 0.00 ± 0.00 0.00 ± 0.00 0.00 ± 0.00 irishtimes-2013 0.35 ± 0.13 0.42 ± 0.22 0.00 ± 0.00 0.01 ± 0.01 0.01 ± 0.01 nytimes-1999 0.23 ± 0.18 0.36 ± 0.22 0.00 ± 0.00 0.21 ± 0.23 0.00 ± 0.00 nytimes-2003 0.47 ± 0.13 0.31 ± 0.18 0.00 ± 0.00 0.01 ± 0.01 0.00 ± 0.00 wikipedia-high 0.26 ± 0.12 0.21 ± 0.13 0.00 ± 0.00 0.01 ± 0.01 0.00 ± 0.00 wikipedia-low 0.21 ± 0.06 0.12 ± 0.09 0.00 ± 0.00 0.06 ± 0.07 0.00 ± 0.01 20-newsgroups 0.31 ± 0.06 0.32 ± 0.09 0.18 ± 0.08 0.12 ± 0.05 0.05 ± 0.04 20-topics 0.23 ± 0.06 0.23 ± 0.09 0.07 ± 0.07 0.11 ± 0.06 0.01 ± 0.01 Table 5: Model stability — Comparison of term stability scores for five topic modeling approaches, calculated using Average Term Stability (ATS). Corpus LDA NMF NNDSVD Ensemble K-Fold bbc 0.86 ± 0.14 0.88 ± 0.21 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 bbc-sport 0.68 ± 0.15 0.80 ± 0.20 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 guardian-2013 0.83 ± 0.18 0.83 ± 0.19 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 irishtimes-2013 0.61 ± 0.14 0.64 ± 0.18 1.00 ± 0.00 0.99 ± 0.01 0.99 ± 0.01 nytimes-1999 0.65 ± 0.19 0.72 ± 0.17 1.00 ± 0.00 0.82 ± 0.19 1.00 ± 0.00 nytimes-2003 0.53 ± 0.12 0.76 ± 0.14 1.00 ± 0.00 0.99 ± 0.01 1.00 ± 0.00 wikipedia-high 0.79 ± 0.15 0.77 ± 0.13 1.00 ± 0.00 0.99 ± 0.01 1.00 ± 0.00 wikipedia-low 0.79 ± 0.10 0.88 ± 0.08 1.00 ± 0.00 0.94 ± 0.07 0.99 ± 0.01 20-newsgroups 0.65 ± 0.07 0.66 ± 0.09 0.81 ± 0.08 0.86 ± 0.05 0.93 ± 0.05 20-topics 0.69 ± 0.08 0.78 ± 0.08 0.94 ± 0.07 0.89 ± 0.06 0.99 ± 0.01 Table 6: Model stability — Comparison of document stability scores for five topic modeling approaches, based on Pairwise Normalized Mutual Information (PNMI). Corpus LDA NMF NNDSVD Ensemble K-Fold bbc 0.86 ± 0.09 0.89 ± 0.10 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 bbc-sport 0.74 ± 0.08 0.87 ± 0.09 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 guardian-2013 0.84 ± 0.07 0.88 ± 0.07 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 irishtimes-2013 0.73 ± 0.07 0.81 ± 0.07 1.00 ± 0.00 0.99 ± 0.00 1.00 ± 0.00 nytimes-1999 0.69 ± 0.09 0.67 ± 0.14 1.00 ± 0.00 0.83 ± 0.13 0.98 ± 0.01 nytimes-2003 0.66 ± 0.06 0.78 ± 0.07 1.00 ± 0.00 0.97 ± 0.01 0.99 ± 0.00 wikipedia-high 0.86 ± 0.07 0.86 ± 0.04 1.00 ± 0.00 0.99 ± 0.00 1.00 ± 0.00 wikipedia-low 0.89 ± 0.04 0.89 ± 0.03 1.00 ± 0.00 0.96 ± 0.04 1.00 ± 0.00 20-newsgroups 0.63 ± 0.02 0.69 ± 0.04 0.80 ± 0.05 0.89 ± 0.04 0.96 ± 0.03 20-topics 0.84 ± 0.03 0.84 ± 0.03 0.97 ± 0.03 0.97 ± 0.02 1.00 ± 0.00 descriptor. While a range of measures have been pro- posed in the literature, we employ a widely-used mea- sure, Normalized Pointwise Mutual Information (NPMI) (Bouma, 2009), which uses term co-occurrence counts from the full corpus to measure the average coherence of the topics in a given model, based on the top t terms in their descriptors. In our evaluations we use t = 10 terms. With regards to the NPMI coherence of the topics produced, Table 7 shows that our proposed basic ensemble and K-Fold approaches perform the best. However, it should be noted that in most cases the differences in the average coherence scores are small. The most noticeable gap is between the LDA ap- proach and the other NMF-based approaches, which may re- flect the tendency of LDA to produce more generic and less semantically-coherent terms (O’Callaghan et al., 2015). To provide a clearer measure of model quality, we next consider the quality of the five methods by evaluating the par- tition accuracy with respect to the ground truth labels. Ta- ble 8 shows the means and standard deviations of the NMI scores for the methods on the ten corpora. Here we see that the best-performing algorithms are NNDSVD-initialized NMF and the K-Fold ensemble approach, although this varies with the dataset. The randomly-initialized algorithms exhibit con- siderable variation in the quality of the models they produce, as indicated by the standard deviation scores, and are worse on average than the ensemble and SVD-based methods, with the 9 Table 7: Model quality — Comparison of topic coherence scores for five topic modeling approaches, based on Normalized Pointwise Mutual Information (NPMI). Corpus LDA NMF NNDSVD Ensemble K-Fold bbc 0.09 ± 0.01 0.15 ± 0.01 0.16 ± 0.00 0.16 ± 0.00 0.16 ± 0.00 bbc-sport 0.11 ± 0.01 0.14 ± 0.01 0.14 ± 0.00 0.15 ± 0.00 0.14 ± 0.00 guardian-2013 0.11 ± 0.01 0.15 ± 0.01 0.16 ± 0.00 0.16 ± 0.00 0.16 ± 0.00 irishtimes-2013 0.08 ± 0.01 0.12 ± 0.01 0.13 ± 0.00 0.13 ± 0.00 0.13 ± 0.00 nytimes-1999 0.10 ± 0.01 0.12 ± 0.01 0.12 ± 0.01 0.12 ± 0.01 0.12 ± 0.01 nytimes-2003 0.10 ± 0.03 0.12 ± 0.01 0.12 ± 0.01 0.10 ± 0.02 0.12 ± 0.01 wikipedia-high 0.15 ± 0.01 0.23 ± 0.01 0.24 ± 0.00 0.23 ± 0.00 0.24 ± 0.00 wikipedia-low 0.19 ± 0.01 0.24 ± 0.01 0.25 ± 0.00 0.24 ± 0.00 0.25 ± 0.00 20-newsgroups 0.12 ± 0.01 0.15 ± 0.01 0.16 ± 0.01 0.15 ± 0.00 0.16 ± 0.00 20-topics 0.19 ± 0.01 0.30 ± 0.01 0.30 ± 0.00 0.29 ± 0.00 0.29 ± 0.00 Table 8: Model quality — Comparison of document partition accuracy scores for five topic modeling approaches, based on Normalized Mutual Information (NMI). Corpus LDA NMF NNDSVD Ensemble K-Fold bbc 0.81 ± 0.06 0.79 ± 0.04 0.82 ± 0.00 0.79 ± 0.00 0.80 ± 0.00 bbc-sport 0.68 ± 0.05 0.80 ± 0.06 0.83 ± 0.00 0.85 ± 0.00 0.85 ± 0.00 guardian-2013 0.77 ± 0.04 0.82 ± 0.04 0.83 ± 0.00 0.84 ± 0.00 0.84 ± 0.00 irishtimes-2013 0.68 ± 0.04 0.72 ± 0.04 0.77 ± 0.00 0.76 ± 0.00 0.77 ± 0.00 nytimes-1999 0.64 ± 0.04 0.53 ± 0.05 0.49 ± 0.00 0.51 ± 0.02 0.51 ± 0.00 nytimes-2003 0.60 ± 0.03 0.58 ± 0.04 0.55 ± 0.00 0.56 ± 0.01 0.58 ± 0.00 wikipedia-high 0.69 ± 0.02 0.72 ± 0.01 0.72 ± 0.00 0.73 ± 0.00 0.74 ± 0.00 wikipedia-low 0.84 ± 0.02 0.87 ± 0.03 0.86 ± 0.00 0.89 ± 0.02 0.88 ± 0.00 20-newsgroups 0.48 ± 0.01 0.47 ± 0.02 0.49 ± 0.01 0.48 ± 0.01 0.48 ± 0.01 20-topics 0.84 ± 0.02 0.83 ± 0.02 0.85 ± 0.01 0.87 ± 0.01 0.88 ± 0.00 exception of the case where LDA is applied to the two New York Times corpora. 5.5. Statistical Significance To further investigate the differences between the algorithms, we performed a series of statistical tests on the results presented in the previous section. We carried out a non-parametric Fried- man’s Aligned Rank test (Garcı́a et al., 2010) for each of the five measures previously reported (ADSD, ATS, PNMI, NPMI, and NMI) to test for the presence of statistically significant dif- ferences in the results amongst the five algorithms and across the ten datasets. These tests returned p-values of 0.000004 (ADSD), 0.000002 (ATS), 0.000003 (PNMI), 0.00002 (NPMI), and 0.118 (NMI) respectively. This indicates that statistically significant differences, at the 1% confidence level, exist in the results achieved by the different algorithms for each measure, except for partition accuracy (NMI). To determine if there was a statistically significant differ- ence between our proposed K-Fold algorithm and the other topic modeling approaches with respect to the four remaining mea- sures, we performed a series of Friedman’s Aligned Rank Pair- wise post hoc tests (Garcı́a et al., 2010), with the K-Fold ap- proach used as a control. The results from these tests are re- ported in Table 9. It is interesting that there is a statistically significant difference, at the 1% confidence level, between our proposed approach and the two randomly initialized topic mod- eling algorithms across all measures, which along with the pre- viously reported performance of the algorithm suggests that the K-Fold approach produces more stable and higher quality topic models. There is no statistical difference between our proposed K-Fold approach, the ensemble and NNDSVD, which may in- dicate that these three measures are similar due to producing more deterministic solutions and this notion is further strength- ened due to their similar performance regarding the measures previously reported. It is also interesting to note that there is a statistical difference between the LDA and the K-Fold approach with regards to coherence, likely due to LDA based approaches generating topics with a lower coherence (O’Callaghan et al., 2015). 5.6. Computational Expense While the timing of algorithms can vary depending on the hardware and implementation utilized, it is useful in this case to obtain an estimate of how much longer the proposed ensem- ble approaches take with respect to traditional topic modeling algorithms. Each topic modeling approach was run 100 times and the average times are reported in Table 10. The experi- ments were carried out on a machine with 12 2.4GHz cores and 128GB of RAM. These running times are impacted due to numerous factors, including the number of documents in the corpus, the dimensionality of the corresponding document- term matrix, and the number of topics k selected for the corpus. As expected, it is clear that the two ensemble approaches take considerably longer to run than the other algorithms. This is naturally due to the underlying nature of their generation step, where 100 iterations of NMF have to be generated. While the 10 Table 9: Posthoc p-values based on Friedman’s Aligned Rank Pairwise test for each of the four measures that were statistically significant, while using the K-Fold approach as a control algorithm. * p ≤ 0.05, ** p ≤ 0.01, *** p ≤ 0.001, **** p ≤ 0.0001 Measure LDA NMF NNDSVD Ensemble K-Fold ADSD **** 0.00001 **** 0.00002 0.645 0.273 NA ATS **** 0.000002 **** 0.0001 0.713 0.250 NA PNMI **** 0.000002 **** 0.00007 0.576 0.304 NA NPMI **** 0.00001 0.111 0.921 0.452 NA Table 10: Comparison of average running time in seconds for the five topic modeling approaches. Corpus LDA NMF NNDSVD Ensemble K-Fold bbc 40.74 0.22 0.30 22.68 30.00 bbc-sport 33.34 0.08 0.08 7.90 8.56 guardian-2013 231.33 1.98 1.94 198.89 189.87 irishtimes-2013 82.75 0.85 0.86 86.76 84.69 nytimes-1999 370.85 2.95 3.74 296.18 375.97 nytimes-2003 454.36 6.11 4.83 613.00 538.57 wikipedia-high 667.88 5.12 3.31 513.68 319.96 wikipedia-low 627.28 6.42 4.07 644.70 380.85 20-newsgroups 229.29 14.16 15.15 1421.44 1527.62 20-topics 1802.16 72.07 115.02 7226.22 6793.77 LDA Mallet implementation is widely used for topic modeling, we observe that it is considerably slower in the majority of our experiments, in fact it is frequently slower than our proposed K-Fold approach which utilizes a more structured but slower initialization step. 5.7. Discussion So far we have discussed the two main criteria for evaluat- ing topic modeling algorithms separately. These are the evalu- ation of model quality, by examining topic coherence and par- tition accuracy, and the evaluation of model stability by exam- ining term stability and document stability. However, it is im- portant to note that the output of both criteria should be consid- ered together – our evaluations highlight that some topic mod- eling approaches perform well with respect to one criterion, while performing poorly with respect to the other. An exam- ple of this can be seen in the results produced by NNDSVD for the nytimes-1999 dataset. For the term and document stability, perfect stability scores are achieved (Tables 5 and 6). How- ever, when we take into account partition accuracy for the same dataset (Table 8), the quality of this solution is not as good as it initially appears. The NNDSVD initialization actually per- forms the worst with regards to accuracy in this case, with a low NMI score of 0.49. Similarly, while randomly-initialized LDA out-performs all other approaches on both New York times cor- pora in terms of partition accuracy (Table 8), the corresponding stability scores are consistently poor across all measures (Ta- bles 4–6). Following the discussion of “redundant stability” in Section 3.4, these results raise an interesting problem in that, while we strive to produce the most stable results as possible, it may also be the case that these results are of poor quality from a model quality standpoint. Among the two newly-proposed approaches, it is interesting to observe that the basic ensemble approach does not perform as well as the K-Fold approach, even though they are based on a similar ensemble process. In the case of the former, we aim to promote diversity when generating our base ensemble members by using randomly-initialized NMF, as motivated by previous work in both supervised and unsupervised ensemble learning (Brown et al., 2005; Kuncheva and Hadjitodorov, 2004). How- ever, for larger datasets, the stochastic nature of this approach tends to cause the final results to contain a degree of variance across different runs of the overall ensemble. In contrast, by combining structured document subsampling with NNDSVD initialization to generate each ensemble member, the K-Fold approach exhibits very little instability across the 20 runs in our experiments, as indicated in the results in Tables 4–6. These findings correspond to those of Hadjitodorov et al. (2006), who demonstrated that a moderate level of diversity leads to useful ensembles in cluster analysis. Overall, among the techniques considered in our experi- ments, the K-Fold ensemble approach produces the best models when taking into account both quality and stability. While the observed NMI scores were lower for certain datasets, it still performs better than the other methods with respect to half of our corpora. This strategy also appears to handle noisy user- generated data well, in comparison to alternative techniques. 6. Conclusions While topic modeling methods such as LDA and NMF are widely applied in a range of domains to analyze unstructured text, researchers often do not consider the effect that random initialization has on models produced by these methods. In this paper we have demonstrated that, for both methods, this 11 can result in significant variations in the topics produced over multiple runs over the same corpus. This effect is manifested both at the term and document level, which can potentially lead to different human interpretations of the underlying thematic structure of the data. To address the issue of instability in the context of NMF, we have investigated the extent to which improved algorithm initialization and ensemble strategies can produce more stable models, which are almost potentially more accurate and insight- ful. We compared the performance of these approaches with re- gards to five different metrics that measure stability, accuracy, and coherence of topics. Our results indicate that a new K-Fold ensemble approach afforded the most stable and accurate set of models, although initializating NMF based on a SVD approx- imation of the document-term matrix can also provide a clear improvement over standard NMF and LDA methods. One concern that arises in the application of ensemble learn- ing techniques in general relates to scalability. While the en- semble techniques described in this paper can be naturally par- allelized, there is considerable scope for reducing the compu- tation time required to generate the ensemble. A potentially promising idea to investigate in this context relates to the con- cept of snapshot ensembles from supervised learning (Huang et al., 2017), where a single algorithm run is allowed to con- verge to several local minima during the optimization process, each providing a contribution to the overall ensemble. Such an approach might also be used to yield more stable topic models via matrix factorization and reduce the computational expense. Acknowledgement. This research was supported by Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289. References Arora, S., Ge, R., and Moitra, A. (2012). Learning topic models – Going be- yond SVD. In Proc. 53rd Symposium on Foundations of Computer Science, pages 1–10. IEEE. Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. In Proc. 18th Annual ACM-SIAM symposium on Discrete algo- rithms, pages 1027–1035. Society for Industrial and Applied Mathematics. Ben-Hur, A., Elisseeff, A., and Guyon, I. (2002). A stability based method for discovering structure in clustered data. In Proc. 7th Pacific Symposium on Biocomputing, pages 6–17. Bertoni, A. and Valentini, G. (2005). Random projections for assessing gene expression cluster stability. In Proc. IEEE International Joint Conference on Neural Networks (IJCNN’05), volume 1, pages 149–154. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022. Bouma, G. (2009). Normalized Pointwise Mutual Information in Collocation Extraction. In Proc. International Conference of the German Society for Computational Linguistics and Language Technology, GCSL ’09. Boutsidis, C. and Gallopoulos, E. (2008). SVD based initialization: A head start for non-negative matrix factorization. Pattern Recognition. Bradley, P. S. and Fayyad, U. M. (1998). Refining initial points for k-means clustering. In Proc. 15th International Conference on Machine Learning (ICML’98), volume 98, pages 91–99. Brown, G., Wyatt, J., Harris, R., and Yao, X. (2005). Diversity creation meth- ods: a survey and categorisation. Information Fusion, 6(1):5–20. Celebi, M. E. and Kingravi, H. A. (2012). Deterministic initialization of the k-means algorithm using hierarchical clustering. International Journal of Pattern Recognition and Artificial Intelligence, 26(07):1250018. Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., and Harsh- man, R. A. (1990). Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407. Garcı́a, S., Fernández, A., Luengo, J., and Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044–2064. Greene, D., Cagney, G., Krogan, N., and Cunningham, P. (2008). Ensemble Non-negative Matrix Factorization Methods for Clustering Protein-Protein Interactions. Bioinformatics, 24(15):1722–1728. Greene, D., O’Callaghan, D., and Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. In Proc. European Conference on Machine Learning (ECML’14), pages 498–513. Springer. Hadjitodorov, S. T., Kuncheva, L. I., and Todorova, L. P. (2006). Moderate diversity for better cluster ensembles. Information Fusion, 7(3):264–275. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., and Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get M for free. CoRR, abs/1704.00109. Kuang, D., Choo, J., and Park, H. (2015). Nonnegative Matrix Factorization for Interactive Topic Modeling and Document Clustering. In Partitional Clustering Algorithms, pages 215–243. Springer International Publishing. Kuhn, H. W. (1955). The hungarian method for the assignment problem. Naval Research Logistics Quaterly, 2:83–97. Kuncheva, L. I. and Hadjitodorov, S. T. (2004). Using diversity in cluster en- sembles. In Proc. IEEE International Conference on Systems, Man and Cybernetics, volume 2, pages 1214–1219. IEEE. Kuncheva, L. I. and Vetrov, D. P. (2006). Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(11):1798–1808. Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non- negative matrix factorization. Nature, 401:788–91. Lin, C.-J. (2007). Projected Gradient Methods for Nonnegative Matrix Factor- ization. Neural Computation, 19(10):2756–2779. McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. URL http://mallet.cs.umass.edu/ Minaei-Bidgoli, B., Topchy, A., and Punch, W. F. (2004). Ensembles of parti- tions via data resampling. In Proc. International Conference on Information Technology: Coding and Computing (ITCC’04). Newman, D., Lau, J. H., Grieser, K., and Baldwin, T. (2010). Automatic Eval- uation of Topic Coherence. In Proc. Annual Conference of the North Amer- ican Chapter of the Association for Computational Linguistics, HLT ’10, pages 100–108. O’Callaghan, D., Greene, D., Carthy, J., and Cunningham, P. (2015). An anal- ysis of the coherence of descriptors in topic modeling. Expert Systems with Applications (ESWA), 42(13):5645–5657. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit- learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830. Pena, J. M., Lozano, J. A., and Larranaga, P. (1999). An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recogni- tion Letters, 20(10):1027–1040. Pio, G., Ceci, M., Malerba, D., and D’Elia, D. (2015). ComiRNet: a web- based system for the analysis of miRNA-gene regulatory networks. BMC Bioinformatics, 16(9):S7. Steyvers, M. and Griffiths, T. (2007). Probabilistic topic models. Handbook of latent semantic analysis, 427(7):424–440. Strehl, A. and Ghosh, J. (2002). Cluster ensembles - a knowledge reuse frame- work for combining multiple partitions. Journal of Machine Learning Re- search, 3:583–617. Su, T. and Dy, J. (2004). A deterministic method for initializing k-means clus- tering. In Proc. 16th IEEE International Conference on Tools with Artificial Intelligence, pages 784–786. IEEE. Suh, S., Choo, J., Lee, J., and Reddy, C. K. (2016). L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization. In Proc. IEEE 16th International Conference on Data Mining. Topchy, A., Jain, A., and Punch, W. (2005). Clustering ensembles: Models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12):1866–1881. Wild, S., Curry, J., and Dougherty, A. (2004). Improving non-negative ma- trix factorizations through structured initialization. Pattern Recognition, 12 37(11):2217–2232. Zhao, W., Chen, J. J., Perkins, R., Liu, Z., Ge, W., Ding, Y., and Zou, W. (2015). A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics, 16(13):1. 13