Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge Ryan J. Gallagher1,2, Kyle Reing1, David Kale1, and Greg Ver Steeg1 1Information Sciences Institute, University of Southern California 2Vermont Complex Systems Center, Computational Story Lab, University of Vermont ryan.gallagher@uvm.edu {reing,kale,gregv}@isi.edu Abstract While generative models such as Latent Dirichlet Allocation (LDA) have proven fruit- ful in topic modeling, they often require de- tailed assumptions and careful specification of hyperparameters. Such model complexity is- sues only compound when trying to general- ize generative models to incorporate human input. We introduce Correlation Explanation (CorEx), an alternative approach to topic mod- eling that does not assume an underlying gen- erative model, and instead learns maximally informative topics through an information- theoretic framework. This framework nat- urally generalizes to hierarchical and semi- supervised extensions with no additional mod- eling assumptions. In particular, word-level domain knowledge can be flexibly incorpo- rated within CorEx through anchor words, al- lowing topic separability and representation to be promoted with minimal human interven- tion. Across a variety of datasets, metrics, and experiments, we demonstrate that CorEx produces topics that are comparable in quality to those produced by unsupervised and semi- supervised variants of LDA. 1 Introduction The majority of topic modeling approaches utilize probabilistic generative models, models which spec- ify mechanisms for how documents are written in order to infer latent topics. These mechanisms may be explicitly stated, as in Latent Dirichlet Alloca- tion (LDA) (Blei et al., 2003), or implicitly stated, as with matrix factorization techniques (Hofmann, 1999; Ding et al., 2008; Buntine and Jakulin, 2006). The core generative mechanisms of LDA, in par- ticular, have inspired numerous generalizations that account for additional information, such as the au- thorship (Rosen-Zvi et al., 2004), document labels (McAuliffe and Blei, 2008), or hierarchical structure (Griffiths et al., 2004). However, these generalizations come at the cost of increasingly elaborate and unwieldy generative assumptions. While these assumptions allow topic inference to be tractable in the face of additional metadata, they progressively constrain topics to a narrower view of what a topic can be. Such assump- tions are undesirable in contexts where one wishes to minimize model complexity and learn topics with- out preexisting notions of how those topics origi- nated. For these reasons, we propose topic modeling by way of Correlation Explanation (CorEx),1 an information-theoretic approach to learning latent topics over documents. Unlike LDA, CorEx does not assume a particular data generating model, and instead searches for topics that are “maximally in- formative” about a set of documents. By learning informative topics rather than generated topics, we avoid specifying the structure and nature of topics ahead of time. In addition, the lightweight framework underly- ing CorEx is versatile and naturally extends to hier- archical and semi-supervised variants with no addi- tional modeling assumptions. More specifically, we 1Open source, documented code for the CorEx topic model available at https://github.com/gregversteeg/ corex_topic. 529 Transactions of the Association for Computational Linguistics, vol. 5, pp. 529–542, 2017. Action Editor: Diana McCarthy. Submission batch: 7/2017 Published 12/2017. c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. may flexibly incorporate word-level domain knowl- edge within the CorEx topic model. Topic models are often susceptible to portraying only dominant themes of documents. Injecting a topic model, such as CorEx, with domain knowledge can help guide it towards otherwise underrepresented topics that are of importance to the user. By incorporating rele- vant domain words, we might encourage our topic model to recognize a rare disease that would other- wise be missed in clinical health notes, focus more attention to topics from news articles that can guide relief workers in distributing aid more effectively, or disambiguate aspects of a complex social issue. Our contributions are as follows: first, we frame CorEx as a topic model and derive an efficient alter- ation to the CorEx algorithm to exploit sparse data, such as word counts in documents, for dramatic speedups. Second, we show how domain knowledge can be naturally integrated into CorEx through “an- chor words” and the information bottleneck. Third, we demonstrate that CorEx and anchored CorEx produce topics of comparable quality to unsuper- vised and semi-supervised variants of LDA over sev- eral datasets and metrics. Finally, we carefully detail several anchoring strategies that highlight the versa- tility of anchored CorEx on a variety of tasks. 2 Methods 2.1 CorEx: Correlation Explanation Here we review the fundamentals of Correlation Ex- planation (CorEx), and adopt the notation used by Ver Steeg and Galstyan in their original presenta- tion of the model (2014). Let X be a discrete ran- dom variable that takes on a finite number of val- ues, indicated with lowercase, x. Furthermore, if we have n such random variables, let XG denote a sub-collection of them, where G ⊆ {1, . . . ,n}. The probability of observing XG = xG is written as p(XG = xG), which is typically abbreviated to p(xG). The entropy of X is written as H(X) and the mutual information of two random variables X1 and X2 is given by I(X1 : X2) = H(X1) + H(X2) − H(X1,X2). The total correlation, or multivariate mutual in- formation, of a group of random variables XG is ex- pressed as TC(XG) = ∑ i∈G H(Xi) −H(XG) (1) = DKL ( p(xG)|| ∏ i∈G p(xi) ) . (2) We see that Eq. 1 does not quantify “correlation” in the modern sense of the word, and so it can be help- ful to conceptualize total correlation as a measure of total dependence. Indeed, Eq. 2 shows that total cor- relation can be expressed using the Kullback-Leibler Divergence and, therefore, it is zero if and only if the joint distribution of XG factorizes, or, in other words, there is no dependence between the random variables. The total correlation can be written when condi- tioning on another random variable Y , TC(XG | Y ) = ∑ i∈G H(Xi | Y ) −H(XG | Y ). So, we can consider the reduction in the total correlation when conditioning on Y . TC(XG; Y ) = TC(XG) −TC(XG | Y ) (3) = ∑ i∈G I(Xi : Y ) − I(XG : Y ) (4) The quantity expressed in Eq. 3 acts as a lower bound of TC(XG) (Ver Steeg and Galstyan, 2015), as readily verified by noting that TC(XG) and TC(XG|Y ) are always non-negative. Also note, the joint distribution of XG factorizes conditional on Y if and only if TC(XG | Y ) = 0. If this is the case, then TC(XG; Y ) is maximized, and Y explains all of the dependencies in XG. In the context of topic modeling, XG represents a group of word types and Y represents a topic to be learned. Since we are always interested in group- ing multiple sets of words into multiple topics, we will denote the binary latent topics as Y1, . . .Ym and their corresponding groups of word types as XGj for j = 1, . . . ,m respectively. The CorEx topic model seeks to maximally explain the dependencies of words in documents through latent topics by max- imizing TC(X; Y1, . . . ,Ym). To do this, we maxi- mize the following lower bound on this expression: max Gj,p(yj|xGj ) m∑ j=1 TC(XGj ; Yj). (5) 530 As we describe in the following section, this ob- jective can be efficiently approximated, despite the search occurring over an exponentially large proba- bility space (Ver Steeg and Galstyan, 2014). Since each topic explains a certain portion of the overall total correlation, we may choose the number of topics by observing diminishing returns to the ob- jective. Furthermore, since the CorEx implementa- tion depends on a random initialization (as described shortly), one may restart the CorEx topic model sev- eral times and choose the one that explains the most total correlation. The latent factors, Yj, are optimized to be infor- mative about dependencies in the data and do not require generative modeling assumptions. Note that the discovered factors, Y , can be used as inputs to construct new latent factors, Z, and so on leading to a hierarchy of topics. Although this extension is quite natural, we focus our analysis on the first level of topic representations for easier interpretation and evaluation. 2.2 CorEx Implementation We summarize the implementation of CorEx as pre- sented by Ver Steeg and Galstyan (2014) in prepa- ration for innovations introduced in the subsequent sections. The numerical optimization for CorEx be- gins with a random initialization of parameters and then proceeds via an iterative update scheme simi- lar to EM. For computational tractability, we subject the optimization in Eq. 5 to the constraint that the groups, Gj, do not overlap, i.e. we enforce single- membership of words within topics. The optimiza- tion entails a combinatorial search over groups, so instead we look for a form that is more amenable to smooth optimization. We rewrite the objective using the alternate form in Eq. 4 while introducing indica- tor variables αi,j which are equal to 1 if and only if word Xi appears in topic Yj (i.e. i ∈ Gj). max αi,j,p(yj|x) m∑ j=1 ( n∑ i=1 αi,jI(Xi : Yj) − I(X : Yj) ) s.t. αi,j = I[j = arg max j̄ I(Xi : Yj̄)]. (6) Note that the constraint on non-overlapping groups now becomes a constraint on α. To make the opti- mization smooth we should relax the constraint so that αi,j ∈ [0, 1]. To do so, we replace the second line with a softmax function. The update for α at iteration t becomes, αti,j = exp ( λt(I(Xi : Yj) − max j̄ I(Xi : Yj̄)) ) . Now α ∈ [0, 1] and the parameter λ controls the sharpness of the softmax function. Early in the opti- mization we use a small value of λ, then increase it later in the optimization to enforce a hard constraint. The objective in Eq. 6 only lower bounds total cor- relation in the hard max limit. The constraint on α forces competition among latent factors to explain certain words, while setting λ = 0 results in all fac- tors learning the same thing. Holding α fixed, taking the derivative of the objective (with respect to the variables p(yj|x), and setting it equal to zero leads to a fixed point equation. We use this fixed point to define update equations at iteration t. pt(yj) = ∑ x̄ pt(yj|x̄)p(x̄) (7) pt(xi|yj) = ∑ x̄ pt(yj|x̄)p(x̄)I[x̄i = xi]/pt(yj) log pt+1(yj|x`) = (8) log pt(yj)+ n∑ i=1 αti,j log pt(x ` i | yj) p(x`i) − logZj(x`) The first two lines just define the marginals in terms of the optimization parameter, pt(yj|x). We take p(x) to be the empirical distribution defined by some observed samples, x`,` = 1, . . . ,N. The third line updates pt(yj|x`), the probabilistic labels for each latent factor, Yj, for a given sample, x`. Note that an easily calculated constant, Zj(x`), appears to ensure the normalization of pt(yj|x`) for each sample. We iterate through these updates until convergence. After convergence, we use the mutual information terms I(Xi : Yj) to rank which words are most in- formative for each factor. The objective is a sum of terms for each latent factor and this allows us to rank the contribution of each factor toward our lower bound on the total correlation. The expected log of the normalization constant, often called the free en- ergy, E[logZj(x)], plays an important role since its expectation provides a free estimate of the i-th term in the objective (Ver Steeg and Galstyan, 2015), as 531 can be seen by taking the expectation of Eq. 8 at convergence and comparing it to Eq. 6. Because our sample estimate of the objective is just the mean of contributions from individual sample points, x`, we refer to logZj(x`) as the pointwise total correlation explained by factor j for sample `. Pointwise TC can be used to localize which samples are particu- larly informative about specific latent factors. 2.3 Sparsity Optimization 2.3.1 Derivation To alter the CorEx optimization procedure to ex- ploit sparsity in the data, we now assume that all variables, xi,yj, are binary and x is a binary vector where X`i = 1 if word i occurs in document ` and X`i = 0 otherwise. Since all variables are binary, the marginal distribution, p(xi|yj), is just a two by two table of probabilities and can be estimated effi- ciently. The time-consuming part of training is the subsequent update of the document labels in Eq. 8 for each document `. The computation of the log likelihood ratio for all n words over all documents is not efficient, as most words do not appear in a given document. We rewrite the logarithm in the interior of the sum. log pt(x ` i | yj) p(x`i) = log pt(Xi = 0 | yj) p(Xi = 0) + (9) xli log ( pt(X ` i = 1 | yj)p(Xi = 0) pt(Xi = 0 | yj)p(X`i = 1) ) Note, when the word does not appear in the docu- ment, only the leading term of Eq. 9 will be nonzero. However, when the word does appear, everything but log P(X`i = 1 | yj)/p(X`i = 1) cancels out. So, we have taken advantage of the fact that the CorEx topic model binarizes documents to assume by de- fault that a word does not appear in the document, and then correct the contribution to the update if the word does appear. Thus, when substituting back into Eq. 8, the sum becomes a matrix multiplication between a matrix with dimensions of the number of variables by the number of documents and entries x`i that is as- sumed to be sparse and a dense matrix with di- mensions of the number of variables by the num- ber of latent factors. Given n variables, N sam- ples, and ρ nonzero entries in the data matrix, the 100 101 102 103 104 D is a st e r R e li e f A rt ic le s T im e ( S e c o n d s) CorEx Sparse CorEx LDA 100 101 102 103 104 N e w Y o rk T im e s T im e ( S e c o n d s) 103 104 105 106 Number of Docs (500 Words Fixed) 100 101 102 103 104 P u b M e d T im e ( S e c o n d s) 103 104 105 Number of Words (5000 Documents Fixed) Figure 1: Speed comparisons to a fixed number of itera- tions as the number of documents and words vary. New York Times articles and PubMed abstracts were collected from the UCI Machine Learning Repository (Lichman, 2013). The disaster relief articles are described in section 4.1, and represented simply as bags of words, not phrases. asymptotic scaling for CorEx goes from O(Nn) to O(n)+O(N)+O(ρ) exploiting sparsity. Latent tree modeling approaches are quadratic in n or worse, so we expect CorEx’s computational advantage to in- crease for larger datasets. 2.3.2 Optimization Evaluation We perform experiments comparing the running time of CorEx before and after implementing the im- provements which exploit sparsity. We also compare with Scikit-Learn’s simple batch implementation of LDA using the variational Bayes algorithm (Hoff- man et al., 2013). Experiments were performed on a four core, Intel i5 chip running at 4 GHz with 32 GB RAM. We show run time when varying the data size in terms of the number of word types and the num- ber of documents. We used 50 topics for all runs and set the number of iterations for each run to 10 itera- tions for LDA and 50 iterations for CorEx. Results are shown in Figure 1. We see that CorEx exploit- ing sparsity is orders of magnitude faster than the 532 naive version and is generally comparable to LDA as the number of documents scales. The slope on the log-log plot suggests a linear dependence of run- ning time on the dataset size, as expected. 2.4 Anchor Words via the Bottleneck The information bottleneck formulates a trade-off between compressing data X into a representation Y , and preserving the information in X that is rel- evant to Z (typically labels in a supervised learning task) (Tishby et al., 1999; Friedman et al., 2001). More formally, the information bottleneck is ex- pressed as max p(y|x) βI(Z : Y ) − I(X : Y ), (10) where β is a parameter controlling the trade-off be- tween compressing X and preserving information about the relevance variable, Z. To see the connection with CorEx, we compare the CorEx objective as written in Eq. 6 with the bottleneck in Eq. 10. We see that we have exactly the same compression term for each latent factor, I(X : Yj), but the relevance variables now corre- spond to Z ≡ Xi. If we want to learn represen- tations that are more relevant to specific keywords, we can simply anchor a word type Xi to topic Yj, by constraining our optimization so that αi,j = βi,j, where βi,j ≥ 1 controls the anchor strength. Oth- erwise, the updates on α remain the same. This schema is a natural extension of the CorEx optimiza- tion and it is flexible, allowing for multiple word types to be anchored to one topic, for one word type to be anchored to multiple topics, or for any com- bination of these semi-supervised anchoring strate- gies. 3 Related Work With respect to integrating domain knowledge into topic models, we draw inspiration from Arora et al. (2012), who used anchor words in the con- text of non-negative matrix factorization. Using an assumption of separability, these anchor words act as high precision markers of particular topics and, thus, help discern the topics from one another. Al- though the original algorithm proposed by Arora et al. (2012), and subsequent improvements to their approach, find these anchor words automatically (Arora et al., 2013; Lee and Mimno, 2014), recent adaptations allow manual insertion of anchor words and other metadata (Nguyen et al., 2014; Nguyen et al., 2015). Our work is similar to the latter, where we treat anchor words as fuzzy logic markers and em- bed them into the topic model in a semi-supervised fashion. In this sense, our work is closest to Halpern et al. (2014; 2015), who have also made use of do- main expertise and semi-supervised anchored words in devising topic models. There is an adjacent line of work that has focused on incorporating word-level information into LDA- based models. Jagarlamudi et al. (2012) proposed SeededLDA, a model that seeds words into given topics and guides, but does not force, these topics towards these integrated words. Andrzejewski and Zhu (2009) presented a model that makes use of “z- labels,” words that are known to pertain to specific topics and that are restricted to appearing in some subset of all the possible topics. Although the z- labels can be leveraged to place different senses of a word into different topics, it requires additional ef- fort to determine when these different senses occur. Our anchoring approach allows a user to more easily anchor one word to multiple topics, allowing CorEx to naturally find topics that revolve around different senses of a word. Andrzejewski et al. (2009) presented a second model which allows specification of Must-Link and Cannot-Link relationships between words that help partition otherwise muddled topics. These logical constraints help enforce topic separability, though these mechanisms less directly address how to an- chor a single word or set of words to help a topic emerge. More generally, the Must/Cannot link and z-label topic models have been expressed in a powerful first-order-logic framework that allows the specification of arbitrary domain knowledge through logical rules (Andrzejewski et al., 2011). Others have built off this first-order-logic approach to au- tomatically learn rule weights (Mei et al., 2014) and incorporate additional latent variable informa- tion (Foulds et al., 2015). Mathematically, CorEx topic models most closely resemble topic models based on latent tree recon- struction (Chen et al., 2016). In Chen et al.’s (2016) analysis, their own latent tree approach and CorEx both report significantly better perplexity than hi- 533 erarchical topic models based on the hierarchical Dirichlet process and the Chinese restaurant process. CorEx has also been investigated as a way to find “surprising” documents (Hodas et al., 2015). 4 Data and Evaluation Methods 4.1 Data We use two challenging datasets with corresponding domain knowledge lexicons to evaluate anchored CorEx. Our first dataset consists of 504,000 human- itarian assistance and disaster relief (HA/DR) arti- cles covering 21 disaster types collected from Re- liefWeb, an HA/DR news article aggregator spon- sored by the United Nations. To mitigate over- whelming label imbalances during anchoring, we both restrict ourselves to documents in English with one label, and randomly subsample 2,000 articles from each of the largest disaster type labels. This leaves us with a corpus of 18,943 articles.2 We accompany these articles with an HA/DR lex- icon of approximately 34,000 words and phrases. The lexicon was curated by first gathering 40–60 seed terms per disaster type from HA/DR domain experts and CrisisLex. This term list was then ex- panded by creating word embeddings for each dis- aster type, and taking terms within a specified co- sine similarity of the seed words. These lists were then filtered by removing names, places, non-ASCII characters, and terms with fewer than three charac- ters. Finally, the extracted terms were audited using CrowdFlower, where users rated the relevance of the terms on a Likert scale. Low relevance terms were dropped from the lexicon. Of these terms 11,891 types appear in the HA/DR articles. Our second dataset consists of 1,237 deidentified clinical discharge summaries from the Informatics for Integrating Biology and the Bedside (i2b2) 2008 Obesity Challenge.3 These summaries are labeled by clinical experts with 15 conditions frequently associated with obesity. For these documents, we leverage a text pipeline that extracts common med- ical terms and phrases (Dai et al., 2008; Chapman et al., 2001), which yields 3,231 such term types. 2HA/DR articles and accompanying lexicon available at http://dx.doi.org/10.7910/DVN/TGOPRU 3Data available upon data use agreement at https:// www.i2b2.org/NLP/Obesity/ For both sets of documents, we use their respective lexicons to break the documents down into bags of words and phrases. We also make use of the 20 Newsgroups dataset, as provided and preprocessed in the Scikit-Learn li- brary (Pedregosa et al., 2011). 4.2 Evaluation CorEx does not explicitly attempt to learn a genera- tive model and, thus, traditional measures such as perplexity are not appropriate for model compari- son against LDA. Furthermore, it is well-known that perplexity and held-out log-likelihood do not neces- sarily correlate with human evaluation of semantic topic quality (Chang et al., 2009). Therefore, we measure the semantic topic quality using Mimno et al.’s (2011) UMass automatic topic coherence score, which correlates with human judgments. We also evaluate the models in terms of multi- class logistic regression document classification (Pe- dregosa et al., 2011), where the feature set of each document is its topic distribution. We perform all document classification tasks using a 60/40 training- test split. Finally, we measure how well each topic model does at clustering documents. We obtain a cluster- ing by assigning each document to the topic that oc- curs with the highest probability. We then measure the quality within clusters (homogeneity) and across clusters (adjusted mutual information). The highest possible value for both measures is one. We do not report clustering metrics on the clinical health notes because the documents are multi-label and, in that case, the metrics are not well-defined. 4.3 Choosing Anchor Words We wish to systematically test the effect of anchor words given the domain-specific lexicons. To do so, we follow the approach used by Jagarlamudi et al. (2012) to automatically generate anchor words: for each label in a data set, we find the words that have the highest mutual information with the label. For word w and label L, this is computed as I(L : w) = H(L) −H(L | w), (11) where for each document of label L we consider if the word w appears or not. 534 0.2 0.3 0.4 0.5 M a c ro F 1 Disaster Relief Articles 20 Newsgroups Clinical Health Notes 0.2 0.3 0.4 0.5 0.6 M ic ro F 1 160 140 120 100 80 60 C o h e re n c e 20 40 60 80 100 Number of Topics 20 40 60 80 100 Number of Topics 0.1 0.2 0.3 0.4 H o m o g e n e it y 20 40 60 80 100 Number of Topics CorEx LDA Figure 2: Baseline comparison of CorEx to LDA with respect to topic coherence and document classification and clustering on three different datasets as the number of topics vary. Points are the average of 30 runs of a topic model. Confidence intervals are plotted but are so small that they are not distinguishable. CorEx is trained using binary data, while LDA is trained on count data. Ho- mogeneity is not well-defined on the multi-label clinical health notes, so it is omitted. 5 Results 5.1 LDA Baseline Comparison We compare CorEx to LDA in terms of topic coher- ence, document classification, and document clus- tering across three datasets. CorEx is trained on bi- nary data, while LDA is trained on count data. While not reported here, CorEx consistently outperformed LDA trained on binary data. In doing these compar- isons, we use the Gensim implementation of LDA (Řehůřek and Sojka, 2010). The results of compar- ing CorEx to LDA as a function of the number of topics are presented in Figure 2. Across all three datasets, we find that the topics produced by CorEx yield document classification re- sults that are on par with or better than those pro- duced by LDA topics. In terms of clustering, CorEx consistently produces document clusters of higher Rank Disaster Relief Topic 1 drought, farmers, harvest, crop, livestock, planting, grain, maize, rainfall, irrigation 3 eruption, volcanic, lava, crater, eruptions, volcanos, slopes, volcanic activity, evacuated, lava flows 8 winter, snow, snowfall, temperatures, heavy snow, heating, freezing, warm clothing, severe winter, avalanches 23 military, armed, civilians, soldiers, aircraft, weapons, rebel, planes, bombs, military personnel Rank 20 Newsgroups Topic 3 team, game, season, player, league, hockey, play, teams, nhl 14 car, bike, cars, engine, miles, road, ride, riding, bikes, ground 26 nasa, launch, orbit, shuttle, mission, satellite, gov, jpl, orbital, solar 39 medical, disease, doctor, patients, treatment, medicine, health, hospital, doctors, pain Rank Clinical Health Notes Topic 12 vomiting, nausea, abdominal pain, diarrhea, fever, dehydration, chill, clostridium difficile, intravenous fluid, compazine 19 anxiety state, insomnia, ativan, neurontin, depression, lorazepam, gabapentin, trazodone, fluoxetine, headache 27 pain, oxycodone, tylenol, percocet, ibuprofen, morphine, osteoarthritis, hernia, motrin, bleeding Table 1: Examples of topics learned by the CorEx topic model. Words are ranked according to mutual informa- tion with the topic, and topics are ranked according to the amount of total correlation they explain. Topic models were run with 50 topics on the Reliefweb and 20 News- groups datasets, and 30 topics on the clinical health notes. homogeneity than LDA. On the disaster relief arti- cles, the CorEx clusters are nearly twice as homoge- neous as the LDA clusters. CorEx outperforms LDA in terms of topic coher- ence on two out of three of the datasets. While LDA 535 0.0 0.1 0.2 0.3 0.4 0.5 H o m o g e n e it y Unsupervised Semi-Supervised Disaster Relief Articles (21 topics) Unsupervised Semi-Supervised 20 Newsgroups (20 topics) 0.0 0.1 0.2 0.3 0.4 0.5 A d ju st e d M u tu a l In fo rm a ti o n Co rEx LD A An ch ore d C orE x Lin ke d L DA z-l ab els LD A 120 100 80 60 C o h e re n c e Co rEx LD A An ch ore d C orE x Lin ke d L DA z-l ab els LD A Figure 3: Comparison of anchored CorEx to other semi- supervised topic models in terms of document clustering and topic coherence. For each dataset, the number of top- ics is fixed to the number of document labels. Each dot is the average of 30 runs. Confidence intervals are plotted but are so small that they are not distinguishable. produces more coherent topics for the clinical health notes, it is particularly striking that CorEx is able to produce high quality topics while only leverag- ing binary count data. Examples of these topics are shown in Table 1. Despite the binary counts limi- tation, CorEx still finds meaningfully coherent and competitive structure in the data. 5.2 Anchored CorEx Analysis We now examine the effects and benefits of guiding CorEx through anchor words. In doing so, we also compare anchored CorEx to other semi-supervised topic models. 5.2.1 Anchoring for Topic Separability We are first interested in how anchoring can be used to encourage topic separability so that docu- ments cluster well. We focus on the HA/DR articles and 20 newsgroups datasets, since traditional clus- tering metrics are not well-defined on the multi-label clinical health notes. For both datasets, we fix the Rank Anchored Disaster Relief Topic 1 harvest, locus, drought, food crisis, farmers, crops, crop, malnutrition, food aid, livestock 4 tents, quake, international federation, red crescent, red cross, blankets, earthquake, richter scale, societies, aftershocks 12 climate, impacts, warming, climate change, irrigation, consumption, household, droughts, livelihoods, interventions 19 storms, weather, winds, coastal, tornado, meteorological, tornadoes, strong winds, tropical, roofs Rank Anchored 20 Newsgroups Topic 5 government, congress, clinton, state, national, economic, general, states, united, order 6 bible, christian, god, jesus, christians, believe, life, faith, world, man 15 use, used, high, circuit, power, work, voltage, need, low, end 20 baseball, pitching, braves, mets, hitter, pitcher, cubs, dl, sox, jays Table 2: Examples of topics learned by CorEx when simultaneously anchoring many topics with anchoring parameter β = 2. Anchor words are shown in bold. Words are ranked according to mutual information with the topic, and topics are ranked according to the amount of total correlation they explain. Topic models were run with 21 topics on the Reliefweb articles and 20 topics on the 20 Newsgroups dataset. number of topics to be equal to the number of doc- ument labels. It is in this context that we compare anchored CorEx to two other semi-supervised topic models: z-labels LDA and must/cannot link LDA. Using the method described in Section 4.3, we au- tomatically retrieve the top five anchors for each dis- aster type and newsgroup. We then filter these lists of any words that are ambiguous, i.e. words that are anchor words for more than one document label. For anchored CorEx and z-labels LDA we simulta- neously assign each set of anchor words to exactly one topic each. For must/cannot link LDA, we cre- ate must-links within the words of the same anchor 536 group, and create cannot-links between words of dif- ferent anchor groups. Since we are simultaneously anchoring to many topics, we use a weak anchoring parameter β = 2 for anchored CorEx. Using the notation from their original papers, we use η = 1 for z-labels LDA, and η = 1000 for must/cannot link LDA. For both LDA variants, we use α = 0.5, β = 0.1 and take 2,000 samples, and estimate the models using code implemented by the original authors. The results of this comparison are shown in Fig- ure 3, and examples of anchored CorEx topics are shown in Table 2. Across all measures CorEx and anchored CorEx outperform LDA. We find that an- chored CorEx always improves cluster quality ver- sus CorEx in terms of homogeneity and adjusted mutual information. Compared to CorEx, multiple simultaneous anchoring neither harms nor benefits the topic coherence of anchored CorEx. Together these metrics suggest that anchored CorEx is find- ing topics that are of equivalent coherence to CorEx, but more relevant to the document labels since gains are seen in terms of document clustering. Against the other semi-supervised topic models, anchored CorEx compares favorably. The document clustering of anchored CorEx is similar to, or bet- ter than, that of z-labels LDA and must/cannot link LDA. Across the disaster relief articles, anchored CorEx finds less coherent topics than the two LDA variants, while it finds similarly coherent topics as must/cannot link LDA on the 20 newsgroup dataset. 5.2.2 Anchoring for Topic Representation We now turn to studying how domain knowledge can be anchored to a single topic to help an other- wise dominated topic emerge, and how the anchor- ing parameter β affects that emergence. To discern this effect, we focus just on anchored CorEx along with the HA/DR articles and clinical health notes, datasets for which we have a domain expert lexicon. We devise the following experiment: first, we de- termine the top five anchor words for each docu- ment label using the methodology described in Sec- tion 4.3. Unlike in the previous section, we do not filter these lists of ambiguous anchor words. Sec- ond, for each document label, we run an anchored CorEx topic model with that label’s anchor words anchored to exactly one topic. We compare this an- 0.4 0.2 0.0 0.2 0.4 0.6 T o p ic O v e rl a p D if f. P o st -A n c h o ri n g Disaster Relief Articles 0.2 0.0 0.2 0.4 0.6 Clinical Health Notes 40 20 0 20 40 60 80 100 P e rc e n t C h a n g e i n T o p ic C o h e re n c e 100 50 0 50 100 150 200 1 2 3 4 5 6 7 8 9 10 Anchoring Parameter β 0.15 0.10 0.05 0.00 0.05 0.10 0.15 F 1 D if fe re n c e P o st -A n c h o ri n g 1 2 3 4 5 6 7 8 9 10 Anchoring Parameter β 0.2 0.1 0.0 0.1 0.2 0.3 0.4 Figure 4: Effect of anchoring words to a single topic for one document label at a time as a function of the anchor- ing parameter β. Light gray lines indicate the trajectory of the metric for a given disaster or disease label. Thick red lines indicate the pointwise average across all labels for fixed value of β. chored topic model to an unsupervised CorEx topic model using the same random seeds, thus creating a matched pair where the only difference is the treat- ment of anchor words. Finally, this matched pairs process is repeated 30 times, yielding a distribution for each metric over each label. We use 50 topics when modeling the ReliefWeb articles and 30 topics when modeling the i2b2 clin- ical health notes. These values were chosen by ob- serving diminishing returns to the total correlation explained by additional topics. In Figure 4 we show how the results of this ex- periment vary as a function of the anchoring pa- rameter β for each disaster and disease type in the two data sets. Since there is heavy variance across document labels for each metric, we also examine a more detailed cross section of these results in Fig- ure 5, where we set β = 5 for the clinical health notes and set β = 10 for the disaster relief arti- cles. As we show momentarily, disaster and disease types that benefit the most from anchoring were un- 537 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Tropical Cyclone Flood Epidemic Earthquake Drought Volcano Flash Flood Insect Infestation Cold Wave Technological Disaster Tsunami Land Slide Wild Fire Severe Local Storm Other Snow Avalanche Extratropical Cyclone Mud Slide Heat Wave Storm Surge Fire Anchoring Parameter β = 10 50 0 50 100 Anchoring Parameter β = 10 0.10 0.05 0.00 0.05 0.10 0.15 0.20 Anchoring Parameter β = 10 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Topic Overlap Diff. Post-Anchoring Asthma Coronary Heart Disease Congestive Heart Failure Depression Diabetes GERD Gallstones Gout Hypercholesterolemia Hypertension Hypertriglyceridemia Osteoarthritis Obstructive Sleep Apnea Obesity Peripheral Vascular Disease Anchoring Parameter β = 5 50 0 50 100 Percent Change of Anchored Topic Coherence when Most Predictive Topic Anchoring Parameter β = 5 0.1 0.0 0.1 0.2 0.3 0.4 F1 Difference Post-Anchoring Anchoring Parameter β = 5 0.0 0.5 1.0 P ro p o rtio n o f R u n s A n c h o re d T o p ic is th e M o st P re d ic tiv e 0.0 0.5 1.0 A v e ra g e F 1 S c o re P re -A n c h o rin g 0.0 0.5 1.0 A v e ra g e T o p ic O v e rla p P re -A n c h o rin g 0.0 0.5 1.0 P ro p o rtio n o f R u n s A n c h o re d T o p ic is th e M o st P re d ic tiv e 0.0 0.5 1.0 A v e ra g e F 1 S c o re P re -A n c h o rin g 0.0 0.5 1.0 A v e ra g e T o p ic O v e rla p P re -A n c h o rin g Figure 5: Cross-section results of the anchoring metrics from fixing β = 5 for the clinical health notes, and β = 10 for the disaster relief articles. Disaster and disease types are sorted by frequency, with the most frequent document labels appearing at the top. Error bars indicate 95% confidence intervals. The color bars provide context for each metric: topic overlap pre-anchoring, proportion of topic model runs where the anchored topic was the most predictive topic, and F1 score pre-anchoring. derrepresented pre-anchoring. Document labels that were well-represented prior to anchoring achieve only marginal gain. This results in the variance seen in Figure 4. A priori we do not know that anchoring will cause the anchor words to appear at the top of topics. So, we first measure how the topic overlap, the propor- tion of the top ten mutual information words that ap- pear within the top ten words of the topics, changes before and after anchoring. From Figure 4 (row 1) we see that as β increases, more of these rel- evant words consistently appear within the topics. For the disaster relief articles, many disaster types see about two more words introduced, while in the clinical health notes the overlap increases by up to four words. Analyzing the cross section in Fig- ure 5 (column 1), we see many of these gains come from disaster and disease types that appeared less in the topics pre-anchoring. Thus, we can sway the topic model towards less dominant themes through anchoring. Document labels that occur the most frequently are those for which the topic overlap changes the least. Next, we examine whether these anchored topics are more coherent topics. To do so, we compare the coherence of the anchored topic with that of the most predictive topic pre-anchoring, i.e. the topic with the largest corresponding coefficient in magnitude of the logistic regression, when the anchored topic itself is most predictive. From Figure 4 (row 2), we see these results have more variance, but largely the anchored topics are more coherent. In some cases, the coher- ence is 1.5 to 2 times that of pre-anchoring. Fur- thermore, by colors of the central panel of Figure 5, we find that the anchored topics are, indeed, often the most predictive topics for each document label. Similar to topic overlap, the labels that see the least improvement are those that appear the most and are already well-represented in the topic model. Finally, we find that the anchored, more coherent topics can lead to modest gains in document clas- sification. For the disaster relief articles, Figure 4 (row 3) shows that there are mixed results in terms of F1 score improvement, with some disaster types performing consistently better, and others perform- ing consistently worse. The results are more consis- tent for the clinical health notes, where there is an average increase of about 0.1 in the F1 score, and 538 some disease types see an increase of up to 0.3 in F1. Given that we are only anchoring 5 words to the topic model, these are significant gains in predictive power. Unlike the gains in topic overlap and coherence, the F1 score increases do not simply correlate with which document labels appeared most frequently. For example, we see in Figure 5 (column 3) that Tropical Cyclone exhibits the largest increase in pre- dictive performance, even though it is also one of the most frequently appearing document labels. Simi- larly, some of the major gains in F1 for the disease types, and major losses in F1 for the disaster types, do not come from the most or least frequent docu- ment labels. Thus, if using anchoring single topics within CorEx for document classification, it is im- portant to examine how the anchoring affects pre- diction for individual document labels. 5.2.3 Anchoring for Topic Aspects Finding topics that revolve around a word, such as a name or location, or a group of words can aid in understanding how a particular subject or event has been framed. We finish with a qualitative ex- periment where we disambiguate aspects of a topic by anchoring a set of words to multiple topics within the CorEx topic model. Note, must/cannot link LDA cannot be used in this manner, and z-labels LDA would require us to know these aspects beforehand. We consider tweets containing #Ferguson (case- insensitive), which detail reactions to the shooting of Black teenager Michael Brown by White po- lice officer Darren Wilson on August 9th, 2014 in Ferguson, Missouri. These tweets were collected from the Twitter Gardenhose, a 10% random sam- ple of all tweets, over the period August 9th, 2014 to November 30th, 2014. Since CorEx will seek max- imally informative topics by exploiting redundan- cies, we remove duplicates of retweets, leaving us with 869,091 tweets. We filter these tweets of punc- tuation, stop words, hyperlinks, usernames, and the ‘RT’ retweet symbol, and use the top 20,000 word types. In the wake of both the shooting and the eventual non-indictment of Darren Wilson, several protests occurred. Some onlookers supported and encour- aged such protests, while others characterized the protests as violent “riots.” To disambiguate these Topic Aspects of “protest” 1 protest, protests, peaceful, violent, continue, night, island, photos, staten, nights 2 protest, protests, #hiphopmoves, #cole, hiphop, nationwide, moves, fo, anheuser, boeing 3 protest, protests, st, louis, guard, national, county, patrol, highway, city 4 protest, protests, paddy, covering, beverly, walmart, wagon, hills, passionately, including 5 protest, protests, solidarity, march, square, rally, #oakland, downtown, nyc, #nyc Topic Aspects of “riot” 6 riot, riots, unheard, language, inciting, accidentally, jokingly, watts, waving, dies 7 riot, black, riots, white, #tcot, blacks, men, whites, race, #pjnet 8 riot, riots, looks, like, sounds, acting, act, animals, looked, treated 9 riot, riots, store, looting, businesses, burning, fire, looted, stores, business 10 gas, riot, tear, riots, gear, rubber, bullets, military, molotov, armored Table 3: Topic aspects around “protest” and “riot” from running a CorEx topic model with 55 topics and anchor- ing “protest” and “protests” together to five topics and “riot” and “riots” together to five topics with β = 2. An- chor words are shown in bold. Note, topics are not or- dered by total correlation. different depictions, we train a CorEx topic model with 55 topics, anchoring “protest” and “protests” together to five topics, and “riot” and “riots” to- gether to five topics with β = 2. These anchored topics are presented in Table 3. The anchored topics reflect different aspects of the framing of the “protests” and “riots,” and are generally interpretable, despite the typical difficulty of extracting coherent topics from short documents using LDA (Tang et al., 2014). The “protest” topic aspects describe protests in St. Louis, Oakland, Bev- erly Hills, and parts of New York City (topics 1, 3, 4, 5), resistance by law enforcement (topics 3 and 4), and discussion of whether the protests were peaceful (topic 1). Topic 2 revolves around hip-hop artists who marched in solidarity with protesters. 539 The “riot” topic aspects discuss racial dynamics of the protests (topic 7) and suggest the demonstrations are dangerous (topics 8 and 9). Topic 10 describes the “riot” gear used in the militarized response to the Ferguson protesters, and Topic 7 also hints at aspects of conservatism through the hashtags #tcot (Top Conservatives on Twitter) and #pjnet (Patriot Journalist Network). As we see, anchored CorEx finds several in- teresting, non-trivial aspects around “protest” and “riot” that could spark additional qualitative inves- tigation. Retrieving topic aspects through anchor words in this manner allows the user to explore dif- ferent frames of complex issues, events, or discus- sions within documents. As with the other anchor- ing strategies, this has the potential to supplement qualitative research done by researchers within the social sciences and digital humanities. 6 Discussion We have introduced an information-theoretic topic model, CorEx, that does not rely on any of the gener- ative assumptions of LDA-based topic models. This topic model seeks maximally informative topics as encoded by their total correlation. We also derived a flexible method for anchoring word-level domain knowledge in the CorEx topic model through the in- formation bottleneck. Anchored CorEx guides the topic model towards themes that do not naturally emerge, and often produces more coherent and pre- dictive topics. Both CorEx and anchored CorEx consistently produce topics that are of comparable quality to LDA-based methods, despite only making use of binarized word counts. Anchored CorEx is more flexible than previous attempts at integrating word-level information into topic models. Topic separability can be enforced by lightly anchoring disjoint groups of words to sepa- rate topics, topic representation can be promoted by assertively anchoring a group of words to a single topic, and topic aspects can be unveiled by anchor- ing a single group of words to multiple topics. The flexibility of anchoring through the information bot- tleneck lends itself to many other possible creative anchoring strategies that could guide the topic model in different ways. Different goals may call for dif- ferent anchoring strategies, and domain experts can shape these strategies to their needs. While we have demonstrated several advantages of the CorEx topic model to LDA, it does have some technical shortcomings. Most notably, CorEx re- lies on binary count data in its sparsity optimiza- tion, rather than the standard count data that is used as input into LDA and other topic models. While we have demonstrated CorEx performs at the level of LDA despite this limitation, its effect would be more noticeable on longer documents. This can be partly overcome if one chunks such longer docu- ments into shorter subdocuments prior to running the topic model. Our implementation also requires that each word appears in only one topic. These lim- itations are not fundamental limitations of the the- ory, but a matter of computational efficiency. In future work, we hope to remove these restrictions while preserving the speed of the sparse CorEx topic modeling algorithm. As we have demonstrated, the information- theoretic approach provided via CorEx has rich po- tential for finding meaningful structure in docu- ments, particularly in a way that can help domain experts guide topic models with minimal interven- tion to capture otherwise eclipsed themes. The lightweight and versatile framework of anchored CorEx leaves open possibilities for theoretical ex- tensions and novel applications within the realm of topic modeling. Acknowledgments We would like to thank the Machine Intelligence and Data Science (MINDS) research group at the Infor- mation Sciences Institute for their help and insight during the course of this research. We also thank the Vermont Advanced Computing Core (VACC) for its computational resources. Finally, we thank the anonymous reviewers and the TACL action editors Diane McCarthy and Kristina Toutanova for their time and effort in helping us improve our work. Ryan J. Gallagher was a visiting research assistant at the Information Sciences Institute while perform- ing this research. Ryan J. Gallagher and Greg Ver Steeg were supported by DARPA award HR0011- 15-C-0115 and David Kale was supported by the Alfred E. Mann Innovation in Engineering Doctoral Fellowship. 540 References David Andrzejewski and Xiaojin Zhu. 2009. Latent Dirichlet Allocation with topic-in-set knowledge. In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Pro- cessing, pages 43–48. Association for Computational Linguistics. David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 25–32. David Andrzejewski, Xiaojin Zhu, Mark Craven, and Benjamin Recht. 2011. A framework for incorpo- rating general domain knowledge into latent Dirichlet allocation using first-order logic. In Proceedings of the International Joint Conference on Artificial Intelli- gence, volume 22, page 1171. Sanjeev Arora, Rong Ge, and Ankur Moitra. 2012. Learning topic models–going beyond SVD. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS), pages 1–10. IEEE. Sanjeev Arora, Rong Ge, Yonatan Halpern, David M. Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael Zhu. 2013. A practical algorithm for topic modeling with provable guarantees. In Proceedings of International Conference on Machine Learning, pages 280–288. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Ma- chine Learning Research, 3(Jan):993–1022. Wray Buntine and Aleks Jakulin. 2006. Discrete compo- nent analysis. In Subspace, Latent Structure and Fea- ture Selection, pages 1–33. Springer. Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L. Boyd-Graber, and David M. Blei. 2009. Reading tea leaves: How humans interpret topic models. In Advances in Neural Information Processing Systems, pages 288–296. Wendy W. Chapman, Will Bridewell, Paul Hanbury, Gre- gory F. Cooper, and Bruce G. Buchanan. 2001. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedi- cal Informatics, 34(5):301–310. Peixian Chen, Nevin L. Zhang, Leonard K. M. Poon, and Zhourong Chen. 2016. Progressive EM for latent tree models and hierarchical topic detection. In Proceed- ings of the Thirtieth AAAI Conference on Artificial In- telligence, pages 1498–1504. Manhong Dai, Nigam H. Shah, Wei Xuan, Mark A. Musen, Stanley J. Watson, Brian D. Athey, Fan Meng, et al. 2008. An efficient solution for mapping free text to ontology terms. AMIA Summit on Translational Bioinformatics, 21. Chris Ding, Tao Li, and Wei Peng. 2008. On the equiv- alence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis, 52(8):3913–3927. James Foulds, Shachi Kumar, and Lise Getoor. 2015. Latent topic networks: A versatile probabilistic pro- gramming framework for topic models. In Pro- ceedings of the International Conference on Machine Learning, pages 777–786. Nir Friedman, Ori Mosenzon, Noam Slonim, and Naftali Tishby. 2001. Multivariate information bottleneck. In Proceedings of the Seventeenth Conference on Uncer- tainty in Artificial Intelligence, pages 152–161. Thomas L. Griffiths, Michael I. Jordan, Joshua B. Tenen- baum, and David M. Blei. 2004. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems, pages 17–24. Yoni Halpern, Youngduck Choi, Steven Horng, and David Sontag. 2014. Using anchors to estimate clini- cal state without labeled data. In AMIA Annual Sympo- sium Proceedings. American Medical Informatics As- sociation. Yoni Halpern, Steven Horng, and David Sontag. 2015. Anchored discrete factor analysis. arXiv preprint arXiv:1511.03299. Nathan Hodas, Greg Ver Steeg, Joshua Harrison, Satish Chikkagoudar, Eric Bell, and Courtney Corley. 2015. Disentangling the lexicons of disaster response in Twitter. In The 3rd International Workshop on Social Web for Disaster Management (SWDM’15). Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. 2013. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303– 1347. Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pages 289– 296. Jagadeesh Jagarlamudi, Hal Daumé III, and Raghaven- dra Udupa. 2012. Incorporating lexical priors into topic models. In Proceedings of the 13th Conference of the European Chapter of the Association for Com- putational Linguistics, pages 204–213. Association for Computational Linguistics. Moontae Lee and David Mimno. 2014. Low- dimensional embeddings for interpretable anchor- based topic inference. In Proceedings of Empiri- cal Methods in Natural Language Processing, pages 1319–1328. Moshe Lichman. 2013. UC Irvine Machine Learning Repository. 541 Jon D. McAuliffe and David M. Blei. 2008. Supervised topic models. In Advances in Neural Information Pro- cessing Systems, pages 121–128. Shike Mei, Jun Zhu, and Jerry Zhu. 2014. Robust Reg- Bayes: Selectively incorporating first-order logic do- main knowledge into Bayesian models. In Proceed- ings of the 31st International Conference on Machine Learning (ICML-14), pages 253–261. David Mimno, Hanna M. Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Op- timizing semantic coherence in topic models. In Pro- ceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262–272. Asso- ciation for Computational Linguistics. Thang Nguyen, Yuening Hu, and Jordan L. Boyd-Graber. 2014. Anchors regularized: Adding robustness and extensibility to scalable topic-modeling algorithms. In Proceedings of the Association of Computational Lin- guistics, pages 359–369. Thang Nguyen, Jordan Boyd-Graber, Jeffrey Lund, Kevin Seppi, and Eric Ringger. 2015. Is your anchor going up or down? Fast and accurate supervised topic models. In Proceedings of North American Chapter of the Association for Computational Linguistics. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gram- fort, Vincent Michel, Bertrand Thirion, Oliver Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- cent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Per- rot, and Édouard Duchesnay. 2011. Scikit-learn: Ma- chine learning in Python. Journal of Machine Learn- ing Research, 12:2825–2830. Radim Řehůřek and Petr Sojka. 2010. Software Frame- work for Topic Modeling with Large Corpora. In Pro- ceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks, pages 45–50. Kyle Reing, David C. Kale, Greg Ver Steeg, and Aram Galstyan. 2016. Toward interpretable topic discovery via anchored correlation explanation. ICML Workshop on Human Interpretability in Machine Learning. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for au- thors and documents. In Proceedings of the 20th Con- ference on Uncertainty in Artificial Intelligence, pages 487–494. Jian Tang, Zhaoshi Meng, Xuanlong Nguyen, Qiaozhu Mei, and Ming Zhang. 2014. Understanding the limit- ing factors of topic modeling via posterior contraction analysis. In Proceedings of the International Confer- ence on Machine Learning, pages 190–198. Naftali Tishby, Fernando C. Pereira, and William Bialek. 1999. The information bottleneck method. In Pro- ceedings of 37th Annual Allerton Conference on Com- munication, Control and Computing, pages 368–377. Greg Ver Steeg and Aram Galstyan. 2014. Discovering structure in high-dimensional data through correlation explanation. In Advances in Neural Information Pro- cessing Systems, pages 577–585. Greg Ver Steeg and Aram Galstyan. 2015. Maxi- mally informative hierarchical representations of high- dimensional data. In Artificial Intelligence and Statis- tics, pages 1004–1012. 542