Improving Topic Models with Latent Feature Word Representations Improving Topic Models with Latent Feature Word Representations Mark Johnson Joint work with Dat Quoc Nguyen, Richard Billingsley and Lan Du Dept of Computing Macquarie University Sydney Australia July 2015 1 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 2 / 29 High-level overview • Topic models take a corpus of documents as input, and jointly cluster: É words by the documents that they occur in, and É documents by the words that they contain • If the corpus is small and/or the documents are short, these clusters will be noisy • Latent feature representations of words learnt from large external corpora (e.g., word2vec, Glove) capture various aspects of word meanings • Here we use latent feature representations learnt on a large external corpus to improve the topic-word distributions in a topic model É we combine the Dirichlet-Multinomial models of Latent Dirichlet Allocation (LDA) with the distributed representations used in neural networks É the improvement is greatest on small corpora with short documents, e.g., Twitter data 3 / 29 Related work • Phan et al. (2011) assumed that the small corpus is a sample of topics from a larger corpus like Wikipedia, and use the topics discovered in the larger corpus to help shape the topic representations in the small corpus É if the larger corpus has many irrelevant topics, this will “use up” the topic space of the model • Petterson et al. (2010) proposed an extension of LDA that uses external information about word similarity, such as thesauri and dictionaries, to smooth the topic-to-word distribution • Sahami and Heilman (2006) employed web search results to improve the information in short texts • Neural network topic models of a single corpus have also been proposed (Salakhutdinov and Hinton, 2009; Srivastava et al., 2013; Cao et al., 2015). 4 / 29 Latent Dirichlet Allocation (LDA) θd ∼ Dir(α) zdi ∼ Cat(θd ) φz ∼ Dir(β) wdi ∼ Cat(φzdi ) • Latent Dirichlet Allocation (LDA) is an admixture model, i.e., each document d is associated with a distribution over topics θd • Inference is typically performed with a Gibbs sampler over the zd,i, integrating out θ and φ (Griffiths et al., 2004) P(zdi =t | Z¬di ) ∝ (N t d¬i + α) N t,wdi ¬di + β Nt¬di + V β 5 / 29 The Dirichlet Multinomial Mixture (DMM) model θ ∼ Dir(α) zd ∼ Cat(θ) φz ∼ Dir(β) wdi ∼ Cat(φzd ) • The Dirichlet Multinomial Mixture (DMM) model is a mixture model, i.e., each document d is associated with a single topic zd (Nigam et al., 2000) • Inference can also be performed using a collapsed Gibbs sampler in which θ and φz are integrated out (Yin and Wang, 2014) P(zd = t | Z¬d ) ∝ (M t ¬d + α) Γ(Nt¬d + V β) Γ(Nt¬d + Nd + V β) ∏ w∈W Γ(Nt,w¬d + N w d + β) Γ(Nt,w¬d + β) 6 / 29 Latent feature word representations • Traditional count-based methods (Deerwester et al., 1990; Lund and Burgess, 1996; Bullinaria and Levy, 2007) for learning real-valued latent feature (LF) vectors rely on co-occurrence counts • Recent approaches based on deep neural networks learn vectors by predicting words given their window-based context (Collobert and Weston, 2008; Mikolov et al., 2013; Pennington et al., 2014; Liu et al., 2015) • We downloaded the pre-trained vectors for word2vec and Glove for this paper 7 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 8 / 29 Latent-feature topic-to-word distributions • We assume that each word w is associated with a word vector ωw • We learn a topic vector τt for each topic t • We use these to define a distribution CatE(w) over words: CatE(w | τtω>) ∝ exp(τt ·ωw ) É τtω > is a vector of unnormalised scores, one per word • In our topic models, we mix the CatE distribution with a multinomial distribution over words, so we can capture ideosyncratic properties of the corpus (e.g., words not seen in the external corpus) É we use a Boolean indicator variable that records whether a word is generated from CatE or the multinomial distribution 9 / 29 The Latent Feature LDA model θd ∼ Dir(α) zdi ∼ Cat(θd ) φz ∼ Dir(β) sdi ∼ Ber(λ) wdi ∼ (1− sdi )Cat(φzdi ) + sdi CatE(τzdi ω >) • sdi is the Boolean indicator variable indicating whether word di is generated from CatE • λ is a user-specified hyper-parameter determining how often words are generated from the CatE distribution É if we estimated λ from data, we expect it would never generate through CatE 10 / 29 The Latent Feature DMM model θd ∼ Dir(α) zdi ∼ Cat(θd ) φz ∼ Dir(β) sdi ∼ Ber(λ) wdi ∼ (1− sdi )Cat(φzdi ) + sdi CatE(τzdi ω >) • sdi is the Boolean indicator variable indicating whether word di is generated from CatE • λ is a user-specified hyper-parameter determining how often words are generated from the CatE distribution 11 / 29 Inference for the LF-LDA model • We integrate out θ and φ as in the Griffiths et al. (2004) sampler, and interleave MAP estimation for τ with Gibbs sweeps for the other variables • Algorithm outline: initialise the word-topic variables zdi using the LDA sampler repeat: for each topic t: τt = arg maxτt P(τt | z,s) for each document d and each word location i: sample zdi from P(zdi | z¬di ,s¬di ,τ) sample sdi from P(sdi | z,s¬di ,τ) 12 / 29 Inference for the LF-DMM model (1) • We integrate out θ and φ as in the Yin and Wang (2014) sampler, and interleave MAP estimation for τ with Gibbs sweeps • Algorithm outline: initialise the word-topic variables zdi using the DMM sampler repeat: for each topic t: τt = arg maxτt P(τt | z,s) for each document d: sample zd from P(zd | z¬d,s¬di ,τ) for each word location i: sample sdi from P(sdi | z,s¬di ,τ) • Note: P(zd | z¬d,s¬di ,τ) is computationally expensive to compute exactly, as it requires summing over all possible values for sd 13 / 29 Inference for the LF-DMM model (2) • The computational problems stem from the fact that all the words in a document have the same topic É have to jointly sample document topic zt and indicator variables sd É the sampling probability is a product of ascending factorials • We approximate these probabilities by assuming that the topic-word counts are “frozen”, i.e., they don’t increase within a document É the DMM is mainly used on short documents (e.g., Twitter), where the “one topic per document” assumption is accurate ⇒ “freezing” the counts should have less impact É could correct this with a Metropolis-Hastings accept-reject step P(zd,sd | z¬d,s¬d,τ) ∝ λ Kd (1 −λ)Nd (Mt¬d + α) Γ(Nt¬d + V β) Γ(Nt¬d + Nd + V β) � ∏ w∈W Γ(Nt,w¬d + N w d + β) Γ(Nt,w¬d + β) �� ∏ w∈W CatE(w | τt ω>)K w d � 14 / 29 Estimating the topic vectors τt • Both the LF-LDA and LF-DMM associate each topic t with a topic vector τt, which must be learnt from the training corpus • After each Gibbs sweep: É the topic variables z identify which topic each word is generated from É the indicator variables s identify which words are generated from the latent feature distributions CatE ⇒ we can use a supervised estimation procedure to find τ • We use LBFGS to optimise the L2-regularised log-loss (MAP estimation) Lt = − ∑ w∈W K t,w � τt ·ωw − log( ∑ w ′∈W exp(τt ·ωw ′)) � + µ ‖ τt ‖22 15 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 16 / 29 Goals of evaluation • A topic model learns document-topic and topic-word distributions: É topic coherence evaluates the topic-word distributions É document clustering and document classification evaluate the document-topic distribution – the latent feature component only directly changes the topic-word distributions, so these are challenging evaluations • Do the word2vec and Glove word vectors behave differently in topic modelling? • We expect that the latent feature component will have the greatest impact on small corpora, so our evaluation focuses on them: Dataset # labels # docs words/doc # types N20 20 newsgroups 20 18,820 103.3 19,572 N20short ≤ 20 words 20 1,794 13.6 6,377 N20small 400 docs 20 400 88.0 8,157 TMN TagMyNews 7 32,597 18.3 13,428 TMNtitle TMN titles 7 32,503 4.9 6,347 Twitter 4 2,520 5.0 1,390 17 / 29 Word2vec-DMM on TagMyNews titles corpus (1) Topic 1 Initdmm Iter=1 Iter=2 Iter=5 Iter=10 Iter=20 Iter=50 Iter=100 Iter=500 japan japan japan japan japan japan japan japan japan nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear nuclear u.s. u.s. u.s. u.s. u.s. u.s. plant u.s. u.s. crisis russia crisis plant plant plant u.s. plant plant plant radiation china crisis radiation quake quake quake quake china nuke russia radiation crisis radiation radiation radiation radiation libya iran plant china china crisis earthquake earthquake earthquake radiation crisis radiation russia nuke nuke tsunami tsunami tsunami u.n. china nuke nuke russia china nuke nuke nuke vote libya libya power power tsunami crisis crisis crisis korea plant iran u.n. u.n. earthquake disaster disaster disaster europe u.n. u.n. iran iran disaster plants oil power government mideast power reactor earthquake power power plants oil election pakistan pakistan earthquake reactor reactor oil power japanese deal talks talks libya quake japanese japanese tepco plants • Table shows the 15 most probable topical words in Topic 1 found by 20-topic word2vec-DMM on the TMN titles corpus • Words found by DMM but not by word2vec-DMM are underlined • Words found by word2vec-DMM but not DMM are in bold 18 / 29 Word2Vec-DMM on TagMyNews titles corpus (2) Topic 4 Topic 5 Topic 19 Topic 14 Initdmm Iter=50 Iter=500 Initdmm Iter=50 Iter=500 Initdmm Iter=50 Iter=500 Initdmm Iter=50 Iter=500 egypt libya libya critic dies star nfl nfl nfl nfl law law china egypt egypt corner star sheen idol draft sports court bill texas u.s. mideast iran office broadway idol draft lockout draft law governor bill mubarak iran mideast video american broadway american players players bill texas governor bin opposition opposition game idol show show coach lockout wisconsin senate senate libya leader protests star lady american film nba football players union union laden u.n. leader lady gaga gaga season player league judge obama obama france protests syria gaga show tour sheen sheen n.f.l. governor wisconsin budget bahrain syria u.n. show news cbs n.f.l. league player union budget wisconsin air tunisia tunisia weekend critic hollywood back n.f.l. baseball house state immigration report protesters chief sheen film mtv top coaches court texas immigration state rights chief protesters box hollywood lady star football coaches lockout arizona vote court asia mubarak park fame wins charlie judge nflpa budget california washington u.n. russia crackdown takes actor charlie players nflpa basketball peru vote arizona war arab bahrain man movie stars men court game senate federal california • Table shows 15 most probable topical words in several topics found by 20-topic word2vec-DMM on the TMN titles corpus • Words found by DMM but not by w2v-DMM are underlined • Words found by w2v-DMM but not DMM are in bold 19 / 29 Topic coherence evaluation • Lau et al. (2014) showed that human scores on a word intrusion task are highly correlated with the normalised pointwise mutual information (NPMI) against a large external corpus (we used English Wikipedia) • We found latent feature vectors produced a significant improvement of NPMI scores on all models and corpora É greatest improvement when λ = 1 (unsurprisingly) NPMI scores on the N20 short dataset with 20 topics, as the mixture weight λ varies from 0 to 1 20 / 29 Topic coherence on Twitter corpus Data Method λ = 1.0 T=4 T=20 T=40 T=80 lda -8.5 ± 1.1 -14.5 ± 0.4 -15.1 ± 0.4 -15.9 ± 0.2 Twitter w2v-lda -7.3 ± 1.0 -13.2 ± 0.6 -14.0 ± 0.3 -14.1 ± 0.3 glove-lda -6.2 ± 1.6 -13.9 ± 0.6 -14.2 ± 0.4 -14.2 ± 0.2 Improve. 2.3 1.3 1.1 1.8 dmm -5.9 ± 1.1 -10.4 ± 0.7 -12.0 ± 0.3 -13.3 ± 0.3 Twitter w2v-dmm -5.5 ± 0.7 -10.5 ± 0.5 -11.2 ± 0.5 -12.5 ± 0.1 glove-dmm -5.1 ± 1.2 -9.9 ± 0.6 -11.1 ± 0.3 -12.5 ± 0.4 Improve. 0.8 0.5 0.9 0.8 • The normalised pointwise mutual information score improves for both LDA and DMM on the Twitter corpus, across a wide range of number of topics 21 / 29 Document clustering evaluation • Cluster documents by assigning them to the highest probability topic • Evaluate clusterings by purity and normalised mutual information (NMI) (Manning et al., 2008) Evaluation of 20-topic LDA on the N20 short corpus, as mixture weight λ varies from 0 to 1 • In general, best results with λ = 0.6 ⇒ Set λ = 0.6 in all further experiments 22 / 29 Document clustering of Twitter data Data Method Purity NMI T=4 T=20 T=40 T=80 T=4 T=20 T=40 T=80 lda 0.559 ± 0.020 0.614 ± 0.016 0.626 ± 0.011 0.631 ± 0.008 0.196 ± 0.018 0.174 ± 0.008 0.170 ± 0.007 0.160 ± 0.004 Twitter w2v-lda 0.598 ± 0.023 0.635 ± 0.016 0.638 ± 0.009 0.637 ± 0.012 0.249 ± 0.021 0.191 ± 0.011 0.176 ± 0.003 0.167 ± 0.006 glove-lda 0.597 ± 0.016 0.635 ± 0.014 0.637 ± 0.010 0.637 ± 0.007 0.242 ± 0.013 0.191 ± 0.007 0.177 ± 0.007 0.165 ± 0.005 Improve. 0.039 0.021 0.012 0.006 0.053 0.017 0.007 0.007 dmm 0.552 ± 0.020 0.624 ± 0.010 0.647 ± 0.009 0.675 ± 0.009 0.194 ± 0.017 0.186 ± 0.006 0.184 ± 0.005 0.190 ± 0.003 Twitter w2v-dmm 0.581 ± 0.019 0.641 ± 0.013 0.660 ± 0.010 0.687 ± 0.007 0.230 ± 0.015 0.195 ± 0.007 0.193 ± 0.004 0.199 ± 0.005 glove-dmm 0.580 ± 0.013 0.644 ± 0.016 0.657 ± 0.008 0.684 ± 0.006 0.232 ± 0.010 0.201 ± 0.010 0.191 ± 0.006 0.195 ± 0.005 Improve. 0.029 0.02 0.013 0.012 0.038 0.015 0.009 0.009 • On the short, small Twitter dataset our models obtain better clustering results than the baseline models with small T. É with T = 4 we obtain 3.9% purity and 5.3% NMI improvements • For small T ≤ 7, on the large datasets of N20, TMN and TMNtitle, our models and baseline models obtain similar clustering results. • With larger T our models perform better than baselines on the short TMN and TMNtitle datasets • On the N20 dataset, the baseline LDA model obtains better clustering results than ours • No reliable difference between word2vec and Glove vectors 23 / 29 Document classification of N20 and N20short corpora • Train a SVM to predict document label based on topic(s) assigned to document Data Model λ = 0.6 T=6 T=20 T=40 T=80 lda 0.312 ± 0.013 0.635 ± 0.016 0.742 ± 0.014 0.763 ± 0.005 N20 w2v-lda 0.316 ± 0.013 0.641 ± 0.019 0.730 ± 0.017 0.768 ± 0.004 glove-lda 0.288 ± 0.013 0.650 ± 0.024 0.733 ± 0.011 0.762 ± 0.006 Improve. 0.004 0.015 -0.009 0.005 lda 0.204 ± 0.020 0.392 ± 0.029 0.459 ± 0.030 0.477 ± 0.025 N20small w2v-lda 0.213 ± 0.018 0.442 ± 0.025 0.502 ± 0.031 0.509 ± 0.022 glove-lda 0.181 ± 0.011 0.420 ± 0.025 0.474 ± 0.029 0.498 ± 0.012 Improve. 0.009 0.05 0.043 0.032 • F1 scores (mean and standard deviation) for N20 and N20small corpora 24 / 29 Document classification of TMN and TMN title corpora Data Model λ = 0.6 T=7 T=20 T=40 T=80 lda 0.658 ± 0.026 0.754 ± 0.009 0.768 ± 0.004 0.778 ± 0.004 TMN w2v-lda 0.663 ± 0.021 0.758 ± 0.009 0.769 ± 0.005 0.780 ± 0.004 glove-lda 0.664 ± 0.025 0.760 ± 0.006 0.767 ± 0.003 0.779 ± 0.004 Improve. 0.006 0.006 0.001 0.002 dmm 0.605 ± 0.023 0.724 ± 0.016 0.738 ± 0.008 0.741 ± 0.005 TMN w2v-dmm 0.619 ± 0.033 0.744 ± 0.009 0.759 ± 0.005 0.777 ± 0.005 glove-dmm 0.624 ± 0.025 0.757 ± 0.009 0.761 ± 0.005 0.774 ± 0.010 Improve. 0.019 0.033 0.023 0.036 lda 0.564 ± 0.015 0.625 ± 0.011 0.626 ± 0.010 0.624 ± 0.006 TMNtitle w2v-lda 0.563 ± 0.029 0.644 ± 0.010 0.643 ± 0.007 0.640 ± 0.004 glove-lda 0.568 ± 0.028 0.644 ± 0.010 0.632 ± 0.008 0.642 ± 0.005 Improve. 0.004 0.019 0.017 0.018 dmm 0.570 ± 0.022 0.650 ± 0.011 0.654 ± 0.008 0.646 ± 0.008 TMNtitle w2v-dmm 0.562 ± 0.022 0.670 ± 0.012 0.677 ± 0.006 0.680 ± 0.003 glove-dmm 0.592 ± 0.017 0.674 ± 0.016 0.683 ± 0.006 0.679 ± 0.009 Improve. 0.022 0.024 0.029 0.034 25 / 29 Document classification of Twitter corpus Data Method λ = 0.6 T=4 T=20 T=40 T=80 lda 0.526 ± 0.021 0.636 ± 0.011 0.650 ± 0.014 0.653 ± 0.008 Twitter w2v-lda 0.578 ± 0.047 0.651 ± 0.015 0.661 ± 0.011 0.664 ± 0.010 glove-lda 0.569 ± 0.037 0.656 ± 0.011 0.662 ± 0.008 0.662 ± 0.006 Improve. 0.052 0.02 0.012 0.011 dmm 0.505 ± 0.023 0.614 ± 0.012 0.634 ± 0.013 0.656 ± 0.011 Twitter w2v-dmm 0.541 ± 0.035 0.636 ± 0.015 0.648 ± 0.011 0.670 ± 0.010 glove-dmm 0.539 ± 0.024 0.638 ± 0.017 0.645 ± 0.012 0.666 ± 0.009 Improve. 0.036 0.024 0.014 0.014 • For document classification the latent feature models generally perform better than the baseline models É On the small N20small and Twitter datasets, when the number of topics T is equal to number of ground truth labels (i.e. 20 and 4 correspondingly) our W2V-LDA model obtains 5+% higher F1 score than the LDA model É Our W2V-DMM model achieves 3.6% and 3.4% higher F1 score than the DMM model on the TMN and TMNtitle datasets with T = 80, respectively. 26 / 29 Outline Introduction Latent-feature topic models Experimental evaluation Conclusions and future work 27 / 29 Conclusions • Latent feature vectors induced from large external corpora can be used to improve topic modelling É latent features significantly improve topic coherence across a range of corpora with both the LDA and DMM models É document clustering and document classification also significantly improve, even though these depend directly only on the document-topic distribution • The improvements were greatest for small document collections and/or for short documents É with enough training data there is sufficient information in the corpus to accurately estimate topic-word distributions É the improvement in the topic-word distributions also improves the document-topic distribution • We did not detect any reliable difference between word2vec and Glove vectors 28 / 29 Future directions • Retrain the word vectors to fit the training corpus É how do we avoid losing information from external corpus? • More sophisticated latent-feature models of topic-word distributions • More efficient training procedures (e.g., using SGD) • Extend this approach to a richer class of topic models 29 / 29 Introduction Latent-feature topic models Experimental evaluation Conclusions and future work