key: cord-0432112-c12qgdsz authors: Mueller, Aaron; Dredze, Mark title: Fine-tuning Encoders for Improved Monolingual and Zero-shot Polylingual Neural Topic Modeling date: 2021-04-11 journal: nan DOI: nan sha: 71da4816a9cce33854ad6c70ec42edd8a9178553 doc_id: 432112 cord_uid: c12qgdsz Neural topic models can augment or replace bag-of-words inputs with the learned representations of deep pre-trained transformer-based word prediction models. One added benefit when using representations from multilingual models is that they facilitate zero-shot polylingual topic modeling. However, while it has been widely observed that pre-trained embeddings should be fine-tuned to a given task, it is not immediately clear what supervision should look like for an unsupervised task such as topic modeling. Thus, we propose several methods for fine-tuning encoders to improve both monolingual and zero-shot polylingual neural topic modeling. We consider fine-tuning on auxiliary tasks, constructing a new topic classification task, integrating the topic classification objective directly into topic model training, and continued pre-training. We find that fine-tuning encoder representations on topic classification and integrating the topic classification task directly into topic modeling improves topic quality, and that fine-tuning encoder representations on any task is the most important factor for facilitating cross-lingual transfer. Topic models (Blei et al., 2003) are widely used across numerous disciplines to study large corpora (Boyd-Graber et al., 2017) . These data-driven models discover salient themes and semantic clusters without any supervision. Monolingual topic models are language-agnostic but do not align topics across languages, as they have a fixed languagespecific vocabulary which cannot be aligned crosslingually after training. Polylingual topic models (Mimno et al., 2009) , however, enable users to consider multilingual corpora, and to discover and align topics across languages. Recent work has demonstrated the effectiveness of deep transformer-based language models to en-code text documents for a wide variety of applications (Xia et al., 2020) . Furthermore, when trained on multilingual corpora, they have been able to discover cross-lingual alignments despite the lack of explicit cross-lingual links (Wu and Dredze, 2019) . Models such as multilingual BERT (mBERT; Devlin et al., 2018) or XLM-RoBERTa (XLM-R; Conneau et al., 2019) can produce a representation of text in a shared subspace across multiple input languages, suitable for both monolingual and multilingual settings, including zero-shot language transfer (Pires et al., 2019) . Simultaneously, topic models have increasingly incorporated neural components. This has included inference networks which learn representations of the input document (Miao et al., 2017; Srivastava and Sutton, 2017 ) that improve over using bags of words directly, as well as replacing bags of words with contextual representations. In particular, the latter allows topic models to benefit from pre-training on large corpora. For example, contextualized topic models (CTMs) (Bianchi et al., 2020a) use autoencoded contextual sentence representations of input documents. An intriguing advantage of using encoders in topic models is their latent multilinguality. Polylingual topic models (Mimno et al., 2009 ) are lightweight in their cross-lingual supervision to align topics across languages, but they nonetheless require some form of cross-lingual alignment. While the diversity of resources and approaches for training polylingual topic models enable us to consider many language pairs and domains, there may be cases where existing resources cannot support an intended use case. Can topic models become polylingual models by relying on multilingual encoders even without additional alignments? Bianchi et al. (2020a) show that CTMs based on contextual sentence representations enable zeroshot cross-lingual topic transfer. While promising, this line of work omits a key step in using con-textualized embeddings: fine-tuning. It has been widely observed that task specific fine-tuning of pretrained embeddings, even with a small amount of supervised data, can significantly improve performance on many tasks, including in zero-and few-shot settings (Howard and Ruder, 2018; Wu and Dredze, 2019) . However, in the case of unsupervised topic modeling, from where are we to obtain task-specific supervised training data? We propose an investigation of how supervision should be bootstrapped to improve language encoders for monolingual and polylingual topic model learning. We also propose a set of experiments to better understand why certain forms of supervision are effective in this unsupervised task. Our contributions include the following: 1. We fine-tune contextualized sentence embeddings on various established auxiliary tasks, finding that many different tasks can be used to improve downstream topic quality and zeroshot topic model transfer. 2. We construct fine-tuning supervision for sentence embeddings through a proposed topic classification task, showing further improved topic coherence. This task uses only the data on which we perform topic modeling. 3. We integrate a topic classification objective directly into the neural topic model architecture (without fine-tuning the embeddings) to understand whether the embeddings or the topic classification objective is responsible for performance improvements. We find that this approach improves topic quality but has little effect on cross-language topic transfer. We present results for both monolingual topic models and cross-lingual topic transfer from English to French, German, Portuguese, and Dutch. Our code, including instructions for replicating our dataset and experimental setup, are publicly available on GitHub. 1 Neural Topic Models Neural topic models (NTMs) are defined by their parameterization by (deep) neural networks or incorporation of neural elements. This approach has become practical largely due to advances in variational inference-specifically, variational autoencoders (VAEs; Kingma and Welling, 2013) . The Neural Variational Document Model (Miao et al., 2016) and Gaussian Softmax Model (Miao et al., 2017) rely on amortized variational inference to approximate the posterior (Zhao et al., 2017; Krishnan et al., 2018) . As these methods employ Gaussian priors, they use softmax transforms to ensure non-negative samples. Another approach has used ReLU transforms (Ding et al., 2018) . Conversely, ProdLDA (Srivastava and Sutton, 2017) uses a Dirichlet prior that produces nonnegative samples which do not need to be transformed. ProdLDA uses an inference network with a VAE to map from an input bag of words to a continuous latent representation. The decoder network samples from this hidden representation to form latent topic representations. Bags of words are reconstructed for each latent space; these constitute the output topics. Others have reported that ProdLDA is the best-performing NTM with respect to topic coherence (Miao et al., 2017) . Contextualized topic models (CTMs; Bianchi et al., 2020a,b) extend ProdLDA by replacing the input bag of words with sentence-BERT (SBERT; Reimers and Gurevych, 2019) embeddings. If the SBERT embeddings are based on a multilingual model such as mBERT (Devlin et al., 2018) or XLM-R (Conneau et al., 2019) , then the topic model becomes implicitly polylingual due to the unsupervised alignments induced between languages during pre-training. This is distinct from how polylinguality is induced in approaches based on Latent Dirichlet Allocation (LDA; Blei et al., 2003) , which require some form of cross-lingual alignments (Mimno et al., 2009) . Using embeddings in topic models is not new (Das et al., 2015; Liu et al., 2015; Li et al., 2016) . While a few recent approaches have leveraged word embeddings for topic modeling (Gupta et al., 2019; Dieng et al., 2020; Sia et al., 2020) , none of these have investigated cross-lingual topic transfer. Polylingual Topic Models Polylingual topic models require some form of cross-lingual alignments, which can come from comparable documents (Mimno et al., 2009) , word alignments (Zhao and Xing, 2006) , multilingual dictionaries (Jagarlamudi and Daumé, 2010), code-switched documents (Peng et al., 2014) , or other distant alignments such as anchors (Yuan et al., 2018) . Work on incomparable documents with soft document links (Hao and Paul, 2018) still relies on dictionaries. While these types of alignments have been com-mon in multilingual learning (Ruder et al., 2019b) , they no longer represent the state-of-the-art. More recent approaches instead tend to employ large pretrained multilingual models (Wu and Dredze, 2019) that induce unsupervised alignments between languages during pre-training. Fine-tuning is known to improve an encoder's representations for a specific task when data directly related to the task is present (Howard and Ruder, 2018; Wu and Dredze, 2019) . Nonetheless, this requires supervised data, which is absent in unsupervised tasks like ours. We consider several approaches to create fine-tuning supervision for topic modeling. In the absence of supervised training sets, transfer learning can be used to learn from one supervised task (or many tasks in the case of meta-learning) for improvements on another (Ruder et al., 2019a) . While transfer is typically performed from a pretrained masked language model to downstream finetuning tasks, transfer can also be performed from one fine-tuning task to another. The aim is that the auxiliary task should induce representations similar to those needed for the target task. What task can serve as an effective auxiliary task for topic modeling? We turn to document classification, the task of identifying the primary topic present in a document from a fixed set of (typically human-identified and human-labeled) topics. We may not have a document classification dataset from the same domain as the topic modeling corpus, nor a dataset which uses the same topics as those present in the corpus. However, fine-tuning could teach the encoder to produce topic-level document representations, regardless of the specific topics present in the data. We use MLDoc (Schwenk and Li, 2018) , a multilingual news document classification dataset and fine-tune on English. For comparison, we fine-tune on a natural language inference (NLI) task. While it is not closely related to topic modeling, this task is a popular choice for fine-tuning both word and sentence representations. This allows us to measure how much task relatedness matters for fine-tuning. The auxiliary tasks use data from a different domain (and task) than the domain of interest for the topic model. Can we bootstrap more direct supervision on our data? We employ an LDA-based topic model to produce a form of topic supervision. We first run LDA on the target corpus to generate topic distributions for each document. Then, we use the inferred topic distributions as supervision by labeling each document with its most probable topic. We fine-tune on this data as we did for the document classification task; the setup is identical except for how the labels are obtained. The advantage of this method is that LDA topics can be created for any corpus. Gururangan et al. (2020) advocated for adapting an encoder to the domain on which one will later fine-tune. This is done by performing continued pre-training over in-domain data using the masked language modeling (MLM) objective. 2 Because continued pre-training requires no task-specific supervision, and because topic modeling implies a sizeable corpus of in-domain documents, we consider continued pre-training on the target corpus as another approach to adapting an encoder. As continued pre-training can be done before fine-tuning, we also try doing both. Does topic classification improve performance because fine-tuning itself induces better representations for topic modeling, or because the model has been exposed to in-domain data and/or supervision directly from the target corpus before topic modeling? Continued pre-training on the target corpus may allow us to answer this question, and provides a further approach for adapting encoders to specific domains. Both continued pre-training and fine-tuning provide supervision for our target task, but both create dependence on a pipeline: we must train and/or fine-tune sentence embeddings, then train a neural topic model using the modified embeddings. However, we can combine the topic classification task and topic modeling into a single end-toend procedure by modifying the inference network of the CTM. This is similar to the architecture of Bianchi et al. (2020a) , but with an added fully-connected layer and softmax to produce a topic classification for the input document from its hidden representation. tecture: a fully-connected layer into a softmax to predict the topic label of the document based on the learned representation of the VAE. Note that we do not necessarily expect this architecture to outperform fine-tuning sentence embeddings: rather, this architecture allows us to ablate over the location of the topic classification objective, which allows us to determine whether improvements in topic quality and/or transfer are due to improved sentence embeddings induced by fine-tuning, or due to the topic classification task itself. We use the negative log-likelihood loss between the topic predicted by LDA (which we treat as the true label) and the topic predicted by our model, adding this loss term (weighted by a hyperparameter λ) to the contextualized topic model's loss function. Thus, the new loss becomes where L ELBO is the negated evidence lower bound objective of the CTM, and L NLL is the negative loglikelihood loss over topic classifications. We refer to this as the topic classification contextualized topic modeling (TCCTM) loss, denoted L TCCTM . TCCTM modifies the topic model, but not the embeddings. This approach is therefore orthogonal to fine-tuning, and the two approaches can be combined; thus, we test the performance of TCCTM with and without fine-tuning. Data We begin by creating a multilingual dataset for topic modeling based on aligned Wikipedia articles extracted from Wikipedia Comparable Corpora 3 in English, French, German, Portuguese, and Dutch. We use 100,000 English articles for training the topic models and evaluating monolingual topic coherence. We also extract 100,000 aligned articles for each language to build comparable vocabularies for preprocessing the test data. 4 For each language, we use a vocabulary of the 5,000 most frequent word types (case-insensitive), excluding stopwords-25,000 types total. We use the English training articles to evaluate monolingual topic quality, and hold out for cross-lingual evaluation a set of 10,000 aligned test articles per-language. For out-of-domain topic classification, we use a dataset of COVID academic articles (in English). 5 To facilitate comparison with the Wikipedia dataset, we extract 100,000 articles and use a vocabulary size of 5,000. To obtain topic labels for each English document, we run LDA for 400 iterations and choose the number of topics τ by performing a search in {10, 20, . . . , 250}, optimizing over NPMI coherence. We find that τ ∈ {100, 110, 120} is best and use τ = 100 here. We label each document with its most probable topic by counting the number of tokens in the document in the top-10 token list for each topic, then taking the argmax. We perform the same procedure on the out-of-domain COVID dataset to generate out-of-domain topic classification supervision, finding that τ = 80 is best on this dataset with respect to NPMI coherence. For the document classification task, we use ML-Doc (Schwenk and Li, 2018) , a multilingual news dataset; we fine-tune on the English data. For NLI, we follow Reimers and Gurevych (2020) in using a mixture of SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) , both of which only contain English data. Training Details We consider embeddings produced by both mBERT (Devlin et al., 2018) and XLM-R (Conneau et al., 2019) . For fine-tuning, we append to these models a fully-connected layer fol-lowed by a softmax, using a negative log-likelihood loss for topic/document classification. We perform a search over the number of epochs in the range [1, 8] , optimizing over downstream NPMI coherence during topic modeling. We follow the procedure of Reimers and Gurevych (2019) to create sentence embedding models from contextual word representations: we mean-pool word embeddings for two sentences simultaneously, feeding these as inputs to a softmax classifier. We use batch size 16; other hyperparameters are kept from Reimers and Gurevych (2019) . For NLI fine-tuning, we follow the procedure and use the hyperparameters of Reimers and Gurevych (2020) : we first fine-tune monolingual BERT on SNLI and MultiNLI; the embeddings are pooled during fine-tuning to create a sentence embedding model. We then perform a knowledge distillation step from the monolingual SBERT model to XLM-R or mBERT. Continued pre-training is performed by training with the MLM objective on English Wikipedia. We run for 1 epoch, using gradient accumulation to achieve an effective batch size of 256. We can pool the embeddings from the resulting model directly or perform fine-tuning after continued pre-training. When topic modeling, we run the CTM for 60 epochs, using an initial learning rate of 2 × 10 −3 , dropout 0.2, and batch size 64. The VAE consists of two hidden layers of dimensionality 100 (as in Srivastava and Sutton 2017 and Bianchi et al. 2020b) . The ProdLDA baseline is run using the same hyperparameters and the same architecture as a CTM, differing only in using bags of words as input instead of SBERT representations. For the LDA baseline, we employ MalletLDA (McCallum, 2002) as implemented in the gensim wrapper, running for 400 iterations on the Wikipedia data using τ = 100. We fine-tune λ in the TCCTM objective in {0.1, 0.2, . . . , 3.0}, finding that λ = 1.0 yields the best downstream topic coherence for the target Wikipedia data. We try TCCTM based on nonfine-tuned sentence embeddings, as well as models fine-tuned on document classification or NLI. We do not perform this approach on a model fine-tuned on in-domain topic classification to avoid overfitting and confounds from performing the same task in multiple stages of the model. To evaluate topic quality, we measure normalized pointwise mutual information (NPMI) coherence on the English Wikipedia dataset. NPMI is used because it is comparable across architectures and objectives, and because it tends to correlate better with human judgments of topic quality (Lau et al., 2014) . While perplexity has been used to evaluate LDA (Blei et al., 2003) as well as neural topic models in the past (Miao et al., 2017) , it is not comparable across different objective functions when using neural approaches (as it depends on the test loss) and tends to correlate poorly with human judgments (Chang et al., 2009) . Topic significance ranking (AlSumait et al., 2009) has been used to measure and rank topics by semantic importance/relevance, though we care more about overall topic quality than ranking topics. As the contextualized topic model is based on a multilingual encoder, it is able to generate θ i (a topic distribution over document i) given input embeddings from a document h i in any language it has seen. To evaluate multilingual generalization, we measure the proportion of aligned test documents for which the most probable English topic θ i English is the same as the most probable targetlanguage topic θ i Target (the Match metric). We also measure the KL divergence between topic distributions D KL (θ i English θ i Target ), taking the mean over all aligned documents (the KL metric). We construct a random baseline by randomly shuffling the English articles and then computing both metrics against the newly unaligned foreign articles. We compare topic coherences on the 100,000 English Wikipedia articles for LDA and ProdLDA baselines, a CTM with no fine-tuning, a CTM with continued pre-training (CPT), and the integrated TCCTM model. We also compare the effect of fine-tuning (FT) on the NLI task, on a document classification task (MLDoc), and on labels from LDA for the out-of-domain COVID dataset and for the in-domain Wikipedia data (Table 1 ). The baseline LDA and ProdLDA models both achieve the same coherence score of 0.129. Compared to these baselines, models based on contextualized representations always achieve higher topic coherence. We find that when using a base CTM without modifying its objective, fine-tuning on any auxiliary task improves topic quality for CTMs. Specifically, fine-tuning on in-domain topic clas- , and the TCCTM model on the English Wikipedia dataset. We present results with and without fine-tuning for XLM-R and mBERT-based sentence embeddings. The right side of the table indicates whether each setup is based on a neural architecture, whether the SBERT embeddings are fine-tuned before topic modeling, whether the topic classification task/objective is present, and whether the embeddings have been trained/fine-tuned on the same data later used for topic modeling. sification data is best for monolingual topic modeling, followed closely by document classification on MLDoc. Topic classification on the out-ofdomain COVID data results in the same topic coherence scores as document classification, indicating that topic classification is an effective method for bootstrapping supervision, even compared to established document classification datasets with human-labeled documents. The further gains in topic coherence when fine-tuning on Wikipedia topic classification data may be due to the data being in-domain, rather than due to the topic classification task. Fine-tuning on NLI yields less coherent topics than document or topic classification. For any given approach, XLM-R always outperforms mBERT. We find that CPT without fine-tuning performs worse than simply fine-tuning, but better than a CTM using embeddings which are not finetuned. Fine-tuning after performing continued pretraining (CPT+FT) slightly improves NPMI over CPT alone, but still results in less coherent topics than if we simply fine-tune on the in-domain Wiki data or the out-of-domain COVID data. Thus, the MLM objective seems to induce representations not conducive to topic modeling. Indeed, fine-tuning on any task is better than continuing to train the encoder on the exact data later used for the CTM. This means that we may not attribute the effectiveness of topic classification solely to the model's seeing in-domain data before topic modeling; rather, some property of fine-tuning itself is better at inducing representations conducive to topic modeling. Conversely, the TCCTM approach using nonfine-tuned embeddings produces more coherent topics than all fine-tuning tasks except topic classification on in-domain Wikipedia data. This means that the topic classification task itself is also responsible for the high topic coherences observed, and not just the fine-tuned sentence embeddings. Nonetheless, topic classification is more effective when used to fine-tune sentence embeddings, rather than as a part of the CTM objective-further cementing the importance of embeddings to topic quality. There seems to be interference-or perhaps overfitting-when combining TCCTM with embeddings fine-tuned on other tasks. Indeed, finetuning on document classification and NLI results in slightly less coherent topics than simply using TCCTM on non-fine-tuned sentence embeddings. Perhaps this could be mitigated with task-specific fine-tuning over λ; we leave this to future work. Table 2 presents results for zero-shot cross-lingual topic transfer. All models, including without finetuning, are far better than random chance on both metrics. This indicates that multilingual encoders contain enough cross-lingual alignment asis to induce cross-lingual topic alignment. Nonetheless, we also find that fine-tuning the embeddings on any task produces better multilingual topic alignments than not fine-tuning; NLI consistently shows the best cross-lingual transfer. Document classification is generally a worse fine-tuning task than topic classification for cross-lingual transfer, despite achieving similar monolingual performance. When performing continued pre-training without fine-tuning, we find that results tend to be comparable to the CTM without fine-tuning, though slightly better. When performing both continued pre-training and fine-tuning, we achieve only slightly higher results compared to simply finetuning; thus, in both monolingual and multilingual settings, the fine-tuning task is more important for topic transfer than seeing in-domain data or having a better in-domain language model. The TCCTM objective alone produces fairly poor multilingual topic alignments, despite its posi- Figure 2 : Performance on the Semantic Textual Similarity (STS) benchmark (Spearman correlation between cosine similarity of sentence representations and gold labels for STS tasks) versus mean Match and KL perlanguage for sentence embedding models fine-tuned on various tasks (all with XLM-R-based sentence embeddings.) The outlier is the XLM-R model fine-tuned on NLI, as it was explicitly designed and tuned for STS (Reimers and Gurevych, 2020) . We do not include TC-CTM as it does not modify sentence embeddings. tive effect in monolingual contexts; however, it consistently performs effective cross-lingual transfer when paired with sentence embeddings fine-tuned on document classification. When paired with embeddings fine-tuned on NLI, TCCTM achieves almost the same scores as the CTM model using the same embeddings. Thus, the fine-tuning task used for the sentence embeddings is the most important factor for cross-lingual transfer. Correlation with Existing Benchmarks To further investigate the role of fine-tuning in inducing better transfer, we employ the Semantic Textual Similarity (STS) benchmark (Cer et al., 2017) ; 6 this has been used to evaluate the quality of sentence embeddings more broadly in previous works Gurevych, 2019, 2020) . Performance is evaluated by measuring the Spearman correlation between the cosine similarity of sentence representations and gold labels for the sentence similarity tasks contained in STS. Here, we try correlating this metric with measures of topic quality, as well as with topic transfer (Figure 2 ). While STS does not correlate strongly with NPMI (ρ = 0.46, P > 0.1), it correlates very well with both Match and KL (ρ = 0.93 and ρ = 0.96, respectively, and P < .005 for both). This implies that welltuned sentence embeddings are not necessarily the most important factor in producing good topics, but they are quite important for crosslingual transfer. However, cross-lingual transfer performance saturates quickly at STS Spearman coefficients over 55, such that an increase of over 50% in STS results in only an 8% increase in Match and 4% reduction in KL. Thus, one could perhaps trade off STS for better cross-lingual transfer at scores above this threshold. We leave this to future work. We find further evidence for STS' weak correlation with NPMI and STS' strong correlation with Match and KL when observing the performance of TCCTM: it does not modify the sentence embeddings, so one would expect that TCCTM would perform similarly to the regular CTM if sentence embeddings are of primary importance. This is not the case for NPMI, as TCCTM seems to greatly improve topic quality when using a non-fine-tuned model and have a slightly negative effect when using a fine-tuned model. However, cross-lingual TCCTM performance is consistently comparable to CTM performance with respect to Match and KL when the fine-tuning datasets are the same. Why is fine-tuning important for cross-lingual transfer? Figure 3 displays confusion matrices comparing the topics obtained in English versus those obtained in French for the same documents using both the CTM (not fine-tuned) and CTM+FT (NLI) model. We present confusion matrices for all target languages in Appendix A. When the embeddings are not fine-tuned, we see that a typical pattern of error is the CTM assigning foreign documents topics from a small subset of the 100 available topics, regardless of the actual content of the document; this is indicated by the frequency of vertical striping in the confusion matrix. After fine-tuning, errors look more evenly distributed across topics and less frequent in general, though there is still slight striping at topic 81. This striping also occurs after fine-tuning at topic 81 for Portuguese and (to a smaller extent) Dutch, but not German. Thus, CTMs trained on monolingual data are prone to assigning foreign documents topics from a small subset of the available topics, but this can be heavily mitigated with well-tuned sentence embeddings. What kinds of topics have high cross-lingual precision, and which have lower precision? We calculate the mean precision per-topic of cross-lingual topic transfer from English to all other target languages using the CTM+FT (NLI) model, 7 finding that topics which are more qualitatively coherent tend to have higher cross-lingual precision. Topics that are less semantically clear or which compete with similar topics tend to exhibit more crosslingual variance. Examples of the highest-and lowest-precision topics may be found in Table 3 . We sometimes observe competing topics which semantically overlap. In our dataset, this typically occurs for short articles which describe small towns and obscure places, such as in the bottom example of Table 3 ; topics 51 and 21 appear most frequently for these articles. Many instances of topics 81 and 89 (the lowest-precision topics in our dataset) also occur in short articles about small towns or obscure places; we hypothesize that this is often due to the probability mass of more relevant topics being split, thus allowing these topics which contain generally higher-probability tokens to be assigned. In monolingual settings, the best topics are achieved through contextualized topic modeling using sentence embeddings fine-tuned on the topic classification task. This holds whether the topic classification objective is used during fine-tuning or integrated into the CTM itself. However, in zeroshot polylingual settings, it is far more important to fine-tune sentence embeddings on any task than to have seen in-domain data during pre-training or to use the topic classification objective. As the topic classification task can be performed on any corpus which has enough documents for topic modeling, supervision for this task is always available; this supervision bootstrapping can therefore serve as a simple way to increase topic quality and transfer for contextualized topic models in the absence of any other data, regardless of domain. There exists a weak but positive correlation between sentence embedding quality (as measured by the STS benchmark) and topic coherence, but a strong correlation between sentence embedding quality and cross-lingual topic transfer performance. Nonetheless, these preliminary findings also suggest that transfer saturates quickly at quite low STS scores and that STS does not correlate well with topic quality, so we do not necessarily recommend directly optimizing over STS for neural topic modeling. Future work should investigate fine-tuning on multilingual datasets, as well as explicitly inducing cross-lingual topic alignments. Because the CTM currently generates topics in one language and then transfers into other languages, it would also be beneficial to investigate methods of generating topics in parallel across languages during topic modeling. Figures 4, 5, 6, and 7 present row-normalized confusion matrices comparing topic assignments for aligned documents in English and all other target languages. We present figures for CTMs based on non-fine-tuned embeddings (left) as well as embeddings fine-tuned on NLI (right). All embeddings are based on XLM-R. As the provided confusion matrices are rownormalized, they do not present the relative frequency of various topics in English. Thus, we present counts of the most probable topics for the English test documents according to a CTM based on non-fine-tuned embeddings and a CTM based on embeddings fine-tuned on NLI ( Figure 8 ). Table 4 presents more sample documents for various high-precision topics. The lowest-precision topics all contain similar top tokens and patterns of error as Table 3 (as topics 81 and 89 do), so we focus on displaying other types of topics which transfer well across languages. We find that topics relating to science, sports, places in specific countries, and entertainment transfer well. Perhaps this is due to shared vocabulary for these subjects, as these all contain either scientific terms or proper nouns which are orthographically identical cross-lingually. Or perhaps these subjects are frequently seen during pretraining, thus enabling more isomorphic representations to form around such subjects. Figure 8 : Counts of the most probable topics for each English document in the aligned test set, according to an XLM-R based CTM with non-fine-tuned embeddings (left) and embeddings fine-tuned on NLI (right). Topic significance ranking of LDA generative models Pre-training is a hot topic: Contextualized document embeddings improve topic coherence Cross-lingual contextualized topic models with zero-shot learning Latent Dirichlet allocation A large annotated corpus for learning natural language inference Applications of topic models SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation Reading tea leaves: How humans interpret topic models Unsupervised cross-lingual representation learning at scale Gaussian LDA for topic models with word embeddings BERT: Pre-training of deep bidirectional transformers for language understanding Topic modeling in embedding spaces Coherence-aware neural topic modeling Ves Stoyanov, and Alexis Conneau. 2020. Self-training improves pre-training for natural language understanding Document informed neural autoregressive topic models with distributional prior Don't stop pretraining: Adapt language models to domains and tasks Learning multilingual topics from incomparable corpora Universal language model fine-tuning for text classification Extracting multilingual topics from unaligned comparable corpora Autoencoding variational bayes On the challenges of learning with inference networks on sparse, high-dimensional data Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality Topic modeling for short texts with auxiliary word embeddings Tat-Seng Chua, and Maosong Sun MALLET: A machine learning for language toolkit Discovering discrete latent topics with neural variational inference Neural variational inference for text processing Polylingual topic models Learning polylingual topic models from codeswitched social media documents How multilingual is multilingual bert? Sentence-BERT: Sentence embeddings using Siamese BERTnetworks Making monolingual sentence embeddings multilingual using knowledge distillation A survey of cross-lingual word embedding models A corpus for multilingual document classification in eight languages Tired of topic models? Neural variational inference for topic models A broad-coverage challenge corpus for sentence understanding through inference Beto, bentz, becas: The surprising cross-lingual effectiveness of BERT Which BERT? A survey organizing contextualized encoders Multilingual anchoring: Interactive topic modeling and alignment across languages BiTAM: Bilingual topic AdMixture models for word alignment 41: wrestler, ring, wwe, heavyweight, professional pt James Maritato é um lutador de wrestling profissional ítalo-americano This material is based on work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. 1746891. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.We wish to thank Shuoyang Ding, Chu-Cheng Lin, and the reviewers for their helpful feedback on earlier drafts of this work.