Evaluating the Stability of Embedding-based Word Similarities Maria Antoniak Cornell University maa343@cornell.edu David Mimno Cornell University mimno@cornell.edu Abstract Word embeddings are increasingly being used as a tool to study word associations in specific corpora. However, it is unclear whether such embeddings reflect enduring properties of lan- guage or if they are sensitive to inconsequential variations in the source documents. We find that nearest-neighbor distances are highly sen- sitive to small changes in the training corpus for a variety of algorithms. For all methods, including specific documents in the training set can result in substantial variations. We show that these effects are more prominent for smaller training corpora. We recommend that users never rely on single embedding models for distance calculations, but rather average over multiple bootstrap samples, especially for small corpora. 1 Introduction Word embeddings are a popular technique in natural language processing (NLP) in which the words in a vocabulary are mapped to low-dimensional vectors. Embedding models are easily trained—several imple- mentations are publicly available—and relationships between the embedding vectors, often measured via cosine similarity, can be used to reveal latent seman- tic relationships between pairs of words. Word em- beddings are increasingly being used by researchers in unexpected ways and have become popular in fields such as digital humanities and computational social science (Hamilton et al., 2016; Heuser, 2016; Phillips et al., 2017). Embedding-based analyses of semantic similarity can be a robust and valuable tool, but we find that standard methods dramatically under-represent the variability of these measurements. Embedding algo- rithms are much more sensitive than they appear to factors such as the presence of specific documents, the size of the documents, the size of the corpus, and even seeds for random number generators. If users do not account for this variability, their conclusions are likely to be invalid. Fortunately, we also find that simply averaging over multiple bootstrap samples is sufficient to produce stable, reliable results in all cases tested. NLP research in word embeddings has so far fo- cused on a downstream-centered use case, where the end goal is not the embeddings themselves but performance on a more complicated task. For exam- ple, word embeddings are often used as the bottom layer in neural network architectures for NLP (Ben- gio et al., 2003; Goldberg, 2017). The embeddings’ training corpus, which is selected to be as large as possible, is only of interest insofar as it generalizes to the downstream training corpus. In contrast, other researchers take a corpus- centered approach and use relationships between em- beddings as direct evidence about the language and culture of the authors of a training corpus (Bolukbasi et al., 2016; Hamilton et al., 2016; Heuser, 2016). Embeddings are used as if they were simulations of a survey asking subjects to free-associate words from query terms. Unlike the downstream-centered approach, the corpus-centered approach is based on direct human analysis of nearest neighbors to embed- ding vectors, and the training corpus is not simply an off-the-shelf convenience but rather the central object of study. 107 Transactions of the Association for Computational Linguistics, vol. 6, pp. 107–119, 2018. Action Editor: Ivan Titov. Submission batch: 6/2017; Revision batch: 9/2017; Published 2/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Downstream-centered Corpus-centered Big corpus Small corpus, difficult or impossi- ble to expand Source is not important Source is the object of study Only vectors are important Specific, fine-grained comparisons are important Embeddings are used in downstream tasks Embeddings are used to learn about the mental model of word associa- tion for the authors of the corpus Table 1: Comparison of downstream-centered and corpus- centered approaches to word embeddings. While word embeddings may appear to measure properties of language, they in fact only measure properties of a curated corpus, which could suf- fer from several problems. The training corpus is merely a sample of the authors’ language model (Shazeer et al., 2016). Sources could be missing or over-represented, typos and other lexical variations could be present, and, as noted by Goodfellow et al. (2016), “Many datasets are most naturally arranged in a way where successive examples are highly cor- related.” Furthermore, embeddings can vary consid- erably across random initializations, making lists of “most similar words” unstable. We hypothesize that training on small and poten- tially idiosyncratic corpora can exacerbate these prob- lems and lead to highly variable estimates of word similarity. Such small corpora are common in digital humanities and computational social science, and it is often impossible to mitigate these problems simply by expanding the corpus. For example, we cannot create more 18th Century English books or change their topical focus. We explore causes of this variability, which range from the fundamental stochastic nature of certain al- gorithms to more troubling sensitivities to properties of the corpus, such as the presence or absence of specific documents. We focus on the training cor- pus as a source of variation, viewing it as a fragile artifact curated by often arbitrary decisions. We ex- amine four different algorithms and six datasets, and we manipulate the corpus by shuffling the order of the documents and taking bootstrap samples of the documents. Finally, we examine the effects of these manipulations on the cosine similarities between em- beddings. We find that there is considerable variability in embeddings that may not be obvious to users of these methods. Rankings of most similar words are not reliable, and both ordering and membership in such lists are liable to change significantly. Some uncer- tainty is expected, and there is no clear criterion for “acceptable” levels of variance, but we argue that the amount of variation we observe is sufficient to call the whole method into question. For example, we find cases in which there is zero set overlap in “top 10” lists for the same query word across bootstrap samples. Smaller corpora and larger document sizes increase this variation. Our goal is to provide meth- ods to quantify this variability, and to account for this variability, we recommend that as the size of a corpus gets smaller, cosine similarities should be averaged over many bootstrap samples. 2 Related Work Word embeddings are mappings of words to points in a K-dimensional continuous space, where K is much smaller than the size of the vocabulary. Re- ducing the number of dimensions has two benefits: first, large, sparse vectors are transformed into small, dense vectors; and second, the conflation of features uncovers latent semantic relationships between the words. These semantic relationships are usually mea- sured via cosine similarity, though other metrics such as Euclidean distance and the Dice coefficient are possible (Turney and Pantel, 2010). We focus on four of the most popular training algorithms: La- tent Semantic Analysis (LSA) (Deerwester et al., 1990), Skip-Gram with Negative Sampling (SGNS) (Mikolov et al., 2013), Global Vectors for Word Rep- resentation (GloVe) (Pennington et al., 2014), and Positive Pointwise Mutual Information (PPMI) (Levy and Goldberg, 2014) (see Section 5 for more detailed descriptions of these algorithms). In NLP, word embeddings are often used as fea- tures for downstream tasks. Dependency parsing (Chen and Manning, 2014), named entity recogni- tion (Turian et al., 2010; Cherry and Guo, 2015), and bilingual lexicon induction (Vulic and Moens, 2015) are just a few examples where the use of embeddings as features has increased performance in recent years. Increasingly, word embeddings have been used as evidence in studies of language and culture. For example, Hamilton et al. (2016) train separate em- beddings on temporal segments of a corpus and then 108 analyze changes in the similarity of words to measure semantic shifts, and Heuser (2016) uses embeddings to characterize discourse about virtues in 18th Cen- tury English text. Other studies use cosine similar- ities between embeddings to measure the variation of language across geographical areas (Kulkarni et al., 2016; Phillips et al., 2017) and time (Kim et al., 2014). Each of these studies seeks to reconstruct the mental model of authors based on documents. An example that highlights the contrast between the downstream-centered and corpus-centered per- spectives is the exploration of implicit bias in word embeddings. Researchers have observed that embedding-based word similarities reflect cultural stereotypes, such as associations between occupa- tions and genders (Bolukbasi et al., 2016). From a downstream-centered perspective, these stereotypical associations represent bias that should be filtered out before using the embeddings as features. In contrast, from a corpus-centered perspective, implicit bias in embeddings is not a problem that must be fixed but rather a means of measurement, providing quantita- tive evidence of bias in the training corpus. Embeddings are usually evaluated on direct use cases, such as word similarity and analogy tasks via cosine similarities (Mikolov et al., 2013; Pennington et al., 2014; Levy et al., 2015; Shazeer et al., 2016). Intrinsic evaluations like word similarities measure the interpretability of the embeddings rather than their downstream task performance (Gladkova and Drozd, 2016), but while some research does evaluate embedding vectors on their downstream task perfor- mance (Pennington et al., 2014; Faruqui et al., 2015), the standard benchmarks remain intrinsic. There has been some recent work in evaluating the stability of word embeddings. Levy et al. (2015) focus on the hyperparameter settings for each algo- rithm and show that hyperparameters such as the size of the context window, the number of negative sam- ples, and the level of context distribution smoothing can affect the performance of embeddings on simi- larity and analogy tasks. Hellrich and Hahn (2016) examine the effects of word frequency, word am- biguity, and the number of training epochs on the reliability of embeddings produced by the SGNS and skip-gram hierarchical softmax (SGHS) (a variant of SGNS), striving for reproducibility and recommend- ing against sampling the corpus in order to preserve stability. Likewise, Tian et al. (2016) explore the ro- bustness of SGNS and GloVe embeddings trained on large, generic corpora (Wikipedia and news data) and propose methods to align these embeddings across different iterations. In contrast, our goal is not to produce artificially stable embeddings but to identify the factors that create instability and measure our statistical confi- dence in the cosine similarities between embeddings trained on small, specific corpora. We focus on the corpus as a fragile artifact and source of variation, considering the corpus itself as merely a sample of possible documents produced by the authors. We examine whether the embeddings accurately model those authors, using bootstrap sampling to measure the effects of adding or removing documents from the training corpus. 3 Corpora We collected two sub-corpora from each of three datasets (see Table 2) to explore how word embed- dings are affected by size, vocabulary, and other parameters of the training corpus. In order to bet- ter model realistic examples of corpus-centered re- search, these corpora are deliberately chosen to be publicly available, suggestive of social research ques- tions, varied in corpus parameters (e.g. topic, size, vocabulary), and much smaller than the standard cor- pora typically used in training word embeddings (e.g. Wikipedia, Gigaword). Each dataset was created or- ganically, over specific time periods, in specific social settings, by specific authors. Thus, it is impossible to expand these datasets without compromising this specificity. We process each corpus by lowercasing all text, re- moving words that appear fewer than 20 times in the corpus, and removing all numbers and punctuation. Because our methods rely on bootstrap sampling (see Section 6), which operates by removing or multi- plying the presence of documents, we also remove duplicate documents from each corpus. U.S. Federal Courts of Appeals The U.S. Federal courts of appeals are regional courts that decide ap- peals from the district courts within their federal ju- dicial circuit. We examine the embeddings of the most recent five years of the 4th and 9th circuits.1 1https://www.courtlistener.com/ 109 Corpus Number of documents Unique words Vocabulary density Words per document NYT Sports (2000) 8,786 12,475 0.0020 708 NYT Music (2000) 3,666 9,762 0.0037 715 AskScience 331,635 16,901 0.0012 44 AskHistorians 63,578 9,384 0.0022 66 4th Circuit 5,368 16,639 0.0014 2,281 9th Circuit 9,729 22,146 0.0011 2,108 Table 2: Comparison of the number of documents, number of unique words (after removing words that appear fewer than 20 times), vocabulary density (the ratio of unique words to the total number of words), and the average number of words per document for each corpus. Setting Method Tests... Run 1 Run 2 Run 3 Fixed Documents in consistent order variability due to algorithm (baseline) A B C A B C A B C Shuffled Documents in random order variability due to document order A C B B A C C B A Bootstrap Documents sampled with replacement variability due to document presence B A A C A B B B B Table 3: The three settings that manipulate the document order and presence in each corpus. The 4th circuit contains Washington D.C. and sur- rounding states, while the 9th circuit contains the entirety of the west coast. Social science research questions might involve measuring a widely held belief that certain courts have distinct ideological ten- dencies (Broscheid, 2011). Such bias may result in measurable differences in word association due to framing effects (Card et al., 2015), which could be observable by comparing the words associated with a given query term. We treat each opinion as a single document. New York Times The New York Times (NYT) An- notated Corpus (Sandhaus, 2008) contains newspaper articles tagged with additional metadata reflecting their content and publication context. To constrain the size of the corpora and to enhance their specificity, we extract data only for the year 2000 and focus on only two sections of the NYT dataset: sports and music. In the resulting corpora, the sports section is substantially larger than the music section (see Table 2). We treat an article as a single document. Reddit Reddit2 is a social website containing thou- sands of forums (subreddits) organized by topic. We use a dataset containing all posts for the years 2007- 2014 from two subreddits: /r/AskScience and /r/AskHistorians. These two subreddits allow users to post any question in the topics of history and science, respectively. AskScience is more than five times larger than AskHistorians, though the doc- 2https://www.reddit.com/ ument length is generally longer for AskHistorians (see Table 2). Reddit is a popular data source for computational social science research; for example, subreddits can be used to explore the distinctiveness and dynamicity of communities (Zhang et al., 2017). We treat an original post as a single document. 4 Corpus Parameters Order and presence of documents We use three different methods to sample the corpus: FIXED, SHUFFLED, and BOOTSTRAP. The FIXED setting includes each document exactly once, and the doc- uments appear in a constant, chronological order across all models. The purpose of this setting is to measure the baseline variability of an algorithm, independent of any change in input data. Algorith- mic variability may arise from random initializations of learned parameters, random negative sampling, or randomized subsampling of tokens within docu- ments. The SHUFFLED setting includes each docu- ment exactly once, but the order of the documents is randomized for each model. The purpose of this setting is to evaluate the impact of variation on how we present examples to each algorithm. The order of documents could be an important factor for algo- rithms that use online training such as SGNS. The BOOTSTRAP setting samples N documents randomly with replacement, where N is equal to the number of documents in the FIXED setting. The purpose of this setting is to measure how much variability is due to the presence or absence of specific sequences of 110 tokens in the corpus. See Table 3 for a comparison of these three settings. Size of corpus We expect the stability of embedding-based word similarities to be influenced by the size of the training corpus. As we add more documents, the impact of any specific document should be less significant. At the same time, larger corpora may also tend to be more broad in scope and variable in style and topic, leading to less idiosyn- cratic patterns in word co-occurrence. Therefore, for each corpus, we curate a smaller sub-corpus that contains 20% of the total corpus documents. These samples are selected using contiguous sequences of documents at the beginning of each training (this ensures that the FIXED setting remains constant). Length of documents We use two document seg- mentation strategies. In the first setting, each training instance is a single document (i.e. an article for the NYT corpus, an opinion from the Courts corpus, and a post from the Reddit corpus). In the second setting, each training instance is a single sentence. We ex- pect this choice of segmentation to have the largest impact on the BOOTSTRAP setting. Documents are often characterized by “bursty” words that are locally frequent but globally rare (Madsen et al., 2005), such as the name of a defendant in a court case. Sampling whole documents with replacement should magnify the effect of bursty words: a rare but locally frequent word will either occur in a Bootstrap corpus or not occur. Sampling sentences with replacement should have less effect on bursty words, since the chance that an entire document will be removed from the corpus is much smaller. 5 Algorithms Evaluating all current embedding algorithms and im- plementations is beyond the scope of this work, so we select four categories of algorithms that represent distinct optimization strategies. Recall that our goal is to examine how algorithms respond to variation in the corpus, not to maximize performance in the accuracy or effectiveness of the embeddings. The first category is online stochastic updates, in which the algorithm updates model parameters us- ing stochastic gradients as it proceeds through the training corpus. All methods implemented in the word2vec and fastText packages follow this format, including skip-gram, CBOW, negative sam- pling, and hierarchical softmax (Mikolov et al., 2013). We focus on SGNS as a popular and representative example. The second category is batch stochastic updates, in which the algorithm first collects a matrix of summary statistics derived from a pass through the training data that takes place before any parame- ters are set, and then updates model parameters using stochastic optimization. We select the GloVe algo- rithm (Pennington et al., 2014) as a representative example. The third category is matrix factorization, in which the algorithm makes deterministic updates to model parameters based on a matrix of summary statistics. As a representative example we include PPMI (Levy and Goldberg, 2014). Finally, to test whether word order is a significant factor we include a document-based embedding method that uses ma- trix factorization, LSA (Deerwester et al., 1990; Lan- dauer and Dumais, 1997). These algorithms each include several hyperparam- eters, which are known to have measurable effects on the resulting embeddings (Levy et al., 2015). We have attempted to choose settings of these parame- ters that are commonly used and comparable across algorithms, but we emphasize that a full evaluation of the effect of each algorithmic parameter would be beyond the scope of this work. For each of the following algorithms, we set the context window size to 5 and the embeddings size to 100. Since we re- move words that occur fewer than 20 times during preprocessing of the corpus, we set the frequency threshold for the following algorithms to 0. For all other hyperparameters, we follow the de- fault or most popular settings for each algorithm, as described in the following sections. 5.1 LSA Latent semantic analysis (LSA) factorizes a sparse term-document matrix X (Deerwester et al., 1990; Landauer and Dumais, 1997). X is factored using singular value decomposition (SVD), retaining K singular values such that X ≈ XK = UK ΣKV TK . The elements of the term-document matrix are weighted, often with TF-IDF, which measures the 111 importance of a word to a document in a corpus. The dense, low-rank approximation of the term-document matrix, XK , can be used to measure the relatedness of terms by calculating the cosine similarity of the relevant rows of the reduced matrix. We use the sci-kit learn3 package to train our LSA embeddings. We create a term-document matrix with TF-IDF weighting, using the default set- tings except that we add L2 normalization and sub- linear TF scaling, which scales the importance of terms with high frequency within a document. We perform dimensionality reduction via a randomized solver (Halko et al., September 2009). The construction of the term-count matrix and the TF-IDF weighting should introduce no variation to the final word embeddings. However, we expect variation due to the randomized SVD solver, even when all other parameters (training document order, presence, size, etc.) are constant. 5.2 SGNS The skip-gram with negative sampling (SGNS) algo- rithm (Mikolov et al., 2013) is an online algorithm that uses randomized updates to predict words based on their context. In each iteration, the algorithm pro- ceeds through the original documents and, at each word token, updates model parameters based on gra- dients calculated from the current model parameters. This process maximizes the likelihood of observed word-context pairs and minimizes the likelihood of negative samples. We use an implementation of the SGNS algorithm included in the Python library gensim4 (Řehůřek and Sojka, 2010). We use the default settings pro- vided with gensim except as described above. We predict that multiple runs of SGNS on the same corpus will not produce the same results. SGNS ran- domly initializes all the embeddings before training begins, and it relies on negative samples created by randomly selecting word and context pairs (Mikolov et al., 2013; Levy et al., 2015). We also expect SGNS to be sensitive to the order of documents, as it relies on stochastic gradient descent which can be biased to be more influenced by initial documents (Bottou, 2012). 3http://scikit-learn.org/ 4https://radimrehurek.com/gensim/models/ word2vec.html 5.3 GloVe Global Vectors for Word Representation (GloVe) uses stochastic gradient updates but operates on a “global” representation of word co-occurrence that is calcu- lated once at the beginning of the algorithm (Penning- ton et al., 2014). Words and contexts are associated with bias parameters, bw and bc, where w is a word and c is a context, learned by minimizing the cost function: L = ∑ w,c f(xwc) ~w ·~c + bw + bc − log(xwc). We use the GloVe implementation provided by Pennington et al. (2014)5. We use the default settings provided with GloVe except as described above. Unlike SGNS, the algorithm does not perform model updates while examining the original docu- ments. As a result, we expect GloVe to be sensitive to random initializations but not sensitive to the order of documents. 5.4 PPMI The positive pointwise mutual information (PPMI) matrix, whose cells represent the PPMI of each pair of words and contexts, is factored using sin- gular value decomposition (SVD) and results in low- dimensional embeddings that perform similarly to GloVe and SGNS (Levy and Goldberg, 2014). PMI(w, c) = log P (w, c) P (w)P (c) ; PPMI(w, c) = max(PMI(w, c), 0). To train our PPMI word embeddings, we use hyperwords,6 an implementation provided as part of Levy et al. (2015).7 We follow the authors’ recom- mendations and set the context distributional smooth- ing (cds) parameter to 0.75, the eigenvalue matrix (eig) to 0.5, the subsampling threshold (sub) to 10-5, and the context window (win) to 5. 5http://nlp.stanford.edu/projects/glove/ 6https://bitbucket.org/omerlevy/ hyperwords/src 7We altered the PPMI code to remove a fixed random seed in order to introduce variability given a fixed corpus; no other change was made. 112 Like GloVe and unlike SGNS, PPMI operates on a pre-computed representation of word co-occurrence, so we do not expect results to vary based on the or- der of documents. Unlike both GloVe and SGNS, PPMI uses a stable, non-stochastic SVD algorithm that should produce the same result given the same input, regardless of initialization. However, we ex- pect variation due to PPMI’s random subsampling of frequent tokens. 6 Methods To establish statistical significance bounds for our observations, we train 50 LSA models, 50 SGNS models, 50 GloVe models, and 50 PPMI models for each of the three settings (FIXED, SHUFFLED, and BOOTSTRAP), for each document segmentation size, for each corpus. For each corpus, we select a set of 20 relevant query words from high probability words from an LDA topic model (Blei et al., 2003) trained on that corpus with 200 topics. We calculate the cosine sim- ilarity of each query word to the other words in the vocabulary, creating a similarity ranking of all the words in the vocabulary. We calculate the mean and standard deviation of the cosine similarities for each pair of query word and vocabulary word across each set of 50 models. From the lists of queries and cosine similarities, we select the 20 words most closely related to the set of query words and compare the mean and standard deviation of those pairs across settings. We calculate the Jaccard similarity between top-N lists to com- pare membership change in the lists of most closely related words, and we find average changes in rank within those lists. We examine these metrics across different algorithms and corpus parameters. 7 Results We begin with a case study of the framing around the query term marijuana. One might hypothesize that the authors of various corpora (e.g. judges of the 4th Circuit, journalists at the NYT, and users on Reddit) have different perceptions of this drug and that their language might reflect those differences. Indeed, after qualitatively examining the lists of most similar terms (see Table 4), we might come to the conclusion that the allegedly conservative 4th Circuit LSA SGNS GloVe PPMI ALGORITHM 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 S T A N D A R D D E V IA T IO N fixed shuffled bootstrap Standard Deviation in the 9th Circuit Corpus LSA SGNS GloVe PPMI ALGORITHM 0.00 0.02 0.04 0.06 0.08 0.10 0.12 S T A N D A R D D E V IA T IO N fixed shuffled bootstrap Standard Deviation in the NYT Music Corpus Figure 1: The mean standard deviations across settings and algorithms for the 10 closest words to the query words in the 9th Circuit and NYT Music corpora using the whole documents. Larger variations indicate less stable embed- dings. judges view marijuana as similar to illegal drugs such as heroin and cocaine, while Reddit users view mari- juana as closer to legal substances such as nicotine and alcohol. However, we observe patterns that cause us to lower our confidence in such conclusions. Table 4 shows that the cosine similarities can vary signifi- cantly. We see that the top ranked words (chosen according to their mean cosine similarity across runs of the FIXED setting) can have widely different mean similarities and standard deviations depending on the algorithm and the three training settings, FIXED, SHUFFLED, and BOOTSTRAP. As expected, each algorithm has a small variation in the FIXED setting. For example, we can see the effect of the random SVD solver for LSA and the effect of random subsampling for PPMI. We do not observe a consistent effect for document order in the SHUFFLED setting. Most importantly, these figures reveal that the 113 4th Circuit NYT Sports Reddit Ask Science 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 Cosine Similarity distribute manufacture oxycodone distributing powder methamphetamine crack distribution cocaine heroin M o st S im il a r W o rd s LSA fixed shuffled bootstrap 0.45 0.50 0.55 0.60 0.65 0.70 Cosine Similarity steroids reserved substance involving violent several cocaine testing tested criticized M o st S im il a r W o rd s LSA fixed shuffled bootstrap 0.70 0.75 0.80 0.85 0.90 Cosine Similarity masturbation medication tobacco stress nicotine alcohol thc caffeine smoking cannabis M o st S im il a r W o rd s LSA fixed shuffled bootstrap 0.70 0.75 0.80 0.85 0.90 Cosine Similarity cigarettes powder narcotics crack drugs pills methamphetamine cocaine oxycodone heroin M o st S im il a r W o rd s SGNS fixed shuffled bootstrap 0.60 0.65 0.70 0.75 0.80 0.85 Cosine Similarity testing abuse substance urine alcohol counseling steroid nandrolone drug cocaine M o st S im il a r W o rd s SGNS fixed shuffled bootstrap 0.65 0.70 0.75 0.80 0.85 0.90 Cosine Similarity drug smoking lsd thc cocaine weed mdma tobacco nicotine cannabis M o st S im il a r W o rd s SGNS fixed shuffled bootstrap 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 Cosine Similarity possession growing smoked grams drugs distribute crack kilograms heroin cocaine M o st S im il a r W o rd s GloVe fixed shuffled bootstrap 0.2 0.3 0.4 0.5 0.6 0.7 Cosine Similarity positive suspensions blaming steroid purposes addiction testing smoking procedures cocaine M o st S im il a r W o rd s GloVe fixed shuffled bootstrap 0.4 0.5 0.6 0.7 0.8 Cosine Similarity effects drugs caffeine weed drug thc tobacco smoke smoking cannabis M o st S im il a r W o rd s GloVe fixed shuffled bootstrap 0.75 0.80 0.85 0.90 Cosine Similarity methamphetamine hydrochloride kilograms paraphernalia kilogram grams crack powder heroin cocaine M o st S im il a r W o rd s PPMI fixed shuffled bootstrap 0.5 0.6 0.7 0.8 0.9 Cosine Similarity steroid testing crack drugs substance positive tested alcohol drug cocaine M o st S im il a r W o rd s PPMI fixed shuffled bootstrap 0.70 0.75 0.80 0.85 0.90 0.95 Cosine Similarity smoke smokers thc cigar cigarettes smoking weed tobacco nicotine cannabis M o st S im il a r W o rd s PPMI fixed shuffled bootstrap Table 4: The most similar words with their means and standard deviations for the cosine similarities between the query word marijuana and its 10 nearest neighbors (highest mean cosine similarity in the FIXED setting. Embeddings are learned from documents segmented by sentence. BOOTSTRAP setting causes large increases in varia- tion across all algorithms (with a weaker effect for PPMI) and corpora, with large standard deviations across word rankings. This indicates that the pres- ence of specific documents in the corpus can signifi- cantly affect the cosine similarities between embed- ding vectors. GloVe produced very similar embeddings in both the FIXED and SHUFFLED settings, with similar means and small standard deviations, which indi- cates that GloVe is not sensitive to document order. However, the BOOTSTRAP setting caused a reduc- tion in the mean and widened the standard deviation, indicating that GloVe is sensitive to the presence of specific documents. 114 Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 viability fetus trimester surgery trimester pregnancies abdomen pregnancies pregnancies surgery visit surgery occupation tenure abortion gestation visit therapy incarceration viability stepfather abortions kindergarten tenure pain visit abortion wife fetus viability workday hospitalization arrival tenure groin gestation headaches abortions neck pain visit throat surgery pregnant hernia headaches headaches abortions grandmother expiration abortion summer trimester birthday pregnant daughter sudden pain suicide experiencing neck birthday panic fetal bladder abortion medications tenure fetus jaw Table 5: The 10 closest words to the query term pregnancy are highly variable. None of the words shown appear in every run. Results are shown across runs of the BOOTSTRAP setting for the full corpus of the 9th Circuit, the whole document size, and the SGNS model. Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 selection selection selection selection selection selection selection genetics process human darwinian convergent evolutionary darwinian convergent darwinian humans theory darwinian humans nature process humans natural genetics evolutionary species evolutionary darwinian convergent genetics human genetics convergent convergent abiogenesis evolutionary species evolutionary theory process process evolutionary species did humans natural natural natural natural human convergent natural humans did species nature natural process convergent process human humans species theory evolutionary creationism human darwinian favor Table 6: The order of the 10 closest words to the query term evolution are highly variable. Results are shown across runs of the BOOTSTRAP setting for the full corpus of AskScience, the whole document length, and the GloVe model. These patterns of larger or smaller variations are generalized in Figure 1, which shows the mean stan- dard deviation for different algorithms and settings. We calculated the standard deviation across the 50 runs for each query word in each corpus, and then we averaged over these standard deviations. The re- sults show the average levels of variation for each algorithm and corpus. We observe that the FIXED and SHUFFLED settings for GloVe and LSA produce the least variable cosine similarities, while PPMI pro- duces the most variable cosine similarities for all settings. The presence of specific documents has a significant effect on all four algorithms (lesser for PPMI), consistently increasing the standard devia- tions. We turn to the question of how this variation in standard deviation affects the lists of most similar words. Are the top-N words simply re-ordered, or do the words present in the list substantially change? Table 5 shows an example of the top-N word lists for the query word pregnancy in the 9th Circuit corpus. Observing Run 1, we might believe that judges of the 9th Circuit associate pregnancy most with questions of viability and abortion, while observ- ing Run 5, we might believe that pregnancy is most associated with questions of prisons and family visits. Although the lists in this table are all produced from the same corpus and document size, the membership of the lists changes substantially between runs of the BOOTSTRAP setting. As another example, Table 6 shows results for the query evolution for the GloVe model and the AskScience corpus. Although this query shows less variation between runs, we still find cause for concern. For example, Run 3 ranks the words human and humans highly, while Run 1 includes neither of those words in the top 10. These changes in top-N rank are shown in Figure 2. For each query word for the AskHistorians corpus, we find the N most similar words using SGNS. We generate new top-N lists for each of the 50 models trained in the BOOTSTRAP setting, and we use Jac- card similarity to compare the 50 lists. We observe similar patterns to the changes in standard deviation 115 LSA SGNS GloVe PPMI ALGORITHM 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 JA C C A R D S IM IL A R IT Y fixed shuffled bootstrap Variation in Top 2 Words LSA SGNS GloVe PPMI ALGORITHM 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 JA C C A R D S IM IL A R IT Y fixed shuffled bootstrap Variation in Top 10 Words Figure 2: The mean Jaccard similarities across settings and algorithms for the top 2 and 10 closest words to the query words in the AskHistorians corpus. Larger Jaccard similarity indicates more consistency in top N member- ship. Results are shown for the sentence document length. in Figure 2; PPMI displays the lowest Jaccard simi- larity across settings, while the other algorithms have higher similarities in the FIXED and SHUFFLED set- tings but much lower similarities in the BOOTSTRAP setting. We display results for both N = 2 and N = 10, emphasizing that even very highly ranked words often drop out of the top-N list. Even when words do not drop out of the top-N list, they often change in rank, as we observe in Figure 3. We show both a specific example for the query term men and an aggregate of all the terms whose average rank is within the top-10 across runs of the BOOTSTRAP setting. In order to highlight the av- erage changes in rank, we do not show outliers in this figure, but we note that outliers (large falls and jumps in rank) are common. The variability across samples from the BOOTSTRAP setting indicates that the presence of specific documents can significantly affect the top-N rankings. We also find that document segmentation size af- 0 5 10 15 20 25 30 35 RANK children women soldiers boys girls horses officers people peasants bodies W O R D Change in Rank: "men" 0 5 10 15 20 25 30 35 40 45 RANK FOR CURRENT ITERATION 1 2 3 4 5 6 7 8 9 10 A V E R A G E R A N K Change in Rank for All Queries Figure 3: The change in rank across runs of the BOOT- STRAP setting for the top 10 words. We show results for both a single query, men, and an aggregate of all the queries, showing the change in rank of the words whose average ranking falls within the 10 nearest neighbors of those queries. Results are shown for SGNS on the AskHis- torians corpus and the sentence document length. fects the cosine similarities. Figure 4 shows that documents segmented at a more fine-grained level produce embeddings with less variability across runs of the BOOTSTRAP setting. Documents segmented at the sentence level have standard deviations clustering closer to the median, while larger documents have standard deviations that are spread more widely. This effect is most significant for the 4th Circuit and 9th Circuit corpora, as these have much larger “docu- ments” than the other corpora. We observe a similar effect for corpus size in Figure 5. The smaller corpus shows a larger spread in standard deviation than the larger corpus, indicating greater variability. Finally, we find that the variance usually stabilizes at about 25 runs of the BOOTSTRAP setting. Figure 6 shows that variability initially increases with the number of models trained. We observe this pattern across corpora, algorithms, and settings. 116 1 2 3 4 5 6 7 8 9 10 RANK 0.05 0.00 0.05 0.10 0.15 0.20 S T A N D A R D D E V IA T IO N Document Size Comparison DOCUMENT SIZE sentence whole Figure 4: Standard deviation of the cosine similarities between all rank N words and their 10 nearest neighbors. Results are shown for different document sizes (sentence vs whole document) in the BOOTSTRAP setting for SGNS in the 4th Circuit corpus. 1 2 3 4 5 6 7 8 9 10 RANK 0.02 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 S T A N D A R D D E V IA T IO N Corpus Size Comparison CORPUS SIZE 0.2 1.0 Figure 5: Standard deviation of the cosine similarities between all rank N words and their 10 nearest neighbors. Results are shown at different corpus sizes (20% vs 100% of documents) in the BOOTSTRAP setting for SGNS in the 4th Circuit corpus, segmented by sentence. 8 Discussion The most obvious result of our experiments is to emphasize that embeddings are not even a single objective view of a corpus, much less an objective view of language. The corpus is itself only a sample, and we have shown that the curation of this sample (its size, document length, and inclusion of specific documents) can cause significant variability in the embeddings. Happily, this variability can be quan- tified by averaging results over multiple bootstrap samples. We can make several specific observations about al- gorithm sensitivities. In general, LSA, GloVe, SGNS, and PPMI are not sensitive to document order in the collections we evaluated. This is surprising, as we 5 10 15 20 25 30 35 40 45 50 NUMBER OF ITERATIONS 0.028 0.030 0.032 0.034 0.036 m e a n (S T A N D A R D D E V IA T IO N ) Stability over Iterations Figure 6: The mean of the standard deviation of the cosine similarities between each query term and its 20 nearest neighbors. Results are shown for different numbers of runs of the BOOTSTRAP setting on the 4th Circuit corpus. had expected SGNS to be sensitive to document order and anecdotally, we had observed cases where the embeddings were affected by groups of documents (e.g. in a different language) at the beginning of train- ing. However, all four algorithms are sensitive to the presence of specific documents, though this effect is weaker for PPMI. Although PPMI appears deterministic (due to its pre-computed word-context matrix), we find that this algorithm produced results under the FIXED ordering whose variability was closest to the BOOTSTRAP set- ting. We attribute this intrinsic variability to the use of token-level subsampling. This sampling method introduces variation into the source corpus that ap- pears to be comparable to a bootstrap resampling method. Sampling in PPMI is inspired by a similar method in the word2vec implementation of SGNS (Levy et al., 2015). It is therefore surprising that SGNS shows noticeable differentiation between the BOOTSTRAP setting on the one hand and the FIXED and SHUFFLED settings on the other. The use of embeddings as sources of evidence needs to be tempered with the understanding that fine- grained distinctions between cosine similarities are not reliable and that smaller corpora and longer docu- ments are more susceptible to variation in the cosine similarities between embeddings. When studying the top-N most similar words to a query, it is important to account for variation in these lists, as both rank and membership can significantly change across runs. Therefore, we emphasize that with smaller corpora comes greater variability, and we recommend that practitioners use bootstrap sampling to generate an 117 ensemble of word embeddings for each sub-corpus and present both the mean and variability of any sum- mary statistics such as ordered word similarities. We leave for future work a full hyperparameter sweep for the three algorithms. While these hyperpa- rameters can substantially impact performance, our goal with this work was not to achieve high perfor- mance but to examine how the algorithms respond to changes in the corpus. We make no claim that one algorithm is better than another. 9 Conclusion We find that there are several sources of variability in cosine similarities between word embeddings vec- tors. The size of the corpus, the length of individual documents, and the presence or absence of specific documents can all affect the resulting embeddings. While differences in word association are measur- able and are often significant, small differences in cosine similarity are not reliable, especially for small corpora. If the intention of a study is to learn about a specific corpus, we recommend that practitioners test the statistical confidence of similarities based on word embeddings by training on multiple bootstrap samples. 10 Acknowledgements This work was supported by NSF #1526155, #1652536, and the Alfred P. Sloan Foundation. We would like to thank Alexandra Schofield, Laure Thompson, our Action Editor Ivan Titov, and our anonymous reviewers for their helpful comments. References Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Research, 3(Feb):1137–1155. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning research, 3(Jan):993–1022. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. In NIPS, pages 4349– 4357. Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer. Andreas Broscheid. 2011. Comparing circuits: Are some U.S. Courts of Appeals more liberal or conservative than others? Law & Society Review, 45(1), March. Dallas Card, Amber E. Boydstun, Justin H. Gross, Philip Resnik, and Noah A. Smith. 2015. The media frames corpus: Annotations of frames across issues. In ACL. Danqi Chen and Christopher D. Manning. 2014. A fast and accurate dependency parser using neural networks. In EMNLP, pages 740–750. Colin Cherry and Hongyu Guo. 2015. The unreasonable effectiveness of word representations for Twitter named entity recognition. In HLT-NAACL, pages 735–745. Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391. Manaal Faruqui, Jesse Dodge, Sujay K. Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. HLT-ACL, pages 1606–1615. Anna Gladkova and Aleksandr Drozd. 2016. Intrinsic evaluations of word embeddings: What can we do bet- ter? In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages 36–42. Yoav Goldberg. 2017. Neural Network Methods for Nat- ural Language Processing. Synthesis Lectures on Hu- man Language Technologies. Morgan & Claypool Pub- lishers. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. September, 2009. Finding structure with randomness: Stochastic algorithms for constructing approximate ma- trix decompositions. Technical Report No. 2009-05. Applied & Computational Mathematics, California In- stitute of Technology. William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic word embeddings reveal statistical laws of semantic change. In ACL. Johannes Hellrich and Udo Hahn. 2016. Bad company– neighborhoods in neural embedding spaces considered harmful. In Proceedings of COLING 2016, the 26th In- ternational Conference on Computational Linguistics: Technical Papers, pages 2785–2796. Ryan Heuser. 2016. Word vectors in the eighteenth- century. In IPAM Workshop: Cultural Analytics. Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. 2014. Temporal analysis of language through neural language models. Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science,. 118 Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2016. Freshman or fresher? Quantifying the geographic vari- ation of language in online social media. In ICWSM, pages 615–618. Thomas K. Landauer and Susan T. Dumais. 1997. A so- lution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211. Omer Levy and Yoav Goldberg. 2014. Neural word embedding as implicit matrix factorization. In NIPS, pages 2177–2185. Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- proving distributional similarity with lessons learned from word embeddings. Transactions of the ACL, 3:211–225. Rasmus E. Madsen, David Kauchak, and Charles Elkan. 2005. Modeling word burstiness using the dirichlet dis- tribution. In Proceedings of the 22nd International Con- ference on Machine Learning, pages 545–552. ACM. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic Regularities in Continuous Space Word Rep- resentations. HLT-NAACL. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In EMNLP, volume 14, pages 1532–43. Lawrence Phillips, Kyle Shaffer, Dustin Arendt, Nathan Hodas, and Svitlana Volkova. 2017. Intrinsic and ex- trinsic evaluation of spatiotemporal text representations in Twitter streams. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 201–210. Radim Řehůřek and Petr Sojka. 2010. Software Frame- work for Topic Modelling with Large Corpora. In Pro- ceedings of the LREC 2010 Workshop on New Chal- lenges for NLP Frameworks, pages 45–50, Valletta, Malta, May. ELRA. Evan Sandhaus. 2008. The New York Times Annotated Corpus. LDC2008T19. Linguistic Data Consortium. Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. 2016. Swivel: Improving Embeddings by Noticing What’s Missing. arXiv:1602.02215. Yingtao Tian, Vivek Kulkarni, Bryan Perozzi, and Steven Skiena. 2016. On the convergent properties of word embedding methods. arXiv preprint arXiv:1605.03956. Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: a simple and general method for semi-supervised learning. In Proceedings of the ACL, pages 384–394. Association for Computational Linguistics. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 37:141–188. Ivan Vulic and Marie-Francine Moens. 2015. Bilingual word embeddings from non-parallel document-aligned data applied to bilingual lexicon induction. In Proceed- ings of the ACL, pages 719–725. ACL. Justine Zhang, William L. Hamilton, Cristian Danescu- Niculescu-Mizil, Dan Jurafsky, and Jure Leskovec. 2017. Community identity and user engagement in a multi-community landscape. Proceedings of ICWSM. 119 120