Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy Yoav Goldberg Ido Dagan Computer Science Department Bar-Ilan University Ramat-Gan, Israel {omerlevy,yogo,dagan}@cs.biu.ac.il Abstract Recent trends suggest that neural- network-inspired word embedding models outperform traditional count-based distri- butional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter op- timizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others. 1 Introduction Understanding the meaning of a word is at the heart of natural language processing (NLP). While a deep, human-like, understanding remains elu- sive, many methods have been successful in recov- ering certain aspects of similarity between words. Recently, neural-network based approaches in which words are “embedded” into a low- dimensional space were proposed by various au- thors (Bengio et al., 2003; Collobert and Weston, 2008). These models represent each word as a d- dimensional vector of real numbers, and vectors that are close to each other are shown to be se- mantically related. In particular, a sequence of pa- pers by Mikolov et al. (2013a; 2013b) culminated in the skip-gram with negative-sampling training method (SGNS): an efficient embedding algorithm that provides state-of-the-art results on various lin- guistic tasks. It was popularized via word2vec, a program for creating word embeddings. A recent study by Baroni et al. (2014) con- ducts a set of systematic experiments compar- ing word2vec embeddings to the more tradi- tional distributional methods, such as pointwise mutual information (PMI) matrices (see Turney and Pantel (2010) and Baroni and Lenci (2010) for comprehensive surveys). These results suggest that the new embedding methods consistently out- perform the traditional methods by a non-trivial margin on many similarity-oriented tasks. How- ever, state-of-the-art embedding methods are all based on the same bag-of-contexts representation of words. Furthermore, analysis by Levy and Goldberg (2014c) shows that word2vec’s SGNS is implicitly factorizing a word-context PMI ma- trix. That is, the mathematical objective and the sources of information available to SGNS are in fact very similar to those employed by the more traditional methods. What, then, is the source of superiority (or per- ceived superiority) of these recent embeddings? While the focus of the presentation in the word- embedding literature is on the mathematical model and the objective being optimized, other factors af- fect the results as well. In particular, embedding algorithms suggest some natural hyperparameters that can be tuned; many of which were already tuned to some extent by the algorithms’ design- ers. Some hyperparameters, such as the number of negative samples to use, are clearly marked as tunable. Other modifications, such as smoothing the negative-sampling distribution, are reported in passing and considered thereafter as part of the al- gorithm. Others still, such as dynamically-sized context windows, are not even mentioned in some of the papers, but are part of the canonical imple- mentation. All of these modifications and system design choices, which we collectively denote as hyperparameters, are part of the final algorithm, and, as we show, have a substantial impact on per- formance. In this work, we make these hyperparameters explicit, and show how they can be adapted and transferred into the traditional count-based ap- proach. To asses how each hyperparameter con- tributes to the algorithms’ performance, we con- duct a comprehensive set of experiments and com- pare four different representation methods, while controlling for the various hyperparameters. Once adapted across methods, hyperparameter tuning significantly improves performance in ev- ery task. In many cases, changing the setting of a single hyperparameter yields a greater increase in performance than switching to a better algorithm or training on a larger corpus. In particular, word2vec’s smoothing of the negative sampling distribution can be adapted to PPMI-based methods by introducing a novel, smoothed variant of the PMI association measure (see Section 3.2). Using this variant increases per- formance by over 3 points per task, on average. We suspect that this smoothing partially addresses the “Achilles’ heel” of PMI: its bias towards co- occurrences of rare words. We also show that when all methods are allowed to tune a similar set of hyperparameters, their per- formance is largely comparable. In fact, there is no consistent advantage to one algorithmic approach over another, a result that contradicts the claim that embeddings are superior to count-based methods. 2 Background We consider four word representation methods: the explicit PPMI matrix, SVD factorization of said matrix, SGNS, and GloVe. For historical reasons, we refer to PPMI and SVD as “count- based” representations, as opposed to SGNS and GloVe, which are often referred to as “neural” or “prediction-based” embeddings. All of these methods (as well as all other “skip-gram”-based embedding methods) are essentially bag-of-words models, in which the representation of each word reflects a weighted bag of context-words that co- occur with it. Such bag-of-word embedding mod- els were previously shown to perform as well as or better than more complex embedding methods on similarity and analogy tasks (Mikolov et al., 2013a; Pennington et al., 2014). Notation We assume a collection of words w ∈ VW and their contexts c ∈ VC, where VW and VC are the word and context vocabularies, and denote the collection of observed word-context pairs as D. We use #(w,c) to denote the number of times the pair (w,c) appears in D. Similarly, #(w) =∑ c′∈VC #(w,c ′) and #(c) = ∑ w′∈VW #(w ′,c) are the number of times w and c occurred in D, respectively. In some algorithms, words and con- texts are embedded in a space of d dimensions. In these cases, each word w ∈ VW is associated with a vector ~w ∈ Rd and similarly each con- text c ∈ VC is represented as a vector ~c ∈ Rd. We sometimes refer to the vectors ~w as rows in a |VW |×d matrix W , and to the vectors ~c as rows in a |VC|×d matrix C. When referring to embed- dings produced by a specific method x, we may use Wx and Cx (e.g. WSGNS or CSV D). All vec- tors are normalized to unit length before they are used for similarity calculation, making cosine sim- ilarity and dot-product equivalent (see Section 3.3 for further discussion). Contexts D is commonly obtained by taking a corpus w1,w2, . . . ,wn and defining the contexts of word wi as the words surrounding it in an L- sized window wi−L, . . . ,wi−1,wi+1, . . . ,wi+L. While other definitions of contexts have been stud- ied (Padó and Lapata, 2007; Baroni and Lenci, 2010; Levy and Goldberg, 2014a) this work fo- cuses on fixed-window bag-of-words contexts. 2.1 Explicit Representations (PPMI Matrix) The traditional way to represent words in the distributional approach is to construct a high- dimensional sparse matrix M, where each row represents a word w in the vocabulary VW and each column a potential context c ∈ VC. The value of each matrix cell Mij represents the association between the word wi and the context cj. A popular measure of this association is pointwise mutual in- formation (PMI) (Church and Hanks, 1990). PMI is defined as the log ratio between w and c’s joint probability and the product of their marginal prob- abilities, which can be estimated by: PMI(w,c) = log P̂(w,c) P̂(w)P̂(c) = log #(w,c)·|D| #(w)·#(c) The rows of MPMI contain many entries of word- context pairs (w,c) that were never observed in the corpus, for which PMI(w,c) = log 0 = −∞. A common approach is thus to replace the MPMI matrix with MPMI0 , in which PMI(w,c) = 0 in cases where #(w,c) = 0. A more consistent ap- proach is to use positive PMI (PPMI), in which all negative values are replaced by 0: PPMI(w,c) = max (PMI (w,c) , 0) Bullinaria and Levy (2007) showed that MPPMI outperforms MPMI0 on semantic similarity tasks. A well-known shortcoming of PMI, which per- sists in PPMI, is its bias towards infrequent events (Turney and Pantel, 2010). A rare context c that co-occurred with a target word w even once, will often yield relatively high PMI score because P̂(c), which is in PMI’s denominator, is very small. This creates a situation in which the top “distributional features” (contexts) of w are often extremely rare words, which do not necessarily appear in the respective representations of words that are semantically similar to w. Nevertheless, the PPMI measure is widely regarded as state-of- the-art for these kinds of distributional-similarity models. 2.2 Singular Value Decomposition (SVD) While sparse vector representations work well, there are also advantages to working with dense low-dimensional vectors, such as improved com- putational efficiency and, arguably, better gener- alization. Such vectors can be obtained by per- forming dimensionality reduction over the sparse high-dimensional matrix. A common method of doing so is truncated Sin- gular Value Decomposition (SVD), which finds the optimal rank d factorization with respect to L2 loss (Eckart and Young, 1936). It was popular- ized in NLP via Latent Semantic Analysis (LSA) (Deerwester et al., 1990). SVD factorizes M into the product of three ma- trices U · Σ · V >, where U and V are orthonor- mal and Σ is a diagonal matrix of eigenvalues in decreasing order. By keeping only the top d ele- ments of Σ, we obtain Md = Ud · Σd · V >d . The dot-products between the rows of W = Ud·Σd are equal to the dot-products between rows of Md. In the setting of word-context matrices, the dense, d-dimensional rows of W can substitute the very high-dimensional rows of M. Indeed, a com- mon approach in NLP literature is factorizing the PPMI matrix MPPMI with SVD, and then taking the rows of: W SVD = Ud · Σd CSVD = Vd (1) as word and context representations, respectively. 2.3 Skip-Grams with Negative Sampling (SGNS) We present a brief sketch of SGNS – the skip-gram embedding model introduced in (Mikolov et al., 2013a) trained using the negative-sampling proce- dure presented in (Mikolov et al., 2013b). A de- tailed derivation of SGNS is available in (Gold- berg and Levy, 2014). SGNS seeks to represent each word w ∈ VW and each context c ∈ VC as d-dimensional vec- tors ~w and ~c, such that words that are “similar” to each other will have similar vector representa- tions. It does so by trying to maximize a function of the product ~w ·~c for (w,c) pairs that occur in D, and minimize it for negative examples: (w,cN ) pairs that do not necessarily occur in D. The neg- ative examples are created by stochastically cor- rupting observed (w,c) pairs from D – hence the name “negative sampling”. For each observation of (w,c), SGNS draws k contexts from the em- pirical unigram distribution PD(c) = #(c) |D| . In word2vec’s implementation of SGNS, this dis- tribution is smoothed, a design choice that boosts its performance. We explore this hyperparameter and others in Section 3. SGNS as Implicit Matrix Factorization Levy and Golberg (2014c) show that SGNS’s corpus- level objective achieves its optimal value when: ~w ·~c = PMI(w,c) − log k Hence, SGNS is implicitly factorizing a word- context matrix whose cell’s values are PMI, shifted by a global constant (log k): W ·C> = MPMI − log k SGNS performs a different kind of factorization from traditional SVD (see 2.2). In particular, the factorization’s loss function is not based on L2, and is much less sensitive to extreme and infi- nite values due to a sigmoid function surrounding ~w · ~c. Furthermore, the loss is weighted, caus- ing rare (w,c) pairs to affect the objective much less than frequent ones. Thus, while many cells in MPMI equal log 0 = −∞, the cost incurred for re- constructing these cells as a small negative value, such as −5 instead of as −∞, is negligible.1 1The logistic (sigmoidal) objective also curbs very high positive values of PMI. We suspect that this property, along with the weighted factorization property, addresses the afore- mentioned shortcoming of PMI, i.e. its overweighting of in- frequent events. An additional difference from SVD, which will be explored further in Section 3.3, is that SVD factorizes M into three matrices, two of them or- thonormal and one diagonal, while SGNS factor- izes M into two unconstrained matrices. 2.4 Global Vectors (GloVe) GloVe (Pennington et al., 2014) seeks to represent each word w ∈ VW and each context c ∈ VC as d-dimensional vectors ~w and ~c such that: ~w ·~c + bw + bc = log (#(w,c)) ∀(w,c) ∈ D Here, bw and bc (scalars) are word/context-specific biases, and are also parameters to be learned in addition to ~w and ~c. GloVe’s objective is explicitly defined as a fac- torization of the log-count matrix, shifted by the entire vocabularies’ bias terms: M log(#(w,c)) ≈ W ·C> + ~bw + ~bc Where ~bw is a |VW | dimensional row vector and ~bc is a |VC| dimensional column vector. If we were to fix bw = log #(w) and bc = log #(c), this would be almost2 equivalent to fac- torizing the PMI matrix shifted by log(|D|). How- ever, GloVe learns these parameters, giving an ex- tra degree of freedom over SVD and SGNS. The model is fit to minimize a weighted least square loss, giving more weight to frequent (w,c) pairs.3 Finally, an important novelty introduced in (Pennington et al., 2014) is that, assuming VC = VW , one could take the representation of a word w to be ~w + ~cw where ~cw is the row correspond- ing to w in C>. This may improve results con- siderably in some circumstances, as we discuss in Sections 3.3 and 6.2. 3 Transferable Hyperparameters This section presents various hyperparameters im- plemented in word2vec and GloVe, and shows how to adapt and apply them to count-based methods. We divide these into: pre-processing hyperparameters, which affect the algorithms’ input data; association metric hyperparameters, which define how word-context interactions are calculated; and post-processing hyperparameters, which modify the resulting word vectors. 2GloVe’s objective ignores (w, c) pairs that do not co- occur in the training corpus, treating them as missing values. SGNS, on the other hand, does take such pairs into account through the negative sampling procedure. 3The weighting formula is another hyper-parameter that could be tuned, but we keep to the default weighting scheme. 3.1 Pre-processing Hyperparameters All the matrix-based algorithms rely on a col- lection D of word-context pairs (w,c) as inputs. word2vec introduces three novel variations on the way D is collected, which can be easily ap- plied to other methods beyond SGNS. Dynamic Context Window (dyn) The tradi- tional approaches usually use a constant-sized un- weighted context window. For instance, if the win- dow size is 5, then a word five tokens apart from the target is treated the same as an adjacent word. Following the intuition that contexts closer to the target are more important, context words can be weighted according to their distance from the fo- cus word. Both GloVe and word2vec employ such a weighting scheme, and while less com- mon, this approach was also explored in tradi- tional count-based methods, e.g. (Sahlgren, 2006). GloVe’s implementation weights contexts using the harmonic function, e.g. a context word three tokens away will be counted as 1 3 of an occurrence. On the other hand, word2vec’s implementation is equivalent to weighing by the distance from the focus word divided by the window size. For ex- ample, a size-5 window will weigh its contexts by 5 5 , 4 5 , 3 5 , 2 5 , 1 5 . The reason we call this modification dynamic context windows is because word2vec imple- ments its weighting scheme by uniformly sam- pling the actual window size between 1 and L, for each token (Mikolov et al., 2013a). The sampling method is faster than the direct method in terms of training time, since there are fewer SGD updates in SGNS and fewer non-zero matrix cells in the other methods. For our systematic experiments, we used the word2vec-style sampled version for all methods, including GloVe. Subsampling (sub) Subsampling is a method of diluting very frequent words, akin to removing stop-words. The subsampling method presented in (Mikolov et al., 2013a) randomly removes words that are more frequent than some threshold t with a probability of p, where f marks the word’s corpus frequency: p = 1 − √ t f (2) Following the recommendation in (Mikolov et al., 2013a), we use t = 10−5 in our experiments.4 4word2vec’s code implements a slightly different for- mula: p = f−t f − √ t f . We followed the formula presented Another implementation detail of subsampling in word2vec is that the removal of tokens is done before the corpus is processed into word- context pairs. This practically enlarges the con- text window’s size for many tokens, because they can now reach words that were not in their origi- nal L-sized windows. We call this kind of subsam- pling “dirty”, as opposed to “clean” subsampling, which removes subsampled words without affect- ing the context window’s size. We found their im- pact on performance comparable, and report re- sults of only the “dirty” variant. Deleting Rare Words (del) While it is com- mon to ignore words that are rare in the training corpus, word2vec removes these tokens from the corpus before creating context windows. As with subsampling, this variation narrows the dis- tance between tokens, inserting new word-context pairs that did not exist in the original corpus with the same window size. Though this variation may also have an effect on performance, preliminary experiments showed that it was small, and we therefore do not investigate its effect in this paper. 3.2 Association Metric Hyperparameters The PMI (or PPMI) between a word and its con- text is well known to be an effective association measure in the word similarity literature. Levy and Golberg (2014c) show that SGNS is implicitly fac- torizing a word-context matrix whose cell’s val- ues are shifted PMI. Following their analysis, we present two variations of the PMI (and implicitly PPMI) association metric, which we adopt from SGNS. These enhancements of PMI are not di- rectly applicable to GloVe, which, by definition, uses a different association measure. Shifted PMI (neg) SGNS has a natural hyper- parameter k (the number of negative samples), which affects the value that SGNS is trying to op- timize for each (w,c): PMI(w,c) − log k. The shift caused by k > 1 can be applied to distri- butional methods through shifted PPMI (Levy and Goldberg, 2014c): SPPMI(w,c) = max (PMI (w,c) − log k, 0) It is important to understand that in SGNS, k has two distinct functions. First, it is used to better estimate the distribution of negative examples; a higher k means more data and better estimation. in the original paper (equation 2). Second, it acts as a prior on the probability of ob- serving a positive example (an actual occurrence of (w,c) in the corpus) versus a negative example; a higher k means that negative examples are more probable. Shifted PPMI captures only the second aspect of k (a prior). We experiment with three values of k: 1, 5, 15. Context Distribution Smoothing (cds) In word2vec, negative examples (contexts) are sampled according to a smoothed unigram dis- tribution. In order to smooth the original con- texts’ distribution, all context counts are raised to the power of α (Mikolov et al. (2013b) found α = 0.75 to work well). This smoothing varia- tion has an analog when calculating PMI directly: PMIα (w,c) = log P̂(w,c) P̂(w)P̂α(c) (3) P̂α(c) = # (c) α∑ c # (c) α Like other smoothing techniques (Pantel and Lin, 2002; Turney and Littman, 2003), context distri- bution smoothing alleviates PMI’s bias towards rare words. It does so by enlarging the probability of sampling a rare context (since P̂α(c) > P̂(c) when c is infrequent), which in turn reduces the PMI of (w,c) for any w co-occurring with the rare context c. In Section 6.2 we demonstrate that this novel variant of PMI is very effective, and consis- tently improves performance across tasks, meth- ods, and configurations. We experiment with two values of α: 1 (unsmoothed) and 0.75 (smoothed). 3.3 Post-processing Hyperparameters We present three hyperparameters that modify the algorithms’ output: the word vectors. Adding Context Vectors (w+c) Pennington et al. (2014) propose using the context vectors in ad- dition to the word vectors as GloVe’s output. For example, the word “cat” can be represented as: ~vcat = ~wcat + ~ccat where ~w and ~c are the word and context embed- dings, respectively. This vector combination was originally moti- vated as an ensemble method. Here, we provide a different interpretation of its effect on the co- sine similarity function. Specifically, we show that adding context vectors effectively adds first- order similarity terms to the second-order similar- ity function. Consider the cosine similarity of two words: cos(x,y) = ~vx ·~vy√ ~vx ·~vx √ ~vy ·~vy = (~wx + ~cx) · (~wy + ~cy)√ (~wx + ~cx) · (~wx + ~cx) √ (~wy + ~cy) · (~wy + ~cy) = ~wx · ~wy + ~cx ·~cy + ~wx ·~cy + ~cx · ~wy√ ~w2x + 2~wx ·~cx + ~c2x √ ~w2y + 2~wy ·~cy + ~c2y = ~wx · ~wy + ~cx ·~cy + ~wx ·~cy + ~cx · ~wy 2 √ ~wx ·~cx + 1 √ ~wy ·~cy + 1 (4) (The last step follows because, as noted in Sec- tion 2, the word and context vectors are normal- ized after training.) The resulting expression combines similarity terms which can be divided into two groups: second-order similarity (wx ·wy, cx ·cy) and first- order similarity (w∗ · c∗). The second-order terms measure the extent to which the two words are re- placeable based on their tendencies to appear in similar contexts, and are the manifestation of Har- ris’s (1954) distributional hypothesis. The first- order terms measure the tendency of one word to appear in the context of the other. In SVD and SGNS, the first-order similarity terms between w and c converge to PMI(w,c), while in GloVe it converges into their log-count (with some bias terms). The similarity calculated in equation 4 is thus a symmetric combination of the first-order and sec- ond order similarities of x and y, normalized by a function of their reflective first-order similarities: sim(x,y) = sim2(x,y) + sim1(x,y)√ sim1(x,x) + 1 √ sim1(y,y) + 1 This similarity measure states that words are similar if they tend to appear in similar contexts, or if they tend to appear in the contexts of each other (and preferably both). The additive w+c representation can be triv- ially applied to other methods that produce distinct word and context vectors (e.g. SVD and SGNS). On the other hand, explicit methods such as PPMI are sparse by definition, and nullify the vast ma- jority of first-order similarities. We therefore do not apply w+c to PPMI in this study. Eigenvalue Weighting (eig) As mentioned in Section 2.2, the word and context vectors derived using SVD are typically represented by (equa- tion 1): W SVD = Ud · Σd CSVD = Vd However, this is not necessarily the optimal con- struction of W SVD for word similarity tasks. We note that in the SVD-based factorization, the re- sulting word and context matrices have very dif- ferent properties. In particular, the context ma- trix CSVD is orthonormal while the word matrix W SVD is not. On the other hand, the factorization achieved by SGNS’s training procedure is much more “symmetric”, in the sense that neither W W2V nor CW2V is orthonormal, and no particular bias is given to either of the matrices in the training ob- jective. Similar symmetry can be achieved with the following factorization: W = Ud · √ Σd C = Vd · √ Σd (5) Alternatively, the eigenvalue matrix can be dis- missed altogether: W = Ud C = Vd (6) While it is not theoretically clear why the symmetric approach is better for semantic tasks, it does work much better empirically (see Sec- tion 6.1). A similar observation was made by Caron (2001), who suggested adding a parameter p to control the eigenvalue matrix Σ: W SVDp = Ud · Σ p d Later studies show that weighting the eigenvalue matrix Σd with the exponent p can have a signif- icant effect on performance, and should be tuned (Bullinaria and Levy, 2012; Turney, 2012). Adapt- ing the notion of symmetric decomposition from SGNS, this study experiments only with symmet- ric variants of SVD (p = 0, p = 0.5; equations (6) and (5)) and the traditional factorization (p = 1; equation (1)). Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W ’s rows) are normal- ized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine sim- ilarity. This normalization is a hyperparameter set- ting in itself, and other normalizations are also ap- plicable. The trivial case is using no normalization Hyper- Explored Applicable parameter Values Methods win 2, 5, 10 All dyn none, with All sub none, dirty, clean† All del none, with† All neg 1, 5, 15 PPMI, SVD, SGNS cds 1, 0.75 PPMI, SVD, SGNS w+c only w, w + c SVD, SGNS, GloVe eig 0, 0.5, 1 SVD nrm none†, row, col†, both† All Table 1: The space of hyperparameters explored in this work. †Explored only in preliminary experiments. at all. Another setting, used by Pennington et al. (2014), normalizes the columns of W rather than its rows. It is also possible to consider a fourth setting that combines both row and column nor- malizations. Note that column normalization is akin to dis- missing the eigenvalues in SVD. While the hy- perparameter setting eig = 0 has an important positive impact on SVD, the same cannot be said of column normalization on other methods. In preliminary experiments, we tried the four differ- ent normalization schemes described above (none, row, column, and both), and found the standard L2 normalization of W ’s rows (i.e. using the cosine similarity measure) to be consistently superior. 4 Experimental Setup We explored a large space of hyperparameters, representations, and evaluation datasets. 4.1 Hyperparameter Space Table 1 enumerates the hyperparameter space. We generated 72 PPMI, 432 SVD, 144 SGNS, and 24 GloVe representations; 672 overall. 4.2 Word Representations Corpus All models were trained on English Wikipedia (August 2013 dump), pre-processed by removing non-textual elements, sentence splitting, and tokenization. The corpus contains 77.5 mil- lion sentences, spanning 1.5 billion tokens. Mod- els were derived using windows of 2, 5, and 10 tokens to each side of the focus word (the window size parameter is denoted win). Words that ap- peared less than 100 times in the corpus were ig- nored, resulting in vocabularies of 189,533 terms for both words and contexts. Training Embeddings We trained a 500- dimensional representation with SVD, SGNS, and GloVe. SGNS was trained using a modified ver- sion of word2vec which receives a sequence of pre-extracted word-context pairs (Levy and Gold- berg, 2014a). GloVe was trained with 50 itera- tions using the original implementation (Penning- ton et al., 2014), applied to the pre-extracted word- context pairs. 4.3 Test Datasets We evaluated each word representation on eight datasets covering similarity and analogy tasks. Word Similarity We used six datasets to eval- uate word similarity: the popular WordSim353 (Finkelstein et al., 2002) partitioned into two datasets, WordSim Similarity and WordSim Relat- edness (Zesch et al., 2008; Agirre et al., 2009); Bruni et al.’s (2012) MEN dataset; Radinsky et al.’s (2011) Mechanical Turk dataset; Luong et al.’s (2013) Rare Words dataset; and Hill et al.’s (2014) SimLex-999 dataset. All these datasets con- tain word pairs together with human-assigned sim- ilarity scores. The word vectors are evaluated by ranking the pairs according to their cosine similar- ities, and measuring the correlation (Spearman’s ρ) with the human ratings. Analogy The two analogy datasets present ques- tions of the form “a is to a∗ as b is to b∗”, where b∗ is hidden, and must be guessed from the entire vocabulary. MSR’s analogy dataset (Mikolov et al., 2013c) contains 8000 morpho-syntactic anal- ogy questions, such as “good is to best as smart is to smartest”. Google’s analogy dataset (Mikolov et al., 2013a) contains 19544 questions, about half of the same kind as in MSR (syntactic analogies), and another half of a more semantic nature, such as capital cities (“Paris is to France as Tokyo is to Japan”). After filtering questions involving out- of-vocabulary words, i.e. words that appeared in English Wikipedia less than 100 times, we remain with 7118 instances in MSR and 19258 instances in Google. The analogy questions are answered using 3CosAdd (addition and subtraction): arg max b∗∈VW\{a∗,b,a} cos(b∗,a∗ −a + b) = arg max b∗∈VW\{a∗,b,a} (cos(b∗,a∗) − cos(b∗,a) + cos(b∗,b)) as well as 3CosMul, which is state-of-the-art in analogy recovery (Levy and Goldberg, 2014b): arg max b∗∈VW\{a∗,b,a} cos(b∗,a∗) · cos(b∗,b) cos(b∗,a) + ε Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul PPMI .709 .540 .688 .648 .393 .338 .491 / .650 .246 / .439 SVD .776 .658 .752 .557 .506 .422 .452 / .498 .357 / .412 SGNS .724 .587 .686 .678 .434 .401 .530 / .552 .578 / .592 GloVe .666 .467 .659 .599 .403 .398 .442 / .465 .529 / .576 Table 2: Performance of each method across different tasks in the “vanilla” scenario (all hyperparameters set to default): win = 2; dyn = none; sub = none; neg = 1; cds = 1; w+c = only w; eig = 0.0. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul PPMI .755 .688 .745 .686 .423 .354 .553 / .629 .289 / .413 SVD .784 .672 .777 .625 .514 .402 .547 / .587 .402 / .457 SGNS .773 .623 .723 .676 .431 .423 .599 / .625 .514 / .546 GloVe .667 .506 .685 .599 .372 .389 .539 / .563 .503 / .559 CBOW .766 .613 .757 .663 .480 .412 .547 / .591 .557 / .598 Table 3: Performance of each method across different tasks using word2vec’s recommended configuration: win = 2; dyn = with; sub = dirty; neg = 5; cds = 0.75; w+c = only w; eig = 0.0. CBOW is presented for comparison. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul PPMI .755 .697 .745 .686 .462 .393 .553 / .679 .306 / .535 SVD .793 .691 .778 .666 .514 .432 .554 / .591 .408 / .468 SGNS .793 .685 .774 .693 .470 .438 .676 / .688 .618 / .645 GloVe .725 .604 .729 .632 .403 .398 .569 / .596 .533 / .580 Table 4: Performance of each method across different tasks using the best configuration for that method and task combination, assuming win = 2. ε = 0.001 is used to prevent division by zero. We abbreviate the two methods “Add” and “Mul”, re- spectively. The evaluation metric for the analogy questions is the percentage of questions for which the argmax result was the correct answer (b∗). 5 Results We begin by comparing the effect of various hy- perparameter configurations, and observe that dif- ferent settings have a substantial impact on per- formance (Section 5.1); at times, this improve- ment is greater than that of switching to a dif- ferent representation method. We then show that, in some tasks, careful hyperparameter tuning can also outweigh the importance of adding more data (5.2). Finally, we observe that our results do not agree with a few recent claims in the word embed- ding literature, and suggest that these discrepan- cies stem from hyperparameter settings that were not controlled for in previous experiments (5.3). 5.1 Hyperparameters vs Algorithms We first examine a “vanilla” scenario (Table 2), in which all hyperparameters are “turned off” (set to default values): small context windows (win = 2), no dynamic contexts (dyn = none), no sub- sampling (sub = none), one negative sample (neg = 1), no smoothing (cds = 1), no context vectors (w+c = only w), and default eigenvalue weights (eig = 0.0).5 Overall, SVD outperforms other methods on most word similarity tasks, often having a considerable advantage over the second- best. In contrast, analogy tasks present mixed re- sults; SGNS yields the best result in MSR’s analo- gies, while PPMI dominates Google’s dataset. The second scenario (Table 3) sets the hyper- parameters to word2vec’s default values: small context windows (win = 2),6 dynamic contexts (dyn = with), dirty subsampling (sub = dirty), five negative samples (neg = 5), context distribu- tion smoothing (cds = 0.75), no context vectors (w+c = only w), and default eigenvalue weights 5While it is more common to set eig = 1, this setting degrades SVD’s performance considerably (see Section 6.1). 6While word2vec’s default window size is 5, we present a single window size (win = 2) in Tables 2-4, in order to iso- late win’s effect from the effects of other hyperparameters. Running the same experiments with different window sizes reveals similar trends. Additional results with broader win- dow sizes are shown in Table 5. win Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Add / Mul Add / Mul 2 PPMI .732 .699 .744 .654 .457 .382 .552 / .677 .306 / .535 SVD .772 .671 .777 .647 .508 .425 .554 / .591 .408 / .468 SGNS .789 .675 .773 .661 .449 .433 .676 / .689 .617 / .644 GloVe .720 .605 .728 .606 .389 .388 .649 / .666 .540 / .591 5 PPMI .732 .706 .738 .668 .442 .360 .518 / .649 .277 / .467 SVD .764 .679 .776 .639 .499 .416 .532 / .569 .369 / .424 SGNS .772 .690 .772 .663 .454 .403 .692 / .714 .605 / .645 GloVe .745 .617 .746 .631 .416 .389 .700 / .712 .541 / .599 10 PPMI .735 .701 .741 .663 .235 .336 .532 / .605 .249 / .353 SVD .766 .681 .770 .628 .312 .419 .526 / .562 .356 / .406 SGNS .794 .700 .775 .678 .281 .422 .694 / .710 .520 / .557 GloVe .746 .643 .754 .616 .266 .375 .702 / .712 .463 / .519 10 SGNS-LS .766 .681 .781 .689 .451 .414 .739 / .758 .690 / .729 GloVe-LS .678 .624 .752 .639 .361 .371 .732 / .750 .628 / .685 Table 5: Performance of each method across different tasks using 2-fold cross-validation for hyperparameter tuning. Configu- rations on large-scale (LS) corpora are also presented for comparison. (eig = 0.0). The results in this scenario are quite different than those of the vanilla scenario, with better performance in many cases. However, this change is not uniform, as we observe that differ- ent settings boost different algorithms. In fact, the question “Which method is best?” might have a completely different answer when comparing on the same task but with different hyperparameter values. Looking at Table 2 and Table 3, for ex- ample, SVD is the best algorithm for SimLex-999 in the vanilla scenario, whereas in the word2vec scenario, it does not perform as well as SGNS. The third scenario (Table 4) enables the full range of hyperparameters given small context win- dows (win = 2); we evaluate each method on each task given every hyperparameter configura- tion, and choose the best performance. We see a considerable performance increase across all methods when comparing to both the vanilla (Ta- ble 2) and word2vec scenarios (Table 3): the best combination of hyperparameters improves up to 15.7 points beyond the vanilla setting, and over 6 points on average. It appears that selecting the right hyperparameter settings often has more im- pact than choosing the most suitable algorithm. Main Result The numbers in Table 4 result from an “oracle” experiment, in which the hyperparam- eters are tuned on the test data, providing an upper bound on the potential performance improvement of hyperparameter tuning. Are such gains achiev- able in practice? Table 5 describes a realistic scenario, where the hyperparameters are tuned on a training set, which is separate from the unseen test data. We also report results for different window sizes (win = 2, 5, 10). We use 2-fold cross validation, in which, for each task, the hyperparameters are tuned on each half of the data and evaluated on the other half. The numbers reported in Table 5 are the av- erages of the two runs for each data-point. The results indicate that approaching the ora- cle’s improvements are indeed feasible. When comparing the performance of the trained config- uration (Table 5) to that of the optimal one (Ta- ble 4), their average difference is about 1%, with larger datasets usually finding the optimal configu- ration. It is therefore both practical and beneficial to properly tune hyperparameters for word simi- larity and analogy detection tasks. An interesting observation, which immediately appears when looking at Table 5, is that there is no single method that consistently performs better than the rest. This behavior is visible across all window sizes, and is discussed in further detail in Section 5.3. 5.2 Hyperparameters vs Big Data An important factor in evaluating distributional methods is the size of corpus and vocabulary, where larger corpora tend to yield better repre- sentations. However, training word vectors from larger corpora is more costly in computation time, which could be spent in tuning hyperparameters. To compare the effect of bigger data versus more flexible hyperparameter settings, we created a large corpus with over 10.5 billion words (7 times larger than our original corpus). This cor- pus was built from an 8.5 billion word corpus sug- gested by Mikolov for training word2vec,7 to which we added UKWaC (Ferraresi et al., 2008). As with the original setup, our vocabulary con- tained every word that appeared at least 100 times in the corpus, amounting to about 620,000 words. Finally, we fixed the context windows to be broad and dynamic (win = 10,dyn = with), and ex- plored 16 hyperparameter settings comprising of: subsampling (sub), shifted PMI (neg = 1, 5), context distribution smoothing (cds), and adding context vectors (w+c). This space is somewhat more restricted than the original hyperparameter space. In terms of computation, SGNS scales nicely, requiring about half a day of computation per setup. GloVe, on the other hand, took several days to run a single 50-iteration instance for this corpus. Applying the traditional count-based methods to this setting proved technically challenging, as they consumed too much memory to be efficiently ma- nipulated. We thus present results for only SGNS and GloVe (Table 5). Remarkably, there are some cases (3/6 word similarity tasks) in which tuning a larger space of hyperparameters is indeed more beneficial than expanding the corpus. In other cases, however, more data does seem to pay off, as evident with both analogy tasks. 5.3 Re-evaluating Prior Claims Prior art raises several claims regarding the superi- ority of certain methods over the others. However, these studies did not control for the hyperparame- ters presented in this work. We thus revisit these claims, and examine their validity based on the re- sults in Table 5.8 Are embeddings superior to count-based dis- tributional methods? It is commonly believed that modern prediction-based embeddings per- form better than traditional count-based methods. This claim was recently supported by a series of systematic evaluations by Baroni et al. (2014). However, our results suggest a different trend. Ta- ble 5 shows that in word similarity tasks, the av- erage score of SGNS is actually lower than SVD’s when win = 2, 5, and it never outperforms SVD 7word2vec.googlecode.com/svn/trunk/ demo-train-big-model-v1.sh 8We note that all conclusions drawn in this section rely on the specific data and settings with which we experiment. It is indeed feasible that experiments on different tasks, data, and hyperparameters may yield other conclusions. by more than 1.7 points in those cases. In Google’s analogies SGNS and GloVe indeed perform bet- ter than PPMI, but only by a margin of 3.7 points (compare PPMI with win = 2 and SGNS with win = 5). MSR’s analogy dataset is the only case where SGNS and GloVe substantially outperform PPMI and SVD.9 Overall, there does not seem to be a consistent significant advantage to one ap- proach over the other, thus refuting the claim that prediction-based methods are superior to count- based approaches. The contradictory results in (Baroni et al., 2014) stem from creating word2vec embed- dings with somewhat pre-tuned hyperparameters (recommended by word2vec), and comparing them to “vanilla” PPMI and SVD representa- tions. In particular, shifted PMI (negative sam- pling) and context distribution smoothing (cds = 0.75, equation (3) in Section 3.2) were turned on for SGNS, but not for PPMI and SVD. An additional difference is Baroni et al.’s setting of eig=1, which significantly deteriorates SVD’s performance (see Section 6.1). Is GloVe superior to SGNS? Pennington et al. (2014) show a variety of experiments in which GloVe outperforms SGNS (among other meth- ods). However, our results show the complete op- posite. In fact, SGNS outperforms GloVe in every task (Table 5). Only when restricted to 3CosAdd, a suboptimal configuration, does GloVe show a 0.8 point advantage over SGNS. This trend persists when scaling up to a larger corpus and vocabulary. This contradiction can be explained by three major differences in the experimental setup. First, in our experiments, hyperparameters were allowed to vary; in particular, w+c was applied to all the methods, including SGNS. Secondly, Pennington et al. (2014) only evaluated on Google’s analo- gies, but not on MSR’s. Finally, in our work, all methods are compared using the same underlying corpus. It is also important to bear in mind that, by definition, GloVe cannot use two hyperparame- ters: shifted PMI (neg) and context distribution smoothing (cds). Instead, GloVe learns a set of bias parameters that subsumes these two modifica- tions and many other potential changes to the PMI metric. Albeit its greater flexibility, GloVe does not fair better than SGNS in our experiments. 9Unlike PPMI, SVD underperforms in both analogy tasks. Is PPMI on-par with SGNS on analogy tasks? Levy and Goldberg (2014b) show that PPMI and SGNS perform similarly on both Google’s and MSR’s analogy tasks. Nevertheless, the results in Table 5 show a clear advantage to SGNS. While the gap on Google’s analogies is not very large (PPMI lags behind SGNS by only 3.7 points), SGNS consistently outperforms PPMI by a large margin on the MSR dataset. MSR’s analogy dataset captures syntactic relations, such as singular-plural inflections for nouns and tense modifications for verbs. We conjecture that cap- turing these syntactic relations may rely on certain types of contexts, such as determiners and func- tion words, which SGNS might be better at cap- turing – perhaps due to the way it assigns weights to different examples, or because it also captures negative correlations which are filtered by PPMI. A deeper look into Levy and Goldberg’s (2014b) experiments reveals the use of PPMI with positional contexts (i.e. each context is a conjunc- tion of a word and its relative position to the target word), whereas SGNS was employed with regular bag-of-words contexts. Positional contexts might contain relevant information for recovering syn- tactic analogies, explaining PPMI’s relatively high score on MSR’s analogy task in (Levy and Gold- berg, 2014b). Does 3CosMul recover more analogies than 3CosAdd? Levy and Goldberg (2014b) show that using similarity multiplication (3CosMul) rather than addition (3CosAdd) improves results on all methods and on every task. This claim is consistent with our findings; indeed, 3CosMul dominates 3CosAdd in every case. The improve- ment is particularly noticeable for SVD and PPMI, which considerably underperform other methods when using 3CosAdd. 5.4 Comparison with CBOW Another algorithm featured in word2vec is CBOW. Unlike the other methods, CBOW cannot be easily expressed as a factorization of a word- context matrix; it ties together the tokens of each context window by representing the context vec- tor as the sum of its words’ vectors. It is thus more expressive than the other methods, and has a po- tential of deriving better word representations. While Mikolov et al. (2013b) found SGNS to outperform CBOW, Baroni et al. (2014) reports that CBOW had a slight advantage. We com- win eig Average Performance 0 .612 2 0.5 .611 1 .551 0 .616 5 0.5 .612 1 .534 0 .584 10 0.5 .567 1 .484 Table 6: The average performance of SVD on word similarity tasks given different values of eig, in the vanilla scenario. pared CBOW to the other methods when setting all the hyperparameters to the defaults provided by word2vec (Table 3). With the exception of MSR’s analogy task, CBOW is not the best- performing method of any other task in this sce- nario. Other scenarios showed similar trends in our preliminary experiments. While CBOW can potentially derive better rep- resentations by combining the tokens in each con- text window, this potential is not realized in prac- tice. Nevertheless, Melamud et al. (2014) show that capturing joint contexts can indeed improve performance on word similarity tasks, and we be- lieve it is a direction worth pursuing. 6 Hyperparameter Analysis We analyze the individual impact of each hyper- parameter, and try to characterize the conditions in which a certain setting is beneficial. 6.1 Harmful Configurations Certain hyperparameter settings might cripple the performance of a certain method. We observe two scenarios in which SVD performs poorly. SVD does not benefit from shifted PPMI. Set- ting neg > 1 consistently deteriorates SVD’s per- formance. Levy and Goldberg (2014c) made a similar observation, and hypothesized that this is a result of the increasing number of zero-cells, which may cause SVD to prefer a factorization that is very close to the zero matrix. SVD’s L2 ob- jective is unweighted, and it does not distinguish between observed and unobserved matrix cells. Using SVD “correctly” is bad. The traditional way of representing words with SVD uses the eigenvalue matrix (eig = 1): W = Ud · Σd. De- spite being theoretically well-motivated, this set- ting leads to very poor results in practice, when compared to other settings (eig = 0.5 or 0). Ta- ble 6 demonstrates this gap. The drop in average accuracy when setting eig = 1 is astounding. The performance gap persists under different hyperparameter settings as well, and drops in performance of over 15 points (absolute) when using eig = 1 instead of eig = 0.5 or 0 are not uncommon. This setting is one of the main reasons for SVD’s inferior results in the study by Baroni et al. (2014), and also the reason we chose to use eig = 0.5 as the default setting for SVD in the vanilla scenario. 6.2 Beneficial Configurations To identify which hyperparameter settings are beneficial, we looked at the best configuration of each method on each task. We then counted the number of times each hyperparameter setting was chosen in these configurations (Table 7). Some trends emerge, such as PPMI and SVD’s prefer- ence towards shorter context windows10 (win = 2), and that SGNS always prefers numerous nega- tive samples (neg > 1). To get a closer look and isolate the effect of each hyperparameter, we controlled for said hy- perparameter, and compared the best configura- tions given each of the hyperparameter’s settings. Table 8 shows the difference between default and non-default settings of each hyperparameter. While many hyperparameter settings can im- prove performance, they may also degrade it when chosen incorrectly. For instance, in the case of shifted PMI (neg), SGNS consistently profits from neg > 1, while SVD’s performance is dra- matically reduced. For PPMI, the utility of ap- plying neg > 1 depends on the type of task: word similarity or analogy. Another example is dynamic context windows (dyn), which is benefi- cial for MSR’s analogy task, but largely detrimen- tal to other tasks. It appears that the only hyperparameter that can be “blindly” applied in any situation is context distribution smoothing (cds = 0.75), yielding a consistent improvement at an insignificant risk. Note that cds helps PPMI more than it does other methods; we suggest that this is because it re- duces the relative impact of rare words on the dis- tributional representation, thus addressing PMI’s “Achilles’ heel”. 10This might also relate to PMI’s bias towards infrequent events (see Section 2.1). Broader windows create more ran- dom co-occurrences with rare words, “polluting” the distribu- tional vector with random words that have high PMI scores. 7 Practical Recommendations It is generally advisable to tune all hyperparam- eters, as well as algorithm-specific hyperparame- ters, for the task at hand. However, this may be computationally expensive. We thus provide some “rules of thumb”, which we found to work well in our setting: • Always use context distribution smoothing (cds = 0.75) to modify PMI, as described in Section 3.2. It consistently improves performance, and is applicable to PPMI, SVD, and SGNS. • Do not use SVD “correctly” (eig = 1). Instead, use one of the symmetric variants (Section 3.3). • SGNS is a robust baseline. While it might not be the best method for every task, it does not signif- icantly underperform in any scenario. Moreover, SGNS is the fastest method to train, and cheapest (by far) in terms of disk space and memory con- sumption. • With SGNS, prefer many negative samples. • for both SGNS and GloVe, it is worthwhile to ex- periment with the ~w +~c variant, which is cheap to apply (does not require retraining) and can result in substantial gains (as well as substantial losses). 8 Conclusions Recent embedding methods introduce a plethora of design choices beyond network architecture and optimization algorithms. We reveal that these seemingly minor variations can have a large im- pact on the success of word representation meth- ods. By showing how to adapt and tune these hy- perparameters in traditional methods, we allow a proper comparison between representations, and challenge various claims of superiority from the word embedding literature. This study also exposes the need for more controlled-variable experiments, and extending the concept of “variable” from the obvious task, data, and method to the often ignored prepro- cessing steps and hyperparameter settings. We also stress the need for transparent and repro- ducible experiments, and commend authors such as Mikolov, Pennington, and others for making their code publicly available. In this spirit, we make our code available as well.11 11http://bitbucket.org/omerlevy/ hyperwords Method win dyn sub neg cds w+c 2 : 5 : 10 none : with none : dirty 1 : 5 : 15 1.00 : 0.75 only w : w + c PPMI 7 : 1 : 0 4 : 4 4 : 4 2 : 6 : 0 1 : 7 — SVD 7 : 1 : 0 4 : 4 1 : 7 8 : 0 : 0 2 : 6 7 : 1 SGNS 2 : 3 : 3 6 : 2 4 : 4 0 : 4 : 4 3 : 5 4 : 4 GloVe 1 : 3 : 4 6 : 2 7 : 1 — — 4 : 4 Table 7: The impact of each hyperparameter, measured by the number of tasks in which the best configuration had that hyper- parameter setting. Non-applicable combinations are marked by “—”. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.5% –1.0% 0.0% +0.1% +0.4% –0.1% –0.1% +1.2% SVD –0.8% –0.2% 0.0% +0.6% +0.4% –0.1% +0.6% +2.1% SGNS –0.9% –1.5% –0.3% +0.1% –0.1% –0.1% –1.0% +0.7% GloVe –0.8% –1.2% –0.9% –0.8% +0.1% –0.9% –3.3% +1.8% (a) Performance difference between best models with dyn = with and dyn = none. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +1.9% +1.3% +1.0% –3.8% –3.9% –5.0% –12.2% SVD +0.7% +0.2% +0.6% +0.7% +0.8% –0.3% +4.0% +2.4% SGNS +1.5% +2.2% +1.5% +0.1% –0.4% –0.1% –4.4% –5.4% GloVe +0.2% –1.3% –1.0% –0.2% –3.4% –0.9% –3.0% –3.6% (b) Performance difference between best models with sub = dirty and sub = none. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +0.6% +4.9% +1.3% +1.0% +2.2% +0.8% –6.2% –9.2% SVD –1.7% –2.2% –1.9% –4.6% –3.4% –3.5% –13.9% –14.9% SGNS +1.5% +2.9% +2.3% +0.5% +1.5% +1.1% +3.3% +2.1% GloVe — — — — — — — — (c) Performance difference between best models with neg > 1 and neg = 1. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI +1.3% +2.8% 0.0% +2.1% +3.5% +2.9% +2.7% +9.2% SVD +0.4% –0.2% +0.1% +1.1% +0.4% –0.3% +1.4% +2.2% SGNS +0.4% +1.4% 0.0% +0.1% 0.0% +0.2% +0.6% 0.0% GloVe — — — — — — — — (d) Performance difference between best models with cds = 0.75 and cds = 1. Method WordSim WordSim Bruni et al. Radinsky et al. Luong et al. Hill et al. Google MSR Similarity Relatedness MEN M. Turk Rare Words SimLex Mul Mul PPMI — — — — — — — — SVD –0.6% –0.2% –0.4% –2.1% –0.7% +0.7% –1.8% –3.4% SGNS +1.4% +2.2% +1.2% +1.1% –0.3% –2.3% –1.0% –7.5% GloVe +2.3% +4.7% +3.0% –0.1% –0.7% –2.6% +3.3% –8.9% (e) Performance difference between best models with w+c = w + c and w+c = only w. Table 8: The added value versus the risk of setting each hyperparameter. The figures show the differences in performance between the best achievable configurations when restricting a hyperparameter to different values. This difference indicates the potential gain of tuning a given hyperparameter, as well as the risks of decreased performance when not tuning it. For example, an entry of +9.2% in Table (d) means that the best model with cds = 0.75 is 9.2% more accurate (absolute) than the best model with cds = 1; i.e. on MSR’s analogies, using cds = 0.75 instead of cds = 1 improved PPMI’s accuracy from .443 to .535. Acknowledgements This work was supported by the Google Research Award Program and the German Research Foun- dation via the German-Israeli Project Cooperation (grant DA 1600/1-1). We thank Marco Baroni and Jeffrey Pennington for their valuable comments. References Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, and Aitor Soroa. 2009. A study on similarity and relatedness using distribu- tional and wordnet-based approaches. In Proceed- ings of Human Language Technologies: The 2009 Annual Conference of the North American Chap- ter of the Association for Computational Linguistics, pages 19–27, Boulder, Colorado, June. Association for Computational Linguistics. Marco Baroni and Alessandro Lenci. 2010. Dis- tributional memory: A general framework for corpus-based semantics. Computational Linguis- tics, 36(4):673–721. Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Dont count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 238–247, Baltimore, Maryland, June. Association for Computational Linguistics. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic lan- guage model. Journal of Machine Learning Re- search, 3:1137–1155. Elia Bruni, Gemma Boleda, Marco Baroni, and Nam Khanh Tran. 2012. Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 136–145, Jeju Island, Korea, July. Association for Computa- tional Linguistics. John A Bullinaria and Joseph P Levy. 2007. Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior Research Methods, 39(3):510–526. John A Bullinaria and Joseph P Levy. 2012. Extracting semantic representations from word co-occurrence statistics: Stop-lists, stemming, and SVD. Behavior Research Methods, 44(3):890–907. John Caron. 2001. Experiments with LSA scor- ing: optimal rank and basis. In Proceedings of the SIAM Computational Information Retrieval Work- shop, pages 157–169. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms, mutual information, and lexicog- raphy. Computational Linguistics, 16(1):22–29. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Pro- ceedings of the 25th International Conference on Machine Learning, pages 160–167. Scott C. Deerwester, Susan T. Dumais, Thomas K. Lan- dauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JASIS, 41(6):391–407. C Eckart and G Young. 1936. The approximation of one matrix by another of lower rank. Psychome- trika, 1:211–218. Roi Reichart Felix Hill and Anna Korhonen. 2014. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. arXiv preprint arXiv:1408.3456. Adriano Ferraresi, Eros Zanchetta, Marco Baroni, and Silvia Bernardini. 2008. Introducing and evaluating ukwac, a very large web-derived corpus of English. In Proceedings of the 4th Web as Corpus Workshop (WAC-4), pages 47–54. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Ey- tan Ruppin. 2002. Placing search in context: The concept revisited. ACM Transactions on Informa- tion Systems, 20(1):116–131. Yoav Goldberg and Omer Levy. 2014. word2vec explained: deriving Mikolov et al.’s negative- sampling word-embedding method. arXiv preprint arXiv:1402.3722. Zellig Harris. 1954. Distributional structure. Word, 10(23):146–162. Omer Levy and Yoav Goldberg. 2014a. Dependency- based word embeddings. In Proceedings of the 52nd Annual Meeting of the Association for Computa- tional Linguistics (Volume 2: Short Papers), pages 302–308, Baltimore, Maryland. Omer Levy and Yoav Goldberg. 2014b. Linguistic regularities in sparse and explicit word representa- tions. In Proceedings of the Eighteenth Confer- ence on Computational Natural Language Learning, pages 171–180, Baltimore, Maryland. Omer Levy and Yoav Goldberg. 2014c. Neural word embeddings as implicit matrix factorization. In Ad- vances in Neural Information Processing Systems 27: Annual Conference on Neural Information Pro- cessing Systems 2014, December 8-13 2014, Mon- treal, Quebec, Canada, pages 2177–2185. Minh-Thang Luong, Richard Socher, and Christo- pher D. Manning. 2013. Better word representa- tions with recursive neural networks for morphol- ogy. In Proceedings of the Seventeenth Confer- ence on Computational Natural Language Learning, pages 104–113, Sofia, Bulgaria, August. Associa- tion for Computational Linguistics. Oren Melamud, Ido Dagan, Jacob Goldberger, Idan Szpektor, and Deniz Yuret. 2014. Probabilistic modeling of joint-context in distributional similar- ity. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 181–190, Baltimore, Maryland, June. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representations in vector space. In Proceedings of the International Conference on Learning Represen- tations (ICLR). Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013b. Distributed rep- resentations of words and phrases and their compo- sitionality. In Advances in Neural Information Pro- cessing Systems, pages 3111–3119. Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013c. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751. Sebastian Padó and Mirella Lapata. 2007. Dependency-based construction of semantic space models. Computational Linguistics, 33(2):161–199. Patrick Pantel and Dekang Lin. 2002. Discovering word senses from text. In Proceedings of the eighth ACM SIGKDD international conference on Knowl- edge discovery and data mining, pages 613–619. ACM. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October. Association for Computational Lin- guistics. Kira Radinsky, Eugene Agichtein, Evgeniy Gabrilovich, and Shaul Markovitch. 2011. A word at a time: Computing word relatedness using temporal semantic analysis. In Proceedings of the 20th international conference on World wide web, pages 337–346. ACM. Magnus Sahlgren. 2006. The Word-Space Model. Ph.D. thesis, Stockholm University. Peter D. Turney and Michael L. Littman. 2003. Mea- suring praise and criticism: Inference of semantic orientation from association. Transactions on Infor- mation Systems, 21(4):315–346. Peter D. Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of se- mantics. Journal of Artificial Intelligence Research, 37(1):141–188. Peter D. Turney. 2012. Domain and function: A dual- space model of semantic relations and compositions. Journal of Artificial Intelligence Research, 44:533– 585. Torsten Zesch, Christof Müller, and Iryna Gurevych. 2008. Using wiktionary for computing semantic relatedness. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, AAAI’08, pages 861–866. AAAI Press.