From Paraphrase Database to Compositional Paraphrase Model and Back John Wieting∗ Mohit Bansal† Kevin Gimpel† Karen Livescu† ∗University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA wieting2@illinois.edu †Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA {mbansal,kgimpel,klivescu}@ttic.edu Abstract The Paraphrase Database (PPDB; Ganitke- vitch et al., 2013) is an extensive semantic re- source, consisting of a list of phrase pairs with (heuristic) confidence estimates. However, it is still unclear how it can best be used, due to the heuristic nature of the confidences and its necessarily incomplete coverage. We propose models to leverage the phrase pairs from the PPDB to build parametric paraphrase models that score paraphrase pairs more accurately than the PPDB’s internal scores while simul- taneously improving its coverage. They allow for learning phrase embeddings as well as im- proved word embeddings. Moreover, we in- troduce two new, manually annotated datasets to evaluate short-phrase paraphrasing mod- els. Using our paraphrase model trained using PPDB, we achieve state-of-the-art results on standard word and bigram similarity tasks and beat strong baselines on our new short phrase paraphrase tasks.1 1 Introduction Paraphrase detection2 is the task of analyzing two segments of text and determining if they have the same meaning despite differences in structure and wording. It is useful for a variety of NLP tasks like question answering (Rinaldi et al., 2003; Fader et al., 2013), semantic parsing (Berant and Liang, 2014), textual entailment (Bosma and Callison- Burch, 2007), and machine translation (Marton et al., 2009). 1We release our datasets, code, and trained models at http://web.engr.illinois.edu/˜wieting2/. 2See Androutsopoulos and Malakasiotis (2010) for a survey on approaches for detecting paraphrases. One component of many such systems is a para- phrase table containing pairs of text snippets, usu- ally automatically generated, that have the same meaning. The most recent work in this area is the Paraphrase Database (PPDB; Ganitkevitch et al., 2013), a collection of confidence-rated paraphrases created using the pivoting technique of Bannard and Callison-Burch (2005) over large parallel corpora. The PPDB is a massive resource, containing 220 million paraphrase pairs. It captures many short paraphrases that would be difficult to obtain us- ing any other resource. For example, the pair {we must do our utmost, we must make every effort} has little lexical overlap but is present in PPDB. The PPDB has recently been used for monolingual align- ment (Yao et al., 2013), for predicting sentence sim- ilarity (Bjerva et al., 2014), and to improve the cov- erage of FrameNet (Rastogi and Van Durme, 2014). Though already effective for multiple NLP tasks, we note some drawbacks of PPDB. The first is lack of coverage: to use the PPDB to compare two phrases, both must be in the database. The second is that PPDB is a nonparametric paraphrase model; the number of parameters (phrase pairs) grows with the size of the dataset used to build it. In practice, it can become unwieldy to work with as the size of the database increases. A third concern is that the confidence estimates in PPDB are a heuristic com- bination of features, and their quality is unclear. We address these issues in this work by introduc- ing ways to use PPDB to construct parametric para- phrase models. First we show that initial skip-gram word vectors (Mikolov et al., 2013a) can be fine- tuned for the paraphrase task by training on word pairs from PPDB. We call them PARAGRAM word vectors. We find additive composition of PARA- GRAM vectors to be a simple but effective way to 345 Transactions of the Association for Computational Linguistics, vol. 3, pp. 345–358, 2015. Action Editor: Hal Daumeé III. Submission batch: 1/2015; Revision batch 4/2015; Published 6/2015. c©2015 Association for Computational Linguistics. Distributed under a CC-BY-NC-SA 4.0 license. http://web.engr.illinois.edu/~wieting2/ embed phrases for short-phrase paraphrase tasks. We find improved performance by training a recur- sive neural network (RNN; Socher et al., 2010) di- rectly on phrase pairs from PPDB. We show that our resulting word and phrase rep- resentations are effective on a wide variety of tasks, including two new datasets that we introduce. The first, Annotated-PPDB, contains pairs from PPDB that were scored by human annotators. It can be used to evaluate paraphrase models for short phrases. We use it to show that the phrase embeddings produced by our methods are significantly more indicative of paraphrasability than the original heuristic scoring used by Ganitkevitch et al. (2013). Thus we use the power of PPDB to improve its contents. Our second dataset, ML-Paraphrase, is a re- annotation of the bigram similarity corpus from Mitchell and Lapata (2010). The task was origi- nally developed to measure semantic similarity of bigrams, but some annotations are not congruent with the functional similarity central to paraphrase relationships. Our re-annotation can be used to assess paraphrasing capability of bigram composi- tional models. In summary, we make the following contributions: Provide new PARAGRAM word vectors, learned using PPDB, that achieve state-of-the-art per- formance on the SimLex-999 lexical similarity task (Hill et al., 2014b) and lead to improved per- formance in sentiment analysis. Provide ways to use PPDB to embed phrases. We compare additive and RNN composition of PARA- GRAM vectors. Both can improve PPDB by re- ranking the paraphrases in PPDB to improve corre- lations with human judgments. They can be used as concise parameterizations of PPDB, thereby vastly increasing its coverage. We also perform a qualita- tive analysis of the differences between additive and RNN composition. Introduce two new datasets. The first contains PPDB phrase pairs and evaluates how well models can measure the quality of short paraphrases. The second is a new annotation of the bigram similar- ity task in Mitchell and Lapata (2010) that makes it suitable for evaluating bigram paraphrases. We release the new datasets, complete with anno- tation instructions and raw annotations, as well as our code and the trained models.3 2 Related Work There is a vast literature on representing words as vectors. The intuition of most methods to create these vectors (or embeddings) is that similar words have similar contexts (Firth, 1957). Earlier models made use of latent semantic analysis (LSA) (Deer- wester et al., 1990). Recently, more sophisticated neural models, work originating with (Bengio et al., 2003), have been gaining popularity (Mikolov et al., 2013a; Pennington et al., 2014). These embeddings are now being used in new ways as they are being tailored to specific downstream tasks (Bansal et al., 2014). Phrase representations can be created from word vectors using compositional models. Simple but effective compositional models were studied by Mitchell and Lapata (2008; 2010) and Blacoe and Lapata (2012). They compared a variety of bi- nary operations on word vectors and found that simple point-wise multiplication of explicit vector representations performed very well. Other works like Zanzotto et al. (2010) and Baroni and Zampar- elli (2010) also explored composition using models based on operations of vectors and matrices. More recent work has shown that the extremely efficient neural embeddings of Mikolov et al. (2013a) also do well on compositional tasks simply by adding the word vectors (Mikolov et al., 2013b). Hashimoto et al. (2014) introduced an alternative word embedding and compositional model based on predicate-argument structures that does well on two simple composition tasks, including the one intro- duced by Mitchell and Lapata (2010). An alternative approach to composition, used by Socher et al. (2011), is to train a recursive neural network (RNN) whose structure is defined by a bi- narized parse tree. In particular, they trained their RNN as an unsupervised autoencoder. The RNN captures the latent structure of composition. Recent work has shown that this model struggles in tasks in- volving compositionality (Blacoe and Lapata, 2012; Hashimoto et al., 2014).4 However, we found suc- 3http://web.engr.illinois.edu/˜wieting2/ 4We also replicated this approach and found training to be time-consuming even using low-dimensional word vectors. 346 http://web.engr.illinois.edu/~wieting2/ cess using RNNs in a supervised setting, similar to Socher et al. (2014), who used RNNs to learn representations for image descriptions. The objec- tive function we used in this work was motivated by their multimodal objective function for learning joint image-sentence representations. Lastly, the PPDB has been used along with other resources to learn word embeddings for several tasks, including semantic similarity, language mod- eling, predicting human judgments, and classifica- tion (Yu and Dredze, 2014; Faruqui et al., 2015). Concurrently with our work, it has also been used to construct paraphrase models for short phrases (Yu and Dredze, 2015). 3 New Paraphrase Datasets We created two novel datasets: (1) Annotated- PPDB, a subset of phrase pairs from PPDB which are annotated according to how strongly they rep- resent a paraphrase relationship, and (2) ML- Paraphrase, a re-annotation of the bigram similarity dataset from Mitchell and Lapata (2010), again an- notated for strength of paraphrase relationship. 3.1 Annotated-PPDB Our motivation for creating Annotated-PPDB was to establish a way to evaluate compositional para- phrase models on short phrases. Most existing para- phrase tasks focus on words, like SimLex-999 (Hill et al., 2014b), or entire sentences, such as the Mi- crosoft Research Paraphrase Corpus (Dolan et al., 2004; Quirk et al., 2004). To our knowledge, there are no datasets that focus on the paraphrasability of short phrases. Thus, we created Annotated-PPDB so that researchers can focus on local compositional phenomena and measure the performance of models directly—avoiding the need to do so indirectly in a sentence-level task. Models that have strong perfor- mance on Annotated-PPDB can be used to provide more accurate confidence scores for the paraphrases in the PPDB as well as reduce the need for large paraphrase tables altogether. Annotated-PPDB was created in a multi-step pro- cess (outlined below) involving various automatic filtering steps followed by crowdsourced human an- notation. One of the aims for our dataset was to col- lect a variety of paraphrase types—we wanted to in- clude pairs that were non-trivial to recognize as well as those with a range of similarity and length. We fo- cused on phrase pairs with limited lexical overlap to avoid including those with only trivial differences. We started with candidate phrases extracted from the first 10M pairs in the XXL version of the PPDB and then executed the following steps.5 Filter phrases for quality: Only those phrases whose tokens were in our vocabulary were retained.6 Next, all duplicate paraphrase pairs were removed; in PPDB, these are distinct pairs that contain the same two phrases with the order swapped. Filter by lexical overlap: Next, we calculated the word overlap score in each phrase pair and then re- tained only those pairs that had a score of less than 0.5. By word overlap score, we mean the fraction of tokens in the smaller of the phrases with Leven- shtein distance ≤ 1 to a token in the larger of the phrases. This was done to exclude less interesting phrase pairs like 〈my dad had, my father had〉 or 〈ballistic missiles, of ballistic missiles〉 that only dif- fer in a synonym or the addition of a single word. Select range of paraphrasabilities: To balance our dataset with both clear paraphrases and erroneous pairs in PPDB, we sampled 5,000 examples from ten chunks of the first 10M initial phrase pairs where a chunk is defined as 1M phrase pairs. Select range of phrase lengths: We then selected 1,500 phrases from each 5000-example sample that encompassed a wide range of phrase lengths. To do this, we first binned the phrase pairs by their effec- tive size. Let n1 be the number of tokens of length greater than one character in the first phrase and n2 the same for the second phrase. Then the effective size is defined as max(n1,n2). The bins contained pairs of effective size of 3, 4, and 5 or more, and 500 pairs were selected from each bin. This gave us a total of 15,000 phrase pairs. Prune to 3,000: 3,000 phrase pairs were then se- lected randomly from the 15,000 remaining pairs to 5Note that the confidence scores for phrase pairs in PPDB are based on a weighted combination of features with weights determined heuristically. The confidence scores were used to place the phrase pairs into their respective sets (S, M, L, XL, XXL, etc.), where each larger set subsumes all smaller ones. 6Throughout, our vocabulary is defined as the most common 100K word types in English Wikipedia, following tokenization and lowercasing (see §5). 347 form an initial dataset, Annotated-PPDB-3K. The phrases were selected so that every phrase in the dataset was unique. Annotate with Mechanical Turk: The dataset was then rated on a scale from 1-5 using Amazon Me- chanical Turk, where a score of 5 denoted phrases that are equivalent in a large number of contexts, 3 meant that the phrases had some overlap in mean- ing, and 1 indicated that the phrases were dissimilar or contradictory in some way (e.g., can not adopt and is able to accept). We only permitted workers whose location was in the United States and who had done at least 1,000 HITS with a 99% acceptance rate. Each example was labeled by 5 annotators and their scores were averaged to produce the final rating. Table 1 shows some statistics of the data. Overall, the annotated data had a mean deviation (MD)7 of 0.80. Table 1 shows that overall, workers found the phrases to be of high quality, as more than two-thirds of the pairs had an average score of at least 3. Also from the Ta- ble, we can see that workers had stronger agreement on very low and very high quality pairs and were less certain in the middle of the range. Prune to 1,260: To create our final dataset, Annotated-PPDB, we selected 1,260 phrase pairs from the 3,000 annotations. We did this by first bin- ning the phrases into 3 categories: those with scores in the interval [1,2.5), those with scores in the in- terval [2.5,3.5], and those with scores in the interval (3.5,5]. We took the 420 phrase pairs with the low- est MD in each bin, as these have the most agree- ment about their label, to form Annotated-PPDB. These 1,260 examples were then randomly split into a development set of 260 examples and a test set of 1,000 examples. The development set had an MD of 0.61 and the test set had an MD of 0.60, indicating the final dataset had pairs of higher agreement than the initial 3,000. 3.2 ML-Paraphrase Our second newly-annotated dataset, ML- Paraphrase, is based on the bigram similarity task originally introduced by Mitchell and Lapata 7MD is similar to standard deviation, but uses absolute value instead of squared value and thus is both more intuitive and less sensitive to outliers. Score Range MD % of Data [1,2) 0.66 8.1 [2,3) 1.05 20.0 [3,4) 0.93 34.9 [4,5] 0.59 36.9 Table 1: An analysis of Annotated-PPDB-3K extracted from PPDB. The statistics shown are for the splits of the data accord- ing to the average score by workers. MD denotes mean devia- tion and % of Data refers to the percentage of our dataset that fell into each range. (2010); we refer to the original annotations as the ML dataset. The ML dataset consists of human similarity rat- ings for three types of bigrams: adjective-noun (JN), noun-noun (NN), and verb-noun (VN). Through manual inspection, we found that the annotations were not consistent with the notion of similarity central to paraphrase tasks. For instance, television set and television programme were the highest rated phrases in the NN section (based on average anno- tator score). Similarly, one of the highest ranked JN pairs was older man and elderly woman. This indi- cates that the annotations reflect topical similarity in addition to capturing functional or definitional simi- larity. Therefore, we had the data re-annotated by two authors of this paper who are native English speak- ers.8 The bigrams were labeled on a scale from 1- 5 where 5 denotes phrases that are equivalent in a large number of contexts, 3 indicates the phrases are roughly equivalent in a narrow set of contexts, and 1 means the phrases are not at all equivalent in any context. Following annotation, we collapsed the rat- ing scale by merging 4s and 5s together and 1s and 2s together. Data IA ρ IA κ ML comp. ρ ML Human ρ JN 0.87 0.79 0.56 0.52 NN 0.64 0.58 0.38 0.49 VN 0.73 0.73 0.55 0.55 Table 2: Inter-annotator agreement of ML-Paraphrase and com- parison with ML dataset. Columns 2 and 3 show the inter- annotator agreement between the two annotators measured with Spearman ρ and Cohen’s κ. Column 4 shows the ρ between ML-Paraphrase and all of the ML dataset. The last column is the average human ρ on the ML dataset. 8We tried using Mechanical Turk here, but due to such short phrases, with few having the paraphrase relationship, workers did not perform well on the task. 348 Statistics for the data are shown in Table 2. We show inter-annotator Spearman ρ and Cohen’s κ in columns 2 and 3, indicating substantial agreement on the JN and VN portions but only moderate agree- ment on NN. In fact, when evaluating our NN an- notations against those from the original ML data (column 4), we find ρ to be 0.38, well below the av- erage human correlation of 0.49 (final column) re- ported by Mitchell and Lapata and also surpassed by pointwise multiplication (Mitchell and Lapata, 2010). This suggests that the original NN portion, more so than the others, favored a notion of similar- ity more related to association than paraphrase. 4 Paraphrase Models We now present parametric paraphrase models and discuss training. Our goal is to embed phrases into a low-dimensional space such that cosine similarity in the space corresponds to the strength of the para- phrase relationship between phrases. We use a recursive neural network (RNN) similar to that used by Socher et al. (2014). We first use a constituent parser to obtain a binarized parse of a phrase. For phrase p, we compute its vector g(p) through recursive computation on the parse. That is, if phrase p is the yield of a parent node in a parse tree, and phrases c1 and c2 are the yields of its two child nodes, we define g(p) recursively as follows: g(p) = f(W [g(c1);g(c2)] + b) where f is an element-wise activation function (tanh), [g(c1);g(c2)] ∈ R2n is the concatenation of the child vectors, W ∈ Rn×2n is the composi- tion matrix, b ∈ Rn is the offset, and n is the di- mensionality of the word embeddings. If node p has no children (i.e., it is a single token), we define g(p) = W (p) w , where Ww is the word embedding matrix in which particular word vectors are indexed using superscripts. The trainable parameters of the model are W , b, and Ww. 4.1 Objective Functions We now present objective functions for training on pairs extracted from PPDB. The training data con- sists of (possibly noisy) pairs taken directly from the original PPDB. In subsequent sections, we discuss how we extract training pairs for particular tasks. We assume our training data consists of a set X of phrase pairs 〈x1,x2〉, where x1 and x2 are assumed to be paraphrases. To learn the model parame- ters (W,b,Ww), we minimize our objective function over the data using AdaGrad (Duchi et al., 2011) with mini-batches. The objective function follows: min W,b,Ww 1 |X| ( ∑ 〈x1,x2〉∈X max(0,δ −g(x1) ·g(x2) + g(x1) ·g(t1)) + max(0,δ −g(x1) ·g(x2) + g(x2) ·g(t2)) ) + λW (‖W‖2 +‖b‖2) + λWw ‖Wwinitial −Ww‖ 2 (1) where λW and λWw are regularization parameters, Wwinitial is the initial word embedding matrix, δ is the margin (set to 1 in all of our experiments), and t1 and t2 are carefully-selected negative examples taken from a mini-batch during optimization. The intuition for this objective is that we want the two phrases to be more similar to each other (g(x1) ·g(x2)) than either is to their respective neg- ative examples t1 and t2, by a margin of at least δ. Selecting Negative Examples To select t1 and t2 in Eq. 1, we simply chose the most similar phrase in the mini-batch (other than those in the given phrase pair). E.g., for choosing t1 for a given 〈x1,x2〉: t1 = argmax t:〈t,·〉∈Xb\{〈x1,x2〉} g(x1) ·g(t) where Xb ⊆ X is the current mini-batch. That is, we want to choose a negative example ti that is sim- ilar to xi according to the current model parameters. The downside of this approach is that we may oc- casionally choose a phrase ti that is actually a true paraphrase of xi. We also tried a strategy in which we selected the least similar phrase that would trig- ger an update (i.e., g(ti)·g(xi) > g(x1)·g(x2)−δ), but we found the simpler strategy above to work bet- ter and used it for all experiments reported below. Discussion The objective in Eq. 1 is similar to one used by Socher et al. (2014), but with several differ- ences. Their objective compared text and projected images. They also did not update the underlying word embeddings; we do so here, and in a way such 349 that they are penalized from deviating from their ini- tialization. Also for a given 〈x1,x2〉, they do not select a single t1 and t2 as we do, but use the en- tire training set, which can be very expensive with a large training dataset. We also experimented with a simpler objective that sought to directly minimize the squared L2- norm between g(x1) and g(x2) in each pair, along with the same regularization terms as in Eq. 1. One problem with this objective function is that the global minimum is 0 and is achieved simply by driv- ing the parameters to 0. We obtained much better results using the objective in Eq. 1. Training Word Paraphrase Models To train just word vectors on word paraphrase pairs (again from PPDB), we used the same objective function as above, but simply dropped the composition terms. This gave us an objective that bears some similarity to the skip-gram objective with negative sampling in word2vec (Mikolov et al., 2013a). Both seek to maximize the dot products of certain word pairs while minimizing the dot products of others. This objective function is: min Ww 1 |X| ( ∑ 〈x1,x2〉∈X max(0,δ −W(x1)w ·W(x2)w + W(x1)w ·W(t1)w ) + max(0,δ −W(x1)w ·W(x2)w + W(x2)w ·W(t2)w ) ) + λWw ‖Wwinitial −Ww‖ 2 (2) It is like Eq. 1 except with word vectors replacing the RNN composition function and with the regular- ization terms on the W and b removed. We further found we could improve this model by incorporating constraints. From our training pairs, for a given word w, we assembled all other words that were paired with it in PPDB and all of their lem- mas. These were then used as constraints during the pairing process: a word t could only be paired with w if it was not in its list of assembled words. 5 Experiments – Word Paraphrasing We first present experiments on learning lexical paraphrasability. We train on word pairs from PPDB and evaluate on the SimLex-999 dataset (Hill et al., 2014b), achieving the best results reported to date. 5.1 Training Procedure To learn word vectors that reflect paraphrasability, we optimized Eq. 2. There are many tunable hyper- parameters with this objective, so to make training tractable we fixed the initial learning rates for the word embeddings to 0.5 and the margin δ to 1. Then we did a coarse grid search over a parameter space for λWw and the mini-batch size. We considered λWw values in {10−2,10−3, ...,10−7,0} and mini- batch sizes in {100, 250, 500, 1000}. We trained for 20 epochs for each set of hyperparameters using AdaGrad (Duchi et al., 2011). For all experiments, we initialized our word vectors with skip-gram vectors trained using word2vec (Mikolov et al., 2013a). The vectors were trained on English Wikipedia (tokenized and lowercased, yielding 1.8B tokens).9 We used a window size of 5 and a minimum count cut-off of 60, producing vectors for approximately 270K word types. We retained vectors for only the 100K most frequent words, averaging the rest to obtain a single vector for unknown words. We will refer to this set of the 100K most frequent words as our vocabulary. 5.2 Extracting Training Data For training, we extracted word pairs from the lexi- cal XL section of PPDB. We used the XL data for all experiments, including those for phrases. We used XL instead of XXL because XL has better qual- ity overall while still being large enough so that we could be selective in choosing training pairs. There are a total of 548,085 pairs. We removed 174,766 that either contained numerical digits or words not in our vocabulary. We then removed 260,425 re- dundant pairs, leaving us with a final training set of 112,894 word pairs. 5.3 Tuning and Evaluation Hyperparameters were tuned using the wordsim-353 (WS353) dataset (Finkelstein et al., 2001), specifi- cally its similarity (WS-S) and relatedness (WS-R) partitions (Agirre et al., 2009). In particular, we tuned to maximize 2×WS-S correlation minus the WS-R correlation. The idea was to reward vectors with high similarity and relatively low relatedness, in order to target the paraphrase relationship. 9We used the December 2, 2013 snapshot. 350 Model n SL999 ρ skip-gram 25 0.21 skip-gram 1000 0.38 PARAGRAM WS 25 0.56∗ + constraints 25 0.58∗ Hill et al. (2014b) 200 0.446 Hill et al. (2014a) - 0.52 inter-annotator agreement N/A 0.67 Table 3: Results on the SimLex-999 (SL999) word similarity task obtained by performing hyperparameter tuning based on 2×WS-S −WS-R and treating SL999 as a held-out test set. n is word vector dimensionality. A ∗ indicates statistical signifi- cance (p < 0.05) over the 1000-dimensional skip-gram vectors. After tuning, we evaluated the best hyperparame- ters on the SimLex-999 (SL999) dataset (Hill et al., 2014b). We chose SL999 as our primary test set as it most closely evaluates the paraphrase relationship. Even though WS-S is a close approximation to this relationship, it does not include pairs that are merely associated and assigned low scores, which SL999 does (see discussion in Hill et al., 2014b). Note that for all experiments we used cosine sim- ilarity as our similarity metric and evaluated the sta- tistical significance of dependent correlations using the one-tailed method of (Steiger, 1980). 5.4 Results Table 3 shows results on SL999 when improving the initial word vectors by training on word pairs from PPDB, both with and without constraints. The “PARAGRAM WS” rows show results when tuning to maximize 2×WS-S − WS-R. We also show results for strong skip-gram baselines and the best results from the literature, including the state-of-the-art re- sults from Hill et al. (2014a) as well as the inter- annotator agreement from Hill et al. (2014b).10 The table illustrates that, by training on PPDB, we can surpass the previous best correlations on SL999 by 4-6% absolute, achieving the best results reported to date. We also find that we can train low-dimensional word vectors that exceed the per- formance of much larger vectors. This is very use- ful as using large vectors can increase both time and memory consumption in NLP applications. To generate word vectors to use for downstream 10Hill et al. (2014a) did not report the dimensionality of the vectors that led to their state-of-the-art results. applications, we chose hyperparameters so as to maximize performance on SL999.11 These word vectors, which we refer to as PARAGRAM vectors, had a ρ of 0.57 on SL999. We use them as initial word vectors for the remainder of the paper. 5.5 Sentiment Analysis As an extrinsic evaluation of our PARAGRAM word vectors, we used them in a convolutional neural network (CNN) for sentiment analysis. We used the simple CNN from Kim (2014) and the binary sentence-level sentiment analysis task from Socher et al. (2013). We used the standard data splits, re- moving examples with a neutral rating. We trained on all constituents in the training set while only us- ing full sentences from development and test, giving us train/development/test sizes of 67,349/872/1,821. The CNN uses m-gram filters, each of which is an m×n vector. The CNN computes the inner product between an m-gram filter and each m-gram in an example, retaining the maximum match (so-called “max-pooling”). The score of the match is a single dimension in a feature vector for the example, which is then associated with a weight in a linear classifier used to predict positive or negative sentiment. While Kim (2014) used m-gram filters of sev- eral lengths, we only used unigram filters. We also fixed the word vectors during learning (called “static” by Kim). After learning, the unigram fil- ters correspond to locations in the fixed word vec- tor space. The learned classifier weights represent how strongly each location corresponds to positive or negative sentiment. We expect this static CNN to be more effective if the word vector space separates positive and negative sentiment. In our experiments, we compared baseline skip- gram embeddings to our PARAGRAM vectors. We used AdaGrad learning rate of 0.1, mini-batches of size 10, and a dropout rate of 0.5. We used 200 un- igram filters and rectified linear units as the activa- tion (applied to the filter output + filter bias). We trained for 30 epochs, predicting labels on the de- velopment set after each set of 3,000 examples. We recorded the highest development accuracy and used those parameters to predict labels on the test set. Results are shown in Table 4. We see improve- 11We did not use constraints during training. 351 word vectors n accuracy (%) skip-gram 25 77.0 skip-gram 50 79.6 PARAGRAM 25 80.9 Table 4: Test set accuracies when comparing embeddings in a static CNN on the binary sentiment analysis task from Socher et al. (2013). ments over the baselines when using PARAGRAM vectors, even exceeding the performance of higher- dimensional skip-gram vectors. 6 Experiments – Compositional Paraphrasing In this section, we describe experiments on a variety of compositional phrase-based paraphrasing tasks. We start with the simplest case of bigrams, and then proceed to short phrases. For all tasks, we again train on appropriate data from PPDB and test on various evaluation datasets, including our two novel datasets (Annotated-PPDB and ML-Paraphrase). 6.1 Training Procedure We trained our models by optimizing Eq. 1 using AdaGrad (Duchi et al., 2011). We fixed the initial learning rates to 0.5 for the word embeddings and 0.05 for the composition parameters, and the mar- gin to 1. Then we did a coarse grid search over a parameter space for λWw , λW , and mini-batch size. For λWw , our search space again consisted of {10−2,10−3, ...,10−7,0}, for λW it was {10−1,10−2,10−3,0}, and we explored batch sizes of {100, 250, 500, 1000, 2000}. When ini- tializing with PARAGRAM vectors, the search space for λWw was shifted upwards to be {10,1,10−1,10−3, ...,10−6} to reflect our in- creased confidence in the initial vectors. We trained only for 5 epochs for each set of parameters. For baselines, we used the same initial skip-gram vectors as in Section 5. 6.2 Evaluation and Baselines For all experiments, we again used cosine similarity as our similarity metric and evaluated the statistical significance using the method of (Steiger, 1980). A baseline used in all compositional experiments is vector addition of skip-gram (or PARAGRAM) word vectors. Unlike explicit word vectors, where point-wise multiplication acts as a conjunction of features and performs well on composition tasks (Mitchell and Lapata, 2008), using addition with skip-gram vectors (Mikolov et al., 2013b) gives bet- ter performance than multiplication. 6.3 Bigram Paraphrasability To evaluate our ability to paraphrase bigrams, we consider the original bigram similarity task from Mitchell and Lapata (2010) as well as our newly- annotated version of it: ML-Paraphrase. Extracting Training Data Training data for these tasks was extracted from the XL portion of PPDB. The bigram similarity task from Mitchell and Lapata (2010) contains three types of bigrams: adjective- noun (JN), noun-noun (NN), and verb-noun (VN). We aimed to collect pairs from PPDB that mirrored these three types of bigrams. We found parsing to be unreliable on such short segments of text, so we used a POS tagger (Man- ning et al., 2014) to tag the tokens in each phrase. We then used the word alignments in PPDB to ex- tract bigrams for training. For JN and NN, we ex- tracted pairs containing aligned, adjacent tokens in the two phrases with the appropriate part-of-speech tag. Thus we extracted pairs like 〈easy job, simple task〉 for the JN section and 〈town meeting, town council〉 for the NN section. We used a different strategy for extracting training data for the VN sub- set: we took aligned VN tokens and took the closest noun after the verb. This was done to approximate the direct object that would have been ideally ex- tracted with a dependency parse. An example from this section is 〈achieve goal, achieve aim〉. We removed phrase pairs that (1) contained words not in our vocabulary, (2) were redundant with oth- ers, (3) contained brackets, or (4) had Levenshtein distance ≤ 1. The final criterion helps to ensure that we train on phrase pairs with non-trivial differences. The final training data consisted of 133,997 JN pairs, 62,640 VN pairs and 35,601 NN pairs. Baselines In addition to RNN models, we report baselines that use vector addition as the composition function, both with our skip-gram embeddings and PARAGRAM embeddings from Section 5. We also compare to several results from prior work. When doing so, we took their best correla- 352 Model Mitchell and Lapata (2010) Bigrams ML-Paraphrase word vectors n comp. JN NN VN Avg JN NN VN Avg skip-gram 25 + 0.36 0.44 0.36 0.39 0.32 0.35 0.42 0.36 PARAGRAM 25 + 0.44∗ 0.34 0.48∗ 0.42 0.50∗ 0.29 0.58∗‡ 0.46 PARAGRAM 25 RNN 0.51∗† 0.40† 0.50∗‡ 0.47 0.57∗‡ 0.44† 0.55∗ 0.52 Hashimoto et al. (2014) 0.49 0.45 0.46 0.47 0.38 0.39 0.45 0.41 Mitchell and Lapata (2010) 0.46 0.49 0.38 0.44 - - - - Human - - - - 0.87 0.64 0.73 0.75 Table 5: Results on the test section of the bigram similarity task of Mitchell and Lapata (2010) and our newly annotated version (ML-Paraphrase). (n) shows the word vector dimensionality and (“comp.”) shows the composition function used: “+” is vector addition and “RNN” is the recursive neural network. The * indicates statistically significant (p < 0.05) over the skip-gram model, † statistically significant over the {PARAGRAM, +} model, and ‡ statistically significant over Hashimoto et al. (2014). tions for each data subset. That is, the JN and NN results from Mitchell and Lapata (2010) use their multiplicative model and the VN results use their di- lation model. From Hashimoto et al. (2014) we used their PAS-CLBLM Addl and PAS-CLBLM Addnl models. We note that their vector dimensionalities are larger than ours, using n = 2000 and 50 respec- tively. Results Results are shown in Table 5. We re- port results on the test portion of the original Mitchell and Lapata (2010) dataset (ML) as well as the entirety of our newly-annotated dataset (ML- Paraphrase). RNN results on ML were tuned on the respective development sections and RNN results on ML-Paraphrase were tuned on the entire ML dataset. Our RNN model outperforms results from the lit- erature on most sections in both datasets and its av- erage correlations are among the highest.12 The one subset of the data that posed difficulty was the NN section of the ML dataset. We suspect this is due to the reasons discussed in Section 3.2; for our ML- Paraphrase dataset, by contrast, we do see gains on the NN section. We also outperform the strong baseline of adding 1000-dimensional skip-gram embeddings, a model with 40 times the number of parameters, on our ML- Paraphrase dataset. This baseline had correlations of 0.45, 0.43, and 0.47 on the JN, NN, and VN parti- tions, with an average of 0.45—below the average ρ of the RNN (0.52) and even the {PARAGRAM, +} model (0.46). 12The results obtained here differ from those reported in Hashimoto et al. (2014) as we scored their vectors with a newer Python implementation of Spearman ρ that handles ties (Hashimoto, P.C.). Interestingly, the type of vectors used to initial- ize the RNN has a significant effect on performance. If we initialize using the 25-dimensional skip-gram vectors, the average ρ on ML-Paraphrase drops to 0.43, below even the {PARAGRAM, +} model. 6.4 Phrase Paraphrasability In this section we show that by training a model based on filtered phrase pairs in PPDB, we can ac- tually distinguish between quality paraphrases and poor paraphrases in PPDB better than the original heuristic scoring scheme from Ganitkevitch et al. (2013). Extracting Training Data As before, training data was extracted from the XL section of PPDB. Similar to the procedure to create our Annotated- PPDB dataset, phrases were filtered such that only those with a word overlap score of less than 0.5 were kept. We also removed redundant phrases and phrases that contained tokens not in our vocabulary. The phrases were then binned according to their ef- fective size and 20,000 examples were selected from bins of effective sizes of 3, 4, and more than 5, cre- ating a training set of 60,000 examples. Care was taken to ensure that none of our training pairs was also present in our development and test sets. Baselines We compare our models with strong lexical baselines. The first, strict word overlap, is the percentage of words in the smaller phrase that are also in the larger phrase. We also include a ver- sion where the words are lemmatized prior to the calculation. We also train a support vector regression model (epsilon-SVR) (Chang and Lin, 2011) on the 33 fea- tures that are included for each phrase pair in PPDB. 353 We scaled the features such that each lies in the in- terval [−1,1] and tuned the parameters using 5-fold cross validation on our dev set.13 We then trained on the entire dev set after finding the best performing C and � combination and evaluated on the test set of Annotated-PPDB. Model word vectors n comp. Annotated-PPDB skip-gram 25 + 0.20 PARAGRAM 25 + 0.32∗ PARAGRAM 25 RNN 0.40∗†‡ Ganitkevitch et al. (2013) 0.25 word overlap (strict) 0.26 word overlap (lemmatized) 0.20 PPDB+SVR 0.33 Table 6: Spearman correlation on Annotated-PPDB. The * indicates statistically significant (p < 0.05) over the skip- gram model, the † indicates statistically significant over the {PARAGRAM, +} model, and the ‡ indicates statistically sig- nificant over PPDB+SVR. Results We evaluated on our Annotated-PPDB dataset described in §3.1. Table 6 shows the Spear- man correlations on the 1000-example test set. RNN models were tuned on the development set of 260 examples. All other methods had no hyperparame- ters and therefore required no tuning. We note that the confidence estimates from Gan- itkevitch et al. (2013) reach a ρ of 0.25 on the test set, similar to the results of strict overlap. While 25- dimensional skip-gram embeddings only reach 0.20, we can improve this to 0.32 by fine-tuning them us- ing PPDB (thereby obtaining our PARAGRAM vec- tors). By using the PARAGRAM vectors to initialize the RNN, we reach a correlation of 0.40, which is better than the PPDB confidence estimates by 15% absolute. We again consider addition of 1000-dimensional skip-gram embeddings as a baseline, and they con- tinue to perform strongly (ρ = 0.37). The RNN ini- tialized with PARAGRAM vectors does reach a higher ρ (0.40), but the difference is not statistically signif- icant (p = 0.16). Thus we can achieve similarly- strong results with far fewer parameters. This task also illustrates the importance of initial- izing our RNN model with appropriate word embed- dings. An RNN initialized with skip-gram vectors 13We tuned both parameters over {2−10,2−9, ...,210}. has a modest ρ of 0.22, well below the ρ of the RNN initialized with PARAGRAM vectors. Clearly, ini- tialization is important when optimizing non-convex objectives like ours, but it is noteworthy that our best results came from first improving the word vectors and then learning the composition model, rather than jointly learning both from scratch. 7 Qualitative Analysis Score Range + RNN [1,2) 2.35 2.08 [2,3) 1.56 1.38 [3,4) 0.87 0.85 [4,5] 0.43 0.47 Table 7: Average absolute error of addition and RNN models on different ranges of gold scores. We performed a qualitative analysis to uncover sources of error and determine differences between adding PARAGRAM vectors and using an RNN ini- tialized with them. To do so, we took the output of both systems on Annotated-PPDB and mapped their cosine similarities to the interval [1,5]. We then computed their absolute error as compared to the gold ratings. Table 7 shows how the average of these absolute errors changes with the magnitude of the gold rat- ings. The RNN performs better (has lower average absolute error) for less similar pairs. Vector addi- tion only does better on the most similar pairs. This is presumably because the most positive pairs have high word overlap and so can be represented effec- tively with a simpler model. To further investigate the differences between these models, we removed those pairs with gold scores in [2,4], in order to focus on pairs with ex- treme scores. We identified two factors that dis- tinguished the performance between the two mod- els: length ratio and the amount of lexical overlap. We did not find evidence that non-compositional phrases, such as idioms, were a source of error as these were not found in ML-Paraphrase and only ap- pear rarely in Annotated-PPDB. We define length ratio as simply the number of tokens in the smaller phrase divided by the number of tokens in the larger phrase. Overlap ratio is the number of equivalent tokens in the phrase pair di- 354 Index Phrase 1 Phrase 2 Length Ratio Overlap Ratio Gold RNN + 1 scheduled to be held in that will take place in 1.0 0.4 4.6 2.9 4.4 2 according to the paper , the newspaper reported that 0.8 0.5 4.6 2.8 4.1 3 at no cost to without charge to 0.75 1.0 4.8 3.1 4.6 4 ’s surname family name of 0.67 1.0 4.4 2.8 4.1 5 could have an impact on may influence 0.4 0.5 4.6 4.2 3.2 6 to participate actively to play an active role 0.6 0.67 5.0 4.8 4.0 7 earliest opportunity early as possible 0.67 0.0 4.4 4.3 2.9 8 does not exceed is no more than 0.75 0.0 5.0 4.8 3.5 Table 8: Illustrative phrase pairs from Annotated-PPDB with gold similarity > 4. The last three columns show the gold similarity score, the similarity score of the RNN model, and the similarity score of vector addition. We note that addition performs better when the pairs have high length ratio (rows 1–2) or overlap ratio (rows 3–4) while the RNN does better when those values are low (rows 5–6 and 7–8 respectively). Boldface indicates smaller error compared to gold scores. vided by the number of tokens in the smaller of the two phrases. Equivalent tokens are defined as to- kens that are either exact matches or are paired up in the lexical portion of PPDB used to train the PARA- GRAM vectors. Table 9 shows how the performance of the mod- els changes under different values of length ratio and overlap ratio.14 The values in this table are the per- centage changes in absolute error when using the RNN over the PARAGRAM vector addition model. So negative values indicate superior performance by the RNN. Length Ratio [0,0.6] (0.6,0.8] (0.8,1] Positive Examples -22.4 10.0 35.5 Negative Examples -9.9 -11.1 -12.2 Both -13.0 -6.4 -2.0 Overlap Ratio [0, 1 3 ] ( 1 3 , 2 3 ] ( 2 3 ,1] Positive Examples -4.5 7.0 19.4 Negative Examples -11.3 -7.5 -15.0 Both -10.6 -5.3 0.0 Table 9: Comparison of the addition and RNN model on phrase pairs of different overlap and length ratios. The values in the table are the percent change in absolute error from the addition model to the RNN model. Negative examples are defined as pairs from Annotated-PPDB whose gold score is less than 2 and positive examples are those with scores greater than 4. “Both” refers to both negative and positive examples. A few trends emerge from this table. One is that as the length ratio increases (i.e., the phrase pairs are closer in length), addition surpasses the RNN for positive examples. For negative examples, the 14The bin delimiters were chosen to be uniform over the range of output values of the length ratio ([0.4,1] with one out- lier data point removed) and overlap ratio ([0,1]). trend is reversed. The same trend appears for over- lap ratio. Examples from Annotated-PPDB illustrat- ing these trends on positive examples are shown in Table 8. When considering both positive and negative ex- amples (“Both”), we see that the RNN excels on the most difficult examples (large differences in phrase length and less lexical overlap). For easier exam- ples, the two fare similarly overall (-2.0 to 0.0% change), but the RNN does much better on nega- tive examples. This aligns with the intuition that addition should perform well when two paraphrastic phrases have high lexical overlap and similar length. But when they are not paraphrases, simple addition is misled and the RNN’s learned composition func- tion better captures the relationship. This may sug- gest new architectures for modeling composition- ality differently depending on differences in length and amount of overlap. 8 Conclusion We have shown how to leverage PPDB to learn state-of-the-art word embeddings and compositional models for paraphrase tasks. Since PPDB was cre- ated automatically from parallel corpora, our models are also built automatically. Only small amounts of annotated data are used to tune hyperparameters. We also introduced two new datasets to evaluate compositional models of short paraphrases15, fill- ing a gap in the NLP community, as currently there are no datasets created for this purpose. Successful models on these datasets can then be used to extend the coverage of, or provide an alternative to, PPDB. 15http://web.engr.illinois.edu/˜wieting2/ 355 http://web.engr.illinois.edu/~wieting2/ There remains a great deal of work to be done in developing new composition models, whether with new network architectures or distance functions. In this work, we based our composition function on constituent parse trees, but this may not be the best approach—especially for short phrases. Depen- dency syntax may be a better alternative (Socher et al., 2014). Besides improving composition, another direction to explore is how to use models for short phrases in sentence-level paraphrase recognition and other downstream tasks. Acknowledgements We thank the editor and the anonymous reviewers as well as Juri Ganitkevitch, Dan Roth, Weiran Wang, and Kazuma Hashimoto for their valuable com- ments and technical assistance. We also thank Chris Callison-Burch, Dipanjan Das, Kuzman Ganchev, Ellie Pavlick, Slav Petrov, Owen Rambow, David Sontag, Oscar Täckström, Kapil Thadani, Lyle Un- gar, Benjamin Van Durme, and Mo Yu for helpful conversations. References Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Paşca, and Aitor Soroa. 2009. A study on similarity and relatedness using distributional and wordnet-based approaches. In Proceedings of Hu- man Language Technologies: The 2009 Annual Con- ference of the North American Chapter of the Associa- tion for Computational Linguistics, pages 19–27. As- sociation for Computational Linguistics. Ion Androutsopoulos and Prodromos Malakasiotis. 2010. A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, pages 135–187. Colin Bannard and Chris Callison-Burch. 2005. Para- phrasing with bilingual parallel corpora. In Proceed- ings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 597–604. Associa- tion for Computational Linguistics. Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Tailoring continuous word representations for depen- dency parsing. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. Marco Baroni and Roberto Zamparelli. 2010. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 Conference on Empiri- cal Methods in Natural Language Processing, pages 1183–1193. Association for Computational Linguis- tics. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. 2003. A neural probabilistic lan- guage model. The Journal of Machine Learning Re- search, 3:1137–1155. Jonathan Berant and Percy Liang. 2014. Semantic pars- ing via paraphrasing. In Proceedings of ACL. Johannes Bjerva, Johan Bos, Rob van der Goot, and Malvina Nissim. 2014. The meaning factory: For- mal semantics for recognizing textual entailment and determining semantic similarity. SemEval 2014, page 642. William Blacoe and Mirella Lapata. 2012. A compari- son of vector-based representations for semantic com- position. In Proceedings of the 2012 Joint Confer- ence on Empirical Methods in Natural Language Pro- cessing and Computational Natural Language Learn- ing, EMNLP-CoNLL ’12, pages 546–556, Strouds- burg, PA, USA. Association for Computational Lin- guistics. Wauter Bosma and Chris Callison-Burch. 2007. Para- phrase substitution for recognizing textual entail- ment. In Proceedings of the 7th International Confer- ence on Cross-Language Evaluation Forum: Evalua- tion of Multilingual and Multi-modal Information Re- trieval, CLEF’06, pages 502–509, Berlin, Heidelberg. Springer-Verlag. Chih-Chung Chang and Chih-Jen Lin. 2011. Libsvm: a library for support vector machines. ACM Trans- actions on Intelligent Systems and Technology (TIST), 2(3):27. Scott C. Deerwester, Susan T Dumais, Thomas K. Lan- dauer, George W. Furnas, and Richard A. Harshman. 1990. Indexing by latent semantic analysis. JAsIs, 41(6):391–407. Bill Dolan, Chris Quirk, and Chris Brockett. 2004. Un- supervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Pro- ceedings of Coling 2004, pages 350–356, Geneva, Switzerland, Aug 23–Aug 27. COLING. John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12:2121–2159, July. Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-driven learning for open question answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1608–1618, Sofia, Bul- garia, August. Association for Computational Linguis- tics. 356 Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015. Retrofitting word vectors to semantic lexicons. In Pro- ceedings of the 2015 Conference of the North Ameri- can Chapter of the Association for Computational Lin- guistics: Human Language Technologies, pages 1606– 1615. Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The con- cept revisited. In Proceedings of the 10th international conference on World Wide Web, pages 406–414. ACM. J.R. Firth. 1957. A Synopsis of Linguistic Theory, 1930- 1955. Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. 2013. Ppdb: The paraphrase database. In HLT-NAACL, pages 758–764. The As- sociation for Computational Linguistics. Kazuma Hashimoto, Pontus Stenetorp, Makoto Miwa, and Yoshimasa Tsuruoka. 2014. Jointly learning word representations and composition functions us- ing predicate-argument structures. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, October. Associa- tion for Computational Linguistics. Felix Hill, KyungHyun Cho, Sebastien Jean, Coline Devin, and Yoshua Bengio. 2014a. Not all neu- ral embeddings are born equal. arXiv preprint arXiv:1410.0718. Felix Hill, Roi Reichart, and Anna Korhonen. 2014b. Simlex-999: Evaluating semantic models with (gen- uine) similarity estimation. CoRR, abs/1408.3456. Yoon Kim. 2014. Convolutional neural networks for sen- tence classification. In Proceedings of the 2014 Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October. Association for Computational Linguistics. Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language pro- cessing toolkit. In Proceedings of 52nd Annual Meet- ing of the Association for Computational Linguistics: System Demonstrations, pages 55–60. Yuval Marton, Chris Callison-Burch, and Philip Resnik. 2009. Improved statistical machine translation using monolingually-derived paraphrases. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 381–390, Singapore, Au- gust. Association for Computational Linguistics. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013a. Efficient estimation of word representa- tions in vector space. arXiv preprint arXiv:1301.3781. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- rado, and Jeff Dean. 2013b. Distributed represen- tations of words and phrases and their composition- ality. In Advances in Neural Information Processing Systems, pages 3111–3119. Jeff Mitchell and Mirella Lapata. 2008. Vector-based models of semantic composition. In ACL, pages 236– 244. Citeseer. Jeff Mitchell and Mirella Lapata. 2010. Composition in distributional models of semantics. Cognitive Science, 34(8):1388–1439. Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word rep- resentation. Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12. Chris Quirk, Chris Brockett, and William Dolan. 2004. Monolingual machine translation for paraphrase gen- eration. In Dekang Lin and Dekai Wu, editors, Pro- ceedings of EMNLP 2004, pages 142–149, Barcelona, Spain, July. Association for Computational Linguis- tics. Pushpendre Rastogi and Benjamin Van Durme. 2014. Augmenting FrameNet via PPDB. In Proceedings of the Second Workshop on EVENTS: Definition, Detec- tion, Coreference, and Representation, pages 1–5, Bal- timore, Maryland, USA, June. Association for Compu- tational Linguistics. Fabio Rinaldi, James Dowdall, Kaarel Kaljurand, Michael Hess, and Diego Mollá. 2003. Exploit- ing paraphrases in a question answering system. In Proceedings of the Second International Workshop on Paraphrasing, pages 25–32, Sapporo, Japan, July. As- sociation for Computational Linguistics. Richard Socher, Christopher D Manning, and Andrew Y Ng. 2010. Learning continuous phrase representa- tions and syntactic parsing with recursive neural net- works. In Proceedings of the NIPS-2010 Deep Learn- ing and Unsupervised Feature Learning Workshop, pages 1–9. Richard Socher, Eric H Huang, Jeffrey Pennin, Christo- pher D Manning, and Andrew Y Ng. 2011. Dynamic pooling and unfolding recursive autoencoders for para- phrase detection. In Advances in Neural Information Processing Systems, pages 801–809. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Pro- ceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing, pages 1631– 1642, Seattle, Washington, USA, October. Association for Computational Linguistics. Richard Socher, Andrej Karpathy, Quoc V. Le, Christo- pher D. Manning, and Andrew Y. Ng. 2014. 357 Grounded compositional semantics for finding and de- scribing images with sentences. TACL, 2:207–218. James H Steiger. 1980. Tests for comparing ele- ments of a correlation matrix. Psychological Bulletin, 87(2):245. Xuchen Yao, Benjamin Van Durme, Chris Callison- Burch, and Peter Clark. 2013. Semi-markov phrase- based monolingual alignment. In EMNLP, pages 590– 600. Mo Yu and Mark Dredze. 2014. Improving lexical em- beddings with semantic knowledge. In Proceedings of the 52nd Annual Meeting of the Association for Com- putational Linguistics (Volume 2: Short Papers), pages 545–550, Baltimore, Maryland, June. Association for Computational Linguistics. Mo Yu and Mark Dredze. 2015. Learning composi- tion models for phrase embeddings. Transactions of the Association for Computational Linguistics, 3:227– 242. Fabio Massimo Zanzotto, Ioannis Korkontzelos, Francesca Fallucchi, and Suresh Manandhar. 2010. Estimating linear models for compositional dis- tributional semantics. In Proceedings of the 23rd International Conference on Computational Linguis- tics, pages 1263–1271. Association for Computational Linguistics. 358