key: cord-0550217-3nkd97nj authors: Malon, Christopher title: Overcoming Poor Word Embeddings with Word Definitions date: 2021-03-05 journal: nan DOI: nan sha: e0c4a2116ec0f3dc48457627bbd8bfe1f0026479 doc_id: 550217 cord_uid: 3nkd97nj Modern natural language understanding models depend on pretrained subword embeddings, but applications may need to reason about words that were never or rarely seen during pretraining. We show that examples that depend critically on a rarer word are more challenging for natural language inference models. Then we explore how a model could learn to use definitions, provided in natural text, to overcome this handicap. Our model's understanding of a definition is usually weaker than a well-modeled word embedding, but it recovers most of the performance gap from using a completely untrained word. The reliance of natural language understanding models on the information in pre-trained word embeddings limits these models from being applied reliably to rare words or technical vocabulary. To overcome this vulnerability, a model must be able to compensate for a poorly modeled word embedding with background knowledge to complete the required task. For example, a natural language inference (NLI) model based on pre-2020 word embeddings may not be able to deduce from "Jack has COVID" that "Jack is sick." By providing the definition, "COVID is a respiratory disease," we want to assist this classification. We describe a general procedure for enhancing a classification model such as natural language inference (NLI) or sentiment classification, to perform the same task on sequences including poorly modeled words using definitions of those words. From the training set T of the original model, we construct an augmented training set T ′ for a model that may accept the same token sequence optionally concatenated with a word definition. In the case of NLI, where there are two token sequences, the definition is concatenated to the premise sequence. Because T ′ has the same form as T , a model accepting the augmented information may be trained in the same way as the original model. Because there are not enough truly untrained words like "COVID" in natural examples, we probe performance by scrambling real words so that their word embedding becomes useless, and supplying definitions. Our method recovers most of the performance lost by scrambling. Moreover, the proposed technique removes biases in more ad hoc solutions like adding definitions to examples without special training. We focus on NLI because it depends more deeply on word meaning than sentiment or topic classification tasks. Chen et al. (2018) pioneered the addition of background information to an NLI model's classification on a per-example basis, augmenting a sequence of token embeddings with features encoding WordNet relations between pairs of words, to achieve a 0.6% improvement on the SNLI (Bowman et al., 2015) task. Besides this explicit reasoning approach, implicit reasoning over background knowledge can be achieved if one updates the base model itself with background information. Lauscher et al. (2020) follows this approach to add information from ConceptNet (Speer et al., 2018) and the Open Mind Common Sense corpus (Singh et al., 2002) through a fine-tuned adapter added to a pretrained language model, achieving better performance on subsets of NLI examples that are known to require world knowledge. Talmor et al. (2020) explore the interplay between explicitly added knowledge and implicitly stored knowledge on artificially constructed NLI problems that require counting or relations from a taxonomy. In the above works, explicit background information comes from a taxonomy or knowledge base. Only a few studies have worked with definition text directly, and not in the context of NLI. Tissier et al. (2017) used definitions to create embeddings for better performance on word similarity tasks, compared to word2vec (Mikolov et al., 2013) and fastText (Bojanowski et al., 2017) while maintaining performance on text classification. Recently, Kaneko and Bollegala (2021) used definitions to remove biases from pretrained word embeddings while maintaining coreference resolution accuracy. In contrast, our work reasons with natural language definitions without forming a new embedding. The enhanced training set T ′ will be built by providing definitions for words in existing examples, while obfuscating the existing embeddings of those words. If a random word of the original text is obfuscated, the classification still may be determined or strongly biased by the remaining words. To ensure the definitions matter, we select carefully. To explain which words of a text are important for classification, Kim et al. (2020) introduced the idea of input marginalization. Given a sequence of tokens x, such that x −i represents the sequence without the ith token x i , they marginalize the probability of predicting a class y c over possible replacement wordsx i in the vocabulary V as and then compare p(y c |x −i ) to p(y c |x) to quantify the importance of x i . The probabilities p(x i |x −i ) are computed by a language model. We simplify by looking only at the classification and not the probability. Like Kim et al. (2020) , we truncate the computation of p(y c |x i , x −i ) to words such that p(x i |x −i ) exceeds a threshold, here .05. Ultimately we mark a word x i as a critical word if there exists a replacementx i such that Additionally we require that the word not appear more than once in the example, because the meaning of repeated words usually impacts the classification less than the fact that they all match. Table 1 shows an example. Premise A young man sits, looking out of a train [side → Neutral, small → Neutral] window. Hypothesis The man is in his room. Label Contradiction Table 1 : An SNLI example, with critical words shown in italics and replacements shown in brackets. A technicality remains because our classification models use subwords as tokens, whereas we consider replacements of whole words returned by pattern.en. We remove all subwords of x i when forming x −i , but we consider only replacementsx i that are a single subword long. We use Wiktionary as a source of definitions. The code of Tissier et al. (2017) downloaded definitions from four commercial online dictionaries, but these are no longer freely available online as of January 2021. When possible, we look for a definition in the Simple English Wiktionary, because these definitions refer to more common usages of words and are written using simpler language. If one is not found, we consult the regular English Wiktionary. 1 To define a word, first we find its part of speech in the original context and lemmatize the word using the pattern.en library (Smedt and Daelemans, 2012). Then we look for a section labeled "English" in the retrieved Wiktionary article, and for a subsection for the part of speech we identified. We extract the first numbered definition in this subsection. There is no guarantee that this sense of the word matches the sense used in the text, but since the word embedding for any other word would be determined only by its spelling, we expect good performance even if a different sense of the word is chosen. In practice, we find that this method usually gives us short, simple definitions that match the usage in the original text. When defining a word, we always write its definition as "word means: definition." This common format ensures that the definitions and the word being defined can be recognized easily by the classifier. Consider an example (x, y c ) ∈ T . If the example has a critical word x i ∈ x that appears only once in the example, andx i is the most likely replacement word that changes the classification, we let x ′ denote the sequence where x i is replaced byx i , and let y ′ c = argmax y p(y|x ′ ). If definitions h i and h ′ i for x i andx i are found by the method described above, we add (x, h i , y c ) and (x ′ , h ′ i , y ′ c ) to the enhanced training set T ′ . Scrambling a word prevents the model from relying on a useful word embedding. In this protocol, we generate random strings of letters, of random length between four and twelve letters, to substitute for x i andx i , while still using the definitions of the the original words. If the original words appear in their own definitions, those occurrences are also replaced by the same strings. Unfortunately, the random strings lose any morphological features of the original words. Table 2 shows an NLI example and the corresponding examples generated for the enhanced training set. A blond man is drinking from a public fountain. / The man is drinking water. / Entailment Scrambled word a blond man is drinking from a public yfcqudqqg. yfcqudqqg means: a natural source of water; a spring. / the man is drinking water. / Entailment Scrambled alternate a blond man is drinking from a public lxuehdeig. lxuehdeig means: lxuehdeig is a transparent solid and is usually clear. windows and eyeglasses are made from it, as well as drinking glasses. / the man is drinking water. / Neutral We consider the SNLI task (Bowman et al., 2015) . We fine-tune an XLNet model (Yang et al., 2019) , because it achieves near state-of-the-art performance on SNLI and outperforms Roberta (Liu et al., 2019) and BERT (Devlin et al., 2019) on later rounds of adversarial annotation for ANLI (Nie et al., 2020) . Due to computing constraints we use the base, cased model. Training is run for three epochs distributed across 4 GPU's, with a batch size of 10 on each, a learning rate of 5 × 10 −5 , 120 warmup steps, a single gradient accumulation step, and a maximum sequence length of 384. For the language model probabilities p(x i |x −i ), pretrained BERT (base, uncased) is used rather than XLNet because the XLNet probabilities have been observed to be very noisy on short sequences. 2 One test set SN LI f ull crit is constructed in the same way as the augmented training set, but our main test set SN LI true crit is additionally constrained to use only examples of the form (x, h i , y c ) where y c is the original label, because labels for the examples (x ′ , h ′ i , y ′ c ) might be incorrect. Not every SNLI example has a critical word, and we do not always find a definition with the right part of speech in Wiktionary. Our training and test sets have 272,492 and 2,457 examples (vs. 549,367 and 9,824 in SNLI). All of our derived datasets are available for download. 3 Our task cannot be solved well without reading definitions. When words are scrambled but no definitions are provided, an SNLI model without special training achieves 54.1% on SN LI true crit . If trained on T ′ with scrambled words but no definitions, a model achieves 36.9%, which is even lower, reflecting that the training set is constructed to prevent a model from utilizing the contextual bias. With definitions and scrambled words, performance is slightly below that of using the original words. Our method using definitions applied to the scrambled words yields 81.2%, compared to 84.6% if words are left unscrambled but no definitions are provided. Most of the accuracy lost by obfuscating the words is recovered, but evidently there is slightly more information accessible in the original word embeddings. If alternatives to the critical words are not included, the classifier learns biases that do not depend on the definition. We explore restricting the training set to verified examples T ′ true ⊂ T ′ in the same way as the SN LI true crit , still scrambling the critical or replaced words in the training and testing sets. Using this subset, a model that is not given the definitions can be trained to achieve 69.9% performance on SN LI true crit , showing a heavy contextual bias. A model trained on this subset that uses the definitions achieves marginally higher performance (82.3%) than the one trained on all of T ′ . On the other hand, testing on SN LI f ull crit yields only 72.3% compared to 80.3% using the full T ′ , showing that the classifier is less sensitive to the definition. Noisy labels from replacements do not hurt accuracy much. The only difference between the "original" training protocol and "no scrambling, no defs" is that the original trains on T and does not include examples with replaced words and unverified labels. Training including the replacements reduces accuracy by 0.5% on SN LI true crit , which includes only verified labels. For comparison, training and testing on all of SNLI with the original protocol achieves 90.4%, so a much larger effect on accuracy must be due to selecting harder examples for SN LI true crit . Definitions are not well utilized without special training. The original SNLI model, if pro-vided definitions of scrambled words at test time as part of the premise, achieves only 63.8%, compared to 81.2% for our specially trained model. If the defined words are not scrambled, the classifier uses the original embedding and ignores the definitions. Training with definitions but no scrambling, 85.2% accuracy is achieved, but this trained model is unable to use the definitions when words are scrambled: it achieves 51.4% on that test set. We have not discovered a way to combine the benefit of the definitions with the knowledge in the original word embedding. To force the model to use both techniques, we prepare a version of the training set which is half scrambled and half unscrambled. This model achieves 83.5% on the unscrambled test set, below the result if no definitions are provided. Definitions are not simply being memorized. We selected the subset SN LI new crit of SN LI true crit consisting of the 44 examples in which the defined word was not defined in a training example. The definition scrambled model achieves 68.2% on this set, well above 45.5% for the original SNLI model reading the scrambled words and definitions but without special training. Remembering a definition from training is thus an advantage (reflected in the higher 81.2% accuracy on SN LI true crit ), but not the whole capability. Definition reasoning is harder than simple substitutions. When definitions are given as oneword substitutions, in the form "scrambled means: original" instead of "scrambled means: definition", the model achieves 84.7% on SN LI true crit compared to 81.2% using the definition text. Of course this is not a possibility for rare words that are not synonyms of a word that has been well trained, but it suggests that the kind of multi-hop reasoning in which words just have to be matched in sequence is easier than understanding a text definition. By construction of the SentencePiece dictionary (Kudo and Richardson, 2018) , only the most frequent words in the training data of the XLNet language model are represented as single tokens. Other words are tokenized by multiple subwords. Sometimes the subwords reflect a morphological change to a well-modeled word, such as a change in tense or plurality. The language model probably In Table 4 we apply various models constructed in the previous subsection to this hard test set. Ideally, a model leveraging definitions could compensate for these weaker word embeddings, but the method here does not do so. This work shows how a model's training may be enhanced to support reasoning with definitions in natural text, to handle cases where word embeddings are not useful. Our method forces the definitions to be considered and avoids the application of biases independent of the definition. Using the approach, entailment examples like "Jack has COVID / Jack is sick" that are misclassified by an XLNet trained on normal SNLI are correctly recognized as entailment when a definition "COVID is a respiratory disease" is added. Methods that can leverage definitions without losing the advantage of partially useful word embeddings are still needed. In an application, it also will be necessary to select the words that would benefit from definitions, and to make a model that can accept multiple definitions. Enriching word vectors with subword information A large annotated corpus for learning natural language inference Neural natural language inference models enhanced with external knowledge Bert: Pre-training of deep bidirectional transformers for language understanding Dictionary-based debiasing of pre-trained word embeddings Interpretation of NLP models through input marginalization SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing Common sense or world knowledge? investigating adapter-based knowledge injection into pretrained transformers Roberta: A robustly optimized bert pretraining approach Efficient estimation of word representations in vector space Adversarial NLI: A new benchmark for natural language understanding Open mind common sense: Knowledge acquisition from the general public Pattern for python Conceptnet 5.5: An open multilingual graph of general knowledge Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge Dict2vec : Learning word embeddings using lexical dictionaries Xlnet: Generalized autoregressive pretraining for language understanding