Neural Lattice Language Models Jacob Buckman Language Technologies Institute Carnegie Mellon University jacobbuckman@gmail.com Graham Neubig Language Technologies Institute Carnegie Mellon University gneubig@cs.cmu.edu Abstract In this work, we propose a new language mod- eling paradigm that has the ability to perform both prediction and moderation of informa- tion flow at multiple granularities: neural lat- tice language models. These models con- struct a lattice of possible paths through a sen- tence and marginalize across this lattice to cal- culate sequence probabilities or optimize pa- rameters. This approach allows us to seam- lessly incorporate linguistic intuitions – in- cluding polysemy and the existence of multi- word lexical items – into our language model. Experiments on multiple language modeling tasks show that English neural lattice language models that utilize polysemous embeddings are able to improve perplexity by 9.95% rela- tive to a word-level baseline, and that a Chi- nese model that handles multi-character to- kens is able to improve perplexity by 20.94% relative to a character-level baseline. 1 Introduction Neural network models have recently contributed to- wards a great amount of progress in natural language processing. These models typically share a common backbone: recurrent neural networks (RNN), which have proven themselves to be capable of tackling a variety of core natural language processing tasks (Hochreiter and Schmidhuber, 1997; Elman, 1990). One such task is language modeling, in which we estimate a probability distribution over sequences of tokens that corresponds to observed sentences (§2). Neural language models, particularly models con- ditioned on a particular input, have many applica- tions including in machine translation (Bahdanau et al., 2016), abstractive summarization (Chopra et al., 2016), and speech processing (Graves et al., 2013). dogs chased the small cat dogs chased the small cat dogs chased the small dogs chased the the_small the_small_cat small_cat dogs_chasedchased chased_the dogs_chased_the chased_the_small Figure 1: Lattice decomposition of a sentence and its cor- responding lattice language model probability calculation Similarly, state-of-the-art language models are al- most universally based on RNNs, particularly long short-term memory (LSTM) networks (Jozefowicz et al., 2016; Inan et al., 2017; Merity et al., 2016). While powerful, LSTM language models usually do not explicitly model many commonly-accepted linguistic phenomena. As a result, standard mod- els lack linguistically informed inductive biases, po- tentially limiting their accuracy, particularly in low- data scenarios (Adams et al., 2017; Koehn and Knowles, 2017). In this work, we present a novel modification to the standard LSTM language mod- eling framework that allows us to incorporate some varieties of these linguistic intuitions seamlessly: neural lattice language models (§3.1). Neural lat- tice language models define a lattice over possi- ble paths through a sentence, and maximize the marginal probability over all paths that lead to gen- erating the reference sentence, as shown in Fig. 1. Depending on how we define these paths, we can in- corporate different assumptions about how language should be modeled. In the particular instantiations of neural lattice language models covered by this paper, we focus on two properties of language that could potentially be of use in language modeling: the existence of multi- word lexical units (Zgusta, 1967) (§4.1) and poly- 529 Transactions of the Association for Computational Linguistics, vol. 6, pp. 529–541, 2018. Action Editor: Holger Schwenk. Submission batch: 8/2017; Revision batch: 1/2018; Published 8/2018. c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. semy (Ravin and Leacock, 2000) (§4.2). Neural lat- tice language models allow the model to incorporate these aspects in an end-to-end fashion by simply ad- justing the structure of the underlying lattices. We run experiments to explore whether these modifications improve the performance of the model (§5). Additionally, we provide qualitative visualiza- tions of the model to attempt to understand what types of multi-token phrases and polysemous em- beddings have been learned. 2 Background 2.1 Language Models Consider a sequence X for which we want to cal- culate its probability. Assume we have a vocabulary from which we can select a unique list of |X| tokens x1,x2, . . . ,x|X| such that X = [x1;x2; . . . ;x|X|], i.e. the concatenation of the tokens (with an appro- priate delimiter). These tokens can be either on the character level (Hwang and Sung, 2017; Ling et al., 2015) or word level (Inan et al., 2017; Merity et al., 2016). Using the chain rule, language models gen- erally factorize p(X) in the following way: p(X) = p(x1,x2, . . . ,x|X|) = |X|∏ t=1 p(xt | x1,x2, . . . ,xt−1). (1) Note that this factorization is exact only in the case where the segmentation is unique. In character- level models, it is easy to see that this property is maintained, because each token is unique and non- overlapping. In word-level models, this also holds, because tokens are delimited by spaces, and no word contains a space. 2.2 Recurrent Neural Networks Recurrent neural networks have emerged as the state-of-the-art approach to approximating p(X). In particular, the LSTM cell (Hochreiter and Schmid- huber, 1997) is a specific RNN architecture which has been shown to be effective on many tasks, in- cluding language modeling (Press and Wolf, 2017; Jozefowicz et al., 2016; Merity et al., 2016; Inan et al., 2017).1 LSTM language models recursively cal- 1In this work, we utilize an LSTM with linked input and forget gates, as proposed by Greff et al. (2016). culate the hidden and cell states (ht and ct respec- tively) given the input embedding et−1 correspond- ing to token xt−1: ht,ct = LSTM(ht−1,ct−1,et−1,θ), (2) then calculate the probability of the next token given the hidden state, generally by performing an affine transform parameterized by W and b, followed by a softmax: p(xt | ht) := softmax(W ∗ht + b). (3) 3 Neural Lattice Language Models 3.1 Language Models with Ambiguous Segmentations To reiterate, the standard formulation of language modeling in the previous section requires splitting sentence X into a unique set of tokens x1, . . . ,x|X|. Our proposed method generalizes the previous for- mulation to remove the requirement of uniqueness of segmentation, similar to that used in non-neural n-gram language models such as Dupont and Rosen- feld (1997) and Goldwater et al. (2007). First, we define some terminology. We use the term “token”, designated by xi, to describe any in- divisible item in our vocabulary that has no other vocabulary item as its constituent part. We use the term “chunk”, designated by ki or x j i , to describe a sequence of one or more tokens that represents a portion of the full string X, containing the unit to- kens xi through xj: x j i = [xi,xi+1; . . . ;xj]. We also refer to the “token vocabulary”, which is the subset of the vocabulary containing only tokens, and to the “chunk vocabulary”, which similarly contains all chunks. Note that we can factorize the probability of any sequence of chunks K using the chain rule, in pre- cisely the same way as sequences of tokens: p(K) = p(k1,k2, . . . ,k|K|) = |K|∏ t=1 p(kt | k1,k2, . . . ,kt−1). (4) We can factorize the overall probability of a to- ken list X in terms of its chunks by using the chain rule, and marginalizing over all segmentations. For any particular token list X, we define a set of valid 530 segmentations S(X), such that for every sequence s ∈ S(X), X = [xs1−1s0 ;x s2−1 s1 ; . . . ;x s|s| s|s|−1]. The factorization is: p(X) = ∑ S p(X,S) = ∑ S p(X|S)p(S) = ∑ S∈S(X) p(S) = ∑ S∈S(X) |S|∏ t=1 p(xst−1st−1 | x s1−1 s0 ,xs2−1s1 , . . . ,x st−1−1 st−2 ). (5) Note that, by definition, there exists a unique seg- mentation of X such that x1,x2, . . . are all tokens, in which case |S| = |X|. When only that one unique segmentation is allowed per X, S contains only that one element, so summation drops out, and therefore for standard character-level and word-level models, Eq. (5) reduces to Eq. (4), as desired. However, for models that license multiple segmentations per X, computing this marginalization directly is gener- ally intractable. For example, consider segmenting a sentence using a vocabulary containing all words and all 2-word expressions. The size of S would grow exponentially with the number of words in X, meaning we would have to marginalize over tril- lions of unique segmentations for even modestly- sized sentences. 3.2 Lattice Language Models To avoid this, it is possible to re-organize the com- putations in a lattice, which allows us to dramati- cally reduce the number of computations required (Dupont and Rosenfeld, 1997; Neubig et al., 2010). All segmentations of X can be expressed as the edges of paths through a lattice over token-level pre- fixes of X: x<1,x<2, . . . ,X. The infimum is the empty prefix x<1; the supremum is X; an edge from prefix x