Minimally Supervised Number Normalization Kyle Gorman and Richard Sproat Google, Inc. 111 8th Ave., New York, NY, USA Abstract We propose two models for verbalizing num- bers, a key component in speech recognition and synthesis systems. The first model uses an end-to-end recurrent neural network. The sec- ond model, drawing inspiration from the lin- guistics literature, uses finite-state transducers constructed with a minimal amount of training data. While both models achieve near-perfect performance, the latter model can be trained using several orders of magnitude less data than the former, making it particularly useful for low-resource languages. 1 Introduction Many speech and language applications require text tokens to be converted from one form to another. For example, in text-to-speech synthesis, one must con- vert digit sequences (32) into number names (thirty- two), and appropriately verbalize date and time ex- pressions (12:47 → twelve forty-seven) and abbre- viations (kg → kilograms) while handling allomor- phy and morphological concord (e.g., Sproat, 1996). Quite a bit of recent work on SMS (e.g., Beaufort et al., 2010) and text from social media sites (e.g., Yang and Eisenstein, 2013) has focused on detect- ing and expanding novel abbreviations (e.g., cn u plz hlp). Collectively, such conversions all fall under the rubric of text normalization (Sproat et al., 2001), but this term means radically different things in differ- ent applications. For instance, it is not necessary to detect and verbalize dates and times when preparing social media text for downstream information extrac- tion, but this is essential for speech applications. While expanding novel abbreviations is also im- portant for speech (Roark and Sproat, 2014), num- bers, times, dates, measure phrases and the like are far more common in a wide variety of text genres. Following Taylor (2009), we refer to cate- gories such as cardinal numbers, times, and dates— each of which is semantically well-circumscribed— as semiotic classes. Some previous work on text nor- malization proposes minimally-supervised machine learning techniques for normalizing specific semi- otic classes, such as abbreviations (e.g., Chang et al., 2002; Pennell and Liu, 2011; Roark and Sproat, 2014). This paper continues this tradition by con- tributing minimally-supervised models for normal- ization of cardinal number expressions (e.g., ninety- seven). Previous work on this semiotic class include formal linguistic studies by Corstius (1968) and Hur- ford (1975) and computational models proposed by Sproat (1996; 2010) and Kanis et al. (2005). Of all semiotic classes, numbers are by far the most im- portant for speech, as cardinal (and ordinal) num- bers are not only semiotic classes in their own right, but knowing how to verbalize numbers is important for most of the other classes: one cannot verbalize times, dates, measures, or currency expressions with- out knowing how to verbalize that language’s num- bers as well. One computational approach to number name ver- balization (Sproat, 1996; Kanis et al., 2005) employs a cascade of two finite-state transducers (FSTs). The first FST factors the integer, expressed as a digit se- quence, into sums of products of powers of ten (i.e., in the case of a base-ten number system). This is composed with a second FST that defines how the 507 Transactions of the Association for Computational Linguistics, vol. 4, pp. 507–519, 2016. Action Editor: Jason Eisner. Submission batch: 3/2016; Revision batch: 5/2016; Published 11/2016. c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. numeric factors are verbalized, and may also handle allomorphy or morphological concord in languages that require it. Number names can be relatively easy (as in English) or complex (as in Russian; Sproat, 2010) and thus these FSTs may be relatively easy or quite difficult to develop. While the Google text-to- speech (TTS) (see Ebden and Sproat, 2014) and au- tomatic speech recognition (ASR) systems depend on hand-built number name grammars for about 70 languages, developing these grammars for new lan- guages requires extensive research and labor. For some languages, a professional linguist can develop a new grammar in as little as a day, but other lan- guages may require days or weeks of effort. We have also found that it is very common for these hand- written grammars to contain difficult-to-detect er- rors; indeed, the computational models used in this study revealed several long-standing bugs in hand- written number grammars. The amount of time, effort, and expertise required to produce error-free number grammars leads us to consider machine learning solutions. Yet it is im- portant to note that number verbalization poses a dauntingly high standard of accuracy compared to nearly all other speech and language tasks. While one might forgive a TTS system that reads the am- biguous abbreviation plz as plaza rather than the in- tended please, it would be inexcusable for the same system to ever read 72 as four hundred seventy two, even if it rendered the vast majority of numbers cor- rectly. To set the stage for this work, we first (§2–3) briefly describe several experiments with a power- ful and popular machine learning technique, namely recurrent neural networks (RNNs). When provided with a large corpus of parallel data, these systems are highly accurate, but may still produce occasional er- rors, rendering it unusable for applications like TTS. In order to give the reader some background on the relevant linguistic issues, we then review some cross- linguistic properties of cardinal number expressions and propose a finite-state approach to number nor- malization informed by these linguistic properties (§4). The core of the approach is an algorithm for in- ducing language-specific number grammar rules. We evaluate this technique on data from four languages. Figure 1: The neural net architecture for the preliminary Russian cardinal number experiments. Purple LSTM lay- ers perform forwards transitions and green LSTM layers perform backwards transitions. The output is produced by a CTC layer with a softmax activation function. Input to- kens are characters and output tokens are words. 2 Preliminary experiment with recurrent neural networks As part of a separate strand of research, we have been experimenting with various recurrent neural network (RNN) architectures for problems in text normaliza- tion. In one set of experiments, we trained RNNs to learn a mapping from digit sequences marked with morphosyntactic (case and gender) information, and their expression as Russian cardinal number names. The motivation for choosing Russian is that the num- ber name system of this language, like that of many Slavic languages, is quite complicated, and therefore serves as a good test of the abilities of any text nor- malization system. The architecture used was similar to a network employed by Rao et al. (2015) for grapheme- to-phoneme conversion, a superficially similar sequence-to-sequence mapping problem. We used a recurrent network with an input layer, four hidden feed-forward LSTM layers (Hochreiter and Schmid- huber, 1997), and a connectionist temporal classi- fication (CTC) output layer with a softmax activa- tion function (Graves et al., 2006).1 Two of the hidden layers modeled forward sequences and the other two backward sequences. There were 32 in- put nodes—corresponding to characters—and 153 output nodes—corresponding to predicted number name words. Each of the hidden layers had 256 nodes. The full architecture is depicted in Figure 1. The system was trained on 22M unique digit se- 1Experiments with a non-CTC softmax output layer yielded consistently poor results, and we do not report them here. 508 quences ranging from one to one million; these were collected by applying an existing TTS text normal- ization system to several terabytes of web text. Each training example consisted of a digit sequence, gen- der and case features, and the Russian cardinal num- ber verbalization of that number. Thus, for example, the system has to learn to produce the feminine in- strumental form of 60. Examples of these mappings are shown in Table 1, and the various inflected forms of a single cardinal number are given in Table 2. In preliminary experiments, it was discovered that short digit sequences were poorly modeled due to under- sampling, so an additional 240,000 short sequence samples (of three or fewer digits) were added to com- pensate. 2.2M examples (10%) were held out as a development set. The system was trained for one day, after which it had a 0% label error rate (LER) on the development data set. When decoding 240,000 tokens of held-out test data with this model, we achieved very high ac- curacy (LER < .0001). The few remaining errors, however, are a serious obstacle to using this system for TTS. The model appears to make no mistakes ap- plying inflectional suffixes to unseen data. Plausibly, this task was made easier by our positioning of the morphological feature string at the end of the input, making it local to the output inflectional suffix (at least for the last word in the number expression). But it does make errors with respect to the numeric value of the expression. For example, for 9801 plu.ins. (девятью тысячами восьмьюстами одними), the system produced девятью тысячами семьюстами одними (9701 plu.ins.): the morphology is cor- rect, but the numeric value is wrong.2 This pattern of errors was exactly the opposite of what we want for speech applications. One might forgive a TTS system that reads 9801 with the cor- rect numeric value but in the wrong case form: a lis- tener would likely notice the error but would usually not be misled about the message being conveyed. In contrast, reading it as nine thousand seven hundred and one is completely unacceptable, as this would actively mislead the listener. It is worth pointing out that the training set used here—22M examples—was quite large, and we were 2The exact number of errors and their particular details var- ied from run to run. only able to obtain such a large amount of labeled data because we already had a high-quality hand- built grammar designed to do exactly this transduc- tion. It is simply unreasonable to expect that one could obtain this amount of parallel data for a new language (e.g., from naturally-occurring examples, or from speech transcriptions). This problem is es- pecially acute for low-resource languages (i.e., most of the world’s languages), where data is by defini- tion scarce, but where it is also hard to find high- quality linguistic resources or expertise, and where a machine learning approach is thus most needed. In conclusion, the system does not perform as well as we demand, nor is it in any case a practical solu- tion due to the large amount of training data needed. The RNN appears to have done an impressive job of learning the complex inflectional morphology of Russian, but it occasionally chooses the wrong num- ber names altogether. 3 Number normalization with RNNs For the purpose of more directly comparing the per- formance of RNNs with the methods we report on below, we chose to ignore the issue of allomorphy and morphological concord, which appears to be “easy” for generic sequence models like RNNs, and focus instead on verbalizing number expressions in whatever morphological category represents the lan- guage’s citation form. 3.1 Data and general approach For our experiments we used three parallel data sets where the target number name was in citation form (in Russian, nominative case): • A large set consisting of 28,000 examples ex- tracted from several terabytes of web text using an existing TTS text normalization system • A medium set of 9,000 randomly-generated ex- amples (for details, see Appendix A) • A minimal set of 300 examples, consisting of the counting numbers up to 200, and 100 carefully-chosen examples engineered to cover a wide variety of phenomena A separate set of 1,000 randomly-generated exam- ples were held out for evaluation. 509 5 neu.gen. → пяти five 24 mas.acc. → двадцать четыре twenty-four 99 plu.ins. → девяноста девятью ninety-nine 11 fem.nom. → одиннадцать eleven 81 fem.gen. → восьмидесяти одной eighty-one 60 fem.ins. → шестьюдесятью sixty 91 neu.ins. → девяноста одним ninety-one 3 mas.gen. → трёх three Table 1: Example input and output data (and glosses) for the Russian RNN experiments. шестьдесят nominative (nom.) шестидесяти genitive (gen.) шестидесяти dative (dat.) шестьдесят accusative (acc.) шестьюдесятью instrumental (ins.) шестидесяти prepositional (pre.) Table 2: Inflectional forms of the cardinal number number “60” in Russian. The minimal set was intended to be representative of the sort of data one might obtain from a native- speaker when asked to provide all the essential infor- mation about number names in their language.3 In these experiments we used two different RNN models. The first was the same LSTM architecture as above (henceforth referred to as “LSTM”), except that the numbers of input and output nodes were 13 and 53, respectively, due to the smaller input and out- put vocabularies. The second was a TensorFlow-based RNN with an attention mechanism (Mnih et al., 2014), using an overall architecture similar to that used in a system for end-to-end speech recognition (Chan et al., 2016). Specifically, we used a 4-layer pyramidal bidirec- tional LSTM reader that reads input characters, a layer of 256 attentional units, and a 2-layer decoder that produces word sequences. The reader is referred to Chan et al., 2016 for further details. Henceforth we refer to this model as “Attention”. All models were trained for 24 hours, at which point they were determined to have converged. 3Note that the native speaker in question merely needs to be able to answer questions of the form “how do you say ‘23’ in your language?”; they do not need to be linguistically trained. In contrast, hand-built grammars require at least some linguistic sophistication on the part of the grammarian. 3.2 Results and discussion Results for these experiments on a test corpus of 1,000 random examples are given in Table 3. The RNN with attention clearly outperformed the LSTM in that it performed perfectly with both the medium and large training sets, whereas the LSTM made a small percentage of errors. Note that since the numbers were in citation form, there was little room for the LSTM to make inflectional errors, and the er- rors it made were all of the “silly” variety, in which the output simply denotes the wrong number. But neither system was capable of learning valid trans- ductions given just 300 training examples.4 We draw two conclusions from these results. First, even a powerful machine learning model known to be applicable to a wide variety of problems may not be appropriate for all superficially-similar prob- lems. Second, it remains to be seen whether any RNN could be designed to learn effectively from an amount of data as small as our smallest training set. Learning from minimal data sets is of great practical concern, and we will proceed to provide a plausible solution to this problem below. We note again that very low error rates do not ensure that a system is 4The failure of the RNNs to generalize from the minimal training set suggests they are evidently not expressive enough for the sort of “clever” inference that is needed to generalize from so little data. It is plausible that an alternative RNN archi- tecture could learn with such a small amount of data, though we leave it to future research to discover just what such an ar- chitecture might be. In an attempt to provide the RNNs with additional support, we also performed an evaluation with the minimal training set in which inputs were encoded so that each decimal position above 0 was represented with a letter (A for 10, B for 100, and so forth). Thus 2034 was represented as 2C3A4. In principle, this ought to have prevented errors which fail to take positional information into account. Unfortunately, this made no difference whatsoever. 510 Training size LSTM Acc. Attention Acc. Overlap 28,000 0.999 1.000 56% 9,000 0.994 1.000 0% 300 < 0.001 < 0.001 < 1% Table 3: Accuracies on a test corpus of 1,000 random Russian citation-form number-name examples for the two RNN architectures. “Overlap” indicates the percentage of the test examples that are also found in the training data. usable, since not all errors are equally forgivable. 4 Number normalization with finite-state transducers The problem of number normalization naturally de- composes into two subproblems: factorization and verbalization of the numeric factors. We first con- sider the latter problem, the simpler of the two. Let λ be the set of number names in the target lan- guage, and let ν be the set of numerals, the integers denoted by a number name. Then let L : ν∗ → λ∗ be a transducer which replaces a sequence of nu- merals with a sequence of number names. For in- stance, for English, L will map 90 7 to ninety seven. In languages where there are multiple allomorphs or case forms for a numeral, L will be non-functional (i.e., one-to-many); we return to this issue shortly. In nearly all cases, however, there are no more than a few dozen numerals in ν,5 and no more than a few names in λ for the equivalent numeral in ν. There- fore, we assume it is possible to construct L with min- imal effort and minimal knowledge of the language. Indeed, all the information needed to construct L for the experiments conducted in this paper can be found in English-language Wikipedia articles. The remaining subproblem, factorization, is re- sponsible for converting digit sequences to numeral factors. In English, for example, 97000 is factored as 90 7 1000. Factorization is also language-specific. In Standard French, for example, there is no sim- plex number name for ‘90’; instead this is realized as quatre-vingt-dix “four twenty ten”, and thus 97000 (quatre-vingt-dix-sept mille) is factored as 4 20 10 7 1000. It is not a priori obvious how one might go about learning language-specific factorizations. For 5At worst, a small number of languages, such as several Indic languages of North India, effectively use unique numerals for all counting numbers up to 100. inspiration, we turn to a lesser-known body of lin- guistics research focusing on number grammars. Hurford (1975) surveys cross-linguistic properties of number naming and proposes a syntactic repre- sentation which directly relates verbalized number names to the corresponding integers. Hurford inter- prets complex number constructions as arithmetic expressions in which operators (and the parenthe- ses indicating associativity) have been elided. By far the two most common arithmetic operations are multiplication and addition.6 In French, for exam- ple, the expression dix-sept, literally ‘ten seven’, de- notes 17, the sum of its terms, and quatre-vingt(s), literally ‘four twenty’, refers to 80, the product of its terms. These may be combined, in quatre-vingt- dix-sept. To visualize arithmetic operations and as- sociativities, we henceforth write factorizations us- ing s-expressions—pre-order serializations of k-ary trees—with numeral terminals and arithmetic oper- ator non-terminals. For example, quatre-vingt-dix- sept is written (+ (* 4 20) 10 7). Within any language there are cues to this elided arithmetic structure. In some languages, some or all addends are separated by a word translated as and. In other languages it is possible to determine whether terms are to be multiplied or summed depending on their relative magnitudes. In French (as in English), for instance, an expression XY usually is interpreted as a product if X < Y, as in quatre-vingt(s) ‘80’, but as a sum if X > Y, as in vingt-quatre ‘24’. Thus the problem of number denormalization—that is, recov- ering the integer denoted by a verbalized number— can be thought of as a special case of grammar induc- tion from pairs of natural language expressions and 6Some languages make use of half-counting, or multiplica- tion by one half (e.g., Welsh hanner cant, ‘50’, literally ‘half hundred’), or back-counting, i.e., subtraction (e.g., Latin unde- vīgintī, ‘19’, literally ‘one from twenty’; Menninger, 1969, 94f.). But these do not reduce the generality of the approach here. 511 their denotations (e.g., Kwiatkowski et al., 2011). 4.1 FST model The complete model consists of four components: 1. A language-independent covering grammar F, transducing from integers expressed as digit se- quences to the set of possible factorizations for that integer 2. A language-specific numeral map M, transduc- ing from digit sequences to numerals 3. A language-specific verbalization grammar G, accepting only those factorizations which are licit in the target language 4. A language-specific lexical map L, transducing from sequences of numerals (e.g., 20) to num- ber names (already defined) As the final component, the lexical map L, has al- ready been described, we proceed to describe the re- maining three components of the system. 4.1.1 Finite-state transducer algorithms While we assume the reader has some familiarity with FSTs, we first provide a brief review of a few key algorithms we employ below. Our FST model is constructed using composition, denoted by the ◦ operator. When both arguments to composition are transducers, composition is equiva- lent to chaining the two relations described. For ex- ample, if A transduces string x to string y, and B trans- duces y to z, then A ◦ B transduces from string x to string z. When the left-hand side of composition is a transducer and the right-hand side is an accep- tor, then their composition produces a transducer in which the range of the left-hand side relation is inter- sected with the set of strings accepted by the right- hand side argument. Thus if A transduces string x to strings {y, z}, and B accepts y then A ◦ B transduces from x to y. We make use of two other fundamental operations, namely inversion and projection. Every transducer A has an inverse denoted by A−1, which is the trans- ducer such that A−1(y) → x if and only if A(x) → y. Any transducer A also has input and output projec- tions denoted by πi(A) and πo(A), respectively. If the transducer A has the domain α∗ and the range β∗, then πi(A) is the acceptor over α∗ which accepts x if and only if A(x) → y for some y ∈ β∗; output pro- jection is defined similarly. The inverse, input pro- jection, and output projection of an FST (or a push- down transducer) are computed by swapping and/or copying the input or output labels of each arc in the machine. See Mohri et al., 2002 for more details on these and other finite-state transducer algorithms. 4.1.2 Covering grammar Let A be an FST which, when repeatedly applied to an arithmetic s-expression string, produces the s- expression’s value. For example, one application of A to (+ (* 4 20) 10 7) produces (+ 80 10 7), and a second application produces 97. Let μ be the set of s-expression markup symbols {‘(’, ‘)’, ‘+’, ‘*’} and Δ be the set {0, 1, 2, . . . , 9}. Then, F : Δ∗ → (μ ∪ Δ)∗ = A−1 ◦ A−1 ◦ A−1 . . . (1) is an FST which transduces an integer expressed as a digit string to all its candidate factorizations expressed as s-expression strings.7 Let C(d) = πo (d ◦ F), which maps from a digit sequence d to the set of all possible factorizations—in any language—of that digit sequence, encoded as s- expressions. For example, C(97) contains strings such as: (+ 90 7) (+ 80 10 7) (+ (* 4 20) 10 7) … 4.1.3 Grammar inference Let M : (μ ∪ Δ)∗ → ν∗ be a transducer which deletes all markup symbols in μ and replaces se- quences of integers expressed as digit sequences with the appropriate numerals in ν. Let D(l) = πi (M ◦ L ◦ l), which maps from a verbalization l to the set of all s-expressions which contain l as ter- minals. For example, D(4 20 10 7) contains: 7In practice, our s-expressions never have a depth exceeding five, so we assume F = A−1 ◦ A−1 ◦ A−1 ◦ A−1 ◦ A−1. 512 S → (7 | 90 | * | +) ∗ → (7 | 90 | +) 1000 + → 90 7 Table 4: A fragment of an English number grammar which accepts factorizations of the numbers {7, 90, 97, 7000, 90000, and 97000}. S represents the start symbol, and ‘|’ denotes disjunction. Note that this fragment is regular rather than context-free, though this is rarely the case for complete grammars. (+ 4 20 10 7) (+ 4 20 (* 10 7)) (+ (* 4 20) 10 7) … Then, given (d, l) where d ∈ Δ∗ is an integer ex- pressed as a digit sequence, and l ∈ λ∗ is d’s verbal- ization, their intersection E(d, l) = C(d) ◦ D(l) (2) will contain the factorization(s) of d that verbalizes as l. In most cases, E will contain exactly one path for a given (d, l) pair. For instance, if d is 97000 and l is ninety seven thousand, E(d, l) is (* (+ 90 7) 10000). We can use E to induce a context-free grammar (CFG) which accepts only those number verbaliza- tions present in the target language. The simplest pos- sible such CFG uses ‘*’ and ‘+’ as non-terminal la- bels, and the elements in the domain of L (e.g., 20) as terminals. The grammar will then consist of binary productions extracted from the s-expression deriva- tions produced by E. Table 4 provides a fragment of such a grammar. With this approach, we face the familiar issues of ambiguity and sparsity. Concerning the former, the output of E is not unique for all outputs. We address this either by applying normal form constraints on the set of permissible productions, or ignoring am- biguous examples during induction. One case of am- biguity involves expressions involving addition with 0 or multiplication by 1, both identity operations that leave the identity element (i.e., 0 or 1) free to asso- ciate either to the left or to the right. From our per- spective, this ambiguity is spurious, so we stipulate that identity elements may only be siblings to (i.e., on the right-hand side of a production with) another digit → (2 | 3 | 4 | …9) teen → (11 | 12 | 13 | …19) decade → (20 | 30 | 40 |…90) century → (200 | 300 | 400 | …900) power_of_ten → (1000 | 10000 | …) Table 5: Optional preterminal rules. terminal. Thus an expression like one thousand one hundred can only be parsed as (+ (* 1 1000) (* 1 100)). But not all ambiguities can be handled by normal form constraints. Some expressions are am- biguous due to the presence of “palindromes” in the verbalization string. For instance, two hundred two can either be parsed as (+ 2 (* 100 2)) or (+ (* 2 100)). The latter derivation is “correct” insofar as it follows the syntactic patterns of other English number expressions, but there is no way to deter- mine this except with reference to the very language- specific patterns we are attempting to learn. There- fore we ignore such expressions during grammar in- duction, forcing the relevant rules to be induced from unambiguous expressions. Similarly, multiplication and addition are associative so expressions like three hundred thousand can be binarized either as (* (* 3 100) 10000) or (* 3 (* 100 1000)), though both derivations are equally “correct”. Once again we ig- nore such ambiguous expressions, instead extracting the relevant rules from unambiguous expressions. Since we only admit two non-terminal labels, the vast majority of our rules contain numeral terminals on their right-hand sides, and as a result, the num- ber of rules tends to be roughly proportional to the size of the terminal vocabulary. Thus it is common that we have observed, for instance, thirteen thou- sand and fourteen million but not fourteen thousand or thirteen million, and as a result, the CFG may be deficient simply due to sparsity in the training data, particularly in languages with large terminal vocab- ularies. To enhance our ability to generalize from a small number of examples, we optionally insert preterminal labels during grammar induction to form classes of terminals assumed to pattern together in all productions. For instance, by introducing ‘teen’ and ‘power_of_ten’ preterminals, all four of the previous expressions are generated by the same top-level pro- duction. The full set of preterminal labels we use here are shown in Table 5. 513 In practice, obtaining productions using E is inef- ficient: it is roughly equivalent to a naïve algorithm which generates all possible derivations for the nu- merals given, then filters out all of those which do not evaluate to the expected total, violate the afore- mentioned normal form constraints, or are otherwise ambiguous. This fails to take advantage of top-down constraints derived from the particular structure of the problem. For example, the naïve algorithm en- tertains many candidate parses for quatre-vingt-dix- sept ‘97’ where the root is ‘*’ and the first child is ‘4’, despite the fact that no such hypothesis is viable as 4 is not a divisor of 97. We inject arithmetic constraints into the gram- mar induction procedure, as follows. The inputs to the modified algorithm are tuples of the form (T, ν0, . . . , νn) where T is the numeric value of the expression and ν0, . . . , νn are the n + 1 numerals in the verbalization. Consider a hypothesized numeric value of the leftmost child of the root, T0...i, which dominates ν0, . . . , νi where i < n. For this to be vi- able, it must be the case that T0...i ≤ T. And, if we further hypothesize that the root node is ‘+’, then the remaining children must evaluate to T−T0...i. Simi- larly, if we hypothesize that the root node is ‘*’, then the remaining children must evaluate to T/T0...i, and this quantity must be integral. This approach can be implemented with a back- tracking recursive descent parser enforcing the afore- mentioned normal form constraints and propagat- ing the top-down arithmetic constraints. In practice, however, we implement the search using a straight- forward dynamic programming algorithm. The algo- rithm proceeds by recursively generating all possi- ble leftmost children of the tree and then using these top-down constraints to prune branches of the search space which have no viable completion (though our implementation does not fully propagate these con- straints). While the number of left subtrees is expo- nential in the length of the verbalization, our imple- mentation remains feasible since real-world exam- ples tend to have verbalizations consisting of rela- tively few terminals. Pseudocode for our implemen- tation is provided in Appendix B. 4.1.4 Grammar compilation Once productions have been collected, they are used to specify a recursive transition network (Woods, 1970) which is then compiled into a push- down acceptor (Allauzen and Riley, 2012) over ν∗, henceforth G. An example is shown in Figure 2. 4.1.5 Final model and remaining issues Then, the verbalization for d is given by V(d) = πo (d ◦ F ◦ M ◦ G ◦ L) . (3) As noted above, the lexicon transducer L is non- functional when there are multiple number names for a single numeral, as may arise in number systems with allomorphy or morphological concord. When this is the case, we compose the lattice produced byV with a language model (LM) of verbalized numbers (over λ∗) and then decode using the shortest path al- gorithm. Note that whereas the construction of G re- quires parallel data, the LM requires only “spoken” data. While it is not common in most languages to write out complex cardinal numbers in their verbal- ized form, it is nonetheless possible to find a large sample of such expressions at web scale (Sproat, 2010); such expressions can be identified by match- ing against the unweighted πo (F ◦ M ◦ G ◦ L). 4.2 Materials and methods The FST-based verbalizer V was constructed and evaluated using four languages: English, Georgian, Khmer, and Russian (the latter targeting citation forms only). The medium and minimal sets are used for all four languages; in Russian, we also reuse the large data set (see §3.1). In all cases the test sets consisted of 1,000 randomly generated examples, the same examples as in previous experiments. The size of V varied by language, with the small- est, English, consisting of roughly 8,000 states and arcs, and the largest, Russian, measuring roughly 80,000 states and arcs and comprising approximately a megabyte of uncompressed binary data. No LM was required for either English or Khmer as both have a functional L. However, the Georgian and Russian L are both ambiguous, so the best path through the output lattice was selected according to a trigram language model with Witten-Bell smooth- ing. The language models were constructed using the medium training set. 514 0 1 7 90 2 (* 5 (+ (+ 3 7 90 690 41000 )* 77 )+)+ Figure 2: A pushdown acceptor that accepts all the language of the grammar fragment in Table 4. Arc labels that contain parentheses indicate “push” and “pop” stack operations, respectively, and must balance along a path. Locale Training size Num. acc. Morph. acc. Overlap eng_us 9,000 1.000 1.000 0% 300 1.000 1.000 < 1% kat_ge 9,000 1.000 1.000 0% 300 1.000 1.000 < 1% khm_kh 9,000 1.000 1.000 0% 300 1.000 1.000 < 1% rus_ru 28,000 1.000 1.000 56% 9,000 1.000 0.998 0% 300 1.000 0.998 < 1% Table 6: Evaluation of the FST verbalizer on English (eng_us), Georgian (kat_ge), Khmer (kmh_kh), and Russian (rus_ru); errors are separated into those which affect the numeric denotation (“Num. acc.”) and those which merely use incorrect morphological forms (“Morph. acc.”). 515 4.3 Results and discussion The results were excellent for all four languages. There were no errors at all in English, Georgian, and Khmer with either data set. While there were a few errors in Russian, crucially all were agree- ment errors rather than errors in the factorization itself, exactly the opposite pattern of error to the ones we observed with the LSTM model. For example, 70,477,170 was rendered as семьдесят миллион четыреста семьдесят семь тысяч сто семьдесят; the second word should be миллионов, the genitive plural form. More surprisingly, verbalizers trained on the 300 examples of the minimal data set per- formed just as well as ones trained with two orders of magnitude more labeled data. 5 Discussion We presented two approaches to number normaliza- tion. The first used a general RNN architecture that has been used for other sequence mapping problems, and the second an FST-based system that uses a fair amount of domain knowledge. The RNN approach can achieve very high accuracy, but with two caveats: it requires a large amount of training data, and the er- rors it makes may result in the wrong number. The FST-based solution on the other hand can learn from a tiny dataset, and never makes that particularly per- nicious type of error. The small size of training data needed and the high accuracy make this a particu- larly attractive approach for low-resource scenarios. In fact, we suspect that the FST model could be made to learn from a smaller number of examples than the 300 that make up the “minimal” set. Finding the min- imum number of examples necessary to cover the entire number grammar appears to be a case of the set cover problem—-which is NP-complete (Karp, 1972)—but it is plausible that a greedy algorithm could identify an even smaller training set. The grammar induction method used for the FST verbalizer is near to the simplest imaginable such procedure: it treats rules as well-formed if and only if they have at least one unambiguous occurrence in the training data. More sophisticated induction methods could be used to improve both generalization and ro- bustness to errors in the training data. Generalization might be improved by methods that “hallucinate” un- observed productions (Mohri and Roark, 2006), and robustness could be improved using manual or au- tomated tree annotation (e.g., Klein and Manning, 2003; Petrov and Klein, 2007). We leave this for fu- ture work. Above, we focused solely on cardinal numbers, and specifically their citation forms. However, in all four languages studied here, ordinal numbers share the same factorization and differ only superficially from cardinals. In this case, the ordinal number ver- balizer can be constructed by applying a trivial trans- duction to the cardinal number verbalizer. However, it is an open question whether this is a universal or whether there may be some languages in which the discrepancy is much greater, so that separate meth- ods are necessary to construct the ordinal verbalizer. The FST verbalizer does not provide any mech- anism for verbalization of numbers in morphologi- cal contexts other than citation form. One possibility would be to use a discriminative model to select the most appropriate morphological variant of a number in context. We also leave this for future work. One desirable property of the FST-based system is that FSTs (and PDTs) are trivially invertible: if one builds a transducer that maps from digit sequences to number names, one can invert it, resulting in a trans- ducer that maps number names to digit sequences. (Invertibility is not a property of any RNN solution.) This allows one, with the help of the appropriate target-side language model, to convert a normaliza- tion system into a denormalization system, that maps from spoken to written form rather than from written to spoken. During ASR decoding, for example, it is often preferable to use spoken representations (e.g., twenty-three) rather than the written forms (e.g., 23), and then perform denormalization on the resulting transcripts so they can be displayed to users in a more-readable form (Shugrina, 2010; Vasserman et al., 2015). In ongoing work we are evaluating FST verbalizers for use in ASR denormalization. 6 Conclusions We have described two approaches to number nor- malization, a key component of speech recognition and synthesis systems. The first used a recurrent neu- ral network and large amounts of training data, but very little knowledge about the problem space. The second used finite-state transducers and a learning 516 method totally specialized for this domain but which requires very little training data. While the former approach is certainly more appealing given current trends in NLP, only the latter is feasible for low- resource languages which most need an automated approach to text normalization. To be sure, we have not demonstrated than RNNs—or similar models—are inapplicable to this problem, nor does it seem possible to do so. How- ever, number normalization is arguably a sequence- to-sequence transduction problem, and RNNs have been shown to be viable end-to-end solutions for sim- ilar problems, including grapheme-to-phoneme con- version (Rao et al., 2015) and machine translation (Sutskever et al., 2014), so one might reasonably have expected them to have performed better without making the “silly” errors that we observed. Much of the recent rhetoric about deep learning suggests that neural networks obviate the need for incorporating detailed knowledge of the problem to be solved; in- stead, one merely needs to feed pairs consisting of inputs and the required outputs, and the system will self-organize to learn the desired mapping (Graves and Jaitly, 2014). While that is certainly a desirable ideal, for this problem one can achieve a much more compact and data-efficient solution if one is willing to exploit knowledge of the domain. Acknowledgements Thanks to Cyril Allauzen, Jason Eisner, Michael Ri- ley, Brian Roark, and Ke Wu for helpful discus- sion, and to Navdeep Jaitly and Haşim Sak for as- sistance with RNN modeling. All finite-state mod- els were constructed using the OpenGrm libraries (http://opengrm.org). References Cyril Allauzen and Michael Riley. 2012. A pushdown transducer extension for the OpenFst library. In CIAA, pages 66–77. Richard Beaufort, Sophie Roekhaut, Louise-Amélie Cougnon, and Cédrick Fairon. 2010. A hybrid rule/model-based finite-state framework for normaliz- ing SMS messages. In ACL, pages 770–779. William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol Vinyals. 2016. Listen, attend and spell: A neural net- work for large vocabulary conversational speech recog- nition. In ICASSP, pages 4960–4964. Jeffrey T. Chang, Hinrich Schütze, and Russ B. Altman. 2002. Creating an online dictionary of abbreviations from MEDLINE. Journal of the American Medical In- formatics Association, 9(6):612–620. H. Brandt Corstius, editor. 1968. Grammars for number names. D. Reidel, Dordrecht. Peter Ebden and Richard Sproat. 2014. The Kestrel TTS text normalization system. Natural Language Engi- neering, 21(3):1–21. Alex Graves and Navdeep Jaitly. 2014. Towards end-to- end speech recognition with recurrent neural networks. In ICML, pages 1764–1772. Alex Graves, Santiago Fernández, Gaustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist tempo- ral classification: Labeling unsegmented sequence data with recurrent neural networks. In ICML, pages 369– 376. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735– 1780. James R. Hurford. 1975. The Linguistic Theory of Nu- merals. Cambridge University Press, Cambridge. Jakub Kanis, Jan Zelinka, and Luděk Müller. 2005. Auto- matic number normalization in inflectional languages. In SPECOM, pages 663–666. Richard M. Karp. 1972. Reducibility among combina- torial problems. In Raymond E. Miller and James W. Thatcher, editors, Complexity of Computer Computa- tions, pages 85–103. Plenum, New York. Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In ACL, pages 423–430. Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2011. Lexical generalization in CCG grammar induction for semantic parsing. In EMNLP, pages 1512–1523. Karl Menninger. 1969. Number Words and Number Sym- bols. MIT Press, Cambridge. Translation of Zahlwort und Ziffer, published by Vanderhoeck & Ruprecht, Breslau, 1934. Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko- ray Kavukcuoglu. 2014. Recurrent models of visual attention. In NIPS, pages 2204–2212. Mehryar Mohri and Brian Roark. 2006. Probabilistic context-free grammar induction based on structural ze- ros. In NAACL, pages 312–319. Mehryar Mohri, Fernando Pereira, and Michael Riley. 2002. Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1):69– 88. Deana Pennell and Yang Liu. 2011. Toward text message normalization: Modeling abbreviation generation. In ICASSP, pages 5364–5367. 517 Slav Petrov and Dan Klein. 2007. Improved inference for unlexicalized parsing. In NAACL, pages 404–411. Kanishka Rao, Fuchun Peng, Haşim Sak, and Françoise Beaufays. 2015. Grapheme-to-phoneme conversion using long short-term memory recurrent neural net- works. In ICASSP, pages 4225–4229. Brian Roark and Richard Sproat. 2014. Hippocratic ab- breviation expansion. In ACL, pages 364–369. Maria Shugrina. 2010. Formatting time-aligned ASR transcripts for readability. In NAACL, pages 198–206. Richard Sproat, Alan W. Black, Stanley Chen, Shankar Kumar, Mari Ostendorf, and Christopher Richards. 2001. Normalization of non-standard words. Com- puter Speech and Language, 15(3):287–333. Richard Sproat. 1996. Multilingual text analysis for text- to-speech synthesis. Natural Language Engineering, 2(4):369–380. Richard Sproat. 2010. Lightly supervised learning of text normalization: Russian number names. In IEEE Work- shop on Speech and Language Technology, pages 436– 441. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Se- quence to sequence learning with neural networks. In NIPS, pages 3104–3112. Paul Taylor. 2009. Text to Speech Synthesis. Cambridge University Press, Cambridge. Lucy Vasserman, Vlad Schogol, and Keith Hall. 2015. Sequence-based class tagging for robust transcription in ASR. In INTERSPEECH, pages 473–477. William A. Woods. 1970. Transition network grammars for natural language analysis. Communications of the ACM, 13(10):591–606. Yi Yang and Jacob Eisenstein. 2013. A log-linear model for unsupervised text normalization. In EMNLP, pages 61–72. 518 A Random sampling procedure Random-generated data sets were produced by sampling from a Yule-Simon distribution with ρ = 1, then rounding each sample’s k trailing digits, where k is a random variable in the discrete uniform distribution U{0, n} and n is the order of the sampled number. Duplicate samples were then removed. The following R function implements this procedure. require(VGAM) EPSILON <- 1e-12 rnumbers <- function(n) { x <- ryules(n, rho=1) num.digits <- floor(log10(x + EPSILON)) + 1 sig.digits <- ceiling(runif(n, min=0, max=num.digits)) unique(signif(x, sig.digits)) } B Parse generation algorithm The following algorithm is used to generate parses from parallel (written/spoken) data. It depends upon a procedure GetSubtrees(…) generating all possible labeled binary subtrees given a sequence of terminals, which is left as an exercise to the reader. 1: procedure GetOracles(T, v0, …, vn) ▷ Total T, terminals v0, …, vn. 2: if n = 1 then 3: if eval(v0) = T then 4: yield s-expression v0 5: end if 6: return 7: end if 8: for i ∈ 1 . . . n−1 do ▷ Size of left child. 9: for all L ∈ GetSubtrees(v0, …, vi) do 10: TL ← eval(L) 11: TR ← T − TL ▷ Hypothesizes + root. 12: if TR > 0 then 13: for all R ∈ GetOracles(TR, vi+1, …, vn) do 14: yield s-expression (+ L R) 15: end for 16: end if 17: TR ← T / TL ▷ Hypothesizes * root. 18: if TR ∈N then ▷ “is a natural number”. 19: for all R ∈ GetOracles(TR, vi+1, …, vn) do 20: yield s-expression (* L R) 21: end for 22: end if 23: end for 24: end for 25: end procedure 519 520