Minimally Supervised Number Normalization

Kyle Gorman and Richard Sproat
Google, Inc.

111 8th Ave., New York, NY, USA

Abstract

We propose two models for verbalizing num-
bers, a key component in speech recognition
and synthesis systems. The first model uses an
end-to-end recurrent neural network. The sec-
ond model, drawing inspiration from the lin-
guistics literature, uses finite-state transducers
constructed with a minimal amount of training
data. While both models achieve near-perfect
performance, the latter model can be trained
using several orders of magnitude less data
than the former, making it particularly useful
for low-resource languages.

1 Introduction

Many speech and language applications require text
tokens to be converted from one form to another. For
example, in text-to-speech synthesis, one must con-
vert digit sequences (32) into number names (thirty-
two), and appropriately verbalize date and time ex-
pressions (12:47 → twelve forty-seven) and abbre-
viations (kg → kilograms) while handling allomor-
phy and morphological concord (e.g., Sproat, 1996).
Quite a bit of recent work on SMS (e.g., Beaufort
et al., 2010) and text from social media sites (e.g.,
Yang and Eisenstein, 2013) has focused on detect-
ing and expanding novel abbreviations (e.g., cn u plz
hlp). Collectively, such conversions all fall under the
rubric of text normalization (Sproat et al., 2001), but
this term means radically different things in differ-
ent applications. For instance, it is not necessary to
detect and verbalize dates and times when preparing
social media text for downstream information extrac-
tion, but this is essential for speech applications.

While expanding novel abbreviations is also im-
portant for speech (Roark and Sproat, 2014), num-
bers, times, dates, measure phrases and the like
are far more common in a wide variety of text
genres. Following Taylor (2009), we refer to cate-
gories such as cardinal numbers, times, and dates—
each of which is semantically well-circumscribed—
as semiotic classes. Some previous work on text nor-
malization proposes minimally-supervised machine
learning techniques for normalizing specific semi-
otic classes, such as abbreviations (e.g., Chang et
al., 2002; Pennell and Liu, 2011; Roark and Sproat,
2014). This paper continues this tradition by con-
tributing minimally-supervised models for normal-
ization of cardinal number expressions (e.g., ninety-
seven). Previous work on this semiotic class include
formal linguistic studies by Corstius (1968) and Hur-
ford (1975) and computational models proposed by
Sproat (1996; 2010) and Kanis et al. (2005). Of all
semiotic classes, numbers are by far the most im-
portant for speech, as cardinal (and ordinal) num-
bers are not only semiotic classes in their own right,
but knowing how to verbalize numbers is important
for most of the other classes: one cannot verbalize
times, dates, measures, or currency expressions with-
out knowing how to verbalize that language’s num-
bers as well.

One computational approach to number name ver-
balization (Sproat, 1996; Kanis et al., 2005) employs
a cascade of two finite-state transducers (FSTs). The
first FST factors the integer, expressed as a digit se-
quence, into sums of products of powers of ten (i.e.,
in the case of a base-ten number system). This is
composed with a second FST that defines how the

507

Transactions of the Association for Computational Linguistics, vol. 4, pp. 507–519, 2016. Action Editor: Jason Eisner.
Submission batch: 3/2016; Revision batch: 5/2016; Published 11/2016.

c©2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


numeric factors are verbalized, and may also handle
allomorphy or morphological concord in languages
that require it. Number names can be relatively easy
(as in English) or complex (as in Russian; Sproat,
2010) and thus these FSTs may be relatively easy or
quite difficult to develop. While the Google text-to-
speech (TTS) (see Ebden and Sproat, 2014) and au-
tomatic speech recognition (ASR) systems depend
on hand-built number name grammars for about 70
languages, developing these grammars for new lan-
guages requires extensive research and labor. For
some languages, a professional linguist can develop
a new grammar in as little as a day, but other lan-
guages may require days or weeks of effort. We have
also found that it is very common for these hand-
written grammars to contain difficult-to-detect er-
rors; indeed, the computational models used in this
study revealed several long-standing bugs in hand-
written number grammars.

The amount of time, effort, and expertise required
to produce error-free number grammars leads us to
consider machine learning solutions. Yet it is im-
portant to note that number verbalization poses a
dauntingly high standard of accuracy compared to
nearly all other speech and language tasks. While
one might forgive a TTS system that reads the am-
biguous abbreviation plz as plaza rather than the in-
tended please, it would be inexcusable for the same
system to ever read 72 as four hundred seventy two,
even if it rendered the vast majority of numbers cor-
rectly.

To set the stage for this work, we first (§2–3)
briefly describe several experiments with a power-
ful and popular machine learning technique, namely
recurrent neural networks (RNNs). When provided
with a large corpus of parallel data, these systems are
highly accurate, but may still produce occasional er-
rors, rendering it unusable for applications like TTS.

In order to give the reader some background on the
relevant linguistic issues, we then review some cross-
linguistic properties of cardinal number expressions
and propose a finite-state approach to number nor-
malization informed by these linguistic properties
(§4). The core of the approach is an algorithm for in-
ducing language-specific number grammar rules. We
evaluate this technique on data from four languages.

Figure 1: The neural net architecture for the preliminary
Russian cardinal number experiments. Purple LSTM lay-
ers perform forwards transitions and green LSTM layers
perform backwards transitions. The output is produced by
a CTC layer with a softmax activation function. Input to-
kens are characters and output tokens are words.

2 Preliminary experiment with recurrent
neural networks

As part of a separate strand of research, we have been
experimenting with various recurrent neural network
(RNN) architectures for problems in text normaliza-
tion. In one set of experiments, we trained RNNs to
learn a mapping from digit sequences marked with
morphosyntactic (case and gender) information, and
their expression as Russian cardinal number names.
The motivation for choosing Russian is that the num-
ber name system of this language, like that of many
Slavic languages, is quite complicated, and therefore
serves as a good test of the abilities of any text nor-
malization system.
The architecture used was similar to a network

employed by Rao et al. (2015) for grapheme-
to-phoneme conversion, a superficially similar
sequence-to-sequence mapping problem. We used a
recurrent network with an input layer, four hidden
feed-forward LSTM layers (Hochreiter and Schmid-
huber, 1997), and a connectionist temporal classi-
fication (CTC) output layer with a softmax activa-
tion function (Graves et al., 2006).1 Two of the
hidden layers modeled forward sequences and the
other two backward sequences. There were 32 in-
put nodes—corresponding to characters—and 153
output nodes—corresponding to predicted number
name words. Each of the hidden layers had 256
nodes. The full architecture is depicted in Figure 1.
The system was trained on 22M unique digit se-
1Experiments with a non-CTC softmax output layer yielded

consistently poor results, and we do not report them here.

508


quences ranging from one to one million; these were
collected by applying an existing TTS text normal-
ization system to several terabytes of web text. Each
training example consisted of a digit sequence, gen-
der and case features, and the Russian cardinal num-
ber verbalization of that number. Thus, for example,
the system has to learn to produce the feminine in-
strumental form of 60. Examples of these mappings
are shown in Table 1, and the various inflected forms
of a single cardinal number are given in Table 2. In
preliminary experiments, it was discovered that short
digit sequences were poorly modeled due to under-
sampling, so an additional 240,000 short sequence
samples (of three or fewer digits) were added to com-
pensate. 2.2M examples (10%) were held out as a
development set.
The system was trained for one day, after which it

had a 0% label error rate (LER) on the development
data set. When decoding 240,000 tokens of held-out
test data with this model, we achieved very high ac-
curacy (LER < .0001). The few remaining errors,
however, are a serious obstacle to using this system
for TTS. The model appears to make no mistakes ap-
plying inflectional suffixes to unseen data. Plausibly,
this task was made easier by our positioning of the
morphological feature string at the end of the input,
making it local to the output inflectional suffix (at
least for the last word in the number expression). But
it does make errors with respect to the numeric value
of the expression. For example, for 9801 plu.ins.
(девятью тысячами восьмьюстами одними), the
system produced девятью тысячами семьюстами
одними (9701 plu.ins.): the morphology is cor-
rect, but the numeric value is wrong.2

This pattern of errors was exactly the opposite of
what we want for speech applications. One might
forgive a TTS system that reads 9801 with the cor-
rect numeric value but in the wrong case form: a lis-
tener would likely notice the error but would usually
not be misled about the message being conveyed. In
contrast, reading it as nine thousand seven hundred
and one is completely unacceptable, as this would
actively mislead the listener.
It is worth pointing out that the training set used

here—22M examples—was quite large, and we were

2The exact number of errors and their particular details var-
ied from run to run.

only able to obtain such a large amount of labeled
data because we already had a high-quality hand-
built grammar designed to do exactly this transduc-
tion. It is simply unreasonable to expect that one
could obtain this amount of parallel data for a new
language (e.g., from naturally-occurring examples,
or from speech transcriptions). This problem is es-
pecially acute for low-resource languages (i.e., most
of the world’s languages), where data is by defini-
tion scarce, but where it is also hard to find high-
quality linguistic resources or expertise, and where
a machine learning approach is thus most needed.
In conclusion, the system does not perform as well

as we demand, nor is it in any case a practical solu-
tion due to the large amount of training data needed.
The RNN appears to have done an impressive job
of learning the complex inflectional morphology of
Russian, but it occasionally chooses the wrong num-
ber names altogether.

3 Number normalization with RNNs

For the purpose of more directly comparing the per-
formance of RNNs with the methods we report on
below, we chose to ignore the issue of allomorphy
and morphological concord, which appears to be
“easy” for generic sequence models like RNNs, and
focus instead on verbalizing number expressions in
whatever morphological category represents the lan-
guage’s citation form.

3.1 Data and general approach
For our experiments we used three parallel data sets
where the target number name was in citation form
(in Russian, nominative case):

• A large set consisting of 28,000 examples ex-
tracted from several terabytes of web text using
an existing TTS text normalization system

• A medium set of 9,000 randomly-generated ex-
amples (for details, see Appendix A)

• A minimal set of 300 examples, consisting
of the counting numbers up to 200, and 100
carefully-chosen examples engineered to cover
a wide variety of phenomena

A separate set of 1,000 randomly-generated exam-
ples were held out for evaluation.

509


5 neu.gen. → пяти five
24 mas.acc. → двадцать четыре twenty-four
99 plu.ins. → девяноста девятью ninety-nine
11 fem.nom. → одиннадцать eleven
81 fem.gen. → восьмидесяти одной eighty-one
60 fem.ins. → шестьюдесятью sixty
91 neu.ins. → девяноста одним ninety-one
3 mas.gen. → трёх three

Table 1: Example input and output data (and glosses) for the Russian RNN experiments.

шестьдесят nominative (nom.)
шестидесяти genitive (gen.)
шестидесяти dative (dat.)
шестьдесят accusative (acc.)
шестьюдесятью instrumental (ins.)
шестидесяти prepositional (pre.)

Table 2: Inflectional forms of the cardinal number number
“60” in Russian.

The minimal set was intended to be representative
of the sort of data one might obtain from a native-
speaker when asked to provide all the essential infor-
mation about number names in their language.3

In these experiments we used two different RNN
models. The first was the same LSTM architecture
as above (henceforth referred to as “LSTM”), except
that the numbers of input and output nodes were 13
and 53, respectively, due to the smaller input and out-
put vocabularies.
The second was a TensorFlow-based RNN with

an attention mechanism (Mnih et al., 2014), using an
overall architecture similar to that used in a system
for end-to-end speech recognition (Chan et al., 2016).
Specifically, we used a 4-layer pyramidal bidirec-
tional LSTM reader that reads input characters, a
layer of 256 attentional units, and a 2-layer decoder
that produces word sequences. The reader is referred
to Chan et al., 2016 for further details. Henceforth we
refer to this model as “Attention”.
All models were trained for 24 hours, at which

point they were determined to have converged.

3Note that the native speaker in question merely needs to be
able to answer questions of the form “how do you say ‘23’ in
your language?”; they do not need to be linguistically trained.
In contrast, hand-built grammars require at least some linguistic
sophistication on the part of the grammarian.

3.2 Results and discussion

Results for these experiments on a test corpus of
1,000 random examples are given in Table 3.
The RNN with attention clearly outperformed the

LSTM in that it performed perfectly with both the
medium and large training sets, whereas the LSTM
made a small percentage of errors. Note that since the
numbers were in citation form, there was little room
for the LSTM to make inflectional errors, and the er-
rors it made were all of the “silly” variety, in which
the output simply denotes the wrong number. But
neither system was capable of learning valid trans-
ductions given just 300 training examples.4

We draw two conclusions from these results. First,
even a powerful machine learning model known to
be applicable to a wide variety of problems may
not be appropriate for all superficially-similar prob-
lems. Second, it remains to be seen whether any
RNN could be designed to learn effectively from an
amount of data as small as our smallest training set.
Learning from minimal data sets is of great practical
concern, and we will proceed to provide a plausible
solution to this problem below. We note again that
very low error rates do not ensure that a system is

4The failure of the RNNs to generalize from the minimal
training set suggests they are evidently not expressive enough
for the sort of “clever” inference that is needed to generalize
from so little data. It is plausible that an alternative RNN archi-
tecture could learn with such a small amount of data, though
we leave it to future research to discover just what such an ar-
chitecture might be. In an attempt to provide the RNNs with
additional support, we also performed an evaluation with the
minimal training set in which inputs were encoded so that each
decimal position above 0 was represented with a letter (A for 10,
B for 100, and so forth). Thus 2034 was represented as 2C3A4. In
principle, this ought to have prevented errors which fail to take
positional information into account. Unfortunately, this made no
difference whatsoever.

510


Training size LSTM Acc. Attention Acc. Overlap

28,000 0.999 1.000 56%
9,000 0.994 1.000 0%
300 < 0.001 < 0.001 < 1%

Table 3: Accuracies on a test corpus of 1,000 random Russian citation-form number-name examples for the two RNN
architectures. “Overlap” indicates the percentage of the test examples that are also found in the training data.

usable, since not all errors are equally forgivable.

4 Number normalization with finite-state
transducers

The problem of number normalization naturally de-
composes into two subproblems: factorization and
verbalization of the numeric factors. We first con-
sider the latter problem, the simpler of the two.
Let λ be the set of number names in the target lan-

guage, and let ν be the set of numerals, the integers
denoted by a number name. Then let L : ν∗ → λ∗
be a transducer which replaces a sequence of nu-
merals with a sequence of number names. For in-
stance, for English, L will map 90 7 to ninety seven.
In languages where there are multiple allomorphs or
case forms for a numeral, L will be non-functional
(i.e., one-to-many); we return to this issue shortly. In
nearly all cases, however, there are no more than a
few dozen numerals in ν,5 and no more than a few
names in λ for the equivalent numeral in ν. There-
fore, we assume it is possible to construct L with min-
imal effort and minimal knowledge of the language.
Indeed, all the information needed to construct L for
the experiments conducted in this paper can be found
in English-language Wikipedia articles.
The remaining subproblem, factorization, is re-

sponsible for converting digit sequences to numeral
factors. In English, for example, 97000 is factored
as 90 7 1000. Factorization is also language-specific.
In Standard French, for example, there is no sim-
plex number name for ‘90’; instead this is realized as
quatre-vingt-dix “four twenty ten”, and thus 97000
(quatre-vingt-dix-sept mille) is factored as 4 20 10
7 1000. It is not a priori obvious how one might go
about learning language-specific factorizations. For

5At worst, a small number of languages, such as several Indic
languages of North India, effectively use unique numerals for all
counting numbers up to 100.

inspiration, we turn to a lesser-known body of lin-
guistics research focusing on number grammars.
Hurford (1975) surveys cross-linguistic properties

of number naming and proposes a syntactic repre-
sentation which directly relates verbalized number
names to the corresponding integers. Hurford inter-
prets complex number constructions as arithmetic
expressions in which operators (and the parenthe-
ses indicating associativity) have been elided. By
far the two most common arithmetic operations are
multiplication and addition.6 In French, for exam-
ple, the expression dix-sept, literally ‘ten seven’, de-
notes 17, the sum of its terms, and quatre-vingt(s),
literally ‘four twenty’, refers to 80, the product of
its terms. These may be combined, in quatre-vingt-
dix-sept. To visualize arithmetic operations and as-
sociativities, we henceforth write factorizations us-
ing s-expressions—pre-order serializations of k-ary
trees—with numeral terminals and arithmetic oper-
ator non-terminals. For example, quatre-vingt-dix-
sept is written (+ (* 4 20) 10 7).
Within any language there are cues to this elided

arithmetic structure. In some languages, some or all
addends are separated by a word translated as and. In
other languages it is possible to determine whether
terms are to be multiplied or summed depending on
their relative magnitudes. In French (as in English),
for instance, an expression XY usually is interpreted
as a product if X < Y, as in quatre-vingt(s) ‘80’, but
as a sum if X > Y, as in vingt-quatre ‘24’. Thus the
problem of number denormalization—that is, recov-
ering the integer denoted by a verbalized number—
can be thought of as a special case of grammar induc-
tion from pairs of natural language expressions and

6Some languages make use of half-counting, or multiplica-
tion by one half (e.g., Welsh hanner cant, ‘50’, literally ‘half
hundred’), or back-counting, i.e., subtraction (e.g., Latin unde-
vīgintī, ‘19’, literally ‘one from twenty’; Menninger, 1969, 94f.).
But these do not reduce the generality of the approach here.

511


their denotations (e.g., Kwiatkowski et al., 2011).

4.1 FST model
The complete model consists of four components:

1. A language-independent covering grammar F,
transducing from integers expressed as digit se-
quences to the set of possible factorizations for
that integer

2. A language-specific numeral map M, transduc-
ing from digit sequences to numerals

3. A language-specific verbalization grammar G,
accepting only those factorizations which are
licit in the target language

4. A language-specific lexical map L, transducing
from sequences of numerals (e.g., 20) to num-
ber names (already defined)

As the final component, the lexical map L, has al-
ready been described, we proceed to describe the re-
maining three components of the system.

4.1.1 Finite-state transducer algorithms
While we assume the reader has some familiarity

with FSTs, we first provide a brief review of a few
key algorithms we employ below.
Our FST model is constructed using composition,

denoted by the ◦ operator. When both arguments to
composition are transducers, composition is equiva-
lent to chaining the two relations described. For ex-
ample, if A transduces string x to string y, and B trans-
duces y to z, then A ◦ B transduces from string x
to string z. When the left-hand side of composition
is a transducer and the right-hand side is an accep-
tor, then their composition produces a transducer in
which the range of the left-hand side relation is inter-
sected with the set of strings accepted by the right-
hand side argument. Thus if A transduces string x to
strings {y, z}, and B accepts y then A ◦ B transduces
from x to y.
We make use of two other fundamental operations,

namely inversion and projection. Every transducer A
has an inverse denoted by A−1, which is the trans-
ducer such that A−1(y) → x if and only if A(x) → y.
Any transducer A also has input and output projec-
tions denoted by πi(A) and πo(A), respectively. If
the transducer A has the domain α∗ and the range β∗,

then πi(A) is the acceptor over α∗ which accepts x if
and only if A(x) → y for some y ∈ β∗; output pro-
jection is defined similarly. The inverse, input pro-
jection, and output projection of an FST (or a push-
down transducer) are computed by swapping and/or
copying the input or output labels of each arc in the
machine. See Mohri et al., 2002 for more details on
these and other finite-state transducer algorithms.

4.1.2 Covering grammar
Let A be an FST which, when repeatedly applied

to an arithmetic s-expression string, produces the s-
expression’s value. For example, one application of
A to (+ (* 4 20) 10 7) produces (+ 80 10 7),
and a second application produces 97. Let μ be the
set of s-expression markup symbols {‘(’, ‘)’, ‘+’, ‘*’}
and Δ be the set {0, 1, 2, . . . , 9}. Then,

F : Δ∗ → (μ ∪ Δ)∗ = A−1 ◦ A−1 ◦ A−1 . . . (1)

is an FST which transduces an integer expressed
as a digit string to all its candidate factorizations
expressed as s-expression strings.7 Let C(d) =
πo (d ◦ F), which maps from a digit sequence d
to the set of all possible factorizations—in any
language—of that digit sequence, encoded as s-
expressions. For example, C(97) contains strings
such as:

(+ 90 7)
(+ 80 10 7)
(+ (* 4 20) 10 7)
…

4.1.3 Grammar inference
Let M : (μ ∪ Δ)∗ → ν∗ be a transducer which

deletes all markup symbols in μ and replaces se-
quences of integers expressed as digit sequences
with the appropriate numerals in ν. Let D(l) =
πi (M ◦ L ◦ l), which maps from a verbalization l
to the set of all s-expressions which contain l as ter-
minals. For example, D(4 20 10 7) contains:

7In practice, our s-expressions never have a depth exceeding
five, so we assume F = A−1 ◦ A−1 ◦ A−1 ◦ A−1 ◦ A−1.

512


S → (7 | 90 | * | +)
∗ → (7 | 90 | +) 1000
+ → 90 7

Table 4: A fragment of an English number grammar which
accepts factorizations of the numbers {7, 90, 97, 7000,
90000, and 97000}. S represents the start symbol, and
‘|’ denotes disjunction. Note that this fragment is regular
rather than context-free, though this is rarely the case for
complete grammars.

(+ 4 20 10 7)
(+ 4 20 (* 10 7))
(+ (* 4 20) 10 7)
…

Then, given (d, l) where d ∈ Δ∗ is an integer ex-
pressed as a digit sequence, and l ∈ λ∗ is d’s verbal-
ization, their intersection

E(d, l) = C(d) ◦ D(l) (2)
will contain the factorization(s) of d that verbalizes
as l. In most cases, E will contain exactly one path
for a given (d, l) pair. For instance, if d is 97000 and
l is ninety seven thousand, E(d, l) is (* (+ 90
7) 10000).
We can use E to induce a context-free grammar

(CFG) which accepts only those number verbaliza-
tions present in the target language. The simplest pos-
sible such CFG uses ‘*’ and ‘+’ as non-terminal la-
bels, and the elements in the domain of L (e.g., 20) as
terminals. The grammar will then consist of binary
productions extracted from the s-expression deriva-
tions produced by E. Table 4 provides a fragment of
such a grammar.
With this approach, we face the familiar issues of

ambiguity and sparsity. Concerning the former, the
output of E is not unique for all outputs. We address
this either by applying normal form constraints on
the set of permissible productions, or ignoring am-
biguous examples during induction. One case of am-
biguity involves expressions involving addition with
0 or multiplication by 1, both identity operations that
leave the identity element (i.e., 0 or 1) free to asso-
ciate either to the left or to the right. From our per-
spective, this ambiguity is spurious, so we stipulate
that identity elements may only be siblings to (i.e.,
on the right-hand side of a production with) another

digit → (2 | 3 | 4 | …9)
teen → (11 | 12 | 13 | …19)
decade → (20 | 30 | 40 |…90)
century → (200 | 300 | 400 | …900)
power_of_ten → (1000 | 10000 | …)

Table 5: Optional preterminal rules.

terminal. Thus an expression like one thousand one
hundred can only be parsed as (+ (* 1 1000) (*
1 100)). But not all ambiguities can be handled by
normal form constraints. Some expressions are am-
biguous due to the presence of “palindromes” in the
verbalization string. For instance, two hundred two
can either be parsed as (+ 2 (* 100 2)) or (+ (*
2 100)). The latter derivation is “correct” insofar
as it follows the syntactic patterns of other English
number expressions, but there is no way to deter-
mine this except with reference to the very language-
specific patterns we are attempting to learn. There-
fore we ignore such expressions during grammar in-
duction, forcing the relevant rules to be induced from
unambiguous expressions. Similarly, multiplication
and addition are associative so expressions like three
hundred thousand can be binarized either as (* (* 3
100) 10000) or (* 3 (* 100 1000)), though both
derivations are equally “correct”. Once again we ig-
nore such ambiguous expressions, instead extracting
the relevant rules from unambiguous expressions.
Since we only admit two non-terminal labels, the

vast majority of our rules contain numeral terminals
on their right-hand sides, and as a result, the num-
ber of rules tends to be roughly proportional to the
size of the terminal vocabulary. Thus it is common
that we have observed, for instance, thirteen thou-
sand and fourteen million but not fourteen thousand
or thirteen million, and as a result, the CFG may be
deficient simply due to sparsity in the training data,
particularly in languages with large terminal vocab-
ularies. To enhance our ability to generalize from
a small number of examples, we optionally insert
preterminal labels during grammar induction to form
classes of terminals assumed to pattern together in all
productions. For instance, by introducing ‘teen’ and
‘power_of_ten’ preterminals, all four of the previous
expressions are generated by the same top-level pro-
duction. The full set of preterminal labels we use
here are shown in Table 5.

513


In practice, obtaining productions using E is inef-
ficient: it is roughly equivalent to a naïve algorithm
which generates all possible derivations for the nu-
merals given, then filters out all of those which do
not evaluate to the expected total, violate the afore-
mentioned normal form constraints, or are otherwise
ambiguous. This fails to take advantage of top-down
constraints derived from the particular structure of
the problem. For example, the naïve algorithm en-
tertains many candidate parses for quatre-vingt-dix-
sept ‘97’ where the root is ‘*’ and the first child is
‘4’, despite the fact that no such hypothesis is viable
as 4 is not a divisor of 97.
We inject arithmetic constraints into the gram-

mar induction procedure, as follows. The inputs
to the modified algorithm are tuples of the form
(T, ν0, . . . , νn) where T is the numeric value of the
expression and ν0, . . . , νn are the n + 1 numerals in
the verbalization. Consider a hypothesized numeric
value of the leftmost child of the root, T0...i, which
dominates ν0, . . . , νi where i < n. For this to be vi-
able, it must be the case that T0...i ≤ T. And, if we
further hypothesize that the root node is ‘+’, then the
remaining children must evaluate to T−T0...i. Simi-
larly, if we hypothesize that the root node is ‘*’, then
the remaining children must evaluate to T/T0...i, and
this quantity must be integral.
This approach can be implemented with a back-

tracking recursive descent parser enforcing the afore-
mentioned normal form constraints and propagat-
ing the top-down arithmetic constraints. In practice,
however, we implement the search using a straight-
forward dynamic programming algorithm. The algo-
rithm proceeds by recursively generating all possi-
ble leftmost children of the tree and then using these
top-down constraints to prune branches of the search
space which have no viable completion (though our
implementation does not fully propagate these con-
straints). While the number of left subtrees is expo-
nential in the length of the verbalization, our imple-
mentation remains feasible since real-world exam-
ples tend to have verbalizations consisting of rela-
tively few terminals. Pseudocode for our implemen-
tation is provided in Appendix B.

4.1.4 Grammar compilation
Once productions have been collected, they are

used to specify a recursive transition network

(Woods, 1970) which is then compiled into a push-
down acceptor (Allauzen and Riley, 2012) over ν∗,
henceforth G. An example is shown in Figure 2.

4.1.5 Final model and remaining issues

Then, the verbalization for d is given by

V(d) = πo (d ◦ F ◦ M ◦ G ◦ L) . (3)

As noted above, the lexicon transducer L is non-
functional when there are multiple number names for
a single numeral, as may arise in number systems
with allomorphy or morphological concord. When
this is the case, we compose the lattice produced byV
with a language model (LM) of verbalized numbers
(over λ∗) and then decode using the shortest path al-
gorithm. Note that whereas the construction of G re-
quires parallel data, the LM requires only “spoken”
data. While it is not common in most languages to
write out complex cardinal numbers in their verbal-
ized form, it is nonetheless possible to find a large
sample of such expressions at web scale (Sproat,
2010); such expressions can be identified by match-
ing against the unweighted πo (F ◦ M ◦ G ◦ L).

4.2 Materials and methods

The FST-based verbalizer V was constructed and
evaluated using four languages: English, Georgian,
Khmer, and Russian (the latter targeting citation
forms only). The medium and minimal sets are used
for all four languages; in Russian, we also reuse the
large data set (see §3.1). In all cases the test sets
consisted of 1,000 randomly generated examples, the
same examples as in previous experiments.
The size of V varied by language, with the small-

est, English, consisting of roughly 8,000 states and
arcs, and the largest, Russian, measuring roughly
80,000 states and arcs and comprising approximately
a megabyte of uncompressed binary data.
No LM was required for either English or Khmer

as both have a functional L. However, the Georgian
and Russian L are both ambiguous, so the best path
through the output lattice was selected according to
a trigram language model with Witten-Bell smooth-
ing. The language models were constructed using the
medium training set.

514


0 1

7

90

2

(*

5
(+

(+

3

7

90

690

41000
)*

77
)+)+

Figure 2: A pushdown acceptor that accepts all the language of the grammar fragment in Table 4. Arc labels that contain
parentheses indicate “push” and “pop” stack operations, respectively, and must balance along a path.

Locale Training size Num. acc. Morph. acc. Overlap

eng_us
9,000 1.000 1.000 0%
300 1.000 1.000 < 1%

kat_ge
9,000 1.000 1.000 0%
300 1.000 1.000 < 1%

khm_kh
9,000 1.000 1.000 0%
300 1.000 1.000 < 1%

rus_ru
28,000 1.000 1.000 56%
9,000 1.000 0.998 0%
300 1.000 0.998 < 1%

Table 6: Evaluation of the FST verbalizer on English (eng_us), Georgian (kat_ge), Khmer (kmh_kh), and Russian
(rus_ru); errors are separated into those which affect the numeric denotation (“Num. acc.”) and those which merely
use incorrect morphological forms (“Morph. acc.”).

515


4.3 Results and discussion

The results were excellent for all four languages.
There were no errors at all in English, Georgian,
and Khmer with either data set. While there were
a few errors in Russian, crucially all were agree-
ment errors rather than errors in the factorization
itself, exactly the opposite pattern of error to the ones
we observed with the LSTM model. For example,
70,477,170 was rendered as семьдесят миллион
четыреста семьдесят семь тысяч сто семьдесят;
the second word should be миллионов, the genitive
plural form. More surprisingly, verbalizers trained
on the 300 examples of the minimal data set per-
formed just as well as ones trained with two orders
of magnitude more labeled data.

5 Discussion

We presented two approaches to number normaliza-
tion. The first used a general RNN architecture that
has been used for other sequence mapping problems,
and the second an FST-based system that uses a fair
amount of domain knowledge. The RNN approach
can achieve very high accuracy, but with two caveats:
it requires a large amount of training data, and the er-
rors it makes may result in the wrong number. The
FST-based solution on the other hand can learn from
a tiny dataset, and never makes that particularly per-
nicious type of error. The small size of training data
needed and the high accuracy make this a particu-
larly attractive approach for low-resource scenarios.
In fact, we suspect that the FST model could be made
to learn from a smaller number of examples than the
300 that make up the “minimal” set. Finding the min-
imum number of examples necessary to cover the
entire number grammar appears to be a case of the
set cover problem—-which is NP-complete (Karp,
1972)—but it is plausible that a greedy algorithm
could identify an even smaller training set.
The grammar induction method used for the FST

verbalizer is near to the simplest imaginable such
procedure: it treats rules as well-formed if and only if
they have at least one unambiguous occurrence in the
training data. More sophisticated induction methods
could be used to improve both generalization and ro-
bustness to errors in the training data. Generalization
might be improved by methods that “hallucinate” un-
observed productions (Mohri and Roark, 2006), and

robustness could be improved using manual or au-
tomated tree annotation (e.g., Klein and Manning,
2003; Petrov and Klein, 2007). We leave this for fu-
ture work.
Above, we focused solely on cardinal numbers,

and specifically their citation forms. However, in all
four languages studied here, ordinal numbers share
the same factorization and differ only superficially
from cardinals. In this case, the ordinal number ver-
balizer can be constructed by applying a trivial trans-
duction to the cardinal number verbalizer. However,
it is an open question whether this is a universal or
whether there may be some languages in which the
discrepancy is much greater, so that separate meth-
ods are necessary to construct the ordinal verbalizer.
The FST verbalizer does not provide any mech-

anism for verbalization of numbers in morphologi-
cal contexts other than citation form. One possibility
would be to use a discriminative model to select the
most appropriate morphological variant of a number
in context. We also leave this for future work.
One desirable property of the FST-based system is

that FSTs (and PDTs) are trivially invertible: if one
builds a transducer that maps from digit sequences to
number names, one can invert it, resulting in a trans-
ducer that maps number names to digit sequences.
(Invertibility is not a property of any RNN solution.)
This allows one, with the help of the appropriate
target-side language model, to convert a normaliza-
tion system into a denormalization system, that maps
from spoken to written form rather than from written
to spoken. During ASR decoding, for example, it is
often preferable to use spoken representations (e.g.,
twenty-three) rather than the written forms (e.g., 23),
and then perform denormalization on the resulting
transcripts so they can be displayed to users in a
more-readable form (Shugrina, 2010; Vasserman et
al., 2015). In ongoing work we are evaluating FST
verbalizers for use in ASR denormalization.

6 Conclusions

We have described two approaches to number nor-
malization, a key component of speech recognition
and synthesis systems. The first used a recurrent neu-
ral network and large amounts of training data, but
very little knowledge about the problem space. The
second used finite-state transducers and a learning

516


method totally specialized for this domain but which
requires very little training data. While the former
approach is certainly more appealing given current
trends in NLP, only the latter is feasible for low-
resource languages which most need an automated
approach to text normalization.
To be sure, we have not demonstrated than

RNNs—or similar models—are inapplicable to this
problem, nor does it seem possible to do so. How-
ever, number normalization is arguably a sequence-
to-sequence transduction problem, and RNNs have
been shown to be viable end-to-end solutions for sim-
ilar problems, including grapheme-to-phoneme con-
version (Rao et al., 2015) and machine translation
(Sutskever et al., 2014), so one might reasonably
have expected them to have performed better without
making the “silly” errors that we observed. Much of
the recent rhetoric about deep learning suggests that
neural networks obviate the need for incorporating
detailed knowledge of the problem to be solved; in-
stead, one merely needs to feed pairs consisting of
inputs and the required outputs, and the system will
self-organize to learn the desired mapping (Graves
and Jaitly, 2014). While that is certainly a desirable
ideal, for this problem one can achieve a much more
compact and data-efficient solution if one is willing
to exploit knowledge of the domain.

Acknowledgements

Thanks to Cyril Allauzen, Jason Eisner, Michael Ri-
ley, Brian Roark, and Ke Wu for helpful discus-
sion, and to Navdeep Jaitly and Haşim Sak for as-
sistance with RNN modeling. All finite-state mod-
els were constructed using the OpenGrm libraries
(http://opengrm.org).

References
Cyril Allauzen and Michael Riley. 2012. A pushdown

transducer extension for the OpenFst library. In CIAA,
pages 66–77.

Richard Beaufort, Sophie Roekhaut, Louise-Amélie
Cougnon, and Cédrick Fairon. 2010. A hybrid
rule/model-based finite-state framework for normaliz-
ing SMS messages. In ACL, pages 770–779.

William Chan, Navdeep Jaitly, Quoc V. Le, and Oriol
Vinyals. 2016. Listen, attend and spell: A neural net-
work for large vocabulary conversational speech recog-
nition. In ICASSP, pages 4960–4964.

Jeffrey T. Chang, Hinrich Schütze, and Russ B. Altman.
2002. Creating an online dictionary of abbreviations
from MEDLINE. Journal of the American Medical In-
formatics Association, 9(6):612–620.

H. Brandt Corstius, editor. 1968. Grammars for number
names. D. Reidel, Dordrecht.

Peter Ebden and Richard Sproat. 2014. The Kestrel TTS
text normalization system. Natural Language Engi-
neering, 21(3):1–21.

Alex Graves and Navdeep Jaitly. 2014. Towards end-to-
end speech recognition with recurrent neural networks.
In ICML, pages 1764–1772.

Alex Graves, Santiago Fernández, Gaustino Gomez, and
Jürgen Schmidhuber. 2006. Connectionist tempo-
ral classification: Labeling unsegmented sequence data
with recurrent neural networks. In ICML, pages 369–
376.

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long
short-term memory. Neural Computation, 9(8):1735–
1780.

James R. Hurford. 1975. The Linguistic Theory of Nu-
merals. Cambridge University Press, Cambridge.

Jakub Kanis, Jan Zelinka, and Luděk Müller. 2005. Auto-
matic number normalization in inflectional languages.
In SPECOM, pages 663–666.

Richard M. Karp. 1972. Reducibility among combina-
torial problems. In Raymond E. Miller and James W.
Thatcher, editors, Complexity of Computer Computa-
tions, pages 85–103. Plenum, New York.

Dan Klein and Christopher D. Manning. 2003. Accurate
unlexicalized parsing. In ACL, pages 423–430.

Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater,
and Mark Steedman. 2011. Lexical generalization
in CCG grammar induction for semantic parsing. In
EMNLP, pages 1512–1523.

Karl Menninger. 1969. Number Words and Number Sym-
bols. MIT Press, Cambridge. Translation of Zahlwort
und Ziffer, published by Vanderhoeck & Ruprecht,
Breslau, 1934.

Volodymyr Mnih, Nicolas Heess, Alex Graves, and Ko-
ray Kavukcuoglu. 2014. Recurrent models of visual
attention. In NIPS, pages 2204–2212.

Mehryar Mohri and Brian Roark. 2006. Probabilistic
context-free grammar induction based on structural ze-
ros. In NAACL, pages 312–319.

Mehryar Mohri, Fernando Pereira, and Michael Riley.
2002. Weighted finite-state transducers in speech
recognition. Computer Speech & Language, 16(1):69–
88.

Deana Pennell and Yang Liu. 2011. Toward text message
normalization: Modeling abbreviation generation. In
ICASSP, pages 5364–5367.

517


Slav Petrov and Dan Klein. 2007. Improved inference
for unlexicalized parsing. In NAACL, pages 404–411.

Kanishka Rao, Fuchun Peng, Haşim Sak, and Françoise
Beaufays. 2015. Grapheme-to-phoneme conversion
using long short-term memory recurrent neural net-
works. In ICASSP, pages 4225–4229.

Brian Roark and Richard Sproat. 2014. Hippocratic ab-
breviation expansion. In ACL, pages 364–369.

Maria Shugrina. 2010. Formatting time-aligned ASR
transcripts for readability. In NAACL, pages 198–206.

Richard Sproat, Alan W. Black, Stanley Chen, Shankar
Kumar, Mari Ostendorf, and Christopher Richards.
2001. Normalization of non-standard words. Com-
puter Speech and Language, 15(3):287–333.

Richard Sproat. 1996. Multilingual text analysis for text-
to-speech synthesis. Natural Language Engineering,
2(4):369–380.

Richard Sproat. 2010. Lightly supervised learning of text
normalization: Russian number names. In IEEE Work-
shop on Speech and Language Technology, pages 436–
441.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Se-
quence to sequence learning with neural networks. In
NIPS, pages 3104–3112.

Paul Taylor. 2009. Text to Speech Synthesis. Cambridge
University Press, Cambridge.

Lucy Vasserman, Vlad Schogol, and Keith Hall. 2015.
Sequence-based class tagging for robust transcription
in ASR. In INTERSPEECH, pages 473–477.

William A. Woods. 1970. Transition network grammars
for natural language analysis. Communications of the
ACM, 13(10):591–606.

Yi Yang and Jacob Eisenstein. 2013. A log-linear model
for unsupervised text normalization. In EMNLP, pages
61–72.

518


A Random sampling procedure

Random-generated data sets were produced by sampling from a Yule-Simon distribution with ρ = 1, then
rounding each sample’s k trailing digits, where k is a random variable in the discrete uniform distribution
U{0, n} and n is the order of the sampled number. Duplicate samples were then removed. The following R
function implements this procedure.

require(VGAM)

EPSILON <- 1e-12
rnumbers <- function(n) {

x <- ryules(n, rho=1)
num.digits <- floor(log10(x + EPSILON)) + 1
sig.digits <- ceiling(runif(n, min=0, max=num.digits))
unique(signif(x, sig.digits))

}

B Parse generation algorithm

The following algorithm is used to generate parses from parallel (written/spoken) data. It depends upon a
procedure GetSubtrees(…) generating all possible labeled binary subtrees given a sequence of terminals,
which is left as an exercise to the reader.

1: procedure GetOracles(T, v0, …, vn) ▷ Total T, terminals v0, …, vn.
2: if n = 1 then
3: if eval(v0) = T then
4: yield s-expression v0
5: end if
6: return
7: end if
8: for i ∈ 1 . . . n−1 do ▷ Size of left child.
9: for all L ∈ GetSubtrees(v0, …, vi) do
10: TL ← eval(L)
11: TR ← T − TL ▷ Hypothesizes + root.
12: if TR > 0 then
13: for all R ∈ GetOracles(TR, vi+1, …, vn) do
14: yield s-expression (+ L R)
15: end for
16: end if
17: TR ← T / TL ▷ Hypothesizes * root.
18: if TR ∈N then ▷ “is a natural number”.
19: for all R ∈ GetOracles(TR, vi+1, …, vn) do
20: yield s-expression (* L R)
21: end for
22: end if
23: end for
24: end for
25: end procedure

519


520