Probabilistic Verb Selection for Data-to-Text Generation

Dell Zhang†1, Jiahao Yuan‡, Xiaoling Wang‡2, and Adam Foster†
†Birkbeck, University of London, Malet Street, London WC1E 7HX, UK

‡Shanghai Key Lab of Trustworthy Computing, East China Normal University,
3663 North Zhongshan Road, Shanghai 200062, China

1dell.z@ieee.org, 2xlwang@sei.ecnu.edu.cn

Abstract

In data-to-text Natural Language Generation
(NLG) systems, computers need to find the
right words to describe phenomena seen in the
data. This paper focuses on the problem of
choosing appropriate verbs to express the di-
rection and magnitude of a percentage change
(e.g., in stock prices). Rather than simply using
the same verbs again and again, we present a
principled data-driven approach to this prob-
lem based on Shannon’s noisy-channel model
so as to bring variation and naturalness into
the generated text. Our experiments on three
large-scale real-world news corpora demon-
strate that the proposed probabilistic model can
be learned to accurately imitate human authors’
pattern of usage around verbs, outperforming
the state-of-the-art method significantly.

1 Introduction

Natural Language Generation (NLG) is a fundamen-
tal task in Artificial Intelligence (AI) (Russell and
Norvig, 2009). It aims to automatically turn struc-
tured data into prose (Reiter, 2007; Belz and Kow,
2009) — the opposite of the better-known field of
Natural Language Processing (NLP) that transforms
raw text into structured data (e.g., a logical form or a
knowledge base) (Jurafsky and Martin, 2009). Being
dubbed “algorithmic authors” or “robot journalists”,
NLG systems have attracted a lot of attention in re-
cent years, thanks to the rise of big data (Wright,
2015).

The use of NLG in financial services has been
growing very fast. One particularly important NLG

problem for summarizing financial or business data
is to automatically generate textual descriptions of
trends between two data points (such as stock prices).
In this paper, we elect to use relative percentages
rather than absolute numbers to describe the change
from one data point to another. This is because an
absolute number might be considered small in one
case but large in another, depending on the unit and
the context (Krifka, 2007; Smiley et al., 2016). For
example, 1000 British pounds are worth much more
than 1000 Japanese yen; a rise of 100 US dollars in
car price might be negligible but the same amount
of increase in bike price would be significant. Given
two data points (e.g., on a stock chart), the percentage
change can always be calculated easily.

The challenge is to select the appropriate verb for
any percentage change. For example, in newspa-
pers, we often see headlines like “Apple’s stock had
jumped 34% this year in anticipation of the next
iPhone . . . ” and “Microsoft’s profit climbed 28%
with shift to Web-based software . . . ”. The journal-
ists writing such news stories use descriptive lan-
guage such as the verbs like jump and climb to
express the direction and magnitude of a percent-
age change. It is of course possible to simply keep
using the same neutral verbs, e.g., increase and
decrease for upward and downward changes re-
spectively, again and again, as in most existing data-
to-text NLG systems. However, the generated text
would sound much more natural if computers could
use a variety of verbs suitable in the context like
human authors do.

Expressions of percentage changes are readily
available in many natural language text datasets and

511

Transactions of the Association for Computational Linguistics, vol. 6, pp. 511–527, 2018. Action Editor: Alexander Koller.
Submission batch: 1/2018; Revision batch: 5/2018; Published 8/2018.

c©2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


can be easily extracted. Therefore computers should
be able to learn from such expressions how people de-
cide which verbs to use for what kind of percentage
changes.

In this paper, we address the problem of verb se-
lection for data-to-text NLG through a principled
data-driven approach. Specifically, we show how to
employ Bayesian reasoning to train a probabilistic
model for verb selection based on large-scale real-
world news corpora, and demonstrate its advantages
over existing verb selection methods.

The rest of this paper is organized as follows. In
Section 2, we review the related work in literature. In
Section 3, we describe the dataset used for our inves-
tigation. In Section 4, we present our probabilistic
model for verb selection in detail. In Section 5, we
conduct experimental evaluation. In Section 6, we
discuss possible extensions to the proposed approach.
In Section 7, we draw conclusions.

2 Related Work

The most successful NLG applications, from the com-
mercial perspective, have been data-to-text NLG sys-
tems which generate textual descriptions of databases
or datasets (Reiter, 2007; Belz and Kow, 2009). A
typical example is the automatic generation of tex-
tual weather forecasts from weather data that has
been used by Environment Canada and UK Met Of-
fice (Goldberg et al., 1994; Belz, 2008; Sripada et al.,
2014). The TREND system (Boyd, 1998) focuses on
generating descriptions of historical weather patterns.
Their method concentrates primarily on the detection
of upward and downward trends in the weather data,
and uses a limited set of verbs to describe different
types of movements. Ramos-Soto et al. (2013) also
address the surface realization of weather trend data
by creating an “intermediate language” for temper-
ature, wind etc. and then using four different ways
to verbalize temperatures based on the minimum,
maximum and trend in the time frame considered.
An empirical corpus-based study of human-written
weather forecasts has been conducted in SUMTIME-
MOUSAM (Reiter et al., 2005), and one aspect of
their research focused on verb selection in weather
forecasts. They built a classifier to predict the choice
of verb based on type (speed vs. direction), informa-
tion content (change or transition from one wind state

to another) and near-synonym choice. There is more
and more interest in using NLG to enhance acces-
sibility, for example by describing data in the form
of graphs etc. to visually impaired people. In such
NLG systems, there has also been exploration into the
generation of text for trend data which should be au-
tomatically adapted to users’ reading levels (Moraes
et al., 2014). There exists wide-spread usage of NLG
systems on the financial and business data. For ex-
ample, the SPOTLIGHT system developed at A.C.
Nielsen automatically generated readable English
text based on the analysis of large amounts of retail
sales data. For another example, in 2016 Forbes re-
ported that FactSet used NLG to automatically write
hundreds of thousands of company descriptions a
day. It is not difficult to imagine that different kinds
of such data-to-text NLG systems can be utilized by
a modern chatbot like Amazon Echo or Microsoft
XiaoIce (Shum et al., 2018) to enable users access a
variety of online data resources via natural language
conversation.

Typically, a complete data-to-text NLG system im-
plements a pipeline which involves both content se-
lection (“what to say”) and surface realization (“how
to say”). In recent years, researchers have made
much progress in the end-to-end joint optimization
of those two aspects: Angeli et al. (2010) treat the
generation process as a sequence of local decisions
represented by log-linear models; Konstas and Lapata
(2013) employ a probabilistic context-free grammar
(PCFG) specifying the structure of the event records
and complement it with an n-gram language model
as well as a dependency model; the most advanced
method to date is the LSTM recurrent neural net-
work (RNN) based encoder-aligner-decoder model
proposed by Mei et al. (2016) which is able to learn
content selection and surface realization together di-
rectly from database-text pairs. The verb selection
problem that we focus on in this paper belongs to the
lexicalization step of content selection, more specifi-
cally, sentence planning. Similar to the above men-
tioned joint optimization methods, our approach to
verb selection is also automatic, unsupervised, and
domain-independent. It would be straightforward to
generalize our proposed model to select other types
of words (like adjectives and adverbs), or even textual
templates as used by Angeli et al. (2010), to describe
numerical data. Due to its probabilistic nature, our

512


proposed model could be plugged into, or interpo-
lated with, a bigger end-to-end probabilistic model
(Konstas and Lapata, 2013) relatively easily, but it
is not obvious how this model could fit into a neural
architecture (Mei et al., 2016).

The existing work on lexicalization that is most
similar to ours is a corpus based method for verb se-
lection developed by Smiley et al. (2016) at Thomson
Reuters. They analyze the usage patterns of verbs
expressing percentage changes in a very large corpus,
the Reuters News Archive. For each verb, they cal-
culate the interquartile range (IQR) of its associated
percentage changes in the corpus. Given a new per-
centage change, their method randomly selects a verb
from those verbs whose IQRs cover the percentage in
question, with equal probabilities. A crowdsourcing
based evaluation has demonstrated the superiority of
their verb selection method to the random baseline
that just chooses verbs completely randomly. It is
notable that their method has been incorporated into
Thomson Reuters EikonTM, their commercial data-
to-text NLG software product for macro-economic
indicators and mergers-and-acquisitions deals (Pla-
chouras et al., 2016). We will make experimental
comparisons between our proposed approach and
theirs in Section 5.

3 Data

3.1 The WSJ Corpus

The first (and main) dataset that we have used to
investigate the problem of verb selection is BLLIP
1987-89 Wall Street Journal (WSJ) Corpus Release 1
which contains a three-year Wall Street Journal
(WSJ) collection of 98,732 stories from ACL/DCI
(LDC93T1), approximately 30 million words (Char-
niak et al., 2000).

We first utilized the Stanford CoreNLP1 (Manning
et al., 2014) toolkit to extract “relation triples” from
all the documents in the dataset, via its open-domain
information extraction (OpenIE) functionality. Then,
with the help of part-of-speech (POS) tagging pro-
vided by the Python package NLTK2 (Bird et al.,
2009), we filtered the extracted relation triples and
retained only those expressing a percentage change

1https://stanfordnlp.github.io/CoreNLP/
2http://www.nltk.org/

in the following format:

Google’s revenue︸ ︷︷ ︸
subject

rose︸ ︷︷ ︸
verb

22.2%︸ ︷︷ ︸
percentage

.

Here the numerical value of percentage change could
be written using either the symbol % or the word
percent. Note that all auxiliary verbs (including
modal verbs) would have been removed, and lemma-
tization (Manning et al., 2008; Jurafsky and Martin,
2009) would have been applied to all main verbs so
that the different inflectional forms of the same verb
would be reduced to their common base form.

After extracting 57,005 candidate triples for a to-
tal of 1,355 verbs, we eliminated rare verbs which
occur less than 50 times in the dataset. Furthermore,
we manually annotated the direction of each verb as
upward or downward, and discarded the verbs like
yield which do not indicate the direction of per-
centage change. The above preprocessing left us with
25 (normalized) verbs of which 11 are upward and
14 are downward. There are 21,766 verb-percentage
pairs in total.

Furthermore, it is found that most of the per-
centage changes in this dataset reside within the
range [0%, 100%]. Only a tiny portion of percentage
changes are beyond that range: 1.35% for upward
verbs and 0.10% for downward verbs. Those out-of-
range percentage changes are considered outliers and
are excluded from our study in this paper, though the
way to relax this constraint will be discussed later in
Section 6.

3.2 The Reuters Corpus
We have also validated our model in a widely-used
public dataset, the Reuters-21578 text categorization
collection3. It is a collection of 21,578 documents
that appeared on Reuters newswire in 1987. The doc-
uments were assembled and indexed with categories,
but they were not needed in this paper.

The same preprocessing as on the WSJ corpus has
been applied to this dataset, except that the minimum
occurring frequency of verbs was not 50 but 5 times
due to the smaller size of this dataset. After manual
annotation and filtering, we ended up with 8 verbs in-
cluding 4 upward ones and 4 downward ones. There
are 603 verb-percentage pairs in total.

3https://goo.gl/NrOfu

513


3.3 The Chinese Corpus

Furthermore, to verify the effectiveness of our ap-
proach in other languages, we have also made use
of the Chinese Gigaword (5th edition) dataset. It
is a comprehensive archive of newswire text data
that has been acquired from eight distinct sources of
Chinese newswire by LDC over a number of years
(LDC2011T13), and contains more than 10 million
sentences.

Since we could not find any open-domain infor-
mation extraction toolkit for “relation triples” in Chi-
nese, we resorted to regular expression matching to
extract, from Chinese sentences, the expressions of
percentage together with their local contexts. A num-
ber of regular expression patterns have been utilized
to ensure that they could cover all the different ways
to write a percentage in Chinese. Then, after POS
tagging, we would be able to identify the verb imme-
diately preceding each percentage if it is associated
with one.

For our application, a big difference between Chi-
nese and English is that the available choices of verbs
to express upward or downward percentage changes
are pretty limited in Chinese: the variation in fact
mostly comes from the adverb used together with
the verb. Therefore, when we talk about the prob-
lem of Chinese verb selection in this paper, we ac-
tually mean the choice of not just verbs but instead
adverb+verb combinations, e.g., 狂升 (rise crazily)
and 略降 (fall slightly). Our proposed probabilistic
model for verb selection, described below in Sec-
tion 4, can be extended straightforwardly to such
generalized Chinese “verbs”.

Similar to the preprocessing of other datasets,
rarely occurring verbs with frequency less than 50
would have been filtered out. In the end, we got
18 Chinese verbs of which 14 are upward and 4 are
downward. There are 2,829 verb-percentage pairs in
total.

4 Approach

In this section, we propose to formulate the task of
verb selection for data-to-text NLG (see Section 1)
as a supervised learning problem (Hastie et al., 2009)
and to address it using Shannon’s noisy-channel
model (Shannon, 1948).

For each of the two possible change directions

(upward and downward), we need to build a specific
model. Without loss of generality, in the subsequent
discussion, we focus on selecting the verbs of one
particular direction; the way to deal with the other
direction is exactly the same. Thus a percentage
change is fully specified by its magnitude in one
model.

The set-up of our supervised learning problem is
as follows. Suppose that we have a set of training ex-
amples D = {(x1,w1), . . . , (xN,wN )}, where each
example consists of a percentage change xi paired
with the verb wi used by the human author to express
that percentage change. Such training data could be
obtained from a large corpus as described in Sec-
tion 3. Let X denote the set of possible percentage
changes: as mentioned earlier, in this paper we as-
sume that X = [0%, 100%]. Let V denote the set of
possible verbs, i.e., the vocabulary. Our task is to
learn a predictive function f : X →V that can map
any given percentage change x to an appropriate verb
w = f(x).

Apparently, there is inherent uncertainty in the
above described process of predicting the choice
of verbs for a percentage change. Making use of
probabilistic reasoning, the principled approach to
handling uncertainties, we argue that the function f
should be determined by the posterior probability
P(w|x). However, it looks difficult to directly es-
timate the parameters of such a conditional model,
aka discriminative model, for every possible value
of x which is a continuous variable. Hence, we turn
to the easier alternative way often used in machine
learning: to construct a generative model. Rather
than directly estimating the conditional probability
distribution, we instead estimate the joint probability
P(x,w) over (x,w) pairs in the generative model.
The joint probability can be decomposed as follows:

P(x,w) = P(w)︸ ︷︷ ︸
prior

P(x|w)︸ ︷︷ ︸
likelihood

, (1)

where P(w) is the prior probability distribution over
verbs w, and P(x|w) is the likelihood, i.e., the prob-
ability of seeing the percentage change x given that
the associated verb is w. The benefit of making the
above decomposition is that the parameters of P(w)
and P(x|w) can be estimated separately.

Given such a generative model, we can then use the
Bayes rule to derive the posterior probability P(w|x)

514


for any new example of x:

P(w|x) = P(w)P(x|w)
P(x)

, (2)

where

P(x) =
∑

w∈V
P(x,w) =

∑

w∈V
P(w)P(x|w) (3)

is the model evidence acting as the normalizing con-
stant in the formula.

Intuitively, this generative model could be consid-
ered as a noisy-channel (Shannon, 1948). When we
see a percentage change x, we can imagine that it has
been generated in two steps (Raviv, 1967). First, a
verb w would be chosen with the prior probability
P(w). Second, the verb w would be passed through
a communication “channel” and be corrupted by the
“noise” to produce the percentage change x according
to the likelihood function (aka the channel model)
P(x|w). In other words, the percentage change x that
we see is actually the distorted form of its associated
verb w.

An alternative, but equivalent, interpretation is
that when a pair (x,w) is passed through the noisy-
channel, the verb w will be lost and finally only the
percentage change x will be seen. The task is to
recover the lost w based on the observed x.

Shannon’s noisy-channel model is in fact a kind
of Bayesian inference. It has been applied to many
NLP tasks such as text categorization, spell checking,
question answering, speech recognition, and machine
translation (Jurafsky and Martin, 2009). Our appli-
cation — probabilistic verb selection — is different
from them because the observed data are continu-
ous real-valued numbers but not discrete symbols.
More importantly, in most of those applications such
as text categorization using the Naı̈ve Bayes algo-
rithm (Manning et al., 2008), the objective is “decod-
ing”, i.e., to find the single most likely label w∗ for
any given input x from the model

w∗ = arg max
w∈V

P(w|x)

= arg max
w∈V

P(w)P(x|w)/P(x)

= arg max
w∈V

P(w)P(x|w) , (4)

and therefore the normalizing constant P(x) does
not need to be calculated. However, this is actually

undesirable for the task of verb selection, because it
implies that the a percentage change x would always
be expressed by the same “optimal” verb w∗ corre-
sponding to it. To achieve variation and naturalness,
we must maintain the diversity of word usage. So
the right method to generate a verb w for the given
percentage change x is to compute the posterior prob-
ability distribution P(w|x) over all the possible verbs
in the vocabulary V using Eq. (2) and then randomly
sample a verb from that distribution. Although this
means that the normalizing constant P(x) needs to
be calculated each time, the computation is still effi-
cient, as unlike in many other applications the vocab-
ulary size |V| is a quite small number in practice (see
Section 3).

In the following two subsections, we study the two
components of our proposed probabilistic model for
verb selection, the prior probability distribution and
the likelihood function, respectively.

4.1 Prior
The prior probability distribution P(w) could sim-
ply be obtained by maximum likelihood estimation
(MLE):

P(w)MLE = Nw/N , (5)

where Nw is the number of training examples with
the verb w, and N is the total number of training
examples.

The relationship between a verb’s rank and fre-
quency in the WSJ corpus is depicted by the log-log
plot Fig. 1, revealing that the empirical distribution of
verbs follows the Zipf ’s law (Powers, 1998), which
is related to the power law (Adamic, 2000; Newman,
2005). Specifically, the frequency of the i-th popular
verb, fi, is proportional to 1/is, where s is the ex-
ponent characterizing the distribution (shown as the
slope of the straight line in the corresponding log-log
plot). This implies that in the context of expressing
percentage changes, the human choice of verbs is
dominated by a few frequently used ones, and many
other verbs are only used very occasionally.

Smoothing: If we would like to intentionally boost
the diversity of verb choices, we could mitigate the
high skewness of the empirical distribution of verbs
by smoothing (Zhai and Lafferty, 2004). A simple
smoothing technique suitable for this purpose is the
Jelinek-Mercer smoothing (Jelinek and Mercer, 1980)

515


0.0 0.5 1.0 1.5 2.0 2.5
log(rank)

4

5

6

7

8

9

lo
g(

fre
q)

(a) upward verbs

0.0 0.5 1.0 1.5 2.0 2.5
log(rank)

4

5

6

7

8

9

lo
g(

fre
q)

(b) downward verbs

Figure 1: The empirical distribution of verbs P(w)MLE follows the Zipf’s law, in the WSJ corpus.

which uses a linear interpolation between the maxi-
mum likelihood estimation of a verb w’s prior proba-
bility distribution with the uniform distribution over
the vocabulary of verbs V, i.e.,

P(w) = λP(w)MLE + (1 −λ) 1|V| , (6)

where P(w)MLE is given by Eq. (5), and the parame-
ter λ ∈ [0, 1] provides a means to explicitly control
the trade-off between accuracy and diversity. The
smaller the parameter λ is, the more diverse the gen-
erated verbs would be. When λ = 0, the prior prob-
ability is completely ignored and the selection of a
verb solely depends on how compatible the verb is
with the given percentage change. When λ = 1, it
backs off to the original model without smoothing.
The optimal value of the parameter λ could be tuned
on a development set (see Section 5.3).

4.2 Likelihood

For each verb w ∈V, we analyze the distribution of
its associated percentage changes and calculate the
following descriptive statistics: mean, standard devi-
ation (std), skewness, kurtosis, median, and interquar-
tile range (IQR). All those descriptive statistics for
the WSJ corpus are given in Table 1. In addition,
Fig. 2 shows the box plots of percentage changes
for top-10 (most frequent) verbs in the WSJ corpus,
where the rectangular box corresponding to each verb
represents the span from the first quartile to the third
quartile, i.e., the interquartile range (IQR), with the

segment inside the box indicating the median and the
whiskers outside the box indicating the rest of the
distribution (except for the points that are determined
to be “outliers” using the so-called Tukey box plot
method).

It can be seen that the choice of verbs often im-
ply the magnitude of percentage change: some verbs
(such as soar and plunge) are mostly used to ex-
press big changes (large medians), while some verbs
(such as advance and ease) are mostly used to
express small changes (small medians). Generally
speaking, the former is associated with a relatively
wide range of percentage changes (large IQRs) while
the latter is associated with a relatively narrow range
of percentage changes (small IQRs). Moreover, it
is interesting to see that for almost all the verbs, the
distribution of percentage changes is heavily skewed
to the left side (i.e., smaller changes).

Given a new percentage change x, in order to cal-
culate its probability of being generated from a verb
w in the above described generative model, we need
to fit the likelihood function, i.e., the probability dis-
tribution P(x|w), for each word w ∈ V, based on
the training data.

One common technique for this purpose is kernel
density estimation (KDE) (Hastie et al., 2009), a non-
parametric way to estimate the probability density
function as follows:

P(x|w) = 1
Nwh

Nw∑

i=1

K

(
x−xi
h

)
, (7)

516


verbs mean std skewness kurtosis median IQR

upward

rise 16.93 18.58 1.77 2.80 9.40 [04.90, 22.00]
increase 17.05 18.06 1.76 3.01 10.45 [05.00, 23.00]
grow 15.46 17.48 1.77 2.93 8.40 [03.20, 21.00]
climb 17.22 18.32 1.81 3.26 10.00 [05.57, 23.00]
jump 31.28 23.64 0.77 -0.24 24.20 [12.53, 48.00]
surge 29.03 25.43 0.85 -0.33 21.00 [08.00, 46.00]
gain 13.78 16.79 1.95 3.89 7.50 [02.00, 20.00]
soar 39.39 27.68 0.42 -0.94 35.00 [15.20, 58.00]
raise 16.54 15.54 1.83 4.19 11.40 [05.00, 22.75]
advance 15.83 15.47 1.87 3.49 10.55 [06.03, 20.00]
boost 20.15 16.16 1.68 2.80 16.00 [09.78, 24.99]

downward

fall 17.52 19.93 1.61 1.86 8.90 [04.18, 24.00]
decline 14.81 17.09 1.87 3.07 8.00 [04.58, 19.00]
drop 18.36 19.00 1.51 1.72 10.00 [05.47, 26.00]
slip 11.95 17.51 2.09 3.24 6.00 [02.00, 09.12]
plunge 38.87 26.92 0.48 -0.83 34.05 [15.08, 58.00]
slide 23.09 22.29 1.00 -0.03 15.00 [05.25, 38.65]
lose 23.65 21.65 1.05 0.47 17.00 [06.00, 36.98]
tumble 28.84 22.46 0.98 0.42 24.90 [10.00, 39.20]
plummet 36.43 23.89 0.62 -0.35 31.00 [19.90, 50.00]
ease 11.02 17.27 2.25 3.97 5.50 [01.95, 08.67]
decrease 19.72 18.67 1.25 0.82 12.00 [05.60, 30.80]
reduce 25.72 21.81 1.41 1.21 20.00 [10.00, 30.00]
dip 13.98 18.98 2.01 2.91 6.85 [03.75, 10.25]
shrink 23.82 20.72 1.33 1.37 15.00 [10.00, 35.00]

Table 1: The descriptive statistics of percentage changes (in %) for each verb, in the WSJ corpus.

0 20 40 60 80 100
percent

rise

increase

climb

grow

gain

jump

soar

surge

raise

advance

ve
rb

(a) upward verbs

0 20 40 60 80 100
percent

fall

decline

drop

tumble

slip

lose

plunge

ease

slide

plummet

ve
rb

(b) downward verbs

Figure 2: The box plots of percentage changes (in %) for the top-10 verbs, in the WSJ corpus.

where Nw is the number of training examples with
the verb w, K(·) is the kernel (a non-negative func-
tion that integrates to one and has mean zero), and
h > 0 is a smoothing parameter called the bandwidth.
Fig. 3 shows the likelihood function P(x|w) fitted
by KDE with Gaussian kernels and automatic band-
width determination using the rule of Scott (2015),

for the most popular upward and downward verbs in
the WSJ corpus: rise and fall.

It is also possible to fit a parametric model of
P(x|w) which would be more efficient than KDE.
Since in this paper x is assumed to be a continuous
random variable within the range [0%, 100%] (see
Section 3), we choose to fit P(x|w) with the Beta

517


0 20 40 60 80 100
0.00

0.02

0.04

0.06

0.08

(a) the verb rise

0 20 40 60 80 100
0.00

0.02

0.04

0.06

0.08

(b) the verb fall

Figure 3: The likelihood function P(x|w) fitted by kernel density estimation (KDE).

0 20 40 60 80 100
0.00

0.02

0.04

0.06

0.08

(a) the verb rise

0 20 40 60 80 100
0.00

0.02

0.04

0.06

0.08

(b) the verb fall

Figure 4: The likelihood function P(x|w) fitted by the Beta distribution.

distribution which is a continuous distribution sup-
ported on the bounded interval [0, 1]:

P(x|w) = Beta(α,β)

=
Γ(α + β)

Γ(α)Γ(β)
xα−1(1 −x)β−1 . (8)

Although there exist a number of continuous dis-
tributions supported on the bounded interval such
as the truncated normal distribution, the Beta dis-
tribution is picked here as it has the ability to take
a great variety of different shapes using only two
parameters α and β. These two parameters can be es-
timated using the method of moments, or maximum
likelihood. For example, using the former, we have
α̂ = x̄

(
x̄(1−x̄)

v̄
− 1
)

and β̂ = (1−x̄)
(
x̄(1−x̄)

v̄
− 1
)

if v̄ < x̄(1 − x̄), where x̄ and v̄ are the sample mean
and sample variance respectively. Fig. 4 shows the

likelihood function P(x|w) fitted by the Beta dis-
tribution using SciPy4 for the most popular upward
and downward verbs in the WSJ corpus: rise and
fall.

5 Experiments

5.1 Baselines

Thomson Reuters: The only published approach
that we are aware of to this specific task of verb selec-
tion in the context of data-to-text NLG is the method
adopted by Thomson Reuters EikonTM (Smiley et
al., 2016). This baseline method’s effectiveness has
been verified through crowdsourcing, as we have
mentioned before (see Section 2). Furthermore, it
is fairly new (published in 2016), therefore should

4https://www.scipy.org/

518


represent the state of the art in this field. Note that
their model was not taken off-the-shelf but re-trained
on our datasets to ensure a fair comparison with our
approach.

Neural Network: Another baseline method that
we have tried is a feed-forward artificial neural net-
work with hidden layers, aka, a multi-layer percep-
tron (Russell and Norvig, 2009; Goodfellow et al.,
2016). It is because neural networks are well-known
universal function approximators, and they represent
quite a different family of supervised learning algo-
rithms. Unlike our proposed probabilistic approach
which is essentially a generative model, the neural
network used in our experiments is a discrimina-
tive model which takes the percentage change in-
put (represented as a single floating-point number)
and then predicts the verb choice directly. Since
we would like to have probability estimates for each
verb, the softmax function was used for the output
layer of neurons, and the network was trained via
back-propagation to minimize the cross-entropy loss
function. An l2 regularization term was also added
to the loss function that would shrink model param-
eters to prevent overfitting. The activation function
was set to the rectified linear unit (ReLU) (Hahn-
loser et al., 2000). The Adam optimization algo-
rithm (Kingma and Ba, 2014) was employed as the
solver, with the samples shuffled after each iteration.
The initial learning rate was set to 0.001, and the
maximum number of iterations (epochs) was set to
1500. For our datasets, a single hidden layer of 100
neurons would be sufficient and adding more neu-
rons or layers could not help. This was found using
the development set through a line search from 20
to 500 hidden neurons with step size 20. Note that
when applying the trained neural network to select
verbs, we should use not argmax but sampling from
the predicted probability distribution (given by the
softmax function), in the same way as we do in our
proposed probabilistic model (see Section 4).

5.2 Code
The Python code for our experiments, along with
the datasets of verb-percentage pairs extracted from
those three corpora (see Section 3), have been made
available to the research community5.

5https://goo.gl/gkj8Fa

5.3 Automatic Evaluation
The end users’ perception of a verb selection algo-
rithm’s quality depends on not only how accurately
the chosen verbs reflect the corresponding percent-
age changes but also how diverse the chosen verbs
are, which are two largely orthogonal dimensions for
evaluation.

Accuracy: The easiest way to assess the accuracy
of an NLG method or system is to compare the texts
generated by computers and the texts written by hu-
mans for the same input data (Mellish and Dale, 1998;
Reiter and Belz, 2009), using an automatic metric
such as BLEU (Papineni et al., 2002). For our task
of verb selection, we decide to use the metric MRR
that stands for mean reciprocal rank (Voorhees, 1999;
Radev et al., 2002) and can be calculated as follows:

MRR =
1

|Q|
∑

(x′i,w
′
i)∈Q

1

rank(w′i)
, (9)

where Q = {(x′1,w′1), . . . , (x′M,w′M )} is the set of
test examples, and rank(w′i) refers to the rank po-
sition of w′i — the verb really used by the human
author to describe the percentage change x′i — in
the list of predicted verbs ranked in the descending
order of their probabilities of correctness given by
the model. The MRR metric is most widely used
for the evaluation of automatic question answering
which is similar to automatic verb selection in the
following sense: they both aim to output just one
suitable response (answer or verb) to any given input
(question or percentage change).

Through 5-fold cross-validation (Hastie et al.,
2009), we have got the MRR scores of our proposed
model (see Section 4) and the two baseline mod-
els (see Section 5.1) which are shown in Table 2.
The models were trained/tested separately on each
dataset (see Section 3). In each round of 5-fold cross-
validation, 20% of the data would become the test set;
in the remaining 80% of the data, randomly selected
60% would be the training set and the other 20%
would be the development set if parameter tuning is
needed (otherwise the whole 80% would be used for
training).

The parameter λ of our model controls the strength
of smoothing over the prior probability (see Sec-
tion 4.1) and thus dictates the trade-off between ac-
curacy and diversity. If we focus on the accuracy

519


corpus method upward verbs downward verbs

WSJ

Thomson Reuters 0.119 ± 0.002 0.106 ± 0.003
Neural Network 0.581 ± 0.044 0.567 ± 0.013
Our Approach (λ = 1 , KDE) 0.724 ± 0.011 0.686 ± 0.016
Our Approach (λ = 1 , Beta) 0.730 ± 0.011 0.685 ± 0.015
Our Approach (λ = 0.05, KDE) 0.533 ± 0.018 0.516 ± 0.003
Our Approach (λ = 0.05, Beta) 0.527 ± 0.012 0.532 ± 0.011

Reuters

Thomson Reuters 0.370 ± 0.033 0.339 ± 0.023
Neural Network 0.860 ± 0.050 0.855 ± 0.044
Our Approach (λ = 1 , KDE) 0.887 ± 0.038 0.881 ± 0.036
Our Approach (λ = 1 , Beta) 0.887 ± 0.045 0.872 ± 0.038
Our Approach (λ = 0.05, KDE) 0.729 ± 0.060 0.799 ± 0.036
Our Approach (λ = 0.05, Beta) 0.721 ± 0.070 0.695 ± 0.054

Chinese

Thomson Reuters 0.167 ± 0.005 0.345 ± 0.019
Neural Network 0.508 ± 0.057 0.668 ± 0.058
Our Approach (λ = 1 , KDE) 0.525 ± 0.011 0.702 ± 0.047
Our Approach (λ = 1 , Beta) 0.528 ± 0.016 0.696 ± 0.042
Our Approach (λ = 0.05, KDE) 0.433 ± 0.013 0.656 ± 0.040
Our Approach (λ = 0.05, Beta) 0.445 ± 0.012 0.639 ± 0.044

Table 2: The accuracy of verb selection measured by MRR (mean±std) via 5-fold cross-validation.

only and ignore the diversity, the optimal value of
λ should just be 1 (i.e., no smoothing). In order to
strike a healthy balance between accuracy and diver-
sity, we carried out a line search for the value of λ
from 0 to 1 with step size 0.05 using the development
set. It turned out that the smoothing effect upon diver-
sity would only become noticeable when λ ≤ 0.1, so
we further conducted a line search from 0 to 0.1 with
step size 0.01, and found that using λ = 0.05 consis-
tently yield a good performance on different corpora.
Actually, this phenomenon should not be very sur-
prising, given the Zipfian distribution of verbs which
is highly skewed (see Fig. 1). Our observation in
the experiments still indicate that smoothing with a
none-zero λ worked better than setting λ = 0. That
is to say, it would not be wise to go to extremes to
ignore the prior entirely which would unnecessarily
harm the accuracy. An alternative smoothing solution
for mitigating the severe skewness of the empirical
prior that we also considered is to make the smoothed
prior probability proportional to the logarithm of the
raw prior probability, but we did not take that route
as (i) we could not find a good principled interpreta-
tion for such a trick and; (ii) using a small λ value
like 0.05 seemed to work sufficiently well. It will be
shown later that sampling verbs from the posterior
probability distribution rather than just using the one

with the maximum probability would help to alleviate
the problem of prior skewness and thus prevent verb
selection from being dominated by the most popular
verbs.

It can be observed from the experimental results
that smoothing (see Section 4.1) does reduce the
accuracy of verb selection. The MRR scores with
λ = 0.05 are lower than those with λ = 1. Nev-
ertheless, as we shall soon see, strong smoothing
is crucially important for achieving a good level of
diversity. Furthermore, there seemed to be little per-
formance difference between the usage of the KDE
technique or the Beta distribution to fit the likelihood
function in our approach. This suggests that the latter
is preferable because it is as effective as the former
but much more efficient. Therefore, in the remaining
part of this paper, we shall focus on this specific ver-
sion of our model (with λ = 0.05, Beta) even though
it may not be the most accurate.

The MRR scores achieved by our approach are
around 0.4 – 0.8 which implies that, on average, the
first or the second verb selected by our approach
would be the “correct” verb used by human authors.

Across all the three corpora, our proposed proba-
bilistic model, whether it is smoothed or not, whether
it uses the KDE technique or the Beta distribution,
outperforms the Thomson Reuters baseline by a large

520


margin in terms of MRR. According to the Wilcoxon
signed-rank test (Wilcoxon, 1945; Kerby, 2014), the
performance improvements brought by our approach
over the Thomson Reuters baseline are statistically
significant with the (two-sided) p-value � 0.0001
on the two English corpora and = 0.0027 on the
Chinese corpus.

With respect to the Neural Network baseline, on
all the three corpora, its accuracy is slightly better
than that of our smoothed model (λ = 0.05) though
it still could not beat our original unsmoothed model
(λ = 1). The major problem with the Neural Net-
work baseline is that, similar to the probabilistic
model without smoothing, its verb choices would
concentrate on the most frequent ones and thus have
very poor diversity. A prominent advantage of our
proposed probabilistic model, in comparison with
discriminative learning algorithms such as the Neural
Network baseline, is that we are able to explicitly
control the trade-off between accuracy and diversity
by adjusting the strength of smoothing.

It is worth emphasizing that the accuracy of a verb
selection method only reflects its ability to imitate
how writers (journalists) use verbs, but this is not
necessarily the same as how readers interpret the
verbs. Usually the ultimate goal of an NLG sys-
tem is to successfully communicate information to
readers. Previous research in NLG and psychology
suggests that there is wide variation in how different
people interpret verbs and words in general, which is
probably much larger in the general population than
amongst journalists. Specifically, the MRR metric
would probably underestimate the effectiveness of a
verb selection method, since a verb different from the
one really used by the writer is not necessarily a less
appropriate choice for the corresponding percentage
change from the reader’s perspective.

Diversity: Other than the accuracy of reproducing
the verb choices made by human authors, verb selec-
tion methods could also be automatically evaluated
in terms of diversity.

Following Kingrani et al. (2015), we borrow the
diversity measures from ecology (Magurran, 1988)
to quantitatively analyze the diversity of verb choices:
each specific verb is considered as a particular
species. When measuring the biological diversity
of a habitant, it is important to consider not only the

number of distinct species present but also the rela-
tive abundance of each species. In the literature of
ecology, the former is called richness and the latter
is called evenness. Here we utilize the well-known
Inverse Simpson Index aka Simpson’s Reciprocal In-
dex (Simpson, 1949) which takes both richness and

evenness into account: D =
(∑R

i=1 p
2
i

)−1
, where

R is the total number of distinct species (i.e., rich-
ness), and pi is the the proportion of the individuals
belonging to the i-th species relative to the entire
population. The evenness is given by the value of
diversity normalized to the range between 0 and 1,
so it can be calculated as D/R.

Table 3 shows the diversity scores of verb choices
made by our approach and the Thomson Reuters base-
line for 450 randomly sampled percentage changes
(see Section 5.4). Overall, in terms of diversity, our
approach would lose to Thomson Reuters. The Neu-
ral Network baseline is omitted here because its di-
versity scores were very low.

Discussion: Figs. 5 and 6 show the confusion ma-
trices of our approach (λ = 0.05, Beta) on the WSJ
corpus as (row-normalized) heatmaps: in the former
we choose the verb with the highest posterior proba-
bility (argmax) while in the latter we sample the verb
from the posterior probability distribution (see Sec-
tion 4). The argmax way would be dominated by a
few verbs (e.g., “rise”, “soar”, “fall”, and “plummet”).
In contrast, random sampling would lead to a much
wider variety of verbs. The experimental results of
all verb selection methods reported in this paper are
generated by the sampling strategy, if not indicated
otherwise. It can be seen from Fig. 6 that the verbs
“soar” and “plunge” are the easiest to be predicted.
Generally speaking, the prediction of verbs is rela-
tively more accurate for bigger percentage changes,
whether upwards or downwards. This is probably be-
cause there are fewer verbs available to describe such
radical percentage changes (see Fig. 2) and thus the
model faces less uncertainty. Most misclassification
(confusion) happens when a verb is incorrectly pre-
dicted to be the most frequent one (“rise” or “fall”).

5.4 Human Evaluation
The two aspects, accuracy and diversity, are both im-
portant for the task of verb selection. Although we
have shown that automatic evaluation could be car-

521


ris
e

in
cr

ea
se

gr
ow

cl
im

b

ju
m

p

su
rg

e

ga
in

so
ar

ra
is

e

ad
va

nc
e

bo
os

t

rise
increase

grow
climb
jump
surge

gain
soar
raise

advance
boost

0.0

0.1

0.2

0.3

0.4

0.5

(a) upward verbs

fa
ll

de
cl

in
e

dr
op sl
ip

pl
un

ge
sl

id
e

lo
se

tu
m

bl
e

pl
um

m
et

ea
se

de
cr

ea
se

re
du

ce di
p

sh
rin

k

fall
decline

drop
slip

plunge
slide
lose

tumble
plummet

ease
decrease

reduce
dip

shrink
0.00

0.15

0.30

0.45

0.60

(b) downward verbs

Figure 5: The confusion matrix heatmap of our approach on the WSJ corpus: choosing the verb with the
highest posterior probability.

ris
e

in
cr

ea
se

gr
ow

cl
im

b

ju
m

p

su
rg

e

ga
in

so
ar

ra
is

e

ad
va

nc
e

bo
os

t

rise
increase

grow
climb
jump
surge

gain
soar
raise

advance
boost

0.08

0.10

0.12

0.14

0.16

(a) upward verbs

fa
ll

de
cl

in
e

dr
op sl
ip

pl
un

ge
sl

id
e

lo
se

tu
m

bl
e

pl
um

m
et

ea
se

de
cr

ea
se

re
du

ce di
p

sh
rin

k

fall
decline

drop
slip

plunge
slide
lose

tumble
plummet

ease
decrease

reduce
dip

shrink
0.045

0.060

0.075

0.090

0.105

0.120

(b) downward verbs

Figure 6: The confusion matrix heatmap of our approach on the WSJ corpus: sampling the verb from the
posterior probability distribution.

corpus method
upward verbs downward verbs

richness evenness diversity richness evenness diversity

WSJ
Our Approach 5 0.6324 3.162 5 0.4698 2.349
Thomson Reuters 11 0.8771 9.648 14 0.6821 9.550

Reuters
Our Approach 3 0.7520 2.256 3 0.5933 1.780
Thomson Reuters 4 0.6453 2.581 4 0.5720 2.288

Chinese
Our Approach 6 0.7965 4.779 4 0.5265 2.106
Thomson Reuters 14 0.5831 8.164 4 0.7150 2.860

Table 3: The diversity of verb selection measured by the Inverse Simpson Index.

522


corpus verbs
Our Approach vs Thomson Reuters Our Approach vs Neural Network
> < ≈ p-value > < ≈ p-value

WSJ
upward 43 32 0 0.2480 53 22 0 0.0004
downward 44 28 3 0.0764 42 32 1 0.2954
both 87 60 3 0.0316 95 54 1 0.0010

Reuters
upward 37 28 10 0.3211 43 24 8 0.0271
downward 39 31 5 0.4030 50 23 2 0.0021
both 76 59 15 0.1683 93 47 10 0.0001

Chinese
upward 42 30 3 0.1945 65 9 1 � 0.0001
downward 29 37 9 0.3891 37 34 4 0.8126
both 71 67 12 0.7985 102 43 5 � 0.0001

All both 234 186 30 0.0217 290 144 16 � 0.0001

Table 4: The results of human evaluation, where the p-values are given by the sign test (two-sided).

ried out for either accuracy or diversity alone, there
is no obvious way to assess the overall effectiveness
of a verb selection method using machines only. The
ultimate judgment on the quality of verb selection
would have to come from human assessors (Mellish
and Dale, 1998; Reiter and Belz, 2009; Smiley et al.,
2016).

To manually compare our approach (the version
with λ = 0.05, Beta) with a baseline method (Thom-
son Reuters or Neural Network), we conduct a ques-
tionnaire survey with 450 multiple-choice questions.
In each question, a respondent would see a pair of
generated sentences describing the same percentage
change with the verbs selected by two different meth-
ods respectively and need to judge which one sounds
better than the other (or it is hard to tell). For exam-
ple, a respondent could be shown the following pair
of generated sentences:

(1) Net profit declines 3%
(2) Net profit plummets 3%

and then they were supposed to choose one of the
three following options as their answer:
[a] Sentence (1) sounds better.
[b] Sentence (2) sounds better.
[c] They are equally good.

The respondents would be blinded to whether the first
verb or the second verb was provided by our proposed
method, as their appearing order would have been
randomized in advance. The questionnaire survey
system withheld the information about the source of
each verb until the answers from all respondents had
been collected, and then it would count how many
times the verb selected by our proposed method was

deemed better than (>), worse than (<), or as good
as (≈) the verb selected by the baseline method.

For each corpus, we produced 150 different ques-
tions, of which half were about upward verbs and half
were about downward verbs. As we have explained
above, each question compares a pair of generated
sentences describing the same percentage change
with different verbs. The sentence generation process
is the same as that used by Smiley et al. (2016). The
subjects were randomly picked from the most popular
ones in the corpus (e.g., “gross domestic product”),
and the percentage changes (as the objects) were ran-
domly sampled from the corpus as well. Each of the
two verb selection methods, in comparison, would
provide one verb (as the predicate) for describing that
specific percentage change. Note that in this sentence
generation process, a pair of sentences would be re-
tained only if the verbs selected by the two methods
were different, as it would be meaningless to compare
two identical sentences.

A total of 15 college-educated people participated
in the questionnaire survey. They are all bilingual,
i.e., native or fluent speakers of both English and
Chinese. Each person was given 30 questions: 10
questions (including 5 upward and 5 downward ones)
from each corpus. We (the authors of this paper)
were excluded from participating in the questionnaire
survey to avoid any conscious or unconscious bias.

The results of human evaluation are shown in Ta-
ble 4. Altogether, respondents prefer the verb se-
lected by our approach 234/450=52% of times, as
opposed to 186/450=41% for the Thomson Reuters
baseline; respondents prefer the verb selected by

523


our approach 290/450=64% of times, as opposed
to 144/450=32% for the Neural Network baseline.
According to the sign test (Wackerly et al., 2007),
our approach works significantly better than the two
baseline methods, Thomson Reuters and Neural Net-
work: overall the (two-sided) p-values are less than
0.05.

Discussion: Our approach exhibits more superior-
ity over the Thomson Reuters baseline on the English
datasets than on the Chinese dataset. Since the Chi-
nese dataset is bigger than the Reuters dataset, though
smaller than the WSJ dataset, the performance differ-
ence is not caused by corpus size but due to language
characteristics. Remember that for Chinese we are
actually predicting adverb+verb combinations (see
Section 3.3). Retrospective manual inspection of the
experimental results suggests that users seem to have
relatively higher expectations of diversity for Chinese
adverbs than for English verbs.

6 Extensions

Robustness: It is still possible, though very un-
likely, for the proposed probabilistic model to gen-
erate atypical uses of a verb. A simple measure to
avoid such situations is to reject the sampled verb w∗

if the posterior probability P(w∗|x) < τ where τ is
a predefined threshold, e.g., 5%, and then resample
w∗ until P(w∗|x) ≥ τ.

Unlimited Range: If the magnitude of a percent-
age change is allowed to go beyond 100%, we would
no longer be able to use the Beta distribution to fit
the likelihood function P(x|w) as it is supported on
a bounded interval. However, it should be straight-
forward to use a flexible probability distribution sup-
ported on the semi-infinite interval [0, +∞], such as
the Gamma distribution.

Subject: The context, in particular the subject of
the percentage change, has not been taken into ac-
count by the presented models. As illustrated by
the two example sentences below, the same verb
(“surge”) could be used for quite different percentage
changes (“181%” vs “8%”) depending on the subject
(“wheat price” vs “inflation”).
• “According to World Bank figures, wheat prices

have surged up by 181 percent in the past three
years to February 2008.”

• “While inflation has surged to almost 8% in 2008,
it is projected by the Commission to fall in 2009.”

Furthermore, the significance of a percentage change
often depends on the domain, and consequently, so
does the most appropriate verb to describe a per-
centage change. For example, a 10% increase in
stock price is interesting, while a 10% increase in
body temperature is life-threatening. It is, of course,
possible to incorporate the subject information into
our probabilistic model by extending Eq. (2) to
P(w|x,s) = P(w,s)P(x|w,s)/P(x,s) where s is
the subject word in the triple. On one hand, this
should make the model more effective, for the rea-
sons explained above. On the other hand, this would
require a lot more data for reliable estimation of the
model parameters, which is one of the reasons why
we leave it for future work.

Language Modeling: Thanks to its probabilistic
nature, our proposed model for verb selection could
be seamlessly plugged into an n-gram statistical lan-
guage model (Jurafsky and Martin, 2009), e.g., for
the MSR Sentence Completion Challenge6. This
might be able to reduce the language model’s perplex-
ity, as the probability of 〈subject, verb, percentage〉
triples could be calculated more precisely.

Hierarchical Modeling: The choice of verb to de-
scribe a particular percentage change could be af-
fected by the style of the author, the topic of the
document, and other contextual factors. To take
those dimensions into account and build a finer prob-
abilistic model for verb selection, we could embrace
Bayesian hierarchical modeling (Gelman et al., 2013;
Kruschke, 2014) which, for example, could let each
author’s model borrow the “statistical power” from
other authors’.

Psychology: There exist a lot of studies in psy-
chology on how people interpret probabilities and
risks (Reagan et al., 1989; Berry et al., 2004). They
could provide useful insights for further enhancing
our verb selection method.

7 Conclusions

The major research contribution of this paper is a
probabilistic model that can select appropriate verbs

6https://goo.gl/yyKBYa

524


to express percentage changes with different direc-
tions and magnitudes. This model is not relying on
hard-wired heuristics, but learned from training ex-
amples (in the form of verb-percentage pairs) that are
extracted from large-scale real-world news corpora.
The choices of verbs made by the proposed model
are found to match our intuitions about how differ-
ent verbs are collocated with percentage changes of
different sizes. The real challenge here is to strike
the right balance between accuracy and diversity,
which can be realized via smoothing. Our experi-
ments have confirmed that the proposed model can
capture human authors’ pattern of usage around verbs
better than the existing method currently employed
by Thomson Reuters EikonTM. We hope that this
probabilistic model for verb selection could help data-
to-text NLG systems achieve greater variation and
naturalness.

Acknowledgments

The research is partly funded by the National Key
R&D Program of China (ID: 2017YFC0803700) and
the NSFC grant (No. 61532021). The Titan X Pascal
GPU used for our experiments was kindly donated
by the NVIDIA Corporation. Prof Xuanjing Huang
(Fudan) has helped with the datasets.

We thank the anonymous reviewers and the ac-
tion editor for their constructive and helpful com-
ments. We also gratefully acknowledge the support
of Geek.AI for this work.

References

Lada A Adamic. 2000. Zipf, power-laws, and Pareto —
A ranking tutorial. Technical report, HP Labs.

Gabor Angeli, Percy Liang, and Dan Klein. 2010. A
simple domain-independent probabilistic approach to
generation. In Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 502–512.

Anja Belz and Eric Kow. 2009. System building cost
vs. output quality in data-to-text generation. In Pro-
ceedings of the 12th European Workshop on Natural
Language Generation (ENLG), pages 16–24.

Anja Belz. 2008. Automatic generation of weather fore-
cast texts using comprehensive probabilistic generation-
space models. Natural Language Engineering (NLE),
14(04):431–455.

Dianne Berry, Theo Raynor, Peter Knapp, and Elisabetta
Bersellini. 2004. Over the counter medicines and the
need for immediate action: A further evaluation of Eu-
ropean Commission recommended wordings for com-
municating risk. Patient Education and Counseling,
53(2):129–134.

Steven Bird, Ewan Klein, and Edward Loper. 2009. Natu-
ral Language Processing with Python: Analyzing Text
with the Natural Language Toolkit. O’Reilly Media.

Sarah Boyd. 1998. TREND: A system for generating
intelligent descriptions of time series data. In Pro-
ceedings of the 2nd IEEE International Conference on
Intelligent Processing Systems (ICIPS).

Eugene Charniak, Don Blaheta, Niyu Ge, Keith Hall,
John Hale, and Mark Johnson. 2000. BLLIP 1987-89
WSJ Corpus Release 1 LDC2000T43. Web Download.
Philadelphia: Linguistic Data Consortium.

Andrew Gelman, John Carlin, Hal Stern, David Dunson,
Aki Vehtari, and Donald Rubin. 2013. Bayesian Data
Analysis. CRC, 3rd edition.

Eli Goldberg, Norbert Driedger, and Richard I. Kittredge.
1994. Using natural-language processing to produce
weather forecasts. IEEE Expert, 9(2):45–53.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
2016. Deep Learning. MIT Press.

Richard H.R. Hahnloser, Rahul Sarpeshkar, Misha A. Ma-
howald, Rodney J. Douglas, and H. Sebastian Seung.
2000. Digital selection and analogue amplification
coexist in a cortex-inspired silicon circuit. Nature,
405(6789):947–951.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.
2009. The Elements of Statistical Learning: Data Min-
ing, Inference, and Prediction. Springer, 2nd edition.

Frederick Jelinek and Robert Mercer, 1980. Interpolated
Estimation of Markov Source Parameters from Sparse
Data, pages 381–402. North-Holland Publishing.

Daniel Jurafsky and James H. Martin. 2009. Speech
and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics and
Speech Recognition. Prentice Hall, 2nd edition.

Dave S Kerby. 2014. The simple difference formula: An
approach to teaching nonparametric correlation. Com-
prehensive Psychology, 3(1).

Diederik P. Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.

Suneel Kumar Kingrani, Mark Levene, and Dell Zhang.
2015. Diversity analysis of web search results. In Pro-
ceedings of the Annual International ACM Web Science
conference (WebSci).

Ioannis Konstas and Mirella Lapata. 2013. A global
model for concept-to-text generation. Journal of Artifi-
cial Intelligence Research (JAIR), 48:305–346.

525


Manfred Krifka. 2007. Approximate interpretations of
number words: A case for strategic communication.
In Cognitive Foundations of Interpretation, pages 111–
126.

John K Kruschke. 2014. Doing Bayesian Data Analysis:
A Tutorial with R, JAGS, and Stan. Academic Press,
2nd edition.

Anne E. Magurran. 1988. Ecological Diversity and Its
Measurement. Princeton University Press.

Christopher D. Manning, Prabhakar Raghavan, and Hin-
rich Schütze. 2008. Introduction to Information Re-
trieval. Cambridge University Press.

Christopher D. Manning, Mihai Surdeanu, John Bauer,
Jenny Rose Finkel, Steven Bethard, and David Mc-
Closky. 2014. The Stanford CoreNLP natural language
processing toolkit. In Proceedings of the 52nd Annual
Meeting of the Association for Computational Linguis-
tics (ACL), System Demonstrations, pages 55–60.

Hongyuan Mei, Mohit Bansal, and Matthew R. Walter.
2016. What to talk about and how? Selective gen-
eration using LSTMs with coarse-to-fine alignment.
In Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational
Linguistics: Human Language Technologies (NAACL-
HLT), pages 720–730.

Chris Mellish and Robert Dale. 1998. Evaluation in
the context of natural language generation. Computer
Speech & Language, 12(4):349–373.

Priscilla Moraes, Kathleen McCoy, and Sandra Carberry.
2014. Adapting graph summaries to the users’ reading
levels. In Proceedings of the 8th International Natural
Language Generation Conference (INLG), pages 64–
73.

Mark E. J. Newman. 2005. Power laws, Pareto distribu-
tions and Zipf’s law. Contemporary Physics, 46(5):323–
351.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: A method for automatic evalu-
ation of machine translation. In Proceedings of the 40th
Annual Meeting of the Association for Computational
Linguistics (ACL), pages 311–318.

Vassilis Plachouras, Charese Smiley, Hiroko Bretz, Ola
Taylor, Jochen L. Leidner, Dezhao Song, and Frank
Schilder. 2016. Interacting with financial data us-
ing natural language. In Proceedings of the 39th In-
ternational ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR), pages
1121–1124.

David MW Powers. 1998. Applications and explanations
of Zipf’s law. In Proceedings of the Joint Conferences
on New Methods in Language Processing and Computa-
tional Natural Language Learning (NeMLaP/CoNLL),
pages 151–160.

Dragomir R. Radev, Hong Qi, Harris Wu, and Weiguo
Fan. 2002. Evaluating web-based question answer-
ing systems. In Proceedings of the 3rd International
Conference on Language Resources and Evaluation
(LREC).

Alejandro Ramos-Soto, Alberto Bugarı́n, Senén Barro,
and Juan Taboada. 2013. Automatic generation of
textual short-term weather forecasts on real prediction
data. In Proceedings of the 10th International Confer-
ence on Flexible Query Answering Systems (FQAS),
pages 269–280.

Josef Raviv. 1967. Decision making in Markov chains
applied to the problem of pattern recognition. IEEE
Transactions on Information Theory, 13(4):536–551.

Robert T. Reagan, Frederick Mosteller, and Cleo Youtz.
1989. Quantitative meanings of verbal probability ex-
pressions. Journal of Applied Psychology, 74(3):433.

Ehud Reiter and Anja Belz. 2009. An investigation into
the validity of some metrics for automatically evalu-
ating natural language generation systems. Computa-
tional Linguistics, 35(4):529–558.

Ehud Reiter, Somayajulu Sripada, Jim Hunter, Jin Yu,
and Ian Davy. 2005. Choosing words in computer-
generated weather forecasts. Artificial Intelligence,
167(1-2):137–169.

Ehud Reiter. 2007. An architecture for data-to-text sys-
tems. In Proceedings of the 11th European Workshop
on Natural Language Generation (ENLG), pages 97–
104.

Stuart Russell and Peter Norvig. 2009. Artificial Intelli-
gence: A Modern Approach. Prentice Hall, 3rd edition.

David W Scott. 2015. Multivariate Density Estimation:
Theory, Practice, and Visualization. John Wiley &
Sons.

Claude E. Shannon. 1948. A mathematical theory of com-
munication. Bell System Technical Journal, 27:623–
656.

Heung-Yeung Shum, Xiaodong He, and Di Li. 2018.
From Eliza to XiaoIce: Challenges and opportunities
with social chatbots. arXiv preprint arXiv:1801.01957.

Edward H Simpson. 1949. Measurement of diversity.
Nature.

Charese Smiley, Vassilis Plachouras, Frank Schilder, Hi-
roko Bretz, Jochen L. Leidner, and Dezhao Song. 2016.
When to plummet and when to soar: Corpus based
verb selection for natural language generation. In Pro-
ceedings of the 9th International Natural Language
Generation Conference (INLG), pages 36–39.

Somayajulu Sripada, Neil Burnett, Ross Turner, John
Mastin, and Dave Evans. 2014. A case study: NLG
meeting weather industry demand for quality and quan-
tity of textual weather forecasts. In Proceedings of the
8th International Natural Language Generation Con-
ference (INLG), pages 1–5.

526


Ellen M. Voorhees. 1999. The TREC-8 question an-
swering track report. In Proceedings of the 8th Text
REtrieval Conference (TREC), pages 77–82.

Dennis Wackerly, William Mendenhall, and Richard
Scheaffer. 2007. Mathematical Statistics with Applica-
tions. Nelson Education.

Frank Wilcoxon. 1945. Individual comparisons by rank-
ing methods. Biometrics Bulletin, 1(6):80–83.

Alex Wright. 2015. Algorithmic authors. Communica-
tions of the ACM (CACM), 58(11):12–14.

Chengxiang Zhai and John Lafferty. 2004. A study of
smoothing methods for language models applied to in-
formation retrieval. ACM Transactions on Information
Systems (TOIS), 22(2):179–214.

527


528