Edinburgh Research Explorer 
 
 
Modeling Semantic Expectation: Using Script Knowledge for
Referent Prediction

Citation for published version:
Modi, A, Titov, I, Demberg, V, Sayeed, A & Pinkal, M 2017, 'Modeling Semantic Expectation: Using Script
Knowledge for Referent Prediction', Transactions of the Association for Computational Linguistics, vol. 5,
pp. 31-44. <https://www.transacl.org/ojs/index.php/tacl/article/view/968>

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Publisher's PDF, also known as Version of record

Published In:
Transactions of the Association for Computational Linguistics

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 06. Apr. 2021

https://www.transacl.org/ojs/index.php/tacl/article/view/968
https://www.research.ed.ac.uk/portal/en/publications/modeling-semantic-expectation-using-script-knowledge-for-referent-prediction(01c319a9-4ee1-4b73-af5a-9ac0aad76603).html


Modeling Semantic Expectation:
Using Script Knowledge for Referent Prediction

Ashutosh Modi1,3 Ivan Titov2,4 Vera Demberg1,3 Asad Sayeed1,3 Manfred Pinkal1,3
1 {ashutosh,vera,asayeed,pinkal}@coli.uni-saarland.de

2 titov@uva.nl
3 Universität des Saarlandes, Germany

4 ILLC, University of Amsterdam, the Netherlands

Abstract

Recent research in psycholinguistics has pro-
vided increasing evidence that humans predict
upcoming content. Prediction also affects per-
ception and might be a key to robustness in
human language processing. In this paper,
we investigate the factors that affect human
prediction by building a computational model
that can predict upcoming discourse referents
based on linguistic knowledge alone vs. lin-
guistic knowledge jointly with common-sense
knowledge in the form of scripts. We find
that script knowledge significantly improves
model estimates of human predictions. In a
second study, we test the highly controversial
hypothesis that predictability influences refer-
ring expression type but do not find evidence
for such an effect.

1 Introduction
Being able to anticipate upcoming content is a

core property of human language processing (Kutas
et al., 2011; Kuperberg and Jaeger, 2016) that has re-
ceived a lot of attention in the psycholinguistic liter-
ature in recent years. Expectations about upcoming
words help humans comprehend language in noisy
settings and deal with ungrammatical input. In this
paper, we use a computational model to address the
question of how different layers of knowledge (lin-
guistic knowledge as well as common-sense knowl-
edge) influence human anticipation.

Here we focus our attention on semantic pre-
dictions of discourse referents for upcoming noun
phrases. This task is particularly interesting because
it allows us to separate the semantic task of antic-

ipating an intended referent and the processing of
the actual surface form. For example, in the con-
text of I ordered a medium sirloin steak with fries.
Later, the waiter brought . . . , there is a strong ex-
pectation of a specific discourse referent, i.e., the
referent introduced by the object NP of the preced-
ing sentence, while the possible referring expression
could be either the steak I had ordered, the steak,
our food, or it. Existing models of human predic-
tion are usually formulated using the information-
theoretic concept of surprisal. In recent work, how-
ever, surprisal is usually not computed for DRs,
which represent the relevant semantic unit, but for
the surface form of the referring expressions, even
though there is an increasing amount of literature
suggesting that human expectations at different lev-
els of representation have separable effects on pre-
diction and, as a consequence, that the modelling
of only one level (the linguistic surface form) is in-
sufficient (Kuperberg and Jaeger, 2016; Kuperberg,
2016; Zarcone et al., 2016). The present model ad-
dresses this shortcoming by explicitly modelling and
representing common-sense knowledge and concep-
tually separating the semantic (discourse referent)
and the surface level (referring expression) expec-
tations.

Our discourse referent prediction task is related
to the NLP task of coreference resolution, but it
substantially differs from that task in the following
ways: 1) we use only the incrementally available left
context, while coreference resolution uses the full
text; 2) coreference resolution tries to identify the
DR for a given target NP in context, while we look
at the expectations of DRs based only on the context

31

Transactions of the Association for Computational Linguistics, vol. 5, pp. 31–44, 2017. Action Editor: Hwee Tou Ng.
Submission batch: 8/2016 Revision batch: 10/2016; Published 1/2017.

c©2017 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


before the target NP is seen.
The distinction between referent prediction and

prediction of referring expressions also allows us to
study a closely related question in natural language
generation: the choice of a type of referring expres-
sion based on the predictability of the DR that is
intended by the speaker. This part of our work is
inspired by a referent guessing experiment by Tily
and Piantadosi (2009), who showed that highly pre-
dictable referents were more likely to be realized
with a pronoun than unpredictable referents, which
were more likely to be realized using a full NP. The
effect they observe is consistent with a Gricean point
of view, or the principle of uniform information den-
sity (see Section 5.1). However, Tily and Piantadosi
do not provide a computational model for estimat-
ing referent predictability. Also, they do not include
selectional preference or common-sense knowledge
effects in their analysis.

We believe that script knowledge, i.e., common-
sense knowledge about everyday event sequences,
represents a good starting point for modelling con-
versational anticipation. This type of common-sense
knowledge includes temporal structure which is par-
ticularly relevant for anticipation in continuous lan-
guage processing. Furthermore, our approach can
build on progress that has been made in recent years
in methods for acquiring large-scale script knowl-
edge; see Section 1.1. Our hypothesis is that script
knowledge may be a significant factor in human an-
ticipation of discourse referents. Explicitly mod-
elling this knowledge will thus allow us to produce
more human-like predictions.

Script knowledge enables our model to generate
anticipations about discourse referents that have al-
ready been mentioned in the text, as well as anticipa-
tions about textually new discourse referents which
have been activated due to script knowledge. By
modelling event sequences and event participants,
our model captures many more long-range depen-
dencies than normal language models are able to. As
an example, consider the following two alternative
text passages:

We got seated, and had to wait for 20 minutes.
Then, the waiter brought the ...

We ordered, and had to wait for 20 minutes. Then,
the waiter brought the ...

Preferred candidate referents for the object posi-

tion of the waiter brought the ... are instances of
the food, menu, or bill participant types. In the con-
text of the alternative preceding sentences, there is a
strong expectation of instances of a menu and a food
participant, respectively.

This paper represents foundational research in-
vestigating human language processing. However,
it also has the potential for application in assistant
technology and embodied agents. The goal is to
achieve human-level language comprehension in re-
alistic settings, and in particular to achieve robust-
ness in the face of errors or noise. Explicitly mod-
elling expectations that are driven by common-sense
knowledge is an important step in this direction.

In order to be able to investigate the influence
of script knowledge on discourse referent expecta-
tions, we use a corpus that contains frequent refer-
ence to script knowledge, and provides annotations
for coreference information, script events and par-
ticipants (Section 2). In Section 3, we present a
large-scale experiment for empirically assessing hu-
man expectations on upcoming referents, which al-
lows us to quantify at what points in a text humans
have very clear anticipations vs. when they do not.
Our goal is to model human expectations, even if
they turn out to be incorrect in a specific instance.
The experiment was conducted via Mechanical Turk
and follows the methodology of Tily and Pianta-
dosi (2009). In section 4, we describe our computa-
tional model that represents script knowledge. The
model is trained on the gold standard annotations of
the corpus, because we assume that human compre-
henders usually will have an analysis of the preced-
ing discourse which closely corresponds to the gold
standard. We compare the prediction accuracy of
this model to human predictions, as well as to two
baseline models in Section 4.3. One of them uses
only structural linguistic features for predicting ref-
erents; the other uses general script-independent se-
lectional preference features. In Section 5, we test
whether surprisal (as estimated from human guesses
vs. computational models) can predict the type of
referring expression used in the original texts in the
corpus (pronoun vs. full referring expression). This
experiment also has wider implications with respect
to the on-going discussion of whether the referring
expression choice is dependent on predictability, as
predicted by the uniform information density hy-

32


(I)(1)P bather [decided]E wash to take a (bath)(2)P bath yesterday afternoon after working out . Once (I)(1)P bather got
back home , (I)(1)P bather [walked]E enter bathroom to (my)(1)P bather (bathroom)(3)P bathroom and first quickly scrubbed
the (bathroom tub)(4)P bathtub by [turning on]E turn water on the (water)(5)P water and rinsing (it)(4)P bathtub clean with
a rag . After (I)(1)P bather finished , (I)(1)P bather [plugged]E close drain the (tub)(4)P bathtub and began [filling]E fill water
(it)(4)P bathtub with warm (water)(5)P water set at about 98 (degrees)(6)P temperature .

Figure 1: An excerpt from a story in the InScript corpus. The referring expressions are in parentheses, and the
corresponding discourse referent label is given by the superscript. Referring expressions of the same discourse referent
have the same color and superscript number. Script-relevant events are in square brackets and colored in orange. Event
type is indicated by the corresponding subscript.

pothesis.
The contributions of this paper consist of:

• a large dataset of human expectations, in a va-
riety of texts related to every-day activities.
• an implementation of the conceptual distinction

between the semantic level of referent predic-
tion and the type of a referring expression.
• a computational model which significantly im-

proves modelling of human anticipations.
• showing that script knowledge is a significant

factor in human expectations.
• testing the hypothesis of Tily and Piantadosi

that the choice of the type of referring expres-
sion (pronoun or full NP) depends on the pre-
dictability of the referent.

1.1 Scripts
Scripts represent knowledge about typical event

sequences (Schank and Abelson, 1977), for exam-
ple the sequence of events happening when eating
at a restaurant. Script knowledge thereby includes
events like order, bring and eat as well as partici-
pants of those events, e.g., menu, waiter, food, guest.
Existing methods for acquiring script knowledge
are based on extracting narrative chains from text
(Chambers and Jurafsky, 2008; Chambers and Juraf-
sky, 2009; Jans et al., 2012; Pichotta and Mooney,
2014; Rudinger et al., 2015; Modi, 2016; Ahrendt
and Demberg, 2016) or by eliciting script knowledge
via Crowdsourcing on Mechanical Turk (Regneri et
al., 2010; Frermann et al., 2014; Modi and Titov,
2014).

Modelling anticipated events and participants is
motivated by evidence showing that event repre-
sentations in humans contain information not only
about the current event, but also about previous
and future states, that is, humans generate anticipa-
tions about event sequences during normal language

comprehension (Schütz-Bosbach and Prinz, 2007).
Script knowledge representations have been shown
to be useful in NLP applications for ambiguity reso-
lution during reference resolution (Rahman and Ng,
2012).

2 Data: The InScript Corpus
Ordinary texts, including narratives, encode script

structure in a way that is too complex and too im-
plicit at the same time to enable a systematic study of
script-based expectation. They contain interleaved
references to many different scripts, and they usually
refer to single scripts in a point-wise fashion only,
relying on the ability of the reader to infer the full
event chain using their background knowledge.

We use the InScript corpus (Modi et al., 2016) to
study the predictive effect of script knowledge. In-
Script is a crowdsourced corpus of simple narrative
texts. Participants were asked to write about a spe-
cific activity (e.g., a restaurant visit, a bus ride, or a
grocery shopping event) which they personally ex-
perienced, and they were instructed to tell the story
as if explaining the activity to a child. This resulted
in stories that are centered around a specific scenario
and that explicitly mention mundane details. Thus,
they generally realize longer event chains associated
with a single script, which makes them particularly
appropriate to our purpose.

The InScript corpus is labelled with event-type,
participant-type, and coreference information. Full
verbs are labeled with event type information, heads
of all noun phrases with participant types, using
scenario-specific lists of event types (such as enter
bathroom, close drain and fill water for the “taking a
bath” scenario) and participant types (such as bather,
water and bathtub). On average, each template of-
fers a choice of 20 event types and 18 participant

33


(I)(1) decided to take a (bath)(2) yesterday afternoon
after working out . Once (I)(1) got back home , (I)(1)

walked to (my)(1) (bathroom)(3) and first quickly
scrubbed the (bathroom tub)(4) by turning on the
(water)(5) and rinsing (it)(4) clean with a rag . Af-
ter (I)(1) finished , (I)(1) plugged XXXXXX

Figure 2: An illustration of the Mechanical Turk exper-
iment for the referent cloze task. Workers are supposed
to guess the upcoming referent (indicated by XXXXXX
above). They can either choose from the previously acti-
vated referents, or they can write something new.

0
5

1
0

1
5

2
0

14

5

1
DR_4

(P_bathtub)
the drain
(new DR)

DR_1
(P_bather)

N
u

m
b

e
r 

o
f 

W
o

rk
e

rs

Figure 3: Response of workers corresponding to the story
in Fig. 2. Workers guessed two already activated dis-
course referents (DR) DR 4 and DR 1. Some of the
workers also chose the “new” option and wrote different
lexical variants of “bathtub drain”, a new DR correspond-
ing to the participant type “the drain”.

types.
The InScript corpus consists of 910 stories ad-

dressing 10 scenarios (about 90 stories per scenario).
The corpus has 200,000 words, 12,000 verb in-
stances with event labels, and 44,000 head nouns
with participant instances. Modi et al. (2016) report
an inter-annotator agreement of 0.64 for event types
and 0.77 for participant types (Fleiss’ kappa).

We use gold-standard event- and participant-type
annotation to study the influence of script knowl-
edge on the expectation of discourse referents. In
addition, InScript provides coreference annotation,
which makes it possible to keep track of the men-
tioned discourse referents at each point in the story.
We use this information in the computational model
of DR prediction and in the DR guessing experiment
described in the next section. An example of an an-
notated InScript story is shown in Figure 1.

3 Referent Cloze Task
We use the InScript corpus to develop computa-

tional models for the prediction of discourse refer-

ents (DRs) and to evaluate their prediction accuracy.
This can be done by testing how often our models
manage to reproduce the original discourse referent
(cf. also the “narrative cloze” task by (Chambers and
Jurafsky, 2008) which tests whether a verb together
with a role can be correctly guessed by a model).
However, we do not only want to predict the “cor-
rect” DRs in a text but also to model human expec-
tation of DRs in context. To empirically assess hu-
man expectation, we created an additional database
of crowdsourced human predictions of discourse ref-
erents in context using Amazon Mechanical Turk.
The design of our experiment closely resembles the
guessing game of (Tily and Piantadosi, 2009) but ex-
tends it in a substantial way.

Workers had to read stories of the InScript corpus
1 and guess upcoming participants: for each target
NP, workers were shown the story up to this NP ex-
cluding the NP itself, and they were asked to guess
the next person or object most likely to be referred
to. In case they decided in favour of a discourse ref-
erent already mentioned, they had to choose among
the available discourse referents by clicking an NP
in the preceding text, i.e., some noun with a specific,
coreference-indicating color; see Figure 2. Other-
wise, they would click the “New” button, and would
in turn be asked to give a short description of the new
person or object they expected to be mentioned. The
percentage of guesses that agree with the actually re-
ferred entity was taken as a basis for estimating the
surprisal.

The experiment was done for all stories of the
test set: 182 stories (20%) of the InScript corpus,
evenly taken from all scenarios. Since our focus is
on the effect of script knowledge, we only consid-
ered those NPs as targets that are direct dependents
of script-related events. Guessing started from the
third sentence only in order to ensure that a mini-
mum of context information was available. To keep
the complexity of the context manageable, we re-
stricted guessing to a maximum of 30 targets and
skipped the rest of the story (this applied to 12%
of the stories). We collected 20 guesses per NP for
3346 noun phrase instances, which amounts to a to-
tal of around 67K guesses. Workers selected a con-

1The corpus is available at : http://www.sfb1102.
uni-saarland.de/?page_id=2582

34

http://www.sfb1102.uni-saarland.de/?page_id=2582
http://www.sfb1102.uni-saarland.de/?page_id=2582


text NP in 68% of cases and “New” in 32% of cases.
Our leading hypothesis is that script knowledge

substantially influences human expectation of dis-
course referents. The guessing experiment provides
a basis to estimate human expectation of already
mentioned DRs (the number of clicks on the respec-
tive NPs in text). However, we expect that script
knowledge has a particularly strong influence in the
case of first mentions. Once a script is evoked in a
text, we assume that the full script structure, includ-
ing all participants, is activated and available to the
reader.

Tily and Piantadosi (2009) are interested in sec-
ond mentions only and therefore do not make use
of the worker-generated noun phrases classified as
“New”. To study the effect of activated but not
explicitly mentioned participants, we carried out a
subsequent annotation step on the worker-generated
noun phrases classified as “New”. We presented an-
notators with these noun phrases in their contexts
(with co-referring NPs marked by color, as in the M-
Turk experiment) and, in addition, displayed all par-
ticipant types of the relevant script (i.e., the script as-
sociated with the text in the InScript corpus). Anno-
tators did not see the “correct” target NP. We asked
annotators to either (1) select the participant type in-
stantiated by the NP (if any), (2) label the NP as un-
related to the script, or (3), link the NP to an overt
antecedent in the text, in the case that the NP is ac-
tually a second mention that had been erroneously
labeled as new by the worker. Option (1) provides
a basis for a fine-grained estimation of first-mention
DRs. Option (3), which we added when we noticed
the considerable number of overlooked antecedents,
serves as correction of the results of the M-Turk ex-
periment. Out of the 22K annotated “New” cases,
39% were identified as second mentions, 55% were
linked to a participant type, and 6% were classified
as really novel.

4 Referent Prediction Model
In this section, we describe the model we use to

predict upcoming discourse referents (DRs).

4.1 Model
Our model should not only assign probabilities

to DRs already explicitly introduced in the preced-
ing text fragment (e.g., “bath” or “bathroom” for the

cloze task in Figure 2) but also reserve some prob-
ability mass for ‘new’ DRs, i.e., DRs activated via
the script context or completely novel ones not be-
longing to the script. In principle, different variants
of the activation mechanism must be distinguished.
For many participant types, a single participant be-
longing to a specific semantic class is expected (re-
ferred to with the bathtub or the soap). In contrast,
the “towel’ participant type may activate a set of ob-
jects, elements of which then can be referred to with
a towel or another towel. The “bath means” partici-
pant type may even activate a group of DRs belong-
ing to different semantic classes (e.g., bubble bath
and salts). Since it is not feasible to enumerate all
potential participants, for ‘new’ DRs we only pre-
dict their participant type (“bath means” in our ex-
ample). In other words, the number of categories
in our model is equal to the number of previously
introduced DRs plus the number of participant types
of the script plus 1, reserved for a new DR not corre-
sponding to any script participant (e.g., cellphone).
In what follows, we slightly abuse the terminology
and refer to all these categories as discourse refer-
ents.

Unlike standard co-reference models, which pre-
dict co-reference chains relying on the entire docu-
ment, our model is incremental, that is, when pre-
dicting a discourse referent d(t) at a given position
t, it can look only in the history h(t) (i.e., the pre-
ceding part of the document), excluding the refer-
ring expression (RE) for the predicted DR. We also
assume that past REs are correctly resolved and as-
signed to correct participant types (PTs). Typical
NLP applications use automatic coreference reso-
lution systems, but since we want to model human
behavior, this might be inappropriate, since an au-
tomated system would underestimate human perfor-
mance. This may be a strong assumption, but for
reasons explained above, we use gold standard past
REs.

We use the following log-linear model (“softmax
regression”):

p(d(t) = d|h(t)) = exp(w
T f(d,h(t)))∑

d′ exp(w
T f(d′,h(t)))

,

where f is the feature function we will discuss in
the following subsection, w are model parameters,
and the summation in the denominator is over the

35


Feature Type
Recency Shallow Linguistic

Frequency Shallow Linguistic
Grammatical function Shallow Linguistic

Previous subject Shallow Linguistic
Previous object Shallow Linguistic

Previous RE type Shallow Linguistic
Selectional preferences Linguistic

Participant type fit Script
Predicate schemas Script

Table 1: Summary of feature types

set of categories described above.
Some of the features included in f are a func-

tion of the predicate syntactically governing the
unobservable target RE (corresponding to the DR
being predicted). However, in our incremental
setting, the predicate is not available in the his-
tory h(t) for subject NPs. In this case, we use
an additional probabilistic model, which esti-
mates the probability of the predicate v given the
context h(t), and marginalize out its predictions:

p(d(t) =d|h(t))=
∑

v

p(v|h(t)) exp(w
T f(d,h(t),v))∑

d′ exp(w
T f(d′,h(t),v))

The predicate probabilities p(v|h(t)) are computed
based on the sequence of preceding predicates (i.e.,
ignoring any other words) using the recurrent neural
network language model estimated on our training
set.2 The expression f(d,h(t),v) denotes the feature
function computed for the referent d, given the
history composed of h(t) and the predicate v.

4.2 Features
Our features encode properties of a DR as well

as characterize its compatibility with the context.
We face two challenges when designing our fea-
tures. First, although the sizes of our datasets are
respectable from the script annotation perspective,
they are too small to learn a richly parameterized
model. For many of our features, we address this
challenge by using external word embeddings3 and
associate parameters with some simple similarity
measures computed using these embeddings. Con-

2We used RNNLM toolkit (Mikolov et al., 2011; Mikolov et
al., 2010) with default settings.

3We use 300-dimensional word embeddings estimated on
Wikipedia with the skip-gram model of Mikolov et al. (2013):
https://code.google.com/p/word2vec/

sequently, there are only a few dozen parameters
which need to be estimated from scenario-specific
data. Second, in order to test our hypothesis that
script information is beneficial for the DR prediction
task, we need to disentangle the influence of script
information from general linguistic knowledge. We
address this by carefully splitting the features apart,
even if it prevents us from modeling some interplay
between the sources of information. We will de-
scribe both classes of features below; also see a sum-
mary in Table 1.
4.2.1 Shallow Linguistic Features

These features are based on Tily and Pianta-
dosi (2009). In addition, we consider a selectional
preference feature.
Recency feature. This feature captures the distance
lt(d) between the position t and the last occurrence
of the candidate DR d. As a distance measure, we
use the number of sentences from the last mention
and exponentiate this number to make the depen-
dence more extreme; only very recent DRs will re-
ceive a noticeable weight: exp(−lt(d)). This feature
is set to 0 for new DRs.
Frequency. The frequency feature indicates the
number of times the candidate discourse referent d
has been mentioned so far. We do not perform any
bucketing.
Grammatical function. This feature encodes the
dependency relation assigned to the head word of
the last mention of the DR or a special none label
if the DR is new.
Previous subject indicator. This binary feature in-
dicates whether the candidate DR d is coreferential
with the subject of the previous verbal predicate.
Previous object indicator. The same but for the ob-
ject position.
Previous RE type. This three-valued feature indi-
cates whether the previous mention of the candidate
DR d is a pronoun, a non-pronominal noun phrase,
or has never been observed before.
4.2.2 Selectional Preferences Feature

The selectional preference feature captures how
well the candidate DR d fits a given syntactic po-
sition r of a given verbal predicate v. It is com-
puted as the cosine similarity simcos(xTd ,xv,r) of
a vector-space representation of the DR xd and a
structured vector-space representation of the pred-

36

https://code.google.com/p/word2vec/


icate xv,r. The similarities are calculated using a
Distributional Memory approach similar to that of
Baroni and Lenci (2010). Their structured vector
space representation has been shown to work well
on tasks that evaluate correlation with human the-
matic fit estimates (Baroni and Lenci, 2010; Baroni
et al., 2014; Sayeed et al., 2016) and is thus suited
to our task.

The representation xd is computed as an aver-
age of head word representations of all the previ-
ous mentions of DR d, where the word vectors are
obtained from the TypeDM model of Baroni and
Lenci (2010). This is a count-based, third-order co-
occurrence tensor whose indices are a word w0, a
second word w1, and a complex syntactic relation r,
which is used as a stand-in for a semantic link. The
values for each (w0,r,w1) cell of the tensor are the
local mutual information (LMI) estimates obtained
from a dependency-parsed combination of large cor-
pora (ukWaC, BNC, and Wikipedia).

Our procedure has some differences with that of
Baroni and Lenci. For example, for estimating the
fit of an alternative new DR (in other words, xd
based on no previous mentions), we use an aver-
age over head words of all REs in the training set,
a “null referent.” xv,r is calculated as the average
of the top 20 (by LMI) r-fillers for v in TypeDM; in
other words, the prototypical instrument of rub may
be represented by summing vectors like towel, soap,
eraser, coin. . . If the predicate has not yet been en-
countered (as for subject positions), scores for all
scenario-relevant verbs are emitted for marginaliza-
tion.
4.2.3 Script Features

In this section, we describe features which rely
on script information. Our goal will be to show that
such common-sense information is beneficial in per-
forming DR prediction. We consider only two script
features.

Participant type fit
This feature characterizes how well the participant
type (PT) of the candidate DR d fits a specific syn-
tactic role r of the governing predicate v; it can be
regarded as a generalization of the selectional prefer-
ence feature to participant types and also its special-
isation to the considered scenario. Given the candi-
date DR d, its participant type p, and the syntactic

(I)(1) decided to take a (bath)(2) yesterday afternoon
after working out . (I)(1) was getting ready to go out
and needed to get cleaned before (I)(1) went so (I)(1)

decided to take a (bath)(2). (I)(1) filled the (bath-
tub)(3) with warm (water)(4) and added some (bub-
ble bath)(5). (I)(1) got undressed and stepped into the
(water)(4). (I)(1) grabbed the (soap)(5) and rubbed it
on (my)(1) (body)(7) and rinsed XXXXXX

Figure 4: An example of the referent cloze task. Similar
to the Mechanical Turk experiment (Figure 2), our refer-
ent prediction model is asked to guess the upcoming DR.

relation r, we collect all the predicates in the train-
ing set which have the participant type p in the posi-
tion r. The embedding of the DR xp,r is given by the
average embedding of these predicates. The feature
is computed as the dot product of xp,r and the word
embedding of the predicate v.

Predicate schemas
The following feature captures a specific aspect of
knowledge about prototypical sequences of events.
This knowledge is called predicate schemas in
the recent co-reference modeling work of Peng et
al. (2015). In predicate schemas, the goal is to model
pairs of events such that if a DR d participated in the
first event (in a specific role), it is likely to partici-
pate in the second event (again, in a specific role).
For example, in the restaurant scenario, if one ob-
serves a phrase John ordered, one is likely to see
John waited somewhere later in the document. Spe-
cific arguments are not that important (where it is
John or some other DR), what is important is that
the argument is reused across the predicates. This
would correspond to the rule X-subject-of-order →
X-subject-of-eat.4 Unlike the previous work, our
dataset is small, so we cannot induce these rules di-
rectly as there will be very few rules, and the model
would not generalize to new data well enough. In-
stead, we again encode this intuition using similari-
ties in the real-valued embedding space.

Recall that our goal is to compute a feature
ϕ(d,h(t)) indicating how likely a potential DR d is
to follow, given the history h(t). For example, imag-

4In this work, we limit ourselves to rules where the syntactic
function is the same on both sides of the rule. In other words, we
can, in principle, encode the pattern X pushed Y → X apologized
but not the pattern X pushed Y → Y cried.

37


Model
Name

Feature
Types

Features

Base Shallow
Linguistic
Features

Recency, Frequency,
Grammatical function,

Previous subject,
Previous object

Linguistic Shallow
Linguistic

Features
+

Linguistic
Feature

Recency, Frequency,
Grammatical function,

Previous subject,
Previous object

+
Selectional Preferences

Script Shallow
Linguistic

Features
+

Linguistic
Feature

+
Script

Features

Recency, Frequency,
Grammatical function,

Previous subject,
Previous object

+
Selectional Preferences

+
Participant type fit,
Predicate schemas

Table 2: Summary of model features

ine that the model is asked to predict the DR marked
by XXXXXX in Figure 4. Predicate-schema rules
can only yield previously introduced DRs, so the
score ϕ(d,h(t)) = 0 for any new DR d. Let us use
“soap” as an example of a previously introduced DR
and see how the feature is computed. In order to
choose which inference rules can be applied to yield
“soap”, we can inspect Figure 4. There are only two
preceding predicates which have DR “soap” as their
object (rubbed and grabbed), resulting in two poten-
tial rules X-object-of-grabbed → X-object-of-rinsed
and X-object-of-rubbed → X-object-of-rinsed. We
define the score ϕ(d,h(t)) as the average of the rule
scores. More formally, we can write

ϕ(d,h(t))=
1

|N(d,h(t))|
∑

(u,v,r)∈N(d,h(t))
ψ(u,v,r), (1)

where ψ(u,v,r) is the score for a rule X-r-of-u →
X-r-of-v, N(d,h(t)) is the set of applicable rules,
and |N(d,h(t))| denotes its cardinality.5 We define
ϕ(d,h(t)) as 0, when the set of applicable rules is
empty (i.e. |N(d,h(t))| = 0).

The scoring function ψ(u,v,r) as a linear func-

5In all our experiments, rather than considering all potential
predicates in the history to instantiate rules, we take into ac-
count only 2 preceding verbs. In other words, u and v can be
interleaved by at most one verb and |N(d, h(t))| is in {0, 1, 2}.

tion of a joint embedding xu,v of verbs u and v:

ψ(u,v,r) = αTr xu,v.

The two remaining questions are (1) how to define
the joint embeddings xu,v, and (2) how to estimate
the parameter vector αr. The joint embedding of two
predicates, xu,v, can, in principle, be any composi-
tion function of embeddings of u and v, for example
their sum or component-wise product. Inspired by
Bordes et al. (2013), we use the difference between
the word embeddings:

ψ(u,v,r) = αTr (xu −xv),

where xu and xv are external embeddings of the
corresponding verbs. Encoding the succession re-
lation as translation in the embedding space has
one desirable property: the scoring function will
be largely agnostic to the morphological form of
the predicates. For example, the difference between
the embeddings of rinsed and rubbed is very sim-
ilar to that of rinse and rub (Botha and Blunsom,
2014), so the corresponding rules will receive simi-
lar scores. Now, we can rewrite the equation (1) as

ϕ(d,h(t))= αT
r(h(t))

∑
(u,v,r)∈N(d,h(t)) (xu −xv)

|N(d,h(t))|
(2)

where r(h(t)) denotes the syntactic function corre-
sponding to the DR being predicted (object in our
example).

As for the parameter vector αr, there are again
a number of potential ways how it can be estimated.
For example, one can train a discriminative classifier
to estimate the parameters. However, we opted for
a simpler approach—we set it equal to the empirical
estimate of the expected feature vector xu,v on the
training set:6

αr =
1

Dr

∑

l,t

δr(r(h
(l,t)))

∑

(u,v,r′)∈N(d(l,t),h(l,t))
(xu −xv), (3)

where l refers to a document in the training set, t
is (as before) a position in the document, h(l,t) and

6This essentially corresponds to using the Naive Bayes
model with the simplistic assumption that the score differences
are normally distributed with spherical covariance matrices.

38


Scenario Human Model Script Model Linguistic Model Tily Model
Accuracy Perplexity Accuracy Perplexity Accuracy Perplexity Accuracy Perplexity

Grocery Shopping 74.80 2.13 68.17 3.16 53.85 6.54 32.89 24.48
Repairing a flat bicycle tyre 78.34 2.72 62.09 3.89 51.26 6.38 29.24 19.08

Riding a public bus 72.19 2.28 64.57 3.67 52.65 6.34 32.78 23.39
Getting a haircut 71.06 2.45 58.82 3.79 42.82 7.11 28.70 15.40
Planting a tree 71.86 2.46 59.32 4.25 47.80 7.31 28.14 24.28

Borrowing book from library 77.49 1.93 64.07 3.55 43.29 8.40 33.33 20.26
Taking Bath 81.29 1.84 67.42 3.14 61.29 4.33 43.23 16.33

Going on a train 70.79 2.39 58.73 4.20 47.62 7.68 30.16 35.11
Baking a cake 76.43 2.16 61.79 5.11 46.40 9.16 24.07 23.67

Flying in an airplane 62.04 3.08 61.31 4.01 48.18 7.27 30.90 30.18
Average 73.63 2.34 62.63 3.88 49.52 7.05 31.34 23.22

Table 3: Accuracies (in %) and perplexities for different models and scenarios. The script model substantially out-
performs linguistic and base models (with p < 0.001, significance tested with McNemar’s test (Everitt, 1992)). As
expected, the human prediction model outperforms the script model (with p < 0.001, significance tested by McNe-
mar’s test).

Model Accuracy Perplexity
Linguistic Model 49.52 7.05

Linguistic Model + Predicate Schemas 55.44 5.88
Linguistic Model + Participant type fit 58.88 4.29

Full Script Model (both features) 62.63 3.88
Table 4: Accuracies from ablation experiments.

d(l,t) are the history and the correct DR for this posi-
tion, respectively. The term δr(r′) is the Kronecker
delta which equals 1 if r = r′ and 0, otherwise. Dr
is the total number of rules for the syntactic function
r in the training set:

Dr =
∑

l,t

δr(r(h
(l,t)))×|N(d(l,t),h(l,t))|.

Let us illustrate the computation with an example.
Imagine that our training set consists of the docu-
ment in Figure 1, and the trained model is used to
predict the upcoming DR in our referent cloze exam-
ple (Figure 4). The training document includes the
pair X-object-of-scrubbed → X-object-of-rinsing, so
the corresponding term (xscrubbed - xrinsing) partici-
pates in the summation (3) for αobj. As we rely on
external embeddings, which encode semantic simi-
larities between lexical items, the dot product of this
term and (xrubbed - xrinsed) will be high.7 Conse-
quently, ϕ(d,h(t)) is expected to be positive for d =
“soap”, thus, predicting “soap” as the likely forth-
coming DR. Unfortunately, there are other terms
(xu − xv) both in expression (3) for αobj and in
expression (2) for ϕ(d,h(t)). These terms may be

7The score would have been even higher, should the pred-
icate be in the morphological form rinsing rather than rinsed.
However, embeddings of rinsing and rinsed would still be suf-
ficiently close to each other for our argument to hold.

irrelevant to the current prediction, as X-object-of-
plugged → X-object-of-filling from Figure 1, and
may not even encode any valid regularities, as X-
object-of-got → X-object-of-scrubbed (again from
Figure 1). This may suggest that our feature will
be too contaminated with noise to be informative
for making predictions. However, recall that inde-
pendent random vectors in high dimensions are al-
most orthogonal, and, assuming they are bounded,
their dot products are close to zero. Consequently,
the products of the relevant (“non-random”) terms,
in our example (xscrubbed - xrinsing) and (xrubbed -
xrinsed), are likely to overcome the (“random”) noise.
As we will see in the ablation studies, the predicate-
schema feature is indeed predictive of a DR and con-
tributes to the performance of the full model.

4.3 Experiments
We would like to test whether our model can pro-

duce accurate predictions and whether the model’s
guesses correlate well with human predictions for
the referent cloze task.

In order to be able to evaluate the effect of script
knowledge on referent predictability, we compare
three models: our full Script model uses all of the
features introduced in section 4.2; the Linguistic
model relies only on the ‘linguistic features’ but not
the script-specific ones; and the Base model includes
all the shallow linguistic features. The Base model
differs from the linguistic model in that it does not
model selectional preferences. Table 2 summarizes
features used in different models.

The data set was randomly divided into training
(70%), development (10%, 91 stories from 10 sce-

39


narios), and test (20%, 182 stories from 10 scenar-
ios) sets. The feature weights were learned using
L-BFGS (Byrd et al., 1995) to optimize the log-
likelihood.
Evaluation against original referents. We calcu-
lated the percentage of correct DR predictions. See
Table 3 for the averages across 10 scenarios. We
can see that the task appears hard for humans: their
average performance reaches only 73% accuracy.
As expected, the Base model is the weakest system
(the accuracy of 31%). Modeling selectional pref-
erences yields an extra 18% in accuracy (Linguis-
tic model). The key finding is that incorporation of
script knowledge increases the accuracy by further
13%, although still far behind human performance
(62% vs. 73%). Besides accuracy, we use perplex-
ity, which we computed not only for all our models
but also for human predictions. This was possible
as each task was solved by multiple humans. We
used unsmoothed normalized guess frequencies as
the probabilities. As we can see from Table 3, the
perplexity scores are consistent with the accuracies:
the script model again outperforms other methods,
and, as expected, all the models are weaker than hu-
mans.

As we used two sets of script features, capturing
different aspects of script knowledge, we performed
extra ablation studies (Table 4). The experiments
confirm that both feature sets were beneficial.
Evaluation against human expectations. In the
previous subsection, we demonstrated that the in-
corporation of selectional preferences and, perhaps
more interestingly, the integration of automatically
acquired script knowledge lead to improved accu-
racy in predicting discourse referents. Now we turn
to another question raised in the introduction: does
incorporation of this knowledge make our predic-
tions more human-like? In other words, are we able
to accurately estimate human expectations? This in-
cludes not only being sufficiently accurate but also
making the same kind of incorrect predictions.

In this evaluation, we therefore use human
guesses collected during the referent cloze task as
our target. We then calculate the relative accuracy of
each computational model. As can be seen in Figure
5, the Script model, at approx. 53% accuracy, is a lot
more accurate in predicting human guesses than the
Linguistic model and the Base model. We can also

Script Linguistic Base

0
1

0
2

0
3

0
4

0
5

0
6

0

52.9

38.4
34.52

R
e

l. 
A

cc
u

ra
cy

 (
in

 %
)

Figure 5: Average relative accuracies of different models
w.r.t human predictions.

Script Linguistic Base0
.0

0
.2

0
.4

0
.6

0
.8

0.5
0.57

0.66

JS
 D

iv
e

rg
e

n
ce

Figure 6: Average Jensen-Shannon divergence between
human predictions and models.

observe that the margin between the Script model
and the Linguistic model is a lot larger in this evalu-
ation than between the Base model and the Linguis-
tic model. This indicates that the model which has
access to script knowledge is much more similar to
human prediction behavior in terms of top guesses
than the script-agnostic models.

Now we would like to assess if our predictions
are similar as distributions rather than only yield-
ing similar top predictions. In order to compare the
distributions, we use the Jensen-Shannon divergence
(JSD), a symmetrized version of the Kullback-
Leibler divergence.

Intuitively, JSD measures the distance between
two probability distributions. A smaller JSD value
is indicative of more similar distributions. Fig-
ure 6 shows that the probability distributions result-
ing from the Script model are more similar to human
predictions than those of the Linguistic and Base
models.

In these experiments, we have shown that script
knowledge improves predictions of upcoming ref-
erents and that the script model is the best among
our models in approximating human referent predic-
tions.

5 Referring Expression Type Prediction
Model (RE Model)

Using the referent prediction models, we next at-
tempt to replicate Tily and Piantadosi’s findings that

40


the choice of the type of referring expression (pro-
noun or full NP) depends in part on the predictability
of the referent.

5.1 Uniform Information Density hypothesis
The uniform information density (UID) hypothe-

sis suggests that speakers tend to convey information
at a uniform rate (Jaeger, 2010). Applied to choice
of referring expression type, it would predict that
a highly predictable referent should be encoded us-
ing a short code (here: a pronoun), while an unpre-
dictable referent should be encoded using a longer
form (here: a full NP). Information density is mea-
sured using the information-theoretic measure of the
surprisal S of a message mi:

S(mi) = − log P(mi | context)
UID has been very successful in explaining a vari-

ety of linguistic phenomena; see Jaeger et al. (2016).
There is, however, controversy about whether UID
affects pronominalization. Tily and Piantadosi
(2009) report evidence that writers are more likely to
refer using a pronoun or proper name when the ref-
erent is easy to guess and use a full NP when readers
have less certainty about the upcoming referent; see
also Arnold (2001). But other experiments (using
highly controlled stimuli) have failed to find an ef-
fect of predictability on pronominalization (Steven-
son et al., 1994; Fukumura and van Gompel, 2010;
Rohde and Kehler, 2014). The present study hence
contributes to the debate on whether UID affects re-
ferring expression choice.

5.2 A model of Referring Expression Choice
Our goal is to determine whether referent pre-

dictability (quantified in terms of surprisal) is cor-
related with the type of referring expression used
in the text. Here we focus on the distinction be-
tween pronouns and full noun phrases. Our data
also contains a small percentage (ca. 1%) of proper
names (like “John”). Due to this small class size
and earlier findings that proper nouns behave much
like pronouns (Tily and Piantadosi, 2009), we com-
bined pronouns and proper names into a single class
of short encodings.

For the referring expression type prediction task,
we estimate the surprisal of the referent from each of
our computational models from Section 4 as well as
the human cloze task. The surprisal of an upcoming
discourse referent d(t) based on the previous context

h(t) is thereby estimated as:
S(d(t)) = − log p(d(t) | h(t))

In order to determine whether referent predictability
has an effect on referring expression type over and
above other factors that are known to affect the
choice of referring expression, we train a logistic
regression model with referring expression type as a
response variable and discourse referent predictabil-
ity as well as a large set of other linguistic factors
(based on Tily and Piantadosi, 2009) as explanatory
variables. The model is defined as follows:

p(n(t) = n|d(t),h(t)) = exp(v
T g(n,dt,h(t)))∑

n′ exp(v
T g(n′,dt,h(t)))

,

where d(t) and h(t) are defined as before, g is the
feature function, and v is the vector of model pa-
rameters. The summation in the denominator is over
NP types (full NP vs. pronoun/proper noun).

5.3 RE Model Experiments
We ran four different logistic regression models.

These models all contained exactly the same set
of linguistic predictors but differed in the estimates
used for referent type surprisal and residual entropy.
One logistic regression model used surprisal esti-
mates based on the human referent cloze task, while
the three other models used estimates based on the
three computational models (Base, Linguistic and
Script). For our experiment, we are interested in the
choice of referring expression type for those occur-
rences of references, where a “real choice” is possi-
ble. We therefore exclude for our analysis reported
below all first mentions as well as all first and second
person pronouns (because there is no optionality in
how to refer to first or second person). This subset
contains 1345 data points.

5.4 Results
The results of all four logistic regression models

are shown in Table 5. We first take a look at the
results for the linguistic features. While there is a
bit of variability in terms of the exact coefficient es-
timates between the models (this is simply due to
small correlations between these predictors and the
predictors for surprisal), the effect of all of these
features is largely consistent across models. For in-
stance, the positive coefficients for the recency fea-
ture means that when a previous mention happened

41


Estimate Std. Error Pr(>| z |)
Human Script Linguistic Base Human Script Linguistic Base Human Script Linguistic Base

(Intercept) -3.4 -3.418 -3.245 -3.061 0.244 0.279 0.321 0.791 <2e-16 *** <2e-16 *** <2e-16 *** 0.00011 ***
recency 1.322 1.322 1.324 1.322 0.095 0.095 0.096 0.097 <2e-16 *** <2e-16 *** <2e-16 *** <2e-16 ***

frequency 0.097 0.103 0.112 0.114 0.098 0.097 0.098 0.102 0.317 0.289 0.251 0.262
pastObj 0.407 0.396 0.423 0.395 0.293 0.294 0.295 0.3 0.165 0.178 0.151 0.189
pastSubj -0.967 -0.973 -0.909 -0.926 0.559 0.564 0.562 0.565 0.0838 . 0.0846 . 0.106 0.101

pastExpPronoun 1.603 1.619 1.616 1.602 0.21 0.207 0.208 0.245 2.19e-14 *** 5.48e-15 *** 7.59e-15 *** 6.11e-11 ***
depTypeSubj 2.939 2.942 2.656 2.417 0.299 0.347 0.429 1.113 <2e-16 *** <2e-16 *** 5.68e-10 *** 0.02994 *
depTypeObj 1.199 1.227 0.977 0.705 0.248 0.306 0.389 1.109 1.35e-06 *** 6.05e-05 *** 0.0119 * 0.525

surprisal -0.04 -0.006 0.002 -0.131 0.099 0.097 0.117 0.387 0.684 0.951 0.988 0.735
residualEntropy -0.009 0.023 -0.141 -0.128 0.088 0.128 0.168 0.258 0.916 0.859 0.401 0.619

Table 5: Coefficients obtained from regression analysis for different models. Two NP types considered: full NP and
Pronoun/ProperNoun, with base class full NP. Significance: ‘***’ < 0.001, ‘**’ < 0.01, ‘*’ < 0.05, and ‘.’ < 0.1.

very recently, the referring expression is more likely
to be a pronoun (and not a full NP).

The coefficients for the surprisal estimates of the
different models are, however, not significantly dif-
ferent from zero. Model comparison shows that they
do not improve model fit. We also used the esti-
mated models to predict referring expression type
on new data and again found that surprisal estimates
from the models did not improve prediction accu-
racy. This effect even holds for our human cloze
data. Hence, it cannot be interpreted as a problem
with the models—even human predictability esti-
mates are, for this dataset, not predictive of referring
expression type.

We also calculated regression models for the full
dataset including first and second person pronouns
as well as first mentions (3346 data points). The re-
sults for the full dataset are fully consistent with the
findings shown in Table 5: there was no significant
effect of surprisal on referring expression type.

This result contrasts with the findings by Tily and
Piantadosi (2009), who reported a significant effect
of surprisal on RE type for their data. In order to
replicate their settings as closely as possible, we also
included residualEntropy as a predictor in our model
(see last predictor in Table 5); however, this did not
change the results.

6 Discussion and Future Work
Our study on incrementally predicting discourse

referents showed that script knowledge is a highly
important factor in determining human discourse ex-
pectations. Crucially, the computational modelling
approach allowed us to tease apart the different fac-
tors that affect human prediction as we cannot ma-
nipulate this in humans directly (by asking them to
“switch off” their common-sense knowledge).

By modelling common-sense knowledge in terms
of event sequences and event participants, our model
captures many more long-range dependencies than
normal language models. The script knowledge is
automatically induced by our model from crowd-
sourced scenario-specific text collections.

In a second study, we set out to test the hypoth-
esis that uniform information density affects refer-
ring expression type. This question is highly con-
troversial in the literature: while Tily and Piantadosi
(2009) find a significant effect of surprisal on refer-
ring expression type in a corpus study very similar
to ours, other studies that use a more tightly con-
trolled experimental approach have not found an ef-
fect of predictability on RE type (Stevenson et al.,
1994; Fukumura and van Gompel, 2010; Rohde and
Kehler, 2014). The present study, while replicating
exactly the setting of T&P in terms of features and
analysis, did not find support for a UID effect on RE
type. The difference in results between T&P 2009
and our results could be due to the different corpora
and text sorts that were used; specifically, we would
expect that larger predictability effects might be ob-
servable at script boundaries, rather than within a
script, as is the case in our stories.

A next step in moving our participant predic-
tion model towards NLP applications would be to
replicate our modelling results on automatic text-
to-script mapping instead of gold-standard data as
done here (in order to approximate human level of
processing). Furthermore, we aim to move to more
complex text types that include reference to several
scripts. We plan to consider the recently published
ROC Stories corpus (Mostafazadeh et al., 2016),
a large crowdsourced collection of topically unre-
stricted short and simple narratives, as a basis for
these next steps in our research.

42


Acknowledgments
We thank the editors and the anonymous review-

ers for their insightful suggestions. We would like
to thank Florian Pusse for helping with the Ama-
zon Mechanical Turk experiment. We would also
like to thank Simon Ostermann and Tatjana Anikina
for helping with the InScript corpus. This research
was partially supported by the German Research
Foundation (DFG) as part of SFB 1102 ‘Informa-
tion Density and Linguistic Encoding’, European
Research Council (ERC) as part of ERC Starting
Grant BroadSem (#678254), the Dutch National Sci-
ence Foundation as part of NWO VIDI 639.022.518,
and the DFG once again as part of the MMCI Cluster
of Excellence (EXC 284).

References
Simon Ahrendt and Vera Demberg. 2016. Improving

event prediction by representing script participants. In
Proceedings of NAACL-HLT.

Jennifer E. Arnold. 2001. The effect of thematic roles on
pronoun use and frequency of reference continuation.
Discourse Processes, 31(2):137–162.

Marco Baroni and Alessandro Lenci. 2010. Distribu-
tional memory: A general framework for corpus-based
semantics. Computational Linguistics, 36(4):673–
721.

Marco Baroni, Georgiana Dinu, and Germán Kruszewski.
2014. Don’t count, predict! A systematic compari-
son of context-counting vs. context-predicting seman-
tic vectors. In Proceedings of ACL.

Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran,
Jason Weston, and Oksana Yakhnenko. 2013. Trans-
lating embeddings for modeling multi-relational data.
In Proceedings of NIPS.

Jan A. Botha and Phil Blunsom. 2014. Compositional
morphology for word representations and language
modelling. In Proceedings of ICML.

Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou
Zhu. 1995. A limited memory algorithm for bound
constrained optimization. SIAM Journal on Scientific
Computing, 16(5):1190–1208.

Nathanael Chambers and Daniel Jurafsky. 2008. Unsu-
pervised learning of narrative event chains. In Pro-
ceedings of ACL.

Nathanael Chambers and Dan Jurafsky. 2009. Unsuper-
vised learning of narrative schemas and their partici-
pants. In Proceedings of ACL.

Brian S. Everitt. 1992. The analysis of contingency ta-
bles. CRC Press.

Lea Frermann, Ivan Titov, and Manfred Pinkal. 2014. A
hierarchical Bayesian model for unsupervised induc-
tion of script knowledge. In Proceedings of EACL.

Kumiko Fukumura and Roger P. G. van Gompel. 2010.
Choosing anaphoric expressions: Do people take into
account likelihood of reference? Journal of Memory
and Language, 62(1):52–66.

T. Florian Jaeger, Esteban Buz, Eva M. Fernandez, and
Helen S. Cairns. 2016. Signal reduction and linguis-
tic encoding. Handbook of psycholinguistics. Wiley-
Blackwell.

T. Florian Jaeger. 2010. Redundancy and reduction:
Speakers manage syntactic information density. Cog-
nitive psychology, 61(1):23–62.

Bram Jans, Steven Bethard, Ivan Vulić, and
Marie Francine Moens. 2012. Skip n-grams
and ranking functions for predicting script events. In
Proceedings of EACL.

Gina R. Kuperberg and T. Florian Jaeger. 2016. What do
we mean by prediction in language comprehension?
Language, cognition and neuroscience, 31(1):32–59.

Gina R. Kuperberg. 2016. Separate streams or proba-
bilistic inference? What the N400 can tell us about the
comprehension of events. Language, Cognition and
Neuroscience, 31(5):602–616.

Marta Kutas, Katherine A. DeLong, and Nathaniel J.
Smith. 2011. A look around at what lies ahead: Pre-
diction and predictability in language processing. Pre-
dictions in the brain: Using our past to generate a fu-
ture.

Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer-
nockỳ, and Sanjeev Khudanpur. 2010. Recurrent neu-
ral network based language model. In Proceedings of
Interspeech.

Tomas Mikolov, Stefan Kombrink, Anoop Deoras, Lukar
Burget, and Jan Cernocky. 2011. RNNLM-recurrent
neural network language modeling toolkit. In Pro-
ceedings of the 2011 ASRU Workshop.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Cor-
rado, and Jeff Dean. 2013. Distributed representations
of words and phrases and their compositionality. In
Proceedings of NIPS.

Ashutosh Modi and Ivan Titov. 2014. Inducing neural
models of script knowledge. Proceedings of CoNLL.

Ashutosh Modi, Tatjana Anikina, Simon Ostermann, and
Manfred Pinkal. 2016. Inscript: Narrative texts anno-
tated with script information. Proceedings of LREC.

Ashutosh Modi. 2016. Event embeddings for semantic
script modeling. Proceedings of CoNLL.

Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong
He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Pushmeet Kohli, and James Allen. 2016. A corpus
and cloze evaluation for deeper understanding of com-
monsense stories. Proceedings of NAACL.

43


Haoruo Peng, Daniel Khashabi, and Dan Roth. 2015.
Solving hard coreference problems. In Proceedings of
NAACL.

Karl Pichotta and Raymond J Mooney. 2014. Statistical
script learning with multi-argument events. Proceed-
ings of EACL.

Altaf Rahman and Vincent Ng. 2012. Resolving com-
plex cases of definite pronouns: the Winograd schema
challenge. In Proceedings of EMNLP.

Michaela Regneri, Alexander Koller, and Manfred
Pinkal. 2010. Learning script knowledge with web
experiments. In Proceedings of ACL.

Hannah Rohde and Andrew Kehler. 2014. Grammati-
cal and information-structural influences on pronoun
production. Language, Cognition and Neuroscience,
29(8):912–927.

Rachel Rudinger, Vera Demberg, Ashutosh Modi, Ben-
jamin Van Durme, and Manfred Pinkal. 2015. Learn-
ing to predict script events from domain-specific text.
Proceedings of the International Conference on Lexi-
cal and Computational Semantics (*SEM 2015).

Asad Sayeed, Clayton Greenberg, and Vera Demberg.
2016. Thematic fit evaluation: an aspect of selectional
preferences. In Proceedings of the Workshop on Eval-
uating Vector Space Representations for NLP (RepE-
val2016).

Roger C. Schank and Robert P. Abelson. 1977. Scripts,
Plans, Goals, and Understanding. Lawrence Erlbaum
Associates, Potomac, Maryland.

Simone Schütz-Bosbach and Wolfgang Prinz. 2007.
Prospective coding in event representation. Cognitive
processing, 8(2):93–102.

Rosemary J. Stevenson, Rosalind A. Crawley, and David
Kleinman. 1994. Thematic roles, focus and the rep-
resentation of events. Language and Cognitive Pro-
cesses, 9(4):519–548.

Harry Tily and Steven Piantadosi. 2009. Refer effi-
ciently: Use less informative expressions for more pre-
dictable meanings. In Proceedings of the workshop on
the production of referring expressions: Bridging the
gap between computational and empirical approaches
to reference.

Alessandra Zarcone, Marten van Schijndel, Jorrig Vo-
gels, and Vera Demberg. 2016. Salience and atten-
tion in surprisal-based accounts of language process-
ing. Frontiers in Psychology, 7:844.

44