Edinburgh Research Explorer 
 
 
Context-aware Frame-Semantic Role Labeling

Citation for published version:
Roth, M & Lapata, M 2015, 'Context-aware Frame-Semantic Role Labeling', Transactions of the Association
for Computational Linguistics, vol. 3, pp. 449-460.
<https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/652>

Link:
Link to publication record in Edinburgh Research Explorer

Document Version:
Publisher's PDF, also known as Version of record

Published In:
Transactions of the Association for Computational Linguistics

General rights
Copyright for the publications made accessible via the Edinburgh Research Explorer is retained by the author(s)
and / or other copyright owners and it is a condition of accessing these publications that users recognise and
abide by the legal requirements associated with these rights.

Take down policy
The University of Edinburgh has made every reasonable effort to ensure that Edinburgh Research Explorer
content complies with UK legislation. If you believe that the public display of this file breaches copyright please
contact openaccess@ed.ac.uk providing details, and we will remove access to the work immediately and
investigate your claim.

Download date: 06. Apr. 2021

https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/652
https://www.research.ed.ac.uk/portal/en/publications/contextaware-framesemantic-role-labeling(110f80eb-8701-4638-8998-e38efcb6d50e).html


Context-aware Frame-Semantic Role Labeling

Michael Roth and Mirella Lapata
School of Informatics, University of Edinburgh

10 Crichton Street, Edinburgh EH8 9AB
{mroth,mlap}@inf.ed.ac.uk

Abstract

Frame semantic representations have been
useful in several applications ranging from
text-to-scene generation, to question answer-
ing and social network analysis. Predicting
such representations from raw text is, how-
ever, a challenging task and corresponding
models are typically only trained on a small
set of sentence-level annotations. In this pa-
per, we present a semantic role labeling sys-
tem that takes into account sentence and dis-
course context. We introduce several new fea-
tures which we motivate based on linguistic
insights and experimentally demonstrate that
they lead to significant improvements over the
current state-of-the-art in FrameNet-based se-
mantic role labeling.

1 Introduction

The goal of semantic role labeling (SRL) is to iden-
tify and label the arguments of semantic predicates
in a sentence according to a set of predefined re-
lations (e.g., “who” did “what” to “whom”). In
addition to providing definitions and examples of
role labeled text, resources like FrameNet (Ruppen-
hofer et al., 2010) group semantic predicates into so-
called frames, i.e., conceptual structures describing
the background knowledge necessary to understand
a situation, event or entity as a whole as well as
the roles participating in it. Accordingly, semantic
roles are defined on a per-frame basis and are shared
among predicates.

In recent years, frame representations have been
successfully applied in a range of downstream tasks,

including question answering (Shen and Lapata,
2007), text-to-scene generation (Coyne et al., 2012),
stock price prediction (Xie et al., 2013), and so-
cial network extraction (Agarwal et al., 2014).
Whereas some tasks directly utilize information
encoded in the FrameNet resource, others make
use of FrameNet indirectly through the output of
SRL systems that are trained on data annotated
with frame-semantic representations. While ad-
vances in machine learning have recently given
rise to increasingly powerful SRL systems follow-
ing the FrameNet paradigm (Hermann et al., 2014;
Täckström et al., 2015), little effort has been devoted
to improve such models from a linguistic perspec-
tive.

In this paper, we explore insights from the lin-
guistic literature suggesting a connection between
discourse and role labeling decisions and show how
to incorporate these in an SRL system. Although
early theoretical work (Fillmore, 1976) has recog-
nized the importance of discourse context for the
assignment of semantic roles, most computational
approaches have shied away from such considera-
tions. To see how context can be useful, consider as
an example the DELIVERY frame, which states that
a THEME can be handed off to either a RECIPIENT
or “more indirectly” to a GOAL. While the distinc-
tion between the latter two roles might be clear for
some fillers (e.g., people vs. locations), there are oth-
ers where both roles are equally plausible and addi-
tional information is required to resolve the ambigu-
ity (e.g., countries). If we hear about a letter being
delivered to Greece, for instance, reliable cues might
be whether the sender is a person or a country and

449

Transactions of the Association for Computational Linguistics, vol. 3, pp. 449–460, 2015. Action Editor: Diana McCarthy.
Submission batch: 5/2015; Revision batch 7/2015; Published 8/2015.

c©2015 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.


whether Greece refers to the geographic region or to
the Greek government.

The example shows that context can generally in-
fluence the choice of correct role label. Accordingly,
we assume that modeling contextual information,
such as the meaning of a word in a given situation,
can improve semantic role labeling performance. To
validate this assumption, we explore different ways
of incorporating contextual cues in a SRL model and
provide experimental support that demonstrates the
usefulness of such additional information.

The remainder of this paper is structured as fol-
lows. In Section 2, we present related work on se-
mantic role labeling and the various features applied
in traditional SRL systems. In Section 3, we provide
additional background on the FrameNet resource.
Sections 4 and 5 describe our baseline system and
contextual extensions, respectively, and Section 6
presents our experimental results. We conclude the
paper by discussing in more detail the output of our
system and highlighting avenues for future work.

2 Related Work

Early work in SRL dates back to Gildea and Juraf-
sky (2002), who were the first to model role assign-
ment to verb arguments based on FrameNet. Their
model makes use of lexical and syntactic features,
including binary indicators for the words involved,
syntactic categories, dependency paths as well as po-
sition and voice in a given sentence. Most subse-
quent work in SRL builds on Gildea and Jurafsky’s
feature set, often with the addition of features that
describe relevant syntactic structures in more de-
tail, e.g., the argument’s leftmost/rightmost depen-
dent (Johansson and Nugues, 2008).

More sophisticated features include the use of
convolution kernels (Moschitti, 2004; Croce et
al., 2011) in order to represent predicate-argument
structures and their lexical similarities more accu-
rately. Beyond lexical and syntactic information,
a few approaches employ additional semantic fea-
tures based on annotated word senses (Che et al.,
2010) and selectional preferences (Zapirain et al.,
2013). Deschacht and Moens (2009) and Huang
and Yates (2010) use sentence-internal sequence in-
formation, in the form of latent states in a hidden
markov model. More recently, a few approaches

(Roth and Woodsend, 2014; Lei et al., 2015; Foland
and Martin, 2015) explore ways of using low-rank
vector and tensor approximations to represent lex-
ical and syntactic features as well as combinations
thereof.

To the best of our knowledge, there exists no
prior work where features based on discourse con-
text are used to assign roles on the sentence level.
Discourse-like features have been previously ap-
plied in models that deal with so-called implicit ar-
guments, i.e., roles which are not locally realized
but resolvable within the greater discourse context
(Ruppenhofer et al., 2010; Gerber and Chai, 2012).
Successful features for resolving implicit arguments
include the distance between mentions and any dis-
course relations occurring between them (Gerber
and Chai, 2012), roles assigned to mentions in the
previous context, the discourse prominence of the
denoted entity (Silberer and Frank, 2012), and its
centering status (Laparra and Rigau, 2013). None
of these features have been used in a standard SRL
system to date (and trivially, not all of them will be
helpful as, for example, the number of sentences be-
tween a predicate and an argument is always zero
within a sentence). In this paper, we extend the
contextual features used for resolving implicit ar-
guments to the SRL task and show how a set of
discourse-level enhancements can be added to a tra-
ditional sentence-level SRL model.

3 FrameNet

The Berkeley FrameNet project (Ruppenhofer et al.,
2010) develops a semantic lexicon and an annotated
example corpus based on Fillmore’s (1976) theory
of frame semantics. Annotations consist of frame-
evoking elements (i.e., words in a sentence that are
associated with a conceptual frame) and frame ele-
ments (i.e., instantiations of semantic roles, which
are defined per frame and filled by words or word
sequences in a given sentence). For example, the
DELIVERY frame describes a scene or situation in
which a DELIVERER hands off a THEME to a RE-
CIPIENT or a GOAL.1 In total, there are 1,019
frames and 8,886 frame elements defined in the lat-

1See https://framenet2.icsi.berkeley.edu/
for a comprehensive list of frames and their definitions.

450


est publicly available version of FrameNet.2 An av-
erage number of 11.6 different frame-evoking ele-
ments are provided for each frame (11,829 in total).
Following previous work on FrameNet-based SRL,
we use the full text annotation data set, which con-
tains 23,087 frame instances.

Semantic annotations for frame instances and
fillers of frame elements are generally provided at
the level of word sequences, which can be single
words, complete or incomplete phrases, and entire
clauses (Ruppenhofer et al., 2010, Chapter 4). An
instance of the DELIVERY frame, with annotations
of the frame-evoking element (underlined) and in-
stantiated frame elements (in brackets), is given in
the example below:

(1) The Soviet Union agreed to speed up [oil]THEME
deliveriesDELIVERY [to Yugoslavia]RECIPIENT .

Note that the oil deliveries here concern Yugoslavia
as a geopolitical entity and hence the RECIPIENT
role is assigned. If Yugoslavia was referred to as
the location of a delivery, the GOAL role would be
assigned instead. In general, roles can be restricted
by so-called semantic types (e.g., every filler of the
THEME element in the DELIVERY frame needs to
be a physical object). However, not all roles are
typed and whether a specific phrase is a suitable
filler largely depends on context.

4 Baseline Model

As a baseline for implementing contextual enhance-
ments to an SRL model, we use the semantic role
labeling components provided by the mate-tools
(Björkelund et al., 2010). Given a frame-evoking el-
ement in a sentence and its associated frame (i.e., a
predicate and its sense), the mate-tools form a
pipeline of logistic regression classifiers that iden-
tify and label frame elements which are instantiated
within the same sentence (i.e., a given predicate’s
arguments).

The adopted SRL system has been developed
for PropBank/NomBank-style role labeling and we
make several changes to adapt it to FrameNet.
Specifically, we change the argument labeling pro-
cedure from predicate-specific to frame-specific

2Version 1.5, released September 2010.

roles and implement I/O methods to read and gen-
erate FrameNet XML files. For direct compari-
son with the previous state-of-the-art for FrameNet-
based SRL, we further implement additional fea-
tures used in the SEMAFOR system (Das et
al., 2014) and combine the role labeling compo-
nents of mate-tools with SEMAFOR’s preprocess-
ing toolchain.3 All features used in our system are
listed in Table 1.

The main differences between our adaptation of
mate-tools and SEMAFOR are as follows: whereas
the latter implements identification and labeling of
role fillers in one step, mate-tools follow the in-
sight that these two steps are conceptually differ-
ent (Xue and Palmer, 2004) and should be modeled
separately. Accordingly, mate-tools contain a global
reranking component which takes into account iden-
tification and labeling decisions while SEMAFOR
only uses reranking techniques to filter overlapping
argument predictions and other constraints (see Das
et al., 2014 for details). We discuss the advantage of
a global reranker for our setting in Section 5.

5 Extensions based on Context

Context can be relevant for semantic role labeling
in various different ways. In this section, we moti-
vate and describe four extensions over previous ap-
proaches.

The first extension is a set of features that model
document-specific aspects of word meaning using
distributional semantics. The motivation for this fea-
ture class stems from the insight that the meaning of
a word in context can influence correct role assign-
ment. While concepts such as polysemy, homonymy
and metonymy are all relevant here, the scarce train-
ing data available for FrameNet-based SRL calls for
a light-weight model that can be applied without
large amounts of labeled data. We therefore employ
distributional word representations which we criti-
cally adapt based on document content. We describe
our contribution in Section 5.1.

Entities that fill semantic roles are sometimes
mentioned in discourse. Given a specific mention

3We note that better results have been reported in Hermann
et al. (2014) and Täckström et al. (2015). However, both of
these more recent approaches rely on a custom frame identifi-
cation component as well as proprietary tools and models for
tagging and parsing which are not publicly available.

451


Argument identification and classification

Lemma form of f POS tag of f
Any syntactic dependents of f* Subcat frame of f*
Voice of a* Any lemma in a*
Number of words in a
First word and POS tag in a
Second word and POS tag in a
Last word and POS tag in a
Relation from first word in a to its parent
Relation from second word in a to its parent
Relation from last word in a to its parent
Relative position of a with respect to p
Voice of a and relative position with respect to p*

Identification only

Lemma form of the first word in a
Lemma form of the syntactic head of a
Lemma form of the last word in a
POS tag of the first word in a
POS tag of the syntactic head of a
POS tag of the last word in a
Relation from syntactic head of a to its parent
Dependency path from a to f
Length of dependency path from a to f
Number of words between a and f

Table 1: Features from Das et al. (2014) which we adopt
in our model; a denotes the argument span under con-
sideration, f refers to the corresponding frame evoking
element. Identification features are instantiated as binary
indicator features. Features marked with an asterisk are
role specific. All other features apply to combinations of
role and frame.

for which a role is to be predicted, we can also di-
rectly use previous role assignments as classification
cues. We describe our implementation of this feature
in Section 5.2.

The filler of a semantic role is often a word or
phrase which occurs only once or a few times in
a document. If neither syntax nor aspects of lexi-
cal meaning provide cues indicating a unique role,
useful information can still be derived from the dis-
course salience of the denoted entity. Our model
makes use of a simple salience indicator that can be
reliably derived from automatically computed coref-
erence chains. We describe the motivation and ac-
tual implementation of this feature in Section 5.3.

The aforementioned features will influence role
labeling decisions directly, however, further im-
provements can be gained by considering interac-
tions between labeling decisions. As discussed in
Das et al. (2014), role annotations in FrameNet are
unique with respect to a frame instance in more
than 96% of cases. This means that even if a feature
is not a positive indicator for a candidate role filler,
knowing it would be a better cue for another can-
didate can also prevent a hypothetical model from
assigning a frame element label incorrectly. While
this kind of knowledge has been successfully im-
plemented as constraints in recent FrameNet-based
SRL models (Hermann et al., 2014; Täckström et al.,
2015), earlier work on PropBank-based role label-
ing suggests that better performance can be achieved
with a re-ranking component which has the poten-
tial to learn such constraints and other interactions
implicitly (Toutanova et al., 2005; Björkelund et al.,
2010). In our model, we adopt the latter method and
extend it with additional frame-based features. We
describe this approach in more detail in Section 5.4.

5.1 Modeling Word Meaning in Context

The underlying idea of distributional models of se-
mantics is that meaning can be acquired based on
distributional properties (typically represented by
co-occurrence counts) of linguistic entities such as
words and phrases (Sahlgren, 2008). Although the
absolute meaning of distributional representations
remains unclear, they have proven highly success-
ful for modeling relative aspects of meaning, as re-
quired for instance in word similarity tasks (Mikolov
et al., 2013; Pennington et al., 2014). Given their
ability to model lexical similarity, it is not surpris-
ing that such representations are also successful at
representing similar words in semantic tasks related
to role labeling (Pennacchiotti et al., 2008; Croce et
al., 2010; Zapirain et al., 2013).

Although distributional representations can be
used directly as features for role labeling (Padó et
al., 2008; Gorinski et al., 2013; Roth and Wood-
send, 2014, inter alia), further gains should be possi-
ble when considering document-specific properties
such as genre and context. This is particularly true
in the context of FrameNet, where different senses
are observed across a diverse range of texts includ-
ing spoken dialogue and debate transcripts as well

452


Country Frame Frame Element

Iran Supply RECIPIENT
Commerce buy BUYER

China Supply SUPPLIER
Commerce sell SELLER

Iraq Locative relation GROUND
Arriving GOAL

Table 2: Most frequent roles assigned to country names
appearing FrameNet texts: whereas Iran and China are
mostly mentioned in an economic context, references to
Iraq are mainly found in a news article about a politician’s
visit to the country.

as travel guides and newspaper articles. Country
names, for example, can be observed as fillers for
different roles depending on the text genre and its
perspective. Whereas some text may talk about a
country as an interesting holiday destination (e.g.,
“Berlitz Intro to Jamaica”), others may discuss what
a country is good at or interested in (e.g., “Iran [Nu-
clear] Introduction”). A list of the most frequent
roles assigned to different country names are dis-
played in Table 2.

Previous approaches model word meaning in con-
text (Thater et al., 2010; Dinu and Lapata, 2010, in-
ter alia) using sentence-level information which is
already available in traditional SRL systems in the
form of explicit features. Here, we go one step fur-
ther and define a simple model in which word mean-
ing representations are adapted to each document.
As a starting point, we use the GloVe toolkit (Pen-
nington et al., 2014) for learning representations4

and apply it to the Wikipedia corpus made available
by the Westbury Lab.5 The learned representations
can be seen as word vectors whose components en-
code basic bits of related encyclopaedic knowledge.
We adapt these general representations to the ac-
tual meaning of a word in a particular text by run-
ning additional iterations of the GloVe toolkit us-
ing document-specific co-occurrences as input and
Wikipedia-based representations for initialization.

4We selected this toolkit in our work due to its flexibility: as
it directly operates over co-occurrence matrices, we can manip-
ulate counts prior to word vector computation and easily take
into account multiple matrices.

5
http://www.psych.ualberta.ca/˜westburylab/

downloads/westburylab.wikicorp.download.html

To make up for the large difference in data size be-
tween the Wikipedia corpus and a single document,
we normalize co-occurrence counts based on the ra-
tio between the absolute numbers of co-occurrences
in both resources.

Given co-occurrence matrices Cwiki and Cd, and
the vocabulary V , we formally define the features
of our SRL model as the components of the vec-
tor space ~wi of words wi (1 ≤ i ≤ |V |) occurring
in document d. The representations are learned by
applying GloVe to optimize the following objective
for n iterations (1 ≤ t ≤ n):

Jt =
∑

i,j

f(Xij)(~w
T
i ~wj − logXij)2, (2)

where X =

{
Cwiki if t < td
Cd otherwise

(3)

The weighting function f scales the impact of each
word pair such that unseen pairs do not contribute
to the overall objective and frequent co-occurrences
are not overweighted. In our experiments, we use
the same weighting function and parametrization as
defined in Pennington et al. (2014). We further set
the number of iterations to be performed on each
co-occurrence matrix following results of an ini-
tial cross-validation experiment on our training data
(td = 50, n = 100).

5.2 Co-occurring Roles

If an entity is mentioned several times in discourse,
it is likely that it also fills several roles. Whereas
the distributional model described in Section 5.1
provides us with information regarding the role as-
signments suitable for an entity given co-occurring
words, we can also can explicitly consider previous
role assignments to the same entity. As shown in
Table 2, a country that fills the SUPPLIER role is
more likely to also fill the role of a SELLER than
that of a BUYER. Given the high number of different
frame elements in FrameNet, only a small fraction of
pairs can be found in the training data, which entails
that directly utilizing role co-occurrences might not
be helpful. In order to benefit from previous role
assignments in discourse, we follow related work
on resolving implicit arguments (Ruppenhofer et al.,
2011; Silberer and Frank, 2012) and consider the se-
mantic types of role assignments (see Section 3) as

453

http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html
http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html


features instead of the role labels themselves. This
tremendously reduces the feature space from more
than 8,000 options (number of defined frame ele-
ments) to just 27 (number of semantic types ob-
served for frame elements in the training data).

In practice, we define one binary indicator fea-
ture fs for each semantic type s observed at train-
ing time. Given a potential filler, we set the feature
value of fs to 1 (otherwise 0) if and only if there ex-
ists a co-referent entity mention annotated as a frame
element filler with semantic type s. Since texts in
FrameNet do not contain any manual mark-up of
coreference relations, we rely on entity mentions
and coreference chains predicted by the Stanford
Coreference Resolution system (Lee et al., 2013).

5.3 Discourse Newness
Our third contextual feature type is based on the
observation that the salience of a discourse entity
and its semantic prominence are interrelated. Previ-
ous work (Rose, 2011) showed that semantic promi-
nence, as signal-led by semantic roles, can better
explain subsequent phenomena related to discourse
salience (such as pronominalization) than syntactic
indicators. Our question here is whether this insight
can be also applied in reverse. Can information on
discourse salience be useful as an indicator for se-
mantic roles?

For this feature, we make use of the same coref-
erence chains as predicted for determining co-
occurring roles. Unfortunately, automatically pre-
dicted mentions and coreference chains are noisy.
To identify particularly reliable indicators for dis-
course salience, we inspected held-out development
data. One such indicator is whether an entity is men-
tioned for the first time (discourse-new) or has been
mentioned before (discourse-old). Let w denote an
entity and R1...Rn the set of all co-reference chains
with mentions r1...rm ∈ Ri (1 ≤ i ≤ n) ordered
by their appearance in text. We define discourse
newness based on head words r.head as:

(4)new(w) =

{
0 if ∃rj ∈ Ri : j > 1∧rj.head ≡ w
1 else

Although this feature is a simple binary indicator, it
can be very useful for distinguishing between roles
that are more or less likely to be assigned to new

Frame Frame Element new/old

Statement SPEAKER 43.8
MESSAGE 99.1
MEDIUM 80.0

Leadership LEADER 78.0
GOVERNED 93.4

Intensionally create CREATOR 58.8
CREATED ENTITY 90.1

Table 3: Frequent frames that have elements with differ-
ent likelihoods of discourse-new vs. discourse-old fillers;
new/old ratios as observed on the development set.

entities. For example, it is easy to imagine that
the RESULT of a CAUSATION is more likely to be
discourse-new than the EFFECT that caused it. Ta-
ble 3 provides an overview of frames found in the
training and development data which have roles with
substantially different likelihoods for discourse-new
fillers.

5.4 Frame-based Reranking

Our goal is to learn a better model for FrameNet-
based semantic role labeling using linguistically in-
spired features such as those described in the previ-
ous sections. To do this, we need a framework for
representing single role assignments and a model of
how such assignments depend on each other within
a frame instance. Inspired by previous work on
reranking in SRL, we assume that we can find the
correct filler of a frame element based on the top k
roles predicted for each candidate word sequence.
We leverage this assumption to train a reranking
model that considers the top predictions for each
candidate and uses all relevant features to select the
best overall structure.

Our implementation of the reranking model is
an adaptation of the reranker made available in the
mate-tools (see Section 4), which we extend to deal
with frame-specific features and arbitrary role la-
bels. As features for the global component, we apply
all local features and additionally use the following
two types of indicator features on the whole frame
structure:

• Total number of roles in the predicted structure
• Ordered set of predicted role labels

454


Frames SRL model P R F1

gold SEMAFOR7 78.4 73.1 75.7∗
gold Framat 80.3 71.7 75.8∗

gold Framat+context 80.4 73.0 76.5

SEMAFOR SEMAFOR 69.2 65.1 67.1∗
SEMAFOR Framat 71.1 63.7 67.2∗

SEMAFOR Framat+context 71.1 64.8 67.8

Table 4: Full structure prediction results using gold (top)
and predicted frames (bottom). All numbers are per-
centages. ∗ Significantly different (p<0.05) from
Framat+context.

At test time, the reranker takes as input the n-best la-
bels for the m-best fillers of a frame structure, com-
putes a global score for each of the n × m possible
combinations and returns the structure with the high-
est overall score as its prediction output. Based on
initial experiments on our training data, we set these
parameters to m = 8 and n = 4.

6 Experiments

In this section, we demonstrate the usefulness of
contextual features for FrameNet-based SRL mod-
els. Our hypothesis is that contextual information
can considerably improve an existing semantic role
labeling system. Accordingly, we test this hypothe-
sis based on the output of three different systems.
The first system, henceforth called Framat (short
for FrameNet-adapted mate-tools) is the baseline
system described in Section 4. The second sys-
tem, henceforth Framat+context, is an enhanced ver-
sion of the baseline that additionally uses all exten-
sions described in Section 5. Finally, we also con-
sider the output of SEMAFOR (Das et al., 2014), a
state-of-the-art model for frame-semantic role label-
ing. Although all systems are provided with entire
documents as input, SEMAFOR and Framat pro-
cess each document sentence-by-sentence whereas
Framat+context also uses features over all sentences.

For evaluation, we use the same FrameNet train-
ing and evaluation texts as established in Das and
Smith (2011). We compute precision, recall and
F1-score using the modified SemEval-2007 scorer
from the SEMAFOR website.6

6http://www.ark.cs.cmu.edu/SEMAFOR/eval/
7Results produced by running SEMAFOR on the exact same

Model/added feature P R F1

Framat w/o reranker 77.5 72.5 74.9
+discourse newness 77.6 72.3 74.9
+word meaning vectors 77.9 72.7 75.2
+cooccurring roles 77.9 72.8 75.3
+reranker 80.6 72.7 76.4
+frame structure 80.4 73.0 76.5

Table 5: Full structure prediction results using gold
frames, Framat and different sets of context features. All
numbers are percentages.

Results Table 4 summarizes our results with Fra-
mat, Framat+context, and SEMAFOR using gold and
predicted frames (see the upper and lower half of
the table, respectively). Although differences in
system architecture lead to different precision/recall
trade-offs for Framat and SEMAFOR, both sys-
tems achieve comparable F1 (for both gold and pre-
dicted frames). Compared to Framat, we can see
that the contextual enhancements implemented in
our Framat+context model lead to immediate gains
of 1.3 points in recall, corresponding to a signifi-
cant increase of 0.7 points in F1. Framat+context’s re-
call is slightly below that of SEMAFOR (73.0% vs.
73.1%), however, it achieves a much higher level of
precision (80.4% vs. 78.4%).

We examined whether differences in performance
among the three systems are significant using an ap-
proximate randomization test over sentences (Yeh,
2000). SEMAFOR and Framat perform signifi-
cantly worse (p<0.05) compared to Framat+context

both when gold and predicted frames are used. In the
remainder of this section we discuss results based on
gold frames, since the focus of this work lies primar-
ily on the role labeling task.

Impact of Individual Features We demonstrate
the effect of adding individual context-based fea-
tures to the Framat model in a separate experiment.
Whereas all models in the previous experiment used
a reranker for direct comparability, here we start
with the Framat baseline (without a reranker) and
add each enhancement described in Section 5 in-
crementally. As summarized in Table 5, the base-
line without a reranker achieves a precision and

frame instances for training and testing as our own models.

455

http://www.ark.cs.cmu.edu/SEMAFOR/eval/


recall of 77.5% and 72.5%, respectively. Addi-
tion of our discourse new feature increases pre-
cision (+0.1%), but also reduces recall (−0.2%).
Adding word meaning vectors compensates for the
loss in recall (+0.4%) and further increases preci-
sion (+0.3%). Information about role assignments
to coreferring mentions increases recall (+0.1%)
while retaining the same level of precision. Finally,
we can see that jointly considering role labeling
decisions in a global reranker with additional fea-
tures on frame structure leads to the strongest boost
in performance, with combined additional gains in
precision and recall of +2.5% and +0.2%, respec-
tively. Interestingly, the gains realized here are much
higher compared to when adding the reranker to the
Framat model without contextual features, which
corresponds to a +2.8% increase in precision but
a −0.8% reduction in recall.
General vs. Document-specific Vectors We also
assessed the impact of adapting vectors to docu-
ments (see Table 6). Specifically, we compared
a version of the Framat+context model without any
vectors against a model using the adaptation tech-
nique presented in Section 5.1 and a simpler alterna-
tive which obtains GloVe representations trained on
the Wikipedia corpus and FrameNet texts. The lat-
ter model does not explicitly take document infor-
mation into account, but it should be able to yield
vectors representative of the FrameNet domains,
merely by being trained on them. As shown in Ta-
ble 6, our adaptation technique is superior to learn-
ing word representations based on Wikipedia and
all FrameNet texts at once. Using the components
of document-specific vectors as features improves
precision and recall by +0.7 percentage points over
Framat+context without vectors. Word representations
trained on Wikipedia and FrameNet improve preci-
sion by +0.2 percentage points and recall by +0.6.

Qualitative Improvements In addition to quanti-
tative gains, we also observe qualitative improve-
ments when considering contextual features. A set
of example predictions by different models are listed
in Table 7. The annotations show that Framat and
SEMAFOR mislabel several cases that are correctly
classified by Framat+context.

In the first example, only Framat+context is able
to predict that on Dec. 1 fills the frame element

Model/word representations P R F1

Framat+context without vectors 79.7 72.2 75.8
+document-specific vectors 80.4 73.0 76.5
+general (Wiki+FN) vectors 79.9 72.8 76.2

Table 6: Full structure prediction results using gold
frames, Framat+context and different vector representa-
tions. All numbers are percentages.

TIME. This may seem trivial at first glance but is
actually remarkable as the word token Dec neither
occurs in the training data nor is well represented
as a time expression in Wikipedia. The only way
the model is able to label this phrase correctly is
by finding that corresponding word tokens are sim-
ilarly distributed across the test document as other
time expressions are in the training data. In the
second and third examples, correct assignments re-
quire some form of world knowledge which is not
expressed within the respective sentences but might
be approximated based on context. For example,
knowing that aunt, uncle and grandmother are role
fillers of a KINSHIP frame means that they are of
the semantic type human and thus only compatible
with the frame element RECIPIENT, not with GOAL.
Similarly, correctly classifying the relation between
Clinton and stooge in the last example is only possi-
ble if the model has access to some information that
makes Clinton a likely filler of the SUPERIOR role.
We conjecture that document-specific word vector
representations provide such information given that
Clinton co-occurs in the document with words such
as president, chief, and claim.

Overall, we find that the features introduced in
Section 5 model a fair amount of contextual in-
formation which can help a semantic role labeling
model to perform better decisions.

7 Discussion

In this section, we discuss the extent to which our
model leverages the full potential of contextual fea-
tures for semantic role labeling. We manually ex-
amine role assignments to frame elements which
seem particularly sensitive to context. We analyze
such frame elements based on differences in label
assignment between Framat and Framat+context that
can be traced back to factors such as agency in dis-

456


SEMAFOR *Can [he]THEME goMOTION [to Paris]GOAL on Dec. 1 ?
Framat *Can [he]THEME goMOTION [to Paris on Dec. 1]GOAL ?
Framat+context Can [he]THEME goMOTION [to Paris]GOAL [on Dec. 1]TIME ?

SEMAFOR *SendSENDING [my regards]THEME to my aunt , uncle and grandmother .
Framat *SendSENDING [my regards]THEME [to my aunt , uncle and grandmother]GOAL .
Framat+context SendSENDING [my regards]THEME [to my aunt , uncle and grandmother]RECIPIENT .

SEMAFOR *Stephanopoulos does n’t want to seem a Clinton stoogeSUBORDINATES AND SUPERIORS
Framat *Stephanopoulos doesn’t want to seem a [Clinton]DESCRIPTOR stoogeSUBORDINATES AND SUPERIORS
Framat+context Stephanopoulos does n’t want to seem a [Clinton]SUPERIOR stoogeSUBORDINATES AND SUPERIORS

Table 7: Examples of frame structures that are labeled incorrectly (marked by asterisks) without contextual features.

course and word sense in context. We investigate
whether our model captures these factors success-
fully and showcase examples while reporting abso-
lute changes in precision and recall.

7.1 Agency and Discourse

Many frame elements in FrameNet indicate agency,
a property that we expect to highly correlate with
contextual features on semantic types of assigned
roles (see Section 5.2) and discourse salience (see
Section 5.3). Analysis of system output revealed
that such features indeed affect and generally im-
prove role labeling. Considering all AGENT ele-
ments across frames, we observe absolute improve-
ments of 4% in precision and 3% in recall. In the fol-
lowing, we provide a more detailed analysis of two
specific frame elements: the low frequent AGENT
element of the PROJECT frame and the highly fre-
quent SPEAKER element in the STATEMENT frame.

The AGENT of a PROJECT is defined as the
“individual or organization that carries out the
PROJECT”. The main difficulty in identifying in-
stances of this frame element is that the frame-
evoking target word is typically a noun such as
project, plan, or program and hence syntactic fea-
tures on word-word dependencies do not provide
sufficient cues. We found several cases where con-
text provided missing cues, leading to an increase
in recall from 56% to 78%. In cases where addi-
tional features did not help, we identified two types
of errors: firstly, the filler was too far from the tar-
get word and therefore could not be identified as
a filler at all (“[North Korea]AGENT is developing
... programPROJECT ”), and secondly, earlier men-
tions indicating agency were not detected by the

coreference resolution system (“The IAEA assisted
Syria (...) This study was part of an IAEAAGENT ..
programPROJECT ).

The SPEAKER of a STATEMENT is defined as
“the sentient entity that produces [a] MESSAGE”.
Instances of the STATEMENT frame are frequently
evoked by verbs such as say, mention, and claim.
The SPEAKER role can be hard to identify in sub-
ject position as an unknown entity could also fill the
MEDIUM role. For example, “a report claims that
...” should be analyzed differently from “a person
claims”. Our contextual features improve role label-
ing in cases where the subject can be classified based
on previous role assignments. On the negative side,
we found our model to be too conservative in some
cases where a subject is discourse new. Additional
gains would be possible with improved coreference
chains that include pronouns such as some and I.
Such chains could be established through a better
preprocessing pipeline or by utilizing additional lin-
guistic resources.

7.2 Word Meaning and Context

As discussed earlier, we expect that the meaning of
a word in context provides valuable cues regarding
potential frame elements. Two types of words are
of particular interest here: ambiguous words, for
which different senses might apply depending on
context, and out-of-vocabulary words, for which no
clear sense could be established during training. In
the following, we take a closer look at differences in
role assignment between Framat and Framat+context

for such fillers.
Ambiguous words that occur as fillers of differ-

ent frame elements in the test set include party,

457


power, program, and view. We find occurrences
of these words in two broad types of contexts: po-
litical and non-political. Within political contexts,
party and power fill frame elements such as POS-
SESSION and LEADER. Outwith political contexts,
we find frame elements such as ELECTRICITY and
SOCIAL EVENT to be far more likely. The Framat
model exhibits a general bias towards the political
domain, often missing instances of frame elements
that are more common in non-political contexts
(e.g., “the six-[party]INTERLOCUTORS talksDISCUSSION ”).
Framat+context, in contrast, shows less of a bias and
provides better classification based on context fea-
tures for all frame elements. Overall, precision for
the four ambiguous words is improved from 86% to
93%, with a few errors remaining due to rare depen-
dency paths (e.g., [program]ACT

NMOD←−−− which SBAR←−−
is PRD←−−violationCOMPLIANCE ) and differences between
frame elements that depend on factors such as num-
ber (COGNIZER vs. COGNIZER 1).

A frequently observed error by the baseline
model is to assign peripheral frame elements such
as TIME to role fillers that actually are not time
expressions. This happens because words which
have not been seen frequently during training but
appear in adverbial positions are generally likely to
fill the frame element TIME. We find that the use
of document-specific word vector representations
drastically reduces the number of such errors (e.g.,
“to giveGIVING [generously]MANNER vs. *TIME ”), with
absolute gains in precision and recall of 14% and
9%, respectively, presumably because non-time
expressions are often distributed differently across
a document than time expressions. Document-
specific word vector representations also improve
recall for out-of-vocabulary words, as seen with the
example of Dec discussed in Section 6. However,
such representations by themselves might be insuf-
ficient to determine which aspects of a word sense
are applicable across a document as occurrences
in specific contexts may also be misleading (e.g.,
“. . . changes [throughout the community]” vs. “...
[throughout the ages]TIME ”). Some of these cases
could be resolved using higher level features that
explicitly model interactions between (predicted)
word meaning in context and other factors, however
we leave this to future work.

8 Conclusions

In this paper, we enriched a traditional semantic role
labeling model with additional information from
context. The corresponding features we defined can
be grouped into three categories: (1) discourse-level
features that directly utilize discourse knowledge in
the form of coreference chains (newness, prior role
assignments), (2) sentence-level features that model
properties of a frame structure as a whole, and (3)
lexical features that can be computed using methods
from distributional semantics and an adaptation to
model document-specific word meaning.

To implement our discourse-level enhancements,
we modified a semantic role labeling system de-
veloped for PropBank/NomBank which we found
to achieve competitive performance on FrameNet-
based annotations. Our main contribution lies in
extending this system to the discourse level. Our
experiments revealed that discourse aware features
can significantly improve semantic role labeling per-
formance, leading to gains of over +2.0 percent-
age points in precision and state-of-the-art results
in terms of F1. Analysis of system output revealed
two reasons for improvement. Firstly, contextual
features provide necessary additional information to
understand and assign roles on the sentence level,
and secondly, some of our discourse-level features
generalize better than traditional lexical and syntac-
tic features. We further found that additional gains
can be achieved using improved preprocessing tools
and a more sophisticated model for feature inter-
actions. In the future, we are planning to assess
whether discourse-level features generalize cross-
linguistically. We would also like to investigate
whether semantic role labeling can benefit from rec-
ognizing textual entailment and high-level discourse
relations. Our code is publicly available under
http://github.com/microth/mateplus.

Acknowledgements

We are grateful to Diana McCarthy and three anony-
mous referees whose feedback helped to substan-
tially improve the present paper. The research pre-
sented in this paper was funded by a DFG Research
Fellowship (RO 4848/1-1).

458

http://github.com/microth/mateplus


References
Apoorv Agarwal, Sriramkumar Balasubramanian, Anup

Kotalwar, Jiehan Zheng, and Owen Rambow. 2014.
Frame semantic tree kernels for social network extrac-
tion from text. In Proceedings of the 14th Confer-
ence of the European Chapter of the Association for
Computational Linguistics, pages 211–219, Gothen-
burg, Sweden, 26–30 April 2014.

Anders Björkelund, Bernd Bohnet, Love Hafdell, and
Pierre Nugues. 2010. A high-performance syntac-
tic and semantic dependency parser. In Coling 2010:
Demonstration Volume, pages 33–36, Beijing, China.

Wanxiang Che, Ting Liu, and Yongqiang Li. 2010. Im-
proving semantic role labeling with word sense. In
Human Language Technologies: The 2010 Annual
Conference of the North American Chapter of the As-
sociation for Computational Linguistics, pages 246–
249, Los Angeles, California, 1–6 June 2010.

Bob Coyne, Alex Klapheke, Masoud Rouhizadeh,
Richard Sproat, and Daniel Bauer. 2012. Annotation
tools and knowledge representation for a text-to-scene
system. In Proceedings of 24th International Con-
ference on Computational Linguistics, pages 679–694,
Mumbai, India, 8–15 December 2012.

Danilo Croce, Cristina Giannone, Paolo Annesi, and
Roberto Basili. 2010. Towards open-domain semantic
role labeling. In Proceedings of the 48th Annual Meet-
ing of the Association for Computational Linguistics,
pages 237–246, Uppsala, Sweden, 11–16 July 2010.

Danilo Croce, Alessandro Moschitti, and Roberto Basili.
2011. Structured lexical similarity via convolution
kernels on dependency trees. In Proceedings of the
2011 Conference on Empirical Methods in Natural
Language Processing, pages 1034–1046, Edinburgh,
United Kingdom.

Dipanjan Das and Noah A. Smith. 2011. Semi-
supervised frame-semantic parsing for unknown pred-
icates. In Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Hu-
man Language Technologies, Portland, Oregon, 19–24
June 2011.

Dipanjan Das, Desai Chen, André F. T. Martins,
Nathan Schneider, and Noah A. Smith. 2014.
Frame-Semantic Parsing. Computational Linguistics,
40(1):9–56.

Koen Deschacht and Marie-Francine Moens. 2009.
Semi-supervised semantic role labeling using the La-
tent Words Language Model. In Proceedings of the
2009 Conference on Empirical Methods in Natural
Language Processing, pages 21–29, Singapore, 2–7
August 2009.

Georgiana Dinu and Mirella Lapata. 2010. Measuring
distributional similarity in context. In Proceedings of

the 2010 Conference on Empirical Methods in Natural
Language Processing, pages 1162–1172, Cambridge,
Massachusetts, 9–11 October 2010.

Charles J. Fillmore. 1976. Frame semantics and the na-
ture of language. In Annals of the New York Academy
of Sciences: Conference on the Origin and Develop-
ment of Language and Speech, volume 280, pages 20–
32.

William Foland and James Martin. 2015. Dependency-
based semantic role labeling using convolutional neu-
ral networks. In Proceedings of the Fourth Joint
Conference on Lexical and Computational Semantics,
pages 279–288, Denver, Colorado.

Matthew Gerber and Joyce Chai. 2012. Semantic Role
Labeling of Implicit Arguments for Nominal Predi-
cates. Computational Linguistics, 38(4):755–798.

Daniel Gildea and Daniel Jurafsky. 2002. Automatic la-
beling of semantic roles. Computational Linguistics,
28(3):245–288.

Philip Gorinski, Josef Ruppenhofer, and Caroline
Sporleder. 2013. Towards weakly supervised resolu-
tion of null instantiations. In Proceedings of the 10th
International Conference on Computational Semantics
(IWCS 2013) – Long Papers, pages 119–130, Potsdam,
Germany, 19–22 March 2013.

Karl Moritz Hermann, Dipanjan Das, Jason Weston, and
Kuzman Ganchev. 2014. Semantic frame identifica-
tion with distributed word representations. In Pro-
ceedings of the 52nd Annual Meeting of the Associa-
tion for Computational Linguistics, pages 1448–1458,
Baltimore, Maryland, 23–25 June 2014.

Fei Huang and Alexander Yates. 2010. Open-domain
semantic role labeling by modeling word spans. In
Proceedings of the 48th Annual Meeting of the Associ-
ation for Computational Linguistics, pages 968–978,
Uppsala, Sweden, 11–16 July 2010.

Richard Johansson and Pierre Nugues. 2008. The ef-
fect of syntactic representation on semantic role label-
ing. In Proceedings of the 22nd International Con-
ference on Computational Linguistics, pages 393–400,
Manchester, United Kingdom, 18–22 August 2008.

Egoitz Laparra and German Rigau. 2013. Sources of ev-
idence for implicit argument resolution. In Proceed-
ings of the 10th International Conference on Compu-
tational Semantics (IWCS 2013) – Long Papers, pages
155–166, Potsdam, Germany, 19–22 March 2013.

Heeyoung Lee, Angel Chang, Yves Peirsman, Nathanael
Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013.
Deterministic coreference resolution based on entity-
centric, precision-ranked rules. Computational Lin-
guistics, 39(4):885–916.

Tao Lei, Yuan Zhang, Lluı́s Màrquez, Alessandro Mos-
chitti, and Regina Barzilay. 2015. High-order low-
rank tensors for semantic role labeling. In Proceedings

459


of the 2015 Conference of the North American Chapter
of the Association for Computational Linguistics: Hu-
man Language Technologies, pages 1150–1160, Den-
ver, Colorado.

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig.
2013. Linguistic regularities in continuous space word
representations. In Proceedings of the 2013 Confer-
ence of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language
Technologies, pages 746–751, Atlanta, Georgia, 9–15
June 2013.

Alessandro Moschitti. 2004. A study on convolution
kernels for shallow statistic parsing. In Proceedings
of the 42nd Meeting of the Association for Computa-
tional Linguistics (ACL’04), Main Volume, pages 335–
342, Barcelona, Spain.

Sebastian Padó, Marco Pennacchiotti, and Caroline
Sporleder. 2008. Semantic role assignment for event
nominalisations by leveraging verbal data. In Pro-
ceedings of the 22nd International Conference on
Computational Linguistics (Coling 2008), pages 665–
672, Manchester, United Kingdom.

Marco Pennacchiotti, Diego De Cao, Roberto Basili,
Danilo Croce, and Michael Roth. 2008. Automatic
induction of FrameNet lexical units. In Proceedings
of the 2008 Conference on Empirical Methods in Nat-
ural Language Processing, pages 457–465, Honolulu,
Hawaii, USA, 25–27 October 2008.

Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. Glove: Global vectors for word rep-
resentation. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing,
pages 1532–1543, Doha, Qatar, 25–29 October 2014.

Ralph L Rose. 2011. Joint information value of syntactic
and semantic prominence for subsequent pronominal
reference. Salience: Multidisciplinary Perspectives on
Its Function in Discourse, 227:81–103.

Michael Roth and Kristian Woodsend. 2014. Compo-
sition of word representations improves semantic role
labelling. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing,
pages 407–413, Doha, Qatar, 25–29 October 2014.

Josef Ruppenhofer, Michael Ellsworth, Miriam R. L.
Petruck, Christopher R. Johnson, and Jan Scheffczyk.
2010. FrameNet II: Extended Theory and Practice.
Technical report, International Computer Science In-
stitute, 14 September 2010.

Josef Ruppenhofer, Philip Gorinski, and Caroline
Sporleder. 2011. In search of missing arguments:
A linguistic approach. In Proceedings of the Inter-
national Conference Recent Advances in Natural Lan-
guage Processing 2011, pages 331–338, Hissar, Bul-
garia, 12–14 September 2011.

Magnus Sahlgren. 2008. The distributional hypothesis.
Italian Journal of Linguistics, 20(1):33–54.

Dan Shen and Mirella Lapata. 2007. Using semantic
roles to improve question answering. In Proceedings
of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational
Natural Language Learning (EMNLP-CoNLL), pages
12–21, Prague, Czech Republic.

Carina Silberer and Anette Frank. 2012. Casting implicit
role linking as an anaphora resolution task. In Pro-
ceedings of the First Joint Conference on Lexical and
Computational Semantics (*SEM 2012), pages 1–10,
Montréal, Canada, 7-8 June.

Oscar Täckström, Kuzman Ganchev, and Dipanjan Das.
2015. Efficient inference and structured learning for
semantic role labeling. Transactions of the Associa-
tion for Computational Linguistics, 3:29–41.

Stefan Thater, Hagen Fürstenau, and Manfred Pinkal.
2010. Contextualizing semantic representations us-
ing syntactically enriched vector models. In Proceed-
ings of the 48th Annual Meeting of the Association for
Computational Linguistics, pages 948–957, Uppsala,
Sweden, 11–16 July 2010.

Kristina Toutanova, Aria Haghighi, and Christopher
Manning. 2005. Joint learning improves semantic
role labeling. In Proceedings of the 43rd Annual Meet-
ing of the Association for Computational Linguistics,
pages 589–596, Ann Arbor, Michigan, 29–30 June
2005.

Boyi Xie, Rebecca J. Passonneau, Leon Wu, and
Germán G. Creamer. 2013. Semantic frames to pre-
dict stock price movement. In Proceedings of the 51st
Annual Meeting of the Association for Computational
Linguistics, pages 873–883, Sofia, Bulgaria, 4–9 Au-
gust 2013.

Nianwen Xue and Martha Palmer. 2004. Calibrating
features for semantic role labeling. In Proceedings of
the 2004 Conference on Empirical Methods in Natural
Language Processing, pages 88–94, Barcelona, Spain,
July.

Alexander Yeh. 2000. More accurate tests for the sta-
tistical significance of result differences. In Proceed-
ings of the 18th International Conference on Computa-
tional Linguistics, pages 947–953, Saarbrücken, Ger-
many.

Beñat Zapirain, Eneko Agirre, Lluı́s Màrquez, and Mi-
hai Surdeanu. 2013. Selectional preferences for se-
mantic role classification. Computational Linguistics,
39(3):631–663.

460